Chapter 14: Virtual Filesystem Layer¶
VFS architecture, dentry cache, mount tree, path resolution, overlayfs, mount namespace operations
The Virtual Filesystem Layer provides a unified interface over all filesystem implementations. Dentry caching, inode management, mount tree operations, and path resolution are kernel-internal — filesystems plug in via well-defined traits. FUSE, overlayfs, configfs, and autofs are first-class citizens, not afterthoughts.
14.1 Virtual Filesystem Layer¶
The VFS (umka-vfs) provides a unified interface over all filesystem types. It is a Tier 1 component that shares a hardware isolation domain with filesystem drivers (see Section 11.2 for platform-specific isolation mechanisms). This shared domain provides crash containment from umka-core but does not provide mutual isolation between VFS and the filesystem drivers within it.
Why VFS is Tier 1 (not Tier 0):
The VFS handles complex, security-sensitive operations: path resolution (symlink loops, mount point crossing), permission checks, and filesystem driver coordination. Isolating VFS from Core provides:
-
Attack surface reduction: Path resolution bugs (symlink attacks, directory traversal) are confined to the VFS domain and cannot corrupt Core memory.
-
Domain boundary: Core → VFS+FS domain (Tier 1) → individual FS drivers (Tier 1/2). A compromised VFS+FS domain cannot corrupt Core memory. However, VFS and filesystem drivers share a domain by default, so a filesystem driver bug can corrupt VFS metadata within the shared domain. On platforms with sufficient isolation domains and few active Tier 1 drivers (e.g., x86-64 with PKU and <12 Tier 1 drivers), VFS and filesystem drivers may be placed in separate domains, providing inter-driver hardware isolation. Rust memory safety — not hardware isolation — is the primary defense against filesystem driver bugs within the shared domain. The hard isolation boundary is between Core and the VFS+FS domain, not between VFS and individual filesystem drivers.
-
Crash containment: A VFS panic (e.g., corrupted dentry cache) is recoverable without rebooting the entire kernel. The recovery protocol:
a. Detection: umka-core detects VFS domain death (MPK exception, panic handler,
or watchdog timeout on the VFS heartbeat ring).
b. Freeze: All syscalls that enter VFS (open, stat, read, write, close, etc.)
are blocked at the umka-core domain boundary. Callers receive -ERESTARTSYS
and the VFS ring is drained.
c. Dirty page cache flush: Dirty pages in umka-core's page cache are flushed to
their backing block devices. The page cache is in umka-core memory (not VFS memory),
so it survives the VFS crash. Flush uses the block layer ring directly.
d. Dentry/inode cache rebuild: The new VFS instance starts with an empty dentry
cache. Dentries are lazily re-populated on the next path lookup (cache miss triggers
disk read). Inode cache is similarly rebuilt on demand.
e. Mount tree reconstruction: umka-core maintains a shadow mount registry
— a Tier 0 table recording (mount_id, device, fstype, mountpoint_path, flags)
for every active mount. The registry is updated atomically by the syscall layer
on every mount()/umount() call (BEFORE dispatching to VFS). After VFS
restart, the new VFS instance iterates the shadow registry and re-mounts each
filesystem in depth-sorted order (root first, then children):
```rust
/// Shadow mount registry in Tier 0 Core. Survives VFS Tier 1 crash.
/// Updated on mount/umount syscalls before VFS dispatch.
pub struct ShadowMountRegistry {
/// Active mounts, keyed by mount_id (u64).
mounts: XArray<ShadowMountEntry>,
}
/// Stable layout for crash recovery: Tier 0 Core persists this struct
/// and reads it back after a VFS Tier 1 crash. `#[repr(C)]` ensures
/// deterministic field ordering across compiler versions.
#[repr(C)]
pub struct ShadowMountEntry {
pub mount_id: u64, // 8 bytes (offset 0)
/// Block device backing this mount (`DevId`, or `DevId::ZERO` for
/// pseudo-fs). Uses `DevId` (u32) — the same type as
/// `SuperBlock.sb_dev` — for consistency with all other VFS device
/// ID fields. KABI `DeviceId` (u64) is not used here because
/// ShadowMountEntry is a Tier 0 crash recovery struct, not a KABI
/// wire type.
pub device: DevId, // 4 bytes (offset 8)
/// Explicit padding for u8-array alignment of `fstype`.
pub _pad0: [u8; 4], // 4 bytes (offset 12)
/// Filesystem type ("ext4", "proc", "sysfs", "tmpfs", "overlay",
/// "fuse.sshfs"). 32 bytes: covers all standard types and FUSE names.
pub fstype: [u8; 32], // 32 bytes (offset 16)
/// Mountpoint path ("/", "/proc", "/sys", "/dev", etc.).
/// 512 bytes: covers deeply nested container mount paths
/// (Kubernetes pod volumes, Docker overlay lower directories).
pub path: [u8; 512], // 512 bytes (offset 48)
/// Mount flags (MS_RDONLY, MS_NOSUID, etc.). u64 for consistency
/// with MountFlags (highest defined bit is 29, but u64 prevents
/// silent truncation if future flags use bits 32+).
pub flags: u64, // 8 bytes (offset 560)
/// Depth in the mount tree (root=0, /proc=1, /proc/sys=2, ...).
/// Used for depth-sorted re-mount ordering.
pub depth: u16, // 2 bytes (offset 568)
/// Filesystem-specific mount data (e.g., overlayfs lower/upper/work).
/// Opaque bytes, copied from the original mount() call.
/// **Truncation**: If the original mount data exceeds 256 bytes, only
/// the first 256 bytes are stored and `fs_data_len` is set to 256.
/// The `truncated` flag below is set to indicate data loss. During
/// crash recovery, truncated mounts fall back to re-reading mount
/// options from /etc/fstab or fail with -EINVAL if unavailable.
pub fs_data: [u8; 256], // 256 bytes (offset 570)
pub fs_data_len: u16, // 2 bytes (offset 826)
/// Set to 1 if the original fs_data was longer than 256 bytes.
pub truncated: u8, // 1 byte (offset 828)
/// Explicit trailing padding to u64 alignment boundary.
pub _pad: [u8; 3], // 3 bytes (offset 829)
// Total: 8+4+4+32+512+8+2+256+2+1+3 = 832 bytes.
}
const_assert!(size_of::<ShadowMountEntry>() == 832);
```
Pseudo-filesystems (/proc, /sys, /dev/devtmpfs) are re-mounted from kernel
state — they have no on-disk backing. Overlayfs is re-constituted from the
recorded `fs_data` (lower/upper/work paths). FUSE mounts that require a
userspace daemon connection receive `-ENOTCONN` until the daemon reconnects.
Memory cost: ~832 bytes per mount (with alignment) × ~20 typical mounts = ~16.6 KB in Core.
f. Open file descriptor recovery: umka-core's FdTable (in Tier 0 Core,
per-task via Task.files: Arc<FdTable>,
Section 8.1) survives the VFS crash.
After mount tree reconstruction, umka-core re-opens each fd by inode number
using the re-mounted filesystem. File descriptors that pointed to deleted files
(unlinked but still open) receive -EIO on next access.
g. Resume: The VFS ring is reopened and blocked syscalls are retried.
Recovery time: ~100-500ms depending on the number of open file descriptors.
Limitation: In-flight writes that had not yet reached the page cache are lost
(the application receives -EIO and must retry).
Domain grouping limitation: The crash recovery protocol above is most effective
when the crash originates from VFS logic itself (e.g., a bug in path resolution or
dentry management). Because VFS and filesystem drivers share an isolation domain by
default, a filesystem driver bug can corrupt VFS metadata (dentry cache, mount tree,
inode state) before detection. In this case, the corrupted VFS state may have already
produced incorrect I/O (wrong block mappings, stale metadata replies) before the
domain crash is detected by Core. Recovery restores VFS to a clean state, but data
written to disk under corrupted VFS guidance may be silently wrong. This is a known
limitation of domain grouping — the hardware fault boundary catches the crash, but
cannot retroactively undo I/O performed with corrupted in-domain state. Rust memory
safety mitigates this risk by preventing most classes of memory corruption bugs, but
unsafe code within the shared domain remains a vector.
In-flight write definition: In-flight writes are writes that have entered the VFS
write path (passed the syscall boundary) but whose data has not yet been inserted into
the page cache. This includes: (1) writes buffered in the VFS ring command queue awaiting
processing, (2) writes being copied from user buffer to a page that has not yet been
marked dirty. Writes that have reached the page cache (page marked PG_DIRTY with a
committed dirty extent) are NOT in-flight — they survive VFS crash via the dirty extent
protocol (Section 14.4). Error reporting for lost in-flight writes: the
write() syscall returns -EIO if the VFS crashes during the write. If write() had
already returned success, the data is in the page cache and is safe. Applications should
retry write() calls that returned -EIO after VFS recovery completes (the ring is
reopened in step f of the recovery protocol).
14.1.1.1.1 Dirty Page Handling on VFS Crash¶
Dirty Page Handling on VFS Crash:
When a Tier 1 VFS driver crashes, UmkaOS Core cannot safely flush dirty pages using the crashed driver's block mapping (the file-offset → block-address translation lives in the now-destroyed VFS domain).
UmkaOS's design: two-phase dirty extent protocol.
The dirty extent protocol must accommodate two fundamentally different filesystem write models:
-
In-place / pre-allocated filesystems (ext4, XFS non-reflink, FAT): The block address is known at page-dirty time (the extent tree maps file offsets to fixed block addresses). Both the logical reservation and physical commit can happen in one step.
-
Copy-on-Write filesystems (Btrfs, XFS reflink, bcachefs): The physical block address is not known at page-dirty time. CoW filesystems allocate new blocks at writeback time, not when the page is first dirtied. A protocol that requires
block_addrat dirty time is incompatible with CoW.
To support both models, UmkaOS uses a two-phase dirty extent protocol: Phase 1 (reserve) records the logical intent at dirty time; Phase 2 (commit) binds the physical block address after writeback allocation.
/// Phase 1: Reserve a dirty extent at page-dirty time.
///
/// Called by VFS drivers when a page is first marked dirty (from the
/// `AddressSpaceOps::dirty_extent()` callback). Records a **logical**
/// writeback intent — the file offset and length that will need to be
/// written back. No physical block address is required at this point.
///
/// The intent is stored in a per-inode **writeback intent list** maintained
/// by UmkaOS Core. The intent list is protected by `i_rwsem` (held at
/// least shared by the caller, since `dirty_extent()` is called from
/// `__set_page_dirty()` which holds the page lock, which nests inside
/// `i_rwsem`).
///
/// # Parameters
/// - `inode_id`: Stable inode identifier (survives VFS crash).
/// - `file_offset`: Byte offset of the dirty range start.
/// - `len`: Length of the dirty range in bytes.
///
/// # Errors
/// Returns `VfsDirtyError::IntentListFull` when the per-inode intent list
/// reaches its capacity (8192 entries). The caller must trigger writeback
/// for this inode to drain completed intents before retrying.
pub fn vfs_dirty_extent_reserve(
inode_id: InodeId,
file_offset: u64,
len: u64,
) -> Result<DirtyExtentToken, VfsDirtyError>;
/// Phase 2: Commit a dirty extent with its physical block address.
///
/// Called by the filesystem's writeback path **after** it has allocated
/// the physical blocks for a dirty range (CoW allocation for Btrfs/XFS
/// reflink, or extent-tree lookup for ext4/XFS non-reflink). Binds the
/// physical block address to the previously reserved logical intent.
///
/// After `vfs_dirty_extent_commit()` returns, UmkaOS Core has a complete
/// record of the dirty extent's physical location. If the VFS crashes
/// between commit and actual I/O completion, Core can flush the extent
/// directly via the block layer.
///
/// # Parameters
/// - `token`: The token returned by `vfs_dirty_extent_reserve()` for
/// this extent. Tokens are single-use; committing the same token
/// twice is a kernel bug (caught by debug-mode assertion).
/// - `block_addr`: Physical block address assigned by the filesystem's
/// writeback allocator.
/// - `block_len`: Length of the physical block range in bytes. May
/// differ from the logical `len` if the filesystem compresses or
/// coalesces extents.
///
/// # Errors
/// Returns `VfsDirtyError::InvalidToken` if the token has already been
/// committed or was never issued.
pub fn vfs_dirty_extent_commit(
token: DirtyExtentToken,
block_addr: PhysBlockAddr,
block_len: u64,
) -> Result<(), VfsDirtyError>;
/// Atomic reserve+commit for non-CoW filesystems.
///
/// Convenience function for filesystems that know the block address at
/// dirty time (ext4 non-delayed-alloc, FAT, exFAT). Equivalent to
/// calling `vfs_dirty_extent_reserve()` followed immediately by
/// `vfs_dirty_extent_commit()`, but avoids the overhead of a separate
/// token round-trip.
///
/// CoW filesystems MUST NOT use this function — they must use the
/// two-phase protocol because the block address is not available at
/// dirty time.
pub fn vfs_dirty_extent_reserve_and_commit(
inode_id: InodeId,
file_offset: u64,
len: u64,
block_addr: PhysBlockAddr,
block_len: u64,
) -> Result<(), VfsDirtyError>;
/// Abort a previously reserved dirty extent. Called when writeback fails
/// after `vfs_dirty_extent_reserve()` but before
/// `vfs_dirty_extent_commit()` — e.g., block allocation failure, I/O
/// error during journal write, or filesystem shutdown. Releases the
/// reserved intent list entry, decrementing `nr_reserved` and freeing
/// the token. The token is consumed (single-use, same as commit).
///
/// **Retry policy**: After 3 consecutive writeback failures for the same
/// extent (tracked per-inode by a `(file_offset, len)` → `fail_count`
/// map in the intent list), the filesystem marks the extent permanently
/// dirty and logs an FMA error:
/// `"writeback abort: extent [offset, offset+len) on inode {id} failed 3 times"`.
/// The permanently-dirty extent remains in the intent list until the
/// inode is evicted or the filesystem is unmounted, ensuring crash
/// recovery can still identify it.
pub fn vfs_dirty_extent_abort(
token: DirtyExtentToken,
) -> Result<(), VfsDirtyError>;
/// Acknowledge that a dirty extent has been successfully flushed to
/// stable storage. Removes the extent from Core's dirty extent log.
/// Called by the VFS driver after receiving I/O completion for the
/// writeback.
pub fn vfs_flush_extent_complete(
inode_id: InodeId,
file_offset: u64,
len: u64,
) -> Result<(), VfsDirtyError>;
/// Opaque token binding a reserved dirty extent to its commit.
/// Issued by `vfs_dirty_extent_reserve()`, consumed by
/// `vfs_dirty_extent_commit()`. Internally encodes the inode ID,
/// file offset, length, and a monotonic sequence number for
/// double-commit detection.
///
/// Size: 24 bytes (inode_id: u64 + file_offset: u64 + seq: u64).
/// Passed by value through the VFS ring buffer.
#[derive(Clone, Copy)]
pub struct DirtyExtentToken {
pub inode_id: InodeId,
pub file_offset: u64,
pub seq: u64,
}
/// Error type for dirty extent operations.
pub enum VfsDirtyError {
/// The per-inode intent list is full (8192 entries). Trigger writeback
/// to drain completed intents before retrying. Equivalent to EBUSY.
IntentListFull,
/// The token has already been committed or was never issued.
InvalidToken,
/// Other VFS error (invalid inode ID, etc.).
Other(VfsError),
}
Dirty extent intent list — UmkaOS Core maintains a per-inode writeback intent list in core memory (not in VFS domain memory):
/// Per-inode writeback intent list. Maintained by UmkaOS Core in its own
/// memory domain, surviving VFS crashes.
///
/// Each entry tracks a dirty file range through its lifecycle:
/// Reserved (logical intent only) → Committed (physical address bound) →
/// Complete (flushed to stable storage, entry removed).
///
/// Protected by `i_rwsem` for structural modifications (insert, remove).
/// The writeback thread reads the list under `i_rwsem` shared to collect
/// committed extents for I/O submission.
pub struct DirtyIntentList {
/// Ring buffer of intent entries. Capacity: 8192 per inode.
/// At 80 bytes per entry (inode_id(8) + file_offset(8) + len(8) +
/// block_addr(16) + block_len(8) + seq(8) + sb_dev(4) + pad(4) +
/// block_dev(16) = 80), worst case is ~640 KB per heavily-dirtied
/// inode — acceptable for a production server. Most inodes have
/// <100 entries at any given time.
///
/// **Allocation**: `DirtyIntentList` is allocated from the
/// `dirty_intent_slab` (a dedicated slab cache, object size = sizeof
/// `DirtyIntentList`, created at boot Phase 2.4). Allocation occurs
/// lazily on the first `vfs_dirty_extent_reserve()` for each inode.
/// The slab is GC'd when idle inodes are evicted from the inode cache.
pub entries: BoundedRing<DirtyIntentEntry, 8192>,
/// Monotonic sequence counter for token generation.
pub next_seq: u64,
}
/// Physical block address on a block device. Newtype around `u64` (plain u64,
/// no `NonZero` — `Option<PhysBlockAddr>` is 16 bytes: 8-byte discriminant +
/// 8-byte payload, with no niche optimization).
pub struct PhysBlockAddr(pub u64);
/// A single dirty extent intent entry.
// Kernel-internal, not KABI: contains Option<Arc<dyn>>, never crosses a compilation
// boundary. #[repr(C)] is required to make the const_assert deterministic across
// compiler versions — without it the compiler may reorder fields.
#[repr(C)]
pub struct DirtyIntentEntry {
/// Stable inode identifier. Redundant during normal operation (the list
/// is per-inode) but needed during crash recovery log replay, where entries
/// may be iterated without per-inode context.
pub inode_id: InodeId,
/// File offset of the dirty range (bytes).
pub file_offset: u64,
/// Length of the dirty range (bytes).
pub len: u64,
/// Physical block address. `None` if Phase 1 only (reserved but not
/// yet committed). `Some(addr)` after Phase 2 commit.
pub block_addr: Option<PhysBlockAddr>,
/// Physical block length. Only valid when `block_addr` is `Some`.
pub block_len: u64,
/// Sequence number (matches `DirtyExtentToken.seq`).
pub seq: u64,
/// Device ID of the superblock that owns this inode.
///
/// This is the key that Core (Tier 0) uses to look up the block device
/// during crash recovery, via the global device registry
/// (`DEVICE_REGISTRY: XArray<Arc<dyn BlockDeviceOps>>`, keyed by `DevId`).
/// When a Tier 1 VFS driver crashes, Core iterates dirty intent entries
/// and uses `sb_dev` to resolve the target block device — without needing
/// any VFS-domain state. This is strictly more reliable than using
/// `block_dev` alone because `block_dev` is `None` for Phase 1 entries
/// (deferred allocation) and for NFS, whereas `sb_dev` is always set for
/// local filesystems and enables Core to match intent entries to the
/// correct `SUPER_BLOCK_MAP` entry for filesystem journal replay.
///
/// Set to `DevId::ZERO` for network filesystems (NFS, CIFS) where there
/// is no local block device — crash recovery for network filesystems uses
/// the network reconnection path instead.
pub sb_dev: DevId,
/// Reference to the block device that owns the physical blocks.
/// Required by crash recovery: when a Tier 1 VFS driver crashes, Core
/// must issue direct block writes via the block layer (bypassing VFS)
/// for all committed extents. Without this reference, Core would need
/// to resolve the block device from the superblock — but the VFS driver
/// holding the superblock may be the one that crashed.
///
/// Set to `None` for network filesystems (NFS) where `block_addr` is an
/// opaque server-side commit token, not a local block address.
///
/// **Redundancy with `sb_dev`**: `block_dev` provides the fast path —
/// Core can issue I/O immediately without a registry lookup. `sb_dev`
/// provides the fallback — if `block_dev` is `None` (Phase 1 entry),
/// Core uses `sb_dev` to locate the block device via `DEVICE_REGISTRY`
/// and then uses the filesystem's block allocator (from the superblock)
/// to determine whether the extent can be committed or must be discarded.
/// The `block_dev` reference is valid during crash recovery because block
/// device drivers are in a separate Tier 1 domain from the VFS driver. If
/// the block device driver crashes simultaneously (a different, independently
/// handled crash event), dirty intent entries referencing that block device
/// are skipped with an FMA warning.
pub block_dev: Option<Arc<dyn BlockDeviceOps>>,
}
// Verify DirtyIntentEntry size matches the capacity analysis (80 bytes per entry,
// 8192 entries = ~640 KB worst case per heavily-dirtied inode). If this assertion
// fails, update the capacity analysis in DirtyIntentList.entries comment above.
//
// Note: DirtyIntentEntry is kernel-internal (never crosses a compilation boundary —
// only Core crash-recovery code touches it). The `#[repr(C)]` attribute ensures
// deterministic layout for the const_assert below, NOT for KABI compatibility.
// `Option<Arc<dyn BlockDeviceOps>>` relies on Rust's niche optimization for
// `Arc<T>` (null pointer → None). This is guaranteed by the language for `Arc`/`Box`
// and verified at compile time by this const_assert. If a future rustc version
// changes the representation, the const_assert will fail and the struct must be
// reworked (e.g., raw pointer + validity flag).
//
// Field breakdown (64-bit target):
// inode_id: InodeId(u64) = 8
// file_offset: u64 = 8
// len: u64 = 8
// block_addr: Option<PhysBlockAddr> = 16 (u64 discriminant + u64 payload; no niche — PhysBlockAddr wraps plain u64)
// block_len: u64 = 8
// seq: u64 = 8
// sb_dev: DevId(u32) = 4
// block_dev: Option<Arc<dyn BlockDeviceOps>> = 16 (fat pointer: data_ptr + vtable_ptr)
// padding for alignment = 4 (after sb_dev, before block_dev's 8-byte alignment)
// TOTAL = 80 bytes
const_assert!(core::mem::size_of::<DirtyIntentEntry>() == 80);
Intent list overflow policy: vfs_dirty_extent_reserve() returns
IntentListFull when the per-inode intent list reaches 8192 entries. The VFS
driver must not proceed with the write operation when IntentListFull is
returned; it must first trigger writeback for this inode (via
writeback_single_inode()) to flush committed extents and free intent list
slots, then retry vfs_dirty_extent_reserve().
This is a deliberate design choice that differs from Linux's approach: UmkaOS
never silently discards safety information. The backpressure ensures that on any
VFS crash, umka-core has a complete record of all outstanding dirty extents
and can accurately flag inconsistent data — no dirty extent is ever "forgotten."
If the VFS driver is unresponsive (not calling vfs_flush_extent_complete() for
5 seconds),
umka-coretreats all entries in the intent list as dirty and initiates VFS driver restart — the backpressure prevents intent list overflow from masking a stuck VFS driver.
Per-filesystem type usage patterns:
| Filesystem | Write model | Phase 1 (reserve) | Phase 2 (commit) | Notes |
|---|---|---|---|---|
| ext4 (non-delayed-alloc) | In-place | reserve_and_commit() |
N/A (atomic) | Block address known from extent tree at dirty time |
| ext4 (delayed-alloc) | Deferred | reserve() at dirty time |
commit() during writeback after ext4_map_blocks() allocates |
Delayed allocation defers block assignment |
| XFS (non-reflink) | In-place | reserve_and_commit() |
N/A (atomic) | BMBT lookup gives block address at dirty time |
| XFS (reflink) | CoW | reserve() at dirty time |
commit() during writeback after CoW fork allocates new blocks |
Shared extents require new block allocation |
| Btrfs | Redirect-on-Write | reserve() at dirty time |
commit() during writeback after extent allocator assigns new tree location |
All writes redirect; old blocks freed at transaction commit. Crash semantics: reserved-only intents (Phase 1 without Phase 2) are volatile — on VFS crash, they are discarded because no physical blocks were allocated. The CoW filesystem's on-disk tree remains consistent because uncommitted writes never modified the on-disk tree. |
| FAT/exFAT | In-place | reserve_and_commit() |
N/A (atomic) | Cluster chain gives block address at dirty time |
| NFS | Network | reserve() at dirty time |
commit() after NFS WRITE RPC completes with server-assigned stable storage |
Block address is the server's opaque commit token |
Core-owned superblock registry for crash recovery bypass:
/// Global superblock map, owned by Core (Tier 0). Maps device IDs to
/// superblock references so that crash recovery can locate the block
/// device and filesystem metadata without going through the crashed VFS
/// driver.
///
/// Keyed by `DevId` (integer key → XArray per collection policy).
/// Populated during `mount()`: Core registers the superblock before handing
/// control to the VFS driver. Removed during `umount()` after the VFS driver
/// has cleanly shut down.
///
/// This map is **not** used on the normal I/O path — it exists solely for
/// crash recovery. Normal path resolution goes through the VFS mount tree.
pub static SUPER_BLOCK_MAP: OnceCell<XArray<Arc<SuperBlock>>> = OnceCell::new();
On crash recovery, Core uses SUPER_BLOCK_MAP to resolve the superblock for the
crashed filesystem, then iterates its inodes' dirty intent lists. Each
DirtyIntentEntry carries its own block_dev reference, enabling Core to issue
direct block writes without the VFS driver's cooperation.
Crash recovery sequence:
Dirty page flush during VFS crash uses the committed dirty extent records stored
in Core memory (via vfs_dirty_extent_commit()). These records contain physical
block addresses that survive the VFS crash. The file-to-inode-to-superblock-to-device
mapping is NOT needed — the dirty extent protocol captures the physical location at
commit time specifically to enable crash recovery without VFS state.
When a Tier 1 VFS driver crashes while UmkaOS Core detects pending dirty intents:
- Iterate the dirty intent lists for all inodes on the crashed filesystem.
Core looks up the superblock via
SUPER_BLOCK_MAP.get(dev_id)and walks its inode cache. - Committed extents (Phase 2 complete —
block_addrisSome): - For each committed extent (newest first, to preserve journal ordering):
Issue a direct block write via the block layer (bypassing VFS), using
entry.block_devfor the device handle andblock_addr/block_lenfrom the intent entry. Wait for write completion. - Reserved-only extents (Phase 1 only —
block_addrisNone): - These extents have dirty pages in the page cache but no physical block assignment. Core cannot flush them without knowing the block address.
- Flag these pages as "potentially inconsistent." The filesystem's own journal/log handles recovery on next mount (same as a hard power-off scenario where dirty pages had not reached disk).
- After all committed extents are flushed: mark as "crash-flushed" and continue with driver reload.
- Any dirty pages NOT covered by any intent entry (neither reserved nor committed) are also flagged as "potentially inconsistent."
Design rationale: This is better than Linux's approach (which silently loses dirty pages when a kernel module crashes) while being simpler than running a full WAL in UmkaOS Core. The pre-registration overhead is one lightweight ring-buffer push per dirtied file region — negligible for writeback-dominated workloads.
Performance implications and mitigation:
The Core → VFS domain switch costs ~23 cycles for the bare WRPKRU instruction
(x86-64 MPK). The full domain crossing — including argument marshaling via the
inter-domain ring buffer and cache effects — is ~30-35 cycles per crossing.
This overhead is amortized by:
-
Page Cache in Core: The Page Cache (Section 4.4) lives in Core, not VFS. Cached file reads/writes hit the Page Cache directly with zero domain switches. Only cache misses (actual I/O) cross into VFS.
-
Batching: Multiple file operations within a single syscall (e.g.,
readv,io_uringbatches) amortize the domain switch over many operations. -
Dentry cache hit rate: The dentry cache (in VFS) has >99% hit rate for typical workloads. Path resolution is fast, and the domain switch cost is dominated by the actual I/O latency (microseconds vs nanoseconds).
Measured overhead: For a 4KB NVMe read (~10μs device latency), the additional domain switches (Core → VFS → FS driver) add ~70 cycles (~30ns total), which is 0.3% overhead. This is well within the "<5% overhead" target.
Metadata-heavy workloads: Individual metadata syscalls (stat, readdir,
open/close) pay higher per-call overhead because the operation base cost is
lower (~200-500ns vs ~10μs for I/O). A single stat() on x86-64 incurs ~46 cycles
(~18ns) for the Core → VFS → Core round-trip, which is ~3.6-9% per call. This is the
design tradeoff for VFS crash containment: the dentry/inode cache lives in the VFS
domain, enabling cache rebuild on VFS crash recovery. Two amortization mechanisms
reduce this cost to ~0.3-0.5% effective overhead for the dominant readdir+stat access
pattern and ~0.05% per stat for io_uring batch workloads; see
Section 14.1 below. For per-architecture
raw overhead figures, see
Section 3.4.
14.1.1.2 Metadata Access Amortization¶
Metadata-heavy workloads (find, package managers, ls -la, container image
unpacking) are dominated by the readdir+stat pattern: the application reads a
directory, then immediately stats every entry. Without amortization, each stat()
incurs a full Core-to-VFS domain crossing (~18ns on x86-64 MPK, ~32-64ns on AArch64
POE). This section specifies two complementary mechanisms that eliminate most of those
crossings, plus a per-filesystem policy framework that controls when prefetch is safe.
14.1.1.2.1 Mechanism 1: Readdir-Plus Prefetch with Per-Task Buffer¶
When a filesystem's readdir implementation returns directory entries, the VFS also
collects statx metadata for each returned entry. The inode is already resolved from
the dentry lookup during readdir — fetching its attributes is essentially free for
local filesystems (the inode struct is already cache-hot). The VFS writes the metadata
into a per-task prefetch buffer allocated in Core memory.
Subsequent stat() / statx() calls on the same directory's entries check the
per-task buffer before crossing into the VFS domain. On hit, stat() returns
immediately with zero domain crossings.
Key design constraints:
- Per-task buffer, not a global cache. The buffer is private to each task. No
locking, no cross-task contention, no cache coherence traffic. Allocated on first
readdir()for a given directory fd, freed when the directory fd is closed or the task exits. - Bounded size: 128 entries x ~320 bytes = ~40KB per task. This exceeds L1 data cache capacity on some architectures (ARMv7 Cortex-A15: 32KB L1D; Cortex-A72/A76: 32KB-48KB L1D). On these targets the scan spills to L2 (~10-15 cycles per access vs ~4 cycles for L1), adding ~640-1280 cycles worst-case on a full miss path. On x86-64 (32-48KB L1D typical) the buffer may fit depending on core implementation. The sequential access pattern means the hardware prefetcher keeps up regardless — L2 prefetch on ARM delivers ~6-8 cycles per line, acceptable for the miss path (which is cold: a readdir-stat pattern that misses has already paid a domain crossing). Miss path cost: 128 iterations x (DevId compare + InodeId compare + Relaxed atomic load) = ~384-640 cycles on L1 hit, ~640-1280 cycles with L2 spills. This is paid only when the target inode is NOT in the prefetch buffer (the common case after readdir IS a hit). Covers the vast majority of directories (median directory size in real workloads is 20-60 entries). For directories larger than 128 entries, only the most recently returned batch is buffered; earlier entries that were already stat'd remain valid, later entries fall through to the normal VFS path.
- Keyed by
(sb_dev, ino)— stable identifiers that survive VFS crash/evolution. No path strings, no dentry pointers, no VFS-internal state. - Generation counter per entry. VFS increments the inode's generation counter on
any metadata mutation (
setattr,truncate,writethat changesmtime, etc.). Core checks the generation before returning a prefetch hit. Stale entries produce a miss and fall through to the normal VFS path. - VFS epoch counter for crash/evolution invalidation. The buffer records the VFS
epoch at fill time. If the VFS is replaced (crash recovery or live evolution), the
global
VFS_EPOCHcounter is incremented and all buffers become stale in O(1). - Core memory (Tier 0) — the buffer itself is not in the VFS domain. It survives VFS crash without corruption.
Data structures:
/// Single prefetch entry. Aligned to cache line to avoid false sharing
/// when the VFS writer and the Core reader access adjacent entries.
///
/// Total size: 320 bytes (64-byte aligned).
/// - DevId (4 bytes) + InodeId (8 bytes) + generation (8 bytes)
/// + StatxBuf (256 bytes) + valid (1 byte) + padding (43 bytes) = 320.
#[repr(C, align(64))]
pub struct PrefetchEntry {
/// Superblock device ID. Together with `ino`, forms the unique key.
pub sb_dev: DevId,
/// Explicit padding: DevId is 4 bytes, InodeId requires 8-byte alignment.
pub _pad0: [u8; 4],
/// Inode number within the filesystem identified by `sb_dev`.
pub ino: InodeId,
/// Inode generation counter at the time this entry was filled.
/// VFS increments the inode's generation on any metadata mutation.
/// Core compares this against the current inode generation before
/// returning the entry. Mismatch → stale → fall through to VFS.
///
/// Memory ordering: stored with Release by VFS (during readdir fill),
/// loaded with Acquire by Core (during stat fast path).
pub generation: AtomicU64,
/// Cached statx result. Layout matches `struct statx` from Linux
/// (256 bytes, binary compatible with the userspace ABI).
pub stx: StatxBuf,
/// Entry validity flag. 0 = invalid, 1 = valid. AtomicU8 instead of
/// AtomicBool: this is cross-domain shared memory (VFS Tier 1 writes,
/// Core Tier 0 reads). A non-0/1 value from a corrupted Tier 1 domain
/// would cause UB with AtomicBool's validity invariant.
pub valid: AtomicU8,
/// Explicit trailing padding: AtomicU8 ends at offset 281; align(64)
/// rounds struct size to 320 (next multiple of 64). 320 - 281 = 39.
pub _pad_tail: [u8; 39],
}
// Layout with align(64): DevId(4) + _pad0(4) + InodeId(8) + AtomicU64(8) +
// StatxBuf(256) + AtomicU8(1) + _pad_tail(39) = 320 bytes (5 × 64-byte cache lines).
const_assert!(size_of::<PrefetchEntry>() == 320);
/// Per-task readdir prefetch buffer. Allocated in Core memory (Tier 0).
///
/// One buffer exists per open directory fd per task. The buffer is
/// populated during `readdir()` and consumed during subsequent
/// `stat()` / `statx()` calls. It is freed when the directory fd is
/// closed or the task exits.
///
/// **Collection policy**: Hot path (per-syscall lookup). `entries` is
/// a fixed-capacity `ArrayVec` — no heap allocation on the hot path.
/// The 128-entry limit bounds memory to ~40KB per task per open
/// directory, which is acceptable for the readdir+stat pattern.
pub struct TaskPrefetchBuf {
/// Prefetch entries, indexed by insertion order. Lookup is linear
/// scan over at most 128 entries (~40KB — exceeds L1 on ARMv7/some
/// AArch64; spills to L2 with ~3-5 extra cycles/access on those
/// targets). For the readdir+stat pattern, entries are accessed in
/// insertion order (sequential scan), so linear search has optimal
/// prefetch behavior regardless of L1/L2 residency.
entries: ArrayVec<PrefetchEntry, 128>,
/// Directory fd this buffer was filled for. Used to associate the
/// buffer with the correct directory on subsequent stat() calls.
/// When the fd is closed, the buffer is freed.
dir_fd: i32,
/// VFS epoch at fill time. If the global `VFS_EPOCH` has advanced
/// (crash or live evolution), the entire buffer is stale and must
/// be discarded. This is an O(1) invalidation mechanism.
vfs_epoch: u64,
}
/// Global VFS epoch counter. Incremented on VFS crash recovery or
/// live evolution. All `TaskPrefetchBuf` instances whose `vfs_epoch`
/// differs from this value are stale.
///
/// Stored as AtomicU64 in Core memory. Incremented with Release
/// ordering; read with Acquire ordering in the stat fast path.
///
/// **Longevity**: u64 counter incremented only on VFS crash or
/// evolution events. At one event per second (vastly exceeding any
/// realistic crash rate), this counter lasts ~584 billion years.
pub static VFS_EPOCH: AtomicU64 = AtomicU64::new(0);
stat() fast path (Core-side, before any domain crossing):
/// Attempt to serve a statx() call from the per-task prefetch buffer.
/// Returns `Some(stx)` on hit (zero domain crossings), `None` on miss
/// (caller falls through to the normal VFS domain crossing path).
///
/// This function runs entirely in Core (Tier 0). No locks, no domain
/// crossings, no ring buffer interaction. The only synchronization is
/// atomic loads on the VFS epoch and per-entry generation counters.
///
/// # Hot path classification
///
/// This is called on every `stat()` / `statx()` / `fstat()` /
/// `newfstatat()` syscall when the task has an active prefetch buffer.
/// Must be O(1) amortized with no heap allocation.
fn sys_statx_fast_path(task: &Task, dentry_dev: DevId, dentry_ino: InodeId) -> Option<StatxBuf> {
let buf = task.prefetch_buf.as_ref()?;
// Check VFS epoch — if VFS was replaced, entire buffer is stale.
if buf.vfs_epoch != VFS_EPOCH.load(Acquire) {
return None;
}
// Linear scan over at most 128 entries. Sequential access pattern
// means the prefetcher keeps up; worst case is 128 × 320B = 40KB
// which exceeds L1 on some architectures (ARMv7 32KB L1D, some
// AArch64 32-48KB L1D) — L2 spill adds ~3-5 cycles/access on
// those targets. Acceptable: miss path is cold (already paying a
// domain crossing on fallthrough).
let entry = buf.entries.iter().find(|e| {
e.valid.load(Relaxed) && e.sb_dev == dentry_dev && e.ino == dentry_ino
})?;
// Check generation — if inode was mutated since readdir, entry is stale.
let gen = entry.generation.load(Acquire);
if gen != inode_current_generation(dentry_dev, dentry_ino) {
return None;
}
Some(entry.stx)
}
Generation counter update path (VFS-side):
When the VFS processes any inode-mutating operation (SetAttr, Truncate, Write
that updates mtime/ctime, Link, Unlink, Rename), it increments the inode's
generation counter. This is a single AtomicU64::fetch_add(1, Release) on the inode
struct — the inode is already locked for the mutation, so this adds zero contention.
The generation counter is stored in the inode struct itself (in VFS memory), and the
Core stat fast path reads it via a shared-memory mapping (the inode's generation
field is in a page mapped read-only into Core's domain). No ring buffer crossing is
needed for the generation check.
Cross-domain memory ordering invariant: This is a shared-memory cross-domain
access pattern — the VFS (Tier 1) writes the generation counter with Release,
and Core (Tier 0) reads it with Acquire. On x86-64 (TSO), these translate to
plain loads/stores with no additional fences. On ARM/AArch64 and RISC-V (weak
memory models), the Acquire load emits the appropriate barrier instruction
(LDAR on AArch64, fence r,rw on RISC-V) to ensure the stat fields are
observed consistently with the generation counter. This ordering MUST NOT be
downgraded to Relaxed in future maintenance — doing so would allow Core to
observe a new generation counter but stale stat fields, returning incorrect
metadata to userspace.
Readdir fill path (VFS-side):
During readdir() processing, after the filesystem driver returns each batch of
directory entries (via the VfsResponse for ReadDir), the VFS iterates over the
returned entries and for each one:
- Looks up the inode in the inode cache (already resolved during readdir).
- Copies the inode's current
statxattributes into aPrefetchEntry. - Stores the current inode generation counter.
- Writes the entry into the task's
TaskPrefetchBufvia the shared-memory mapping.
This piggybacks on work already being done — the inode is cache-hot from the readdir lookup. The additional cost is ~50-80 cycles per entry (one memcpy of 256 bytes for the StatxBuf + two atomic stores). For a typical directory of 50 entries, this is ~2,500-4,000 cycles total — less than the cost of a single domain crossing.
14.1.1.2.2 Mechanism 2: io_uring Statx Coalescing¶
When the io_uring submission queue contains multiple consecutive IORING_OP_STATX
entries, the VFS dispatcher coalesces them into a single domain crossing. Instead of
N crossings for N stat requests, one crossing processes all N.
Detection and dispatch:
The io_uring dispatch loop (Section 19.3) already processes SQEs in
batches. When the dispatcher encounters an IORING_OP_STATX SQE, it peeks ahead in
the submission queue for consecutive IORING_OP_STATX entries, collecting up to 64
into a single batch. The batch is sent to the VFS as a single
VfsRequest::StatxBatch over the ring buffer.
/// Batched statx request. Sent as a single VfsRequest when the io_uring
/// dispatcher detects consecutive IORING_OP_STATX SQEs.
///
/// The VFS resolves all paths in a single domain stay and writes all
/// results back in a single VfsResponse::StatxBatchResult.
pub struct StatxBatchArgs {
/// Number of statx requests in this batch (1..=64).
pub count: u8,
/// DMA buffer handle containing an array of `StatxBatchEntry` structs.
/// The buffer is allocated from the io_uring's pre-registered buffer
/// pool when available, or from the shared DMA pool otherwise.
pub entries_buf: DmaBufferHandle,
}
/// Single entry within a StatxBatch request.
#[repr(C)]
pub struct StatxBatchEntry {
/// Directory fd for path resolution (AT_FDCWD or an open directory).
pub dirfd: i32,
/// AT_* flags (AT_SYMLINK_NOFOLLOW, AT_EMPTY_PATH, etc.).
pub flags: u32,
/// STATX_* request mask.
pub mask: u32,
/// Path string offset within the DMA buffer's string region.
pub path_offset: u32,
/// Path string length in bytes.
pub path_len: u16,
/// Padding to 4-byte alignment for array element stride. Without this
/// pad, `path_len: u16` at offset 18 leaves the struct at 20 bytes but
/// with 2 bytes of implicit tail padding for `u32`-aligned array access.
/// Making it explicit ensures no uninitialized bytes leak across userspace.
pub _pad: [u8; 2],
}
const_assert!(size_of::<StatxBatchEntry>() == 20);
/// Batched statx response. One result per entry in the request.
/// C-compatible layout: fixed array with explicit count. Neither
/// `ArrayVec` nor `Result<T, E>` have stable repr(C) layout.
#[repr(C)]
pub struct StatxBatchResult {
/// Number of valid entries in `results`.
pub count: u8,
/// Explicit padding: count(u8, offset 1) to results[0] (align 8 from StatxBuf).
/// 7 bytes: 1 + 7 = 8. CLAUDE.md rule 11.
pub _pad: [u8; 7],
/// Per-entry results. Index corresponds to the request entry index.
pub results: [StatxBatchResultEntry; 64],
}
/// Single entry in a batched statx response.
#[repr(C)]
pub struct StatxBatchResultEntry {
/// 0 = success (stx is valid), negative = errno (stx is zeroed).
pub error: i32,
/// Explicit padding: error(i32, offset 0+4=4) to stx (align 8 from StatxBuf).
/// 4 bytes. CLAUDE.md rule 11.
pub _pad: [u8; 4],
/// Valid only when `error == 0`.
pub stx: StatxBuf,
}
// Layout: error(4) + _pad(4) + StatxBuf(256) = 264 bytes. All padding explicit.
const_assert!(size_of::<StatxBatchResultEntry>() == 264);
// StatxBatchResult: count(1) + _pad(7) + 64 × 264 = 16904 bytes. All padding explicit.
const_assert!(size_of::<StatxBatchResult>() == 16904);
Key design properties:
- Transparent to userspace. Applications submit individual
IORING_OP_STATXSQEs as usual. The coalescing is entirely internal to the kernel's io_uring dispatch path. Each SQE still gets its own CQE with the correctuser_dataand result code. - No additional memory overhead. The batch uses the existing ring buffer and DMA
buffer infrastructure. The
StatxBatchEntryarray is written into a DMA buffer that is already allocated for the io_uring ring. - Works with all filesystem types. Each stat within the batch goes through normal
VFS path resolution and filesystem locking. The coalescing only eliminates the domain
crossing overhead, not any per-file locking. Filesystems with
Neverprefetch policy still benefit from crossing amortization. - Interaction with readdir-plus prefetch. Before sending a
StatxBatchto the VFS, the dispatcher checks each entry against the task'sTaskPrefetchBuf. Entries that hit the prefetch buffer are resolved immediately and removed from the batch. Only cache-miss entries cross into the VFS domain. In the best case (all 64 entries hit the prefetch buffer), zero domain crossings occur. - Ring protocol extension.
StatxBatchis added asVfsOpcode::StatxBatch = 70in the VFS ring protocol (Section 14.2). The response usesVfsOpcode::StatxBatchResult = 71. These opcodes are only generated by the io_uring coalescing path; they are never exposed to filesystem drivers directly (the VFS dispatches individualGetattrcalls internally for each entry in the batch).
14.1.1.2.3 Prefetch Policy Framework¶
Each filesystem declares its readdir prefetch policy via a method on the
FileSystemOps trait (Section 14.1):
/// Policy controlling whether the VFS prefetches statx metadata during
/// readdir for this filesystem.
///
/// The default implementation returns `Always`, which is correct for all
/// single-node local filesystems. Network and cluster filesystems must
/// override this to return the appropriate policy.
pub enum ReaddirPrefetchPolicy {
/// Always prefetch. Data is authoritative — no external consistency
/// concerns. The VFS fills the per-task prefetch buffer on every
/// readdir, and stat() uses the buffer unconditionally (subject to
/// generation counter freshness).
///
/// Appropriate for: ext4, XFS, btrfs, tmpfs, procfs, sysfs, debugfs.
Always,
/// Prefetch, but layer on top of the filesystem's existing attribute
/// cache. The prefetch buffer entries are valid only as long as the
/// filesystem's own cache considers them valid. When the filesystem
/// invalidates its cache (e.g., NFS delegation recall, CIFS oplock
/// break, FUSE attr_timeout expiry), it calls
/// `vfs_invalidate_prefetch(sb_dev, ino)` which bumps the generation
/// counter on any matching prefetch entry. No additional locking is
/// needed — the prefetch mechanism layers on top of whatever
/// consistency protocol the filesystem already implements.
///
/// Appropriate for: NFS, CIFS/SMB, FUSE, UmkaOS peerfs.
CacheAware,
/// Never prefetch. Each stat() acquires its own consistency token
/// (e.g., DLM glock) for linearizability. The domain crossing cost
/// (~18ns on x86-64) is negligible compared to the distributed lock
/// round-trip (~50-500us), so prefetch elimination provides no
/// measurable benefit and would violate the consistency model.
///
/// Appropriate for: GFS2, OCFS2, UmkaOS DLM-based cluster FS.
Never,
}
Per-filesystem policy table:
| Filesystem Type | Policy | Rationale |
|---|---|---|
| ext4, XFS, btrfs, tmpfs | Always |
Single-node, no external consistency concerns. Inode data is authoritative. |
| procfs, sysfs, debugfs | Always |
Synthetic FS. Metadata is kernel-generated and stable within a readdir window. |
| NFS (v3/v4) | CacheAware |
Layers on NFS actimeo/delegation cache. CB_RECALL bumps generation counter. |
| CIFS/SMB | CacheAware |
Layers on oplock/lease cache. Lease break bumps generation counter. |
| FUSE | CacheAware |
Layers on FUSE entry_timeout/attr_timeout. Unified regardless of whether the daemon supports FUSE_READDIRPLUS. |
| UmkaOS peerfs (distributed) | CacheAware |
Peer protocol metadata push notifications bump generation counter (Section 5.1). |
| GFS2, OCFS2 | Never |
DLM linearizability required. ~18ns crossing is 0.004% of ~50-500us glock round-trip. |
| UmkaOS DLM-based cluster FS | Never |
Same as GFS2/OCFS2 — DLM consistency model requires per-stat lock acquisition. |
| Overlayfs | Inherits | Upper layer: Always (local, mutable). Lower layers: read-only, so Always (immutable data is trivially consistent). |
14.1.1.2.4 Network/Cluster Filesystem Invalidation Integration¶
The CacheAware policy integrates with each filesystem's existing cache invalidation
mechanism through a single Core-side callback:
/// Invalidate a prefetch entry for the given (sb_dev, ino) pair.
/// Called by filesystem cache invalidation handlers (NFS CB_RECALL,
/// CIFS lease break, FUSE NOTIFY_INVAL_INODE, peerfs METADATA_INVALIDATE).
///
/// This function iterates over all tasks that have an active prefetch
/// buffer for the given superblock and bumps the generation counter on
/// any matching entry. The iteration is bounded: at most one buffer per
/// task, at most 128 entries per buffer.
///
/// # Performance
///
/// Cold path — called only on cache invalidation events, which are
/// infrequent relative to stat() calls. Uses a per-superblock task list
/// to avoid scanning all tasks in the system.
pub fn vfs_invalidate_prefetch(sb_dev: DevId, ino: InodeId) {
// Bump the generation counter on the inode. Any prefetch entry
// holding the old generation will fail the freshness check on the
// next stat() fast path and fall through to the VFS.
inode_bump_generation(sb_dev, ino);
}
Per-filesystem invalidation triggers:
-
NFS: When the NFS client receives
CB_RECALL(NFSv4 delegation return) or detectsactimeoexpiry (NFSv3/v4 attribute timeout), it callsvfs_invalidate_prefetch(sb_dev, ino). Nextstat()sees generation mismatch, crosses to VFS, and the NFS client re-fetches attributes from the server. -
CIFS/SMB: When the CIFS client receives an oplock break or lease break notification from the SMB server, it calls
vfs_invalidate_prefetch(sb_dev, ino). -
FUSE: When
attr_timeoutexpires or the FUSE daemon sendsFUSE_NOTIFY_INVAL_INODE, the FUSE client callsvfs_invalidate_prefetch(sb_dev, ino). -
UmkaOS peerfs: The peer protocol
METADATA_INVALIDATEmessage (Section 5.1) triggersvfs_invalidate_prefetch(). This is tighter than NFS because the peer node pushes invalidations proactively (not just on delegation recall), reducing the stale-data window.
Why this is better than existing approaches:
- Eliminates the domain crossing. NFS
READDIRPLUSonly eliminates the network round-trip for attribute fetches; the kernel-side VFS domain crossing still occurs for eachstat(). Our readdir-plus prefetch eliminates both. - Unified across all remote filesystems. NFS, FUSE, CIFS, and peerfs all use the same Core-side prefetch buffer with the same generation-counter invalidation. No per-filesystem prefetch implementation is needed.
- Generation-counter invalidation is more precise than time-based expiry. NFS
actimeois a blunt timeout; our generation counter reflects actual inode mutations. The result is fewer false invalidations and a higher effective hit rate.
14.1.1.2.5 Live Evolution Interaction¶
- The prefetch buffer is in Core memory (Tier 0) and survives VFS live evolution (Section 13.18) unchanged.
- On VFS evolution: Core increments
VFS_EPOCH(singlefetch_add(1, Release)). All prefetch buffers become stale in O(1) — no per-task or per-entry iteration. - The new VFS instance exports fresh inode generation counters. The first
readdir()after evolution refills the buffer with current data. - No data from the old VFS instance leaks through — the epoch check catches everything before any stale entry is returned to userspace.
14.1.1.2.6 Crash Recovery Interaction¶
- On VFS crash: Core increments
VFS_EPOCH(same mechanism as evolution). All prefetch buffers are invalidated atomically. - Buffer memory is in Core (Tier 0) and cannot be corrupted by a VFS crash.
- Subsequent
stat()calls miss the buffer, fall through to the VFS domain crossing, and trigger VFS restart via the normal crash recovery path (Section 11.9). - After VFS recovery completes, the next
readdir()refills the buffer. The transient period between crash and buffer refill uses the unoptimized path (full domain crossing per stat), which is correct but slower.
14.1.1.2.7 Amortized Performance Budget¶
With both mechanisms active, the effective metadata overhead for common access patterns:
| Access Pattern | Mechanism | Effective Overhead (x86-64 MPK) | Domain Crossings |
|---|---|---|---|
Single stat() (no prefetch) |
None | ~3.6-9% per call (~18ns / 200-500ns base) | 1 round-trip |
| readdir + stat (Always/CacheAware FS) | Readdir-plus prefetch | ~0.3-0.5% effective | 1 crossing for readdir, 0 for stat hits (~95% hit rate) |
io_uring batch of 64 IORING_OP_STATX |
Statx coalescing | ~0.05% per stat | 1 crossing for 64 stats |
| io_uring batch + prefetch buffer | Both | ~0.01% per stat (best case) | 0 crossings on full prefetch hit |
| readdir + stat (Never-policy FS) | None (DLM overhead dominates) | ~3.6-9% (negligible vs ~50-500us DLM) | 1 round-trip per stat |
Assumptions: 95% prefetch buffer hit rate for readdir+stat pattern (based on: median directory has <128 entries, stat() calls follow readdir in program order, inode mutation between readdir and stat is rare). Hit rate degrades for directories >128 entries (only the last batch is buffered) and for workloads that interleave mutations with stat.
Phase assignment: Readdir-plus prefetch is Phase 2 (required for metadata-heavy workload performance targets). io_uring statx coalescing is Phase 3 (optimization; the system is correct without it).
14.1.2 VFS Architecture¶
Responsibilities: path resolution, dentry caching, inode management, mount tree traversal, and permission checks (delegated to umka-core's capability system via the inter-domain ring buffer).
14.1.2.1 Nucleus / Evolvable Classification¶
Every VFS component is explicitly classified per the replaceability model
(Section 13.18). Nucleus components are non-replaceable data
structures whose invariants are verified and whose layout survives live evolution.
Evolvable components are replaceable policy modules that can be hot-swapped
via EvolvableComponent without rebooting.
| Component | Classification | Rationale |
|---|---|---|
Dentry cache (dcache hash table, Dentry struct, LRU list) |
Nucleus | Correctness-critical: path resolution depends on dentry integrity. RCU-walk protocol embeds ordering invariants into the data structure. Corrupted dcache = silent wrong-file access. Cannot be swapped while RCU readers hold references. |
Inode cache (per-superblock XArray, Inode struct, AddressSpace) |
Nucleus | Correctness-critical: inode metadata (permissions, size, link count) governs security and data integrity. The inode generation counter protocol for prefetch invalidation depends on immutable layout. |
Mount table (MountNamespace, Mount struct, mount hash table) |
Nucleus | Correctness-critical: mount tree integrity governs which filesystem serves each path. Corrupted mount table = namespace escape. The RCU-protected mount hash and propagation graph encode safety invariants. |
SuperBlock (per-mount filesystem state, SbWriters, freeze FSM) |
Nucleus | Correctness-critical: the freeze state machine, writer tracking counters, and error behavior mode are safety-critical invariants that must not change during operation. |
VFS ring protocol (VfsRingSet, VfsRingPair, request/response format) |
Nucleus | Correctness-critical: the ring is the isolation boundary. Ring layout, opcode encoding, and response matching must be immutable across live evolution. A new VFS Evolvable inherits the existing ring set (see Section 14.3). |
Dirty extent protocol (DirtyIntentList, reserve/commit/abort API) |
Nucleus | Correctness-critical: crash recovery depends on the intent list being complete and uncorrupted. The two-phase protocol's invariants (token single-use, overflow backpressure) are safety properties. |
| ErrSeq (writeback error tracking) | Nucleus | Correctness-critical: one-shot error reporting to userspace is a POSIX contract. The atomic packing of errno + counter is a verified invariant. |
| Path resolution algorithm (RCU-walk, ref-walk fallback, symlink loop detection) | Nucleus | Correctness-critical: symlink loop detection (depth limit, visited set) and mount-crossing logic are security invariants. TOCTOU resistance depends on the algorithm, not tunable parameters. |
| Readahead window sizing (sequential detection, window growth/shrink) | Evolvable | Policy decision: the heuristic for when to grow or shrink the readahead window is a tuning knob, not a correctness property. ML can improve it. The readahead engine (page pre-allocation, I/O submission) is Nucleus; the window sizing policy is Evolvable. |
| Writeback scheduling (BDI dirty page selection, inode writeback ordering) | Evolvable | Policy decision: which dirty inodes to write back first, how to interleave sequential and random I/O, and when to trigger background writeback are heuristic choices. The writeback infrastructure (bio submission, completion tracking, writeback_lock) is Nucleus. |
Dirty page throttling (balance_dirty_pages pause duration, dirty ratio) |
Evolvable | Policy decision: the bandwidth-proportional throttling algorithm and dirty ratio thresholds are tunable via sysctl and ML. The throttling mechanism (task sleep, PerCpuCounter for dirty page counts) is Nucleus. |
| Dentry LRU eviction policy (which unused dentries to reclaim first) | Evolvable | Policy decision: LRU ordering and shrinker batch size are heuristics. The LRU list data structure is Nucleus. |
Readdir-plus prefetch policy (ReaddirPrefetchPolicy per-filesystem) |
Evolvable | Policy decision: whether to prefetch statx metadata during readdir is a per-filesystem heuristic. The prefetch buffer infrastructure (TaskPrefetchBuf, PrefetchEntry) is Nucleus. |
| Doorbell coalescing policy (batch size, timeout thresholds) | Evolvable | Policy decision: the coalescing batch size and timeout are ML-tunable parameters. The coalescing mechanism (CoalescedDoorbell, atomic bitmask) is Nucleus. |
Swap mechanics: When the VFS Evolvable is live-replaced (Section 13.18), the Nucleus data structures (dcache, inode cache, mount table, ring set, dirty intent lists) are preserved in-place. The new Evolvable inherits them and resumes operation. Only the policy vtables (readahead sizing, writeback scheduling, dirty throttling, LRU eviction) are swapped. This is why the Nucleus/Evolvable boundary is drawn at the data-structure / policy-algorithm line: data survives the swap, policy is replaced.
Filesystem drivers register as VFS backends. The VFS never interprets on-disk format directly — it delegates all storage operations through three trait interfaces:
Foundational VFS types (used throughout this chapter):
/// Opaque filesystem inode identifier. Unique within a single SuperBlock.
///
/// Inode 0 is never valid (used as the null sentinel in `AtomicOption`).
/// Inode 1 is conventionally the root directory inode.
/// The u64 width accommodates all known filesystem inode spaces (ext4 uses
/// u32 internally but promotes to u64 for future-proofing; Btrfs and ZFS
/// use u64 natively).
///
/// `InodeId` is filesystem-private: the same u64 value in two different
/// `SuperBlock` instances refers to different inodes.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct InodeId(pub u64);
impl From<u64> for InodeId { fn from(v: u64) -> Self { InodeId(v) } }
impl From<InodeId> for u64 { fn from(id: InodeId) -> u64 { id.0 } }
/// Opaque VFS pipe identifier. Each `pipe(2)` / `pipe2(2)` call produces a
/// unique `PipeId` for internal tracking (waitqueue association, splice
/// routing, and PipeBuffer lifetime management). Not visible to userspace.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
pub struct PipeId(pub u64);
/// Memory protection flags for `FileOps::mmap()`.
///
/// Bitfield matching Linux `PROT_*` constants from `<sys/mman.h>`.
/// Passed by the VMM to the filesystem's mmap callback so it can validate
/// or adjust protections (e.g., deny PROT_WRITE for read-only mounts,
/// deny PROT_EXEC for noexec mounts).
///
/// These are the userspace-facing PROT_* values, NOT the kernel-internal
/// VM_* flags. The VMM converts between MmapProt and VmFlags via
/// `prot_flags_to_vm_flags()` ([Section 4.8](04-memory.md#virtual-memory-manager)).
pub struct MmapProt(u32);
impl MmapProt {
pub const NONE: MmapProt = MmapProt(0x0);
pub const READ: MmapProt = MmapProt(0x1); // PROT_READ
pub const WRITE: MmapProt = MmapProt(0x2); // PROT_WRITE
pub const EXEC: MmapProt = MmapProt(0x4); // PROT_EXEC
pub fn contains(&self, flag: MmapProt) -> bool {
self.0 & flag.0 == flag.0
}
}
/// Result type returned by `FileOps::mmap()`.
///
/// On success, the filesystem returns `MmapResult` describing any
/// adjustments it made to the mapping. The VMM applies these adjustments
/// to the VMA after the callback returns.
///
/// For the Tier 0 in-kernel direct-call path (where `f_op.mmap(f, &mut vma)`
/// modifies the VMA directly), `MmapResult::Ok` is returned after the VMA
/// has been modified in place. For the KABI ring transport (Tier 1/2), the
/// decomposed return struct carries the adjusted fields back to the VMM.
pub struct MmapResult {
/// Adjusted vm_flags (the filesystem may set VM_IO, clear VM_MAYWRITE, etc.).
/// If the filesystem did not modify flags, this equals the input vm_flags.
pub vm_flags: u64,
/// Filesystem-specific VmOperations handle (opaque u64 for KABI transport).
/// The VMM sets `vma.vm_ops` from this value. Zero means no custom vm_ops.
pub vm_ops_handle: u64,
}
/// Response envelope for cross-domain VFS ring buffer calls.
///
/// This is the kernel-internal typed representation. The wire-level
/// representation on the ring buffer is `VfsResponseWire`
/// ([Section 14.2](#vfs-ring-buffer-protocol)), which uses a single `i64 status`
/// field for compact encoding. The VFS dispatch layer converts between
/// the two: `status >= 0` → `Ok(status)`, `status < 0 && status !=
/// i64::MIN` → `Err(status as i32)`, `status == i64::MIN` → `Pending`.
#[derive(Debug)]
pub enum VfsResponse {
/// Success, possibly with a return value (e.g., byte count for read/write).
Ok(i64),
/// Error code (negated Linux errno, e.g., `-ENOENT`).
Err(i32),
/// Asynchronous completion pending; caller must wait on the completion ring.
Pending,
}
/// Filesystem-level operations (mount, unmount, statfs).
/// Implemented once per filesystem type (ext4, XFS, btrfs, ZFS, tmpfs, etc.).
pub trait FileSystemOps: Send + Sync {
/// Mount a filesystem from the given source device with flags and options.
fn mount(&self, source: &str, flags: MountFlags, data: &[u8]) -> Result<SuperBlock>;
/// Unmount a previously mounted filesystem.
fn unmount(&self, sb: &SuperBlock) -> Result<()>;
/// Force-unmount: abort in-flight I/O with EIO. Called when umount2()
/// is invoked with MNT_FORCE. Not all filesystems support this — return
/// ENOSYS if unsupported. NFS uses this for stale server recovery.
fn force_umount(&self, sb: &SuperBlock) -> Result<()>;
/// Return filesystem statistics (total/free/available blocks and inodes).
fn statfs(&self, sb: &SuperBlock) -> Result<StatFs>;
/// Flush all dirty data and metadata for this filesystem to stable storage.
/// Backend for syncfs(2) and the filesystem-level portion of sync(2).
fn sync_fs(&self, sb: &SuperBlock, wait: bool) -> Result<()>;
/// Remount with changed flags/options (e.g., `mount -o remount,ro`).
fn remount(&self, sb: &SuperBlock, flags: MountFlags, data: &[u8]) -> Result<()>;
/// Freeze the filesystem for a consistent snapshot. All pending writes are
/// flushed and new writes block until thaw. Used by LVM snapshots, device-mapper,
/// and backup tools via FIFREEZE ioctl.
fn freeze(&self, sb: &SuperBlock) -> Result<()>;
/// Thaw a previously frozen filesystem, allowing writes to resume.
fn thaw(&self, sb: &SuperBlock) -> Result<()>;
/// Format filesystem-specific mount options for /proc/mounts output.
fn show_options(&self, sb: &SuperBlock, buf: &mut [u8]) -> Result<usize>;
/// Declare the filesystem's write mode. Called once at mount time and cached
/// by the VFS in `SuperBlock.write_mode`. Informs writeback scheduling, page
/// cache sharing, and free space accounting.
/// See [Section 14.4](#vfs-fsync-and-cow--copy-on-write-and-redirect-on-write-infrastructure)
/// for the `WriteMode` enum and design rationale.
/// Default: `WriteMode::InPlace` (traditional overwrite semantics).
fn write_mode(&self) -> WriteMode {
WriteMode::InPlace
}
}
/// Inode (directory structure) operations.
/// Handles namespace operations: lookup, create, link, unlink, rename.
///
/// Note: `OsStr` is a kernel-defined type (NOT `std::ffi::OsStr`, which is
/// unavailable in `no_std`). It is a dynamically-sized type (DST) wrapping
/// `[u8]`, representing filenames that may contain arbitrary non-UTF-8 bytes
/// (Linux filenames are byte strings, not Unicode). Defined in
/// `umka-vfs/src/types.rs`:
/// `pub struct OsStr([u8]);`
/// As a DST, `OsStr` cannot be used by value — it is always behind a
/// reference (`&OsStr`) or `Box<OsStr>`. `&OsStr` is a fat pointer
/// (pointer + length), analogous to `&[u8]` but carrying the semantic
/// intent of "filesystem name component." Conversion from `&str` is
/// infallible (UTF-8 is a valid byte sequence); conversion TO `&str`
/// returns `Result` (may fail on non-UTF-8 filenames).
pub trait InodeOps: Send + Sync {
/// Look up a child entry by name within a parent directory.
fn lookup(&self, parent: InodeId, name: &OsStr) -> Result<InodeId>;
/// Create a regular file in the given directory.
fn create(&self, parent: InodeId, name: &OsStr, mode: FileMode) -> Result<InodeId>;
/// Create a subdirectory.
fn mkdir(&self, parent: InodeId, name: &OsStr, mode: FileMode) -> Result<InodeId>;
/// Create a hard link: new entry `new_name` in `new_parent` pointing to `inode`.
fn link(&self, inode: InodeId, new_parent: InodeId, new_name: &OsStr) -> Result<()>;
/// Create a symbolic link containing `target` at `parent/name`.
fn symlink(&self, parent: InodeId, name: &OsStr, target: &OsStr) -> Result<InodeId>;
/// Read the target of a symbolic link.
fn readlink(&self, inode: InodeId, buf: &mut [u8]) -> Result<usize>;
/// Create a device special file (block/char device, FIFO, or socket).
fn mknod(&self, parent: InodeId, name: &OsStr, mode: FileMode, dev: DevId) -> Result<InodeId>;
/// Remove a directory entry (unlink for files, rmdir for empty directories).
fn unlink(&self, parent: InodeId, name: &OsStr) -> Result<()>;
/// Remove an empty directory. Separate from unlink for POSIX semantics:
/// `unlink()` on a directory returns EISDIR; `rmdir()` on a file returns ENOTDIR.
fn rmdir(&self, parent: InodeId, name: &OsStr) -> Result<()>;
/// Rename/move a directory entry, possibly across directories.
/// `flags` supports RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT
/// (Linux renameat2 semantics, required for overlayfs).
fn rename(
&self,
old_parent: InodeId, old_name: &OsStr,
new_parent: InodeId, new_name: &OsStr,
flags: RenameFlags,
) -> Result<()>;
/// Get inode attributes (size, mode, timestamps, link count).
fn getattr(&self, inode: InodeId) -> Result<InodeAttr>;
/// Set inode attributes (chmod, chown, utimes).
fn setattr(&self, inode: InodeId, attr: &SetAttr) -> Result<()>;
/// Truncate a byte range within a file, deallocating the corresponding
/// on-disk blocks (extent tree updates, journal entries, COW handling).
/// Used by hole-punch (`FALLOC_FL_PUNCH_HOLE`) and range-discard
/// operations. The VFS calls this after evicting the affected pages
/// from the page cache; the filesystem is responsible only for the
/// on-disk state. `start` and `end` are byte offsets (inclusive start,
/// exclusive end; `end == u64::MAX` means "to end of file").
fn truncate_range(&self, inode: InodeId, start: u64, end: u64) -> Result<(), IoError>;
/// List extended attributes on an inode.
fn listxattr(&self, inode: InodeId, buf: &mut [u8]) -> Result<usize>;
/// Get an extended attribute value.
fn getxattr(&self, inode: InodeId, name: &OsStr, buf: &mut [u8]) -> Result<usize>;
/// Set an extended attribute value.
fn setxattr(&self, inode: InodeId, name: &OsStr, value: &[u8], flags: XattrFlags)
-> Result<()>;
/// Remove an extended attribute.
fn removexattr(&self, inode: InodeId, name: &OsStr) -> Result<()>;
/// Flush inode metadata to stable storage. Called by
/// `vfs_fsync_metadata()` for O_SYNC/O_DSYNC writes when the inode's
/// on-disk metadata must be updated (timestamps, size, block map).
///
/// `sync_mode`: `WB_SYNC_ALL` (wait for I/O completion) or
/// `WB_SYNC_NONE` (schedule I/O but do not wait). O_SYNC always uses
/// `WB_SYNC_ALL`.
fn write_inode(&self, ino: InodeId, sync_mode: WriteSyncMode) -> Result<()>;
}
/// Validated userspace pointer wrapper for writing data to userspace.
///
/// `UserSliceMut` represents a region of userspace memory that the kernel has
/// validated for write access. It ensures that:
/// 1. The pointer range `[ptr, ptr + len)` lies entirely within the task's
/// user address space (below `TASK_SIZE`, not in kernel address space).
/// 2. The pages are mapped writable (or will be demand-faulted on copy).
///
/// **Construction**: Created by `UserSliceMut::new(ptr, len)` which performs
/// the address range validation. This is called early in the syscall path
/// (before any I/O) so that an invalid buffer is rejected with `EFAULT`
/// before work is done.
///
/// **Copy path**: `copy_to_user(dst: &mut UserSliceMut, src: &[u8])` copies
/// kernel data into the validated userspace region. The copy handles:
/// - Page faults: if a destination page is not resident, the fault handler
/// allocates and maps it (demand paging), then retries the copy.
/// - Partial copies: if a fault cannot be resolved (e.g., SIGBUS on a
/// mapped-but-uncommittable page), the copy returns the number of bytes
/// successfully copied. The caller (VFS read dispatch) returns a short
/// read to userspace.
/// - SMAP/PAN enforcement: on architectures with Supervisor Mode Access
/// Prevention (x86 SMAP, ARM PAN), the copy temporarily enables user
/// access via `stac`/`clac` (x86) or `uaccess_enable`/`uaccess_disable`
/// (ARM). The access window is scoped to the copy operation.
///
/// **Advance semantics**: After each `copy_to_user()` call, the internal
/// pointer advances by the number of bytes written and `remaining()` decreases
/// accordingly. This allows iterative filling (e.g., page-by-page copy from
/// the page cache in `generic_file_read_iter()`).
///
/// **Thread safety**: `UserSliceMut` is `!Send` and `!Sync` — it is valid
/// only for the current task's address space on the current CPU. It must not
/// be stored beyond the syscall lifetime.
pub struct UserSliceMut {
/// Validated userspace destination pointer. Guaranteed to be below
/// `TASK_SIZE` at construction time.
ptr: *mut u8,
/// Remaining bytes available for writing.
len: usize,
}
impl UserSliceMut {
/// Create a validated userspace write buffer.
///
/// Returns `EFAULT` if `ptr + len` overflows or exceeds `TASK_SIZE`.
pub fn new(ptr: *mut u8, len: usize) -> Result<Self, Errno>;
/// Number of bytes remaining in the buffer.
pub fn remaining(&self) -> usize;
/// Copy `src` into the userspace buffer, advancing the internal pointer.
/// Returns the number of bytes actually copied (may be less than
/// `src.len()` if a page fault cannot be resolved).
pub fn write(&mut self, src: &[u8]) -> Result<usize, Errno>;
}
/// Validated userspace pointer wrapper for reading data from userspace.
///
/// Analogous to `UserSliceMut` but for kernel reads from user memory.
/// `copy_from_user(dst: &mut [u8], src: &UserSlice)` copies userspace data
/// into a kernel buffer with the same fault-handling and SMAP/PAN semantics
/// as `UserSliceMut`.
pub struct UserSlice {
/// Validated userspace source pointer. Guaranteed to be below
/// `TASK_SIZE` at construction time.
ptr: *const u8,
/// Remaining bytes available for reading.
len: usize,
}
impl UserSlice {
/// Create a validated userspace read buffer.
///
/// Returns `EFAULT` if `ptr + len` overflows or exceeds `TASK_SIZE`.
pub fn new(ptr: *const u8, len: usize) -> Result<Self, Errno>;
/// Number of bytes remaining in the buffer.
pub fn remaining(&self) -> usize;
/// Copy data from the userspace buffer into `dst`, advancing the
/// internal pointer. Returns the number of bytes actually copied.
pub fn read(&mut self, dst: &mut [u8]) -> Result<usize, Errno>;
}
/// File data operations (open, read, write, sync, allocate, close).
pub trait FileOps: Send + Sync {
/// Called when a file is opened. Allows the filesystem to initialize per-open
/// state (NFS delegation, device state, lock state). Returns a filesystem-private
/// context value stored in the file descriptor.
fn open(&self, inode: InodeId, flags: OpenFlags) -> Result<u64>;
/// Called when the last file descriptor referencing this open file is closed.
/// Filesystem releases per-open state (flock release-on-close, NFS delegation
/// return, device cleanup). `private` is the value returned by `open()`.
fn release(&self, inode: InodeId, private: u64) -> Result<()>;
/// Read data from a file. `file` provides the OpenFile context (f_pos,
/// f_flags, filesystem-private state). `offset` is read-write: the
/// implementation advances it by the number of bytes read (supporting
/// both pread with caller-supplied offset and read with f_pos).
/// `buf` is a user-space slice descriptor for safe copy-to-user.
fn read(
&self,
file: &OpenFile,
buf: &mut UserSliceMut,
offset: &mut i64,
) -> Result<usize, IoError>;
/// Write data to a file. Same conventions as `read()`: `offset` is
/// advanced by the number of bytes written.
fn write(
&self,
file: &OpenFile,
buf: &UserSlice,
offset: &mut i64,
) -> Result<usize, IoError>;
/// Truncate a file to the specified size. This is separate from setattr
/// because truncation is a complex operation on many filesystems: it must
/// free blocks/extents, update extent trees, handle COW (ZFS/btrfs),
/// interact with snapshots, and flush in-progress writes beyond the new
/// size. The VFS calls truncate after updating the in-memory inode size.
/// `private` is the filesystem-private context value returned by `open()`.
fn truncate(&self, inode: InodeId, private: u64, new_size: u64) -> Result<()>;
/// Flush file data (and optionally metadata) to stable storage.
/// `private` is the filesystem-private context value returned by `open()`.
/// For DSM-managed pages: fsync waits for both local writeback completion
/// AND DSM PutAck receipt ([Section 6.12](06-dsm.md#dsm-subscriber-controlled-caching--fsync-semantics)).
/// The VFS fsync path calls `dsm_sync_pages(inode)` after
/// `filemap_write_and_wait_range()` to ensure DSM coherence.
fn fsync(&self, inode: InodeId, private: u64, start: u64, end: u64, datasync: u8) -> Result<()>;
/// Pre-allocate or punch holes in file storage. `private` is the
/// filesystem-private context value returned by `open()`.
fn fallocate(&self, inode: InodeId, private: u64, offset: u64, len: u64, mode: FallocateMode) -> Result<()>;
/// Read directory entries. Returns entries starting from `offset` (an opaque
/// cookie, not a byte position). The callback is invoked for each entry; it
/// returns `false` to stop iteration (buffer full). This is the backend for
/// `getdents64(2)`. `private` is the filesystem-private context value
/// returned by `open()`.
fn readdir(
&self,
inode: InodeId,
private: u64,
offset: u64,
emit: &mut dyn FnMut(InodeId, u64, FileType, &OsStr) -> bool,
) -> Result<()>;
/// Seek to a data or hole region (SEEK_DATA / SEEK_HOLE, lseek(2)).
/// Filesystems that do not support sparse files return the file size for
/// SEEK_DATA at any offset, and ENXIO for SEEK_HOLE at any offset.
/// `private` is the filesystem-private context value returned by `open()`.
fn llseek(&self, inode: InodeId, private: u64, offset: i64, whence: SeekWhence) -> Result<u64>;
/// Map a file region into a process address space. The VFS calls this to
/// obtain the page frame list; the actual page table manipulation is done
/// by umka-core (Section 4.1). Filesystems that do not support mmap (e.g.,
/// procfs, sysfs) return ENODEV. `private` is the filesystem-private
/// context value returned by `open()`.
fn mmap(&self, inode: InodeId, private: u64, offset: u64, len: usize, prot: MmapProt) -> Result<MmapResult>;
/// Handle a filesystem-specific ioctl. The VFS dispatches generic ioctls
/// (FIOCLEX, FIONREAD, etc.) itself; only unrecognized ioctls reach the
/// filesystem driver. Returns ENOTTY for unsupported ioctls. `private` is
/// the filesystem-private context value returned by `open()`.
fn ioctl(&self, inode: InodeId, private: u64, cmd: u32, arg: u64) -> Result<i64>;
/// Splice data between a file and a pipe without copying through userspace.
/// Backend for splice(2), sendfile(2), and copy_file_range(2). Filesystems
/// that do not implement this get a generic page-cache-based fallback
/// provided by the VFS. `private` is the filesystem-private context value
/// returned by `open()`.
fn splice_read(
&self,
inode: InodeId,
private: u64,
offset: u64,
pipe: PipeId,
len: usize,
) -> Result<usize>;
/// Splice data from a pipe into a file without copying through userspace.
/// Reverse direction of splice_read: pipe is the data source, file is the
/// destination. Backend for splice(2) write direction and vmsplice(2).
/// Filesystems that do not implement this get a generic page-cache-based
/// fallback provided by the VFS. `private` is the filesystem-private
/// context value returned by `open()`.
fn splice_write(
&self,
pipe: PipeId,
inode: InodeId,
private: u64,
offset: u64,
len: usize,
) -> Result<usize>;
/// Remap a file range: create shared extent references between files.
/// Backend for FICLONE, FICLONERANGE, and FIDEDUPERANGE ioctls, and the
/// server-side copy path of copy_file_range(2). Source and destination
/// must be on the same filesystem.
///
/// `flags` controls behavior (see `RemapFlags` in
/// [Section 14.4](#vfs-fsync-and-cow--copy-on-write-and-redirect-on-write-infrastructure)):
/// - `REMAP_FILE_DEDUP`: only remap if source and destination byte ranges
/// are identical (deduplication mode; byte-by-byte comparison first).
/// - `REMAP_FILE_CAN_SHORTEN`: caller accepts a shorter remap than
/// requested (e.g., if source extent ends before `len` bytes).
///
/// Returns the number of bytes actually remapped. Filesystems that do not
/// support reflinks return `EOPNOTSUPP`. The VFS generic layer handles
/// permission checks, file size validation, and lock ordering before
/// dispatching to this method.
fn remap_file_range(
&self,
src_inode: InodeId,
src_private: u64,
src_offset: u64,
dst_inode: InodeId,
dst_private: u64,
dst_offset: u64,
len: u64,
flags: RemapFlags,
) -> Result<u64> {
Err(Errno::EOPNOTSUPP)
}
/// Poll for readiness events (POLLIN, POLLOUT, POLLERR, etc.).
///
/// Called by `poll(2)`, `select(2)`, and `epoll_ctl(EPOLL_CTL_ADD)` to:
/// 1. Register the caller's wait entry on the file's internal WaitQueue(s)
/// via `poll_wait()`, so the caller is woken when readiness changes.
/// 2. Return the current readiness mask (which events are ready *right now*).
///
/// `pt` is `Some(&mut PollTable)` on the first call (registration pass) and
/// `None` on subsequent re-polls after wakeup (just check readiness, don't
/// re-register). Regular files always return `EPOLLIN | EPOLLOUT | EPOLLRDNORM
/// | EPOLLWRNORM` — they are always ready. Special files (pipes, sockets,
/// eventfd, signalfd, timerfd, pidfd) check their internal state and call
/// `poll_wait()` on the appropriate WaitQueue(s).
///
/// `private` is the filesystem-private context value returned by `open()`.
fn poll(
&self,
inode: InodeId,
private: u64,
events: PollEvents,
pt: Option<&mut PollTable>,
) -> Result<PollEvents>;
}
/// Poll callback registration table.
///
/// Passed to `FileOps::poll()` on the first call. The file implementation calls
/// `poll_wait(wq, pt)` for each WaitQueue that can change the file's readiness.
/// The `PollTable` records which wait queues were registered so that the polling
/// infrastructure (epoll, poll, select) can install wakeup callbacks.
///
/// **Lifecycle**: allocated on the caller's stack (for poll/select) or embedded
/// in the `EpollItem` (for epoll). The `queue_proc` function pointer is the
/// mechanism that installs the actual `WaitQueueEntry`:
/// - For `poll(2)` / `select(2)`: installs a one-shot entry that wakes the
/// calling task.
/// - For `epoll_ctl(EPOLL_CTL_ADD)`: installs a persistent entry whose wakeup
/// function is `ep_poll_callback` ([Section 19.1](19-sysapi.md#syscall-interface--epoll-primary)).
pub struct PollTable {
/// Callback invoked by `poll_wait()`. Installs a `WaitQueueEntry` on the
/// given `WaitQueueHead`. The `key` parameter carries the events mask so
/// the wakeup callback can filter spurious wakes.
pub queue_proc: fn(wq: &WaitQueueHead, pt: &mut PollTable, key: PollEvents),
/// Opaque pointer to the polling infrastructure's private state.
/// For epoll: points to the `EpollItem` that owns this poll table entry.
/// For poll/select: points to the per-fd poll state on the caller's stack.
/// SAFETY: For poll/select: points to caller-stack-allocated poll state,
/// valid for the duration of the poll syscall. For epoll: points to the
/// owning EpollItem, valid for the EpollItem's lifetime. The queue_proc
/// callback must cast to the correct type.
pub private: *mut (),
/// Events the caller is interested in. Set by the polling infrastructure
/// before calling `FileOps::poll()`. The file implementation may use this
/// to avoid registering on wait queues that cannot produce requested events.
pub events: PollEvents,
}
/// Register a wait queue with the poll table.
///
/// Called by `FileOps::poll()` implementations to tell the polling infrastructure
/// "wake me when this wait queue fires." The `PollTable` installs a
/// `WaitQueueEntry` on `wq` with the appropriate wakeup function.
///
/// If `pt` is `None` (re-poll after wakeup), this is a no-op — the entry is
/// already installed from the first call.
///
/// **Cost**: one `WaitQueueEntry` insertion per wait queue per monitored fd.
/// Most files have one wait queue; sockets may have two (read + write).
///
/// ```rust
/// fn poll_wait(wq: &WaitQueueHead, pt: Option<&mut PollTable>) {
/// if let Some(pt) = pt {
/// (pt.queue_proc)(wq, pt, pt.events);
/// }
/// }
/// ```
pub fn poll_wait(wq: &WaitQueueHead, pt: Option<&mut PollTable>);
/// Dentry (directory entry) lifecycle operations.
/// Most filesystems use the default VFS implementations. Only network and
/// clustered filesystems need custom implementations (primarily d_revalidate).
pub trait DentryOps: Send + Sync {
/// Revalidate a cached dentry. Called before using a cached dentry to verify
/// it is still valid. Returns true if the dentry is still valid, false if
/// the VFS should discard it and perform a fresh lookup.
/// Default: always returns true (local filesystems).
/// Network FS: checks with the server. Clustered FS: checks DLM lease (Section 15.12.6).
fn d_revalidate(&self, parent: InodeId, name: &OsStr) -> Result<bool> {
Ok(true)
}
/// Custom name comparison. Called during lookup to compare a dentry name
/// with a search name. Used by case-insensitive filesystems (e.g., VFAT,
/// CIFS with case folding, ext4 with casefold feature).
/// Default: byte-exact comparison.
fn d_compare(&self, name: &OsStr, search: &OsStr) -> bool {
name == search
}
/// Returns a custom hash for this dentry name, or `None` to use the
/// VFS default (SipHash-1-3 with per-superblock key from `SuperBlock.hash_key`).
/// Must be consistent with d_compare: if two names are equal per d_compare,
/// they must produce the same hash.
///
/// The VFS lookup layer calls `d_hash()` and checks the return value.
/// If `None`, the VFS uses its own SipHash-1-3 with the per-superblock
/// random key directly, without requiring filesystem involvement. This
/// matches Linux's pattern where `d_hash` is only invoked when
/// `dentry->d_op->d_hash` is non-NULL.
///
/// Filesystems with custom hash requirements (e.g., case-insensitive)
/// override this to return `Some(hash_value)` using their own algorithm —
/// they never see the SipHash key. The per-superblock key is managed by
/// the VFS, not exposed to filesystem implementations.
fn d_hash(&self, name: &OsStr) -> Option<u64> {
None
}
/// Called when a dentry's reference count drops to zero (dentry enters
/// the unused LRU list). Filesystem can veto caching by returning false.
fn d_delete(&self, inode: InodeId, name: &OsStr) -> bool {
true // default: allow LRU caching
}
/// Called when a dentry is finally freed from the cache.
fn d_release(&self, inode: InodeId, name: &OsStr) {}
}
/// Kernel-internal inode attribute structure. Contains all fields exposed by
/// Linux statx(2). The SysAPI layer translates to the userspace struct statx
/// layout (different field ordering, padding, and encoding).
pub struct InodeAttr {
/// Bitmask of valid fields (STATX_* flags). Filesystems set only
/// the bits for fields they actually populate.
pub mask: u32,
pub mode: u32, // File type and permissions. u32 for internal storage
// convenience and future extensibility. Only bits [15:0]
// are defined (identical to Linux umode_t). Bits [31:16]
// are reserved and must be zero. The SysAPI translation
// to userspace statx truncates to u16.
pub nlink: u32, // Hard link count
pub uid: u32, // Owner UID
pub gid: u32, // Group GID
pub ino: u64, // Inode number
pub size: u64, // File size in bytes
pub blocks: u64, // 512-byte blocks allocated
pub blksize: u32, // Preferred I/O block size
// Timestamps with nanosecond precision
pub atime_sec: i64, // Last access
pub atime_nsec: u32,
pub mtime_sec: i64, // Last modification
pub mtime_nsec: u32,
pub ctime_sec: i64, // Last status change
pub ctime_nsec: u32,
pub btime_sec: i64, // Creation time (birth time)
pub btime_nsec: u32,
/// Device ID (for device special files: char/block). Uses the `DevId` type
/// ([Section 14.5](#device-node-framework)) with Linux-compatible MKDEV encoding:
/// `(major << 20) | (minor & 0xFFFFF)`. Major occupies bits 31:20 (12 bits,
/// 0-4095), minor occupies bits 19:0 (20 bits, 0-1048575). The SysAPI layer
/// ({ref:linux-compatible-syscall-dispatch-layer} <!-- UNRESOLVED -->) splits `DevId` into separate
/// `stx_rdev_major`/`stx_rdev_minor` u32 fields for `statx()` responses using
/// `dev_id.major()` and `dev_id.minor()`.
pub rdev: DevId,
/// Device ID of the filesystem containing this inode. Same `DevId` encoding
/// as `rdev`. The SysAPI layer splits into `stx_dev_major`/`stx_dev_minor`
/// for `statx()` responses.
pub dev: DevId,
pub mount_id: u64, // Mount identifier (STATX_MNT_ID, since Linux 5.8)
pub attributes: u64, // File attributes (STATX_ATTR_* flags)
pub attributes_mask: u64, // Supported attributes mask
// Direct I/O alignment (STATX_DIOALIGN, since Linux 6.1)
pub dio_mem_align: u32, // Required alignment for DIO memory buffers
pub dio_offset_align: u32, // Required alignment for DIO file offsets
// Subvolume identifier (STATX_SUBVOL, since Linux 6.10; btrfs, bcachefs)
pub subvol: u64,
// Atomic write limits (STATX_WRITE_ATOMIC, since Linux 6.11)
pub atomic_write_unit_min: u32, // Min atomic write size (power-of-2)
pub atomic_write_unit_max: u32, // Max atomic write size (power-of-2)
pub atomic_write_segments_max: u32, // Max segments in atomic write
pub atomic_write_unit_max_opt: u32, // Optimal max atomic write size (STATX_WRITE_ATOMIC, since Linux 6.13)
// Direct I/O read alignment (STATX_DIO_READ_ALIGN, since Linux 6.14)
pub dio_read_offset_align: u32, // DIO read offset alignment (0 = use dio_offset_align)
}
Linux comparison: Linux's VFS uses struct super_operations, struct inode_operations,
struct file_operations, and struct dentry_operations — C structs of function pointers
(Linux's file_operations alone has 30+ methods). UmkaOS's trait-based design serves the
same purpose but with Rust's safety guarantees: a filesystem that forgets to implement
fsync is a compile-time error, not a null pointer dereference at runtime. The trait
methods above cover the operations needed for POSIX compatibility, including
remap_file_range() for reflink/clone/dedup (see
Section 14.4). Rarely-used
operations (e.g., fiemap) are handled by generic VFS fallback code that calls the
core read/write/fallocate methods.
14.1.2.2 File Handle Export (ExportOps)¶
The ExportOps trait is implemented by filesystems that support persistent file handles —
opaque tokens that identify an inode across server reboots and path renames. Required for:
- NFS server (clients hold file handles that survive server restart)
- CRIU checkpoint/restore (
open_by_handle_atreopens files by handle) - Backup software (
rsync --no-implied-dirs, backup agents)
/// File system export operations. Optional — implement only if the filesystem
/// supports persistent, path-independent file handles.
///
/// A file handle is a short opaque byte string (max 128 bytes) that uniquely
/// identifies an inode within a filesystem instance. The handle must survive:
/// - Server reboots (handle encodes stable inode ID + generation counter)
/// - Directory renames (handle does not encode path)
/// - Mount point changes (handle is filesystem-relative, not global)
pub trait ExportOps: Send + Sync {
/// Encode an inode into a file handle.
///
/// Returns the handle bytes written and a filesystem-defined `fh_type` code
/// (passed back to `decode_fh`; used to distinguish handle formats).
///
/// # Typical encoding
/// ext4: [ inode_number: u32, generation: u32 ] → 8 bytes, fh_type=1
/// XFS: [ ino: u64, gen: u32, parent_ino: u64, parent_gen: u32 ] → 24 bytes, fh_type=1
/// Btrfs: [ objectid: u64, root_objectid: u64, gen: u64 ] → 24 bytes, fh_type=1
///
/// Returns `Err(EOVERFLOW)` if `max_bytes` is too small for this filesystem's handle.
fn encode_fh(
&self,
inode: &Inode,
handle: &mut [u8; 128],
max_bytes: usize,
/// If true, include parent inode info to enable NFS reconnect after server reboot.
connectable: bool,
) -> Result<(usize, u8), VfsError>; // (bytes_written, fh_type)
/// Decode a file handle back to an inode reference.
///
/// Called by `open_by_handle_at`. Must look up the inode using the filesystem's
/// internal handle format without path traversal.
///
/// Returns `Err(ESTALE)` if the inode no longer exists or the generation counter
/// does not match (inode number reused after deletion).
fn decode_fh(
&self,
handle: &[u8],
fh_type: u8,
) -> Result<Arc<Inode>, VfsError>;
/// Get the parent directory inode of an inode (for NFS reconnect after reboot).
///
/// Returns `Err(EACCES)` if the filesystem cannot determine the parent without a
/// full tree walk (e.g., hardlinks with multiple parents).
fn get_parent(&self, inode: &Inode) -> Result<Arc<Inode>, VfsError>;
/// Get the directory entry name for `child` within `parent`.
///
/// Used by the NFS server to reconstruct paths for client caches.
/// Returns the byte length of the name written into `name_buf`.
/// Returns `Err(ENOENT)` if no entry for `child` is found in `parent`.
fn get_name(
&self,
parent: &Inode,
child: &Inode,
name_buf: &mut [u8; 256],
) -> Result<usize, VfsError>;
}
/// Kernel-side file handle: wraps the opaque handle bytes with metadata.
/// Matches the layout of Linux's `struct file_handle` for syscall ABI compatibility.
#[repr(C)]
pub struct FileHandle {
/// Byte length of the handle data (the populated prefix of `f_handle`).
pub handle_bytes: u32,
/// Filesystem-defined type code (passed back verbatim to `ExportOps::decode_fh`).
pub handle_type: i32,
/// Opaque handle data (filesystem-defined encoding, up to 128 bytes).
pub f_handle: [u8; 128],
}
const_assert!(size_of::<FileHandle>() == 136);
name_to_handle_at(2) implementation:
name_to_handle_at(dirfd, pathname, handle, mount_id, flags):
1. Resolve pathname to an inode (using normal path resolution with dirfd as the base;
AT_EMPTY_PATH allows operating on dirfd itself without a pathname component).
2. Retrieve the inode's superblock.
3. Check that the superblock implements ExportOps. Return ENOTSUP if not.
4. Call superblock.export_ops.encode_fh(inode, handle.f_handle, handle.handle_bytes,
connectable=true).
5. Write back handle_bytes and handle_type into the userspace handle struct.
6. Write the mount's numeric ID to *mount_id. Mount IDs are assigned at mount time
via a monotonic counter (Section 14.2.3 MountNode.mnt_id).
7. Return 0 on success; EOVERFLOW if the handle buffer is too small.
open_by_handle_at(2) implementation:
open_by_handle_at(mount_fd, handle, flags):
1. Requires CAP_DAC_READ_SEARCH. This syscall bypasses normal path-based access checks
by design — it is intended for root-equivalent processes such as NFS servers and
backup agents. Return EPERM if the capability is absent.
2. Resolve mount_fd to identify which filesystem the handle belongs to:
fdget(mount_fd) → extract the file's MountDentry → use that mount's superblock.
mount_fd must be an open fd on any file or directory within the target filesystem
(typically the mountpoint itself, e.g., `fd = open("/mnt")`). If mount_fd is
AT_FDCWD, the current working directory's mount is used.
3. Retrieve the mount's superblock (from the MountDentry resolved in step 2).
4. Check that the superblock implements ExportOps. Return ENOTSUP if not.
5. Call superblock.export_ops.decode_fh(handle.f_handle, handle.handle_type) → Arc<Inode>.
6. If Err(ESTALE): the inode was deleted or the generation counter does not match
(inode number reused). Return ESTALE.
7. Perform a DAC check and LSM check on the inode using the caller's credentials.
8. Allocate a new OpenFile wrapping the inode. The open file description does not
carry a path — the inode is accessed directly without directory traversal.
9. Return the new file descriptor number.
Security note: open_by_handle_at intentionally skips directory execute-permission
checks along the path to the inode (the path is not known at this point). This is
the documented and expected behavior for NFS server use. CAP_DAC_READ_SEARCH is the
required guard.
14.1.2.3 Core VFS Data Structures¶
The VFS layer operates on four fundamental data structures: dentries (directory entries), inodes (index nodes), superblocks (mounted filesystem state), and open files (open file handles). All four are defined in this section.
14.1.2.3.1.1 OpenFile (Open File Description)¶
/// An open file description — the kernel-internal object backing one or more
/// file descriptors. Created by `open(2)`, `openat(2)`, `socket(2)`, `pipe(2)`,
/// `accept(2)`, etc. Multiple file descriptors can reference the same `OpenFile`
/// via `dup(2)` or `fork(2)`.
///
/// **Lifecycle**: Allocated at open time. Reference-counted (`Arc<OpenFile>`).
/// The `FdTable` holds `Arc<OpenFile>` entries. When the last fd referencing
/// this open file is closed (refcount drops to zero), `FileOps::release()` is
/// called and the `OpenFile` is freed.
///
/// **Concurrency**: Most fields are immutable after creation (`inode`, `dentry`,
/// `mount`, `f_ops`, `f_cred`, `f_mode`). Mutable fields use atomic operations:
/// - `f_pos`: `AtomicI64` — updated by `read()`/`write()`/`lseek()`. `pread()`
/// and `pwrite()` do not touch `f_pos`. Access is mediated by `fdget_pos()`:
///
/// **`fdget_pos()` protocol** (f_pos serialization):
/// The VFS read/write dispatch path calls `fdget_pos(fd)` instead of plain
/// `fdget(fd)`. This function returns an `FdPos` guard that provides
/// exclusive `&mut i64` access to the file position:
///
/// - **Single-user fast path**: If the `OpenFile` has exactly one `Arc`
/// reference (refcount == 1, meaning no `dup(2)` or `fork(2)` sharing),
/// `fdget_pos()` loads `f_pos` into a local `i64`, returns `&mut` to it,
/// and stores it back on drop. No mutex, no contention. This is the
/// common case for most file descriptors.
///
/// - **Multi-user slow path**: If the `OpenFile` has multiple references
/// (shared via `dup(2)` or `fork(2)` — detected by `Arc::strong_count() > 1`),
/// `fdget_pos()` acquires `f_pos_lock` (a per-OpenFile `Mutex<()>`)
/// before returning `&mut` access to a local copy. This serializes
/// concurrent `read()`/`write()` calls that share the same open file
/// description, matching POSIX requirements for atomic position updates.
/// The mutex is released when the `FdPos` guard is dropped.
///
/// - **`pread()`/`pwrite()` bypass**: These syscalls use a caller-supplied
/// offset and never call `fdget_pos()` — they call `fdget()` directly.
/// No f_pos serialization is needed because the caller-supplied offset
/// is on the stack.
///
/// This design matches Linux's `fdget_pos()` / `__fdget_pos()` protocol
/// exactly (see `fs/file.c`), ensuring identical concurrency semantics.
/// - `f_flags`: `AtomicU32` — modified by `fcntl(F_SETFL)` for `O_APPEND`,
/// `O_NONBLOCK`, `O_ASYNC`, `O_DIRECT`. Read-only flags (`O_RDONLY`,
/// `O_RDWR`, `O_CREAT`, `O_EXCL`) are set at open time and never change.
/// - `f_wb_err`: `u64` — writeback error snapshot (plain value, not atomic).
/// Initialized from `AddressSpace::wb_err.sample()` at open time. Compared
/// at `fsync()` time against `AddressSpace::wb_err` via
/// `check_and_advance(&mut self.f_wb_err)` to detect new errors.
/// - `private_data`: `AtomicPtr` — set once by `FileOps::open()` and read by
/// subsequent operations. Typically not modified after initialization.
///
/// **Relationship to FdTable**: The `FdTable` (in [Section 8.1](08-process.md#process-and-task-management))
/// maps integer file descriptors (0, 1, 2, ...) to `Arc<OpenFile>`. `dup(2)`
/// creates a new fd pointing to the same `Arc<OpenFile>`. `fork()` copies the
/// `FdTable`, incrementing the `Arc` refcount for each entry.
pub struct OpenFile {
/// Inode backing this open file. For regular files, directories, symlinks,
/// and device nodes, this is the filesystem inode. For pipes and sockets,
/// this is a synthetic inode from the pipefs/sockfs pseudo-filesystem.
pub inode: Arc<Inode>,
/// Dentry that was used to open this file. Pinned for the lifetime of the
/// open file — this prevents the dentry from being evicted while the file
/// is open, which is necessary for `/proc/[pid]/fd/N` readlink (returns
/// the path via `d_path()` on this dentry).
pub dentry: DentryRef,
/// Mount instance through which this file was opened. Pinned for the
/// lifetime of the open file — this prevents `umount` from proceeding
/// while files are open on the filesystem (umount checks `mnt_count`).
pub mount: Arc<Mount>,
/// File operations vtable. Set at open time from the inode's `i_fop`
/// (regular files, directories) or the device driver's registered
/// `FileOps` (character/block devices). Immutable after creation.
pub f_ops: &'static dyn FileOps,
/// Current file position (seek offset). Updated by `read()`, `write()`,
/// and `lseek()`. Not used by `pread()`/`pwrite()` (which take an
/// explicit offset). Initialized to 0 for regular opens, to the file
/// size for `O_APPEND` opens (the kernel re-seeks to EOF before each
/// `write()` regardless of the stored position).
pub f_pos: AtomicI64,
/// Mutex protecting `f_pos` for shared open file descriptions.
/// Only acquired by `fdget_pos()` when `Arc::strong_count() > 1`
/// (i.e., the open file is shared via `dup(2)` or `fork(2)`).
/// Single-user file descriptors (the common case) never touch this
/// mutex — `fdget_pos()` skips it entirely. This matches Linux's
/// `struct file::f_pos_lock` mutex.
pub f_pos_lock: Mutex<()>,
/// Open flags. Lower bits contain the access mode (O_RDONLY=0, O_WRONLY=1,
/// O_RDWR=2). Upper bits contain status flags (O_APPEND, O_NONBLOCK,
/// O_ASYNC, O_DIRECT, O_NOATIME, O_CLOEXEC). Status flags may be modified
/// by `fcntl(F_SETFL)`; access mode bits are immutable after open.
pub f_flags: AtomicU32,
/// File mode derived from open flags. Bitflags indicating which operations
/// are permitted on this open file. Set at open time and immutable.
/// Checked by the VFS before dispatching to `FileOps` methods.
pub f_mode: FileMode,
/// Credentials captured at open time. Used for permission checks that
/// occur after open (e.g., writeback, async I/O completion) where the
/// original opener's credentials must be used, not the current task's.
/// Immutable after creation.
pub f_cred: Arc<Credentials>,
/// Writeback error snapshot (plain `u64`, not atomic `ErrSeq`). Initialized
/// from `AddressSpace::wb_err.sample()` at open time. At `fsync()` time,
/// compared against the current `AddressSpace::wb_err` via
/// `check_and_advance(&mut self.f_wb_err)` — if a new error occurred since
/// this fd was opened (or since the last `fsync()`), `fsync()` returns the
/// error. After reporting, the snapshot is advanced so the error is reported
/// exactly once per fd. The snapshot is a non-atomic `u64` because only the
/// owning fd thread accesses it (no concurrent readers), unlike
/// `AddressSpace::wb_err` which is the atomic source.
pub f_wb_err: u64,
/// Readahead state for this open file. Tracks sequential access detection,
/// the current readahead window size, and the last readahead position.
/// Used by `filemap_get_pages()` and the readahead engine
/// ([Section 4.4](04-memory.md#page-cache--readahead-engine)) to decide how many pages to
/// prefetch. Each open file has independent readahead state — two
/// processes reading the same file at different positions maintain
/// separate readahead windows.
pub ra_state: Mutex<FileRaState>,
/// Filesystem-private data. Set by `FileOps::open()` to store per-open
/// state (e.g., ext4 journal handle, NFS delegation ID, device driver
/// context). The VFS passes this value (as `private: u64`) to all
/// subsequent `FileOps` method calls. Cleared by `FileOps::release()`.
pub private_data: AtomicPtr<()>,
/// Driver generation at the time this file was opened. Set to
/// `sb.driver_generation.load(Acquire)` during `open()`. Compared
/// against `sb.driver_generation` on every VFS operation; mismatch
/// returns `ENOTCONN` (the file handle is stale from a pre-crash
/// driver instance). Not atomic — set once at open time, read-only
/// thereafter.
///
/// The generation check is in the VFS dispatch path (before
/// `select_ring()`):
/// ```rust
/// if file.open_generation != file.inode.i_sb.driver_generation.load(Acquire) {
/// return Err(ENOTCONN);
/// }
/// ```
pub open_generation: u64,
}
bitflags! {
/// File mode flags — derived from open flags at open time. These indicate
/// which operations the VFS permits on this open file description.
/// Immutable after open.
///
/// These are internal VFS flags (not directly visible to userspace). They
/// are derived from the `O_*` flags passed to `open(2)`:
/// - `O_RDONLY` (0) → `FMODE_READ`
/// - `O_WRONLY` (1) → `FMODE_WRITE`
/// - `O_RDWR` (2) → `FMODE_READ | FMODE_WRITE`
///
/// Additional flags are set based on the file type and filesystem
/// capabilities.
pub struct FileMode: u32 {
/// Read operations permitted (`read`, `pread`, `readv`, `mmap PROT_READ`).
const FMODE_READ = 0x0001;
/// Write operations permitted (`write`, `pwrite`, `writev`, `mmap PROT_WRITE`).
const FMODE_WRITE = 0x0002;
/// `lseek` is meaningful. Set for regular files and block devices.
/// Not set for pipes, sockets, and some character devices.
const FMODE_LSEEK = 0x0004;
/// `pread` is supported (implies the file has a stable notion of offset).
/// Set for regular files and block devices. Not set for pipes or sockets.
const FMODE_PREAD = 0x0008;
/// `pwrite` is supported.
const FMODE_PWRITE = 0x0010;
/// Execute permission was checked at open time (implies `O_PATH` was not
/// used and the file's execute bit was verified). Used by `execveat(2)`
/// with `AT_EMPTY_PATH` to avoid a redundant permission check.
const FMODE_EXEC = 0x0020;
/// File does not contribute to filesystem busy state. Set for files
/// opened with `O_PATH` (which are just path references, not real opens).
const FMODE_PATH = 0x0040;
/// Direct I/O mode. Set when `O_DIRECT` is in effect and the filesystem
/// supports it. The VFS bypasses the page cache for read/write.
const FMODE_DIRECT = 0x0080;
}
}
14.1.2.3.1.2 Dentry (Directory Cache Entry)¶
/// Directory cache entry — represents a single component in a pathname.
///
/// Dentries form a tree that mirrors the filesystem namespace. Each dentry
/// caches the result of a directory lookup: the mapping from a name to an
/// inode. The dentry cache (dcache) is the primary mechanism for avoiding
/// repeated directory lookups on hot paths.
///
/// **Lifecycle**: Created by `InodeOps::lookup()` on first access. Cached
/// in the dcache hash table (keyed by parent + name). Freed when the
/// reference count drops to zero AND the dentry is evicted from the LRU.
/// Negative dentries (name exists but no inode) are also cached to avoid
/// repeated failed lookups.
///
/// **Concurrency**: Dentries are RCU-protected for lockless path resolution
/// (RCU-walk mode, Section 14.1.3). Mutations (create, unlink, rename)
/// acquire the parent dentry's `d_lock` spinlock.
///
/// `#[repr(C)]` on `Dentry` is for deterministic field ordering (cache line
/// layout control), not for cross-compilation-unit ABI stability. Inner types
/// (`DentryName`, `RcuCell<..>`, `IntrusiveList<..>`) retain Rust-default layout.
/// Tier 1 drivers never receive raw `Dentry` pointers — all access is through
/// the VFS ring protocol by inode number.
// kernel-internal, not KABI — no const_assert (contains Rust-layout inner types).
#[repr(C)]
pub struct Dentry {
/// The name of this directory entry (the final component, not the full path).
/// Inline for short names (<=32 bytes); heap-allocated for longer names.
/// Immutable after creation (renames create a new dentry).
pub d_name: DentryName,
/// Inode that this dentry points to. `None` for negative dentries
/// (cached "does not exist" results). Set once by `d_instantiate()`
/// after a successful lookup or create. Protected by RCU for readers;
/// `d_lock` for writers.
pub d_inode: RcuCell<Option<Arc<Inode>>>,
/// Parent dentry. The root dentry's parent is itself.
/// Protected by RCU (for RCU-walk path resolution).
pub d_parent: RcuCell<Arc<Dentry>>,
/// Hash table linkage for dcache lookup (keyed by parent + name hash).
pub d_hash: HashListNode,
/// Children list (subdirectories and files in this directory).
/// Only meaningful for directory dentries. Protected by `d_lock`.
pub d_children: IntrusiveList<Dentry>,
/// Sibling linkage (entry in parent's `d_children` list).
pub d_sibling: IntrusiveListNode,
/// Per-dentry spinlock. Protects `d_children`, `d_inode` mutations,
/// and `d_flags` updates. Lock level: DENTRY_LOCK (level 16).
pub d_lock: SpinLock<(), DENTRY_LOCK>,
/// Dentry flags (DCACHE_MOUNTED, DCACHE_NEGATIVE, etc.).
pub d_flags: AtomicU32,
/// Cross-namespace mount refcount. Counts how many mount namespaces have
/// a mount at this dentry. Incremented in `do_mount()`, decremented in
/// `do_umount()`. `DCACHE_MOUNTED` is cleared only when this reaches 0.
///
/// **Why needed**: A single dentry can be a mount point in multiple
/// namespaces simultaneously (e.g., "/" is mounted in every namespace
/// that cloned the mount tree). Without this refcount, `do_umount()` in
/// one namespace would clear `DCACHE_MOUNTED` and break path resolution
/// in all other namespaces that still have mounts at this dentry.
///
/// **Protocol**:
/// - `do_mount()` step 6f: `dentry.d_mount_refcount.fetch_add(1, Relaxed)`
/// THEN `dentry.d_flags.fetch_or(DCACHE_MOUNTED, Release)`.
/// - `do_umount()` step 9:
/// `if dentry.d_mount_refcount.fetch_sub(1, AcqRel) == 1 {`
/// ` dentry.d_flags.fetch_and(!DCACHE_MOUNTED, Release);`
/// `}`
/// - Same pattern in `do_umount_tree()` step 3d, `do_move_mount()` step 5c.
///
/// u32: bounded by mount_max × max_namespaces. Even with 100K mounts ×
/// 100K namespaces, the per-dentry count is bounded by namespace count
/// (~100K), well within u32 range.
pub d_mount_refcount: AtomicU32,
/// Reference count. Dentries with refcount > 0 are pinned (in use).
/// Dentries with refcount == 0 are on the LRU and may be evicted
/// under memory pressure.
/// u32: bounded by max_files sysctl (default 8M). At max_files=8M
/// concurrent references to a single dentry, u32 provides ~536x
/// headroom. AtomicU64 rejected: hot-path refcount, 2x width penalty
/// on ILP32 architectures (ARMv7, PPC32).
pub d_refcount: AtomicU32,
/// Cached permission bits for fast path resolution (Section 14.1.3).
pub cached_perm: AtomicU32,
/// Superblock this dentry belongs to.
pub d_sb: Arc<SuperBlock>,
/// Filesystem-specific dentry operations (d_revalidate, d_release, etc.).
/// Set by the filesystem during lookup. NULL for simple filesystems.
pub d_ops: Option<&'static dyn DentryOps>,
/// RCU head for deferred freeing.
pub d_rcu: RcuHead,
/// LRU list linkage for dcache reclaim.
pub d_lru: IntrusiveListNode,
/// Mount point generation counter. Incremented when a filesystem is
/// mounted or unmounted on this dentry. Used by RCU-walk to detect
/// mount table changes during lockless traversal. This is a generation
/// counter protocol, not a Linux-style seqcount (no even/odd semantics).
///
/// Reader protocol: (1) sample d_mount_seq with Acquire, (2) lookup in
/// mount hash table, (3) sample d_mount_seq again with Acquire, (4) if
/// values differ, retry from step 1.
pub d_mount_seq: AtomicU32,
}
/// Short name inline buffer size. Names <=32 bytes are stored inline
/// in the dentry (no heap allocation). Covers >99% of real filenames.
pub const DENTRY_INLINE_NAME_LEN: usize = 32;
/// Maximum dentry name length (POSIX NAME_MAX).
pub const DENTRY_MAX_NAME_LEN: usize = 255;
/// Dentry name: inline for short names, heap-allocated for long names.
/// The Heap variant stores names up to DENTRY_MAX_NAME_LEN bytes; the
/// bound is enforced by d_alloc() which validates name.len() <= NAME_MAX
/// before construction. debug_assert!(name.len() <= DENTRY_MAX_NAME_LEN)
/// in the Heap constructor provides defense-in-depth.
pub enum DentryName {
Inline { buf: [u8; DENTRY_INLINE_NAME_LEN], len: u8 },
Heap { ptr: Box<[u8]> },
}
14.1.2.3.1.3 AddressSpace (Page Cache Mapping)¶
/// VFS-layer page cache wrapper for one inode. Wraps a `PageCache`
/// ([Section 4.4](04-memory.md#page-cache)) with VFS-layer writeback
/// coordination, error tracking, and filesystem-specific operations.
///
/// Each inode for a regular file or block device has exactly one
/// `AddressSpace`. Directories and symlinks typically do not use
/// `AddressSpace` unless the filesystem maps their data through the page
/// cache (e.g., directories in ext4 are page-cache-backed).
///
/// **Storage**: `AddressSpace` is embedded directly inside `Inode`
/// (field `i_mapping`). No separate allocation is needed on the fast
/// path.
///
/// **Concurrency**:
/// - `page_cache`: `Option<PageCache>` — `Some` for normal files, `None` for
/// DAX files (AS_DAX set). When `Some`, the inner XArray provides RCU-safe
/// lock-free reads and per-instance `xa_lock` for writers. See
/// [Section 4.4](04-memory.md#page-cache) for the full concurrency model.
/// All code paths that access `page_cache` must check `is_some()` first;
/// DAX paths bypass the page cache entirely.
/// - `page_cache.nr_pages`, `page_cache.nr_dirty`, `nrwriteback`: independent
/// atomic counters; no lock needed for individual increments/decrements
/// (`nr_pages` and `nr_dirty` only exist when `page_cache` is `Some`).
/// - `writeback_lock`: `Mutex` serializing concurrent writeback of
/// this inode's pages. At most one writeback agent runs per inode
/// at any time.
/// - `writeback_in_progress`: `AtomicBool` lightweight sentinel checked
/// by the reclaim path without acquiring `writeback_lock`.
pub struct AddressSpace {
/// Back-pointer to the owning inode. `Weak` to avoid a reference
/// cycle (Inode → AddressSpace → Inode).
pub host: Weak<Inode>,
/// Page storage backend — XArray with RCU-safe lock-free reads and
/// per-instance `xa_lock` for writers. Defined in [Section 4.4](04-memory.md#page-cache).
/// `page_cache.nr_pages` and `page_cache.nr_dirty` are the canonical
/// page/dirty counters (no separate copies here — use accessors).
/// None for DAX-capable filesystems that map persistent memory directly.
pub page_cache: Option<PageCache>,
/// Number of pages currently under active writeback I/O. A page is
/// counted here from the moment writeback I/O is submitted until the
/// I/O completion handler clears the `PG_WRITEBACK` flag.
pub nrwriteback: AtomicU64,
/// True while writeback I/O is in progress for this inode. Lightweight
/// sentinel for the memory reclaim path: reclaim checks this flag
/// without acquiring `writeback_lock` to skip inodes already being
/// flushed. The writeback thread sets this AFTER acquiring
/// `writeback_lock` and clears it BEFORE releasing the lock.
/// Ordering: `writeback_lock` acquisition → set flag → writeback I/O →
/// clear flag → release `writeback_lock`.
/// Intra-domain (VFS Tier 1). Not accessed from Core directly.
/// AtomicBool validity invariant maintained by Rust type safety
/// within the compilation unit.
pub writeback_in_progress: AtomicBool,
/// Writeback error sequence counter. Updated on I/O errors via
/// `ErrSeq::set_err(errno)`. Each open file descriptor snapshots
/// `wb_err` at open time (`file.f_wb_err`); `fsync()` compares the
/// snapshot to detect new errors. See [Section 14.4](#vfs-fsync-and-cow).
pub wb_err: ErrSeq,
/// Writeback serialisation state. At most one concurrent writeback
/// agent is permitted per `AddressSpace` to avoid seek amplification
/// on rotational storage and to simplify error propagation.
///
/// `writeback_lock` serializes writeback *within* a single inode's
/// `AddressSpace`. Multiple inodes on the same backing device can
/// writeback concurrently — `BdiWriteback` ([Section 4.6](04-memory.md#writeback-subsystem--writeback-thread-organization))
/// coordinates device-level I/O scheduling across all inodes, not
/// per-inode serialization. Two threads holding their respective
/// inode writeback_locks may both submit bios to the same block
/// device — this is correct and desirable for throughput.
pub writeback_lock: Mutex<WritebackState>,
/// Sequence counter for truncation-fault coordination.
///
/// Replaces Linux's `mapping->invalidate_lock` (rwsem, added v5.15,
/// commit 730633f0b7f9) with a lockless seqcount. Writers (truncate,
/// hole-punch, collapse-range) bracket page cache mutations with
/// `invalidate_begin()` / `invalidate_end()` while holding
/// `I_RWSEM(write)`. Readers (page fault) call `read_begin()` before
/// page cache lookup and `read_check()` after PTE installation -- two
/// atomic loads, no lock acquired.
///
/// The seqcount eliminates the ONLY lock ordering exception that was
/// previously required in the page fault path (`VMA_LOCK(105)` ->
/// `INVALIDATE_LOCK(90)` violated descending-level order). With
/// `InvalidateSeq`, the fault path lock chain is strictly ascending:
/// `VMA_LOCK(105, read)` -> `PAGE_LOCK(180)` -> `PTL(185)`.
///
/// See [Section 4.8](04-memory.md#virtual-memory-manager--invalidateseq-lockless-truncation-fault-coordination)
/// for the full struct definition, memory ordering table, and edge
/// case analysis.
pub invalidate_seq: InvalidateSeq,
/// Filesystem-provided callbacks for page cache operations.
/// Statically known at inode creation time; never changes.
pub ops: &'static dyn AddressSpaceOps,
/// Flags controlling eviction and special page semantics.
///
/// - `AS_UNEVICTABLE` (bit 0): pages must not be reclaimed under
/// memory pressure (e.g., ramfs, tmpfs locked pages).
/// - `AS_BALLOON_PAGE` (bit 1): pages are balloon-inflated and may
/// be reclaimed by the balloon driver at any time.
/// - `AS_EIO` (bit 2): a writeback error occurred; subsequent
/// `fsync` calls must return `-EIO` until the flag is cleared.
/// - `AS_ENOSPC` (bit 3): a writeback error occurred due to no
/// space remaining on device.
///
/// **Dual error reporting**: `AS_EIO`/`AS_ENOSPC` flags and `wb_err`
/// (errseq_t) serve complementary purposes. The flags provide a quick
/// boolean "any error occurred?" check used by `sync_file_range()` and
/// the writeback scanner. The errseq_t counter provides per-fd error
/// tracking so that multiple concurrent `fsync()` callers each see the
/// error exactly once. Both are set atomically in `writeback_end_io()`.
/// This dual mechanism matches Linux 4.13+ semantics (commit 5660e13d).
/// - `AS_DAX` (bit 4): this mapping is DAX (Direct Access) — file data
/// lives in persistent memory and is mapped directly into user page
/// tables without page cache copies. When set, the page fault handler
/// calls `dax_iomap_fault()` instead of `filemap_fault()`, and
/// `writepages`/`writepage` are never called (no page cache to write
/// back). Set at mount time for filesystems on persistent memory
/// mounted with `-o dax`. See [Section 15.16](15-storage.md#persistent-memory--design-dax-direct-access-integration).
/// When `AS_DAX` is set, `page_cache` is `None` — no `PageCache` is
/// allocated for DAX files. Direct-access files use CPU load/store
/// through the DAX mapping ([Section 15.16](15-storage.md#persistent-memory--design-dax-direct-access-integration)),
/// bypassing the page cache entirely. This saves ~256 bytes per DAX
/// inode (the `PageCache` struct including its embedded XArray root,
/// counters, and xa_lock).
pub flags: AtomicU32,
/// DAX error generation counter. Only meaningful when `flags` contains
/// `AS_DAX`. DAX files bypass the page cache, so the standard `wb_err`
/// mechanism (which tracks writeback I/O errors on page cache pages)
/// does not apply. Instead, hardware-detected errors on persistent
/// memory (MCE on x86, SEA on ARM64) are recorded here.
///
/// Error propagation for DAX files:
/// - MCE/SEA → `SIGBUS` to the accessing process (immediate, via the
/// page fault / machine-check handler).
/// - MCE/SEA → increment `dax_err` generation (for deferred `fsync`
/// reporting).
/// - `fsync()` on a DAX file: compare `file.f_dax_err` with
/// `mapping.dax_err`. If generations differ, return `-EIO`. This is
/// the same generation-counter protocol used by `wb_err` for non-DAX
/// files ([Section 14.15](#disk-quota-subsystem--writeback-error-propagation-errseq)),
/// but applied to DAX hardware errors instead of writeback I/O errors.
/// - `f_dax_err` is snapshotted at `open()` time, identical to `f_wb_err`.
///
/// For non-DAX files (`AS_DAX` not set), this field is unused (reads as 0).
pub dax_err: AtomicU32,
/// Interval tree of file-backed VMAs mapping this file. Used for
/// reverse mapping: truncation, writeback, page migration, and KSM
/// need to find all VMAs mapping a given file offset range. This is
/// the UmkaOS equivalent of Linux's `address_space.i_mmap` (`rb_root_cached`
/// interval tree) protected by `i_mmap_rwsem`.
///
/// The `RwLock` protects concurrent insert/remove during mmap/munmap
/// (writers) vs. read during truncation/writeback/rmap walks (readers).
/// Lock level: follows `mmap_lock` — callers hold `mmap_lock.write()`
/// before acquiring `i_mmap.write()`. Readers (rmap walks) acquire
/// `i_mmap.read()` independently.
///
/// `IntervalTree<VmaRef>` stores `(start_pgoff, end_pgoff, VmaRef)` tuples.
/// Lookup: `i_mmap.tree.query(pgoff_start, pgoff_end)` returns all VMAs
/// whose file offset range overlaps `[pgoff_start, pgoff_end)`.
pub i_mmap: RwLock<IntervalTree<VmaRef>>,
}
/// DAX File Handling
///
/// DAX (Direct Access) files on persistent memory bypass the page cache
/// entirely. When a filesystem is mounted with `-o dax` on a persistent
/// memory device, every inode's `AddressSpace` has `AS_DAX` set and
/// `page_cache` is `None`.
///
/// **Memory savings**: Skipping `PageCache` allocation saves ~256 bytes per
/// DAX inode (XArray root node, `nr_pages`/`nr_dirty` counters, `xa_lock`,
/// internal bookkeeping). On a persistent memory filesystem with millions
/// of small files, this is significant.
///
/// **Error tracking**: DAX files cannot use the standard `wb_err` writeback
/// error mechanism because there are no page cache pages and no writeback
/// I/O. Instead, hardware memory errors (MCE on x86-64, Synchronous
/// External Abort on AArch64) are tracked via `AddressSpace::dax_err`:
///
/// 1. Hardware detects uncorrectable error on persistent memory address.
/// 2. MCE/SEA handler delivers `SIGBUS` (`BUS_MCEERR_AR` for synchronous,
/// `BUS_MCEERR_AO` for asynchronous) to the process whose access
/// triggered the fault. This is immediate — the process is notified
/// before `fsync` is ever called.
/// 3. MCE/SEA handler increments `mapping.dax_err` (AtomicU32 generation
/// counter, same wrap-around protocol as `ErrSeq`).
/// 4. On `fsync()`: the VFS compares `file.f_dax_err` (snapshotted at
/// `open()`) with `mapping.dax_err`. If they differ, `fsync` returns
/// `-EIO` and advances `file.f_dax_err` to the current generation
/// (so the error is reported exactly once per fd, matching `wb_err`
/// semantics).
///
/// **Dirty page throttling**: `balance_dirty_pages()` excludes DAX files.
/// DAX writes go directly to persistent memory via CPU store instructions
/// — there are no dirty page cache pages to throttle. Write bandwidth is
/// bounded by the persistent memory device's write throughput, not by the
/// kernel's dirty page ratio. The `writeback_lock`, `nrwriteback`, and
/// `writeback_in_progress` fields are unused for DAX inodes.
///
/// **Page fault path**: When a DAX file is faulted, the VFS calls
/// `dax_iomap_fault()` (not `filemap_fault()`). This maps the persistent
/// memory physical address directly into the process's page table — no
/// page allocation, no page cache insertion, no copy. For huge page faults
/// (PMD-level, 2 MiB on x86-64), `dax_iomap_pmd_fault()` maps a single
/// PMD entry covering the entire 2 MiB region.
/// Serialised writeback state embedded inside `AddressSpace::writeback_lock`.
///
/// Protected by `AddressSpace::writeback_lock`. The `Mutex` ensures only
/// one writeback agent runs at a time; the fields inside track progress
/// so that a new agent can resume where the previous one left off.
pub struct WritebackState {
/// Next page index to examine during writeback. The writeback agent
/// advances this forward as pages are submitted for I/O. Wraps to 0
/// after reaching the last page, implementing a cyclic scan
/// consistent with the kernel's "kupdate" writeback policy.
pub writeback_index: u64,
/// Accumulated bytes of dirty data at the time writeback started.
/// Used to limit how much data a single writeback pass writes, so
/// that a continuous dirty stream does not starve readers.
pub dirty_bytes: u64,
}
/// Filesystem callbacks invoked by the VFS page cache layer.
///
/// Each filesystem that participates in the page cache provides a
/// static `AddressSpaceOps` implementation. The VFS calls these methods
/// when it needs to populate the cache (read miss), flush dirty pages
/// (writeback), or decide whether a page can be dropped (reclaim).
///
/// **Object safety**: all methods take `&self` on the ops vtable plus
/// explicit `AddressSpace`/`Page` references. The vtable itself is
/// `'static`, `Send`, and `Sync`.
pub trait AddressSpaceOps: Send + Sync {
/// Read one page (identified by `index`, a page-aligned file offset
/// divided by `PAGE_SIZE`) from the backing store into the page
/// cache. Fill the already-allocated and cache-inserted `page` with
/// data from backing store. The page has already been allocated,
/// locked (`PageFlags::LOCKED`), and inserted into the page cache
/// XArray by the caller (`filemap_get_pages`). The filesystem must
/// initiate the I/O to populate the page contents.
///
/// **Contract**: Implementations MUST NOT allocate a new page or
/// overwrite the page cache XArray slot. The caller owns the slot;
/// overwriting it orphans the locked page and deadlocks concurrent
/// readers waiting on `PageFlags::LOCKED`.
///
/// Called with no locks held. The implementation may block.
fn read_page(
&self,
mapping: &AddressSpace,
index: u64,
page: &Arc<Page>,
) -> Result<(), IoError>;
/// **Example: ext4 read_page() flow**
///
/// When a file-backed page fault triggers `read_page()` on an ext4 file:
///
/// 1. `ext4_read_page(mapping, pgoff, page)`:
/// a. Map logical block: `ext4_map_blocks(inode, pgoff)` → translates file offset
/// to physical block number via the extent tree.
/// b. Build Bio: `Bio::new_read(bdev, phys_block, page)`.
/// c. Submit: `bio_submit(bio)` → dispatches to block device driver.
/// d. Wait: page is unlocked by bio completion callback when I/O finishes.
/// e. Return `Ok(())` — page now contains file data.
///
/// The readahead engine ([Section 4.4](04-memory.md#page-cache--readahead-engine)) may
/// batch multiple pages into a single Bio with scatter-gather, submitting them
/// via `AddressSpaceOps::readahead()` instead of individual `read_page()` calls.
/// Read multiple pages as a batch for readahead. Receives the readahead
/// window from the readahead engine ([Section 4.4](04-memory.md#page-cache--readahead-engine)).
/// Implementations should submit I/O for all requested pages in a single
/// Bio batch. Filesystems that do not implement this method fall back to
/// sequential `read_page()` calls.
/// Default: returns `EOPNOTSUPP` (use `read_page` fallback).
fn readahead(
&self,
mapping: &AddressSpace,
ra: &ReadaheadControl,
) -> Result<(), IoError> {
Err(IoError::new(Errno::EOPNOTSUPP))
}
/// Write a single dirty page to the backing store. `wbc` carries
/// writeback control parameters (sync mode, range limits, number
/// of pages already written in this pass). The implementation must
/// set `PG_WRITEBACK` for the duration of the I/O. The implementation
/// MUST NOT clear `PG_DIRTY` — the `DIRTY → clean` transition and
/// `nr_dirty` decrement are owned exclusively by the completion callback
/// (`writeback_end_io()` for Tier 0, or the Tier 0 `WritebackResponse`
/// handler for Tier 1). Clearing DIRTY here would cause a double-decrement
/// of `nr_dirty` when the completion callback also clears it.
///
/// Called with no locks held. The implementation may block.
fn writepage(
&self,
mapping: &AddressSpace,
page: &Page,
wbc: &WritebackControl,
) -> Result<(), IoError>;
/// Write multiple dirty pages to the backing store in a single batch.
/// Called by the writeback subsystem ([Section 4.6](04-memory.md#writeback-subsystem)) instead of
/// iterating `writepage()` one page at a time. The filesystem should submit
/// all dirty pages in the address space (subject to `wbc` constraints) as
/// coalesced Bio requests for maximum throughput.
///
/// # Returns
/// - `Ok(n)`: Number of pages successfully submitted for writeback.
/// - `Err(IoError)`: Fatal error; writeback aborted for this inode.
///
/// Default: returns `EOPNOTSUPP` (writeback layer falls back to per-page
/// `writepage()` calls). Filesystems that support extent-based I/O (ext4,
/// XFS, btrfs) should implement this for 5-10x writeback throughput vs.
/// per-page writepage on rotational media.
fn writepages(
&self,
mapping: &AddressSpace,
wbc: &WritebackControl,
) -> Result<u64, IoError> {
Err(IoError::new(Errno::EOPNOTSUPP))
}
/// Verify data integrity of a page populated through a non-standard
/// path (RDMA fetch, DSM migration, decompression). Filesystems that
/// store per-page checksums (btrfs, ext4 metadata, ZFS) implement this
/// to catch silent corruption from paths that bypass the standard block
/// I/O checksum pipeline.
///
/// Called by the DSM page fetch path ([Section 6.11](06-dsm.md#dsm-distributed-page-cache))
/// after RDMA-fetching a page from a remote peer, before setting
/// `PageFlags::UPTODATE`. If verification fails, the fetched page is
/// discarded and the DSM falls back to storage I/O.
///
/// Default: returns `Ok(true)` — the page is accepted without
/// verification (appropriate for filesystems without per-page
/// checksums, e.g., tmpfs, ext2, NFS).
fn verify_page(
&self,
mapping: &AddressSpace,
index: u64,
page: &Page,
) -> Result<bool, IoError> {
Ok(true)
}
/// Called by the page reclaimer immediately before a clean page is
/// removed from the cache. The filesystem may decline eviction by
/// returning `false` (e.g., because it has pinned the page for
/// journalling). Returning `true` grants permission to evict.
///
/// Must not block; must not acquire locks that might sleep.
fn releasepage(&self, page: &Page) -> bool;
/// Called by `generic_file_write_iter()` before writing user data into
/// a page. The filesystem prepares the page for writing:
///
/// - **ext4**: starts a JBD2 journal handle (`journal_start()`), allocates
/// blocks for delayed allocation, reads the page from disk if the write
/// is partial (does not cover the entire page).
/// - **XFS**: creates a delayed allocation extent reservation.
/// - **tmpfs**: allocates a swap-backed page.
/// - **Default (simple filesystems)**: allocates a clean page from the
/// page cache if not already present, zeroing unwritten portions.
///
/// The returned page reference is locked (`PageFlags::LOCKED` set).
/// The caller (`generic_file_write_iter`) copies user data into the
/// page between `write_begin` and `write_end`.
///
/// On error (e.g., `ENOSPC` from block allocation), the write is aborted
/// and the page is released without modification.
///
/// See [Section 15.6](15-storage.md#filesystem-ext4) for ext4's implementation.
fn write_begin(
&self,
mapping: &AddressSpace,
pos: u64,
len: usize,
flags: u32,
) -> Result<PageRef, IoError>;
/// Called by `generic_file_write_iter()` after writing user data into
/// the page returned by `write_begin()`. The filesystem commits the
/// write:
///
/// - **ext4**: marks buffer heads dirty, stops the JBD2 journal handle
/// (`journal_stop()`), updates `i_size` if the write extended the file.
/// - **XFS**: marks the page dirty, updates extent state.
/// - **Default (simple filesystems)**: marks the page dirty via
/// `set_page_dirty()`.
///
/// `copied` is the number of bytes actually copied by the write (may be
/// less than `len` for a short copy from user memory). The filesystem
/// must handle partial writes correctly (e.g., by not advancing `i_size`
/// past the last successfully written byte).
///
/// The page is still locked on entry; the filesystem may unlock it
/// before returning.
///
/// **Tier boundary for `set_page_dirty()`**: For Tier 1 filesystems
/// (ext4, XFS, Btrfs), `write_end()` is invoked via the KABI ring:
/// the Tier 0 VFS dispatches a `WriteEnd` command to the filesystem's
/// domain, the filesystem processes it and returns a response. The
/// response includes a `dirty: bool` flag indicating whether the page
/// should be marked dirty. The **Tier 0 VFS ring consumer** -- not the
/// Tier 1 filesystem -- calls `set_page_dirty()` upon receiving a
/// response with `dirty == true`. This keeps all page cache metadata
/// operations (`set_page_dirty()`, `nr_dirty` counters, BDI dirty
/// list) in Tier 0, avoiding cross-domain direct calls from Tier 1.
///
/// For Tier 0 filesystems (tmpfs, ramfs -- statically linked), the
/// `write_end()` callback runs in the same domain and calls
/// `set_page_dirty()` directly. No ring dispatch is needed.
///
/// This is the same pattern used for block I/O completion (Tier 1
/// NVMe driver signals via outbound ring, Tier 0 consumer calls
/// `bio_complete()`) -- see [Section 12.8](12-kabi.md#kabi-domain-runtime) and
/// [Section 15.19](15-storage.md#nvme-driver-architecture).
fn write_end(
&self,
mapping: &AddressSpace,
pos: u64,
len: usize,
copied: usize,
page: PageRef,
) -> Result<usize, IoError>;
/// Called by the page cache when a page is first dirtied. Allows the
/// filesystem to register the affected block extent for crash recovery
/// journaling BEFORE the page is modified.
///
/// Filesystems with journaling (ext4, btrfs, XFS) implement this to
/// record dirty extents in their journal. Filesystems without journaling
/// (tmpfs, ramfs, NFS) leave this as the default no-op.
///
/// Called from `__set_page_dirty()` with the page locked. The `offset`
/// and `len` arguments describe the byte range within the file that
/// will be dirtied (typically `page_offset` and `PAGE_SIZE`, but
/// sub-page dirty tracking for large folios may pass smaller ranges).
///
/// **Interaction with two-phase dirty extent protocol**: For Tier 1
/// VFS drivers running in an isolated domain, the `dirty_extent()`
/// callback calls `vfs_dirty_extent_reserve()` to register the
/// logical intent in Core's dirty intent list
/// ([Section 14.1](#virtual-filesystem-layer)). For in-place filesystems that
/// know the block address at dirty time (ext4 non-delayed-alloc),
/// the callback may use `vfs_dirty_extent_reserve_and_commit()` to
/// atomically reserve and bind the physical address. CoW filesystems
/// (Btrfs, XFS reflink) call only `vfs_dirty_extent_reserve()` here
/// and defer `vfs_dirty_extent_commit()` to the writeback path after
/// block allocation. For Tier 0 filesystems (statically linked, e.g.,
/// tmpfs), the callback can directly update internal journal
/// structures without crossing a domain boundary.
///
/// **In-place filesystem implementation pattern** (ext4 example):
/// 1. `dirty_extent()` is called with the byte range `[offset, offset+len)`.
/// 2. The filesystem maps the byte range to physical block extents via
/// its extent tree.
/// 3. Calls `vfs_dirty_extent_reserve_and_commit(inode_id, offset, len,
/// block_addr, block_len)` — atomic reserve+commit since the block
/// address is known.
/// 4. The filesystem writes a journal descriptor block recording the
/// physical extents that are about to be modified.
/// 5. Only after the journal descriptor is committed (or at least
/// queued for commit) does `dirty_extent()` return `Ok(())`.
/// 6. The caller (`__set_page_dirty()`) then sets `PageFlags::DIRTY`
/// on the page.
///
/// **CoW filesystem implementation pattern** (Btrfs example):
/// 1. `dirty_extent()` is called with the byte range `[offset, offset+len)`.
/// 2. Calls `vfs_dirty_extent_reserve(inode_id, offset, len)` — Phase 1
/// only. No block address is available yet (CoW allocates at writeback).
/// 3. Returns `Ok(())` with the `DirtyExtentToken` stored in the inode's
/// per-extent pending-commit table (filesystem-private state).
/// 4. During writeback, the filesystem allocates new blocks via its
/// extent allocator, then calls `vfs_dirty_extent_commit(token,
/// block_addr, block_len)` — Phase 2.
/// 5. After I/O completion, calls `vfs_flush_extent_complete()`.
///
/// This ordering guarantee ensures that on crash recovery, Core has a
/// record of every dirty extent — no data modification happens without
/// a corresponding intent entry.
fn dirty_extent(
&self,
_mapping: &AddressSpace,
_offset: u64,
_len: u64,
) -> Result<(), IoError> {
// Default: no-op. Appropriate for in-memory filesystems (tmpfs,
// ramfs) and network filesystems (NFS, which has its own write
// delegation protocol).
Ok(())
}
/// Returns the direct-I/O implementation for this address space,
/// if the filesystem supports bypassing the page cache (e.g., for
/// `O_DIRECT` opens). Returns `None` if direct I/O is not supported;
/// the VFS will then fall back to the page-cache path.
fn direct_io(&self) -> Option<&dyn DirectIoOps> {
None
}
}
14.1.2.3.1.4 Generic File Operations (VFS-to-Page-Cache Bridge)¶
The VFS provides generic implementations of file read/write that bridge
FileOps calls to the page cache. Most filesystem types delegate their
FileOps::read() and FileOps::write() to these generic functions,
only providing the AddressSpaceOps callbacks for cache miss I/O.
Isolation domain: filemap_get_pages() runs in Tier 0 (Core domain). The page
cache XArray is Core memory. VFS dispatches the read request to Core via the KABI ring;
Core's filemap_get_pages() accesses the page cache directly. The ~23-46 cycle domain
crossing happens at the VFS-to-Core ring boundary.
/// Maximum number of pages fetched in a single readahead or
/// `filemap_get_pages()` call. Matches Linux's `PAGEPOOL_SIZE` (32).
/// A bounded `ArrayVec` is used to avoid heap allocation on the hot path.
pub const MAX_READAHEAD_PAGES: usize = 32;
/// Read pages from the page cache for a file read operation.
/// This is the generic implementation used by most filesystem types.
/// Equivalent to Linux's `filemap_get_pages()` + `generic_file_read_iter()`.
///
/// Returns pages covering the requested range `[pgoff, pgoff + nr_pages)`.
/// Pages not in cache are fetched via `mapping.ops.read_page()`.
///
/// **Concurrency**: Called with no inode locks held. Multiple threads may
/// call this concurrently on the same `AddressSpace`; the page cache XArray
/// ([Section 4.4](04-memory.md#page-cache)) provides internal synchronization.
///
/// **Concurrent reader deduplication (lock-or-find protocol)**:
/// When a cache miss occurs, multiple threads may race to populate the same
/// page index. The following protocol ensures exactly one thread performs
/// I/O, while all others wait for the result:
///
/// 1. **Cache probe**: `pc.pages.load(idx)` — if found, return (cache hit).
/// 2. **Allocate**: Allocate a new page, set `PageFlags::LOCKED` atomically.
/// 3. **Atomic insert**: `pc.pages.try_store(idx, page)` — attempts a
/// compare-and-swap insertion into the XArray slot.
/// 4. **Lost race**: If `try_store` returns an existing page (a concurrent
/// reader won the race and inserted first), drop our freshly allocated
/// page, then wait for the existing page's `PageFlags::LOCKED` to be
/// cleared (the winner is performing I/O). Once unlocked, the existing
/// page contains valid data — return it.
/// 5. **Won race**: If `try_store` succeeds (our page is now in the cache),
/// call `read_page()` to fill the page from the backing store. On I/O
/// completion, clear `PageFlags::LOCKED` and wake all waiters sleeping
/// on this page's lock (step 4 above). Return the filled page.
///
/// This protocol prevents duplicate I/O: at most one `read_page()` call is
/// issued per page index, regardless of the number of concurrent readers.
/// The cost of the losing path is one wasted page allocation (returned to
/// the buddy allocator immediately) plus a sleep on the page lock — no I/O.
///
/// **Readahead integration**: Before the lock-or-find path, this function
/// checks the readahead state (`FileRaState` on the `OpenFile`) and may
/// trigger `AddressSpaceOps::readahead()` to batch-fetch a window of pages
/// in a single I/O. The readahead engine ([Section 4.4](04-memory.md#page-cache--readahead-engine))
/// determines the window size based on sequential access detection.
/// Readahead-populated pages are inserted via the same `try_store` protocol,
/// so concurrent readahead and fault-driven reads do not duplicate I/O.
///
/// **Error handling (short-read semantics)**: If `read_page()` fails for
/// any page in the range, the function clears `PageFlags::LOCKED` on the
/// failed page (waking waiters), sets `PageFlags::ERROR` to signal the
/// failure, and removes the failed page from the cache via
/// `pc.pages.erase(idx)`. If pages were successfully fetched in earlier
/// iterations, they are returned as a short read (the caller receives
/// fewer pages than requested — not an error). Only if *no* pages were
/// successfully fetched does the function return `Err`. This matches
/// POSIX read semantics: a successful partial transfer is reported as a
/// short read, not an error. Waiters sleeping on a page that fails I/O
/// are woken and observe `PageFlags::ERROR`, causing them to return `EIO`.
pub fn filemap_get_pages(
mapping: &AddressSpace,
pgoff: u64,
nr_pages: u32,
ra_state: &mut FileRaState,
) -> Result<ArrayVec<PageRef, MAX_READAHEAD_PAGES>, IoError> {
let mut pages = ArrayVec::new();
let pc = mapping.page_cache.as_ref().ok_or(IoError::new(Errno::EINVAL))?;
for i in 0..nr_pages as u64 {
let idx = pgoff + i;
// Step 1: Cache probe with RCU + speculative refcount.
// The XArray load returns a PageRef valid only under RCU read lock.
// We must bump the refcount before releasing RCU to prevent the
// page reclaimer from freeing the page between lookup and use.
// This matches Linux's folio_try_get_rcu() pattern in filemap_get_pages().
{
let _rcu = rcu_read_lock();
if let Some(page) = pc.pages.load(idx) {
// Speculative refcount bump under RCU protection. If the page
// was concurrently freed (refcount already 0), try_get_ref()
// returns false and we fall through to the miss path.
if page.try_get_ref() {
drop(_rcu); // Release RCU after refcount is stable.
// Cache hit — mark referenced for LRU aging.
page.flags.fetch_or(PageFlags::ACCESSED, Relaxed);
pages.push(page);
continue;
}
// Refcount bump failed — page is being freed. Fall through
// to the miss path (will allocate a new page or find a
// replacement after reclaim completes).
}
}
// Cache miss — DSM cooperative cache check (three-stage filter).
//
// Design: never add latency to the common case. Most misses are
// local-only (no remote node has the page). The three stages
// progressively filter out unnecessary RDMA lookups:
//
// Stage 1: Bloom filter (~15-30ns, 3-5 cache lines).
// Per-peer counting Bloom filters ([Section 6.11](06-dsm.md#dsm-distributed-page-cache))
// are exchanged lazily via DSM heartbeat piggyback. A negative
// result means "definitely not cached remotely" — skip RDMA.
// Eliminates ~90-95% of remote lookups with zero I/O.
//
// Stage 2: Sequential access rejection (~5ns, single branch).
// Sequential readahead streams rarely benefit from cooperative
// caching — the same sequential stream is unlikely to be cached
// on a peer. If FileRaState indicates sequential pattern (the
// readahead engine already tracks this), skip RDMA even if
// bloom says "maybe". Eliminates another ~3-5% of lookups.
//
// Stage 3: Speculative parallel issue (RDMA + NVMe simultaneously).
// For the remaining ~2-5% of random-access misses where bloom
// says "maybe", fire BOTH the RDMA cooperative lookup AND the
// local NVMe readahead in parallel. First completion wins;
// the loser is cancelled. RDMA (~2-5μs) typically beats NVMe
// (~10-100μs) when the remote cache is warm, so we get a real
// speedup. When the remote cache is cold, NVMe completes
// normally — no added latency.
//
// Net effect: zero overhead for ~95% of misses; ~5-10μs speedup for
// the ~2-5% where remote cache is warm; never slower than local-only.
if mapping.host.i_sb.s_flags.load(Relaxed) & MS_DSM_COOPERATIVE != 0 {
let file_id = DsmFileId::from_inode(&mapping.host);
// Stage 1: Bloom filter — fast local rejection.
let bloom_hit = dsm_bloom_probe(&file_id, idx);
if bloom_hit {
// Stage 2: Sequential access rejection.
let is_sequential = ra_state.prev_pos != 0
&& idx == ra_state.prev_pos + 1;
if !is_sequential {
// Stage 3: Speculative parallel issue.
// Fire RDMA cooperative lookup. Simultaneously, fall through
// to readahead below (the NVMe path). The RDMA result is
// checked after readahead submission — if RDMA completed
// first, use the remote page and cancel the local I/O.
let rdma_fut = dsm_cooperative_cache_lookup_async(
&file_id, idx,
);
// Fall through to readahead (NVMe path starts here).
ra_state.start = idx;
page_cache_readahead(mapping, ra_state, nr_pages - i as u32);
// Check RDMA result — did the remote cache beat NVMe?
if let Some(page) = rdma_fut.try_complete() {
// Remote cache won. Cancel local I/O for this page
// (readahead may have submitted it; the page cache
// deduplicates — the readahead page is simply evicted
// on next reclaim pass if unused).
page.flags.fetch_or(PageFlags::ACCESSED, Relaxed);
pages.push(page);
continue;
}
// RDMA didn't complete yet or missed. NVMe readahead is
// already in flight — re-check cache below (normal path).
// The RDMA future is dropped (cancelled on drop).
}
}
}
// Cache miss — trigger readahead before attempting I/O.
// The readahead engine ([Section 4.4](04-memory.md#page-cache--readahead-engine)) may
// submit a larger I/O batch here to prefetch upcoming pages.
// Pass the remaining page count (nr_pages - pages already fetched),
// not the page index — page_cache_readahead uses FileRaState.start
// (set by the caller) to determine *which* pages to read, and
// nr_pages to bound the readahead window size.
ra_state.start = idx;
page_cache_readahead(mapping, ra_state, nr_pages - i as u32);
// Re-check cache (readahead may have populated this page).
// Same RCU + speculative refcount pattern as Step 1.
{
let _rcu = rcu_read_lock();
if let Some(page) = pc.pages.load(idx) {
if page.try_get_ref() {
drop(_rcu);
page.flags.fetch_or(PageFlags::ACCESSED, Relaxed);
pages.push(page);
continue;
}
}
}
// Step 2: Allocate a new page with LOCKED flag. Relaxed ordering
// suffices because the page is freshly allocated and not yet shared.
let new_page = alloc_page(GFP_KERNEL)?;
new_page.flags.fetch_or(PageFlags::LOCKED, Relaxed);
// Step 3: Atomic insert — race with concurrent readers.
match pc.pages.try_store(idx, new_page.clone()) {
Ok(()) => {
// Step 5: Won race — we own this slot. Fill via I/O.
//
// **Ring dispatch integration (VFS-BUG-2/4 fix)**:
// For Tier 0 filesystems, `read_page` is a direct synchronous
// call. For Tier 1 filesystems, the VFS ring protocol is used:
// (a) Construct VfsRequest { opcode: ReadPage, page_index: idx,
// buf: DmaBufferHandle from the page, offset, count }.
// (b) Submit via reserve_slot() -> complete_slot() on the
// per-superblock VfsRingPair.
// (c) The submitting thread sleeps on the page's wait queue
// (wait_on_page_locked), NOT on the ring completion queue.
// (d) The Tier 1 driver processes the request, fills the DMA
// buffer, and posts a VfsResponse. The ring completion
// handler copies data to the page cache page, calls
// set_page_uptodate(), clears LOCKED, and wakes the
// page's wait queue.
// This ensures the read path works identically for both tiers:
// the caller always sleeps on wait_on_page_locked, and the
// I/O completion path always unlocks the page.
match mapping.ops.read_page(mapping, idx, &new_page) {
Ok(()) => {
// I/O initiated. For Tier 0, this returns after the
// page is filled. For Tier 1, this returns after the
// ring request is submitted (async). In both cases,
// LOCKED is cleared by the completion path.
wait_on_page_locked(&new_page);
new_page.flags.fetch_or(PageFlags::ACCESSED, Relaxed);
pages.push(new_page.clone());
}
Err(e) => {
// I/O failed. Mark error, unlock, remove, wake.
new_page.flags.fetch_or(PageFlags::ERROR, Release);
new_page.flags.fetch_and(!PageFlags::LOCKED, Release);
wake_page_waiters(&new_page);
pc.pages.erase(idx);
// Short-read semantics: if we already have pages,
// return them (partial success). Only error if no
// pages were fetched at all.
if !pages.is_empty() {
return Ok(pages);
}
return Err(e);
}
}
}
Err(existing) => {
// Step 4: Lost race — another thread inserted first.
// Drop our page (returns to buddy allocator).
drop(new_page);
// Wait for the winner to finish I/O (LOCKED cleared).
wait_on_page_locked(&existing);
// Check if the winner's I/O failed.
if existing.flags.load(Acquire) & PageFlags::ERROR != 0 {
// Short-read: return pages collected so far, or error
// if none were successfully fetched.
if !pages.is_empty() {
return Ok(pages);
}
return Err(IoError::new(Errno::EIO));
}
existing.flags.fetch_or(PageFlags::ACCESSED, Relaxed);
pages.push(existing);
}
}
}
Ok(pages)
}
/// Generic file read iterator. Used by most filesystem `FileOps::read()`
/// implementations. Reads from the page cache, triggering readahead and
/// I/O as needed.
///
/// **Flow**:
/// 1. Compute the starting page offset and intra-page offset from `*offset`.
/// 2. Call `filemap_get_pages()` for the required page range.
/// 3. Copy data from the returned pages into `buf` via `copy_to_user()`.
/// 4. Advance `*offset` by the number of bytes read.
/// 5. Return the total bytes copied, or an error if no bytes were read.
///
/// **Short reads**: If the file ends mid-page (offset + len > i_size),
/// only the valid bytes are copied. This is not an error — the return
/// value reflects the actual bytes read, and `*offset` is advanced
/// accordingly.
///
/// **DAX bypass**: If `mapping.flags` has `AS_DAX` set, this function
/// is never called — DAX files use `dax_iomap_rw()` instead, which
/// maps persistent memory directly into the user's address space.
pub fn generic_file_read_iter(
file: &OpenFile,
buf: &mut UserSliceMut,
offset: &mut i64,
) -> Result<usize, IoError> {
let mapping = &file.inode.i_mapping;
let mut total = 0usize;
// VFS-BUG-5 fix: i_size check before read loop. Without this, a read
// past EOF would copy uninitialized page data to userspace (information
// disclosure). POSIX: read() past EOF returns 0 bytes, not an error.
let i_size = file.inode.i_size.load(Acquire) as i64;
if *offset >= i_size {
return Ok(0);
}
// Clamp the effective read length to not exceed i_size. This ensures
// we never read beyond the file's logical end, even if page cache pages
// exist beyond i_size (e.g., from a concurrent truncate race — the
// truncate path clears those pages asynchronously).
let max_readable = (i_size - *offset) as usize;
let effective_remaining = min(buf.remaining(), max_readable);
while total < effective_remaining {
let pgoff = (*offset as u64) / PAGE_SIZE as u64;
let intra = (*offset as usize) % PAGE_SIZE;
let remaining = effective_remaining - total;
let nr = min((remaining + PAGE_SIZE - 1) / PAGE_SIZE, MAX_READAHEAD_PAGES as usize);
let pages = filemap_get_pages(mapping, pgoff, nr as u32, &mut file.ra_state.lock())?;
for (i, page) in pages.iter().enumerate() {
// intra-page offset: non-zero only for the first page of each
// filemap_get_pages batch (the read may start mid-page). All
// subsequent pages are read from offset 0.
let page_intra = if i == 0 { intra } else { 0 };
// Clamp to i_size: never copy beyond the file's logical end.
let avail = min(
min(PAGE_SIZE - page_intra, effective_remaining - total),
buf.remaining(),
);
if avail == 0 { break; }
copy_to_user(buf, page.data_ptr().add(page_intra), avail)?;
*offset += avail as i64;
total += avail;
}
if pages.len() < nr as usize { break; } // short read or EOF
}
Ok(total)
}
Tier isolation note: The VFS read path dispatches copy_to_user via a KABI
return-buffer mechanism when the filesystem runs in Tier 1 (hardware memory domain
isolated). The Tier 1 filesystem populates a shared bounce buffer (mapped into the
driver's isolation domain as writable), and umka-core (Tier 0) performs the actual
copy_to_user() after the KABI ring response returns. This ensures the Tier 1
driver never directly accesses userspace memory — all user memory writes go through
Tier 0 copy_to_user(), which validates the destination address against the task's
address space. For Tier 0 filesystems (e.g., the root filesystem driver), the
copy_to_user() call is inlined directly — no bounce buffer needed.
Read path walkthrough (numbered trace, analogous to write path):
sys_read()— extract fd, buf, count fromSyscallContext; resolveOpenFile.- VFS dispatch — call
file.ops.read_iter()(orgeneric_file_read_iterfor regular files). filemap_get_pages()— compute page offset, probe page cache XArray.- Page cache hit — mark
PageFlags::ACCESSED, return cached page. - Page cache miss — trigger readahead (
AddressSpaceOps::readahead()). wait_on_page_locked()— sleep until I/O completion clearsPageFlags::LOCKED.copy_to_user()— copy page data to userspace buffer (bounce buffer if Tier 1).- Return total bytes read to userspace via syscall return register.
Cross-references: - Page cache structure and XArray: Section 4.4 - Readahead engine: Section 4.4 - DAX direct access path: Section 15.16
14.1.2.3.1.5 generic_file_write_iter() — Buffered Write Path¶
/// Generic buffered write implementation. Called from `FileOps::write_iter()`
/// for regular file writes. Iterates over the write range page by page, using
/// the filesystem's `write_begin()`/`write_end()` callbacks for per-page
/// preparation and commit.
///
/// Equivalent to Linux's `generic_file_write_iter()` + `generic_perform_write()`.
///
/// # Steps
///
/// 1. RLIMIT_FSIZE check (truncate write to limit).
/// 2. O_APPEND: acquire i_rwsem exclusive, seek to i_size.
/// 3. For each page in [pos, pos+count):
/// a. `write_begin(mapping, pos, len, flags)` — filesystem prepares page
/// (journal reservation, delayed allocation, partial page read-in).
/// Returns locked page reference.
/// b. `copy_from_user(page_addr + offset, buf, bytes)` — copy user data
/// into the page. Handles fault (short copy) via `bytes_copied`.
/// c. `write_end(mapping, pos, len, bytes_copied, page)` — filesystem
/// commits the write (mark dirty, update journal, update i_size).
/// d. `balance_dirty_pages_ratelimited(mapping)` — throttle the writer
/// if dirty page count exceeds the threshold, preventing memory
/// pressure from unbounded dirtying.
/// 4. Update `mtime` and `ctime` timestamps.
/// 5. Return total bytes written.
///
/// # Error handling
///
/// If `write_begin()` fails (e.g., ENOSPC from block allocation), the write
/// returns a short count or error. Pages already committed via `write_end()`
/// remain dirty and will be written back asynchronously. This matches Linux:
/// a partial write is not rolled back.
///
/// If `copy_from_user()` returns a short count (user page not present),
/// `write_end()` is called with the short `bytes_copied`. The filesystem
/// handles the partial page correctly (e.g., ext4 does not advance i_size
/// past the last successfully written byte).
fn generic_file_write_iter(
file: &OpenFile,
buf: &UserSlice,
pos: &mut i64,
) -> Result<usize, IoError> {
// WF-10 fix: field is i_mapping, not address_space.
// WF-11 fix: file.inode is a field (Arc<Inode>), not a method call.
let inode = &*file.inode;
let mapping = &inode.i_mapping;
// Step 1: RLIMIT_FSIZE — truncate write to file size limit.
let rlimit_fsize = current_task().process.rlimits.limits[RLIMIT_FSIZE].soft;
let count = if *pos as u64 + buf.len() as u64 > rlimit_fsize && rlimit_fsize != u64::MAX {
signal_send(current_task(), SIGXFSZ);
if *pos as u64 >= rlimit_fsize { return Err(IoError::EFBIG); }
(rlimit_fsize - *pos as u64) as usize
} else {
buf.len()
};
// Step 2: O_APPEND — atomically seek to end of file.
if file.f_flags.load(Relaxed) & O_APPEND != 0 {
*pos = inode.i_size.load(Acquire) as i64;
}
// Step 3: Page-by-page write loop.
let mut written: usize = 0;
while written < count {
let offset_in_page = (*pos as usize) % PAGE_SIZE;
let bytes = core::cmp::min(PAGE_SIZE - offset_in_page, count - written);
// 3a: Filesystem prepares page (alloc, journal, partial read-in).
let page = mapping.ops.write_begin(mapping, *pos as u64, bytes, 0)?;
// 3b: Copy user data from UserSlice into page.
// UserSlice::read_to_page() performs copy_from_user internally,
// handling SMAP/PAN page faults and returning bytes copied.
// This is the correct API for user → kernel page copies.
let page_kaddr = page_address(&page) + offset_in_page;
let bytes_copied = buf.read_at(written, page_kaddr, bytes);
// 3c: Filesystem commits write (dirty, journal, i_size update).
// WF-03 fix: use the return value — the filesystem may commit
// fewer bytes than copied (block-aligned partial commit).
let committed = mapping.ops.write_end(
mapping, *pos as u64, bytes, bytes_copied, page,
)?;
written += committed;
*pos += committed as i64;
if committed < bytes {
break; // Short write — stop.
}
// 3d: Dirty page throttling.
balance_dirty_pages_ratelimited(mapping);
}
// Step 4: Update timestamps.
inode.i_mtime.store(current_time_ns(), Release);
inode.i_ctime.store(current_time_ns(), Release);
// Step 5: Return bytes written.
Ok(written)
}
14.1.2.3.1.6 Inode (Index Node)¶
/// In-memory representation of a filesystem object (file, directory,
/// symlink, device, pipe, socket).
///
/// Each inode has a unique (superblock, inode_number) pair. The VFS
/// maintains an inode cache (icache) keyed by this pair to avoid
/// repeated disk reads.
///
/// **Lifecycle**: Created by `FileSystemOps::mount()` (root inode) or
/// `InodeOps::lookup()`/`InodeOps::create()` for other entries. Cached
/// in the icache. Freed when the last dentry referencing it is evicted
/// AND the on-disk link count drops to zero (unlinked).
///
/// **Concurrency**: Inode metadata is protected by `i_lock` (spinlock).
/// File data is protected by `i_rwsem` (read-write semaphore) — readers
/// (read, readdir) take shared; writers (write, truncate) take exclusive.
// kernel-internal, not KABI — no const_assert (contains Arc, RwSemaphore, dyn traits).
#[repr(C)]
pub struct Inode {
/// Inode number. Unique within a superblock. Assigned by the filesystem.
pub i_ino: u64,
/// File type and permission mode (S_IFREG, S_IFDIR, etc. | rwxrwxrwx).
pub i_mode: u32,
/// Owner UID (kernel-internal representation, namespace-agnostic).
/// Permission checks compare against `from_kuid(mnt_userns, i_uid)` to
/// translate between the filesystem's user namespace and the calling
/// process's user namespace ([Section 17.1](17-containers.md#namespace-architecture)).
pub i_uid: u32,
/// Owner GID (kernel-internal representation, namespace-agnostic).
/// Permission checks use `from_kgid(mnt_userns, i_gid)` analogously.
pub i_gid: u32,
/// Hard link count. When this reaches 0 and no open file descriptors
/// remain, the inode is freed (both in-memory and on-disk).
pub i_nlink: AtomicU32,
/// File size in bytes. AtomicI64 for compatibility with Linux loff_t
/// semantics. For regular files and directories, i_size is always >= 0.
/// For symlinks: length of the target path. Updated under `i_rwsem`.
/// Consumers cast via `i_size.load(Acquire) as u64` after asserting
/// non-negative: `debug_assert!(self.i_size.load(Acquire) >= 0)`.
pub i_size: AtomicI64,
/// Timestamps (seconds + nanoseconds since epoch).
pub i_atime: Timespec,
pub i_mtime: Timespec,
pub i_ctime: Timespec,
/// Block size for this inode's filesystem (typically 4096).
pub i_blksize: u32,
/// Number of 512-byte blocks allocated on disk.
pub i_blocks: u64,
/// Device number (major:minor) for device special files (S_IFBLK/S_IFCHR).
/// Uses `DevId` type with Linux MKDEV encoding: `(major << 20) | minor`.
/// See [Section 14.5](#device-node-framework) for encoding details.
/// `DevId { raw: 0 }` for regular files.
pub i_rdev: DevId,
/// Generation number. Incremented when an inode is recycled (same i_ino
/// reused for a new file). Used by NFS file handles to detect stale handles.
/// Constrained to u32 by NFS file handle wire format (nfs_fh generation
/// field). At 10K inode recycled/sec, wraps after ~5 days — but
/// stale-handle collision requires matching SAME i_ino AND i_generation
/// (1-in-4B chance). NFS clients detect wrap mismatch via ESTALE.
/// Matches Linux i_generation behavior.
pub i_generation: u32,
/// Per-inode spinlock. Protects metadata updates (mode, uid, gid, timestamps,
/// nlink). Lock level: INODE_LOCK (level 15).
pub i_lock: SpinLock<(), INODE_LOCK>,
/// Read-write semaphore for file data. read()/readdir() take shared;
/// write()/truncate() take exclusive.
pub i_rwsem: RwSemaphore,
/// Superblock this inode belongs to.
pub i_sb: Arc<SuperBlock>,
/// Inode operations (lookup, create, link, unlink, etc.).
/// Set by the filesystem when the inode is created.
pub i_op: &'static dyn InodeOps,
/// File operations (read, write, mmap, ioctl, etc.).
/// Set by the filesystem; used when opening this inode as a file.
pub i_fop: &'static dyn FileOps,
/// Filesystem-private data. Opaque pointer used by the filesystem
/// driver to attach its own per-inode state (e.g., ext4_inode_info).
/// SAFETY: Set to a filesystem-specific type (e.g., *mut Ext4InodeInfo)
/// during inode initialization under I_NEW flag. The filesystem's
/// evict_inode() method must cast back to the original type and free.
/// Type safety is NOT enforced — callers must maintain the type
/// invariant. Set once during inode init, read-only thereafter.
/// Concurrent access is safe because i_private is immutable after
/// I_NEW is cleared.
///
/// `unsafe impl Send for Inode {}` — SAFETY: i_private is set once
/// during inode initialization (under I_NEW). After I_NEW is cleared,
/// i_private is read-only. All other fields of Inode are either atomic
/// or protected by documented locks.
/// `unsafe impl Sync for Inode {}` — same safety argument applies.
pub i_private: *mut (),
/// Per-inode LSM security blob. Allocated by `security_inode_alloc()`
/// during `iget()`; freed by `security_inode_free()` during
/// `evict_inode()`. See [Section 9.8](09-security.md#linux-security-module-framework) for
/// LsmBlob definition and lifecycle.
pub i_security: Option<NonNull<LsmBlob>>,
/// Page cache address space for this inode's data.
/// Contains the `PageCache` storage backend, writeback coordination,
/// and error tracking. See the `AddressSpace` struct defined above.
pub i_mapping: AddressSpace,
/// Reference count. Managed by dentry references and open file handles.
/// u32: bounded by concurrent references (max_files sysctl, default 8M).
/// Same rationale as Dentry.d_refcount — hot-path, ILP32 penalty.
pub i_refcount: AtomicU32,
/// RCU callback head for deferred inode slab free. The inode struct
/// cannot be freed immediately when the last reference is dropped
/// because concurrent RCU readers (path lookup via `rcu_walk`,
/// `find_inode()` under `rcu_read_lock()`) may still hold pointers
/// to this inode. Instead, eviction step 5 calls
/// `call_rcu(&inode.i_rcu, inode_free_rcu)` which defers the slab
/// free until all RCU readers have exited their critical sections.
/// `inode_free_rcu` calls `inode_slab.free(inode)` to return the
/// object to the inode slab cache.
pub i_rcu: RcuHead,
/// Dirty flag. Set when inode metadata has been modified in memory
/// but not yet written to disk.
pub i_state: AtomicU32,
/// Inode cache membership flag is tracked in `i_state` (`I_HASHED` bit).
/// Lookup is via per-superblock XArray (`SuperBlock.inode_cache`).
/// No hash-table linkage node — XArray manages its own internal nodes.
/// Superblock dirty inode list linkage.
pub i_sb_list: IntrusiveListNode,
// ---- Writeback integration ----
/// Nanosecond timestamp when this inode was first dirtied (metadata or data).
/// Set by `mark_inode_dirty()` on the first dirty transition (I_DIRTY_*
/// flags going from 0 → non-zero). Used by the writeback subsystem to
/// implement `dirty_expire_centisecs`: inodes dirtied longer than the
/// threshold are prioritized for writeback. Zero when clean.
pub dirtied_when: AtomicU64,
/// Writeback association. Links this inode to a specific `BdiWriteback`
/// instance (backing device writeback context). Set when the inode is
/// first dirtied; cleared when writeback completes and the inode returns
/// to clean state. `None` for inodes on pseudo-filesystems (procfs,
/// sysfs) that have no backing device.
pub i_wb: Option<Arc<BdiWriteback>>,
/// BDI dirty inode list linkage. Links this inode into the per-BDI
/// dirty list (`BdiWriteback.b_dirty`, `b_io`, or `b_more_io`)
/// maintained by the writeback subsystem. The writeback thread walks
/// these lists to find inodes that need flushing. The node is unlinked
/// when the inode is no longer dirty.
pub wb_link: IntrusiveListNode,
}
14.1.2.3.1.7 Inode Lifecycle and Page Cache Teardown¶
This subsection specifies the interaction between inode reference counting, page cache lifetime, and the eviction sequence. These paths are critical for correctness: a missed writeback silently loses data; a missed page free leaks memory; a race between eviction and page fault corrupts the page cache.
Inode state flags (i_state: AtomicU32):
/// Inode state bitflags stored in `Inode::i_state`.
///
/// Multiple flags may be set simultaneously. All updates use atomic
/// CAS (compare-and-swap) on `i_state` — no separate lock is required
/// for flag manipulation, but metadata fields protected by `i_lock`
/// must still be accessed under that lock.
pub mod InodeStateFlags {
/// Inode is newly allocated; filesystem has not yet filled its
/// on-disk fields. `unlock_new_inode()` clears this flag and
/// wakes waiters.
pub const I_NEW: u32 = 1 << 0;
/// Inode metadata (mode, uid, timestamps, etc.) is dirty. Set by
/// `mark_inode_dirty(I_DIRTY_SYNC)`.
pub const I_DIRTY_SYNC: u32 = 1 << 1;
/// Inode has dirty data-bearing metadata (file size, block
/// mappings) that `fdatasync` must flush. Set by
/// `mark_inode_dirty(I_DIRTY_DATASYNC)`.
pub const I_DIRTY_DATASYNC: u32 = 1 << 2;
/// Inode has dirty pages in its AddressSpace. Set when the first
/// page is dirtied; cleared when writeback drains all dirty pages.
pub const I_DIRTY_PAGES: u32 = 1 << 3;
/// Inode is being freed by `evict_inode()`. While set,
/// `mark_inode_dirty()` is a no-op (does not re-add the inode to
/// dirty lists), and `find_inode()` skips this inode (returns
/// `None`). Set atomically before eviction step 1; never cleared
/// (the inode is freed).
pub const I_FREEING: u32 = 1 << 4;
/// Writeback of dirty timestamp fields is pending but not yet
/// submitted. Used to coalesce frequent `atime` updates into a
/// single writeback pass. `fdatasync` skips metadata flush when
/// only this flag is set (timestamps are not data-relevant).
pub const I_DIRTY_TIME: u32 = 1 << 5;
/// Inode will be freed as soon as its reference count reaches
/// zero. Set when `i_nlink` drops to 0 while file descriptors
/// still hold references.
pub const I_WILL_FREE: u32 = 1 << 6;
/// Writeback in progress. Set by the writeback thread before issuing
/// I/O for this inode's pages, cleared on writeback completion.
/// Prevents concurrent writeback of the same inode (a second writeback
/// request skips the inode while this flag is set). Also prevents the
/// inode from being evicted from the inode cache during writeback.
pub const I_WRITEBACK: u32 = 1 << 7;
/// Inode is present in its superblock's inode cache XArray.
/// Set by `inode_cache_insert()`, cleared when removed from the XArray
/// (during eviction or `inode_cache_evict()`). Used by
/// debug assertions to verify cache consistency.
pub const I_HASHED: u32 = 1 << 8;
/// Page cache of this inode was corrupted during a driver crash
/// recovery cycle. Set by VFS ring crash recovery
/// ([Section 14.3](#vfs-per-cpu-ring-extension)) when coherence checks fail on
/// the inode's cached pages. While set, writeback skips this inode
/// (writing corrupt data to disk would propagate the corruption).
/// Applications reading from this inode receive `EIO`. Cleared
/// only by unmount + fsck + remount, or by inode eviction.
pub const I_PAGE_CACHE_CORRUPT: u32 = 1 << 9;
/// Combined mask: any data or metadata is dirty (not including timestamps).
pub const I_DIRTY: u32 = I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES;
/// Combined mask: any dirty state including timestamps.
pub const I_DIRTY_ALL: u32 = I_DIRTY | I_DIRTY_TIME;
}
Inode reference counting:
// Inode is reference-counted via Inode::i_refcount (AtomicU32).
//
// References are held by:
// 1. Dentry cache — each dentry pointing to this inode holds one ref.
// 2. Open file descriptors — each struct File holds one ref via its dentry.
// 3. Page cache (implicit) — the page cache does NOT hold an explicit
// i_refcount reference. Instead, the page cache is torn down as part
// of the eviction sequence (step 3) which runs only after all explicit
// references are released. Pages hold a Weak<Inode> back-pointer via
// AddressSpace::host.
// 4. Writeback thread — holds a temporary ref during flush (acquired
// when the inode is picked for writeback, released when writeback
// completes or the inode is skipped).
//
// Eviction eligibility:
// An inode is eligible for eviction when ALL of the following hold:
// (a) i_refcount == 0 (no dentry, fd, or writeback refs), AND
// (b) i_nlink == 0 (no on-disk hard links remain).
// If i_refcount == 0 but i_nlink > 0, the inode is merely removed from
// the active inode cache and placed on the LRU list for potential reuse.
// It is only freed (and its pages torn down) when memory pressure evicts
// it from the LRU, or when i_nlink later drops to 0 via unlink().
Eviction sequence (evict_inode()):
The eviction sequence is the sole path through which an inode and its page
cache pages are freed. It is called by iput_final() when the last reference
is released on an inode with i_nlink == 0.
evict_inode(inode):
Step 0: Set I_FREEING in inode.i_state (atomic OR).
From this point:
- mark_inode_dirty(inode) is a no-op.
- find_inode(sb, ino) skips this inode (returns None).
No new references can be acquired.
Step 1: Remove inode from the per-superblock XArray (icache).
sb.inode_cache.erase(inode.i_ino)
After this, no new lookups can find the inode. Concurrent
lookups that raced with step 0 will see I_FREEING and retry.
Step 2: If the inode has dirty pages or dirty metadata:
writeback_single_inode(inode, WB_SYNC_ALL)
This flushes ALL dirty pages via AddressSpaceOps::writepage()
and waits for every writeback I/O to complete (blocks until
inode.i_mapping.nrwriteback == 0).
If any writeback I/O fails, the error is recorded in
AddressSpace::wb_err (ErrSeq). The page is still freed in
step 3 — errors are recorded, not retried.
Step 3: truncate_inode_pages_final(&inode.i_mapping)
If page_cache is None (DAX inode), skip this step entirely —
there are no cached pages to tear down.
This is the page cache teardown:
a. Walk AddressSpace.page_cache (unwrapped XArray), removing all entries.
b. For each page:
- If PG_WRITEBACK is set: wait for I/O completion
(spin on the page's wait queue — this should be rare
after step 2 drained writeback, but handles races with
async I/O completion).
- If PG_DIRTY is set: BUG — step 2 should have flushed
all dirty pages. In release builds, log a warning and
proceed (the page will be freed without writeback).
- Remove the page from the LRU list if present.
- Release the page frame back to the buddy allocator
(Section 4.2).
c. Set AddressSpace.page_cache.nr_pages = 0.
d. Set AddressSpace.page_cache.nr_dirty = 0.
Ordering: step 3 MUST NOT begin until step 2 has fully
completed (all writeback I/O acknowledged).
Step 4: Call the filesystem's evict_inode() for on-disk cleanup:
inode.i_op.evict_inode(inode.i_ino)
The filesystem driver:
- Frees disk blocks (updates block bitmap / extent tree).
- Removes the inode from its on-disk inode table.
- Commits a journal transaction if journaled.
For KABI Tier 1/2 drivers, this is dispatched as
VfsOpcode::EvictInode through the VFS ring buffer.
Step 5: Remove inode from superblock's inode list:
sb.s_inode_list_lock.lock()
sb.s_inodes.remove(&inode.i_sb_list)
sb.s_inode_list_lock.unlock()
Defer inode slab free via RCU:
call_rcu(&inode.i_rcu, inode_free_rcu)
where inode_free_rcu returns the slab object to the
inode slab cache. This is necessary because concurrent
RCU readers (rcu_walk path lookup, find_inode under
rcu_read_lock) may still hold pointers to this inode.
Direct slab free would cause use-after-free.
InodeOps::evict_inode() method (extends the InodeOps trait defined above):
/// Called during inode eviction (step 4) after the VFS has flushed
/// all dirty pages and torn down the page cache. The filesystem
/// must free on-disk resources (blocks, extent tree entries, inode
/// bitmap bit) and commit any necessary journal transactions.
///
/// `i_nlink` is guaranteed to be 0 when this is called for a real
/// eviction (unlinked file). For pseudo-filesystems (tmpfs, procfs)
/// that have no on-disk state, this method is a no-op.
///
/// Must not fail — on-disk cleanup is best-effort. If the journal
/// commit fails, the filesystem marks itself as needing fsck
/// (sets the error flag in the superblock) and returns.
fn evict_inode(&self, ino: InodeId);
VfsOpcode::EvictInode (extends the VfsOpcode enum):
/// `InodeOps::evict_inode`. Inode eviction — free on-disk resources.
/// Sent after the VFS has completed page cache teardown (step 3).
EvictInode = 38,
With a corresponding VfsRequestArgs variant:
/// `InodeOps::evict_inode`. No extra arguments — the inode number
/// is in `VfsRequest::ino`.
EvictInode {},
Truncate path (truncate_inode_pages_range(mapping, lstart, lend)):
Called by ftruncate(2) (shrinking a file), unlink(2) (via eviction),
fallocate(FALLOC_FL_PUNCH_HOLE), and fallocate(FALLOC_FL_COLLAPSE_RANGE).
This is a partial teardown — only pages in the specified range are removed,
unlike truncate_inode_pages_final() which removes all pages.
truncate_inode_pages_range(mapping, lstart, lend):
Precondition: caller holds inode.i_rwsem exclusive (write lock).
This prevents concurrent page faults, reads, and writes from
populating the range being truncated.
1. Compute page-aligned range:
start_index = lstart / PAGE_SIZE
end_index = lend / PAGE_SIZE (or u64::MAX for "to end of file")
2. If mapping.page_cache is None (DAX inode), skip steps 2–3 — DAX
files have no cached pages. Proceed directly to step 4 (on-disk
block deallocation).
For each page in mapping.page_cache[start_index..=end_index]:
a. Acquire page lock (set `PageFlags::LOCKED` via atomic CAS; sleep if
already locked by another thread — e.g., readahead).
b. If `PageFlags::DIRTY` is set:
cancel_dirty_page(page):
- Clear `PageFlags::DIRTY` on the page.
- Decrement mapping.page_cache.nr_dirty.
- Decrement the BDI (backing device info) dirty page counter.
- The page is NOT written back — truncated data is discarded.
c. If `PageFlags::WRITEBACK` is set:
wait_on_page_writeback(page):
- Sleep until the bio completion handler clears `PageFlags::WRITEBACK`.
- This handles the race where writeback was submitted before
truncate acquired i_rwsem but has not yet completed.
d. Invalidate shared futex keys on this page:
futex_key_invalidate(page.phys_frame())
([Section 19.4](19-sysapi.md#futex-and-userspace-synchronization--physical-page-stability-for-shared-futex-keys)).
Wakes all FUTEX_WAIT callers keyed on this physical frame
with -EINVAL, since the backing page is about to be freed.
e. Remove the page from mapping.page_cache (XArray delete).
f. Remove the page from the LRU list (if present).
g. Unlock the page (clear `PageFlags::LOCKED`).
h. Release the page frame reference. If this is the last
reference, the page is freed back to the buddy allocator
(Section 4.2). If another mapping holds a reference (e.g.,
a shared mmap), the page survives until that reference is
dropped.
3. Decrement mapping.page_cache.nr_pages by the count of removed
pages (atomic subtract).
4. Notify the filesystem for on-disk block deallocation:
inode.i_op.truncate_range(ino, lstart, lend)
The filesystem frees the corresponding disk blocks, updates
extent trees, and journals the change. For KABI Tier 1/2
drivers this is dispatched as the existing VfsOpcode::Truncate
with the range encoded in the size field.
5. If lstart is not page-aligned (partial page at the start of the
range): zero the tail of the partial page from lstart to the
next page boundary. The page remains in the cache with its
leading portion intact. Mark it dirty so the zeroed region is
written back.
6. If lend is not page-aligned and lend != u64::MAX (partial page
at the end): zero the head of the partial page from the page
start to lend. Mark it dirty. (This case arises only with
FALLOC_FL_PUNCH_HOLE; ftruncate always has lend = u64::MAX.)
Dirty page handling before eviction:
Dirty pages are NEVER silently discarded during eviction. The eviction sequence guarantees data integrity through the following invariants:
writeback_single_inode()(eviction step 2) is called withWB_SYNC_ALL, which means:- ALL dirty pages are submitted for writeback via
AddressSpaceOps::writepage(). - The caller blocks until every submitted bio has completed (waits on each page's PG_WRITEBACK flag).
-
If the inode is already being written back by the periodic writeback thread,
WB_SYNC_ALLwaits for that in-progress writeback to finish, then re-scans for any pages dirtied in the interim. -
If writeback I/O fails (disk error, transport error):
- The error code is recorded in
AddressSpace::wb_err(ErrSeq counter) so that any concurrentfsync()on another fd for this inode will observe the error. - The
AS_EIOorAS_ENOSPCflag is set inAddressSpace::flags. - The page is still freed in step 3 — there is no retry loop. The
data is lost, but the error is recorded. This matches POSIX
semantics: a subsequent
fsync()returns-EIOexactly once per file descriptor. -
A kernel log message is emitted at
KERN_ERRlevel:"VFS: writeback error during eviction of inode {sb}:{ino}: {errno}". -
truncate_inode_pages_final()(eviction step 3) asserts that no dirty pages remain (PG_DIRTY must be clear after step 2). In debug builds, a dirty page at this point triggers a BUG (logic error in the writeback path). In release builds, it is logged as a warning and the page is freed without writeback.
Race prevention:
The eviction sequence must be safe against concurrent operations:
| Race scenario | Prevention mechanism |
|---|---|
find_inode() during eviction |
I_FREEING flag checked by find_inode() — returns None, causing the caller to read a fresh inode from disk (or get ENOENT if unlinked). |
mark_inode_dirty() during eviction |
I_FREEING flag checked — mark_inode_dirty() is a no-op when I_FREEING is set. |
| Page fault on evicting inode | find_inode() returns None (inode removed from XArray in step 1). The fault handler reads a fresh inode from disk. If the file is unlinked (i_nlink == 0), the on-disk inode is already marked free and the read fails with ESTALE. |
| Writeback thread picks inode during eviction | The writeback thread checks I_FREEING before acquiring the inode ref. If the flag is set, the inode is skipped. If writeback is already in progress when I_FREEING is set, step 2 (WB_SYNC_ALL) waits for it to complete. |
iget() racing with iput_final() |
iget() loads the inode from sb.inode_cache XArray under RCU and increments i_refcount atomically. iput_final() only proceeds if the CAS i_refcount: 1 -> 0 succeeds. If iget() increments first, the CAS fails and eviction is aborted. |
| Truncate racing with readahead | truncate_inode_pages_range() holds i_rwsem exclusive. Readahead acquires i_rwsem shared. The rwsem serializes them. |
Truncate racing with mmap read fault |
The page fault handler acquires i_rwsem shared (via filemap_fault()). Truncate holds it exclusive. The fault retries after truncate completes and finds no page (returns SIGBUS if the fault address is beyond the new EOF). |
Cross-references:
- Writeback thread organization and writeback_single_inode(): Section 4.6
- Buddy allocator (page frame release): Section 4.2
- fsync end-to-end flow and ErrSeq semantics: this section (fsync / fdatasync End-to-End Flow)
- VFS ring buffer protocol (EvictInode dispatch): this section (VFS Ring Buffer Protocol)
- Page cache XArray structure: Section 4.4
- LRU lists and page reclaim: Section 4.2
- DLM-aware page cache invalidation on lock release/downgrade: Section 15.15
14.1.2.3.1.8 Inode Cache (icache)¶
All in-memory inodes are registered in a per-superblock inode cache (icache). The cache
serves two purposes: (1) deduplication — ensuring that only one Inode instance
exists for any given (superblock, inode_number) pair via per-superblock XArray lookup,
and (2) memory management — tracking unreferenced inodes on a global LRU list for
eviction under memory pressure.
/// Global inode cache LRU and shrinker state. Inode lookup is per-superblock
/// (via `SuperBlock.inode_cache: XArray<u64, Arc<Inode>>`), but the LRU list
/// for memory pressure eviction is global — the shrinker needs a single list
/// to scan across all filesystems.
///
/// **Design rationale**: Per-superblock XArray eliminates hash computation
/// (~15-25 cycles saved), provides O(1) guaranteed lookup (no collision
/// chains), and improves cache locality (per-filesystem working set stays
/// in its own radix tree). The caller always has the superblock from path
/// resolution (`dentry.d_sb`), so no extra lookup is needed.
///
/// **Singleton**: one global instance, initialized during VFS subsystem
/// init. Accessed via `inode_cache()` which returns `&'static InodeCache`.
pub struct InodeCache {
/// LRU list of unreferenced inodes (i_refcount == 0, i_nlink > 0).
///
/// Head = least recently used (oldest unreferenced inode).
/// Tail = most recently unreferenced inode.
///
/// An inode is added to the LRU tail when its `i_refcount` drops
/// to 0 (via `iput()`) and `i_nlink > 0` (still has on-disk links).
/// It is removed from the LRU when:
/// - A new reference is acquired (`iget()` / `find_inode()`).
/// - Memory pressure triggers LRU eviction.
/// - The inode is evicted due to `i_nlink` dropping to 0.
///
/// Protected by `lru_lock`. Lock ordering: `lru_lock` is acquired
/// AFTER any per-superblock XArray lock (never the reverse).
pub lru: SpinLock<IntrusiveList<Inode>>,
/// Number of inodes currently on the LRU list. Updated atomically
/// on LRU insert/remove. Used by the shrinker to estimate
/// reclaimable memory without taking `lru_lock`.
pub lru_count: AtomicU64,
/// High watermark — when `lru_count` exceeds this value, the
/// background reclaim kthread is woken to proactively evict cold
/// inodes. Set during VFS init based on total system memory:
/// reclaim_watermark = max(1024, total_pages / 256)
/// This keeps the LRU from growing unboundedly on large-memory
/// systems while ensuring small systems still cache a useful
/// number of inodes.
pub reclaim_watermark: u64,
}
Inode cache operations:
/// Look up an inode in the per-superblock XArray by inode number.
///
/// Uses RCU read-side protection — no locks, no atomic increments on
/// the fast path. The returned `Arc<Inode>` has its reference count
/// incremented (the caller owns one reference).
///
/// Returns `None` if:
/// - No inode with this ino exists in the superblock's inode cache.
/// - The inode has `I_FREEING` set (eviction in progress) — the
/// caller should re-read from disk or retry.
///
/// **Hot path**: called on every `open()`, `stat()`, and path lookup
/// that misses the dentry cache. O(1) XArray lookup with no hash
/// computation and no collision chains.
pub fn inode_cache_lookup(sb: &SuperBlock, ino: u64) -> Option<Arc<Inode>> {
// rcu_read_lock() (implicit in XArray::load)
// sb.inode_cache.load(ino)
// → if found and !(i_state & I_FREEING): increment i_refcount, return Some
// → if found and (i_state & I_FREEING): return None (skip dying inode)
// → if not found: return None
// If the inode was on the LRU (i_refcount was 0), remove it from
// the global LRU list (acquire lru_lock, unlink, decrement lru_count).
// rcu_read_unlock() (implicit)
}
/// Insert an inode into the global cache.
///
/// Called after a filesystem driver has allocated and filled a new inode
/// (from `InodeOps::lookup()`, `InodeOps::create()`, or `read_inode()`).
/// The inode must have `I_NEW` set in `i_state` — this flag is cleared
/// by `unlock_new_inode()` after the filesystem finishes initialization.
///
/// **Preconditions**:
/// - `inode.i_ino` and `inode.i_sb` are set (valid superblock + ino).
/// - `inode.i_refcount >= 1` (the caller holds a reference).
/// - No existing entry for `inode.i_ino` in `inode.i_sb.inode_cache`
/// — callers must check `inode_cache_lookup()` first.
///
/// **Panics** in debug builds if a duplicate entry exists (logic error
/// in the filesystem driver). In release builds, returns the existing
/// inode and drops the new one.
pub fn inode_cache_insert(inode: Arc<Inode>) {
// inode.i_sb.inode_cache.store(inode.i_ino, inode.clone())
// Set i_state |= I_HASHED (indicates presence in icache).
}
/// Evict unreferenced inodes from the LRU list to reclaim memory.
///
/// Called by the memory shrinker when the slab allocator or page reclaim
/// needs to free memory. Evicts up to `count` inodes from the LRU head
/// (least recently used first).
///
/// For each evicted inode:
/// 1. Remove from LRU list (under `lru_lock`).
/// 2. Set `I_FREEING` in `i_state` (atomic OR).
/// 3. Remove from the superblock's XArray (`sb.inode_cache.erase(ino)`).
/// 4. If the inode has dirty pages or metadata: call
/// `writeback_single_inode(inode, WB_SYNC_NONE)` — best-effort
/// writeback. Dirty inodes at the LRU head are skipped and moved
/// to the tail (they will be written back by the periodic writeback
/// thread before being eligible for eviction again).
/// 5. Call `evict_inode()` for page cache teardown and on-disk cleanup.
/// 6. Defer inode slab free via `call_rcu(&inode.i_rcu, inode_free_rcu)`
/// (RCU readers may still hold pointers — see `Inode::i_rcu` field).
///
/// **Concurrency**: `iget()` racing with eviction is safe — `iget()`
/// checks `I_FREEING` after incrementing `i_refcount`. If `I_FREEING`
/// is set, `iget()` decrements `i_refcount` back and returns `None`.
///
/// **Scope**: `sb` constrains eviction to a single superblock. Pass
/// `None` to evict across all superblocks (global memory pressure).
pub fn inode_cache_evict(sb: Option<&SuperBlock>, count: usize) -> usize {
// Returns the number of inodes actually evicted (may be < count
// if fewer reclaimable inodes exist or dirty inodes are skipped).
}
Shrinker integration:
The inode cache registers a memory shrinker callback during VFS init. The shrinker is invoked by the page reclaim path (Section 4.2) when free memory falls below the low watermark.
/// Inode cache shrinker. Registered as a global shrinker during VFS init.
///
/// `count_objects()`: returns `inode_cache().lru_count` (fast — no lock).
/// `scan_objects()`: calls `inode_cache_evict(None, nr_to_scan)`.
///
/// Priority: shrinker priority is set to `SHRINKER_DEFAULT_PRIORITY` (0).
/// The inode cache shrinker runs alongside the dentry cache shrinker and
/// slab shrinkers. The reclaim path distributes scan pressure across all
/// registered shrinkers proportionally to their `count_objects()` return
/// value — larger caches receive more scan pressure.
pub static INODE_CACHE_SHRINKER: Shrinker = Shrinker {
count_objects: inode_cache_count,
scan_objects: inode_cache_scan,
seeks: DEFAULT_SEEKS, // 2 — moderate cost to recreate (disk read)
flags: 0,
};
Invariants:
- Every in-memory Inode with I_HASHED set is present in exactly one
SuperBlock.inode_cache XArray entry. Removing from the XArray clears I_HASHED.
- An inode is on the LRU list if and only if i_refcount == 0 AND
i_nlink > 0 AND I_FREEING is not set.
- lru_count always equals the number of nodes in lru (maintained
atomically — incremented on LRU insert, decremented on LRU remove).
- Lock ordering: XArray internal lock -> lru_lock. Never acquire an
XArray lock while holding lru_lock.
Cross-references: - Inode struct and lifecycle: this section (Inode above) - Eviction sequence: this section (Inode Lifecycle and Page Cache Teardown above) - Page reclaim and shrinker framework: Section 4.2 - Dentry cache (parallel deduplication cache for path components): this section (Dentry above) - Crash recovery inode iteration: this section (Dirty Page Handling on VFS Crash above)
14.1.2.3.1.9 SuperBlock¶
/// In-memory representation of a mounted filesystem.
///
/// Each mount creates one SuperBlock instance. The superblock holds
/// filesystem-level metadata (block size, feature flags, root inode)
/// and provides the interface between the VFS and the filesystem driver.
///
/// **Lifecycle**: Created by `FileSystemOps::mount()`. Destroyed by
/// `FileSystemOps::unmount()` after all references are released.
pub struct SuperBlock {
/// Filesystem type identifier (e.g., "ext4", "xfs", "tmpfs").
pub s_type: &'static str,
/// Block size in bytes (typically 1024, 2048, or 4096).
pub s_blocksize: u32,
/// Log2 of block size (for bit-shift division).
pub s_blocksize_bits: u8,
/// Maximum file size supported by this filesystem.
pub s_maxbytes: i64,
/// Root dentry of the mounted filesystem.
pub s_root: Arc<Dentry>,
/// Filesystem operations (mount, unmount, statfs, sync).
pub s_op: &'static dyn FileSystemOps,
/// Mount flags (MS_RDONLY, MS_NOSUID, MS_NODEV, MS_DSM_COOPERATIVE, etc.).
///
/// **`MS_DSM_COOPERATIVE` (bit 30)**: Set at mount time to indicate this
/// filesystem participates in DSM cooperative caching. When set, the VFS
/// page cache miss path (`filemap_get_pages`) uses the three-stage
/// speculative protocol ([Section 6.11](06-dsm.md#dsm-distributed-page-cache)): (1) Bloom
/// filter fast rejection (~15ns), (2) sequential access skip, (3)
/// speculative parallel RDMA + NVMe for random misses. Zero overhead
/// for ~95% of misses; ~5-10μs speedup for the rest. Enables
/// cluster-wide page cache sharing for distributed filesystems.
/// Only meaningful when DSM is enabled ([Section 6.11](06-dsm.md#dsm-distributed-page-cache)).
/// Filesystems that do not set this flag use the standard local-only
/// page cache lookup (no DSM overhead on the read path).
/// AtomicU32: all defined flags fit in bits 0-30. If more than 31
/// flags are needed, widen to AtomicU64.
pub s_flags: AtomicU32,
/// Filesystem-specific data. Opaque pointer used by the filesystem
/// driver to attach its own per-superblock state (e.g., ext4_sb_info,
/// xfs_mount).
/// SAFETY: Set to a filesystem-specific type during mount(). The
/// filesystem's kill_sb()/unmount() must cast back and free. Valid
/// for the lifetime of the superblock. Set once during mount,
/// read-only thereafter.
pub s_fs_info: *mut (),
/// UUID of the filesystem (if supported). Used for persistent mount
/// identification and `/proc/mounts` output.
pub s_uuid: [u8; 16],
/// Per-superblock inode cache. Provides O(1) lookup by inode number
/// via XArray (radix tree). RCU-protected reads on the hot path.
///
/// Per-superblock XArray eliminates hash computation (~15-25 cycles
/// saved vs. global RcuHashMap), provides O(1) guaranteed lookup
/// (no collision chains), and improves cache locality (per-filesystem
/// working set stays in its own radix tree). The caller always has the
/// superblock from path resolution (`dentry.d_sb`), so no extra lookup
/// is needed.
/// On 32-bit architectures (ARMv7, PPC32), XArray stores u64 keys via
/// a synthetic two-level index. Performance is O(1) for keys <= u32::MAX
/// and O(log64(N)) for larger keys.
pub inode_cache: XArray<u64, Arc<Inode>>,
/// List of all inodes belonging to this superblock.
/// Protected by `s_inode_list_lock`.
pub s_inodes: IntrusiveList<Inode>,
// Dirty inode tracking is handled exclusively by BdiWriteback.b_dirty
// (see [Section 4.6](04-memory.md#writeback-subsystem--bdiwriteback)). No per-superblock dirty list.
/// Per-superblock lock for inode list management.
pub s_inode_list_lock: SpinLock<()>,
/// Block device backing this filesystem (None for pseudo-filesystems
/// like tmpfs, procfs, sysfs).
pub s_bdev: Option<Arc<BlockDevice>>,
/// Backing device info — controls writeback rate limiting, readahead
/// window, and per-device dirty page accounting. Set during mount:
///
/// - **Disk-backed filesystems** (ext4, XFS, Btrfs): points to the
/// `BackingDevInfo` owned by the underlying `BlockDevice`. The BDI
/// is shared if multiple mounts use the same block device (e.g.,
/// bind mounts). The writeback thread
/// ([Section 4.6](04-memory.md#writeback-subsystem--writeback-thread-organization)) uses `s_bdi` to
/// locate the inode dirty lists (`BdiWriteback.b_dirty`) and to
/// enforce per-device dirty page throttling via
/// `balance_dirty_pages()`.
///
/// - **Network filesystems** (NFS, CIFS, 9P): allocate a dedicated
/// `BackingDevInfo` per superblock during mount. The BDI's `bdev`
/// field is `None`; writeback goes through the filesystem's own
/// network I/O path rather than the block layer.
///
/// - **Pseudo-filesystems** (tmpfs, procfs, sysfs, devtmpfs): `None`.
/// These filesystems have no backing store and never produce dirty
/// pages that require writeback. The VFS skips all writeback and
/// dirty-throttling code paths when `s_bdi` is `None`.
///
/// **Writeback chain**: inode → `i_sb` (`SuperBlock`) → `s_bdi`
/// (`BackingDevInfo`) → `wb` (`BdiWriteback`). This chain is how the
/// writeback subsystem discovers which device an inode's dirty pages
/// should be flushed to. Breaking this chain (e.g., a disk-backed
/// filesystem with `s_bdi = None`) would silently prevent writeback
/// and leak dirty pages indefinitely.
///
/// **Cross-reference**: `BackingDevInfo` struct definition and
/// `BdiWriteback` internals are in [Section 4.6](04-memory.md#writeback-subsystem--writeback-thread-organization).
pub s_bdi: Option<Arc<BackingDevInfo>>,
/// Reference count. Held by Mount nodes and open file handles.
pub s_refcount: AtomicU32,
/// Freeze count. >0 means filesystem is frozen (FIFREEZE).
pub s_freeze_count: AtomicU32,
/// Per-superblock writer tracking for the freeze state machine.
/// Implements the `SB_FREEZE_WRITE -> SB_FREEZE_PAGEFAULT -> SB_FREEZE_FS
/// -> SB_FREEZE_COMPLETE` progression used by FIFREEZE/FITHAW ioctls and
/// `do_remount()`.
pub s_writers: SbWriters,
/// Error handling behavior for this filesystem. Set at mount time from
/// the `errors=` mount option (e.g., `errors=remount-ro`). Defaults to
/// `FsErrorMode::Continue` unless the filesystem specifies otherwise.
/// Consulted by the VFS error path when a filesystem reports an I/O or
/// metadata corruption error.
pub s_error_behavior: FsErrorMode,
/// Per-mount VFS ring set for Tier 1 filesystem driver communication.
/// Contains N ring pairs (request + response), where N is negotiated
/// at mount time. `None` for pseudo-filesystems (tmpfs, procfs, sysfs)
/// that run in Tier 0 and do not use ring-based dispatch.
///
/// The ring_set is allocated at mount time and persists across driver
/// crashes (rings are drained and reset, not recreated). The replacement
/// driver re-binds to the existing ring_set during crash recovery
/// Step U14 ([Section 14.3](#vfs-per-cpu-ring-extension--crash-recovery)).
pub ring_set: Option<Box<VfsRingSet>>,
/// Driver generation counter. Incremented each time the filesystem
/// driver is (re)loaded after a crash (Step U16 of the unified VFS
/// crash recovery sequence in [Section 14.3](#vfs-per-cpu-ring-extension--crash-recovery)).
/// Initialized to 0 at first mount.
///
/// Used for:
/// - **Stale response detection**: VFS response consumer checks
/// `response.driver_generation == sb.driver_generation.load(Acquire)`
/// before processing any completion. Mismatched responses (from the
/// pre-crash driver instance) are discarded.
/// - **Stale file handle detection**: `OpenFile.open_generation` is
/// compared against this field; mismatch returns `ENOTCONN`.
///
/// One counter per superblock (not per ring) — all rings on a mount
/// share the same generation. Persisted in Core (Tier 0) memory, so
/// it survives Tier 1 driver crashes.
///
/// **Longevity**: u64 at crash-per-minute rate (extreme) wraps after
/// ~35 trillion years. No wrap handling needed.
pub driver_generation: AtomicU64,
}
/// Freeze level for the superblock writer tracking state machine.
///
/// The freeze process advances through levels in order:
/// `Unfrozen -> Write -> PageFault -> Fs -> Complete`.
/// Each level blocks a broader category of operations. Thaw reverses
/// the progression. Used by FIFREEZE/FITHAW ioctls and `do_remount()`.
#[repr(u8)]
pub enum SbFreezeLevel {
/// Not frozen. All operations permitted.
Unfrozen = 0,
/// Block new writes (`write`, `truncate`, `fallocate`).
Write = 1,
/// Block page faults (prevent new page-cache population via mmap writes).
PageFault = 2,
/// Block filesystem operations (metadata updates, journal commits).
Fs = 3,
/// Fully frozen. No filesystem activity. Safe for snapshots.
Complete = 4,
}
/// Per-superblock writer tracking for the freeze state machine.
///
/// The `frozen` field records the current freeze level. The `writers`
/// array tracks the number of active writers at each of the three
/// blockable levels (Write, PageFault, Fs). Each counter uses
/// `PerCpuCounter` for scalable per-CPU increment/decrement on the
/// write path, with a global `sum()` used only during freeze/thaw
/// transitions to wait for all writers to drain.
///
/// **Freeze protocol**:
/// 1. Set `frozen` to the target level (e.g., `SbFreezeLevel::Write`).
/// 2. Wait for `writers[level - 1].sum() == 0` (all active writers at
/// that level have completed).
/// 3. Advance to the next level. Repeat until `Complete`.
///
/// **Thaw protocol**: Set `frozen` back to `Unfrozen` and wake all
/// waiters blocked by the freeze.
pub struct SbWriters {
/// Current freeze level. Read with `Acquire`, written with `Release`.
pub frozen: AtomicU8,
/// Per-level writer counts: `[0]` = Write level, `[1]` = PageFault
/// level, `[2]` = Fs level. `PerCpuCounter` for scalable hot-path
/// increment (no cross-CPU contention on the write path).
pub writers: [PerCpuCounter; 3],
/// WaitQueue for threads blocked by a freeze. Woken on thaw.
pub freeze_wait: WaitQueue,
}
Writer entry/exit protocol (sb_start_write / sb_end_write):
Every VFS operation that modifies the filesystem must bracket its work
with the sb_start_write()/sb_end_write() protocol. This allows the
freeze state machine to wait for all in-flight writers to drain before
advancing to the next freeze level.
/// Attempt to enter the filesystem for a write-class operation.
/// Returns a guard that decrements the writer count on drop.
///
/// If the filesystem is frozen at or beyond the requested level,
/// the caller blocks on `sb.s_writers.freeze_wait` until thaw.
///
/// **Interruptibility**: If the calling task receives a fatal signal
/// while blocked, `sb_start_write` returns `Err(EINTR)`. The VFS
/// write path converts this to `-EINTR` for the syscall.
///
/// # Arguments
/// * `sb` — The superblock of the filesystem being written to.
/// * `level` — The freeze level this operation belongs to:
/// - `SbFreezeLevel::Write` (1): data writes (`write`, `truncate`, `fallocate`).
/// - `SbFreezeLevel::PageFault` (2): page fault writes (mmap dirty page).
/// - `SbFreezeLevel::Fs` (3): filesystem metadata updates (journal commit).
///
/// # Hot path
/// The common case (filesystem not frozen) is a single `Acquire` load
/// on `frozen` + one `PerCpuCounter::inc()` (~5-10 cycles total, no
/// contention). The slow path (freeze in progress) blocks.
pub fn sb_start_write(sb: &SuperBlock, level: SbFreezeLevel) -> Result<SbWriteGuard, Errno> {
loop {
let frozen = sb.s_writers.frozen.load(Ordering::Acquire);
if frozen >= level as u8 {
// Filesystem is frozen at or beyond our level. Block.
sb.s_writers.freeze_wait.wait_interruptible()?;
continue;
}
sb.s_writers.writers[(level as u8 - 1) as usize].inc();
// Re-check after increment (the freeze may have advanced between
// our check and increment — same race as Linux's percpu_rwsem).
let frozen_after = sb.s_writers.frozen.load(Ordering::Acquire);
if frozen_after >= level as u8 {
sb.s_writers.writers[(level as u8 - 1) as usize].dec();
sb.s_writers.freeze_wait.wait_interruptible()?;
continue;
}
return Ok(SbWriteGuard { sb, level });
}
}
/// RAII guard that decrements the writer count on drop.
pub struct SbWriteGuard<'a> {
sb: &'a SuperBlock,
level: SbFreezeLevel,
}
impl Drop for SbWriteGuard<'_> {
fn drop(&mut self) {
self.sb.s_writers.writers[(self.level as u8 - 1) as usize].dec();
}
}
/// Freeze a filesystem to the specified level.
///
/// Called by `FIFREEZE` ioctl and `do_remount()` (remount read-only).
/// Advances through freeze levels sequentially:
/// `Write -> PageFault -> Fs -> Complete`.
///
/// At each level:
/// 1. Set `frozen` to the target level (Release store).
/// 2. Wait for `writers[level-1].sum() == 0` (all writers at that level drained).
///
/// After reaching `Complete`, the filesystem is fully quiesced: no pending
/// writes, no page faults, no metadata updates. Safe for LVM snapshots,
/// device-mapper operations, and backup tools.
///
/// # Errors
/// Returns `Err(EBUSY)` if the filesystem is already frozen.
/// Returns `Err(EINTR)` if the wait is interrupted by a fatal signal.
pub fn freeze_super(sb: &SuperBlock) -> Result<(), Errno>;
/// Thaw a frozen filesystem.
///
/// Called by `FITHAW` ioctl. Sets `frozen` back to `Unfrozen` and
/// wakes all threads blocked on `freeze_wait`.
pub fn thaw_super(sb: &SuperBlock) -> Result<(), Errno>;
Freeze/thaw interaction with VFS ring protocol: When a filesystem is frozen
at SbFreezeLevel::Write or beyond, the VFS ring dispatch path
(Section 14.2) returns -EROFS for write-class operations
(Write, Truncate, Fallocate, Create, Mkdir, Symlink, Link,
Unlink, Rmdir, Rename, Mknod, SetAttr, SetXattr, RemoveXattr)
without enqueuing them on the ring. Read-class operations (Read, Lookup,
Getattr, Readdir, ReadPage, Readahead) continue to function during
freeze. The Freeze and Thaw opcodes in the ring protocol
(Section 14.2) are used to notify the filesystem driver to
quiesce/resume its own internal state (journal, allocator).
/// Filesystem error handling behavior (set via mount option `errors=`).
///
/// When a filesystem encounters an I/O error or metadata corruption,
/// the VFS consults `SuperBlock.s_error_behavior` to determine the
/// system-level response. This is separate from the error returned to
/// the calling application (which always gets an appropriate errno).
///
/// **Linux compatibility**: The `errors=` mount option values and their
/// semantics match Linux exactly. The numeric values match the ext4
/// `s_errors` on-disk field for ext4 compatibility.
#[repr(u8)]
pub enum FsErrorMode {
/// Continue operation after error (default for ext4).
/// The error is reported to the application via errno, but the
/// filesystem remains mounted read-write. Suitable for non-critical
/// filesystems where availability is preferred over safety.
Continue = 0,
/// Remount filesystem read-only on error.
/// This is the safest non-destructive option: it prevents further
/// data corruption while keeping existing data readable.
///
/// **Remount-ro procedure**:
/// 1. Set `SuperBlock.s_flags |= MS_RDONLY` (atomic OR).
/// 2. Flush all dirty pages via `sync_fs(sb, wait=true)`. Pages that
/// fail to flush are marked with `PG_ERROR` and left in the cache
/// (they cannot be written back to a read-only filesystem).
/// 3. Reject all future write operations (`write`, `truncate`,
/// `fallocate`, `rename`, `unlink`, `mkdir`, etc.) with `EROFS`.
/// 4. Log the error and the remount event to the kernel log and
/// the fault management subsystem ([Section 20.1](20-observability.md#fault-management-architecture)).
/// 5. Existing read-only file descriptors continue to work.
/// Existing read-write file descriptors remain open but all
/// write operations return `EROFS`.
RemountRo = 1,
/// Kernel panic on filesystem error.
/// Used for critical root filesystems where continuing with a
/// corrupted filesystem is worse than rebooting. This should only
/// be set on the root filesystem in environments with automatic
/// reboot and fsck (e.g., servers with watchdog timers).
Panic = 2,
}
Writeback error integration: The writeback subsystem calls check_fs_error_mode(sb)
when a page writeback I/O completes with an error. This function inspects
sb.s_error_behavior and takes the configured action (log, remount-ro, or panic).
Without this hook, errors=remount-ro would be meaningless for asynchronous writeback
errors — the writeback subsystem would mark pages PG_ERROR but never trigger the
VFS-level error policy. See Section 4.6 for the writeback I/O completion
path that invokes this check.
See Section 14.2 for the VFS ring buffer cross-domain dispatch protocol (request/response ring pairs, opcodes, marshaling, timeout, cancellation, crash recovery).
See Section 14.4 for the fsync/fdatasync end-to-end flow and Copy-on-Write / Redirect-on-Write infrastructure (WriteMode, ExtentSharingOps, shared-extent page cache, reflink ioctls, CoW-aware writeback, free space accounting).
14.1.2.4 End-to-End Write Path: Userspace to Hardware¶
This walkthrough traces a single buffered write(2) from a userspace application
through every kernel layer to stable media. It serves as a cross-reference map
connecting the VFS, page cache, writeback, block layer, and device driver
specifications.
1. USERSPACE: write(fd, buf, len)
→ Syscall entry ([Section 19.1](19-sysapi.md#syscall-interface))
→ umka-sysapi resolves fd to OpenFile
2. VFS DISPATCH: vfs_write(file, buf, len, &pos)
→ File operations dispatch via file.f_op.write_iter
→ fdget_pos() acquires position lock (§14.5 above)
→ Calls generic_file_write_iter() for regular files
3. PAGE CACHE WRITE: generic_file_write_iter()
→ pagecache_get_page(mapping, pgoff, FGP_WRITEBEGIN)
→ Page cache lookup via XArray ([Section 4.4](04-memory.md#page-cache))
→ On miss: allocate page, insert into XArray
→ copy_from_user(page_addr + offset, buf, len)
→ Data copied from userspace buffer to page cache page
→ set_page_dirty(page) → marks PG_DIRTY
→ vfs_dirty_extent_reserve() ([Section 14.4](#vfs-fsync-and-cow))
→ Reserves writeback intent in the dirty extent tracker
4. FILESYSTEM NOTIFICATION: .write_iter() callback
→ For ext4: ext4_write_begin() / ext4_write_end()
→ Journal reservation (JBD2) for metadata
→ Delayed allocation: logical blocks reserved, physical not yet assigned
→ For XFS: xfs_file_write_iter() → iomap framework
→ For Btrfs: CoW reservation via extent tree
**Tier boundary**: For Tier 1 filesystems, write_begin/write_end are
dispatched via KABI ring. The Tier 1 filesystem's write_end() response
includes `dirty: bool`. The Tier 0 VFS ring consumer calls
set_page_dirty() on behalf of the filesystem -- the filesystem never
calls set_page_dirty() directly across the domain boundary.
([Section 12.8](12-kabi.md#kabi-domain-runtime))
5. WRITEBACK (ASYNCHRONOUS): triggered by dirty ratio threshold,
periodic writeback timer (default 5s), or explicit fsync()
→ Writeback thread ([Section 4.6](04-memory.md#writeback-subsystem--writeback-thread-organization))
picks dirty inode from per-bdi writeback list
→ writeback_single_inode() → .writepages() or .writepage()
→ Filesystem allocates physical blocks (delayed allocation commit):
- ext4: ext4_writepages() → ext4_map_blocks() assigns physical extents
- XFS: xfs_vm_writepages() → xfs_bmapi_write()
- Btrfs: extent_writepages() → CoW extent allocation
→ vfs_dirty_extent_commit() binds physical block address to intent
6. BIO CONSTRUCTION: filesystem builds Bio from dirty pages
→ Bio { op: Write, start_lba, segments: [page, ...] }
→ Sets BioFlags: FUA for journal commits, PERSISTENT for critical I/O
→ ([Section 15.2](15-storage.md#block-io-and-volume-management--bio-crash-recovery))
7. BLOCK LAYER: bio_submit()
→ Cgroup I/O throttling check ([Section 15.2](15-storage.md#block-io-and-volume-management--cgroup-io-throttling))
→ I/O scheduler path (if attached):
bio_to_io_request() → scheduler merges, reorders
([Section 15.18](15-storage.md#io-priority-and-scheduling))
→ Direct dispatch path (NVMe multi-queue): bypass scheduler
8. DEVICE DRIVER: BlockDeviceOps::submit_bio()
→ Tier 0: direct function call in kernel context
→ Tier 1: KABI ring dispatch through DomainRingBuffer
([Section 12.3](12-kabi.md#kabi-bilateral-capability-exchange))
→ Tier 2: IPC message to userspace driver process
9. HARDWARE DMA: driver programs NVMe SQ / AHCI command slot / virtio desc
→ DMA from page cache page to device
→ DmaDevice::dma_map_sgl() creates IOMMU mapping
([Section 4.14](04-memory.md#dma-subsystem))
→ Device writes data to stable media
10. COMPLETION: device signals IRQ → driver processes CQ entry
→ bio_complete() invokes bio.end_io callback (interrupt context)
→ Deferred to blk-io workqueue for page cache updates:
- Clear PG_WRITEBACK on the page
- Wake fsync() waiters if applicable
- Update AddressSpace.wb_err on error
Design note — write() and async writeback visibility: write() returns success
as soon as data is in the page cache (step 3 above). Asynchronous writeback failures
(step 10) are NOT visible to write() — they are visible only to fsync() via the
ErrSeq mechanism (Section 15.1). This is intentional and
Linux-compatible: write() is a buffer-fill operation, not a durability guarantee.
Applications that need durability must call fsync() or use O_SYNC/O_DSYNC.
The ErrSeq mechanism ensures each open file descriptor sees each writeback error
exactly once on the next fsync() call — the fd snapshots AddressSpace::wb_err
at open() time (file.f_wb_err), and fsync() compares the snapshot to the
current wb_err generation to detect new errors. If the application never calls
fsync(), writeback errors are silently absorbed (the data is lost, but the
application was not requesting durability guarantees). This matches POSIX semantics
and Linux 4.13+ behavior (errseq_t).
14.1.2.5 O_SYNC / O_DSYNC Write Path¶
When a file is opened with O_SYNC or O_DSYNC, write() must not return until
the data (and possibly metadata) is on stable storage. This guarantee is enforced
after step 3 (page cache write) completes, before returning to userspace.
The synchronous write path reuses the normal page-cache write path above — data is still copied to a page cache page and the page is marked dirty. The difference is that the caller blocks on writeback before returning, instead of deferring to the asynchronous writeback thread. This design keeps the page cache as the single source of truth for dirty tracking and avoids duplicating writeback logic.
O_SYNC/O_DSYNC branch (inserted between steps 3 and 4 above):
3a. SYNC CHECK: after set_page_dirty() and vfs_dirty_extent_reserve():
if file.f_flags & (O_SYNC | O_DSYNC) != 0:
// Flush the dirty range we just wrote to stable storage.
err = filemap_write_and_wait_range(
mapping,
offset, // start of the write
offset + len - 1, // end of the write (inclusive)
)
// filemap_write_and_wait_range():
// 1. Calls writeback_range(mapping, start, end) which triggers
// AddressSpaceOps::writepages() for the dirty pages in [start, end].
// 2. Waits for PG_WRITEBACK to clear on all pages in the range
// (blocks until device DMA + completion for those pages).
// 3. Returns the first error from wb_err in the range, if any.
if err != 0:
// Writeback failed — propagate error to write() caller.
// The page remains in the page cache (still dirty or errored).
// AddressSpace.wb_err records the error for subsequent fsync().
return Err(err)
// O_SYNC: data + ALL metadata must be stable.
// O_DSYNC: data must be stable; metadata only if file size changed.
if file.f_flags & O_SYNC != 0:
// Full sync: flush inode metadata (timestamps, size, blocks).
err = vfs_fsync_metadata(inode)
if err != 0:
return Err(err)
else:
// O_DSYNC: flush metadata only if i_size changed (data integrity).
// File size changes affect data recoverability — a crash after
// extending the file but before updating i_size on disk would lose
// the new data (it would be beyond the on-disk EOF). Timestamp
// updates (mtime, ctime) are NOT required for data integrity.
if offset + len > old_i_size:
err = vfs_fsync_metadata(inode)
if err != 0:
return Err(err)
vfs_fsync_metadata(inode) calls InodeOps::write_inode(inode, WB_SYNC_ALL) to
flush the inode's on-disk metadata. For journaling filesystems (ext4, XFS), this
commits the journal transaction containing the inode update. For non-journaling
filesystems, this writes the inode block and issues a cache flush.
Performance: O_SYNC adds the device write latency to every write() call (~10-15 us on NVMe, ~3-8 ms on SATA). This is inherent — the user requested durability. The page cache write (step 3) remains ~1-5 us; the additional cost is entirely device I/O.
Interaction with writeback: The synchronous flush in step 3a writes back the same
dirty pages that the asynchronous writeback thread (step 5) would eventually process.
After filemap_write_and_wait_range() completes, the pages are clean (PG_DIRTY cleared),
so the writeback thread skips them. No double-write occurs. Dirty page accounting
(AddressSpace.page_cache.nr_dirty, BDI dirty counters) is correctly decremented by the
writeback completion path, regardless of whether writeback was triggered synchronously
or asynchronously.
Error semantics: If the device reports a write error, the error is:
1. Stored in AddressSpace.wb_err (for subsequent fsync() error reporting).
2. Returned from write() to the caller (the write "failed" from the durability
perspective, even though the data is in the page cache).
3. The page may remain dirty in the cache (for retry on next writeback attempt).
14.1.2.6 O_DIRECT (Direct I/O) Path¶
O_DIRECT bypasses the page cache entirely: data is transferred via DMA directly between the user buffer and the block device. This eliminates double-copying (user buffer to page cache to device) and avoids polluting the page cache with streaming I/O data that will never be re-read.
Alignment requirements: O_DIRECT requires sector-aligned file offset and transfer
length. The required alignment is filesystem-dependent and reported via statx()
(dio_offset_align, dio_mem_align fields in UmkaStatx). Typical values:
- ext4/XFS on NVMe: 512 bytes (sector size)
- ext4 with bigalloc: filesystem block size (e.g., 4096)
- Btrfs: sector size (4096)
Unaligned offset or length returns EINVAL from write() / read().
The user buffer must also be aligned to dio_mem_align (typically 512 bytes). This
ensures the DMA controller can transfer directly to/from the buffer without bounce
buffering.
/// Direct I/O operations. Returned by `AddressSpaceOps::direct_io()` for
/// filesystems that support O_DIRECT. The VFS calls these methods instead
/// of the page-cache path when `FMODE_DIRECT` is set on the file.
pub trait DirectIoOps: Send + Sync {
/// Perform a direct read from the block device into the user buffer.
///
/// `file`: the open file (provides inode, block mapping).
/// `buf`: user-space destination buffer (must be dio_mem_align-aligned).
/// `offset`: file offset (must be dio_offset_align-aligned).
/// `len`: number of bytes to read.
///
/// Returns the number of bytes actually read (may be less than `len`
/// on EOF or partial DMA completion).
///
/// The implementation must:
/// 1. Map the file offset range to block device LBAs via the filesystem's
/// extent/block map.
/// 2. Pin the user buffer pages via `get_user_pages(buf, len, WRITE)`.
/// 3. Build a Bio with the pinned user pages as DMA targets.
/// 4. Submit the Bio and wait for completion.
/// 5. Unpin the user pages on completion.
fn direct_read(
&self,
file: &OpenFile,
buf: UserSliceMut,
offset: u64,
len: u64,
) -> Result<u64, IoError>;
/// Perform a direct write from the user buffer to the block device.
///
/// `file`: the open file.
/// `buf`: user-space source buffer (must be dio_mem_align-aligned).
/// `offset`: file offset (must be dio_offset_align-aligned).
/// `len`: number of bytes to write.
///
/// Returns the number of bytes actually written (short write on error).
///
/// The implementation must:
/// 1. Allocate blocks if writing beyond current extents (fallocate or
/// delayed allocation commit).
/// 2. Pin the user buffer pages via `get_user_pages(buf, len, READ)`.
/// 3. Build a Bio with the pinned user pages as DMA sources.
/// 4. Submit the Bio and wait for completion.
/// 5. Update i_size if the write extended the file.
/// 6. Unpin the user pages on completion.
fn direct_write(
&self,
file: &OpenFile,
buf: UserSlice,
offset: u64,
len: u64,
) -> Result<u64, IoError>;
}
Cache coherence protocol: O_DIRECT and buffered I/O on the same file must not produce stale reads or lost writes. The VFS enforces coherence through serialization and invalidation:
Direct I/O cache coherence:
1. SERIALIZATION via i_rwsem:
- Buffered read/write: acquires i_rwsem SHARED.
- Direct I/O read/write: acquires i_rwsem EXCLUSIVE.
This prevents concurrent DIO and buffered I/O on the same inode.
A DIO write cannot race with a buffered read that might see stale
page cache data, and a DIO read cannot race with a buffered write
that might dirty a page cache page covering the same range.
2. BEFORE DIO READ — invalidate cached pages in the range:
invalidate_inode_pages2_range(mapping, offset, offset + len - 1)
→ Removes all page cache pages covering [offset, offset+len).
→ If any page is dirty, it is written back first (to avoid data loss),
then removed from the cache.
→ Subsequent buffered reads will re-fetch from disk (seeing DIO writes).
3. BEFORE DIO WRITE — flush + invalidate:
filemap_write_and_wait_range(mapping, offset, offset + len - 1)
→ Writes back any dirty pages in the range to disk.
invalidate_inode_pages2_range(mapping, offset, offset + len - 1)
→ Removes the pages from the cache.
→ This ensures the DIO write does not race with dirty page writeback
(which would overwrite the DIO data with stale page cache contents).
4. AFTER DIO WRITE — no page cache update:
The written data is on disk. Page cache pages for this range were
invalidated in step 3 and are not re-populated. Subsequent buffered
reads will fetch the new data from disk (cache miss → readpage).
VMA integration: O_DIRECT does NOT allocate page cache pages. The user buffer
pages are pinned in physical memory via get_user_pages() for the duration of the
DMA transfer and unpinned on completion. This is fundamentally different from
buffered I/O, where data passes through kernel-owned page cache pages.
Fallback: If AddressSpaceOps::direct_io() returns None (filesystem does not
support DIO), the VFS silently falls back to the buffered I/O path. The FMODE_DIRECT
flag is not set on the OpenFile, and all reads/writes use the page cache. This matches
Linux behavior for filesystems without DIO support (e.g., some network filesystems).
Error handling: If DMA fails mid-transfer (device error, IOMMU fault), the Bio
completion callback reports the error. The DIO path returns the number of bytes
successfully transferred (short read/write). If zero bytes were transferred, the
error code from the Bio is returned directly (e.g., EIO).
Key latency contributors (approximate, NVMe on x86-64): - Steps 1-4 (VFS + page cache): ~1-5 us (CPU-bound, no I/O) - Step 3a (O_SYNC/O_DSYNC): +10-15 us NVMe, +3-8 ms SATA (device write latency) - Step 5 (writeback trigger): 0-5s delay (async) or 0 (fsync path) - Steps 6-7 (bio construction + block layer): ~1-3 us - Steps 8-9 (driver + DMA): ~1-2 us (Tier 0/1) or ~5-10 us (Tier 2) - Step 10 (hardware): 10-100 us (NVMe) / 1-10 ms (SATA) - O_DIRECT path (bypasses steps 3-5): ~15-120 us total (DMA + device)
Cross-references for each step: - Syscall dispatch: Section 19.1 - Page cache: Section 4.4 - Dirty extent tracking: Section 14.4 - Writeback subsystem: Section 4.6 - Bio and block device trait: Section 15.2 - I/O scheduling: Section 15.18 - KABI ring dispatch: Section 12.3 - DMA subsystem: Section 4.14 - Tier 1 crash recovery for in-flight writes: Section 11.9
The following sections (Pipe Subsystem, Inode Cache, Dentry Cache, Path Resolution, Mount Namespace) remain in this file.
14.1.3 Pipe Subsystem¶
For pipe implementation, see Section 14.17.
14.1.4 Inode Cache (icache)¶
The inode cache uses per-superblock XArray lookup (SuperBlock.inode_cache) with a
global LRU list (InodeCache) for eviction. See the Core VFS Data Structures section
above for struct definitions. It provides:
inode_cache_lookup(sb, ino)— O(1) per-superblock XArray lookup under RCU (hot path).inode_cache_insert(inode)— inserts into the inode's superblock XArray.inode_cache_evict(sb, count)— LRU eviction for memory pressure.INODE_CACHE_SHRINKER— registered shrinker for integration with the page reclaim subsystem.
Each superblock holds an XArray<u64, Arc<Inode>> keyed by inode number for O(1)
lookup with no hash computation. Unreferenced inodes (i_refcount == 0,
i_nlink > 0) are placed on a global LRU list for eviction under memory pressure.
Memory pressure integration: The inode cache registers a shrinker with
umka-core's memory reclaim subsystem (Section 4.2). When
the page allocator signals pressure, inode_cache_evict() walks the LRU list
and evicts up to nr_to_scan inodes. Each eviction frees the inode struct,
its associated page cache pages (via truncate_inode_pages()), and its LSM
blob. The dentry cache shrinker runs first (evicting dentries drops inode
refcounts, making more inodes eligible for LRU eviction).
14.1.5 Dentry Cache¶
The dentry (directory entry) cache is the performance-critical data structure of the VFS.
It maps (parent_inode, name) pairs to child inodes, eliminating repeated disk lookups
for path resolution.
Data structure: RCU-protected hash table. Read-side lookups are lock-free — no atomic operations on the read path, only a memory barrier on RCU read lock entry/exit. This matches Linux's dentry cache design, which is similarly RCU-protected for the same performance reasons.
Negative dentries: When a lookup() returns ENOENT, the VFS caches a negative
dentry for that (parent, name) pair. Subsequent lookups for the same nonexistent path
component return ENOENT immediately without calling into the filesystem driver. This
is critical for workloads like $PATH searches where the shell looks for an executable
in 5-10 directories, finding it only in one. Without negative dentries, every command
invocation would perform 4-9 unnecessary disk lookups.
Eviction: LRU eviction under memory pressure. The dentry cache integrates with umka-core's memory reclaim (Section 4.12 — Memory Compression Tier, in 04-memory.md) — when the page allocator signals memory pressure, the dentry cache shrinker evicts least-recently-used entries. Negative dentries are evicted preferentially (they are cheaper to re-create than positive dentries).
14.1.6 Path Resolution¶
Path resolution walks the dentry cache component by component. For example,
/usr/lib/libfoo.so resolves as: root dentry -> lookup("usr") -> lookup("lib")
-> lookup("libfoo.so").
RCU path walk (fast path): The entire resolution is attempted under an RCU read-side critical section. No dentry reference counts are taken, no locks are acquired. If every component is in the dentry cache and no concurrent renames or unmounts are in progress, the entire path resolves with zero atomic operations.
Ref-walk fallback (slow path): If any component is not cached, or if a concurrent
mount/rename is detected (via sequence counters), the RCU walk aborts and restarts in
ref-walk mode. Ref-walk takes dentry reference counts and inode locks as needed. This
two-phase approach is identical to Linux's LOOKUP_RCU -> LOOKUP_LOCKED fallback.
Mount point traversal: When a dentry is flagged as a mount point, resolution crosses into the mounted filesystem's root dentry. The mount table is consulted via RCU lookup (no lock) in the fast path.
Symlink resolution: The VFS follows up to 40 nested symlinks before returning
ELOOP. This matches the Linux limit and prevents infinite symlink loops.
Symlink namespace semantics: Symlink targets are always resolved relative to
the current task's mount namespace, not the symlink inode's namespace. Absolute
symlink targets (/foo/bar) start from task.fs.root (the task's chroot/pivot_root).
Relative symlink targets are resolved from the symlink's parent directory. The
AT_SYMLINK_NOFOLLOW flag prevents resolution entirely (returns the symlink inode).
This matches Linux behavior and ensures that symlinks do not become cross-namespace
escape vectors — a symlink created in one mount namespace cannot force resolution
through a different namespace's mount tree.
Capability checks: Traverse permission is checked at each path component, but
not via an inter-domain ring call on every component. Instead, the dentry cache
stores a cached_perm: AtomicU32 field containing the permission bits resolved on the
last successful access by the current UID. During RCU-walk, the VFS reads
cached_perm from the dentry (same domain, no ring call) and compares against the
requesting process's UID and requested permission. If the cached permission matches
(common case — same user accessing the same path), no domain crossing occurs and the
check costs only a single atomic load (~1-3 cycles). The permission cache is
invalidated on chmod(), chown(), ACL changes, and capability revocation (all
infrequent operations).
Permission cache encoding: The 32-bit cached_perm field is divided into:
- Bits [31:16]: Truncated UID hash (upper 16 bits of a fast hash of the accessor's
UID). This is NOT a full UID — it is a probabilistic match filter.
- Bits [15:12]: Reserved (zero).
- Bits [11:9]: Permission result for owner (rwx).
- Bits [8:6]: Permission result for group (rwx).
- Bits [5:3]: Permission result for other (rwx).
- Bits [2:0]: Access mode that was checked (rwx).
On a cache hit (UID hash matches AND requested permission bits are a subset of the cached grant), the VFS skips the domain crossing. On a cache miss (UID hash mismatch or permission bits not cached), the VFS performs a full capability check via the inter-domain ring and updates the cache. The 16-bit UID hash has a ~1/65536 false positive rate — a different user may produce the same truncated UID hash as the cached entry. However, the slow-path capability check is always invoked on any hash collision, so the false positive never results in unauthorized access.
On hash collision (false positive rate ~1/65536 per lookup): access is denied — the VFS falls back to the slow-path inter-domain capability check, which always produces a correct result. The permission cache is purely advisory; a collision always causes a cache miss, never a permission elevation. Fail-safe direction: deny unknown, never grant unknown.
Multi-user ping-pong: On shared directory trees accessed by multiple UIDs concurrently, the single-entry permission cache experiences ping-pong (alternating misses as different UIDs overwrite each other's cached hash). This is acceptable for the common case (single-user access patterns dominate) and bounded — each miss costs one domain crossing, identical to the no-cache case. A multi-entry cache was considered but rejected due to per-dentry memory overhead (each additional entry adds 4 bytes per dentry, multiplied by millions of dentries in active caches).
This design is correct because: 1. A cache hit is only accepted when the UID hash AND the requested permission bits match the stored grant exactly. The probability of a different user with a different permission set matching both fields is ~1/65536 per lookup — and that case results in a cache miss and full slow-path check anyway. 2. The cache is invalidated on ALL permission-changing operations (chmod, chown, ACL changes, capability revocation), ensuring stale grants are never served after the underlying permission state changes. 3. Only the slow-path inter-domain ring call is authoritative. umka-vfs cannot grant access that umka-core's capability tables do not authorize.
Only on a cache miss (first access, different UID, or invalidated entry) does the
VFS call umka-core via the inter-domain ring to perform a full capability check and
update the dentry's cached permissions. This amortized design preserves the security
guarantee (umka-vfs cannot bypass capability checks — it has no access to capability
tables, per Section 11.2 and Section 11.3) while keeping the hot-path overhead to a single
atomic load per component, comparable to Linux's inode->i_mode check.
14.1.6.1 MountDentry — VFS Location Pair¶
A MountDentry is the fundamental VFS location type: a (mount, dentry) pair that
uniquely identifies a point in the mount tree. It is the result of every path
resolution and the primary reference type passed between VFS operations.
/// A reference to a location in the mount tree: the specific mount and the
/// dentry within that mount's filesystem. Two dentries with the same inode
/// in different mounts are different `MountDentry` values.
///
/// `MountDentry` holds Arc references to both the `Mount` and the `Dentry`,
/// keeping both alive for the duration of the reference. Dropping a
/// `MountDentry` decrements both refcounts.
pub struct MountDentry {
/// The mount containing this dentry.
pub mnt: Arc<Mount>,
/// The dentry within the mount's filesystem.
pub dentry: Arc<Dentry>,
}
/// Resolve an open file descriptor to its `MountDentry`.
///
/// Used by `open_by_handle_at()`, `fstatat(AT_EMPTY_PATH)`, and io_uring
/// `AT_FDCWD` resolution. The returned `MountDentry` identifies the mount
/// and dentry that the file descriptor was opened on.
///
/// # Errors
/// - `EBADF`: `fd` is not a valid open file descriptor.
/// - `ENOENT`: The file descriptor's dentry has been unlinked (deleted)
/// and `FMODE_PATH` is not set.
pub fn fd_to_mount_dentry(fd: i32) -> Result<MountDentry, Errno> {
let file = fget(fd)?;
Ok(MountDentry {
mnt: Arc::clone(&file.mnt),
dentry: Arc::clone(&file.dentry),
})
}
14.1.6.2 Path Lookup Entry Point¶
The path_lookup function is the primary entry point for all VFS path resolution.
Every syscall that accepts a pathname (open, stat, access, mkdir, unlink,
mount, execve, etc.) calls path_lookup to translate the user-provided path
string into a MountDentry pair identifying the target location in the mount tree.
bitflags! {
/// Flags controlling path resolution behavior. Passed to `path_lookup()`
/// by syscall handlers to customize resolution semantics.
///
/// These flags correspond to Linux's internal `LOOKUP_*` flags (not
/// directly visible to userspace, but indirectly controlled by syscall
/// flags like `O_NOFOLLOW`, `O_DIRECTORY`, `O_CREAT`, `AT_SYMLINK_NOFOLLOW`,
/// `AT_EMPTY_PATH`, `RESOLVE_BENEATH`, `RESOLVE_NO_XDEV`, etc.).
pub struct LookupFlags: u32 {
/// Follow the terminal symlink. If the final path component is a
/// symlink and this flag is set, resolution follows it to the target.
/// If not set, resolution returns the symlink inode itself.
///
/// Default for most syscalls (`open`, `stat`). Cleared by `O_NOFOLLOW`
/// and `AT_SYMLINK_NOFOLLOW`. `lstat()` clears this flag.
///
/// Note: intermediate symlinks (non-terminal components) are ALWAYS
/// followed regardless of this flag — only the final component is
/// affected. This matches POSIX behavior.
const FOLLOW = 0x0001;
/// The final component must be a directory. If it resolves to a
/// non-directory inode, return `ENOTDIR`.
///
/// Set by `O_DIRECTORY` (openat2), `mkdir` (parent lookup), and
/// `rmdir` (target validation). Also implicitly set when the path
/// ends with a trailing `/` (POSIX: trailing slash implies directory).
const DIRECTORY = 0x0002;
/// Resolve the parent directory of the final component, not the
/// final component itself. The final component name is returned
/// separately (not resolved to an inode). Used by syscalls that
/// create or remove entries: `mkdir`, `mknod`, `unlink`, `rmdir`,
/// `rename`, `link`, `symlink`.
///
/// When set, `path_lookup` returns the `MountDentry` of the parent
/// directory, and the final component name is stored in a separate
/// output parameter (not shown in this simplified signature).
const PARENT = 0x0004;
/// The syscall is creating a new entry (`O_CREAT`). This flag is
/// informational — it does not change resolution behavior, but it
/// is checked by audit/LSM hooks to distinguish "create" from
/// "open existing".
const CREATE = 0x0008;
/// The syscall requires exclusive creation (`O_EXCL`). Combined
/// with `CREATE`. If the final component already exists, return
/// `EEXIST`. The VFS checks this after resolution completes.
const EXCL = 0x0010;
/// The resolution is part of an `open()` operation. This flag
/// enables open-intent optimizations: the VFS can pass an open
/// intent to the filesystem's `lookup()` so that NFS can perform
/// an atomic lookup-and-open in a single RPC, avoiding a TOCTOU
/// race between lookup and open.
const OPEN = 0x0020;
/// Resolution must not cross the `root` boundary upward. If the
/// path contains `..` components that would ascend above `root`,
/// return `EXDEV` instead of silently clamping to `root`.
///
/// Maps to `RESOLVE_BENEATH` (openat2). Provides a stronger
/// security guarantee than chroot: even a privileged process
/// cannot escape the `root` boundary via `..` traversal.
const BENEATH = 0x0040;
/// Resolution must not cross mount point boundaries. If the path
/// traverses a mount point (in either direction — into a mounted
/// filesystem or back out via `..`), return `EXDEV`.
///
/// Maps to `RESOLVE_NO_XDEV` (openat2). Used by sandboxed
/// processes and container runtimes that want to confine path
/// resolution to a single filesystem.
const NO_XDEV = 0x0080;
/// Do not trigger automounts during resolution. If a path
/// component is an autofs trigger point, return `ENOENT` instead
/// of mounting the remote filesystem.
///
/// Maps to `AT_NO_AUTOMOUNT`. This is separate from
/// `RESOLVE_NO_MAGICLINKS` — automount suppression and magic-link
/// suppression are independent concepts.
const NO_AUTOMOUNT = 0x0100;
/// Fail if any path component (including the terminal) is a
/// symlink. Stricter than clearing `FOLLOW` (which only affects
/// the terminal component).
///
/// Maps to `RESOLVE_NO_SYMLINKS` (openat2, 0x04).
const NO_SYMLINKS = 0x0200;
/// Fail on `/proc/[pid]/fd/*` style magic symlinks (procfs magic
/// links that jump to arbitrary filesystem locations). Regular
/// symlinks are still followed unless `NO_SYMLINKS` is also set.
///
/// Maps to `RESOLVE_NO_MAGICLINKS` (openat2, 0x02).
const NO_MAGICLINKS = 0x0400;
/// Treat `dirfd` as the filesystem root. `..` at `dirfd` stays
/// at `dirfd` (like chroot but per-syscall). Combined with
/// `BENEATH`, provides a complete sandboxed lookup.
///
/// Maps to `RESOLVE_IN_ROOT` (openat2, 0x10).
const IN_ROOT = 0x0800;
/// Non-blocking lookup. If the resolution would block (uncached
/// dentry, lazy NFS lookup, autofs trigger), return `EAGAIN`
/// instead of blocking. Used by io_uring for async path ops.
///
/// Maps to `RESOLVE_CACHED` (openat2, 0x20).
const CACHED = 0x1000;
/// Empty path resolution. When set with an empty path string,
/// the resolution returns the `MountDentry` of the `dirfd`
/// itself (or `pwd` if `dirfd` is `AT_FDCWD`). Used by
/// `AT_EMPTY_PATH` (fstatat, linkat, etc.) and `fexecve`.
const EMPTY_PATH = 0x2000;
}
}
/// VFS path resolution entry point. Called from syscall handlers to resolve
/// a user-provided path string to a `MountDentry` pair
/// ([Section 8.1](08-process.md#process-and-task-management--fsstruct)).
///
/// This function implements the two-phase resolution protocol described above:
/// first attempts RCU-walk (lockless, zero atomic operations on hit), then
/// falls back to ref-walk on miss or concurrent modification.
///
/// # Arguments
///
/// - `mnt_ns`: Mount namespace for mount point traversal. Determines which
/// mounts are visible during resolution. Obtained from
/// `task.nsproxy.mount_ns` ([Section 17.1](17-containers.md#namespace-architecture)).
/// - `root`: Chroot root boundary from `task.fs.read().root`. Path resolution
/// never ascends above this point via `..`. If `LOOKUP_BENEATH` is set,
/// attempting to ascend above `root` returns `EXDEV` instead of clamping.
/// - `pwd`: Current working directory from `task.fs.read().pwd`. Used as the
/// starting point for relative path resolution. Ignored for absolute paths
/// (paths starting with `/`).
/// - `path`: Path string (absolute or relative). Kernel-space byte slice —
/// the syscall layer has already copied this from userspace via
/// `copy_from_user`. Must be null-terminated or bounded by `PATH_MAX`
/// (4096 bytes). An empty `path` is valid only if `LOOKUP_EMPTY_PATH` is
/// set in `flags`.
/// - `flags`: `LookupFlags` bitflags controlling resolution behavior (see
/// the `LookupFlags` definition above).
///
/// # Returns
///
/// On success, returns a `MountDentry` identifying the resolved location
/// in the mount tree. The returned `MountDentry` holds references to both
/// the mount and the dentry (refcounts incremented). The caller is
/// responsible for releasing these references when done.
///
/// # Errors
///
/// | Error | Condition |
/// |-------|-----------|
/// | `ENOENT` | A path component does not exist (and `LOOKUP_CREATE` is not set) |
/// | `EACCES` | Traverse (execute) permission denied on a directory component |
/// | `ENOTDIR` | A non-terminal component is not a directory, or `LOOKUP_DIRECTORY` is set and the final component is not a directory |
/// | `ELOOP` | More than 40 nested symlinks encountered during resolution |
/// | `ENAMETOOLONG` | A path component exceeds `NAME_MAX` (255 bytes) or the total path exceeds `PATH_MAX` (4096 bytes) |
/// | `EXDEV` | `LOOKUP_BENEATH`: path escapes `root` via `..`. `LOOKUP_NO_XDEV`: path crosses a mount boundary |
/// | `EINVAL` | Empty path without `LOOKUP_EMPTY_PATH` |
///
/// # Concurrency
///
/// Thread-safe. Multiple threads may call `path_lookup` concurrently. The
/// RCU-walk phase is fully lockless. The ref-walk fallback acquires per-dentry
/// spinlocks and inode `i_rwsem` (shared) as needed.
///
/// # Performance
///
/// Hot path (all components cached, no concurrent mutations): O(n) where n is
/// the number of path components. Each component costs one dentry hash lookup
/// (~5-10ns) plus one `cached_perm` check (~1-3ns). No domain crossings, no
/// locks, no atomic RMW operations.
pub fn path_lookup(
mnt_ns: &MountNamespace,
root: &MountDentry,
pwd: &MountDentry,
path: &[u8],
flags: LookupFlags,
) -> Result<MountDentry, Errno>
Credential resolution: path_lookup() accesses the calling task's credentials
via current_task().cred (RCU-protected read) for permission checks at each path
component. This is valid because VFS runs as Tier 1 (Ring 0, shared per-CPU state
with Core). The CpuLocal current_task pointer lives in Nucleus memory, readable
by all Tier 1 domains. No domain crossing is needed to read task credentials.
RESOLVE_IN_ROOT capture timing: The root boundary is captured at the start of
path_lookup() from task.fs.root. If the task's namespace changes between syscall
entry and path_lookup(), the root captured at path_lookup entry is authoritative.
This is consistent with Linux openat2() behavior.
dirfd validity across unshare(CLONE_NEWNS): After unshare(CLONE_NEWNS),
existing file descriptors (including dirfd values) remain valid. The dentry referenced
by a dirfd is in the mount tree — after unshare, the new mount namespace is a copy of
the old, and existing dentries are shared (copy-on-write mount points). Operations
using AT_FDCWD or an explicit dirfd resolve in the calling task's current mount
namespace. The dirfd does not become invalid.
pwd after unshare(CLONE_NEWNS): If the current working directory is unreachable
from the new namespace's root (e.g., the mount point was not copied), the pwd becomes
a "floating" dentry. File operations relative to pwd succeed (the dentry is still
valid). getcwd() returns ENOENT. This matches Linux behavior.
Syscall-to-VFS bridge: Syscall handlers construct the path_lookup call from
the current task's state:
// Example: openat(dirfd, pathname, flags, mode) syscall handler sketch.
// Shows how SyscallContext fields feed into path_lookup.
fn sys_openat(ctx: &mut SyscallContext) -> i64 {
let dirfd = ctx.args[0] as i32;
let pathname = copy_path_from_user(ctx.args[1] as *const u8, PATH_MAX)?;
let flags = ctx.args[2] as u32;
let mode = ctx.args[3] as u32;
let fs = ctx.task.fs.read();
let nsproxy = ctx.task.nsproxy.load();
let mnt_ns = &nsproxy.mount_ns;
// Determine the base directory for relative paths.
let base = if dirfd == AT_FDCWD {
&fs.pwd
} else {
&fd_to_mount_dentry(ctx.task.files.get(dirfd)?)?
};
let lookup_flags = open_flags_to_lookup_flags(flags);
let target = path_lookup(mnt_ns, &fs.root, base, &pathname, lookup_flags)?;
// ... proceed with open using the resolved MountDentry ...
}
Flag translation functions:
/// Translate `open(2)` / `openat(2)` O_* flags to internal LookupFlags.
/// Used by sys_open, sys_openat, and legacy open paths.
fn open_flags_to_lookup_flags(o_flags: u32) -> LookupFlags {
let mut lf = LookupFlags::FOLLOW; // default: follow terminal symlinks
if o_flags & O_NOFOLLOW != 0 {
lf.remove(LookupFlags::FOLLOW);
}
if o_flags & O_DIRECTORY != 0 {
lf |= LookupFlags::DIRECTORY;
}
if o_flags & O_CREAT != 0 {
lf |= LookupFlags::CREATE;
}
if o_flags & O_PATH != 0 {
// O_PATH opens are lightweight fd-only references —
// no file data access, no permission check on the final component.
lf |= LookupFlags::EMPTY_PATH;
}
lf
}
/// Translate `openat2(2)` resolve flags (from `struct open_how.resolve`) to
/// internal LookupFlags. Called by sys_openat2 AFTER `open_flags_to_lookup_flags()`
/// to layer the RESOLVE_* restrictions on top of the O_* translations.
///
/// Linux `openat2(2)` resolve flag values (from `include/uapi/linux/openat2.h`):
/// RESOLVE_NO_XDEV = 0x01
/// RESOLVE_NO_MAGICLINKS = 0x02
/// RESOLVE_NO_SYMLINKS = 0x04
/// RESOLVE_BENEATH = 0x08
/// RESOLVE_IN_ROOT = 0x10
/// RESOLVE_CACHED = 0x20
///
/// Returns `Err(EINVAL)` if mutually exclusive flags are combined
/// (e.g., RESOLVE_BENEATH | RESOLVE_IN_ROOT is allowed in Linux 5.12+
/// but the semantics are RESOLVE_IN_ROOT-dominant).
fn resolve_flags_to_lookup_flags(
base: LookupFlags,
resolve: u64,
) -> Result<LookupFlags, Errno> {
let mut lf = base;
if resolve & RESOLVE_NO_XDEV != 0 {
lf |= LookupFlags::NO_XDEV;
}
if resolve & RESOLVE_NO_MAGICLINKS != 0 {
lf |= LookupFlags::NO_MAGICLINKS;
}
if resolve & RESOLVE_NO_SYMLINKS != 0 {
lf |= LookupFlags::NO_SYMLINKS;
lf.remove(LookupFlags::FOLLOW); // NO_SYMLINKS implies no terminal follow
}
if resolve & RESOLVE_BENEATH != 0 {
lf |= LookupFlags::BENEATH;
}
if resolve & RESOLVE_IN_ROOT != 0 {
lf |= LookupFlags::IN_ROOT;
}
if resolve & RESOLVE_CACHED != 0 {
lf |= LookupFlags::CACHED;
}
// Reject unknown bits (forward compatibility).
let known_bits = RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS
| RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_CACHED;
if resolve & !known_bits != 0 {
return Err(Errno::EINVAL);
}
Ok(lf)
}
14.1.7 Mount Namespace and Capability-Gated Mounting¶
Each process belongs to a mount namespace containing its own mount tree.
Mount operations are capability-gated:
| Operation | Required Capability | Scope |
|---|---|---|
| mount | CAP_MOUNT |
Mount namespace |
| bind mount | CAP_MOUNT + read access to source |
Mount namespace + source |
| remount | CAP_MOUNT |
Mount namespace |
| umount | CAP_MOUNT |
Mount namespace |
| pivot_root | CAP_SYS_ADMIN |
Mount namespace |
CAP_MOUNT is scoped to the calling process's mount namespace — it does not grant
mount authority in other namespaces. A container with its own mount namespace can mount
filesystems within that namespace without affecting the host.
Mount propagation: Shared, private, slave, and unbindable propagation types, with
the same semantics as Linux (MS_SHARED, MS_PRIVATE, MS_SLAVE, MS_UNBINDABLE).
This is essential for container runtimes that rely on mount propagation for volume mounts.
Filesystem type registration: Only umka-core can register new filesystem types with the VFS. Filesystem drivers request registration via the inter-domain ring, and umka-core verifies the driver's identity and KABI certification before granting registration.
14.1.7.1 Mount Lifecycle¶
The mount(2) syscall drives a multi-step flow that creates or reuses a
SuperBlock, allocates a MountPoint node, and inserts it into the calling
process's mount tree. Each step has a defined rollback on failure, ensuring
no resource leaks.
Mount flow (do_mount) (summary; the canonical do_mount algorithm with full
step ordering is defined in Section 14.6):
-
Lookup filesystem type. Search the global
FS_TYPE_TABLE(XArray keyed by filesystem name hash) for the requestedfs_typestring (e.g.,"ext4","tmpfs"). If not found, returnENODEV. -
Cgroup device controller check. For block-backed filesystems (source resolves to a block device), check the calling task's cgroup device controller allowlist (Section 17.2). If the device's
(major, minor)is not indevices.allowfor the task's cgroup, returnEPERM. This prevents containers from mounting arbitrary block devices. Pseudo-filesystems (tmpfs, procfs, sysfs) skip this check. -
Resolve or allocate SuperBlock. For block-backed filesystems, hash the
(fs_type, device)pair and search the active superblock table: - Existing superblock found: Increment
s_refcount. Verify mount flags are compatible (e.g., cannot mount the same deviceMS_RDONLYand read-write simultaneously). If incompatible, returnEBUSY. -
No existing superblock: Allocate a new
SuperBlockfrom the VFS slab cache. Initializes_type,s_blocksize,s_flags,s_bdev, ands_uuidwith defaults. Sets_refcount = 1. -
Call
FileSystemOps::mount(sb, source, flags, data). The filesystem driver reads the on-disk superblock (for block-backed filesystems), fills theSuperBlockfields (s_blocksize,s_maxbytes,s_root,s_fs_info), performs journal replay if needed (ext4, XFS), and returns. For pseudo- filesystems (tmpfs, procfs), this step populates the root inode and dentry without any block I/O. -
On error: release the
SuperBlock(decrement refcount; if zero, free it). Return the filesystem's error code. -
Create MountPoint node. Allocate a
MountPointlinking: parent: the dentry where this mount is attached (e.g.,/mnt/data).source: the device path or source string (e.g.,/dev/sda1).sb: theSuperBlockfrom step 3/4.mount_id: a globally unique monotonic u64 mount identifier (exposed to userspace viaSTATX_MNT_ID).flags: mount flags (MS_RDONLY,MS_NOSUID,MS_NODEV, etc.).propagation: propagation type (MS_SHARED,MS_PRIVATE, etc.), defaulting toMS_PRIVATE.-
On error: call
FileSystemOps::unmount(sb), release SuperBlock. -
Bind BackingDevInfo. For block-backed filesystems, associate the
BackingDevInfo(BDI) from the block device with the superblock. The BDI controls writeback rate limiting, readahead window defaults, and dirty page accounting per backing device (Section 4.6). -
For pseudo-filesystems (tmpfs, procfs), a default BDI with no writeback is used.
-
Insert into mount tree. Acquire the mount namespace write lock. Attach the
MountPointas a child of the parent dentry in the namespace's mount tree. Setd_mount_seqon the parent dentry (incremented to invalidate any in-flight RCU-walk lookups that cached the old state). Apply mount propagation rules: if the parent mount isMS_SHARED, replicate the new mount into all peer mount namespaces. Release the mount namespace lock. -
On error: deallocate
MountPoint, callFileSystemOps::unmount(sb), release SuperBlock. -
Return success. The filesystem is now accessible at the mount point.
Unmount flow (do_umount):
-
Check reference count. If the mount has active open files, child mounts, or CWD references, return
EBUSY(unlessMNT_FORCEorMNT_DETACHis specified). -
Detach from mount tree. Acquire the mount namespace write lock. Remove the
MountPointfrom its parent's child list. Incrementd_mount_seqon the parent dentry. ForMNT_DETACH(lazy umount), the mount is detached from the namespace tree immediately but theSuperBlockis kept alive until all references are released. -
Sync dirty data. Call
FileSystemOps::sync_fs(sb, wait=true)to flush all dirty pages and metadata. This invokes the writeback thread (Section 4.6) for the superblock's BDI. ForMNT_FORCE, skip the sync and proceed with best-effort teardown (in-flight I/O is drained with-EIO). -
Tear down SuperBlock. Decrement
s_refcount. If the refcount reaches zero (no other mounts share this superblock): a. Evict all inodes: walks_inodes, call inode eviction sequence (writeback dirty pages,InodeOps::evict_inode, remove fromsb.inode_cacheXArray). b. CallFileSystemOps::unmount(sb)— the filesystem flushes its journal, writes the clean-unmount marker, and releasess_fs_info. c. Releases_bdevreference (if block-backed). d. Free theSuperBlockslab object. -
Deallocate MountPoint. Free the
MountPointslab object.
Force unmount (umount2 with MNT_FORCE): Calls
FileSystemOps::force_umount(sb), which aborts in-flight I/O with -EIO
and skips journal commit. Used when a network filesystem server is
unreachable or a device has been physically removed. Data loss may occur
for unflushed dirty pages.
Remount (mount -o remount): Does not create a new MountPoint.
Instead, calls FileSystemOps::remount(sb, new_flags, data) to update
mount options on the existing superblock. The VFS validates flag
transitions (e.g., MS_RDONLY → read-write requires CAP_MOUNT and a
journal replay check).
See Section 14.5 for the character/block device node framework
(chrdev/blkdev registration, major number table, devtmpfs automatic /dev node lifecycle).
14.1.7.2 ML Policy Integration for VFS¶
The VFS subsystem emits observations and exposes tunable parameters through the ML policy framework (Section 23.1). This enables closed-loop optimization of readahead, writeback scheduling, and dirty page throttling.
Observation hooks: The following observe_kernel! call sites are placed in
VFS hot/warm paths. Each call is zero-cost (NOP) when no policy service consumer
is attached (static key patching; see Section 23.1).
| Call site | Subsystem | Observation type | Path class | Data emitted |
|---|---|---|---|---|
filemap_get_pages() cache miss |
VfsLayer | VfsObs::PageCacheMiss |
Hot | (ino, file_offset, ra_window_size, sequential: bool) |
filemap_get_pages() cache hit |
VfsLayer | VfsObs::PageCacheHit |
Hot | (ino, file_offset) — sampled at 1/64 rate to bound overhead |
generic_file_write_iter() |
VfsLayer | VfsObs::BufferedWrite |
Hot | (ino, bytes_written, dirty_pages_after) — sampled at 1/16 |
writeback_single_inode() completion |
VfsLayer | VfsObs::WritebackComplete |
Warm | (ino, pages_written, elapsed_us, sequential_ratio) |
balance_dirty_pages() throttle |
VfsLayer | VfsObs::DirtyThrottle |
Warm | (bdi_id, dirty_pages, dirty_limit, throttle_ms) |
page_cache_readahead() trigger |
VfsLayer | VfsObs::ReadaheadTrigger |
Warm | (ino, start_offset, window_pages, sequential: bool) |
Dentry cache miss in path_lookup() |
VfsLayer | VfsObs::DentryCacheMiss |
Hot | (parent_ino, name_hash) — sampled at 1/32 |
| VFS ring request enqueue | VfsLayer | VfsObs::RingEnqueue |
Hot | (mount_id, opcode, ring_index) — sampled at 1/128 |
path_lookup() completion |
VfsLayer | VfsObs::PathLookupLatency |
Hot | (path_components, elapsed_ns, rcu_walk_success: bool) — sampled at 1/64. Measures end-to-end path resolution latency including mount crossings and symlink follows. RCU-walk success rate is a key metric: low success rate indicates contention forcing ref-walk fallbacks. |
select_ring() → response dequeue |
VfsLayer | VfsObs::RingUtilization |
Warm | (mount_id, ring_index, ring_depth, pending_slots, response_latency_ns) — emitted on every response dequeue. Measures ring fill level and round-trip latency. High pending_slots/ring_depth ratio signals the ring is saturated and ring count should be increased (or ring depth enlarged). |
| Readahead completion audit | VfsLayer | VfsObs::ReadaheadHitRate |
Warm | (bdi_id, window_pages, pages_used_before_eviction, hit_ratio_pct) — emitted when a readahead window is fully consumed or evicted. Tracks how many prefetched pages were actually accessed before eviction. Low hit rate means the readahead window is oversized (wasting memory and I/O bandwidth). |
/// VFS-specific observation types for the ML policy framework.
/// Used as the `obs_type` field in `observe_kernel!` calls.
#[repr(u16)]
pub enum VfsObs {
/// Page cache miss — readahead evaluation opportunity.
PageCacheMiss = 0,
/// Page cache hit — confirms readahead effectiveness.
PageCacheHit = 1,
/// Buffered write completion — dirty page accumulation signal.
BufferedWrite = 2,
/// Writeback completion for a single inode.
WritebackComplete = 3,
/// Dirty page throttling engaged — backpressure signal.
DirtyThrottle = 4,
/// Readahead triggered — window sizing feedback.
ReadaheadTrigger = 5,
/// Dentry cache miss — path resolution pressure signal.
DentryCacheMiss = 6,
/// VFS ring enqueue — cross-domain I/O pressure signal.
RingEnqueue = 7,
/// Path resolution end-to-end latency — RCU-walk success rate signal.
PathLookupLatency = 8,
/// Ring utilization — fill level and response latency signal.
RingUtilization = 9,
/// Readahead hit rate — window sizing effectiveness feedback.
ReadaheadHitRate = 10,
}
Tunable parameters: The following VFS parameters are registered in the Kernel Tunable Parameter Store (Section 23.1). ParamId values are allocated in the I/O Scheduler range (0x0300-0x03FF) since VFS readahead and writeback are I/O-adjacent. Each parameter has a default, bounds, and a cooldown period to prevent oscillation.
| ParamId | Name | Default | Min | Max | Cooldown | Description |
|---|---|---|---|---|---|---|
IoReadaheadPages (0x0300) |
readahead_pages |
32 | 1 | 512 | 30s | Per-BDI max readahead window in pages |
| 0x0303 | vfs_dirty_ratio_pct |
20 | 5 | 80 | 60s | vm.dirty_ratio equivalent — percentage of total memory that can be dirty before synchronous writeback |
| 0x0304 | vfs_dirty_bg_ratio_pct |
10 | 1 | 50 | 60s | vm.dirty_background_ratio — background writeback trigger threshold |
| 0x0305 | vfs_writeback_interval_cs |
500 | 100 | 6000 | 30s | Writeback timer interval in centiseconds (default 5s = 500cs) |
| 0x0306 | vfs_ra_sequential_threshold |
4 | 1 | 32 | 30s | Number of sequential page accesses before readahead window doubles |
| 0x0307 | vfs_ring_coalesce_batch |
8 | 1 | 64 | 10s | Default VFS ring doorbell coalescing batch size for regular I/O |
| 0x0308 | vfs_ring_coalesce_timeout_us |
20 | 1 | 200 | 10s | Default VFS ring doorbell coalescing timeout in microseconds |
| 0x0309 | vfs_completion_coalesce_batch |
8 | 1 | 32 | 10s | Response-direction completion coalescing batch size (Section 14.3). Number of completions batched before waking the VFS consumer. |
| 0x030A | vfs_completion_coalesce_timeout_us |
10 | 1 | 100 | 10s | Response-direction completion coalescing timeout in microseconds. Bounds worst-case latency for sparse completion streams. |
Closed-loop example — readahead window auto-tuning:
- Policy service observes
VfsObs::PageCacheMissandVfsObs::ReadaheadTriggeron a per-BDI basis. High miss rate after readahead suggests the window is too small. - Policy service computes the optimal
readahead_pagesusing the PID controller (Section 23.1). Target metric: page cache hit rate > 95% for sequential workloads. - Policy service sends a
ParamAdjust { param_id: IoReadaheadPages, value: N }message. - The VFS readahead engine reads the updated value via
PARAM_STORE.get(IoReadaheadPages)on the next readahead evaluation (warm path, no hot-path overhead).
Phase assignment: VFS observation hooks are Phase 3 (functional without ML; ML provides optimization). Parameter registration is Phase 2 (parameters are readable by sysctl even without a policy service).
14.2 VFS Ring Buffer Protocol (Cross-Domain Dispatch)¶
The tier model (Section 11.3) requires ALL cross-domain communication to use ring
buffer IPC. However, the FileSystemOps, InodeOps, and FileOps traits defined
in Section 14.1 use direct Rust function call signatures. This section specifies how trait
method calls are marshaled across the isolation domain boundary between umka-core
(VFS layer) and Tier 1 filesystem drivers.
Architecture: Each mounted filesystem has a dedicated request/response ring pair:
/// Maximum inline I/O data size in bytes. Reads/writes at or below this
/// threshold carry data inline in the ring entry, avoiding DMA buffer
/// allocation and IOMMU mapping. Covers >90% of procfs/sysfs reads.
///
/// 192 bytes fits within the ring entry without bloating large-I/O variants
/// (the `VfsRequestArgs` union is already dominated by `SetXattr` at ~280 bytes).
/// Saves ~150-300ns per small I/O by eliminating DMA alloc/free + IOMMU map/unmap.
pub const INLINE_IO_MAX: usize = 192;
/// Sentinel value for `DmaBufferHandle` indicating no DMA buffer is
/// allocated. Used by the inline small I/O path: when `buf == ZERO`,
/// data is carried inline in the ring entry (request's `inline_data` for
/// writes, response's `inline_data` for reads). The driver checks
/// `buf == DmaBufferHandle::ZERO` to select the inline path.
impl DmaBufferHandle {
/// All-zero sentinel indicating no DMA buffer is allocated.
/// Pool ID 0 is reserved as invalid and never allocated by the DMA
/// buffer pool allocator ([Section 4.14](04-memory.md#dma-subsystem)). Combined with
/// `iova_base = 0` (page 0 is never mapped by any IOMMU implementation),
/// `ZERO` is guaranteed to never match any valid handle.
pub const ZERO: Self = DmaBufferHandle { pool_id: 0, generation: 0, offset: 0, iova_base: 0 };
}
/// VFS-specific ring buffer. Extends `DomainRingBuffer`
/// ([Section 11.8](11-drivers.md#ipc-architecture-and-message-passing)) with per-slot state tracking
/// for the split reservation protocol (`EMPTY -> RESERVED -> FILLED ->
/// CONSUMED -> EMPTY`) and a per-ring in-flight operation counter for
/// crash recovery quiescence.
///
/// The `DomainRingBuffer` header occupies 128 bytes (2 cache lines).
/// The `slot_states` array is allocated contiguously after the ring data
/// region. The `inflight_ops` counter is used by crash recovery
/// (`drain_all_vfs_rings()`) and live evolution quiescence to wait for
/// producers to complete their current operations before draining.
///
/// **Relationship to DomainRingBuffer**: `RingBuffer<T>` composes (not
/// inherits) `DomainRingBuffer`. All ring pointer fields (`head`, `tail`,
/// `published`, `state`) are accessed through `inner`. The generic
/// parameter `T` is the entry type (`VfsRequest` or `VfsResponseWire`)
/// for type-safe entry access via `read_entry()`.
// kernel-internal, not KABI
pub struct RingBuffer<T> {
/// Base ring buffer header (128 bytes) + data region.
/// Contains `head`, `tail`, `published`, `state`, `size`, `entry_size`.
pub inner: DomainRingBuffer,
/// Per-slot state for the split reservation protocol.
/// Length: `inner.size` entries. Allocated contiguously after the
/// ring data region. Each entry is an `AtomicU8` holding one of
/// the `RingSlotState` values (`Empty`, `Reserved`, `Filled`,
/// `Consumed`).
///
/// SAFETY: Pointer is valid for `inner.size` elements. Allocated
/// from kernel slab at mount time, valid for the lifetime of the
/// mount. Freed during umount after all rings are drained.
pub slot_states: *const AtomicU8,
/// Number of in-flight producer operations on this ring. Incremented
/// in `reserve_slot()` after successful slot reservation (after CAS
/// on `inner.head`). Decremented in `complete_slot()` after marking
/// the slot `FILLED` and advancing `inner.published`.
///
/// Used by crash recovery (`drain_all_vfs_rings()`) and live evolution
/// quiescence to wait for all producers to complete before draining.
/// The counter reaching zero guarantees no producer is between
/// `reserve_slot()` and `complete_slot()` (i.e., no producer is
/// mid-`copy_from_user` with a RESERVED slot).
///
/// `AtomicU32`: maximum concurrent producers per ring is bounded by
/// CPU count (≤256 for MAX_VFS_RINGS). `u32` is sufficient.
pub inflight_ops: AtomicU32,
/// Phantom type for entry type safety.
_marker: core::marker::PhantomData<T>,
}
impl<T> RingBuffer<T> {
/// Read a typed entry at the given slot index.
///
/// SAFETY: `idx` must be < `self.inner.size`. The caller must ensure
/// the slot contains valid data (slot_state == FILLED or the ring is
/// being drained during crash recovery with all producers quiesced).
pub unsafe fn read_entry(&self, idx: usize) -> &T {
debug_assert!(idx < self.inner.size as usize);
let data_base = (&self.inner as *const DomainRingBuffer as *const u8)
.add(size_of::<DomainRingBuffer>());
&*(data_base.add(idx * self.inner.entry_size as usize) as *const T)
}
}
/// Per-mount ring buffer pair for VFS <-> filesystem driver communication.
///
/// The VFS (in umka-core) enqueues requests on `request_ring`; the filesystem
/// driver dequeues, processes, and enqueues responses on `response_ring`.
/// Both rings are in shared memory (PKEY 1 on x86-64 — read-only for both
/// domains; actual data in PKEY 14 shared DMA pool).
pub struct VfsRingPair {
/// Request ring: VFS -> filesystem driver. Ring size: 256 entries
/// (configurable per-mount via mount options).
///
/// **Producer model**: Under PerCpu granularity, each ring has exactly
/// one producer (pure SPSC). Under PerNuma/PerLlc/Fixed granularity,
/// multiple CPUs may share a ring; the producer side uses a CAS loop
/// on `head` for atomic slot reservation (see `reserve_slot()` in
/// [Section 14.3](#vfs-per-cpu-ring-extension)). The consumer side is always single-
/// threaded per ring (driver consumer thread).
pub request_ring: RingBuffer<VfsRequest>,
/// Whether this ring is shared by multiple CPUs. Set at mount time
/// based on the ring granularity and CPU-to-ring mapping. When `true`,
/// `reserve_slot()` uses a CAS loop on `head` for atomic slot
/// allocation. When `false` (PerCpu mode), `reserve_slot()` uses a
/// simple load/store on `head` (no contention possible).
///
/// Invariant: `shared_ring == false` implies exactly one CPU maps to
/// this ring in the `cpu_to_ring` table. This is verified at mount time.
pub shared_ring: bool, // Kernel-internal, not KABI.
/// Response ring: filesystem driver -> VFS. SPSC (driver produces, VFS
/// consumes). Same size as request ring.
pub response_ring: RingBuffer<VfsResponseWire>,
/// Doorbell: filesystem driver writes to signal request availability.
/// Uses the doorbell coalescing mechanism (Section 11.5.1.1) to batch
/// notifications when multiple requests are enqueued.
pub doorbell: DoorbellRegister,
/// Completion WaitQueue: VFS threads wait here when a synchronous
/// operation needs a response. Multiple threads may be blocked on
/// the same WaitQueue simultaneously (one per in-flight request).
///
/// **Response matching protocol** (request_id -> waiting thread):
///
/// 1. **Submit**: The VFS caller allocates a `request_id` from
/// `VfsRingPair.next_request_id` (AtomicU64, fetch_add(1, Relaxed)),
/// stores it in `VfsRequest.request_id`, enqueues the request on
/// the request ring, and parks on `completion` via `wait_event!`.
///
/// 2. **Wait condition**: The caller's `wait_event!` condition checks:
/// ```rust
/// wait_event!(ring.completion, {
/// // Check for crash recovery (ring set entered RECOVERING state).
/// if ring_set.state.load(Acquire) == VFSRS_RECOVERING {
/// return Err(EIO);
/// }
/// // Check if our response has been deposited in the response table.
/// ring.response_table.contains(our_request_id)
/// });
/// ```
///
/// 3. **Completion**: The VFS response consumer thread (running in
/// Tier 0) drains the response ring, reads each `VfsResponseWire`,
/// and deposits it into `response_table` keyed by `request_id`.
/// After depositing, it calls `completion.wake_up_all()`. Each
/// woken thread re-evaluates its wait_event condition: only the
/// thread whose `request_id` is in `response_table` proceeds.
/// Others re-sleep. This is the standard "thundering herd with
/// condition recheck" pattern — acceptable because VFS rings are
/// per-CPU (PerCpu mode: one thread at a time) or per-NUMA
/// (bounded contention).
///
/// 4. **Retrieval**: The woken thread calls
/// `response_table.remove(our_request_id)` to extract its
/// `VfsResponseWire`, processes the result, and returns.
pub completion: WaitQueue,
/// Per-ring response table: maps request_id -> VfsResponseWire.
/// Used by the response matching protocol to deliver responses to
/// the correct waiting thread. XArray keyed by request_id (u64).
///
/// Populated by the VFS response consumer thread (Tier 0) as it
/// drains the response ring. Consumed by waiting threads after
/// `wake_up_all()` signals availability.
///
/// Bounded size: at most `request_ring.inner.size` entries (one per
/// in-flight request slot). Entries are removed by the waiting thread
/// after retrieval, so the table does not grow unboundedly.
pub response_table: XArray<VfsResponseWire>,
/// Monotonically increasing request ID allocator. Each VFS caller
/// gets a unique request_id for response matching.
/// u64: at 10^9 requests/sec, wraps in 584 years.
pub next_request_id: AtomicU64,
}
/// VFS request message. Serialized representation of a trait method call.
/// Fixed-size header + variable-length payload.
///
/// **Layout**: The header fields (`request_id`, `opcode`, `ino`, `fh`) are
/// followed by the tagged-union `args` payload. The `opcode` field is `u32`
/// (from `VfsOpcode`); `_pad_opcode` provides explicit padding to maintain
/// natural alignment of the subsequent `u64` fields. This prevents
/// information disclosure via implicit compiler-inserted padding bytes.
///
/// **Size**: Header is 32 bytes + `VfsRequestArgs` (largest variant is
/// `SetXattr` at ~280 bytes). Total entry size ~320 bytes (see per-CPU ring
/// extension memory analysis). `const_assert!` below verifies the header.
#[repr(C)]
pub struct VfsRequest {
/// Unique request ID for matching responses. Globally unique per mount
/// (allocated from `VfsRingSet::next_request_id`). IDs are unique but
/// not necessarily monotonic within a single ring — two CPUs sharing a
/// ring may allocate IDs before ring slot assignment, so a lower ID can
/// appear in a later slot. The protocol uses IDs for response matching
/// only, not ordering. u64 counter: at 10M ops/sec, wraps after
/// ~58,000 years (well beyond the 50-year uptime target). No wrap
/// handling needed.
pub request_id: u64,
/// Operation code identifying the trait method.
pub opcode: VfsOpcode,
/// Explicit padding after the u32 `opcode` to align `ino` to 8 bytes.
/// Must be zero. Prevents information disclosure from implicit padding.
pub _pad_opcode: u32,
/// Inode number (for InodeOps/FileOps calls). 0 for FileSystemOps calls.
pub ino: u64,
/// File handle (for FileOps calls). u64::MAX for non-file operations.
pub fh: u64,
/// Operation-specific arguments. The variant must match `opcode`.
/// Variable-length data (filenames, xattr values, write data) is
/// passed via shared DMA buffer references embedded in the variant,
/// not stored inline in the ring entry.
///
/// The VFS dispatcher validates that the `args` variant matches
/// `opcode` before dispatching; a mismatch is a kernel bug and
/// triggers a panic in debug builds, a silent no-op error response
/// in release builds.
pub args: VfsRequestArgs,
}
// Verify header layout: request_id(8) + opcode(4) + pad(4) + ino(8) + fh(8) = 32.
const_assert!(core::mem::offset_of!(VfsRequest, args) == 32);
// Verify total size: header(32) + VfsRequestArgs(288, largest variant SetXattr) = 320.
// VfsRequestArgs = 4 (discriminant) + 256 (KernelString) + 4 (padding) + 16 (DmaBufferHandle)
// + 4 (value_len) + 4 (flags) = 288 bytes, aligned to 8.
//
// Memory tradeoff: 320 bytes per entry × 256 entries × N rings = ~80 KiB per ring.
// At 64 CPUs × 256 entries = ~5 MiB per mount. The inline_data optimization
// (192 bytes embedded in the Write variant) eliminates DMA alloc + IOMMU mapping
// for small I/O (saving ~150-300 ns per operation). Ring memory is pinned DMA
// pages pre-allocated from the shared DMA pool — these pages are committed
// regardless of entry size and cannot be used for other purposes.
const_assert!(core::mem::size_of::<VfsRequest>() == 320);
/// Per-opcode argument payload for a `VfsRequest`.
///
/// `#[repr(C, u32)]` tagged union: the discriminant is a `u32` matching
/// `VfsOpcode`, and each variant is an independent `#[repr(C)]` struct.
/// This ensures a stable ABI across the Tier 0 / Tier 1 KABI boundary
/// (zero-copy ring, matching the io_uring SQE pattern). The `VfsRequest.opcode`
/// field in the header serves as the authoritative discriminant; the
/// in-union discriminant is redundant but guarantees Rust's safety
/// invariant (no invalid discriminant UB).
///
/// Every `VfsOpcode` variant has a corresponding `VfsRequestArgs` variant
/// with the exact parameters that the trait method requires. Variants
/// that carry no extra data beyond what is already in the `VfsRequest`
/// header (opcode, ino, fh) use an empty body `{}`.
///
/// **Inline string limits**: `KernelString` holds up to 255 bytes. Names
/// longer than 255 bytes (possible on some exotic filesystems) must be
/// passed via a `DmaBufferHandle` placed in the `buf` field of the
/// relevant variant; the VFS sets the string `len` to 0 as a sentinel in
/// that case.
///
/// **Caller contract**: The caller fills `VfsRequest { opcode, args, .. }`
/// and enqueues it on `request_ring`. The VFS dispatcher validates that
/// the `args` variant matches `opcode` before dispatching to the
/// filesystem driver.
#[repr(C, u32)]
pub enum VfsRequestArgs {
// ---------------------------------------------------------------
// FileSystemOps
// ---------------------------------------------------------------
/// `FileSystemOps::mount`. No extra args; mount options are passed
/// via a separate `DmaBufferHandle` in the ring header.
Mount {},
/// `FileSystemOps::unmount`. Graceful unmount; all dirty data must
/// be flushed before the response is sent.
Unmount {},
/// `FileSystemOps::force_unmount`. Best-effort: abandon in-flight
/// I/O and free resources.
ForceUnmount {},
/// `FileSystemOps::statfs`. No per-call arguments.
Statfs {},
/// `FileSystemOps::sync_fs`. `wait` controls whether the driver
/// must block until all I/O is complete (`true`) or may return once
/// I/O is queued (`false`).
SyncFs { wait: u8 }, // 0 = no-wait, 1 = wait. u8 for cross-domain safety.
/// `FileSystemOps::remount`. New flags; updated option string is in
/// a `DmaBufferHandle` in the ring header.
Remount { flags: u32 },
/// `FileSystemOps::freeze`. Quiesce all writes for snapshotting.
Freeze {},
/// `FileSystemOps::thaw`. Resume writes after a freeze.
Thaw {},
// ---------------------------------------------------------------
// InodeOps
// ---------------------------------------------------------------
/// `InodeOps::lookup`. Look up `name` in the directory identified
/// by `VfsRequest::ino`.
Lookup { name: KernelString },
/// `InodeOps::create`. Create a regular file named by the dentry
/// already allocated by the VFS. `mode` is the combined file-type
/// and permission bits.
Create { mode: FileMode },
/// `InodeOps::link`. Create a hard link whose new name is
/// `new_name` inside the directory inode of the request.
Link { src_ino: u64, new_name: KernelString },
/// `InodeOps::unlink`. Remove a directory entry. The inode is freed
/// when its link count reaches zero and all file descriptors are
/// closed.
Unlink { name: KernelString },
/// `InodeOps::mkdir`. Create a directory with the given permission
/// bits.
Mkdir { mode: FileMode },
/// `InodeOps::rmdir`. Remove an empty directory.
Rmdir { name: KernelString },
/// `InodeOps::rename`. Move or rename a directory entry.
/// `new_dir_ino` is the inode number of the destination directory.
/// `new_name` is the destination name. `flags` carries `RENAME_*`
/// constants (e.g., `RENAME_NOREPLACE`, `RENAME_EXCHANGE`).
Rename { new_dir_ino: u64, new_name: KernelString, flags: u32 },
/// `InodeOps::symlink`. Create a symbolic link whose target path is
/// `target`. The created inode is named by the dentry pre-allocated
/// by the VFS.
Symlink { target: KernelString },
/// `InodeOps::readlink`. Resolve the symlink target into
/// `buf`. The driver writes the target string into the DMA buffer
/// identified by `buf`.
Readlink { buf: DmaBufferHandle },
/// `InodeOps::mknod`. Create a special file (block device, character
/// device, FIFO, or socket). `dev` carries the (major, minor) pair
/// using the `DevId` type with Linux MKDEV encoding: `(major << 20) | minor`.
/// See [Section 14.5](#device-node-framework) for encoding details.
Mknod { mode: FileMode, dev: DevId },
/// `InodeOps::getattr`. Retrieve inode attributes into an
/// `InodeAttr`. `request_mask` is a bitmask of `STATX_*` fields the
/// caller wants. `flags` is `AT_*` flags from `statx(2)`.
GetAttr { request_mask: u32, flags: u32 },
/// `InodeOps::setattr`. Modify inode attributes. `valid` is a
/// bitmask of `ATTR_*` flags indicating which fields in `attr` the
/// driver must update.
SetAttr { attr: InodeAttr, valid: u32 },
/// `InodeOps::truncate`. Set the file size to `size` bytes,
/// releasing or zero-extending as needed.
Truncate { size: u64 },
/// `InodeOps::getxattr`. Retrieve the extended attribute `name` into
/// `buf`. On return, the response `status` field (>= 0) carries the
/// attribute value length.
GetXattr { name: KernelString, buf: DmaBufferHandle },
/// `InodeOps::setxattr`. Set extended attribute `name` to `value`.
/// `flags` is `XATTR_CREATE`, `XATTR_REPLACE`, or 0.
SetXattr { name: KernelString, value: DmaBufferHandle, value_len: u32, flags: u32 },
/// `InodeOps::listxattr`. Enumerate all extended attribute names into
/// `buf` as a sequence of NUL-terminated strings. On return, the
/// response `status` field (>= 0) carries the total length written.
ListXattr { buf: DmaBufferHandle },
/// `InodeOps::removexattr`. Delete the extended attribute `name`.
RemoveXattr { name: KernelString },
/// `FileSystemOps::show_options`. Write the filesystem-specific
/// mount options (as they would appear in `/proc/mounts`) into
/// `buf`.
ShowOptions { buf: DmaBufferHandle },
// ---------------------------------------------------------------
// AddressSpaceOps (page cache → filesystem driver)
// ---------------------------------------------------------------
/// `AddressSpaceOps::read_page`. Populate one page from backing
/// store on a page cache miss. `page_index` is the page-aligned
/// file offset divided by `PAGE_SIZE`. The driver reads data from
/// the backing block device and writes it into the DMA buffer
/// identified by `buf` (exactly `PAGE_SIZE` bytes). The page has
/// already been allocated and inserted into the page cache by the
/// caller; the driver only needs to fill it.
///
/// `page_cache_id`: Core-resident handle identifying the
/// `AddressSpace`/`PageCache` that owns this page. Set by the VFS
/// dispatch path (in Core, Tier 0) before enqueuing. Used by crash
/// recovery to resolve orphaned pages WITHOUT traversing VFS-domain
/// state (the inode cache is in VFS/Tier 1 and may be corrupted).
/// The handle is the `AddressSpace` pointer cast to `u64` — valid
/// because `AddressSpace` is in Core memory and pinned for the
/// lifetime of the superblock.
ReadPage { page_index: u64, page_cache_id: u64, buf: DmaBufferHandle },
/// `AddressSpaceOps::readahead`. Batch read for the readahead
/// engine ([Section 4.4](04-memory.md#page-cache--readahead-engine)). `start_index` is
/// the first page index; `nr_pages` is the count. The driver
/// should submit I/O for the entire range in a single Bio batch.
/// `buf` is a DMA buffer large enough for `nr_pages * PAGE_SIZE`
/// bytes. Pages have been pre-allocated and cache-inserted by the
/// readahead engine; the driver fills them sequentially.
/// Filesystems that do not implement batched readahead return
/// `EOPNOTSUPP`; the VFS falls back to per-page `ReadPage` calls.
///
/// `page_cache_id`: Same semantics as `ReadPage::page_cache_id`.
/// All pages in the readahead batch belong to the same
/// `AddressSpace`.
Readahead { start_index: u64, nr_pages: u32, page_cache_id: u64, buf: DmaBufferHandle },
/// `AddressSpaceOps::writepage`. Write a single dirty page to
/// the backing store. Used by the page reclaimer when it needs to
/// evict a dirty page. For normal writeback, the `WritebackRequest`
/// ring ([Section 4.6](04-memory.md#writeback-subsystem--writeback-domain-crossing-tier-0---tier-1))
/// is used instead (batched, higher throughput). `writepage` is
/// the single-page fallback for reclaim pressure.
WritePage { page_index: u64, buf: DmaBufferHandle, sync_mode: u8 },
/// `AddressSpaceOps::dirty_extent`. Notify the filesystem that a
/// page range is about to be dirtied. The filesystem records the
/// affected extent for crash-recovery journaling. `offset` and
/// `len` are byte offsets within the file.
DirtyExtent { offset: u64, len: u64 },
/// `AddressSpaceOps::releasepage`. Ask the filesystem whether a
/// clean page may be evicted from the cache. The driver responds
/// with `ok = true` (permit eviction) or `ok = false` (page is
/// pinned for journaling or other reasons). Must not block.
ReleasePage { page_index: u64 },
// ---------------------------------------------------------------
// FileOps
// ---------------------------------------------------------------
/// `FileOps::open`. Open the file. `flags` are the `O_*` open
/// flags from `open(2)`/`openat(2)`. `mode` is relevant only when
/// `O_CREAT` is set.
Open { flags: u32, mode: FileMode },
/// `FileOps::release`. The last reference to this open file
/// descriptor has been closed. The driver must flush any cached
/// state for `fh`.
Release {},
/// `FileOps::read`. Read up to `count` bytes starting at `offset`
/// from the file into `buf`. The driver writes data into the DMA
/// buffer identified by `buf`. On return, `VfsResponseWire::status`
/// (>= 0) carries the number of bytes actually read.
///
/// **Inline small I/O path**: If `count <= INLINE_IO_MAX` (192 bytes),
/// the VFS sets `buf` to `DmaBufferHandle::ZERO` (sentinel: no DMA
/// buffer allocated). The driver writes read data into the response's
/// `inline_data` field instead of a DMA buffer. This eliminates DMA
/// alloc/free + IOMMU map/unmap for small reads (procfs, sysfs, small
/// config files). Saves ~150-300ns per small read. Covers >90% of
/// procfs/sysfs reads. See `VfsResponseWire::inline_data`.
Read { buf: DmaBufferHandle, offset: u64, count: u32 },
/// `FileOps::write`. Write `count` bytes from `buf` into the file
/// starting at `offset`. `buf` points to a DMA buffer the VFS has
/// already filled with the data to be written.
///
/// **Inline small I/O path**: If `count <= INLINE_IO_MAX` (192 bytes),
/// the VFS places write data inline in `inline_data` and sets `buf` to
/// `DmaBufferHandle::ZERO` (sentinel). The driver reads from
/// `inline_data` instead of a DMA buffer. The `copy_from_user()` that
/// fills `inline_data` happens after slot reservation but before
/// `complete_slot()` — enabled by the split reservation/completion
/// protocol ([Section 14.3](#vfs-per-cpu-ring-extension)).
Write { buf: DmaBufferHandle, offset: u64, count: u32,
inline_data: [u8; INLINE_IO_MAX] },
/// `FileOps::fsync`. Flush dirty data and metadata to stable
/// storage. If `datasync` is `true`, only data blocks need to be
/// flushed (equivalent to `fdatasync(2)`). `start`..`end` is the
/// byte range to sync; `end == u64::MAX` means "to end of file".
Fsync { datasync: u8, start: u64, end: u64 }, // 0 = fsync, 1 = fdatasync. u8 for cross-domain safety.
/// `FileOps::readdir`. Enumerate directory entries into `buf`
/// starting after the position identified by `cookie`. A `cookie` of
/// 0 means start from the beginning. The driver fills `buf` with
/// `linux_dirent64` records. `VfsResponseWire::status` (>= 0) carries
/// the number of bytes written.
ReadDir { buf: DmaBufferHandle, cookie: u64 },
/// `FileOps::ioctl`. Pass a device-specific command to the
/// filesystem driver. `cmd` is the ioctl number; `arg` is the raw
/// usize argument (may be a user pointer, a small integer, or a
/// `DmaBufferHandle` depending on the command).
Ioctl { cmd: u32, arg: usize },
/// `FileOps::mmap`. Establish a memory mapping. `vma_token` is an
/// opaque handle the VFS passes to the driver to identify the
/// virtual memory area; the driver uses it to call back into the
/// VFS to install PTEs via the KABI page-fault callback.
Mmap { vma_token: u64, prot: u32, flags: u32 },
/// `FileOps::fallocate`. Pre-allocate or manipulate storage for the
/// given byte range. `mode` carries `FALLOC_FL_*` flags.
Fallocate { mode: u32, offset: u64, len: u64 },
/// `FileOps::seek_data`. Find the next byte range containing data
/// at or after `offset` (implements `SEEK_DATA` from `lseek(2)`).
SeekData { offset: u64 },
/// `FileOps::seek_hole`. Find the next hole (unallocated range) at
/// or after `offset` (implements `SEEK_HOLE` from `lseek(2)`).
SeekHole { offset: u64 },
/// `FileOps::poll`. Query which I/O events are ready. `events` is
/// a bitmask of `POLLIN`, `POLLOUT`, `POLLERR`, etc. The driver
/// responds immediately with the currently ready events; the VFS
/// handles `epoll`/`select` wait registration separately.
Poll { events: u32 },
/// `FileOps::splice_read`. Transfer up to `len` bytes from the file
/// at `offset` into an in-kernel pipe identified by `pipe_ino`,
/// without copying through userspace. `flags` carries `SPLICE_F_*`
/// flags.
SpliceRead { pipe_ino: u64, offset: u64, len: u32, flags: u32 },
/// `FileOps::splice_write`. Transfer up to `len` bytes from the
/// in-kernel pipe identified by `pipe_ino` into the file at
/// `offset`. `flags` carries `SPLICE_F_*` flags.
SpliceWrite { pipe_ino: u64, offset: u64, len: u32, flags: u32 },
// ---------------------------------------------------------------
// Inode lifecycle operations
// ---------------------------------------------------------------
/// `InodeOps::evict_inode`. Inode eviction — free on-disk resources.
/// Sent after the VFS has completed page cache teardown. The inode
/// number is in `VfsRequest::ino`; no extra arguments are needed.
/// The driver MUST release all on-disk resources (extent tree entries,
/// block allocations, journal reservations) associated with the inode.
/// The response is `VfsResponse::Ok(0)` on success or
/// `VfsResponse::Err(-errno)` on failure (which is logged but does not
/// prevent inode freeing — the VFS continues eviction regardless).
EvictInode {},
/// `InodeOps::truncate_range`. Deallocate blocks within
/// `[offset, offset+len)` without changing file size.
/// Used by `FALLOC_FL_PUNCH_HOLE`, `FALLOC_FL_ZERO_RANGE`, and
/// `FALLOC_FL_COLLAPSE_RANGE`. Separate from `Truncate` (which sets
/// `i_size` via `setattr`). The VFS evicts page cache pages in the
/// affected range before sending this request.
TruncateRange { offset: u64, len: u64 },
/// `InodeOps::write_inode`. Flush inode metadata to stable storage.
/// Called by `vfs_fsync_metadata()` for O_SYNC/O_DSYNC writes when the
/// inode's on-disk metadata must be updated (timestamps, size, block map).
/// `sync_mode`: 0 = `WB_SYNC_NONE` (schedule I/O, don't wait),
/// 1 = `WB_SYNC_ALL` (wait for I/O completion). u8 for cross-domain safety.
WriteInode { sync_mode: u8 },
// ---------------------------------------------------------------
// Batched metadata operations (io_uring coalescing)
// ---------------------------------------------------------------
/// Batched `statx()` request. Generated by the io_uring dispatch path
/// when consecutive `IORING_OP_STATX` SQEs are detected. The VFS
/// resolves all paths in a single domain stay. Never sent by
/// filesystem drivers.
/// See [Section 14.1](#virtual-filesystem-layer--mechanism-2-io_uring-statx-coalescing).
StatxBatch { count: u8, entries_buf: DmaBufferHandle },
}
/// Bounded kernel-internal string. Avoids heap allocation for the common
/// case of short names (directory entries, xattr names, symlink targets
/// ≤ 255 bytes).
///
/// For strings longer than 255 bytes the caller must use a
/// `DmaBufferHandle` instead and set `len = 0` as a sentinel.
#[repr(C)]
pub struct KernelString {
/// Byte length of the string, not including any NUL terminator.
/// Range: 0 (sentinel for "use DMA buffer") to 255.
pub len: u8,
/// Inline storage. Valid bytes are `data[..len]`. The remainder
/// is zero-padded. Not NUL-terminated; callers must use `len`.
pub data: [u8; 255],
}
// Layout: 1 + 255 = 256 bytes.
const_assert!(size_of::<KernelString>() == 256);
/// VFS operation codes. One-to-one mapping to trait methods.
#[repr(u32)]
pub enum VfsOpcode {
// FileSystemOps
Mount = 1,
Unmount = 2,
ForceUnmount = 3,
Statfs = 4,
SyncFs = 5,
Remount = 6,
Freeze = 7,
Thaw = 8,
ShowOptions = 37, // → FileSystemOps::show_options; called by /proc/mounts, mount(8)
// InodeOps
Lookup = 20,
Create = 21,
Link = 22,
Unlink = 23,
Mkdir = 24,
Rmdir = 25,
Rename = 26,
Symlink = 27,
Readlink = 28,
Getattr = 29,
Setattr = 30,
Truncate = 35,
Getxattr = 31,
Setxattr = 32,
Listxattr = 33,
Removexattr = 34,
Mknod = 36, // → InodeOps::mknod; called by mknod(2) for device nodes
EvictInode = 38, // → InodeOps::evict_inode; called when inode's last reference drops
TruncateRange = 39, // → InodeOps::truncate_range; FALLOC_FL_PUNCH_HOLE/ZERO_RANGE
// Separate from Truncate (35) which sets i_size via setattr.
// TruncateRange deallocates blocks within [offset, offset+len)
// without changing file size.
WriteInode = 55, // → InodeOps::write_inode; flush inode metadata to stable storage
// Called by vfs_fsync_metadata() for O_SYNC/O_DSYNC writes.
// AddressSpaceOps (page cache ↔ filesystem)
ReadPage = 60, // → AddressSpaceOps::read_page; page cache miss
Readahead = 61, // → AddressSpaceOps::readahead; batched readahead
WritePage = 62, // → AddressSpaceOps::writepage; reclaim single-page writeback
DirtyExtent = 63, // → AddressSpaceOps::dirty_extent; journal pre-registration
ReleasePage = 64, // → AddressSpaceOps::releasepage; reclaim eviction check
// FileOps
Open = 40,
Release = 41,
Read = 42,
Write = 43,
Fsync = 44,
Readdir = 45,
Ioctl = 46,
Mmap = 47,
Fallocate = 48,
SeekData = 49,
SeekHole = 50,
Poll = 51,
SpliceRead = 52, // → FileOps::splice_read; called by splice(2), sendfile(2)
SpliceWrite = 53, // → FileOps::splice_write; called by splice(2) write side
// Batched metadata operations (io_uring coalescing)
// These opcodes are generated only by the io_uring statx coalescing path
// ([Section 14.1](#virtual-filesystem-layer--mechanism-2-io_uring-statx-coalescing)).
// They are never exposed to filesystem drivers directly — the VFS
// dispatches individual Getattr calls internally for each batch entry.
StatxBatch = 70, // → Batched statx; args in DmaBufferHandle as StatxBatchEntry[]
StatxBatchResult = 71, // → Response-only opcode (no VfsRequestArgs variant). Carries
// per-entry StatxBuf or error as a batched response payload.
// Used only by VFS internal response routing; never sent on
// the request ring.
}
/// VFS response message — wire-level representation on the response ring.
///
/// Every request placed on the `request_ring` eventually produces exactly one
/// `VfsResponseWire` on the paired `response_ring`. The `request_id` field
/// matches the request it completes, enabling out-of-order completion.
///
/// **Status encoding**: `status` is a signed 64-bit value.
/// - `status >= 0`: Success. For data-transfer operations (`Read`, `Write`,
/// `Readdir`, `ReadPage`, `Readahead`, `SpliceRead`, `SpliceWrite`), the
/// value is the byte count transferred. For operations that return a new
/// inode (`Lookup`, `Create`, `Mkdir`, `Symlink`, `Mknod`), the value
/// is the new inode number. For all other operations, the value is 0.
/// - `status == -4095..-1`: Error. The negated Linux errno (e.g., `-2` for
/// `ENOENT`). Matches the kernel's standard error encoding.
/// - `status == i64::MIN` (`0x8000_0000_0000_0000`): Pending — the driver
/// has acknowledged the request but not yet completed it. The VFS must
/// continue waiting for the final response. At most one `Pending` response
/// per request is permitted.
///
/// **Size**: Header is 40 bytes (8 + 8 + 8 + 8 + 4 + 4). For responses
/// carrying inline read data (small I/O path), `inline_data` adds up to
/// `INLINE_IO_MAX` (192) bytes. Total response entry: 256 bytes
/// (40 header + 192 inline_data + 24 padding, aligned to 256 for cache
/// efficiency on response ring). For responses without inline data
/// (large I/O, non-read operations), `inline_data_len` is 0 and the
/// consumer can skip the inline data region.
#[repr(C, align(256))]
pub struct VfsResponseWire {
/// Request ID this response completes. Matches `VfsRequest::request_id`.
pub request_id: u64,
/// Driver generation counter at the time this response was produced.
/// The VFS discards responses whose generation does not match the
/// current `sb.driver_generation` (stale responses from a pre-crash
/// driver instance). See Step 5.5 below.
pub driver_generation: u64,
/// Status code: >= 0 for success (byte count or inode number),
/// negative for error (negated errno), `i64::MIN` for Pending.
pub status: i64,
/// Operation-specific supplementary data. Currently used by:
/// - `Lookup`/`Create`/`Mkdir`/`Symlink`/`Mknod`: inode generation
/// counter in `aux[0]` (u32, for NFS file handle staleness detection).
/// - `Open`: filesystem-private file handle in `status` (u64, stored
/// by VFS in `OpenFile::private_data`).
/// - `GetAttr`: `STATX_*` result mask in `aux[0]`.
/// - `ReleasePage`: `aux[0]` = 1 if eviction is permitted, 0 if denied.
/// - All other operations: `aux` is zero.
pub aux: [u32; 2],
/// Number of valid bytes in `inline_data`. 0 for non-inline responses.
/// Range: 0..=INLINE_IO_MAX (192). When > 0, the VFS copies
/// `inline_data[..inline_data_len]` directly to userspace, bypassing
/// the DMA buffer entirely.
pub inline_data_len: u32,
/// Padding after inline_data_len to maintain 8-byte alignment.
pub _pad_len: u32,
/// Inline read data for small I/O responses. Used when the original
/// `Read` request had `count <= INLINE_IO_MAX` and `buf == ZERO`.
/// The filesystem driver writes read data here instead of into a DMA
/// buffer. Eliminates DMA alloc/free + IOMMU map/unmap for small reads.
/// For non-inline responses, this region is unused (content undefined).
pub inline_data: [u8; INLINE_IO_MAX],
/// Padding to fill the 256-byte struct size mandated by `#[repr(C, align(256))]`.
/// 8 + 8 + 8 + 8 + 4 + 4 + 192 = 232 bytes of fields. 256 - 232 = 24 bytes pad.
/// The `align(256)` attribute ensures each response entry is cache-line-aligned
/// and power-of-two sized for efficient ring indexing (index × 256 = byte offset).
pub _pad: [u8; 24],
}
const_assert!(core::mem::size_of::<VfsResponseWire>() == 256);
Dispatch flow (read syscall example):
- Userspace calls
read(fd, buf, len). - Syscall entry point resolves
fdto aValidatedCap(Section 9.1). - VFS checks the page cache (Section 4.4). On cache HIT: data is served from core memory with zero domain crossings. On cache MISS: continue.
- VFS constructs a
VfsRequest: - Large I/O (
count > INLINE_IO_MAX):{ opcode: Read, buf: DmaBufferHandle, offset, count }. Thebufis aDmaBufferHandlepointing to a shared-memory region where the driver will write the read data (zero-copy). - Small I/O (
count <= INLINE_IO_MAX):{ opcode: Read, buf: DmaBufferHandle::ZERO, offset, count }. No DMA buffer is allocated. The driver writes data intoVfsResponseWire::inline_data. - VFS enqueues the request on
request_ringand rings the doorbell. - The filesystem driver (in its Tier 1 domain) dequeues the request. It checks
buf == DmaBufferHandle::ZEROto select the path: - DMA path: reads via
BlockDevice, writes data to the shared DMA buffer. - Inline path: reads from page cache or block device, writes data into
VfsResponseWire::inline_data[..inline_data_len]and setsinline_data_len. - Driver enqueues a
VfsResponseWire { request_id, status, inline_data_len, ... }onresponse_ring. - VFS dequeues the response:
- DMA path: populates the page cache, copies data from DMA buffer to userspace.
- Inline path: copies
inline_data[..inline_data_len]directly to userspace. No DMA buffer to free, no IOMMU unmap. Saves ~150-300ns per small read.
Key design properties:
- Page cache absorbs most I/O: Only cache misses cross the domain boundary. On a warm cache (common for frequently accessed files), read() has zero domain crossings — data is served directly from core memory. This is why the page cache lives in umka-core, not in the filesystem driver.
- Zero-copy data path: Read/write data is transferred via shared DMA buffer handles, not copied into the ring buffer. The ring carries only the metadata (opcode, offsets, lengths, buffer handles). Data pages are in the shared DMA pool (PKEY 14 / domain 2).
- Batching: The doorbell coalescing mechanism (Section 11.5.1.1) batches multiple requests into a single domain switch. readahead() enqueues multiple read requests before ringing the doorbell once.
- Trait interface as specification: The
FileSystemOps,InodeOps,FileOps, andAddressSpaceOpstraits defined in Section 14.1 serve as the SPECIFICATION of the ring protocol. Each trait method maps to exactly oneVfsOpcode. The trait signatures define the arguments; the ring protocol serializes them intoVfsRequestArgs. Filesystem driver developers implement the traits; the KABI code generator (Section 12.1) produces the serialization/deserialization stubs.
VFS Ring Error Handling and Cancellation:
Every cross-domain VFS request is subject to timeout, cancellation, and driver crash handling. This section specifies the complete lifecycle of a request that does not complete normally.
1. Timeout: Every VFS request has a per-operation timeout based on the expected latency class of the operation:
| Timeout class | Operations | Default timeout |
|---|---|---|
| Regular | Read, Write, Stat, Lookup, Create, Open, Release, Getattr, Setattr, Readdir, Readlink, Link, Unlink, Mkdir, Rmdir, Rename, Symlink, Getxattr, Setxattr, Listxattr, Removexattr, Mmap, SeekData, SeekHole, Poll, Ioctl |
30 seconds |
| Slow | Fsync, Truncate, Fallocate |
120 seconds |
| Mount | Mount, Unmount, ForceUnmount, Remount, Statfs, SyncFs, Freeze, Thaw |
300 seconds |
The kernel VFS layer starts a per-request timer when the request is enqueued on the
request_ring. If the timer fires before a VfsResponse::Ok or VfsResponse::Err
arrives on the response_ring, the kernel performs the following steps:
a. Sets request.state to Cancelled in the shared ring metadata.
b. Returns ETIMEDOUT to the waiting syscall (waking the blocked thread via
the VfsRingPair::completion wait queue).
c. Enqueues a CancelToken { request_id, reason: CancelReason::Timeout } on a
dedicated cancellation side-channel in the ring so the filesystem driver can
detect the cancellation and avoid processing a stale request. The driver is
expected to check the cancellation channel before beginning I/O for each
dequeued request.
Timeouts are per-mount configurable via mount options (vfs_timeout_regular=<secs>,
vfs_timeout_slow=<secs>, vfs_timeout_mount=<secs>). The values above are defaults.
2. Crash handling (filesystem driver crashes): When a Tier 1 filesystem driver crashes (detected by the isolation recovery mechanism described in Section 11.6), the kernel VFS layer performs the following recovery sequence:
a. All pending requests for the crashed filesystem driver are immediately failed
with EIO. Every thread blocked on VfsRingPair::completion for that mount is
woken with VfsResponse::Err(-EIO).
b. The VFS ring is closed: the kernel unmaps the shared ring pages and marks the
VfsRingPair as defunct. No new requests are accepted.
c. Any subsequent access to files on that filesystem (open files, cached dentries,
inode operations) returns ENOTCONN until the driver is restarted and the
filesystem is remounted.
d. For Tier 1 filesystem drivers: the crash recovery mechanism reloads the driver
module and replays the mount sequence (using the stored mount arguments from
SuperBlock). Pending request state is lost — applications whose requests
were failed with EIO must retry. Open file descriptors pointing to the crashed
filesystem become invalid and return ENOTCONN on any operation; applications
must close and reopen them after remount completes.
Crash Recovery Algorithm — Complete Specification:
VFS crash recovery runs when a Tier 1 VFS driver (e.g., ext4, XFS) crashes and is reloaded (Section 11.9).
Synchronization during recovery (no lock-based ordering — uses atomics):
The unified VFS crash recovery sequence (Section 14.3)
uses VfsRingSet.state atomics (VFSRS_RECOVERING) to block new operations, NOT
explicit locks. The previous lock-based model (vfs_global_lock, sb.recovery_lock)
was replaced by the atomic state machine approach:
- ring_set.state.store(VFSRS_RECOVERING, Release) blocks all new select_ring() calls.
- Per-ring inflight_ops counters provide the quiescence barrier.
- Per-inode inode.lock (level 185) is acquired only if individual inodes need repair
(e.g., truncate-on-recovery for partially-written files).
This eliminates lock ordering complexity and avoids adding a global lock to the recovery path. See the unified U1-U18 sequence in Section 14.3 for the authoritative step ordering.
Step 1: Quiesce in-flight operations
- Set ring_set.state = VFSRS_RECOVERING (atomic store, Release ordering).
Note: the recovery state is tracked on the VfsRingSet, not on the SuperBlock.
See Section 14.3 for the unified U1-U18 recovery sequence.
- All new VFS operations on this superblock return ENXIO immediately (no-op check at syscall entry).
- Wait for all per-ring inflight_ops counters to reach zero: sum(ring.request_ring.inflight_ops for all rings in ring_set) == 0 (spin with a 5s timeout; if not drained after 5s, send SIGKILL to processes with operations stuck in the crashed driver's domain). SIGKILL is the escalation path of last resort: it is used only when a process cannot be unblocked by returning EIO on its stuck syscall (i.e., the process is in TASK_UNINTERRUPTIBLE waiting on a ring response that will never arrive). SIGTERM cannot wake an uninterruptible process. Only processes with operations stuck in the crashed driver's domain are affected.
- The inflight_ops counter (defined in RingBuffer<T>) is incremented in reserve_slot() after successful slot reservation and decremented in complete_slot() after marking the slot FILLED. Per-ring counters avoid false sharing between CPUs on different rings. See Section 14.3 for the per-CPU ring extension that defines these counters.
- Why per-ring, not per-sb: A single per-sb AtomicU32 would be a cache line contention point for N concurrent producers. Per-ring counters eliminate cross-ring false sharing. The crash recovery path sums all N counters (cold path, O(N) where N <= 256).
Step 2: Extract ring data, drain, and wake orphaned page waiters
This step has three phases that MUST execute in order. The extraction phase reads ring entries while ring pointers are still valid; the drain phase resets ring pointers; the wake phase unlocks orphaned pages using the extracted data. The unified recovery sequence in Section 14.3 specifies the exact interleaving with the general crash recovery steps from Section 11.9.
Phase 2a: EXTRACT (ring pointers still valid)
- Walk each ring's request entries from tail to published. For each entry:
- If the opcode is ReadPage or Readahead: extract the page_cache_id: u64
and page_index: u64 from the request args (these identify the page in Core
memory via the page cache, NOT via the VFS-domain inode cache). Collect into
an ArrayVec<OrphanedPageEntry, MAX_ORPHANED_PAGES>.
- If the entry holds a DMA buffer handle (buf != DmaBufferHandle::ZERO):
collect the handle for deferred freeing.
- Inode cache independence: The page cache lives in Core (Tier 0). During
crash recovery, the VFS domain (Tier 1) may be corrupted. The ring entry must
carry enough information to resolve the page WITHOUT traversing VFS-domain
state. Specifically, VfsRequest carries page_cache_id (a u64 handle to the
Core-resident AddressSpace/PageCache) set by the VFS dispatch path BEFORE
enqueuing the request. This handle is read directly from the ring entry during
extraction — no inode lookup needed.
Phase 2b: DRAIN (reset rings)
- The driver-to-kernel ring buffer (Section 12.1) may have pending completion events from operations submitted before the crash.
- Call ring_drain_completions(sb.driver_ring): process all pending completions (call the registered callback for each entry). Completions after a crash return EIO.
- Discard all pending submission-side entries by waking blocked threads with EIO.
- Free all collected DMA buffer handles (free_request_dma_handles()).
- Reset all slot_states to EMPTY and ring pointers to 0. See
drain_all_vfs_rings() in Section 14.3.
Phase 2c: WAKE (orphaned pages)
- Orphaned page wake: Crash recovery must handle threads sleeping on page
wait queues (wait_on_page_locked), not just the ring completion WaitQueue.
When a cache-miss read is in progress, the requesting thread sleeps on the
PAGE's wait queue, not on ring.completion. If the driver crashes, these
pages remain LOCKED forever, causing indefinite hangs.
For each collected OrphanedPageEntry:
(a) Resolve the target page from page_cache_id + page_index using Core-
resident page cache lookups (no VFS-domain data needed).
(b) Set PageFlags::ERROR on the page.
(c) Clear PageFlags::LOCKED via unlock_page().
(d) This wakes all threads sleeping on wait_on_page_locked() for that page.
They observe PageFlags::ERROR and return EIO to userspace.
This ensures no orphaned LOCKED pages survive a driver crash.
Step 2.5: Page cache integrity verification
- Before inspecting page cache state, verify page cache metadata integrity.
A crashing Tier 1 driver may have corrupted XArray tree nodes (if the
driver had write access to page cache metadata via the shared memory
domain).
- For each inode with a non-empty page cache: walk the XArray tree and
verify: (a) all slot pointers are within valid slab regions, (b) the
xa_node.count field matches the actual non-null slot count, (c) no
cycles exist (bounded walk depth = XArray max height = 6 for 64-bit).
Synchronization protocol for the XArray walk:
- Read-only validation phase: Acquire rcu_read_lock(). The XArray
walk reads node pointers and slot entries under RCU protection.
kswapd may run concurrently (it removes pages from the XArray via
xa_erase()), but RCU protects node lifetimes — freed XArray nodes
are not reused until the grace period ends.
- Mutation phase (corruption repair): If corruption is detected,
drop the RCU lock, acquire the per-AddressSpace i_pages lock
(xa_lock_irq(&mapping->i_pages)), then call
truncate_inode_pages() to drop the entire page cache for that
inode. The i_pages lock prevents concurrent page cache mutations
during the truncation.
- The walk is bounded (max XArray height = 6 levels, each level is
64-way fanout = 64^6 = ~68 billion pages). For a 1TB file with
4KB pages, the tree has ~256M entries across ~4M nodes, requiring
~100ms to walk. This is acceptable for a cold crash recovery path.
- If corruption is detected: mark the inode as
I_PAGE_CACHE_CORRUPT, drop the entire page cache for that inode (truncate_inode_pages()), and log an FMA event. The data will be re-read from disk after remount.
Step 3: Dirty page detection and writeback
- Walk each inode's page cache (skipping DAX inodes where page_cache is None and inodes marked I_PAGE_CACHE_CORRUPT) for all dirty pages: pages with PageFlags::Dirty set.
- For journaled filesystems (e.g., ext4/JBD2), the filesystem's journal tracks
the relationship between journal transactions and page state via
Transaction.tid — not via per-page LSN fields. The VFS layer checks
whether each dirty page's owning transaction has been committed:
- If the page's transaction has committed: mark page clean (the journal will replay it during recovery).
- If the page's transaction has NOT committed: writeback must be deferred until the filesystem is repaired.
- Dirty pages beyond the last commit are kept in memory (pinned) until the filesystem is fsck'd and remounted, at which point a forced writeback is issued.
Step 4: Reload driver and remount
- Load the new driver image (Section 11.3 reload protocol).
- Call driver.mount(sb.device, sb.flags) with MS_RDONLY first (safe mode).
- Run the filesystem's built-in consistency check (ext4 replay journal; XFS log recovery; Btrfs tree walk) via driver.fsck_fast().
- If fsck_fast() returns Ok(()): remount read-write; resume normal operations.
- If fsck_fast() returns Err: emit FMA fault event, keep read-only, require manual intervention.
Step 5: Flush deferred dirty pages
- After successful RW remount, call writeback_deferred_dirty(sb) to flush the dirty pages held since Step 3.
Step 3a: Generation counter bump (before driver reload)
- The SuperBlock has a driver_generation: AtomicU64 counter that is
incremented each time the driver is (re)loaded. The VFS sets
sb.driver_generation = old_generation + 1 BEFORE the driver is
reloaded (Step 4) and BEFORE dirty page writeback (Step 5). This
ordering is critical: the replacement driver's responses (including
writeback completions) must carry the NEW generation so the VFS
consumer does not discard them. (Previously numbered Step 5.5 and
placed after Step 5 — moved to avoid silent data loss on the
writeback flush path. See Section 14.3 Step U13a
for the unified sequence rationale.) This ensures:
- Stale responses from the pre-crash ring (if any were in-flight) are
detected and discarded: the VFS checks response.driver_generation ==
sb.driver_generation.load(Acquire) before processing any completion.
- Open file handles acquired before the crash carry the old generation
(stored in OpenFile.open_generation, set at open time). Any VFS
operation using an old-generation file handle returns ENOTCONN:
if file.open_generation != file.inode.i_sb.driver_generation.load(Acquire) {
return Err(ENOTCONN);
}
SuperBlock struct (not in
driver-owned memory), so it survives driver crashes. One counter per
mount (not per ring) — all rings on a mount share the same generation.
Recovery latency target: ≤500ms for ≤1 million in-flight operations and ≤10 million dirty pages.
3. Cancellation protocol: A caller (or the kernel on behalf of a caller) can cancel a pending request through the following protocol:
a. The caller invokes vfs_cancel(request_id) (internal kernel API, not exposed
as a syscall — cancellation is triggered by signal delivery, thread exit, or
timeout).
b. The kernel sets the CANCEL bit in the RingSlotFlags of the target request's
ring slot header. This is an atomic bitwise-OR on the slot's flags field
(Relaxed ordering — the flag is advisory; correctness depends on the completion
protocol, not on ordering of the flag write itself).
c. The kernel enqueues a CancelToken on the cancellation side-channel of the
VfsRingPair (belt-and-suspenders: the side-channel ensures the driver is
notified even if it has already dequeued the request but not yet checked the
slot flags).
d. The filesystem driver checks RingSlotFlags::CANCEL in the slot header before
starting I/O for each dequeued request. If the flag is set, the driver writes
a VfsResponse::Err(-ECANCELED) to the response ring and moves to the next
request. The driver MUST NOT begin any side-effecting I/O (block reads, metadata
updates) for a cancelled request.
e. If the driver has already started processing the request (e.g., issued a block
I/O read before the CANCEL flag was set), the driver completes the operation
normally and writes the result to the response ring. The kernel discards the
response silently since the request is already resolved from the caller's
perspective.
f. The caller (kernel-side) always waits for a completion slot from the driver —
either VfsResponse::Err(-ECANCELED) (if the driver saw the flag) or a normal
VfsResponse::Ok/VfsResponse::Err (if the driver had already started I/O).
The per-request timeout still applies; if neither response arrives within the
timeout, the request enters the crash-recovery path (step 2 above).
bitflags::bitflags! {
/// Per-slot flags for the cancellation protocol.
///
/// **Storage**: These flags are stored in the `slot_states` array
/// alongside the `RingSlotState` values (see
/// [Section 14.3](#vfs-per-cpu-ring-extension--ring-topology)), NOT in the ring
/// slot entry itself. The `VfsRequest` struct starts with
/// `request_id: u64` — there is no flags header in the slot data.
///
/// The `slot_states[idx]` array uses `AtomicU8` where the lower 2 bits
/// encode `RingSlotState` (EMPTY=0, RESERVED=1, FILLED=2, CONSUMED=3)
/// and bit 7 encodes the CANCEL flag. This allows both state transitions
/// and cancellation to be managed through a single atomic byte per slot,
/// avoiding a separate flags array.
///
/// The `OCCUPIED` flag from the previous design is replaced by the
/// `RingSlotState::Filled` value — a slot is "occupied" when its state
/// is FILLED (consumer has not yet consumed it).
#[repr(transparent)]
pub struct RingSlotFlags: u8 {
/// Request is cancelled. The writer (VFS side) sets this bit to
/// signal that the driver should skip processing. The driver checks
/// this bit before beginning I/O. If the driver has already started
/// I/O, the flag is ignored and the operation completes normally.
///
/// Bit 7 of the slot_states[idx] AtomicU8. Set via
/// `slot_states[idx].fetch_or(CANCEL.bits(), Relaxed)`.
const CANCEL = 1 << 7;
/// Bits 2-6 reserved for future use.
}
}
/// Token placed on the cancellation side-channel of a VfsRingPair to notify
/// the filesystem driver that a previously enqueued request should be skipped.
#[repr(C)]
pub struct CancelToken {
/// The `request_id` of the cancelled request. Matches `VfsRequest::request_id`.
pub request_id: u64,
/// Why the request was cancelled.
pub reason: CancelReason,
/// Explicit trailing padding (struct alignment = 8 from u64 field).
_pad: [u8; 4],
}
const_assert!(size_of::<CancelToken>() == 16);
/// Reason for request cancellation.
#[repr(u32)]
pub enum CancelReason {
/// The per-operation timeout expired before the driver responded.
Timeout = 1,
/// The calling thread was interrupted (signal delivery or thread exit).
CallerCancelled = 2,
/// The filesystem driver crashed; all pending requests are being flushed.
DriverCrash = 3,
}
Cancellation state machine — the lifecycle of a cancelled request from the driver's perspective:
Driver dequeues request from ring:
1. Read slot.flags with Acquire ordering.
2. If flags & CANCEL:
→ Write VfsResponse::Err(-ECANCELED) to response ring.
→ Increment consumer pointer. Done.
3. If !(flags & CANCEL):
→ Begin I/O processing.
→ Re-check flags & CANCEL periodically for long operations
(optional optimization — not required for correctness).
→ Complete I/O. Write VfsResponse::Ok/Err to response ring.
→ If CANCEL was set between steps 3 and completion:
kernel discards the response (request already resolved).
This protocol guarantees: (1) the caller always receives exactly one error or the original lower-layer file remains untouched, (2) no I/O is wasted on requests that were cancelled before the driver began processing, (3) I/O that has already started is never aborted mid-flight (which would risk filesystem inconsistency).
4. VfsResponse::Pending semantics: A VfsResponse::Pending response from the
filesystem driver means the request has been accepted and acknowledged but not yet
completed (for example, the driver has issued a block I/O request and is waiting for
device completion). The contract is:
- The caller must poll the
response_ringor sleep onVfsRingPair::completionfor the finalVfsResponse::OkorVfsResponse::Err. Pendingdoes NOT reset the per-request timeout timer. The maximum time inPendingstate is bounded by the operation timeout defined above. If the final response does not arrive within the timeout, the request is cancelled using the standard cancellation protocol (step 3).- A driver may send at most one
Pendingresponse per request. Sending multiplePendingresponses for the samerequest_idis a protocol violation; the kernel logs a warning and ignores duplicatePendingresponses. Pendingis optional: a driver may respond directly withOkorErrwithout ever sendingPending. It exists to allow the VFS layer to distinguish "driver has seen the request" from "request is still sitting in the ring unprocessed" for diagnostic and health-monitoring purposes (Section 20.1).
5. Filesystem Driver Concurrency Requirements:
Driver concurrency model: The SPSC ring serializes delivery of requests to
the filesystem driver, not processing. Filesystem driver implementations MUST
process dequeued requests concurrently using internal work queues. A driver that
processes requests sequentially will exhibit head-of-line blocking (e.g., a
stat() waiting behind a fsync()).
The request_id-based response matching already supports out-of-order completion.
A compliant driver implementation:
- Dequeues requests in batches (up to ring depth).
- Dispatches each request to an appropriate internal work queue (e.g., metadata operations to a fast-path thread, journal commits to a dedicated journal thread).
- Responds via
VfsResponsewith the matchingrequest_idas each operation completes -- responses may arrive in any order.
The VfsResponse::Pending mechanism (see above) provides acknowledgment for
long-running operations, allowing the VFS layer to distinguish "driver is
processing" from "driver is stuck."
Rationale: Unlike Linux, where multiple threads enter the filesystem driver concurrently on different CPUs (with filesystem-internal locking providing serialization at a finer granularity), UmkaOS's SPSC ring serializes delivery. The page cache absorbs >95% of read/stat operations with zero domain crossings (see "Key design properties" above), so the ring is crossed primarily for cache-miss reads, writes, and metadata mutations. For these operations, concurrent driver-internal processing is essential for matching Linux's multi-threaded I/O throughput on a single filesystem.
Scaling beyond single-ring: For workloads with high write concurrency on a single mount (e.g., PostgreSQL checkpoint with 64 backends issuing concurrent fsyncs), the single SPSC ring becomes a producer-side contention point. The per-CPU VFS ring extension (Section 14.3) replaces the single ring with N SPSC rings (one per CPU or per CPU group), eliminating cross-CPU cache line bouncing while preserving all SPSC lock-free invariants per ring.
14.3 Per-CPU VFS Ring Extension¶
14.3.1 Motivation¶
The baseline VFS ring protocol (Section 14.2) allocates a single
SPSC VfsRingPair per mounted filesystem. The VFS (umka-core) is the sole producer
on the request ring; the filesystem driver is the sole consumer. This design is
correct and efficient for single-threaded workloads, but creates a producer-side
serialization bottleneck when many CPUs issue concurrent filesystem operations on
the same mount.
PostgreSQL checkpoint scenario (the motivating workload): PostgreSQL's checkpoint
process (and since PG 15, the checkpointer plus background writer) issues fsync()
on hundreds to thousands of relation files concurrently. With 64 backends each doing
fsync() on different files, all fsync requests serialize through the single request
ring. The ring depth is 256 entries by default; under contention, producers spin
waiting for the ring to drain. The filesystem driver dequeues and dispatches to
internal work queues, but the single-ring bottleneck means:
-
Producer serialization: The SPSC ring protocol requires a single producer. Since the VFS runs on any CPU, slot reservation must be serialized. Under 64-way contention with a single ring, this serialization becomes the bottleneck. Each CPU contends for slot reservation, waiting for the lock, and the lock holder's cache line bounces between CPUs via the coherence protocol. At ~50-70 cycles per bounce on x86-64, the lock acquisition alone costs ~3.2-4.5 us under 64-way contention.
-
Head-of-line blocking: A slow
fsyncthat fills the ring blocks all subsequent producers from enqueueing any request (reads, stats, lookups) on the same mount. The single ring's depth (256 entries by default) is shared across all CPUs. -
Doorbell storm: Each CPU that reserves a slot and enqueues a request rings the doorbell independently. With 64 concurrent producers, the driver receives up to 64 doorbells for requests that could have been coalesced.
This extension replaces the single SPSC ring per mount with N rings per
mount, one per CPU or per CPU group. Under PerCpu granularity (the primary
mode), each ring is pure SPSC — no CAS, no contention between producers. Under
shared-ring modes (PerNuma/PerLlc/Fixed), multiple CPUs share a ring with
atomic (CAS) head allocation. The CPU that issues a VFS operation uses its
assigned ring, eliminating or greatly reducing cross-CPU cache line bouncing.
14.3.2 Design Principles¶
-
PerCpu rings are pure SPSC — the primary mode. Under
PerCpugranularity (the default for >= 65 CPUs), each ring has exactly one producer. All SPSC invariants, memory ordering guarantees, and backpressure semantics are preserved per ring. No CAS on the ringhead— just relaxed load/store. -
Shared-ring modes use CAS on head — safe but slower. Under
PerNuma,PerLlc, andFixedgranularity, multiple CPUs share a ring. Thereserve_slot()function uses a CAS loop onheadfor atomic slot allocation. This is lock-free (no spinlock) but has O(N) expected CAS retries under N-way contention. The CAS loop is bounded by ring size. -
The driver side multiplexes N rings. The filesystem driver's consumer thread(s) poll all N rings. This is the only new complexity — the driver must handle multiple input rings instead of one.
-
Backward compatible. Drivers that only support single-ring mode (
ring_count_max = 1in their KABI manifest) continue to work unchanged. The VFS falls back to single-ring mode for such drivers. -
No new lock types. Ring selection is determined by
cpu_idat VFS entry — no lock, no CAS, no arbitration. The mapping is a static array lookup. The shared-ring CAS operates on the existingheadatomic.
14.3.3 Ring Topology¶
14.3.3.1 VfsRingSet: Per-Mount Ring Collection¶
The single VfsRingPair is replaced by a VfsRingSet that contains 1..N ring
pairs, where N is negotiated at mount time.
/// VfsRingSet state constants. These are the mount-level recovery states,
/// distinct from the per-ring DomainRingBuffer.state values (Active=0,
/// Disconnected=1).
pub const VFSRS_ACTIVE: u8 = 0;
pub const VFSRS_RECOVERING: u8 = 1;
pub const VFSRS_QUIESCING: u8 = 2;
/// Per-mount collection of VFS ring pairs. Replaces the single `VfsRingPair`
/// from [Section 14.2](#vfs-ring-buffer-protocol).
///
/// Each ring pair is a full SPSC channel (request + response + doorbell +
/// completion wait queue). Rings are indexed by CPU group — each CPU is
/// statically assigned to exactly one ring at mount time.
///
/// Placement: Tier 0 (Core). The VfsRingSet is owned by the superblock
/// and lives in umka-core memory. The ring data regions are in shared
/// memory (PKEY 14 shared DMA pool on x86-64).
// Kernel-allocated, kernel-owned. Layout is depended upon by Tier 1
// drivers via `&VfsRingSet` reference passed at vfs_init() time.
// `#[repr(C)]` ensures stable field offsets across compilation units.
#[repr(C)]
pub struct VfsRingSet {
/// Array of ring pairs. Length is `ring_count`. Index 0 is always valid.
/// Allocated from the kernel slab at mount time (warm path, bounded N).
/// Maximum N = MAX_VFS_RINGS (256, sufficient for 256-core systems
/// with 1:1 CPU-to-ring mapping).
///
/// SAFETY: Allocated from kernel slab at mount time. Valid for the
/// lifetime of the VfsRingSet (which is the lifetime of the mount).
/// `ring_count` is the element count. Freed in umount after all rings
/// are drained. During crash recovery, `drain_all_vfs_rings()` accesses
/// rings under `VfsRingSet.state == RECOVERING`, which prevents
/// concurrent `select_ring()` access. Raw pointer required because
/// VfsRingSet is `#[repr(C)]` for KABI transport.
/// `debug_assert!(ring_count <= MAX_VFS_RINGS)` at mount-time init.
///
/// VfsRingSet implements Send + Sync because `rings` points to
/// slab-allocated memory that outlives all users; ring_count is
/// immutable after mount.
///
/// `*const` because VfsRingPair's mutable fields (via `RingBuffer<T>`:
/// `inner.head`, `inner.tail`, `inner.published`, `slot_states`,
/// `inflight_ops`) are all atomics — shared access is safe via atomic
/// operations. The pointer itself is immutable after mount (the array
/// base never changes); interior mutability is provided by the atomic
/// fields within each VfsRingPair's `RingBuffer<T>` members.
///
/// Raw pointer required because `#[repr(C)]` layout is depended upon
/// by Tier 1 VFS drivers that receive `&VfsRingSet` via `vfs_init()`.
/// Although VfsRingSet is kernel-allocated and kernel-owned, its field
/// offsets are ABI-visible to the driver through the reference.
pub rings: *const VfsRingPair,
/// Number of active ring pairs. Range: 1..=MAX_VFS_RINGS.
/// Set at mount time after negotiation with the driver.
/// Invariant: ring_count >= 1 (single-ring mode is the minimum).
pub ring_count: u16,
/// CPU-to-ring mapping table. Index: CPU ID (0..nr_cpu_ids).
/// Value: ring index (0..ring_count). Populated at mount time.
/// Updated atomically on CPU hotplug events.
///
/// This is a read-only lookup table on the hot path — no lock, no CAS.
/// The table is allocated once at mount time with capacity for
/// `nr_cpu_ids` entries (runtime-discovered, not hardcoded).
///
/// For CPU IDs beyond the table size (should not happen — table is
/// sized to nr_cpu_ids at mount time), the fallback is ring index 0.
///
/// SAFETY: Same lifetime as `rings` — allocated at mount, freed at
/// umount. `cpu_to_ring_len` is the element count. Raw pointer for
/// `#[repr(C)]` KABI transport compatibility. `*mut` because
/// `cpu_to_ring` entries are updated by CPU hotplug
/// (`vfs_rings_cpu_online`) via `AtomicU16::store()`.
pub cpu_to_ring: *mut AtomicU16,
/// Number of entries in the cpu_to_ring table (== nr_cpu_ids at mount time).
/// u32: supports systems with >65535 CPUs (large HPC / datacenter nodes).
pub cpu_to_ring_len: u32,
/// Ring allocation granularity used at mount time.
pub granularity: RingGranularity,
/// Per-ring doorbell coalescing state for cross-ring coalesced doorbells.
/// See "Doorbell Coalescing" section below.
pub coalesced_doorbell: CoalescedDoorbell,
/// Global request_id generator for this mount. All rings draw from this
/// single atomic counter to ensure mount-wide unique request IDs.
/// See "Request ID Generation" below.
///
/// **Cache line isolation**: This field is hot-path (atomically incremented
/// on every VFS operation). It must NOT share a cache line with other
/// frequently-written fields. The `state` field below is cold-path (written
/// only during crash recovery), so false sharing with `state` is acceptable.
/// The `_cacheline_pad` ensures `next_request_id` starts on its own cache
/// line when VfsRingSet is cache-line aligned.
pub next_request_id: CacheLinePadded<AtomicU64>,
/// Mount-level state for crash recovery coordination.
/// Set to RECOVERING during crash recovery to block all rings.
/// Cold path only — written during crash recovery, read (Relaxed) on
/// ring selection hot path for early-exit check.
///
/// Constants:
/// - `VFSRS_ACTIVE = 0u8`: Normal operation. select_ring() proceeds.
/// - `VFSRS_RECOVERING = 1u8`: Crash recovery in progress. select_ring()
/// returns ENXIO. Set at Step U3, cleared at Step U17.
/// - `VFSRS_QUIESCING = 2u8`: Live evolution quiescence in progress.
/// select_ring() returns ENXIO. Set at evolution initiation, cleared
/// when the replacement driver's consumer loops are ready.
///
/// These are DISTINCT from the per-ring `DomainRingBuffer.state` values
/// (0 = Active, 1 = Disconnected). The VfsRingSet state gates ALL rings
/// in the mount; the per-ring state controls individual ring operation.
pub state: AtomicU8,
/// Padding to fill remaining bytes in the struct layout after state,
/// preventing adjacent slab allocations from sharing this cache line.
/// (Not cache-line alignment of `state` itself — for that, add
/// `#[repr(C, align(64))]` to the struct.)
_pad: [u8; 5],
}
// VfsRingSet is kernel-allocated but its layout is depended upon by Tier 1
// drivers that receive &VfsRingSet via vfs_init(). Contains raw pointers and
// atomics with platform-dependent sizes — no const_assert (size varies between
// 32-bit and 64-bit architectures).
// SAFETY: `rings` and `cpu_to_ring` point to slab-allocated memory that
// outlives all users of VfsRingSet (freed at umount after all rings are
// drained). All mutable state within VfsRingPair uses atomics. ring_count
// and cpu_to_ring_len are immutable after mount-time initialization.
unsafe impl Send for VfsRingSet {}
unsafe impl Sync for VfsRingSet {}
/// Maximum number of VFS rings per mount. Sized for 256-core systems
/// with 1:1 CPU-to-ring mapping. Systems with >256 CPUs use CPU-group
/// mapping (multiple CPUs share one ring). The value is a compile-time
/// upper bound for array sizing; the actual ring count is negotiated
/// at mount time and is typically much smaller.
pub const MAX_VFS_RINGS: usize = 256;
/// Ring allocation granularity — how CPUs are mapped to rings.
#[repr(u8)]
pub enum RingGranularity {
/// One ring per CPU. Maximum parallelism, maximum memory usage.
/// Best for high-IOPS workloads (databases, storage servers).
PerCpu = 0,
/// One ring per NUMA node. CPUs on the same NUMA node share one ring.
/// Good balance of parallelism and memory. Reduces cross-NUMA cache
/// bouncing while keeping ring count manageable.
PerNuma = 1,
/// One ring per LLC (Last-Level Cache) group. CPUs sharing an L3 cache
/// share one ring. Finer than PerNuma on multi-CCX/chiplet designs
/// (AMD EPYC, Intel Sapphire Rapids). Within an LLC group, cache line
/// bouncing for the ring head is L3-local (~10-15 cycles) rather than
/// cross-socket (~50-70 cycles).
PerLlc = 2,
/// Fixed number of rings (specified via mount option). CPUs are
/// distributed round-robin across rings. Used when the operator
/// wants explicit control (e.g., `vfs_ring_count=4`).
Fixed = 3,
/// Single ring (legacy mode). Equivalent to the baseline protocol.
/// Used when the driver reports `ring_count_max = 1`.
Single = 4,
}
14.3.3.2 CPU-to-Ring Assignment¶
At mount time, the VFS builds the cpu_to_ring mapping table based on the
negotiated ring_count and granularity:
/// Build the CPU-to-ring mapping table at mount time.
///
/// # Arguments
/// * `ring_count` — Negotiated number of rings (1..=MAX_VFS_RINGS).
/// * `granularity` — How CPUs are grouped into rings.
///
/// # Returns
/// Slab-allocated mapping table of length `nr_cpu_ids`.
///
/// Hot path access: `cpu_to_ring[smp_processor_id()].load(Relaxed)` — one
/// atomic load (~1 cycle on x86-64 TSO, ~1-3 cycles on ARM/RISC-V).
fn build_cpu_to_ring_map(
ring_count: u16,
granularity: RingGranularity,
) -> &'static [AtomicU16] {
let nr_cpus = arch::current::cpu::nr_cpu_ids();
let table = slab_alloc_zeroed::<AtomicU16>(nr_cpus);
match granularity {
RingGranularity::PerCpu => {
// 1:1 mapping: CPU i → ring min(i, ring_count - 1).
// If nr_cpus > ring_count, wrap with modulo.
for cpu in 0..nr_cpus {
table[cpu].store((cpu % ring_count as usize) as u16, Relaxed);
}
}
RingGranularity::PerNuma => {
// One ring per NUMA node (up to ring_count nodes).
// NUMA node IDs are discovered at boot via ACPI SRAT / device tree.
for cpu in 0..nr_cpus {
let node = arch::current::cpu::cpu_to_node(cpu);
table[cpu].store((node % ring_count as usize) as u16, Relaxed);
}
}
RingGranularity::PerLlc => {
// One ring per LLC group. LLC group IDs are discovered at boot
// via CPUID (x86), CLIDR_EL1 (AArch64), or device tree.
for cpu in 0..nr_cpus {
let llc_id = arch::current::cpu::cpu_to_llc_group(cpu);
table[cpu].store((llc_id % ring_count as usize) as u16, Relaxed);
}
}
RingGranularity::Fixed => {
// Round-robin distribution.
for cpu in 0..nr_cpus {
table[cpu].store((cpu % ring_count as usize) as u16, Relaxed);
}
}
RingGranularity::Single => {
// All CPUs map to ring 0.
// Table is already zero-initialized.
}
}
table
}
Hot path ring selection — the VFS dispatch path (step 4 in the dispatch flow from Section 14.2) selects the ring as follows:
/// Ring slot states for the split reservation/completion protocol.
///
/// This protocol separates slot reservation (which requires producer ordering)
/// from data fill (which may fault on `copy_from_user`). The key insight:
/// `preempt_disable` is needed only around `select_ring()` + `reserve_slot()`
/// (~3 instructions, ~nanoseconds), NOT around the entire ring operation.
/// After reservation, the slot is owned by the reserving task regardless of
/// which CPU it runs on. This enables:
/// - Inline small I/O: `copy_from_user` during ring fill
/// - FUSE passthrough: userspace access during ring submission
/// - Any path needing page faults during data fill
///
/// ```
/// EMPTY → RESERVED → FILLED → CONSUMED → EMPTY
/// ↑ |
/// +---------------------------------------+
/// ```
///
/// - `EMPTY`: Slot is available for reservation. Producer may CAS to RESERVED.
/// Consumer skips EMPTY slots.
/// - `RESERVED`: Slot is claimed by a producer (CAS from EMPTY). The producer
/// owns the slot exclusively. Consumer stops at RESERVED slots (head-of-line
/// blocking — the slot is not yet ready). The consumer does NOT skip past
/// RESERVED slots; it returns and retries later.
/// - `FILLED`: Producer has written data and marked the slot complete. Consumer
/// may process this slot. Transition: consumer sets CONSUMED after processing.
/// - `CONSUMED`: Consumer has processed the slot. The consumer transitions
/// CONSUMED → EMPTY immediately after processing (single owner; no race).
/// `tail` advances past contiguous CONSUMED→EMPTY slots.
///
/// **State ownership**: Only the producer writes EMPTY→RESERVED and
/// RESERVED→FILLED. Only the consumer writes FILLED→CONSUMED→EMPTY and
/// advances `tail`. The `head` is producer-owned. This two-party protocol
/// has no ABA risk because each slot has exactly one owner at a time.
#[repr(u8)]
pub enum RingSlotState {
Empty = 0,
Reserved = 1,
Filled = 2,
Consumed = 3,
}
/// Select the VFS ring for the current CPU and reserve a slot.
///
/// This is the hot-path entry point — called on every VFS operation that
/// crosses the domain boundary. The full protocol:
///
/// ```
/// preempt_disable();
/// let ring = select_ring(ring_set)?; // ENXIO if RECOVERING/QUIESCING
/// let (slot_idx, seq) = reserve_slot(ring)?; // CAS on head (shared) or store (per-CPU)
/// preempt_enable();
/// // --- preemption safe zone: fill data, may fault ---
/// fill_slot_data(slot_idx, &request); // copy_from_user OK here
/// complete_slot(ring, slot_idx, seq); // store(FILLED, Release) + advance published
/// ```
///
/// **Preemption note**: `preempt_disable()` is held only around
/// `select_ring()` + `reserve_slot()` (~3-8 instructions, ~4-20 cycles).
/// After reservation, `preempt_enable()` — the slot is owned by the
/// reserving task regardless of which CPU it runs on. Data fill and
/// `complete_slot()` (mark slot as FILLED) can happen from any CPU,
/// including after migration or page fault. The preempt_disable window
/// is ~nanoseconds (slot reservation only), not the entire ring operation.
///
/// **Producer model**: Under `PerCpu` granularity (the primary mode), each
/// ring has exactly one producer — the `preempt_disable` window guarantees
/// no other task on this CPU can interleave. This is pure SPSC: no CAS on
/// `head`, just a relaxed load + store. Under `PerNuma`/`PerLlc`/`Fixed`
/// granularity, multiple CPUs share a ring. The `reserve_slot()` function
/// uses a CAS loop on `head` to provide atomic slot allocation. See
/// `reserve_slot()` below for both paths.
///
/// **Inline write advisory**: For shared-ring modes (`PerNuma`/`PerLlc`/
/// `Fixed`), inline writes that may trigger page faults during
/// `copy_from_user()` can hold a RESERVED slot for the duration of the
/// fault (~1-50 us minor, ~1-10 ms major). This causes head-of-line
/// blocking for the consumer on that ring. For shared rings, the VFS
/// dispatch path SHOULD prefer the DMA buffer path (pre-copy before
/// reservation) for inline-eligible writes if the ring's pending count
/// exceeds `ring.size / 2`. This is a performance heuristic, not a
/// correctness requirement — the consumer handles RESERVED slots correctly
/// by waiting (see consumer algorithm below).
///
/// Cost (PerCpu): 1 atomic load (Relaxed) + bounds check + 1 store. ~3-5 cycles.
/// Cost (shared): 1 CAS loop (~5-20 cycles under contention) + 1 CAS on slot state.
#[inline(always)]
/// Check open_generation before dispatching any VFS operation.
/// This is the VFS dispatch entry point referenced in the OpenFile
/// struct doc comment. Must be called before `select_ring()`.
///
/// Returns `Err(ENOTCONN)` if the file was opened with a different
/// driver generation (the driver has crashed and been reloaded since
/// this file was opened). Userspace must close and re-open the file.
#[inline(always)]
fn vfs_check_open_generation(file: &File) -> Result<(), KernelError> {
let current_gen = file.inode.i_sb.driver_generation.load(Ordering::Acquire);
if file.open_generation != current_gen {
return Err(KernelError::ENOTCONN);
}
Ok(())
}
fn select_ring(ring_set: &VfsRingSet) -> Result<&VfsRingPair, KernelError> {
// Fast-path rejection during crash recovery or live evolution.
// Relaxed ordering is sufficient for two reasons:
// 1. **Downstream Acquire barriers provide correctness**: operations
// that pass this check go on to `reserve_slot()`, which uses an
// Acquire load on `inflight_ops`. The inflight_ops barrier is the
// true synchronization point — any operation that slips past this
// Relaxed check is still counted and waited for during drain.
// 2. **False negatives are bounded**: reading ACTIVE when the state
// is actually RECOVERING is bounded by store propagation delay
// (nanoseconds on TSO, microseconds on weak-memory architectures).
// The operation will be caught by the inflight_ops drain.
//
// False positives during ACTIVE -> RECOVERING transition cannot happen
// because the inflight_ops barrier in wait_for_producers_quiesced()
// ensures all producers have completed before drain begins.
//
// False positives during RECOVERING -> ACTIVE transition (U17) CAN
// occur on weak-memory architectures (ARM, RISC-V): a producer on
// another CPU may load Relaxed and see RECOVERING after U17 stores
// ACTIVE with Release. This is harmless — the producer gets ENXIO and
// retries on the next attempt (the window is nanoseconds to microseconds).
if ring_set.state.load(Ordering::Relaxed) != VFSRS_ACTIVE {
return Err(KernelError::ENXIO);
}
let cpu = arch::current::cpu::smp_processor_id();
let ring_idx = if cpu < ring_set.cpu_to_ring_len as usize {
// SAFETY: cpu < cpu_to_ring_len, validated above. cpu_to_ring points
// to a slab-allocated array of cpu_to_ring_len AtomicU16 elements,
// valid for the lifetime of the mount.
unsafe { (*ring_set.cpu_to_ring.add(cpu)).load(Ordering::Relaxed) } as usize
} else {
0 // Fallback for CPUs beyond the table (should not happen).
};
// SAFETY: ring_idx is in bounds — cpu_to_ring values are validated
// at mount time to be < ring_count. Bounds check is redundant but
// present for defense-in-depth.
debug_assert!(ring_idx < ring_set.ring_count as usize);
let idx = if ring_idx < ring_set.ring_count as usize {
ring_idx
} else {
0
};
// SAFETY: rings is a valid, non-null pointer to ring_count VfsRingPair
// elements, allocated from kernel slab at mount time and valid for the
// lifetime of the mount. The pointer is set during mount initialization
// and never modified afterward.
debug_assert!(!ring_set.rings.is_null());
Ok(unsafe { &*ring_set.rings.add(idx) })
}
/// Reserve a slot on the given ring. Returns `(slot_index, sequence_number)`.
/// The `sequence_number` is the `head` value at reservation time — passed to
/// `complete_slot()` so it can correctly advance the `published` watermark
/// without reading a potentially stale `head`.
///
/// Must be called with preemption disabled (caller holds PreemptGuard).
///
/// **PerCpu mode** (single producer per ring): No CAS on `head`. The caller
/// is the sole producer under `preempt_disable`, so `head` load + store is
/// safe. Cost: ~3-5 cycles.
///
/// **Shared-ring mode** (PerNuma/PerLlc/Fixed — multiple CPUs share a ring):
/// Uses a CAS loop on `head` to atomically claim the next slot. If the CAS
/// fails, the loop retries with the updated `head` (no false RingFull).
/// Cost: ~5-20 cycles depending on contention.
///
/// After `head` is successfully advanced, the producer CAS's the slot state
/// from EMPTY → RESERVED as defense-in-depth (should always succeed because
/// the consumer transitions CONSUMED → EMPTY before `tail` advances past
/// the slot, and `head - tail < size` was checked).
///
/// **Inflight tracking**: On successful reservation, increments
/// `ring.request_ring.inflight_ops` (AtomicU32, Relaxed). This counter is
/// decremented by `complete_slot()`. Crash recovery waits for all per-ring
/// `inflight_ops` to reach zero before draining, ensuring no producer is
/// mid-`copy_from_user` with a RESERVED slot.
#[inline]
fn reserve_slot(ring: &VfsRingPair) -> Result<(u32, u64), RingFull> {
if ring.shared_ring {
// --- Shared-ring path: CAS loop on head ---
loop {
let head = ring.request_ring.inner.head.load(Ordering::Acquire);
let tail = ring.request_ring.inner.tail.load(Ordering::Acquire);
if head.wrapping_sub(tail) >= ring.request_ring.inner.size as u64 {
return Err(RingFull);
}
// Attempt to advance head atomically.
match ring.request_ring.inner.head.compare_exchange_weak(
head,
head.wrapping_add(1),
Ordering::AcqRel,
Ordering::Relaxed,
) {
Ok(_) => {
// We own slot `head`. CAS slot state as defense-in-depth.
let idx = (head & (ring.request_ring.inner.size as u64 - 1)) as u32;
let slot_state = &ring.request_ring.slot_states[idx as usize];
let prev = slot_state.swap(
RingSlotState::Reserved as u8,
Ordering::AcqRel,
);
debug_assert_eq!(prev, RingSlotState::Empty as u8,
"Slot {} should be EMPTY after head CAS, was {}",
idx, prev);
// Track in-flight operation for crash recovery quiescence.
ring.request_ring.inflight_ops.fetch_add(1, Ordering::Relaxed);
return Ok((idx, head));
}
Err(_) => {
// Another CPU advanced head. Retry with updated value.
core::hint::spin_loop();
}
}
}
} else {
// --- PerCpu path: single producer, no CAS on head ---
let head = ring.request_ring.inner.head.load(Ordering::Relaxed);
let tail = ring.request_ring.inner.tail.load(Ordering::Acquire);
if head.wrapping_sub(tail) >= ring.request_ring.inner.size as u64 {
return Err(RingFull);
}
let idx = (head & (ring.request_ring.inner.size as u64 - 1)) as u32;
// Unconditional swap(RESERVED). Under preempt_disable, we are the
// sole producer on this per-CPU ring. If head - tail < size, the
// slot at `head` MUST be EMPTY — any other state indicates a broken
// invariant (a bug), not ring congestion. Using swap instead of CAS
// avoids silently returning RingFull on invariant violations.
let slot_state = &ring.request_ring.slot_states[idx as usize];
let prev = slot_state.swap(
RingSlotState::Reserved as u8,
Ordering::AcqRel,
);
debug_assert_eq!(prev, RingSlotState::Empty as u8,
"PerCpu ring invariant violation: slot {} should be EMPTY, was {}",
idx, prev);
ring.request_ring.inner.head.store(
head.wrapping_add(1), Ordering::Release,
);
// Track in-flight operation for crash recovery quiescence.
ring.request_ring.inflight_ops.fetch_add(1, Ordering::Relaxed);
Ok((idx, head))
}
}
/// Mark a reserved slot as filled (data is ready for consumer).
/// Called after data fill is complete. May be called from any CPU —
/// the slot is owned by the reserving task, not bound to a CPU.
///
/// `seq` is the sequence number (head value) returned by `reserve_slot()`.
/// It is used to advance the `published` watermark: `published` tracks
/// `seq + 1` (i.e., the slot just filled is now visible). The consumer
/// uses `published` as a doorbell hint — it tells the consumer that new
/// slots may be ready up to `published`, but the consumer still checks
/// per-slot state to determine which slots are actually FILLED.
///
/// **Out-of-order completion**: When slots are completed out of order
/// (e.g., slot 6 fills before slot 5 because slot 5's copy_from_user
/// faulted), `published` may temporarily lag behind the highest FILLED
/// slot. This is correct: `published` is a *lower bound* on the highest
/// filled slot. The consumer scans from `tail` to `published` and
/// processes only FILLED slots. A RESERVED slot at position `tail` blocks
/// `tail` advancement (head-of-line blocking), but the consumer can
/// process FILLED slots ahead of it (scan-and-process model, see
/// consumer algorithm below).
///
/// **Memory ordering**: The `Release` store on `slot_state` (FILLED)
/// happens before the `Release` in `fetch_max` on `published`. The
/// consumer loads `published` with `Acquire`, establishing a happens-before
/// relationship: the consumer is guaranteed to see the FILLED state and
/// all data written by the producer.
#[inline]
fn complete_slot(ring: &VfsRingPair, idx: u32, seq: u64) {
let slot_state = &ring.request_ring.slot_states[idx as usize];
slot_state.store(RingSlotState::Filled as u8, Ordering::Release);
// Advance published watermark. Uses the reservation-time sequence
// number (seq + 1), NOT the current head. This prevents advertising
// slots that were reserved after this one but not yet filled.
ring.request_ring.inner.published.fetch_max(
seq.wrapping_add(1),
Ordering::Release,
);
// Decrement in-flight counter. Crash recovery waits for this to
// reach zero before draining (ensures no producer is mid-fill).
ring.request_ring.inflight_ops.fetch_sub(1, Ordering::Release);
}
14.3.3.3 Consumer-Side Algorithm¶
The consumer (filesystem driver) processes slots from tail towards published.
The algorithm handles out-of-order completion (RESERVED slots between FILLED slots)
by using a scan-and-process model with strict tail advancement rules.
/// Consumer-side ring drain algorithm. Called by the driver's consumer
/// thread(s) when the doorbell fires or on polling wakeup.
///
/// **Invariants maintained by this function**:
/// - `tail` advances only past contiguous slots that have been processed
/// (CONSUMED → EMPTY transition). `tail` never skips a RESERVED slot.
/// - FILLED slots ahead of a RESERVED slot ARE processed (dispatched to
/// the driver's internal work queue), but `tail` does not advance past
/// the RESERVED blocker until it becomes FILLED and is processed.
/// - After processing a FILLED slot, the consumer transitions it to
/// CONSUMED and then immediately to EMPTY (single owner — no race).
/// This two-step transition maintains the documented state machine
/// and allows diagnostic observation of the CONSUMED state.
///
/// **Head-of-line blocking**: A slow RESERVED slot (e.g., producer stuck
/// in a page fault during copy_from_user) blocks `tail` advancement but
/// does NOT block processing of later FILLED slots. The ring's effective
/// capacity is reduced by the number of RESERVED slots between `tail` and
/// the first unprocessed FILLED slot. Under PerCpu mode (the primary
/// mode), the producer IS the blocked task, so no other slots can be
/// reserved on this ring during the fault — head-of-line blocking is
/// moot. Under shared-ring modes, other CPUs can reserve and fill slots
/// past the blocked one; those slots are processed but `tail` stays
/// pinned until the blocked slot completes.
///
/// **Livelock prevention**: The scan window is bounded by `published`.
/// The consumer does not scan beyond `published` (which is updated
/// atomically by producers). Each scan pass has bounded work: at most
/// `published - tail` slots. If no FILLED slots are found in a pass,
/// the consumer returns and waits for the next doorbell.
fn drain_ring(ring: &VfsRingPair) {
let mask = ring.request_ring.inner.size as u64 - 1;
loop {
let tail = ring.request_ring.inner.tail.load(Ordering::Acquire);
let published = ring.request_ring.inner.published.load(Ordering::Acquire);
if tail == published {
return; // Ring is empty (no new slots to process).
}
let mut processed_any = false;
// Scan from tail to published, processing FILLED slots.
// Use `!=` instead of `<` for wrapping-safe comparison: pos starts
// at tail and increments toward published; the ring size guarantees
// published - tail <= ring.size, so the scan always terminates.
let mut pos = tail;
while pos != published {
let idx = (pos & mask) as usize;
let state = ring.request_ring.slot_states[idx]
.load(Ordering::Acquire);
match state {
s if s == RingSlotState::Filled as u8 => {
// Process this slot.
// SAFETY: slot is FILLED and we are the sole consumer.
let request = unsafe { ring.request_ring.read_entry(idx) };
dispatch_to_work_queue(request);
// Transition: FILLED → CONSUMED → EMPTY.
// Single owner (consumer), so no CAS needed.
ring.request_ring.slot_states[idx]
.store(RingSlotState::Empty as u8, Ordering::Release);
processed_any = true;
}
s if s == RingSlotState::Reserved as u8 => {
// Slot not yet filled by its producer (e.g., blocked
// in copy_from_user page fault). Skip this slot and
// continue scanning — FILLED slots ahead of a RESERVED
// slot ARE processed (dispatched to the driver's work
// queue). `tail` cannot advance past this RESERVED
// blocker until it becomes FILLED and is processed,
// but processing later FILLED slots reduces latency
// for those requests. Under PerCpu mode, only one
// producer exists per ring, so a RESERVED slot means
// the producer is mid-fill and no later FILLED slots
// exist — the continue is effectively a no-op. Under
// shared-ring mode, other CPUs may have filled later
// slots that can be processed now.
pos = pos.wrapping_add(1);
continue;
}
_ => {
// EMPTY or CONSUMED — should not appear between
// tail and published. Debug assert for invariant
// violation.
debug_assert!(false,
"Unexpected slot state {} at pos {} (tail={}, published={})",
state, pos, tail, published);
break;
}
}
pos = pos.wrapping_add(1);
}
// Advance tail past contiguous EMPTY slots from the old tail.
// Slots we processed above were set to EMPTY, so tail advances
// past all of them (up to the first non-EMPTY slot).
let mut new_tail = tail;
while new_tail != published {
let idx = (new_tail & mask) as usize;
let state = ring.request_ring.slot_states[idx]
.load(Ordering::Acquire);
if state != RingSlotState::Empty as u8 {
break;
}
new_tail = new_tail.wrapping_add(1);
}
if new_tail != tail {
ring.request_ring.inner.tail.store(new_tail, Ordering::Release);
}
if !processed_any {
return; // No progress this pass — wait for next doorbell.
}
// Loop to check if new slots were published while we were processing.
}
}
Tail advancement guarantee: tail advances strictly monotonically and only past
slots in EMPTY state (meaning they were FILLED, processed, and reset to EMPTY by the
consumer). A RESERVED slot at position tail pins tail until that slot completes
its lifecycle (RESERVED → FILLED → process → EMPTY). This ensures no slot is ever
skipped or double-processed.
Ring capacity under head-of-line blocking: If one RESERVED slot pins tail, the
ring's usable capacity is reduced by the number of EMPTY slots between tail and the
RESERVED blocker. In the worst case (one RESERVED slot at tail, all other slots
EMPTY), the ring has size - 1 usable slots — functionally identical to a normal SPSC
ring. Under shared-ring modes with high contention, multiple RESERVED slots can
accumulate, temporarily reducing effective capacity. The negotiation protocol's auto-
selection heuristic (Section 14.3) accounts for this by
allocating more slots per ring in shared modes (default depth 256 for PerNuma/PerLlc
vs 64-128 for PerCpu with many rings).
14.3.4 Request ID Generation¶
Request IDs must be unique within a mount to support response matching. The baseline protocol uses per-ring monotonic IDs. With N rings, two strategies are possible:
Chosen strategy: Global atomic counter per mount.
/// Generate a mount-globally unique request ID.
///
/// All rings on this mount share a single AtomicU64 counter. This ensures
/// that request IDs are unique across all rings without per-ring namespacing.
///
/// Cost: one atomic fetch_add(1, Relaxed) per VFS operation. On x86-64,
/// this is a LOCK XADD (~15-20 cycles under contention). On AArch64
/// with LSE atomics (ARMv8.1+), LDADD is ~10-30 cycles uncontended,
/// ~40-80 cycles under 64-core contention. On ARMv8.0 without LSE or
/// RISC-V without Zacas, the LL/SC retry loop costs ~10-20 cycles per
/// attempt with ~O(N) expected attempts under N-way contention. Under
/// 64-core saturation (worst case): ~640-1280 cycles per request_id
/// allocation. Phase 3 optimization: per-CPU pre-allocated ID ranges
/// can eliminate this contention for non-LSE ARM targets.
/// Under 64-core contention, the cache line bounces — but this is a
/// DIFFERENT cache line from the ring's head/published, so it does not
/// compound with ring contention.
///
/// The alternative (per-ring monotonic + ring_index prefix) was rejected
/// because it complicates response matching in the driver and breaks the
/// existing assumption that request_id is a simple monotonic u64.
#[inline]
fn alloc_request_id(ring_set: &VfsRingSet) -> u64 {
ring_set.next_request_id.fetch_add(1, Ordering::Relaxed)
}
Longevity analysis: At 10 billion requests per second (far beyond any conceivable filesystem workload — 100 Gbps NVMe at 4KB would be ~25M IOPS), a u64 counter wraps after ~58 years. At realistic rates (1M IOPS sustained), wrap time is ~584,000 years. Safe for 50-year uptime.
Why not per-ring monotonic IDs? Per-ring IDs would avoid the global atomic but
create two problems: (a) the driver must track which ring a response belongs to,
adding complexity to the response matching path; (b) the cancellation side-channel
uses request_id to identify requests — with per-ring IDs, cancel tokens would need
a (ring_index, request_id) pair, breaking the existing CancelToken struct layout
(Section 14.2). The global counter adds ~15-80 cycles under
contention (< 0.5% of minimum VFS operation latency). Per-ring IDs were rejected
because: (1) breaks CancelToken uniqueness without ring_index, (2) makes driver
response matching ring-aware, (3) complicates crash recovery log replay
(per-ring ordering must be reconstructed from ring sequence numbers rather than
using a single global ID sequence). Relaxed ordering on the counter is sufficient:
the ring's Release/Acquire on published provides the happens-before guarantee
between the writer (who wrote the request including its ID) and the reader.
The CacheLinePadded wrapping ensures no false sharing.
14.3.5 Doorbell Coalescing Across N Rings¶
With N rings, naively ringing each ring's doorbell independently would cause N doorbell interrupts per batch of operations. The coalesced doorbell mechanism aggregates notifications across all rings in a mount.
/// Coalesced doorbell for a VfsRingSet. Instead of each ring having an
/// independent doorbell, a single coalesced doorbell aggregates pending
/// work across all rings.
///
/// The producer (VFS) sets a per-ring "pending" bit and then decides
/// whether to ring the shared doorbell based on coalescing policy.
/// The consumer (driver) checks all rings with pending bits set.
// kernel-internal, not KABI
#[repr(C, align(64))]
pub struct CoalescedDoorbell {
/// Bitmask of rings that have new entries since the last doorbell.
/// Bit i is set when ring i has new entries. The driver clears bits
/// as it drains rings.
///
/// AtomicU256 is not available on all architectures. For MAX_VFS_RINGS
/// = 256, this is implemented as an array of 4 AtomicU64 values.
/// Each AtomicU64 covers 64 rings.
///
/// **32-bit architecture note** (ARMv7, PPC32): 64-bit atomics require
/// LDREXD/STREXD (ARMv7) or lwarx/stwcx pairs (PPC32), which are slower
/// than native-width atomics (~10-20 cycles vs ~3-5 cycles). This is
/// acceptable: called per-VFS-operation (hot path), not per-mount. On
/// 32-bit architectures, the ~10-20 cycle overhead for LDREXD/STREXD is
/// <8% of minimum VFS metadata operation latency (~200-500 ns).
/// Functional correctness is maintained on all architectures.
pub pending_mask: [AtomicU64; 4],
/// The actual doorbell register. Writing any non-zero value wakes
/// the driver's consumer thread(s).
pub doorbell: DoorbellRegister,
/// Coalescing state (producer-side). Tracks entries since last doorbell
/// for adaptive coalescing.
pub coalescer: DoorbellCoalescer,
}
// kernel-internal, not KABI — CoalescedDoorbell contains DoorbellRegister and
// DoorbellCoalescer which have platform-dependent layout. Accessed only within
// Tier 0 Core via VfsRingSet.
impl CoalescedDoorbell {
/// Mark a ring as having pending entries and optionally ring the doorbell.
///
/// Called by the VFS after enqueueing a request on a specific ring.
/// The doorbell is rung when:
/// (a) the coalescer's pending_count reaches max_batch, OR
/// (b) the coalescer's timeout expires (first entry in batch is older
/// than coalesce_timeout_us), OR
/// (c) the request is a synchronous high-priority operation (fsync,
/// mount, unmount) — these bypass coalescing entirely.
///
/// Cost: one atomic OR (~5-10 cycles) + conditional doorbell write.
#[inline]
pub fn notify(&self, ring_index: u16, force: bool) {
let word = ring_index as usize / 64;
let bit = ring_index as u64 % 64;
self.pending_mask[word].fetch_or(1u64 << bit, Ordering::Release);
if force || self.coalescer.should_ring() {
self.doorbell.ring();
self.coalescer.reset();
}
}
/// Read and clear pending ring mask. Called by the driver consumer.
///
/// Returns a snapshot of which rings have pending entries, then clears
/// those bits. The driver iterates the set bits and drains each ring.
#[inline]
pub fn take_pending(&self) -> [u64; 4] {
let mut result = [0u64; 4];
for i in 0..4 {
result[i] = self.pending_mask[i].swap(0, Ordering::AcqRel);
}
result
}
}
Coalescing policy for VFS operations:
| Operation class | Coalescing behavior | Rationale |
|---|---|---|
Synchronous metadata (Fsync, Mount, Unmount, Freeze, Thaw, SyncFs) |
Force doorbell immediately (force = true) |
Caller is blocked waiting for completion. Coalescing adds latency with no throughput benefit. |
Readahead (Readahead, ReadPage) |
Coalesce up to batch-32 or 50 us timeout | Readahead is speculative; latency tolerance is high. Batch amortizes doorbell cost. |
Regular I/O (Read, Write, Lookup, Create, etc.) |
Coalesce up to batch-8 or 20 us timeout | Balance between latency and throughput. |
Batched metadata (StatxBatch) |
Coalesce entire batch into one doorbell | Already batched by io_uring path. |
Performance impact: With N=64 rings and coalesced doorbells, the PostgreSQL checkpoint scenario goes from 64 independent doorbells per batch to 1 coalesced doorbell. This saves ~63 * doorbell_cost (~5-150 cycles depending on whether the doorbell is an MMIO write or a memory write with interrupt). Combined with the elimination of cache line bouncing on the ring head, the total saving per fsync batch is ~200-4500 cycles.
14.3.6 Mount-Time Negotiation¶
Ring count is negotiated between the VFS and the filesystem driver during the
mount() sequence. The driver advertises its capability; the VFS selects the
actual count based on system topology and mount options.
14.3.6.1 Driver Capability Advertisement¶
The KABI driver manifest (Section 12.6) is extended with a VFS ring count field:
/// Extension to KabiDriverManifest for VFS drivers.
/// Placed in section `.kabi_vfs_caps` adjacent to `.kabi_manifest`.
#[repr(C)]
pub struct KabiVfsCapabilities {
/// Magic: 0x56465343 ("VFSC") — identifies a valid VFS capability block.
pub magic: u32,
/// Structure version (currently 1).
pub version: u32,
/// Maximum number of VFS request rings this driver can consume.
/// 1 = legacy single-ring mode (backward compatible).
/// N > 1 = driver supports multi-ring mode with up to N rings.
///
/// The driver must be prepared to handle any ring_count in [1, ring_count_max].
/// The VFS selects the actual count and communicates it during mount.
pub ring_count_max: u16,
/// Driver's preferred ring granularity hint. The VFS may override this
/// based on system topology and mount options.
pub preferred_granularity: RingGranularity,
/// Reserved for future use. Must be zero.
pub _reserved: [u8; 5],
}
const_assert!(size_of::<KabiVfsCapabilities>() == 16);
/// Default VFS capabilities for drivers that do not include a `.kabi_vfs_caps`
/// section (backward compatibility).
pub const KABI_VFS_CAPS_DEFAULT: KabiVfsCapabilities = KabiVfsCapabilities {
magic: 0x56465343,
version: 1,
ring_count_max: 1, // Single-ring mode (legacy).
preferred_granularity: RingGranularity::Single,
_reserved: [0; 5],
};
14.3.6.2 Negotiation Protocol¶
Mount sequence (extended from [Section 14.2](#vfs-ring-buffer-protocol)):
1. VFS reads driver's KabiVfsCapabilities from the .kabi_vfs_caps ELF section.
If absent, use KABI_VFS_CAPS_DEFAULT (ring_count_max = 1).
2. VFS determines the desired ring count:
a. If mount option `vfs_ring_count=N` is specified: use min(N, driver.ring_count_max).
b. If mount option `vfs_ring_granularity=<mode>` is specified: compute ring count
from topology (per-cpu, per-numa, per-llc).
c. If no mount options: auto-select based on online CPU count:
- 1-4 CPUs: 1 ring (single-ring mode, no overhead).
- 5-16 CPUs: min(nr_numa_nodes, driver.ring_count_max) rings (PerNuma).
- 17-64 CPUs: min(nr_llc_groups, driver.ring_count_max) rings (PerLlc).
- 65+ CPUs: min(nr_cpus, driver.ring_count_max) rings (PerCpu).
3. VFS allocates ring_count VfsRingPair structures in shared memory.
Each ring has independent head/tail/published/size fields.
Ring entry size and depth are uniform across all rings (same as the
per-mount mount options `vfs_ring_depth=N`).
4. VFS builds the cpu_to_ring mapping table.
5. VFS passes the ring set to the driver during mount initialization.
For T1 transport: the KabiT1EntryFn signature is extended to accept
a ring array:
entry_ring(ksvc, rings: *mut [RingBuffer], ring_count: u16) -> u32
For backward compatibility: if ring_count == 1, this is identical to
the existing single-ring entry point.
6. Driver initializes its consumer side for all ring_count rings.
The driver may spawn multiple consumer threads or use a single thread
with round-robin polling — this is an internal driver decision.
14.3.6.3 Mount Options¶
New mount options for per-CPU rings:
| Mount option | Type | Default | Description |
|---|---|---|---|
vfs_ring_count=N |
u16 | auto | Number of VFS rings. 0 = auto (topology-based). 1 = force single-ring. |
vfs_ring_granularity=<mode> |
string | auto |
One of: auto, per-cpu, per-numa, per-llc, fixed. auto selects based on CPU count (step 2c above). |
vfs_ring_depth=N |
u32 | 256 | Depth of each individual ring. Same as existing mount option. Applies uniformly to all rings. |
14.3.7 Driver-Side Multiplexing¶
The filesystem driver must consume requests from N rings instead of one. Three consumer strategies are supported:
14.3.7.1 Strategy 1: Single-Thread Round-Robin (Simple Drivers)¶
/// Simple round-robin consumer for multi-ring VFS.
/// Suitable for filesystem drivers with a single consumer thread.
///
/// The thread iterates all rings in round-robin order, draining each ring
/// before moving to the next. The pending_mask from the coalesced doorbell
/// guides which rings to check, avoiding wasted iteration over empty rings.
fn vfs_consumer_round_robin(
rings: &[VfsRingPair],
doorbell: &CoalescedDoorbell,
) {
loop {
// Wait for doorbell notification.
doorbell.doorbell.wait();
// Read which rings have pending work.
let pending = doorbell.take_pending();
// Iterate set bits — each bit corresponds to a ring with work.
for word_idx in 0..4 {
let mut bits = pending[word_idx];
while bits != 0 {
let bit = bits.trailing_zeros() as u16;
let ring_idx = (word_idx as u16 * 64) + bit;
bits &= bits - 1; // Clear lowest set bit.
// Drain this ring completely before moving to the next.
drain_ring(&rings[ring_idx as usize]);
}
}
}
}
14.3.7.2 Strategy 2: Per-Ring Consumer Threads (High-Performance Drivers)¶
/// High-performance consumer model: one kthread per ring.
///
/// Each kthread is affinity-bound to the NUMA node (or LLC group) that
/// the ring serves. This maximizes cache locality — the ring's head/published
/// cache lines stay in the consumer thread's L1/L2.
///
/// Suitable for high-IOPS filesystem drivers (ext4, XFS, btrfs) that can
/// process requests independently per ring.
///
/// The per-ring doorbell is embedded in each VfsRingPair (the existing
/// doorbell field). The coalesced doorbell is used only when strategy 1
/// is active. In strategy 2, each ring uses its own doorbell independently.
fn vfs_consumer_per_ring(
ring: &VfsRingPair,
ring_index: u16,
) {
loop {
ring.doorbell.wait();
drain_ring(ring);
}
}
14.3.7.3 Strategy 3: Adaptive (Recommended)¶
Drivers detect the ring count at mount time and choose:
- ring_count == 1: Use the existing single-consumer model (no change).
- ring_count <= 4: Use round-robin (one thread, minimal overhead).
- ring_count > 4: Use per-ring consumer threads (maximum parallelism).
The threshold (4) is a heuristic — with 4 rings, one thread can drain all rings without significant latency. Above 4, the round-robin cycle time exceeds the typical operation latency and per-ring threads become worthwhile.
14.3.7.4 Response Routing¶
With N request rings, the driver sends responses on the corresponding response ring — ring index i's request ring has a paired response ring i. The VFS consumer for responses is already per-ring (each VfsRingPair has its own response_ring and completion WaitQueue). A thread blocked on a synchronous VFS operation sleeps on the specific ring's WaitQueue, not a global one. When the response arrives on ring i's response_ring, only threads waiting on ring i are woken.
Request flow:
CPU 7 → cpu_to_ring[7] = ring 2 → ring_set.rings[2].request_ring → enqueue
Response flow:
Driver dequeues from ring 2's request_ring
Driver enqueues response on ring 2's response_ring
VFS consumer for ring 2 wakes threads on ring 2's completion WaitQueue
This ensures that the response arrives on the same ring where the request was submitted. The thread that submitted the request is sleeping on that ring's WaitQueue and is woken directly — no global WaitQueue scanning.
14.3.8 Driver-Side Ring Entry Prefetch¶
When a filesystem driver's consumer thread dequeues a request from the ring, the ring protocol specifies that the consumer prefetches the next N ring entries into CPU cache while processing the current request. This hides the memory latency of reading ring entries behind the computation/I/O latency of processing the current request.
/// Prefetch the next `prefetch_count` ring entries after the current dequeue
/// position. Called by the driver's consumer loop immediately after dequeuing
/// a request, before beginning I/O processing for that request.
///
/// The prefetch is a cache hint (software prefetch instruction); it does not
/// modify the ring state or advance the consumer pointer. If the prefetched
/// entries are not yet published (tail has not advanced that far), the
/// prefetch is a harmless no-op (prefetching an unpublished slot reads
/// stale data that will be overwritten before the consumer reaches it).
///
/// **Architecture mapping**:
/// - x86-64: `_mm_prefetch(ptr, _MM_HINT_T0)` — prefetch into L1.
/// - AArch64: `PRFM PLDL1KEEP, [ptr]` — prefetch for load, keep in L1.
/// - ARMv7: `PLD [ptr]` — data prefetch.
/// - RISC-V: no standard prefetch instruction (Zicbop extension adds
/// `prefetch.r`; fallback is no-op on cores without Zicbop).
/// - PPC32/PPC64LE: `dcbt` — data cache block touch.
/// - s390x: no user-accessible prefetch; no-op.
/// - LoongArch64: `preld` — prefetch for load.
///
/// **Prefetch count**: 4 entries is the default. This covers the typical
/// pipeline depth where the driver has dispatched the current request to
/// a work queue and is about to dequeue the next. At ~320 bytes per
/// `VfsRequest` entry, 4 entries = 1,280 bytes = ~20 cache lines. This
/// fits comfortably in L1 on all architectures (minimum L1 = 16 KiB on
/// ARMv7 Cortex-A15). Configurable per-driver via the KABI manifest
/// field `ring_prefetch_count` (default 4, range 0-16, 0 = disabled).
///
/// **Cost**: ~1-4 cycles per prefetch instruction (overlapped with
/// current request processing — effectively zero additional latency).
///
/// **Benefit**: Eliminates ~50-100 cycles of L2/L3 read latency per
/// ring entry dequeue (the entry is already in L1 when the consumer
/// reaches it). Over a batch of 8 dequeues, this saves ~400-800 cycles.
#[inline(always)]
fn prefetch_ring_entries(
ring: &RingBuffer<VfsRequest>,
current_tail: u64,
prefetch_count: u32,
) {
let mask = ring.size - 1; // ring size is power-of-2
for i in 1..=prefetch_count {
let idx = (current_tail.wrapping_add(i as u64)) & mask;
let entry_ptr = ring.data_ptr(idx as usize);
// SAFETY: entry_ptr is within the ring's allocated data region
// (bounded by mask). The prefetch is a hint and does not dereference
// the pointer; no memory safety violation even if the slot is
// unpublished.
unsafe { arch::current::cpu::prefetch_read(entry_ptr as *const u8); }
}
}
Integration point: The prefetch call is inserted into the consumer loop between "dequeue current entry" and "dispatch to work queue":
loop {
let entry = ring.dequeue(); // Read current request
prefetch_ring_entries(ring, tail, 4); // Prefetch next 4 while processing
dispatch_to_work_queue(entry); // Dispatch (may involve I/O)
tail = tail.wrapping_add(1);
}
14.3.9 Completion Coalescing (Response Direction)¶
The doorbell coalescing mechanism (above) batches request-direction notifications (Tier 0 VFS → Tier 1 driver). An analogous mechanism is needed for the response direction (Tier 1 driver → Tier 0 VFS) to avoid waking the VFS consumer on every individual completion.
Problem: Without completion coalescing, a driver that completes 8 requests in rapid succession (e.g., 8 readahead pages arriving from NVMe in a single interrupt) generates 8 separate WaitQueue wakeups on the VFS side. Each wakeup involves an IPI to the waiting CPU (~200-500 cycles on x86-64 cross-core) plus the WaitQueue wake protocol (~50-100 cycles). For 8 completions: ~2,000-4,800 cycles of wakeup overhead.
Solution: The driver batches completions on the response ring and signals the VFS with a single completion doorbell after writing N responses (or after a configurable timeout).
/// Completion coalescing state, embedded in each VfsRingPair.
/// The driver side accumulates completions and signals the VFS consumer
/// only after the batch threshold is reached or the coalescing timeout
/// expires.
///
/// **Classification**: Nucleus (data structure), Evolvable (threshold/timeout
/// parameters are ML-tunable via ParamId 0x0309 and 0x030A).
// SAFETY: All fields are accessed exclusively by the driver's consumer
// thread via `&mut self`. No cross-domain or cross-thread reads. Adding
// a diagnostic read path requires converting to AtomicU32/AtomicU64.
pub struct CompletionCoalescer {
/// Number of responses written since the last VFS wakeup.
/// Incremented by the driver on each response_ring enqueue.
/// Reset to 0 after signaling the VFS.
pub pending_completions: u32,
/// Batch threshold: signal VFS after this many completions.
/// Default: 8 for regular I/O, 1 for synchronous operations
/// (Fsync, Mount — these bypass coalescing because the caller
/// is blocked waiting for exactly one response).
pub batch_threshold: u32,
/// Timestamp (in TSC or arch-equivalent monotonic cycles) of the
/// first unsignaled completion. If the time since first unsignaled
/// completion exceeds `coalesce_timeout_cycles`, signal the VFS
/// regardless of batch size. This bounds worst-case latency for
/// sparse completion streams.
///
/// **CPU migration note**: If the consumer thread migrates between
/// CPUs, TSC on the new CPU may differ from the old CPU (pre-Zen3
/// AMD, non-constant-TSC platforms). The `wrapping_sub` check in
/// `should_signal()` treats a negative delta as a very large positive
/// value, causing immediate signal — a false positive (extra wakeup),
/// not a missed wakeup. This is benign: one extra wakeup per
/// migration event (~once per seconds at most). On AArch64 (CNTVCT_EL0
/// is globally synchronized) and x86 with constant_tsc + nonstop_tsc,
/// migration has no effect on cycle counter monotonicity.
///
/// **Recommendation**: Driver consumer threads SHOULD be affinity-pinned
/// to a NUMA node or LLC group (see Strategy 2: Per-Ring Consumer Threads).
/// Pinned threads avoid this edge case entirely.
pub first_unsignaled_cycles: u64,
/// Coalescing timeout in cycles. Default: ~10 us worth of cycles
/// (e.g., ~30,000 cycles at 3 GHz). Converted from the ML-tunable
/// parameter `vfs_completion_coalesce_timeout_us` (ParamId 0x030A)
/// at mount time and on parameter update.
pub coalesce_timeout_cycles: u64,
}
impl CompletionCoalescer {
/// Called by the driver after writing a response to the response ring.
/// Returns `true` if the VFS should be signaled (wake the completion
/// WaitQueue), `false` if the completion should be coalesced.
///
/// **Synchronous bypass**: If the response is for a synchronous
/// operation (Fsync, Mount, Unmount, Freeze, Thaw, SyncFs), this
/// function always returns `true` — the caller is blocked and must
/// be woken immediately.
#[inline]
pub fn should_signal(&mut self, is_sync_op: bool) -> bool {
if is_sync_op {
self.pending_completions = 0;
return true;
}
self.pending_completions += 1;
if self.pending_completions == 1 {
self.first_unsignaled_cycles = arch::current::cpu::read_cycle_counter();
}
if self.pending_completions >= self.batch_threshold {
self.pending_completions = 0;
return true;
}
let now = arch::current::cpu::read_cycle_counter();
if now.wrapping_sub(self.first_unsignaled_cycles) >= self.coalesce_timeout_cycles {
self.pending_completions = 0;
return true;
}
false
}
}
Driver integration: After writing each VfsResponseWire to the response ring:
ring.response_ring.enqueue(response);
ring.response_ring.inner.published.store(new_published, Release);
if ring.completion_coalescer.should_signal(is_sync_op) {
ring.completion.wake_all(); // Signal VFS consumer
}
Performance impact: With completion coalescing at batch=8, the 8-readahead scenario generates 1 wakeup instead of 8. Savings: ~1,750-4,200 cycles per readahead batch. Combined with request-side doorbell coalescing, a full readahead cycle (8 requests batched into 1 doorbell + 8 responses batched into 1 wakeup) saves ~2,000-5,000 cycles total vs. the uncoalesced baseline.
14.3.10 Crash Recovery¶
Crash recovery (Section 11.9) must drain ALL N rings when a filesystem driver crashes.
14.3.10.1 Unified VFS Driver Crash Recovery Sequence¶
The base protocol (Section 14.2) defines VFS-specific Steps 1-5.5. The general recovery protocol (Section 11.9) defines Steps 1-9. This section specifies the canonical merged sequence — the single authoritative ordering of all steps. Both base protocol and general recovery descriptions are normative for their individual step content, but THIS section defines the step ordering and interleaving. An implementing agent follows this sequence top-to-bottom.
UNIFIED VFS DRIVER CRASH RECOVERY SEQUENCE
Step U1. [General 1] FAULT DETECTED
Hardware exception / watchdog / ring corruption in Tier 1 domain.
Step U1a. [General 1a] TIER CHECK
If effective_tier() == Tier::Zero: panic (no isolation).
If Tier::One or Tier::Two: proceed.
Step U2. [General 2] ISOLATE
Revoke domain (PKRU AD bit / POR_EL0 / DACR). Mask interrupts.
Step U3. [General 2'] SET RING STATE
ring_set.state = RECOVERING (blocks select_ring()).
Per-ring inner.state = RING_STATE_DISCONNECTED.
See set_all_rings_disconnected() below.
Step U4. [General 2a] NMI EJECTION
NMI IPI to eject CPUs still in the crashed domain.
Step U5. [VFS Step 1] QUIESCE PRODUCERS
Wait for all per-ring inflight_ops == 0 (5s timeout).
This ensures no producer is mid-copy_from_user with a
RESERVED slot. If timeout: SIGKILL stuck processes.
See wait_for_producers_quiesced() below.
Step U6. [VFS Step 2a] EXTRACT RING DATA (ring pointers still valid)
Walk each ring's request entries (tail..published).
Collect orphaned ReadPage/Readahead entries (page_cache_id +
page_index) for later page unlock.
Collect DMA buffer handles for deferred freeing.
See collect_orphaned_page_entries() and
collect_dma_handles() in drain_all_vfs_rings() below.
Step U7. [VFS Step 2b + General 3] DRAIN AND RESET RINGS
For each ring:
(a) Drain response ring (driver -> VFS completions as EIO).
(b) Fail pending requests (wake threads with EIO).
(c) Free collected DMA buffer handles.
(d) Reset all slot_states to EMPTY.
(e) Reset ring pointers (head=tail=published=0).
(f) Reset per-ring inner.state to RING_STATE_ACTIVE.
NOTE: Between U7(f) and U17, individual ring states are ACTIVE
while ring_set.state remains RECOVERING. This mixed state is
safe and intentional:
- Producers: blocked by ring_set.state == RECOVERING at
select_ring() — no new requests can be enqueued.
- Consumer: the new driver (loaded in U14) needs ACTIVE
per-ring state to start its consumer loop. If ring.state
were still RING_STATE_DISCONNECTED, the consumer loop's Phase 1.5
state check would immediately break out.
See drain_all_vfs_rings() below.
Step U8. [VFS Step 2c] WAKE ORPHANED PAGES
For each collected OrphanedPageEntry:
Set PageFlags::ERROR, unlock_page().
Wakes wait_on_page_locked() waiters -> EIO.
Step U9. [VFS Step 2.5] PAGE CACHE INTEGRITY CHECK
Walk XArray trees for corruption detection.
Step U10. [VFS Step 3] DIRTY PAGE DETECTION
Identify dirty pages for deferred writeback.
Step U11. [General 4] DRAIN PENDING I/O
Complete all remaining user requests with EIO.
Post io_uring CQEs with error status.
Step U12. [General 4a] EMIT FMA EVENT
fma_emit(FaultEvent::DriverCrash { ... })
Step U13. [General 5-7 + DMA] DEVICE RESET + RELEASE LOCKS + UNLOAD
FLR, KABI lock release, driver memory free.
DMA quiescence (FLR + IOTLB invalidation + wait_dma_quiesce) was
initiated between U2-U4 as part of the unified interleaving
specified in [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation--dma-quiescence-during-crash-recovery).
By this step, DMA is fully quiesced and IOMMU entries are revoked.
Step U13a. [VFS Step 5.5] GENERATION COUNTER BUMP (moved before U14)
sb.driver_generation.fetch_add(1, Release)
**This MUST happen BEFORE driver reload (U14)** so that the
new driver instance and all its responses (including writeback
completions in U15) carry the NEW generation. If the bump were
after U15 (the old U16 position), writeback responses from U15
would carry the OLD generation and be discarded by the VFS
consumer's `response.driver_generation == sb.driver_generation`
check, causing **silent data loss** (dirty pages completed by
the driver but not marked clean).
Step U14. [VFS Step 4 + General 8] RELOAD DRIVER AND REMOUNT
Load new driver binary from CrashRecoveryPool or buddy allocator.
The new driver instance goes through the standard KABI module Hello
protocol ([Section 12.8](12-kabi.md#kabi-domain-runtime--module-hello-protocol)):
(a) Register with the domain service.
(b) Declare dependencies (block device, DMA allocator, etc.).
(c) Domain service resolves dependencies and hands out handles.
**KABI→VFS ring handoff**: The Hello protocol creates
`CrossDomainRing` objects for generic KABI service bindings.
However, VFS rings use a different ring type (`VfsRingPair`
with 320-byte `VfsRequest` entries and `VfsOpcode`-based
dispatch). The handoff works as follows:
- The generic KABI Hello protocol creates `CrossDomainRing`
objects for the driver's non-VFS dependencies (DMA, crypto,
etc.) — these use the standard 64-byte `T1CommandEntry`.
- For the VFS-specific ring, the domain service does NOT
create a `CrossDomainRing`. Instead, it passes the existing
`VfsRingSet` pointer (which survived the crash — ring
memory is kernel-owned, not domain-owned) directly to the
driver's `vfs_init()` entry point. The `VfsRingSet` was
reset in Steps U7-U10 and its rings are ready for reuse.
- The driver's `vfs_init()` receives both the generic KABI
handles (from Hello) and the VFS-specific `VfsRingSet`
pointer (passed separately by the domain service).
**VFS initialization KABI interface**: The `vfs_init()` function
is declared in the filesystem KABI `.kabi` definition as an
optional initialization method (present only for filesystem
drivers, not for all Tier 1 drivers):
```rust
/// Filesystem-specific initialization. Called by the domain
/// service after the Hello protocol completes and generic KABI
/// handles are resolved. Receives the VfsRingSet for this mount.
///
/// The driver creates consumer threads (one per ring pair) using
/// `kernel_services.create_kthread()` — a KABI kernel-services
/// method, NOT a direct kthread_create() syscall. Tier 1 drivers
/// cannot create kthreads directly; they request creation via
/// the kernel-services KABI handle obtained during Hello.
///
/// Ring memory permissions: VfsRingSet and its ring data regions
/// are in shared memory mapped with the driver's domain key
/// (read-write for the driver's PKEY/POE/DACR domain). The ring
/// control structures (head/tail/published) are AtomicU64 —
/// interior mutability through shared references is safe.
fn vfs_init(
&self,
ring_set: &VfsRingSet,
kernel_services: &KernelServicesHandle,
) -> Result<(), KabiError>;
```
(d) The new driver inherits the existing sb.ring_set: the domain
service passes the VfsRingSet pointer to the driver's init
function. The driver starts N consumer threads (one per ring
in ring_set), each bound to the corresponding VfsRingPair.
The per-ring inner.state was reset to ACTIVE in U7(f), so
the consumer loop's Phase 1.5 state check passes.
(e) Mount RO, run fsck_fast() (fast metadata consistency check),
remount RW if fsck_fast passes.
Step U15. [VFS Step 5] FLUSH DEFERRED DIRTY PAGES
writeback_deferred_dirty(sb)
Step U16. (REMOVED — generation bump moved to Step U13a, before driver reload.
See U13a rationale above. This step is now a no-op placeholder
to preserve step numbering.)
Step U17. [Per-CPU ext] RING SET ACTIVE
ring_set.state.store(VFSRS_ACTIVE, Release)
This is the LAST step. The Release ordering ensures all
prior ring resets, slot_states clears, and driver init are
visible to producers before they observe ACTIVE.
Step U18. [General 9] DRIVER READY
Driver announces readiness to domain service.
VFS/Block Recovery Interleaving (Same-Domain Crash)
When a Tier 1 filesystem driver crashes and it shares a domain with the block driver (common on platforms with limited isolation domains), TWO crash recovery sequences are triggered for the SAME domain crash event:
- Block I/O recovery (Section 15.2): Drains
block request queues, completes in-flight bios with
EIO, resets the block device's hardware queues. - VFS recovery (this section): The unified sequence U1-U18 above.
Ordering constraint: Block I/O recovery MUST complete before VFS Step
U14 (RELOAD DRIVER AND REMOUNT). The filesystem driver's vfs_init()
submits block I/O (for fsck_fast metadata reads), which requires the block
device to be operational. The crash recovery worker
(Section 12.8) serializes these via the domain-level
recovery_mutex — block recovery runs first (it is faster: ~10-50ms
for queue drain), then VFS recovery starts.
Bio completion callback domain: When the VFS submits a bio to the
block layer, the bio completion callback (bio.bi_end_io) executes in
the context of the block device's interrupt handler — which runs in
Tier 0 (Core domain), not in the VFS driver domain. This is by design:
bio completions update Core-resident page cache state (PG_writeback,
PG_uptodate, PG_error flags) and wake Tier 0 waitqueues. The bio
completion callback NEVER enters the VFS Tier 1 domain — it only touches
Core data structures. This ensures bio completions continue to work even
during VFS driver recovery.
Recovery Functions¶
Step U3: Set ring state
fn set_all_rings_disconnected(ring_set: &VfsRingSet) {
// Block new select_ring() calls.
ring_set.state.store(VFSRS_RECOVERING, Ordering::Release);
for i in 0..ring_set.ring_count as usize {
// SAFETY: rings is valid for ring_count elements.
let ring = unsafe { &*ring_set.rings.add(i) };
ring.request_ring.inner.state.store(RING_STATE_DISCONNECTED, Ordering::Release);
ring.response_ring.inner.state.store(RING_STATE_DISCONNECTED, Ordering::Release);
}
}
The mount-level ring_set.state is checked at the top of select_ring() — if
not ACTIVE, the VFS operation returns ENXIO immediately without touching any
individual ring. This provides a fast-path rejection of new operations during
recovery, avoiding the need to check each ring's state individually.
Step U5: Wait for producer quiescence
/// Wait for all in-flight producers to complete their current operations.
///
/// After set_all_rings_disconnected(), no NEW operations can enter via
/// select_ring() (returns ENXIO). But producers that already passed
/// select_ring() and reserve_slot() may be mid-copy_from_user with a
/// RESERVED slot. This function waits for those producers to finish.
///
/// The inflight_ops counter is incremented in reserve_slot() and
/// decremented in complete_slot(). When it reaches zero for all rings,
/// no producer is between reserve and complete — safe to drain.
///
/// Timeout: 5 seconds (matches the per-sb quiescence timeout in the
/// base protocol). After timeout: SIGKILL processes with stuck operations.
///
/// **Busy-wait justification**: This is a cold path (crash recovery only,
/// not normal operation). The spin_loop uses `core::hint::spin_loop()`
/// which issues a PAUSE instruction on x86 (reducing power and yielding
/// the pipeline to other hyperthreads). The 5-second window is the maximum;
/// typical quiescence completes in microseconds (the stuck producer only
/// needs to finish its `copy_from_user` + `complete_slot()` sequence).
/// A WaitQueue-based approach was considered but rejected because the
/// producers are not aware of the recovery and thus cannot signal a
/// waitqueue — polling is the only option.
fn wait_for_producers_quiesced(ring_set: &VfsRingSet) {
let deadline = ktime_get() + Duration::from_secs(5);
loop {
let mut all_quiesced = true;
for i in 0..ring_set.ring_count as usize {
// SAFETY: rings pointer valid for ring_count elements.
let ring = unsafe { &*ring_set.rings.add(i) };
if ring.request_ring.inflight_ops.load(Ordering::Acquire) != 0 {
all_quiesced = false;
break;
}
}
if all_quiesced {
return;
}
if ktime_get() > deadline {
// Escalation: SIGKILL stuck processes (same as base protocol).
sigkill_stuck_producers(ring_set);
return;
}
core::hint::spin_loop();
}
}
/// Identify and SIGKILL processes that have operations stuck in the VFS ring.
///
/// Called when `wait_for_producers_quiesced()` times out after 5 seconds.
/// A producer is "stuck" if it called `reserve_slot()` (incrementing
/// `inflight_ops`) but never called `complete_slot()` (decrementing it).
/// This can happen if the producer's thread is blocked in an uninterruptible
/// sleep between reserve and complete (e.g., page fault during `copy_from_user`
/// that blocks on I/O to the now-crashed filesystem — a deadlock).
///
/// **Mechanism**: Each ring maintains a per-ring `WaitQueue` that producers
/// sleep on when the ring is full (`reserve_slot()` calls `wq.wait_event`).
/// After the ring state is set to RING_STATE_DISCONNECTED, producers woken from this
/// waitqueue check the state and return ENXIO. For producers stuck in
/// `copy_from_user()` (not sleeping on the waitqueue), SIGKILL is the only
/// option — it interrupts the page fault handler and causes the thread to
/// enter `do_exit()`.
///
/// The function iterates all tasks in the system and sends SIGKILL to any
/// task that has a pending VFS operation on a ring belonging to this ring_set.
/// This is identified by checking if the task's `current_vfs_ring` pointer
/// (set in `reserve_slot()`, cleared in `complete_slot()`) points to a ring
/// in this ring_set.
fn sigkill_stuck_producers(ring_set: &VfsRingSet) {
// Iterate the task table (RCU-protected read) looking for tasks
// with current_vfs_ring pointing into this ring_set's ring array.
let ring_base = ring_set.rings as usize;
let ring_end = ring_base + ring_set.ring_count as usize
* core::mem::size_of::<VfsRingPair>();
rcu_read_lock();
for_each_task(|task| {
let ring_ptr = task.current_vfs_ring.load(Ordering::Relaxed) as usize;
if ring_ptr >= ring_base && ring_ptr < ring_end {
// This task has an in-flight VFS operation on one of our rings.
// Send SIGKILL to force it out of whatever blocking state it's in.
signal_wake_up(task, /* fatal */ true);
}
});
rcu_read_unlock();
}
Steps U6-U8: Drain all VFS rings (unified 4-phase function)
/// Maximum orphaned page entries to collect during crash recovery.
/// With 256 rings * 256 depth = 65536 total entries, but only a
/// fraction are ReadPage/Readahead. 4096 covers the worst case
/// for a single mount under heavy read load.
const MAX_ORPHANED_PAGES: usize = 4096;
/// Orphaned page entry collected from ring during crash recovery.
/// Contains Core-resident data only (no VFS-domain pointers).
struct OrphanedPageEntry {
/// Core-resident AddressSpace handle (pointer cast to u64).
/// Valid because AddressSpace is in Core memory, pinned for
/// the lifetime of the superblock.
page_cache_id: u64,
/// Page index within the AddressSpace.
page_index: u64,
}
/// Drain all VFS rings during crash recovery.
///
/// This function implements Steps U6 (EXTRACT), U7 (DRAIN+RESET), and
/// collects data for U8 (WAKE). The caller invokes wake_orphaned_pages()
/// after this function returns.
///
/// **Four phases per ring** (executed sequentially per ring, rings
/// processed sequentially):
///
/// Phase 1: EXTRACT — read ring entries while pointers are valid.
/// Collect orphaned page entries (ReadPage/Readahead requests with
/// page_cache_id for Core-resident page lookup).
/// Collect DMA buffer handles for deferred freeing.
///
/// Phase 2: DRAIN — process completions and fail pending requests.
/// Drain response ring (EIO for all entries).
/// Wake blocked threads with EIO.
/// Free DMA buffer handles collected in Phase 1.
///
/// Phase 3: RESET — clear ring state for replacement driver.
/// Reset all slot_states to EMPTY.
/// Reset ring pointers (head = tail = published = 0).
/// Reset per-ring inner.state to RING_STATE_ACTIVE.
///
/// Phase 4 (after all rings): WAKE orphaned pages.
/// Done by caller using the returned orphaned_pages collection.
///
/// Order: rings are drained sequentially (ring 0, ring 1, ..., ring N-1).
/// No parallelism needed — crash recovery is a cold path with a 500ms
/// latency target.
fn drain_all_vfs_rings(
ring_set: &VfsRingSet,
) -> ArrayVec<OrphanedPageEntry, MAX_ORPHANED_PAGES> {
let mut orphaned_pages: ArrayVec<OrphanedPageEntry, MAX_ORPHANED_PAGES> =
ArrayVec::new();
for i in 0..ring_set.ring_count as usize {
// SAFETY: rings pointer valid for ring_count elements.
let ring = unsafe { &*ring_set.rings.add(i) };
// Phase 1: EXTRACT — read ring entries while pointers are valid.
let tail = ring.request_ring.inner.tail.load(Ordering::Acquire);
let published = ring.request_ring.inner.published.load(Ordering::Acquire);
let mask = ring.request_ring.inner.size as u64 - 1;
let mut pos = tail;
// Use `!=` (not `<`) for wrapping-safe comparison — consistent
// with drain_ring() at the hot-path consumer (see its comment).
// At u64 scale wrapping is unreachable (~58,000 years at 10M ops/sec),
// but the correct idiom avoids self-contradiction in the spec.
while pos != published {
let idx = (pos & mask) as usize;
// SAFETY: pos is between tail and published; all producers are
// quiesced (Step U5 waited for inflight_ops == 0), so all slots
// in [tail..published) were written by producers before they
// decremented inflight_ops. Slots in this range are FILLED (not
// RESERVED or EMPTY) because: (1) the producer's complete_slot()
// call advances `published` only AFTER storing FILLED state, and
// (2) quiescence guarantees no producer is mid-fill. A RESERVED
// slot would imply an incomplete producer, contradicting the
// inflight_ops == 0 quiescence condition.
// The state is NOT verified at runtime — the quiescence guarantee
// makes the check unnecessary, and adding one would add overhead
// to a cold crash-recovery path for no correctness benefit.
let entry: &VfsRequest = unsafe { ring.request_ring.read_entry(idx) };
// Collect orphaned page entries for ReadPage/Readahead.
match entry.opcode {
VfsOpcode::ReadPage => {
if let VfsRequestArgs::ReadPage { page_index, page_cache_id, .. } = &entry.args {
if !orphaned_pages.is_full() {
orphaned_pages.push(OrphanedPageEntry {
page_cache_id: *page_cache_id,
page_index: *page_index,
});
}
// If full: log FMA warning. Pages will remain locked
// until oom_reaper or manual intervention.
}
}
VfsOpcode::Readahead => {
if let VfsRequestArgs::Readahead { start_index, nr_pages, page_cache_id, .. } = &entry.args {
for pg in 0..*nr_pages as u64 {
if !orphaned_pages.is_full() {
orphaned_pages.push(OrphanedPageEntry {
page_cache_id: *page_cache_id,
page_index: start_index + pg,
});
}
}
}
}
_ => {}
}
// Collect DMA buffer handles for freeing.
free_request_dma_handles(entry);
pos = pos.wrapping_add(1);
}
// Phase 2: DRAIN — process completions and fail pending.
// NOTE: These operate on DIFFERENT rings within the same VfsRingPair:
// - drain_spsc_response_ring: advances the RESPONSE ring tail to
// discard driver completions (driver → VFS direction).
// - fail_pending_requests: checks the REQUEST ring pending_count
// and wakes blocked submitters (VFS → driver direction).
// The response ring drain does NOT affect request ring state.
// fail_pending_requests correctly sees the original pending_count
// because it reads from a separate ring object.
drain_spsc_response_ring(&ring.response_ring);
fail_pending_requests(&ring.request_ring, &ring.completion);
// Phase 3: RESET — clear ring state for replacement driver.
// Reset all slot_states to EMPTY FIRST (before ring pointers).
// Relaxed ordering is safe here because the downstream barrier
// chain guarantees visibility: the ring_set.state store to
// RING_STATE_ACTIVE (Step U17) uses Release ordering. The
// replacement driver's consumer thread observes RING_STATE_ACTIVE
// via Acquire load. This Release/Acquire pair establishes a
// happens-before relationship: all Relaxed stores to slot_states
// (done before the Release store) are visible to the consumer
// thread (which reads after the Acquire load). No intermediate
// Release is needed on the individual slot_state stores.
for slot_idx in 0..ring.request_ring.inner.size as usize {
// SAFETY: slot_states points to an array of `inner.size`
// AtomicU8 elements, allocated at mount time. `slot_idx` is
// bounded by `inner.size`. Raw pointer arithmetic is required
// because `*const AtomicU8` does not support `[]` indexing.
unsafe {
(*ring.request_ring.slot_states.add(slot_idx)).store(
RingSlotState::Empty as u8,
Ordering::Relaxed,
);
}
}
for slot_idx in 0..ring.response_ring.inner.size as usize {
// SAFETY: same invariant as request_ring slot_states above.
unsafe {
(*ring.response_ring.slot_states.add(slot_idx)).store(
RingSlotState::Empty as u8,
Ordering::Relaxed,
);
}
}
// Reset inflight_ops to zero (should already be zero after
// quiescence, but defensive reset for correctness).
ring.request_ring.inflight_ops.store(0, Ordering::Relaxed);
ring.response_ring.inflight_ops.store(0, Ordering::Relaxed);
// Reset ring pointers.
ring.request_ring.inner.head.store(0, Ordering::Release);
ring.request_ring.inner.published.store(0, Ordering::Release);
ring.request_ring.inner.tail.store(0, Ordering::Release);
ring.response_ring.inner.head.store(0, Ordering::Release);
ring.response_ring.inner.published.store(0, Ordering::Release);
ring.response_ring.inner.tail.store(0, Ordering::Release);
// Reset per-ring state to Active for the replacement driver.
ring.request_ring.inner.state.store(RING_STATE_ACTIVE, Ordering::Release);
ring.response_ring.inner.state.store(RING_STATE_ACTIVE, Ordering::Release);
}
orphaned_pages
}
/// Free DMA buffer handles referenced by a pending VFS request.
/// Called during Phase 1 extraction to prevent DMA pool memory leaks.
///
/// Over 50-year uptime with periodic driver crashes, unfree'd DMA handles
/// are a slow memory leak. Each crash could leak up to ring_depth *
/// ring_count DMA buffer handles if not properly freed.
fn free_request_dma_handles(entry: &VfsRequest) {
// Free any DMA buffer handles carried by VfsRequestArgs variants.
// Every variant with a `buf: DmaBufferHandle` or `entries_buf: DmaBufferHandle`
// field must be handled. A missing arm leaks DMA pool memory on every crash —
// over 50-year uptime, this is a slow memory leak.
//
// Note: For Read and Write, the inline small I/O path uses DmaBufferHandle::ZERO
// (no DMA buffer allocated for inline payloads). The != ZERO check correctly
// skips these. Large I/O paths DO allocate DMA buffers and their handles are
// freed here.
match &entry.args {
// Page cache operations: buf is the DMA-mapped page buffer.
VfsRequestArgs::ReadPage { buf, .. }
| VfsRequestArgs::WritePage { buf, .. }
| VfsRequestArgs::Readahead { buf, .. } => {
if *buf != DmaBufferHandle::ZERO {
dma_pool_free(*buf);
}
}
// Byte-range I/O operations: buf is the DMA-mapped user data buffer.
// DmaBufferHandle::ZERO for inline small I/O (no DMA buffer allocated).
VfsRequestArgs::Read { buf, .. }
| VfsRequestArgs::Write { buf, .. }
| VfsRequestArgs::Readlink { buf, .. } => {
if *buf != DmaBufferHandle::ZERO {
dma_pool_free(*buf);
}
}
// Extended attribute operations: value/buf is the DMA-mapped xattr data.
VfsRequestArgs::SetXattr { value, .. } => {
if *value != DmaBufferHandle::ZERO {
dma_pool_free(*value);
}
}
VfsRequestArgs::GetXattr { buf, .. }
| VfsRequestArgs::ListXattr { buf, .. } => {
if *buf != DmaBufferHandle::ZERO {
dma_pool_free(*buf);
}
}
// Directory and info operations: buf is the DMA-mapped result buffer.
VfsRequestArgs::ShowOptions { buf, .. }
| VfsRequestArgs::ReadDir { buf, .. } => {
if *buf != DmaBufferHandle::ZERO {
dma_pool_free(*buf);
}
}
// Batched statx: entries_buf is the DMA-mapped result array.
VfsRequestArgs::StatxBatch { entries_buf, .. } => {
if *entries_buf != DmaBufferHandle::ZERO {
dma_pool_free(*entries_buf);
}
}
// Variants without DMA buffer handles — listed exhaustively
// so the compiler catches new variants with DMA buffers.
// Adding a VfsRequestArgs variant with a DmaBufferHandle field
// WITHOUT adding an arm here is a DMA pool memory leak.
VfsRequestArgs::Mount { .. }
| VfsRequestArgs::Unmount { .. }
| VfsRequestArgs::ForceUnmount { .. }
| VfsRequestArgs::Statfs { .. }
| VfsRequestArgs::SyncFs { .. }
| VfsRequestArgs::Remount { .. }
| VfsRequestArgs::Freeze { .. }
| VfsRequestArgs::Thaw { .. }
| VfsRequestArgs::Lookup { .. }
| VfsRequestArgs::Create { .. }
| VfsRequestArgs::Link { .. }
| VfsRequestArgs::Unlink { .. }
| VfsRequestArgs::Mkdir { .. }
| VfsRequestArgs::Rmdir { .. }
| VfsRequestArgs::Rename { .. }
| VfsRequestArgs::Symlink { .. }
| VfsRequestArgs::Mknod { .. }
| VfsRequestArgs::GetAttr { .. }
| VfsRequestArgs::SetAttr { .. }
| VfsRequestArgs::Truncate { .. }
| VfsRequestArgs::RemoveXattr { .. }
| VfsRequestArgs::DirtyExtent { .. }
| VfsRequestArgs::ReleasePage { .. }
| VfsRequestArgs::Open { .. }
| VfsRequestArgs::Release { .. }
| VfsRequestArgs::Fsync { .. }
| VfsRequestArgs::Ioctl { .. }
| VfsRequestArgs::Mmap { .. }
| VfsRequestArgs::Fallocate { .. }
| VfsRequestArgs::SeekData { .. }
| VfsRequestArgs::SeekHole { .. }
| VfsRequestArgs::Poll { .. }
| VfsRequestArgs::SpliceRead { .. }
| VfsRequestArgs::SpliceWrite { .. }
| VfsRequestArgs::EvictInode { .. }
| VfsRequestArgs::TruncateRange { .. }
| VfsRequestArgs::WriteInode { .. } => {}
}
}
/// Drain the response ring (driver -> VFS completions) during crash recovery.
/// Any completions that the driver had posted but the VFS consumer had not
/// yet consumed are processed here. Since the driver crashed, these
/// completions may contain partial or corrupted data — they are treated
/// as EIO errors regardless of the completion status.
///
/// Walks from `response_ring.inner.tail` to `response_ring.inner.published`,
/// reading each completion entry and processing it.
fn drain_spsc_response_ring(response_ring: &RingBuffer<VfsResponseWire>) {
let mut tail = response_ring.inner.tail.load(Ordering::Acquire);
let published = response_ring.inner.published.load(Ordering::Acquire);
// Use `!=` (not `<`) for wrapping-safe comparison — consistent
// with drain_ring() and drain_all_vfs_rings() (see SF-169).
while tail != published {
let idx = (tail % response_ring.inner.size as u64) as usize;
// SAFETY: idx is within ring bounds (tail < published, ring is sized).
let completion: &VfsResponseWire = unsafe {
response_ring.read_entry(idx)
};
// Each response contains a request_id that maps to a waiting thread's
// completion token. Wake the thread with EIO status — the driver crashed,
// so even "successful" completions in the ring are suspect.
// The waiting thread checks the ring_set.state (RECOVERING) in its
// wait_event condition and translates the wake into an EIO return.
// Intentionally discard all completions (including successful ones).
// During crash recovery, the driver's state is unknown — even
// "successful" completions may reflect corrupted state. Waiting
// threads are woken below and observe VFSRS_RECOVERING, which they
// translate into -EIO. This is the correct crash recovery policy:
// discard everything and let the application retry after the
// replacement driver is loaded.
let _ = completion;
tail = tail.wrapping_add(1);
}
response_ring.inner.tail.store(tail, Ordering::Release);
}
/// Fail all pending requests on a single request ring.
/// Walks from tail to published, waking each blocked thread with EIO.
///
/// Uses `wake_up_all()` (not a hypothetical `wake_all_with_error(EIO)`) because
/// `WaitQueueHead` does not have an error-passing wake method. Woken threads
/// check the ring/superblock state in their `wait_event` condition loop and
/// detect the RECOVERING state, translating it into an EIO return to userspace.
/// This is the standard Linux pattern: `wake_up_all()` + condition check in
/// the waiter loop.
fn fail_pending_requests(
request_ring: &RingBuffer<VfsRequest>,
completion: &WaitQueue,
) {
let tail = request_ring.inner.tail.load(Ordering::Acquire);
let published = request_ring.inner.published.load(Ordering::Acquire);
let pending_count = published.wrapping_sub(tail);
// Wake all threads on this ring's completion queue.
// The woken threads check ring_set.state in their wait_event loop
// condition and return EIO when they observe RECOVERING.
if pending_count > 0 {
completion.wake_up_all();
}
}
/// Wake orphaned pages after ring drain (Step U8).
///
/// Uses Core-resident page_cache_id to resolve pages WITHOUT traversing
/// VFS-domain state (which may be corrupted after the crash). The page
/// cache is in Core (Tier 0); the inode cache is in VFS (Tier 1).
fn wake_orphaned_pages(orphaned_pages: &[OrphanedPageEntry]) {
// RCU read lock is mandatory: the page could be concurrently evicted
// by memory reclaim between the XArray lookup and the flags update.
// RCU protection ensures the page reference remains valid for the
// duration of the lookup + flag-set + unlock sequence.
let _rcu = rcu_read_lock();
for entry in orphaned_pages {
// SAFETY: page_cache_id is a pointer to a Core-resident
// AddressSpace, cast to u64 at enqueue time. The AddressSpace
// is pinned for the lifetime of the superblock.
//
// **Driver corruption mitigation**: page_cache_id was read from
// the VFS request ring, which is in shared memory accessible to
// the (now crashed) driver. A corrupted driver could have written
// arbitrary values to request ring entries before crashing.
// Validate that the pointer falls within Core (Tier 0) memory
// before dereferencing. This prevents a corrupted page_cache_id
// from causing the recovery path to dereference a driver-domain
// or arbitrary address.
if !is_core_memory_range(entry.page_cache_id as usize, core::mem::size_of::<AddressSpace>()) {
log_fma_warning!("wake_orphaned_pages: page_cache_id {:#x} not in Core memory, skipping",
entry.page_cache_id);
continue;
}
let address_space = unsafe {
&*(entry.page_cache_id as *const AddressSpace)
};
// Look up the page in the XArray-backed page cache under RCU.
// PageCache.pages is an XArray<PageEntry>; load() returns
// Option<&PageEntry> under the current RCU read lock.
if let Some(page_entry) = address_space.page_cache.pages.load(entry.page_index) {
let page = &page_entry.page;
page.flags.fetch_or(PageFlags::ERROR.bits(), Ordering::Release);
unlock_page(&page);
}
// If page not found: it was already evicted or never inserted.
// No action needed — no thread can be waiting on a non-existent page.
}
// _rcu dropped here — end of RCU read-side critical section.
}
writeback_deferred_dirty() Definition (Step U15)
/// Flush dirty pages that were deferred during crash recovery.
///
/// During the driver outage (Steps U1-U14), dirty pages accumulated in
/// the page cache with no backing driver to write them to disk. After
/// the replacement driver remounts (Step U14), this function flushes
/// those dirty pages via the standard writeback path.
///
/// The function iterates the superblock's dirty inode list
/// (`sb.s_dirty` / `sb.s_io`) and submits writeback work items for
/// each dirty inode. The writeback is synchronous: this function
/// blocks until all deferred dirty pages are written (or error).
///
/// # Arguments
///
/// - `sb`: The superblock of the remounted filesystem. The replacement
/// driver is already loaded and accepting writeback requests.
///
/// # Errors
///
/// Individual page writeback errors are logged via FMA but do not abort
/// the recovery. Pages that fail writeback retain the DIRTY flag and
/// are retried on the next periodic writeback cycle. The function
/// returns the count of failed pages for diagnostic purposes.
///
/// # Performance
///
/// This is a cold path (runs once per crash recovery). The writeback
/// rate is bounded by the replacement driver's throughput. On a typical
/// NVMe device, flushing 100 MB of deferred dirty pages takes ~20-50ms.
pub fn writeback_deferred_dirty(sb: &SuperBlock) -> u64 {
let mut failed_count: u64 = 0;
// Phase 1: Collect dirty inodes under RCU read lock.
// We MUST NOT perform blocking I/O (WB_SYNC_ALL) inside an RCU
// read-side critical section — blocking with tree-RCU prevents
// grace period completion, causing RCU stalls and potential deadlock.
// Instead, collect inode references into a bounded ArrayVec, drop
// the RCU lock, then writeback outside RCU.
//
// Capacity 1024: sufficient for most crash recovery scenarios.
// If the superblock has more than 1024 dirty inodes, the function
// iterates in batches (collect 1024, drop RCU, writeback, re-acquire
// RCU for the next batch). This is correct because new dirty inodes
// cannot be created during crash recovery (ring_set.state == RECOVERING,
// so no new I/O is accepted). The dirty inode list only shrinks
// (as writeback completes) or stays the same.
const BATCH_SIZE: usize = 1024;
let mut batch: ArrayVec<InodeRef, BATCH_SIZE> = ArrayVec::new();
let mut resume_ino: u64 = 0;
loop {
batch.clear();
// Phase 1: Collect dirty inodes under RCU protection.
{
let _rcu = rcu_read_lock();
for inode in sb.dirty_inodes_iter_from(resume_ino) {
if batch.is_full() {
resume_ino = inode.ino + 1;
break;
}
// Acquire a reference to the inode (pin it against eviction).
batch.push(InodeRef::from(&inode));
}
}
// _rcu dropped here — RCU lock released before blocking I/O.
if batch.is_empty() {
break; // No more dirty inodes.
}
// Phase 2: Writeback each dirty inode OUTSIDE RCU.
for inode_ref in batch.iter() {
let mapping = &inode_ref.address_space;
let wbc = WritebackControl {
sync_mode: WB_SYNC_ALL,
nr_to_write: i64::MAX, // Write all dirty pages
range_start: 0,
range_end: i64::MAX,
};
if let Err(_e) = mapping.writeback_range(&wbc) {
failed_count += 1;
fma_emit(FaultEvent::WritebackError {
inode: inode_ref.ino,
sb_dev: sb.s_dev,
});
}
}
// InodeRef drops release the references.
}
failed_count
}
Step U17: Restore ring set state to ACTIVE
After the generation counter is bumped (Step U13a), the replacement driver remounts (Step U14), and dirty pages are flushed (Step U15), the ring set is re-activated:
// Step U13a already bumped generation (before driver reload).
// Step U16 is now a no-op (generation bump moved to U13a).
// Step U17: re-activate the ring set. This is the LAST step.
// The Release ordering ensures all prior ring resets, slot_states clears,
// driver initialization, and generation bump are visible to producers
// before they observe ACTIVE and begin enqueuing new requests.
ring_set.state.store(VFSRS_ACTIVE, Ordering::Release);
This transition from RECOVERING to ACTIVE is the final gate. Without it, the mount remains permanently stuck in RECOVERING state and all VFS operations return ENXIO indefinitely.
Recovery latency impact: Draining N rings sequentially adds O(N) to recovery time. Each ring drain is O(ring_depth) — with depth 256, each drain is ~256 iterations of entry extraction + pointer arithmetic. For N=64 rings: 64 * 256 = 16,384 iterations, each ~50-200 ns (includes DMA handle freeing and orphaned page collection) = ~0.8-3.3 ms. Well within the 500 ms recovery latency target. The producer quiescence wait (Step U5) adds at most 5 seconds in the worst case (copy_from_user on a major page fault), but typically completes in microseconds (most copy_from_user operations are cache-hot).
14.3.11 Live Evolution¶
Live kernel evolution (Section 13.18) replaces a running filesystem driver with a new version. The evolution protocol interacts with per-CPU rings as follows:
Phase A' (Quiescence) — extended for N rings:
- Set
ring_set.state = QUIESCING— new VFS operations returnEAGAIN(callers retry after evolution completes). - Wait for all N rings'
inflight_opscounters to reach zero. EachRingBuffer<T>has an independentinflight_ops: AtomicU32counter (incremented inreserve_slot(), decremented incomplete_slot()). The same counter is used by crash recovery (Step U5 above). - Drain all N response rings to process any final completions from the old driver.
Phase B (Atomic Swap) — the ring pointers are unchanged during vtable swap. The new driver inherits the same ring set. Ring count does not change during live evolution (changing ring count requires unmount/remount).
Phase C (Post-Swap Cleanup) — the new driver re-initializes its consumer
threads for all N rings. If the new driver supports a different ring_count_max
than the old driver, the ring count remains unchanged until the next remount.
14.3.12 CPU Hotplug¶
When a CPU comes online or goes offline, the cpu_to_ring mapping must be
updated.
14.3.12.1 CPU Online¶
/// Called by the CPU hotplug framework when a new CPU comes online.
/// Updates the cpu_to_ring mapping for all mounted filesystems.
fn vfs_rings_cpu_online(cpu: CpuId) {
for sb in all_superblocks() {
let ring_set = &sb.ring_set;
if cpu < ring_set.cpu_to_ring_len as usize {
// Assign the new CPU to a ring based on the mount's granularity.
let ring_idx = compute_ring_for_cpu(cpu, ring_set);
// SAFETY: cpu < cpu_to_ring_len, validated above. cpu_to_ring
// points to a slab-allocated array valid for the mount's lifetime.
unsafe { (*ring_set.cpu_to_ring.add(cpu)).store(ring_idx, Ordering::Release) };
}
// If cpu >= cpu_to_ring_len (CPU ID exceeds table allocated at mount
// time), the fallback in select_ring() routes to ring 0. This can
// occur if CPUs are hot-added beyond the boot-time nr_cpu_ids.
// A remount would rebuild the table with the new nr_cpu_ids.
}
}
14.3.12.2 CPU Offline¶
/// Called by the CPU hotplug framework when a CPU goes offline.
/// The cpu_to_ring entry for the offline CPU is NOT cleared — it becomes
/// stale but harmless (any thread migrated off the offline CPU will call
/// select_ring() on its new CPU and get the correct ring). No ring is
/// removed or deallocated on CPU offline — ring count is fixed for the
/// lifetime of the mount.
fn vfs_rings_cpu_offline(cpu: CpuId) {
// No action required. The ring assigned to this CPU continues to exist
// and may still have in-flight operations. The driver's consumer thread
// for this ring continues to drain it.
//
// If the offline CPU was the ONLY CPU assigned to a particular ring,
// that ring becomes idle — the driver's consumer thread for it will
// find no new work. This is benign.
}
Ring count is immutable for the lifetime of a mount. Rings are allocated at mount time and freed at unmount. CPU hotplug changes the CPU-to-ring mapping but never adds or removes rings. Adding rings would require the driver to reinitialize its consumer side (equivalent to a mini-remount); removing rings would orphan in-flight requests. Both are too disruptive for a hotplug event. If the operator wants to adjust ring count after a topology change, they must unmount and remount.
14.3.13 Performance Analysis¶
14.3.13.1 Cache Line Contention Elimination¶
The primary performance gain is eliminating cross-CPU cache line contention on the
request ring's head/published cache line.
Before (single ring, 64-CPU PostgreSQL checkpoint):
| Metric | Value |
|---|---|
| Producer reservation contention per fsync | ~63 contenders on single-ring CAS |
| Reservation CAS cost (x86-64, 64-way contention) | ~3,150-4,410 cycles (~1.3-1.8 us) |
| Ring head cache line bounces per produce | ~1 (lock holder writes, lock release bounces) |
| Total contention overhead per fsync | ~3,200-4,500 cycles |
| Total for 1000-file checkpoint | ~1.3-1.8 ms contention overhead |
After (per-CPU rings, 64-CPU PostgreSQL checkpoint):
| Metric | Value |
|---|---|
| Producer reservation contention | 0 (each CPU is sole SPSC producer on its ring) |
| Ring head cache line bounces per produce | 0 (ring head is CPU-local) |
| Global request_id bounce | ~1 per fsync (~15-20 cycles) |
| Total contention overhead per fsync | ~15-20 cycles (~6-8 ns) |
| Total for 1000-file checkpoint | ~6-8 us contention overhead |
Speedup on contention (PerCpu mode): ~200x reduction in per-fsync contention overhead. The single-ring CAS contention is eliminated entirely — each CPU owns its ring and reserves slots without cross-CPU contention.
Shared-ring mode (PerNuma with 16 CPUs/node): The CAS loop on head has
O(N) expected retries under N-way contention, where N is the number of CPUs
sharing the ring. With 16 CPUs per NUMA node: ~15 contenders on head CAS,
~750-1,050 cycles per reservation (vs ~3,150-4,410 for the global single-ring
case). This is a ~4x improvement over single-ring, but ~50x worse than PerCpu.
PerNuma is appropriate when the memory overhead of PerCpu is unacceptable but
some contention reduction is needed.
14.3.13.2 Memory Overhead¶
Each VfsRingPair occupies:
- Ring headers: 2 * 128 bytes = 256 bytes (two cache lines per ring, request + response)
- Ring data: 2 * (ring_depth * entry_size). VfsRequest is a Rust enum whose size is
dominated by the largest VfsRequestArgs variant (SetXattr contains a KernelString
at 256 bytes plus DmaBufferHandle, value_len, and flags). With the enum
discriminant, alignment padding, and the VfsRequest header fields (request_id,
opcode, ino, fh), the actual entry size is ~320 bytes. VfsCompletion is
smaller (~64 bytes). With default depth 256: request ring = 256 * 320 = 80 KB,
response ring = 256 * 64 = 16 KB, total ring data ≈ 96 KB per ring pair.
- Doorbell + WaitQueue: ~128 bytes
- Total per ring: ~96.4 KB
| Ring count | Total memory per mount | Notes |
|---|---|---|
| 1 (legacy) | ~96 KB | Baseline |
| 4 (per-NUMA, 4-socket) | ~386 KB | Typical server |
| 16 (per-LLC, AMD EPYC) | ~1.5 MB | High-core-count server |
| 64 (per-CPU) | ~6.2 MB | Maximum parallelism |
| 256 (per-CPU, 256-core) | ~24.7 MB | Extreme case |
Note: If the per-ring memory for high ring counts is excessive, the ring depth can be reduced proportionally. With 64 per-CPU rings at depth 64 (instead of 256), total per-mount memory drops to ~1.5 MB while still providing sufficient queue depth for typical VFS workloads.
Memory is allocated from the kernel slab at mount time (warm path). For the common case (4-16 rings), memory overhead is modest. The 64-ring and 256-ring cases are opt-in via explicit mount options and appropriate only for high-IOPS storage servers.
14.3.13.3 Per-CPU Ring Depth Reduction¶
With N rings, each ring serves fewer CPUs and therefore needs fewer slots. The effective queue depth per CPU remains the same or better:
| Configuration | Rings | Depth/ring | Effective depth/CPU | Total slots |
|---|---|---|---|---|
| Single ring, 64 CPUs | 1 | 256 | 4 | 256 |
| Per-NUMA (4 nodes), 64 CPUs | 4 | 256 | 16 | 1024 |
| Per-LLC (16 groups), 64 CPUs | 16 | 128 | 128 | 2048 |
| Per-CPU, 64 CPUs | 64 | 64 | 64 | 4096 |
For per-CPU mode, the per-ring depth can be reduced (via vfs_ring_depth mount
option) without reducing effective per-CPU capacity. A depth of 64 per ring with
64 rings provides 64 in-flight operations per CPU — far more than any single CPU
can sustain.
14.3.13.4 Impact on Performance Budget¶
The per-CPU ring extension does NOT increase the per-I/O domain crossing cost. The ring protocol (SPSC produce, domain switch, consume) is identical per operation. The changes affect only:
| Cost component | Change | Impact |
|---|---|---|
Ring selection (select_ring()) |
+1-3 cycles (atomic load + bounds check) | +0.001% on 10 us op |
| Request ID generation | +15-20 cycles (global atomic fetch_add) | +0.006-0.008% on 10 us op |
| Doorbell coalescing mask update | +5-10 cycles (atomic OR) | +0.002-0.004% on 10 us op |
| Cache line contention elimination | -3,150-4,410 cycles (under 64-CPU contention) | -1.3-1.8% saved |
| Net impact under contention | -3,100-4,370 cycles saved | -1.26-1.78% improvement |
| Net impact without contention | +21-33 cycles added | +0.009-0.013% overhead |
The extension is a net performance win under any multi-CPU workload and negligibly more expensive (~0.01%) for single-CPU workloads. The overhead is well within the existing 2.5% headroom under the 5% budget (Section 3.4).
14.3.13.5 Amortization Math: Negative Overhead Analysis¶
The design target is NEGATIVE overhead — UmkaOS filesystem I/O must be FASTER than Linux despite the Tier 1 domain switch (Section 1.1). This section presents the amortization math against a production Linux kernel baseline (CONFIG_PROVE_LOCKING=n, CONFIG_LOCK_STAT=n).
Linux baseline for a read() cache miss (measured path, no isolation):
| Linux path component | Cost (x86-64, cycles) | Notes |
|---|---|---|
Syscall entry (SYSCALL + kernel stack setup) |
~40 | Shared with UmkaOS |
VFS vfs_read() dispatch (function call, no vtable) |
~10-15 | Direct call |
Filesystem ext4_file_read_iter() (indirect call through f_op) |
~5-8 | Indirect call + branch predictor |
filemap_get_pages() (page cache XArray lookup) |
~30-50 | XArray walk, cache miss case |
ext4_readahead() (extent tree lookup + bio build) |
~100-300 | Varies with extent depth |
bio_submit() (block layer dispatch) |
~50-100 | Request queue + scheduling |
| Total Linux in-kernel overhead | ~235-513 | Before device I/O |
UmkaOS path for the same read() cache miss (with per-CPU ring, N=8 batch):
| UmkaOS path component | Cost (x86-64, cycles) | Notes |
|---|---|---|
Syscall entry (SYSCALL + kernel stack setup) |
~40 | Identical to Linux |
| VFS ring enqueue (write entry + advance published) | ~15-20 | SPSC produce, no lock |
Ring selection (select_ring()) |
~1-3 | Atomic load + bounds check |
| Request ID generation | ~15-20 | Atomic fetch_add |
| Doorbell coalescing mask update | ~5-10 | Atomic OR (amortized) |
| Doorbell (domain switch) — amortized over N | ~23/N | WRPKRU, amortized |
| Driver dequeue (ring entry already prefetched in L1) | ~5-10 | L1 hit from prefetch |
| Filesystem processing (same as Linux) | ~100-300 | Extent tree + bio |
| Response enqueue + completion coalescing | ~10-15 | SPSC produce + coalesce check |
| Completion wakeup — amortized over N | ~300/N | IPI + WaitQueue, amortized (see note) |
| Total UmkaOS in-kernel overhead (N=16) | ~220-433 | Before device I/O |
Completion wakeup cost (300 cycles) derivation: The 300-cycle figure assumes
same-NUMA-node IPI (~200 cycles on x86-64 Intel Xeon Scalable, measured via
rdtsc across APIC_ICR write to handler entry) plus WaitQueue wake overhead
(~50-100 cycles for priority-ordered wake + rescheduling check). Cross-NUMA IPI
costs ~500-1000 cycles; if the VFS consumer runs on a different NUMA node than the
filesystem driver, completion wakeup rises to ~600-1100 cycles. The 300-cycle
figure applies to PerLlc and PerNuma ring granularities where producer and consumer
share a NUMA node. Cross-NUMA worst case is documented in the per-architecture table
below.
Per-operation domain crossing cost — production Linux baseline:
The per-operation overhead that UmkaOS must amortize is compared against the Linux function-call chain overhead that UmkaOS eliminates by replacing indirect calls with a ring protocol.
Linux production per-operation function-call overhead eliminated by UmkaOS:
vfs_read() dispatch: ~10-15 cycles (direct call)
f_op->read_iter: ~5-8 cycles (indirect call, x86-64 retpoline)
Lock stat accounting: ~5-10 cycles (CONFIG_LOCK_STAT=n: 0 cycles;
production kernels vary, ~5 cycles
for inline static key check residual)
TOTAL (production): ~20-28 cycles
Note: Debug kernels with CONFIG_PROVE_LOCKING=y add ~25 cycles of lockdep
checking per lock acquisition. The baseline above uses PRODUCTION builds only.
Total domain crossing overhead = doorbell_cost + completion_cost
= 23 + 300 = 323 cycles (uncoalesced, x86-64)
Breakeven batch size = 323 / 28 = ~12 operations (production Linux)
Why UmkaOS achieves savings despite the domain switch:
-
No lockdep overhead: Linux's lockdep (lock dependency checker) adds ~20-30 cycles to every lock acquisition on debug kernels. Production kernels with CONFIG_PROVE_LOCKING=n and CONFIG_LOCK_STAT=n have zero lockdep overhead, but still pay ~5 cycles for inline static key checks on lock-stat-capable paths. UmkaOS's compile-time lock ordering eliminates all runtime lock validation: 0 cycles on all builds.
-
No indirect call overhead: Linux's VFS dispatches through
f_op->read_iter— an indirect call through a function pointer. On x86-64 with Spectre v2 mitigations (retpoline/IBRS), indirect calls cost ~15-25 cycles. UmkaOS's ring protocol avoids indirect calls on the hot path — the opcode is amatchon au32, which the compiler converts to a jump table (direct branch). -
Cache-friendlier ring layout: The ring buffer is a contiguous array with predictable access pattern (sequential consume). Linux's VFS path walks multiple non-contiguous data structures (file -> dentry -> inode -> superblock -> f_op -> address_space -> page tree). The ring's sequential layout produces fewer L1 cache misses (~2-3 vs ~5-8 for the pointer-chasing VFS path).
-
Prefetch hides latency: The driver-side ring entry prefetch (see "Driver-Side Ring Entry Prefetch" above) loads the next 4 entries into L1 while processing the current request. Linux has no equivalent — each VFS function call must load its arguments from wherever they happen to reside in the cache hierarchy.
Summary table — per-operation overhead at different batch sizes (x86-64, production Linux):
| Batch size (N) | Domain crossing per-op | Linux prod. overhead saved | Net | Verdict |
|---|---|---|---|---|
| 1 (uncoalesced) | 323 cycles | 28 cycles | +295 cycles | 1.2% overhead on 10us op |
| 2 | 162 cycles | 28 cycles | +134 cycles | 0.54% overhead |
| 4 | 81 cycles | 28 cycles | +53 cycles | 0.21% overhead |
| 8 | 40 cycles | 28 cycles | +12 cycles | 0.048% overhead |
| 12 | 27 cycles | 28 cycles | -1 cycle | Breakeven |
| 16 | 20 cycles | 28 cycles | -8 cycles | NEGATIVE overhead |
| 32 | 10 cycles | 28 cycles | -18 cycles | Negative (bonus) |
Breakeven is at N=12 against production Linux. At N>=12, UmkaOS per-operation cost is less than Linux's equivalent function-call path. At N=16 (typical io_uring depth), the saving is ~8 cycles/op. At N=32 (common for high-IOPS NVMe workloads), the saving is ~18 cycles/op.
N=8 (the default regular I/O coalescing batch) is NOT negative-overhead against
production Linux — it adds ~12 cycles/op (+0.048% on a 10us operation). This is
well within the 2.5% headroom under the 5% budget. The negative-overhead threshold
requires N>=12, which is achieved by:
- io_uring workloads (typical depth 32-128): always negative overhead.
- PostgreSQL checkpoint (fsync storm on 64 backends): N>>12.
- Batched readahead (sequential reads trigger 4-32 page prefetch): N>=16.
When N < 12: For single-threaded sequential reads (effective batch size 1-4), the domain crossing adds ~0.2-1.2% overhead — well within the 5% budget. The page cache absorbs >95% of reads without any domain crossing (cache hits are served entirely in Tier 0), so the effective overhead across all operations is much lower than the per-miss figure.
Inline small I/O path (N=1, negative overhead): For reads/writes where
count <= INLINE_IO_MAX (192 bytes) — covering >90% of procfs/sysfs accesses —
data is carried inline in the ring entry (Section 14.2). No DMA
buffer allocation, no IOMMU map/unmap. This eliminates ~150-300ns per small I/O:
| Path | Linux cost | UmkaOS inline cost | Delta |
|---|---|---|---|
read("/proc/self/status", buf, 128) |
~280-400 cycles (VFS + seq_file + copy_to_user) | ~180-260 cycles (ring + inline_data + copy_to_user) | -100 to -140 cycles |
read("/sys/class/net/eth0/mtu", buf, 8) |
~250-380 cycles | ~160-240 cycles | -90 to -140 cycles |
For metadata-heavy workloads (container startup reading hundreds of small files), this is measurable throughput improvement. The inline path achieves negative overhead at N=1 — no batching required. This is the strongest negative-overhead argument for procfs/sysfs workloads that dominate container and monitoring scenarios.
14.3.14 Backward Compatibility¶
14.3.14.1 Single-Ring Drivers¶
Drivers that do not include a .kabi_vfs_caps ELF section (or set
ring_count_max = 1) operate in single-ring mode. The VfsRingSet is allocated
with ring_count = 1 and the cpu_to_ring table maps all CPUs to ring 0.
This is functionally identical to the baseline protocol — no behavioral change.
14.3.14.2 VfsRingPair Preservation¶
The VfsRingPair struct is unchanged. The extension wraps it in VfsRingSet
without modifying the per-ring structure. The request/response ring layout,
entry format, opcodes, and all existing fields remain identical.
14.3.14.3 Cancellation Protocol¶
The cancellation protocol (Section 14.2) is unchanged per ring.
CancelToken.request_id uses the mount-global ID, so the driver can match it
against any ring's pending requests. The cancellation side-channel is per ring —
each ring has its own cancellation channel, and the cancel token is enqueued on
the ring where the original request was submitted (the VFS tracks which ring
each request went to via the per-task last_vfs_ring_index field, set during
select_ring()).
14.3.14.4 Timeout Handling¶
Per-request timeouts (Section 14.2) are unchanged. Each request has an independent timer regardless of which ring it was submitted on. The timer callback cancels the request on the specific ring where it was submitted.
14.3.15 Cross-References¶
- VFS ring buffer protocol (base): Section 14.2
- IPC ring buffer design: Section 11.8
- Crash recovery: Section 11.9
- Live kernel evolution: Section 13.18
- KABI transport classes: Section 12.6
- Performance budget: Section 1.3
- Cumulative performance budget: Section 3.4
- Core provisioning: Section 7.11
- Doorbell coalescing: Section 5.1
- fsync end-to-end flow: Section 14.4
- Page cache: Section 4.4
14.3.16 Phase Assignment¶
| Component | Phase | Rationale |
|---|---|---|
| VfsRingSet struct and single-ring allocation | Phase 2 | Replaces VfsRingPair allocation at mount; backward compatible with ring_count=1. |
Mount option parsing (vfs_ring_count, vfs_ring_granularity) |
Phase 2 | Mount option infrastructure exists; adding new options is incremental. |
CPU-to-ring mapping table and select_ring() |
Phase 2 | Core hot-path change; must be correct from first multi-ring mount. |
| Global request_id counter | Phase 2 | Replaces per-ring counter; simple atomic. |
| Coalesced doorbell | Phase 2 | Required for multi-ring to avoid doorbell storms. |
| Driver-side round-robin consumer (Strategy 1) | Phase 2 | Minimum viable multi-ring consumer. |
| KabiVfsCapabilities ELF section and negotiation | Phase 2 | Must exist before any driver can advertise multi-ring support. |
| Crash recovery for N rings | Phase 2 | Must be correct before any production use of multi-ring mode. |
| Per-ring consumer threads (Strategy 2) | Phase 3 | Optimization; round-robin is sufficient for Phase 2. |
| CPU hotplug integration | Phase 3 | Hotplug is uncommon in production; Phase 2 mapping is static. |
| Live evolution for N rings | Phase 3 | Live evolution is Phase 3 feature. |
| Adaptive granularity auto-selection | Phase 3 | Requires topology discovery infrastructure. |
| Driver-side ring entry prefetch | Phase 2 | Trivial to implement (one prefetch intrinsic per dequeue); significant L1 cache benefit. |
| Completion coalescing (response direction) | Phase 2 | Required for negative-overhead target; mirrors request-side doorbell coalescing. |
14.4 fsync / fdatasync End-to-End Flow¶
The fsync(2) and fdatasync(2) syscalls guarantee that file data (and optionally
metadata) reach stable storage. This section documents the complete call path from
syscall entry to disk write completion, crossing VFS, page cache, filesystem, and
block layer boundaries.
Syscall entry → VFS dispatch:
fsync(fd) / fdatasync(fd)
→ sys_fsync() / sys_fdatasync()
→ vfs_fsync_range(file, start=0, end=LLONG_MAX, datasync)
vfs_fsync_range(file, start, end, datasync) is the canonical entry point. The
sync_file_range(2) syscall also calls it with a sub-range.
Step 1 — Writeback dirty pages:
/// VFS-level fsync implementation.
fn vfs_fsync_range(
file: &File,
start: i64,
end: i64,
datasync: bool,
) -> Result<(), IoError> {
let mapping = &file.inode.i_mapping;
// (1) Flush all dirty pages in [start, end] to the block layer.
// The writeback engine iterates dirty pages, calling
// AddressSpaceOps::writepage() for each one, which builds bios
// and submits them. Does NOT wait for I/O completion yet.
filemap_write_and_wait_range(mapping, start, end)?;
// (1b) DSM-aware writeback: for MS_DSM_COOPERATIVE superblocks, after
// local writeback completes, wait for DSM home node acknowledgment.
// dsm_sync_pages() sends dirty DSM pages via RDMA and blocks until
// PutAck is received from each home node, ensuring data durability
// on the remote node before fsync returns.
// See [Section 6.12](06-dsm.md#dsm-subscriber-controlled-caching--fsync-semantics).
if mapping.host.i_sb.s_flags.load(Relaxed) & MS_DSM_COOPERATIVE != 0 {
dsm_sync_pages(&file.inode, start, end)?;
}
// (2) Dispatch to filesystem-specific fsync (journal commit, etc).
// For KABI Tier 1/2 drivers: sends Fsync through VFS ring buffer.
// For in-kernel filesystems: calls FileOps::fsync() directly.
// The inode and private values are extracted from the OpenFile.
let inode_id = file.inode.id;
let private = file.private_data.load(Ordering::Relaxed) as u64;
file.f_op.fsync(inode_id, private, start as u64, end as u64, datasync as u8)?;
// (3) Check for writeback errors that arrived asynchronously since
// this fd was opened (or since the last successful fsync).
// ErrSeq::check_and_advance() compares the fd's snapshot against
// the mapping's current wb_err generation. If they differ, a
// writeback error occurred — return it and advance the snapshot.
// Without this step, writeback errors that arrive between
// write() and fsync() would be silently lost.
if let Some(errno) = mapping.wb_err.check_and_advance(&mut file.f_wb_err) {
return Err(IoError::from_raw(errno));
}
Ok(())
}
filemap_write_and_wait_range(mapping, start, end) performs two phases:
- Write phase: Walk the page cache (XArray range scan) for pages in
[start, end]withPageFlags::DIRTYset. For each dirty page, callAddressSpaceOps::writepage(mapping, page, wbc). The filesystem maps the page to physical block(s), builds aBio, and submits it viabio_submit()(Section 15.2). SetWRITEBACKflag on the page.
DIRTY flag ownership: filemap_write_and_wait_range does NOT clear
DIRTY — it only sets WRITEBACK. The DIRTY → clean transition and
nr_dirty decrement are owned by the completion callback:
- Tier 0 filesystems: writeback_end_io() (deferred via writeback_end_io_deferred
callback on the blk-io workqueue) clears DIRTY and decrements nr_dirty on success.
- Tier 1 filesystems: The Tier 0 WritebackResponse handler (step 11
in Section 4.6) clears DIRTY and decrements nr_dirty.
This single-owner design prevents the double-decrement bug that would occur
if both the write phase and the completion handler cleared DIRTY.
- Wait phase: Walk the same range again. For each page with
WRITEBACKset, block until the bio completion callback clears the flag. If any page hasAS_EIO/AS_ENOSPCerror state, return the error and clear it (one-shot error reporting — see below).
Dual error reporting (AS flags + ErrSeq): The AS_EIO/AS_ENOSPC
AddressSpace flags are a legacy mechanism. ErrSeq (wb_err) is the
primary error reporting mechanism: it provides per-fd error visibility
(each open fd sees the error exactly once via check_and_advance()).
Both mechanisms are set together by writeback_end_io() for consistency,
but ErrSeq is authoritative for fsync() error returns. The AS flags
are used by filemap_write_and_wait_range for backward compatibility
with callers that check AddressSpace state directly.
/// Flush all dirty pages in [start, end] to the block layer and wait
/// for completion. This is the core implementation behind fsync step 1.
///
/// # Algorithm
/// Phase 1 (write): iterate dirty pages and submit writeback bios.
/// Phase 2 (wait): block until all submitted bios complete.
///
/// # Locking
/// Does NOT hold i_rwsem (already held by caller for write paths).
/// Acquires page locks individually via write_begin/end protocol.
///
/// # Error handling
/// Collects errors from both phases. Returns the first error encountered.
/// All pages are processed even after an error (no short-circuit) to
/// maximize data written to disk before returning failure.
fn filemap_write_and_wait_range(
mapping: &AddressSpace,
start: i64,
end: i64,
) -> Result<(), IoError> {
let mut first_err: Result<(), IoError> = Ok(());
// --- Phase 1: Write dirty pages ---
let start_idx = (start as u64) >> PAGE_SHIFT;
let end_idx = (end as u64) >> PAGE_SHIFT;
let mut wbc = WritebackControl {
sync_mode: WbSyncMode::All,
range_start: start as u64,
range_end: end as u64,
nr_to_write: i64::MAX,
};
// Prefer writepages() for batch I/O if the filesystem supports it.
if mapping.ops.writepages(mapping, &wbc).is_err() {
// Fall back to per-page writepage() iteration.
let mut idx = start_idx;
while idx <= end_idx {
if let Some(page) = mapping.pages.load(idx) {
if page.flags.load(Acquire) & PageFlags::DIRTY != 0 {
// Set WRITEBACK before submitting. Do NOT clear DIRTY —
// the completion callback owns that transition.
// set_page_writeback: atomically set WRITEBACK flag
// AND increment nrwriteback. The matching decrement
// is in writeback_end_io(). Without the increment,
// the decrement in writeback_end_io causes underflow
// (u64::MAX), permanently corrupting writeback scheduling.
page.flags.fetch_or(PageFlags::WRITEBACK, Release);
mapping.nrwriteback.fetch_add(1, Relaxed);
if let Err(e) = mapping.ops.writepage(mapping, &page, &wbc) {
page.flags.fetch_and(!PageFlags::WRITEBACK, Release);
mapping.nrwriteback.fetch_sub(1, Relaxed);
if first_err.is_ok() { first_err = Err(e); }
}
}
}
idx += 1;
}
}
// --- Phase 2: Wait for WRITEBACK completion ---
let mut idx = start_idx;
while idx <= end_idx {
if let Some(page) = mapping.pages.load(idx) {
// Spin/sleep until the completion callback clears WRITEBACK.
wait_on_page_writeback(&page);
// Check for per-page error (set by writeback_end_io on failure).
if page.flags.load(Acquire) & PageFlags::ERROR != 0 {
if first_err.is_ok() {
first_err = Err(IoError::new(Errno::EIO));
}
}
}
idx += 1;
}
// Check AS-level error flags (set by mapping_set_error on I/O error).
if mapping.flags.load(Acquire) & (AS_EIO | AS_ENOSPC) != 0 {
let flags = mapping.flags.fetch_and(!(AS_EIO | AS_ENOSPC), Release);
if first_err.is_ok() {
if flags & AS_ENOSPC != 0 {
first_err = Err(IoError::new(Errno::ENOSPC));
} else {
first_err = Err(IoError::new(Errno::EIO));
}
}
}
first_err
}
Page Wait Queue Infrastructure (used by wait_on_page_writeback and
wait_on_page_locked):
/// Global page wait hash table. Hashed by page address to reduce memory
/// overhead (one WaitQueueHead per hash bucket, not per page).
/// Size: 256 buckets (matches Linux's PAGE_WAIT_TABLE_BITS = 8).
/// Warm path: accessed on every fsync and every page fault that waits
/// for I/O completion.
static PAGE_WAIT_TABLE: [WaitQueueHead; 256] = [WaitQueueHead::new(); 256];
/// Map a Page reference to its hash bucket in the page wait table.
fn page_waitqueue(page: &Page) -> &'static WaitQueueHead {
// Hash by page struct address (NOT physical address) — the page struct
// is pinned in MEMMAP and has a stable address for the kernel lifetime.
let hash = (page as *const Page as usize >> PAGE_SHIFT) & 0xFF;
&PAGE_WAIT_TABLE[hash]
}
/// Sleep until the WRITEBACK flag is cleared on the page.
/// Called by fsync Phase 2 and by the page fault path when a page is
/// undergoing writeback. The matching wake is in `Page::wake_waiters()`,
/// called from `writeback_end_io()` after clearing WRITEBACK.
fn wait_on_page_writeback(page: &Page) {
// Fast check: if WRITEBACK is already clear, return immediately.
if page.flags.load(Acquire) & PageFlags::WRITEBACK.bits() == 0 {
return;
}
let wq = page_waitqueue(page);
wq.wait_event(|| page.flags.load(Acquire) & PageFlags::WRITEBACK.bits() == 0);
}
impl Page {
/// Wake all waiters sleeping on this page's hash bucket.
/// Called from writeback completion (after clearing WRITEBACK) and
/// from unlock_page (after clearing PG_LOCKED).
pub fn wake_waiters(&self) {
page_waitqueue(self).wake_up_all();
}
}
/// Set writeback error on an AddressSpace. Called from writeback completion
/// paths (writeback_end_io, end_page_writeback) when a bio completes with
/// an I/O error.
///
/// Wraps ErrSeq::set_err() and sets the legacy AS_EIO/AS_ENOSPC flags.
/// Both mechanisms are updated together for consistency — ErrSeq is
/// authoritative for fsync() error reporting, AS flags are for legacy
/// callers that check AddressSpace state directly.
fn mapping_set_error(mapping: &AddressSpace, errno: Errno) {
// Increment ErrSeq generation and store the error code.
mapping.wb_err.set_err(errno);
// Set legacy AS flags for backward compatibility.
if errno == Errno::ENOSPC {
mapping.flags.fetch_or(AS_ENOSPC, Release);
} else {
mapping.flags.fetch_or(AS_EIO, Release);
}
}
/// Add an inode to the BDI's dirty list if not already present.
/// Called from set_page_dirty() when a page first transitions to dirty.
/// Equivalent to mark_inode_dirty(inode, I_DIRTY_PAGES) — checks the
/// I_DIRTY_PAGES flag to avoid duplicate list insertions.
fn bdi_dirty_inode(bdi: &BackingDevInfo, inode: InodeId) {
// Check I_DIRTY_PAGES — idempotent, no action if already set.
let inode_ref = inode_lookup(inode);
if inode_ref.state.fetch_or(I_DIRTY_PAGES, AcqRel) & I_DIRTY_PAGES != 0 {
return; // Already on the dirty list.
}
// Add to BDI's b_dirty list under writeback_lock.
bdi.wb.push_dirty_inode(inode_ref);
}
Step 2 — Filesystem-specific sync (journaled filesystems):
For journaled filesystems (ext4, XFS, btrfs), fsync() does more than flush pages:
| Filesystem | fsync action after writeback |
|---|---|
| ext4 (data=ordered) | jbd2_journal_force_commit() — flush journal to disk, issue cache flush |
| ext4 (data=journal) | Commit journal transaction containing both data and metadata |
| XFS | xfs_log_force_lsn() — force log to LSN that covers the inode's metadata |
| btrfs | btrfs_sync_log() — flush the per-root log tree, then superblock |
| tmpfs | No-op (no backing store) |
| NFS | nfs_file_fsync() — send COMMIT RPC to server |
| FUSE (Section 14.11) | Send FUSE_FSYNC opcode through /dev/fuse |
| KABI Tier 1/2 | Send VfsRequest::Fsync { datasync, start, end } through ring buffer |
/// ext4 fsync implementation. Called via FileOps::fsync() dispatch.
/// Ensures all data and metadata for the inode reach stable storage
/// by forcing the JBD2 journal to commit.
///
/// Linux equivalent: ext4_sync_file() in fs/ext4/fsync.c.
fn ext4_fsync(
inode_id: InodeId,
private: u64,
start: u64,
end: u64,
datasync: u8,
) -> Result<()> {
let inode = inode_lookup(inode_id);
let sbi = ext4_sb_info(&inode.i_sb);
let journal = &sbi.journal;
// For data=journal mode, all data is already in the journal.
// Force the transaction containing this inode's data to commit.
//
// For data=ordered mode, data pages were already flushed by
// filemap_write_and_wait_range() (step 1 in vfs_fsync_range).
// We only need to commit the metadata transaction.
// If fdatasync and only timestamps changed (I_DIRTY_TIME without
// I_DIRTY_DATASYNC), skip the journal commit entirely.
if datasync != 0
&& inode.state.load(Acquire) & I_DIRTY_DATASYNC == 0
&& inode.state.load(Acquire) & I_DIRTY_TIME != 0
{
return Ok(());
}
// Force the journal transaction containing this inode's metadata
// to disk. journal_force_commit() waits for the commit I/O to
// complete, including the commit block written with BioFlags::FUA.
// The FUA flag ensures the commit block reaches stable storage
// without needing a separate cache flush bio.
// i_datasync_tid is in the ext4-specific inode info, not the generic Inode.
// Access via i_private cast (see Ext4InodeInfo in [Section 15.6](15-storage.md#filesystem-ext4)).
// SAFETY: i_private was set by ext4's alloc_inode() and is valid for the
// inode's lifetime. The Ext4InodeInfo is slab-allocated and immovable.
let ext4_info = unsafe { &*(inode.i_private as *const Ext4InodeInfo) };
let tid = ext4_info.i_datasync_tid.load(Acquire);
journal.force_commit(tid)?;
// On ext4, the journal commit with FUA on the commit block serves
// as the device cache flush. No separate REQ_PREFLUSH bio is needed
// because the FUA commit block guarantees ordering: all data written
// before the commit block is on stable media.
Ok(())
}
Step 3 — Block layer flush (inside filesystem-specific fsync):
The device cache flush described below is performed inside the
filesystem-specific fsync() implementation (step 2), not as a separate
VFS-level post-fsync action. For ext4, the BioFlags::FUA on the journal
commit block serves this purpose. For other filesystems:
After the filesystem commits its journal/log, it issues a cache flush to the storage device to ensure write-back caches are drained:
bio_submit(bio with REQ_PREFLUSH | REQ_FUA)
→ block device request queue
→ NVMe: FLUSH command (opcode 0x00) / SATA: FLUSH CACHE EXT (0xEA)
→ completion interrupt → bio_endio() → wake waiters
Tier 1 crash recovery: Journal commit bios MUST set BioFlags::PERSISTENT
(Section 15.2) so they are preserved across
Tier 1 storage driver crash recovery. The block layer's pending bio list retains
PERSISTENT bios during domain teardown and replays them to the new driver instance
after reload. Without this flag, a Tier 1 driver crash between journal write
submission and completion would lose the journal commit — corrupting the filesystem
on the next mount (journal replay would be incomplete).
For fdatasync: metadata-only changes (atime, mtime) are NOT flushed. The
filesystem skips journal commit if only timestamps changed (I_DIRTY_TIME flag
without I_DIRTY_DATASYNC).
Error reporting — one-shot semantics:
/// Error state per AddressSpace. Encapsulates a monotonic sequence counter
/// and the most recent errno. On writeback I/O failure, call `set_err(errno)`.
/// On fsync(), compare the file's snapshot with `sample()` to detect new errors.
///
/// Provides the same "each error seen exactly once per fd" semantics as
/// Linux's errseq_t. Errno and counter are packed into a single atomic word
/// to prevent torn reads between errno and sequence counter.
///
/// **64-bit architectures** (x86-64, AArch64, RISC-V 64, PPC64LE, s390x,
/// LoongArch64): uses `AtomicU64` — errno in low 16 bits, counter in high
/// 48 bits. No counter wrap concern.
///
/// **32-bit architectures** (ARMv7, PPC32): uses packed `AtomicU32` matching
/// Linux's errseq_t layout — bits [11:0] = errno, bit [12] = seen flag,
/// bits [31:13] = counter.
/// Longevity: 19-bit counter wraps after 524,288 errors. At 1 error/sec
/// (extreme), wraps in ~6 days. Matches Linux errseq_t layout (ABI-constrained).
/// 32-bit targets (ARMv7, PPC32) only. 64-bit targets use 47-bit counter
/// (140T values, safe for 50-year uptime at any realistic error rate).
/// Wrap behavior: false "no new error" on check_and_advance — acceptable
/// because (a) errno field still carries the last error code, (b) filesystems
/// with 500K+ errors have already been marked for fsck.
///
/// Single atomic operation for both `set_err()` and `check_and_advance()` —
/// no torn reads between errno and counter.
#[cfg(target_pointer_width = "64")]
pub struct ErrSeq {
/// Packed: bits [15:0] = errno (unsigned, max 4095),
/// bit [16] = seen flag, bits [63:17] = counter.
/// Single AtomicU64 prevents torn reads.
inner: AtomicU64,
}
#[cfg(target_pointer_width = "32")]
pub struct ErrSeq {
/// Packed: bits [11:0] = errno, bit [12] = seen flag,
/// bits [31:13] = counter. Matches Linux errseq_t layout.
inner: AtomicU32,
}
#[cfg(target_pointer_width = "64")]
impl ErrSeq {
// 64-bit ERRNO_BITS is 16 (not 12) to simplify the bit layout; values > 4095
// are kernel bugs caught by debug_assert. 47-bit counter provides ample headroom.
const ERRNO_BITS: u32 = 16;
const SEEN_BIT: u64 = 1 << Self::ERRNO_BITS;
const CTR_INC: u64 = 1 << (Self::ERRNO_BITS + 1);
const ERRNO_MASK: u64 = Self::SEEN_BIT - 1;
pub const fn new() -> Self { Self { inner: AtomicU64::new(0) } }
/// Record a new writeback error. Atomically packs errno + incremented
/// counter into a single word. No backoff needed: writeback_lock
/// serializes concurrent writeback per inode. The only contention
/// source is a rare race between set_err (writeback error completion)
/// and check_and_advance (fsync), which retries at most once.
pub fn set_err(&self, errno: i32) {
debug_assert!(errno.unsigned_abs() <= 4095, "errno exceeds MAX_ERRNO");
let errno_val = (errno.unsigned_abs() as u64) & Self::ERRNO_MASK;
loop {
let old = self.inner.load(Acquire);
let new_val = ((old & !Self::ERRNO_MASK & !Self::SEEN_BIT)
.wrapping_add(Self::CTR_INC)) | errno_val;
match self.inner.compare_exchange_weak(old, new_val, AcqRel, Acquire) {
Ok(_) => break,
Err(_) => continue,
}
}
}
/// Snapshot current value with "seen" bit set.
pub fn sample(&self) -> u64 { self.inner.load(Acquire) | Self::SEEN_BIT }
/// Check for new errors since `since`. Returns errno if changed.
/// The returned errno is always POSITIVE (e.g., 5 for EIO, not -5).
/// `set_err()` accepts both positive and negative errnos (via
/// `unsigned_abs()`), but the stored and returned value is always the
/// absolute (positive) errno. Callers that need a negative errno for
/// syscall returns must negate: `Err(-(errno as i32))`.
pub fn check_and_advance(&self, since: &mut u64) -> Option<i32> {
let current = self.inner.load(Acquire);
if current == *since { return None; }
*since = current | Self::SEEN_BIT;
let errno = (current & Self::ERRNO_MASK) as i32;
if errno == 0 { None } else { Some(errno) }
}
}
#[cfg(target_pointer_width = "32")]
impl ErrSeq {
const ERRNO_BITS: u32 = 12; // ilog2(MAX_ERRNO=4095) + 1
const SEEN_BIT: u32 = 1 << Self::ERRNO_BITS;
const CTR_INC: u32 = 1 << (Self::ERRNO_BITS + 1);
const ERRNO_MASK: u32 = Self::SEEN_BIT - 1;
pub const fn new() -> Self { Self { inner: AtomicU32::new(0) } }
pub fn set_err(&self, errno: i32) {
debug_assert!(errno.unsigned_abs() <= 4095, "errno exceeds MAX_ERRNO");
let errno_val = (errno.unsigned_abs()) & Self::ERRNO_MASK;
loop {
let old = self.inner.load(Acquire);
let new_val = ((old & !Self::ERRNO_MASK & !Self::SEEN_BIT)
.wrapping_add(Self::CTR_INC)) | errno_val;
match self.inner.compare_exchange_weak(old, new_val, AcqRel, Acquire) {
Ok(_) => break,
Err(_) => continue,
}
}
}
pub fn sample(&self) -> u32 { self.inner.load(Acquire) | Self::SEEN_BIT }
/// Returns positive errno (see 64-bit variant doc comment).
pub fn check_and_advance(&self, since: &mut u32) -> Option<i32> {
let current = self.inner.load(Acquire);
if current == *since { return None; }
*since = current | Self::SEEN_BIT;
let errno = (current & Self::ERRNO_MASK) as i32;
if errno == 0 { None } else { Some(errno) }
}
}
fdatasync vs fsync decision matrix:
| Condition | fdatasync action |
fsync action |
|---|---|---|
| Dirty data pages exist | Writeback + wait | Writeback + wait |
| File size changed (truncate/append) | Metadata flush (size is data-relevant) | Metadata flush |
| Only timestamps changed | Skip metadata flush | Metadata flush |
| Permissions/ownership changed | Skip metadata flush | Metadata flush |
| Journal commit needed | Yes (for data blocks) | Yes (for all) |
| Device cache flush | Yes | Yes |
Cross-references:
- AddressSpaceOps::writepage(): Section 14.1
- Page cache dirty tracking: Section 4.2
- Block I/O layer and bio_submit(): Section 15.2
- Journal write barrier: Section 15.5
- VFS ring buffer protocol (Tier 1/2 fsync dispatch): Section 14.2
- Writeback thread organization: Section 4.6
- Copy-on-Write / Redirect-on-Write infrastructure: below
14.4.1 Copy-on-Write and Redirect-on-Write Infrastructure¶
Modern filesystems fall into three write models. Linux treats all three identically at the VFS level — each filesystem independently manages its own write path, extent sharing, and snapshot interaction. This means the VFS cannot optimize writeback scheduling, cannot share page cache pages between reflinked files, and cannot accurately predict free space costs for dirty page flushes.
UmkaOS's VFS distinguishes these models explicitly, enabling generic optimizations that benefit all CoW/RoW filesystems without filesystem-specific code in the VFS.
14.4.1.1 Write Mode Declaration¶
/// Write mode declared by each filesystem via `FileSystemOps::write_mode()`.
/// Cached in `SuperBlock.write_mode` at mount time. Informs the VFS writeback
/// path, page cache sharing strategy, and free space accounting.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum WriteMode {
/// Traditional in-place overwrite (ext4 without reflinks, tmpfs, ramfs).
/// Writeback reuses the same block address. No extent sharing awareness
/// needed. Free space cost of flushing dirty pages: zero (no new blocks).
InPlace,
/// Copy-on-Write for shared extents (XFS with reflinks, ext4 with reflinks).
/// Non-shared extents are overwritten in place. Shared extents (refcount > 1
/// due to reflinks, snapshots, or dedup) require new block allocation on
/// write. The writeback path queries `ExtentSharingOps::is_extent_shared()`
/// to decide: shared → allocate new block; unshared → overwrite in place.
CopyOnWrite,
/// Redirect-on-Write: the filesystem NEVER overwrites data blocks (Btrfs,
/// ZFS, bcachefs, UPFS). All writes allocate new blocks; old blocks are
/// freed only when no snapshot, clone, or active reference retains them.
/// Consistency is achieved by atomic metadata root pointer updates (e.g.,
/// ZFS uberblock, Btrfs tree root, UPFS checkpoint record) rather than
/// journaling.
///
/// Writeback always requests a new block address from the filesystem.
/// Free space accounting must reserve space for pending redirections.
RedirectOnWrite,
}
Why three modes matter:
| Aspect | InPlace | CopyOnWrite | RedirectOnWrite |
|---|---|---|---|
| Writeback block allocation | Reuse existing | Conditional (check sharing) | Always new |
| Free space cost of dirty flush | Zero | Zero (unshared) or one block (shared) | One new block per dirty block |
| Sequential writeback batching | Not useful (scattered overwrites) | Not useful (scattered) | Highly useful (batch new allocations into sequential runs) |
| Page cache sharing for reflinks | N/A (no reflinks) | Yes (shared extents) | Yes (all data may be snapshot-shared) |
| Journal needed | Typically yes | Depends on FS | No (atomic root update suffices) |
| Snapshot integration | External (LVM/dm-snapshot) | Per-extent refcount | Native (tree root versioning) |
14.4.1.2 Extent Sharing Operations¶
/// Trait implemented by filesystems that support extent sharing (reflinks,
/// snapshots, clones, dedup). Optional — only `CopyOnWrite` and
/// `RedirectOnWrite` filesystems implement this.
///
/// The VFS queries these methods during:
/// - Writeback: to decide CoW vs in-place for `CopyOnWrite` filesystems.
/// - Page cache lookup: to enable shared-extent page cache.
/// - Free space accounting: to estimate true cost of flushing dirty pages.
///
/// All methods are called from the writeback workqueue context (not the page
/// fault hot path). Implementations may acquire filesystem-internal locks.
pub trait ExtentSharingOps: Send + Sync {
/// Returns true if the extent covering `[file_offset, file_offset + len)`
/// in the given inode is shared (refcount > 1 due to reflinks, snapshots,
/// or dedup). Returns false for holes, unallocated ranges, and unshared
/// extents.
///
/// For `CopyOnWrite` filesystems: determines whether writeback must
/// allocate a new block or can overwrite in place.
/// For `RedirectOnWrite` filesystems: always returns true conceptually
/// (all blocks are "shared" with the previous tree version), but the
/// implementation may optimize by returning false for blocks that are
/// guaranteed unshared (e.g., newly allocated since the last checkpoint).
fn is_extent_shared(&self, inode: InodeId, file_offset: u64, len: u64) -> bool;
/// Returns the physical location of the extent backing `[file_offset..]`
/// in the given inode. Used by the shared-extent page cache to index
/// pages by physical address. Returns `None` for holes, unallocated
/// ranges, and inline data.
fn extent_phys_addr(&self, inode: InodeId, file_offset: u64) -> Option<PhysExtent>;
/// Allocate a new block for a CoW write. Called by the writeback path when
/// `is_extent_shared()` returns true (CopyOnWrite mode) or unconditionally
/// (RedirectOnWrite mode). The filesystem allocates a new physical extent,
/// updates its internal mapping, and returns the new extent descriptor.
///
/// The old extent's refcount is decremented. If it drops to zero and no
/// snapshot retains it, the filesystem may free it (deferred to the
/// filesystem's own garbage collection or checkpoint cycle).
fn cow_allocate(
&self,
inode: InodeId,
file_offset: u64,
len: u64,
) -> Result<PhysExtent>;
}
/// A physical extent descriptor — identifies a contiguous range on a block device.
pub struct PhysExtent {
/// Block device that owns this extent.
pub bdev: BlockDeviceId,
/// Physical byte offset on the block device.
pub phys_offset: u64,
/// Length of the extent in bytes.
pub len: u64,
}
14.4.1.3 Shared-Extent Page Cache¶
Problem (Linux limitation): Linux's page cache is indexed solely by (address_space,
file_offset). When two files share a physical extent via reflink, each file gets a
separate page cache entry for the same data — doubling memory consumption. This is a
known limitation acknowledged by Linux developers,
unfixed because the assumption "one page = one mapping pointer" is deeply embedded in
the Linux MM.
UmkaOS design: dual-indexed page cache with physical extent awareness.
UmkaOS has no such legacy constraint. The page cache uses a two-level lookup:
-
Primary index (unchanged): per-inode
PageCachekeyed byfile_offset— the standard per-inode page tree used for all I/O operations (Section 4.4). -
Secondary index (new):
PhysExtentCache— a globalRcuHashMapkeyed by(BlockDeviceId, phys_offset)that maps toPagereferences. Populated only for filesystems withWriteMode::CopyOnWriteorWriteMode::RedirectOnWrite.
/// Global cache of pages indexed by physical extent location.
/// Enables page sharing between files that reference the same physical blocks
/// (reflinks, snapshots, clones). RCU-protected for lock-free read-side lookups.
///
/// Lookup path (read): RCU read lock → hash lookup → Page refcount increment.
/// Insert path (miss): filesystem provides PhysExtent via ExtentSharingOps →
/// read from disk → insert into PhysExtentCache → insert into per-inode PageCache.
/// Eviction: when a Page's refcount (across all address_spaces) drops to zero,
/// the page is removed from PhysExtentCache and per-inode PageCaches.
pub static PHYS_EXTENT_CACHE: OnceCell<RcuHashMap<PhysExtentKey, Page>> = OnceCell::new();
/// Key for the physical extent cache: (block device, physical byte offset).
/// Uses the page-aligned offset (physical offset rounded down to page size).
#[derive(Hash, Eq, PartialEq, Clone, Copy)]
pub struct PhysExtentKey {
pub bdev: BlockDeviceId,
pub phys_offset: u64, // page-aligned
}
Read path (for CoW/RoW filesystems):
read(inode_A, file_offset_X):
1. Look up (inode_A.address_space, file_offset_X) in per-inode PageCache.
→ Hit: return page (standard fast path, no change from InPlace mode).
→ Miss: continue to step 2.
2. Call extent_phys_addr(inode_A, file_offset_X) → PhysExtent { bdev, phys_offset }.
3. Look up (bdev, phys_offset) in PHYS_EXTENT_CACHE.
→ Hit: page is already cached (another file sharing this extent loaded it).
Insert a reference into inode_A's PageCache. Return page.
→ Miss: read from disk. Insert into PHYS_EXTENT_CACHE and inode_A's PageCache.
Write path (CoW-on-write for shared pages):
write(inode_A, file_offset_X, data):
1. Look up page in inode_A's PageCache.
2. If page is in PHYS_EXTENT_CACHE AND has references from multiple address_spaces:
→ Allocate a new private page for inode_A.
→ Copy old page contents to new page.
→ Apply the write to the new page.
→ Remove inode_A's reference from the old page in PHYS_EXTENT_CACHE.
→ Insert new page into inode_A's PageCache (not into PHYS_EXTENT_CACHE —
it's now private to inode_A until the filesystem assigns it a new
physical location during writeback).
→ Mark new page dirty.
3. If page is NOT shared (single reference):
→ Modify in place, mark dirty (standard path).
This design eliminates the 2x memory penalty for reflinked files. The cost is one additional hash lookup on cache miss (step 2-3), which is cold-path (disk I/O dominates). Hot-path reads (step 1 hit) are unchanged.
Memory savings: For a 10 GB dataset reflinked to 5 containers, Linux uses 50 GB of page cache (5 copies). UmkaOS uses 10 GB (one copy, 5 references). This is particularly significant for container-dense workloads where the same base image is reflinked across hundreds of containers.
14.4.1.4 Reflink Operations and Ioctl Dispatch¶
The remap_file_range() method in FileOps (Section 14.1) is the filesystem-level backend.
The VFS generic layer handles validation and dispatches via ioctls:
/// Flags for remap_file_range().
pub struct RemapFlags(u32);
impl RemapFlags {
/// Only remap if source and destination byte ranges are identical.
/// Used by FIDEDUPERANGE ioctl for deduplication.
pub const REMAP_FILE_DEDUP: Self = Self(1 << 0);
/// Caller accepts a shorter remap than requested. The filesystem may
/// return fewer bytes than `len` if the source extent ends early.
/// Used by FICLONE/FICLONERANGE (always set) and copy_file_range.
pub const REMAP_FILE_CAN_SHORTEN: Self = Self(1 << 1);
}
/// Clone range descriptor for FICLONERANGE ioctl.
/// Matches Linux's `struct file_clone_range` layout for binary compatibility.
#[repr(C)]
pub struct FileCloneRange {
/// Source file descriptor.
pub src_fd: i64,
/// Source file offset.
pub src_offset: u64,
/// Length to clone (0 = clone to EOF).
pub src_length: u64,
/// Destination file offset.
pub dest_offset: u64,
}
// Layout: 8 + 8 + 8 + 8 = 32 bytes.
const_assert!(size_of::<FileCloneRange>() == 32);
Ioctl definitions (Linux ABI-compatible):
| Ioctl | Number (x86-64) | Argument | Semantics |
|---|---|---|---|
FICLONE |
0x40049409 (_IOW(0x94, 9, i32)) |
Source fd | Clone entire file into destination fd |
FICLONERANGE |
0x4020940d (_IOW(0x94, 13, FileCloneRange)) |
FileCloneRange |
Clone specified byte range |
FIDEDUPERANGE |
0xC0189436 (_IOWR(0x94, 54, FileDeduperangeHdr)) |
Variable-length | Dedup if content matches |
VFS ioctl dispatch (generic, not per-filesystem):
ioctl(dst_fd, FICLONE, src_fd):
1. Validate: both fds open, dst writable, same superblock, same filesystem type.
2. Lock ordering: inode_lock(src) then inode_lock(dst) (lower ino first to prevent
deadlock when src and dst are swapped in concurrent calls).
3. Call dst.file_ops.remap_file_range(src, 0, dst, 0, src.size,
REMAP_FILE_CAN_SHORTEN).
4. Invalidate dst's affected range in PHYS_EXTENT_CACHE (new shared extents will
be populated lazily on next read).
5. Return 0 on success, -errno on failure.
14.4.1.5 copy_file_range() VFS Dispatch¶
copy_file_range(2) (syscall 326 on x86-64) is the general-purpose server-side copy
interface. The VFS dispatch prioritizes zero-copy where possible:
copy_file_range(fd_in, off_in, fd_out, off_out, len, flags=0):
1. If same filesystem AND filesystem implements remap_file_range():
→ Try reflink (remap_file_range with REMAP_FILE_CAN_SHORTEN).
→ If EOPNOTSUPP: fall through to step 2.
2. If same filesystem AND filesystem implements a dedicated copy_file_range handler:
→ Use filesystem-specific server-side copy (e.g., NFS server-side copy,
CIFS CopyChunk).
3. Fallback: splice-based copy through page cache (generic, works cross-filesystem).
This reads source pages into the page cache, then writes them to the
destination — no userspace round-trip, but does consume page cache memory.
Note: Unlike Linux (which uses syscall 326 on x86-64), UmkaOS uses the same
syscall number for ABI compatibility. The flags parameter is reserved (must be 0).
14.4.1.6 CoW/RoW-Aware Writeback¶
The writeback thread (Section 4.6)
uses SuperBlock.write_mode to adapt its behavior:
| Write mode | Writeback behavior |
|---|---|
| InPlace | Standard: for each dirty page, issue write to the page's existing block address. The block address is already known (stored in the iomap). |
| CopyOnWrite | Before writing each dirty page, call is_extent_shared(). If shared: call cow_allocate() to get a new block address, write to the new address, update the filesystem's extent mapping. If unshared: write in place (same as InPlace). |
| RedirectOnWrite | For ALL dirty pages, call cow_allocate() to get new block addresses. The filesystem's allocator can batch these requests to produce sequential physical layouts, reducing seek overhead on rotational media and improving flash write amplification. The old block addresses are not reused until the filesystem's checkpoint/commit cycle confirms the new tree root. |
RoW writeback batching optimization: For RedirectOnWrite filesystems, the writeback
thread collects all dirty pages for a given inode before requesting block allocations.
This allows the filesystem allocator to assign a contiguous physical range (one large
extent) rather than many scattered single-block allocations. The batch size is bounded
by BDI_MAX_WRITEBACK_BATCH (default: 1024 pages = 4 MB). This produces sequential
I/O patterns even when dirty pages were written at random file offsets — a significant
advantage for RoW filesystems on both rotational and flash storage.
14.4.1.7 Free Space Accounting for CoW/RoW Filesystems¶
Traditional statfs() reports free blocks = total - used. For CoW/RoW filesystems, this
is misleading because:
- Pending CoW: Dirty shared pages will consume new blocks on writeback. The "true"
free space is lower than
statfs()reports. - Snapshot overhead: Deleting a file doesn't free its blocks if snapshots reference them.
- RoW garbage: Old blocks from previous tree versions occupy space until garbage collection reclaims them.
UmkaOS adds an extended space accounting interface:
/// Extended filesystem space information, reported by CoW/RoW-aware
/// filesystems in addition to the standard StatFs. Optional — InPlace
/// filesystems return None.
pub struct ExtendedSpaceInfo {
/// Bytes reserved for pending CoW allocations (dirty shared pages that
/// will need new blocks on writeback).
pub cow_reserved_bytes: u64,
/// Bytes reclaimable by snapshot deletion (blocks held only by snapshots,
/// not by live files).
pub snapshot_reclaimable_bytes: u64,
/// Bytes occupied by stale RoW tree versions pending garbage collection.
pub gc_pending_bytes: u64,
/// Effective free bytes = statfs.free - cow_reserved - gc_pending.
/// This is the "true" free space available for new writes.
pub effective_free_bytes: u64,
}
This information is exposed via the statfs extended attributes
(STATX_ATTR_* flags) and through the UmkaOS-specific
/ukfs/kernel/fs/<mount>/space_info umkafs interface.
Cross-references:
- WriteMode declaration: FileSystemOps::write_mode() (Section 14.1)
- remap_file_range(): FileOps trait (Section 14.1)
- Writeback thread organization: Section 4.6
- Page cache and AddressSpace: Section 4.4
- Block I/O submission: Section 15.2
- Btrfs (RedirectOnWrite): Section 15.8
- XFS (CopyOnWrite with reflinks): Section 15.7
- ZFS (RedirectOnWrite): Section 15.10
- FICLONE/FICLONERANGE Linux compat: Section 19.1
- Dirty extent pre-registration: Section 14.1 (VFS crash recovery)
14.5 Character and Block Device Node Framework¶
All device classes that expose character or block device files under /dev register
through a unified device node framework. This framework manages major/minor number
allocation, the global device registry, and automatic /dev node lifecycle via
devtmpfs.
14.5.1 Character Device Region Registration¶
/// Character device region registration. All device classes (TTY, evdev, ALSA,
/// DRM, watchdog, SPI, RTC, etc.) register through this unified interface.
/// A region reserves a contiguous range of minor numbers under a single major.
pub struct ChrdevRegion {
/// Major device number. Either a well-known major from the allocation
/// table below, or dynamically allocated via `alloc_chrdev_region()`.
/// Valid range: 1-4095 (0 is reserved). Linux ABI uses 12-bit major (MKDEV
/// encoding: bits 31:20), so the hard limit is 4095. Dynamic allocation
/// uses the range 234-254, then 384-511.
pub major: u16,
/// First minor number in this region.
pub minor_base: u32,
/// Number of minor numbers reserved (contiguous from `minor_base`).
/// Must be >= 1. The range `minor_base..minor_base+minor_count` must
/// not overlap with any other registered region under the same major.
///
/// **Overflow check**: `register_chrdev_region()` validates:
/// ```
/// minor_base.checked_add(minor_count).ok_or(EINVAL)?;
/// assert!(minor_base + minor_count <= MINORMASK + 1);
/// ```
/// Without this check, a `minor_base + minor_count` overflow wraps
/// to a valid range, causing silent overlap with unrelated device
/// regions under the same major. `MINORMASK` is `0xFFFFF` (20 bits,
/// matching Linux's `MINORBITS = 20`).
pub minor_count: u32,
/// File operations for all devices in this region. Called by the VFS
/// when userspace opens, reads, writes, ioctls, or closes a device node
/// with a matching major:minor pair.
pub fops: &'static dyn FileOps,
/// Human-readable name for this region (e.g., "ttyS", "input/event",
/// "snd/pcmC"). Used in `/proc/devices` output and diagnostic logging.
/// Max 31 bytes (null-terminated).
pub name: &'static str,
}
/// Global character device registry. Indexed by a composite key of
/// `(major << 20 | minor_base)` for O(1) lookup during `open()`.
///
/// `RcuXArray` provides:
/// - O(1) lookup by composite key on the `open()` hot path.
/// - RCU-protected reads: `open()` does not acquire any lock — readers
/// call `rcu_read_lock()` + `xa_load()` for lockless lookup.
/// - Ordered iteration for `/proc/devices` enumeration.
///
/// Writers (register/unregister) acquire `CHRDEV_WRITE_LOCK` to serialize
/// mutations, then modify the XArray under its internal lock.
/// Registrations happen at subsystem init and driver probe (warm path).
static CHRDEV_TABLE: RcuXArray<Arc<ChrdevRegion>> = RcuXArray::new();
/// Writer-side serialization for `CHRDEV_TABLE`. Readers never touch this lock.
/// Held only during `register_chrdev_region` / `unregister_chrdev_region`.
static CHRDEV_WRITE_LOCK: SpinLock<()> = SpinLock::new(());
/// Register a character device region. Called by subsystems during init
/// (e.g., TTY layer registers major 4 for serial, input layer registers
/// major 13 for evdev).
///
/// Returns `Ok(())` on success. Returns `Err(DeviceError::RegionConflict)`
/// if the requested major:minor range overlaps with an existing registration.
/// Returns `Err(DeviceError::MajorExhausted)` if dynamic allocation is
/// requested (`major == 0`) and no free major numbers remain.
pub fn register_chrdev_region(region: ChrdevRegion) -> Result<(), DeviceError>;
/// Dynamically allocate a major number and register a region. Used by
/// device classes that do not have a well-known major (UIO, RTC, etc.).
/// The kernel selects the lowest available major in the dynamic range
/// (234-254, then 384-511).
///
/// Returns the allocated major number on success.
pub fn alloc_chrdev_region(
name: &'static str,
minor_base: u32,
minor_count: u32,
fops: &'static dyn FileOps,
) -> Result<u16, DeviceError>;
/// Unregister a character device region. Called during driver unload or
/// crash recovery. After this call, `open()` on device nodes with matching
/// major:minor returns `ENODEV`.
///
/// Does NOT remove `/dev` nodes — that is handled by `devtmpfs_remove_node()`.
/// The two operations are decoupled because a crash recovery sequence may
/// unregister the old region before registering the replacement.
pub fn unregister_chrdev_region(major: u16, minor_base: u32);
open() dispatch: When userspace calls open("/dev/foo", ...):
- VFS resolves the path through the dentry cache to a device inode.
- The inode's
i_rdevfield contains theDevId(major:minor). - Cgroup device access check: The VFS calls
cgroup_bpf_run(BPF_CGROUP_DEVICE, &ctx)wherectxis aBpfCgroupDevCtx { access_type, dev_type, major, minor }populated from the inode's device type and the requested access flags (BPF_DEVCG_ACC_READ,BPF_DEVCG_ACC_WRITE, or both depending onO_RDONLY/O_WRONLY/O_RDWR). The BPF program is evaluated bottom-up from the task's cgroup to the root — access is allowed only if every ancestor's program (if any) returns 1. If any program returns 0,open()returns-EPERMimmediately. If no BPF program is attached to any ancestor, access is allowed by default. See Section 17.2 for the full enforcement model and the v1devices.allow/devices.denytranslation. - The VFS extracts the actual major and minor from the inode's
i_rdev(DevId). It looks up theChrdevRegioninCHRDEV_TABLE(for character devices) orBlkdevRegioninBLKDEV_TABLE(for block devices) under RCU read lock. The lookup iterates entries for the given major to find the region whose range[minor_base, minor_base + minor_count)contains the inode's minor number. The XArray is keyed by(major << 20 | minor_base), so for a major with multiple regions, the lookup walks entries at keys(major << 20 | 0)through(major << 20 | inode_minor)to find the containing range (XArray ordered iteration, typically 1-2 entries per major). - If found, the region's
fopsis attached to the newOpenFile. fops.open()is called with the inode's actual minor number (not the region'sminor_base), allowing the driver to compute the device instance index asminor - region.minor_base.
The cgroup check (step 3) applies identically to both chrdev_open()
and blkdev_open() — the BPF context distinguishes the two via the
dev_type field (BPF_DEVCG_DEV_CHAR=2 vs BPF_DEVCG_DEV_BLOCK=1).
The mknod() syscall also calls the same hook with
access_type = BPF_DEVCG_ACC_MKNOD before creating a device node in
the filesystem.
14.5.2 Block Device Registration¶
Block devices follow an analogous pattern with register_blkdev() and a
separate BLKDEV_TABLE: XArray<Arc<BlkdevRegion>>. The block layer
(Section 15.2) adds additional registration
state (request queue, disk geometry, partition table) that character devices
do not need.
14.5.3 Major Number Allocation Table¶
Well-known major numbers are assigned to match Linux for userspace compatibility.
Tools like ls -l, stat, udev rules, and container runtimes rely on these
values being identical to Linux.
| Major | Device Class | Minor Range | Notes |
|---|---|---|---|
| 1 | mem (null, zero, random, urandom, full) |
0-15 | /dev/null=1,3; /dev/zero=1,5; /dev/full=1,7; /dev/random=1,8; /dev/urandom=1,9 |
| 4 | ttyS (serial terminals) | 64-255 | /dev/ttyS0=4,64; legacy range for 16550-compatible UARTs |
| 5 | tty, console, ptmx | 0-2 | /dev/tty=5,0; /dev/console=5,1; /dev/ptmx=5,2 |
| 10 | misc (miscellaneous character devices) | varies | /dev/fuse=10,229; /dev/rfkill=10,242; /dev/watchdog=10,130; /dev/loop-control=10,237 |
| 13 | input (evdev, joydev, mousedev) | 0-1023 | /dev/input/event0=13,64; mousedev 13,32-63; joydev 13,0-31; evdev 13,64-95; extended evdev 13,256+ (Linux 2.6+) |
| 29 | fb (framebuffer) | 0-31 | /dev/fb0=29,0; legacy interface, DRM preferred |
| 31 | mtdblock (MTD block translation) | 0-31 | /dev/mtdblock0=31,0 |
| 90 | mtd (raw MTD character access) | 0-31 | /dev/mtd0=90,0 |
| 116 | ALSA (snd) | 0-255 | /dev/snd/pcmC0D0p=116,16; /dev/snd/controlC0=116,0 |
| 136 | pts (PTY slave devices) | 0-1048575 | /dev/pts/0=136,0; devpts filesystem allocates minors dynamically |
| 226 | DRM (dri) | 0-255 | /dev/dri/card0=226,0; /dev/dri/renderD128=226,128 |
| 239 | IPMI device interface | 0-31 | /dev/ipmi0=239,0 |
| dynamic | UIO, RTC, hwmon, etc. | allocated at registration | Major assigned by alloc_chrdev_region() |
14.5.4 Devtmpfs: Automatic /dev Node Lifecycle¶
Devtmpfs is a kernel-managed tmpfs instance mounted on /dev that
automatically creates and removes device nodes in response to device
registration and unregistration events. It eliminates the boot-time race
between device discovery and userspace udev — device nodes exist before
any userspace process runs.
/// Devtmpfs entry describing a device node to create under /dev.
/// Passed to `devtmpfs_create_node()` by the device registry when a
/// device is registered, and to `devtmpfs_remove_node()` on unregistration
/// or crash recovery.
pub struct DevtmpfsEntry {
/// Path relative to /dev (e.g., "ttyS0", "input/event3", "snd/pcmC0D0p").
/// Intermediate directories (e.g., "input/", "snd/") are created
/// automatically if they do not exist. Max 63 bytes.
pub path: ArrayString<64>,
/// Device type: character or block.
pub dev_type: DevType,
/// Major:minor device identifier.
pub dev_id: DevId,
/// File permissions (e.g., 0o666 for /dev/null, 0o620 for TTY devices,
/// 0o660 for block devices). The owner is always root:root; udev rules
/// can adjust ownership after boot.
pub mode: u16,
}
/// Device type discriminant for device nodes.
#[repr(u8)]
pub enum DevType {
/// Character device (S_IFCHR).
Char = 0,
/// Block device (S_IFBLK).
Block = 1,
}
/// Major:minor device identifier. Encoded as a single u32 for storage
/// efficiency (matches Linux's `MKDEV(major, minor)` encoding).
///
/// Linux `dev_t` is u32 with MAJOR = top 12 bits (0–4095) and MINOR =
/// bottom 20 bits (0–1048575). The `new()` constructor validates that
/// `major` fits in 12 bits; callers passing a u16 > 4095 get a panic
/// (debug) or truncation would silently produce wrong device numbers.
pub struct DevId {
/// Encoded as `(major << 20) | (minor & 0xFFFFF)`.
/// Major occupies bits 31:20 (12 bits, 0–4095).
/// Minor occupies bits 19:0 (20 bits, 0–1048575).
pub raw: u32,
}
impl DevId {
/// Create a `DevId` from separate major and minor numbers.
///
/// # Panics
/// Panics if `major > 4095` — Linux ABI reserves only 12 bits for major.
pub fn new(major: u16, minor: u32) -> Self {
assert!(major <= 0x0FFF, "DevId: major {} exceeds 12-bit Linux ABI limit (max 4095)", major);
assert!(minor <= 0xFFFFF, "DevId: minor {} exceeds 20-bit limit (max 1048575)", minor);
DevId { raw: (major as u32) << 20 | (minor & 0x000F_FFFF) }
}
pub fn major(&self) -> u16 { (self.raw >> 20) as u16 }
pub fn minor(&self) -> u32 { self.raw & 0x000F_FFFF }
/// Encode for stat()/fstat()/newfstatat() `st_dev` and `st_rdev` fields.
/// This encoding differs from the kernel-internal MKDEV layout.
/// Matches Linux `new_encode_dev()` in include/linux/kdev_t.h.
/// The SysAPI layer calls this when filling `struct stat` responses.
/// For statx() responses, use `major()` and `minor()` directly (statx
/// has separate `stx_rdev_major`/`stx_rdev_minor` u32 fields).
pub fn new_encode_dev(&self) -> u32 {
let major = self.major() as u32;
let minor = self.minor();
(minor & 0xff) | ((major & 0xfff) << 8) | ((minor & !0xffu32) << 12)
}
/// Decode a stat()/fstat() encoded device number back to DevId.
/// Inverse of `new_encode_dev()`. Matches Linux `new_decode_dev()`.
pub fn new_decode_dev(encoded: u32) -> DevId {
let major = ((encoded & 0xfff00) >> 8) as u16;
let minor = (encoded & 0xff) | ((encoded >> 12) & 0xfff00);
DevId::new(major, minor)
}
}
// Round-trip verification: encode and decode must be inverses.
// Test with boundary values: major 0..4095 (12 bits), minor 0..1048575 (20 bits).
const_assert!({
let d = DevId::new(0, 0);
let enc = d.new_encode_dev();
let dec = DevId::new_decode_dev(enc);
dec.major() == 0 && dec.minor() == 0
});
const_assert!({
let d = DevId::new(4095, 1048575);
let enc = d.new_encode_dev();
let dec = DevId::new_decode_dev(enc);
dec.major() == 4095 && dec.minor() == 1048575
});
Lifecycle hooks:
/// Create a device node under /dev. Called by `DeviceRegistry::register()`
/// after a device and its chrdev/blkdev region are successfully registered.
///
/// Creates the inode in the devtmpfs superblock with the specified
/// major:minor, type, and permissions. If intermediate path components
/// do not exist (e.g., "input/" for "input/event3"), they are created as
/// directories with mode 0o755.
///
/// This function is idempotent: if the node already exists with the same
/// major:minor, it is a no-op. If it exists with a different major:minor,
/// the old node is replaced (stale node from a previous driver instance).
pub fn devtmpfs_create_node(entry: &DevtmpfsEntry) -> Result<(), IoError>;
/// Remove a device node from /dev. Called by `DeviceRegistry::unregister()`
/// and by the crash recovery manager when a driver's device is being
/// cleaned up.
///
/// Removes the inode from devtmpfs. Empty parent directories are NOT
/// removed (they may be needed by other devices in the same class).
///
/// Idempotent: removing a non-existent node is a no-op (returns `Ok(())`).
pub fn devtmpfs_remove_node(path: &str) -> Result<(), IoError>;
Boot sequence:
Devtmpfs is mounted during boot Phase 5 (after the physical memory allocator, slab allocator, and VFS are initialized, but before the root filesystem is mounted):
Boot Phase 5: devtmpfs initialization
1. Create an in-kernel tmpfs instance for devtmpfs.
2. Mount it internally (not yet visible to userspace).
3. Create standard device nodes:
- /dev/null (1, 3) mode 0o666 — discard sink
- /dev/zero (1, 5) mode 0o666 — zero source
- /dev/full (1, 7) mode 0o666 — always-full sink
- /dev/random (1, 8) mode 0o666 — blocking entropy source
- /dev/urandom (1, 9) mode 0o666 — non-blocking entropy source
- /dev/console (5, 1) mode 0o600 — kernel console
- /dev/tty (5, 0) mode 0o666 — controlling terminal alias
- /dev/ptmx (5, 2) mode 0o666 — PTY master multiplexer
4. Device discovery (PCI enumeration, platform devices, DT/ACPI) probes
drivers, which register devices → devtmpfs_create_node() populates
/dev with hardware-specific nodes.
5. After rootfs mount: bind-mount devtmpfs onto /dev in the real root.
Userspace udev starts and may adjust permissions, create symlinks
(e.g., /dev/disk/by-uuid/...), and apply udev rules.
Crash recovery interaction: When a Tier 1 driver crashes and its device is
being recovered (Section 11.9), the crash recovery manager calls
devtmpfs_remove_node() for all device nodes owned by the crashed driver.
After the replacement driver loads and re-registers its devices,
devtmpfs_create_node() recreates the nodes. Userspace processes that had
open file descriptors to the old nodes receive EIO on subsequent I/O; they
must reopen the device to get a file descriptor backed by the new driver
instance.
14.5.4.1 Crash Recovery and Hotplug Event Interaction¶
When a driver crash overlaps with hotplug events (e.g., a USB hub driver crashes while devices are being enumerated), the following ordering guarantees apply:
-
Event queue freeze: The crash recovery manager acquires the hotplug workqueue's drain lock before beginning recovery. New
HotplugEvent::DeviceArrivalevents for the crashed driver's bus subtree are enqueued but not processed until recovery completes. Events for unrelated bus subtrees continue processing normally. -
Device node cleanup:
devtmpfs_remove_node()is called for each device owned by the crashed driver. The removal is atomic per-node: either the inode is fully removed or the operation has no effect (idempotent). -
Pending event replay: After the replacement driver loads and its
init()returnsProbeResult::Ok, the hotplug workqueue drain lock is released. Queued arrival events for the recovered subtree are replayed in FIFO order. The new driver instance receivesDeviceArrivalevents for any devices that appeared during the recovery window. -
Stale removal events:
DeviceRemovalevents for devices that were already cleaned up during crash recovery are silently dropped (the device handle no longer exists in the registry). This is safe becausedevtmpfs_remove_node()is idempotent. -
Uevent replay to userspace: After recovery, the netlink translation layer (Section 19.5) emits a synthetic
changeuevent for each recovered device. This notifiesudev/systemd-udevdto re-apply rules (permissions, symlinks) without requiring a fulludevadm trigger.
/// Crash recovery hotplug coordination.
///
/// Acquires the drain lock on the hotplug workqueue for the specified bus
/// subtree, preventing event processing until `release_hotplug_drain()`.
/// Events continue to be enqueued — they are replayed on release.
pub fn acquire_hotplug_drain(subtree_root: DeviceHandle) -> HotplugDrainGuard;
/// RAII guard that releases the hotplug drain lock on drop.
/// Queued events for the frozen subtree are replayed in FIFO order.
pub struct HotplugDrainGuard {
subtree_root: DeviceHandle,
}
impl Drop for HotplugDrainGuard {
fn drop(&mut self) {
// Release drain lock. The hotplug workqueue processes all queued
// events for subtree_root's descendants in FIFO order.
}
}
14.5.5 Initial Device Naming¶
The kernel assigns initial device names following Linux conventions for userspace
compatibility. Userspace udev may later create persistent symlinks
(/dev/disk/by-uuid/, /dev/disk/by-id/, etc.) but the kernel-assigned names
must match what Linux tools expect.
Naming rules by device class:
| Class | Pattern | Algorithm | Examples |
|---|---|---|---|
| Block (SCSI/NVMe) | sd[a-z]+ / nvme[N]n[M] |
SCSI: alphabetic sequence by probe order. NVMe: controller N, namespace M. | sda, sdb, nvme0n1 |
| Block partitions | <disk>N |
Partition number from GPT/MBR table. | sda1, nvme0n1p1 |
| Network | eth[N] / wlan[N] |
Sequential index per subsystem (Ethernet vs WiFi). udev's predictable naming (ens3, enp0s25) is applied by userspace rules, not the kernel. |
eth0, wlan0 |
| TTY serial | ttyS[N] |
Port index from UART enumeration (PCI BAR order, DT aliases, ACPI UID). | ttyS0, ttyS1 |
| TTY USB serial | ttyUSB[N] |
Sequential index by USB probe order. | ttyUSB0 |
| Input (evdev) | input/event[N] |
Sequential index by registration order. | input/event0 |
| ALSA | snd/pcmC[N]D[M]p |
Card N (probe order), device M (codec order), p=playback / c=capture. |
snd/pcmC0D0p |
| DRM | dri/card[N] / dri/renderD[128+N] |
Sequential by GPU probe order. Render nodes start at minor 128. | dri/card0, dri/renderD128 |
| Framebuffer | fb[N] |
Legacy. Sequential by registration. | fb0 |
| Watchdog | watchdog[N] |
Sequential by registration. | watchdog0 |
| Loop | loop[N] |
Fixed pool of max_loop devices (default 256). Created at boot. |
loop0 |
Implementation: Each device class maintains its own index counter (typically a
static AtomicU32). The counter is incremented atomically at register_chrdev()
or add_disk() time. The generated name is passed to devtmpfs_create_node()
and stored in the DeviceNode.dev_name field.
Stability caveat: Kernel-assigned names like sda/sdb depend on probe order,
which can vary across boots. This is a known Linux behavior. Persistent naming
(/dev/disk/by-uuid/, /dev/disk/by-path/, /dev/disk/by-id/) is handled
entirely by userspace udev rules that read device attributes from the
uevent/sysfs interface and create stable symlinks. The kernel provides all
necessary attributes (serial number, WWN, partition UUID) via the uevent
mechanism (Section 19.5).
14.5.6 File Operations Replacement (replace_fops)¶
Some device classes use a single /dev entry as a multiplexer that switches to a
specialized FileOps vtable after open(). Examples:
- ALSA:
/dev/snd/controlC0opens with a generic ALSA controlFileOps. PCM device files (/dev/snd/pcmC0D0p) open with PCM-specificFileOpsfrom the start and do NOT usereplace_fops. The ALSAreplace_fopsuse case is the control device switching to a specialized monitoring mode viaSNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS. - TTY: A TTY file descriptor switches its line discipline (e.g., from N_TTY to
N_SLIP) via
TIOCSETD, which replaces theFileOpsto reflect the new discipline's read/write/ioctl behavior. - evdev:
EVIOCGRABtransitions an input device to exclusive-grab mode with a specializedFileOpsthat filters events to the grabbing client.
The OpenFile.f_ops field is declared as &'static dyn FileOps and is normally
immutable after creation. replace_fops provides a controlled mechanism to swap it:
/// Atomically replace the FileOps vtable on an open file descriptor.
///
/// This is the mechanism for device classes that multiplex multiple
/// operational modes through a single device node. The caller (device
/// subsystem code, NOT the driver directly) must hold the file's position
/// lock (`fdget_pos()` guard) to prevent concurrent read/write operations
/// from observing a partially-switched state.
///
/// # Safety
///
/// * `new_ops` must be `&'static` — it must outlive the `OpenFile`. In practice
/// this means the new FileOps must be a static vtable defined in the subsystem
/// module (e.g., `static PCM_PLAYBACK_OPS: FileOps = ...`), not a dynamically
/// constructed object.
/// * The caller must ensure no I/O operations are in-flight on the file at the
/// time of the swap. The `fdget_pos()` guard serializes with `read()`/`write()`;
/// `ioctl()` is serialized by the subsystem's own locking (e.g., ALSA's
/// `pcm_stream_lock`, TTY's `tty_lock`).
/// * The old `FileOps` is not freed (it is `&'static`). No cleanup callback is
/// needed.
///
/// # Implementation
///
/// Uses `AtomicPtr::store(new_ops, Release)` on the internal representation of
/// `f_ops`. Subsequent `read()`/`write()`/`ioctl()` calls load with `Acquire`
/// ordering and dispatch through the new vtable. The `Release`/`Acquire` pair
/// ensures all state mutations made by the caller before calling `replace_fops()`
/// (e.g., initializing PCM hardware parameters, setting up the line discipline
/// buffer) are visible to the next I/O operation through the new vtable.
pub fn replace_fops(
file: &OpenFile,
new_ops: &'static dyn FileOps,
_guard: &FdPosGuard,
) {
// `_guard` enforces at the type level that the caller holds the
// fdget_pos() guard, serializing concurrent replace_fops() calls
// on the same OpenFile. Without this parameter, the UnsafeCell
// write to f_ops_vtable relies on caller discipline alone.
// Decompose the fat pointer into data + vtable, store data atomically.
let (data_ptr, vtable_ptr) = (new_ops as *const dyn FileOps).to_raw_parts();
// Write vtable FIRST (plain store, ordered by the subsequent Release).
// SAFETY: all FileOps impl types have 'static vtables. The plain store
// is safe because the subsequent Release on f_ops_data orders this write
// relative to any reader's Acquire load.
unsafe { *file.f_ops_vtable.get() = vtable_ptr; }
// THEN publish the data pointer with Release. A reader's Acquire load
// on f_ops_data guarantees visibility of the vtable write above.
file.f_ops_data.store(data_ptr as *mut (), Release);
}
OpenFile.f_ops representation: To support replace_fops, the internal
representation uses two fields rather than a single &'static dyn FileOps:
- f_ops_data: AtomicPtr<()> — the data pointer component of the fat pointer,
swapped atomically with Release ordering on write, Acquire on read.
- f_ops_vtable: UnsafeCell<*const ()> — the vtable pointer component, written
BEFORE the f_ops_data Release store. The Release/Acquire pair on f_ops_data
guarantees that a reader observing the new data pointer also observes the new
vtable pointer. This ordering is correct on all architectures, including weakly
ordered ones (AArch64, RISC-V), because Release orders ALL prior writes.
AtomicPtr<dyn FileOps> is NOT valid Rust (dyn FileOps is !Sized, and
AtomicPtr<T> requires T: Sized). The two-field decomposition avoids this
limitation. The public f_ops() accessor reconstructs the fat pointer from the
two components. For the common case (no replacement), this adds zero overhead on
x86-64 (Acquire is free under TSO) and a single ldar instruction on AArch64
(~1 cycle). The f_ops field shown in OpenFile (Section 14.1)
is the accessor return type, not the storage type.
Subsystem usage constraints: replace_fops is callable only from kernel
subsystem code (ALSA core, TTY layer, input core), not from KABI driver callbacks.
Tier 1/Tier 2 drivers that need mode-switching behavior must request the switch through
their subsystem's control interface (e.g., ALSA snd_pcm_hw_params(), TTY tty_set_ldisc()),
which validates the request and calls replace_fops internally.
Cross-references: - Device registry and bus management: Section 11.4 - Crash recovery node cleanup: Section 11.9 - TTY/PTY device nodes: Section 21.1 - ALSA device nodes: Section 21.4 - DRM device nodes: Section 22.1 - Input (evdev) device nodes: Section 21.3
14.6 Mount Tree Data Structures and Operations¶
The mount tree is the central data structure of the VFS layer that tracks all mounted filesystems, their hierarchical relationships, and their propagation properties. Every path resolution operation traverses the mount tree (via the mount hash table) to cross mount boundaries. This section defines the complete data structures, algorithms, and namespace operations that were previously referenced but unspecified by Section 14.1, Section 14.1, and Section 17.1.
Design principles:
-
RCU for the read path: Mount hash table lookups happen on every path resolution (every
open(),stat(),readlink(),execve()). The read path must be completely lock-free. Writers (mount/unmount) serialize through the per-namespacemount_lockand publish changes via RCU. -
Per-namespace scoping: Unlike Linux, which uses a single global
mount_hashtable, UmkaOS scopes the mount hash table per mount namespace. This eliminates contention between namespaces in container-heavy workloads (thousands of namespaces with independent mount trees) and allows mount operations in different namespaces to proceed in parallel with no shared lock. The trade-off is additional memory per namespace; this is acceptable because each namespace already has an independent mount tree and the hash table overhead is proportional to the number of mounts (typically 30-100 per container, well under 1 KiB of hash table memory). -
Arc-based lifetime management: Mount nodes are reference-counted via
Arc<Mount>. Parent, master, and peer references useArc(strong) orWeak(where appropriate to break cycles). RCU protects the hash chains and list traversals;Arcprotects theMountnode lifetime beyond the RCU grace period. -
Capability gating: All mount tree modifications check
CAP_MOUNTorCAP_SYS_ADMINas specified in Section 14.1. The data structures below enforce this at the entry point of each operation, not deep inside the algorithm. -
64-bit mount IDs: Per-namespace monotonic counter, never wrapping on any realistic system. Mount IDs are unique within a namespace and are the stable identifier used by
statx()(STATX_MNT_ID), the newstatmount()/listmount()syscalls, and/proc/PID/mountinfo.
14.6.1 Mount Flags¶
bitflags! {
/// Per-mount flags controlling security and access behavior.
///
/// These are distinct from per-superblock options (which control the
/// filesystem driver's behavior). A single superblock can be mounted
/// at multiple locations with different per-mount flags (e.g., one
/// mount point read-write, another read-only via bind mount + remount).
///
/// Bit assignments match Linux's `MNT_*` internal flags
/// (`include/linux/mount.h`, stable since Linux 2.6.x). These are
/// NOT the userspace `MS_*` flags (`include/uapi/linux/mount.h`) —
/// the `mount(2)` and `mount_setattr(2)` compat shims translate
/// `MS_*`/`MOUNT_ATTR_*` to `MountFlags` at syscall entry.
#[repr(transparent)]
pub struct MountFlags: u64 {
// --- Userspace-visible flags (set via mount/remount/mount_setattr) ---
//
// Bit assignments match Linux `include/linux/mount.h` exactly.
// Verified against torvalds/linux master (2026-03-25).
/// Do not honor set-user-ID and set-group-ID bits on executables.
const MNT_NOSUID = 0x01; // Linux: MNT_NOSUID = 0x01
/// Do not allow access to device special files on this mount.
const MNT_NODEV = 0x02; // Linux: MNT_NODEV = 0x02
/// Do not allow execution of programs on this mount.
const MNT_NOEXEC = 0x04; // Linux: MNT_NOEXEC = 0x04
/// Do not update access times on this mount.
const MNT_NOATIME = 0x08; // Linux: MNT_NOATIME = 0x08
/// Do not update directory access times on this mount.
const MNT_NODIRATIME = 0x10; // Linux: MNT_NODIRATIME = 0x10
/// Update atime only if atime <= mtime or atime <= ctime, or if
/// the previous atime is more than 24 hours old. Default for most
/// mounts since Linux 2.6.30 and UmkaOS.
const MNT_RELATIME = 0x20; // Linux: MNT_RELATIME = 0x20
/// Mount is read-only. Writes return EROFS.
const MNT_READONLY = 0x40; // Linux: MNT_READONLY = 0x40
/// Do not follow symlinks on this mount. Used by container runtimes
/// to prevent symlink-based escapes from bind-mounted directories.
const MNT_NOSYMFOLLOW = 0x80; // Linux: MNT_NOSYMFOLLOW = 0x80
// --- Internal flags (kernel-managed, not settable by userspace) ---
/// Mount can be expired and automatically unmounted under memory
/// pressure or after an idle timeout. Used by autofs. The VFS
/// checks `mnt_count == 0` before expiring a shrinkable mount.
const MNT_SHRINKABLE = 0x100; // Linux: MNT_SHRINKABLE = 0x100
/// Internal mount (not exposed to userspace). Used for kernel-
/// internal mounts (pipefs, sockfs, etc.).
const MNT_INTERNAL = 0x4000; // Linux: MNT_INTERNAL = 0x4000
// --- Container namespace lock flags (MNT_LOCK_*) ---
//
// These flags prevent unprivileged users in child mount namespaces
// from changing mount attributes inherited from the parent namespace.
// Set by the kernel when creating a user namespace or copying a
// mount namespace. Critical for container security — without these,
// a container could remount a read-only host path as read-write.
/// Atime setting is locked (NOATIME/RELATIME/NODIRATIME cannot
/// be changed by unprivileged mount_setattr in child namespace).
const MNT_LOCK_ATIME = 0x040000; // Linux: MNT_LOCK_ATIME = 0x040000
/// NOEXEC flag is locked.
const MNT_LOCK_NOEXEC = 0x080000; // Linux: MNT_LOCK_NOEXEC = 0x080000
/// NOSUID flag is locked.
const MNT_LOCK_NOSUID = 0x100000; // Linux: MNT_LOCK_NOSUID = 0x100000
/// NODEV flag is locked.
const MNT_LOCK_NODEV = 0x200000; // Linux: MNT_LOCK_NODEV = 0x200000
/// READONLY flag is locked (cannot be remounted read-write by
/// unprivileged users in child namespace).
const MNT_LOCK_READONLY = 0x400000; // Linux: MNT_LOCK_READONLY = 0x400000
/// Mount is locked and cannot be unmounted by unprivileged
/// processes. Set on mounts visible in child mount namespaces
/// created by unprivileged users — prevents a child namespace
/// from unmounting a mount inherited from the parent. Cleared
/// only by a process with `CAP_SYS_ADMIN` in the mount's owning
/// user namespace.
const MNT_LOCKED = 0x800000; // Linux: MNT_LOCKED = 0x800000
/// Mount is in the process of being unmounted. Set by `umount()`
/// before removing the mount from the hash table. Prevents new
/// path lookups from entering this mount. Once set, never cleared
/// (the mount node is freed after the RCU grace period).
const MNT_DOOMED = 0x1000000; // Linux: MNT_DOOMED = 0x1000000
/// Synchronous unmount requested. Set when MNT_DETACH was NOT
/// specified and the kernel must wait for all references to drain.
const MNT_SYNC_UMOUNT = 0x2000000; // Linux: MNT_SYNC_UMOUNT = 0x2000000
/// Mount is being torn down by the umount process.
const MNT_UMOUNT = 0x8000000; // Linux: MNT_UMOUNT = 0x8000000
// --- UmkaOS extension flags (bits 28+) ---
//
// These flags are UmkaOS-original extensions NOT present in Linux's
// mnt_flags. They occupy high bit positions (28+) to avoid collision
// with future Linux MNT_* additions. Both are intentional design
// improvements over Linux.
/// **UmkaOS extension — not present in Linux mnt_flags.**
/// Per-mount lazytime: buffer atime updates in memory and flush
/// lazily. Reduces write I/O for atime-heavy workloads (mail servers).
/// This is a genuine improvement over Linux's per-superblock
/// `SB_LAZYTIME`: different bind mounts of the same filesystem can
/// have different lazytime policies (e.g., `/var/mail` with lazytime,
/// `/var/log` without, on the same ext4 volume).
const MNT_LAZYTIME = 1 << 28; // UmkaOS extension (bit 28)
/// **UmkaOS extension — not present in Linux mnt_flags.**
/// Explicit detached-mount state flag. Set by `fsmount()` before
/// `move_mount()` attaches the mount to the namespace tree. Detached
/// mounts are invisible to path resolution and /proc/PID/mountinfo.
/// Linux tracks this implicitly through namespace tree membership;
/// UmkaOS makes it an explicit flag used in 10+ places in
/// fsmount/move_mount/open_tree flows for clarity and correctness.
const MNT_DETACHED = 1 << 29; // UmkaOS extension (bit 29)
}
}
14.6.2 Propagation Type¶
/// Mount propagation type. Controls whether mount/unmount events at this
/// mount point are propagated to other mount points, and in which direction.
///
/// Propagation is fundamental to container runtimes: Docker sets the rootfs
/// to MS_PRIVATE by default, Kubernetes uses MS_SHARED for volume mounts
/// that must be visible across pod containers.
///
/// See: Linux kernel Documentation/filesystems/sharedsubtree.rst
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum PropagationType {
/// Mount events propagate bidirectionally within the peer group.
/// All mounts in the same peer group see each other's mount/unmount
/// events. This is the Linux default for the initial namespace root.
Shared = 0,
/// Mount events are not propagated to or from this mount. This is
/// the default for new mount namespaces (container isolation).
Private = 1,
/// Mount events propagate unidirectionally from the master to this
/// mount, but not in the reverse direction. Used when a container
/// should see new mounts from the host but not expose its own mounts
/// to the host.
Slave = 2,
/// Like Private, but additionally prevents this mount from being
/// used as the source of a bind mount. Used for security-sensitive
/// mount points that should never be replicated.
Unbindable = 3,
}
14.6.3 Mount Node¶
/// A single mount instance in the mount tree.
///
/// Equivalent to Linux's `struct mount` (not `struct vfsmount` — the latter
/// is the subset exposed to filesystem drivers; `struct mount` is the full
/// internal structure). Each `Mount` represents one attachment of a
/// filesystem at a specific point in the directory tree.
///
/// **Lifetime**: `Mount` nodes are allocated via `Arc<Mount>`. References
/// are held by:
/// - The mount hash table (via RCU-protected hash chain)
/// - The parent mount's `children` list
/// - The peer group's `mnt_share` ring
/// - The master mount's `mnt_slave_list`
/// - Any open file descriptor whose path traversed this mount
/// (via `mnt_count` reference count)
/// - The `MountNamespace.mount_list`
///
/// A mount node is freed when all strong references are dropped, which
/// happens after: (a) removal from the hash table, (b) removal from the
/// parent's child list, (c) RCU grace period completion, and (d) all
/// path-resolution references (`mnt_count`) have been released.
pub struct Mount {
// --- Identity ---
/// Unique mount identifier within the owning namespace. Monotonically
/// increasing, 64-bit, never reused. This is the value returned by
/// `statx()` in `stx_mnt_id` (STATX_MNT_ID) and reported in
/// `/proc/PID/mountinfo` field 1.
pub mount_id: u64,
/// Device name string (e.g., "/dev/sda1", "tmpfs", "overlay").
/// Displayed in `/proc/PID/mountinfo` field 10 (mount source).
/// Heap-allocated, immutable after mount creation.
pub device_name: Box<[u8]>,
// --- Tree structure ---
/// Parent mount. `None` for the root of the mount namespace.
/// Uses `Weak` to prevent reference cycles in the mount tree:
/// parent -> children -> parent would create a cycle with `Arc`.
/// The parent is always alive while any child exists (the child
/// holds a position in the parent's hash chain), so the `Weak`
/// can always be upgraded during normal operation. It fails only
/// during the teardown of a doomed mount tree, which is expected.
pub parent: Option<Weak<Mount>>,
/// Cached parent mount ID. Set at mount time, updated on `move_mount`.
/// Avoids `Weak::upgrade()` during RCU-walk lookups (the upgrade may
/// fail during concurrent umount). Helper: `fn mount_id_of_parent(&self)
/// -> u64 { self.parent_mount_id }`.
pub parent_mount_id: u64,
/// The dentry in the parent mount's filesystem where this mount is
/// attached. For the root mount of a namespace, this is the root
/// dentry of the parent mount (which is itself).
///
/// Together with `parent`, this pair `(parent_mount, mountpoint_dentry)`
/// is the key in the mount hash table. Path resolution uses this to
/// detect mount crossings: when a dentry has `DCACHE_MOUNTED` set,
/// the VFS calls `lookup_mnt(current_mount, dentry)` to find the
/// child mount.
pub mountpoint: DentryRef,
/// Root dentry of the mounted filesystem. When path resolution
/// crosses into this mount, it continues from this dentry.
pub root: DentryRef,
/// The superblock of the mounted filesystem. Shared across all
/// mounts of the same filesystem instance (e.g., bind mounts share
/// the superblock). The superblock holds the filesystem-specific
/// state and the `FileSystemOps`/`InodeOps`/`FileOps` trait objects.
pub superblock: Arc<SuperBlock>,
/// Children of this mount — sub-mounts attached at dentries within
/// this mount's filesystem. Intrusive doubly-linked list for O(1)
/// insertion and removal. Protected by the namespace's `mount_lock`
/// for writes; RCU-protected for reads during path resolution.
pub children: IntrusiveList<Arc<Mount>>,
/// Link entry for this mount in its parent's `children` list.
/// Embedded in the `Mount` node to avoid per-child heap allocation.
pub child_link: IntrusiveListNode,
// --- Mount flags ---
/// Per-mount flags (nosuid, nodev, noexec, readonly, noatime, etc.).
/// Atomically readable for the path-resolution hot path (no lock
/// needed to check MNT_READONLY or MNT_NOSUID). Modified only under
/// `mount_lock` via atomic store with Release ordering.
pub flags: AtomicU64,
// --- Propagation ---
/// Propagation type for this mount (Shared, Private, Slave, Unbindable).
/// Determines how mount/unmount events are forwarded to related mounts.
/// Modified only under `mount_lock`.
pub propagation: PropagationType,
/// Peer group ID for shared mounts. All mounts in the same peer group
/// have the same `group_id`. Private and unbindable mounts have
/// `group_id == 0`. Slave mounts retain the `group_id` of their
/// former peer group (for /proc/PID/mountinfo optional fields).
///
/// Allocated from the namespace's `group_id_allocator`. Unique within
/// a namespace.
pub group_id: u64,
/// Circular linked list of peer mounts (shared propagation).
/// All mounts in a peer group are linked through `mnt_share`.
/// When a mount/unmount event occurs on any peer, it is propagated
/// to all other peers in the ring. For Private/Unbindable mounts,
/// this list contains only the mount itself (self-loop).
pub mnt_share: IntrusiveListNode,
/// Master mount for slave propagation. When this mount is a slave,
/// `mnt_master` points to the shared mount from which this mount
/// receives (but does not send) propagation events.
/// `None` for shared, private, and unbindable mounts.
pub mnt_master: Option<Weak<Mount>>,
/// List head for slave mounts of this mount. When this mount is
/// shared (or was shared), slave mounts derived from it are linked
/// through `mnt_slave_list`. Each slave's `mnt_slave` node is an
/// entry in this list.
pub mnt_slave_list: IntrusiveList<Arc<Mount>>,
/// Link entry for this mount in its master's `mnt_slave_list`.
pub mnt_slave: IntrusiveListNode,
// --- Namespace membership ---
/// The mount namespace that owns this mount. `Weak` because the
/// namespace may be destroyed (all processes exited) while detached
/// mounts or lazy-unmount remnants still exist.
pub ns: Weak<MountNamespace>,
/// Link entry in the namespace's `mount_list`. Used for ordered
/// iteration (e.g., /proc/PID/mountinfo output, umount ordering).
pub ns_list_link: IntrusiveListNode,
// --- Reference counting ---
/// Active reference count. Incremented when path resolution enters
/// this mount (ref-walk mode) or when an open file descriptor
/// references a path within this mount. `umount()` checks this
/// before removing the mount: if `mnt_count > 0`, the mount is
/// busy and umount returns `EBUSY` (unless `MNT_DETACH` is used).
///
/// Note: this is separate from the `Arc` reference count. `Arc`
/// tracks the lifetime of the `Mount` struct itself. `mnt_count`
/// tracks whether the mount is actively *in use* by path lookups
/// and open files. A mount can have `mnt_count == 0` (not busy)
/// while still having `Arc` strong count > 0 (struct not yet freed
/// because it's still in the hash table or child list).
pub mnt_count: AtomicU64,
// --- Mount hash chain ---
/// Link entry in the mount hash table bucket chain. RCU-protected:
/// readers traverse the chain under `rcu_read_lock()` without any
/// lock; writers modify the chain under `mount_lock` and publish
/// via RCU. Uses intrusive linking for zero-allocation hash insertion.
pub hash_link: IntrusiveListNode,
}
impl Mount {
/// Cached parent mount ID, avoids Weak::upgrade() during RCU-walk.
#[inline]
pub fn mount_id_of_parent(&self) -> u64 {
self.parent_mount_id
}
/// Inode ID of the mountpoint dentry (the dentry in the parent mount
/// where this mount is attached). Used as the secondary key in the
/// mount hash table: lookup is `(parent_mount_id, mountpoint_inode_id)`.
#[inline]
pub fn mountpoint_inode(&self) -> InodeId {
self.mountpoint.inode
}
}
/// Reference to a dentry. Wraps the dentry's inode ID and parent inode ID,
/// which together uniquely identify a dentry in the dentry cache (Section
/// 13.1.2). The VFS resolves this to a cached dentry entry on access.
///
/// This avoids holding a direct pointer into the dentry cache (which is
/// RCU-managed and may be evicted), while still providing O(1) lookup via
/// the dentry hash table.
pub struct DentryRef {
/// Inode ID of the parent directory containing this dentry.
pub parent_inode: InodeId,
/// Name hash of this dentry. For filesystems with a custom
/// `DentryOps::d_hash()` (case-insensitive filesystems), `name_hash`
/// stores the result of `d_hash()`, not the default hash. For
/// filesystems without custom hashing, the default SipHash-1-3 of the
/// name component is used. Used for O(1) dentry cache lookup without
/// storing the full name.
pub name_hash: u64,
/// Inode ID of the dentry itself (for positive dentries).
pub inode: InodeId,
}
14.6.4 Mount Hash Table¶
/// Per-namespace mount hash table. Maps `(parent_mount_id, mountpoint_dentry)`
/// pairs to child `Mount` nodes. This is the data structure consulted on
/// every mount-point crossing during path resolution.
///
/// **Why per-namespace**: Linux uses a single global `mount_hashtable` with
/// ~2048 buckets, protected by a per-bucket spinlock for writes and RCU for
/// reads. In container-heavy environments (thousands of namespaces, each with
/// 30-100 mounts), this creates false sharing on hash buckets and limits
/// scalability of concurrent mount operations across namespaces. UmkaOS's
/// per-namespace hash table eliminates cross-namespace contention entirely.
///
/// **Sizing**: The hash table is sized to the number of mounts in the
/// namespace, with a minimum of 32 buckets and a maximum of 1024. The table
/// is resized (doubled) when the load factor exceeds 2.0, and shrunk
/// (halved) when the load factor drops below 0.25. Resizing allocates a
/// new bucket array, rehashes under `mount_lock`, and publishes via RCU.
///
/// **Hash function**: SipHash-1-3 of `(parent_mount_id, mountpoint_inode_id)`.
/// The SipHash key is per-namespace, generated from a CSPRNG at namespace
/// creation. This prevents hash-flooding attacks where an adversary crafts
/// mount points that collide in the hash table.
pub struct MountHashTable {
/// RCU-protected bucket array. Wrapped in `Arc<BucketArray>` because
/// `RcuCell` requires an atomically-swappable thin pointer — `Box<[T]>`
/// is a fat pointer (data + length) that cannot be atomically swapped
/// on any current architecture. `Arc<BucketArray>` is a single thin
/// pointer that `RcuCell` can swap atomically.
/// Readers traverse under `rcu_read_lock()`; writers modify under
/// the namespace's `mount_lock`.
buckets: RcuCell<Arc<BucketArray>>,
/// Number of entries in the hash table. Used for load-factor
/// computation during resize decisions. Modified only under `mount_lock`.
/// **Bounded**: u32 supports ~4 billion mounts per namespace. Linux's
/// default `sysctl fs.mount-max` is 100,000; even extreme container
/// workloads rarely exceed 1 million. u32 is sufficient.
/// At mount_max=100K, u32 provides ~42,949x headroom. This is a hash
/// table entry count, not an identifier — the 50-year u64 policy does
/// not apply.
count: u32,
/// SipHash key for this hash table. Per-namespace, generated at
/// namespace creation from the kernel CSPRNG.
hash_key: [u64; 2],
}
/// Thin-pointer wrapper for the dynamically-sized bucket array.
/// `Box<[MountHashBucket]>` is a fat pointer (data + length) that cannot be
/// atomically swapped by `RcuCell`. This wrapper provides a thin `Arc` pointer.
// Kernel-internal, not KABI.
struct BucketArray {
buckets: Box<[MountHashBucket]>,
}
/// A single bucket in the mount hash table. Contains the head pointer
/// of an RCU-protected chain of Mount nodes.
struct MountHashBucket {
/// Head of the intrusive linked list of Mount nodes hashing to this
/// bucket. Null if the bucket is empty. Readers follow this chain
/// under RCU; writers modify under `mount_lock`.
///
/// **Lifecycle**: Hash chain insertion calls `Arc::into_raw()` to obtain
/// the raw pointer (incrementing the strong count); hash chain removal
/// under `mount_lock` uses RCU to defer `Arc::from_raw()` (which
/// decrements the count) until after the grace period. This ensures
/// RCU readers never access freed memory.
head: AtomicPtr<Mount>,
}
impl MountHashTable {
/// Look up a child mount at the given `(parent, dentry)` pair.
///
/// Called during path resolution when a dentry has the `DCACHE_MOUNTED`
/// flag set. Must be called under `rcu_read_lock()`.
///
/// Returns `Some(&Mount)` if a mount is found at this point, or
/// `None` if the dentry is not a mount point (stale `DCACHE_MOUNTED`
/// flag — possible after lazy unmount).
///
/// **Performance**: O(1) expected, O(n) worst-case where n is the
/// chain length (bounded by load factor < 2.0). No locks, no atomics
/// beyond the initial `Acquire` load of the bucket head pointer.
pub fn lookup<'a>(
&'a self,
parent_mount_id: u64,
mountpoint_inode: InodeId,
_rcu: &'a RcuReadGuard,
) -> Option<&'a Mount> {
let hash = siphash_1_3(
self.hash_key,
parent_mount_id,
mountpoint_inode,
);
// Readers must obtain the bucket array pointer and compute
// bucket_count from the same RCU-protected snapshot to avoid
// OOB access during a concurrent resize.
let buckets = self.buckets.read(_rcu);
let bucket_idx = hash as usize % buckets.len();
let bucket = &buckets[bucket_idx];
let mut current = bucket.head.load(Ordering::Acquire);
while !current.is_null() {
// SAFETY: `current` is a valid Mount pointer within an RCU
// read-side critical section. The Mount node is not freed
// until after the RCU grace period.
let mnt = unsafe { &*current };
if mnt.mount_id_of_parent() == parent_mount_id
&& mnt.mountpoint_inode() == mountpoint_inode
&& !mnt.is_doomed()
{
return Some(mnt);
}
current = mnt.hash_link.next.load(Ordering::Acquire);
}
None
}
/// Transition from RCU-protected `&Mount` to a long-lived reference.
///
/// **Ref-walk mode**: After `lookup()` returns `Some(&Mount)`, the
/// caller must increment `mnt_count` before dropping the `RcuReadGuard`:
/// ```
/// let rcu = rcu_read_lock();
/// if let Some(mnt) = mount_hash.lookup(parent_id, ino, &rcu) {
/// mnt.mnt_count.fetch_add(1, Acquire);
/// drop(rcu);
/// // `mnt` is now safe to use without RCU protection.
/// // Caller must call mnt.mnt_count.fetch_sub(1, Release)
/// // when the reference is no longer needed.
/// }
/// ```
///
/// **RCU-walk mode**: The caller stays within the RCU critical section
/// for the entire path resolution and never increments `mnt_count`.
/// If RCU-walk fails (e.g., dentry seqlock mismatch), the path
/// resolution restarts in ref-walk mode.
///
/// The `Acquire` on `fetch_add` pairs with the `Release` on
/// `fetch_sub` to ensure visibility of all mount state modifications
/// made before the reference was taken.
pub fn get_counted_ref(mnt: &Mount) {
mnt.mnt_count.fetch_add(1, Ordering::Acquire);
}
}
14.6.5 Mount Namespace¶
/// A mount namespace. Contains an independent mount tree with its own root
/// mount, hash table, and mount list. Created by `clone(CLONE_NEWNS)` or
/// `unshare(CLONE_NEWNS)`.
///
/// The `vfs_root: Capability<VfsNode>` field in `NamespaceSet` (Section 17.1.2)
/// is updated to point to this namespace's root mount:
///
/// ```rust
/// // Updated NamespaceSet field (replaces the previous Capability<VfsNode>):
/// pub mount_ns: Arc<MountNamespace>,
/// ```
///
/// **Relationship to NamespaceSet**: Each task's `NamespaceSet` holds
/// an `Arc<MountNamespace>`. Multiple tasks in the same mount namespace
/// share the same `Arc<MountNamespace>`. When `clone(CLONE_NEWNS)` is called,
/// a new `MountNamespace` is created by cloning the parent's mount tree
/// (via `copy_tree()`).
pub struct MountNamespace {
/// Unique namespace identifier. Used for `/proc/PID/ns/mnt` inode
/// number and `setns()` namespace comparison.
pub ns_id: u64,
/// Root mount of this namespace's mount tree. This is the mount
/// that corresponds to "/" for all processes in this namespace.
/// Updated atomically by `pivot_root()`.
pub root: RcuCell<Arc<Mount>>,
/// Ordered list of all mounts in this namespace. The ordering is
/// topological: parent mounts appear before their children. This
/// ordering is used by:
/// - `/proc/PID/mountinfo`: output follows this order
/// - `umount -a`: unmounts in reverse order (leaves before parents)
/// - Namespace teardown: unmounts in reverse topological order
pub mount_list: IntrusiveList<Arc<Mount>>,
/// Number of mounts in this namespace. Used to enforce the
/// per-namespace mount count limit (default: 100,000 — matching
/// Linux's `sysctl fs.mount-max`). Prevents mount-storm DoS attacks
/// where a compromised container creates millions of mounts.
/// Current-state count bounded by mount_max (~100K). u64 used for
/// consistency with other AtomicU64 counters in the namespace; u32
/// would suffice. On ILP32 architectures (ARMv7, PPC32), AtomicU64
/// requires a CAS loop (no native 64-bit atomics), adding ~5-10ns
/// per increment. Acceptable: mount/unmount is a warm path.
pub mount_count: AtomicU64,
/// Event counter. Incremented on every mount/unmount/remount
/// operation. Used by `poll()` on `/proc/PID/mountinfo` to detect
/// mount tree changes. Container runtimes and systemd use this
/// to react to mount events without periodic scanning.
pub event_seq: AtomicU64,
/// Per-namespace mount hash table. Maps `(parent_mount, dentry)` to
/// child mount for path resolution mount-point crossings.
pub hash_table: MountHashTable,
/// Mutex serializing mount tree modifications (mount, unmount,
/// remount, pivot_root, bind mount, move mount). Readers (path
/// resolution) do not acquire this lock — they use RCU.
/// Lock hierarchy level 20 (MOUNT_LOCK): above DENTRY_LOCK (19),
/// below EVM_LOCK (22). See [Section 3.5](03-concurrency.md#locking-strategy--lock-hierarchy-summary).
pub mount_lock: Mutex<()>,
/// Mount ID allocator. Monotonically increasing 64-bit counter.
/// IDs are never reused within a namespace. At 1 mount/second
/// sustained, a 64-bit counter would not wrap for ~584 billion years.
pub id_allocator: AtomicU64,
/// Peer group ID allocator. Like mount IDs, monotonically increasing
/// and never reused. Separate from mount IDs because group IDs are
/// shared across mounts and have a different lifecycle.
pub group_id_allocator: AtomicU64,
/// User namespace that owns this mount namespace. Determines
/// capability checks for mount operations. A process must have
/// `CAP_MOUNT` in this user namespace (or an ancestor) to modify
/// the mount tree.
pub user_ns: Arc<UserNamespace>,
}
14.6.6 DCACHE_MOUNTED Integration¶
The dentry cache (Section 14.1) must track which dentries are mount points.
When a filesystem is mounted at a dentry, the VFS sets the DCACHE_MOUNTED
flag on that dentry. During path resolution (Section 14.1), when the VFS
encounters a dentry with DCACHE_MOUNTED set, it calls
mnt_ns.hash_table.lookup() (where mnt_ns is the current task's mount namespace)
to find the child mount and continues resolution from the child mount's root dentry.
/// Dentry cache entry flags. Stored in the dentry's `flags: AtomicU32` field.
/// Extended to include DCACHE_MOUNTED for mount-point detection.
bitflags! {
#[repr(transparent)]
pub struct DcacheFlags: u32 {
/// This dentry is a mount point — a filesystem is mounted on it.
/// Set by `do_mount()` when attaching a mount. Cleared by
/// `do_umount()` when the last mount at this dentry is removed.
///
/// Path resolution checks this flag on every path component.
/// When set, `mnt_ns.hash_table.lookup(current_mount.mount_id, dentry)`
/// is called to find the child mount. This check is a single atomic
/// load (~1 cycle) — the flag exists specifically to avoid a hash
/// table lookup on every path component (only mount points need
/// the lookup).
const DCACHE_MOUNTED = 1 << 0;
/// Dentry has been disconnected from the tree (e.g., NFS stale
/// handle, deleted directory that is still open).
const DCACHE_DISCONNECTED = 1 << 1;
/// Dentry is a negative dentry (caches a failed lookup).
const DCACHE_NEGATIVE = 1 << 2;
/// Dentry has filesystem-specific operations (d_revalidate, etc.).
const DCACHE_OP_MASK = 1 << 3;
}
}
14.6.7 Filesystem Context (New Mount API)¶
The new mount API (Linux 5.2+, used increasingly by container runtimes and
systemd) separates mount operations into discrete steps: context creation,
configuration, superblock creation, and attachment. This provides better
error reporting (errors at each step, not a single mount(2) errno) and
supports atomic mount configuration changes.
/// Filesystem context for the new mount API.
///
/// Created by `fsopen()`, configured by `fsconfig()`, and consumed by
/// `fsmount()`. The context holds all the state needed to create a new
/// superblock and mount, accumulated through multiple `fsconfig()` calls.
///
/// This is equivalent to Linux's `struct fs_context`.
///
/// **Lifetime**: The context is reference-counted via a file descriptor
/// returned by `fsopen()`. It is destroyed when the file descriptor is
/// closed. If `fsmount()` has not been called, the context is simply
/// freed (no mount created). If `fsmount()` was called, the context's
/// state has been consumed and the mount exists independently.
/// Maximum mount options (key-value pairs) across `options` and
/// `binary_options` combined. `fsconfig()` returns `ENOSPC` when
/// `options.len() + binary_options.len() >= FS_CONTEXT_MAX_OPTIONS`.
pub const FS_CONTEXT_MAX_OPTIONS: usize = 256;
/// Maximum error log size in bytes (matches Linux `FC_LOG_SIZE`).
/// The `fc_log_write()` function checks
/// `log.len() + msg.len() <= FC_LOG_SIZE` before appending; excess
/// bytes are silently truncated.
pub const FC_LOG_SIZE: usize = 4096;
pub struct FsContext {
/// Filesystem type (e.g., "ext4", "tmpfs", "overlay"). Set at
/// `fsopen()` time and immutable thereafter.
pub fs_type: Arc<dyn FileSystemOps>,
/// Filesystem type name (for diagnostics and /proc/mounts).
pub fs_type_name: Box<[u8]>,
/// Source device or path (equivalent to mount(2) `source` parameter).
/// Set via `fsconfig(FSCONFIG_SET_STRING, "source", ...)`.
pub source: Option<Box<[u8]>>,
/// Accumulated mount options as key-value pairs. Each `fsconfig()`
/// call adds or modifies an entry. The filesystem driver validates
/// options at `fsconfig(FSCONFIG_CMD_CREATE)` time.
/// Bounded by FS_CONTEXT_MAX_OPTIONS (256 total across `options` and
/// `binary_options`). `fsconfig()` returns `ENOSPC` when the combined
/// count reaches the limit. Cold-path allocation (mount/remount only).
pub options: Vec<(Box<[u8]>, Box<[u8]>)>,
/// Binary data options (for filesystems that accept binary mount data).
/// Set via `fsconfig(FSCONFIG_SET_BINARY, ...)`.
/// Shares the `FS_CONTEXT_MAX_OPTIONS` limit with `options`.
pub binary_options: Vec<(Box<[u8]>, Box<[u8]>)>,
/// Mount flags to apply to the created mount.
pub mount_flags: MountFlags,
/// The created superblock. Set by `fsconfig(FSCONFIG_CMD_CREATE)`,
/// consumed by `fsmount()`.
pub superblock: Option<Arc<SuperBlock>>,
/// Error log. Filesystem drivers write diagnostic messages here
/// during context creation and configuration. Readable by userspace
/// via `read()` on the fscontext file descriptor.
/// Bounded to FC_LOG_SIZE (4096) bytes. Truncated silently when full.
/// Cold-path allocation (mount error reporting only).
pub log: Vec<u8>,
/// Purpose of this context: new mount, reconfiguration, or submount.
pub purpose: FsContextPurpose,
/// Lifecycle state of this context. Transitions: New → Configured →
/// Consumed (by `fsmount()`). Further `fsconfig()` calls on a Consumed
/// context return `EBUSY`.
pub state: FsContextState,
/// User namespace for permission checks. Set at `fsopen()` time
/// to the caller's user namespace.
pub user_ns: Arc<UserNamespace>,
}
/// Purpose of a filesystem context, controlling which operations are valid.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum FsContextPurpose {
/// Creating a new mount (from `fsopen()`).
NewMount = 0,
/// Reconfiguring an existing mount (from `fspick()`).
Reconfig = 1,
/// Internal: creating a submount (e.g., automount).
Submount = 2,
}
/// Lifecycle state of an `FsContext`.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum FsContextState {
/// Freshly created by `fsopen()` or `fspick()`. Accepting `fsconfig()` calls.
New = 0,
/// Options have been set via `fsconfig()`, but `FSCONFIG_CMD_CREATE`/
/// `FSCONFIG_CMD_RECONFIGURE` has not yet been called.
Configuring = 1,
/// `FSCONFIG_CMD_CREATE` succeeded; superblock is ready. Awaiting `fsmount()`.
Created = 2,
/// `fsmount()` has consumed the superblock. The fsopen fd is still open
/// for error log retrieval but cannot create another mount.
Consumed = 3,
/// An error occurred during creation. The error log is readable.
/// Further `fsconfig()` calls return `EBUSY`.
Failed = 4,
}
14.6.7.1 FsContext Lifecycle and Error Channel¶
The new mount API separates mount configuration into discrete, verifiable steps. Each step either advances the context state or returns a structured error. The full lifecycle:
Step 1: fd = fsopen("ext4", FSOPEN_CLOEXEC)
→ Validates "ext4" against the filesystem type registry.
→ Allocates FsContext { fs_type: ext4_ops, purpose: NewMount, state: New, ... }.
→ Returns an O_RDWR file descriptor backed by the FsContext.
→ FsContext state: New.
Step 2: fsconfig(fd, FSCONFIG_SET_STRING, "source", "/dev/sda1", 0)
fsconfig(fd, FSCONFIG_SET_STRING, "errors", "remount-ro", 0)
fsconfig(fd, FSCONFIG_SET_FLAG, "noatime", NULL, 0)
→ Each call appends to FsContext.options: [("source", "/dev/sda1"), ("errors", "remount-ro"), ...].
→ Returns 0 on success; EINVAL if the key is not recognized by the filesystem type.
→ First `fsconfig()` call transitions state: New → Configuring.
→ FsContext state: Configuring (still accumulating options).
Step 3: fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
→ Calls FileSystemOps::mount(source, flags, options) on the configured filesystem type.
→ On success: FsContext.superblock = Some(sb); state → Created.
→ On failure: diagnostic message is written to FsContext.log; state → Failed.
Caller can read the error via read(fd, buf, len) — see Error Channel below.
→ Returns 0 on success; -errno on failure.
Step 4: mnt_fd = fsmount(fd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOATIME)
→ Consumes FsContext.superblock (state must be Created; returns EBUSY if
Consumed, EINVAL if New or Failed).
→ Allocates a MountNode with MNT_DETACHED flag set.
→ Returns an O_PATH fd referencing the detached mount.
→ FsContext state: Consumed (further fsconfig/fsmount calls return EBUSY).
Step 5: move_mount(mnt_fd, "", AT_FDCWD, "/mnt/data", MOVE_MOUNT_F_EMPTY_PATH)
→ Attaches the detached mount to the namespace mount tree at /mnt/data.
→ Clears MNT_DETACHED from the MountNode.
→ Triggers mount propagation to peer/slave mounts (Section 14.2.10).
open_tree(2) — clone or open a mount:
fd = open_tree(dirfd, path, OPEN_TREE_CLONE | AT_RECURSIVE)
→ Resolves path to a mount.
→ OPEN_TREE_CLONE: creates a detached copy of the mount tree rooted at path,
identical to a recursive bind mount but without modifying the namespace.
AT_RECURSIVE: the clone includes all submounts below path.
→ The returned O_PATH fd can be passed to move_mount() to attach elsewhere.
→ Without OPEN_TREE_CLONE: returns an O_PATH fd referencing the existing mount
without cloning (useful for passing a mount reference across namespaces).
mount_setattr(2) — bulk-modify mount tree flags:
mount_setattr(dirfd, path, AT_RECURSIVE, &mount_attr { attr_set, attr_clr }, sizeof)
→ Resolves path to a mount.
→ AT_RECURSIVE: applies to all mounts in the subtree rooted at path.
→ attr_clr: clears these flags from each mount (applied first).
→ attr_set: sets these flags on each mount (applied after attr_clr).
→ The operation is atomic within the subtree: if validation fails for any mount
(e.g., clearing MNT_READONLY on a superblock-level read-only filesystem), no
flags are changed on any mount.
→ Requires CAP_MOUNT.
FsContext Error Channel:
When fsconfig(FSCONFIG_CMD_CREATE) or fsmount() encounter a filesystem-level error
(e.g., superblock checksum mismatch, missing required option, device I/O error), the error
is not conveyed solely via errno. The filesystem driver writes a human-readable diagnostic
string to FsContext.log. The caller retrieves it via read(fd, buf, len) on the FsContext
file descriptor:
read(fs_context_fd, buf, len):
if FsContext.log is empty: return 0 (EOF — no error message pending)
n = min(len, FsContext.log.len())
copy_to_user(buf, FsContext.log[..n])
FsContext.log.drain(..n)
return n
Example error message (readable by system administrators):
This approach is superior to the traditional single-errno response: it gives system
administrators and container runtimes actionable diagnostic information without requiring
a separate diagnostics ioctl or /proc file.
14.6.8 Mount Attribute Structure (mount_setattr)¶
/// User-visible mount attribute structure for `mount_setattr(2)`.
/// Matches Linux's `struct mount_attr` exactly for ABI compatibility.
///
/// `mount_setattr()` atomically modifies mount properties on a single
/// mount or recursively on an entire mount tree (when `AT_RECURSIVE`
/// is passed). Container runtimes use this for recursive read-only
/// mounts (`MOUNT_ATTR_RDONLY` + `AT_RECURSIVE`).
#[repr(C)]
pub struct MountAttr {
/// Flags to set on the mount(s). Bits correspond to `MOUNT_ATTR_*`
/// constants. Applied after `attr_clr` (clear first, then set).
pub attr_set: u64,
/// Flags to clear from the mount(s). Applied before `attr_set`.
pub attr_clr: u64,
/// Propagation type to set. One of `MS_SHARED`, `MS_PRIVATE`,
/// `MS_SLAVE`, `MS_UNBINDABLE`, or 0 (no change). Only one
/// propagation flag may be set; combining them returns `EINVAL`.
/// The mount_setattr handler validates `attr.propagation` is a valid
/// PropagationType variant (0-3); returns EINVAL on invalid values.
pub propagation: u64,
/// File descriptor of the user namespace to associate with the
/// mount (for ID-mapped mounts). Set to 0 or omit if not
/// changing the mount's user namespace mapping.
pub userns_fd: u64,
}
// Layout: 4 × u64 = 32 bytes.
const_assert!(size_of::<MountAttr>() == 32);
/// MOUNT_ATTR_* flag constants for mount_setattr(2).
/// These map to MountFlags but use a separate constant space matching
/// Linux's UAPI.
pub const MOUNT_ATTR_RDONLY: u64 = 0x00000001;
pub const MOUNT_ATTR_NOSUID: u64 = 0x00000002;
pub const MOUNT_ATTR_NODEV: u64 = 0x00000004;
pub const MOUNT_ATTR_NOEXEC: u64 = 0x00000008;
pub const MOUNT_ATTR_NOATIME: u64 = 0x00000010;
pub const MOUNT_ATTR_STRICTATIME: u64 = 0x00000020;
pub const MOUNT_ATTR_NODIRATIME: u64 = 0x00000080;
pub const MOUNT_ATTR_NOSYMFOLLOW: u64 = 0x00200000;
14.6.9 Mount Operations — Algorithms¶
All mount tree modification algorithms require holding the namespace's
mount_lock (lock hierarchy level 20, Section 3.5). Path resolution (read
path) uses only RCU and never acquires mount_lock. The algorithms below
describe the kernel-internal implementation; the syscall entry points
(mount(2), umount2(2), and the new mount API) perform argument
validation and capability checks before calling these internal functions.
14.6.9.1 do_mount — Mount a Filesystem¶
do_mount(source, target_path, fs_type, flags, data) -> Result<()>
0a. Capability check: verify caller holds CAP_MOUNT ([Section 9.1](09-security.md#capability-based-foundation))
in the target mount namespace. Return EPERM if not held.
0b. LSM hook: `lsm_call_superblock_security(Mount, cred, sb, &SbOpContext { ... })`.
If the LSM denies the mount request, return EPERM. This hook fires
before any path resolution to allow early rejection of unauthorized
mount operations (e.g., SELinux `mount` permission check against the
caller's security context and the target path label).
0c. Cgroup device controller check. If the calling task's cgroup has a device
controller with `BPF_CGROUP_DEVICE` program attached: call
`cgroup_bpf_run(BPF_CGROUP_DEVICE, &DeviceAccessCtx { dev: source_dev,
access: BLK_OPEN_READ | BLK_OPEN_WRITE })`. If denied, return EPERM.
This check runs before filesystem lookup because the source device may
not be accessible to the container.
1. Resolve `target_path` to (mount, dentry) via path resolution (Section 14.1.3).
2. If `flags` contains MS_REMOUNT, delegate to do_remount() (Section 14.2.9.4).
3. If `flags` contains MS_BIND, delegate to do_bind_mount() (Section 14.2.9.5).
4. If `flags` contains MS_MOVE, delegate to do_move_mount() (Section 14.2.9.6).
5. If `flags` contains MS_SHARED|MS_PRIVATE|MS_SLAVE|MS_UNBINDABLE,
delegate to do_change_propagation() (Section 14.2.9.7).
6. Otherwise, this is a new filesystem mount:
a. Look up the filesystem type by name in the filesystem registry.
If not registered, return ENODEV.
b. Call `FileSystemOps::mount(source, flags, data)` on the filesystem
driver. This creates and returns a `SuperBlock`. On failure, return
the error from the driver.
b2. LSM hook: `lsm_call_superblock_security(superblock, source, flags, data)`.
If the LSM denies the mount (returns non-zero), drop the superblock
and return EPERM. This hook allows SELinux/AppArmor to enforce mount
restrictions based on the filesystem type, source device, and
mount options.
c. Check namespace mount count against `mount_max` limit. If exceeded,
drop the superblock and return ENOSPC.
d. Allocate a new `Mount` node:
- `mount_id` from `namespace.id_allocator.fetch_add(1)`
- `parent` = resolved mount from step 1
- `mountpoint` = resolved dentry from step 1
- `root` = superblock's root dentry
- `superblock` = the SuperBlock from step 6b
- `flags` = translate MS_* to MountFlags
- `propagation` = Private (default for new mounts)
- `group_id` = 0 (private mount has no peer group)
- `mnt_count` = 0
e. Acquire `mount_lock`.
f. Increment the mountpoint dentry's mount refcount and set the
`DCACHE_MOUNTED` flag:
```
mountpoint_dentry.d_mount_refcount.fetch_add(1, Relaxed);
mountpoint_dentry.d_flags.fetch_or(DCACHE_MOUNTED, Release);
```
The refcount increment uses Relaxed ordering because it is
protected by `mount_lock` (held since step 6e). The flag set
uses Release so RCU readers see it on path resolution.
`d_mount_refcount` tracks how many mounts reference this dentry
as their mountpoint (see [Section 14.1](#virtual-filesystem-layer)); it is
decremented by `do_umount()` and only when it reaches zero is
`DCACHE_MOUNTED` cleared. This avoids a lock ordering violation:
`mount_lock` (level 20) is held, and acquiring `d_lock`
(level 19) would violate the lower-first ordering rule.
**Writer-writer safety**: both mount and umount hold the
namespace's `mount_lock` before modifying `DCACHE_MOUNTED`.
The atomic `fetch_or`/`fetch_and` are for reader visibility
under RCU, not for writer synchronization.
g. Insert the Mount into the mount hash table at
bucket(parent_mount_id, mountpoint_inode_id).
h. Add the Mount to the parent's `children` list.
i. Add the Mount to the namespace's `mount_list` (after its parent
in topological order).
j. Increment `namespace.mount_count`.
k. Propagate: if the parent mount is shared, call
`propagate_mount()` (Section 14.2.10.1) to replicate this mount
on all peers and slaves of the parent.
l. Increment `namespace.event_seq`.
m. Release `mount_lock`.
14.6.9.2 do_umount — Unmount a Filesystem¶
do_umount(target_mount, flags) -> Result<()>
Capability check: CAP_MOUNT in caller's mount namespace.
1. If `target_mount` is the namespace root and flags does not contain
MNT_DETACH, return EBUSY (cannot unmount root).
2. If `target_mount.flags` has MNT_LOCKED and the caller lacks
CAP_SYS_ADMIN in the mount's owning user namespace, return EPERM.
3. If `flags` does not contain MNT_DETACH (not lazy):
a. Check `target_mount.mnt_count`. If > 0, return EBUSY.
b. Check that `target_mount.children` is empty. If not, return EBUSY
(sub-mounts must be unmounted first, unless MNT_DETACH is used).
4. If `flags` contains MNT_FORCE:
a. Call `FileSystemOps::force_umount()` if the filesystem supports it.
This causes in-flight I/O to fail with EIO. NFS uses this for stale
server recovery.
5. Acquire `mount_lock`.
6. Set `MNT_DOOMED` on `target_mount.flags` (atomic OR).
This prevents new path lookups from entering the mount.
7. Remove `target_mount` from the mount hash table.
8. Remove `target_mount` from the parent's `children` list.
9. Decrement the mountpoint dentry's mount refcount and conditionally
clear `DCACHE_MOUNTED`:
```
let mountpoint_dentry = target_mount.mountpoint;
if mountpoint_dentry.d_mount_refcount.fetch_sub(1, AcqRel) == 1 {
mountpoint_dentry.d_flags.fetch_and(!DCACHE_MOUNTED, Release);
}
```
AcqRel on the decrement: Acquire ensures we see all prior increments
from other mount operations; Release ensures the flag clear is visible
to RCU readers only after the refcount reaches zero. Multiple mounts
can be stacked on the same dentry (cross-namespace or bind mounts);
`DCACHE_MOUNTED` is cleared only when the last one is removed.
10. Propagate: if the parent mount is shared, call `propagate_umount()`
(Section 14.2.10.2) to remove corresponding mounts from peers and slaves.
11. Remove from `namespace.mount_list`.
12. Decrement `namespace.mount_count`.
13. Increment `namespace.event_seq`.
14. Release `mount_lock`.
15. If `flags` contains MNT_DETACH (lazy unmount):
a. The mount is now disconnected from the tree but may still be
referenced by open file descriptors (mnt_count > 0). It will be
fully freed when the last reference is dropped.
b. Open files continue to work on the disconnected mount. New path
lookups cannot reach it.
16. If not lazy: call `FileSystemOps::unmount()` synchronously.
If lazy: schedule `FileSystemOps::unmount()` to run when `mnt_count`
drops to 0 (via a callback registered on the final `Arc::drop`).
14.6.9.3 do_umount_tree — Recursive Unmount¶
do_umount_tree(root_mount, flags) -> Result<()>
Used by MNT_DETACH on a mount with sub-mounts, and by namespace teardown.
1. Acquire `mount_lock`.
2. Collect all mounts in the subtree rooted at `root_mount` by traversing
`root_mount.children` recursively. Collect in reverse topological order
(leaves first, root last).
3. For each mount in the collected list:
a. Set MNT_DOOMED.
b. Remove from hash table.
c. Remove from parent's children list.
d. Decrement the mountpoint dentry's `d_mount_refcount` and
conditionally clear `DCACHE_MOUNTED` (same protocol as `do_umount`
step 9: `d_mount_refcount.fetch_sub(1, AcqRel)`; clear flag only
when refcount reaches 0).
e. Remove from namespace.mount_list.
f. Decrement namespace.mount_count.
4. Propagate umount for each removed mount.
5. Increment namespace.event_seq.
6. Release `mount_lock`.
7. For each collected mount: schedule filesystem unmount (immediate
if mnt_count == 0, deferred if lazy).
14.6.9.4 do_remount — Change Mount Flags/Options¶
do_remount(target_mount, flags, data) -> Result<()>
Capability check: CAP_MOUNT in caller's mount namespace.
1. Translate new `flags` to `MountFlags`.
2. Extract per-superblock options from `data`.
3. **RW→RO transition: flush dirty pages before flag change.**
If the remount transitions from read-write to read-only
(`!(old_flags & MS_RDONLY) && (new_flags & MS_RDONLY)`):
a. Call `sync_filesystem(sb)` to flush all dirty pages and metadata.
This triggers `writeback_inodes_sb(sb, WB_SYNC_ALL)` which
writes all dirty pages for this superblock to stable storage.
b. If any dirty inode cannot be flushed (device error), retry up to
`REMOUNT_RO_FLUSH_RETRIES` (3) times with a 100ms delay between
retries. If all retries fail:
- If `flags & MS_FORCE`: proceed with remount-ro anyway. Dirty
pages for failed inodes are discarded (data loss accepted —
the admin explicitly requested force). Log FMA warning.
- If no MS_FORCE: return `Err(EBUSY)` — cannot remount read-only
while dirty pages exist that cannot be flushed. The caller
must either fix the device error or use `mount -o remount,ro,force`.
c. After successful flush, verify no new dirty pages appeared during
the flush (a writer may have dirtied pages concurrently). If
`sb.nr_dirty_inodes > 0`, retry from step 3a (bounded by the
same 3-retry limit — total retries across all sub-attempts).
d. Set `sb.s_writers.frozen` to `SbFreezeLevel::Write` to prevent
new writers from dirtying pages between the final flush and the
flag change in step 5. Wait for `sb.s_writers.writers[0].sum() == 0`
(all active writers drain). Released after step 5.
4. Acquire `mount_lock`.
5. Update `target_mount.flags` atomically with `Release` ordering.
Concurrent readers (path walk, statfs) load mount flags with `Acquire`
ordering to pair with this `Release`, ensuring the flag change is
visible before subsequent filesystem operations on this mount.
Note: a remount can change per-mount flags (readonly, nosuid, etc.)
independently of superblock options. For example, `mount -o remount,ro`
on a bind mount makes that mount point read-only without affecting
other mount points of the same filesystem.
6. If per-superblock options changed, call
`FileSystemOps::remount(sb, flags, data)`. On failure, restore the
old flags, release freeze if held, and return the error.
7. Release freeze: set `sb.s_writers.frozen` to `SbFreezeLevel::Unfrozen`
and wake blocked writers (step 3d).
8. Increment `namespace.event_seq`.
9. Release `mount_lock`.
14.6.9.5 do_bind_mount — Bind Mount (MS_BIND)¶
do_bind_mount(source_path, target_path, flags) -> Result<()>
Capability check: CAP_MOUNT + read access to source path.
1. Resolve `source_path` to (source_mount, source_dentry).
2. Resolve `target_path` to (target_mount, target_dentry).
3. If `source_mount.propagation == Unbindable`, return EINVAL.
4. Clone the source mount:
a. Allocate a new `Mount` node.
b. `superblock` = `source_mount.superblock` (shared — same filesystem
instance, same data pages).
c. `root` = `source_dentry` (bind mount's root is the source path,
not necessarily the source mount's root — this is how bind mounts
of subdirectories work).
d. `flags` = copy from source, then apply any new flags from `flags`.
e. `propagation` = Private (new bind mounts default to Private).
5. If `flags` contains MS_REC (recursive bind):
a. For each sub-mount under `source_mount` (descendants of
`source_dentry`), clone the mount and attach it at the
corresponding dentry under the new bind mount.
b. Skip unbindable mounts.
6. Acquire `mount_lock`.
7. Attach the cloned mount(s) at target_path (same steps as
do_mount steps 6f-6m).
8. Release `mount_lock`.
14.6.9.6 do_move_mount — Move a Mount (MS_MOVE)¶
do_move_mount(source_mount, target_path) -> Result<()>
Capability check: CAP_MOUNT in caller's mount namespace.
1. Resolve `target_path` to (target_parent_mount, target_dentry).
2. Verify `target_dentry` is not a descendant of `source_mount`
(moving a mount underneath itself would create a cycle). Return
EINVAL if it is.
3. Verify `source_mount` is not the namespace root. Return EINVAL.
4. Acquire `mount_lock`.
5. Remove `source_mount` from the old location:
a. Remove from hash table at old (parent, dentry) key.
b. Remove from old parent's children list.
c. Decrement old mountpoint dentry's `d_mount_refcount` and
conditionally clear `DCACHE_MOUNTED` (same protocol as
`do_umount` step 9).
6. Attach at new location:
a. Update `source_mount.parent` to `target_parent_mount`.
b. Update `source_mount.mountpoint` to `target_dentry`.
c. Insert into hash table at new (parent, dentry) key.
d. Add to new parent's children list.
e. Increment `target_dentry.d_mount_refcount` and set
`DCACHE_MOUNTED` (same protocol as `do_mount` step 6f).
7. Propagation: moving a mount does not trigger propagation
(matches Linux behavior).
8. Increment `namespace.event_seq`.
9. Release `mount_lock`.
14.6.9.7 do_change_propagation — Set Propagation Type¶
do_change_propagation(target_mount, type, flags) -> Result<()>
Capability check: CAP_MOUNT in caller's mount namespace.
1. Determine the target mount(s):
- If `flags` contains MS_REC: target mount and all descendants.
- Otherwise: target mount only.
2. Acquire `mount_lock`.
3. For each target mount:
a. If changing to Shared:
- Allocate a new `group_id` from `namespace.group_id_allocator`.
- Set `mount.group_id = new_id`.
- If the mount was previously a slave, it becomes shared+slave
(receives from master AND propagates to peers).
b. If changing to Private:
- Remove from peer group ring (`mnt_share`).
- Remove from master's slave list (if slave).
- Set `mount.group_id = 0`.
- Set `mount.mnt_master = None`.
c. If changing to Slave:
- If the mount is currently shared, it becomes a slave of its
former peer group. The first remaining peer becomes the master.
- Remove from peer group ring.
- Add to master's `mnt_slave_list`.
- Set `mount.mnt_master` to the former peer group leader.
- Mount retains its `group_id` (for mountinfo optional fields).
d. If changing to Unbindable:
- Same as Private, plus prevents bind mount of this mount.
e. Update `mount.propagation`.
4. Increment `namespace.event_seq`.
5. Release `mount_lock`.
14.6.10 Mount Propagation Algorithms¶
Mount propagation ensures that mount/unmount events on shared mount points are replicated across all related mount points. This is essential for container volume mounts: when a volume is mounted on a shared host path, all containers that have a slave relationship to that path see the new mount.
14.6.10.1 propagate_mount¶
propagate_mount(source_mount, new_child_mount) -> Result<()>
Called under mount_lock when a mount is added to a shared mount point.
Lock ordering: when the propagation walk must acquire per-mount locks
(e.g., for mnt_count, children list, or mountpoint hash updates),
locks are acquired in ascending mnt_id order. This prevents ABBA
deadlocks when two concurrent propagation walks traverse overlapping
peer groups. If a lock cannot be acquired in order (e.g., a lower
mnt_id mount is discovered after a higher one is already locked),
the higher lock is released and re-acquired after the lower one.
1. Walk the peer group ring of `source_mount` (via `mnt_share` links).
For each peer mount (excluding `source_mount` itself):
a. Clone `new_child_mount` with the peer as parent.
The clone's mountpoint is the dentry in the peer's filesystem
that corresponds to `new_child_mount.mountpoint` in the source.
b. Attach the clone at the peer (insert into hash table, set
DCACHE_MOUNTED, add to children list, add to mount_list).
c. If the clone's parent is shared, iteratively propagate to
that peer group using the tree walk algorithm (matching Linux's
`propagate_mnt()` iterative walker). Visited groups are tracked
via a marker flag on each mount to prevent infinite loops. The
iterative walker processes the propagation tree in a single loop
without stack recursion, preventing stack overflow regardless of
propagation chain depth.
2. Walk the slave list of `source_mount` (via `mnt_slave_list`).
For each slave mount:
a. Clone `new_child_mount` with the slave as parent.
b. Attach the clone at the slave.
c. If the slave is also shared (shared+slave), propagate to the
slave's peer group (step 1 applied to the slave's peers).
3. If the cloning in any propagation step fails (e.g., ENOMEM for
the mount count limit), roll back: remove all clones created in
this propagation pass and return the error. Propagation is
all-or-nothing within a single mount operation.
14.6.10.2 propagate_umount¶
propagate_umount(source_mount) -> Result<()>
Called under mount_lock when a mount is removed from a shared mount point.
1. Walk the peer group ring of `source_mount.parent` (the parent must
be shared for propagation to occur).
For each peer of the parent:
a. Look up a child mount at the corresponding mountpoint dentry
in the peer's mount hash table.
b. If found and the child's superblock matches `source_mount`'s
superblock (same filesystem), unmount it (do_umount steps 6-12).
c. If the child mount has its own children, recursively unmount
the subtree (do_umount_tree).
2. Walk the slave list of the parent.
For each slave:
a. Same as step 1a-1c, applied to the slave.
14.6.11 Namespace Operations¶
14.6.11.1 copy_tree — Clone Mount Tree for CLONE_NEWNS¶
copy_tree(source_root_mount, source_root_dentry) -> Result<Arc<MountNamespace>>
Called by clone(CLONE_NEWNS) and unshare(CLONE_NEWNS).
1. Allocate a new `MountNamespace` with fresh `ns_id`, empty hash table,
and a new `mount_lock`.
2. The new namespace inherits the parent's `user_ns`.
3. Clone the source root mount:
a. Allocate new `Mount` with the same superblock and root dentry.
b. Flags are copied. **Propagation is INHERITED from the source root mount**
(not forced to Private). If the source root is shared, the clone is added
to the same peer group. If the source root is private, the clone is private.
This matches Linux's `copy_tree()` behavior and is consistent with step 4e
logic. The statement "child's mounts are private unless marked shared"
means shared propagation is PRESERVED, not overridden.
c. Record old-to-new mount mapping: `mount_map[source_root] = cloned_root`.
`mount_map` type: `HashMap<*const Mount, Arc<Mount>>` — cold path
(runs once per `clone(CLONE_NEWNS)` or `unshare(CLONE_NEWNS)`).
Keys are raw pointers for identity comparison (Arc pointer values
are not integers, so XArray cannot be used per collection policy).
HashMap is acceptable on this cold path.
4. For each mount in the source namespace's mount_list (topological order):
a. Skip unbindable mounts.
b. Clone the mount into the new namespace.
c. Preserve the parent-child relationship (the cloned child's parent
is the clone of the original child's parent).
d. Insert into the new namespace's hash table and mount_list.
e. Record old-to-new mapping: `mount_map[source_mount] = cloned_mount`.
f. Set propagation:
- If the source mount is shared: the clone is added to the same
peer group (shared propagation preserved across CLONE_NEWNS).
This is critical for container runtimes that rely on propagation.
- If the source mount is private/slave/unbindable: the clone is
Private.
**Error handling**: If mount cloning fails at step 4b (e.g., ENOMEM from
Mount allocation), drop the partially-constructed MountNamespace. All
previously cloned mounts are freed via their Arc destructors. The task's
`fs.root` and `fs.pwd` are unchanged because step 6 was not reached.
Return the error to the caller (`ENOMEM`).
5. Set the new namespace's root to the clone of `source_root_mount`.
6. **Update calling task's fs.root and fs.pwd**: Using the `mount_map`, find the
cloned counterpart of the task's current root mount and pwd mount:
```
let mut fs = task.fs.write();
if let Some(new_root) = mount_map.get(&fs.root.mount) {
fs.root = PathRef { mount: Arc::clone(new_root), dentry: fs.root.dentry.clone() };
}
if let Some(new_pwd) = mount_map.get(&fs.pwd.mount) {
fs.pwd = PathRef { mount: Arc::clone(new_pwd), dentry: fs.pwd.dentry.clone() };
}
```
Without this step, the task would still resolve paths against the old namespace's
mounts, defeating the purpose of CLONE_NEWNS.
7. Return the new namespace.
14.6.11.2 pivot_root Integration¶
The pivot_root(2) algorithm specified in Section 17.1 is updated to
use the Mount data structure:
pivot_root(new_root_path, put_old_path) -> Result<()>
Capability check: CAP_SYS_ADMIN in caller's user namespace.
The caller must be in a mount namespace (not the initial namespace).
1. Resolve `new_root_path` to (new_root_mount, new_root_dentry).
Verify `new_root_dentry` is the root of `new_root_mount` (i.e.,
new_root is a mount point, not just a directory).
2. Resolve `put_old_path` to (put_old_mount, put_old_dentry).
Verify `put_old` is at or under `new_root`.
3. Verify `new_root_mount` is not the current namespace root
(i.e., `new_root_mount != namespace.root`). If it is, pivot_root
is a no-op — return `EINVAL`.
4. Verify `put_old` is reachable from `new_root` by walking the
mount tree upward. This ensures `put_old` is a valid location
within the new root's subtree for mounting the old root.
5. Acquire `mount_lock`.
6. Let `old_root_mount` = namespace's current root mount.
7. Detach `new_root_mount` from its current position:
a. Remove from hash table.
b. Remove from parent's children.
c. Clear DCACHE_MOUNTED on its old mountpoint.
8. Reattach `old_root_mount` at `put_old`:
a. Set `old_root_mount.parent` = `new_root_mount`.
b. Set `old_root_mount.mountpoint` = the dentry corresponding to
`put_old` within `new_root_mount`'s filesystem.
c. Insert `old_root_mount` into hash table at new position.
d. Set DCACHE_MOUNTED on the put_old dentry.
9. Set `new_root_mount` as the namespace root:
a. `new_root_mount.parent` = None (it is now the root).
b. `new_root_mount.mountpoint` = `new_root_mount.root` (self-referential
for the root mount).
c. `namespace.root.update(new_root_mount, &mount_lock_guard)` (RCU
publish via RcuCell::update).
10. Update the CURRENT TASK's fs.root and fs.pwd if they reference the old root:
```
let mut fs = current_task().fs.write();
if fs.root.mount == old_root_mount {
fs.root = PathRef { mount: Arc::clone(&new_root_mount), dentry: new_root_dentry };
}
if fs.pwd.mount == old_root_mount {
fs.pwd = PathRef { mount: Arc::clone(&new_root_mount), dentry: new_root_dentry };
}
```
NOTE: Linux does NOT iterate all tasks in the namespace. Other tasks sharing
the same `fs_struct` see the update via the shared reference. Tasks with
different `fs_struct` instances that reference the old root will see the old
root moved to `put_old` on their next path resolution — this is correct
behavior (they can then chdir to the new root if desired).
11. Increment `namespace.event_seq`.
12. Release `mount_lock`.
Note: Steps 7-9 are the atomic state change. In-flight path lookups
that started before step 9 see the old root via RCU (the old
`RcuCell` value remains valid until the grace period). New lookups
after step 9 see the new root. This matches the atomicity guarantee
specified in Section 17.1.3.
14.6.11.3 Namespace Teardown¶
When a mount namespace is destroyed (all processes exited, all
/proc/PID/ns/mnt file descriptors closed, all bind mounts of the
namespace file unmounted):
destroy_mount_namespace(ns) -> ()
1. Acquire `mount_lock`.
2. Iterate `ns.mount_list` in reverse topological order (leaves first).
3. For each mount:
a. Set MNT_DOOMED.
b. Remove from hash table.
c. Remove from parent's children.
d. Remove from peer group and slave lists.
4. Release `mount_lock`.
5. For each removed mount (in reverse order):
a. If `mnt_count == 0`, call `FileSystemOps::unmount()`.
b. If `mnt_count > 0` (lazy unmount remnants still referenced by
open file descriptors), defer unmount to final reference drop.
6. Drop the hash table and mount list.
14.6.12 New Mount API Syscalls¶
UmkaOS implements the Linux 5.2+ mount API syscalls for compatibility with modern container runtimes (containerd, CRI-O) and systemd. These are thin wrappers around the internal mount operations described above.
| Syscall | Purpose | Capability |
|---|---|---|
fsopen(fs_type, flags) |
Create a filesystem context | CAP_MOUNT |
fspick(dirfd, path, flags) |
Create a reconfiguration context for an existing mount | CAP_MOUNT |
fsconfig(fd, cmd, key, value, aux) |
Configure a filesystem context | CAP_MOUNT |
fsmount(fs_fd, flags, mount_attr) |
Create a detached mount from a configured context | CAP_MOUNT |
move_mount(from_dirfd, from_path, to_dirfd, to_path, flags) |
Attach a detached mount or move an existing mount | CAP_MOUNT |
open_tree(dirfd, path, flags) |
Open or clone a mount point as a file descriptor | CAP_MOUNT (if OPEN_TREE_CLONE) |
mount_setattr(dirfd, path, flags, attr, size) |
Modify mount attributes, optionally recursively | CAP_MOUNT |
fsopen flow:
1. Validate fs_type against the filesystem registry.
2. Allocate FsContext with purpose = NewMount.
3. Return a file descriptor referencing the context.
fsconfig flow (selected commands):
- FSCONFIG_SET_STRING: set a key-value option string.
- FSCONFIG_SET_BINARY: set a binary option blob.
- FSCONFIG_SET_FD: set an option to a file descriptor (e.g., source device).
- FSCONFIG_CMD_CREATE: validate all options and create the superblock
by calling FileSystemOps::mount(). On success, the superblock is stored
in FsContext.superblock. On failure, diagnostic messages are written
to the context's error log.
- FSCONFIG_CMD_RECONFIGURE: for fspick contexts, apply new options
to the existing superblock via FileSystemOps::remount().
fsmount flow:
1. Consume the superblock from the FsContext. The FsContext is
marked consumed (state = FsContextState::Consumed); further
fsconfig() calls on this fd return EBUSY. The fsopen fd remains
open for error log retrieval but cannot be used to create another
mount. Closing the fsopen fd releases the FsContext; double-release
is prevented by the consumed state flag.
2. Allocate a Mount node with MNT_DETACHED flag set.
3. The mount is not yet attached to any namespace or visible to path
resolution. It exists only as a detached object referenced by the
returned file descriptor.
4. Return an O_PATH file descriptor referencing the detached mount.
move_mount flow:
1. Resolve the source (detached mount fd or existing mount path).
2. Resolve the target path.
3. If the source is detached (MNT_DETACHED):
a. Clear MNT_DETACHED.
b. Attach to the namespace via do_mount steps 6e-6m.
4. If the source is an existing mount:
a. Delegate to do_move_mount() (Section 14.6).
open_tree flow:
1. Resolve the path to a mount.
2. If OPEN_TREE_CLONE:
a. Clone the mount (like do_bind_mount without attaching).
b. The clone is detached (MNT_DETACHED).
c. If OPEN_TREE_CLONE | AT_RECURSIVE: recursively clone the subtree.
3. Return an O_PATH file descriptor.
mount_setattr flow:
1. Resolve the path to a mount.
2. Validate attr_set and attr_clr do not conflict.
3. Acquire mount_lock.
4. If AT_RECURSIVE:
a. Collect all mounts in the subtree.
b. Validate the changes are valid for all mounts (e.g., clearing
MNT_READONLY on a mount whose superblock is read-only is invalid).
c. If validation fails for any mount, return error (no partial changes).
d. Apply attr_clr then attr_set to all mounts atomically.
5. If not recursive: apply to the single mount.
6. If attr.propagation != 0: change propagation type (Section 14.6).
7. Increment namespace.event_seq.
8. Release mount_lock.
14.6.13 Mount Introspection Syscalls¶
Linux 6.8 introduced statmount(2) and listmount(2) as structured
replacements for parsing /proc/PID/mountinfo. UmkaOS implements both for
container introspection tools and future-compatible userspace.
| Syscall | Purpose | Capability |
|---|---|---|
statmount(req, buf, bufsize, flags) |
Query detailed mount information by mount ID | None (own namespace) |
listmount(req, buf, bufsize, flags) |
List child mount IDs of a given mount | None (own namespace) |
statmount: Returns a struct statmount containing the mount's ID,
parent ID, mount flags, propagation type, peer group ID, master mount ID,
filesystem type, mount source, mount point path, and superblock options.
The request specifies which fields to populate via a bitmask, avoiding
unnecessary work (e.g., path resolution for mount point is skipped if
STATMOUNT_MNT_POINT is not requested).
listmount: Returns an array of 64-bit mount IDs for the child mounts
of a given mount. Supports cursor-based iteration: the caller passes the
last seen mount ID, and listmount returns mount IDs after that cursor.
This handles concurrent mount/unmount gracefully (mounts added after the
cursor are seen; mounts removed are skipped).
14.6.14 /proc/PID/mountinfo Format¶
Each process exposes its mount namespace's mount tree through
/proc/PID/mountinfo and /proc/PID/mounts. These files are read by
systemd, Docker, findmnt, df, mountpoint, and other tools.
mountinfo line format (one line per mount, matching Linux exactly):
<mount_id> <parent_id> <major>:<minor> <root> <mount_point> <mount_options> <optional_fields> - <fs_type> <mount_source> <super_options>
| Field | Source | Example |
|---|---|---|
| mount_id | Mount.mount_id |
36 |
| parent_id | Mount.parent.mount_id (self for root) |
35 |
| major:minor | SuperBlock.dev major:minor |
98:0 |
| root | Path of mount root within the filesystem | / or /subdir |
| mount_point | Path of mount point relative to process root | /mnt/data |
| mount_options | Per-mount flags as comma-separated options | rw,noatime,nosuid |
| optional fields | Propagation: shared:N, master:N, propagate_from:N |
shared:1 master:2 |
| separator | Literal hyphen | - |
| fs_type | Filesystem type name | ext4 |
| mount_source | Mount.device_name |
/dev/sda1 |
| super_options | From FileSystemOps::show_options() |
rw,errors=continue |
Implementation: The VFS iterates the namespace's mount_list under
rcu_read_lock() and formats each line. The mount_list's topological
ordering ensures that parent mounts appear before children (matching
Linux's output order).
/proc/PID/mounts: A simplified view matching the old /etc/mtab
format: <device> <mount_point> <fs_type> <options> 0 0. Generated
from the same mount_list, omitting mount IDs and propagation fields.
14.6.15 Path Resolution Integration¶
This section details how the mount tree integrates with the path resolution algorithm described in Section 14.1.
Mount crossing in RCU-walk (fast path):
resolve_component_rcu(current_mount, current_dentry, name):
1. Look up `name` in the dentry cache: dentry = dcache_lookup(current_dentry, name).
2. If dentry is not found: fall through to ref-walk (cache miss).
3. If dentry.flags has DCACHE_MOUNTED:
a. Call mnt_ns.hash_table.lookup(current_mount.mount_id, dentry.inode, &rcu_guard).
b. If a child mount is found:
- current_mount = child_mount
- current_dentry = child_mount.root
- If child_mount.root also has DCACHE_MOUNTED, repeat step 3
(stacked mounts — rare but legal).
c. If no child mount found: DCACHE_MOUNTED is stale (race with
umount). Clear the flag lazily and continue with the dentry.
4. Return (current_mount, dentry).
Mount crossing in ref-walk (slow path):
resolve_component_ref(current_mount, current_dentry, name):
1. Same as RCU-walk step 1, but takes a dentry reference count.
2. Same DCACHE_MOUNTED check.
3. If mount crossing:
a. Call mnt_ns.hash_table.lookup() under rcu_read_lock().
b. If found: increment child_mount.mnt_count (atomic add).
c. Decrement current_mount.mnt_count.
d. current_mount = child_mount; current_dentry = child_mount.root.
4. Return (current_mount, dentry).
".." traversal across mount boundaries:
resolve_dotdot(current_mount, current_dentry):
1. Chroot boundary check: if current_dentry == task.fs.root.dentry
AND current_mount == task.fs.root.mnt, return (current_mount,
current_dentry). The process is at its chroot root — ".." must
not escape the jail.
2. If current_dentry == current_mount.root:
- We are at the root of this mount. ".." should cross into the parent
mount.
- If current_mount.parent is None: we are at the namespace root.
".." resolves to the root itself (cannot go above /).
- Otherwise: current_mount = current_mount.parent.
current_dentry = current_mount.mountpoint.
(Continue resolving ".." from the parent mount's mountpoint.)
3. If current_dentry != current_mount.root:
- Normal ".." within the mount's filesystem.
- current_dentry = current_dentry.parent.
4. Return (current_mount, current_dentry).
14.6.16 Performance Characteristics¶
| Operation | Cost | Notes |
|---|---|---|
| Mount hash lookup (RCU read) | ~5-15 ns | SipHash + 1-2 pointer chases, no locks, no atomics. Occurs on every mount-point crossing during path resolution. |
| DCACHE_MOUNTED check | ~1 ns | Single atomic load of dentry flags. Occurs on every path component — the gate that avoids hash lookup on non-mount-point dentries. |
| Mount (new filesystem) | ~1-10 us | Dominated by filesystem driver's mount() (superblock creation). Mount tree insertion is ~200 ns under lock. |
| Unmount | ~500 ns - 5 us | Hash removal + propagation. Filesystem unmount() cost varies (ext4 journal flush vs. tmpfs instant). |
| Bind mount | ~300 ns | Mount node clone + hash insertion. No filesystem I/O. |
| Bind mount (recursive, N sub-mounts) | ~300*N ns | Linear in subtree size. |
| Propagation (mount, M peers) | ~300*M ns | One clone per peer. Propagation to slaves adds per-slave overhead. |
| /proc/PID/mountinfo generation | ~50 ns/mount | One line per mount. 100-mount namespace: ~5 us total. |
| copy_tree (CLONE_NEWNS, N mounts) | ~500*N ns | Clone all mounts. 100-mount namespace: ~50 us. |
| pivot_root | ~1 us | Two hash table mutations + RCU publish. |
Memory overhead per mount: ~320 bytes for the Mount struct (including
all intrusive list nodes and propagation fields) plus ~16 bytes for the hash
table entry. A container with 100 mounts consumes ~33 KiB of mount tree
metadata. A system with 10,000 containers (1 million mounts total) consumes
~330 MiB — proportional to the actual number of mounts, not pre-allocated.
14.6.17 Cross-References¶
- Section 3.5 (Lock Hierarchy):
MOUNT_LOCKat level 20, betweenDENTRY_LOCK(19) andEVM_LOCK(22). - Section 9.1 (Capabilities):
CAP_MOUNT(bit 70) gates all mount operations.CAP_SYS_ADMIN(bit 21) required forpivot_rootandMNT_LOCKEDoverride. - Section 14.1 (VFS Architecture):
FileSystemOps::mount()creates the superblock consumed bydo_mount().FileSystemOps::unmount()is called bydo_umount()after tree removal. - Section 14.1 (Dentry Cache):
DCACHE_MOUNTEDflag triggers mount hash table lookup during path resolution. - Section 14.1 (Path Resolution): RCU-walk and ref-walk mount crossing detailed in Section 14.6.
- Section 14.1 (Mount Namespace and Capability-Gated Mounting): The capability table and propagation type summary specified there are implemented by the data structures in this section.
- Section 14.8 (overlayfs):
OverlayFs::mount()creates anOverlaySuperBlockconsumed via the standarddo_mount()path. - Section 17.1 (Namespace Implementation):
NamespaceSet.mount_ns: Arc<MountNamespace>provides access to the full mount tree rather than just a capability handle to the root VFS node. TheNamespaceSetis per-task (Task.nsproxy), not per-process. - Section 17.1 (pivot_root): The step-by-step algorithm there is superseded by the precise
Mount-struct-based algorithm in Section 14.6. - Section 17.1 (Namespace Inheritance):
CLONE_NEWNStriggerscopy_tree()(Section 14.6).
14.7 Distribution-Aware VFS Extensions¶
When filesystems are shared across cluster nodes (Section 15.14), the VFS must handle cache validity, locking granularity, and metadata coherence across node boundaries. Linux's VFS was designed for local filesystems with network filesystem support bolted on afterward, resulting in several systemic performance problems. UmkaOS's VFS addresses these by integrating with the Distributed Lock Manager (Section 15.15).
| Linux Problem | Impact | UmkaOS Fix |
|---|---|---|
| Dentry cache assumes local validity | Remote rename/unlink leaves stale dentries on other nodes | Callback-based invalidation: DLM lock downgrade (Section 15.15) triggers targeted dentry invalidation for affected directory entries only |
d_revalidate() on every lookup for network FS |
Extra round-trip per path component on NFS/CIFS/GFS2 | Lease-attached dentries: dentry is valid while parent directory DLM lock is held (Section 15.15); zero revalidation cost during lease period |
| Inode-level locking forces false sharing | Two nodes writing to different byte ranges of the same file serialize on the inode lock | Range locks in VFS: DLM byte-range lock resources (Section 15.15) allow concurrent operations on different ranges of the same file |
| No concurrent directory operations | mkdir and create in the same directory serialize globally |
Per-bucket directory locks: hash-based directory formats (ext4 htree, GFS2 leaf blocks) use separate DLM resources per hash bucket |
readdir() + stat() = 2N round-trips for N files |
ls -l on a 1000-file remote directory requires 2001 operations |
getdents_plus() returning attributes with directory entries (analogous to NFS READDIRPLUS but in-kernel, avoiding the userspace/kernel boundary per entry). getdents_plus() is an UmkaOS VFS-internal operation (not a new syscall): the VFS's readdir implementation populates both the directory entry and its InodeAttr in a single filesystem callback, caching the attributes for immediate use by a subsequent getattr() / stat() call. Userspace accesses this via the standard getdents64(2) + statx(2) syscalls — the optimization is transparent, eliminating redundant disk or DLM round-trips inside the kernel. |
| Full inode cache invalidation on lock drop | Dropping a DLM lock on an inode discards all cached metadata, even fields that haven't changed | Per-field inode validity: mtime/size read from DLM Lock Value Block (Section 15.15); permissions and ownership from local capability cache; only stale fields refreshed on lock reacquire |
Integration with Section 15.15 DLM:
- Dentry lease binding: When the VFS caches a dentry for a clustered filesystem, it records the DLM lock resource that protects the parent directory. The dentry remains valid as long as that lock is held at CR (Concurrent Read) mode or stronger. When the DLM downgrades or releases the lock (due to contention from another node), the VFS receives a callback and invalidates only the affected dentries — not the entire dentry subtree.
/// Per-dentry lease tracking for distributed VFS.
/// Stored in `Dentry::d_fsdata` for clustered filesystems.
pub struct DentryLeaseInfo {
/// DLM lock resource ID protecting the parent directory of this dentry.
pub dlm_resource: DlmResourceId,
/// Lease sequence counter. Incremented by the DLM callback when the
/// lease is invalidated (lock downgrade or release). VFS path walk
/// compares the dentry's cached `lease_seq` against the current
/// directory DLM lock's `lease_seq`: if they differ, the dentry is
/// treated as stale and re-validated.
///
/// Type: u64 (50-year rule: at 10M invalidations/sec, wraps in ~58K years).
pub lease_seq: u64,
/// The DLM lock mode at which this dentry was validated.
pub validated_at_mode: DlmLockMode,
}
-
Range-aware writeback: When a process holds a DLM byte-range lock and writes to pages within that range, the VFS tracks dirty pages per lock range (not per inode). On lock downgrade, only dirty pages within the lock's range are flushed (Section 15.15). This eliminates the Linux problem where dropping a lock on a 100 GB file requires flushing all dirty pages, even if only 4 KB was modified.
-
Attribute caching via LVB: The VFS reads frequently-accessed inode attributes (
i_size,i_mtime,i_blocks) from the DLM Lock Value Block (Section 15.15) rather than performing a disk read on every lock acquire. The LVB is updated by the last writer on lock release, so readers always get current values at the cost of a single RDMA operation (~3-4 μs) instead of a disk I/O (~10-15 μs for NVMe).
14.7.1.1 Lease Invalidation and In-Flight I/O Synchronization¶
When the DLM downgrades a dentry or byte-range lock (due to contention from another node), in-flight I/O operations that depend on the lease must be coordinated to prevent data corruption. The synchronization protocol:
-
DLM blocking callback received: The VFS receives a
dlm_ast_blocking()callback indicating that another node requests the lock at a conflicting mode. -
In-flight I/O barrier: The VFS increments the per-inode
invalidation_seq: AtomicU64counter (Acquire ordering). All new VFS operations targeting this inode checkinvalidation_seqbefore proceeding; if it has changed since the operation began, the operation must re-validate its cached dentry/inode state after re-acquiring the lock. -
Drain in-flight operations: The VFS waits for all in-flight operations that hold a reference to the current lock grant to complete. This uses a per-lock-resource
inflight_count: AtomicU32reference counter: - Each VFS operation that depends on a DLM lock increments
inflight_count(Acquire) at operation start and decrements it (Release) at completion. -
The invalidation path waits on a per-lock-resource WaitQueue with a 30-second timeout (matching GFS2's
gfs2_glock_wait_for_demotetimeout). The WaitQueue is signaled by each VFS operation upon completion (after decrementing inflight_count). If the timeout expires, the lock is downgraded forcibly (the remote node's request takes priority to avoid cluster-wide deadlocks). -
Flush dirty data: For byte-range locks, dirty pages within the lock's range are flushed to disk (Section 15.15) before the lock is downgraded.
-
Invalidate caches: Dentry cache entries protected by the lock are invalidated. Page cache pages within the byte range are invalidated (discarded if clean, flushed then discarded if dirty).
-
Downgrade/release the lock: The DLM lock is downgraded to the requested mode (or released entirely). The
dlm_ast_completion()callback notifies the requesting node that the lock is available.
Ordering guarantee: Steps 2-5 are atomic with respect to the lock: no new operation
can acquire the lock between the barrier (step 2) and the downgrade (step 6) because
the lock's grant state is set to LOCK_INVALIDATING during this window.
14.8 overlayfs: Union Filesystem for Containers¶
Use case: Container image layering. Docker, containerd, Podman, and Kubernetes all use overlayfs as their primary storage driver. A container image is a stack of read-only filesystem layers; overlayfs merges them with a writable upper layer to present a unified view. Without overlayfs, container runtimes fall back to copy-the-entire-layer approaches (VFS copy, naive snapshots), which are orders of magnitude slower for image pull and container startup.
Tier: Tier 1 (runs in the VFS isolation domain alongside umka-vfs).
Rationale for Tier 1 (not Tier 2): overlayfs is a stacking filesystem — it sits between the VFS and the underlying filesystem drivers (ext4, XFS, btrfs, tmpfs). Every path lookup, readdir, and file open in a container traverses overlayfs. Placing it in Tier 2 (Ring 3, process boundary) would add two domain crossings per VFS operation inside every container, roughly doubling the path resolution overhead. Since overlayfs delegates all storage I/O to the underlying filesystem (which is itself a Tier 1 driver), overlayfs never touches hardware directly — it is a pure VFS client. Its code complexity is moderate (~3,000 SLOC in Linux) and auditable. The crash containment boundary is the VFS domain: if overlayfs panics, the VFS recovery protocol (Section 14.1) handles it.
Container setup ordering: During container creation, the overlayfs mount must
complete before pivot_root() changes the container's root filesystem. The
sequence is: (1) mount overlayfs at the target path, (2) mount pseudo-filesystems
(/proc, /sys, /dev) on top, (3) pivot_root() to switch the container root
to the overlayfs mount. Reversing steps (1) and (3) would leave the container with
no root filesystem. This ordering matches the OCI runtime specification and is
enforced by the umka-sysapi container setup helpers.
pivot_root() namespace validation: pivot_root() validates that new_root
is a mount point in the calling task's mount namespace. If new_root was mounted in
a different namespace (e.g., parent), pivot_root() returns EINVAL. This prevents
namespace-crossing pivots that would create incoherent mount state.
Design: overlayfs implements FileSystemOps, InodeOps, FileOps, and
DentryOps from the VFS trait system (Section 14.1). It does not introduce new
VFS abstractions — it composes existing ones.
14.8.1 Mount Options and Configuration¶
/// Mount options parsed from the `data` parameter of `FileSystemOps::mount()`.
/// Encoded as comma-separated key=value pairs in the `data: &[u8]` slice,
/// matching Linux's overlayfs mount option syntax exactly.
///
/// Example mount command:
/// ```
/// mount -t overlay overlay \
/// -o lowerdir=/lower2:/lower1,upperdir=/upper,workdir=/work \
/// /merged
/// ```
///
/// For read-only overlays (no upperdir/workdir), only lowerdir is required.
/// This is used for container image inspection without a writable layer.
pub struct OverlayMountOptions {
/// Colon-separated list of lower layer paths, ordered from topmost to
/// bottommost. At least one lower layer is required. Maximum 500 layers
/// (matching Linux's limit, which Docker/containerd never approach —
/// typical images have 5-20 layers).
///
/// Each path must be an existing directory on a mounted filesystem.
/// The VFS resolves each path to an `InodeId` at mount time and holds
/// a reference to the underlying superblock for the mount's lifetime.
///
/// Heap-allocated rather than inline (`ArrayVec<_, 500>` would be up to
/// 4000 bytes on the stack). The 500-layer maximum is enforced at mount
/// validation time. Mount processing is a rare, non-hot-path operation
/// where heap allocation is acceptable.
pub lower_dirs: Box<[InodeId]>,
/// Upper layer directory (read-write). `None` for read-only overlays.
/// Must reside on a filesystem that supports: xattr (for whiteouts and
/// metacopy markers), rename with RENAME_WHITEOUT, and mknod (for
/// character-device whiteouts). The upper filesystem must be writable.
pub upper_dir: Option<InodeId>,
/// Work directory for atomic copy-up staging. Required if `upper_dir`
/// is set. Must be on the **same filesystem** as `upper_dir` (same
/// superblock) — copy-up uses rename(2) from workdir to upperdir,
/// which requires same-device semantics. The VFS verifies this at
/// mount time by comparing `SuperBlock` identity.
///
/// The workdir must be empty at mount time. overlayfs creates a `work/`
/// subdirectory inside it for staging, and an `index/` subdirectory
/// for NFS export handles (if enabled).
pub work_dir: Option<InodeId>,
/// Enable metadata-only copy-up. When true, operations that modify
/// only metadata (chmod, chown, utimes, setxattr) copy only the
/// inode metadata to the upper layer, deferring data copy until the
/// first write. Dramatically reduces container startup I/O: a
/// `chmod` on a 200 MB binary copies ~4 KB of metadata instead of
/// 200 MB of data.
///
/// Default: true (matches Docker/containerd default since Linux 5.11+
/// with kernel config `OVERLAY_FS_METACOPY=y`).
///
/// Security restriction: this option is silently forced to `false`
/// when the mount is user-namespace-influenced (i.e., when the caller
/// does not hold `CAP_SYS_ADMIN` in the initial user namespace). In
/// such mounts the upper layer uses `user.overlay.*` xattrs, which
/// are writable by the file owner without privilege; a forged
/// metacopy xattr could redirect reads to arbitrary lower-layer files.
/// See [Section 14.8](#overlayfs-union-filesystem-for-containers--metacopy-trust-model-and-security-constraints)
/// for the complete trust model and enforcement mechanism.
pub metacopy: bool,
/// Directory rename/redirect handling.
///
/// - `On`: Enable redirect xattrs for directory renames. Required
/// for rename(2) on merged directories to succeed (without this,
/// rename of a directory that exists in a lower layer returns EXDEV).
/// - `Follow`: Follow existing redirect xattrs but do not create new
/// ones. Safe for mounting layers created by a trusted system.
/// - `NoFollow`: Ignore redirect xattrs entirely. Most restrictive.
/// - `Off`: Disable redirect handling; directory renames return EXDEV.
///
/// Default: `On` (required by Docker/containerd for correct semantics).
pub redirect_dir: RedirectDirMode,
/// Volatile mode. When enabled, overlayfs skips all fsync/sync_fs calls
/// to the upper filesystem. A crash or power loss may leave the upper
/// layer in an inconsistent state (workdir staging artifacts, partial
/// copy-ups). The overlay refuses to remount if it detects a previous
/// volatile session that was not cleanly unmounted.
///
/// Docker uses volatile mode for ephemeral containers where persistence
/// is not needed (CI runners, build containers, test environments).
///
/// Default: false.
pub volatile: bool,
/// Use `user.overlay.*` xattr namespace instead of `trusted.overlay.*`.
/// Required for unprivileged (rootless) overlayfs mounts where the
/// calling process lacks CAP_SYS_ADMIN in the initial user namespace.
/// The `user.*` xattr namespace is writable by the file owner without
/// special capabilities.
///
/// Default: false (use `trusted.overlay.*`).
pub userxattr: bool,
/// Extended inode number mode. Controls how overlayfs composes inode
/// numbers to guarantee uniqueness across layers.
///
/// - `On`: Compose inode numbers using upper bits for layer index.
/// Requires underlying filesystems to use <32-bit inode numbers
/// (ext4, XFS with `inode32` mount option).
/// - `Off`: Use raw underlying inode numbers. Risk of collisions
/// across layers (two files on different layers may share an ino).
/// - `Auto`: Enable if all underlying filesystems have small enough
/// inode numbers; disable otherwise.
///
/// Default: `Auto`.
pub xino: XinoMode,
/// NFS export support. When enabled, overlayfs maintains an index
/// directory (inside workdir) that maps NFS file handles to overlay
/// dentries. Required if the overlay mount will be exported via NFS.
///
/// Default: false (NFS export of container filesystems is uncommon).
pub nfs_export: bool,
/// fs-verity digest validation for lower layer files. When enabled,
/// overlayfs verifies that lower-layer files have valid fs-verity
/// digests matching the expected values stored in the upper layer's
/// metacopy xattr. Provides content integrity for container image
/// layers without requiring dm-verity on the entire block device.
///
/// - `Off`: No verity checking.
/// - `On`: Verify if digest is present; allow files without digest.
/// - `Require`: Reject files that lack a valid fs-verity digest.
///
/// Default: `Off`.
pub verity: VerityMode,
}
/// Redirect directory mode.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum RedirectDirMode {
/// Create and follow redirect xattrs.
On,
/// Follow existing redirect xattrs but do not create new ones.
Follow,
/// Do not follow redirect xattrs.
NoFollow,
/// Disable redirect handling; directory renames return EXDEV.
Off,
}
/// Extended inode number composition mode.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum XinoMode {
/// Always compose inode numbers.
On,
/// Never compose inode numbers.
Off,
/// Compose if underlying inode numbers fit.
Auto,
}
/// fs-verity enforcement mode for lower layer files.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum VerityMode {
/// No verity checking.
Off,
/// Verify if digest present; allow files without digest.
On,
/// Reject lower files without valid fs-verity digest.
Require,
}
14.8.2 Core Data Structures¶
/// Overlay filesystem superblock state. One instance per overlay mount.
/// Created by `OverlayFs::mount()` and stored in the `SuperBlock`'s
/// filesystem-private field.
pub struct OverlaySuperBlock {
/// Lower layer inodes (topmost first). Index 0 is the highest-priority
/// lower layer (searched first after upper). These are directory inodes
/// on the underlying filesystems, held for the mount's lifetime.
///
/// Heap-allocated rather than inline (`ArrayVec<_, 500>` would exceed
/// the safe stack frame budget — each `OverlayLayer` contains an
/// `InodeId`, a `SuperBlock` reference, and a `u16` index). The
/// 500-layer maximum is enforced at mount validation time. Mount
/// processing is a rare, non-hot-path operation where heap allocation
/// is acceptable.
pub lower_layers: Box<[OverlayLayer]>,
/// Upper layer state. `None` for read-only overlay mounts.
pub upper_layer: Option<OverlayLayer>,
/// Work directory inode on the upper filesystem. Used as a staging
/// area for atomic copy-up operations.
pub work_dir: Option<InodeId>,
/// Index directory inode (inside workdir). Used for NFS export file
/// handle resolution and hard link tracking across copy-up.
pub index_dir: Option<InodeId>,
/// Parsed mount options (immutable after mount).
pub config: OverlayMountOptions,
/// The xattr prefix used for overlay-private xattrs. Either
/// `"trusted.overlay."` (privileged) or `"user.overlay."` (userxattr
/// mode). Stored once to avoid branching on every xattr operation.
pub xattr_prefix: &'static [u8],
/// Volatile session marker. If volatile mode is enabled, this is set
/// to true after creating the `$workdir/work/incompat/volatile`
/// sentinel directory. On mount, if the sentinel exists from a
/// previous unclean session, mount fails with EINVAL.
pub volatile_active: bool,
/// True if this overlay was mounted from within a user namespace or
/// if the upper layer's filesystem mount is owned by a non-initial
/// user namespace. When true, `metacopy` and `redirect_dir=on` are
/// disabled regardless of mount options, `userxattr` mode is
/// mandatory, and data-only lower layers are rejected.
///
/// Set once at `OverlayFs::mount()` time by checking whether the
/// calling process's user namespace is the initial user namespace
/// (`current_user_ns() == &init_user_ns`). Immutable thereafter.
///
/// See Section 14.4.6.1 for the full security model.
pub userns_influenced: bool,
}
/// A single layer in the overlay stack.
pub struct OverlayLayer {
/// Root directory inode of this layer on its underlying filesystem.
pub root: InodeId,
/// Superblock of the underlying filesystem. Arc reference held for
/// the overlay mount's lifetime to prevent the underlying FS from
/// being unmounted while the overlay is active.
pub sb: Arc<SuperBlock>,
/// Layer index (0 = upper or topmost lower; increases downward).
/// Used for xino composition and for identifying which layer an
/// overlay inode's data resides on.
pub index: u16,
}
/// Atomic optional value using a sentinel for the `None` state.
/// `InodeId` of 0 represents `None` (inode 0 is never valid in any filesystem).
/// Provides lock-free read access via `Acquire` load and one-time write
/// via `compare_exchange` (for copy-up transitions from None -> Some).
pub struct AtomicOption<T: Into<u64> + From<u64>> {
value: AtomicU64, // 0 = None, non-zero = Some(T)
}
impl AtomicOption<InodeId> {
pub fn none() -> Self { Self { value: AtomicU64::new(0) } }
pub fn load(&self) -> Option<InodeId> {
match self.value.load(Ordering::Acquire) {
0 => None,
v => Some(InodeId(v)),
}
}
/// Atomically transition from None to Some. Returns Err if already set.
pub fn set_once(&self, val: InodeId) -> Result<(), InodeId> {
self.value.compare_exchange(0, val.0, Ordering::AcqRel, Ordering::Acquire)
.map(|_| ())
.map_err(|v| InodeId(v))
}
}
/// Per-inode overlay state. Tracks which layers contribute to a merged
/// view of this inode.
///
/// An `OverlayInode` is created on first lookup and cached in the VFS
/// inode cache. It is the filesystem-private data attached to the VFS
/// inode via `InodeId`.
pub struct OverlayInode {
/// Inode in the upper layer. `Some` if the entry exists in upper
/// (either originally or after copy-up). `None` if the entry exists
/// only in lower layers.
///
/// Protected by `copy_up_lock`: transitions from `None` to `Some`
/// exactly once during copy-up. Once set, never changes back.
/// Reads after copy-up are lock-free (Acquire load on the Option
/// discriminant).
pub upper: AtomicOption<InodeId>,
/// Inode in the topmost lower layer that contains this entry.
/// `None` if the entry exists only in upper (newly created file).
pub lower: Option<LowerInodeRef>,
/// 1 if this inode is a metacopy-only upper entry (metadata
/// copied, data still in lower layer). Cleared to 0 after full
/// data copy-up completes. Uses AtomicU8 (not AtomicBool) to avoid
/// the bool validity invariant — Tier 1 intra-domain memory
/// corruption from a co-domain module could write a non-0/1 value,
/// which would be undefined behavior for AtomicBool.
/// 0 = no metacopy, 1 = metacopy.
pub metacopy: AtomicU8,
/// True if this is an opaque directory. An opaque directory hides
/// all entries from lower layers — readdir and lookup do not
/// descend into lower layers below this point.
pub opaque: bool,
/// Redirect path for directory renames. When a merged directory is
/// renamed in the upper layer, this field stores the original lower
/// path so that lookups can find the renamed directory's lower
/// contents. `None` for non-redirected entries.
pub redirect: Option<Box<OsStr>>,
/// Lock serializing copy-up operations on this inode. Only one
/// thread may copy-up a given inode at a time. Other threads
/// attempting to modify the same lower-layer file block on this
/// lock until copy-up completes, then proceed against the upper copy.
///
/// This is a `Mutex`, not an `RwLock`, because copy-up is an
/// exclusive state transition (None -> Some). Read paths check
/// `upper` with an Acquire load and only take the lock if they
/// need to trigger copy-up.
pub copy_up_lock: Mutex<()>,
/// Overlay inode type. Needed because the overlay may present a
/// different view than the underlying filesystem (e.g., a whiteout
/// character device appears as "entry does not exist").
pub inode_type: OverlayInodeType,
}
/// Reference to a lower-layer inode.
pub struct LowerInodeRef {
/// Inode ID on the lower layer's filesystem.
pub inode: InodeId,
/// Which lower layer this inode resides on (index into
/// `OverlaySuperBlock::lower_layers`).
pub layer_index: u16,
}
/// Overlay inode type classification.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum OverlayInodeType {
/// Regular file (may be metacopy).
Regular,
/// Directory (may be merged or opaque).
Directory,
/// Symbolic link.
Symlink,
/// Character device, block device, FIFO, or socket.
Special,
/// Whiteout entry (exists in upper layer to mark deletion of a
/// lower-layer entry). Not visible to userspace — lookups return
/// ENOENT. Internally represented as either a character device
/// with major:minor 0:0 or a zero-size file with the
/// `trusted.overlay.whiteout` xattr.
Whiteout,
}
14.8.3 Overlay Dentry Operations¶
overlayfs requires custom DentryOps to handle the dynamic nature of the
merged filesystem view. Copy-up changes which layer serves a file, so cached
dentries must be revalidated.
/// overlayfs dentry operations.
impl DentryOps for OverlayDentryOps {
/// Revalidate a cached overlay dentry.
///
/// Returns `false` (forcing re-lookup) in these cases:
/// 1. The overlay inode has been copied up since the dentry was cached
/// (detected by checking if `OverlayInode::upper` transitioned from
/// None to Some since the last lookup).
/// 2. The underlying filesystem's dentry has been invalidated (delegates
/// to the underlying filesystem's `d_revalidate` if it implements one,
/// e.g., for NFS lower layers).
/// 3. A whiteout has been created or removed in the upper layer for this
/// name (detected by checking upper-layer lookup result against cached
/// overlay state).
///
/// Returns `true` (dentry is still valid) in all other cases.
fn d_revalidate(&self, parent: InodeId, name: &OsStr) -> Result<bool>;
/// Overlay dentries use the default VFS hash (SipHash-1-3).
fn d_hash(&self, _name: &OsStr) -> Option<u64> {
None
}
/// Overlay dentries are always eligible for LRU caching.
fn d_delete(&self, _inode: InodeId, _name: &OsStr) -> bool {
true
}
/// On dentry release, drop the overlay inode's references to
/// underlying filesystem inodes.
fn d_release(&self, inode: InodeId, name: &OsStr);
}
Dentry cache interaction: When a copy-up occurs, the overlay must invalidate the affected dentry in the VFS dentry cache (Section 14.1) so that subsequent lookups see the upper-layer inode instead of the stale lower-layer reference. The invalidation sequence:
- Copy-up completes (new file exists in upper layer).
OverlayInode::upperis set via an atomic Release store.- The overlay calls
d_invalidate()on the parent directory's dentry for the affected name. This removes the dentry from the hash table and marks it for re-lookup. - The next lookup for this name calls
OverlayInodeOps::lookup(), which now finds the upper-layer entry and returns the updatedOverlayInode.
Negative dentry handling: Negative dentries (cached ENOENT results) in the overlay dentry cache are invalidated when: - A new file is created in the upper layer (the negative dentry for that name must be purged). - A whiteout is removed (the previously-hidden lower-layer entry becomes visible again).
14.8.4 Lookup Algorithm¶
overlayfs lookup implements the layer search order:
OverlayInodeOps::lookup(parent: InodeId, name: &OsStr) -> Result<InodeId>:
let overlay_parent = get_overlay_inode(parent)
// Step 1: Search upper layer (if writable overlay).
if let Some(upper_dir) = overlay_parent.upper {
match underlying_lookup(upper_dir, name) {
Ok(upper_inode) => {
// Check if this is a whiteout.
if is_whiteout(upper_inode) {
// Entry was deleted. Do NOT search lower layers.
// Cache a negative dentry.
return Err(ENOENT)
}
// Check if this is an opaque directory.
let opaque = is_opaque_dir(upper_inode)
// Found in upper. If directory and not opaque, may need
// to merge with lower layers.
if is_directory(upper_inode) && !opaque {
// Merged directory: upper exists, also search lower
// for the merge view.
let lower = find_in_lower_layers(overlay_parent, name)
return create_overlay_inode(Some(upper_inode), lower, ...)
}
// Non-directory or opaque directory: upper is authoritative.
return create_overlay_inode(Some(upper_inode), None, ...)
}
Err(ENOENT) => {
// Not in upper, fall through to lower layers.
}
Err(e) => return Err(e), // Propagate I/O errors.
}
}
// Step 2: Search lower layers (topmost first).
// If parent directory has a redirect, follow it.
for (layer_idx, lower_layer) in lower_layers_for(overlay_parent) {
match underlying_lookup(lower_dir_at(lower_layer, overlay_parent), name) {
Ok(lower_inode) => {
if is_whiteout(lower_inode) {
// Whiteout in this lower layer. Stop searching.
return Err(ENOENT)
}
return create_overlay_inode(None, Some(LowerInodeRef {
inode: lower_inode,
layer_index: layer_idx,
}), ...)
}
Err(ENOENT) => continue, // Try next lower layer.
Err(e) => return Err(e),
}
}
// Not found in any layer.
Err(ENOENT)
Whiteout detection: An upper-layer entry is a whiteout if either:
- It is a character device with major:minor 0:0 (traditional format), OR
- It is a zero-size regular file with the trusted.overlay.whiteout (or
user.overlay.whiteout in userxattr mode) xattr set.
Both formats are supported for compatibility with existing container images.
UmkaOS creates whiteouts using the xattr format by default (avoids requiring
mknod capability for character device creation in unprivileged containers).
Opaque directory detection: A directory is opaque if it has the xattr
trusted.overlay.opaque (or user.overlay.opaque) set to "y". An opaque
directory hides all entries from lower layers — lookups do not descend past it.
This is used when an entire directory is deleted and recreated in the upper layer.
14.8.5 Copy-Up Protocol¶
Copy-up is the central operation of overlayfs. When a lower-layer file must be modified, its contents (and/or metadata) are first copied to the upper layer. The copy-up must be atomic from the perspective of concurrent readers: at no point should a reader see a partially-copied file.
Full copy-up algorithm (for regular files when metacopy is disabled, or on first write to a metacopy-only file):
copy_up(overlay_inode: &OverlayInode) -> Result<InodeId>:
// Fast path: already copied up.
if let Some(upper) = overlay_inode.upper.load(Acquire) {
if !overlay_inode.metacopy.load(Acquire) {
return Ok(upper) // Fully copied up already.
}
// Metacopy exists but needs full data copy. Fall through.
}
// Slow path: take copy-up lock.
let _guard = overlay_inode.copy_up_lock.lock()
// Double-check after acquiring lock (another thread may have completed
// copy-up while we waited).
if let Some(upper) = overlay_inode.upper.load(Acquire) {
if !overlay_inode.metacopy.load(Acquire) {
return Ok(upper)
}
}
let lower = overlay_inode.lower.as_ref().expect("copy-up requires lower");
let sb = overlay_super_block()
// Step 1: Ensure parent directory exists in upper layer.
// Recursively copy-up parent directories if needed.
let upper_parent = ensure_upper_parent(overlay_inode)
// Step 2: Create temporary file in workdir (same filesystem as upper).
// The workdir is on the same device as upperdir, enabling atomic rename.
let tmp_name = generate_temp_name() // e.g., "#overlay.XXXXXXXX"
let tmp_inode = underlying_create(sb.work_dir, tmp_name, lower_mode)
// Step 3: Copy metadata from lower to tmp.
let lower_attr = underlying_getattr(lower.inode)
// CVE-2023-0386 mitigation: Verify that the source file's UID/GID are valid
// in the overlay mount's user namespace. A setuid file in the lower layer
// whose UID has no mapping in the overlay's userns must NOT be copied up
// with elevated privileges. Reject with EOVERFLOW if unmappable.
if !from_kuid_munged(overlay_mnt_userns, lower_attr.uid).is_valid()
|| !from_kgid_munged(overlay_mnt_userns, lower_attr.gid).is_valid() {
underlying_unlink(sb.work_dir, tmp_name) // clean up temp file
return Err(EOVERFLOW)
}
underlying_setattr(tmp_inode, &lower_attr) // owner, mode, timestamps
// Step 4: Copy xattrs from lower to tmp.
// Filter out overlay-private xattrs (trusted.overlay.*).
copy_xattrs_filtered(lower.inode, tmp_inode, sb.xattr_prefix)
// Step 5: Copy file data (skip if metacopy mode and this is a
// metadata-only copy-up triggered by chmod/chown/utimes).
if !metacopy_only {
copy_file_data_chunked(lower.inode, tmp_inode, &overlay_inode.copy_up_lock)
// Uses chunked I/O with periodic lock release. See "Chunked Copy-Up
// and Cgroup I/O Throttling" below for the algorithm.
} else {
// Set metacopy xattr on the tmp file. This marks it as containing
// metadata only — data will be copied on first write.
underlying_setxattr(tmp_inode,
concat(sb.xattr_prefix, "metacopy"), b"", 0)
// If the lower file is itself a metacopy (nested overlay), follow
// the redirect chain to find the actual data source.
if let Some(origin) = get_metacopy_origin(lower.inode) {
underlying_setxattr(tmp_inode,
concat(sb.xattr_prefix, "origin"), &encode_fh(origin), 0)
}
}
// Step 6: Set security context on tmp file.
// Copy security.* xattrs that the security framework requires.
// Step 7: Atomic rename from workdir to upperdir.
// This is the commit point. Before this rename, the copy-up is invisible
// to other processes. After this rename, the upper-layer file is live.
underlying_rename(sb.work_dir, tmp_name, upper_parent, target_name,
RenameFlags::RENAME_NOREPLACE)
// Step 8: Update overlay inode state.
let upper_inode = underlying_lookup(upper_parent, target_name)
// set_once(): CAS from None to Some. The copy_up_lock guarantees single
// writer, so set_once always succeeds (debug_assert to catch invariant violations).
overlay_inode.upper.set_once(upper_inode)
.expect("copy_up_lock guarantees single writer");
if metacopy_only {
overlay_inode.metacopy.store(true, Release)
}
// Step 9: Invalidate the dentry cache entry for this name.
// Forces subsequent lookups to see the upper-layer version.
d_invalidate(upper_parent, target_name)
Ok(upper_inode)
Atomicity guarantee: The rename in Step 7 is the single atomic commit point.
If the system crashes before Step 7, the temporary file in workdir is orphaned and
cleaned up on next mount (overlayfs scans workdir for stale temporaries during
mount() and removes them). If the system crashes after Step 7, the upper-layer
file is complete and consistent.
Error recovery (runtime failures): Each step that can fail must clean up all prior steps before returning an error to the caller. The copy-up state machine tracks progress through four states:
/// Copy-up state machine. Tracks the current phase of a copy-up operation
/// for error recovery. Stored on the stack (not persistent — crash recovery
/// uses workdir scan, not state machine replay).
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub enum CopyUpState {
/// Step 2: Creating the temporary file in workdir.
/// Cleanup on failure: none (nothing created yet).
Creating,
/// Steps 3-6: Copying metadata, xattrs, data, and security context
/// to the temporary file.
/// Cleanup on failure: unlink temporary file from workdir.
Copying,
/// Step 7: Atomic rename from workdir to upperdir.
/// Cleanup on failure: unlink temporary file from workdir.
Renaming,
/// Steps 8-9: Rename succeeded. Updating overlay inode state and
/// invalidating dentry cache.
/// No cleanup needed — the upper file is committed and will be
/// found on retry if post-rename steps fail.
Complete,
}
Error recovery protocol (driven by CopyUpState):
CopyUpState when error occurs |
Cleanup action | Returned error | Lower file status |
|---|---|---|---|
Creating (Step 2 fails) |
None (nothing was created) | EIO / ENOSPC |
Unchanged |
Copying (Steps 3-6 fail) |
underlying_unlink(workdir, tmp_name) — remove partial temp file |
EIO / ENOSPC |
Unchanged |
Renaming (Step 7 fails) |
underlying_unlink(workdir, tmp_name) — remove complete-but-uncommitted temp file |
EIO |
Unchanged |
Complete (Steps 8-9 fail) |
None — rename already committed; upper file is live and will be discovered on retry | EIO (rare) |
Now shadowed by upper |
The error recovery path:
copy_up_with_recovery(overlay_inode) -> Result<InodeId>:
let mut state = CopyUpState::Creating;
let result = (|| -> Result<InodeId> {
// Step 2: Create temp file.
let tmp = underlying_create(sb.work_dir, tmp_name, mode)?;
state = CopyUpState::Copying;
// Steps 3-6: Copy metadata, xattrs, data, security context.
copy_metadata(lower, tmp)?;
copy_xattrs_filtered(lower, tmp, prefix)?;
if !metacopy_only {
match copy_file_data_chunked(lower, tmp, ©_up_lock) {
Ok(()) => {}
Err(EALREADY) => {
// Another thread completed copy-up while we released
// the lock between chunks. Clean up our temp file and
// return the already-committed upper inode.
let _ = underlying_unlink(sb.work_dir, tmp_name);
return Ok(overlay_inode.upper.load(Acquire).unwrap())
}
Err(e) => return Err(e),
}
}
copy_security_context(lower, tmp)?;
state = CopyUpState::Renaming;
// Step 7: Atomic rename.
underlying_rename(sb.work_dir, tmp_name, upper_parent, name,
RENAME_NOREPLACE)?;
state = CopyUpState::Complete;
// Steps 8-9: Update overlay state.
overlay_inode.upper.set_once(upper)
.expect("copy_up_lock guarantees single writer");
d_invalidate(upper_parent, name);
Ok(upper)
})();
if let Err(e) = &result {
match state {
CopyUpState::Creating => {
// Nothing to clean up.
}
CopyUpState::Copying | CopyUpState::Renaming => {
// Best-effort cleanup: remove the temporary file.
if let Err(unlink_err) = underlying_unlink(sb.work_dir, tmp_name) {
// Unlink failed — orphan left in workdir.
// mount() workdir scan will remove it.
log_warn!("copy-up cleanup failed: {:?}", unlink_err);
}
}
CopyUpState::Complete => {
// Rename committed. Upper file is live. No cleanup.
}
}
}
result
Key invariant: The original lower-layer file is never modified or removed
during copy-up. If any step fails before CopyUpState::Complete, the lower file
remains intact and serves subsequent reads. The temporary file in workdir is
either successfully cleaned up or left as an orphan for the next mount scan.
If cleanup of the temporary file itself fails (i.e., underlying_unlink() returns
an error during recovery), the orphaned temporary is left in workdir and will be
removed by the next mount() scan. The original copy-up failure is still returned
to the caller as an error. The orphaned file does not affect correctness because
the rename (Step 7) did not complete.
Parent directory copy-up: Directories are copied up recursively. When copying
up /a/b/c/file.txt, if /a/b/c/ does not exist in upper, overlayfs creates
/a/, then /a/b/, then /a/b/c/ in upper (each with appropriate metadata and
the trusted.overlay.origin xattr pointing to the lower original). Only then does
the file copy-up proceed. Each directory copy-up is itself atomic (created in
workdir, renamed to upper).
Hard link handling on copy-up: If a lower-layer file has multiple hard links (nlink > 1), all names referencing the same lower inode must resolve to the same upper inode after copy-up. The overlay maintains an index directory (inside workdir) that maps lower file handles to upper inodes. On copy-up, the overlay checks the index first: - If an index entry exists, the file was already copied up via another name. Create a hard link in upper rather than copying data again. - If no index entry exists, perform a full copy-up and record the mapping.
This index is also used for NFS export (mapping file handles across copy-up).
14.8.5.1 Chunked Copy-Up and Cgroup I/O Throttling¶
Copy-up data transfer (Step 5) can be arbitrarily large — container images routinely
contain multi-gigabyte database files, ML model weights, or log archives. A naive
single-pass copy_file_data() holds copy_up_lock for the entire transfer duration.
Under cgroup io.max throttling (Section 17.2), this
duration is further extended because each write is subject to the cgroup's byte-rate
and IOPS limits. The result: all other threads attempting any metadata or write
operation on the same file block on copy_up_lock for the full throttled transfer
time — potentially minutes for a large file under tight io.max limits.
Solution: Chunked copy-up with periodic lock release. The data copy is split into
fixed-size chunks. Between chunks, the copy_up_lock is released, allowing blocked
threads to observe the in-progress copy-up and either wait briefly or proceed (if
the copy-up completed during their wait). The temporary file in workdir is not
visible to other overlay operations until the final atomic rename (Step 7), so
releasing the lock during data copy does not expose partial state.
/// Chunk size for copy-up data transfer. 2 MiB balances:
/// - Throughput: large enough to amortize splice/sendfile setup overhead.
/// - Latency: small enough that lock-hold time per chunk is bounded (~2ms
/// at 1 GB/s disk throughput, longer under io.max throttling).
/// - Memory: the chunk is transferred in-place (splice) or via a bounded
/// kernel buffer, not allocated as a contiguous 2 MiB region.
const COPY_UP_CHUNK_SIZE: u64 = 2 * 1024 * 1024; // 2 MiB
copy_file_data_chunked(
lower: InodeId,
tmp: InodeId,
lock: &Mutex<()>,
) -> Result<()>:
let file_size = underlying_getattr(lower).size
let mut offset: u64 = 0
while offset < file_size {
let chunk_len = min(COPY_UP_CHUNK_SIZE, file_size - offset)
// Copy one chunk. Uses splice/sendfile for zero-copy where the
// underlying filesystem supports it; falls back to read+write.
// The write side is subject to the calling task's cgroup io.max
// throttling — cgroup_io_throttle() is called inside the block
// layer's submit_bio() path, which may sleep if the cgroup's
// byte-rate or IOPS budget is exhausted.
copy_file_range(lower, tmp, offset, chunk_len)?
offset += chunk_len
if offset < file_size {
// Release the copy-up lock between chunks. This allows other
// threads blocked on copy_up_lock to wake and re-check state.
// They will find upper still None (the rename has not happened)
// and re-acquire the lock. The first thread to re-acquire
// continues the copy from where it left off.
//
// SAFETY of releasing mid-copy: the temporary file is in workdir,
// invisible to overlay lookups. No concurrent thread can observe
// partial data. The only visible state change is the lock
// becoming briefly available, which lets waiters check for
// completion and yields the CPU to higher-priority tasks.
drop(lock.unlock())
// Yield point: allow the scheduler to run higher-priority tasks
// (especially relevant when this copy-up is in a low-priority
// cgroup). Also allows signal delivery (SIGKILL check).
if signal_pending(current_task()) {
// Copy-up interrupted by fatal signal. The temporary file
// will be cleaned up by the error recovery path (CopyUpState::Copying).
return Err(EINTR)
}
// Re-acquire the lock before the next chunk.
lock.lock()
// Double-check: another thread may have completed the copy-up
// while we released the lock (race between two writers on the
// same metacopy file). If so, abandon our temp file — the
// error recovery path will unlink it.
if overlay_inode.upper.load(Acquire).is_some()
&& !overlay_inode.metacopy.load(Acquire) {
return Err(EALREADY) // caller detects this and returns Ok
}
}
}
Ok(())
Interaction with io.max throttling: Each chunk's write passes through the block
layer's submit_bio() path. If the calling task's cgroup has io.max limits, the
throttle check (cgroup_io_throttle() in Section 17.2) sleeps until the
cgroup's token bucket replenishes. This sleep occurs while holding the copy-up lock
for the current chunk only — at most COPY_UP_CHUNK_SIZE / wbps seconds per chunk
hold. Between chunks, the lock is released, so other threads are unblocked.
Worst-case lock-hold time per chunk: COPY_UP_CHUNK_SIZE / min(disk_throughput,
io.max.wbps). At the minimum practical io.max setting of 1 MB/s, a 2 MiB chunk
holds the lock for ~2 seconds. This is acceptable because:
1. Threads waiting on copy_up_lock are already blocked on a slow-path mutation.
2. The 2-second hold is bounded and predictable, unlike the unbounded hold of a
single-pass copy of a multi-gigabyte file.
3. Under normal (unthrottled) I/O, the hold time per chunk is ~2ms at 1 GB/s.
EALREADY sentinel: When copy_file_data_chunked returns Err(EALREADY), the
caller (copy_up_with_recovery) recognizes this as a benign race — another thread
completed the copy-up. The caller cleans up its temporary file and returns the
already-committed upper inode. This is handled in the CopyUpState::Copying error
recovery branch (temp file unlink).
Signal handling: The signal_pending() check between chunks allows SIGKILL to
abort a long-running copy-up promptly (within one chunk transfer time) instead of
only after the entire file is copied. The error recovery path cleans up the partial
temporary file.
14.8.6 Metacopy Mode¶
Metacopy is the performance-critical optimization for container startup. Without metacopy, any metadata operation (chmod, chown, utimes) on a lower-layer file triggers a full data copy. With metacopy enabled, only metadata is copied, and data copy is deferred until the file is opened for writing.
Metacopy lifecycle:
State transitions for a file in metacopy mode:
[Lower-only]
│
│ chmod/chown/utimes/setxattr
▼
[Metacopy in upper] ← metadata copied, data in lower
│ upper has trusted.overlay.metacopy xattr
│ open(O_WRONLY/O_RDWR) or truncate
▼
[Full copy-up] ← data + metadata in upper
trusted.overlay.metacopy xattr removed
Realfile mechanism: When an overlayfs file is opened, the OpenFile's
f_mapping (AddressSpace pointer) is redirected to the underlying real inode's
AddressSpace (upper if exists, else lower). This ensures page cache operations go
directly to the real filesystem. On copy-up, f_mapping is redirected from lower to
upper. Overlayfs has no own page cache — it delegates entirely to underlying
filesystems.
Read path for metacopy files: When a metacopy file is opened for reading
(O_RDONLY), data is served from the lower layer. The OverlayFileOps::read()
implementation checks overlay_inode.metacopy and dispatches to the lower-layer
FileOps::read() with the lower inode. No data copy occurs.
Concurrent metacopy copy-up: If two tasks trigger copy-up for the same metacopy
inode simultaneously, the first task to acquire the inode's copy_up_lock mutex
(see OverlayInode::copy_up_lock) performs the copy-up. The second task waits on
copy_up_lock and, after acquiring it, checks whether copy-up already completed
(oi.upper is now Some). If so, it redirects f_mapping to the upper inode
and proceeds without repeating the copy-up.
Write trigger: When a metacopy file is opened for writing (O_WRONLY,
O_RDWR) or truncated, the overlay triggers a full data copy-up before allowing
the write:
impl FileOps for OverlayFileOps {
fn open(&self, inode: InodeId, flags: OpenFlags) -> Result<u64> {
let oi = get_overlay_inode(inode);
// If opening for write and file is metacopy-only, trigger
// full data copy-up before returning the fd.
if flags.is_writable() && oi.metacopy.load(Acquire) {
copy_up_data(oi)?;
// copy_up_data() copies file data from lower to upper,
// removes the metacopy xattr, and clears oi.metacopy.
}
// Delegate open to the appropriate underlying filesystem.
if let Some(upper) = oi.upper.load(Acquire) {
underlying_open(upper, flags)
} else {
// Read-only open on a lower-only file. No copy-up needed.
underlying_open(oi.lower.unwrap().inode, flags)
}
}
}
14.8.6.1 Metacopy Trust Model and Security Constraints¶
The metacopy mechanism is only safe when the kernel can trust that
trusted.overlay.metacopy (or user.overlay.metacopy in userxattr mode) was
written by the overlay itself during a copy-up, not forged by a process with
write access to the upper layer. If forged, an attacker could create a file
whose upper stub has a redirect xattr pointing to an arbitrary path in a lower
layer, then set the metacopy xattr to tell the kernel to serve lower-layer data
through the stub — exposing files the attacker would not otherwise be able to read
via the overlay's merged view.
Xattr namespace privilege boundary
The trusted. xattr namespace is the primary safeguard. The kernel checks
CAP_SYS_ADMIN via capable() — which verifies the capability against the
initial user namespace — not via ns_capable() (which would accept a user
namespace root). This means:
A process that holds
CAP_SYS_ADMINonly within a user namespace (i.e., container root mapped to an unprivileged host UID) cannot set or readtrusted.*xattrs on the host filesystem. Only a process withCAP_SYS_ADMINin the initial user namespace can writetrusted.overlay.*xattrs.
This provides complete protection for overlayfs mounts created in the initial
user namespace: container processes cannot forge trusted.overlay.metacopy or
trusted.overlay.redirect xattrs because they lack the required capability on
the host filesystem.
Privileged container caveat: A container that runs with CAP_SYS_ADMIN in
the initial user namespace (not just the container's user namespace) CAN write
trusted.* xattrs and could forge metacopy stubs. This is a known trust
boundary: granting CAP_SYS_ADMIN in the initial namespace to a container is
equivalent to granting root on the host. Operators who grant this should not
rely on overlayfs metacopy for isolation.
User-namespace-influenced mounts: the attack surface
Since Linux 5.11, overlayfs can be mounted from within a user namespace
(CAP_SYS_ADMIN in the user namespace that owns the mount namespace suffices to
call mount("overlay", ...)). Such mounts are required to use userxattr mode
(-o userxattr), which substitutes the user.overlay.* xattr namespace for
trusted.overlay.*. Unlike trusted.*, the user.* namespace is writable by
the file owner without any privilege — specifically, the unprivileged host UID
that the container root maps to can set user.overlay.metacopy and
user.overlay.redirect xattrs on files in the upper layer.
A user-namespace-influenced mount is defined as any overlayfs mount where either:
- The overlayfs
mount()call was made from within a user namespace (the calling process's user namespace is not the initial user namespace), or - The upper directory's owning user namespace differs from the initial user
namespace (detected by comparing the user namespace of the mount namespace
that created the upper directory's filesystem mount against
init_user_ns).
Enforcement: metacopy disabled for user-namespace-influenced mounts
UmkaOS enforces the following rule at mount time and at metacopy lookup time:
Mount-time enforcement: When OverlayFs::mount() is called from a process
not in the initial user namespace, the metacopy and redirect_dir options are
forced to off regardless of what the caller requested. The mount proceeds with
these features disabled. The kernel logs:
overlayfs: metacopy and redirect_dir disabled for user-namespace mount (CVE mitigation, Section 14.4.6.1)
This matches Linux's behaviour (since kernel 5.11, user-namespace overlayfs
mounts are restricted to userxattr mode and metacopy is not permitted unless
the caller has CAP_SYS_ADMIN in the initial user namespace).
The OverlaySuperBlock records whether the mount is user-namespace-influenced:
pub struct OverlaySuperBlock {
// ... existing fields ...
/// True if this overlay was mounted from within a user namespace (the
/// mounting process's user namespace is not the initial user namespace)
/// or if the upper layer's filesystem mount is owned by a non-initial
/// user namespace. When true, metacopy and redirect_dir are disabled
/// regardless of mount options, and userxattr mode is mandatory.
///
/// Set once at mount time; immutable thereafter.
pub userns_influenced: bool,
}
Lookup-time enforcement: Even if metacopy is enabled in the mount options,
the metacopy lookup path checks userns_influenced before reading or acting on
any metacopy xattr:
/// Attempt to read a metacopy stub from the given upper-layer dentry.
/// Returns `None` (treat as a regular upper file) if:
/// - The mount is user-namespace-influenced, or
/// - No metacopy xattr is present, or
/// - The xattr value fails validation.
fn ovl_lookup_metacopy(dentry: &Dentry, sb: &OverlaySuperBlock) -> Option<OverlayMetacopy> {
// Never trust metacopy xattrs from user-namespace-influenced mounts.
// The xattr namespace used by such mounts (user.overlay.*) is writable
// by the file owner without privilege, so any metacopy xattr present
// must be treated as potentially forged.
if sb.userns_influenced {
return None;
}
// Read the metacopy xattr from the upper-layer file.
let xattr_name = concat_static(sb.xattr_prefix, "metacopy");
let xattr = dentry.get_xattr(xattr_name)?;
// Validate xattr value. The Linux-compatible format is either empty
// (legacy, no digest) or a 4+N byte structure: 4-byte header followed
// by an optional fs-verity SHA-256 digest (32 bytes). Reject anything
// that does not match either form.
validate_metacopy_xattr(xattr)
}
The lookup-time check is defence-in-depth: the mount-time enforcement already
prevents metacopy=on from reaching OverlaySuperBlock::config on
user-namespace mounts, so ovl_lookup_metacopy would not be called. The
redundant check in ovl_lookup_metacopy protects against future code paths that
might bypass the mount-time gate.
Userxattr mode and data-only layers
When userxattr=on is set (required for user-namespace mounts), user.overlay.*
xattrs are used throughout. The user.overlay.redirect xattr controls directory
rename semantics and, in data-only layer configurations, points metacopy stubs to
their data sources. Because user.* xattrs are writable by the file owner, and
because data-only layer configurations allow a metacopy file in one lower layer to
redirect to a file in a data-only lower layer via user.overlay.redirect:
redirect_dir=onis disallowed for user-namespace-influenced mounts (forced tooffat mount time).- Data-only lower layers are disallowed for user-namespace-influenced mounts:
OverlayFs::mount()returnsEPERMif any lower layer path is specified with the::data-only separator syntax whenuserns_influencedis true.
These restrictions prevent the user.overlay.redirect xattr from being used to
point a metacopy stub in one layer at a file in another layer that the container
would not otherwise be able to access.
Summary of security invariants
| Condition | trusted.overlay.* metacopy |
user.overlay.* metacopy |
|---|---|---|
Initial user namespace mount, metacopy=on |
Trusted (forging requires host CAP_SYS_ADMIN) | N/A (userxattr not used in privileged mounts by default) |
| User-namespace mount | N/A (trusted.* inaccessible from user NS) | Disabled (forced off at mount time; ovl_lookup_metacopy returns None) |
User-namespace mount, userxattr=on, data-only layers |
N/A | Rejected at mount time (EPERM) |
14.8.7 Directory Operations¶
Readdir merge: Reading a merged directory (one that exists in both upper and lower layers) requires combining entries from all layers, excluding whiteouts and applying opaque directory semantics.
OverlayFileOps::readdir(inode, private, offset, emit) -> Result<()>:
let oi = get_overlay_inode(inode)
// Phase 1: Collect entries from upper layer.
let mut seen: HashSet<OsString> = HashSet::new()
if let Some(upper) = oi.upper.load(Acquire) {
underlying_readdir(upper, |entry_inode, entry_off, ftype, name| {
// Skip whiteout entries — they indicate deleted lower entries.
if is_whiteout_entry(entry_inode) {
seen.insert(name.to_owned()) // Track for lower suppression.
return true // Continue iteration.
}
seen.insert(name.to_owned())
emit(overlay_inode_for(entry_inode), entry_off, ftype, name)
})
}
// Phase 2: If directory is opaque, stop here. Lower entries are hidden.
if oi.opaque {
return Ok(())
}
// Phase 3: Collect entries from lower layers, skipping duplicates.
for lower_ref in lower_dirs_for(oi) {
underlying_readdir(lower_ref.inode, |entry_inode, entry_off, ftype, name| {
// Skip entries already seen in upper or higher lower layers.
if seen.contains(name) {
return true
}
// Skip whiteout entries from lower layers too.
if is_whiteout_entry(entry_inode) {
seen.insert(name.to_owned())
return true
}
seen.insert(name.to_owned())
emit(overlay_inode_for(entry_inode), entry_off, ftype, name)
})
}
Ok(())
Deduplication uses byte-exact filename comparison. Case-insensitive upper/lower layer combinations may produce duplicate entries with different casing. This matches Linux overlayfs behavior.
Readdir caching: The merged directory listing is cached in the overlay file's
private state (returned by open()) for the lifetime of the open directory file
descriptor. This matches Linux's behavior: the merge is computed once per
opendir() and subsequent readdir() calls return entries from the cache. The
cache is invalidated on rewinddir() (seek to offset 0).
Performance note on seen HashSet: The HashSet<OsString> in the pseudocode above
is allocated once per opendir() call (during the initial merge), not once per
readdir() call. The cache stores the deduplicated entry list; subsequent readdir()
calls walk the already-merged cache without re-allocating or re-hashing. For large
directories (>10,000 entries), the initial opendir() merge is O(N) with one
allocation per distinct entry name (stored in the HashSet during merge, then released
when the merge completes and entries are stored in a flat Vec in the file private
state). The hot path — repeated readdir() calls iterating through the cached Vec —
is O(entries) with zero heap allocations.
Bound: The HashSet is bounded by the sum of directory entries across all layers
for this single directory (typically <10,000 in container images; capped by the
filesystem's max directory entries). This is a warm-path allocation (once per
opendir(), bounded by directory size) and is acceptable per the collection usage policy
(Section 3.13).
Directory rename (redirect_dir=on): When a merged directory is renamed,
overlayfs cannot rename the lower-layer directory (it is read-only). Instead:
- Create the new directory name in the upper layer.
- Set the
trusted.overlay.redirectxattr on the new upper directory, containing the absolute path (from the overlay root) of the original lower directory. Maximum redirect path: 256 bytes. Encoding: raw bytes (the underlying filesystem's filename encoding — typically UTF-8). No escaping; path components are separated by/. Paths exceeding 256 bytes cause the rename to fall back to full copy-up of the directory tree (no redirect xattr is set; the renamed directory becomes an opaque copy). This 256-byte limit is an UmkaOS implementation choice (Linux has no specific limit). The fallback to full directory copy-up preserves correctness. - Lookups for the renamed directory follow the redirect: when searching lower layers, use the redirect path instead of the current name.
- Create a whiteout at the old name to hide the lower-layer original.
Opaque directory creation (rmdir + mkdir of same name):
- Create whiteout or opaque directory in upper layer.
- Set
trusted.overlay.opaquexattr to"y"on the new upper directory. - All lower-layer entries under this path are hidden.
14.8.8 Whiteout and Deletion¶
When a file or directory is deleted from a merged view, overlayfs must hide the lower-layer entry without modifying the lower layer:
File deletion (unlink on a merged file):
1. If the file exists in upper: remove the upper entry via underlying_unlink().
2. If the file exists in any lower layer: create a whiteout in the upper layer
at the same path.
3. Invalidate the dentry cache entry.
Directory deletion (rmdir on a merged directory):
1. Verify the merged view of the directory is empty (no entries from any layer
that are not whiteouts). Return ENOTEMPTY if non-empty.
2. If the directory exists in upper: remove it.
3. If the directory exists in lower: create an opaque whiteout in upper.
Whiteout creation:
/// Create a whiteout entry in the upper layer.
///
/// UmkaOS uses the xattr-based whiteout format by default: a zero-size
/// regular file with the overlay whiteout xattr set. This avoids
/// requiring mknod(2) capability (character device 0:0 creation
/// requires CAP_MKNOD in the filesystem's user namespace).
///
/// For compatibility, the character-device whiteout format is also
/// recognized on read (lookup).
fn create_whiteout(upper_parent: InodeId, name: &OsStr) -> Result<()> {
let sb = overlay_super_block();
// Create zero-size regular file.
let whiteout = underlying_create(upper_parent, name,
FileMode::regular(0o000))?;
// Set the whiteout xattr.
underlying_setxattr(whiteout,
concat(sb.xattr_prefix, "whiteout"), b"y", XattrFlags::CREATE)?;
Ok(())
}
RENAME_WHITEOUT integration: The VFS rename() with RENAME_WHITEOUT flag
(already supported in InodeOps::rename(), Section 14.1) atomically renames a
file and creates a whiteout at the old name. overlayfs uses this during copy-up
of directory entries: when a file is copied from lower to upper, the old lower
path is hidden by a whiteout created atomically with the rename.
14.8.9 Volatile Mode¶
Volatile mode disables all durability guarantees for the upper layer. This is a deliberate trade-off for ephemeral container workloads.
Behavior:
- fsync(), fdatasync(), and sync_fs() on overlay files are no-ops (return
success without calling the underlying filesystem's sync).
- On mount with volatile=true, create the sentinel directory
$workdir/work/incompat/volatile/.
- On unmount, remove the sentinel directory (clean shutdown).
- On next mount, if the sentinel exists, return EINVAL with a diagnostic
message: the previous volatile session was not cleanly unmounted, and the
upper/work directories may be inconsistent. The operator must delete upper
and work directories and recreate them.
- After any writeback error on the upper filesystem, subsequent fsync() calls
on overlay files return EIO persistently (matching Linux's error stickiness
behavior from Section 15.1).
Container runtime usage: Docker enables volatile mode for containers started
with --storage-opt overlay2.volatile=true. This is common for CI/CD runners,
build containers, and test environments where container state is discarded after
each run.
14.8.10 Extended Attribute Handling¶
overlayfs must handle xattrs carefully because it uses private xattrs for internal bookkeeping (whiteouts, metacopy, redirects, opaque markers) and must pass through user-visible xattrs correctly.
Xattr namespace partitioning:
| Namespace | Behavior |
|---|---|
trusted.overlay.* (or user.overlay.* in userxattr mode) |
Internal: overlay-private. Not visible to userspace via listxattr()/getxattr(). Used for whiteout, opaque, metacopy, redirect, origin markers. |
security.* |
Pass-through with copy-up: Copied from lower to upper during copy-up. setxattr() triggers copy-up. Includes security.selinux, security.capability (file caps), security.ima. |
system.posix_acl_access, system.posix_acl_default |
Pass-through with copy-up: POSIX ACLs are copied during copy-up. setfacl triggers copy-up. |
user.* (excluding user.overlay.* in userxattr mode) |
Pass-through with copy-up: User-defined xattrs. Copied during copy-up. |
trusted.* (excluding trusted.overlay.*) |
Pass-through with copy-up: Only accessible to CAP_SYS_ADMIN processes. Copied during copy-up. |
getxattr/setxattr dispatch:
OverlayInodeOps::getxattr(inode, name, buf) -> Result<usize>:
// Block access to overlay-private xattrs.
if name.starts_with(overlay_xattr_prefix()) {
return Err(ENODATA)
}
// Serve from upper if available, otherwise from lower.
let target = upper_or_lower(inode)
underlying_getxattr(target, name, buf)
OverlayInodeOps::setxattr(inode, name, value, flags) -> Result<()>:
// Block writes to overlay-private xattrs.
if name.starts_with(overlay_xattr_prefix()) {
return Err(EPERM)
}
// setxattr triggers copy-up (xattr must be set on upper).
let upper = copy_up(inode)?
underlying_setxattr(upper, name, value, flags)
OverlayInodeOps::listxattr(inode, buf) -> Result<usize>:
// List xattrs from upper (if exists) or lower.
// Filter out overlay-private xattrs from the result.
let target = upper_or_lower(inode)
let raw = underlying_listxattr(target, buf)?
filter_out_overlay_xattrs(buf, raw)
Nested overlayfs: When overlayfs is mounted on top of another overlayfs
(nested container images, uncommon but valid), the inner overlay's xattrs must
not collide with the outer overlay's. Linux handles this via "xattr escaping":
the inner overlay stores its xattrs under trusted.overlay.overlay.* instead
of trusted.overlay.*. UmkaOS implements the same escaping mechanism. This is
transparent to the filesystem — the inner overlay simply uses a longer prefix.
14.8.11 statfs Behavior¶
OverlayFs::statfs() returns statistics from the upper layer's filesystem (if
present). For read-only overlays (no upper), statistics from the topmost lower
layer are returned. This matches Linux behavior and ensures that df on a
container's root filesystem shows the available space on the writable layer.
14.8.12 Inode Number Composition (xino)¶
To guarantee unique inode numbers across the merged view, overlayfs composes inode numbers from the underlying filesystem's inode number and the layer index:
Where xino_bits is the number of bits available for the underlying inode
(typically 32 for ext4 with default inode sizes). This ensures that
stat() returns unique inode numbers for files from different layers that
happen to share the same underlying inode number (common when layers are on
the same filesystem).
When xino=off or when underlying inode numbers exceed the available bit width,
overlayfs falls back to using the underlying inode numbers directly. In this mode,
st_dev differs between upper and lower files (the VFS assigns a unique device
number per overlay mount), but st_ino may collide across layers. Applications
that rely on (st_dev, st_ino) pairs for file identity (e.g., tar, rsync,
find -inum) may exhibit incorrect behavior. xino=auto avoids this by
enabling composition only when it is safe.
14.8.13 Mount and Unmount Flow¶
Mount:
OverlayFs::mount(source, flags, data) -> Result<SuperBlock>:
1. Parse mount options from `data` into `OverlayMountOptions`.
2. Determine user-namespace influence (security policy, Section 14.4.6.1):
userns_influenced = (current_user_ns() != &init_user_ns)
If userns_influenced:
a. Force options.metacopy = false.
Force options.redirect_dir = RedirectDirMode::Off.
Log: "overlayfs: metacopy and redirect_dir disabled for
user-namespace mount (Section 14.4.6.1)"
b. Require options.userxattr == true. If not set, return EPERM.
(User-namespace mounts cannot use trusted.overlay.* xattrs.)
c. If any lower_dir entry uses the data-only '::' separator syntax:
return EPERM. (Data-only layers with userxattr are disallowed
because user.overlay.redirect is owner-writable.)
3. Resolve each lower_dir path to an InodeId via VFS path lookup.
Verify each is a directory. Hold references for mount lifetime.
4. If upper_dir is set:
a. Resolve upper_dir to InodeId. Verify it is a writable directory.
b. Resolve work_dir to InodeId. Verify same superblock as upper_dir.
c. Check work_dir is empty.
d. Create `$workdir/work/` subdirectory if it does not exist.
e. If volatile mode:
- Check for `$workdir/work/incompat/volatile/` sentinel.
If exists: return EINVAL ("previous volatile session unclean").
- Create the sentinel directory.
f. If nfs_export: create `$workdir/index/` subdirectory.
g. Clean stale temporary files from workdir (names starting with
`#overlay.`). These are remnants of interrupted copy-ups.
5. Verify upper filesystem supports required operations:
- xattr support (getxattr/setxattr succeed with overlay prefix).
- rename with RENAME_WHITEOUT (test with a dummy file in workdir).
6. Construct `OverlaySuperBlock` with userns_influenced as determined
in step 2, and `SuperBlock`.
7. Register overlay dentry ops with the VFS.
8. Emit mount options for /proc/mounts via show_options().
Unmount:
OverlayFs::unmount(sb) -> Result<()>:
1. If volatile mode: remove sentinel directory
`$workdir/work/incompat/volatile/`.
2. Flush and release upper layer (must happen FIRST):
a. Sync all dirty pages and metadata in the upper filesystem's
writeback queue. This ensures that any copy-up data, whiteouts,
and metadata changes written to the upper layer are on stable
storage before the upper SuperBlock reference is released.
Uses `sync_filesystem(upper_sb)` which issues a full barrier.
b. Release the upper directory InodeId reference. This decrements
the upper SuperBlock's mount reference count.
c. Release the workdir InodeId reference (same SuperBlock as upper).
The upper layer MUST be flushed and released before the lower layers
because:
- Dirty data in the upper layer may reference inodes from lower
layers (metacopy files whose data still resides on a lower layer).
If a lower SuperBlock were dropped first, its block device could
be detached, making the lower data unreachable and causing I/O
errors during upper flush.
- The upper filesystem's journal commit may reference lower-layer
block addresses (in filesystems like ext4 where the journal
records physical block numbers). Releasing the lower device
before journal commit would corrupt the journal.
3. Release lower layer references (in reverse stacking order,
topmost first):
a. For each lower layer (from layer N down to layer 1):
- Release the lower directory InodeId reference.
- Decrement the lower SuperBlock's mount reference count.
b. Reverse order ensures that if multiple lower layers share a
SuperBlock (uncommon but valid), the last reference is released
on the final iteration, not mid-traversal.
c. Lower layers are read-only — no flush is needed. Their data
is immutable for the lifetime of the overlay mount.
4. Drop the OverlaySuperBlock (overlay's own VFS superblock metadata).
At this point, all underlying filesystem references have been
released. If this was the last mount referencing an underlying
filesystem, that filesystem's kill_sb() is triggered, which
flushes its own metadata and releases the block device.
Race with concurrent unmount of underlying filesystems: The VFS mount
reference counting prevents an underlying filesystem from being unmounted
while the overlay holds references to its inodes. An umount of the lower
or upper filesystem while the overlay is mounted returns EBUSY (the
overlay's InodeId references pin the underlying SuperBlock). This is
identical to Linux behavior.
MountNamespace teardown cleanup order: During MountNamespace teardown,
overlayfs cleanup follows the same ordering as explicit unmount: (1) flush
pending copy-ups, (2) remove workdir temporary files, (3) unmount upper layer,
(4) unmount lower layers. Workdir cleanup MUST precede upper unmount — the
workdir is on the upper filesystem. If the workdir is cleaned after upper
unmount, the workdir files become inaccessible and leak storage until the next
fsck on the underlying filesystem.
14.8.14 Performance Characteristics¶
| Operation | Overhead vs. direct filesystem access | Notes |
|---|---|---|
| Path lookup (cached) | +1 hash lookup per component | Overlay dentry points to underlying dentry |
| Read (lower-only file) | ~0% | Direct delegation to lower filesystem |
| Read (upper file) | ~0% | Direct delegation to upper filesystem |
| Read (metacopy file) | ~0% | Reads from lower, same as lower-only |
| Write (upper file) | ~0% | Direct delegation to upper filesystem |
| Write (first write, copy-up) | O(file_size) one-time | Sequential read+write of file data |
| Write (metacopy first write) | O(file_size) one-time | Deferred from container startup |
| chmod/chown (metacopy) | O(1) ~10μs | Metadata-only copy-up (no data copy) |
| chmod/chown (no metacopy) | O(file_size) | Full copy-up triggered |
| readdir (merged) | O(entries × layers) | Hash-based dedup over all layers |
| stat (cached) | ~0% | Overlay inode cached in VFS |
Container startup optimization: With metacopy enabled, pulling and starting a container image avoids copying any file data during the initial setup phase (only metadata operations occur: chmod, chown, symlink creation for the container's init process). Data is copied lazily on first write. For typical container images (200-500 MB of layers), this reduces container start time from seconds to tens of milliseconds for the filesystem setup phase.
14.8.15 dm-verity Integration for Container Image Layers¶
Read-only lower layers in a container overlay can be protected by dm-verity (Section 9.3). The container runtime mounts each image layer's block device with dm-verity verification, then stacks them as overlayfs lower layers:
Container image mount sequence:
1. Pull image layers: layer1.img, layer2.img, ..., layerN.img
2. For each layer:
a. Set up dm-verity on the layer's block device (Merkle tree
verification, Section 9.2.6)
b. Mount the verified block device read-only (ext4/XFS)
3. Mount overlayfs:
mount -t overlay overlay \
-o lowerdir=/mnt/layerN:...:/mnt/layer1,upperdir=...,workdir=...
/container/rootfs
This provides block-level integrity verification for all read-only container layers. The writable upper layer is covered by IMA (Section 9.5) for runtime integrity measurement of modified files. Together, dm-verity (lower layers) + IMA (upper layer) provide complete integrity coverage for container filesystems.
The optional verity=require mount option (Section 14.8) provides an
additional layer of verification at the overlayfs level using fs-verity digests,
independent of dm-verity block device verification.
14.8.16 Linux Compatibility¶
overlayfs is compatible with Linux's overlayfs at the mount interface and xattr format level:
- Upper and lower directories created by Linux overlayfs are mountable by UmkaOS
and vice versa. The xattr format (
trusted.overlay.*names and values) is identical. - Mount option syntax matches Linux exactly (
-o lowerdir=...,upperdir=..., workdir=...). - Whiteout formats (both character device 0:0 and xattr-based) are recognized.
- Metacopy xattr format is compatible: layers created with
metacopy=onon Linux work on UmkaOS. redirect_dirxattr format and path encoding match Linux./proc/mountsoutput format matches Linux for container introspection tools./sys/module/overlay/parameters/*is not emulated (UmkaOS does not use kernel modules); per-mount options in the mount command are the sole configuration mechanism.
Docker/containerd/Podman compatibility: These runtimes interact with
overlayfs exclusively through the mount(2) syscall and standard file operations.
They do not use any overlayfs-specific ioctls or sysfs interfaces. UmkaOS's
implementation of mount("overlay", ...) with the standard option string is
sufficient for full compatibility. The overlay2 storage driver in Docker and
the overlayfs snapshotter in containerd are fully supported.
14.9 binfmt_misc — Arbitrary Binary Format Registration¶
binfmt_misc is a VFS-level mechanism that allows userspace to register handlers
for arbitrary binary formats, identified by magic bytes or file extension. When the
kernel's exec path attempts to start a file and neither the native ELF handler nor
the #! script handler matches, the kernel delegates to a registered binfmt_misc
interpreter. The registered interpreter binary is invoked with the original file
path as an additional argument.
Critical use cases:
- Multi-architecture containers:
qemu-aarch64-staticis registered as the interpreter for AArch64 ELF binaries, identified by the AArch64 ELF magic header. This allows running unmodified ARM64 Docker images on an x86-64 host without hardware virtualisation. - Java:
.jarfiles executed as if they were executables via a registration that maps the.jarextension to/usr/bin/java -jar. - .NET: PE32+ executables identified by the
MZmagic bytes are mapped todotnet exec. - Wine: 16-bit and 32-bit Windows PE files mapped to
wine.
14.9.1 Data Structures¶
/// A single registered binfmt_misc entry.
/// Kernel-internal, not KABI or wire format. `Option<[u8; N]>` fields use
/// the discriminant to distinguish "magic match" from "extension match"
/// without requiring sentinel values in the array.
pub struct BinfmtMiscEntry {
/// Registration name. Shown as the filename under the binfmt_misc mount.
/// Alphanumeric, hyphen, and underscore only. NUL-terminated.
pub name: [u8; 64],
/// Matching strategy: magic bytes or file extension.
pub match_type: BinfmtMatch,
/// Magic bytes to compare against file content (BinfmtMatch::Magic only).
/// Maximum 128 bytes. Length of `magic` and `mask` must be equal.
/// Comparison is byte-by-byte (no endianness interpretation) — each
/// byte in the file at `magic_offset + i` is ANDed with `mask[i]` and
/// compared to `magic[i]`. Multi-byte values embedded in magic patterns
/// must be specified in the byte order they appear in the file.
pub magic: Option<[u8; 128]>,
/// Length of the valid portion of `magic` and `mask` arrays.
pub magic_len: u8,
/// Bitmask applied to each file byte before comparison with `magic`.
/// A mask byte of `0xff` means "match exactly"; `0x00` means "ignore".
pub mask: Option<[u8; 128]>,
/// Byte offset within the file at which `magic` is compared.
pub magic_offset: u16,
/// File extension string (BinfmtMatch::Extension only).
/// Case-sensitive. Does not include the leading `.`. NUL-terminated.
/// 32 bytes: 31 chars + NUL. Covers all reasonable extensions.
pub extension: Option<[u8; 32]>,
/// Absolute path to the interpreter binary.
pub interpreter: [u8; PATH_MAX],
/// Behavioural flags.
pub flags: BinfmtFlags,
/// Whether this entry participates in exec matching.
pub enabled: AtomicBool,
}
/// How the entry identifies matching binaries.
pub enum BinfmtMatch {
/// Match by magic bytes at a fixed offset within the file.
Magic,
/// Match by the file extension of the executed path.
Extension,
}
bitflags! {
/// Behavioural flags for a binfmt_misc entry.
pub struct BinfmtFlags: u32 {
/// Pass the original filename as argv[0] to the interpreter instead
/// of substituting the interpreter path.
const PRESERVE_ARGV0 = 0x01;
/// Open the binary file and pass it to the interpreter as an open fd
/// (via `/proc/self/fd/N`). Required when the binary is not
/// world-readable and the interpreter runs without elevated privilege.
const OPEN_BINARY = 0x02;
/// Use the credentials (uid, gid, capabilities) of the interpreter
/// binary rather than those of the executed file. Equivalent to
/// setuid execution for the interpreter.
const CREDENTIALS = 0x04;
/// Fix binary: the interpreter is not itself subject to further
/// binfmt_misc or personality transformation. Prevents recursion.
const FIX_BINARY = 0x08;
/// Secure: do not grant elevated credentials even when the interpreter
/// binary is setuid. Overrides CREDENTIALS for privilege de-escalation.
const SECURE = 0x10;
}
}
/// Maximum registered binfmt_misc entries. Real systems have fewer than 64;
/// this bound makes the table fixed-size and avoids heap allocation on the
/// exec hot path.
pub const MAX_BINFMT_MISC: usize = 64;
The global entry table is an RcuCell<ArrayVec<Arc<BinfmtMiscEntry>, MAX_BINFMT_MISC>>.
The exec path reads the table under an RCU read guard (lock-free) and performs a
bounded scan (at most MAX_BINFMT_MISC entries). Registration, enable/disable, and
removal are cold-path operations: the writer clones the current ArrayVec, applies
the modification, and publishes the new version via RcuCell::update() (RCU grace
period). The list is short in practice (fewer than 64 entries on any real system),
so O(N) scan cost is negligible relative to exec overhead.
14.9.2 Registration Interface¶
The binfmt_misc filesystem is mounted at /proc/sys/fs/binfmt_misc (also accessible
at /sys/kernel/umka/binfmt_misc/ via the umkafs namespace — see
Section 20.5). It exposes:
| Path | Type | Description |
|---|---|---|
register |
write-only file | Register a new entry |
status |
read/write file | 1 = all entries active; 0 = all disabled globally |
<name>/enabled |
read/write file | 1 enable, 0 disable, -1 remove this entry |
<name> |
read-only file | Shows entry details (flags, interpreter, magic/extension) |
Writing to register or any <name>/enabled file requires Capability::SysAdmin
in the caller's capability set.
Registration format (written as a single line to register):
Fields are separated by the same delimiter character as the leading :. Any
printable non-alphanumeric character may be used as the delimiter (allowing paths
that contain colons).
| Field | Description |
|---|---|
name |
Identifier: alphanumeric, -, _. Maximum 63 characters. |
type |
M for magic-byte match; E for extension match. |
offset |
Decimal byte offset for magic comparison (type M). 0 for most formats. |
magic |
Hex-escaped bytes for type M (e.g., \x7fELF). Extension string for type E. |
mask |
Hex-escaped bitmask for type M; same length as magic. Empty for type E. |
interpreter |
Absolute path to the interpreter binary. Must exist at registration time. |
flags |
Subset of POCFS: P = PRESERVE_ARGV0, O = OPEN_BINARY, C = CREDENTIALS, F = FIX_BINARY, S = SECURE. |
Example — registering QEMU user-mode for AArch64 ELF binaries on an x86-64 host:
:qemu-aarch64:M:0:\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7\x00::qemu-aarch64-static:OC
- Type
M, offset0: compare 20 magic bytes starting at file byte 0. - No mask: all bytes compared exactly (
\xffmask is implied). O(OPEN_BINARY): interpreter receives file as fd, not path, for cross-uid access.C(CREDENTIALS): interpreter's credentials govern setuid semantics.
Parsing algorithm:
parse_registration(line: &[u8]) -> Result<BinfmtMiscEntry>:
1. delimiter = line[0]
2. Split line on delimiter into fields: [name, type, offset, magic_or_ext,
mask, interpreter, flags_str].
3. Validate name: alphanumeric + '-' + '_', length 1–63.
4. Parse type: 'M' → BinfmtMatch::Magic, 'E' → BinfmtMatch::Extension.
5. For type M:
a. Parse offset as decimal u16.
b. Decode hex-escaped bytes into magic array (max 128 bytes).
c. If mask non-empty: decode hex-escaped bytes; must equal magic.len().
d. If mask empty: fill mask with 0xff bytes (exact match).
6. For type E:
a. Validate extension: printable ASCII, no '/', no '.', max 31 chars.
b. Store extension without leading '.'.
7. Validate interpreter: starts with '/', exists in VFS (path lookup),
is a regular file with execute permission for at least one uid.
8. Parse flags_str: accept 'P', 'O', 'C', 'F', 'S' in any order.
9. Construct BinfmtMiscEntry with enabled = AtomicBool::new(true).
10. Clone current ArrayVec from RcuCell; reject if name already exists.
11. Push Arc<BinfmtMiscEntry> to cloned table; RCU-publish via RcuCell::update().
14.9.3 Exec Path Integration¶
During do_execve (Section 8.1), after the ELF handler and the
#! script handler both decline the binary (return ENOEXEC), the kernel calls
binfmt_misc_load_binary(file, argv, envp).
Matching algorithm:
binfmt_misc_load_binary(file, argv, envp) -> Result<()>:
1. Acquire RCU read guard on global entry table (lock-free).
2. If global status is disabled: return ENOEXEC.
3. Read a probe buffer of min(128 + max_magic_offset, 256) bytes from
offset 0 of `file`. This single read covers all registered magic ranges.
4. For each entry in table order (bounded by MAX_BINFMT_MISC = 64):
a. If !entry.enabled.load(Relaxed): skip.
b. If entry.match_type == Magic:
i. end = entry.magic_offset as usize + entry.magic_len as usize.
ii. If end > probe_buffer.len(): skip (file too short).
iii.For each byte i in 0..magic_len:
file_byte = probe[magic_offset + i] & mask[i]
if file_byte != magic[i] & mask[i]: break → no match
iv. If all bytes matched: entry is selected.
c. If entry.match_type == Extension:
i. Extract filename from argv[0] (last path component).
ii. If filename ends with '.' + extension (case-sensitive): entry is selected.
5. If no entry matched: drop RCU guard; return ENOEXEC.
6. Clone the matched entry (Arc clone, no copy of byte arrays).
7. Drop RCU read guard.
8. Build new argv:
a. If PRESERVE_ARGV0 set: new_argv = [interpreter, argv[0], argv[1..]]
b. Else: new_argv = [interpreter, original_file_path, argv[1..]]
c. If OPEN_BINARY set: pass file as open fd; prepend "/proc/self/fd/<N>"
in place of original_file_path.
9. If CREDENTIALS set: use interpreter binary's uid/gid/caps for the new exec.
10. If SECURE set: clear any setuid bits that CREDENTIALS would have applied.
11. Invoke do_execve recursively with interpreter path and new_argv.
If FIX_BINARY set: skip binfmt_misc matching in the recursive exec
(set a per-exec flag to prevent re-entry into this function).
Step 11's recursive do_execve processes the interpreter itself through the
normal ELF handler. QEMU user-mode binaries are statically linked ELF executables,
so the recursion terminates in one level.
14.9.4 The binfmt_misc Filesystem¶
binfmt_misc_fs is a minimal VFS filesystem type (FsType::BinfmtMisc) with the
following FsOps implementation:
impl FsOps for BinfmtMiscFs {
fn mount(&self, flags: MountFlags, _data: &[u8]) -> Result<Arc<SuperBlock>>;
fn statfs(&self, sb: &SuperBlock) -> Result<StatFs>;
}
impl InodeOps for BinfmtMiscDir {
fn lookup(&self, name: &OsStr) -> Result<Arc<Dentry>>;
fn iterate_dir(&self, ctx: &mut DirContext) -> Result<()>;
}
impl FileOps for BinfmtMiscRegister {
fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>; // parse_registration
}
impl FileOps for BinfmtMiscStatus {
fn read(&self, buf: &mut [u8], _offset: u64) -> Result<usize>; // "enabled\n" or "disabled\n"
fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>; // "1" / "0"
}
impl FileOps for BinfmtMiscEntryFile {
fn read(&self, buf: &mut [u8], _offset: u64) -> Result<usize>; // entry details
fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>; // "1" / "0" / "-1"
}
The filesystem has no on-disk backing store. All state lives in the in-kernel
RcuCell<ArrayVec<Arc<BinfmtMiscEntry>, MAX_BINFMT_MISC>>. Directory inodes are
synthesised dynamically: lookup reads the entry table under an RCU guard, scans
for a matching name, and returns a synthetic inode. iterate_dir emits register,
status, and all current entry names.
RCU synchronization on handler removal: When an entry is removed (write -1 to
the entry file), the removal sequence is:
1. Acquire the global binfmt_misc_lock (spinlock, serializes writers).
2. Create a new ArrayVec with the entry removed.
3. Publish via RcuCell::update() (rcu_assign_pointer semantics).
4. Release binfmt_misc_lock.
5. Call synchronize_rcu() to wait for all readers to complete.
6. Drop the old ArrayVec (releases the Arc<BinfmtMiscEntry>).
Step 5 is critical: without it, a concurrent execve() holding an RCU read lock
could still be matching against the removed entry. The synchronize_rcu() ensures
that by the time the Arc is dropped (and potentially the interpreter binary's
file reference released), all in-flight search_binary_handler() calls have either
completed or moved past the entry table read. For FIX_BINARY entries (where the
interpreter file is pinned at registration time), the pinned file reference is
released only after the RCU grace period completes.
Multiple mounts of the binfmt_misc filesystem share the same global entry table
(identical to Linux semantics). Unmounting does not clear registrations; entries
persist until explicitly removed via echo -1 > /proc/sys/fs/binfmt_misc/<name>/enabled
or until the kernel reboots.
Mount point: The standard location is /proc/sys/fs/binfmt_misc, mounted by
systemd-binfmt.service at early boot before loading entries from
/etc/binfmt.d/*.conf and /usr/lib/binfmt.d/*.conf.
14.9.5 Persistence and systemd Integration¶
The kernel holds registrations only in memory. Registrations are lost on reboot.
The systemd-binfmt.service unit re-registers all entries at each boot by reading
configuration files with the format:
# /etc/binfmt.d/qemu-aarch64.conf
:qemu-aarch64:M:0:\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7\x00::qemu-aarch64-static:OC
Each non-comment, non-empty line is written verbatim to
/proc/sys/fs/binfmt_misc/register. Drop-in files in /usr/lib/binfmt.d/ are
processed first, then /etc/binfmt.d/ (higher priority). Conflicting entries with
the same name are rejected by the kernel (duplicate-name check in parse_registration).
14.9.6 Security Model¶
- Privilege: Writing to
registeror anyenabledfile requiresCapability::SysAdmin. Unprivileged processes cannot add or modify entries. - Interpreter credentials: By default (no
CREDENTIALSflag), the interpreter runs with the calling process's credentials. The setuid bits of the interpreter binary are ignored. This prevents privilege escalation via a crafted binary whose magic bytes happen to match a setuid interpreter's registration. CREDENTIALSflag: Explicitly opts in to interpreter-binary credential inheritance. Should only be set for fully trusted interpreters.SECUREflag: When set alongsideCREDENTIALS, strips any elevated privilege that would have been inherited. Useful for sandboxed interpreters.OPEN_BINARYflag: The kernel opens the binary file before constructing the newargv, so the interpreter receives an already-open fd. This allows the interpreter to read the file even when the binary is not world-readable (e.g.,chmod 700user-owned binaries run through QEMU on a shared host). The fd is passed as a/proc/self/fd/Npath to remain compatible with interpreters that accept a file path argument.- Recursion guard: The
FIX_BINARYflag, combined with the per-exec recursion flag set in step 11 of Section 14.5.3, prevents pathological interpreter chains where an interpreter is itself a binfmt_misc-dispatched binary.
14.10 autofs — Kernel Automount Trigger¶
autofs is the kernel side of the automount subsystem. Its role is narrow: detect
access to a path that has not yet been mounted, suspend the filesystem lookup, notify
a userspace daemon, and resume the lookup after the daemon has performed the mount.
The kernel does not decide what to mount or where it comes from — that is
entirely the daemon's responsibility.
Used extensively by systemd through .automount units: lazy NFS home directories
(/home/$user), removable media (/media/disk), and network shares that should
only connect on demand.
14.10.1 Architecture¶
autofs registers a VFS filesystem type (FsType::Autofs). An autofs filesystem
instance covers a single mount point. Inside that mount point, the kernel may see
directory entries that are not yet backed by a real mount. When path resolution
(Section 14.1) traverses one of these
directories and finds DCACHE_NEED_AUTOMOUNT set on its dentry, it calls the
dentry's d_automount operation.
The two fundamental mount modes are:
| Mode | Description |
|---|---|
indirect |
autofs mount covers a directory; lookups of subdirectories trigger mounts. /nfs is autofs; accessing /nfs/fileserver triggers a mount of fileserver:/export onto /nfs/fileserver. |
direct |
The autofs mount point IS the trigger. Accessing the exact path (e.g., /mnt/backup) triggers the mount. |
14.10.2 Data Structures¶
/// State for one autofs filesystem instance (one mount point).
pub struct AutofsMount {
/// Pipe to the automount daemon. Kernel writes AutofsPacket messages here.
pub pipe: Arc<Pipe>,
/// Protocol version negotiated with the daemon (UmkaOS implements v5).
/// The daemon declares its version via `AUTOFS_IOC_PROTOVER` ioctl on
/// the autofs mount fd. If the daemon's version is < 5, the kernel
/// responds with v4 compatibility packets (no UID/GID/PID fields).
/// If the daemon's version is > 5, the kernel uses v5 (the kernel
/// never speaks a protocol newer than it implements). Version mismatch
/// logging: "autofs: daemon v{N}, kernel v5 — using v{min(N,5)}".
pub proto_version: u32,
/// Whether the daemon has declared itself gone (catatonic state).
pub catatonic: AtomicBool,
/// Idle timeout in seconds after which expire packets are sent.
pub timeout_secs: AtomicU32,
/// All outstanding lookup requests waiting for daemon response.
/// Keyed by token (u64). XArray provides O(1) lookup with internal
/// xa_lock for write serialization, replacing the external Mutex.
pub pending: XArray<Arc<AutofsPendingRequest>>,
/// Monotonically increasing token counter. Internal counter is u64
/// (exhaustion-proof: at 100 tokens/sec, wraps in 5.8 billion years).
/// The Linux ABI wire protocol (`AutofsPacketMissing::wait_queue_token`)
/// carries the low 32 bits only. The XArray is keyed by the wire token
/// (u32, zero-extended to u64 for XArray indexing). Only the low 32 bits
/// of the counter are used as XArray keys and wire tokens. Lookup on daemon
/// response is O(1) via `pending.get(wire_token as u64)`. Collision is
/// impossible: at 100 tokens/sec,
/// the u32 space covers 49 days of tokens, but pending requests time out
/// within `timeout_secs` (typically 30-300 seconds).
pub next_token: AtomicU64,
/// Mount type: indirect or direct.
pub mount_type: AutofsMountType,
}
pub enum AutofsMountType {
Indirect,
Direct,
Offset, // Internal: used for sub-mounts within a multi-mount map.
}
/// One outstanding automount request.
pub struct AutofsPendingRequest {
/// Token echoed back in the daemon's IOC_READY / IOC_FAIL ioctl.
pub token: u32,
/// Path component that triggered the lookup (indirect) or full path (direct).
pub name: CString,
/// Sleeping callers blocked on this mount.
pub waitq: WaitQueue,
/// Result set by the daemon: Ok(()) on success, Err(errno) on failure.
pub result: OnceLock<Result<()>>,
}
/// Packet written to the daemon pipe for a missing mount (protocol v5).
/// Layout matches Linux `struct autofs_v5_packet`. 304 bytes on all
/// UmkaOS-supported architectures: the `ino: u64` field has 8-byte alignment
/// on all targets (ARMv7 AAPCS, PPC32 System V ABI, and all 64-bit ABIs),
/// so trailing padding is always 4 bytes (300 named → 304 aligned).
/// The daemon reads `mem::size_of::<AutofsPacketMissing>()` bytes from the pipe.
#[repr(C)]
pub struct AutofsPacketMissing {
pub hdr: AutofsPacketHdr, // offset 0, size 8
/// Token for AUTOFS_IOC_READY / AUTOFS_IOC_FAIL.
pub wait_queue_token: u32, // offset 8, size 4
/// Device number of the autofs mount.
pub dev: u32, // offset 12, size 4
/// Inode number of the trigger dentry.
pub ino: u64, // offset 16, size 8
/// UID of the process that triggered the lookup.
pub uid: u32, // offset 24, size 4
/// GID of the process that triggered the lookup.
pub gid: u32, // offset 28, size 4
/// PID (thread group leader) of the process that triggered the lookup.
/// Autofs wire protocol uses __u32 (not pid_t).
pub pid: u32, // offset 32, size 4
/// TGID of the triggering process. Autofs wire protocol uses __u32.
pub tgid: u32, // offset 36, size 4
/// Length of `name` (not including NUL). Autofs wire: __u32.
pub len: u32, // offset 40, size 4
/// Name of the missing directory component (NUL-terminated).
pub name: [u8; NAME_MAX + 1], // offset 44, size 256
// Named fields: 8+4+4+8+4+4+4+4+4+256 = 300 bytes.
// u64 alignment on all UmkaOS targets → 4 bytes trailing padding → 304.
}
// 304 on all UmkaOS-supported architectures: u64 has 8-byte alignment on
// ARMv7 (AAPCS), PPC32 (System V ABI), and all 64-bit ABIs. The 32-bit
// value of 300 would only apply on i386 (4-byte u64 alignment), which
// UmkaOS does not support.
const_assert!(size_of::<AutofsPacketMissing>() == 304);
/// Packet written to the daemon pipe requesting expiry of an idle mount.
/// Layout matches Linux `struct autofs_v5_packet_expire` (304 bytes on all
/// UmkaOS-supported architectures — identical to AutofsPacketMissing). In
/// Linux, `autofs_packet_expire_direct_t` is a typedef alias for `autofs_v5_packet`.
#[repr(C)]
pub struct AutofsPacketExpire {
pub hdr: AutofsPacketHdr,
pub wait_queue_token: u32,
pub dev: u32,
pub ino: u64,
pub uid: u32,
pub gid: u32,
pub pid: u32,
pub tgid: u32,
pub len: u32, // Autofs wire: __u32.
pub name: [u8; NAME_MAX + 1],
}
// Same reasoning as AutofsPacketMissing: 304 on all UmkaOS targets.
const_assert!(size_of::<AutofsPacketExpire>() == 304);
/// Common packet header.
/// Field types match Linux's `struct autofs_packet_hdr` exactly:
/// both `proto_version` and `type` are `int` (i32) in the Linux C struct.
#[repr(C)]
pub struct AutofsPacketHdr {
pub proto_version: i32,
pub packet_type: i32,
}
const_assert!(size_of::<AutofsPacketHdr>() == 8);
/// Autofs packet type constants. Values match Linux `auto_fs.h`.
/// The header's `packet_type` field is `i32` (not enum repr) for C ABI
/// compatibility. These constants are used for matching:
///
/// | Value | Name | Protocol | Description |
/// |-------|------|----------|-------------|
/// | 0 | Missing | v1 | Legacy missing (v1/v2 only) |
/// | 1 | Expire | v1 | Legacy expire (v1/v2 only) |
/// | 2 | ExpireMulti | v4 | Multi-mount expire |
/// | 3 | MissingIndirect | v5 | Indirect mount trigger |
/// | 4 | ExpireIndirect | v5 | Indirect mount expiry |
/// | 5 | MissingDirect | v5 | Direct mount trigger |
/// | 6 | ExpireDirect | v5 | Direct mount expiry |
///
/// Types 3-6 are required for v5 protocol. systemd dispatches on these values.
/// UmkaOS uses types 3-6 for v5 operation (types 0-1 only for v4 compat).
#[repr(i32)]
pub enum AutofsPacketType {
Missing = 0,
Expire = 1,
ExpireMulti = 2,
MissingIndirect = 3,
ExpireIndirect = 4,
MissingDirect = 5,
ExpireDirect = 6,
}
14.10.3 Packetized Pipe Protocol¶
The autofs kernel-to-daemon communication channel is a packetized pipe: each
write() from the kernel writes exactly one complete packet
(mem::size_of::<AutofsPacketMissing>() bytes — 304 on all UmkaOS-supported
architectures), and each read() from the daemon must read exactly that many bytes
to consume one packet. The pipe is opened with O_DIRECT semantics (Linux pipe
O_DIRECT flag, since Linux 3.4) to ensure atomic packet-sized writes — a partial
write never occurs as long as the packet size (304 bytes) is less than PIPE_BUF (4096 bytes,
POSIX-guaranteed atomicity threshold). See Section 14.17 for the UmkaOS
pipe implementation.
If the pipe buffer is full (all slots occupied), the kernel's write() returns
-EAGAIN (the pipe fd is set to non-blocking mode by the daemon at setup). The
autofs trigger path converts this to -ENOMEM and returns to the caller — the
daemon is overloaded and cannot accept new mount requests.
14.10.4 Automount Protocol¶
Trigger sequence (the fast path through VFS path resolution):
autofs_d_automount(dentry, path) -> Result<Option<Arc<Mount>>>:
Precondition: called from REF-walk (never RCU-walk; see Section 14.6.6).
1. Obtain the AutofsMount for this dentry's superblock.
2. If catatonic: return Err(ENOENT) immediately.
3. Check if `dentry` is already a mount point (DCACHE_MOUNTED set):
return Ok(None) — another thread raced and completed the mount.
4. Allocate token = next_token.fetch_add(1, Relaxed).
5. Construct AutofsPacketMissing with v5 packet type:
- indirect mode: `packet_type = AutofsPacketType::MissingIndirect`
- direct mode: `packet_type = AutofsPacketType::MissingDirect`
Set `{ hdr: { proto_version: 5, packet_type }, token, name = dentry.name or full path }`.
6. Insert Arc<AutofsPendingRequest> into pending table under token.
7. Write packet to pipe (non-blocking; if pipe is full, return ENOMEM —
the daemon is overloaded).
8. Sleep on pending.waitq with timeout = timeout_secs seconds.
9. On wake:
a. Remove request from pending table.
b. If result is Ok(()):
- Verify dentry is now a mount point (DCACHE_MOUNTED).
- Return Ok(None) (VFS follow_mount() will handle the new mount).
c. If result is Err(e): return Err(e).
10. On timeout:
a. Remove request from pending table.
b. Return Err(ETIMEDOUT).
Daemon response (via ioctl on the autofs pipe fd or mount point fd):
AUTOFS_IOC_READY(token: u32):
1. Acquire pending lock; look up token.
2. If not found: return ENXIO (stale token; request already timed out).
3. Set request.result = Ok(()).
4. Wake all waiters on request.waitq.
5. Remove from pending table.
AUTOFS_IOC_FAIL(token: u32):
1. Acquire pending lock; look up token.
2. If not found: return ENXIO.
3. Set request.result = Err(ENOENT).
4. Wake all waiters.
5. Remove from pending table.
Multiple callers may race to access the same missing path simultaneously. All of
them find the same AutofsPendingRequest in the pending table (inserted by the
first caller) and sleep on the same waitq. When the daemon responds, all waiters
wake together.
14.10.5 Control Interface¶
All autofs control operations are performed via ioctl(2) on the file descriptor
of the autofs pipe (passed to the kernel at mount time via the fd=N mount option)
or on a file descriptor opened on the autofs mount point itself.
| ioctl | Direction | Description |
|---|---|---|
AUTOFS_IOC_READY |
daemon→kernel | Mount succeeded for token. |
AUTOFS_IOC_FAIL |
daemon→kernel | Mount failed for token. |
AUTOFS_IOC_CATATONIC |
daemon→kernel | Daemon is exiting; all future lookups fail with ENOENT. |
AUTOFS_IOC_PROTOVER |
kernel→daemon | Returns protocol version (5 for UmkaOS). |
AUTOFS_IOC_SETTIMEOUT |
daemon→kernel | Sets idle expiry timeout in seconds. |
AUTOFS_IOC_EXPIRE |
kernel→daemon | Requests daemon to expire (unmount) one idle subtree. |
AUTOFS_IOC_EXPIRE_MULTI |
kernel→daemon | Requests daemon to expire up to N idle subtrees. |
AUTOFS_IOC_EXPIRE_INDIRECT |
kernel→daemon | Like EXPIRE but limited to indirect-mode subtrees. |
AUTOFS_IOC_EXPIRE_DIRECT |
kernel→daemon | Like EXPIRE but limited to direct-mode mount points. |
AUTOFS_IOC_PROTOSUBVER |
kernel→daemon | Returns protocol sub-version (UmkaOS: 6, matching Linux 5.4+). |
AUTOFS_IOC_ASKUMOUNT |
daemon→kernel | Query whether the autofs mount point can be unmounted. |
14.10.6 Expiry¶
After an autofs-triggered mount has been idle for timeout_secs seconds, the
kernel initiates expiry. Expiry is cooperative: the kernel asks the daemon to
consider unmounting; the daemon decides whether conditions are met (no processes
have open files under the mount, no active chdir into it) and issues umount(2)
if appropriate.
autofs_expire_run(mount: &AutofsMount):
Executed from a kernel timer callback at intervals of timeout_secs / 4.
1. Walk all mounts that are children of this autofs mount point.
2. For each child mount M:
a. Compute idle_time = now - M.last_access_time.
b. If idle_time < timeout_secs: skip.
c. If any process has an open fd into M's subtree (check mount's
open-file reference count): skip.
d. Allocate token = next_token.fetch_add(1, Relaxed).
e. Write AutofsPacketExpire { hdr: { proto_version: 5,
packet_type: ExpireIndirect (indirect) or ExpireDirect (direct) },
token, name = M.mountpoint_name } to pipe.
f. Insert AutofsPendingRequest into pending table.
g. Daemon calls AUTOFS_IOC_READY(token) after umount(2) succeeds, or
AUTOFS_IOC_FAIL(token) if the mount is still busy.
3. The timer reschedules itself unless the mount is in catatonic state.
The expiry path does not sleep in the kernel; it is fire-and-forget from the kernel's perspective. The daemon drives the actual unmount.
14.10.7 VFS Integration¶
autofs inserts itself into the VFS path walk at the d_automount dentry operation
hook, which is called by follow_automount() inside the path resolution loop
(Section 14.1):
follow_automount(path, nd) -> Result<()>:
1. Verify nd.flags does not include LOOKUP_NO_AUTOMOUNT.
2. Call dentry.ops.d_automount(dentry, path) → new_mnt (may be None).
3. If new_mnt is Some(mnt): call do_add_mount(mnt, path).
4. Continue path walk over the now-mounted subtree.
RCU-walk downgrade: d_automount cannot sleep, and sleeping is required to
wait for the daemon response. Therefore, if the path walk is in RCU mode (the
optimistic lockless fast path), it is downgraded to REF-walk before
d_automount is called. The downgrade is performed by unlazy_walk(), which
acquires reference counts on the path components traversed so far. Once in
REF-walk, the kernel can sleep safely in autofs_d_automount.
LOOKUP_NO_AUTOMOUNT: Certain operations (stat, openat with
O_NOFOLLOW | O_PATH, utimensat with AT_SYMLINK_NOFOLLOW) set this flag to
avoid triggering automounts on stat-only access. This matches Linux semantics.
14.10.8 Mount Options¶
autofs is mounted by the daemon at startup with options passed via the data
argument to mount(2):
| Option | Description |
|---|---|
fd=N |
File descriptor of the daemon-side pipe end. Required. |
uid=N |
UID of the daemon process. Used for permission checks on expire. |
gid=N |
GID of the daemon process. |
minproto=N |
Minimum acceptable protocol version (daemon's minimum). |
maxproto=N |
Maximum acceptable protocol version (daemon's maximum). |
indirect |
Mount in indirect mode (default). |
direct |
Mount in direct mode. |
offset |
Mount in offset mode (internal; used by the daemon for sub-mounts). |
UmkaOS implements autofs protocol version 5, sub-version 6 (AUTOFS_PROTO_SUBVERSION = 6),
matching the version supported by Linux kernel 5.4+ and systemd's automount daemon v252+. The protocol version
is negotiated at mount time: the kernel picks min(maxproto, UMKA_PROTO_VERSION)
and returns it via AUTOFS_IOC_PROTOVER.
14.10.9 systemd Integration¶
A systemd .automount unit creates an autofs mount point at the path specified
by Where=, paired with a .mount unit of the same name. systemd acts as the
automount daemon:
- At unit activation, systemd calls
mount("autofs", Where, "autofs", 0, "fd=N,..."). - When
AutofsPacketMissingarrives on the pipe, systemd activates the corresponding.mountunit (which runsmount(2)for the real filesystem). - On success, systemd calls
AUTOFS_IOC_READY(token); on failure,AUTOFS_IOC_FAIL(token). TimeoutIdleSec=in the.automountunit maps directly toAUTOFS_IOC_SETTIMEOUT.- After the idle timeout, systemd receives
AutofsPacketExpireand issuesumount(2)if the mount is not busy, then callsAUTOFS_IOC_READY(token).
Example unit (/etc/systemd/system/home.automount):
[Unit]
Description=Automount /home via NFS
[Automount]
Where=/home
TimeoutIdleSec=300
[Install]
WantedBy=multi-user.target
Paired with /etc/systemd/system/home.mount which specifies the NFS source and
options. systemd creates the autofs mount point when the .automount unit starts
and tears it down when the unit stops.
14.10.10 Linux Compatibility¶
UmkaOS's autofs implementation is wire-compatible with Linux autofs4:
- Protocol version 5, sub-version 6 — matches Linux kernel 5.4+.
- All ioctl numbers are identical to Linux (
AUTOFS_IOC_*from<linux/auto_fs.h>). AutofsPacketMissingandAutofsPacketExpirestructs are#[repr(C)]and match the Linux kernel ABI exactly.- Mount option string format (
fd=N,uid=N,...) matches Linux. - systemd's automount daemon,
autofs(5)userspace tools, andmount.autofsall operate without modification against UmkaOS's autofs implementation.
14.11 FUSE — Filesystem in Userspace¶
FUSE allows user-space processes to implement complete filesystems. A FUSE filesystem
daemon opens /dev/fuse (character device, major 10, minor 229), mounts via
FUSE_SUPER_MAGIC, and serves kernel VFS calls by reading and writing structured
FUSE messages over the device fd. Any FUSE protocol-compliant daemon runs without
modification on UmkaOS.
14.11.1 Architecture¶
User Process (e.g., sshfs, rclone, glusterfs-fuse)
│ write(fuse_fd, fuse_out_header + reply)
│ read(fuse_fd, fuse_in_header + args)
▼
/dev/fuse (character device, major 10 minor 229)
│
┌────┴────────────────────────────────────────┐
│ FuseConn: pending request queue │
│ FuseInode: nodeid → dentry mapping │
└────┬────────────────────────────────────────┘
│ VFS callbacks → fuse_request dispatch
▼
UmkaOS VFS layer (lookup, read, write, open, ...)
│
POSIX application
The FUSE connection object (FuseConn) is the central coordination point. It
maintains two queues: pending (requests waiting for the daemon to pick up) and
processing (requests sent to the daemon, awaiting reply). Each VFS thread that
triggers a FUSE operation enqueues a request and blocks until the daemon writes
the corresponding reply.
14.11.2 Core Data Structures¶
/// Maximum pending FUSE requests per connection. Prevents unbounded kernel
/// memory growth from a slow or misbehaving FUSE daemon.
const FUSE_MAX_PENDING: usize = 4096;
/// One FUSE connection — shared between all fds opened on this mount.
pub struct FuseConn {
/// Pending requests waiting for the daemon to read.
/// Lock-free bounded MPMC ring (defined in Section 3.1.11). VFS
/// operations push requests (producer side); the FUSE daemon reads
/// from the ring via `/dev/fuse` (consumer side). `try_push()`
/// returns `Err(Full)` for backpressure — foreground callers block
/// on `waitq` until the daemon drains entries; background callers
/// receive `EAGAIN`.
/// Per-request overhead: ~60-80 cycles for Arc refcount operations
/// (4 atomic ops across pending ring and processing XArray). Acceptable:
/// each FUSE request involves a user-kernel round-trip (~2-10 us),
/// making the ~30-40 ns refcount overhead <2%.
pub pending: BoundedMpmcRing<Arc<FuseRequest>, FUSE_MAX_PENDING>,
/// Number of currently outstanding background (async) requests.
/// Incremented when a background request is submitted; decremented on reply.
pub num_background: AtomicU32,
/// Maximum background requests before blocking new submissions.
/// Default: 12 (matching Linux `FUSE_DEFAULT_MAX_BACKGROUND`).
/// Negotiated via FUSE_INIT: the daemon may set `max_background` in
/// `FuseInitOut` to override the default.
pub max_background: u32,
/// When `num_background >= congestion_threshold`, the VFS marks the
/// backing device as congested, causing writeback to throttle.
/// Default: 9 (matching Linux `FUSE_DEFAULT_CONGESTION_THRESHOLD`,
/// which is `max_background * 3 / 4`).
///
/// **Units note**: Both `max_background` and `congestion_threshold` are
/// measured in **request count** (not pages or bytes). Each background
/// FUSE request may transfer a variable number of pages (e.g., a single
/// WRITE request carries up to `max_write` bytes, default 128 KiB = 32
/// pages). The request-count limit provides coarse backpressure; memory
/// consumption is bounded by `max_background * max_write`.
pub congestion_threshold: u32,
/// Wait queue for tasks blocked due to backpressure (background request
/// count exceeding `max_background`).
pub bg_waitq: WaitQueue,
/// Requests sent to the daemon, awaiting reply. Keyed by monotonic
/// request ID (u64). XArray provides O(1) lookup with native RCU reads
/// and internal xa_lock for write serialization, eliminating the need
/// for an external Mutex on the lookup structure.
pub processing: XArray<Arc<FuseRequest>>,
/// Wait queue: daemon blocked in read() waiting for new requests.
pub waitq: WaitQueue,
/// Connection options negotiated via FUSE_INIT.
pub opts: FuseConnOpts,
/// Next unique request ID (monotonically increasing).
pub next_unique: AtomicU64,
/// True after the daemon has exchanged FUSE_INIT.
/// Intra-domain (FuseConn lives entirely within umka-vfs Tier 1).
/// AtomicBool validity maintained by Rust type safety.
pub initialized: AtomicBool,
/// True when the connection is shutting down. Intra-domain.
pub destroyed: AtomicBool,
/// Maximum write size negotiated (from FUSE_INIT reply).
pub max_write: u32,
/// Maximum read size.
pub max_read: u32,
}
/// A single FUSE request/reply pair.
pub struct FuseRequest {
/// Monotonic ID — matches `FuseInHeader.unique` and `FuseOutHeader.unique`.
pub unique: u64,
pub opcode: FuseOpcode,
/// Serialized FUSE input args (everything after the `FuseInHeader`).
/// **Collection policy exception**: Vec<u8> on a warm/hot path. FUSE input
/// args are variable-length (path names up to PATH_MAX, write data up to
/// max_write). A fixed-size buffer would waste memory for small ops or
/// truncate large ones. Allocation is bounded by max_write (negotiated
/// at FUSE_INIT, typically 128 KiB) and occurs once per FUSE operation.
pub in_args: Vec<u8>,
pub reply: Mutex<FuseReply>,
/// Woken when `reply` transitions to `Done`.
pub waker: WaitEntry,
}
/// State of a request's reply.
pub enum FuseReply {
/// Not yet answered by the daemon.
Pending,
/// Reply bytes, or a negative errno on error.
/// Collection policy exception: Vec<u8> on warm/hot path. FUSE replies
/// are variable-length (stat: ~100 bytes, read data: up to max_read,
/// readdir: variable). Allocation bounded by max_read (negotiated at
/// FUSE_INIT, typically 128 KiB). The FUSE userspace round-trip (~2-10 us)
/// dominates; Vec allocation (~50-100 ns) is <5% overhead.
Done(Result<Vec<u8>, i32>),
}
/// FUSE connection options negotiated during FUSE_INIT.
pub struct FuseConnOpts {
pub max_write: u32,
pub max_read: u32,
pub max_pages: u16,
/// Capabilities declared by the daemon (server side).
pub capable: FuseInitFlags,
/// Capabilities the kernel requests (client side).
pub want: FuseInitFlags,
/// Timestamp granularity in nanoseconds (0 = 1 ns, i.e., full precision).
pub time_gran: u32,
pub writeback_cache: bool,
pub parallel_dirops: bool,
pub async_dio: bool,
pub posix_acl: bool,
pub default_permissions: bool,
pub allow_other: bool,
}
FuseConn is reference-counted via Arc and held by:
- The superblock of the mounted filesystem.
- Every open file descriptor on /dev/fuse belonging to that mount.
When the last daemon fd is closed, FuseConn.destroyed is set and all further
VFS operations return EIO. The mount point must then be explicitly unmounted
with fusermount -u or umount.
14.11.2.1 Request Backpressure¶
FUSE distinguishes foreground requests (synchronous VFS operations: lookup, open, read, write) from background requests (async writeback, readahead, background FUSE_NOTIFY replies). Backpressure is applied to background requests to prevent a slow daemon from causing unbounded kernel memory growth:
fuse_submit_background(conn, request):
loop:
n = conn.num_background.load(Acquire)
if n < conn.max_background:
if conn.num_background.compare_exchange(n, n + 1, AcqRel, Acquire).is_ok():
break
else:
// Block until the daemon processes a reply and decrements num_background.
// Non-blocking callers (e.g., writeback from kthread) get EAGAIN instead.
if request.is_nonblocking():
return Err(EAGAIN)
conn.bg_waitq.wait_until(|| conn.num_background.load(Acquire) < conn.max_background)
// Congestion marking: when background requests exceed the threshold,
// inform the VFS writeback layer so it throttles dirty page generation.
if conn.num_background.load(Acquire) >= conn.congestion_threshold:
set_bdi_congested(conn.backing_dev_info)
conn.pending.try_push(request) // lock-free; returns Err(Full) if ring is full
conn.waitq.wake_one() // wake daemon blocked in read(/dev/fuse)
fuse_complete_background(conn):
prev = conn.num_background.fetch_sub(1, AcqRel)
if prev <= conn.congestion_threshold:
clear_bdi_congested(conn.backing_dev_info)
if prev <= conn.max_background:
conn.bg_waitq.wake_one()
Foreground requests are not subject to max_background — they always enter
the pending ring (bounded by FUSE_MAX_PENDING = 4096). If try_push() returns
Err(Full), the foreground caller blocks on conn.waitq until the daemon drains
entries. This matches Linux semantics where synchronous FUSE operations
never return EAGAIN (except with O_NONBLOCK on the file, which is handled
at the VFS layer above FUSE).
14.11.3 Wire Protocol¶
All FUSE communication is framed with fixed headers. The kernel writes a request header followed by opcode-specific arguments; the daemon writes a reply header followed by opcode-specific data.
/// Fixed header preceding every FUSE request (kernel → daemon).
#[repr(C)]
pub struct FuseInHeader {
/// Total request length (this header + opcode args).
pub len: u32,
/// Opcode (FuseOpcode value).
pub opcode: u32,
/// Unique request ID; matched by the reply.
pub unique: u64,
/// Target inode number (0 for FUSE_INIT / FUSE_STATFS).
pub nodeid: u64,
/// Effective UID of the calling process.
pub uid: u32,
/// Effective GID of the calling process.
pub gid: u32,
/// PID of the calling process.
pub pid: u32,
/// Length of extended request data appended after the standard opcode
/// arguments (protocol 7.36+). Zero when no extensions are present.
/// Used by FUSE_SECURITY_CTX, FUSE_CREATE_SUPP_GROUP.
pub total_extlen: u16,
pub padding: u16,
}
const_assert!(size_of::<FuseInHeader>() == 40);
/// Fixed header preceding every FUSE reply (daemon → kernel).
#[repr(C)]
pub struct FuseOutHeader {
/// Total reply length (this header + reply data).
pub len: u32,
/// 0 on success; negative errno on error (e.g., -ENOENT = -2).
pub error: i32,
/// Matches the `unique` field from the corresponding `FuseInHeader`.
pub unique: u64,
}
const_assert!(size_of::<FuseOutHeader>() == 16);
Requests and replies are variable-length. The daemon must read exactly
FuseInHeader.len bytes per request and must write exactly FuseOutHeader.len
bytes per reply. A short read or write is a protocol error and terminates the
connection.
FUSE_FORGET and FUSE_BATCH_FORGET are the only opcodes that carry no reply;
the daemon must not write a reply for them.
14.11.4 FUSE Opcodes¶
The direction column records who initiates the message: K→D = kernel to daemon (a VFS call from a user process), D→K = daemon to kernel (a notify or retrieve reply with no corresponding VFS initiator).
| Opcode | Value | Direction | Description |
|---|---|---|---|
| FUSE_LOOKUP | 1 | K→D | Lookup a name within a directory |
| FUSE_FORGET | 2 | K→D | Decrement inode reference count (no reply) |
| FUSE_GETATTR | 3 | K→D | Fetch inode attributes |
| FUSE_SETATTR | 4 | K→D | Modify inode attributes |
| FUSE_READLINK | 5 | K→D | Read the target of a symbolic link |
| FUSE_SYMLINK | 6 | K→D | Create a symbolic link |
| (reserved) | 7 | — | Reserved (unused in FUSE protocol; sequence intentionally skips from 6 to 8) |
| FUSE_MKNOD | 8 | K→D | Create a special or regular file |
| FUSE_MKDIR | 9 | K→D | Create a directory |
| FUSE_UNLINK | 10 | K→D | Remove a file |
| FUSE_RMDIR | 11 | K→D | Remove a directory |
| FUSE_RENAME | 12 | K→D | Rename a file (v1; same mount) |
| FUSE_LINK | 13 | K→D | Create a hard link |
| FUSE_OPEN | 14 | K→D | Open a file |
| FUSE_READ | 15 | K→D | Read file data |
| FUSE_WRITE | 16 | K→D | Write file data |
| FUSE_STATFS | 17 | K→D | Query filesystem statistics |
| FUSE_RELEASE | 18 | K→D | Close file (last close releases the handle) |
| (reserved) | 19 | — | Unassigned in FUSE protocol (intentionally skipped) |
| FUSE_FSYNC | 20 | K→D | Sync file data to stable storage |
| FUSE_SETXATTR | 21 | K→D | Set an extended attribute |
| FUSE_GETXATTR | 22 | K→D | Get an extended attribute value |
| FUSE_LISTXATTR | 23 | K→D | List all extended attribute names |
| FUSE_REMOVEXATTR | 24 | K→D | Remove an extended attribute |
| FUSE_FLUSH | 25 | K→D | Flush on close (sent before FUSE_RELEASE) |
| FUSE_INIT | 26 | K→D | Initialize connection (first message exchanged) |
| FUSE_OPENDIR | 27 | K→D | Open a directory |
| FUSE_READDIR | 28 | K→D | Read directory entries |
| FUSE_RELEASEDIR | 29 | K→D | Close a directory |
| FUSE_FSYNCDIR | 30 | K→D | Sync directory metadata to stable storage |
| FUSE_GETLK | 31 | K→D | Test a POSIX byte-range lock |
| FUSE_SETLK | 32 | K→D | Acquire or release a POSIX lock (non-blocking) |
| FUSE_SETLKW | 33 | K→D | Acquire a POSIX lock (blocking) |
| FUSE_ACCESS | 34 | K→D | Check access (used only when default_permissions is false) |
| FUSE_CREATE | 35 | K→D | Atomically create and open a file |
| FUSE_INTERRUPT | 36 | K→D | Cancel a pending request |
| FUSE_BMAP | 37 | K→D | Map logical file block to device block |
| FUSE_DESTROY | 38 | K→D | Tear down the connection |
| FUSE_IOCTL | 39 | K→D | Forward an ioctl to the userspace filesystem |
| FUSE_POLL | 40 | K→D | Poll a file for readiness events |
| FUSE_NOTIFY_REPLY | 41 | D→K | Deliver data in response to FUSE_NOTIFY_RETRIEVE |
| FUSE_BATCH_FORGET | 42 | K→D | Drop references for multiple inodes at once |
| FUSE_FALLOCATE | 43 | K→D | Pre-allocate or de-allocate file space |
| FUSE_READDIRPLUS | 44 | K→D | Read directory entries together with their attributes |
| FUSE_RENAME2 | 45 | K→D | Rename with RENAME_EXCHANGE or RENAME_NOREPLACE |
| FUSE_LSEEK | 46 | K→D | Seek with SEEK_DATA or SEEK_HOLE |
| FUSE_COPY_FILE_RANGE | 47 | K→D | Server-side copy (copy_file_range) |
| FUSE_SETUPMAPPING | 48 | K→D | Set up a DAX direct memory mapping |
| FUSE_REMOVEMAPPING | 49 | K→D | Remove a DAX mapping |
| FUSE_SYNCFS | 50 | K→D | Sync the entire filesystem |
| FUSE_TMPFILE | 51 | K→D | Create an unnamed temporary file (O_TMPFILE) |
| FUSE_STATX | 52 | K→D | Extended stat (statx(2)) |
| FUSE_COPY_FILE_RANGE_64 | 53 | K→D | Server-side copy (64-bit variant, returns bytes_copied via fuse_copy_file_range_out). Added in FUSE protocol 7.45. |
Notify messages (daemon → kernel, unsolicited; no reply is sent by the kernel
except for FUSE_NOTIFY_RETRIEVE which expects FUSE_NOTIFY_REPLY):
| Notify code | Value | Description |
|---|---|---|
| FUSE_NOTIFY_POLL | 1 | Wake all pollers on the specified file handle |
| FUSE_NOTIFY_INVAL_INODE | 2 | Invalidate cached attributes and, optionally, a byte range of page cache |
| FUSE_NOTIFY_INVAL_ENTRY | 3 | Invalidate a specific dentry in a parent directory |
| FUSE_NOTIFY_STORE | 4 | Pre-populate a byte range of the page cache |
| FUSE_NOTIFY_RETRIEVE | 5 | Request the kernel to send page-cache contents back to the daemon |
| FUSE_NOTIFY_DELETE | 6 | Remove a dentry without a round-trip FUSE_LOOKUP failure |
| FUSE_NOTIFY_RESEND | 7 | Daemon notification that a previously interrupted request should be resent. Paired with HAS_RESEND capability flag (bit 39). Protocol 7.41+, Linux 6.12+ |
| FUSE_NOTIFY_INC_EPOCH | 8 | Increment the kernel-side epoch counter for cache invalidation coordination |
| FUSE_NOTIFY_PRUNE | 9 | Request the kernel to prune (evict) dentries from a directory |
14.11.5 FUSE_INIT Handshake¶
FUSE_INIT is always the first message exchanged. The kernel sends
FuseInitIn and the daemon replies with FuseInitOut. The two sides negotiate
protocol version and capability flags; the connection uses the minimum agreed
minor version.
/// FUSE_INIT request body (kernel → daemon).
#[repr(C)]
pub struct FuseInitIn {
/// FUSE major protocol version (kernel sends 7).
pub major: u32,
/// FUSE minor protocol version (kernel sends 45 for Linux 6.14+ equivalent).
pub minor: u32,
pub max_readahead: u32,
/// Capability bitmask the kernel supports (low 32 bits of FuseInitFlags).
/// Wire format: flags = FuseInitFlags bits 0-31 (low 32 bits).
pub flags: u32,
/// Extended capability flags (protocol minor ≥ 36, FUSE_INIT_EXT must be set in flags).
/// Wire format: flags2 = FuseInitFlags bits 32-63 shifted down 32 bits.
/// This matches the FUSE protocol extension for large flag sets (kernel 5.13+).
pub flags2: u32,
pub unused: [u32; 11],
}
// Layout: 5 × u32 + 11 × u32 = 16 × 4 = 64 bytes.
const_assert!(size_of::<FuseInitIn>() == 64);
/// FUSE_INIT reply body (daemon → kernel).
#[repr(C)]
pub struct FuseInitOut {
pub major: u32,
pub minor: u32,
pub max_readahead: u32,
/// Capabilities the daemon acknowledges and enables (low 32 bits of FuseInitFlags).
/// Wire format: flags = FuseInitFlags bits 0-31 (low 32 bits).
pub flags: u32,
/// Maximum number of outstanding background requests.
pub max_background: u16,
/// Congestion threshold: kernel slows down at this many background requests.
pub congestion_threshold: u16,
/// Maximum bytes per WRITE request.
pub max_write: u32,
/// Timestamp granularity in nanoseconds (0 = 1 ns, i.e., full precision).
pub time_gran: u32,
/// Maximum scatter-gather page count per request.
pub max_pages: u16,
/// Alignment required for DAX mappings.
pub map_alignment: u16,
/// Extended flags (protocol minor ≥ 36, requires FUSE_INIT_EXT set in flags).
/// Wire format: flags2 = FuseInitFlags bits 32-63 shifted down 32 bits.
/// This matches the FUSE protocol extension for large flag sets (kernel 5.13+).
pub flags2: u32,
pub max_stack_depth: u32,
/// Negotiated request timeout in seconds. Valid when `FUSE_REQUEST_TIMEOUT`
/// (bit 42) is set in the negotiated flags. 0 = no timeout. Matches Linux
/// `include/uapi/linux/fuse.h` field `request_timeout`.
pub request_timeout: u16,
pub unused: [u16; 11],
}
// Layout: 4+4+4+4+2+2+4+4+2+2+4+4+2+22 = 64 bytes.
// (8×u32 = 32) + (4×u16 = 8) + (11×u16 = 22) + (request_timeout u16 = 2) = 64.
const_assert!(size_of::<FuseInitOut>() == 64);
bitflags! {
/// Capability flags exchanged during FUSE_INIT.
pub struct FuseInitFlags: u64 {
/// Daemon supports asynchronous read requests.
const ASYNC_READ = 1 << 0;
/// Daemon handles POSIX advisory byte-range locks.
const POSIX_LOCKS = 1 << 1;
/// Daemon uses file handles returned in open replies.
const FILE_OPS = 1 << 2;
/// Daemon handles O_TRUNC atomically in open.
const ATOMIC_O_TRUNC = 1 << 3;
/// Filesystem supports NFS export (node IDs are stable across reboots).
const EXPORT_SUPPORT = 1 << 4;
/// Daemon supports writes larger than 4 KiB.
const BIG_WRITES = 1 << 5;
/// Kernel should not apply the process umask to create operations.
const DONT_MASK = 1 << 6;
/// Daemon supports splice(2)-based writes.
const SPLICE_WRITE = 1 << 7;
/// Daemon supports splice(2)-based moves.
const SPLICE_MOVE = 1 << 8;
/// Daemon supports splice(2)-based reads.
const SPLICE_READ = 1 << 9;
/// Daemon handles BSD flock() locking.
const FLOCK_LOCKS = 1 << 10;
/// Daemon supports ioctl on directories.
const HAS_IOCTL_DIR = 1 << 11;
/// Kernel auto-invalidates cached data on attribute changes.
const AUTO_INVAL_DATA = 1 << 12;
/// Kernel uses FUSE_READDIRPLUS instead of FUSE_READDIR.
const DO_READDIRPLUS = 1 << 13;
/// Kernel switches adaptively between READDIRPLUS and READDIR.
const READDIRPLUS_AUTO = 1 << 14;
/// Daemon supports asynchronous direct I/O.
const ASYNC_DIO = 1 << 15;
/// Daemon supports writeback caching (batched dirty page writeback).
const WRITEBACK_CACHE = 1 << 16;
/// Daemon does not need FUSE_OPEN (open is a no-op).
const NO_OPEN_SUPPORT = 1 << 17;
/// Parallel directory operations are safe (no serialization needed).
const PARALLEL_DIROPS = 1 << 18;
/// Kernel clears setuid/setgid bits on write (v1).
const HANDLE_KILLPRIV = 1 << 19;
/// Daemon supports POSIX ACLs.
const POSIX_ACL = 1 << 20;
/// Daemon sets error on abort rather than returning EIO.
const ABORT_ERROR = 1 << 21;
/// `max_pages` field in FuseInitOut is valid.
const MAX_PAGES = 1 << 22;
/// Daemon caches symlink targets.
const CACHE_SYMLINKS = 1 << 23;
/// Daemon does not need FUSE_OPENDIR.
const NO_OPENDIR_SUPPORT = 1 << 24;
/// Daemon explicitly invalidates data (FUSE_NOTIFY_INVAL_INODE).
const EXPLICIT_INVAL_DATA = 1 << 25;
/// `map_alignment` field in FuseInitOut is valid.
const MAP_ALIGNMENT = 1 << 26;
/// Daemon is aware of submount semantics.
const SUBMOUNTS = 1 << 27;
/// Kernel clears setuid/setgid bits on write (v2, extended semantics).
const HANDLE_KILLPRIV_V2 = 1 << 28;
/// Extended setxattr arguments (flags field present).
const SETXATTR_EXT = 1 << 29;
/// `flags2` fields in FuseInitIn/Out are valid.
const INIT_EXT = 1 << 30;
const INIT_RESERVED = 1 << 31;
// --- flags2 bits (require INIT_EXT set in flags) ---
// Wire format: flags2 = FuseInitFlags bits 32-63 shifted down 32 bits.
// All bit positions match Linux `include/uapi/linux/fuse.h` (torvalds/linux master).
/// Security context support (protocol 7.36+). Linux 6.0+.
/// Daemon can receive security context (e.g., SELinux label) with
/// create/mkdir/mknod requests via extended headers (total_extlen).
const SECURITY_CTX = 1 << 32;
/// Per-inode DAX hint (protocol 7.36+). Linux 6.0+.
/// Daemon can set per-inode DAX mode via FUSE_ATTR_DAX.
const HAS_INODE_DAX = 1 << 33;
/// Supplementary group support in create (protocol 7.38+). Linux 6.6+.
const CREATE_SUPP_GROUP = 1 << 34;
/// Expire-only entry invalidation (protocol 7.38+). Linux 6.6+.
const HAS_EXPIRE_ONLY = 1 << 35;
/// Allow mmap on direct-I/O files (protocol 7.39+). Linux 6.6+.
const DIRECT_IO_ALLOW_MMAP = 1 << 36;
/// I/O passthrough to backing file (protocol 7.40+). Linux 6.9+.
const PASSTHROUGH = 1 << 37;
/// Opt out of NFS export support (protocol 7.40+). Linux 6.9+.
const NO_EXPORT_SUPPORT = 1 << 38;
/// Daemon supports request resend on interrupted operations.
/// When set, the kernel may resend a FUSE request that was interrupted
/// (e.g., by a signal) if the daemon has not yet replied. The daemon must
/// handle duplicate `unique` IDs idempotently (protocol 7.41+). Linux 6.12+.
const HAS_RESEND = 1 << 39;
/// ID-mapped FUSE mounts (protocol 7.42+). Linux 6.12+.
const ALLOW_IDMAP = 1 << 40;
/// io_uring-based FUSE request transport (protocol 7.43+). Linux 6.14+.
/// When negotiated, requests are submitted and completed via io_uring
/// SQEs/CQEs instead of read()/write() on `/dev/fuse`, eliminating
/// two syscalls per FUSE operation.
const OVER_IO_URING = 1 << 41;
/// Per-request timeout (protocol 7.45+). Linux 6.14+.
const REQUEST_TIMEOUT = 1 << 42;
}
}
If the daemon returns a minor version lower than what the kernel sent, the kernel downconverts: fields that did not exist in the older protocol minor are ignored. If the daemon sends a major version other than 7, the kernel closes the connection.
14.11.6 VFS Integration¶
FUSE registers filesystem type "fuse" with superblock magic FUSE_SUPER_MAGIC =
0x65735546. Mounting proceeds as follows:
mount(2) path
- User invokes
mount -t fuse -o fd=N,...or uses thefusermount3helper. - The kernel parses the
fd=Nmount option and resolves the fd to an open/dev/fusefile. - A
FuseConnis allocated and attached to the fd and the new superblock. - The kernel sends
FUSE_INITand waits for the daemon's reply; on success,FuseConn.initializedis set and the mount completes.
VFS → FUSE dispatch
For every VFS operation on a FUSE mount (lookup, read, write, getattr, etc.) the kernel:
- Allocates a
FuseRequestwith a freshuniqueID. - Serializes the opcode-specific arguments into
in_args. - Appends the request to
FuseConn.pendingand wakes the daemon's wait queue. - Blocks on
FuseRequest.wakeruntil the daemon writes a reply. - Deserializes the reply from
FuseRequest.replyand returns to the VFS caller.
The daemon loop is simply:
loop {
bytes = read(fuse_fd, buf) // blocks until a request is pending
handle_opcode(parse(buf))
write(fuse_fd, reply_bytes) // unblocks the kernel thread
}
Interrupt handling
If the calling thread receives a fatal signal while waiting for a FUSE reply,
the kernel enqueues a FUSE_INTERRUPT message targeting the original request's
unique ID. It then waits a short grace period (default 20 milliseconds). If the
daemon does not abort the request and send a reply within that window, the kernel
forcibly removes the request from FuseConn.processing and returns EINTR to
the caller. The daemon is expected to ignore any subsequent reply it sends for
the interrupted unique.
Writeback cache (WRITEBACK_CACHE flag)
When this capability is negotiated, dirty pages accumulate in the kernel page
cache and are written to the daemon in larger batches via FUSE_WRITE. Without
it, every write(2) to a FUSE file generates an immediate, synchronous
FUSE_WRITE to the daemon, serializing all write traffic. Most performance-
sensitive FUSE filesystems negotiate WRITEBACK_CACHE.
Connection death
When the last daemon fd is closed (daemon exits, crashes, or explicitly calls
FUSE_DESTROY):
FuseConn.destroyedis set atomically.- All requests in
FuseConn.processingare completed with errorENOTCONN. - All requests in
FuseConn.pendingare discarded. - Subsequent VFS operations on the mount return
EIO. - The mount point persists in the namespace; an explicit
umountorfusermount -uis required to remove it.
14.11.7 Security Model¶
Mount-owner restriction (default)
Unless the allow_other mount option is passed, only the UID that opened
/dev/fuse and performed the mount may access the filesystem. All other UIDs
receive EACCES from the UmkaOS VFS layer before the request reaches the daemon,
regardless of the file mode bits the daemon returns.
allow_other option
Permits any UID to access the filesystem subject to normal Unix permission
checks. Because allow_other exposes the daemon process to arbitrary user
requests, it requires either:
- The SysAdmin capability in the mount namespace, or
- The /proc/sys/fs/fuse/user_allow_other sysctl set to 1 (off by default).
default_permissions option
When set, the kernel enforces standard Unix permission checks (owner, group,
other; st_mode, st_uid, st_gid) against the attributes the daemon returns
in FUSE_GETATTR. The kernel never sends FUSE_ACCESS in this mode. Without
default_permissions, the daemon is responsible for its own access control and
receives FUSE_ACCESS for every access check.
Privilege requirement for mounting
Unprivileged FUSE mounts (without SysAdmin) are permitted only through
fusermount3, which is installed setuid-root and validates that the user owns
the target mountpoint. Direct mount(2) requires SysAdmin in the current
user namespace.
14.11.8 io_uring FUSE¶
UmkaOS supports the io_uring-based FUSE I/O path (OVER_IO_URING feature, equivalent
to Linux 6.14+). The daemon opts in by negotiating the OVER_IO_URING capability
during FUSE_INIT and then submitting SQEs of type IORING_OP_URING_CMD to the
/dev/fuse fd rather than using blocking read/write.
Benefits over the classic blocking I/O path:
- Asynchronous request handling — the daemon can have many requests in flight simultaneously without blocking threads.
- Reduced syscall overhead — requests are batched via
io_uring_submit; one syscall drains or fills multiple queue slots. - CPU affinity — the daemon can pin io_uring workers to specific CPUs, reducing cross-socket latency for NUMA-aware FUSE filesystems.
The FUSE daemon registers a fixed buffer pool at startup. The kernel delivers
requests into pre-registered buffers, and the daemon submits replies via the same
ring. The wire format (FuseInHeader, FuseOutHeader, opcode bodies) is
unchanged; only the transport mechanism differs.
Capability requirement: OVER_IO_URING negotiation requires the daemon process
to hold CAP_IPC_LOCK (needed for the io_uring fixed buffer registration, which
pins user pages via IORING_REGISTER_BUFFERS). If the daemon lacks CAP_IPC_LOCK,
the OVER_IO_URING capability is silently cleared from the FUSE_INIT response and
the connection falls back to the classic blocking I/O path.
14.11.9 Linux Compatibility¶
/dev/fusedevice node (major 10, minor 229): identical to Linux.- FUSE protocol version 7.45 (Linux 6.14+ equivalent) is the maximum negotiated kernel version. Daemons advertising higher minors receive 7.45 in the reply.
libfuse3(3.x series): works without modification.fusermount3and thefuse.ko-equivalent path: built into the UmkaOS VFS layer; no kernel module is required.- All widely deployed FUSE filesystems run without modification:
sshfs,rclone mount,glusterfs-fuse,ceph-fuse,bindfs,s3fs-fuse,encfs,gvfs,ntfs-3g. - DAX (
FUSE_SETUPMAPPING/FUSE_REMOVEMAPPING) is supported on persistent memory-backed FUSE mounts, providing zero-copy access to file data.
14.11.10 VFS Service Provider¶
Provider model: VFS service is always a host-native provider (the host's
kernel manages the filesystem). Device-native Tier M providers do not apply here —
storage devices provide BLOCK_STORAGE, not FILESYSTEM. The VFS service
provider runs on the host that mounts the filesystem and serves it to remote peers.
Sharing model: multiple remote peers mount the same export simultaneously
(close-to-open consistency, DLM-coordinated locking).
A host can provide mounted filesystems as cluster services via the peer protocol. Remote peers mount the export as a local filesystem and perform file operations transparently — the VFS dispatches operations to the remote host, which executes them against its local filesystem.
This is the VFS instantiation of the capability service provider model described in Section 5.7. In a uniform UmkaOS cluster, VFS service provider provides file sharing without requiring NFS, nfsd, portmapper, idmapd, or any external daemon. The cluster infrastructure (peer protocol, DLM, PeerRegistry, heartbeat) provides everything needed.
// umka-vfs/src/service_provider.rs
/// Provides a local mount point as a cluster service to remote peers.
pub struct VfsServiceProvider {
/// The local mount point being served (e.g., "/data").
mount: MountHandle,
/// Unique service identifier. Used as the DLM lock namespace for all
/// file locks on this service (Section 14.7.10.3).
service_id: ExportId,
/// Transport endpoint for receiving remote VFS operations.
endpoint: PeerEndpoint,
/// Lease duration for metadata caching (default: 30 seconds).
/// Remote peers cache metadata (stat, readdir) for this duration.
/// On expiry, they must re-validate with the server.
lease_duration_ms: u32,
/// Maximum concurrent remote operations.
max_inflight: u32,
/// Connected clients, tracked for lease invalidation and recovery.
/// Keyed by PeerId (u64). XArray provides O(1) lookup with native
/// RCU-protected reads (no read-side locking) and ordered iteration.
clients: XArray<ServiceClientState>,
}
/// Per-client state on the server side. Tracks leases and open files
/// for recovery after client disconnect/reconnect.
pub struct ServiceClientState {
/// Peer ID of the connected client.
peer_id: PeerId,
/// PeerRegistry generation at last sync. Used to detect stale clients.
last_registry_gen: u64,
/// Active inode leases held by this client. Keyed by InodeId (u64);
/// value is `()` (presence-only tracking). XArray per collection policy
/// (integer-keyed mapping, warm path — lease grant/revoke).
leases: XArray<()>,
/// Open file handles (for recovery after server reboot). Keyed by
/// FileHandle (u64). XArray per collection policy (integer-keyed mapping).
/// Maximum 4096 open files per client (enforced at Open time; server
/// returns -EMFILE if exceeded).
open_files: XArray<OpenFileRecord>,
}
/// Recovery metadata for one open file on a remote client.
/// Used during server reboot recovery (grace period) to validate
/// client reclaim requests. Stored in `ServiceClientState::open_files`,
/// keyed by `FileHandle` (u64). The server populates this on every
/// successful `Open` and removes it on `Release`.
pub struct OpenFileRecord {
/// Server-assigned file handle.
handle: FileHandle,
/// Inode this file handle refers to.
inode_id: InodeId,
/// Open flags (O_RDONLY, O_WRONLY, O_RDWR, etc.).
flags: u32,
/// Client's UID at open time (for permission re-verification on reclaim).
uid: u32,
/// Client's GID at open time.
gid: u32,
}
/// VFS operation forwarded from a remote peer. Modeled after FUSE opcodes
/// but uses native UmkaOS VFS types, not FUSE wire format.
///
/// Every mutating operation carries the caller's `uid` and `gid` for
/// permission checking on the server (Section 14.7.10.2).
/// Fixed-size filename for wire protocol. NUL-padded, max 255 bytes.
#[repr(C)]
pub struct FileName {
/// Actual name length in bytes (excluding NUL).
pub len: u8,
/// NUL-padded name bytes (only first `len` bytes are significant).
pub bytes: [u8; 255],
}
// Layout: 1 + 255 = 256 bytes.
const_assert!(size_of::<FileName>() == 256);
/// Fixed-size xattr name for wire protocol (same layout as FileName).
pub type XattrName = FileName;
/// Xattr value descriptor. Values <= 224 bytes are inlined; larger values
/// use bulk transfer via a shared memory region.
#[repr(C)]
pub struct XattrValue {
/// Actual value length in bytes.
pub len: u32,
/// Inline data (valid for first `min(len, 224)` bytes).
pub inline_data: [u8; 224],
/// Explicit padding after inline_data (offset 228) to align bulk_offset (u64, align 8).
/// 228 % 8 = 4, need 4 bytes. CLAUDE.md rule 11.
pub _pad: [u8; 4],
/// Non-zero if value was transferred via bulk region (offset into
/// the shared data region). Zero if fully inlined.
pub bulk_offset: u64,
}
// Layout: len(4) + inline_data(224) + _pad(4) + bulk_offset(8) = 240 bytes.
// All padding explicit.
const_assert!(size_of::<XattrValue>() == 240);
/// Wire protocol discriminant. Append-only: new operations are added at
/// the end with incrementing values. Do not reorder or remove variants.
#[repr(C, u16)]
pub enum VfsServiceOp {
Lookup { parent: InodeId, name: FileName, uid: u32, gid: u32 },
Getattr { inode: InodeId },
/// `attrs` is a `SetAttrMask` bitmask specifying which attributes to set.
Setattr { inode: InodeId, attrs: SetAttrMask, uid: u32, gid: u32 },
Open { inode: InodeId, flags: u32, uid: u32, gid: u32 },
Read { handle: FileHandle, offset: u64, len: u32 },
Write { handle: FileHandle, offset: u64, data_region_offset: u64, data_len: u32 },
Release { handle: FileHandle },
/// `offset` is an opaque server-assigned cookie (NOT a byte offset or
/// entry index). Value 0 starts from the beginning of the directory.
/// Each Readdir response includes the cookie for the next batch.
/// This matches the NFS cookie model and avoids issues with
/// concurrent directory mutations invalidating positional offsets.
Readdir { inode: InodeId, offset: u64, uid: u32, gid: u32 },
Create { parent: InodeId, name: FileName, mode: u32, flags: u32, uid: u32, gid: u32 },
Unlink { parent: InodeId, name: FileName, uid: u32, gid: u32 },
Mkdir { parent: InodeId, name: FileName, mode: u32, uid: u32, gid: u32 },
Rmdir { parent: InodeId, name: FileName, uid: u32, gid: u32 },
Rename { src_parent: InodeId, src_name: FileName,
dst_parent: InodeId, dst_name: FileName, uid: u32, gid: u32 },
Fsync { handle: FileHandle, datasync: u8 }, // 0 = fsync, 1 = fdatasync. u8 for wire safety.
Statfs,
/// File locking operations. Lock state is managed by the DLM
/// (Section 14.7.10.3); these ops coordinate with the server's
/// local filesystem lock state.
Lock { handle: FileHandle, lock_type: LockType, start: u64, len: u64, uid: u32 },
Unlock { handle: FileHandle, start: u64, len: u64 },
/// Create a symbolic link. `target` is the symlink destination path.
Symlink { parent: InodeId, name: FileName, target: FileName, uid: u32, gid: u32 },
/// Read the target of a symbolic link.
Readlink { inode: InodeId },
/// Create a hard link. `inode` is the existing file; `new_parent`/`new_name`
/// specify the new directory entry pointing to it.
Link { inode: InodeId, new_parent: InodeId, new_name: FileName, uid: u32, gid: u32 },
/// Get an extended attribute value.
Getxattr { inode: InodeId, name: XattrName, uid: u32, gid: u32 },
/// Set an extended attribute. `flags` follows Linux semantics:
/// `XATTR_CREATE` (1) = fail if exists, `XATTR_REPLACE` (2) = fail if absent.
Setxattr { inode: InodeId, name: XattrName, value: XattrValue, flags: u32, uid: u32, gid: u32 },
/// List all extended attribute names on an inode.
Listxattr { inode: InodeId, uid: u32, gid: u32 },
/// Remove an extended attribute.
Removexattr { inode: InodeId, name: XattrName, uid: u32, gid: u32 },
}
// VfsServiceOp is #[repr(C, u16)]: overall alignment = 8 (from u64 fields).
// Discriminant u16 at offset 0 (2 bytes), 6 bytes padding to offset 8 for
// first field alignment.
//
// Largest variant: Rename { src_parent: InodeId(8), src_name: FileName(256),
// dst_parent: InodeId(8), dst_name: FileName(256), uid: u32(4), gid: u32(4) }
// Layout: offset 8..16(InodeId) + 16..272(FileName) + 272..280(InodeId)
// + 280..536(FileName) + 536..540(u32) + 540..544(u32) = 544 bytes total.
//
// Runner-up: Symlink { parent(8), name(256), target(256), uid(4), gid(4) }
// = 8+256+256+4+4 + 6(discrim pad) = 536 bytes.
// Setxattr { inode(8), name(256), value(240), flags(4), uid(4), gid(4) }
// = 8+256+240+4+4+4 + 6(discrim pad) + 4(trailing align) = 528 bytes.
//
// EVERY variant on the wire takes 544 bytes, even a Getattr (16 bytes of
// actual data). Acceptable for a KABI ring (not a network wire protocol);
// consider a header+opcode+payload redesign if ring bandwidth is a concern.
const_assert!(size_of::<VfsServiceOp>() == 544);
/// Xattr encoding rules:
///
/// - `XattrName`: up to 255 bytes (`XATTR_NAME_MAX`, matching Linux). Sent
/// inline in the `ServiceMessage` payload.
/// - `XattrValue`: up to 65536 bytes (`XATTR_SIZE_MAX`, matching Linux).
/// Values <= 224 bytes are sent inline in the `ServiceMessage`. Values
/// > 224 bytes use bulk transfer via the peer transport: the client
/// writes the value into a bounce buffer and sends `Setxattr` with
/// `data_region_offset` pointing to the buffer; the server fetches the
/// value via remote fetch through the peer transport.
/// - `Listxattr` returns a null-separated list of attribute names. If the
/// total list exceeds 224 bytes, it is transferred via bulk push from
/// the server to the client's bounce buffer.
///
/// The 224-byte inline threshold is chosen to fit within a single
/// `ServiceMessage` payload (256 bytes minus header overhead), avoiding
/// a separate bulk transfer for small xattr values (the common case for
/// security labels, POSIX ACLs, and user attributes).
/// SetAttrMask bitmask specifying which attributes to set. Matches Linux
/// `ATTR_*` values for compatibility with `fuse_setattr_in`.
bitflags! {
pub struct SetAttrMask: u32 {
/// Set file mode (permissions).
const MODE = 1 << 0;
/// Set owner UID.
const UID = 1 << 1;
/// Set owner GID.
const GID = 1 << 2;
/// Set file size (truncate).
const SIZE = 1 << 3;
/// Set access time to a specific value.
const ATIME = 1 << 4;
/// Set modification time to a specific value.
const MTIME = 1 << 5;
/// Set change time (server updates ctime automatically on any change;
/// this flag is for explicit ctime override when restoring backups).
const CTIME = 1 << 6;
/// Set access time to current server time.
const ATIME_NOW = 1 << 7;
/// Set modification time to current server time.
const MTIME_NOW = 1 << 8;
}
}
/// Server-assigned file handle. Opaque to the client. Maps to an open
/// file descriptor on the server's VFS. Valid for the lifetime of the
/// client connection (or until Release).
pub type FileHandle = u64;
InodeId scope: InodeId values are scoped to a single VFS service export
(one server mount point). They are NOT globally unique across the cluster. The
combination (server_peer_id, service_id, inode_id) is globally unique.
Clients must not compare InodeId values across different peerfs mounts.
Wire protocol: operation forwarding over the peer protocol. Data transfers (Read, Write) use remote write/read via the peer transport for zero-copy. Metadata and control operations use ring pair send/recv.
Server side: the VFS service provider receives operations, dispatches them to the local VFS layer (which invokes the local filesystem — ext4, XFS, etc.), and returns results. The server is entirely in-kernel — no userspace daemon involved (unlike NFS's nfsd or FUSE daemons). Permission checks use the caller's UID/GID against the local filesystem's ownership and mode bits.
Client side: remote peers mount the export using a dedicated filesystem
type (mount -t peerfs host_peer_id:/data /mnt/remote). The client
filesystem translates local VFS operations into VfsServiceOp messages and
sends them to the server peer.
14.11.10.1 Scope and Relationship to NFS¶
VFS service provider is designed for uniform UmkaOS clusters — all nodes run UmkaOS, share a consistent UID/GID namespace, and trust each other at the kernel level (mutual peer authentication via the capability system). In this environment, it replaces NFS entirely: no RPC/XDR, no portmapper, no separate daemon, no exports file. File sharing is a native cluster capability.
VFS service provider does not cover the full NFS feature set:
| Feature | VFS Export | NFS v4.2 |
|---|---|---|
| Wire protocol | Native peer protocol (RDMA) | RPC/XDR (+ optional RDMA) |
| Authentication | Peer capability + UID pass-through | Kerberos/GSSAPI, AUTH_SYS |
| UID mapping | None (consistent namespace assumed) | idmapd, Kerberos principal |
| File locking | DLM (Section 15.15) | Built-in NLM/v4 locks |
| Consistency | Close-to-open + leases | Close-to-open + delegations |
| Server recovery | DLM lock recovery + heartbeat | Grace period + reclaim |
| Parallel data | Not supported | pNFS |
| Referrals | Not supported | v4.1 referrals |
| ACL model | POSIX ACLs (from underlying FS) | NFSv4 ACLs |
| Configuration | Zero (auto-discovered via PeerRegistry) | exports file, mount options |
| Daemons required | None | nfsd, mountd, idmapd, gssproxy |
For mixed environments (Linux clients, Windows clients, NAS appliances, or clusters where per-user Kerberos authentication is required), NFS (Section 15.14) remains the right choice.
14.11.10.2 Identity and Permission Model¶
VFS service provider uses UID/GID pass-through: the client sends the calling process's UID and GID with every operation that requires permission checking. The server applies standard POSIX permission checks (owner, group, other, POSIX ACLs) against the local filesystem using the received UID/GID.
Assumptions (non-negotiable for VFS service provider):
1. All peers in the cluster share a consistent UID/GID namespace.
(Same /etc/passwd, LDAP, or equivalent directory service on all nodes.)
2. The client peer is authenticated via the capability system (Section 9.1).
UID/GID are trusted because the client kernel is trusted.
3. Root squash: optional, configurable per-export. When enabled, uid 0 from
remote peers is mapped to nobody (65534). Default: enabled.
This is intentionally simple. UmkaOS clusters are managed as a single system with a single identity domain. Cross-domain authentication (Kerberos, GSSAPI) is the job of NFS, not the VFS service provider.
14.11.10.3 File Locking via DLM¶
File locks on exported filesystems are managed by the cluster's DLM (Section 15.15). The DLM already provides distributed lock coordination, deadlock detection (Section 15.15), and lock recovery on node failure.
/// DLM lock resource name for a file lock on an exported filesystem.
/// Composed from export ID and inode number — globally unique within
/// the cluster.
pub struct VfsLockResource {
pub service_id: ExportId,
pub inode_id: InodeId,
}
/// Lock type for VFS service provider locks. Maps directly to POSIX lock types.
#[repr(u8)]
pub enum LockType {
/// flock() shared lock or fcntl() F_RDLCK.
Shared = 0,
/// flock() exclusive lock or fcntl() F_WRLCK.
Exclusive = 1,
}
Lock flow (process on Host B locks file on Host A's export):
- Process calls
flock(fd, LOCK_EX)on a file mounted viapeerfs. - Client VFS sends
Lock { handle, Exclusive, 0, WHOLE_FILE, uid }to the server peer. - Server acquires a DLM lock on
VfsLockResource { service_id, inode_id }in exclusive mode. If the DLM lock is already held by another peer, the request blocks (or returnsEWOULDBLOCKforLOCK_NB). - After DLM grant, server also acquires the local filesystem lock (so local processes and remote clients see consistent lock state).
- Server responds with success. Client
flock()returns.
fcntl() byte-range locks: supported. The DLM lock resource includes
the byte range: VfsLockResource { service_id, inode_id, start, len }.
DLM handles range overlap and splitting.
Deadlock detection: the DLM's WaitForGraph
(Section 15.15) detects cross-node deadlocks.
If a deadlock cycle is found, one holder receives EDEADLK.
Lock recovery on client failure: when a client peer is declared Dead (heartbeat timeout, Section 5.8), the DLM releases all locks held by that peer. Server-side open file state for the dead client is cleaned up.
14.11.10.4 Consistency Model: Close-to-Open¶
VFS service provider uses close-to-open consistency, the same model as NFS. This is simple, well-understood, and sufficient for the vast majority of workloads.
Rules:
1. CLOSE flushes: when a client closes a file (Release op), all dirty data
and metadata are flushed to the server before Release returns. The server
commits to the underlying filesystem (fsync if O_SYNC, writeback otherwise).
2. OPEN validates: when a client opens a file (Open op), the client
invalidates all cached attributes for that inode and fetches fresh
metadata from the server. If the file was modified by another client
since the last open, the new data is visible.
3. Between open and close: the client may cache read data and buffer
writes. Concurrent access to the same file from multiple clients
without locking has undefined ordering (same as NFS). Use flock()
or fcntl() for coordination.
4. Lease-assisted invalidation: the server sends lease invalidation to
clients holding metadata leases when an inode is modified. This
provides best-effort visibility between open/close cycles —
not guaranteed, but usually works within 1-2 lease durations.
fsync() semantics: Fsync operation is forwarded to the server, which
calls fsync() on the underlying filesystem. Returns only after data is
persistent on the server's storage.
14.11.10.5 Metadata Caching and Leases¶
The client caches stat and readdir results for the lease duration
(default: 30 seconds, configurable per-export). Leases are per-inode,
granted implicitly on Lookup and Getattr responses.
Lease invalidation: when the server modifies an inode (local write, unlink, rename, chmod, etc.), it sends invalidation messages to all clients holding leases for that inode. Clients receiving invalidation drop their cached attributes; the next access re-fetches from the server.
Unreachable client: if a client is unreachable during invalidation, the lease expires naturally after the lease duration. The server does NOT block on client acknowledgment — invalidation is best-effort, and close-to-open consistency provides the correctness backstop.
Read caching: file data may be cached on the client for the lease duration. On lease invalidation, cached data for the invalidated inode is also dropped. This provides reasonable read performance for read-heavy workloads without complex delegation machinery.
Write buffering: dirty writes are buffered on the client and flushed on
close(), fsync(), or when the write buffer exceeds a threshold (default:
1 MB per file). The server may send an early flush request if another client
opens the same file (to make close-to-open consistency work without waiting
for the first client's close).
14.11.10.5.1 Lease Invalidation Wire Protocol¶
The server sends a ServiceMessage with opcode LEASE_INVALIDATE (0x0100)
containing a batch of InodeId values to invalidate (up to 32 per message,
batched to reduce round trips).
/// Lease invalidation message payload. Sent from server to client when
/// inodes held in the client's lease set are modified on the server.
#[repr(C)]
pub struct LeaseInvalidation {
/// Number of inodes in this batch (1-32).
count: u16,
_pad: [u8; 6],
/// Array of inode IDs to invalidate. Only the first `count` entries
/// are valid; remaining entries are undefined.
inodes: [InodeId; 32],
}
// Layout: 2 + 6(pad) + 32×8 = 264 bytes.
const_assert!(size_of::<LeaseInvalidation>() == 264);
Delivery: fire-and-forget (no ACK required). The server does NOT block on delivery — invalidation is best-effort. If the ring pair send fails (client unreachable), the message is discarded. Close-to-open consistency (Section 14.11) provides the correctness backstop: any client that opens a file will always see fresh data regardless of whether invalidation was delivered.
Server-side invalidation trigger: any local operation that modifies an
inode (write, truncate, chmod, chown, rename, unlink, link,
setxattr) scans the clients XArray for leases containing that inode_id
and batches invalidation messages. The scan uses RCU-protected reads on the
per-client leases XArray — no locking required on the read side. Batching
collects up to 32 inodes per message before sending, with a 1 ms flush
timer to bound latency when fewer than 32 inodes are pending.
14.11.10.6 Server Reboot Recovery¶
When the exporting host reboots, connected clients detect the reboot via the heartbeat protocol (generation change in PeerRegistry, Section 5.2).
Recovery protocol:
- Server reboots, re-joins the cluster with an incremented generation number. Re-exports its filesystems.
- Server enters a grace period (default: 45 seconds). During the grace period, only lock reclaim operations are accepted — no new opens or mutations. This prevents new clients from acquiring locks that conflict with locks held by recovering clients.
- Clients detect the generation change, reconnect to the server, and reclaim their open file state:
- Re-send
Openfor each file that was open before the reboot. - Re-acquire DLM locks (DLM recovery protocol handles this automatically — Section 15.15).
- After the grace period, normal operations resume. Clients that did not reclaim within the grace period have their open files and locks invalidated.
Grace period operation filtering: during the grace period, the server
classifies each incoming VfsServiceOp:
RECLAIM operations (allowed during grace):
- Open with RECLAIM flag set (client re-opening a file it had open
before reboot). Server validates against OpenFileRecord from the
client's pre-reboot state (persisted in the recovery log or re-sent
by the client in the reclaim Open message).
- Lock with RECLAIM flag (DLM lock reclaim — handled by DLM subsystem,
see [Section 15.15](15-storage.md#distributed-lock-manager)).
NON-RECLAIM operations (rejected during grace with -EAGAIN):
- Any Open without RECLAIM flag.
- Create, Mkdir, Unlink, Rmdir, Rename, Setattr, Setxattr.
- Write (new writes, not reclaim of buffered data).
- Readdir, Getattr, Lookup (metadata reads also deferred — server
state may be inconsistent during recovery).
RECLAIM flag encoding:
const OPEN_RECLAIM: u32 = 1 << 31;
Set in the VfsServiceOp::Open.flags field. The server masks this bit
before passing flags to the local filesystem open.
After the grace period expires (default 45 seconds, configurable via
per-export grace_period_s mount option), the RECLAIM flag is ignored
and all operations proceed normally. Clients that did not reclaim within
the window have their handles invalidated — subsequent operations on
stale handles return -ESTALE.
Client-side handling: processes with open files on the rebooting server
experience a brief stall (grace period duration) followed by normal
operation. No EIO unless the server remains unreachable beyond the
heartbeat dead threshold.
14.11.10.7 Capability Gating and Discovery¶
Remote filesystem access requires CAP_FS_REMOTE
(Section 9.1). The
capability is checked per-connection (at mount time), not per-operation.
Discovery: hosts exporting filesystems advertise FILESYSTEM in their
PeerRegistry capabilities (Section 5.2).
Remote peers discover available exports by querying
PeerRegistry::peers_with_cap(FILESYSTEM), then requesting an export list
from the serving peer. No /etc/exports file, no showmount — discovery
is automatic via the cluster membership protocol.
14.11.10.8 PeerFS Client Filesystem Implementation¶
PeerFS is the client-side kernel filesystem that mounts a remote VFS service
export as a local filesystem. It translates VFS operations into VfsServiceOp
messages, manages a local inode cache, integrates with the page cache for
data caching, and handles server reconnection transparently.
Tier assignment: Tier 1 (hardware memory domain isolated). PeerFS runs in the VFS isolation domain alongside other filesystem drivers. Network I/O is delegated to the peer protocol transport in Core.
Phase: 2 (core cluster filesystem, required for multi-node operation).
14.11.10.8.1 PeerFs Struct¶
/// Client-side filesystem for mounting remote VFS service exports.
///
/// Registered with VFS as filesystem type `"peerfs"`. Each mount creates
/// one `PeerFs` instance attached to the superblock via `s_fs_info`.
///
/// **Superblock magic**: `PEERFS_SUPER_MAGIC = 0x50454552` (`"PEER"` in ASCII).
///
/// **Lifecycle**: Created during `fill_super`. Destroyed on unmount after
/// all open files are released and dirty data flushed to the server.
pub struct PeerFs {
/// Peer ID of the serving host.
pub server_peer_id: PeerId,
/// Export path on the server (e.g., "/data").
pub export_path: ArrayString<256>,
/// Service connection established via ServiceBind
/// ([Section 5.7](05-distributed.md#network-portable-capabilities--capability-service-providers)).
/// Provides the peer queue pair and control channel to the server.
pub conn: PeerFsConn,
/// Local cache of remote inodes. Keyed by server-assigned `InodeId` (u64).
/// XArray provides O(1) lookup with RCU-protected reads on the hot path.
pub inode_cache: XArray<Arc<PeerFsInode>>,
/// Number of cached inodes. Used for LRU eviction decisions.
pub nr_cached_inodes: AtomicU64,
/// Maximum cached inodes before LRU eviction begins (default: 65536).
pub max_cached_inodes: u64,
/// LRU list for inode cache eviction. Head = most recently used.
/// Protected by a dedicated spinlock (not the inode cache XArray lock)
/// to avoid contention between lookups (read-side RCU on XArray) and
/// LRU reordering. Eviction walks from tail (least recently used).
pub inode_lru: SpinLock<IntrusiveList<PeerFsInode>>,
/// Lease duration in milliseconds. Cached attributes and data are valid
/// for this duration after the server grants the lease. Default: 30000 (30s).
pub lease_duration_ms: u32,
/// Write buffer flush threshold per file in bytes. When buffered dirty
/// data for a single file exceeds this, writeback is triggered without
/// waiting for close. Default: 1 MiB (1_048_576).
pub writeback_threshold: u32,
/// Maximum concurrent in-flight operations to the server. Backpressure:
/// new operations block when this limit is reached. Default: 256.
pub max_inflight: u32,
/// Current in-flight operation count.
pub inflight: AtomicU32,
/// Wait queue for tasks blocked on inflight limit.
pub inflight_waitq: WaitQueue,
/// Retry timeout for server-unreachable in milliseconds. Operations
/// block for this duration before returning -EIO. Default: 60000 (60s).
pub retry_timeout_ms: u32,
/// Server generation (from PeerRegistry). Used to detect server reboots.
pub server_generation: AtomicU64,
/// True when the server is in grace period (lock reclaim only).
pub in_grace_period: AtomicBool,
/// Root squash: when true, uid 0 from this client is mapped to
/// nobody (65534) by the server. Default: true.
pub root_squash: bool,
/// Read-only mount. When true, all mutating operations return -EROFS.
pub read_only: bool,
}
/// Connection state to the remote VFS service provider.
pub struct PeerFsConn {
/// Peer protocol endpoint for control messages (ServiceBind channel).
pub endpoint: PeerEndpoint,
/// Data region registered with the peer transport at ServiceBind time.
/// Covers the client's bounce buffer pool for zero-copy bulk transfers.
pub data_region: ServiceDataRegion,
/// Bounce buffer pool for bulk data transfers. Pre-allocated at mount
/// time. Size: `max_inflight * 128 KiB` (covers max concurrent I/O).
/// Slab-backed, no hot-path allocation.
pub bounce_pool: SlabPool<BounceBuffer>,
/// Connection state.
pub state: AtomicU8, // PeerFsConnState discriminant
/// Sequence number for request/response matching.
pub next_seq: AtomicU64,
}
/// Pre-allocated bounce buffer for bulk data transfers. Fixed-size,
/// slab-allocated from `PeerFsConn::bounce_pool`. One buffer per in-flight
/// I/O operation. Never heap-allocated on the hot path — the pool is
/// sized at mount time to `max_inflight` entries. Bounce buffer pool uses
/// vmalloc-backed allocation (not buddy allocator) to avoid high-order
/// page allocation failures. Each buffer is page-aligned within the
/// vmalloc region.
#[repr(C, align(4096))]
pub struct BounceBuffer {
/// Data area. Size: 128 KiB (covers the maximum single read/write
/// transfer size). Page-aligned for transport registration requirements.
pub data: [u8; 131072],
/// Transport-local key for this buffer (from the registered data region).
pub local_key: u32,
/// Explicit padding after local_key (u32, offset 131076) to align
/// region_offset (u64, align 8). 131076 % 8 = 4, need 4 bytes.
/// CLAUDE.md rule 11.
pub _pad: [u8; 4],
/// Offset of this buffer within the registered data region. Used to
/// compute the remote-accessible address for bulk transfers.
pub region_offset: u64,
}
// Layout: data(131072) + local_key(4) + _pad(4) + region_offset(8) = 131088 bytes.
// align(4096) pads to 135168. All padding explicit.
const_assert!(size_of::<BounceBuffer>() == 135168);
/// Connection state machine.
#[repr(u8)]
pub enum PeerFsConnState {
/// ServiceBind in progress.
Connecting = 0,
/// Normal operation.
Connected = 1,
/// Server unreachable, retrying.
Reconnecting = 2,
/// Server rebooted, in grace period (reclaim only).
GracePeriod = 3,
/// Unmounting, draining in-flight operations.
Draining = 4,
/// Terminal: connection dead, all ops return -EIO.
Dead = 5,
}
14.11.10.8.2 FileSystemOps Implementation¶
PeerFS implements the FileSystemOps trait (Section 14.1) to
register as filesystem type "peerfs". It does not require a block device
(FS_REQUIRES_DEV is not set).
mount / fill_super:
peerfs_mount(source: &str, flags: MountFlags, data: &[u8]) -> Result<SuperBlock>:
1. Parse `source` as "<peer_id>:<export_path>".
- peer_id: decimal u64 or hostname resolved via PeerRegistry.
- export_path: absolute path on the server.
If parse fails: return -EINVAL.
2. Parse mount options from `data`:
- lease_duration=N (seconds, default 30, range 1-3600)
- writeback_threshold=N (bytes, default 1048576, range 4096-67108864)
- max_inflight=N (default 256, range 16-4096)
- retry_timeout=N (seconds, default 60, range 5-600)
- ro (read-only)
- norootsquash (disable root squash)
3. Resolve server_peer_id via PeerRegistry
([Section 5.2](05-distributed.md#cluster-topology-model--peer-registry)).
Check FILESYSTEM flag in server's PeerCapFlags. If absent: return -ENOENT.
4. Check CAP_FS_REMOTE capability on calling task
([Section 9.1](09-security.md#capability-based-foundation)). If absent: return -EPERM.
5. ServiceBind to the server's VFS service
([Section 5.7](05-distributed.md#network-portable-capabilities--capability-service-providers)).
Payload includes export_path and client's lease_duration preference.
Server may adjust lease_duration downward. On failure: return -ECONNREFUSED.
6. Register data region with peer transport for bounce buffer pool.
7. Allocate PeerFs struct, populate all fields.
8. Send VfsServiceOp::Lookup { parent: ROOT_INODE, name: "." } to
fetch root inode attributes from server.
9. Allocate SuperBlock:
- s_type = "peerfs"
- s_blocksize = server-reported block size (from Statfs)
- s_maxbytes = i64::MAX
- s_flags = flags | MS_NOSUID | MS_NODEV
- s_fs_info = PeerFs pointer
- s_root = dentry for root inode
- s_magic = PEERFS_SUPER_MAGIC (0x50454552)
- s_bdev = None (no local block device)
- s_bdi = None (no local backing device; writeback managed by peerfs)
10. Register heartbeat callback for server_peer_id to detect reboots.
Return SuperBlock.
statfs: Forwards VfsServiceOp::Statfs to the server. Returns the
server's StatFs values directly (total/free/available blocks and inodes).
Cached for lease_duration_ms to avoid round-trips on repeated df calls.
sync_fs: Flushes all dirty pages for all open files on this mount to
the server, then sends VfsServiceOp::Fsync for each dirty inode. Blocks
until all server acknowledgments arrive.
unmount: Flushes dirty data (sync_fs), releases all DLM locks,
sends VfsServiceOp::Release for all open files, tears down the
data region, and disconnects from the server.
show_options: Emits peer=<peer_id>,export=<path>,lease=<N> for
/proc/mounts.
14.11.10.8.3 InodeOps and FileOps¶
All VFS inode and file operations are translated to VfsServiceOp messages
and forwarded to the server. The mapping is direct:
| VFS Operation | VfsServiceOp | Notes |
|---|---|---|
lookup |
Lookup |
Populates local PeerFsInode cache on hit |
getattr |
Getattr |
Served from cache if lease valid |
setattr |
Setattr |
Invalidates cached attrs on success |
create |
Create |
Returns new inode + open handle |
mkdir |
Mkdir |
|
unlink |
Unlink |
Invalidates parent dir cache |
rmdir |
Rmdir |
Invalidates parent dir cache |
rename |
Rename |
Invalidates both parent dir caches |
readdir |
Readdir |
Populates inode cache for returned entries |
open |
Open |
Invalidates cached attrs (close-to-open) |
read |
Read (remote fetch) |
Zero-copy from server's page cache |
write |
Buffered, Write (remote push) |
Flushed on close/fsync/threshold |
release |
Release |
Flush dirty pages first |
fsync |
Fsync |
Flush dirty pages, then forward |
symlink |
Symlink |
Creates symlink; returns new inode |
readlink |
Readlink |
Returns symlink target path |
link |
Link |
Creates hard link; invalidates parent dir cache |
getxattr |
Getxattr |
Phase 2 — required for POSIX ACLs |
setxattr |
Setxattr |
Phase 2 — required for POSIX ACLs |
listxattr |
Listxattr |
Phase 2 |
removexattr |
Removexattr |
Phase 2 |
Read path (hot):
peerfs_read(file, buf, offset, len):
inode = file.inode
pi = inode.i_private as &PeerFsInode
// Check page cache first (lease must be valid).
if pi.lease_valid():
pages = page_cache_lookup(inode, offset, len)
if pages.is_complete():
copy_to_user(buf, pages)
return len
// Cache miss or lease expired: fetch from server via peer transport.
bounce = conn.bounce_pool.alloc() // pre-allocated, no heap alloc
send VfsServiceOp::Read { handle, offset, len } to server
// Server responds with region offset for the data.
// Client fetches data from remote region into bounce buffer.
transport_fetch(bounce, server_region_offset, len)
insert_into_page_cache(inode, offset, bounce.data, len)
copy_to_user(buf, bounce.data, len)
conn.bounce_pool.free(bounce)
return len
Write path (warm — buffered, flushed asynchronously):
peerfs_write(file, buf, offset, len):
if self.read_only: return -EROFS
pi = file.inode.i_private as &PeerFsInode
// Buffer in page cache. Mark pages dirty.
copy_from_user_to_page_cache(file.inode, offset, buf, len)
// Relaxed ordering: dirty_bytes is a best-effort counter for writeback
// triggering, not a synchronization mechanism. Races between concurrent
// writers may cause slight over- or under-counting, but writeback
// correctness does not depend on exact counts — the page cache dirty
// flags are the source of truth. Relaxed avoids unnecessary fence cost
// on the write hot path (~2-5ns saved per write on weakly-ordered archs).
pi.dirty_bytes.fetch_add(len, Relaxed)
// Trigger writeback if threshold exceeded.
if pi.dirty_bytes.load(Relaxed) >= self.writeback_threshold:
peerfs_writeback(file.inode)
return len
peerfs_writeback(inode):
// Collect dirty pages, push to server via peer transport.
for each dirty page range (offset, data, len):
bounce = conn.bounce_pool.alloc()
copy_pages_to_bounce(bounce, data, len)
send VfsServiceOp::Write { handle, offset, data_region_offset: bounce.region_offset, data_len: len } to server
// Server fetches data from client's bounce buffer via remote read.
wait_for_server_ack()
conn.bounce_pool.free(bounce)
clear_page_dirty(inode, offset, len)
pi.dirty_bytes.store(0, Relaxed)
Release (close) path: enforces close-to-open consistency.
peerfs_release(file):
// Flush all dirty pages to server before releasing handle.
peerfs_writeback(file.inode)
send VfsServiceOp::Release { handle }
wait_for_server_ack()
// Do NOT invalidate cached attrs here — other opens may hold leases.
Open path: enforces close-to-open consistency (revalidation side).
peerfs_open(inode, flags):
pi = inode.i_private as &PeerFsInode
// Close-to-open: invalidate cached attrs and page cache on open.
pi.invalidate_attrs()
invalidate_inode_pages(inode) // drop cached read data
send VfsServiceOp::Open { inode: pi.remote_inode_id, flags, uid, gid }
handle = wait_for_reply()
// Store server-assigned handle in file->private_data.
file.private_data = handle
// Refresh cached attrs from Open reply.
pi.update_attrs(reply.attrs, current_time_ms() + self.lease_duration_ms)
14.11.10.8.4 PeerFsInode: Remote Inode Cache¶
/// Local representation of a remote inode. Attached to the VFS `Inode`
/// via `i_private`. Caches attributes received from the server to avoid
/// round-trips on `stat()` / `getattr()` within the lease window.
pub struct PeerFsInode {
/// Server-assigned inode ID. Matches `InodeId` on the server's filesystem.
pub remote_inode_id: InodeId,
/// Cached attributes (mode, uid, gid, size, mtime, ctime, nlink).
pub attrs: SpinLock<PeerFsInodeAttrs>,
/// Absolute time (monotonic ms) when the cached attrs expire.
/// Operations after this time must re-fetch from the server.
pub lease_expiry_ms: AtomicU64,
/// Server generation at the time this inode was cached. If the server
/// reboots (generation changes), all cached inodes are stale.
pub server_generation: u64,
/// Dirty bytes buffered in the page cache for this inode. Used to
/// trigger writeback when exceeding `PeerFs::writeback_threshold`.
pub dirty_bytes: AtomicU64,
/// Server-assigned file handle for the current open. Zero if not open.
/// Multiple opens share the same PeerFsInode but get distinct handles
/// (handles are stored in `OpenFile::private_data`, not here).
/// This field tracks the *last known* handle for recovery purposes.
pub last_handle: AtomicU64,
/// LRU linkage for inode cache eviction.
pub lru_node: IntrusiveListNode,
}
/// Cached attributes for a remote inode.
pub struct PeerFsInodeAttrs {
pub mode: u32,
pub uid: u32,
pub gid: u32,
pub size: u64,
pub nlink: u32,
pub mtime_sec: u64,
pub mtime_nsec: u32,
pub ctime_sec: u64,
pub ctime_nsec: u32,
pub blocks: u64,
pub blksize: u32,
}
impl PeerFsInode {
/// Returns true if cached attributes are still within the lease window.
pub fn lease_valid(&self) -> bool {
current_monotonic_ms() < self.lease_expiry_ms.load(Acquire)
}
/// Invalidate cached attributes. Next getattr will fetch from server.
pub fn invalidate_attrs(&self) {
self.lease_expiry_ms.store(0, Release);
}
/// Update cached attributes from a server response.
pub fn update_attrs(&self, attrs: PeerFsInodeAttrs, new_expiry_ms: u64) {
*self.attrs.lock() = attrs;
self.lease_expiry_ms.store(new_expiry_ms, Release);
}
}
Cache eviction: triggered when nr_cached_inodes exceeds
max_cached_inodes (default 65536). Eviction runs in a background kthread
(peerfs_evictor), not on the hot lookup path.
peerfs_evict_inodes(pfs: &PeerFs):
1. Acquire inode_lru spinlock.
2. Walk the LRU list from tail (least recently used).
3. For each candidate inode:
a. Skip if VFS reference count > 0 (inode has active opens).
b. Skip if dirty_bytes > 0 (must writeback first — schedule
async writeback via peerfs_writeback() and re-check on next
eviction pass).
c. Remove from LRU list and inode_cache XArray.
d. Drop the Arc<PeerFsInode> (deallocates if refcount reaches 0).
e. Decrement nr_cached_inodes.
4. Stop when nr_cached_inodes <= max_cached_inodes * 7/8 (hysteresis
to avoid thrashing — evict 12.5% below threshold before stopping).
5. Release inode_lru spinlock.
Memory budget: 65536 inodes x ~128 bytes per PeerFsInode = 8 MiB.
Configurable via mount option max_inodes=N.
Lease-driven invalidation: The server sends lease invalidation messages when a remote inode is modified by another client. On receipt:
on_lease_invalidation(inode_id: InodeId):
pi = inode_cache.load(inode_id)
if pi.is_some():
pi.invalidate_attrs()
invalidate_inode_pages(pi.vfs_inode) // drop cached read data
This provides best-effort visibility between open/close cycles. The close-to-open protocol is the correctness backstop — invalidation is an optimization that improves freshness for long-lived opens.
Revalidation: When a VFS operation accesses an inode with an expired
lease, the client sends VfsServiceOp::Getattr and updates the cache.
This is transparent to the caller.
14.11.10.8.5 Page Cache Integration¶
PeerFS uses the standard VFS page cache (Section 14.1) for read and write caching. The caching policy is deliberately simple — no delegations, no complex cache consistency state machines.
Design principles (lessons from NFS client bugs): - No "delegation" concept. Leases are the only cache validity mechanism. - Page cache validity is tied to the inode's lease. When the lease expires or is invalidated, all cached pages for that inode are dropped. - No speculative cache retention after lease loss. This wastes some bandwidth on re-reads but eliminates stale data bugs. - Dirty pages are ALWAYS flushed before releasing a file handle. No deferred or lazy flush that could lose data.
Read caching:
- On read(), check page cache first. If the page is present AND the
inode's lease is valid, serve from cache (zero network I/O).
- If the page is absent or the lease is expired, fetch from the server
via remote fetch (peer transport) and insert into the page cache.
- Readahead: the VFS readahead infrastructure (Section 14.1)
drives prefetch. PeerFS implements AddressSpaceOps::readahead() to
batch multiple pages into a single remote fetch.
Write caching:
- Writes go to the page cache. Pages are marked dirty.
- Dirty pages are flushed to the server on:
1. close() (mandatory — close-to-open consistency).
2. fsync() (explicit sync request).
3. Per-file dirty bytes exceeding writeback_threshold.
4. Server sends early-flush request (another client opened the file).
5. Memory pressure (VFS shrinker callback).
mmap support:
- mmap() is supported. Page faults trigger the read path (fetch from
server if not cached). Dirty mmap pages are tracked via the page cache
dirty mechanism and flushed on msync(), munmap(), or close().
No AddressSpaceOps::direct_IO: PeerFS does not support O_DIRECT.
All I/O goes through the page cache. This simplifies the implementation
and avoids the NFS O_DIRECT coherency bugs. Applications requiring
direct server access should use the DLM-based locking path.
Readdir caching:
- Directory entries are cached as a serialized readdir buffer in the page
cache, using the same page cache machinery as regular file data. Pages
are keyed by (directory inode, byte offset into the readdir stream).
- Cached readdir data is valid for lease_duration_ms. After expiry, the
next readdir() re-fetches from the server.
- Any directory mutation (Create, Unlink, Mkdir, Rmdir, Rename
affecting the directory as source or destination) invalidates the
directory's cached readdir pages immediately.
- No negative dentry caching. Caching "file not found" results is
error-prone with multiple clients modifying the same directory
concurrently. A lookup miss always goes to the server.
14.11.10.8.6 Mount Syntax¶
| Option | Type | Default | Range | Description |
|---|---|---|---|---|
lease_duration |
seconds | 30 | 1-3600 | Metadata/data cache validity period |
writeback_threshold |
bytes | 1048576 | 4096-64M | Per-file dirty data flush threshold |
max_inflight |
count | 256 | 16-4096 | Max concurrent server operations |
retry_timeout |
seconds | 60 | 5-600 | Timeout before -EIO on server unreachable |
ro |
flag | off | Read-only mount | |
norootsquash |
flag | off | Disable root squash (uid 0 not mapped) | |
max_inodes |
count | 65536 | 1024-1048576 | Maximum cached inodes before LRU eviction |
Examples:
mount -t peerfs 42:/data /mnt/remote
mount -t peerfs 42:/home /mnt/home -o lease_duration=60,ro
mount -t peerfs 7:/scratch /mnt/scratch -o writeback_threshold=4194304,max_inflight=512
Peer ID can also be a hostname if DNS/mDNS resolution is available in the cluster. The PeerRegistry resolves hostnames to PeerId values.
14.11.10.8.7 Error Handling¶
| Condition | Behavior |
|---|---|
| Server unreachable | Operations block for retry_timeout (default 60s), then return -EIO. Connection transitions to Reconnecting. |
| Server rebooted | Detected via PeerRegistry generation change. Connection transitions to GracePeriod. Client reclaims open files and DLM locks. Normal operations resume after grace period. See Section 14.11. |
| Stale file handle | Server returns -ESTALE. Client re-lookups the inode from parent directory: walk up to nearest cached valid parent, re-lookup each path component. If re-lookup succeeds, retry the operation with the new handle. If the file was deleted, return -ESTALE to the application. |
| Server returns error | Passed through to the application as-is (e.g., -EACCES, -ENOSPC, -ENOENT). |
| Transport failure | Connection transitions to Reconnecting. Re-register data region after reconnect. In-flight operations receive -EIO. |
| Client-side memory pressure | VFS shrinker evicts clean cached inodes (LRU). Dirty inodes are written back first. |
| Inflight limit reached | New operations block on inflight_waitq until an in-flight operation completes. |
Stale inode recovery detail:
peerfs_handle_estale(inode, parent_path_components):
// Walk up the cached path to find the nearest valid ancestor.
for component in parent_path_components.reverse():
ancestor = lookup_cached(component)
if ancestor.is_valid():
// Re-lookup from this ancestor downward.
for child in path_from(ancestor, inode):
result = send VfsServiceOp::Lookup { parent: ancestor.id, name: child }
if result.is_err():
return result // file was deleted or renamed
ancestor = result.inode
// Update local inode cache with fresh server state.
update_inode_cache(inode, ancestor)
return Ok(())
return Err(ESTALE) // entire path is stale
14.11.10.8.8 Advantages Over NFS¶
| Aspect | PeerFS | NFS v4.2 |
|---|---|---|
| Wire overhead | Native structs over peer transport — no RPC/XDR marshaling | Sun RPC + XDR encoding/decoding on every operation |
| Daemons | Zero (pure in-kernel) | nfsd, mountd, idmapd, gssproxy, rpc.statd, rpcbind |
| Configuration | mount -t peerfs peer:/path /mnt |
/etc/exports, /etc/fstab, Kerberos keytabs, idmapd.conf |
| Discovery | Automatic via PeerRegistry (Section 5.2) | Manual: must know server IP and export path |
| Locking | DLM (Section 15.15) — already running for DFS | NLM (v3) or built-in (v4) — separate protocol, separate recovery |
| Data transfer | Zero-copy via peer transport (bounce buffer pool, no kernel copy on large I/O; ~3-5μs on RDMA, ~50-200μs on TCP) | TCP or RPC-over-RDMA (still has XDR framing overhead) |
| Caching model | Leases only — simple, predictable, few bugs | Delegations — complex state machine, known source of client bugs |
| Recovery | DLM lock recovery + PeerRegistry heartbeat | Grace period + reclaim + edge cases around delegation return |
14.11.10.8.9 Intentional Non-Goals¶
The following NFS features are not implemented in PeerFS. Each omission is deliberate, not a gap:
- pNFS (parallel NFS): UmkaOS clusters use the block service provider (Section 15.14) for parallel storage access. PeerFS serves the simpler "one server, one export" use case.
- Referrals: PeerRegistry discovery handles service location. No need for filesystem-level redirect.
- NFSv4 ACLs: PeerFS uses POSIX ACLs from the underlying filesystem. The server enforces them using the passed UID/GID.
- Client-side deduplication: The server's local filesystem handles this.
- Kerberos/GSSAPI: PeerFS assumes a uniform trust domain (all peers mutually authenticated via Section 9.1). Cross-domain authentication requires NFS.
- O_DIRECT: All I/O goes through the page cache for simplicity. Applications needing direct access use DLM locks and RDMA directly.
14.12 configfs — Kernel Object Configuration Filesystem¶
configfs is a RAM-resident pseudo-filesystem (similar to sysfs) that allows
user-space to create, configure, and destroy kernel objects by manipulating
directories and files under a single mount point. The key distinction from sysfs
is direction of control: sysfs exports kernel-managed objects to user-space,
while configfs gives user-space the power to instantiate new kernel objects via
mkdir.
configfs is used by:
- LIO iSCSI / NVMe-oF target (/sys/kernel/config/target/, /sys/kernel/config/nvmet/) — see Section 11 for the block-layer and NVMe-oF protocol details.
- USB gadget framework (/sys/kernel/config/usb_gadget/)
- 9pnet and netconsole subsystems
14.12.1 Architecture¶
User Space
mkdir / rmdir / cat / echo
│
/sys/kernel/config/
│ (VFS operations)
┌────────────┴────────────────────────┐
│ configfs VFS layer │
│ ConfigfsSubsystem → ConfigGroup │
│ ConfigItem → ConfigAttribute │
└────────────┬────────────────────────┘
│ callbacks
Kernel subsystem
(LIO, nvmet, USB gadget, ...)
User-space operates exclusively with POSIX filesystem primitives. No ioctl or dedicated syscall is needed. The kernel subsystem registers callback functions that the configfs VFS layer invokes in response to standard filesystem operations.
14.12.2 Data Structures¶
/// A configfs subsystem, registered by a kernel module at init time.
pub struct ConfigfsSubsystem {
/// Directory name created under /sys/kernel/config/.
pub name: &'static str,
/// Root group of this subsystem.
pub root: Arc<ConfigGroup>,
}
/// A configfs group — a directory that may contain items, subgroups, and
/// attributes. Groups may also carry a set of default child groups that are
/// created automatically when the group itself is created.
pub struct ConfigGroup {
pub item: ConfigItem,
/// Active children (items and subgroups) keyed by name.
/// Bounded by the subsystem's make_item/make_group callbacks (return
/// ENOSPC when subsystem-specific limits are exceeded). All configfs
/// mutation operations require CAP_SYS_ADMIN.
///
/// **Lock ordering**: Parent group's `children` lock MUST be acquired
/// before any child group's `children` lock (top-down ordering). This
/// is deadlock-free by the tree structure: no cycle can form because
/// locks are always acquired in root-to-leaf order. Concurrent mkdir
/// operations on sibling groups do not conflict (different RwLock
/// instances). Lock level: LOCK_LEVEL_CONFIGFS_CHILDREN (within the
/// VFS lock ordering hierarchy, after VFS inode lock, before
/// attribute file I/O locks).
pub children: RwLock<BTreeMap<String, ConfigChild>>,
/// Type descriptor controlling allowed operations on this group.
pub item_type: Arc<ConfigItemType>,
/// Subgroups automatically created alongside this group (not user-removable).
/// Bounded: typically 1-4 default groups per subsystem (e.g., target_core_mod
/// creates "alua" and "statistics"). Cold path (group creation).
/// Enforced: registration fails with ENOSPC if len >= CONFIGFS_MAX_DEFAULT_GROUPS.
pub default_groups: Vec<Arc<ConfigGroup>>,
}
const CONFIGFS_MAX_DEFAULT_GROUPS: usize = 16;
/// Discriminated union of group children.
pub enum ConfigChild {
Item(Arc<ConfigItem>),
Group(Arc<ConfigGroup>),
}
/// A configfs item — the leaf directory representing one kernel object.
pub struct ConfigItem {
/// Item name within its parent group. Set at creation time (mkdir);
/// immutable thereafter — no lock required. Inline storage avoids
/// heap allocation for typical names (container IDs, device names).
pub name: ArrayString<256>,
/// Reference count; item is dropped when it reaches zero.
/// AtomicU64 for consistent width across 32-bit and 64-bit platforms
/// (per project policy — avoids usize width variation).
pub kref: AtomicU64,
pub parent: Weak<ConfigGroup>,
pub item_type: Arc<ConfigItemType>,
}
/// Type descriptor: defines the callbacks and attributes for an item or group.
pub struct ConfigItemType {
pub name: &'static str,
/// Called when the item's reference count drops to zero.
pub release: fn(&ConfigItem),
/// Attribute files exposed in every instance of this item type.
pub attrs: &'static [&'static dyn ConfigAttribute],
/// Returns additional child groups (used for complex multi-level objects).
pub groups: Option<fn(&ConfigItem) -> Vec<Arc<ConfigGroup>>>,
/// Create a new leaf item inside this group (triggered by mkdir).
pub make_item: Option<fn(group: &ConfigGroup, name: &str)
-> Result<Arc<ConfigItem>, KernelError>>,
/// Create a new subgroup inside this group (triggered by mkdir).
pub make_group: Option<fn(group: &ConfigGroup, name: &str)
-> Result<Arc<ConfigGroup>, KernelError>>,
/// Notify the subsystem before an item is removed (triggered by rmdir).
pub drop_item: Option<fn(group: &ConfigGroup, item: &ConfigItem)>,
}
/// A single configfs attribute — a regular file in the item directory.
pub trait ConfigAttribute: Send + Sync {
/// File name within the item directory.
fn name(&self) -> &str;
/// Unix permission bits (typically 0644 for read-write, 0444 for read-only).
fn mode(&self) -> u32;
/// Populate `buf` with a text representation of the attribute value.
/// Returns the number of bytes written.
fn show(&self, item: &ConfigItem, buf: &mut [u8]) -> Result<usize, KernelError>;
/// Parse `buf` and apply the new attribute value.
/// Returns the number of bytes consumed.
fn store(&self, item: &ConfigItem, buf: &[u8]) -> Result<usize, KernelError>;
}
Lifetimes and reference counting mirror those of the objects the subsystem
manages. A ConfigItem is kept alive as long as the directory exists in the
configfs namespace. Removal (rmdir) calls drop_item, decrements the kref,
and invokes release when the count reaches zero.
14.12.3 Mount Point and Directory Layout¶
configfs is mounted at boot by configfs_init() and exposed at
/sys/kernel/config. User-space may also mount it manually:
Illustrative layout showing the NVMe-oF and iSCSI target subsystems (see Section 11 for full protocol details):
/sys/kernel/config/
├── target/ ← LIO iSCSI / generic target subsystem
│ ├── core/
│ │ └── iblock_0/ ← mkdir: create iblock backstore group
│ │ └── lio_disk0/ ← mkdir: create a new block device object
│ │ ├── dev ← echo /dev/sda > dev
│ │ ├── udev_path ← echo /dev/sda > udev_path
│ │ └── enable ← echo 1 > enable
│ └── iscsi/
│ └── iqn.2024-01.com.example:storage/ ← mkdir: create iSCSI target IQN
│ └── tpgt_1/ ← mkdir: create target portal group
│ ├── enable
│ ├── lun/
│ │ └── lun_0 → ../../core/iblock_0/lio_disk0 ← symlink
│ ├── acls/
│ │ └── iqn.2024-01.com.client:host1/
│ │ ├── auth/
│ │ └── mapped_lun0/
│ └── fabric_statistics/
├── nvmet/ ← NVMe-oF target subsystem
│ ├── subsystems/
│ │ └── nqn.2024-01.com.example:nvme-ssd/ ← mkdir: create NVMe subsystem NQN
│ │ ├── attr_allow_any_host
│ │ └── namespaces/
│ │ └── 1/ ← mkdir: create namespace ID 1
│ │ ├── device_path ← echo /dev/nvme0n1 > device_path
│ │ └── enable ← echo 1 > enable
│ └── ports/
│ └── 1/ ← mkdir: create NVMe-oF port
│ ├── addr_trtype ← echo tcp > addr_trtype
│ ├── addr_traddr ← echo 192.0.2.1 > addr_traddr
│ ├── addr_trsvcid ← echo 4420 > addr_trsvcid
│ └── subsystems/
│ └── nqn.2024-01.com.example:nvme-ssd ← symlink
└── usb_gadget/ ← USB gadget framework
└── g1/ ← mkdir: create a gadget instance
├── idVendor
├── idProduct
└── functions/
└── mass_storage.0/
└── lun.0/
└── file ← echo /dev/sdb > file
The directory hierarchy encodes object relationships. Symlinks express associations between independently-created objects (e.g., linking a LUN to its backing store, or attaching a subsystem to a port).
14.12.4 VFS Operations¶
configfs maps the five fundamental filesystem operations onto subsystem callbacks:
mkdir(path)
The parent directory's ConfigItemType is consulted. If make_group is
defined, a new ConfigGroup is allocated and returned as a subdirectory dentry.
If make_item is defined, a new ConfigItem is allocated and returned. Only one
of the two may be non-null for a given group type; attempting mkdir on a group
that defines neither returns EPERM. Default child groups are created
automatically alongside any new group.
rmdir(path)
The directory must be empty (no user-created children; default children are
exempt from this check and are removed automatically). drop_item is invoked
on the parent's ConfigItemType, then the item's kref is decremented. If the
kref reaches zero, release is called. Attempting to remove a non-empty
directory returns ENOTEMPTY.
open(attr_path) / read(attr_fd)
The fd is associated with the specific ConfigAttribute. read(2) invokes
ConfigAttribute::show(), which populates the kernel buffer with a text
representation. The output is always \n-terminated for shell compatibility.
open(attr_path) / write(attr_fd)
write(2) invokes ConfigAttribute::store() with the user-supplied buffer.
The subsystem parses and validates the value; on error it returns a negative
errno. Writes larger than PAGE_SIZE (4 KiB) are rejected with EINVAL to
prevent unbounded allocations.
symlink(src, dst)
Used to express dependencies between items: for example, associating a LUN
directory with a backstore object, or adding a subsystem to a port's subscriber
list. configfs validates that both the source and destination are within the
same configfs mount before creating the link. The subsystem's ConfigItemType
may reject symlinks by returning EPERM from an optional allow_link callback.
readdir
Returns all children of a group: items, subgroups, attribute files, and symlinks.
Attribute names are synthesized from ConfigItemType.attrs; no inode backing
store is needed.
14.12.5 Linux Compatibility¶
/sys/kernel/config/mount point and directory layout: byte-for-byte identical to Linux configfs (kernel 5.0+).- The
ConfigAttributeread/write text format (newline-terminated strings,echo value > fileidiom) matches Linux. - LIO iSCSI target tools (
targetcli,targetcli-fb,rtslib-fb) work without modification. - NVMe-oF target tools (
nvmetcli) work without modification; see Section 11 for NVMe-oF transport configuration details. - USB gadget framework (
configfs-gadget,libusbgx) works without modification. - Symlink semantics (cross-item dependencies) are identical to Linux: both source and destination must reside within the same configfs mount.
14.13 File Notification System¶
UmkaOS implements inotify and fanotify with full Linux syscall and wire-format compatibility. Internal delivery uses typed structured channels rather than raw fd-write protocols; the external syscall interfaces are byte-for-byte identical to Linux.
Two interfaces are provided:
- inotify: informational events (IN_CREATE, IN_MODIFY, etc.), delivered asynchronously
via a file descriptor readable with
read(2). - fanotify: superset of inotify, plus permission events (
FAN_OPEN_PERM,FAN_ACCESS_PERM,FAN_OPEN_EXEC_PERM) that block the originating syscall until userspace responds with allow or deny. Used by malware scanners, file integrity monitors, and backup software.
Both are implemented in umka-vfs. Event delivery hooks are called from within the VFS
operation dispatch paths — after permission checks pass, before returning to userspace.
14.13.1 inotify¶
14.13.1.1 In-Kernel Objects¶
/// Per-inotify-instance state. Created by inotify_init() / inotify_init1().
/// Exposed to userspace as a file descriptor (the fd is backed by a synthetic
/// inode in the anonymous inode filesystem; read(2) on it drains the event queue).
pub struct InotifyInstance {
/// Watch descriptors: maps wd → InotifyWatch.
/// WatchDescriptor is the Linux `wd` (i32 cast to u32 index). XArray
/// provides O(1) lookup with native RCU-protected reads (no read-side
/// locking) and internal xa_lock for write serialization, replacing the
/// external RwLock. Watch addition/removal is infrequent (warm path).
pub watches: XArray<Arc<InotifyWatch>>,
/// Monotonically increasing allocator for watch descriptors.
/// WDs are 1-based positive integers per inotify_add_watch(2) contract.
///
/// **Longevity**: i32 allows ~2.1 billion watch additions per inotify fd.
/// At 1000 add/remove cycles per second, wraps in ~24.8 days. In practice,
/// applications rarely exceed thousands of watches. If a long-lived daemon
/// exhausts the space, the application must close and re-open the inotify fd
/// (matches Linux behavior — Linux also does not recycle WDs).
/// WD recycling (reusing released WDs via a free list) would extend
/// the practical lifetime but is deferred: Linux does not recycle WDs,
/// and changing this would be a behavioral difference.
pub next_wd: AtomicI32,
/// Per-instance event queue. Pre-allocated at `inotify_init()` time with
/// capacity `max_queued_events` (sysctl, default 16384). Fixed capacity avoids
/// heap allocation under spinlock. Overflow policy: **newest events are dropped**
/// when the queue is full (the newest event that would be enqueued is discarded,
/// not an existing queued event). This matches Linux inotify behavior: the
/// `overflow` flag is set and a synthetic `IN_Q_OVERFLOW` event is prepended
/// to the next `read(2)` response.
pub event_queue: SpinLock<BoundedRing<InotifyEventBuf>>,
/// Set when the event queue overflowed since the last read(2). A synthetic
/// `IN_Q_OVERFLOW` event is prepended to the next read response and this flag
/// is cleared. Separate from the queue to avoid occupying a queue slot.
pub overflow: AtomicBool,
/// Wait queue for poll()/select()/epoll() on this instance.
pub wait_queue: WaitQueueHead,
/// Flags from inotify_init1() (IN_CLOEXEC, IN_NONBLOCK).
pub flags: u32,
}
/// One inotify watch: a single inode being monitored for specific events.
pub struct InotifyWatch {
/// Watch descriptor (the value returned to userspace by inotify_add_watch).
pub wd: WatchDescriptor,
/// The inode being watched. Holds an Arc reference to prevent premature eviction
/// while the watch is active.
pub inode: Arc<Inode>,
/// Bitmask of watched events (IN_CREATE | IN_MODIFY | IN_CLOSE_WRITE | etc.).
pub mask: u32,
/// Back-reference to the owning InotifyInstance (weak to avoid cycles).
pub instance: Weak<InotifyInstance>,
}
/// Event delivered to userspace via read(2) on the inotify fd.
/// Matches the Linux inotify_event ABI exactly.
#[repr(C)]
pub struct InotifyEvent {
/// Watch descriptor that fired.
pub wd: i32,
/// Event type (IN_CREATE, IN_MODIFY, IN_DELETE, etc.).
pub mask: u32,
/// Links related IN_MOVED_FROM and IN_MOVED_TO events (same cookie = same rename).
pub cookie: u32,
/// Length of the name[] field in bytes, including the null terminator and any
/// trailing padding bytes. 0 if no filename is associated with this event
/// (e.g., IN_ATTRIB on a non-directory inode).
pub len: u32,
// Followed immediately by name[len]: null-terminated filename, valid only for
// events on directory inodes. Padded to a 4-byte boundary.
}
const_assert!(size_of::<InotifyEvent>() == 16);
/// Internal buffer holding a complete inotify event + filename bytes.
/// Uses a fixed-size array instead of `Vec<u8>` to avoid heap allocation
/// inside the spinlock-protected event queue. NAME_MAX is 255; with a null
/// terminator the maximum name is 256 bytes. Padded to 4-byte alignment,
/// the worst case is 256 bytes (255 + 1 null, already 4-byte aligned).
pub struct InotifyEventBuf {
pub header: InotifyEvent,
/// The filename, null-terminated and padded to a 4-byte boundary.
/// Only the first `header.len` bytes are valid; the rest is unused.
/// Fixed capacity avoids heap allocation under the event_queue SpinLock.
///
/// **Memory budget**: 272 * max_queued_events bytes per instance
/// (default ~4.25 MiB). With max_user_instances=128, worst-case per-user
/// kernel memory: ~544 MiB. This is bounded by per-user rlimits and
/// is comparable to Linux's inotify memory consumption for the same
/// event depth. Phase 3 optimization: variable-length event entries
/// reduce typical memory by 10-20x.
pub name: [u8; 256],
}
14.13.1.2 VFS Integration Hooks¶
inotify events are generated from dentry/inode operation call sites within the VFS dispatch layer. The fast path check costs a single pointer load:
| VFS operation | Event(s) generated |
|---|---|
create, mkdir, mknod, symlink |
IN_CREATE on parent dir inode |
unlink, rmdir |
IN_DELETE on parent dir; IN_DELETE_SELF on the target inode |
rename (source side) |
IN_MOVED_FROM on old parent + cookie |
rename (destination side) |
IN_MOVED_TO on new parent + same cookie |
open |
IN_OPEN on the inode |
read, readdir |
IN_ACCESS on the inode |
write, truncate, fallocate |
IN_MODIFY on the inode |
setattr (chmod/chown/utimes) |
IN_ATTRIB on the inode |
close (file was written) |
IN_CLOSE_WRITE on the inode |
close (read-only open) |
IN_CLOSE_NOWRITE on the inode |
| inotify watch removed (inode evicted or inotify_rm_watch) | IN_IGNORED on the watch descriptor |
Each Inode carries an inotify_watches field:
/// Maximum number of inotify watches that can be attached to a single inode.
/// Bounded to avoid unbounded heap allocation on the per-inode watch list,
/// which is scanned under a SpinLock on every VFS event delivery. 128 is
/// generous — even heavily-monitored inodes rarely exceed a handful of
/// watchers (one per inotify instance). The system-wide per-user limit
/// (`max_user_watches` sysctl, default 8192) bounds the total; this
/// per-inode cap prevents pathological concentration on a single inode.
const MAX_WATCHES_PER_INODE: usize = 128;
/// Per-inode inotify watch list. Null when no watches are active (the common case).
/// This field is checked on every relevant VFS operation; a null pointer load
/// has zero overhead (no branch misprediction for the vast majority of inodes).
///
/// Uses `OnceLock` for the `None` → `Some` transition: the first `inotify_add_watch`
/// calls `inotify_watches.get_or_init(|| SpinLock::new(ArrayVec::new()))`.
/// `OnceLock` provides internal synchronization for the initialization race —
/// if two threads add the first watch concurrently, only one performs the init.
/// Subsequent accesses are a simple pointer load (no locking overhead).
/// Reverting to "no watches" does NOT clear the OnceLock (the empty SpinLock
/// persists, consuming only the lock + ArrayVec header — ~24 bytes); this avoids
/// an ABA race on the pointer.
pub inotify_watches: OnceLock<SpinLock<ArrayVec<Arc<InotifyWatch>, MAX_WATCHES_PER_INODE>>>,
When the field is None (no watches active), the check is a single null pointer
comparison — zero overhead on the fast path for the vast majority of inodes.
14.13.1.3 Event Delivery Algorithm¶
fsnotify_inode_event(inode, event_mask, name, cookie):
watches_opt = inode.inotify_watches.get() // single load, no locking
if watches_opt is None: return // fast path: no watches on this inode
watches = watches_opt.lock()
for watch in watches.iter():
fired_mask = watch.mask & event_mask
if fired_mask == 0: continue
if let Some(instance) = watch.instance.upgrade():
buf = InotifyEventBuf {
header: InotifyEvent { wd: watch.wd, mask: fired_mask, cookie, len: name.len() + padding },
name: name_bytes_padded_to_4_bytes,
}
queue = instance.event_queue.lock()
if !queue.is_full():
queue.push(buf)
else:
// Queue overflow: set the overflow flag so that the next read(2) prepends
// a synthetic IN_Q_OVERFLOW event. The AtomicBool lives outside the spinlock;
// store is done while still holding the lock to ensure the writer side sees
// the flag before any reader drains the queue.
instance.overflow.store(true, Ordering::Release)
drop(queue)
instance.wait_queue.wake_up_one() // unblock read()/poll()
14.13.1.4 Syscall Implementations¶
inotify_add_watch(fd, path, mask) → wd:
1. Resolve path → inode using normal path resolution.
2. Look up fd → InotifyInstance.
3. Scan instance.watches for an existing watch on this inode:
- If found: update watch.mask = mask (OR behavior if IN_MASK_ADD flag is set;
replace otherwise). Return the existing wd.
4. Enforce max_user_watches: count total watches across all inotify instances
for the calling user's real UID. If total >= sysctl.max_user_watches, return
ENOSPC (matches Linux errno for this limit). The per-user watch count is
tracked via an AtomicU32 in the per-user credential structure for O(1) checking.
5. Allocate a new WatchDescriptor from instance.next_wd.fetch_add(1).
If the result is negative (wrapped past i32::MAX), return ENOSPC —
the WD space is exhausted. The application must close and re-open the
inotify fd. (Linux also wraps without checking; UmkaOS adds the guard
for 50-year uptime correctness at negligible cost.)
6. Construct InotifyWatch { wd, inode: inode.clone(), mask, instance: Arc::downgrade(&instance) }.
7. Initialize inode.inotify_watches if it was None.
8. Insert the watch into both inode.inotify_watches and instance.watches.
9. Increment the per-user watch count.
10. Return wd.
inotify_rm_watch(fd, wd) → 0:
1. Look up fd → InotifyInstance.
2. Remove the watch from instance.watches by wd. Return EINVAL if not found.
3. Remove the corresponding entry from inode.inotify_watches.
4. If inode.inotify_watches is now empty, the OnceLock persists with an
empty ArrayVec (not cleared — see OnceLock design note above). The ~24-byte
overhead avoids ABA races on the pointer.
5. Deliver an IN_IGNORED event to the instance.
6. Drop the Arc<InotifyWatch>.
14.13.1.5 Mandatory Event Coalescing¶
Coalescing rule (mandatory): Before enqueuing a new event, the delivery path
checks whether the tail of the instance's EventQueue is an identical event. If
so, the new event is discarded (coalesced) rather than enqueued. Two events are
identical if and only if:
fn events_are_identical(a: &InotifyEvent, b: &InotifyEvent) -> bool {
a.wd == b.wd &&
a.mask == b.mask &&
a.cookie == b.cookie &&
a.name == b.name // byte-for-byte name comparison
}
The check is against the tail only (O(1)), not the entire queue. Events are coalesced only when consecutive and identical — non-consecutive duplicates are not coalesced (ordering is preserved for different events between duplicates).
IN_MOVED_FROM / IN_MOVED_TO cookie pairing: Cookie values are assigned by
a per-VFS-instance AtomicU32 cookie_counter. Consecutive rename operations
get consecutive cookie values. Coalescing does NOT apply to cookie-bearing
events (mask has IN_MOVED_FROM or IN_MOVED_TO set) — rename pairs must always
be delivered in full.
IN_Q_OVERFLOW: When the fixed-capacity BoundedRing is full and a new event
cannot be enqueued (even after attempting coalescing), the InotifyInstance.overflow
AtomicBool is set to true. On the next read(2), the read path checks this flag
first: if set, it clears the flag and prepends a synthetic IN_Q_OVERFLOW event
(wd=-1, mask=IN_Q_OVERFLOW, cookie=0, name="") before draining normal events.
This keeps the overflow sentinel out of the ring buffer itself, preserving all
max_queued_events slots for real events. The queue is never silently dropped
without this sentinel.
Performance: Under cargo build workloads (10k+ file writes), inotify
watchers on the build directory receive IN_MODIFY storms. Coalescing reduces
queue pressure by 10-100x for write-heavy workloads where the application
re-reads the file on any change (editor reload, build system).
Linux compatibility: Linux inotify performs the same tail-coalescing. UmkaOS mandates it (Linux specifies it informally). The IN_Q_OVERFLOW sentinel behaviour is identical to Linux.
14.13.1.6 inotify Sysctls (/proc/sys/fs/inotify/)¶
| Sysctl | Default | Enforced at | Description |
|---|---|---|---|
max_user_instances |
128 | inotify_init() / inotify_init1() |
Maximum inotify file descriptors per real UID. Returns EMFILE when exceeded. |
max_user_watches |
8192 | inotify_add_watch() |
Maximum watches across all inotify instances per real UID. Returns ENOSPC when exceeded. |
max_queued_events |
16384 | Event enqueue (§14.9.1.3) | Maximum pending events per inotify instance before IN_Q_OVERFLOW. Set at inotify_init() time. |
Note on max_user_watches default: Linux kernels 5.11+ dynamically increase
this limit based on available memory (up to 1048576). UmkaOS uses the static
default of 8192 (matching the historical Linux default) but allows runtime tuning
via the sysctl. The event queue capacity per instance is max_queued_events (not the
compile-time generic parameter — the BoundedRing is allocated with capacity
max_queued_events at inotify_init() time).
Enforcement:
- inotify_init() / inotify_init1(): check per-user instance count against
max_user_instances. If exceeded, return EMFILE.
- inotify_add_watch(): check per-user watch count against max_user_watches
(step 4 in §14.9.1.4). If exceeded, return ENOSPC.
- Event enqueue: when the per-instance event queue reaches max_queued_events,
new events are dropped and the overflow flag is set (§14.9.1.5).
14.13.1.7 read(2) Serialization Protocol¶
Events are packed contiguously in the user buffer provided to read(2):
- Each event consists of
struct inotify_event(16 bytes:wdi32 +masku32 cookieu32 +lenu32) followed bylenbytes of filename data.- The filename is null-terminated and padded with additional null bytes to align
the total event size (
sizeof(inotify_event) + len) to the next 4-byte boundary. Thelenfield includes all null bytes (terminator + padding). - For events without a filename (e.g.,
IN_ATTRIBon a non-directory),lenis 0 and no name bytes follow the header. - If the user buffer is smaller than
sizeof(inotify_event)(16 bytes),read(2)returnsEINVAL(matching Linux ≥ 2.6.21 behavior). - Partial events are never returned: if the next event in the queue does not fit
in the remaining buffer space,
read(2)stops and returns the number of bytes written so far. If no events have been written yet (first event does not fit), returnEINVAL. - If the
overflowflag is set, a syntheticIN_Q_OVERFLOWevent (wd=-1, mask=IN_Q_OVERFLOW, cookie=0, len=0, total 16 bytes) is prepended before draining normal events. The flag is cleared after prepending.
14.13.2 fanotify¶
fanotify extends inotify with:
- Filesystem-wide and mount-wide marks (not just per-inode): a single mark can cover an entire mount point or filesystem, eliminating the need to add per-inode watches for directories being monitored for new file creation.
- Permission events (
FAN_OPEN_PERM,FAN_ACCESS_PERM,FAN_OPEN_EXEC_PERM): the originating syscall blocks until the fanotify daemon responds with allow or deny, subject to a mandatory per-group timeout (default 5000ms) to prevent system-wide I/O stalls.
14.13.2.1 Data Structures¶
/// Per-fanotify-instance state. Created by fanotify_init().
pub struct FanotifyInstance {
/// Mark tables: one XArray per mark type, keyed by the object's u64 ID.
/// Three separate XArrays (matching the QuotaCache precedent in
/// disk-quota-subsystem.md) instead of a single `BTreeMap<FanotifyMarkKey>`:
/// (1) XArray provides O(1) lookup with RCU-compatible reads,
/// (2) avoids BTreeMap with enum-wrapped integer keys (collection policy),
/// (3) allows independent locking per mark type.
/// Mark management (fanotify_mark() syscall) is warm-path. Event delivery
/// traverses per-inode/mount/sb mark lists (attached when marks are added),
/// not these central tables.
pub inode_marks: XArray<Arc<FanotifyMark>>,
pub mount_marks: XArray<Arc<FanotifyMark>>,
pub sb_marks: XArray<Arc<FanotifyMark>>,
/// Informational event queue (non-permission events).
/// Uses a pre-allocated fixed-capacity ring buffer (`BoundedRing`) to avoid
/// heap allocation under spinlock. Capacity is set at `fanotify_init()` time
/// (default 16384, matching Linux `FANOTIFY_DEFAULT_MAX_EVENTS`).
/// Events beyond capacity are dropped and a `FAN_Q_OVERFLOW` synthetic event
/// is generated (matching Linux behavior).
pub event_queue: SpinLock<BoundedRing<FanotifyEvent>>,
/// Set when the event queue overflowed since the last read(2). A synthetic
/// `FAN_Q_OVERFLOW` event is prepended to the next read response and this
/// flag is cleared. Kept outside the ring to avoid occupying an event slot.
pub overflow: AtomicBool,
/// Pending permission requests: keyed by a unique request ID (u64)
/// assigned at creation. Entries are removed when the daemon writes a
/// response. XArray provides O(1) lookup with internal xa_lock for write
/// serialization, replacing the external SpinLock.
pub perm_queue: XArray<Arc<FanotifyPermRequest>>,
/// Next permission request ID (monotonically increasing).
pub next_perm_id: AtomicU64,
/// Wait queue for poll()/select()/epoll() on this instance.
pub wait_queue: WaitQueueHead,
/// Notification class: determines permission event delivery order when multiple
/// fanotify instances watch the same inode.
/// FAN_CLASS_NOTIF=0x00000000: informational only.
/// FAN_CLASS_CONTENT=0x00000004: content scanners (see file after open).
/// FAN_CLASS_PRE_CONTENT=0x00000008: DLP / integrity monitors (see file before open).
/// Higher class is notified first. Within the same class, order is unspecified.
pub class: FanotifyClass,
/// Flags from fanotify_init() (FAN_CLOEXEC, FAN_NONBLOCK, FAN_REPORT_FID, etc.).
pub flags: u32,
/// Maximum time to wait for a permission event response.
/// Default: 5000ms. Configurable per group at fanotify_init() time via
/// FANOTIFY_INIT_PERM_TIMEOUT_MS (UmkaOS extension, not in Linux).
/// A value of 0 means: use the system default from
/// /proc/sys/fs/fanotify/perm_timeout_ms.
pub perm_timeout: Duration,
/// Action taken when a permission request times out:
/// - PermTimeoutAction::Deny: return EPERM to the originating syscall (safe default)
/// - PermTimeoutAction::Allow: allow the operation (permissive mode for monitoring-only daemons)
pub perm_timeout_action: PermTimeoutAction,
}
pub enum PermTimeoutAction {
Deny, // Return EPERM to originating syscall on timeout (default)
Allow, // Allow the operation on timeout (for monitoring daemons that tolerate loss)
}
/// Mark type discriminant for fanotify_mark() dispatch. Determines which
/// XArray (`inode_marks`, `mount_marks`, or `sb_marks`) to use for the
/// mark operation. The u64 ID is extracted from the discriminant and used
/// as the XArray key directly.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum FanotifyMarkKey {
/// Inode mark: watches a specific file or directory.
/// `inode_id` is the filesystem-wide inode number.
Inode { inode_id: u64 },
/// Mount mark: watches all files under a mount point.
/// `mount_id` is the unique mount ID from the mount tree.
Mount { mount_id: u64 },
/// Filesystem mark: watches all files on a filesystem (superblock scope).
/// `sb_id` is a unique identifier for the superblock (device number).
Filesystem { sb_id: u64 },
}
/// A single fanotify mark: attaches event interest to an inode, mount, or superblock.
pub struct FanotifyMark {
pub mark_type: FanotifyMarkType, // FAN_MARK_INODE, FAN_MARK_MOUNT, FAN_MARK_FILESYSTEM
/// Object identifier: inode_id (for inode marks), mount_id (for mount marks),
/// or superblock pointer (for filesystem marks).
pub object_id: u64,
/// Event mask this mark is listening for.
pub mask: u64,
/// Ignore mask: events matching this mask are suppressed even if mask is set.
pub ignored_mask: u64,
pub instance: Weak<FanotifyInstance>,
}
/// A pending permission request: holds the event plus the response channel.
pub struct FanotifyPermRequest {
/// The event as delivered to userspace via read(2) on the fanotify fd.
pub event: FanotifyEvent,
/// Unique request ID (matches the fd-based identification in the response).
pub request_id: u64,
/// Set to FAN_ALLOW or FAN_DENY by the daemon's write(2) response.
/// Uses OnceLock<u32> (first-writer-wins via `.set()`): the first
/// writer (daemon's write() or timeout handler) wins; the loser's
/// `.set()` returns Err. This enforces the semantic at the type level.
pub response: OnceLock<u32>,
/// Wakes the blocked originating syscall when response becomes Some.
pub waker: WaitQueueHead,
}
/// Event delivered to userspace via read(2) on the fanotify fd.
/// Matches Linux's fanotify_event_metadata ABI.
#[repr(C)]
pub struct FanotifyEvent {
pub event_len: u32, // Total length of this event record (including variable info records)
pub vers: u8, // FANOTIFY_METADATA_VERSION (always 3)
pub reserved: u8,
pub metadata_len: u16, // sizeof(FanotifyEvent)
pub mask: u64, // Event type bitmask
pub fd: i32, // Opened fd for the file (or -1 with FAN_REPORT_FID)
pub pid: i32, // PID of the process that triggered the event
}
// mask field matches Linux `__aligned_u64` — naturally 8-byte aligned at offset 8.
const_assert!(size_of::<FanotifyEvent>() == 24);
/// Variable-length event information record header.
/// Appended after `FanotifyEvent` when `FAN_REPORT_FID` is set in the
/// fanotify group flags. Multiple info records may follow a single event
/// (e.g., `FAN_REPORT_FID | FAN_REPORT_DFID_NAME` produces two records).
/// Linux ABI: `struct fanotify_event_info_header` (linux/fanotify.h).
#[repr(C)]
pub struct FanotifyEventInfoHeader {
/// Info record type. Determines the layout of the data following
/// this header. Values:
/// - `FAN_EVENT_INFO_TYPE_FID` (1): file handle info.
/// - `FAN_EVENT_INFO_TYPE_DFID_NAME` (2): directory + name.
/// - `FAN_EVENT_INFO_TYPE_DFID` (3): directory file handle only.
/// - `FAN_EVENT_INFO_TYPE_PIDFD` (4): pidfd info (Linux 5.15+).
/// - `FAN_EVENT_INFO_TYPE_ERROR` (5): filesystem error info (Linux 5.16+).
pub info_type: u8,
/// Padding for alignment.
pub pad: u8,
/// Total length of this info record (header + payload), in bytes.
/// Must be a multiple of 4 (aligned to u32 boundary).
pub len: u16,
}
/// File identifier info record. Follows `FanotifyEventInfoHeader` when
/// `info_type == FAN_EVENT_INFO_TYPE_FID` (or DFID/DFID_NAME).
/// Linux ABI: `struct fanotify_event_info_fid` (linux/fanotify.h).
///
/// The file handle bytes follow immediately after this struct. The total
/// record size is `sizeof(FanotifyEventInfoHeader) + sizeof(FanotifyEventInfoFid)
/// + handle_bytes`, padded to a 4-byte boundary.
#[repr(C)]
pub struct FanotifyEventInfoFid {
/// Header identifying this record type and total length.
pub hdr: FanotifyEventInfoHeader,
/// Filesystem identifier (same as `statfs.f_fsid`). Allows userspace
/// to correlate the file handle with a specific mounted filesystem.
pub fsid: FsId,
/// Variable-length file handle. The first 4 bytes are the handle
/// length (matching `struct file_handle.handle_bytes`), followed by
/// the handle type (4 bytes) and the handle data. The total length
/// including padding is recorded in `hdr.len`.
// Followed by: struct file_handle { u32 handle_bytes; i32 handle_type; u8 f_handle[]; }
}
/// Filesystem identifier (matches `__kernel_fsid_t` / `statfs.f_fsid`).
#[repr(C)]
pub struct FsId {
pub val: [i32; 2],
}
const_assert!(size_of::<FanotifyEventInfoHeader>() == 4);
const_assert!(size_of::<FsId>() == 8);
const_assert!(size_of::<FanotifyEventInfoFid>() == 12);
/// fanotify event info type constants. Match Linux `FAN_EVENT_INFO_TYPE_*`.
pub const FAN_EVENT_INFO_TYPE_FID: u8 = 1;
pub const FAN_EVENT_INFO_TYPE_DFID_NAME: u8 = 2;
pub const FAN_EVENT_INFO_TYPE_DFID: u8 = 3;
pub const FAN_EVENT_INFO_TYPE_PIDFD: u8 = 4;
pub const FAN_EVENT_INFO_TYPE_ERROR: u8 = 5;
/// Byte-range info for pre-content events (fanotify pre-content scanning).
/// Linux 6.12+.
pub const FAN_EVENT_INFO_TYPE_RANGE: u8 = 6;
/// Mount ID info for mount-aware fanotify. Linux 6.12+.
pub const FAN_EVENT_INFO_TYPE_MNT: u8 = 7;
// Types 8, 9 reserved by Linux.
/// Source directory+name for rename events. Linux 6.6+.
pub const FAN_EVENT_INFO_TYPE_OLD_DFID_NAME: u8 = 10;
// Type 11 reserved by Linux.
/// Destination directory+name for rename events. Linux 6.6+.
pub const FAN_EVENT_INFO_TYPE_NEW_DFID_NAME: u8 = 12;
pub enum FanotifyMarkType { Inode, Mount, Filesystem }
pub enum FanotifyClass {
Notif = 0x0000_0000, // FAN_CLASS_NOTIF
Content = 0x0000_0004, // FAN_CLASS_CONTENT
PreContent = 0x0000_0008, // FAN_CLASS_PRE_CONTENT
}
14.13.2.2 Permission Event Flow¶
When a VFS operation triggers a permission-event mask bit (e.g., FAN_OPEN_PERM on
open(2)):
fanotify_perm_event(inode, event_type, opener_pid):
// Collect all matching fanotify instances in class order (PreContent first).
matching = collect_matching_marks(inode, event_type)
if matching is empty: return Ok(()) // fast path
for instance in matching sorted by class descending:
id = instance.next_perm_id.fetch_add(1)
event_fd = open_file_for_fanotify(inode) // opens fd for daemon to inspect
event = FanotifyEvent { mask: event_type, fd: event_fd, pid: opener_pid, ... }
req = Arc::new(FanotifyPermRequest { event, request_id: id, response: None, waker })
instance.perm_queue.lock().insert(id, req.clone())
queue = instance.event_queue.lock()
if !queue.is_full():
queue.push(event)
else:
// Queue overflow: drop event, set FAN_Q_OVERFLOW flag (matching Linux)
instance.overflow.store(true, Ordering::Release)
instance.wait_queue.wake_up_one()
// Block with mandatory timeout — never block indefinitely
match req.channel.wait_timeout(instance.perm_timeout):
Ok(response):
if response.allow: close(event_fd); continue // allow: close fd, check next instance
else: close(event_fd); return Err(EPERM)
Err(Timeout):
// Log timeout: fanotify daemon too slow
log_warn!("fanotify: perm request timed out after {:?}, action={:?}",
instance.perm_timeout, instance.perm_timeout_action)
// Increment per-group timeout counter (visible in /proc/PID/fdinfo/<fafd>)
instance.timeout_count.fetch_add(1, Ordering::Relaxed)
match instance.perm_timeout_action:
PermTimeoutAction::Deny → close(event_fd); return Err(EPERM)
PermTimeoutAction::Allow → close(event_fd); continue // allow on timeout
return Ok(()) // all instances allowed
Timeout vs late response race: If the daemon responds after the timeout
fires but before the requesting thread fully unblocks, the response is discarded.
The req.response uses an OnceLock<u32> first-writer-wins pattern: the first writer (either
the daemon's write() or the timeout handler) wins. The loser's write is a no-op.
This prevents both double-free of the event fd and contradictory
allow-then-deny sequences. The daemon's late response is logged at DEBUG level
for diagnostic purposes.
Mandatory permission event timeout: Permission events (FAN_OPEN_PERM, FAN_ACCESS_PERM, FAN_OPEN_EXEC_PERM) have a mandatory response timeout to prevent system-wide I/O stalls.
System-wide timeout knob: /proc/sys/fs/fanotify/perm_timeout_ms (default: 5000). Can be set to 0 to disable timeout (not recommended; requires CAP_SYS_ADMIN).
Monitoring: /proc/sys/fs/fanotify/perm_timeout_count — system-wide count of permission request timeouts (monotonic counter, reset on boot). Per-group count in /proc/PID/fdinfo/<fafd> as perm_timeout_count: N.
Linux compatibility note: Linux fanotify has no timeout on permission events (daemon death causes permanent block — requires daemon restart or fanotify fd close). UmkaOS's timeout is an improvement over Linux; existing fanotify daemons work unchanged (they don't set FANOTIFY_INIT_PERM_TIMEOUT_MS, so they get the 5s default with Deny on timeout). Tools like systemd-oomd, CrowdStrike Falcon, and audit daemons that use fanotify will benefit automatically from the safety timeout.
Userspace daemon writes FAN_ALLOW / FAN_DENY:
write(fanotify_fd, &fanotify_response { fd: event_fd, response: FAN_ALLOW_or_DENY }):
// Match the response to a pending request by event_fd.
// Linux ABI compatibility: the userspace `fanotify_response` struct uses `fd` as
// the matching key. Internally, the kernel maps fd → request using the per-group
// fd-to-request XArray (O(1) lookup). An internal request_id is used only for
// kernel-side tracking and logging; it is never exposed to userspace.
req = find_perm_request_by_fd(instance.perm_queue, event_fd)
if req is None: return Err(EINVAL) // stale or already answered
req.response.call_once(|| FAN_ALLOW_or_DENY)
req.waker.wake_up_one() // unblock the blocked syscall
UmkaOS improvement over Linux fanotify: Linux matches responses to pending permission
requests by the fd number inside the fanotify_response struct, which becomes ambiguous
if the daemon closes and reopens fds in the event window. UmkaOS uses a typed
FanotifyPermRequest with a structured response channel keyed by a monotonically
increasing request_id. The Arc<FanotifyPermRequest> lifetime guarantees the blocked
syscall's stack is valid until the response arrives, eliminating the lifetime
ambiguity in the fd-matching approach.
14.13.3 UmkaOS-Native File Watch Capabilities¶
UmkaOS provides a capability-based file watching API as a modern alternative to
inotify. Unlike inotify (global watch descriptor namespace, process-scoped),
FileWatchCap watches are:
- Capability-scoped: unforgeable, revocable, auditable
- Memory-bounded: each watch is a capability slot (no global state)
- Automatically revoked: when the capability is dropped or the process exits
- Ring-delivered: events go to a typed UmkaOS ring buffer, not a read() queue
- Composable: multiple watches can share one ring
inotify remains fully supported for Linux compatibility. FileWatchCap is the
recommended API for new UmkaOS code.
/// A capability granting the holder the right to watch a specific inode for
/// specific events. Cannot be forged; issued by the kernel only.
/// Revocable via the standard capability revocation path (Section 9.1).
pub struct FileWatchCap {
/// The inode to watch. Kernel-internal reference — not a path (immune to rename).
inode: Arc<Inode>,
/// Events to deliver (subset of InotifyMask).
mask: InotifyMask,
/// Watch children of this directory (if inode is a directory).
watch_children: bool,
/// Watch children recursively (deep watch — UmkaOS extension, not in inotify).
watch_recursive: bool,
}
/// Subscribe to inode events via a capability.
/// Events are delivered to `ring` as typed `FileWatchEvent` structs.
///
/// Returns a `WatchHandle` — dropping the handle unregisters the watch.
pub fn inode_watch(
cap: FileWatchCap,
ring: Arc<EventRing<FileWatchEvent>>,
) -> Result<WatchHandle, WatchError>;
/// A single file watch event, delivered to the ring.
/// C-compatible layout: uses explicit length + fixed array instead of
/// `Option<ArrayString<255>>` (which has Rust-internal layout).
// kernel-internal, not KABI
#[repr(C)]
pub struct FileWatchEvent {
pub event_type: FileWatchEventType, // enum (see below)
pub cookie: u32, // for rename pairs (FROM/TO share cookie)
pub inode_id: u64, // stable inode number
pub name_len: u8, // 0 = no name; >0 = first `name_len` bytes valid
pub name: [u8; 255], // filename (for directory events), NUL-padded
pub timestamp: MonotonicInstant, // UmkaOS extension: not in inotify
}
// FileWatchEvent layout: event_type(u32=4) + cookie(u32=4) + inode_id(u64=8) +
// name_len(u8=1) + name([u8;255]=255) + timestamp(MonotonicInstant(u64)=8).
// After name_len+name: offset = 4+4+8+1+255 = 272. 272 % 8 = 0, no padding.
// Total: 272 + 8 = 280 bytes.
const_assert!(core::mem::size_of::<FileWatchEvent>() == 280);
/// Monotonic timestamp (nanoseconds since boot, from CLOCK_MONOTONIC).
/// Used for UmkaOS extensions where wall-clock time is not needed.
#[repr(transparent)]
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct MonotonicInstant(pub u64);
#[repr(u32)]
pub enum FileWatchEventType {
Access, // File was read
Modify, // File was written
Attrib, // Metadata changed (chmod, chown, timestamps)
CloseWrite, // File opened for writing was closed
CloseNoWrite, // File opened read-only was closed
Open, // File was opened
MovedFrom, // File moved out (cookie matches MovedTo)
MovedTo, // File moved in (cookie matches MovedFrom)
Create, // File created in watched directory
Delete, // File deleted from watched directory
DeleteSelf, // Watched file itself was deleted
MoveSelf, // Watched file itself was moved
Unmount, // Filesystem containing watched file was unmounted
}
Deep watch (watch_recursive: true): watches a directory tree recursively.
UmkaOS maintains a kernel-side tree of watch registrations, automatically adding
watches for new subdirectories as they are created (IN_CREATE on a directory).
inotify has no recursive watch; tools like inotifywait -r simulate it with
userspace polling, which has TOCTOU races. UmkaOS's deep watch is race-free.
Deep watch resource limits: Recursive watches can consume significant kernel
memory on deep directory trees (e.g., a deep watch on / with millions of
directories). Limits are enforced per-user to prevent denial-of-service:
- max_deep_watches_per_user: default 128 (sysctl fs.inotify.max_deep_watches).
Exceeding this returns ENOSPC.
- max_watch_entries_per_deep_watch: default 65536. If the directory tree contains
more subdirectories than this, the kernel stops adding new watches beyond the limit
and delivers IN_Q_OVERFLOW to signal the user that coverage is incomplete.
- Memory accounting: each internal watch node costs ~128 bytes. A deep watch on a
tree with 65536 directories consumes ~8 MB of kernel memory, charged to the
user's RLIMIT_MEMLOCK limit.
- CAP_SYS_ADMIN can override max_deep_watches_per_user up to the system-wide
hard limit of 1024.
Obtaining a FileWatchCap: capability is issued via:
/// Open a FileWatchCap for a path (requires read permission on the path).
pub fn open_watch_cap(
dirfd: DirFd,
path: &Path,
mask: InotifyMask,
watch_children: bool,
watch_recursive: bool,
) -> Result<FileWatchCap, WatchError>;
Revocation: WatchHandle::drop() unregisters the watch. When the process
exits, all WatchHandles are dropped automatically — no cleanup required.
Capability revocation (Section 9.1) also revokes all file watches derived from
the revoked capability.
Linux compatibility: FileWatchCap is an UmkaOS-only API. inotify_init(),
inotify_add_watch(), inotify_rm_watch() work identically to Linux.
FileWatchCap is intended for new UmkaOS applications; existing Linux software
uses inotify unchanged.
14.13.4 Cross-References¶
- Section 14.1 (VFS Traits): inotify/fanotify hooks are inserted at the VFS
operation dispatch layer, after
InodeOps/FileOpscall sites complete successfully. - Section 17.1 (Namespace Implementation): fanotify marks survive
CLONE_NEWNSand remain attached to the underlying inode/mount, not to a specific mount namespace. Marks set in a parent namespace remain visible in child namespaces for the same underlying mount. - Section 9.1 (Security):
fanotify_init(FAN_CLASS_CONTENT)andfanotify_init(FAN_CLASS_PRE_CONTENT)requireCAP_SYS_ADMIN. Informational fanotify (FAN_CLASS_NOTIF) requires no capability (Linux 5.13+, unprivileged fanotify); UmkaOS follows the same requirement for compatibility.
14.14 Local File Locking (flock / fcntl POSIX Locks / OFD Locks)¶
UmkaOS provides three advisory file locking interfaces, each with distinct semantics:
| Interface | Granularity | Lock scope | Inherited on fork |
Released on |
|---|---|---|---|---|
flock(2) |
Whole file | Per open-file-description | Yes (child shares the open file description, so the same lock is shared; either process can release it) | Last close of the description |
fcntl F_SETLK |
Byte-range (POSIX) | Per process (PID) | No | Process exit OR any close of the file |
fcntl F_OFD_SETLK |
Byte-range (OFD) | Per open-file-description | Yes | Last close of the description |
All three are advisory: a process can read and write a file regardless of locks held
by other processes. Locks only prevent other processes from acquiring conflicting locks.
Mandatory locking (Linux MS_MANDLOCK) is deliberately not implemented — it was
deprecated in Linux 5.15 and is incompatible with modern VFS semantics.
14.14.1 Data Structures¶
/// A single file lock entry. Stored in the per-inode `FileLockTree`.
pub struct FileLock {
/// Lock type: read (shared) or write (exclusive).
pub lock_type: FileLockType,
/// Byte range: [start, end] inclusive. 0..=u64::MAX represents the whole file.
/// For flock locks, start=0 and end=u64::MAX always.
pub start: u64,
pub end: u64,
/// For POSIX locks: the PID of the owning process.
/// All POSIX locks held by a process are released when it exits OR when
/// any file descriptor for the file is closed (POSIX semantics).
/// For OFD locks: None. The lock is owned by the open-file-description.
/// For flock locks: None. The lock is owned by the open-file-description.
pub owner_pid: Option<Pid>,
/// The open-file-description that created this lock.
/// Weak reference: if the description is dropped (last fd closed), the lock
/// is released. For POSIX locks, `owner_pid` is the primary ownership token
/// and `owner_fd` is advisory for conflict matching.
pub owner_fd: Weak<OpenFile>,
/// Wait queue: tasks blocked waiting for this lock to be released sleep here.
pub wait_queue: WaitQueueHead,
}
pub enum FileLockType {
/// Shared (read) lock. Multiple readers can hold simultaneously.
Read,
/// Exclusive (write) lock. No other lock may be held concurrently.
Write,
}
/// Per-inode lock state. Present only on inodes that have had locks acquired;
/// None on inodes that have never been locked (zero overhead on the fast path).
pub struct InodeLocks {
/// Augmented interval tree of active locks (POSIX, flock, and OFD locks).
/// Sorted by `l_start`; each node carries `subtree_max: u64` = maximum
/// `l_end` in its subtree. This enables O(log n) range overlap queries.
/// See Section 14.10.3 for the full algorithm specification.
pub locks: FileLockTree,
/// Protects the lock tree. Operations must be atomic with respect to each other.
pub lock: SpinLock<()>,
}
/// Augmented interval tree for file lock conflict detection.
/// Red-black tree sorted by `l_start`, augmented with `subtree_max` for
/// O(log n) range overlap queries.
pub struct FileLockTree {
/// Root of the red-black tree. None when no locks are held.
root: Option<Box<FileLockNode>>,
/// Number of locks currently in the tree.
count: usize,
}
/// FileLockNode allocation uses a dedicated slab cache (`file_lock_slab`)
/// with per-CPU magazines, matching Linux's `file_lock_cache`. This provides
/// bounded warm-path allocation without general-heap contention.
pub struct FileLockNode {
pub lock: FileLock,
/// Maximum `l_end` value in this node's subtree (including this node).
/// Updated on every insert/delete along the path to the root.
pub subtree_max: u64,
pub left: Option<Box<FileLockNode>>,
pub right: Option<Box<FileLockNode>>,
pub color: RbColor,
}
pub enum RbColor { Red, Black }
14.14.2 Conflict Detection¶
Two locks conflict if:
1. At least one is a write lock (FileLockType::Write).
2. Their byte ranges overlap: !(lock_a.end < lock_b.start || lock_b.end < lock_a.start).
3. They have different owners:
- For POSIX locks: different PIDs.
- For OFD/flock locks: different Weak<OpenFile> pointers.
- A POSIX lock can upgrade/replace an existing POSIX lock from the same PID without conflict.
14.14.3 Locking Algorithm¶
UmkaOS uses an augmented interval tree (red-black tree with subtree_max
augmentation) for O(log n) file lock conflict detection. This is the correct
data structure; there is no O(n) fallback. Linux used an O(n) linked-list scan
for decades before adding interval trees in Linux 3.13; UmkaOS starts with the
correct design.
FileLockTree structure:
- Sorted by l_start (range start)
- Each node carries subtree_max: u64 = maximum l_end in its subtree
- This augmentation enables O(log n) range overlap queries
Conflict query for range [req_start, req_end):
Walk the tree: at each node, if node.subtree_max < req_start, the entire
subtree has no overlapping locks — prune. Otherwise check the node itself
and recurse into both children. O(log n + k) where k = number of conflicts found.
Insert/delete: O(log n) standard red-black tree operations, plus O(log n)
subtree_max recomputation on the path to root. During red-black tree
rotations (left-rotate, right-rotate), subtree_max is recomputed for the
two rotated nodes: node.subtree_max = max(node.lock.l_end,
left_child_max(node), right_child_max(node)). This is the standard
augmented red-black tree technique (CLRS §14.2).
fcntl_setlk(fd, lock_type, start, end, wait: bool) → Result:
inode = fd.inode()
ensure inode.locks is initialized
inode.locks.lock.lock()
loop:
// O(log n + k) interval tree query for conflicting locks in [start, end).
for existing in inode.locks.locks.query_conflicts(start, end, lock_type, &fd):
if !wait:
inode.locks.lock.unlock()
return Err(EAGAIN) // F_SETLK: fail immediately
// F_SETLKW: deadlock detection before sleeping
if would_deadlock(current_pid, existing.owner_pid):
inode.locks.lock.unlock()
return Err(EDEADLK)
inode.locks.lock.unlock()
existing.wait_queue.wait_event(|| !lock_conflicts_anymore(...))
inode.locks.lock.lock()
continue loop // re-check after wakeup (spurious wakeup safe)
// No conflict: coalesce adjacent/overlapping locks of the same type and owner,
// then insert the new lock. O((k+1) log n).
coalesce_and_insert(inode, fd, lock_type, start, end)
inode.locks.lock.unlock()
return Ok(())
Lock Coalescing Algorithm (Greedy Interval Merge)
The following batch coalescing algorithm is used during lock migration and
crash recovery (when multiple lock requests are replayed). The per-call
coalescing path is coalesce_and_insert() below, which operates on the
interval tree directly with no Vec allocation.
Input: a set of pending lock requests sorted by (offset, len).
Output: a minimal set of merged lock requests covering the same byte ranges.
Data structure:
Algorithm (O(n log n) for n requests):
1. Collect all pending requests into Vec<PendingLockRequest>.
2. Sort by offset (ascending), then by len (descending) as tiebreaker.
3. Sweep left to right:
- Start with current = requests[0].
- For each subsequent request r:
- If r.offset <= current.offset + current.len (overlapping or adjacent)
AND r.op == current.op (same lock type):
- current.len = max(current.offset + current.len, r.offset + r.len) - current.offset
- Otherwise: emit current, set current = r.
4. Emit final current.
Rationale: coalescing reduces the number of kernel lock table entries for byte-range
locking (POSIX fcntl(F_SETLK)), avoiding fragmentation in the per-file lock list.
coalesce_and_insert(new_lock) — called after conflict check passes:
- Query the interval tree for all locks owned by
new_lock.pidthat are adjacent to or overlappingnew_lock's range[l_start, l_end)(adjacent =existing.l_end == new_lock.l_startor vice versa) - Compute the union range:
min(all.l_start)tomax(all.l_end) - Remove all found locks from the interval tree (O(k log n))
- Insert a single merged lock covering the union range (O(log n))
Complexity: O((k+1) log n) where k = number of locks merged. Coalescing reduces tree size over time for processes that acquire many adjacent byte-range locks (common in database file locking patterns).
14.14.4 Deadlock Detection¶
Wait-For Graph data structure:
/// Directed graph of lock-wait relationships between threads.
/// Edge (A → B) means thread A is currently blocked waiting for a byte-range
/// lock held by thread B. The graph is maintained in the lock manager and
/// updated on each lock acquisition attempt that would block.
///
/// Bounded at compile time; exceeding `LOCK_GRAPH_MAX_THREADS` causes
/// `detect_deadlock` to abort with `LockError::DeadlockDetectionOverflow`.
pub struct WaitForGraph {
/// Sparse adjacency list: (waiting_thread, [holder_threads]).
/// Each entry represents one blocked thread and the set of threads
/// it is waiting on (typically 1, but can be multiple for range locks).
edges: ArrayVec<(ThreadId, ArrayVec<ThreadId, LOCK_GRAPH_MAX_HOLDERS>), LOCK_GRAPH_MAX_THREADS>,
}
// WaitForGraph is ~20 KiB inline (256 * 80 bytes), which exceeds safe kernel
// stack depth (8-16 KiB). It MUST NOT be allocated on the kernel stack.
// Allocation strategy: per-CPU pre-allocated buffer. Only one deadlock
// detection runs per CPU at a time because InodeLocks.lock is a SpinLock,
// which disables preemption. No other thread on this CPU can enter a locking
// path while the SpinLock is held.
static LOCK_DEADLOCK_GRAPH: PerCpu<WaitForGraph> = PerCpu::new(WaitForGraph::new);
impl WaitForGraph {
pub const fn new() -> Self { Self { edges: ArrayVec::new() } }
/// Record that `waiter` is blocked on a lock held by `holder`.
pub fn add_edge(&mut self, waiter: ThreadId, holder: ThreadId) { /* ... */ }
/// Remove all outgoing edges from `waiter` (called on lock release or
/// wakeup so stale edges don't pollute subsequent detections).
pub fn remove_waiter(&mut self, waiter: ThreadId) { /* ... */ }
/// Return an iterator over all threads that currently hold locks that
/// `waiter` is blocked on. O(n) scan over the edge list.
pub fn holders_of(&self, waiter: ThreadId) -> impl Iterator<Item = ThreadId> + '_ {
self.edges.iter()
.find(|(tid, _)| *tid == waiter)
.map(|(_, holders)| holders.iter().copied())
.into_iter()
.flatten()
}
}
/// Maximum number of threads tracked simultaneously in the wait-for graph.
/// This is the per-inode concurrent contention limit, not a system-wide
/// thread limit. 256 concurrent threads blocked on the same inode's lock
/// tree is well beyond any realistic workload. Overflow is treated
/// conservatively (EDEADLK) — correctness is preserved at the cost of
/// a false deadlock report.
pub const LOCK_GRAPH_MAX_THREADS: usize = 256;
/// Maximum number of holders per waiting thread (range locks can be split).
pub const LOCK_GRAPH_MAX_HOLDERS: usize = 8;
Deadlock Detection: Wait-For Graph DFS (3-Color)
Each lock holder is a node; each blocked waiter is a directed edge (waiter → holder).
Node state per thread:
- WHITE: not yet visited in current DFS
- GRAY: currently in the DFS recursion stack (potential cycle node)
- BLACK: fully explored, no cycle reachable from here
Constants:
Algorithm (invoked before blocking on a contested lock):
fn detect_deadlock(start: ThreadId, graph: &WaitForGraph) -> bool:
// Stack-allocated: WaitForGraph limits to VFS_LOCK_MAX_DEPTH threads,
// so linear scan on ≤64 entries is faster than HashMap heap allocation.
color = ArrayVec<(ThreadId, Color), VFS_LOCK_MAX_DEPTH>::new()
return dfs(start, &mut color, graph, depth=0)
fn dfs(node: ThreadId, color: &mut ArrayVec, graph: &WaitForGraph, depth: usize) -> bool:
if depth > VFS_LOCK_MAX_DEPTH:
return true // treat as deadlock (conservative)
color[node] = GRAY
for each holder in graph.holders_of(node):
match color.get(holder):
GRAY => return true // back-edge: cycle detected
BLACK => continue // already explored, safe
WHITE | None:
color[holder] = WHITE
if dfs(holder, color, graph, depth+1): return true
color[node] = BLACK
return false
On true return: the blocking call returns Err(LockError::Deadlock) / EDEADLK.
The caller must release all currently held locks and retry with a backoff.
The graph is constructed on-demand per lock request and is not persisted. Returning
true on depth overflow is safe: it causes the lock request to fail with EDEADLK,
which is better than silently allowing a potential deadlock. The depth limit prevents
deadlock detection from becoming a denial-of-service vector in pathological chains.
14.14.5 Lock Release on File Description Close¶
When an OpenFile's reference count drops to zero (the last file descriptor
pointing to it is closed):
- OFD locks: all locks where
owner_fdmatches this description are removed. - flock locks: the flock lock associated with this description (if any) is removed.
- POSIX locks: all locks where
owner_pid == current_process.pidare removed. This is the POSIX-mandated behavior: closing any file descriptor for a file releases all POSIX locks the process holds on that file, regardless of which fd was used to acquire them.
After removing locks, wake all tasks in the wait_queue of each removed lock so they
can retry acquisition.
14.14.6 memfd Sealing (F_ADD_SEALS / F_GET_SEALS)¶
memfd_create(2) returns an anonymous file (backed by tmpfs, with no pathname).
Seals are write-once restrictions placed on the file's mutation capabilities:
/// Seal flags for memfd files. Once set, seals cannot be removed.
/// SEAL_SEAL prevents any further seals from being added.
bitflags! {
pub struct SealFlags: u32 {
/// Prevent any further seals from being added.
const SEAL_SEAL = 0x0001;
/// Prevent the file from shrinking (ftruncate to a smaller size returns EPERM).
const SEAL_SHRINK = 0x0002;
/// Prevent the file from growing (writes past EOF, ftruncate to larger size return EPERM).
const SEAL_GROW = 0x0004;
/// Prevent all writes: write(2) returns EPERM, mmap(PROT_WRITE) returns EPERM.
const SEAL_WRITE = 0x0008;
/// Prevent future mmap(PROT_WRITE) but allow existing writable mappings to remain.
const SEAL_FUTURE_WRITE = 0x0010;
}
}
fcntl(fd, F_ADD_SEALS, seals): add the specified seals atomically via a
compare_exchange on the inode's AtomicU32 seal field. Fails with EPERM if
SEAL_SEAL is already set. Fails with EBUSY if SEAL_WRITE is being added while
a writable mmap exists on the file.
fcntl(fd, F_GET_SEALS): return the current seal set (atomic load, lock-free).
Seal enforcement in VFS paths:
- write(2) and pwrite64(2): check SEAL_WRITE.
- ftruncate(2) to smaller size: check SEAL_SHRINK.
- ftruncate(2) to larger size: check SEAL_GROW.
- mmap(PROT_WRITE): check SEAL_WRITE | SEAL_FUTURE_WRITE.
UmkaOS improvement: seals are stored as an AtomicU32 in the memfd's inode — seal reads
are lock-free (a single atomic load), which is important because the seal check appears
on every write(2) and mmap(2) call for sealed fds.
14.14.7 Cross-References¶
- Section 15.15 (Distributed Lock Manager): the DLM provides cluster-wide advisory locks that extend the local flock/POSIX lock semantics across nodes. Local file locks (this section) are node-local only.
- Section 14.1 (VFS Architecture):
FileOps::release()is the call site where OFD and flock locks are released when the last fd to a file description is closed. - Section 17.1 (Containers): POSIX lock ownership is per-PID-namespace-PID. Within a container's PID namespace, lock ownership semantics are unchanged.
14.14.8 Lock Semantics Mode (POSIX Default / OFD Opt-in)¶
UmkaOS keeps POSIX semantics as the default for F_SETLK to preserve full Linux
binary compatibility. Applications and deployments that want the correct OFD
semantics as default can opt in at three levels, with the highest-priority source
winning:
Priority order (highest first): 1. Per-call explicit constant 2. Per-process prctl 3. Per-user-namespace sysctl 4. System global default: POSIX
14.14.8.1.1 Per-call explicit (always available, no mode setting needed)¶
F_OFD_SETLK // Always OFD semantics (Linux 3.15+, UmkaOS supported)
F_OFD_SETLKW // Always OFD semantics, blocking
F_SETLK_POSIX // UmkaOS extension: always POSIX semantics, explicit
F_SETLKW_POSIX // UmkaOS extension: always POSIX semantics, blocking
F_SETLK_POSIX exists so code inside an OFD-default process can still request
POSIX semantics for specific locks (e.g., a bundled library that requires
process-death lock release for crash detection).
14.14.8.1.2 Per-process opt-in¶
prctl(PR_SET_LOCK_SEMANTICS, LOCK_SEM_OFD) // F_SETLK means OFD for this process
prctl(PR_SET_LOCK_SEMANTICS, LOCK_SEM_POSIX) // Explicit POSIX (escape hatch)
prctl(PR_GET_LOCK_SEMANTICS, 0, 0, 0, 0) // Query current mode
Stored in Task.lock_semantics: LockSemanticsMode (per-thread but inherited from
the process — all threads in a process share the same mode via Process.lock_semantics).
Inheritance rules:
- fork(): child inherits parent's lock_semantics
- exec(): inherited (sticky) — a container runtime sets it once; all descendant
processes inherit
- exec() of setuid/setgid binary: reset to the user-namespace sysctl default
(security: a privilege-elevating binary must not blindly inherit)
14.14.8.1.3 Per-user-namespace sysctl¶
Values: posix (default) | ofd
This sysctl is per-user-namespace, not global. Each container has its own
user namespace and therefore its own file_lock_default. The container runtime
sets it at container creation:
/// Per-user-namespace lock semantics default.
/// Stored in UserNamespace.file_lock_default.
pub enum LockSemanticsMode {
Posix = 0, // F_SETLK uses POSIX semantics (default)
Ofd = 1, // F_SETLK uses OFD semantics
Unset = 2, // Not explicitly configured; falls through to namespace/global default
}
Requires CAP_SYS_ADMIN in the target user namespace to change.
Affects new processes only — running processes keep their current mode.
14.14.8.1.4 Deployment model¶
| Scenario | Recommended config |
|---|---|
| Host with legacy software | sysctl = posix (default), no change needed |
| UmkaOS-native container | runtime sets sysctl = ofd in container's user namespace |
| Mixed container (some legacy binaries) | sysctl = posix, UmkaOS-native apps use prctl |
| Wine / NFS lockd / old SQLite | prctl(LOCK_SEM_POSIX) in launch wrapper |
14.14.8.1.5 Internal resolution¶
fn effective_lock_semantics(
task: &Task,
cmd: FcntlCmd,
) -> LockSemanticsMode {
match cmd {
FcntlCmd::OfdSetLk | FcntlCmd::OfdSetLkW => LockSemanticsMode::Ofd,
FcntlCmd::SetLkPosix | FcntlCmd::SetLkWPosix => LockSemanticsMode::Posix,
FcntlCmd::SetLk | FcntlCmd::SetLkW => {
// Resolve: per-process > per-namespace sysctl > global POSIX
if task.process.lock_semantics != LockSemanticsMode::Unset {
task.process.lock_semantics
} else {
task.user_namespace.file_lock_default
}
}
_ => LockSemanticsMode::Posix,
}
}
Linux compatibility: existing binaries calling F_SETLK on a system where
no mode is set get identical POSIX behaviour to Linux. F_OFD_SETLK was added
in Linux 3.15 and is already supported. F_SETLK_POSIX and
PR_SET_LOCK_SEMANTICS are UmkaOS extensions with no Linux equivalent.
14.15 Disk Quota Subsystem (quotactl)¶
Disk quotas enforce per-user, per-group, and per-project limits on filesystem space and inode usage. Required for multi-tenant storage environments and Linux compatibility.
14.15.1 Data Structures¶
/// Internal kernel quota accounting structure. NOT the UAPI struct — see
/// `IfDqblk` below for the Linux-compatible quotactl(2) wire format.
/// This struct extends the UAPI layout with `bgrace` and `igrace` fields
/// for in-kernel grace period tracking (not exposed to userspace directly).
pub struct DiskQuota {
/// Hard block limit (bytes). 0 = no limit. Writes that would exceed this
/// are rejected with EDQUOT immediately, regardless of grace period.
pub bhardlimit: u64,
/// Soft block limit (bytes). Exceeding this triggers a grace period timer.
/// Once the grace period expires, further writes are rejected with EDQUOT.
pub bsoftlimit: u64,
/// Current block usage (bytes). Updated on every successful write and truncate.
pub bcurrent: u64,
/// Hard inode limit. 0 = no limit. File creation that would exceed this
/// is rejected with EDQUOT.
pub ihardlimit: u64,
/// Soft inode limit. Exceeding this triggers an inode grace period.
pub isoftlimit: u64,
/// Current inode count (files + directories + symlinks owned by this subject).
pub icurrent: u64,
/// Quota grace period expiry deadline for blocks: the absolute timestamp
/// (seconds since epoch) at which the block soft limit grace period expires.
/// Set to `now + bgrace` when the soft block limit is first exceeded.
/// 0 if the soft limit has not been exceeded.
/// After this deadline, writes that would keep usage above `bsoftlimit`
/// are rejected with EDQUOT (same enforcement as the hard limit).
/// Matches Linux `dqb_btime` semantics ("time limit for excessive disk use").
/// Type is u64, matching the Linux UAPI `struct if_dqblk.dqb_btime` (`__u64`).
pub btime: u64,
/// Quota grace period expiry deadline for inodes: the absolute timestamp
/// (seconds since epoch) at which the inode soft limit grace period expires.
/// Semantics mirror `btime` but for inode counts instead of block usage.
/// 0 if the soft inode limit has not been exceeded.
/// Type is u64, matching the Linux UAPI `struct if_dqblk.dqb_itime` (`__u64`).
pub itime: u64,
/// Grace period for the block soft limit, in seconds. Default: 7 days (604800).
pub bgrace: u32,
/// Grace period for the inode soft limit, in seconds. Default: 7 days (604800).
pub igrace: u32,
}
/// Linux UAPI quota structure for quotactl(2). Matches `struct if_dqblk`
/// from `<linux/quota.h>` exactly — this is the struct that userspace tools
/// (quota, repquota, edquota) read and write via Q_GETQUOTA / Q_SETQUOTA.
///
/// Field order and sizes must match the Linux definition exactly:
/// __u64 dqb_bhardlimit, dqb_bsoftlimit, dqb_curspace,
/// __u64 dqb_ihardlimit, dqb_isoftlimit, dqb_curinodes,
/// __u64 dqb_btime, dqb_itime,
/// __u32 dqb_valid
#[repr(C)]
pub struct IfDqblk {
pub dqb_bhardlimit: u64,
pub dqb_bsoftlimit: u64,
pub dqb_curspace: u64,
pub dqb_ihardlimit: u64,
pub dqb_isoftlimit: u64,
pub dqb_curinodes: u64,
/// Grace period expiry deadline for blocks (seconds since epoch).
/// 0 if the soft block limit has not been exceeded.
pub dqb_btime: u64,
/// Grace period expiry deadline for inodes (seconds since epoch).
pub dqb_itime: u64,
/// Bitmask of QIF_* flags indicating which fields are valid.
/// QIF_BLIMITS=1, QIF_SPACE=2, QIF_ILIMITS=4, QIF_INODES=8,
/// QIF_BTIME=16, QIF_ITIME=32, QIF_ALL=0x3F.
pub dqb_valid: u32,
// repr(C) adds 4 bytes implicit trailing padding for u64 alignment,
// matching Linux's `struct if_dqblk` exactly (9 fields, 72 bytes).
// No explicit `_pad` field — Linux has 9 fields, not 10. The implicit
// padding is zero-initialized by the kernel before copy_to_user().
}
// Layout: 8×u64 + u32 + 4(implicit pad) = 64 + 4 + 4 = 72 bytes.
const_assert!(size_of::<IfDqblk>() == 72);
/// Conversion between internal `DiskQuota` and UAPI `IfDqblk`:
/// - Q_GETQUOTA: kernel reads `DiskQuota` from cache, converts to `IfDqblk`,
/// copies to userspace. `dqb_valid` is set to `QIF_ALL` (all fields valid).
/// - Q_SETQUOTA: kernel copies `IfDqblk` from userspace, updates only the
/// `DiskQuota` fields indicated by `dqb_valid` in the cache.
/// Quota subject type.
pub enum QuotaType {
User = 0, // USRQUOTA
Group = 1, // GRPQUOTA
Project = 2, // PRJQUOTA
}
/// Quota operations implemented by filesystems that support quotas.
/// Optional — filesystems without quota support omit this and quotactl(2) returns ENOSYS.
pub trait QuotaOps: Send + Sync {
/// Enable quota enforcement for the given type, reading limits from `quota_file`.
fn quota_on(&self, quota_type: QuotaType, quota_file: &str) -> Result<(), VfsError>;
/// Disable quota enforcement for the given type.
fn quota_off(&self, quota_type: QuotaType) -> Result<(), VfsError>;
/// Read the quota entry for subject `id` (UID, GID, or project ID).
fn get_quota(&self, quota_type: QuotaType, id: u32) -> Result<DiskQuota, VfsError>;
/// Set limits and accounting for subject `id`. Requires CAP_SYS_ADMIN.
fn set_quota(&self, quota_type: QuotaType, id: u32, quota: &DiskQuota) -> Result<(), VfsError>;
/// Read global quota state (grace periods, flags) for the given type.
fn get_info(&self, quota_type: QuotaType) -> Result<QuotaInfo, VfsError>;
/// Set global quota state (grace periods). Requires CAP_SYS_ADMIN.
fn set_info(&self, quota_type: QuotaType, info: &QuotaInfo) -> Result<(), VfsError>;
/// Flush in-memory quota accounting to the quota database file.
fn sync_quota(&self, quota_type: QuotaType) -> Result<(), VfsError>;
}
/// Global quota state (grace periods and enabled flags) for a single quota type.
pub struct QuotaInfo {
/// Block grace period in seconds.
pub bgrace: u32,
/// Inode grace period in seconds.
pub igrace: u32,
/// Quota flags (QIF_FLAGS: quota enabled, quota accounting-only, etc.).
pub flags: u32,
}
14.15.2 quotactl(2) Dispatch¶
The quotactl(2) syscall encodes both the quota command and the quota type in a
single 32-bit cmd argument: bits [31:8] are the command
(Q_QUOTAON=0x800002, Q_QUOTAOFF=0x800003, Q_GETQUOTA=0x800007,
Q_SETQUOTA=0x800008, Q_GETINFO=0x800005, Q_SETINFO=0x800006,
Q_SYNC=0x800001) and bits [7:0] are the quota type (USRQUOTA=0, GRPQUOTA=1,
PRJQUOTA=2). This matches the Linux QCMD(cmd, type) = (cmd << 8) | type macro.
Subcmd values range up to 0x800009 (Q_GETNEXTQUOTA).
quotactl(cmd, dev, id, addr):
qt_cmd = cmd >> 8
qt_type = QuotaType::from(cmd & 0xff) // USRQUOTA/GRPQUOTA/PRJQUOTA
sb = resolve_superblock_from_device_path(dev)
if sb.quota_ops is None: return Err(ENOSYS)
// Capability check for mutating operations
if qt_cmd in [Q_QUOTAON, Q_QUOTAOFF, Q_SETQUOTA, Q_SETINFO]:
check_capability(CAP_SYS_ADMIN)?
match qt_cmd:
Q_QUOTAON → sb.quota_ops.quota_on(qt_type, addr_as_path)
Q_QUOTAOFF → sb.quota_ops.quota_off(qt_type)
Q_GETQUOTA → quota = sb.quota_ops.get_quota(qt_type, id)?; uapi = quota.to_if_dqblk(); copy_to_user(addr, uapi)
Q_SETQUOTA → uapi = copy_from_user::<IfDqblk>(addr)?; quota = DiskQuota::from_if_dqblk(&uapi); sb.quota_ops.set_quota(qt_type, id, "a)
Q_GETINFO → info = sb.quota_ops.get_info(qt_type)?; copy_to_user(addr, info)
Q_SETINFO → info = copy_from_user(addr)?; sb.quota_ops.set_info(qt_type, &info)
Q_SYNC → sb.quota_ops.sync_quota(qt_type)
_ → return Err(EINVAL)
14.15.3 VFS Enforcement Hooks¶
On every write(2), fallocate, create, mkdir, mknod, and symlink call, the
VFS checks quotas for all three subject types:
vfs_quota_check_blocks(inode, bytes_requested) → Result:
creds = current_task().creds
for qt in [QuotaType::User, QuotaType::Group, QuotaType::Project]:
id = match qt:
User → creds.fsuid
Group → creds.fsgid
Project → inode.project_id // stored in the inode's native i_projid field (set via FS_IOC_FSSETXATTR)
quota = inode.sb.quota_ops.get_quota(qt, id)? // from in-memory quota cache
new_usage = quota.bcurrent + bytes_requested
if new_usage > quota.bhardlimit && quota.bhardlimit != 0:
return Err(EDQUOT) // hard limit exceeded: reject immediately
if new_usage > quota.bsoftlimit && quota.bsoftlimit != 0:
now = current_time_secs()
// The get_quota() → check btime → update_quota_cache() sequence
// must be serialized per (qt, id) to prevent a TOCTOU race: two
// concurrent writers could both read btime == 0 and both set btime,
// with the second overwriting the first. Serialization is provided
// by the per-quota-entry SpinLock in the quota cache (acquired by
// get_quota() and held until update_quota_cache() completes).
if quota.btime == 0:
quota.btime = now + quota.bgrace as u64 // start grace period timer
update_quota_cache(qt, id, "a)
elif now > quota.btime:
return Err(EDQUOT) // grace period expired: reject
// else: within grace period, allow the write
return Ok(())
vfs_quota_check_inodes(inode, count) → Result:
// Identical structure to vfs_quota_check_blocks but uses icurrent/isoftlimit/ihardlimit.
14.15.4 In-Memory Quota Cache¶
Quota accounting state is kept in a per-filesystem in-memory cache to avoid hitting
the quota database file on every write. The cache structure mirrors DiskQuota with an
additional dirty: bool field. Cache entries are written back to the quota file
asynchronously via sync_quota(), which is called:
- Periodically by the writeback daemon (default interval: 30 seconds).
- On
quotactl(Q_SYNC). - On filesystem unmount.
- On
sync(2)/syncfs(2)when the filesystem's quota is dirty.
The cache uses three per-filesystem XArrays — one per QuotaType — keyed by
subject ID (u32 UID, GID, or project ID). Quota checks on the write(2) hot
path use RCU read guards (lock-free, no contention between concurrent writers).
Updates (usage accounting, limit changes via quotactl(Q_SETQUOTA)) acquire the
XArray's internal lock on the affected entry only.
/// Per-filesystem in-memory quota cache.
///
/// Three XArrays partition by quota type so that user, group, and project
/// lookups are fully independent (no false contention). XArray provides
/// O(1) lookup by integer key with native RCU read support.
pub struct QuotaCache {
/// User quota cache, keyed by UID.
pub user: XArray<QuotaCacheEntry>,
/// Group quota cache, keyed by GID.
pub group: XArray<QuotaCacheEntry>,
/// Project quota cache, keyed by project ID.
pub project: XArray<QuotaCacheEntry>,
}
pub struct QuotaCacheEntry {
pub quota: DiskQuota,
/// True if this entry has been modified since the last writeback.
pub dirty: bool,
}
impl QuotaCache {
/// Look up a quota entry. RCU read — no lock, no allocation.
/// Called on every write(2), fallocate, create, mkdir — hot path.
pub fn get(&self, qt: QuotaType, id: u32) -> Option<RcuRef<QuotaCacheEntry>> {
self.array_for(qt).get_rcu(id as u64)
}
/// Insert or update a quota entry. Acquires the XArray's internal lock
/// for the affected slot only — does not block concurrent reads.
pub fn set(&self, qt: QuotaType, id: u32, entry: QuotaCacheEntry) {
self.array_for(qt).store(id as u64, entry);
}
fn array_for(&self, qt: QuotaType) -> &XArray<QuotaCacheEntry> {
match qt {
QuotaType::User => &self.user,
QuotaType::Group => &self.group,
QuotaType::Project => &self.project,
}
}
}
This replaces the previous RwLock<HashMap<(QuotaType, u32), DiskQuota>> design,
which had three problems: (1) HashMap with integer keys violates collection policy
(§3.1.13); (2) the global RwLock serialises all quota checks across all subjects;
(3) the composite (QuotaType, u32) key prevents independent access by quota type.
The XArray design gives O(1) lookup, lock-free RCU reads, and natural partitioning.
14.15.5 Linux Compatibility¶
quotactl(2)with all seven commands (Q_QUOTAON,Q_QUOTAOFF,Q_GETQUOTA,Q_SETQUOTA,Q_GETINFO,Q_SETINFO,Q_SYNC) is fully implemented.- The UAPI
IfDqblkstructure matches the Linuxstruct if_dqblklayout exactly (9 fields:dqb_bhardlimitthroughdqb_valid). The internalDiskQuotastruct extends this withbgrace/igracefields for in-kernel grace period tracking. - quota tools (
quota,quotacheck,repquota,edquota) work without modification. - ext4, XFS, and tmpfs quota implementations are in scope for the initial release.
- Project quotas (
PRJQUOTA) are supported; project IDs are stored in the inode'si_projidfield (set viaFS_IOC_FSSETXATTR).
14.15.6 Cross-References¶
- Section 14.1 (VFS Architecture): quota checks are inserted into the VFS
dispatch layer at
write,create,mkdir,mknod, andfallocatecall sites. - Section 17.1 (Containers): cgroup v2
io.maxandmemory.maxprovide resource controls complementary to quota; quota enforces per-UID/GID storage limits while cgroups enforce per-container I/O and memory limits. - Section 15.1 (Storage): ext4, XFS, and btrfs filesystem drivers implement
QuotaOpsas part of theirSuperBlockinitialization.
14.16 Extended Attributes (xattr)¶
Extended attributes are name-value pairs associated with inodes, providing metadata beyond the standard POSIX file attributes (owner, group, mode, timestamps). They are the storage mechanism for POSIX ACLs (Section 9.2), SELinux labels, IMA hashes (Section 9.5), overlayfs whiteouts (Section 14.8), file capabilities (Section 9.9), and user-defined metadata.
UmkaOS implements the complete Linux xattr ABI: identical syscall numbers, identical
namespace rules, identical size limits, and identical wire format for POSIX ACLs
stored in system.posix_acl_access / system.posix_acl_default.
14.16.1 Syscall Interface¶
Twelve syscalls implement four operations across three path resolution variants:
| Operation | Path-based (follows symlinks) | Link-based (no follow) | FD-based |
|---|---|---|---|
| Get | getxattr(path, name, value, size) |
lgetxattr(path, name, value, size) |
fgetxattr(fd, name, value, size) |
| Set | setxattr(path, name, value, size, flags) |
lsetxattr(path, name, value, size, flags) |
fsetxattr(fd, name, value, size, flags) |
| List | listxattr(path, list, size) |
llistxattr(path, list, size) |
flistxattr(fd, list, size) |
| Remove | removexattr(path, name) |
lremovexattr(path, name) |
fremovexattr(fd, name) |
Return values: getxattr returns the number of bytes written to value (or the
required buffer size if size == 0). listxattr returns the total length of the
null-separated name list (or required size if size == 0). setxattr and removexattr
return 0 on success.
Error codes: ENODATA (attribute not found), EEXIST (CREATE flag, attribute
already exists), ERANGE (buffer too small), EPERM (namespace permission denied),
ENOTSUP (filesystem does not support xattrs or namespace not valid for this inode type).
The l-variants operate on the symlink inode itself rather than following the symlink
target. The f-variants use an open file descriptor, bypassing path resolution entirely.
14.16.2 XattrFlags¶
bitflags! {
/// Flags for setxattr / lsetxattr / fsetxattr. Matches Linux XATTR_CREATE
/// and XATTR_REPLACE from <linux/xattr.h>.
// kernel-internal, not KABI
#[repr(C)]
pub struct XattrFlags: u32 {
/// Fail with EEXIST if the attribute already exists.
const CREATE = 0x1;
/// Fail with ENODATA if the attribute does not exist.
const REPLACE = 0x2;
// 0 (no flags) = create or replace unconditionally.
}
}
Setting both CREATE | REPLACE simultaneously is invalid and returns EINVAL.
14.16.3 Namespace Prefixes¶
Extended attribute names are partitioned into four namespaces by their prefix string. Each namespace has independent permission semantics:
14.16.3.1 user.*¶
User-defined attributes. No capability required.
| Operation | Requirement |
|---|---|
| Get | Read permission on the file |
| Set | Write permission on the file |
Inode type restriction: user.* xattrs are permitted only on regular files and
directories. Attempts to set user.* on symlinks, device nodes, pipes, or sockets
return EPERM. Rationale: symlinks must be transparent (a symlink's xattrs should not
be confused with those of its target); device node xattrs would create ambiguity between
the device file and the underlying device.
14.16.3.2 trusted.*¶
Trusted attributes for kernel subsystems and privileged daemons.
| Operation | Requirement |
|---|---|
| Get | CAP_SYS_ADMIN |
| Set | CAP_SYS_ADMIN |
Stored on disk and persistent across reboots. Examples:
trusted.overlay.opaque— overlayfs opaque directory marker (Section 14.8)trusted.overlay.redirect— overlayfs rename redirect
14.16.3.3 security.*¶
Security labels written by LSMs and integrity subsystems.
| Operation | Requirement |
|---|---|
| Set | Delegated to LSM hooks. Default (commoncap): CAP_SYS_ADMIN for generic security.* attributes; security.capability requires CAP_SETFCAP (checked in cap_convert_nscap()). SELinux/AppArmor may impose additional type enforcement rules via lsm_call_inode_security(Setxattr, ...). |
| Get | Varies by LSM; SELinux allows read by any process with appropriate type enforcement |
Examples:
security.selinux— SELinux security context (Section 9.8)security.ima— IMA measurement hash (Section 9.5)security.capability— file capabilities (VFS_CAP_REVISION_3) (Section 9.9)security.evm— EVM HMAC over protected xattrs (Section 9.5)
14.16.3.4 system.*¶
System attributes for kernel-managed metadata. Two attributes are defined:
system.posix_acl_access— POSIX access ACL (Section 9.2)system.posix_acl_default— POSIX default ACL (directories only)
Permission model: read follows normal file permission checks; set requires write
permission plus ownership (uid == i_uid) or CAP_FOWNER.
14.16.4 Size Limits¶
/// Maximum length of an extended attribute name, including the namespace prefix
/// (e.g., "user." is 5 bytes of the 255-byte budget). Matches Linux XATTR_NAME_MAX.
pub const XATTR_NAME_MAX: usize = 255;
/// Maximum size of an extended attribute value in bytes (64 KiB).
/// Matches Linux XATTR_SIZE_MAX.
pub const XATTR_SIZE_MAX: usize = 65536;
/// Maximum total size of a listxattr() output buffer in bytes (64 KiB).
/// Matches Linux XATTR_LIST_MAX.
pub const XATTR_LIST_MAX: usize = 65536;
These are hard limits enforced by the VFS layer before dispatching to filesystem code.
Individual filesystems may impose smaller limits (e.g., ext4 inline xattr space is
limited by the inode size minus i_extra_isize).
14.16.5 VFS Dispatch Pipeline¶
All xattr syscalls route through InodeOps methods defined in
Section 14.1. The VFS layer performs namespace permission checks
and LSM hooks before dispatching to the filesystem:
- Parse namespace prefix — extract
"user.","trusted.","security.", or"system."from the attribute name. Unknown prefixes returnEOPNOTSUPP. - Validate name length — reject if
name.len() > XATTR_NAME_MAX. - Validate value size — reject if
value.len() > XATTR_SIZE_MAX(set operations). - Check namespace permissions — verify the caller holds the required capability
for the namespace (see tables above). Check inode type restriction for
user.*. - Call LSM hooks —
lsm_call_inode_security(Setxattr | Getxattr | Removexattr | Listxattr, ...)(Section 9.8). LSMs may deny the operation (e.g., SELinux type enforcement) or interceptsecurity.*writes. - Dispatch to filesystem — call the
InodeOps::getxattr/setxattr/listxattr/removexattrmethod on the filesystem driver. - EVM re-computation (set/remove of
security.*xattrs only) — after the filesystem write succeeds, trigger EVM HMAC re-computation (Section 9.5).
14.16.6 Per-Filesystem Storage¶
Each filesystem implements xattr storage according to its on-disk format. The VFS layer
is agnostic to the storage mechanism; it delegates entirely to InodeOps.
| Filesystem | Storage mechanism | Inline capacity | Overflow strategy |
|---|---|---|---|
| ext4 | Inode body (after i_extra_isize) or external xattr block |
~100 bytes (256-byte inode default) | Separate 4 KiB block, shared across inodes via block refcount |
| XFS | Inode attribute fork (shortform, leaf, or B-tree) | ~256 bytes (shortform) | B-tree of 4 KiB attr leaf blocks |
| Btrfs | Xattr items in the filesystem B-tree (same tree as data extent refs) | ~3900 bytes (single leaf item) | Additional B-tree items (no single-xattr limit, tree grows) |
| tmpfs | XArray per-inode, keyed by FNV-1a hash of xattr name |
Memory-only, no disk limit | Bounded by tmpfs size limit and system memory |
| ZFS | System Attributes (SA) in dnode bonus buffer or ZAP objects | ~48 KiB (bonus buffer) | Fat ZAP (on-disk hash table) |
tmpfs xattr storage: tmpfs has no backing disk, so xattrs are stored in memory.
Each inode with xattrs carries an XArray<XattrEntry> keyed by fnv1a(name) as u64
with open-addressing collision resolution (same triangular probing scheme as
Section 14.18). On collision, the probe
sequence h, h+1, h+3, h+6, … is followed, comparing XattrEntry.name at each
occupied slot. This gives O(1) lookup for the common case (no collisions) with bounded
worst-case O(k) where k is the number of collisions for a given hash.
/// tmpfs xattr entry. Stored in the per-inode XArray.
pub struct TmpfsXattrEntry {
/// Full attribute name including namespace prefix (e.g., "user.mime_type").
/// Heap-allocated because xattr names are variable-length.
pub name: Box<[u8]>,
/// Attribute value. Heap-allocated, up to XATTR_SIZE_MAX bytes.
pub value: Box<[u8]>,
}
14.16.7 POSIX ACL Wire Format¶
The POSIX draft ACL (Section 9.2)
is stored on disk as the value of system.posix_acl_access (access ACL) and
system.posix_acl_default (default ACL, directories only). The wire format is
identical to Linux <linux/posix_acl_xattr.h>:
/// POSIX ACL xattr header. Appears once at the start of the xattr value.
/// All fields are little-endian on disk (Le32/Le16 wrappers enforce
/// explicit conversion at read/write boundaries).
#[repr(C, packed)]
pub struct PosixAclXattrHeader {
/// ACL format version. Must be POSIX_ACL_XATTR_VERSION (0x0002).
pub version: Le32,
}
// Packed layout: 4 bytes.
const_assert!(size_of::<PosixAclXattrHeader>() == 4);
/// POSIX_ACL_XATTR_VERSION — the only version defined by the POSIX draft standard.
pub const POSIX_ACL_XATTR_VERSION: u32 = 0x0002;
/// A single ACL entry. Follows the header; repeated N times.
/// All fields are little-endian on disk.
#[repr(C, packed)]
pub struct PosixAclXattrEntry {
/// ACL entry tag identifying the entry type.
pub tag: Le16,
/// Permission bits: ACL_READ (0x04) | ACL_WRITE (0x02) | ACL_EXECUTE (0x01).
pub perm: Le16,
/// Qualifier: uid for ACL_USER, gid for ACL_GROUP.
/// ACL_UNDEFINED_ID (0xFFFFFFFF) for USER_OBJ, GROUP_OBJ, MASK, OTHER.
pub id: Le32,
}
// Packed layout: 2 + 2 + 4 = 8 bytes.
const_assert!(size_of::<PosixAclXattrEntry>() == 8);
/// ACL entry tag values.
pub const ACL_USER_OBJ: u16 = 0x01;
pub const ACL_USER: u16 = 0x02;
pub const ACL_GROUP_OBJ: u16 = 0x04;
pub const ACL_GROUP: u16 = 0x08;
pub const ACL_MASK: u16 = 0x10;
pub const ACL_OTHER: u16 = 0x20;
/// Sentinel value for entries that do not reference a specific uid/gid.
pub const ACL_UNDEFINED_ID: u32 = 0xFFFF_FFFF;
Wire layout: 4-byte header followed by N 8-byte entries. Total size =
4 + 8 * N bytes.
Minimum ACL: 3 entries (USER_OBJ, GROUP_OBJ, OTHER) = 28 bytes. This
is the "minimal ACL" equivalent to standard POSIX mode bits.
Extended ACL: When named users or groups are present, a MASK entry is
mandatory. The mask defines the maximum permissions for ACL_USER, ACL_GROUP,
and ACL_GROUP_OBJ entries (the "effective permissions" are entry.perm & mask.perm).
14.16.8 chmod / ACL Mask Interaction¶
When chmod() is called on a file that has a POSIX access ACL, the ACL must be
updated to reflect the new mode bits. The POSIX draft standard defines this mapping:
| Mode bits | ACL entry updated |
|---|---|
| Owner bits (mode >> 6) & 0o7 | ACL_USER_OBJ.perm |
| Group bits (mode >> 3) & 0o7 | ACL_MASK.perm (NOT ACL_GROUP_OBJ) |
| Other bits (mode) & 0o7 | ACL_OTHER.perm |
The group bits of the file mode always correspond to ACL_MASK, not ACL_GROUP_OBJ.
This is a common source of confusion but is required by POSIX.1e: the mask entry is
the upper bound on group-class permissions, and ls -l displays the mask as the group
permission bits.
Conversely, when an ACL is set via setxattr("system.posix_acl_access", ...), the
file's mode bits are updated to match: owner bits from ACL_USER_OBJ.perm, group
bits from ACL_MASK.perm, other bits from ACL_OTHER.perm.
14.16.9 Default ACL Inheritance¶
When creating a new inode in a directory that has system.posix_acl_default set:
New file creation:
1. The directory's default ACL becomes the new file's access ACL.
2. The ACL_MASK entry (if present) is ANDed with the umask-adjusted creation
mode to produce the file's effective permissions.
3. The file's mode bits are set from the resulting ACL (owner from USER_OBJ,
group from MASK, other from OTHER).
4. The new file does NOT receive a default ACL (only directories inherit defaults).
New directory creation: 1. Same as file creation for the access ACL. 2. Additionally, the parent's default ACL is copied as the new directory's own default ACL, ensuring recursive inheritance for all future children.
No default ACL: If the parent directory has no system.posix_acl_default xattr,
standard umask-based permission inheritance applies and no ACL is created on the
new inode.
14.16.10 EVM Integration¶
EVM (Extended Verification Module) protects security-critical xattrs against offline tampering (Section 9.5).
Protected xattr set: security.selinux, security.ima, security.capability,
and any other security.* xattr registered with EVM at boot.
Flow on security.* xattr modification:
1. VFS calls InodeOps::setxattr() to persist the new value.
2. On success, VFS acquires the per-inode evm_lock (spinlock).
3. VFS concatenates the inode number and all protected xattr values in a
deterministic order.
4. VFS computes HMAC-SHA3-256 over the concatenation using the boot-derived
EVM key.
5. VFS writes the resulting HMAC as the value of security.evm.
6. VFS releases evm_lock.
Lock ordering: The evm_lock is acquired AFTER the inode's i_rwsem (which
protects xattr storage). The ordering is: i_rwsem (exclusive, acquired by
setxattr() VFS path) → evm_lock (spinlock, acquired in step 2). Reversing
this order would deadlock: evm_lock protects only the HMAC recomputation
(steps 3-5), not the underlying xattr storage write (step 1). Concurrent
getxattr("security.evm") reads do NOT acquire evm_lock — they read the
stored HMAC value directly. This is safe because setxattr holds i_rwsem
exclusive, which prevents concurrent setxattr (but not concurrent getxattr,
which takes i_rwsem shared). A getxattr concurrent with step 5 may see
either the old or new HMAC — both are valid (the old HMAC matches the old
xattr set; the new HMAC matches the new xattr set).
Appraisal on file open: When a file is opened, EVM re-computes the HMAC and
compares it against the stored security.evm value. Mismatch returns EINTEGRITY
(or EPERM when evm_mode is set to enforce).
14.16.11 Performance Budget¶
| Operation | Path class | Typical cost | Notes |
|---|---|---|---|
getxattr (inline, ext4) |
Warm | ~500 ns | Inode already in page cache; inline scan of i_extra_isize region |
getxattr (external block, ext4) |
Cold | ~5 us | Requires reading the shared xattr block from disk |
setxattr (inline, ext4) |
Warm | ~1 us | Journal transaction for inode update |
listxattr |
Cold | ~2 us | Iterates all xattr entries in inode + overflow |
| LSM hook overhead per xattr op | Hot | ~20 ns | Static dispatch through LSM hook array (Section 9.8) |
| EVM HMAC re-computation | Warm | ~3 us | HMAC-SHA3-256 over protected xattr set; dominated by hash computation |
tmpfs getxattr (XArray lookup) |
Warm | ~100 ns | In-memory XArray traversal, no disk I/O |
Hot-path note: xattr operations are not on the per-packet or per-syscall hot path.
The most frequent xattr consumer is LSM label checks during file_open, which cache
the resolved label in the inode's LSM blob and do not re-read the xattr on every access.
The performance budget above reflects the actual xattr syscall cost, not the cached
LSM check cost (which is ~5 ns via the blob pointer).
14.17 Pipes and FIFOs¶
Pipes (pipe(2), pipe2(2)) and named FIFOs (mkfifo(2)) are anonymous
unidirectional byte streams. They are the oldest and most widely used IPC
primitive in UNIX.
14.17.1 Pipe Data Buffer¶
The pipe data buffer uses the page-array model defined in
Section 17.3. Each pipe holds an array of page references
(PipePage), supporting partial-page writes for small messages and zero-copy
page gifting for vmsplice(SPLICE_F_GIFT). The PipeBuffer struct provides:
- Default 65536 bytes (16 pages), matching Linux default
- Lock-free single-writer fast path with
active_writerCAS - Mutex-protected multi-writer slow path for POSIX atomicity
- Inline storage for the common case (16 pages), heap fallback for
fcntl(F_SETPIPE_SZ)beyond 64 KB - Seqlock-based resize safety for concurrent
fcntl(F_SETPIPE_SZ)
Key PipeBuffer fields (summary; full definition in Section 17.3):
- pages: ArrayVec<PipePage, 16> — page ring (inline for default 64 KiB)
- r_idx: u32, w_idx: u32 — read/write indices into the page ring
- capacity: u32 — current capacity in bytes
- active_writer: AtomicU64 — CAS-based single-writer detection
See Section 17.3 for the complete struct definition, write/read algorithms, memory ordering rationale, and resize protocol.
14.17.2 Capacity and fcntl(F_SETPIPE_SZ)¶
Default pipe capacity: 65536 bytes (matches Linux default).
fcntl(F_SETPIPE_SZ, size) resizes the pipe:
- Rounds up to the next power of 2 (minimum 4096 bytes)
- Maximum: /proc/sys/fs/pipe-max-size (default 1MB, same as Linux)
- Requires CAP_SYS_RESOURCE to exceed /proc/sys/fs/pipe-max-size
- Data currently in the pipe is preserved (pages migrated to new array)
- If the new size is smaller than current content: EBUSY
fcntl(F_GETPIPE_SZ) returns the current capacity.
14.17.3 MPSC Pipes (Multiple Writers)¶
When more than one process/thread writes to the same pipe (e.g., shell
{ cmd1; cmd2; } | cmd3), the single-writer fast path cannot be used.
UmkaOS detects multiple writers via PipeBuffer.writer_count: AtomicU32:
writer_count == 1: lock-free single-writer fast path (CAS onactive_writer)writer_count > 1: writer acquiresPipeBuffer.ring_lock: Mutex<()>before writing
Writes <= PIPE_BUF (4096 bytes) are always atomic (no interleaving with
other writers) -- same guarantee as POSIX.
14.17.4 O_DIRECT Pipe Mode¶
pipe2(O_DIRECT): each write() is a discrete message; read() returns
exactly one message. Implemented by prepending a 4-byte length header before
each message in the page array:
/// O_DIRECT pipe message header (4 bytes, native endian).
/// Followed immediately by `len` bytes of payload.
/// Alignment: none required (page data is byte-addressable).
/// packed is defensive: ensures no trailing padding if fields are added later.
#[repr(C, packed)]
pub struct PipeMessageHdr {
pub len: u32,
}
const_assert!(size_of::<PipeMessageHdr>() == 4);
Maximum message size: PIPE_BUF (4096 bytes) for atomic writes.
14.17.5 Named FIFOs (mkfifo)¶
Named FIFOs use the same PipeBuffer struct, but with a VFS inode for
pathname lookup:
mkfifo(path, mode): creates a VFS inode of typeInodeKind::Fifoopen(path, O_RDONLY): blocks until a writer opens (unless O_NONBLOCK)open(path, O_WRONLY): blocks until a reader opens (unless O_NONBLOCK)- Once both ends are open: identical semantics to anonymous pipe
A FIFO is a VFS node (VfsNode) that, when opened, creates a reference to
an existing PipeBuffer or creates a new one. Multiple readers and writers
can open a FIFO; the reader_count and writer_count fields track
opens/closes. Writers use the multi-writer slow path when concurrent writes
are detected. When the last reader and last writer close, the buffer is freed.
14.17.6 Splice and Zero-Copy¶
UmkaOS's page-array pipe model (PipeBuffer) uses the same fundamental design
as Linux's struct pipe_buffer array: each pipe page is a reference to a
physical page with offset and length. This enables true zero-copy splice
operations via page-reference transfer.
Pipe-to-pipe splice (splice(pipe_fd_in, pipe_fd_out)): Transfers page
references from the source pipe to the destination pipe. The source pipe's
PipePage entries are moved (not copied) to the destination, incrementing the
underlying page refcount. No data copy occurs -- only metadata (page pointer,
offset, length) is transferred. This matches Linux's pipe-to-pipe splice
semantics exactly.
File-to-pipe splice (splice(file_fd, pipe_fd)): The filesystem's
FileOps::splice_read populates pipe pages directly from page cache pages,
transferring page references (incrementing refcount) without copying data.
The pipe's PipePage entries point directly into the page cache.
Pipe-to-socket splice (splice(pipe_fd, socket_fd)): The network stack
receives page references from the pipe and uses scatter-gather DMA to transmit
directly from the pipe's pages (zero kernel-side copy). Each PipePage
maps to a scatter-gather entry for the NIC.
Pipe-to-file splice (splice(pipe_fd, file_fd)): The filesystem's
FileOps::splice_write transfers page references from the pipe into the
page cache (for filesystems that support it) or copies data from pipe pages
to page cache pages.
vmsplice zero-copy (vmsplice(pipe_fd, iov, SPLICE_F_GIFT)): When
SPLICE_F_GIFT is set, the user pages described by the iovec are unmapped
from the sender's address space and gifted to the pipe as PipePage entries
with is_gifted == true. The reader can then access the data without any
copy. Without SPLICE_F_GIFT, data is copied from user pages into pipe pages.
14.17.7 Linux Compatibility¶
- Default capacity 65536 bytes: identical to Linux
F_SETPIPE_SZ/F_GETPIPE_SZ: identical semanticsPIPE_BUF= 4096 bytes: POSIX required, identicalO_DIRECTpipe mode: identical to Linux 3.4+pipe2(O_CLOEXEC | O_NONBLOCK | O_DIRECT): all flags supported- Splice semantics: identical to Linux (page-reference transfer)
/proc/sys/fs/pipe-max-size: identical default (1MB), same permission model- Signal on broken pipe:
SIGPIPE+EPIPEon write to pipe with no readers select()/poll()/epoll():EPOLLINwhen data available,EPOLLOUTwhen space available,EPOLLHUPon last writer close
14.18 Pseudo-Filesystems¶
Pseudo-filesystems are RAM-resident virtual filesystems that expose kernel state to userspace. Unlike disk-backed filesystems, they have no persistent storage — all content is generated dynamically by the kernel on read and consumed on write. UmkaOS provides a common registration framework and six standard pseudo-filesystems required for Linux compatibility.
procfs and sysfs are specified in Section 14.19. tmpfs and devtmpfs are built into the VFS core (Section 14.1). cgroupfs is specified in Section 17.2, and configfs in Section 14.12. This section covers the remaining pseudo-filesystems needed for complete Linux workload support: debugfs, tracefs, hugetlbfs, bpffs, securityfs, and efivarfs.
14.18.1 Common Registration Framework¶
All pseudo-filesystems share a uniform registration path through the VFS layer
(Section 14.1). Each pseudo-fs defines a static PseudoFsType
descriptor and calls register_filesystem() during kernel init
(Section 2.3).
/// Registration descriptor for a pseudo-filesystem type.
///
/// Each pseudo-fs defines exactly one static instance. The VFS layer stores
/// registered types in an XArray keyed by a monotonic u64 registration
/// sequence number. Name-based lookup (mount -t <name>) walks the XArray
/// linearly — acceptable because the total number of filesystem types is
/// small (<50) and registration/mount are cold-path operations.
pub struct PseudoFsType {
/// Filesystem type name as it appears in mount(2) and /proc/filesystems.
/// e.g., "debugfs", "tracefs", "hugetlbfs", "bpf", "securityfs", "efivarfs".
pub name: &'static str,
/// Filesystem flags. Pseudo-filesystems typically set NODEV | NOEXEC | NOSUID
/// to prevent device node creation, executable mapping, and setuid escalation.
pub fs_flags: FsFlags,
/// Filesystem magic number returned by statfs(2). Each pseudo-fs has a
/// unique magic defined by the Linux UAPI (include/uapi/linux/magic.h).
pub magic: u32,
/// Populate the superblock and create the root inode. Called once per mount.
/// The implementation creates the root directory inode with appropriate mode
/// and ownership, then populates any initial directory structure.
pub fill_super: fn(&mut SuperBlock) -> Result<(), VfsError>,
/// Accepted mount options. The VFS parses `-o key=value` pairs from mount(2)
/// and validates them against this table before calling `fill_super`.
pub mount_opts: &'static [MountOptDesc],
}
bitflags! {
/// Filesystem-level flags applied at mount time.
pub struct FsFlags: u32 {
/// No device special files may be created or accessed on this filesystem.
const NODEV = 1 << 0;
/// No files on this filesystem may be executed (mmap PROT_EXEC denied).
const NOEXEC = 1 << 1;
/// Setuid and setgid bits are ignored for all files on this filesystem.
const NOSUID = 1 << 2;
/// This filesystem may be mounted inside a non-initial user namespace.
/// Only set for filesystems that are safe for unprivileged mounting
/// (e.g., tmpfs, procfs subset). None of the six pseudo-filesystems
/// in this section set this flag — they all require initial namespace.
const USERNS_MOUNT = 1 << 3;
}
}
/// Descriptor for a single mount option accepted by a pseudo-filesystem.
pub struct MountOptDesc {
/// Option name (e.g., "pagesize", "mode", "uid").
pub name: &'static str,
/// Type of the option value.
pub kind: MountOptKind,
}
/// Value type for a mount option.
pub enum MountOptKind {
/// Boolean flag (present = true, absent = false). Example: "noexec".
Flag,
/// Unsigned 64-bit integer. Example: "size=1073741824".
U64,
/// Unsigned 32-bit integer. Example: "uid=1000", "mode=0700".
U32,
/// String value. Example: (currently unused by pseudo-fs, reserved).
Str,
}
Registration is idempotent — calling register_filesystem() with a name that is
already registered returns EBUSY. Unregistration is not supported for built-in
pseudo-filesystems (they live for the kernel's lifetime).
14.18.1.1 PseudoInode and File Operations¶
All pseudo-filesystems share a simplified inode representation for RAM-backed directory trees. Unlike disk-backed inodes (Section 14.1), pseudo-inodes carry no block mappings, no page cache association, and no filesystem-specific opaque data.
/// Simplified inode for RAM-backed pseudo-filesystems.
///
/// Pseudo-inodes are allocated on first access and freed when the
/// dentry is removed. They are never written to disk.
pub struct PseudoInode {
/// Inode number. Allocated from a per-superblock `AtomicU64`
/// counter starting at 2 (inode 1 is reserved for the root).
/// u64 counters never wrap within the kernel's operational
/// lifetime (50+ years at billions of allocations per second).
pub ino: u64,
/// POSIX file mode (type + permission bits).
pub mode: u32,
/// Owner UID.
pub uid: Uid,
/// Owner GID.
pub gid: Gid,
/// Access time.
pub atime: Timespec64,
/// Modification time.
pub mtime: Timespec64,
/// Status change time.
pub ctime: Timespec64,
/// Inode content, determined by the file type.
pub data: PseudoInodeData,
}
/// Content discriminant for pseudo-inodes.
pub enum PseudoInodeData {
/// Directory: children stored in `HashedXArray<DirEntry>`, a wrapper
/// that encapsulates FNV-1a hashing + triangular probing + tombstone
/// protocol on top of XArray. This is necessary because XArray's built-in
/// `xa_store(key, value)` overwrites existing entries at the same key —
/// using raw `xa_store(fnv1a(name), entry)` would silently lose entries
/// on hash collision.
///
/// `HashedXArray<V>` provides:
/// - `insert(name: &[u8], value: V) -> Result<(), Exists>`
/// - `lookup(name: &[u8]) -> Option<&V>`
/// - `remove(name: &[u8]) -> Option<V>`
///
/// Implementation: on insert, hash `h = fnv1a(name) as u64`. If the slot
/// at `h` is occupied by a different name, probe `h+1, h+3, h+6, ...`
/// (triangular probing: offset k = k*(k+1)/2). Lookup uses the same
/// probe sequence, comparing `DirEntry.name` at each occupied slot.
/// Remove stores a tombstone sentinel (a DirEntry with empty name) so
/// the probe chain is not broken. Periodic compaction removes tombstones
/// when the tombstone ratio exceeds 25%.
///
/// Common-case (no collision): single XArray lookup, O(1). With 64-bit
/// FNV-1a, collisions are negligible for directories under ~2^32 entries.
Directory(HashedXArray<DirEntry>),
/// Regular file: read/write behavior defined by the `PseudoFileOps`
/// implementation provided at file creation time.
RegularFile(&'static dyn PseudoFileOps),
/// Symbolic link: target path stored inline.
Symlink(Box<[u8]>),
}
/// Directory entry within a pseudo-filesystem directory.
pub struct DirEntry {
/// Entry name (variable-length, heap-allocated).
pub name: Box<[u8]>,
/// Inode number of the target.
pub ino: u64,
/// File type for getdents64 optimization. Values match Linux UAPI
/// `include/uapi/linux/dirent.h`:
/// DT_UNKNOWN = 0, DT_FIFO = 1, DT_CHR = 2, DT_DIR = 4,
/// DT_BLK = 6, DT_REG = 8, DT_LNK = 10, DT_SOCK = 12.
pub d_type: u8,
}
/// Callback trait for pseudo-filesystem regular files.
///
/// Each file in a pseudo-filesystem implements this trait to define its
/// read/write behavior. Implementations are typically stateless — they
/// read from or write to kernel data structures referenced through the
/// inode's subsystem-specific context.
pub trait PseudoFileOps: Send + Sync {
/// Read data from this pseudo-file into `buf` starting at `offset`.
/// Returns the number of bytes written to `buf`.
fn read(
&self,
inode: &PseudoInode,
buf: &mut [u8],
offset: u64,
) -> Result<usize, Errno>;
/// Write data from `buf` to this pseudo-file at `offset`.
/// Returns the number of bytes consumed from `buf`.
fn write(
&self,
inode: &PseudoInode,
buf: &[u8],
offset: u64,
) -> Result<usize, Errno>;
/// Called when the file is opened. Optional initialization.
/// Default: no-op (returns Ok).
fn open(&self, _inode: &PseudoInode) -> Result<(), Errno> {
Ok(())
}
/// Called when the last fd referencing this file is closed.
/// Default: no-op.
fn release(&self, _inode: &PseudoInode) {}
}
14.18.1.2 Helper Functions¶
Convenience functions for creating and removing entries in pseudo-filesystem directories. Used internally by debugfs, tracefs, securityfs, and bpffs.
/// Create a regular file as a child of `parent`.
///
/// Allocates a new `PseudoInode` with `PseudoInodeData::RegularFile(ops)`,
/// inserts a `DirEntry` into the parent's `XArray`, and creates a VFS
/// dentry linking the two.
///
/// # Errors
/// - `EEXIST`: a child with the same name already exists.
/// - `ENOMEM`: inode or dentry allocation failed.
pub fn pseudo_create_file(
parent: &PseudoInode,
name: &[u8],
mode: u16,
ops: &'static dyn PseudoFileOps,
) -> Result<Arc<PseudoInode>, VfsError>;
/// Create a subdirectory as a child of `parent`.
///
/// Allocates a new `PseudoInode` with `PseudoInodeData::Directory(HashedXArray::new())`.
/// The new directory starts empty.
pub fn pseudo_create_dir(
parent: &PseudoInode,
name: &[u8],
mode: u16,
) -> Result<Arc<PseudoInode>, VfsError>;
/// Remove a child entry from `parent` by name.
///
/// Looks up the child in the parent's directory XArray. If the target is a
/// directory, it must be empty (returns `ENOTEMPTY` otherwise).
///
/// **Tombstone protocol**: Because the directory uses open-addressing
/// collision resolution, naive deletion (clearing a slot to empty) would
/// break probe chains for entries inserted after the deleted entry via
/// collision. On removal, the slot is set to a tombstone sentinel
/// (`DirEntry::TOMBSTONE`) that:
/// - Is treated as "occupied" during probing (probe chains remain intact)
/// - Is skipped during name matching (not returned by lookup)
/// - Is overwritten by new inserts (reclaimed lazily)
///
/// When the directory's tombstone count exceeds 25% of occupied slots,
/// the directory is compacted: a new `HashedXArray` is allocated, all
/// live entries are inserted into it, the directory's `HashedXArray`
/// pointer is atomically swapped via `rcu_assign_pointer`, and the old
/// `HashedXArray` is freed after an RCU grace period. Concurrent RCU
/// readers always see a consistent directory (either old or new, never
/// partially rebuilt). Cold path, under the parent inode's directory lock.
pub fn pseudo_remove(
parent: &PseudoInode,
name: &[u8],
) -> Result<(), VfsError>;
14.18.2 debugfs — Kernel Debug Filesystem¶
Mount point: /sys/kernel/debug (mount -t debugfs debugfs /sys/kernel/debug)
Magic: 0x64626720 (DEBUGFS_MAGIC)
Flags: NODEV | NOEXEC | NOSUID
debugfs exposes kernel-internal debugging data. It carries no stable ABI guarantee: files may appear, disappear, or change format between kernel versions. Userspace tools must handle missing files gracefully.
Access control: mounted with mode=0700 by default, restricting access to
CAP_SYS_ADMIN (Section 9.9). Distributions may remount
with mode=0755 for read-only debug access, but this is a policy decision outside
kernel scope.
14.18.2.1 debugfs Registration¶
static DEBUGFS_TYPE: PseudoFsType = PseudoFsType {
name: "debugfs",
fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
magic: 0x64626720,
fill_super: debugfs_fill_super,
mount_opts: &[
MountOptDesc { name: "uid", kind: MountOptKind::U32 },
MountOptDesc { name: "gid", kind: MountOptKind::U32 },
MountOptDesc { name: "mode", kind: MountOptKind::U32 },
],
};
14.18.2.2 debugfs Kernel API¶
Kernel subsystems create debugfs entries during their initialization. All functions are no-ops (returning a dummy handle) if debugfs is not mounted, ensuring subsystem init never fails due to debugfs unavailability.
/// Handle to a debugfs directory. Opaque to callers.
/// Internally holds the dentry reference for the directory inode.
pub struct DebugfsDir {
dentry: Arc<Dentry>,
}
/// Handle to a debugfs file or value entry. Opaque to callers.
pub struct DebugfsEntry {
dentry: Arc<Dentry>,
}
/// Create a directory under the debugfs root (or under `parent`).
/// Returns `DebugfsDir` used as the parent for subsequent entries.
pub fn debugfs_create_dir(
name: &str,
parent: Option<&DebugfsDir>,
) -> Result<DebugfsDir, VfsError>;
/// Create a file with custom read/write operations.
/// `mode` is the POSIX permission bits (e.g., 0o444 for read-only).
/// `fops` provides the read/write/open/release callbacks.
pub fn debugfs_create_file(
name: &str,
parent: &DebugfsDir,
mode: u16,
fops: &'static FileOps,
) -> Result<DebugfsEntry, VfsError>;
/// Create a file that reads/writes a single `AtomicU32` value.
/// Read returns the decimal ASCII representation; write parses decimal ASCII.
pub fn debugfs_create_u32(
name: &str,
parent: &DebugfsDir,
mode: u16,
value: &'static AtomicU32,
) -> Result<DebugfsEntry, VfsError>;
/// Create a file that reads/writes a single `AtomicU64` value.
pub fn debugfs_create_u64(
name: &str,
parent: &DebugfsDir,
mode: u16,
value: &'static AtomicU64,
) -> Result<DebugfsEntry, VfsError>;
/// Create a file that reads/writes a single `AtomicBool` value.
/// Read returns "Y\n" or "N\n"; write accepts "1"/"Y"/"y" or "0"/"N"/"n".
pub fn debugfs_create_bool(
name: &str,
parent: &DebugfsDir,
mode: u16,
value: &'static AtomicBool,
) -> Result<DebugfsEntry, VfsError>;
/// Remove a single debugfs entry (file or empty directory).
pub fn debugfs_remove(entry: DebugfsEntry);
/// Remove a directory and all entries beneath it recursively.
/// Safe to call from module teardown — removes all entries created
/// by the subsystem in one call.
pub fn debugfs_remove_recursive(dir: DebugfsDir);
14.18.2.3 Lockdown Integration¶
When kernel lockdown (Section 9.3) is active, debugfs access is restricted based on the lockdown level:
| Lockdown Level | debugfs Behavior |
|---|---|
none |
Full read/write access (subject to mount permissions) |
integrity |
Read-only: writes to all debugfs files return EPERM |
confidentiality |
Fully disabled: mount returns EPERM; all reads/writes return EPERM |
The debugfs=off boot parameter disables debugfs entirely (equivalent to
confidentiality lockdown for debugfs). When disabled, debugfs_create_*
functions return dummy handles and all file operations are no-ops, ensuring
subsystem initialization never fails due to debugfs unavailability.
14.18.2.4 Standard debugfs Directories¶
Created at boot by their respective subsystems:
| Directory | Subsystem | Content |
|---|---|---|
/sys/kernel/debug/block/ |
Block layer (Section 15.2) | Per-device I/O stats, request queue state |
/sys/kernel/debug/dma_buf/ |
DMA subsystem (Section 4.14) | DMA-buf allocation tracking |
/sys/kernel/debug/clk/ |
Clock framework (Section 2.24) | Clock tree rates, enable counts |
/sys/kernel/debug/regulator/ |
Regulator framework (Section 13.27) | Voltage/current state per regulator |
/sys/kernel/debug/ieee80211/ |
WiFi subsystem (Section 13.15) | Per-PHY/per-STA debug counters |
/sys/kernel/debug/bluetooth/ |
Bluetooth (Section 13.14) | HCI trace data |
14.18.3 tracefs — Tracing Filesystem¶
Mount point: /sys/kernel/tracing (historically /sys/kernel/debug/tracing;
UmkaOS creates a compatibility symlink at the legacy path)
Magic: 0x74726163 (TRACEFS_MAGIC)
Flags: NODEV | NOEXEC | NOSUID
tracefs exposes the tracepoint event catalog and ftrace ring buffers to userspace tracing tools (perf, bpftrace, trace-cmd). It is the primary interface for Section 20.2.
Access control: CAP_SYS_ADMIN or CAP_PERFMON for most operations. Reading
available_events and event format files requires only read permission on the
tracefs mount (allows unprivileged discovery of available tracepoints without
enabling them).
14.18.3.1 tracefs Registration¶
static TRACEFS_TYPE: PseudoFsType = PseudoFsType {
name: "tracefs",
fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
magic: 0x74726163,
fill_super: tracefs_fill_super,
mount_opts: &[
MountOptDesc { name: "uid", kind: MountOptKind::U32 },
MountOptDesc { name: "gid", kind: MountOptKind::U32 },
MountOptDesc { name: "mode", kind: MountOptKind::U32 },
],
};
14.18.3.2 tracefs Directory Structure¶
/sys/kernel/tracing/
available_events # one "subsystem:event" per line
available_tracers # "nop function function_graph"
current_tracer # write tracer name to activate
trace # human-readable trace output (snapshot)
trace_pipe # streaming trace output (blocks on read)
tracing_on # "1"=enabled, "0"=disabled (write to toggle)
buffer_size_kb # per-CPU ring buffer size (write to resize)
events/ # per-subsystem event directories
sched/ # scheduler events
sched_switch/
enable # "1"=trace, "0"=disable
filter # BPF-style filter expression
format # printf-style field description
id # tracepoint numeric ID (u32)
syscalls/ # syscall enter/exit events
net/ # networking events
block/ # block I/O events
irq/ # interrupt events
per_cpu/ # per-CPU trace data
cpu0/
trace # CPU-specific trace snapshot
trace_pipe # CPU-specific streaming trace
stats # entries, overrun, commit overrun, bytes
instances/ # named trace instances (independent buffers)
14.18.3.3 tracefs Ring Buffer¶
Each CPU has a dedicated ring buffer (per-CPU, no lock contention). The buffer
size defaults to 1408 KB per CPU (matching Linux default) and is configurable
via buffer_size_kb. Ring buffer allocation uses the page allocator
(Section 4.2); each buffer is a set of linked pages, not a
single contiguous allocation (avoids high-order allocation failures on fragmented
systems).
/// Per-CPU trace ring buffer. One instance per CPU per trace instance.
pub struct TraceRingBuffer {
/// Per-CPU buffer pages. Each entry is a page-sized ring segment.
/// Pages are linked in a circular list for wraparound.
pub pages: PerCpu<TraceBufferPages>,
/// Buffer size in KB (per CPU). Default: 1408.
pub size_kb: AtomicU32,
/// Overrun counter: events dropped due to full buffer (per CPU, u64).
pub overrun: PerCpu<AtomicU64>,
/// Total events written (per CPU, u64).
pub entries: PerCpu<AtomicU64>,
}
14.18.3.4 Tracepoint Integration¶
Each tracepoint registered via the DECLARE_TRACEPOINT! macro
(Section 20.2) automatically gets an events/<subsystem>/<name>/
directory in tracefs. The id file provides a stable u32 identifier that
perf_event_open() (Section 20.8) uses to attach to
tracepoints programmatically.
14.18.3.5 Trace Instances¶
Named trace instances (mkdir /sys/kernel/tracing/instances/mytracer) create
independent ring buffers with their own events/, trace, trace_pipe, and
per_cpu/ directories. This allows multiple concurrent tracing sessions (e.g.,
one for system-wide scheduler tracing, another for application-specific I/O
tracing) without interference.
14.18.4 hugetlbfs — Huge Page Filesystem¶
Mount point: /dev/hugepages (or any user-chosen mount point)
Magic: 0x958458f6 (HUGETLBFS_MAGIC)
Flags: NODEV | NOSUID
hugetlbfs provides huge page-backed file mappings for applications requiring large contiguous physical pages: databases (Oracle, PostgreSQL shared buffers), DPDK, HPC/AI workloads, and GPU pinned memory.
14.18.4.1 hugetlbfs Registration¶
static HUGETLBFS_TYPE: PseudoFsType = PseudoFsType {
name: "hugetlbfs",
fs_flags: FsFlags::NODEV | FsFlags::NOSUID,
magic: 0x958458f6,
fill_super: hugetlbfs_fill_super,
mount_opts: &[
MountOptDesc { name: "pagesize", kind: MountOptKind::U64 },
MountOptDesc { name: "size", kind: MountOptKind::U64 },
MountOptDesc { name: "min_size", kind: MountOptKind::U64 },
MountOptDesc { name: "nr_inodes", kind: MountOptKind::U64 },
MountOptDesc { name: "uid", kind: MountOptKind::U32 },
MountOptDesc { name: "gid", kind: MountOptKind::U32 },
MountOptDesc { name: "mode", kind: MountOptKind::U32 },
],
};
14.18.4.2 Mount Options¶
| Option | Type | Default | Description |
|---|---|---|---|
pagesize |
bytes | architecture default | Huge page size. Platform-dependent (see table below). |
size |
bytes | all available | Maximum total size of files on this mount. |
min_size |
bytes | 0 | Guaranteed reservation: this many bytes of huge pages are reserved at mount time and cannot be stolen by other mounts. |
nr_inodes |
count | unlimited | Maximum number of inodes (files + directories). |
uid |
uid_t | 0 | UID of the root directory. |
gid |
gid_t | 0 | GID of the root directory. |
mode |
octal | 01777 | Permissions of the root directory. |
14.18.4.3 Supported Huge Page Sizes¶
| Architecture | Default | Available Sizes |
|---|---|---|
| x86-64 | 2 MiB | 2 MiB (PMD), 1 GiB (PUD) |
| AArch64 | 2 MiB | 64 KiB (cont PTE), 2 MiB (PMD), 32 MiB (cont PMD), 1 GiB (PUD) |
| ARMv7 | 2 MiB | 2 MiB (section) |
| RISC-V 64 | 2 MiB | 2 MiB (PMD), 1 GiB (PUD) |
| PPC32 | 4 MiB | 4 MiB (depends on MMU variant) |
| PPC64LE | 2 MiB | 2 MiB, 1 GiB (radix); 16 MiB (HPT mode also supports 16 MiB) |
| s390x | 1 MiB | 1 MiB (segment table large page) |
| LoongArch64 | 2 MiB | 2 MiB (PMD), 1 GiB (PUD) |
Sizes are discovered at boot from the hardware page table capabilities and
reported in /proc/meminfo (Hugepagesize) and /sys/kernel/mm/hugepages/.
14.18.4.4 File Operations¶
| Syscall | Behavior |
|---|---|
open() / creat() |
Creates a file backed by huge pages. No physical pages allocated yet. |
mmap() |
Maps huge pages into the process address space. Each VMA page fault allocates a single huge page from the pool. MAP_POPULATE pre-faults all pages. |
read() / write() |
Returns EINVAL. hugetlbfs files are mmap-only. |
unlink() |
Removes the directory entry. Huge pages are returned to the pool when the last mapping is removed (reference-counted). |
fallocate() |
mode=0: pre-allocate huge pages without mapping. FALLOC_FL_PUNCH_HOLE: release allocated pages for the given range. |
ftruncate() |
Resize the file. Shrinking releases pages beyond the new size. |
14.18.4.5 Huge Page Pool Management¶
The system-wide huge page pool is managed via:
/proc/sys/vm/nr_hugepages— persistent huge pages (survive memory pressure)/proc/sys/vm/nr_overcommit_hugepages— surplus pages (reclaimed under pressure)- Per-NUMA node:
/sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/nr_hugepages
The hugetlbfs pool is independent of THP (Section 4.7): hugetlbfs uses an explicit reservation pool while THP uses buddy allocator promotion. They do not compete for the same pages.
14.18.4.6 memfd_create Integration¶
memfd_create() with MFD_HUGETLB (Section 4.15) creates an
anonymous file descriptor backed by hugetlbfs. The optional MFD_HUGE_2MB /
MFD_HUGE_1GB flags select the page size. This is the preferred mechanism for
applications that need huge pages without a visible filesystem mount.
14.18.5 bpffs — BPF Filesystem¶
Mount point: /sys/fs/bpf (mount -t bpf bpffs /sys/fs/bpf)
Magic: 0xcafe4a11 (BPF_FS_MAGIC)
Flags: NODEV | NOEXEC | NOSUID
bpffs persists BPF objects (programs, maps, links) beyond the lifetime of the loading process. Required by Cilium (Kubernetes CNI), systemd, bpftool, and any infrastructure that loads BPF programs at boot and expects them to survive across process restarts.
14.18.5.1 bpffs Registration¶
static BPFFS_TYPE: PseudoFsType = PseudoFsType {
name: "bpf",
fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
magic: 0xcafe4a11,
fill_super: bpffs_fill_super,
mount_opts: &[
MountOptDesc { name: "mode", kind: MountOptKind::U32 },
MountOptDesc { name: "delegate_cmds", kind: MountOptKind::U64 },
MountOptDesc { name: "delegate_maps", kind: MountOptKind::U64 },
MountOptDesc { name: "delegate_progs", kind: MountOptKind::U64 },
MountOptDesc { name: "delegate_attachs", kind: MountOptKind::U64 },
],
};
14.18.5.2 BPF Object Pinning¶
/// Pin a BPF object (program, map, or link) to a path in bpffs.
/// The object's kernel reference count is incremented. The object
/// remains alive as long as at least one pin or fd references it.
///
/// Called via bpf(BPF_OBJ_PIN, { fd, pathname }).
///
/// # Errors
/// - `EEXIST`: path already exists.
/// - `EINVAL`: fd does not refer to a BPF object.
/// - `EACCES`: caller lacks CAP_BPF or write permission on parent directory.
/// - `ENOSPC`: bpffs inode limit reached (if configured).
pub fn bpf_obj_pin(fd: BpfFd, pathname: &Path) -> Result<(), SyscallError>;
/// Retrieve a previously pinned BPF object by path, returning a new fd.
/// The caller receives a new file descriptor referencing the pinned object.
///
/// Called via bpf(BPF_OBJ_GET, { pathname }).
///
/// # Errors
/// - `ENOENT`: path does not exist.
/// - `EACCES`: caller lacks CAP_BPF or read permission on the path.
pub fn bpf_obj_get(pathname: &Path) -> Result<BpfFd, SyscallError>;
14.18.5.3 Object Lifecycle¶
A BPF object is freed when all references are removed:
- All userspace file descriptors closed.
- All bpffs pins removed (via
unlink()). - All kernel-internal references released (e.g., a BPF program attached to a network hook holds an internal reference; detaching releases it).
Only when the reference count reaches zero does the kernel free the BPF program bytecode and map memory.
14.18.5.4 Directory Structure¶
bpffs supports arbitrary directory hierarchies via mkdir(2). Conventions used
by standard tools:
| Path | Creator | Content |
|---|---|---|
/sys/fs/bpf/tc/globals/ |
iproute2 | Shared maps for TC BPF programs |
/sys/fs/bpf/cilium/ |
Cilium | Datapath programs and maps |
/sys/fs/bpf/xdp/ |
xdp-tools | XDP programs |
/sys/fs/bpf/ip/ |
iproute2 | BPF programs for ip rule |
Standard VFS operations: mkdir(), rmdir(), unlink(), readdir() for
namespace management. Only unlink() on a pinned object file removes the pin;
rmdir() requires the directory to be empty.
14.18.5.5 BPF Token Delegation¶
BPF tokens (Linux 6.9+) allow unprivileged processes to perform specific BPF
operations within the scope of a bpffs mount. A privileged process creates
a token by calling bpf(BPF_TOKEN_CREATE) on a bpffs file descriptor; the
token inherits delegation rights from the mount options.
/// A BPF token granting scoped BPF permissions to an unprivileged process.
/// Created via bpf(BPF_TOKEN_CREATE, { bpffs_fd }).
///
/// The token inherits its allowed operations from the mount-time
/// delegation options of the bpffs instance.
pub struct BpfToken {
/// BPF commands this token permits (e.g., BPF_PROG_LOAD, BPF_MAP_CREATE).
pub allowed_cmds: BpfCmdSet,
/// Map types this token permits creating.
pub allowed_map_types: BpfMapTypeSet,
/// Program types this token permits loading.
pub allowed_prog_types: BpfProgTypeMask,
/// Attach types this token permits.
pub allowed_attach_types: BpfAttachTypeSet,
}
Mount-time delegation options control what a token created on this mount may grant:
| Mount Option | Effect |
|---|---|
delegate_cmds=0x1f |
Bitmask of BPF commands the token may delegate |
delegate_maps=0xff |
Bitmask of map types the token may delegate |
delegate_progs=0x3f |
Bitmask of program types the token may delegate |
delegate_attachs=0x7f |
Bitmask of attach types the token may delegate |
Without delegation mount options, BPF_TOKEN_CREATE returns ENOENT — the
mount does not support token creation.
14.18.5.6 Access Control¶
Standard POSIX permissions on directories and files control visibility. Creating
or retrieving pinned objects additionally requires CAP_BPF
(Section 9.9). Programs operating within a BPF
token scope (Section 19.2) may pin objects
without CAP_BPF if the token grants the appropriate permissions.
14.18.6 securityfs — Security Module Filesystem¶
Mount point: /sys/kernel/security
Magic: 0x73636673 (SECURITYFS_MAGIC)
Flags: NODEV | NOEXEC | NOSUID
securityfs provides per-LSM configuration and status interfaces. The overall LSM framework and the content exposed under securityfs are specified in Section 9.8. This section formalizes the filesystem registration and the kernel API for creating securityfs entries.
14.18.6.1 securityfs Registration¶
static SECURITYFS_TYPE: PseudoFsType = PseudoFsType {
name: "securityfs",
fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
magic: 0x73636673,
fill_super: securityfs_fill_super,
mount_opts: &[],
};
14.18.6.2 securityfs Kernel API¶
/// Handle to a securityfs directory. Opaque to callers.
pub struct SecurityfsDir {
dentry: Arc<Dentry>,
}
/// Handle to a securityfs file. Opaque to callers.
pub struct SecurityfsEntry {
dentry: Arc<Dentry>,
}
/// Create a directory under the securityfs root (or under `parent`).
/// Each LSM creates its top-level directory during LSM init.
pub fn securityfs_create_dir(
name: &str,
parent: Option<&SecurityfsDir>,
) -> Result<SecurityfsDir, VfsError>;
/// Create a file with custom read/write callbacks.
/// `mode` is POSIX permission bits. LSMs typically use 0o444 for
/// status files and 0o600 or 0o200 for policy write interfaces.
pub fn securityfs_create_file(
name: &str,
parent: &SecurityfsDir,
mode: u16,
fops: &'static FileOps,
) -> Result<SecurityfsEntry, VfsError>;
/// Remove a securityfs entry. Called during LSM teardown (if supported)
/// or during live kernel evolution ([Section 13.18](13-device-classes.md#live-kernel-evolution)).
pub fn securityfs_remove(entry: SecurityfsEntry);
14.18.6.3 Standard securityfs Layout¶
| Path | LSM | Content |
|---|---|---|
/sys/kernel/security/lsm |
Core | Comma-separated list of active LSMs (read-only) |
/sys/kernel/security/apparmor/ |
AppArmor | Profile management, policy load |
/sys/kernel/security/selinux/ |
SELinux | Enforce mode, policy, booleans, AVC stats |
/sys/kernel/security/ima/ |
IMA | Measurement log, policy (Section 9.5) |
/sys/kernel/security/evm/ |
EVM | EVM mode, status |
/sys/kernel/security/landlock/ |
Landlock | ABI version (Section 9.8) |
Access control varies by LSM: reading status files is typically unrestricted, while policy writes require CAP_MAC_ADMIN (Section 9.9).
14.18.7 efivarfs — EFI Variable Filesystem¶
Mount point: /sys/firmware/efi/efivars
Magic: 0xde5e81e4 (EFIVARFS_MAGIC)
Flags: NODEV | NOEXEC | NOSUID
efivarfs exposes UEFI firmware variables to userspace for reading and writing.
Required for boot manager configuration (efibootmgr), Secure Boot key management,
and firmware diagnostics.
Availability: UEFI systems only. On non-UEFI platforms (most ARMv7, PPC32,
PPC64LE), the filesystem is not registered and mount -t efivarfs returns ENODEV.
| Architecture | EFI Support | efivarfs Available |
|---|---|---|
| x86-64 | Yes (UEFI standard) | Yes |
| AArch64 | Yes (UEFI standard) | Yes |
| ARMv7 | Rare (U-Boot) | Only if EFI runtime services present |
| RISC-V 64 | Emerging (UEFI spec) | When EFI runtime services present |
| PPC32 | No (Open Firmware / DTB) | No |
| PPC64LE | No (OPAL / SLOF) | No |
| s390x | No (z/VM / LPAR IPL) | No |
| LoongArch64 | Emerging (UEFI standard on Loongson 3A5000+) | When EFI runtime services present |
14.18.7.1 efivarfs Registration¶
static EFIVARFS_TYPE: PseudoFsType = PseudoFsType {
name: "efivarfs",
fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
magic: 0xde5e81e4,
fill_super: efivarfs_fill_super,
mount_opts: &[],
};
14.18.7.2 File Naming Convention¶
Each file in efivarfs represents one UEFI variable, named as
{VariableName}-{VendorGUID} where the GUID is in standard
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx format:
Boot0001-8be4df61-93ca-11d2-aa0d-00e098032b8c— Boot entry 1 (EFI Global Variable GUID)BootOrder-8be4df61-93ca-11d2-aa0d-00e098032b8c— Boot order sequenceSecureBoot-8be4df61-93ca-11d2-aa0d-00e098032b8c— Secure Boot statedbx-d719b2cb-3d3a-4596-a3bc-dad00e67656f— Secure Boot forbidden signature DB
14.18.7.3 File Format¶
/// Wire format for efivarfs file content. The first 4 bytes are the EFI
/// variable attributes; the remainder is the variable value.
/// This matches the Linux efivarfs file format exactly.
///
/// Read: returns attributes (4 bytes LE) + value
/// Write: caller provides attributes (4 bytes LE) + new_value
// Userspace ABI (efivarfs read/write wire format). DST: no const_assert on
// EFI variable file format: first 4 bytes are the attributes (little-endian u32),
// remainder is the variable value. Total size is attributes(4) + value(N).
// kernel-internal, not KABI
/// Parse an efivarfs file buffer into (attributes, value_slice).
/// Returns `Err(EINVAL)` if the buffer is too short (< 4 bytes).
fn parse_efivar(buf: &[u8]) -> Result<(EfiVarAttributes, &[u8]), Errno> {
if buf.len() < 4 {
return Err(EINVAL);
}
let attrs = u32::from_le_bytes([buf[0], buf[1], buf[2], buf[3]]);
Ok((EfiVarAttributes::from_bits_truncate(attrs), &buf[4..]))
}
/// Build an efivarfs file buffer from (attributes, value).
/// Writes attributes as little-endian u32 prefix followed by value bytes.
fn build_efivar(attrs: EfiVarAttributes, value: &[u8], out: &mut [u8]) -> usize {
let total = 4 + value.len();
out[..4].copy_from_slice(&attrs.bits().to_le_bytes());
out[4..total].copy_from_slice(value);
total
}
bitflags! {
/// EFI variable attributes. Matches the UEFI specification (Table 14).
pub struct EfiVariableAttributes: u32 {
/// Variable persists across resets.
const NON_VOLATILE = 0x0000_0001;
/// Variable accessible during boot services.
const BOOTSERVICE_ACCESS = 0x0000_0002;
/// Variable accessible at OS runtime.
const RUNTIME_ACCESS = 0x0000_0004;
/// Hardware error record (separate NVRAM region on some firmware).
const HARDWARE_ERROR_RECORD = 0x0000_0008;
/// Only authenticated writes accepted (deprecated by UEFI 2.8+).
const AUTHENTICATED_WRITE_ACCESS = 0x0000_0010;
/// Time-based authenticated variable (used by Secure Boot db/dbx).
const TIME_BASED_AUTHENTICATED_WRITE_ACCESS = 0x0000_0020;
/// Append-only writes: new data is appended, existing data unchanged.
const APPEND_WRITE = 0x0000_0040;
}
}
14.18.7.4 Operations¶
| Syscall | Behavior |
|---|---|
read() |
Returns attributes (4 bytes LE) followed by the variable value. Calls EFI GetVariable() runtime service (Section 2.20). |
write() |
Caller provides attributes (4 bytes LE) + new value. Calls EFI SetVariable(). Returns EIO on firmware error, ENOSPC if NVRAM is full. |
creat() |
Creates a new EFI variable. File name must follow the {Name}-{GUID} convention. Calls EFI SetVariable() with the new name. |
unlink() |
Deletes the EFI variable by calling SetVariable() with DataSize=0. Returns EPERM if the variable is immutable. |
readdir() |
Enumerates all EFI variables via GetNextVariableName(). Results are cached in memory after first enumeration; invalidated on any write. |
14.18.7.5 Immutable Variable Protection¶
Certain variables are critical to system boot and must not be accidentally deleted:
/// Variables marked immutable (FS_IMMUTABLE_FL) by the kernel.
/// Users cannot unlink or write to these without first clearing
/// the immutable flag (requires CAP_LINUX_IMMUTABLE).
const IMMUTABLE_VARS: &[&str] = &[
"SecureBoot",
"SetupMode",
"PK", // Platform Key
"KEK", // Key Exchange Key
"AuditMode",
"DeployedMode",
];
The kernel sets FS_IMMUTABLE_FL on these files at mount time. Modifying them
requires chattr -i first (which requires CAP_LINUX_IMMUTABLE), providing a
two-step safeguard against accidental firmware corruption.
14.18.7.6 NVRAM Wear Protection¶
EFI NVRAM has limited write endurance (typically 100K-1M cycles per flash block). The kernel rate-limits writes to prevent userspace from wearing out NVRAM:
/// NVRAM write rate limiter. Shared across all efivarfs writes.
/// Uses a token bucket algorithm: one token per write, refilled at
/// `REFILL_RATE` tokens per second, maximum burst of `BUCKET_SIZE`.
pub struct EfiNvramRateLimiter {
/// Current token count. Bounded gauge: range [0, NVRAM_BUCKET_SIZE].
/// AtomicU32 is sufficient: max value is 64 (NVRAM_BUCKET_SIZE),
/// never incremented beyond the bucket ceiling by the refill logic.
pub tokens: AtomicU32,
/// Last refill timestamp (nanoseconds, monotonic clock).
pub last_refill_ns: AtomicU64,
}
/// Maximum burst writes before throttling.
const NVRAM_BUCKET_SIZE: u32 = 64;
/// Sustained write rate: 1 write per 100ms (10 writes/sec).
/// At this rate, 100K-cycle NVRAM endures ~2,800 hours of
/// continuous maximum-rate writes. Practical workloads (boot
/// manager changes, key rotations) are many orders of magnitude
/// below this rate.
const NVRAM_REFILL_INTERVAL_NS: u64 = 100_000_000;
When the bucket is empty, write() returns EBUSY. The caller (typically
efibootmgr or mokutil) retries after a short delay.
14.18.8 Boot Initialization Order¶
All pseudo-filesystems register during Phase 5 (VFS initialization) of the boot sequence (Section 2.3). The ordering reflects dependency constraints:
| Order | Filesystem | Dependency | Registration Guard |
|---|---|---|---|
| 1 | debugfs | VFS core initialized | None |
| 2 | tracefs | debugfs mounted (for legacy symlink) | debugfs mount point exists |
| 3 | hugetlbfs | Physical memory allocator + huge page pool (Section 4.2) | Huge page pool initialized |
| 4 | bpffs | eBPF verifier initialized (Section 19.2) | BPF subsystem ready |
| 5 | securityfs | LSM framework initialized (Section 9.8) | At least one LSM registered |
| 6 | efivarfs | EFI runtime services available (Section 2.20) | efi.runtime_services != null (skipped on non-UEFI) |
After registration, each filesystem is mounted at its standard mount point by
the init process (PID 1). The kernel does not auto-mount pseudo-filesystems —
mount commands come from userspace init (systemd .mount units or /etc/fstab).
The exception is the root filesystem and devtmpfs, which are mounted by the kernel
before init executes.
14.19 procfs and sysfs¶
procfs and sysfs are the two primary pseudo-filesystems through which the kernel
exposes runtime state to userspace. procfs (/proc) is process-oriented: per-PID
directories, global memory/CPU statistics, and writable sysctl tunables. sysfs
(/sys) is device-oriented: it mirrors the kernel's device model as a directory
hierarchy with one-value-per-file attributes. Both are required for glibc, systemd,
udev, ps, top, htop, lscpu, lsblk, and virtually every Linux system management
tool.
Both filesystems build on the common pseudo-filesystem registration framework defined in Section 14.18.
14.19.1 procfs — Process Information Filesystem¶
Mount point: /proc (type proc)
Magic: 0x9fa0 (PROC_SUPER_MAGIC)
Flags: NODEV | NOEXEC | NOSUID | USERNS_MOUNT
procfs is the kernel's primary process-to-userspace information channel. It is
mounted automatically during early init (before PID 1 executes) and is required
for correct glibc operation (/proc/self/), systemd (cgroup discovery, mount
enumeration, process introspection), and standard POSIX process tools.
14.19.1.1 procfs Registration¶
static PROCFS_TYPE: PseudoFsType = PseudoFsType {
name: "proc",
fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID
| FsFlags::USERNS_MOUNT,
magic: 0x9fa0,
fill_super: proc_fill_super,
mount_opts: &[
MountOptDesc { name: "hidepid", kind: MountOptKind::U32 },
MountOptDesc { name: "gid", kind: MountOptKind::U32 },
MountOptDesc { name: "subset", kind: MountOptKind::Str },
],
};
Mount options:
| Option | Values | Default | Description |
|---|---|---|---|
hidepid |
0, 1, 2, 4 | 0 | Process visibility. 0 = world-readable. 1 = hide cmdline/status for other users' PIDs. 2 = invisible /proc/PID/ for non-owned processes. 4 = same as 2 plus hide thread IDs. |
gid |
GID | 0 | GID that bypasses hidepid restrictions. Allows monitoring daemons (e.g., monit, Prometheus node_exporter) to see all processes without running as root. |
subset |
pid |
none | Mount only per-PID entries (no global files). Used for container /proc mounts that only need process information. |
14.19.1.2 ProcEntry Trait¶
Every file or directory in procfs is backed by an implementation of ProcEntry.
Subsystems register entries during init; per-PID entries are instantiated lazily
on first lookup.
/// Trait for procfs file content generation.
///
/// Implementations are typically zero-sized types that read kernel state
/// on demand. No per-file heap allocation occurs — the `ProcEntry` is a
/// `&'static dyn` reference.
pub trait ProcEntry: Send + Sync {
/// Read content into `buf` starting at byte `offset`.
/// Returns the number of bytes written to `buf`.
///
/// For fixed-format files (e.g., `/proc/PID/stat`), the implementation
/// generates the entire content into an internal buffer on the first
/// read (offset=0) and serves subsequent reads from that snapshot.
/// This ensures a consistent view even if the process state changes
/// between read() calls.
fn read(&self, ctx: &ProcReadCtx, buf: &mut [u8], offset: u64) -> Result<usize, Errno>;
/// Write data from `buf`. Returns bytes consumed.
/// Most procfs files are read-only and return `EACCES`.
fn write(&self, ctx: &ProcWriteCtx, buf: &[u8]) -> Result<usize, Errno> {
Err(Errno::EACCES)
}
/// Poll for readability/writability/events. Returns the events that are
/// currently ready. Default: not pollable (empty events).
///
/// This is required for `/proc/sys/` files: programs that `poll()` on
/// sysctl files (e.g., `poll(/proc/sys/vm/overcommit_memory, POLLPRI)`)
/// need notification when the value changes. The sysctl implementation
/// calls `proc_sys_notify()` on write, which wakes pollers.
fn poll(&self, _ctx: &ProcReadCtx, _events: PollEvents) -> PollEvents {
PollEvents::empty()
}
}
/// Notify pollers of a procfs sysctl entry that the value has changed.
/// Called by the sysctl write path after updating the kernel parameter.
/// Wakes all processes polling the file with `POLLPRI`.
fn proc_sys_notify(entry: &dyn ProcEntry) {
// Implementation: the sysctl ProcEntry stores a WaitQueue. poll()
// registers the caller on this WaitQueue. proc_sys_notify() does
// wake_up_all(&entry.waitqueue, POLLPRI). Specific sysctl entries
// that support notification override poll() to register on the waitqueue.
}
/// Context passed to ProcEntry::read().
pub struct ProcReadCtx<'a> {
/// The PID this entry belongs to (None for global entries).
pub pid: Option<Pid>,
/// Credentials of the reading process (for permission checks).
/// RCU-protected reference: credentials can change during a task's
/// lifetime (setuid, capset), so a static reference would be unsound.
pub cred: RcuRef<'a, Credentials>,
}
/// Context passed to ProcEntry::write().
pub struct ProcWriteCtx<'a> {
/// The PID this entry belongs to (None for global entries).
pub pid: Option<Pid>,
/// Credentials of the writing process.
/// RCU-protected reference — see `ProcReadCtx::cred` for rationale.
pub cred: RcuRef<'a, Credentials>,
}
14.19.1.3 SeqFile Protocol¶
Large procfs files that enumerate variable-length records (e.g., /proc/net/tcp,
/proc/mounts) use the SeqFile protocol to handle partial reads across multiple
read() syscalls without missing or duplicating entries.
/// Sequential file generator. Each call to `show()` produces one logical
/// record. The SeqFile infrastructure handles buffering, partial reads,
/// and seek.
pub trait SeqFileOps: Send + Sync {
/// Position the iterator at the beginning.
/// `pos` is the logical position (0 on first read, or a saved position
/// from a previous lseek). Returns an opaque iterator state, or None
/// if `pos` is beyond the last record.
fn start(&self, ctx: &ProcReadCtx, pos: u64) -> Option<SeqIterState>;
/// Emit one record into `buf`. Returns bytes written.
/// The SeqFile core calls `show()` repeatedly, appending output to an
/// internal page-sized buffer. When the buffer is full, it is returned
/// to the `read()` caller and the remaining records are served on the
/// next `read()`.
fn show(&self, state: &SeqIterState, buf: &mut [u8]) -> Result<usize, Errno>;
/// Advance to the next record. Returns the updated iterator state,
/// or None if iteration is complete.
fn next(&self, state: SeqIterState) -> Option<SeqIterState>;
/// Release any resources held by the iterator.
/// Called when the file is closed or the read sequence is abandoned.
fn stop(&self, state: SeqIterState);
}
/// Opaque iterator state for SeqFile traversal.
/// Subsystems typically store an index, a pointer to the current object,
/// and an RCU read-lock guard or reference count.
pub struct SeqIterState {
/// Logical position counter (incremented by `next()`).
/// Stored across read() calls for correct resume-after-partial-read.
pub index: u64,
/// Bytes emitted so far in the current read() call.
pub count: usize,
/// Subsystem-private state (cast from a concrete type).
pub private: u64,
}
14.19.1.4 procfs Registration API¶
/// Register a global procfs entry (e.g., `/proc/meminfo`).
///
/// `name`: entry name (e.g., "meminfo").
/// `mode`: POSIX permission bits (e.g., 0o444 for read-only).
/// `parent`: parent directory (None = procfs root).
/// `ops`: implementation providing read/write behavior.
///
/// Returns a handle used for removal (live kernel evolution only;
/// built-in entries are never removed).
///
/// # Errors
/// - `EEXIST`: name already registered under parent.
/// - `ENOMEM`: allocation failure.
pub fn proc_create(
name: &'static str,
mode: u16,
parent: Option<&ProcDir>,
ops: &'static dyn ProcEntry,
) -> Result<ProcHandle, VfsError>;
/// Register a global procfs entry backed by SeqFile.
/// Convenience wrapper: creates a ProcEntry that delegates to SeqFileOps.
pub fn proc_create_seq(
name: &'static str,
mode: u16,
parent: Option<&ProcDir>,
ops: &'static dyn SeqFileOps,
) -> Result<ProcHandle, VfsError>;
/// Create a subdirectory under the procfs root (e.g., `/proc/net/`).
pub fn proc_mkdir(
name: &'static str,
parent: Option<&ProcDir>,
) -> Result<ProcDir, VfsError>;
/// Opaque handle to a registered procfs directory.
pub struct ProcDir {
dentry: Arc<Dentry>,
}
/// Opaque handle to a registered procfs entry (file or directory).
/// Dropping this handle does NOT remove the entry — entries persist
/// for the kernel's lifetime. Explicit removal via `proc_remove()`
/// is for live kernel evolution only.
pub struct ProcHandle {
dentry: Arc<Dentry>,
}
/// Remove a previously registered procfs entry.
pub fn proc_remove(handle: ProcHandle);
14.19.1.5 Per-PID Directory Lifecycle¶
A /proc/PID/ directory is created when a task is allocated
(Section 8.1) and removed when the task is reaped
(after wait() collects the zombie). The directory is lazily populated — subdirectory
dentries are instantiated on first lookup, not at task creation time. This avoids
per-fork allocation overhead for the ~20 entries per PID. The lookup callback
(pid_dir_lookup()) resolves the PID to a Task, creates a dentry backed by a
ProcInode, and populates on-demand. No dentries exist for PIDs that have not been
accessed.
For thread-group leaders, /proc/PID/task/ contains subdirectories for each thread
in the group. Each thread subdirectory mirrors the per-PID layout.
14.19.1.6 Mandatory /proc/PID/ Entries¶
These entries are required for glibc, systemd, ps, top, htop, and container runtimes. Format descriptions are normative — field order, separator characters, and field types must match Linux exactly for binary compatibility.
| Entry | Mode | Format | Description |
|---|---|---|---|
status |
0o444 | Key:\tValue\n lines |
Human-readable status. Fields: Name, Umask, State, Tgid, Ngid, Pid, PPid, TracerPid, Uid (4 fields), Gid (4 fields), FDSize, Groups, NStgid, NSpid, NSpgid, NSsid, Kthread, VmPeak, VmSize, VmLck, VmPin, VmHWM, VmRSS, RssAnon, RssFile, RssShmem, VmData, VmStk, VmExe, VmLib, VmPTE, VmSwap, HugetlbPages, CoreDumping, THP_enabled, Threads, SigQ, SigPnd, ShdPnd, SigBlk, SigIgn, SigCgt, CapInh, CapPrm, CapEff, CapBnd, CapAmb, NoNewPrivs, Seccomp, Speculation_Store_Bypass, SpeculationIndirectBranch, Cpus_allowed, Cpus_allowed_list, Mems_allowed, Mems_allowed_list, voluntary_ctxt_switches, nonvoluntary_ctxt_switches. |
stat |
0o444 | Single line, space-separated | 52 fields: pid (comm) state ppid pgrp session tty_nr tpgid flags minflt cminflt majflt cmajflt utime stime cutime cstime priority nice num_threads itrealvalue starttime vsize rss rsslim startcode endcode startstack kstkesp kstkeip signal blocked sigignore sigcatch wchan nswap cnswap exit_signal processor rt_priority policy delayacct_blkio_ticks guest_time cguest_time start_data end_data start_brk arg_start arg_end env_start env_end exit_code. Matches Linux fs/proc/array.c do_task_stat() — field 52 is exit_code. Note: core_dumping and thp_enabled are in /proc/PID/status (Key:Value format), not stat. |
statm |
0o444 | Single line, 7 space-separated page counts | size resident shared text lib data dt |
cmdline |
0o444 | NUL-separated argv bytes | Empty for zombie/kernel threads. |
environ |
0o400 | NUL-separated envp bytes | Requires ptrace_may_access() or same UID. Returns empty for kernel threads. |
maps |
0o444 | One line per VMA | start-end perms offset dev inode pathname. Hex addresses, rwxp perms. |
smaps |
0o444 | Multi-line per VMA | Same header as maps, followed by Size, KernelPageSize, MMUPageSize, Rss, Pss, Pss_Dirty, Shared_Clean, Shared_Dirty, Private_Clean, Private_Dirty, Referenced, Anonymous, LazyFree, AnonHugePages, ShmemPmdMapped, FilePmdMapped, Shared_Hugetlb, Private_Hugetlb, Swap, SwapPss, Locked, THPeligible, VmFlags lines. |
fd/ |
0o500 | Directory of symlinks | Each entry is a decimal fd number; readlink returns the path of the open file. |
fdinfo/ |
0o500 | One file per fd | Lines: pos:, flags:, mnt_id:. Additional fields for epoll, eventfd, inotify, fanotify, timerfd, signalfd fds. |
cgroup |
0o444 | hierarchy-ID:controller-list:cgroup-path per line |
For cgroups v2: single line 0::/path. |
mountinfo |
0o444 | SeqFile, one line per mount | Fields: mount_id, parent_id, major:minor, root, mount_point, mount_options, optional_fields, separator(-), fs_type, mount_source, super_options. See Section 14.6. |
ns/ |
0o500 | Directory of symlinks | Entries: cgroup, ipc, mnt, net, pid, pid_for_children, time, time_for_children, user, uts, ima. Readlink returns <type>:[<inode>]. The ima entry links to the task's IMA namespace; IMA measurement log output (e.g., /sys/kernel/security/ima/ascii_runtime_measurements) is scoped to the reading task's IMA namespace — a process in container namespace A sees only namespace A's measurement log entries. See Section 17.1, Section 9.5. |
oom_score |
0o444 | Single integer (0-2000) | OOM badness score. See Section 4.5. |
oom_score_adj |
0o644 | Single integer (-1000 to 1000) | OOM adjustment. -1000 = never kill. Writing requires CAP_SYS_RESOURCE for values < 0. |
limits |
0o444 | Table format | Columns: Limit, Soft Limit, Hard Limit, Units. Rows: Max cpu time, Max file size, Max data size, Max stack size, Max core file size, Max resident set, Max processes, Max open files, Max locked memory, Max address space, Max file locks, Max pending signals, Max msgqueue size, Max nice priority, Max realtime priority, Max realtime timeout. See Section 8.7. |
io |
0o400 | key: value lines |
Fields: rchar, wchar, syscr, syscw, read_bytes, write_bytes, cancelled_write_bytes. Requires ptrace_may_access() or same UID. |
task/ |
0o555 | Directory of per-thread subdirectories | Each subdirectory is named by TID and contains the same entries as the parent /proc/PID/ (stat, status, maps, etc.) scoped to that thread. |
exe |
— | Symlink | Points to the executable file. Returns ENOENT readlink if the binary has been deleted (but the symlink text reads <path> (deleted)). |
root |
— | Symlink | Points to the process's root directory (as set by chroot()). |
cwd |
— | Symlink | Points to the process's current working directory. |
comm |
0o644 | Single line, max 16 bytes | Executable name (truncated to TASK_COMM_LEN). Writable: echo newname > /proc/PID/comm sets the task comm (requires same UID or CAP_SYS_PTRACE). |
wchan |
0o444 | Single symbol name | Wait channel: the kernel function the task is sleeping in, or "0" if running. |
14.19.1.7 Mandatory Global /proc Entries¶
These entries are required for system monitoring tools, container runtimes, and standard POSIX utilities.
| Entry | Mode | Format | Primary Consumers |
|---|---|---|---|
meminfo |
0o444 | Key: value kB lines |
free, top, htop, systemd, Prometheus node_exporter. Fields: MemTotal, MemFree, MemAvailable, Buffers, Cached, SwapCached, Active, Inactive, Active(anon), Inactive(anon), Active(file), Inactive(file), Unevictable, Mlocked, SwapTotal, SwapFree, Zswap, Zswapped, Dirty, Writeback, AnonPages, Mapped, Shmem, KReclaimable, Slab, SReclaimable, SUnreclaim, KernelStack, PageTables, SecPageTables, NFS_Unstable, Bounce, WritebackTmp, CommitLimit, Committed_AS, VmallocTotal, VmallocUsed, VmallocChunk, Percpu, HardwareCorrupted, AnonHugePages, ShmemHugePages, ShmemPmdMapped, FileHugePages, FilePmdMapped, CmaTotal, CmaFree, HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb, DirectMap4k, DirectMap2M, DirectMap1G. |
cpuinfo |
0o444 | Per-arch key-value blocks | lscpu, nproc, /proc/cpuinfo parsers. Per-CPU block separated by blank line. Format is architecture-specific (x86: processor/vendor_id/model name/flags; ARM: processor/BogoMIPS/Features; RISC-V: hart/isa/mmu). |
stat |
0o444 | Multi-line | top, mpstat, vmstat. Lines: cpu (aggregate), cpu0..cpuN (per-CPU: user nice system idle iowait irq softirq steal guest guest_nice), intr (per-IRQ counts), ctxt (context switches), btime (boot time epoch), processes (forks since boot), procs_running, procs_blocked, softirq (per-softirq counts). |
loadavg |
0o444 | Single line | uptime, w, shell prompts. Format: 1min 5min 15min running/total last_pid. |
uptime |
0o444 | Single line | uptime. Format: seconds_since_boot idle_seconds (both with centisecond precision). |
version |
0o444 | Single line | uname, build identification. Format: UmkaOS version <version> (<build>) (<compiler>) #<build_number> <config> <date>. |
filesystems |
0o444 | One per line | mount auto-detection. Format: [nodev]\tfstype. nodev prefix for pseudo-filesystems that have no backing device. |
self |
— | Symlink | Points to /proc/[getpid()]. Resolved per-access to the calling task's TGID. Required by glibc (/proc/self/exe, /proc/self/fd/). |
thread-self |
— | Symlink | Points to /proc/[getpid()]/task/[gettid()]. Resolved per-access to the calling thread's TID. Required for per-thread procfs access (e.g., /proc/thread-self/attr/current for SELinux). |
mounts |
— | Symlink | Points to self/mounts. |
partitions |
0o444 | Table | lsblk, fdisk. Columns: major, minor, #blocks, name. |
diskstats |
0o444 | One line per device | iostat, sar. 18 fields per line: major minor name reads_completed reads_merged sectors_read ms_reading writes_completed writes_merged sectors_written ms_writing ios_in_progress ms_io weighted_ms_io discards_completed discards_merged sectors_discarded ms_discarding flush_count ms_flushing. |
net/ |
0o555 | Directory | Network pseudo-files (per-net-namespace). Entries: dev (interface stats), tcp (TCP sockets), tcp6, udp, udp6, unix (UNIX sockets), route (IPv4 routing), ipv6_route, if_inet6, arp, snmp, snmp6, netstat, sockstat, sockstat6, raw, raw6, packet, protocols, wireless. Each file uses SeqFile. |
sys/ |
0o555 | Directory tree | Sysctl interface. Writable tunables organized as kernel/, vm/, fs/, net/, dev/. Each leaf file reads/writes one value. See Section 20.9 for the sysctl registration framework. |
interrupts |
0o444 | Table | Per-CPU IRQ counts. Columns: IRQ number, per-CPU counts, IRQ chip name, hardware IRQ, action name. |
softirqs |
0o444 | Table | Per-CPU softirq counts. One row per softirq type (HI, TIMER, NET_TX, NET_RX, BLOCK, IRQ_POLL, TASKLET, SCHED, HRTIMER, RCU). |
14.19.1.8 procfs Namespace Awareness¶
Each PID namespace (Section 17.1) has its own view of /proc:
processes in a child PID namespace see only PIDs visible within that namespace.
When procfs is mounted inside a container, fill_super binds the superblock to
the caller's PID namespace. /proc/1/ inside the container refers to the
container's init process, not the host PID 1.
14.19.1.9 procfs Permission Model¶
| hidepid | Effect |
|---|---|
| 0 | All /proc/PID/ directories visible to all users (Linux default) |
| 1 | /proc/PID/{cmdline,sched,status} restricted to owner; directory is visible |
| 2 | /proc/PID/ directory invisible to non-owners (opendir returns ENOENT) |
| 4 | Same as 2, plus thread IDs hidden from /proc/PID/task/ |
The gid mount option exempts a group from hidepid restrictions. Processes whose
supplementary groups include the specified GID see all PIDs regardless of hidepid.
This allows monitoring tools to run without CAP_SYS_PTRACE.
14.19.2 sysfs — Device Model Filesystem¶
Mount point: /sys (type sysfs)
Magic: 0x62656572 (SYSFS_MAGIC)
Flags: NODEV | NOEXEC | NOSUID
sysfs mirrors the kernel's device model hierarchy as a directory tree. Every
registered bus, device, driver, and class gets a directory. Attributes are files
that expose exactly one value each (the "sysfs one-value-per-file rule"). This
filesystem is required for udev device discovery, systemd device management,
and all /sys-reading tools (lspci, lsusb, lsblk, ip link, etc.).
14.19.2.1 sysfs Registration¶
static SYSFS_TYPE: PseudoFsType = PseudoFsType {
name: "sysfs",
fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
magic: 0x62656572,
fill_super: sysfs_fill_super,
mount_opts: &[],
};
14.19.2.2 Kobject Model¶
Every kernel object that participates in sysfs inherits from Kobject. A kobject
represents one directory in the sysfs tree. The parent-child relationship between
kobjects defines the directory hierarchy.
/// Kernel object — the unit of representation in sysfs.
///
/// Every kobject corresponds to exactly one directory under `/sys`.
/// Kobjects form a tree: each kobject has at most one parent.
/// The root kobjects (no parent) appear directly under `/sys`.
/// **Cycle detection**: `sysfs_create_kobject()` walks the parent chain
/// (max depth 32) to verify the new parent is not a descendant of the
/// new kobject. If a cycle is detected, creation fails with `ELOOP` and
/// an FMA warning identifies the offending driver.
pub struct Kobject {
/// Name of this object (= directory name in sysfs).
pub name: Box<[u8]>,
/// Parent kobject. None for top-level directories.
pub parent: Option<Arc<Kobject>>,
/// Attribute groups attached to this kobject.
/// Each group's attributes appear as files in this directory.
pub attr_groups: ArrayVec<&'static SysfsGroup, 8>,
/// Kset membership (if any). A kset is a collection of related
/// kobjects that share a uevent domain.
pub kset: Option<Arc<Kset>>,
/// Reference count. Kobject is freed when refcount reaches zero.
pub refcount: AtomicU64,
/// Uevent state: whether this kobject has been announced to userspace.
pub uevent_sent: AtomicBool,
}
/// A kset groups related kobjects and provides the uevent emission
/// context. Bus, class, and device collections are ksets.
pub struct Kset {
/// The kset's own kobject (its sysfs directory).
pub kobj: Kobject,
/// Uevent filter: if set, called before emitting uevents for
/// member kobjects. Returns false to suppress the event.
pub uevent_filter: Option<fn(&Kobject) -> bool>,
}
14.19.2.3 SysfsAttribute Trait¶
/// Trait for sysfs file attributes.
///
/// Each attribute is a single file in a kobject's directory.
/// Implementations MUST follow the one-value-per-file rule:
/// `show()` returns exactly one scalar, string, or enumeration value.
/// `store()` parses exactly one value.
pub trait SysfsAttribute: Send + Sync {
/// Attribute file name.
fn name(&self) -> &'static str;
/// POSIX permission bits (e.g., 0o444 read-only, 0o644 read-write).
fn mode(&self) -> u16;
/// Read the attribute value into `buf`. Returns bytes written.
/// The output MUST end with a newline (`\n`) for shell compatibility.
/// Maximum output: PAGE_SIZE - 1 bytes (4095 on all architectures).
fn show(&self, kobj: &Kobject, buf: &mut [u8]) -> Result<usize, Errno>;
/// Write a new value from `buf`. Returns bytes consumed.
/// Returns `EACCES` for read-only attributes (mode without write bits).
/// Returns `EINVAL` if the value cannot be parsed.
fn store(&self, kobj: &Kobject, buf: &[u8]) -> Result<usize, Errno> {
Err(Errno::EACCES)
}
}
/// A named collection of attributes applied to a kobject together.
///
/// Groups provide atomic attachment: all attributes in a group are
/// created or destroyed as a unit. A kobject may have multiple groups;
/// each group can optionally create a subdirectory.
pub struct SysfsGroup {
/// Subdirectory name. If Some, attributes appear in a named
/// subdirectory of the kobject's directory. If None, attributes
/// appear directly in the kobject's directory.
pub name: Option<&'static str>,
/// Attributes in this group.
pub attrs: &'static [&'static dyn SysfsAttribute],
/// Visibility filter. Called once per attribute at group creation
/// time. Returns the effective mode (0 = skip this attribute).
/// Allows conditional attributes based on hardware capabilities.
pub is_visible: Option<fn(&Kobject, &dyn SysfsAttribute) -> u16>,
}
14.19.2.4 sysfs Kernel API¶
/// Create a kobject directory in sysfs and attach its attribute groups.
///
/// The directory is created under the parent's directory (or the sysfs
/// root if parent is None). All attribute groups are instantiated
/// atomically: if any attribute creation fails, the entire directory
/// is rolled back.
///
/// Typically called indirectly through `device_register()`, `bus_register()`,
/// or `class_register()` rather than directly by subsystem code.
pub fn sysfs_create_kobject(kobj: &Arc<Kobject>) -> Result<(), VfsError>;
/// Attach an additional attribute group to an existing kobject.
/// Used when subsystems add attributes after initial registration
/// (e.g., driver-specific attributes added at probe time).
pub fn sysfs_create_group(
kobj: &Arc<Kobject>,
group: &'static SysfsGroup,
) -> Result<(), VfsError>;
/// Remove an attribute group from a kobject.
pub fn sysfs_remove_group(
kobj: &Arc<Kobject>,
group: &'static SysfsGroup,
);
/// Create a symbolic link in sysfs. Used for cross-references:
/// e.g., `/sys/class/net/eth0/device` → `/sys/devices/pci0000:00/...`.
///
/// `kobj`: the directory where the symlink is created.
/// `target`: the kobject the symlink points to.
/// `name`: the symlink file name.
pub fn sysfs_create_link(
kobj: &Arc<Kobject>,
target: &Arc<Kobject>,
name: &'static str,
) -> Result<(), VfsError>;
/// Remove a symbolic link.
pub fn sysfs_remove_link(kobj: &Arc<Kobject>, name: &'static str);
/// Notify userspace that an attribute value has changed.
/// Wakes any process blocked in `poll()` / `select()` on the
/// attribute file. Used by the thermal subsystem, battery driver,
/// and power management to notify udev/systemd of state changes.
pub fn sysfs_notify(kobj: &Arc<Kobject>, attr_name: &'static str);
/// Remove a kobject and all its attribute groups from sysfs.
/// Called during device removal or driver unbind.
pub fn sysfs_remove_kobject(kobj: &Arc<Kobject>);
14.19.2.5 Mandatory /sys Hierarchies¶
These top-level directories are required for udev, systemd, and standard Linux device management tools.
14.19.2.5.1 /sys/devices/ — Physical Device Tree¶
The canonical device hierarchy. Every device registered via
Section 11.4 appears here, organized by physical
topology: system/cpu/, pci0000:00/, platform/, virtual/. Each device
directory contains:
- Standard attributes:
uevent,power/(runtime PM state),driver(symlink),subsystem(symlink to bus/class). - Device-specific attributes added by the device driver at probe time.
- Child device directories for hierarchical devices (e.g., PCI bridge children).
14.19.2.5.2 /sys/bus/ — Bus Type Directories¶
One directory per registered bus type (pci, usb, platform, i2c, spi, etc.). Each bus directory contains:
| Subdirectory | Content |
|---|---|
devices/ |
Symlinks to /sys/devices/... for every device on this bus. |
drivers/ |
One subdirectory per registered driver. Each driver directory contains bind (write device name to force-bind), unbind (write device name to force-unbind), and new_id (write vendor:device to add dynamic ID). |
drivers_probe |
Write a device name to trigger re-probe. |
drivers_autoprobe |
1 = auto-probe new devices (default), 0 = manual binding only. |
14.19.2.5.3 /sys/class/ — Device Class Directories¶
One directory per device class. Each class directory contains symlinks to the
device directories in /sys/devices/. Classes group devices by function rather
than bus topology:
| Class | Content | Primary Tool |
|---|---|---|
net/ |
Network interfaces (eth0, wlan0, lo) | ip link, NetworkManager |
block/ |
Block devices (sda, nvme0n1, dm-0) | lsblk |
tty/ |
Terminal devices (ttyS0, ttyUSB0, pts/0) | stty, getty |
input/ |
Input devices (event0, mice) | libinput, evtest |
hwmon/ |
Hardware monitoring (temperatures, fans, voltages) | sensors, lm-sensors |
thermal/ |
Thermal zones and cooling devices | thermald |
power_supply/ |
Battery and AC adapter status | upower |
backlight/ |
Display backlight | xrandr, brightnessctl |
leds/ |
LED devices | ledctl |
sound/ |
ALSA sound devices | aplay, alsamixer |
drm/ |
DRM/KMS display devices | modetest, Xorg |
misc/ |
Miscellaneous devices | Various |
14.19.2.5.4 /sys/block/ — Block Devices¶
Symlinks into /sys/devices/ for all block devices. Each block device directory
contains queue/ (scheduler, nr_requests, read_ahead_kb), stat (I/O counters),
size (sectors), and partition subdirectories.
14.19.2.5.5 /sys/fs/ — Filesystem-Specific Controls¶
| Directory | Content |
|---|---|
cgroup/ |
cgroup filesystem mount controls (Section 17.2) |
ext4/ |
Per-mount ext4 tuning (when ext4 driver is active) |
fuse/ |
FUSE connection controls (Section 14.11) |
selinux/ |
SELinux policy interface (duplicate of securityfs for compat) |
14.19.2.5.6 /sys/kernel/ — Kernel Parameters¶
| Directory | Content |
|---|---|
mm/ |
Memory management controls (transparent_hugepage/, hugepages/, ksm/) |
debug/ |
debugfs mount point (when debugfs is mounted here) |
security/ |
securityfs mount point. security/ima/ascii_runtime_measurements output is scoped to the reading task's IMA namespace (Section 9.5). |
tracing/ |
tracefs mount point |
irq/ |
Default IRQ affinity |
uevent_seqnum |
Monotonic uevent sequence number (u64, for udev ordering) |
kexec_loaded |
Always 0 in UmkaOS. Live kernel evolution (Section 13.18) replaces kexec; the traditional kexec path is not implemented. Retained for compatibility with tools that read this file. |
kexec_crash_loaded |
Always 0 in UmkaOS. Crash recovery uses the live evolution infrastructure, not kexec-based kdump. Retained for compatibility. |
vmcoreinfo |
Crash dump layout information |
14.19.2.5.7 /sys/module/ — Loaded Modules and Parameters¶
One directory per loaded module (including built-in modules that have parameters). Each module directory contains:
| Entry | Content |
|---|---|
parameters/ |
One file per module parameter. Read returns current value; write changes it (if the parameter is writable). |
refcnt |
Module reference count. |
coresize |
Size of the module's core section in bytes. |
initsize |
Size of the module's init section (0 after init completes). |
holders/ |
Symlinks to modules that depend on this one. |
14.19.2.5.8 /sys/power/ — Power Management¶
| Entry | Mode | Description |
|---|---|---|
state |
0o644 | Write mem, disk, freeze to trigger system suspend. Read returns available states. See Section 7.5. |
wakeup_count |
0o644 | Wakeup event counter for suspend synchronization. |
mem_sleep |
0o644 | Preferred suspend-to-RAM variant (s2idle, shallow, deep). |
disk |
0o644 | Hibernate method (platform, shutdown, reboot, suspend). |
pm_async |
0o644 | 1 = async device suspend/resume (default), 0 = sequential. |
image_size |
0o644 | Maximum hibernate image size in bytes. |
resume |
0o200 | Write MAJOR:MINOR to set the resume device for hibernation. |
14.19.2.6 uevent Mechanism¶
Kobject state changes generate uevents that are delivered to userspace via two
channels: a netlink multicast socket (NETLINK_KOBJECT_UEVENT, group 1) and the
/sys/*/uevent file.
/// Uevent actions. Each action generates a NETLINK_KOBJECT_UEVENT
/// message and updates the kobject's `uevent` sysfs file.
#[derive(Clone, Copy, Debug)]
pub enum KobjAction {
/// Device/kobject added. udev creates /dev nodes and runs rules.
Add,
/// Device/kobject removed. udev removes /dev nodes.
Remove,
/// Device state changed (e.g., firmware loaded, link state changed).
Change,
/// Kobject moved to a different parent (renamed).
Move,
/// Device brought online (e.g., CPU, memory block).
Online,
/// Device taken offline.
Offline,
/// Device binding to a driver.
Bind,
/// Device unbound from a driver.
Unbind,
}
/// Environment variables included in every uevent message.
pub struct UeventEnv {
/// Key-value pairs. Standard keys:
/// - `ACTION`: "add", "remove", "change", "move", "online",
/// "offline", "bind", "unbind"
/// - `DEVPATH`: sysfs path relative to /sys (e.g., "/devices/pci0000:00/...")
/// - `SUBSYSTEM`: bus or class name (e.g., "pci", "net", "block")
/// - `SEQNUM`: monotonic u64 sequence number (never wraps in 50+ years
/// at 10 billion events/sec)
/// - `DEVTYPE`: device type within subsystem (e.g., "disk", "partition")
/// - `DRIVER`: driver name (present for bind/unbind)
/// - `MAJOR`, `MINOR`: device numbers (present for char/block devices)
/// - `DEVNAME`: device name for /dev (e.g., "sda", "ttyS0")
///
/// Additional subsystem-specific keys are appended by the device's
/// `uevent()` callback.
pub vars: ArrayVec<UeventVar, 32>,
}
/// Single uevent environment variable.
pub struct UeventVar {
pub key: &'static str,
pub value: ArrayVec<u8, 256>,
}
/// Emit a uevent for a kobject.
///
/// 1. Increment the global uevent sequence number (AtomicU64).
/// 2. Build the UeventEnv: standard keys + subsystem-specific keys
/// from the kobject's `uevent()` callback.
/// 3. Format the netlink message: `ACTION@DEVPATH\0KEY=VALUE\0...`
/// 4. Multicast via NETLINK_KOBJECT_UEVENT to all listeners (udevd).
/// 5. Write the formatted uevent to `/sys/<devpath>/uevent` for
/// manual re-trigger (`echo add > /sys/devices/.../uevent`).
pub fn kobject_uevent(kobj: &Arc<Kobject>, action: KobjAction) -> Result<(), Errno>;
/// Emit a uevent with additional environment variables.
/// Used when the subsystem needs to include extra key-value pairs
/// beyond what the standard `uevent()` callback provides.
pub fn kobject_uevent_env(
kobj: &Arc<Kobject>,
action: KobjAction,
extra_env: &[UeventVar],
) -> Result<(), Errno>;
14.19.2.7 uevent Delivery Path¶
- Driver or subsystem calls
kobject_uevent(kobj, action). - Kernel increments global
UEVENT_SEQNUM(AtomicU64, visible at/sys/kernel/uevent_seqnum). UeventEnvis built: ACTION, DEVPATH, SUBSYSTEM, SEQNUM, plus subsystem-specific variables from the kobject'suevent()method.- Netlink multicast: the formatted message is sent to
NETLINK_KOBJECT_UEVENTgroup 1. udevd receives the message, matches it against udev rules, and creates/removes/devnodes, loads firmware, sets permissions, runs RUN commands, etc. - The
ueventfile in the kobject's sysfs directory is updated. Writing an action name to this file re-triggers the uevent (e.g.,echo change > /sys/devices/.../ueventforces udev to re-process the device).
14.19.2.8 sysfs Namespace Awareness¶
sysfs is network-namespace-aware for /sys/class/net/: each network namespace
(Section 17.1) sees only its own interfaces. Other sysfs hierarchies
are shared across all namespaces (devices, buses, and classes other than net/ are
global). This matches Linux behavior. Device entries under /sys/class/net/ are
tagged with their owning network namespace; readdir() filters entries to show only
devices in the caller's net namespace.
14.19.2.9 sysfs Binary Attributes¶
For attributes that are not human-readable text (firmware blobs, ACPI tables, PCIe config space), sysfs provides binary attributes:
/// Binary attribute: arbitrary-length read/write with offset support.
/// Used for firmware upload, PCI config space, ACPI tables, etc.
pub trait SysfsBinAttribute: Send + Sync {
/// Attribute file name.
fn name(&self) -> &'static str;
/// POSIX permission bits.
fn mode(&self) -> u16;
/// Maximum file size (for `stat()` reporting and write bounds checking).
fn size(&self) -> usize;
/// Read into `buf` at `offset`. Returns bytes read.
fn read(&self, kobj: &Kobject, buf: &mut [u8], offset: u64) -> Result<usize, Errno>;
/// Write from `buf` at `offset`. Returns bytes written.
fn write(&self, kobj: &Kobject, buf: &[u8], offset: u64) -> Result<usize, Errno> {
Err(Errno::EACCES)
}
}
14.19.3 Boot Initialization Order¶
procfs and sysfs are among the earliest pseudo-filesystems mounted, as many subsequent boot steps depend on them:
| Order | Filesystem | Dependency | Notes |
|---|---|---|---|
| 1 | sysfs | VFS core initialized | Mounted before device probing begins; bus/class directories must exist for device registration. |
| 2 | procfs | VFS core + PID allocator | Mounted before PID 1 executes; glibc requires /proc/self/. |
| 3 | devtmpfs | sysfs | Device nodes reference sysfs kobjects. |
Both filesystems are kernel-mounted (not user-mounted): the kernel mounts them during early init before transferring control to PID 1. This is unlike debugfs, tracefs, and other pseudo-filesystems which are mounted by userspace init.