Chapter 4: Memory Management¶

Physical allocator, virtual memory, page tables, slab, NUMA, compression tier, page cache

Memory management runs entirely within UmkaOS Core. The common case (page fault handling, allocation, deallocation) involves zero protection domain crossings.

The allocator and VMM are factored into non-replaceable data primitives and replaceable policy layers for 50-year uptime (Section 4.2, Section 4.8).

4.1 Boot Allocator¶

The boot allocator provides memory to early-init code before the buddy allocator and slab are operational. It initialises from firmware-provided memory maps (ACPI/E820 on x86, Device Tree on ARM/RISC-V) and is retired once the buddy allocator takes over.

Design: A single-pass, forward-bumping allocator over a sorted list of free physical ranges. No free operation — allocations are permanent until the buddy allocator is initialised. This is intentional: boot-time data (PerCpu arrays, NUMA topology, RCU state, GDT/IDT, early page tables) has kernel-lifetime and is never freed.

4.1.1 Kernel Address Types¶

/// Virtual address newtype. Wraps `usize` — matches the architecture's
/// native virtual address width (32-bit on ARMv7/PPC32, 64-bit on others).
/// Prevents accidental mixing of virtual and physical addresses.
///
/// **ABI boundary rule**: `VirtAddr` is a kernel-internal type. ABI-crossing
/// structs (syscall parameters, driver SDK types) use raw `u64` or `Le64`.
#[derive(Clone, Copy, Eq, PartialEq, Ord, PartialOrd, Hash, Debug)]
#[repr(transparent)]
pub struct VirtAddr(pub usize);

/// Physical address newtype. Wraps `u64` on ALL architectures because
/// ARMv7 LPAE and PPC32 have >32-bit physical addresses (up to 40 bits).
/// Using `u64` uniformly avoids truncation on 32-bit systems with large
/// physical address spaces.
#[derive(Clone, Copy, Eq, PartialEq, Ord, PartialOrd, Hash, Debug)]
#[repr(transparent)]
pub struct PhysAddr(pub u64);

/// Common address operations shared by VirtAddr and PhysAddr.
pub trait Address: Copy {
    fn as_usize(self) -> usize;
    fn as_u64(self) -> u64;
    fn page_aligned(self) -> bool;
    fn page_offset(self) -> usize;
    fn align_up(self, align: usize) -> Self;
    fn align_down(self, align: usize) -> Self;
}

impl VirtAddr {
    pub const MAX: Self = Self(usize::MAX);
    pub fn new(val: usize) -> Self { Self(val) }
    pub fn as_usize(self) -> usize { self.0 }
    pub fn as_u64(self) -> u64 { self.0 as u64 }
}

impl PhysAddr {
    pub const MAX: Self = Self(u64::MAX);
    pub fn new(val: u64) -> Self { Self(val) }
    pub fn as_u64(self) -> u64 { self.0 }
    /// Convert to `usize`. **WARNING**: On 32-bit architectures (ARMv7,
    /// PPC32), this truncates physical addresses above 4 GB because `usize`
    /// is 32 bits. Prefer `as_u64()` for LPAE-safe code that may handle
    /// addresses in the >4 GB range. The debug assertion catches truncation
    /// during development.
    pub fn as_usize(self) -> usize {
        debug_assert!(
            self.0 <= usize::MAX as u64,
            "PhysAddr::as_usize() truncation on 32-bit: {:#x}",
            self.0,
        );
        self.0 as usize
    }
}

// Arithmetic: Add<usize>, Sub<usize> for both types.
// Sub<Self> returns usize (distance). Add<Self> is NOT implemented
// (adding two addresses is meaningless).
// Debug-only: VirtAddr canonical address checking on x86-64 (sign extension),
// AArch64 tag bit checking.

/// Boot-phase allocator. Initialised from firmware memory tables before any other
/// kernel allocator. Retired (replaced by buddy allocator) during mem_init().
/// All allocations are permanent (no free). Single-threaded use only (pre-SMP).
/// Maximum number of firmware memory map entries the boot allocator can hold.
/// 256 exceeds the maximum observed E820/UEFI memory map entries (Linux caps
/// at 128; CXL expanders add ~10-20 entries per device). Increase this value
/// at build time if a system requires more entries.
pub const BOOT_MEM_MAX_RANGES: usize = 256;

pub struct BootAlloc {
    /// Sorted list of free physical memory ranges from firmware.
    /// Populated from E820 (x86), UEFI memory map, or Device Tree /memory nodes.
    /// NOTE: `BOOT_MEM_MAX_RANGES` entries is a boot-time constraint for the
    /// initial bump allocator that operates before dynamic allocation is
    /// available. Static arrays are necessary at this stage.
    ///
    /// **Overflow**: If `parse_firmware_memory_map()` encounters more than
    /// `BOOT_MEM_MAX_RANGES` entries, the kernel panics with
    /// "boot: memory map exceeds BOOT_MEM_MAX_RANGES entries". This is
    /// a hard boot failure — there is no fallback.
    ranges: [PhysRange; BOOT_MEM_MAX_RANGES],
    /// Number of valid entries in `ranges`.
    nr_ranges: usize,
    /// Next allocation pointer within the current range.
    current_top: PhysAddr,
    /// Index of the range being served.
    current_range: usize,
}

/// A contiguous physical memory range.
pub struct PhysRange {
    pub base: PhysAddr,
    pub end:  PhysAddr,  // exclusive
}

impl BootAlloc {
    /// Allocate `size` bytes with `align` alignment (must be power of two).
    /// Searches ranges in order, advancing to the next free range when the
    /// current one is exhausted. Panics with "boot: out of memory" if all
    /// ranges are exhausted without satisfying the request (i.e.,
    /// `current_range >= nr_ranges` after scanning all entries).
    /// Returns a physical address; caller maps it into the kernel virtual window.
    ///
    /// # Panics
    ///
    /// Panics if `align` is not a power of two or if `size` is zero.
    pub fn alloc(&mut self, size: usize, align: usize) -> PhysAddr {
        debug_assert!(align.is_power_of_two(), "boot alloc: align must be power of two");
        debug_assert!(size > 0, "boot alloc: zero-size allocation");
        // ... align_up + bump logic ...
    }

    /// Mark a physical range as reserved (used by ACPI tables, initramfs, etc.)
    /// so it is excluded from allocatable ranges. Must be called before alloc().
    pub fn reserve(&mut self, base: PhysAddr, size: usize);

    /// Hand off all remaining free ranges to the buddy allocator.
    /// Called once during mem_init(). After this call, BootAlloc is inert
    /// and further `alloc()` calls panic.
    ///
    /// **Precondition**: Must be called exactly once, after the buddy
    /// allocator is constructed but before slab_init(). A second call
    /// panics (`assert!(!self.retired)`).
    pub fn hand_off_to_buddy(&mut self, buddy: &mut BuddyAllocator);
}

Initialisation sequence (arch-independent):

0. BSS pre-allocator — before any dynamic allocation is possible, the kernel uses
   BOOTSTRAP_BUF, a 64 KB static buffer in .bss, for the earliest allocations
   (initial page tables, early serial console structs). This is a trivial bump
   allocator over a fixed array; it is superseded by BootAlloc in step 2 and
   never used after that point.
1. arch_early_init() — establish identity map of first 1 GB, enable MMU/paging.
2. parse_firmware_memory_map() — read E820/UEFI/DT, build BootAlloc.ranges[].
      After copying entries from firmware, sanitize the range list:
      a. Panic if any entry has `base >= end` (degenerate range).
      b. Sort `ranges[]` by `base` address in ascending order.
      c. Merge overlapping or adjacent free entries (same memory type).
      d. Clip any free entry that overlaps a reserved region.
      This sanitization is required because UEFI/BIOS firmware occasionally
      produces overlapping or unsorted memory map entries (Linux's
      `e820__range_add()` in `arch/x86/kernel/e820.c` performs equivalent
      sanitization). The bump allocator's `alloc()` function relies on
      sorted, non-overlapping ranges; without sanitization, overlapping
      entries cause double-allocation and the buddy hand-off (step 7b)
      would double-free pages.
3. reserve_kernel_image() — mark kernel .text/.data/.bss as reserved.
4. reserve_initramfs() — mark initramfs region as reserved.
5. reserve_acpi_tables() — mark RSDP/XSDT/MADT/SRAT/HMAT regions as reserved.
6. BootAlloc available for use by all early init code.
7. mem_init():
   a. Allocate per-NUMA BuddyAllocator structures from BootAlloc.
   b. Call boot_alloc.hand_off_to_buddy() — all remaining free pages enter buddy.
   c. Mark BootAlloc as retired. Further alloc() panics.
8. slab_init() — build slab caches on top of buddy.
9. Normal allocation (Box::new, Arc::new, etc.) is now available.

Key differences from Linux memblock: No deferred boot-time allocation tracking, no late reservations after step 6, no mirror/hotplug bookkeeping at this stage (handled by NUMA topology after buddy is up). Simpler invariant: the boot allocator is a one-pass bump allocator that retires cleanly.

4.2 Physical Memory Allocator¶

Per-CPU Free Page Pools (hot path, no locks)
    |
    v
Per-NUMA-Node Buddy Allocator (warm path, per-node lock)
    |
    v
Global Buddy Allocator (cold path, rare fallback)

Per-CPU free page pools: Each CPU maintains a private pool of free pages. Allocation and deallocation from this pool require no locking and no atomic operations.
Per-NUMA buddy allocator: When per-CPU pools are exhausted, pages are allocated from the NUMA-local buddy allocator. This uses a per-node lock (minimal contention because per-CPU pools absorb most traffic).
Page order: Buddy allocator manages orders 0-10 (4KB to 4MB).
NUMA awareness: Allocations prefer the requesting CPU's NUMA node. Cross-node allocation is a fallback with configurable policy.

See also: Section 2.23 (Hardware Memory Safety) hooks into the physical and slab allocators to assign MTE tags on allocation and clear them on free, providing hardware-assisted use-after-free and buffer overflow detection.

GfpFlags — Memory Allocation Flags:

bitflags! {
    /// Memory allocation flags passed to the physical page allocator and slab allocator.
    ///
    /// **Context rules** — must be respected to avoid deadlocks:
    /// - In interrupt context or under spinlock: use `GFP_ATOMIC` or `GFP_NOWAIT`.
    /// - In normal schedulable kernel context: use `GFP_KERNEL`.
    /// - From within filesystem code (holding inode/page locks): add `GFP_NOFS`.
    /// - From within block driver or I/O path: add `GFP_NOIO`.
    ///
    /// **UmkaOS vs Linux**: Linux uses a raw `gfp_t` typedef (u32) with scattered
    /// `#define` constants. UmkaOS uses `bitflags!` for compile-time type safety —
    /// invalid flag combinations (e.g., GFP_ATOMIC | GFP_KERNEL) are caught at
    /// the call site rather than producing silent runtime bugs.
    pub struct GfpFlags: u32 {
        // --- Base zone modifiers (where to allocate from) ---
        //
        // Zone boundaries (architecture-specific):
        //
        //   x86-64:     DMA [0, 16 MB)  |  DMA32 [16 MB, 4 GB)  |  Normal [4 GB, max_phys)
        //   AArch64:    DMA [0, device-specific IOMMU limit]  |  Normal [above DMA]
        //   ARMv7:      DMA [0, device-specific]  |  Normal [above DMA]  |  HighMem [above lowmem_limit]
        //   RISC-V:     DMA [0, device-specific]  |  Normal [above DMA]
        //
        // Zone fallback order: GFP_DMA constrains allocation to the DMA zone only.
        // GFP_DMA32 allows DMA + DMA32 zones. Default (no zone flag) falls back
        // through Normal → DMA32 → DMA until a suitable page is found.

        /// Allocate from the DMA zone (typically <16 MB on x86; architecture-specific).
        /// Required for ISA-DMA-capable devices. Use `GFP_DMA32` for 32-bit DMA.
        /// Bit positions match Linux `include/linux/gfp_types.h` zone modifier
        /// bits for tracepoint compatibility (perf, ftrace decode raw gfp_t values).
        const DMA              = 0x0000_0001;  // bit 0, ___GFP_DMA
        /// Allocate from HIGHMEM zone if available (32-bit systems only).
        /// Not meaningful on 64-bit architectures where all RAM is directly mapped.
        const HIGHMEM          = 0x0000_0002;  // bit 1, ___GFP_HIGHMEM
        /// Allocate from the DMA32 zone (below 4 GB physical address).
        /// Use for PCI devices that cannot address >4 GB.
        const DMA32            = 0x0000_0004;  // bit 2, ___GFP_DMA32
        /// Page is movable by the memory compactor (can be migrated without
        /// affecting correctness from the kernel's perspective).
        const MOVABLE          = 0x0000_0008;  // bit 3, ___GFP_MOVABLE

        // --- Reclaim / sleeping modifiers ---

        /// May invoke the page reclaimer and sleep waiting for memory.
        /// The standard flag for all normal sleepable kernel allocations.
        const RECLAIM          = 0x0000_0010;
        /// May sleep (yield the CPU) while waiting for memory.
        const IO               = 0x0000_0020;
        /// Disable filesystem callbacks in the reclaim path. Use when the
        /// caller holds filesystem locks (writepage, readpage, etc.) to
        /// prevent deadlock via re-entrant filesystem calls.
        const FS               = 0x0000_0040;
        /// Disable all I/O (including swap writeback) in the reclaim path.
        /// Use from block drivers or storage paths that must not block on I/O.
        const WRITE            = 0x0000_0080;
        /// Allow kswapd reclaim (can wake kswapd if below WMARK_LOW).
        const KSWAPD_RECLAIM   = 0x0000_0100;
        /// Use the emergency reserve pool. Interrupt-safe; may not sleep.
        /// Higher priority than NOWAIT; intended for true interrupt context.
        const ATOMIC           = 0x0000_0200;
        /// For user-visible pages (anonymous memory, file-backed pages
        /// accessed by userspace). Enables reclaimable migration.
        const USER             = 0x0000_0400;
        /// Zero the allocated memory before returning. Caller gets clean pages.
        const ZERO             = 0x0000_0800;

        // --- Warning / failure reporting modifiers ---

        /// Suppress allocation failure warnings. When set, allocation failures
        /// do NOT produce kernel log messages or FMA diagnostic events. Without
        /// this flag, the allocator emits an FMA advisory event (level INFO) on
        /// allocation failure in debug/diagnostic builds.
        ///
        /// **Two-level allocation failure reporting (UmkaOS design)**:
        /// - **Without NOWARN**: allocation failure emits an FMA advisory event
        ///   ([Section 20.1](20-observability.md#fault-management-architecture)) at INFO level. The event includes GfpFlags,
        ///   caller context, and allocation order. FMA is rate-limited per
        ///   call-site (token bucket, 1 event/sec burst, 0.1 events/sec
        ///   sustained) to prevent log flooding from retry loops.
        /// - **With NOWARN**: allocation failure is silently swallowed. No log,
        ///   no FMA event. Used in paths where failure is expected and benign
        ///   (e.g., `GFP_NOWAIT` networking buffer refills, speculative
        ///   readahead, page cache pre-allocation).
        ///
        /// This is an improvement over Linux's `__GFP_NOWARN`, which is a
        /// binary suppress-or-emit with no rate limiting on the emit path.
        /// UmkaOS's FMA coordination ensures that even the non-NOWARN path
        /// cannot flood the kernel log under memory pressure.
        const NOWARN           = 0x0000_2000;

        // --- Composite flags for common use cases ---

        /// Standard kernel allocation: sleepable, may reclaim, may do I/O.
        /// Use for all normal kernel allocations in process context.
        const KERNEL = Self::RECLAIM.bits() | Self::IO.bits() | Self::FS.bits()
                     | Self::WRITE.bits() | Self::KSWAPD_RECLAIM.bits();

        /// Non-blocking allocation: no direct reclaim, no sleep, but wakes kswapd
        /// for background reclaim. For contexts where failure is acceptable
        /// (e.g., caches). Preferred over GFP_ATOMIC for non-interrupt contexts
        /// that simply must not sleep. Linux equivalent: `GFP_NOWAIT =
        /// __GFP_KSWAPD_RECLAIM | __GFP_NOWARN` (wakes kswapd at low watermark
        /// so background reclaim runs even when the caller cannot block;
        /// suppresses allocation failure warnings since failure is expected).
        ///
        /// **Why not zero?** A zero value is indistinguishable from `GfpFlags::empty()`,
        /// making `flags.contains(GfpFlags::NOWAIT)` always true — a logic bug.
        /// More importantly, zero means "no flags at all": no kswapd wake, no
        /// reclaim of any kind. Linux's GFP_NOWAIT deliberately includes
        /// `__GFP_KSWAPD_RECLAIM` so that even non-blocking allocations trigger
        /// background page freeing when memory is low.
        const NOWAIT = Self::KSWAPD_RECLAIM.bits() | Self::NOWARN.bits();

        /// Interrupt-safe allocation using reserve pool. Use only in IRQ context
        /// or under spinlock where GFP_NOWAIT's failure rate is unacceptable.
        const ATOMIC_ALLOC = Self::ATOMIC.bits() | Self::KSWAPD_RECLAIM.bits();

        /// Like GFP_KERNEL but zeroes the memory. For kernel objects that must
        /// start in a known-zero state (prevents info leaks from recycled pages).
        const KERNEL_ZEROED = Self::KERNEL.bits() | Self::ZERO.bits();

        /// Like GFP_KERNEL but disables filesystem reclaim. Safe from VFS/fs code.
        const KERNEL_NOFS = Self::RECLAIM.bits() | Self::IO.bits()
                          | Self::WRITE.bits() | Self::KSWAPD_RECLAIM.bits();

        /// Like GFP_KERNEL but disables I/O and filesystem reclaim. Safe from
        /// block drivers and filesystem code. Linux equivalent: GFP_NOIO.
        /// Does NOT include IO or FS — prevents reclaim from re-entering
        /// the block layer or filesystem, avoiding deadlocks in drivers that
        /// allocate memory under device locks.
        const KERNEL_NOIO = Self::RECLAIM.bits()
                          | Self::KSWAPD_RECLAIM.bits();

        /// For user-visible pages with high-memory zone support and movability.
        const HIGHUSER_MOVABLE = Self::KERNEL.bits() | Self::USER.bits()
                               | Self::HIGHMEM.bits() | Self::MOVABLE.bits();

        /// Never fail. The allocator retries indefinitely (sleeping between
        /// retries) until memory is available. Use with extreme caution:
        /// callers must tolerate unbounded latency. Typically used for
        /// page table allocations and other paths where failure is
        /// unrecoverable. If a cgroup memory limit is hit under NOFAIL,
        /// the allocation is charged to the root cgroup instead of failing.
        const NOFAIL           = 0x0000_1000;
    }
}

4.2.1 BuddyAllocator¶

The buddy allocator is the central physical page allocator. One BuddyAllocator instance exists per NUMA node, managing all free physical pages for that node. It operates on power-of-two page blocks (orders 0 through MAX_ORDER) and merges adjacent free blocks to reduce fragmentation.

4.2.1.1 Constants¶

/// Maximum allocation order (inclusive). A block of order MAX_ORDER contains
/// 2^MAX_ORDER contiguous pages (4 MB with 4 KB pages).
/// Allocations larger than 2^MAX_ORDER pages are not supported by the buddy
/// allocator — callers requiring contiguous regions beyond this size must use
/// the CMA (Contiguous Memory Allocator) or boot-time reservations.
///
/// **Semantics**: MAX_ORDER uses Linux 6.4+ **inclusive** semantics (matching
/// the renamed `MAX_PAGE_ORDER` in `include/linux/mmzone.h`). `free_lists`
/// has `MAX_ORDER + 1` entries (orders 0 through MAX_ORDER inclusive, i.e.,
/// 11 free lists per migration type). This is NOT the old pre-6.4 exclusive
/// semantics where `MAX_ORDER = 11` meant orders 0-10.
/// Linux also defines `NR_PAGE_ORDERS = MAX_PAGE_ORDER + 1 = 11`.
pub const MAX_ORDER: usize = 10;

/// Number of pages to batch-transfer between Per-CPU Free Page Pools and the
/// buddy allocator in a single lock acquisition. Chosen to match Linux's
/// default: `max(zone_managed_pages / 1024 / 4, 1)` capped at
/// `PAGE_SHIFT * 8 = 31` for order-0. 31 amortizes buddy lock cost while
/// keeping per-operation latency bounded.
pub const PCP_BATCH_SIZE: usize = 31;

/// Default high watermark for Per-CPU Free Page Pools. When a PCP pool
/// accumulates more than `PCP_HIGH_DEFAULT` pages, the excess is drained
/// back to the buddy allocator. Value matches Linux default (batch * 6 = 186).
/// Tunable at runtime via `/proc/sys/vm/percpu_pagelist_high_fraction`.
pub const PCP_HIGH_DEFAULT: u32 = 186; // 31 * 6 = 186

/// Stack-allocation capacity for NUMA node arrays. Used for bounded
/// stack-allocated containers (ArrayVec) in hot/warm paths where heap
/// allocation is forbidden or undesirable. This is NOT a system-wide
/// NUMA node limit — the kernel discovers the actual node count from
/// firmware (SRAT/DTB) at boot. If a system has more than 64 local
/// NUMA nodes (unlikely — covers 8 sockets + 16 CXL expanders with
/// 2x headroom), the kernel panics at boot with a clear message.
/// Cluster peers and DSM remote nodes are NOT local NUMA nodes.
pub const NUMA_NODES_STACK_CAP: usize = 64;

4.2.1.2 Migration Types¶

Each free block is tagged with a migration type that groups pages by mobility. This prevents unmovable kernel allocations from scattering across the address space and blocking compaction of movable (user) pages.

/// Migration type for a free page block. Determines which free list the
/// block resides on within a given order. The allocator maintains separate
/// per-order free lists for each migration type.
#[repr(u8)]
pub enum MigrateType {
    /// Unmovable pages: kernel slab objects, page tables, DMA buffers.
    /// Once allocated, these pages cannot be relocated.
    Unmovable  = 0,
    /// Movable pages: anonymous user memory, page cache. The compactor
    /// can relocate these pages to defragment physical memory.
    Movable    = 1,
    /// Reclaimable pages: kernel caches (dentry, inode) that can be freed
    /// under memory pressure without moving them.
    Reclaimable = 2,
    /// Number of migration types (not a valid type itself).
    Count      = 3,
}

4.2.1.3 `BuddyFreeList` — Per-Order Free List¶

/// A doubly-linked list of free page blocks at a specific allocation order.
/// Each `BuddyAllocator` contains `(MAX_ORDER + 1) * MigrateType::Count`
/// free lists: one per (order, migration type) pair.
///
/// The list is intrusive: free `Page` descriptors contain `prev`/`next`
/// pointers that link them into the list. Only the *head page* of a free
/// compound block is linked; its `order` field records the block size.
///
/// **Invariant**: `Page.lru` serves as buddy free-list linkage when `PG_BUDDY`
/// is set in `Page.flags`, and as LRU generation linkage when the page is
/// allocated and in the page cache. These uses are mutually exclusive;
/// `PG_BUDDY` is the discriminant. `insert_free_list` and `remove_from_free_list`
/// operate on `page.lru.next` / `page.lru.prev` to link/unlink pages.
pub struct BuddyFreeList {
    /// Head of the intrusive doubly-linked list of free page blocks.
    /// Each node is the head `Page` descriptor of a 2^order-page block.
    /// Null when the list is empty.
    pub head: *mut Page,
    /// Tail of the list. Enables O(1) append for FIFO-order allocation
    /// (reduces page reuse predictability — security benefit).
    pub tail: *mut Page,
    /// Number of free blocks at this order. Non-atomic: always accessed under
    /// `BuddyAllocator.lock`. Distinguished from the zone-wide aggregate
    /// `BuddyAllocator.nr_free` (AtomicU64) which is read without the lock
    /// for watermark checks in the allocation fast path.
    pub nr_free: u64,
}

4.2.1.4 `PcpPagePool` — Per-CPU Free Page Pool¶

Per-CPU free page pools eliminate lock contention for the overwhelmingly common case: single-page (order-0) allocation and deallocation. Each CPU owns a private pool accessed with IRQs disabled (local_irq_save) — no locks, no atomics on the fast path. IRQ-disable (not merely preempt-disable) is required because IRQ handlers can allocate memory (e.g., GFP_ATOMIC in networking receive paths); preempt-only would allow an IRQ to corrupt the per-CPU pool mid-access.

Design note: why PerCpu<PcpPagePool> and not CpuLocalBlock? CpuLocalBlock (Section 3.2) is a compact, fixed-layout struct embedded in a dedicated register (GS-base on x86-64, TPIDR_EL1 on AArch64). It holds small, frequently-accessed scheduler/RCU state (~256 bytes total). PcpPagePool is too large for CpuLocalBlock (256 × 8 bytes = 2 KB for the pages array alone, plus watermarks and counters). Placing it in CpuLocalBlock would blow the L1 cache budget for the entire block. Instead, PcpPagePool lives in PerCpu (heap-allocated per-CPU area), accessed via PerCpu::get() with IRQs disabled (local_irq_save). The extra indirection (one pointer dereference) is negligible: the pool array itself is cache-hot after the first access, and allocation frequency (~100K/sec/CPU) is well below the scheduler tick rate.

/// Per-CPU pool of free order-0 pages. Absorbs the majority of allocation
/// and deallocation traffic, allowing the buddy allocator lock to be acquired
/// only when the pool needs refilling or draining.
///
/// Accessed exclusively by the owning CPU with IRQs disabled (`local_irq_save`).
/// No lock is needed: IRQ-disable guarantees single-threaded access even when
/// IRQ handlers allocate memory (e.g., `GFP_ATOMIC` in network receive).
///
/// **Hot path**: `alloc_pages(order=0)` → pop from `pages[]` (no lock).
/// **Refill**: when empty, acquire buddy lock once, transfer `batch` pages.
/// **Drain**: when `pages.len() > high`, return `batch` pages to buddy.
pub struct PcpPagePool {
    /// Cached free pages on this CPU. Stack discipline (LIFO): the most
    /// recently freed page is allocated first, maximizing cache warmth.
    /// Capacity is `PCP_CAPACITY` (256 pages = 1 MB with 4 KB pages).
    /// This is the maximum number of pages a single CPU can cache; the
    /// actual working set is governed by the `high` watermark.
    ///
    /// **Design note: `*mut Page` (not `PhysAddr`)**: UmkaOS stores `*mut Page`
    /// pointers directly, matching Linux's `struct page *` in per-CPU page
    /// lists. This avoids a MEMMAP lookup (`MEMMAP[pa >> PAGE_SHIFT]`) on
    /// every order-0 alloc/free — a ~1-3 cycle cost per operation that adds
    /// up on the hottest path in the kernel. The `select_block()` buddy
    /// helper already returns `*mut Page`, so PCP refill stores the pointer
    /// directly without conversion. PCP drain passes `*mut Page` straight to
    /// `buddy_merge_and_insert()`. The only cost is that `*mut Page` carries
    /// a validity assumption (the Page descriptor must remain mapped), which
    /// is trivially satisfied because MEMMAP is a boot-time permanent mapping
    /// that is never unmapped.
    pub pages: ArrayVec<*mut Page, PCP_CAPACITY>,
    /// High watermark. When `pages.len()` exceeds this value after a free,
    /// drain `batch` pages back to the buddy allocator. Prevents one CPU
    /// from hoarding too many pages while other CPUs starve.
    /// Default: dynamically computed as
    /// `max(batch * 6, zone_low_watermark / nr_cpus_on_node)`.
    /// On a typical 64 GB node with 32 CPUs: high ≈ 186.
    /// Clamped to `PCP_CAPACITY` on all systems.
    pub high: u32,
    /// Number of pages to transfer in a single refill or drain operation.
    /// Amortizes the buddy lock cost. Default: 31. Tunable via
    /// `/proc/sys/vm/percpu_pagelist_high_fraction` (adjusts both `high`
    /// and `batch` proportionally).
    pub batch: u32,
    /// Pool validity flag. Set to `true` once this pool is fully initialized
    /// for its CPU; cleared during CPU hotplug teardown. Used by the
    /// `PhysAllocPolicy` quiescence protocol to determine whether a CPU's
    /// PCP pool can be drained safely (e.g., during policy hot-swap or
    /// NUMA rebalancing). The policy layer checks `pcp_valid.load(Acquire)`
    /// before touching `pages`; if `false`, the pool is skipped.
    pub pcp_valid: AtomicBool,
}

/// **Initialization**: PCP pools are populated during Phase 1.1 boot
/// (after buddy allocator hand-off from the boot allocator). For each
/// online CPU: `pcp.high` and `pcp.batch` are computed from zone size
/// and CPU count, then `pcp.pcp_valid.store(true, Release)` enables
/// the fast path. On secondary CPU bringup (SMP): the AP's PCP pool
/// is initialized and `pcp_valid` set in `secondary_cpu_init()`.
/// On CPU hotplug offline: `pcp_valid` is cleared BEFORE draining
/// the pool, ensuring no concurrent fast-path access.

/// Maximum capacity of the PCP page pool (ArrayVec size).
/// 256 pages = 1 MB at 4 KB page size. Provides headroom for dynamic
/// high watermarks on large-memory systems (e.g., 512 GB with 8 CPUs
/// yields high ≈ 240). The actual working set is much smaller on most
/// systems (governed by `high`, typically 100-200).
///
/// **Scaling**: On extreme systems (>4 TB RAM, >128 CPUs), the dynamically
/// computed `high` watermark may exceed PCP_CAPACITY. When this occurs,
/// `high` is clamped to PCP_CAPACITY (line 275). The effect is slightly
/// more frequent buddy lock contention — acceptable because such systems
/// benefit more from NUMA-local allocation than from larger PCP pools.
pub const PCP_CAPACITY: usize = 256;

/// Default batch size for PCP pool refill/drain operations.
/// Default batch size — must equal `PCP_BATCH_SIZE`. Defined separately
/// for contexts that need a `u32` value. 31 pages per batch amortizes
/// buddy lock acquisition cost while keeping per-operation latency bounded.
pub const PCP_BATCH_DEFAULT: u32 = PCP_BATCH_SIZE as u32;

Tuning: The high and batch values are recalculated when /proc/sys/vm/percpu_pagelist_high_fraction is written. The formula is: high = max(batch * 6, total_node_pages / fraction / nr_cpus_on_node), batch = max(1, high / 6). This matches Linux behavior and ensures that high-memory systems do not over-cache pages on individual CPUs.

4.2.1.5 `BuddyAllocator` — Per-NUMA-Node Allocator¶

/// Per-NUMA-node buddy allocator. Manages all free physical pages belonging
/// to a single NUMA node across orders 0 through MAX_ORDER. One instance per
/// NUMA node, allocated from the boot allocator during `mem_init()`.
///
/// **Locking protocol**:
/// - `lock` protects all `free_lists` entries. Must be held for split, merge,
///   and any direct free-list manipulation.
/// - `pcpu_pools` are lock-free (IRQ-disable via `local_irq_save`). The buddy
///   `lock` is acquired only when a PCP pool needs refilling or draining.
/// - `nr_free` is updated atomically (outside the lock) to provide a
///   consistent snapshot for `/proc/meminfo` and watermark checks without
///   requiring the buddy lock on every read.
///
/// **Performance**: On a typical server workload, >95% of order-0 allocations
/// are served from `pcpu_pools` without touching `lock`. The buddy lock is
/// contended only during bulk refill/drain (amortized by `batch`) and
/// higher-order allocations (uncommon in steady state).
pub struct BuddyAllocator {
    /// Per-order, per-migration-type free page lists.
    /// Indexed as `free_lists[order][migrate_type]`.
    /// Order 0 = single page (4 KB), order MAX_ORDER = 2^MAX_ORDER pages (4 MB).
    pub free_lists: [[BuddyFreeList; MigrateType::Count as usize]; MAX_ORDER + 1],
    /// Spinlock protecting all `free_lists` entries during split/merge.
    /// IRQ-safe: acquired with `lock.lock_irqsave()` because page allocation
    /// can occur in interrupt context (GFP_ATOMIC).
    pub lock: SpinLock<()>,
    /// Total free pages across all orders and migration types on this node.
    /// Updated with `Ordering::Relaxed` after every alloc/free; the lock
    /// serializes the actual free-list mutation, so the counter is eventually
    /// consistent. Used for watermark checks and `/proc/meminfo` reporting.
    pub nr_free: AtomicU64,
    /// NUMA node ID this allocator serves. Immutable after initialization.
    /// Matches the index in `BUDDY_ALLOCATORS`.
    pub node_id: u32,
    /// Per-CPU free page pools for fast order-0 allocation. Each CPU has
    /// its own pool, accessed with IRQs disabled (`local_irq_save`).
    pub pcpu_pools: PerCpu<PcpPagePool>,
    /// Watermarks for this node (pages). Reclaim decisions reference these
    /// thresholds. Set during boot from zone size and tunable via
    /// `/proc/sys/vm/min_free_kbytes`.
    pub wmark_min: u64,
    /// Low watermark: kswapd wakes when `nr_free` drops below this.
    pub wmark_low: u64,
    /// High watermark: kswapd sleeps when `nr_free` rises above this.
    pub wmark_high: u64,
    /// Replaceable allocation policy. All algorithmic decisions (block
    /// selection, split strategy, merge policy, NUMA fallback, watermark
    /// tuning, compaction trigger) are dispatched through this trait object.
    /// The policy can be live-replaced via the evolution framework
    /// ([Section 13.18](13-device-classes.md#live-kernel-evolution)).
    ///
    /// **State spill**: All mutable state (free_lists, nr_free, pcpu_pools,
    /// watermarks) is owned by BuddyAllocator, not by the policy. The
    /// policy is a stateless algorithm dispatcher — same pattern as
    /// `IoSchedOps` ([Section 16.21](16-networking.md#traffic-control-and-queue-disciplines--state-ownership-for-live-evolution))
    /// and `QdiscOps` ([Section 16.21](16-networking.md#traffic-control-and-queue-disciplines--state-ownership-for-live-evolution)).
    ///
    /// **Post-swap handler**: After `AtomicPtr` swap, the evolution
    /// framework MUST call `new_policy.recalc_watermarks(buddy, node_pages)`
    /// for every NUMA node before resuming allocations. Watermarks encode
    /// policy decisions (min_free thresholds, low/high ratios) that differ
    /// between policies. Without recalculation, the new policy operates
    /// with stale watermark values from the old policy, potentially
    /// triggering spurious OOM or delaying kswapd wakeup.
    ///
    /// **Quiescence protocol**: After swapping the allocation policy, the
    /// swap initiator sends an IPI to all CPUs to drain per-CPU page
    /// caches (PCP) and recalculate zone watermarks using the new policy.
    /// During recalculation, allocation fast paths are temporarily
    /// disabled (fall through to slow path with zone lock). This ensures
    /// no allocations proceed with stale watermarks. The sequence is:
    /// 1. `AtomicPtr::swap(new_policy, Release)` — publish new policy.
    /// 2. `smp_call_function_all(drain_pcp_and_recalc)` — IPI all CPUs.
    ///    `smp_call_function_all()` is synchronous: the calling CPU waits
    ///    until all target CPUs have executed the callback and returned
    ///    (via per-CPU completion flags polled by the caller).
    /// 3. Each CPU: drain its PCP back to buddy **for every NUMA node
    ///    whose BuddyAllocator references this CPU's PCP** (not only the
    ///    CPU's "local" node — NUMA mempolicy or `numa_fill` may have
    ///    populated PCP pools on remote nodes), then call
    ///    `new_policy.recalc_watermarks()` for **all** node zones.
    /// 4. Each CPU: re-enable fast path (`pcp_valid.store(true, Release)`)
    ///    on every PCP pool that was drained.
    ///    After `smp_call_function_all()` returns, all CPUs have drained
    ///    their PCP pools across all nodes.
    /// Between steps 1 and 2, PCP fast-path allocations proceed normally
    /// (policy is not consulted on the fast path). Slow-path allocations
    /// read the new policy via the already-swapped AtomicPtr. After step 2
    /// (synchronous IPI), all CPUs have drained PCP and recalculated
    /// watermarks under the new policy.
    ///
    /// **In-flight slow-path allocations during policy swap**: Between
    /// `AtomicPtr::swap` (step 1) and PCP drain completion (step 4),
    /// slow-path allocations may be executing under the old policy. This
    /// is safe because: (1) the old policy vtable memory is retained for
    /// the 5-second watchdog window (same as all evolution swaps), freed
    /// via `call_rcu()` after an RCU grace period. (2) Slow-path
    /// allocations read the policy once at entry (`policy.load(Acquire)`)
    /// and complete without re-reading -- the policy pointer is captured
    /// in a local variable for the duration of the allocation. (3) After
    /// step 4 (all CPUs drained and `pcp_valid` re-enabled), no new
    /// slow-path calls can use the old pointer because the `AtomicPtr`
    /// was already swapped in step 1 with Release ordering, and the IPI
    /// in step 2 acts as a full memory barrier on each receiving CPU.
    ///
    /// Active allocation policy. Replaced atomically via RCU during live
    /// policy evolution. Readers load via `policy.load(Acquire)` under
    /// RCU read lock; the evolution primitive swaps via `policy.swap(new, AcqRel)`
    /// after quiescence.
    ///
    /// **Why `AtomicPtr<PhysAllocPolicyVtable>` not `AtomicPtr<dyn PhysAllocPolicy>`**:
    /// Trait objects (`dyn Trait`) are unsized — `AtomicPtr` requires `Sized`.
    /// Instead, the policy is represented as a `#[repr(C)]` vtable struct
    /// (`PhysAllocPolicyVtable`) containing function pointers. This is the
    /// same pattern as `VtableHeader` in live-kernel-evolution and all KABI
    /// vtable structs. The `AtomicPtr` stores a pointer to the concrete
    /// vtable, which can be atomically swapped.
    pub policy: AtomicPtr<PhysAllocPolicyVtable>,

    /// TEE (Trusted Execution Environment) capability of this NUMA node.
    /// Populated during boot from hardware discovery
    /// ([Section 2.15](02-boot-hardware.md#numa-topology-discovery--tee-capability-discovery)).
    /// The tiering engine checks this before migrating confidential pages:
    /// pages belonging to a `ConfidentialContext`
    /// ([Section 9.7](09-security.md#confidential-computing)) must never be migrated to a node
    /// where `tee_info.tee_capable == false` (e.g., CXL memory expanders
    /// without hardware encryption support).
    pub tee_info: NumaNodeTeeInfo,

    /// Allocation wait queue. Tasks that fail page allocation in this node's
    /// slow path (after reclaim, compaction, and OOM) sleep here. Woken by
    /// `free_pages()` when pages are returned to this node's buddy free
    /// lists AND `nr_free` crosses above `wmark_min`. The throttle
    /// (nr_free > wmark_min check) prevents thundering-herd wakeups on
    /// every individual page free.
    ///
    /// Per-node (not global) to avoid waking tasks blocked on node A when
    /// node B frees pages (which would just cause the woken task to fail
    /// allocation again and re-sleep). The OOM killer wakes this queue
    /// (Step 7) after killing a victim, allowing allocating tasks to retry
    /// with TIF_MEMDIE reserve access.
    ///
    /// See [Section 4.5](#oom-killer) for the OOM retry mechanism.
    pub alloc_waitqueue: WaitQueue,
}

4.2.1.5.1 NUMA Node TEE Info¶

/// Per-NUMA-node TEE capability, populated during boot from hardware discovery.
/// See [Section 9.7](09-security.md#confidential-computing--tee-capable-numa-nodes) for the full design
/// and [Section 2.15](02-boot-hardware.md#numa-topology-discovery--tee-capability-discovery) for the per-architecture
/// discovery procedure.
pub struct NumaNodeTeeInfo {
    /// Whether this NUMA node's memory controller supports hardware encryption.
    /// True for: AMD SEV-SNP (ASID-keyed), Intel TDX (MKTME-keyed), ARM CCA (GPC-protected).
    /// False for: CXL Type 3 expanders without TEE firmware, standard DRAM on non-TEE platforms.
    pub tee_capable: bool,
    /// Maximum number of concurrent encryption key IDs supported by this node's controller.
    /// 0 if !tee_capable.
    pub max_key_ids: u32,
}

Confidential page migration gate: The tiering engine's page migration path (ML-guided tier migration, kswapd demotion, and explicit migrate_pages()) must check target_node.tee_info.tee_capable before migrating any page that belongs to a ConfidentialContext. If the target node is not TEE-capable, the migration is rejected with MigrationError::TargetNotTeeCapable. This prevents confidential pages from being demoted to CXL-attached memory nodes that lack hardware encryption, which would break confidential computing guarantees. The NUMA fallback order in alloc_pages() applies the same gate: confidential allocations skip non-TEE-capable nodes in the fallback chain.

4.2.1.6 Global Allocator Table¶

/// Global table of per-NUMA-node buddy allocators. Indexed by NUMA node ID
/// (0-based). Initialized once during `mem_init()` after NUMA topology
/// discovery ([Section 4.11](#numa-topology-and-policy)).
/// After initialization, the slice reference is immutable — nodes are not
/// hot-added at runtime (CXL memory hotplug adds pages to an existing node's
/// buddy allocator, not new nodes).
pub static BUDDY_ALLOCATORS: OnceCell<&'static [BuddyAllocator]> = OnceCell::new();

4.2.1.7 `alloc_pages()` — Page Allocation¶

/// Allocate 2^order contiguous physical pages from the specified NUMA node.
///
/// Returns the head `Page` descriptor of the allocated block on success, or
/// `AllocError` if the system is out of memory after all recovery attempts.
///
/// # Arguments
/// - `order`: Allocation order (0..=MAX_ORDER). Order 0 = 1 page (4 KB),
///   order 10 = 1024 pages (4 MB).
/// - `gfp`: Allocation flags controlling sleep/reclaim behavior and zone
///   selection. See `GfpFlags` above.
/// - `nid`: Preferred NUMA node. `NumaNodeId::ANY` uses the calling CPU's
///   local node.
///
/// # Fast Path (order 0, non-atomic)
/// 1. Save and disable IRQs (`local_irq_save`).
/// 1a. Check `pcp.pcp_valid.load(Acquire)`. If `false`, the pool is being
///     drained (policy hot-swap or CPU hotplug teardown) — skip PCP and
///     fall through to the slow path (buddy allocator under zone lock).
/// 2. Pop a page from the calling CPU's `PcpPagePool`.
/// 3. Restore IRQs (`local_irq_restore`). Return the page.
///
/// If the PCP pool is empty, refill it (see [Section 4.2](#physical-memory-allocator--pcp-refill) below),
/// then retry step 2.
///
/// # Slow Path (order > 0, or PCP miss requiring buddy)
/// 1. Acquire `buddy.lock` with IRQs saved.
/// 2. Walk orders from `order` up to `MAX_ORDER`:
///    a. Check `free_lists[o][preferred_migrate_type]` for a free block.
///    b. If found, remove the block. If `o > order`, split (see
///       [Section 4.2](#physical-memory-allocator--split-algorithm)). Release lock. Return page.
/// 3. If no block found in preferred migration type, fall back to other
///    migration types: `Reclaimable → Movable → Unmovable` (steal policy).
/// 4. Release `buddy.lock`.
///
/// # NUMA Fallback
/// If the preferred node is exhausted:
/// 1. Walk NUMA distance table in ascending distance order.
/// 2. Attempt allocation from each node using the slow path above.
/// 3. If all nodes exhausted, proceed to reclaim.
///
/// # TIF_MEMDIE / PF_MEMALLOC Reserve Bypass
/// Before entering reclaim, check if the current task has `TIF_MEMDIE` set
/// (OOM victim exiting) or `PF_MEMALLOC` (reclaim context). If so, bypass
/// watermark checks and allocate from reserves below `wmark_min`. This
/// prevents the OOM victim's exit path from deadlocking — `do_exit()` needs
/// memory for page table teardown, fd close, etc. In Linux, this is
/// implemented via `gfp_to_alloc_flags()` which sets `ALLOC_NO_WATERMARKS`
/// when `TIF_MEMDIE` is set.
///
/// ```rust
/// if current_task().thread_flags.load(Relaxed) & TIF_MEMDIE != 0 {
///     // Bypass watermarks, allocate from deep reserves.
///     return alloc_pages_no_watermark(order, gfp, nid);
/// }
/// ```
///
/// # Reclaim Path (GFP_KERNEL / GFP_RECLAIM)
/// 1. Wake kswapd for background reclaim.
/// 2. Attempt direct reclaim: scan LRU inactive list, evict clean pages,
///    compress reclaimable pages ([Memory Compression Tier](memory-compression-tier.md)).
/// 3. Retry allocation once.
/// 4. If still failing and `gfp` allows OOM: invoke OOM killer, retry.
/// 5. Return `AllocError::OutOfMemory` if all attempts fail.
///
/// # GFP_ATOMIC Path
/// Skip all reclaim and sleeping. If the buddy allocator has no pages at
/// the requested order on any node, return `AllocError::OutOfMemory`
/// immediately. The caller must handle failure gracefully (e.g., drop
/// packets, fail the I/O request).
///
/// # __GFP_ZERO Handling
/// If `gfp` contains `ZERO`, the allocated page(s) are zeroed via
/// `arch::current::mm::zero_page()` before return. On architectures with
/// hardware-accelerated zeroing (ARM DC ZVA, x86 REP STOSB with ERMS),
/// this is done with the optimal instruction sequence.
///
/// # Cgroup Memory Charge (warm path)
///
/// After a page is obtained from the buddy allocator (PCP refill or slow path)
/// and before it is returned to the caller, the allocator charges the page to
/// the calling task's memory cgroup via `mem_cgroup_charge()`. This is a warm
/// path — it runs only when the PCP pool missed (~1 in 31 allocations for
/// order 0) or on every higher-order allocation. The hot-path PCP pop does NOT
/// call `mem_cgroup_charge()` because PCP pages were already charged when they
/// were originally allocated from the buddy.
///
/// ```rust
/// // After page allocation from buddy allocator (PCP refill or slow path):
/// if let Err(e) = mem_cgroup_charge(page, gfp) {
///     if !gfp.contains(GfpFlags::NOFAIL) {
///         buddy_free_pages(page, order);
///         return Err(AllocError::CgroupLimit);
///     }
///     // GFP_NOFAIL: charge to root cgroup and proceed.
///     // The root cgroup has no limit — this never fails.
///     mem_cgroup_charge_root(page);
/// }
/// ```
///
/// `mem_cgroup_charge()` performs:
/// 1. Read the calling task's `task.cgroup` (Acquire load).
/// 2. If the cgroup has no memory controller (`cgroup.memory.is_none()`),
///    return `Ok(())` — no accounting overhead.
/// 3. Deduct from the calling CPU's `MemCgroupStock`
///    (see [Section 17.2](17-containers.md#control-groups--per-cpu-memory-charge-batching-memcgroupstock)).
///    On the common path (stock hit for the current cgroup), this is a local
///    memory store with no atomic operations — eliminating the scalability
///    bottleneck on 128+ CPU systems. On a stock miss, perform a single
///    `usage.fetch_add(STOCK_SIZE)` to refill the per-CPU cache, then deduct
///    locally. Only the refill touches the global `MemController::usage` atomic.
/// 4. If `usage > memory.max`: attempt per-cgroup reclaim. If reclaim fails
///    and `gfp` does not contain `NOFAIL`, return `Err(CgroupLimit)`.
///    If `usage > memory.high`: mark the task for throttling on return to
///    userspace (the throttle is deferred, not inline).
/// 5. Store the cgroup reference in `page.mem_cgroup` for later uncharge.
///
/// On `free_pages()`, the inverse `mem_cgroup_uncharge(page)` decrements
/// `usage` and clears `page.mem_cgroup`. This runs only on the buddy free
/// slow path (when the PCP pool drains excess pages back to the buddy
/// allocator, or when a higher-order page is freed directly). The PCP
/// push/pop hot path does NOT call `mem_cgroup_uncharge()` — PCP is a
/// per-CPU cache that merely delays the page's return to the buddy. The
/// cgroup charge persists while the page sits in PCP. This mirrors the
/// charge path: charge on buddy allocation (PCP refill or slow path),
/// uncharge on buddy free (PCP drain or slow path). PCP push/pop are
/// purely cache operations with no cgroup accounting overhead.
pub fn alloc_pages(order: u32, gfp: GfpFlags, nid: NumaNodeId) -> Result<Page, AllocError>;

4.2.1.8 `free_pages()` — Page Deallocation¶

/// Return 2^order contiguous physical pages to the allocator.
///
/// # Arguments
/// - `page`: Head `Page` descriptor of the block being freed. Must have been
///   returned by a prior `alloc_pages()` call with the same `order`.
/// - `order`: The order that was used when allocating this block. Passing the
///   wrong order is undefined behavior (debug builds panic; release builds
///   corrupt the free lists).
///
/// # Fast Path (order 0)
/// 1. Save and disable IRQs (`local_irq_save`).
/// 2. Push the page onto the calling CPU's `PcpPagePool` (LIFO).
/// 3. If `pool.pages.len() > pool.high`: drain `pool.batch` pages back to
///    the buddy allocator (see [Section 4.2](#physical-memory-allocator--pcp-drain) below).
/// 4. Restore IRQs (`local_irq_restore`).
///
/// # Slow Path (order > 0)
/// 1. Acquire `buddy.lock` with IRQs saved.
/// 2. Attempt to merge with the buddy block (see [Section 4.2](#physical-memory-allocator--merge-algorithm)).
/// 3. Insert the (possibly coalesced) block into the appropriate free list.
/// 4. Update per-list `nr_free` and zone-wide `buddy.nr_free` (atomic).
/// 5. Release `buddy.lock`.
///
/// # Safety
/// - `page` must be a valid, currently-allocated page. Double-free is detected
///   in debug builds by checking the page's `PG_BUDDY` flag (set when a page
///   is on a free list, cleared on allocation).
/// - The caller must not hold any reference to the page contents after this call.
pub fn free_pages(page: Page, order: u32);

4.2.1.9 Split Algorithm¶

When the allocator needs a block of order R but the smallest available block is at order N (where N > R), the block must be split:

split_block(block, current_order=N, requested_order=R):
    while current_order > requested_order:
        current_order -= 1
        // Split: the block at current_order+1 becomes two blocks at current_order.
        // The upper half (buddy) goes onto the free list; the lower half continues.
        upper_half = block + (PAGE_SIZE * 2^current_order)
        upper_half.order = current_order
        set_page_flag(upper_half, PG_BUDDY)
        insert_free_list(free_lists[current_order][migrate_type], upper_half)
        free_lists[current_order][migrate_type].nr_free += 1  // upper half added to free list
    // 'block' is now at the requested order. Return it to the caller.
    clear_page_flag(block, PG_BUDDY)
    block.order = requested_order
    return block

Example: Request order 1 (8 KB), smallest available is order 3 (32 KB). 1. Split order 3 → two order 2 blocks. Upper half (16 KB) → free_lists[2]. 2. Split lower order 2 → two order 1 blocks. Upper half (8 KB) → free_lists[1]. 3. Return lower order 1 block to caller.

Result: 1 block allocated (8 KB), 2 blocks added to free lists (8 KB + 16 KB). Total: 8 + 8 + 16 = 32 KB accounted for.

4.2.1.10 Merge Algorithm¶

When a block is freed, the allocator attempts to coalesce it with its buddy (the adjacent block at the same order) to form a larger block:

merge_block(page, order):
    while order < MAX_ORDER:
        buddy_pfn = page_to_pfn(page) ^ (1 << order)
        buddy = pfn_to_page(buddy_pfn)
        // Check merge conditions:
        // 1. Buddy must be free (PG_BUDDY flag set).
        // 2. Buddy must be at the same order (buddy.order == order).
        // 3. Both must be in the same zone (no cross-zone merging).
        // 4. Both must have the same migration type.
        if !is_buddy_free(buddy, order):
            break
        // Remove buddy from its free list.
        remove_from_free_list(free_lists[order][migrate_type], buddy)
        clear_page_flag(buddy, PG_BUDDY)
        free_lists[order][migrate_type].nr_free -= 1  // buddy removed from free list
        // Merge: take the lower-addressed page as the new block head.
        page = min(page, buddy)   // by physical address
        order += 1
    // Insert the merged block into the free list at the final order.
    page.order = order
    set_page_flag(page, PG_BUDDY)
    insert_free_list(free_lists[order][migrate_type], page)
    free_lists[order][migrate_type].nr_free += 1  // merged block added

Buddy address calculation: For a block at physical frame number pfn of order o, the buddy's PFN is pfn XOR (1 << o). This is the fundamental property of the buddy system: a block and its buddy always differ in exactly one bit of the PFN, determined by the order.

Merge terminates when: 1. The buddy is not free (still allocated), or 2. The buddy is at a different order (was split and only partially freed), or 3. The merged block would exceed MAX_ORDER, or 4. The buddy is in a different zone or has a different migration type.

4.2.1.11 PCP Refill¶

When a CPU's PcpPagePool is empty and an order-0 allocation is requested:

pcp_refill(pool, buddy):
    acquire buddy.lock (irqsave)
    count = 0
    while count < pool.batch:
        // Select a free block of order 0 from the buddy system and remove it.
        // Equivalent to select_block(buddy, 0, migrate_type) which walks the
        // free_lists starting at order 0 and splits higher-order blocks if needed.
        page = select_block(buddy, /*order=*/0, migrate_type)
        if page is None:
            break   // buddy exhausted at order 0; caller falls back to NUMA
        pool.pages.push(page)
        count += 1
    buddy.nr_free.fetch_sub(count, Relaxed)
    release buddy.lock

One lock acquisition transfers up to batch (default PCP_BATCH_SIZE = 31) pages. This amortizes the lock cost: subsequent order-0 allocations are lock-free until the pool drains again.

4.2.1.12 PCP Drain¶

When a CPU's PcpPagePool exceeds high after a free:

pcp_drain(pool, buddy):
    acquire buddy.lock (irqsave)
    count = 0
    while count < pool.batch and pool.pages.len() > 0:
        page = pool.pages.pop()
        buddy_merge_and_insert(buddy, page, order=0)
        count += 1
    buddy.nr_free.fetch_add(count, Relaxed)
    release buddy.lock

Draining returns pages to the buddy allocator where they can be merged into higher-order blocks, reducing fragmentation. The drain is also triggered by: - CPU offline: all pages from the departing CPU's pool are drained. - Memory pressure: kswapd can request a drain of all CPU pools on a node to reclaim pages pinned in PCP pools. - Explicit flush: /proc/sys/vm/drop_caches or compact_memory triggers a full PCP drain to maximize merge opportunities.

4.2.1.13 GfpFlags Integration Summary¶

Flag	`alloc_pages()` behavior
`GFP_KERNEL`	May sleep, may invoke direct reclaim and kswapd, may trigger OOM killer. Standard for process-context allocations.
`GFP_ATOMIC` / `ATOMIC_ALLOC`	Must not sleep. Uses emergency reserve pool (below `wmark_min`). Returns `AllocError` immediately if no pages available. For IRQ context or under spinlock.
`KERNEL_NOIO`	May sleep and reclaim, but reclaim path must not start I/O. Prevents deadlock when called from block drivers holding device locks.
`KERNEL_NOFS`	May sleep and reclaim, but reclaim path must not re-enter filesystem. Prevents deadlock when called from filesystem code holding inode locks.
`ZERO`	Zero-fill the allocated page(s) before returning. Uses `arch::current::mm::zero_page()` for hardware-optimal zeroing.
`MOVABLE`	Allocate from `MigrateType::Movable` free list. The compactor may later relocate these pages. Used for user anonymous memory and page cache.
`DMA` / `DMA32`	Restrict allocation to the DMA zone (<16 MB) or DMA32 zone (<4 GB). Required for devices with limited address bus width.
`HIGHUSER_MOVABLE`	Composite: user-visible, movable, reclaimable. The default for `mmap` anonymous pages.
`NOFAIL`	Never return `AllocError`. Retries indefinitely, sleeping between attempts. If a cgroup memory limit blocks the allocation, the page is charged to the root cgroup instead of failing. Used for page table allocations and other unrecoverable paths.

Lock-hierarchy allocation constraints:

The lock hierarchy (Section 3.5) creates implicit constraints on which GfpFlags may be used when holding specific locks:

Held lock	Required GfpFlags	Rationale
`I_RWSEM` (level 80) or `INODE_LOCK` (level 160)	`KERNEL_NOFS`	Reclaim may call `writepage()` → re-enters filesystem → acquires `I_RWSEM` → deadlock. `KERNEL_NOFS` suppresses filesystem callbacks in the reclaim path.
`FS_SB_LOCK` (level 150)	`KERNEL_NOFS`	Same reason: reclaim must not re-enter any filesystem path.
`WRITEBACK_LOCK` (level 170)	`KERNEL_NOFS`	Writeback completion must not trigger further writeback.
Block driver device locks	`KERNEL_NOIO`	Reclaim may call `submit_bio()` → re-enters block layer → acquires device lock → deadlock. `KERNEL_NOIO` suppresses all I/O in the reclaim path.
Any `SpinLock` (IRQs disabled)	`ATOMIC_ALLOC` or `NOWAIT`	Must not sleep — spinlock contract violation.

This is not a new mechanism — GFP_KERNEL_NOFS and GFP_KERNEL_NOIO already exist (lines above). This table documents which locks mandate which flags, making the constraint explicit for implementers. Violations are detected in debug builds via lockdep-equivalent runtime checking: if a GFP_KERNEL allocation is attempted while holding a lock at filesystem level or above, a BUG() fires with the lock chain.

4.2.1.14 Initialization Sequence¶

During mem_init(), after NUMA topology is known:

mem_init():
    let num_nodes = numa_topology.num_nodes()
    // Allocate BuddyAllocator array from boot allocator.
    let allocators = boot_alloc.alloc_array::<BuddyAllocator>(num_nodes)
    for nid in 0..num_nodes:
        allocators[nid].node_id = nid
        allocators[nid].lock = SpinLock::new()
        allocators[nid].nr_free = AtomicU64::new(0)
        // Initialize all free lists to empty.
        for order in 0..=MAX_ORDER:
            for mt in 0..MigrateType::Count:
                allocators[nid].free_lists[order][mt] = BuddyFreeList::empty()
        // Initialize PCP pools for each CPU on this node.
        for cpu in cpus_on_node(nid):
            allocators[nid].pcpu_pools[cpu] = PcpPagePool {
                pages: ArrayVec::new(),
                high: PCP_HIGH_DEFAULT,
                batch: PCP_BATCH_SIZE,  // 31
            }
        // Calculate watermarks from node memory size.
        let node_pages = numa_topology.node_pages(nid)
        allocators[nid].wmark_min = min_free_kbytes_to_pages(nid)
        allocators[nid].wmark_low = wmark_min * 5 / 4
        allocators[nid].wmark_high = wmark_min * 3 / 2
    // Hand off boot allocator's free ranges to buddy.
    boot_alloc.hand_off_to_buddy(&mut allocators)
    BUDDY_ALLOCATORS.set(allocators).expect("buddy already initialized")

hand_off_to_buddy() iterates the boot allocator's remaining free ranges and inserts each page-aligned chunk into the appropriate node's buddy free lists at the highest possible order (maximizing initial merge state). After this call, the boot allocator is retired and all physical memory management flows through alloc_pages() / free_pages().

4.2.1.15 OOM Killer¶

When the physical allocator cannot satisfy an allocation after all reclaim attempts are exhausted (direct reclaim, compaction, kswapd), the OOM killer is invoked as a last resort. It selects a victim process, sends SIGKILL, and grants the victim a temporary memory allocation bypass so its exit path can complete and release pages.

/// Out-of-memory killer. Invoked by `alloc_pages()` when memory is exhausted
/// after reclaim attempts fail. Selects a victim task and kills it to free
/// memory for the faulting allocation.
pub struct OomKiller;

/// Constraint for OOM victim selection scope. Determines which tasks are
/// eligible for killing.
pub enum OomConstraint {
    /// System-wide OOM: select from all eligible tasks across the entire system.
    /// Triggered when global reclaim fails for a `GFP_KERNEL` allocation.
    Global,
    /// Cgroup OOM: select from tasks within this cgroup subtree only.
    /// Triggered when `memory.current > memory.max` for a memory cgroup
    /// ([Section 17.2](17-containers.md#control-groups)).
    Cgroup(Arc<Cgroup>),
    /// Cpuset-constrained: select from tasks whose cpuset's NUMA node mask
    /// overlaps the allocation's requested NUMA node set. Prevents killing a
    /// task on node 0 to free memory on node 3 when the allocator only needs
    /// pages from node 3.
    Cpuset(CpuSet),
}

impl OomKiller {
    /// Select the best victim task to kill.
    ///
    /// **Scoring**: UmkaOS uses dual OOM scoring:
    /// - **Linux-compatible** (`oom_score_compat()`): matches Linux's
    ///   `proc_oom_score()` normalization for `/proc/[pid]/oom_score`.
    /// - **Internal** (`oom_score_internal()`): enhanced scorer using
    ///   FMA data (memory pressure trends, I/O amplification, reclaim
    ///   efficiency) for the actual victim selection.
    ///
    /// Both use the base memory footprint (matching modern Linux `oom_badness()`):
    ///   `base = rss_pages + swap_pages + pgtable_pages`
    /// (`dirty_kb` is NOT included — dirty pages are already counted in RSS.
    ///  `child_rss` is NOT included — removed from Linux in the 2010 OOM rewrite.)
    /// See the canonical definition in [Section 4.5](#oom-killer) for the full formula.
    ///
    /// - `oom_score_adj`: per-task tunable in range [-1000, 1000], set via
    ///   `/proc/[pid]/oom_score_adj`. Positive values increase kill priority;
    ///   negative values decrease it. Value -1000 guarantees OOM immunity
    ///   (used by init, critical system daemons).
    ///
    /// The task with the **highest** score is selected as the victim.
    ///
    /// **Exclusions** (never selected — handled internally by `oom_badness()`
    /// returning `i64::MIN`, see [Section 4.5](#oom-killer)):
    /// - PID 1 (init): `oom_score_adj` defaults to -1000. Exception: if init
    ///   is the *only* task in a cgroup OOM, it is killed (the cgroup's memory
    ///   limit is absolute).
    /// - Kernel threads (`Task::flags & PF_KTHREAD`): not killable (mm: None).
    /// - Tasks with `PF_OOM_VICTIM` set: already selected for OOM kill.
    /// - Tasks with `MMF_OOM_SKIP` set on their mm: already reaped.
    /// - Tasks in `in_vfork()`: share parent's mm, killing them doesn't free.
    /// - Tasks already receiving `SIGKILL` (`Task::signal_pending(SIGKILL)`):
    ///   skip — they are already dying and will release memory shortly.
    /// - Tasks with `oom_score_adj == -1000`: OOM-immune (unless sole occupant
    ///   of a cgroup OOM).
    ///
    /// **Constraint enforcement**: When `constraint` is `Cgroup`, only tasks
    /// whose `cgroup` is a descendant of (or equal to) the specified cgroup
    /// are considered. When `constraint` is `Cpuset`, only tasks whose cpuset
    /// NUMA mask intersects the allocation target nodes are considered.
    ///
    /// Returns `None` if no eligible victim exists (all tasks are immune or
    /// already dying). In this case, `alloc_pages()` returns `AllocError::OutOfMemory`.
    pub fn select_victim(constraint: OomConstraint) -> Option<Arc<Task>>;

    /// Kill the selected victim. Canonical ordering is defined in
    /// [Section 4.5](#oom-killer) (kill sequence). Summary:
    ///
    /// 1. Log OOM event.
    /// 2. Set `PF_OOM_VICTIM` in `Task::flags` — exclusion flag, prevents
    ///    re-selection. Must be set FIRST.
    /// 3. Send `SIGKILL` (via `force_sig(SIGKILL, &task)` from
    ///    [Section 8.5](08-process.md#signal-handling)). Unblockable — wakes TASK_KILLABLE sleepers.
    /// 4. Set `TIF_MEMDIE` — grants the victim's exit path unconditional
    ///    access to memory reserves (below `wmark_min`).
    /// 5. Enqueue the victim's mm into the OOM reaper:
    ///    ```rust
    ///    if let Some(ref mm) = task.process.mm {
    ///        OOM_REAPER.victims.push(Arc::clone(mm));
    ///        OOM_REAPER.wake.wake_one();
    ///    }
    ///    ```
    /// 6. If `memory.oom.group == 1`: kill ALL tasks in the cgroup subtree
    ///    (applying PF_OOM_VICTIM → SIGKILL → TIF_MEMDIE to each). The
    ///    victim from `select_victim()` is included in the group iteration.
    ///
    /// `TIF_MEMDIE` is cleared when the victim's `mm_struct` is torn down
    /// (in `exit_mm()`), not when SIGKILL is delivered. This ensures the
    /// reservation bypass remains active for the entire exit path.
    pub fn kill_victim(task: Arc<Task>);
}

Integration points:

Trigger	Path	Constraint
`alloc_pages(GFP_KERNEL)` after reclaim + compact + retry exhausted	`alloc_pages()` slow path, step 5	`OomConstraint::Global`
Cgroup memory controller: `memory.current > memory.max`	Cgroup charge path (`try_charge()`)	`OomConstraint::Cgroup(cg)`
Page fault: anonymous `handle_anon_fault()` calls `alloc_pages()` which fails	Page fault handler → `alloc_pages()` → OOM	`OomConstraint::Global` (or `Cgroup` if the faulting task is in a memory cgroup with a limit)

Page fault OOM recovery: When a page fault triggers OOM and the faulting task is not the selected victim, the faulting task retries the allocation after the victim's memory is freed. If the faulting task is the victim, it receives SIGKILL and never returns to userspace.

memory.oom.group: When set to 1 on a cgroup, the OOM killer kills all tasks in the cgroup subtree rather than just the highest-scoring individual. This is essential for container workloads where killing a single task leaves the application in an inconsistent state (e.g., a database with a dead writer but live readers).

Cross-references: - Memory cgroup controller: Section 17.2 - Signal delivery: Section 8.5 - FMA fault reporting: Section 20.1 - Cleanup tokens (run on OOM kill): Section 8.1

4.2.1.16 OOM Reaper¶

When the OOM killer sends SIGKILL to a victim, the victim may be blocked in an uninterruptible sleep — waiting for I/O, holding mmap_lock, or stalled in a driver. The victim cannot process the signal until it wakes, but the system needs memory now. The OOM reaper solves this by proactively reclaiming the victim's anonymous pages from a dedicated kernel thread, without waiting for the victim to exit.

Deadlock prevention: The reaper exists specifically to break the following cycle: 1. Task A holds mmap_lock, tries to allocate memory, triggers OOM. 2. OOM kills Task B, but Task B is blocked waiting for mmap_lock held by A. 3. Without reaper: deadlock — A waits for memory, B waits for A's lock. 4. With reaper: reaper unmaps B's anonymous pages via a raw page table walk, freeing memory so A can proceed.

/// OOM reaper kthread — asynchronously reclaims memory from OOM victims.
///
/// Runs as a single kernel thread (`oom_reaper`), woken after each OOM kill.
/// Walks the victim's page tables and unmaps anonymous pages without holding
/// `mmap_lock`, breaking potential deadlock cycles between the OOM victim
/// and the faulting allocator.
pub struct OomReaper {
    /// Wake channel — signaled by `OomKiller::kill_victim()` after enqueueing
    /// a new victim.
    wake: WaitQueue,
    /// Victim queue — `OomKiller::kill_victim()` enqueues here after sending
    /// SIGKILL. Bounded: max 8 pending victims. If the queue is full, the
    /// oldest entry is dropped — that victim is likely already exiting and
    /// releasing memory through normal `exit_mm()`.
    ///
    /// **Dropped victim observability**: When a victim is dropped from the
    /// queue (ring full), the reaper emits a tracepoint
    /// `trace_oom_reaper_drop(pid, comm)` and increments the per-zone
    /// `oom_reap_dropped` counter (exposed via `/proc/zoneinfo`). The
    /// dropped victim still receives SIGKILL and will release memory
    /// through normal `do_exit()` → `exit_mm()`. No explicit recovery
    /// action is needed — dropping only skips the accelerated reaping.
    /// MPMC ring used for uniform ring API; actual concurrency pattern is
    /// MPSC (multiple OOM kill sites produce, single oom_reaper thread consumes).
    /// The single-consumer CAS overhead is negligible on this cold path.
    victims: BoundedMpmcRing<Arc<MmStruct>, 8>,
}

impl OomReaper {
    /// Main reaper loop. Runs in a dedicated kthread, sleeping on `self.wake`
    /// until a victim is enqueued.
    ///
    /// For each victim mm:
    /// 1. Walk the victim's `mm->pgd` page tables.
    /// 2. For each present, anonymous PTE: unmap the page, decrement RSS,
    ///    and return the page to the physical allocator via `free_pages()`.
    /// 3. **Skip file-backed pages** — those belong to the page cache and
    ///    will be reclaimed through normal page cache eviction.
    /// 4. Issue `mmu_notifier_invalidate_range_start()` / `_end()` around
    ///    each VMA's worth of unmapping. This ensures KVM secondary page
    ///    tables and IOMMU mappings are kept coherent with the primary page
    ///    tables during reaping.
    /// 5. After the walk completes, set `MMF_OOM_SKIP` on the victim's
    ///    `mm->flags`. This flag tells `select_victim()` to skip this mm
    ///    on future OOM invocations — its memory has already been reaped.
    ///
    /// **Synchronization with do_exit()**: The reaper sets `MMF_UNSTABLE`
    /// on the victim's `mm->flags` (AtomicU64) BEFORE starting the page
    /// table walk. The victim's `exit_mm()` path checks `MMF_UNSTABLE`:
    /// if set, `exit_mm()` skips `unmap_vmas()` because the reaper is
    /// already handling page table teardown. This prevents double-free
    /// of physical pages (both reaper and do_exit trying to unmap the
    /// same PTE simultaneously).
    ///
    /// The reaper uses atomic PTE compare-and-swap (CAS) to unmap entries:
    /// `cmpxchg(pte, present_value, 0)`. If the CAS fails (do_exit already
    /// cleared the PTE), the reaper skips that entry. This makes the race
    /// between reaper and exit_mm benign even without mmap_lock.
    ///
    /// The reaper does **not** hold `mmap_lock`. It operates on the raw
    /// page table entries with a special "reap mode" that tolerates
    /// concurrent VMA modifications. If the victim holds `mmap_lock`
    /// write-locked (VMA split/merge in progress), the reaper retries
    /// after a short backoff (1ms initial, doubling up to 10ms, max 5
    /// retries). If all retries are exhausted, the reaper sets
    /// `MMF_OOM_SKIP` anyway — the victim will eventually exit and
    /// release pages through the normal path.
    ///
    /// **Statistics** (per-zone, exposed via `/proc/zoneinfo`):
    /// - `oom_reap_pages`: total pages reclaimed by the reaper.
    /// - `oom_reap_attempts`: number of times the reaper attempted to
    ///   reap a victim (including retries).
    /// - `oom_reap_failures`: number of victims where reaping was skipped
    ///   entirely (retry budget exhausted).
    pub fn reap_loop(&self);
}

4.2.1.17 OOM Score¶

The OOM killer uses a per-task score to rank candidates for killing. UmkaOS implements dual scoring: a Linux-compatible scorer for /proc/[pid]/oom_score (ABI obligation) and an improved internal scorer using FMA data for actual victim selection. Both share the same base memory footprint formula. The canonical scoring formula is defined in Section 4.5:

base = rss_pages + swap_pages + pgtable_pages

(dirty_kb is NOT included — dirty pages are already counted in RSS. child_rss is NOT included — removed from Linux in the 2010 OOM rewrite.)

For /proc/[pid]/oom_score (Linux-compatible, matches proc_oom_score()):

let badness = oom_badness(task, totalpages);  // raw page count
let score = (1000 + badness * 1000 / totalpages as i64) * 2 / 3;

The task with the highest internal score is selected as the victim. See Section 4.5 for the full oom_score_compat() and oom_score_internal() functions including cgroup-aware scoring, memory.min/low protection, and the OOM-immune (oom_score_adj = -1000) fast path.

Procfs interfaces:

Path	Access	Range	Description
`/proc/[pid]/oom_score`	Read-only	0–2000	Computed score, evaluated on each read.
`/proc/[pid]/oom_score_adj`	Read-write (`CAP_SYS_RESOURCE`)	-1000 to +1000	Tunable adjustment. `-1000` = OOM-immune (init, critical daemons). `0` = default (pure RSS-proportional). `+1000` = always kill first.
`/proc/[pid]/oom_adj`	Read-write (legacy)	-17 to +15	Deprecated Linux interface, maintained for compatibility. Mapped to `oom_score_adj`: positive values as `oom_adj * 1000 / 15`, and special value `-17` maps to `-1000` (OOM-immune). Writing `oom_adj` updates `oom_score_adj`; reading `oom_score_adj` is authoritative.

4.2.1.18 Cgroup OOM Events¶

The memory cgroup controller (Section 17.2) exposes OOM event counters in memory.events, allowing userspace monitors (systemd, container runtimes, orchestrators) to track OOM activity per cgroup without polling dmesg.

Counters in memory.events:

Key	Type	Description
`oom`	`u64` monotonic	Incremented each time the OOM killer is invoked for this cgroup (an OOM condition occurred, regardless of whether a task was actually killed — e.g., the OOM may have been resolved by a concurrent free).
`oom_kill`	`u64` monotonic	Number of tasks actually killed by the OOM killer within this cgroup. Always `<= oom`.
`oom_group_kill`	`u64` monotonic	Number of group OOM kills — when `memory.oom.group=1`, all tasks in the cgroup subtree are killed as a unit. Each such event increments this counter by 1 (not by the number of tasks killed).

All counters are u64 and monotonically increasing. At the maximum plausible OOM rate of 1 event per second, a u64 counter overflows in approximately 585 billion years — well within the 50-year uptime requirement.

Event notification: Userspace monitors register an eventfd on cgroup.events to receive edge-triggered notifications when any counter in memory.events changes. This avoids polling: systemd-oomd, container runtimes, and Kubernetes kubelet all use this mechanism to react to OOM events within milliseconds.

Cross-references: - Cgroup v2 memory controller and memory.events format: Section 17.2 - /proc/[pid] interface and task lifecycle: Section 8.1 - FMA OOM telemetry and fault correlation: Section 20.1

4.2.1.19 ML Policy Integration — Intelligent OOM Victim Selection¶

The default OOM scoring heuristic (Section 4.5) is a crude proxy for "least important task." It knows nothing about workload semantics: a database writer and a log rotator with equal RSS get equal scores, even though killing the writer has catastrophically higher impact. The ML policy framework (Section 23.1) enables learning-based victim selection that factors in restart cost, dependency structure, and SLA tiers.

Integration point: select_victim() emits an OomVictimSelection observation before making its decision and consults the ML policy for an adjusted score:

/// OOM observation — emitted at each OOM invocation for ML training.
/// Subsystem: MemoryManager, obs_type: MemObs::OomVictimSelection.
///
/// Feature vector (packed into KernelObservation.features[0..9]):
///   [0] constraint_type: OomConstraint discriminant (0=Global, 1=Cgroup, 2=Cpuset)
///   [1] candidate_count: number of eligible tasks considered
///   [2] victim_pid: PID of the selected victim
///   [3] victim_rss_pages: victim's RSS at selection time
///   [4] victim_swap_pages: victim's swap usage
///   [5] victim_oom_score: computed OOM score (0–2000)
///   [6] victim_cgroup_id: cgroup ID of the victim
///   [7] total_free_pages: system free pages at OOM time
///   [8] pressure_stall_us: total PSI memory stall in last 10s (µs)
///   [9] reserved (zero)

/// ML-adjusted OOM scoring. Called by select_victim() when ML policy is active.
///
/// The ML model receives each candidate's feature vector and returns a
/// score adjustment in range [-500, +500]. The final score is:
///   ml_score = base_score + oom_score_adj + ml_adjustment
///
/// The adjustment is bounded to prevent the ML model from overriding
/// administrator-set oom_score_adj by more than 25% of the total range.
/// OOM-immune tasks (oom_score_adj == -1000) are NEVER overridden by ML.
pub fn ml_adjusted_oom_score(task: &Task, constraint: OomConstraint) -> i32 {
    let base = oom_score(task);  // canonical formula from [Section 4.5](#oom-killer)
    // Consult ML policy if registered for MemoryManager subsystem.
    let ml_adj = ml_policy_query(
        ParamId::MemOomScoreAdjustment,
        &OomCandidateFeatures {
            rss_pages: task.mm().rss_pages(),
            swap_pages: task.mm().swap_pages(),
            cgroup_id: task.cgroup_id(),
            restart_count_24h: task.restart_counter_24h(),
            dependency_depth: task.service_dependency_depth(),
            sla_tier: task.cgroup_sla_tier(),
            uptime_s: task.uptime_seconds(),
            mem_growth_rate: task.mm().rss_growth_rate_pages_per_sec(),
        },
    ).unwrap_or_else(|| {
        // Fallback: no adjustment if ML unavailable. Increment counter
        // for observability so operators know OOM scoring has degraded
        // to pure heuristic mode.
        ML_OOM_FALLBACK_COUNT.fetch_add(1, Relaxed);
        0
    })
    (base as i32 + ml_adj).clamp(0, 2000)
}

Candidate feature vector — each OOM candidate is described by:

Feature	Type	Source	Rationale
`rss_pages`	u64	`task.mm().rss_stat`	Current memory footprint
`swap_pages`	u64	`task.mm().swap_stat`	Swap usage (total memory pressure contribution)
`cgroup_id`	u64	`task.cgroup_id()`	Identifies workload group (u64: UmkaOS 50-year rule — cgroup IDs are u64 per Section 17.2)
`restart_count_24h`	u16	FMA restart telemetry	High restart count → killing this task is cheap
`dependency_depth`	u8	Service dependency graph	Tasks depended on by many others should be spared
`sla_tier`	u8	cgroup `sla.tier` attribute	0=best-effort, 1=standard, 2=premium, 3=critical
`uptime_s`	u32	`task.start_time`	Long-running tasks accumulate state; killing them is expensive
`mem_growth_rate`	i32	RSS derivative (pages/sec)	Fast-growing tasks are likely the leak source

Feedback signal — after an OOM kill, the kernel emits a follow-up observation (MemObs::OomOutcome) recording the actual outcome:

/// OOM outcome observation — emitted 5 seconds after OOM kill.
/// Subsystem: MemoryManager, obs_type: MemObs::OomOutcome.
///
/// Feature vector (packed into KernelObservation.features[0..9]):
///   [0] victim_pid: PID of the killed task
///   [1] pages_freed: pages actually freed by victim's exit (measured)
///   [2] time_to_recovery_ms: ms until memory pressure resolved (PSI stall < 1%)
///   [3] service_restart_ms: ms until a replacement task with same cgroup appeared
///   [4] cascading_kills: number of additional OOM kills within 10s of this one
///   [5] cgroup_oom_count_after: memory.events.oom counter 10s after kill
///   [6] victim_was_ml_adjusted: 1 if ML adjusted the score, 0 if pure heuristic
///   [7] baseline_score: what the score would have been without ML adjustment
///   [8] reserved
///   [9] reserved

This creates a complete closed loop: 1. Observe: OOM invoked → emit OomVictimSelection with candidate features 2. Decide: ML model adjusts scores → victim selected 3. Act: victim killed 4. Measure: 5s later, OomOutcome records pages freed, recovery time, cascade impact 5. Learn: policy service correlates (selection features, outcome) pairs for training

Ground truth for training: The optimal OOM kill maximizes pages_freed while minimizing time_to_recovery_ms + cascading_kills * penalty + service_restart_ms. The policy service computes a reward signal:

reward = pages_freed / target_free_pages
       - α * (time_to_recovery_ms / 5000)
       - β * cascading_kills
       - γ * (service_restart_ms / 30000)

where α, β, γ are hyperparameters tuned per deployment (default: α=0.3, β=0.5, γ=0.2).

Safety rails: - ML adjustment is bounded to [-500, +500] — cannot override admin intent - oom_score_adj == -1000 (OOM-immune) is absolute — ML cannot override - If ML policy service crashes or times out (>1ms), fallback to pure heuristic - Decay: ML adjustments decay to zero over 30s if the policy service stops updating - All ML-influenced kills are tagged in the dmesg log with [ml-adjusted]

Cross-references: - ML policy framework: Section 23.1 - Observation bus and feature extractors: Section 23.1 - FMA restart telemetry: Section 20.1 - cgroup SLA tiers: Section 7.10

4.2.2 Allocator Replaceability — 50-Year Uptime Design¶

The physical memory allocator is factored into non-replaceable data and replaceable policy to support live kernel evolution over multi-decade uptimes. This follows the same state-spill pattern proven by the I/O scheduler (Section 16.21) and the qdisc subsystem (Section 16.21).

4.2.2.1 Non-Replaceable Data Structures¶

The following structures are non-replaceable — they are the physical memory map itself. They are Priority 1 verification targets (Section 24.4):

Structure	Reason	Lifetime
`PageArray` (vmemmap)	Page descriptors; indexed by PFN on every page operation	Kernel lifetime
`BuddyAllocator` free lists	Linked lists of free blocks; mutated under `lock`	Kernel lifetime
`PcpPagePool`	Per-CPU fast-path pool; accessed with IRQs disabled (`local_irq_save`)	Kernel lifetime
Watermarks (`wmark_min/low/high`)	Threshold values; read by reclaim heuristics	Kernel lifetime

4.2.2.2 PageArray — vmemmap-Backed Page Descriptor Array¶

The MEMMAP global (Section 4.3) is a flat &'static [Page] array allocated from the boot allocator. For 50-year uptime with CXL memory hot-add, this must become a vmemmap-backed sparse virtual mapping so new Page descriptors can be added at runtime without relocating the existing array.

/// vmemmap-backed page descriptor array. Maps the Page array into a
/// contiguous virtual address range, but only backs it with physical pages
/// where actual memory exists. Holes in the physical address space leave
/// vmemmap PTEs unmapped — accessing a Page descriptor for a non-existent
/// PFN triggers a page fault caught by the kernel exception handler.
///
/// **Layout**: The vmemmap occupies a fixed virtual address range starting
/// at `VMEMMAP_BASE` (architecture-specific). At 64 bytes per Page
/// descriptor and 4 KB physical pages, 1 TB of vmemmap VA space covers
/// 64 TB of physical address space — sufficient for any foreseeable server
/// including multi-chassis CXL expansion.
///
/// **Backing**: Only PFN ranges with actual physical memory have their
/// vmemmap pages backed by real physical pages. One 4 KB vmemmap page
/// holds 64 Page descriptors (64 B × 64 = 4096 B), covering 256 KB of
/// physical memory.
///
/// **Hot-add**: When CXL memory is hot-added
/// ([Section 5.9](05-distributed.md#cxl-30-fabric-integration)),
/// the kernel maps additional vmemmap pages for the new PFN range via
/// `extend_for_hotadd()`. Cost: one `alloc_pages()` call per 64 new Page
/// descriptors. Zero performance impact on existing pages.
///
/// **phys_to_page() cost**: `phys_to_page(paddr)` becomes
/// `&*(VMEMMAP_BASE + (paddr / PAGE_SIZE) * 64)`. This is the same single
/// indexed load instruction as the flat array — the compiler folds the
/// vmemmap base address into the displacement. Zero runtime cost difference.
///
/// **NON-REPLACEABLE**: PageArray is a data structure, not policy. Its
/// layout (Page struct fields, vmemmap base address, PFN indexing) is fixed
/// for the kernel's lifetime. Policy (which pages to allocate, watermarks,
/// compaction decisions) is in `PhysAllocPolicy` (replaceable).
pub struct PageArray {
    /// Base virtual address of the vmemmap region.
    pub base: VirtAddr,
    /// Maximum PFN currently covered by the vmemmap allocation.
    /// Set at boot to `max_phys_addr / PAGE_SIZE`. Extended on memory
    /// hot-add. Reads use `Ordering::Acquire`; writes use `Ordering::Release`
    /// to ensure newly mapped vmemmap pages are visible before `max_pfn`
    /// is incremented.
    pub max_pfn: AtomicU64,
}

impl PageArray {
    /// Look up the Page descriptor for a physical address.
    /// Returns None if the PFN exceeds the current vmemmap range.
    #[inline(always)]
    pub fn get(&self, paddr: PhysAddr) -> Option<&'static Page> {
        let pfn = paddr.as_u64() / PAGE_SIZE as u64;
        if pfn >= self.max_pfn.load(Acquire) {
            return None;
        }
        // SAFETY: Three invariants must hold for this dereference to be sound:
        // 1. The PFN is within range: checked above (pfn < max_pfn).
        // 2. The vmemmap backing page is physically mapped: guaranteed by
        //    mem_init() (boot) or extend_for_hotadd() (hotplug), which map
        //    vmemmap pages before updating max_pfn with Release ordering.
        //    Our Acquire load of max_pfn ensures we see the mapping.
        // 3. The Page struct at this address is initialised: mem_init() and
        //    extend_for_hotadd() zero all Page descriptors before publishing
        //    max_pfn, so no uninitialised read is possible.
        Some(unsafe { &*((self.base.as_u64() + pfn * 64) as *const Page) })
    }

    /// Extend the vmemmap for newly hot-added physical memory.
    /// Maps vmemmap pages for the PFN range `[pfn_start, pfn_end)`.
    /// Called by the CXL hot-add handler ([Section 5.9](05-distributed.md#cxl-30-fabric-integration))
    /// and the ACPI memory hotplug handler.
    ///
    /// # Protocol
    /// 1. For each 4 KB vmemmap page needed (each covers 64 Page descriptors):
    ///    - `alloc_pages(0, GFP_KERNEL, nid)` from the buddy allocator (nid =
    ///      the NUMA node of the hot-added memory).
    ///    - Map the allocated page at `VMEMMAP_BASE + pfn * 64` in the kernel
    ///      page table.
    /// 2. Zero-initialise all new Page descriptors (refcount=0, flags=0).
    /// 3. Update `max_pfn` with `Release` ordering AFTER all PTEs are installed.
    ///
    /// # Errors
    /// Returns `AllocError` if vmemmap backing pages cannot be allocated.
    /// On failure, no partial extension is visible — `max_pfn` is not updated.
    pub fn extend_for_hotadd(
        &self,
        pfn_start: u64,
        pfn_end: u64,
    ) -> Result<(), AllocError> {
        // 1. Compute the vmemmap page range.
        // 2. Allocate physical pages for each vmemmap page.
        // 3. Map into kernel page table at VMEMMAP_BASE + pfn * 64.
        // 4. Zero-initialise Page descriptors.
        // 5. self.max_pfn.store(pfn_end, Release);
        Ok(())
    }
}

4.2.2.3 MEMMAP to vmemmap Migration¶

At early boot the physical memory allocator uses MEMMAP — a flat &'static [Page] array allocated from the boot allocator (BootVec-backed, one per NUMA zone). This array is sufficient for the buddy allocator hand-off (Phase 1.1) and slab init (Phase 1.2), but it cannot grow at runtime: it occupies a contiguous boot-time allocation that cannot be extended for CXL hot-add or memory hotplug.

PageArray is the permanent vmemmap-backed sparse mapping (defined above) that replaces MEMMAP once the slab allocator is operational. The migration from MEMMAP to PageArray is a one-time, boot-only operation that must be completed before any memory hot-add event can occur.

Migration point: Phase 1.2.1 — immediately after slab_init() completes (Phase 1.2) and before page_cache_init() (Phase 1.4). At this point the buddy allocator and slab allocator are both operational, so alloc_pages() is available to back the vmemmap.

memmap_to_vmemmap() — Migration protocol:

memmap_to_vmemmap():
    // Phase A: Allocate and map vmemmap backing pages.
    //
    // Compute the PFN range from the boot-time memory map. For each 4 KB
    // vmemmap page (covers 64 Page descriptors = 256 KB of physical memory):
    //   1. alloc_pages(0, GFP_KERNEL, nid) — allocate one backing page on
    //      the NUMA node that owns that physical range.
    //   2. Map the backing page at VMEMMAP_BASE + (pfn_group * size_of::<Page>()) in
    //      the kernel page table.
    //
    // Holes in the physical address space are skipped — no vmemmap page is
    // allocated for PFN ranges with no physical memory. Attempting to access
    // a Page descriptor in an unmapped hole triggers a kernel page fault
    // caught by the FMA framework ([Section 20.1](20-observability.md#fault-management-architecture)).
    let max_pfn = boot_memmap.max_pfn()
    let vmemmap_base = arch::VMEMMAP_BASE

    for pfn_group in (0..max_pfn).step_by(64):
        if !boot_memmap.has_memory_in_range(pfn_group, pfn_group + 64):
            continue   // hole — skip
        let nid = boot_memmap.pfn_to_nid(pfn_group)
        let page = alloc_pages(0, GFP_KERNEL, nid)?
        kernel_page_table.map(
            vmemmap_base + pfn_group * 64,  // virtual address
            page.phys_addr(),               // physical address
            PAGE_SIZE,                      // size
            PTE_PRESENT | PTE_WRITE | PTE_NX,
        )

    // Phase B: Copy descriptors from MEMMAP to vmemmap.
    //
    // For each valid PFN, copy the Page descriptor from the boot-time MEMMAP
    // array to the vmemmap location. Both source and destination are 64-byte
    // aligned, so this is a simple memcpy per descriptor. The copy preserves
    // all fields: refcount, flags, mapping, lru linkage, node_id, zone_id.
    //
    // Ordering: no concurrent access is possible — this runs on the BSP
    // before SMP bringup (Phase 3.1), so no locks or atomics are needed.
    let old_memmap = MEMMAP.get().expect("MEMMAP not initialised")
    for pfn in 0..max_pfn:
        if !boot_memmap.has_memory_at_pfn(pfn):
            continue
        let src: &Page = &old_memmap[pfn as usize]
        let dst: *mut Page = (vmemmap_base + pfn * 64) as *mut Page
        // SAFETY: dst is mapped (Phase A), src is valid (boot MEMMAP),
        // no concurrent access (BSP only, pre-SMP).
        unsafe { core::ptr::copy_nonoverlapping(src as *const Page, dst, 1) }

    // Phase C: Activate PageArray and retire MEMMAP.
    //
    // 1. Construct the global PageArray singleton with the vmemmap base and
    //    max_pfn. Store it in the global `PAGE_ARRAY` OnceCell.
    // 2. Update every zone's page_array pointer from the old MEMMAP slice to
    //    the new PageArray. This is a single pointer-width store per zone.
    //    After this point, all phys_to_page() calls go through vmemmap.
    // 3. Free the boot allocator's MEMMAP backing memory back to the buddy
    //    allocator. The BootVec regions become regular free pages.
    //    This reclaims the MEMMAP overhead (e.g., 4 MB on a 256 GB system).
    let page_array = PageArray {
        base: VirtAddr::new(vmemmap_base),
        max_pfn: AtomicU64::new(max_pfn),
    }
    PAGE_ARRAY.set(page_array).expect("PAGE_ARRAY already set")

    // Update zone page_array pointers.
    for zone in all_zones():
        zone.page_array = PAGE_ARRAY.get().unwrap()

    // Free boot MEMMAP memory.
    boot_alloc.free_memmap_regions()

Post-migration invariants:

PAGE_ARRAY is the sole global page descriptor array. All code paths use PAGE_ARRAY.get().get(paddr) for page descriptor lookup.
MEMMAP is invalid — the OnceCell still holds the stale slice pointer, but the backing memory has been returned to the buddy allocator. Any access to MEMMAP after migration is a kernel bug. Debug builds poison the MEMMAP OnceCell to trigger an immediate panic on stale access.
The vmemmap is extensible via PageArray::extend_for_hotadd() for CXL hot-add events occurring after boot.

Cost: The migration allocates one 4 KB page per 64 Page descriptors (same physical memory as the boot MEMMAP, just in a sparse virtual mapping). On a 256 GB system: 4 MB of vmemmap pages, allocated in ~65,536 alloc_pages() calls (~1 ms total at ~15 ns per call). The descriptor copy is ~4 MB of memcpy, completing in <1 ms. Total migration time: <5 ms — negligible relative to the overall boot sequence.

Cross-references: - Boot MEMMAP (Page struct, MEMMAP global): Section 4.3 - Boot init phase ordering: Section 2.3 - CXL memory hot-add: Section 5.9 - FMA framework (page fault on unmapped vmemmap hole): Section 20.1

4.2.2.4 Replaceable Physical Allocator Policy¶

/// Replaceable policy trait for the physical allocator. Governs block
/// selection, splitting, merging, NUMA fallback, watermarks, and compaction
/// decisions. The trait methods are called only on warm/cold paths — the
/// per-CPU page pool (PCP) hot path never invokes policy.
///
/// Replaced via `EvolvableComponent` ([Section 13.18](13-device-classes.md#live-kernel-evolution)). The
/// default implementation matches Linux 6.x heuristics; replacements may
/// use ML-predicted access patterns
/// ([Section 23.1](23-ml-policy.md#aiml-policy-framework-closed-loop-kernel-intelligence)).
pub trait PhysAllocPolicy: Send + Sync {
    /// Select a free block of at least `order` from the buddy allocator.
    /// Called on PCP refill (batch of 31 pages) and higher-order allocations.
    fn select_block(
        &self,
        buddy: &BuddyAllocator,
        order: u32,
        migrate: MigrateType,
        gfp: GfpFlags,
    ) -> Option<(*mut Page, u32)>; // (page_ptr, actual_order)

    /// Split a higher-order block down to the requested order.
    /// The split algorithm (see [Section 4.2](#physical-memory-allocator--split-algorithm)) is
    /// mechanical, but the policy controls which half to allocate and how
    /// to tag free blocks for anti-fragmentation grouping.
    fn split_block(
        &self,
        buddy: &BuddyAllocator,
        page: *mut Page,
        current_order: u32,
        target_order: u32,
        migrate: MigrateType,
    );

    /// Merge a freed block with its buddy. Returns the final merged order.
    /// Policy may choose to defer merging under certain conditions (e.g.,
    /// keeping order-0 blocks readily available during memory pressure to
    /// avoid immediate re-splitting).
    fn merge_block(
        &self,
        buddy: &BuddyAllocator,
        page: *mut Page,
        order: u32,
        migrate: MigrateType,
    ) -> u32; // final merged order

    /// Determine NUMA fallback order for a given allocation.
    /// Default: walk distance table in ascending order.
    /// Replacement: might consider memory tier (CXL vs DRAM), current
    /// pressure per node, or ML-predicted access patterns.
    fn numa_fallback_order(
        &self,
        preferred_nid: u32,
        gfp: GfpFlags,
    ) -> ArrayVec<u32, NUMA_NODES_STACK_CAP>;

    /// Recalculate watermarks for a node. Called on sysctl write
    /// (`/proc/sys/vm/min_free_kbytes`), memory hot-add, or immediately
    /// after `PhysAllocPolicy` live replacement. Cold path.
    fn recalc_watermarks(
        &self,
        buddy: &BuddyAllocator,
        node_pages: u64,
    ) -> (u64, u64, u64); // (wmark_min, wmark_low, wmark_high)

    /// Decide whether to trigger memory compaction. Called by kswapd
    /// when watermarks are breached. Returns true if compaction should
    /// run before falling back to direct reclaim.
    fn should_compact(
        &self,
        buddy: &BuddyAllocator,
        order: u32,
        gfp: GfpFlags,
    ) -> bool;

    /// PCP batch size tuning. Called when `percpu_pagelist_high_fraction`
    /// is written via sysctl. Returns `(high, batch)` for the given node.
    fn pcp_watermarks(
        &self,
        node_pages: u64,
        nr_cpus_on_node: u32,
        fraction: u32,
    ) -> (u32, u32); // (high, batch)
}

PhysAllocPolicyVtable — KABI-compatible vtable struct:

The PhysAllocPolicy trait is dispatched at runtime through a #[repr(C)] vtable struct, enabling live policy replacement via AtomicPtr swap without trait object indirection. This follows the same pattern as all KABI vtable structs (Section 12.1).

/// C-ABI-compatible vtable for `PhysAllocPolicy`. Stored behind
/// `BuddyAllocator::policy: AtomicPtr<PhysAllocPolicyVtable>`.
/// Each function pointer corresponds to a `PhysAllocPolicy` trait method
/// with the self-reference replaced by an opaque `*const ()` context pointer
/// (the concrete policy object, cast by the caller).
///
/// **Header fields** (`vtable_size`, `kabi_version`) follow the standard
/// KABI vtable header convention: the evolution framework checks
/// `vtable_size` to determine which function pointers are present (forward
/// compatibility), and `kabi_version` to reject incompatible policy modules.
// kernel-internal, not KABI
#[repr(C)]
pub struct PhysAllocPolicyVtable {
    /// Size of this vtable struct in bytes. Used by the evolution framework
    /// to detect newer vtable layouts that append additional function pointers.
    /// Must be set to `core::mem::size_of::<PhysAllocPolicyVtable>() as u64`.
    pub vtable_size: u64,

    /// KABI version of this policy module. Checked at registration time
    /// against the kernel's expected version range. Incompatible versions
    /// (major mismatch) are rejected; minor additions are tolerated via
    /// `vtable_size` probing.
    pub kabi_version: u64,

    /// See `PhysAllocPolicy::select_block`.
    pub select_block: unsafe extern "C" fn(
        ctx: *const (),
        buddy: *const BuddyAllocator,
        order: u32,
        migrate: MigrateType,
        gfp: GfpFlags,
        out_page: *mut *mut Page,
        out_order: *mut u32,
    ) -> bool, // true = found, false = no block available

    /// See `PhysAllocPolicy::split_block`.
    pub split_block: unsafe extern "C" fn(
        ctx: *const (),
        buddy: *const BuddyAllocator,
        page: *mut Page,
        current_order: u32,
        target_order: u32,
        migrate: MigrateType,
    ),

    /// See `PhysAllocPolicy::merge_block`. Returns final merged order.
    pub merge_block: unsafe extern "C" fn(
        ctx: *const (),
        buddy: *const BuddyAllocator,
        page: *mut Page,
        order: u32,
        migrate: MigrateType,
    ) -> u32,

    /// See `PhysAllocPolicy::numa_fallback_order`.
    /// Writes node IDs into `out_buf[0..out_len]`, returns count written.
    pub numa_fallback_order: unsafe extern "C" fn(
        ctx: *const (),
        preferred_nid: u32,
        gfp: GfpFlags,
        out_buf: *mut u32,
        out_len: u32,
    ) -> u32,

    /// See `PhysAllocPolicy::recalc_watermarks`.
    /// Returns (wmark_min, wmark_low, wmark_high) via out-pointers.
    pub recalc_watermarks: unsafe extern "C" fn(
        ctx: *const (),
        buddy: *const BuddyAllocator,
        node_pages: u64,
        out_min: *mut u64,
        out_low: *mut u64,
        out_high: *mut u64,
    ),

    /// See `PhysAllocPolicy::should_compact`.
    pub should_compact: unsafe extern "C" fn(
        ctx: *const (),
        buddy: *const BuddyAllocator,
        order: u32,
        gfp: GfpFlags,
    ) -> bool,

    /// See `PhysAllocPolicy::pcp_watermarks`.
    /// Returns (high, batch) via out-pointers.
    pub pcp_watermarks: unsafe extern "C" fn(
        ctx: *const (),
        node_pages: u64,
        nr_cpus_on_node: u32,
        fraction: u32,
        out_high: *mut u32,
        out_batch: *mut u32,
    ),

    /// Opaque context pointer passed as the first argument to all
    /// function pointers above. Points to the concrete policy object.
    /// The kernel does not interpret this — it is owned and freed by
    /// the policy module on unload.
    pub ctx: *const (),
}

4.2.2.5 Performance Budget — Policy Indirection Cost¶

Path	Frequency	PhysAllocPolicy called?	Cost
`alloc_pages(order=0)` — PCP hit	>95% of all allocations	No	~8-18 cycles (`local_irq_save` + stack pop + `local_irq_restore`)
`alloc_pages(order=0)` — PCP miss, refill	~once per 31 allocs	Yes: `select_block()` × batch	Amortized: <1 cycle/alloc
`alloc_pages(order>0)`	<5% of allocations	Yes: `select_block()`, `split_block()`	~50-200 cycles (under lock)
`free_pages(order=0)` — PCP push	>95% of all frees	No	~8-18 cycles (`local_irq_save` + stack push + `local_irq_restore`)
`free_pages(order=0)` — PCP drain	~once per 186+ frees	Yes: `merge_block()` × batch	Amortized: <1 cycle/free
Watermark recalc	sysctl write, hot-add	Yes: `recalc_watermarks()`	Cold path, negligible
Compaction decision	kswapd check	Yes: `should_compact()`	Background thread

Key invariant: The hot paths (PcpPagePool pop/push) never call through PhysAllocPolicy. These paths are fixed code operating on fixed data structures with IRQs disabled (local_irq_save). Policy is invoked only on warm/cold paths — PCP refill/drain (amortized over 31+ allocs/frees per PCP_BATCH_SIZE), higher-order allocation (uncommon), and background decisions (kswapd, watermark recalc).

4.2.2.6 PageExtArray — Per-Page Extension Metadata¶

The Page struct is frozen at 64 bytes (one cache line) for the kernel's lifetime — growing it would break every consumer and waste cache bandwidth for fields that most subsystems don't need. When new per-page metadata is required (memory tiering, CXL fabric locality, hardware tagging), it goes into a parallel extension array indexed by PFN — Pattern 1 (Extension Array) from the Data Format Evolution Framework (Section 13.18).

/// Per-page extension metadata. Parallel to PageArray, indexed by PFN.
/// Allocated via vmemmap at `PAGE_EXT_BASE` (a separate VA range from
/// the main `VMEMMAP_BASE`). Only backed by physical pages when
/// extension features are active — zero memory cost when unused.
///
/// **Size**: 16 bytes per page (one quarter cache line). At 16 B per
/// 4 KB page, a 1 TB system requires 4 GB of extension metadata.
/// For systems that don't use any extension features, no physical
/// pages are allocated — the VA range exists but faults on access
/// (triggering a kernel OOPS, caught by the FMA framework).
///
/// **NON-REPLACEABLE DATA**: same status as PageArray. Layout changes
/// require a Data Format Evolution payload (Pattern 2:
/// Shadow-and-Migrate) — the evolution framework allocates a new
/// vmemmap region with the enlarged struct, migrates entries in the
/// background, and swaps pointers atomically during Phase B.
// kernel-internal, not KABI — PageExt layout is opaque to Tier 1 drivers.
#[repr(C, align(16))]
pub struct PageExt {
    /// Memory tiering heat counter. Incremented by the page replacement
    /// policy on access; decayed by the kswapd background scanner.
    /// Used by the ML-guided tier migration policy (Section 23.1)
    /// to decide DRAM ↔ CXL tier placement.
    /// Initial value: 0 (cold). Range: 0..=65535.
    ///
    /// **Saturation analysis**: This is a saturating metric, not a monotonic
    /// counter — it is incremented on access and periodically decayed (halved
    /// by kswapd every scan period). Saturation at 65535 means "maximally hot"
    /// and does not lose discrimination: pages at saturation are all
    /// hot-tier candidates regardless of their exact heat. The 50-year
    /// counter rule does not apply (this is not a monotonic identifier).
    pub heat: AtomicU16,
    /// CXL fabric locality tag. Identifies the CXL switch/port topology
    /// path to this page's physical device. Set during CXL hot-add
    /// (Section 5.10) and used by the NUMA fallback policy for
    /// latency-aware placement. 0 = local DRAM (no CXL).
    pub cxl_locality: u16,
    /// Hardware memory tag (MTE on AArch64, future tagging on other
    /// architectures). Stored here rather than in Page.flags to keep
    /// Page at 64 bytes. 0 = untagged.
    pub hw_tag: u8,
    /// Extension feature flags. Bit 0: heat counter active.
    /// Bit 1: CXL locality valid. Bit 2: hw_tag valid.
    /// Bit 3: wb_fail_count valid.
    /// Bits 4-7: reserved (must be zero).
    pub ext_flags: u8,
    /// Writeback failure count. Incremented by `writeback_end_io()` on
    /// I/O error. When >= 3, the page is marked `PageFlags::PERMANENT_ERROR`
    /// and excluded from future writeback attempts. Reset to 0 on
    /// successful writeback. AtomicU8 with Relaxed ordering. Writer:
    /// writeback completion path (holds page lock). Readers: writeback
    /// scanner and reclaim (may check PERMANENT_ERROR or read wb_fail_count
    /// without the page lock — atomic required for Rust soundness).
    ///
    /// Cross-references: [Section 4.6](#writeback-subsystem), [Section 15.2](15-storage.md#block-io-and-volume-management).
    pub wb_fail_count: AtomicU8,
    /// Reserved for future per-page extensions. Must be zero.
    /// When a new field is needed, carve it from _reserved and
    /// assign a new ext_flags bit. When _reserved is exhausted,
    /// use Pattern 2 (Shadow-and-Migrate) to grow PageExt.
    pub _reserved: [u8; 9],
}

// Static assertion: 2 + 2 + 1 + 1 + 1 + 9 = 16 bytes.
const _PAGE_EXT_SIZE_CHECK: () = assert!(
    core::mem::size_of::<PageExt>() == 16,
    "PageExt must be exactly 16 bytes"
);

Per-architecture PageExt vmemmap base addresses:

Architecture	`PAGE_EXT_BASE`	Coverage
x86-64	`0xFFFF_E100_0000_0000`	256 TB of physical memory
AArch64	`0xFFFF_8100_0000_0000`	256 TB of physical memory
RISC-V (Sv48)	`0xFFFF_C100_0000_0000`	128 TB of physical memory
PPC64LE	`0xC001_0000_0000_0000`	64 TB of physical memory
ARMv7	Flat array (limited VA)	≤4 GB physical memory
PPC32	Flat array (limited VA)	≤4 GB physical memory
s390x	`0x0002_0000_0000_0000`	512 TB of physical memory (Region-1 table coverage)
LoongArch64	`0x9000_0100_0000_0000`	256 TB of physical memory (DMW window)

/// Global PageExtArray. Initialized lazily on first use (e.g., CXL hot-add,
/// MTE policy enablement). Until initialized, `PAGE_EXT_ARRAY.get()` returns
/// `None` and no physical memory is consumed.
pub static PAGE_EXT_ARRAY: OnceCell<PageExtArray> = OnceCell::new();

pub struct PageExtArray {
    /// Base VA of the extension vmemmap.
    pub base: VirtAddr,
    /// Maximum PFN covered. Must match `PAGE_ARRAY.max_pfn`.
    /// Updated atomically on hot-add (same protocol as PageArray).
    pub max_pfn: AtomicU64,
}

impl PageExtArray {
    /// Get the extension metadata for a physical page.
    /// Returns None if PFN is out of range.
    #[inline]
    pub fn get(&self, pfn: u64) -> Option<&PageExt> {
        if pfn >= self.max_pfn.load(Ordering::Acquire) {
            return None;
        }
        // SAFETY: vmemmap pages for PFNs < max_pfn are backed.
        // Use u64 arithmetic (consistent with PageArray::get) to avoid
        // truncation on 32-bit targets where pfn * 16 could overflow usize.
        Some(unsafe { &*((self.base.as_u64() + pfn * 16) as *const PageExt) })
    }

    /// Extend the extension vmemmap for newly hot-added physical memory.
    /// Must be called after `PageArray::extend_for_hotadd()` for the same
    /// PFN range — the extension array tracks the same physical memory.
    pub fn extend_for_hotadd(&self, pfn_start: u64, pfn_end: u64) {
        // Same protocol as PageArray::extend_for_hotadd():
        // 1. Allocate physical pages for extension vmemmap in range.
        // 2. Map at PAGE_EXT_BASE + (pfn * 16) with zero-fill.
        // 3. Update max_pfn with Release ordering.
    }
}

Performance: Accessing PageExt costs one additional cache-line load (~3-5 ns L1 hit, ~10-15 ns L2 hit) beyond the Page access. Components that don't use extensions pay nothing — they never dereference PAGE_EXT_ARRAY. The extension array is optional infrastructure; its overhead is pay-for-use only.

Cross-references: - Live kernel evolution table: Section 13.18 - Data format evolution framework: Section 13.18 - I/O scheduler state-spill pattern: Section 16.21 - Qdisc state-spill pattern: Section 16.21 - Page frame descriptor (Page struct, MEMMAP): Section 4.3 - CXL memory hot-add: Section 5.9 - ML-guided allocation policy: Section 23.1 - Formal verification targets: Section 24.4

4.3 Slab Allocator¶

For kernel object allocation (capabilities, file descriptors, inodes, etc.):

Per-CPU slab caches with magazine-based design
Per-NUMA partial slab lists
Object constructors/destructors for complex types
SLUB-style merging of similarly-sized caches

Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules. &self methods use interior mutability for mutation. Atomic fields use .store()/.load(). See CLAUDE.md §Spec Pseudocode Quality Gates. Pointer types: Object pointers in the allocator pseudocode use *mut u8 (raw byte pointer) uniformly. The caller casts to the target type: let obj = kmalloc(size, gfp)? as *mut T. This matches the C-style void* return convention for generic allocators. Within the slab internals, SlabPage references use &SlabPage (immutable borrow under the node_partial lock); the lock guard provides mutable access to the freelist via interior mutability (UnsafeCell<Freelist>).

4.3.1.1 `kmalloc()` / `kfree()` Public API¶

Every kernel subsystem that allocates variable-sized objects uses kmalloc() and kfree(). These are thin wrappers that map byte sizes to slab size classes and perform pointer-to-cache reverse lookups.

/// Size class table. Maps a byte size to the slab size class index.
/// Size classes: 8, 16, 32, 64, 96, 128, 192, 256, 512, 1024, 1536, 2048,
/// 3072, 4096, 5120, 6144, 7168, 8192, 9216, 10240, 11264, 12288, 13312,
/// 14336, 15360, 16384.
/// Classes 0-3 are powers of two (8..64). Class 4 is 96 (3*32 — matches
/// Linux kmalloc-96 for 3-pointer structs). Class 5 is 128. Class 6 is 192
/// (3*64 — matches Linux kmalloc-192). Classes 7-9 are 256, 512, 1024.
/// Classes 10-11 are 512-byte steps (1536, 2048). Classes 12-25 are
/// 1024-byte steps (3072..16384).
/// Total: 26 size classes (SLAB_SIZE_CLASSES = 26).
///
/// The 96 and 192 byte classes avoid 33% internal fragmentation for
/// common kernel objects (task_struct fields, inode caches, dentry caches)
/// that are 3x a power-of-two. Linux includes both in kmalloc_info[].
const SLAB_SIZE_CLASSES: usize = 26;
const KMALLOC_SIZE_TABLE: [u32; SLAB_SIZE_CLASSES] = [
    8, 16, 32, 64, 96, 128, 192, 256, 512, 1024, 1536, 2048, 3072,
    4096, 5120, 6144, 7168, 8192, 9216, 10240, 11264, 12288,
    13312, 14336, 15360, 16384,
];

/// Maximum number of objects in a single magazine. Each per-CPU
/// `MagazinePair` contains two magazine pointers (`Option<NonNull<SlabMagazine>>`
/// for loaded + spare), each magazine holding up to MAGAZINE_SIZE object pointers. The constant is referenced
/// throughout the slab allocator for batch-fill sizes, depot exchange
/// counts, and total_allocated drift bounds.
///
/// Value 64 balances per-CPU memory overhead against depot exchange
/// frequency. Linux SLUB uses a similar batch size for per-CPU freelists.
pub const MAGAZINE_SIZE: usize = 64;

/// Convert a requested byte size to the smallest size class index that
/// can satisfy it. Returns `None` if size > 16384 (caller must use
/// `alloc_pages()` directly for large allocations).
///
/// For sizes <= 1024: linear search through the first 10 size classes
/// (8, 16, 32, 64, 96, 128, 192, 256, 512, 1024). The non-power-of-two
/// classes (96, 192) prevent using a pure bit-manipulation index.
/// For sizes 1025..=16384: linear search in the step region (classes 10-25).
///
/// Hot path: called on every `kmalloc()`. The linear search is bounded
/// to at most 26 iterations (SLAB_SIZE_CLASSES). In practice, most kernel
/// allocations are <= 256 bytes, so the search terminates in 8 iterations.
#[inline]
pub fn kmalloc_index(size: usize) -> Option<u32> {
    if size == 0 {
        return None; // zero-size: handled by kmalloc() directly
    }
    if size > 16384 {
        return None; // too large for slab — use alloc_pages()
    }
    // Linear scan: find the smallest size class >= requested size.
    for i in 0..SLAB_SIZE_CLASSES as u32 {
        if KMALLOC_SIZE_TABLE[i as usize] as usize >= size {
            return Some(i);
        }
    }
    None
}

/// Allocate `size` bytes of kernel memory.
///
/// This is the primary kernel allocation interface. Maps the requested
/// byte size to a slab size class and calls `slab_alloc()`. For
/// allocations larger than 16384 bytes, falls back to `alloc_pages()`
/// directly (with the result tracked via a large-alloc XArray for
/// `kfree()` reverse lookup).
///
/// # Arguments
/// - `size`: Number of bytes to allocate (may be zero -- returns a
///    valid non-null pointer to a zero-size allocation).
/// - `gfp`: Allocation flags controlling reclaim behavior
///    ([Section 4.2](#physical-memory-allocator--gfpflags-integration-summary)).
///
/// # Returns
/// Pointer to at least `size` bytes of memory (may be more due to
/// size-class rounding). The memory is uninitialized unless
/// `GFP_ZERO` is set in `gfp`.
///
/// # Errors
/// Returns `AllocError` if memory cannot be allocated.
#[inline]
pub fn kmalloc(size: usize, gfp: GfpFlags) -> Result<*mut u8, AllocError> {
    // Zero-size allocation: return a Rust-native ZST sentinel.
    // NonNull::dangling() is non-null, non-dereferenceable, and requires
    // no real allocation. kfree() checks for this at entry and no-ops.
    // This matches the intent of Linux's ZERO_SIZE_PTR using Rust idiom.
    if size == 0 {
        return Ok(NonNull::<u8>::dangling().as_ptr());
    }
    match kmalloc_index(size) {
        Some(sc) => slab_alloc(sc, gfp),
        None => {
            // Large allocation: fall back to buddy allocator.
            let order = pages_order_for_size(size);
            let page = alloc_pages(order, gfp, numa_mem_id())?;
            let ptr = page_to_virt(page);
            // Track in the large-alloc XArray for kfree() reverse lookup.
            LARGE_ALLOCS.insert(ptr as u64, LargeAlloc { order });
            Ok(ptr)
        }
    }
}

/// Free memory previously allocated by `kmalloc()`.
///
/// Determines the owning slab cache from the pointer via the
/// Page.mapping backpointer, then calls the slab free path. For
/// large allocations (backed directly by buddy pages), frees via
/// `free_pages()`.
///
/// # Safety
/// `ptr` must have been returned by a previous `kmalloc()` call and
/// must not have been freed already (double-free is detected in debug
/// builds via the slab freelist canary).
pub fn kfree(ptr: *mut u8) {
    if ptr.is_null() {
        return; // kfree(NULL) is a no-op (matches Linux)
    }
    // Zero-size allocation sentinel: NonNull::dangling() from kmalloc(0).
    // No real allocation was made — no-op free.
    if ptr == NonNull::<u8>::dangling().as_ptr() {
        return;
    }

    // Check PG_SLAB first — >99% of kfree() calls are slab-backed.
    // Checking LARGE_ALLOCS first would penalize every kfree with an
    // unnecessary XArray remove on the cold large-alloc table.
    let page = virt_to_page(ptr);
    if page.flags.load(Relaxed) & PG_SLAB != 0 {
        // Slab-backed allocation: reverse-lookup via Page.mapping.
        // SAFETY: The mapping pointer is valid for the slab's lifetime.
        // The slab cache cannot be GC'd while objects are outstanding.
        let cache = unsafe { &*(page.mapping.load(Acquire) as *const SlabCache) };
        let sc = cache.size_class;
        // NOTE: total_allocated is NOT decremented here in the fast path.
        // It is decremented in slab_free_slow() to match the symmetry with
        // slab_alloc_slow() where it is incremented. The fast path (magazine
        // push) does not touch the counter — the count lags by up to
        // MAGAZINE_SIZE * num_cpus, acceptable for GC eligibility checks.
        cache.last_activity_ns.store(now_ns(), Relaxed);
        free(ptr, sc);
        return;
    }

    // Large allocation (backed by buddy pages): lookup in LARGE_ALLOCS.
    if let Some(large) = LARGE_ALLOCS.remove(ptr as u64) {
        free_pages(page, large.order);
        return;
    }

    // Neither PG_SLAB nor LARGE_ALLOCS — invalid pointer.
    debug_assert!(false, "kfree: ptr is not a slab or large allocation");
}

/// Large allocation tracking for kfree() reverse lookup. Keyed by
/// virtual address (integer key -- XArray per collection policy).
/// Only used for allocations >16384 bytes that bypass slab caching.
/// Cold path: large allocations are rare in kernel code.
static LARGE_ALLOCS: XArray<LargeAlloc> = XArray::new();

struct LargeAlloc {
    order: u32,
}

4.3.1.2 Internal Slab Structures¶

SlabList — doubly-linked list of slab objects within a slab cache:

/// Doubly-linked list of `Slab` objects. Used for the `partial` list within
/// each NUMA node's slab cache — slabs that have some free objects but are
/// not completely empty. The allocator draws from the partial list when the
/// per-CPU magazine is empty: it pops the head slab, allocates an object
/// from it, and (if the slab is now full) removes it from the list. When
/// an object is freed back to a full slab, that slab is pushed onto the
/// partial list.
///
/// Invariant: `count` always equals the number of nodes reachable from
/// `head` by following `next` pointers. `head.prev == null` and
/// `tail.next == null`.
/// SAFETY: `head` and `tail` are null when the list is empty (count == 0).
/// When non-null, they point to valid `Slab` instances allocated from the
/// buddy allocator's slab metadata pages. All slabs on the list are owned
/// by the same `SlabCache`. Access is protected by the `SlabCache.node_partial`
/// SpinLock — no unsynchronized reads or writes.
pub struct SlabList {
    /// First slab in the list, or null if empty.
    pub head: *mut Slab,
    /// Last slab in the list, or null if empty.
    pub tail: *mut Slab,
    /// Number of slabs in the list.
    pub count: usize,
}

/// Intrusive list links embedded in each `Slab` header. These fields are
/// only valid when the slab is on a `SlabList` (partial list). A slab that
/// is on the per-CPU magazine or is completely free does not use these links.
///
/// Each `Slab` represents one page (or compound page for large objects)
/// subdivided into fixed-size object slots.
// (embedded in Slab struct):
//   pub prev: *mut Slab,
//   pub next: *mut Slab,

Slab — Single slab page descriptor:

/// A single slab page: a contiguous memory region divided into `capacity`
/// fixed-size objects. The freelist is stored **out-of-band** (in a separate
/// metadata page, not within the object region) for security and debuggability.
///
/// **UmkaOS vs Linux**: Linux stores the freelist inside free objects themselves
/// (each free slot contains a pointer to the next free slot). This enables
/// use-after-free exploitation (an attacker who writes to a freed object can
/// corrupt the freelist). UmkaOS's out-of-band freelist prevents this class of
/// bug from being exploitable at the cost of ~2 bytes of metadata per object.
///
/// **Metadata pool amortization**: The OOB freelist pointer pool allocates
/// metadata entries in chunks of 64 from a dedicated 128-byte slab cache. Each
/// chunk serves 64 slab objects, amortizing the per-object metadata overhead to
/// ~2 bytes/object (128 bytes / 64 entries). The pool is refilled lazily on slab
/// page allocation, not on every object allocation.
///
/// Typical slab sizes: 8-object slab for large objects (≥4 KB/8 = 512 B each),
/// up to 512 objects for small objects (8-16 bytes each in a 4 KB page).
pub struct Slab {
    /// Physical address of the first byte of the object region.
    /// Always page-aligned. The object region occupies `capacity * obj_size` bytes.
    pub base: PhysAddr,

    /// Size of each object in bytes, including alignment padding.
    /// Constant for the lifetime of the slab (set at slab creation).
    pub obj_size: u32,

    /// Maximum number of objects this slab can hold. Computed as
    /// `usable_page_bytes / obj_size` at slab creation.
    pub capacity: u16,

    /// Number of currently free (unallocated) objects in this slab.
    pub free_count: u16,

    /// Compact out-of-band freelist. Stores indices (0..capacity) of free object
    /// slots. The top `free_count` entries are valid free slot indices.
    /// Allocation: pop `freelist[free_count - 1]`, decrement `free_count`.
    /// Deallocation: push slot index to `freelist[free_count]`, increment `free_count`.
    ///
    /// Double-free detection: before pushing, verify the slot is not already
    /// in `freelist[0..free_count]` (O(free_count), acceptable for debug builds;
    /// skipped in release with a KVA-guarded canary approach).
    ///
    /// Memory: allocated from the buddy allocator as a separate page (or from
    /// a small metadata pool for slabs with capacity ≤ 256). Never overlaps
    /// with the object region. Freed alongside the slab page during GC
    /// Phase 5 (return pages to buddy) — the GC must free both the object
    /// pages and the freelist metadata page for each reclaimed slab.
    ///
    /// SAFETY: `freelist` points to a contiguous `[u16; capacity]` array
    /// allocated from the slab metadata pool (buddy allocator). Valid for
    /// the slab's lifetime: allocated in `slab_page_init()`, freed in
    /// `slab_shrink()` (GC Phase 5). Never null for an initialized slab
    /// (allocated before the slab is inserted into the partial list).
    /// Only accessed under the owning `SlabCache.node_partial` SpinLock.
    pub freelist: *mut u16,

    /// Pointer to the `SlabCache` that owns this slab. Used for accounting
    /// and to return the slab to the appropriate partial/empty list.
    /// SAFETY: `cache` pointer is valid for the slab's lifetime (a slab is always
    /// owned by a cache and freed before the cache is destroyed).
    pub cache: *const SlabCache,

    /// Which list this slab is currently on (drives cache list management).
    pub list_state: SlabState,

    /// Slab generation counter. Incremented on each alloc+free cycle.
    /// Used by the sanitizer to detect stale pointers (optional, debug builds).
    ///
    /// **Longevity**: u32 wraps after ~4.3 billion cycles. At 100K alloc/free
    /// cycles per second per slab, wraps in ~12 hours. Acceptable because:
    /// (1) debug-only — not compiled in release builds; (2) stale-pointer
    /// detection is probabilistic, not relied upon for safety invariants.
    #[cfg(debug_assertions)]
    pub generation: u32,
}

/// Which free-list within a `SlabCache` this slab is currently linked into.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SlabState {
    /// All `capacity` objects are allocated (no free slots).
    Full,
    /// Some objects allocated, some free. The common steady-state.
    Partial,
    /// No objects allocated. Eligible for return to the buddy allocator.
    Empty,
}

SlabMagazine — Fixed-capacity object pointer array:

/// A magazine holds up to MAGAZINE_SIZE object pointers for fast per-CPU
/// allocation/deallocation. Allocated from a dedicated small slab cache
/// (or via buddy pages for the bootstrap case). NOT from the cache it
/// serves — this avoids recursion.
///
/// Magazines are referenced via `NonNull<SlabMagazine>` in `MagazinePair`
/// (loaded and spare). The depot holds stacks of full/empty magazines.
pub struct SlabMagazine {
    /// Number of valid object pointers in `objects[0..count]`.
    pub count: u32,
    /// Object pointer storage. Indices `0..count` are valid pointers.
    /// Indices `count..MAGAZINE_SIZE` are undefined.
    pub objects: [*mut u8; MAGAZINE_SIZE],
}

impl SlabMagazine {
    /// Create a new empty magazine (count = 0, all slots undefined).
    pub fn empty() -> Self {
        Self {
            count: 0,
            objects: [core::ptr::null_mut(); MAGAZINE_SIZE],
        }
    }

    pub fn is_empty(&self) -> bool { self.count == 0 }
    pub fn is_full(&self) -> bool { self.count as usize == MAGAZINE_SIZE }

    /// Pop an object pointer from the magazine. Returns None if empty.
    pub fn pop(&mut self) -> Option<*mut u8> {
        if self.count == 0 { return None; }
        self.count -= 1;
        Some(self.objects[self.count as usize])
    }

    /// Push an object pointer. Returns Err if full.
    pub fn push(&mut self, ptr: *mut u8) -> Result<(), *mut u8> {
        if self.count as usize >= MAGAZINE_SIZE { return Err(ptr); }
        self.objects[self.count as usize] = ptr;
        self.count += 1;
        Ok(())
    }
}

/// Deallocate a magazine struct. Returns the magazine's memory to the
/// magazine slab cache (a small dedicated cache separate from the object
/// caches to avoid recursion). Called when GC drains a depot or when a
/// per-CPU magazine is reclaimed during CPU offline.
fn dealloc_magazine(mag: NonNull<SlabMagazine>) {
    // SAFETY: mag was allocated from MAGAZINE_SLAB. The caller has
    // exclusive ownership (magazine was removed from depot/pair).
    unsafe { MAGAZINE_SLAB.free(mag.as_ptr() as *mut u8); }
}

SlabCache — Slab cache descriptor (global/per-NUMA state only):

Per-CPU slab magazines are defined in Section 3.2 (MagazinePair in CpuLocalBlock). The slab allocator accesses them via the per-CPU register — see Section 3.1.2 for the data structure definition and access protocol. SlabCache does not embed per-CPU state; it holds only the global and per-NUMA structures that back the per-CPU magazines.

// SLAB_SIZE_CLASSES is defined above in the kmalloc() / kfree() Public API
// section (= 26, including 96 and 192 byte classes). Not redeclared here.

/// A slab cache for objects of a fixed size and alignment. Created at boot
/// for common kernel types (capabilities, inodes, etc.) and on demand for
/// driver-specific types.
///
/// Per-CPU fast-path state lives in `CpuLocalBlock.slab_magazines[size_class]`
/// (Section 3.1.2), NOT in this struct. `SlabCache` owns the shared slow-path
/// structures: per-NUMA partial slab lists and the magazine depot (full/empty
/// magazine stacks used to refill or drain per-CPU magazines).
pub struct SlabCache {
    /// Size class index (0..SLAB_SIZE_CLASSES).
    /// Maps directly to `CpuLocalBlock.slab_magazines[size_class]`.
    pub size_class: u32,
    /// Object size in bytes (rounded up to alignment).
    pub object_size: u32,
    /// Object alignment requirement.
    pub align: u32,
    /// Buddy allocator order for slab pages (2^slab_order pages per slab).
    /// Most caches use order 0 (single 4 KB page). Large objects (>2 KB)
    /// use order 1 or higher to fit at least 2 objects per slab. Computed
    /// once at cache creation based on `object_size` and `align`:
    ///   order 0: object_size ≤ PAGE_SIZE / 2  (≤2048 B on 4 KB pages)
    ///   order 1: object_size ≤ PAGE_SIZE       (≤4096 B)
    ///   order 2: object_size ≤ PAGE_SIZE * 2   (≤8192 B)
    /// Maximum order is 3 (32 KB slab). Objects larger than 16 KB are
    /// allocated directly from the buddy allocator (no slab caching).
    pub slab_order: u32,
    /// Per-NUMA-node partial slab lists. Indexed by NUMA node ID.
    /// Length = `num_online_nodes()` at boot; allocated once from the boot-time
    /// allocator during `slab_init()` and never resized. The `&'static` lifetime
    /// is correct because this allocation lives for the kernel's entire lifetime
    /// (boot-time caches with `PERMANENT` flag are never destroyed; driver-created
    /// caches with `DESTROYABLE` flag are garbage-collected when empty and idle —
    /// see [Section 4.3](#slab-allocator--slab-cache-garbage-collection)).
    /// Dynamically sized based on discovered
    /// NUMA node count (no compile-time MAX_NUMA_NODES constant — see
    /// Section 4.9). Follows the same dynamic sizing pattern as `PerCpu<T>`
    /// (Section 3.1.1) and `NumaTopology` (Section 4.9).
    ///
    /// Allocated during `slab_init()` (Phase 1.2) from the buddy allocator
    /// (Phase 1.1 `hand_off_to_buddy()` completes before `slab_init()`).
    /// NUMA topology discovery (Section 4.9) must complete before
    /// `slab_init()` so the node count is known. The buddy allocator
    /// provides a `&'static [SpinLock<SlabList>]` by allocating a
    /// contiguous region via `alloc_pages()` and leaking it (the
    /// allocation is permanent, so no memory is actually leaked).
    /// For the complete kernel init phase ordering, see
    /// [Section 2.3](02-boot-hardware.md#boot-init-cross-arch).
    pub node_partial: &'static [SpinLock<SlabList>],
    /// Flags controlling cache lifecycle.
    pub flags: SlabCacheFlags,
    /// Timestamp (monotonic ns) of last alloc or free on this cache.
    /// Updated on the **slow path only** (magazine miss/overflow): the
    /// hot path (magazine pop/push) does NOT touch this field. The slab
    /// GC idle threshold (300 seconds) is orders of magnitude larger than
    /// the magazine batch interval, so the imprecision is immaterial.
    pub last_activity_ns: AtomicU64,
    /// GC state machine. See [Section 4.3](#slab-allocator--slab-cache-garbage-collection).
    pub gc_state: AtomicU8,
    /// Kernel-unique cache identifier. Monotonically assigned at creation time.
    /// Used as the key in `SLAB_CACHES` XArray. Never reused.
    pub cache_id: u64,
    /// Approximate count of allocated (live) objects across all slabs.
    /// Updated on the SLOW PATH only (depot exchange, slab_grow, slab_shrink).
    /// The hot path (magazine pop/push) does NOT update this counter.
    /// The count may lag by up to `MAGAZINE_SIZE * num_cpus` objects, which
    /// is acceptable for GC eligibility checks (GC runs on caches idle 300+ seconds).
    pub total_allocated: AtomicU64,
    /// Magazine depot: stacks of full and empty magazines.
    /// When a per-CPU `MagazinePair` needs a fresh loaded magazine, it returns
    /// its empty magazine to `depot.empties` and pops a full one from
    /// `depot.fulls`. When both per-CPU magazines are full on free, the spare
    /// is returned to `depot.fulls` and an empty is popped from `depot.empties`.
    /// Protected by a single spinlock (contention is rare — depot access is the
    /// slow path, hit only when both magazines miss).
    /// **Lock ordering**: `buddy.zone_lock(120) < depot.lock(125) < node_partial(130)`.
    /// In practice, these locks are NEVER held simultaneously during normal
    /// operation — each phase acquires and releases its lock before the next:
    /// - Alloc slow path: depot.lock -> [release] -> node_partial -> [release]
    ///   -> slab_grow() -> buddy.lock -> [release] -> node_partial -> [release].
    /// - Free slow path: depot.lock -> [release] ->
    ///   drain_magazine_to_slabs() acquires node_partial per NUMA node ->
    ///   [release node_partial] -> slab_shrink() acquires buddy.lock -> [release].
    ///   NOTE: the depot lock is dropped BEFORE drain_objects_to_slabs() runs.
    ///   The spare magazine's object pointers are copied to a stack-local
    ///   `ArrayVec` under the depot lock, the magazine struct is zeroed and
    ///   reused in-place, and the drain proceeds without holding the lock.
    ///   This eliminates contention: other CPUs can access the depot while
    ///   the drain proceeds.
    /// - GC: Phase 2 depot.lock -> [release] -> Phase 5 buddy.lock -> [release].
    ///
    /// The ordering rule exists to prevent inversions with the allocation path.
    /// The depot lock is acquired only on the magazine-miss slow path;
    /// the fast path (magazine hit) is lock-free.
    ///
    /// **`depot.lock < node_partial.lock`**: The alloc slow path may hold
    /// `depot.lock` while re-checking per-CPU magazines, then fall through
    /// to `node_partial.lock` on depot miss. Reversing this ordering
    /// (acquiring `node_partial` first, then `depot`) is forbidden —
    /// it would create an ABBA deadlock. The complete slab lock ordering
    /// chain is: `buddy.zone_lock < depot.lock < node_partial.lock`.
    pub depot: SpinLock<MagazineDepot>,
    /// Constructor called **once per slab page population** (when `slab_grow()`
    /// initializes a freshly-allocated slab), NOT on every `slab_alloc()` return.
    /// Magazine-recycled objects are NOT re-constructed — they retain state from
    /// the previous `kfree()`. This matches Linux SLUB behavior: `ctor` runs in
    /// `__slab_alloc()` only when a new slab page is allocated, not on
    /// per-object fast-path returns.
    ///
    /// Receives a pointer to `object_size` bytes of uninitialized memory.
    /// Must initialize all fields required for the object's invariants.
    /// Called under no lock — must not sleep if the slab cache serves
    /// `GFP_ATOMIC` allocations. The pointer is valid for the slab's lifetime
    /// and always points within the object region (never into freelist metadata).
    ///
    /// **GFP_ZERO + constructor interaction**: `GFP_ZERO` zeroes the object AFTER
    /// the constructor has run (since GFP_ZERO is applied in `slab_alloc()`,
    /// which wraps the slow path where the constructor ran during slab_grow).
    /// This destroys the constructor's initialization — a logic error.
    /// `slab_alloc()` debug_asserts that `GFP_ZERO` and `ctor` are not both
    /// set (matching Linux SLUB's `WARN_ON_ONCE(s->ctor && __GFP_ZERO)`).
    pub ctor: Option<fn(*mut u8)>,
    /// Destructor called before returning objects to the page allocator
    /// during slab cache GC (Phase 5) or `SlabCache::destroy()`. Receives
    /// a pointer to a fully-initialized object and must clean up any
    /// resources held by the object (e.g., drop Arc references, release
    /// hardware resources). Must not sleep. Called for every allocated
    /// object in the slab before the slab's backing pages are freed.
    pub dtor: Option<fn(*mut u8)>,
    /// Name for debugging and /proc/slabinfo.
    pub name: &'static str,
}

/// Magazine depot: global pool of pre-filled and empty magazines.
/// Shared across all CPUs for a given size class. Accessed under `SlabCache.depot` lock.
///
/// The depot capacity is computed at `slab_init()` as `max(128, 2 * num_online_cpus())`.
/// This ensures every CPU can exchange a magazine without falling through to the
/// buddy slow path, even under worst-case simultaneous exhaustion (all CPUs drain
/// their magazines in the same tick).
///
/// The depot uses a heap-allocated `Vec` (one allocation per size class at boot —
/// negligible). This avoids a fixed `ArrayVec` capacity limit that would degrade
/// on >64-CPU systems (e.g., IBM POWER10 with 240 SMT-8 cores = 1920 threads).
///
/// Memory cost: `2 * depot_capacity * 8` bytes per depot (two Vecs of `NonNull`
/// pointers) per size class. For 256 CPUs and 26 size classes
/// (SLAB_SIZE_CLASSES = 26): `2 * 512 * 8 * 26 = 208 KB` — negligible.
pub struct MagazineDepot {
    /// Stack of full magazines (each contains MAGAZINE_SIZE objects).
    /// Refill source for per-CPU loaded magazines.
    ///
    /// # Safety
    /// All pointers in `fulls` are valid, aligned, slab-allocated
    /// `SlabMagazine` instances with provenance from `alloc_slab_page()`.
    /// They are freed when the owning `SlabCache` is destroyed (GC Phase 5).
    pub fulls: Vec<NonNull<SlabMagazine>>,
    /// Stack of empty magazines (count == 0).
    /// Drain target for per-CPU magazines that are full.
    ///
    /// # Safety
    /// Same invariants as `fulls`.
    pub empties: Vec<NonNull<SlabMagazine>>,
}

/// Minimum depot capacity per stack. The actual capacity is
/// `max(DEPOT_MIN_CAPACITY, 2 * num_online_cpus())`, computed at `slab_init()`.
/// 128 magazines x 64 objects x 8 bytes = 64 KB of cached objects
/// per size class — fits comfortably in L2 on all architectures.
/// The `Vec` is allocated once at `slab_init()` per size class with capacity
/// `max(DEPOT_MIN_CAPACITY, 2 * num_online_cpus())` (cold path); all subsequent
/// depot operations are push/pop within the pre-allocated capacity.
///
/// **Heap allocation safety**: `Vec::push` will reallocate if `len >= capacity`,
/// which would trigger `kmalloc()` under `depot.lock` (SpinLock with IRQs
/// disabled), risking re-entry into the slab allocator. The depot capacity is
/// sized so that the total magazine population (bounded by `2 * num_cpus` active
/// + circulation) never exceeds it. Every `push` is preceded by:
/// ```rust
/// debug_assert!(self.fulls.len() < self.fulls.capacity(),
///     "depot fulls exceeded pre-allocated capacity");
/// ```
/// This catches any violation during development. In production, the invariant
/// holds because the total number of magazines is bounded by the initial
/// allocation pool.
pub const DEPOT_MIN_CAPACITY: usize = 128;

SLAB_CACHES — global slab cache registry:

/// Global registry of all slab caches. Keyed by monotonically increasing
/// `cache_id: u64` assigned at creation time. Provides O(1) lookup for cache
/// GC scanning and `/proc/slabinfo` enumeration.
///
/// XArray is used because the key is an integer (`cache_id`) — per collection
/// policy, integer-keyed mappings always use XArray. RCU-protected reads allow
/// the GC scanner and `/proc/slabinfo` to iterate without blocking allocations.
/// Writers (cache create/destroy) are serialized by `SLAB_CACHES_LOCK`.
///
/// A separate name-to-id deduplication map (`SLAB_CACHE_NAMES: HashMap<&'static str, u64>`)
/// is maintained under `SLAB_CACHES_LOCK` for `kmem_cache_create()` dedup.
/// This is a cold-path-only HashMap (cache creation is boot or driver init).
///
/// **Why not a name hash as key**: A u64 hash of the cache name has birthday-problem
/// collision risk. With ~100-500 caches the probability is negligible (~10^-14),
/// but the consequence of a collision (silently overwriting a slab cache in the
/// XArray, causing permanent memory leaks) is unacceptable for 50-year uptime.
/// A monotonic u64 cache_id eliminates collision risk entirely.
///
/// Populated during `slab_init()` with boot-time caches (PERMANENT), then
/// extended at runtime by driver-created caches (DESTROYABLE) via the KABI
/// slab creation interface.
/// XArray's internal RCU protection allows lock-free reads by GC scanner and
/// `/proc/slabinfo`. Writers hold `SLAB_CACHES_LOCK` (Mutex) for serialization.
/// No SpinLock wrapper — SpinLock would defeat XArray's RCU read capability.
pub static SLAB_CACHES: XArray<Arc<SlabCache>> = XArray::new();

/// Name-to-cache-id dedup map. Cold path only (cache creation/lookup by name).
/// Protected by `SLAB_CACHES_LOCK` (same lock as SLAB_CACHES writers).
static SLAB_CACHE_NAMES: Mutex<HashMap<&'static str, u64>> = Mutex::new(HashMap::new());

/// Next cache ID. Monotonically increasing. Assigned under `SLAB_CACHES_LOCK`.
/// u64 — at 10^9 cache creations/sec (impossible), wraps in ~584 years.
static NEXT_CACHE_ID: AtomicU64 = AtomicU64::new(0);

/// A fixed-size array that is mutable during initialization and becomes
/// permanently immutable after `seal()`. Reads (hot path) are bare
/// `AtomicPtr::load(Acquire)` — zero overhead vs a raw array. Writes
/// check `debug_assert!(!self.sealed)` — elided in release builds.
/// The Nucleus evolution primitive may `unseal()` → modify → `seal()`
/// during the quiesced Phase B of live evolution ([Section 13.18](13-device-classes.md#live-kernel-evolution)).
///
/// `SealedArray` has exactly one user (SLAB_CACHES_BY_CLASS). If other
/// subsystems need a "sealed after boot" pattern, extract to a shared
/// utility. The KABI subsystem uses a similar pattern for
/// `KabiProviderIndex` ([Section 12.7](12-kabi.md#kabi-service-dependency-resolution)).
pub struct SealedArray<T, const N: usize> {
    data: UnsafeCell<[T; N]>,
    sealed: AtomicBool,
}

// SAFETY: SealedArray is Sync because:
// - After seal(), only atomic reads occur (T: Sync).
// - Before seal(), writes are single-threaded (boot-time init on BSP).
// - The sealed flag is AtomicBool with Release/Acquire ordering.
unsafe impl<T: Sync, const N: usize> Sync for SealedArray<T, N> {}

impl<T, const N: usize> SealedArray<T, N> {
    /// Create a new unsealed array with the given initial values.
    pub const fn new(data: [T; N]) -> Self {
        Self { data: UnsafeCell::new(data), sealed: AtomicBool::new(false) }
    }

    /// Read element at `idx`. In release builds: bare array index, zero overhead.
    /// Panics if `idx >= N`.
    #[inline(always)]
    pub fn get(&self, idx: usize) -> &T {
        // SAFETY: After seal(), no mutation occurs. Before seal(), only the
        // BSP init thread writes (single-threaded boot). The AtomicBool fence
        // ensures visibility.
        unsafe { &(*self.data.get())[idx] }
    }

    /// Set element at `idx`. Only valid before `seal()`.
    /// Panics in debug builds if already sealed.
    pub fn set(&self, idx: usize, val: T) {
        debug_assert!(!self.sealed.load(Ordering::Acquire),
            "SealedArray::set() called after seal()");
        // SAFETY: Only called during single-threaded boot init (BSP).
        unsafe { (*self.data.get())[idx] = val; }
    }

    /// Seal the array. After this, `set()` panics in debug builds.
    /// All prior writes are made visible to readers via Release ordering.
    pub fn seal(&self) {
        self.sealed.store(true, Ordering::Release);
    }

    /// Unseal the array for live evolution. Restricted to the Nucleus
    /// evolution primitive during the quiesced Phase B (all CPUs halted).
    ///
    /// # Safety
    /// Caller must guarantee exclusive access (evolution quiescence).
    pub unsafe fn unseal(&self) {
        self.sealed.store(false, Ordering::Release);
    }
}

/// Flat array for O(1) size-class → SlabCache lookup on the allocation fast path.
/// Indexed by size-class index (0..SLAB_SIZE_CLASSES). Each entry is an `Option`
/// containing a raw pointer to the corresponding `SlabCache` (same object stored
/// in `SLAB_CACHES` XArray). Populated during `slab_init()` alongside `SLAB_CACHES`.
///
/// **Why a separate array**: The hot-path `slab_alloc(sc)` / `slab_free(sc)` needs
/// to reach the `SlabCache` from a size-class index in O(1) without taking any lock.
/// `SLAB_CACHES` is keyed by cache_id (for deduplication and `/proc/slabinfo`
/// enumeration), not by size-class index — an XArray lookup by cache_id on every
/// allocation would add unnecessary overhead.
/// `SLAB_CACHES_BY_CLASS` provides a lock-free, cache-line-friendly direct index.
///
/// Entries are set exactly once during `slab_init()` and sealed immediately
/// after. Driver-created caches are accessed via `SLAB_CACHES` XArray (by name),
/// NOT this array. `SLAB_CACHES_BY_CLASS` contains only PERMANENT boot-time
/// kmalloc size-class caches and is immutable after `slab_init()` returns.
/// Reads are unsynchronized (pointer load is `Acquire`) — safe because entries
/// are never modified after the seal.
///
/// **Safety invariant**: Each non-null pointer in `SLAB_CACHES_BY_CLASS` points to
/// a `SlabCache` whose backing `Arc` in `SLAB_CACHES` has a reference count > 0.
/// For PERMANENT caches: the `Arc` is never removed from `SLAB_CACHES`, so the raw
/// pointer is valid for the kernel's lifetime. For DESTROYABLE caches: they are
/// never placed in `SLAB_CACHES_BY_CLASS` (only PERMANENT boot-time caches appear
/// here). During live kernel evolution, the Nucleus evolution primitive may
/// temporarily unseal to update size-class cache pointers — the quiescence
/// guarantee ensures no concurrent readers during the swap
/// ([Section 13.18](13-device-classes.md#live-kernel-evolution)).
pub static SLAB_CACHES_BY_CLASS: SealedArray<AtomicPtr<SlabCache>, SLAB_SIZE_CLASSES> =
    SealedArray::new([const { AtomicPtr::new(core::ptr::null_mut()) }; SLAB_SIZE_CLASSES]);

/// Serializes slab cache creation and destruction. Held only on the cold path
/// (cache create/destroy); never held during allocation or free.
pub static SLAB_CACHES_LOCK: Mutex<()> = Mutex::new(());

SlabCache::alloc() — allocation algorithm:

Fast path accesses CpuLocalBlock.slab_magazines[sc] directly via the per-CPU register. No PerCpu<T> wrapper overhead. No borrow-state CAS on the slab fast path. Direct register-based access via CpuLocal (Section 3.2).

The gfp parameter controls reclaim behavior and initialization. On the fast path (magazine hit), gfp is checked only for GFP_ZERO (object zeroing). On the slow path (magazine miss → depot miss → partial list miss → slab_grow()), gfp is propagated to alloc_pages() to constrain whether the allocation may sleep, trigger reclaim, or invoke the OOM killer.

See Section 4.2 for the full GfpFlags table (flag semantics, lock-hierarchy constraints, and bitflags definition).

slab_alloc(size_class: sc, gfp: GfpFlags) -> Result<*mut u8, AllocError>:
    // GFP_ZERO is applied HERE, wrapping ALL return paths (fast and slow).
    // Previous versions had GFP_ZERO only on the fast path, leaving 7 slow-path
    // return sites unzeroed — an information-leak vulnerability (CWE-200).
    let obj = slab_alloc_inner(sc, gfp)?;
    if gfp.contains(GFP_ZERO):
        // Zero the entire size-class allocation (not just the requested size,
        // to prevent info leak from the rounding overhead bytes).
        // Magazine-recycled objects and slow-path objects both contain stale
        // data from previous use. Linux SLUB handles this via
        // slab_want_init_on_alloc() which checks __GFP_ZERO at the same point.
        unsafe { core::ptr::write_bytes(obj, 0, KMALLOC_SIZE_TABLE[sc as usize] as usize) }
    // GFP_ZERO + constructor interaction: Linux SLUB issues
    // WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO)). Zeroing after construction
    // overwrites constructor state. UmkaOS rejects this combination.
    debug_assert!(
        !(gfp.contains(GFP_ZERO) && cache_has_constructor(sc)),
        "GFP_ZERO with constructor is invalid: zeroing destroys constructed state"
    );
    // Update total_allocated on slow path only (fast path is too hot).
    // The counter is approximate (may lag by MAGAZINE_SIZE * num_cpus),
    // acceptable for GC eligibility checks.
    // NOTE: total_allocated is incremented in slab_alloc_slow's return paths.
    // last_activity_ns is updated at the same points.
    return Ok(obj)

slab_alloc_inner(sc, gfp: GfpFlags) -> Result<*mut u8, AllocError>:
    flags = local_irq_save()

    // magazine_active check: during CPU hotplug, magazines are drained and
    // invalidated. If magazine_active is false, bypass the magazine entirely
    // and fall through to the depot slow path. See §CPU hotplug slab quiescence.
    if !CpuLocalBlock.magazine_active.load(Relaxed):
        local_irq_restore(flags)
        return slab_alloc_slow(sc, gfp)

    // SAFETY: Access via raw pointer from CpuLocal register base (not get_mut()).
    // IRQs disabled — no concurrent non-atomic access on this CPU.
    // The `&raw mut` produces a `*mut MagazinePair`. We immediately convert
    // to `&mut MagazinePair` since IRQ-disable guarantees exclusive access.
    let pair: &mut MagazinePair = unsafe { &mut *(&raw mut (*CpuLocal::as_ptr()).slab_magazines[sc]) };

    // Fast path: pop from loaded magazine.
    // SAFETY: On the fast path (magazine_active == true), loaded is always Some.
    // unwrap() is elided in release builds. as_mut() yields &mut SlabMagazine.
    let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
    if loaded.count > 0:
        loaded.count -= 1;
        let obj: *mut u8 = loaded.objects[loaded.count as usize];
        local_irq_restore(flags)
        return Ok(obj)

    // Loaded empty — swap loaded ↔ spare (swaps the Option<NonNull> values).
    core::mem::swap(&mut pair.loaded, &mut pair.spare);
    // SAFETY: spare was also Some on the fast path; now it is the new loaded.
    let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
    if loaded.count > 0:
        loaded.count -= 1;
        let obj: *mut u8 = loaded.objects[loaded.count as usize];
        local_irq_restore(flags)
        return Ok(obj)

    // Both magazines empty — slow path (under lock)
    local_irq_restore(flags)
    return slab_alloc_slow(sc, gfp)

slab_alloc_slow(sc, gfp: GfpFlags) -> Result<*mut u8, AllocError>:
    cache = SLAB_CACHES_BY_CLASS[sc].load(Acquire)  // O(1) size-class → SlabCache

    // GC_DRAINING check: reject allocations for caches being garbage-collected.
    // Between GC Phase 3 (verify empty) and Phase 5 (free pages), new allocations
    // must be rejected to prevent use-after-free of pages about to be freed.
    // For DESTROYABLE caches only — PERMANENT caches never enter GC_DRAINING.
    if cache.gc_state.load(Acquire) != SlabGcState::Active as u8 {
        return Err(AllocError::CacheDraining)
    }

    // Try depot: swap empty loaded for a full magazine from depot.
    // SpinLock::lock() saves IRQ state and disables IRQs automatically
    // (see locking-strategy.md §SpinLock<T>), so no nested local_irq_save/restore is needed.
    let depot_guard = cache.depot.lock()   // saves IRQ state + disables IRQs + acquires lock

    // MIGRATION SAFETY: The task may have migrated between the fast path's
    // local_irq_restore() and this depot.lock acquisition. After acquiring the
    // spinlock (which disables IRQs, pinning us to the current CPU), we MUST
    // re-check the current CPU's magazine state. CpuLocal::as_ptr() returns
    // the CURRENT CPU's CpuLocalBlock, which may differ from the fast-path
    // CPU due to migration. We re-check this CPU's magazines because they
    // may have objects available. Linux SLUB handles this by re-checking
    // the per-CPU freelist under preemption-disable in ___slab_alloc.
    let pair: &mut MagazinePair = unsafe { &mut *(&raw mut (*CpuLocal::as_ptr()).slab_magazines[sc]) };
    // SAFETY: slow path entered from fast path where magazine_active == true,
    // so both loaded and spare are Some. unwrap() + as_mut() is safe.
    let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
    if loaded.count > 0:
        // Current CPU's loaded magazine has objects — use it directly.
        loaded.count -= 1;
        let obj: *mut u8 = loaded.objects[loaded.count as usize];
        drop(depot_guard)
        cache.total_allocated.fetch_add(1, Relaxed);
        cache.last_activity_ns.store(now_ns(), Relaxed);
        return Ok(obj)
    core::mem::swap(&mut pair.loaded, &mut pair.spare);
    let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
    if loaded.count > 0:
        loaded.count -= 1;
        let obj: *mut u8 = loaded.objects[loaded.count as usize];
        drop(depot_guard)
        cache.total_allocated.fetch_add(1, Relaxed);
        cache.last_activity_ns.store(now_ns(), Relaxed);
        return Ok(obj)
    // Both magazines genuinely empty on the current CPU. Proceed with depot exchange.

    if let Some(full_mag) = depot_guard.fulls.pop():
        // pair.loaded is confirmed empty (count==0) — return it to empties.
        // pair.loaded is Option<NonNull<SlabMagazine>>; depot.empties is Vec<NonNull>.
        // unwrap() is safe: loaded was Some (verified by magazine_active invariant).
        debug_assert!(depot_guard.empties.len() < depot_guard.empties.capacity(),
            "depot empties exceeded pre-allocated capacity");
        depot_guard.empties.push(pair.loaded.unwrap())   // return empty magazine
        pair.loaded = Some(full_mag)
        let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
        loaded.count -= 1;
        let obj: *mut u8 = loaded.objects[loaded.count as usize];
        drop(depot_guard)                // releases lock + restores IRQ state
        cache.total_allocated.fetch_add(1, Relaxed);
        cache.last_activity_ns.store(now_ns(), Relaxed);
        return Ok(obj)
    drop(depot_guard)                    // releases lock + restores IRQ state

    // Depot empty — refill from per-NUMA partial slab list.
    // IMPORTANT: node_partial.lock is dropped BEFORE calling slab_grow(),
    // because slab_grow() re-acquires node_partial.lock internally to insert
    // the newly allocated slab. Holding node_partial.lock across slab_grow()
    // would deadlock (ABBA with itself). This follows the Linux SLUB pattern.
    node = current_numa_node()
    {
        let partial = cache.node_partial[node].lock()
        if let Some(slab) = partial.pop_head():
            // Batch-fill a magazine from the slab's freelist.
            // Pop up to MAGAZINE_SIZE objects from the slab's freelist into
            // the loaded magazine. Each object is at a known offset within the
            // slab page (computed from object index * object_size + red_zone).
            // SAFETY: pair.loaded is Some (magazine_active invariant, no GC concurrent).
            let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
            let count = slab.free_count.min(MAGAZINE_SIZE);
            for i in 0..count {
                loaded.objects[i] = slab.freelist_pop();
            }
            loaded.count = count;
            // Update slab state based on remaining free objects.
            if slab.free_count > 0 {
                // Slab still has free objects — keep in partial list.
                partial.push_head(slab);
            } else {
                // Slab is now fully allocated — transition to Full state.
                // Without this, drain_magazine_to_slabs' Full->Partial
                // transition check never matches, leaking the slab.
                slab.list_state = SlabState::Full;
                // Full slabs are NOT in the partial list — they are
                // tracked implicitly via the Page.mapping pointer back
                // to the SlabCache. They re-enter the partial list when
                // an object is freed (slab_free_slow path).
            }
            // Return one object from the freshly-filled magazine.
            loaded.count -= 1;
            let obj: *mut u8 = loaded.objects[loaded.count as usize];
            drop(partial)
            cache.total_allocated.fetch_add(1, Relaxed);
            cache.last_activity_ns.store(now_ns(), Relaxed);
            return Ok(obj)
        // Lock DROPPED here — partial guard goes out of scope
    }

    // Partial list empty — grow the cache WITHOUT holding node_partial.lock.
    // gfp is propagated here to control reclaim behavior.
    // slab_grow() acquires node_partial.lock internally to insert the new slab.
    slab_grow(cache, node, gfp)?

    // Re-acquire node_partial.lock to pop from the freshly-inserted slab.
    // RETRY LOOP: another CPU may have stolen the freshly-grown slab between
    // slab_grow()'s internal insertion and our re-acquisition of node_partial.
    // Bounded retry (3 attempts) before returning AllocError.
    for _attempt in 0..3 {
        let partial = cache.node_partial[node].lock()
        if let Some(slab) = partial.pop_head():
            // MIGRATION SAFETY: Re-read `pair` from CpuLocal for the CURRENT
            // CPU. The task may have migrated during `slab_grow()` above (which
            // can sleep via direct reclaim under GFP_KERNEL). The SpinLock
            // acquisition disables IRQs, pinning us to the current CPU, so this
            // CpuLocal read is stable for the rest of the critical section.
            // Without this re-read, `pair` would still point to the ORIGINAL
            // CPU's MagazinePair, causing a data race (two CPUs accessing the
            // same non-atomic MagazinePair fields without synchronization).
            let pair: &mut MagazinePair = unsafe {
                &mut *(&raw mut (*CpuLocal::as_ptr()).slab_magazines[sc])
            };
            // SAFETY: pair.loaded is Some (magazine_active invariant).
            let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
            let count = slab.free_count.min(MAGAZINE_SIZE);
            for i in 0..count {
                loaded.objects[i] = slab.freelist_pop();
            }
            loaded.count = count;
            if slab.free_count > 0 {
                partial.push_head(slab);
            } else {
                slab.list_state = SlabState::Full;
            }
            loaded.count -= 1;
            let obj: *mut u8 = loaded.objects[loaded.count as usize];
            drop(partial)
            cache.total_allocated.fetch_add(1, Relaxed);
            cache.last_activity_ns.store(now_ns(), Relaxed);
            return Ok(obj)
        drop(partial)
        // Another CPU consumed our slab. Grow again.
        slab_grow(cache, node, gfp)?
    }
    return Err(AllocError)  // Exhausted retries — severe contention or OOM.

4.3.1.3 Slab-to-Buddy Page Allocation¶

When both per-CPU magazines and per-NUMA partial lists are exhausted, the slab allocator must obtain a fresh slab page from the buddy allocator. This is the only point where the slab subsystem interacts with the physical page allocator.

/// Allocate a new slab page from the buddy allocator.
///
/// Called when both per-CPU magazines and per-NUMA partial lists are
/// exhausted for the given cache. Allocates 2^`cache.slab_order` contiguous
/// pages, initialises slab metadata and the out-of-band freelist, and
/// inserts the new slab into the per-NUMA partial list.
///
/// # Arguments
/// - `cache`: The slab cache that needs a new slab.
/// - `nid`: Preferred NUMA node, typically `numa_mem_id()` (the NUMA node
///   of the CPU that triggered the allocation).
/// - `gfp`: Allocation flags propagated from the original allocation request.
///
/// A `Page` used as a slab backing page. The `index_or_freelist` field
/// holds the slab freelist head pointer, and `mapping` points to the
/// owning `SlabCache`.
pub type SlabPage = Page;

/// # Errors
/// Returns `AllocError` if the buddy allocator cannot satisfy the request
/// after all reclaim and fallback attempts.
fn slab_grow(
    cache: &SlabCache,
    nid: NumaNodeId,
    gfp: GfpFlags,
) -> Result<(), AllocError> {
    // Allocate from the buddy allocator on the preferred NUMA node.
    // __GFP_MOVABLE is NOT set — slab pages are pinned (not compactable).
    // The migration type is Unmovable (see MigrateType in Section 4.2).
    // alloc_pages returns &'static Page (a reference into the global MEMMAP
    // array). The Page struct is the canonical metadata — it is NOT a by-value
    // copy. The physical page allocator spec ([Section 4.2](#physical-memory-allocator))
    // defines alloc_pages() → Result<&'static Page, AllocError>. If the return
    // type were Page-by-value, slab metadata initialization would modify a
    // stack copy, not the MEMMAP entry — slab-allocator invariants would break.
    let page: &'static Page = alloc_pages(cache.slab_order, gfp, nid)?;

    // Initialise slab metadata on the allocated page(s).
    let slab = SlabPage::init(page, cache);

    // Carve objects from the slab page and build the out-of-band freelist.
    // Each slot index (0..capacity) is pushed onto the freelist array.
    slab.init_freelist(cache.object_size, cache.align);

    // Insert into per-NUMA partial list (slab has all objects free).
    cache.node_partial[nid].lock().push(slab);

    Ok(())
}

Core slab helper functions used by slab_grow(), slab_alloc_slow(), and slab_free_slow():

/// Convert a virtual address to its owning `Page` struct reference.
/// Uses the direct-map offset: `MEMMAP[(va - DIRECT_MAP_BASE) >> PAGE_SHIFT]`.
fn virt_to_page(va: *const u8) -> &'static Page {
    let pfn = (va as usize - DIRECT_MAP_BASE) >> PAGE_SHIFT;
    &MEMMAP[pfn]
}

/// Convert a `Page` reference back to its virtual address in the direct map.
fn page_to_virt(page: &Page) -> *mut u8 {
    ((page.pfn() << PAGE_SHIFT) + DIRECT_MAP_BASE) as *mut u8
}

/// Convert a virtual address to a physical address (direct-map arithmetic).
///
/// **32-bit LPAE note (ARMv7, PPC32)**: On architectures where physical RAM
/// starts at a non-zero base address (e.g., ARMv7 LPAE with RAM at
/// `0x2_00000000`), `DIRECT_MAP_BASE` is defined as `PAGE_OFFSET - PHYS_OFFSET`
/// in `arch/*/mm.rs`, making this formula produce the correct physical address
/// without an explicit `PHYS_OFFSET` term. On 64-bit architectures,
/// `PHYS_OFFSET` is always 0, so `DIRECT_MAP_BASE == PAGE_OFFSET`. See
/// `arch::current::mm::DIRECT_MAP_BASE` for per-arch definitions and
/// `arch::current::mm::PHYS_OFFSET` for the physical memory base.
fn virt_to_phys(va: *const u8) -> PhysAddr {
    PhysAddr((va as usize - DIRECT_MAP_BASE) as u64)
}

/// Map a `Page` to its `SlabPage` metadata. The SlabPage metadata is stored
/// out-of-band in the Page struct's `slab` union variant (overlay of the
/// `mapping` and `index` fields, reinterpreted when `PG_SLAB` is set).
/// This avoids per-slab metadata allocation — the Page struct IS the metadata.
fn page_to_slab(page: &Page) -> &SlabPage {
    // SAFETY: PG_SLAB is set on this page (checked by caller).
    // The slab union variant is valid when PG_SLAB is set.
    unsafe { &*(page as *const Page as *const SlabPage) }
}

impl SlabPage {
    /// Initialize a freshly-allocated page as a slab page.
    /// Sets PG_SLAB flag, stores the owning cache pointer, initializes
    /// free_count to the full object capacity.
    fn init(page: &'static Page, cache: &SlabCache) -> &'static SlabPage {
        page.flags.fetch_or(PG_SLAB, Release);
        page.mapping.store(cache as *const SlabCache as *mut (), Release);
        let slab = page_to_slab(page);
        // SAFETY: We have exclusive access (page just allocated from buddy).
        unsafe {
            let s = &mut *(slab as *const SlabPage as *mut SlabPage);
            s.free_count = (PAGE_SIZE << cache.slab_order) as u16
                / cache.object_size as u16;
            s.list_state = SlabState::Partial;
        }
        slab
    }

    /// Build the initial freelist for all object slots in this slab.
    /// Each slot is pushed onto the freelist (a contiguous array of u16
    /// indices stored in the page's trailer). After init, all objects are free.
    fn init_freelist(&self, object_size: u32, align: u32) {
        let capacity = self.free_count as usize;
        let base = page_to_virt(self.page());
        for i in 0..capacity {
            // freelist[i] = i (all slots free, allocated in order 0..N-1).
            self.freelist_push(i as u16);
        }
    }

    /// Pop a free object slot index from the slab's freelist.
    /// Returns the object's virtual address.
    fn freelist_pop(&self) -> *mut u8 {
        // SAFETY: Caller holds node_partial lock. free_count > 0 checked by caller.
        let idx = unsafe { &mut *(self.freelist.get()) }.pop().unwrap();
        let offset = idx as usize * self.object_size() as usize;
        (page_to_virt(self.page()) as usize + offset) as *mut u8
    }

    /// Push a freed object slot index back onto the freelist.
    fn freelist_push(&self, slot_idx: u16) {
        // SAFETY: Caller holds node_partial lock.
        unsafe { &mut *(self.freelist.get()) }.push(slot_idx);
    }

    /// Check whether a specific slot index is free (in the freelist).
    fn slab_slot_is_free(&self, slot_idx: u16) -> bool {
        // SAFETY: Caller holds node_partial lock.
        unsafe { &*(self.freelist.get()) }.contains(&slot_idx)
    }
}

NUMA node selection. nid is set to numa_mem_id() — the NUMA node of the CPU that triggered the allocation. This ensures slab objects are allocated on the local node for cache affinity. The caller passes nid through from slab_alloc_slow(), which obtains it via current_numa_node().

NUMA fallback. If the preferred node is exhausted, alloc_pages() internally falls back to nearby nodes using the NUMA distance table (Section 4.11). The slab allocator does NOT implement its own NUMA fallback — it relies entirely on the buddy allocator's fallback chain.

Error path. If alloc_pages() returns AllocError, the slab allocation fails and propagates the error to the caller. The caller must handle allocation failure:

Return ENOMEM to userspace (normal case).
Retry with GFP_KERNEL_NOIO if the original allocation was from I/O completion context and the first attempt used GFP_KERNEL.
For GFP_ATOMIC callers (IRQ context), failure is expected to be rare — the emergency reserve pool is the last resort. Callers must have a non-allocating fallback (e.g., drop the packet, fail the I/O request).

Slab page freeing. When a slab page becomes completely empty (all objects freed back to it and free_count == capacity), it is collected for deferred freeing. The caller removes the empty slab from the partial list while holding the node_partial lock, then releases the lock before calling free_pages(). This avoids a lock ordering violation: node_partial.lock (level 130) must never be held while acquiring buddy.zone_lock (level 120).

/// Return a completely empty slab to the buddy allocator.
///
/// **Precondition**: The slab has been removed from the `node_partial`
/// list by the caller. The `node_partial` lock is NOT held when this
/// function is called. This is critical for lock ordering:
/// `buddy.zone_lock(120)` must not be acquired while `node_partial(130)`
/// is held.
///
/// Called from `drain_magazine_to_slabs()` after collecting empty slabs,
/// and from GC Phase 5 (`return_slab_pages_to_buddy`).
fn slab_shrink(cache: &SlabCache, slab: *mut SlabPage) {
    // Free the out-of-band freelist metadata page.
    // SAFETY: slab is no longer on any list; exclusive access.
    slab.destroy_freelist();
    // Defensive destructor loop: call dtor for any non-free objects.
    // In current call paths (drain_magazine_to_slabs and GC Phase 5), all
    // objects are free by the time slab_shrink() runs (free_count == capacity),
    // so this loop finds nothing to destroy. It exists as a safety net for
    // potential future forced-destroy paths (e.g., emergency cache teardown).
    if let Some(dtor) = cache.dtor {
        if slab.free_count < slab.capacity {
            for slot_idx in 0..slab.capacity {
                if !slab_slot_is_free(slab, slot_idx) {
                    let obj_ptr = slab.base.as_ptr::<u8>().add(slot_idx as usize * cache.object_size as usize);
                    dtor(obj_ptr);
                }
            }
        }
    }
    // Return the slab page(s) to the buddy allocator.
    // buddy.zone_lock(120) is acquired inside free_pages() -- safe because
    // no slab locks (125, 130) are held.
    free_pages(slab.page(), cache.slab_order);
}

GfpFlags propagation. The GfpFlags used for slab_grow() are propagated from the original allocation request through to the buddy allocator. Common combinations:

Caller context	`GfpFlags`	Behaviour
Process context (default)	`GFP_KERNEL`	May sleep, may trigger direct reclaim and kswapd, may invoke OOM killer
IRQ / softirq (Section 3.8) / under spinlock	`GFP_ATOMIC_ALLOC`	Non-sleepable, draws from emergency reserve pool, returns `AllocError` immediately if exhausted
I/O completion path	`GFP_KERNEL_NOIO`	May sleep, but suppresses I/O and filesystem callbacks in reclaim to prevent writeback recursion
Filesystem code (holding inode locks)	`GFP_KERNEL_NOFS`	May sleep, suppresses filesystem callbacks in reclaim to prevent deadlock

Cross-reference: See Section 4.2 for alloc_pages() / free_pages() API and GfpFlags definitions. See Section 4.11 for NUMA distance tables and fallback chain.

SlabCache::free() — deallocation algorithm:

free(ptr: *mut u8, size_class: sc):
    flags = local_irq_save()

    // magazine_active check: during CPU hotplug, magazines are drained.
    // If inactive, bypass magazine and go directly to depot slow path.
    if !CpuLocalBlock.magazine_active.load(Relaxed):
        local_irq_restore(flags)
        slab_free_slow(ptr, sc)
        return

    // SAFETY: Access via raw pointer from CpuLocal register base (not get_mut()).
    // IRQs disabled — no concurrent non-atomic access on this CPU.
    // Convert raw pointer to `&mut` since IRQ-disable guarantees exclusive access.
    let pair: &mut MagazinePair = unsafe { &mut *(&raw mut (*CpuLocal::as_ptr()).slab_magazines[sc]) };

    // Fast path: push to loaded magazine.
    // SAFETY: On the fast path (magazine_active == true), loaded is always Some.
    let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
    if loaded.count < MAGAZINE_SIZE:
        loaded.objects[loaded.count as usize] = ptr;
        loaded.count += 1;
        local_irq_restore(flags)
        return

    // Loaded full — swap loaded ↔ spare (swaps the Option<NonNull> values).
    core::mem::swap(&mut pair.loaded, &mut pair.spare);
    // SAFETY: spare was also Some on the fast path; now it is the new loaded.
    let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
    if loaded.count < MAGAZINE_SIZE:
        loaded.objects[loaded.count as usize] = ptr;
        loaded.count += 1;
        local_irq_restore(flags)
        return

    // Both magazines full — slow path (under lock)
    local_irq_restore(flags)
    slab_free_slow(ptr, sc)

slab_free_slow(ptr, sc):
    cache = SLAB_CACHES_BY_CLASS[sc].load(Acquire)  // O(1) size-class → SlabCache

    // Decrement total_allocated on the slow free path (symmetric with
    // the increment in slab_alloc_slow). The fast path does NOT touch
    // this counter — the count lags by up to MAGAZINE_SIZE * num_cpus.
    cache.total_allocated.fetch_sub(1, Relaxed);

    // Return full spare magazine to depot, get an empty one.
    // SpinLock::lock() saves IRQ state and disables IRQs automatically
    // (see locking-strategy.md §SpinLock<T>), so no nested local_irq_save/restore is needed.
    let depot_guard = cache.depot.lock()   // saves IRQ state + disables IRQs + acquires lock

    // MIGRATION SAFETY: The task may have migrated between the fast path's
    // local_irq_restore() and this depot.lock acquisition. After acquiring the
    // spinlock (which disables IRQs, pinning us to the current CPU), we MUST
    // re-check the current CPU's magazine state. The magazines on the new CPU
    // may have space. Without this check, we would push pair.spare (the new
    // CPU's spare, which may NOT be full) to depot.fulls, violating the
    // invariant that depot.fulls contains only full magazines. Subsequent CPUs
    // popping from fulls would read uninitialized pointers.
    let pair: &mut MagazinePair = unsafe { &mut *(&raw mut (*CpuLocal::as_ptr()).slab_magazines[sc]) };
    // SAFETY: slow path entered from fast path where magazine_active == true,
    // so both loaded and spare are Some. unwrap() + as_mut() is safe.
    let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
    if loaded.count < MAGAZINE_SIZE:
        // Current CPU's loaded magazine has space — just push directly.
        loaded.objects[loaded.count as usize] = ptr;
        loaded.count += 1;
        drop(depot_guard)
        return
    core::mem::swap(&mut pair.loaded, &mut pair.spare);
    let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
    if loaded.count < MAGAZINE_SIZE:
        loaded.objects[loaded.count as usize] = ptr;
        loaded.count += 1;
        drop(depot_guard)
        return
    // Both magazines genuinely full on the current CPU. Proceed with depot exchange.
    //
    // drain_buf: if the depot has no empties, we copy the spare magazine's
    // objects into a stack-local buffer for deferred draining OUTSIDE the
    // depot lock. The magazine struct itself is reused (count zeroed) as an
    // empty magazine installed in pair.spare. This avoids allocating a new
    // magazine under the depot lock.
    let mut drain_buf: ArrayVec<*mut u8, MAGAZINE_SIZE> = ArrayVec::new();

    if let Some(empty_mag) = depot_guard.empties.pop():
        // pair.spare is confirmed full (count==MAGAZINE_SIZE) — safe to push to fulls.
        // unwrap() is safe: spare is Some (magazine_active invariant).
        debug_assert!(depot_guard.fulls.len() < depot_guard.fulls.capacity(),
            "depot fulls exceeded pre-allocated capacity");
        depot_guard.fulls.push(pair.spare.unwrap())   // return full spare as NonNull
        pair.spare = Some(empty_mag)                  // replace with empty from depot
    else:
        // No empties in depot — drain spare's objects for deferred return to slabs.
        // Copy spare's object pointers into a stack buffer (bounded by MAGAZINE_SIZE).
        // SAFETY: pair.spare is Some (magazine_active invariant).
        let spare_mag: &mut SlabMagazine = unsafe { pair.spare.unwrap().as_mut() };
        for i in 0..spare_mag.count {
            drain_buf.push(spare_mag.objects[i]);
        }
        // Zero the magazine's count — it is now an empty magazine in-place.
        // The magazine struct stays allocated and pointed to by pair.spare
        // (still Some). No None state needed for this path.
        spare_mag.count = 0;

    // Swap loaded ↔ spare so loaded is the empty magazine (from depot or zeroed).
    //
    // State trace (depot-has-empties path):
    //   Before: loaded=Some(full_A), spare=Some(empty_from_depot)
    //   After swap: loaded=Some(empty_from_depot), spare=Some(full_A)
    //   Push ptr into loaded — correct. Both fields remain Some.
    //
    // State trace (no-empties drain path):
    //   Before: loaded=Some(full_A), spare=Some(zeroed_B) (B's objects in drain_buf)
    //   After swap: loaded=Some(zeroed_B), spare=Some(full_A)
    //   Push ptr into loaded — correct. Both fields remain Some.
    core::mem::swap(&mut pair.loaded, &mut pair.spare);

    // SAFETY: pair.loaded is Some in both paths (verified by state traces above).
    let loaded: &mut SlabMagazine = unsafe { pair.loaded.unwrap().as_mut() };
    // Now push the freed object into loaded (the empty magazine).
    loaded.objects[loaded.count as usize] = ptr;
    loaded.count += 1;
    drop(depot_guard)                // releases lock + restores IRQ state

    // Drain the copied objects OUTSIDE the depot lock.
    // This is the critical optimisation: drain_magazine_to_slabs() acquires
    // node_partial(130) per NUMA node, so running it without depot.lock(125)
    // held eliminates contention on the depot for other CPUs.
    let mut empty_slabs = ArrayVec::<_, MAGAZINE_SIZE>::new();
    if !drain_buf.is_empty() {
        empty_slabs = drain_objects_to_slabs(&drain_buf, cache);
    }

    // Deferred buddy free: return any empty slabs to the buddy allocator
    // AFTER releasing all slab locks. See slab_shrink().
    for slab in empty_slabs:
        slab_shrink(cache, slab)

drain_magazine_to_slabs() / drain_objects_to_slabs() — Return objects to their owning slabs:

Two drain functions share the same core logic (reverse-lookup, freelist push, state transitions) but differ in input source:

drain_magazine_to_slabs(&mut SlabMagazine, ...) — takes a magazine pointer, iterates mag.objects[0..mag.count]. Used by GC Phase 2 (drain_depot_to_partial) where full magazines are popped from the depot and drained.
drain_objects_to_slabs(&[*mut u8], ...) — takes a pre-copied slice of object pointers from a stack-local ArrayVec. Used by slab_free_slow() when the depot has no empty magazines: the spare magazine's objects are copied to a stack buffer (under depot lock), the magazine struct is reused in-place as an empty magazine, and the copied objects are drained after the depot lock is released.

Both return empty slabs for deferred buddy freeing.

/// Drain all objects from a magazine back to their owning slabs.
///
/// For each object in the magazine:
/// 1. Reverse-lookup: ptr -> phys_to_page() -> Page.mapping -> SlabCache
///    -> Slab (via Page.index_or_freelist for the slab metadata pointer).
/// 2. Return the object to the slab's freelist under node_partial lock.
/// 3. If the slab transitions from Full to Partial, re-insert it into
///    the node_partial list.
/// 4. If the slab transitions to Empty (free_count == capacity), remove
///    it from the node_partial list and collect it for deferred free.
///
/// # Lock ordering
/// Called with the depot lock (level 125) NOT held. The caller drops
/// the depot lock before calling this function (see `slab_free_slow()`
/// and `drain_depot_to_partial()`).
/// This function acquires node_partial lock (level 130) per NUMA node.
/// No lock ordering constraint applies because only one lock is held
/// at a time — the depot lock was released before entry.
/// Does NOT call free_pages() — empty slabs are returned to the caller
/// for deferred buddy freeing after all slab locks are released.
///
/// # Performance optimization
/// Objects are grouped by NUMA node before lock acquisition. This reduces
/// the number of node_partial lock acquire/release cycles from O(MAGAZINE_SIZE)
/// (one per object) to O(num_numa_nodes) (one per NUMA node with objects).
/// The common case (all objects from the same node) requires exactly one
/// lock acquisition.
///
/// # Returns
/// A local ArrayVec of empty slabs to be freed by the caller via
/// slab_shrink() after dropping all slab locks.
fn drain_magazine_to_slabs(
    mag: &mut SlabMagazine,
    cache: &SlabCache,
) -> ArrayVec<*mut SlabPage, MAGAZINE_SIZE> {
    drain_objects_to_slabs(&mag.objects[..mag.count], cache)
    // Note: caller is responsible for zeroing mag.count if the magazine
    // struct is reused (e.g., depot drain frees the magazine entirely).
}

/// Drain a slice of object pointers back to their owning slabs.
/// Same logic as `drain_magazine_to_slabs` but operates on a pre-copied
/// object slice rather than a magazine reference. Used by `slab_free_slow()`
/// when the spare magazine's objects are extracted into a stack buffer.
fn drain_objects_to_slabs(
    objects: &[*mut u8],
    cache: &SlabCache,
) -> ArrayVec<*mut SlabPage, MAGAZINE_SIZE> {
    let mut empty_slabs: ArrayVec<*mut SlabPage, MAGAZINE_SIZE> = ArrayVec::new();

    // Group objects by NUMA node to minimize lock acquisitions.
    // ArrayVec<(nid, [(obj, slab, slot_idx)])> — bounded by MAGAZINE_SIZE.
    let mut per_node: [ArrayVec<(*mut u8, *mut SlabPage, usize), MAGAZINE_SIZE>;
                       NUMA_NODES_STACK_CAP] = Default::default();
    for &obj in objects {
        let paddr = virt_to_phys(obj);
        let page = phys_to_page(paddr);
        let slab = page_to_slab(page);
        let nid = page.node_id as usize;
        let slot_idx = ((obj as usize) - slab.base.as_usize()) / cache.object_size as usize;
        per_node[nid].push((obj, slab, slot_idx));
    }

    // Process each NUMA node: one lock acquisition per node.
    for nid in 0..num_possible_nodes() {
        if per_node[nid].is_empty() { continue; }
        let mut partial_guard = cache.node_partial[nid].lock();

        for &(obj, slab, slot_idx) in &per_node[nid] {
            // Return the object to the slab's freelist.
            // SAFETY: slot_idx < capacity, object was allocated from this slab.
            slab.freelist_push(slot_idx as u16);
            slab.free_count += 1;

            if slab.free_count == slab.capacity {
                // Slab is now empty — remove from partial list, collect for
                // deferred buddy free. Do NOT call free_pages() here (would
                // acquire buddy.zone_lock(120) while holding node_partial(130)).
                partial_guard.remove(slab);
                slab.list_state = SlabState::Empty;
                empty_slabs.push(slab);
            } else if slab.free_count == 1 && slab.list_state == SlabState::Full {
                // Slab was Full, now has a free slot — move to Partial list.
                slab.list_state = SlabState::Partial;
                partial_guard.push(slab);
            }
            // else: slab was already Partial, stays in the list. No action.
        }
        drop(partial_guard);
    }

    empty_slabs
}

Cost model: The slab fast path uses CpuLocal (Section 3.2) for magazine access, matching Linux's this_cpu_* pattern: ~1-10 cycles depending on architecture (single instruction on x86-64 via gs: prefix, 2-3 instructions on AArch64 via TPIDR_EL1). No PerCpu<T> wrapper is involved. The two-magazine swap design (Section 3.1.2, MagazinePair) ensures that tight alloc/free loops stay entirely on the fast path without touching the depot or partial lists. General per-CPU data accessed through PerCpu<T> uses a debug-only borrow-state CAS (Section 3.3): ~3-8 cycles in release builds, ~20-30 cycles in debug builds.

Cost breakdown by path: Magazine pop (fast path): ~5-15 cycles (~95% of allocations), dominated by local_irq_save/local_irq_restore (~3-8 cycles on x86-64, ~5-12 on AArch64) plus the stack pop. Magazine miss → depot exchange: ~50-100 cycles. Depot empty → buddy fallback: ~450-650 cycles. IRQ-disable is required (not merely preempt-disable) because IRQ handlers can allocate memory; the depot and buddy paths are progressively rarer.

SlabRef<T> — Owned handle to a slab-allocated object:

/// A reference to a slab-allocated object. Provides O(1) allocation
/// and deallocation from the owning slab cache. Implements `Deref<Target = T>`
/// for transparent access. `Drop` returns the object to the slab cache.
/// Not `Clone` — each `SlabRef` represents unique ownership.
///
/// **IRQ and cross-CPU safety**: `SlabRef` can be held across
/// preemption points and migrated between CPUs. On `drop()`, the object
/// is returned to the *current* CPU's magazine (with IRQs disabled via
/// `local_irq_save` for the push operation). This means an object allocated on CPU A may be
/// freed to CPU B's magazine — this is correct and expected. The magazine
/// depot mechanism (Section 4.3) rebalances magazines across CPUs as they
/// fill and empty, so cross-CPU free does not cause persistent imbalance.
///
/// No caller action is required for preemption safety. The slab allocator
/// internally disables preemption for the duration of the magazine push
/// (a single pointer write, <10ns).
pub struct SlabRef<T: ?Sized> {
    ptr: NonNull<T>,
    cache: *const SlabCache,
}

4.3.2 Page Frame Descriptor¶

The Page struct is the fundamental physical memory tracking unit. One Page exists per physical page frame in the system, allocated as a contiguous array (memmap) at boot time, indexed by PFN (physical frame number). The memmap is allocated from the boot allocator during mem_init() before the buddy allocator is operational.

This is an internal kernel data structure with no Linux ABI exposure. /proc/meminfo, /proc/vmstat, and similar interfaces read from global atomic counters, not from Page structs directly.

/// Physical page frame descriptor. One per physical page in the system.
/// Allocated in a contiguous array at boot (memmap), indexed by PFN.
/// Size: 64 bytes (one cache line) to minimize memmap overhead.
///
/// At 64 bytes per page, a 256 GB system requires 4 MB of memmap space
/// (256 GB / 4 KB × 64 B = 4 MB). This is 0.0015% of total RAM — negligible.
///
/// Design note: Linux's `struct page` is 64 bytes and uses unions extensively
/// to overlay fields for different page types (slab, compound, page-cache, etc.).
/// UmkaOS uses the same 64-byte budget but with explicit fields and a simpler
/// layout. Fields that are mutually exclusive (e.g., `mapping` vs slab pointer)
/// share the same offset via Rust enums where needed rather than raw unions.
// kernel-internal, not KABI
#[repr(C, align(64))]
pub struct Page {
    /// Reference count. Matches Linux's `_refcount` convention:
    /// - 0 = free (in buddy allocator, not allocated to any subsystem).
    /// - 1 = allocated (returned by `alloc_pages()`), not yet additionally
    ///   referenced. This is the base state after allocation.
    /// - N > 1 = N-1 additional references beyond the base allocation
    ///   (e.g., page cache + DMA mapping + additional `get_page()` calls).
    ///
    /// `alloc_pages()` sets `refcount` to 1 via `set_page_refcounted()`.
    /// `get_page()` increments. `put_page()` decrements; when it reaches
    /// 0, the page is returned to the buddy allocator's free list.
    ///
    /// **Invariant**: `refcount > 0` means the page is in use. Code that
    /// checks "is this page allocated?" tests `refcount > 0`, not
    /// `refcount != 0`. There is no -1 sentinel.
    ///
    /// Manipulated atomically; never accessed under a lock.
    pub refcount: AtomicI32,
    /// Mapping count: number of page table entries pointing to this page.
    /// - 0 = not mapped in any page table (allocated but unmapped, e.g.,
    ///   slab page, DMA buffer, page cache page with no active mmap).
    /// - N > 0 = mapped by N PTEs (shared or COW mappings).
    ///
    /// Separate from `refcount` because a page can be referenced (held in
    /// page cache, pinned by DMA) without being mapped in any PTE, and the
    /// reclaim path needs O(1) "is this page mapped?" checks without
    /// reverse-mapping walks. Also needed by WEA section lifecycle and
    /// DSM remote references.
    ///
    /// Matches Linux's separate `_mapcount` field in `struct folio`.
    pub mapcount: AtomicI32,
    /// Page flags (locked, dirty, uptodate, writeback, slab, compound, etc.).
    /// See `PageFlags` bitflags for the complete set.
    pub flags: AtomicU32,
    /// Explicit padding to align `mapping` to 8 bytes.
    /// Offset 12, size 4. Zero-initialized during memmap init.
    /// kernel-internal padding — not KABI, no info disclosure risk.
    pub _pad1: [u8; 4],
    /// For pages in the page cache: pointer to the address_space (inode mapping).
    /// For anonymous pages: pointer to the anon_vma reverse-mapping structure.
    /// For slab pages: pointer to the slab cache descriptor (`SlabCache`).
    /// Null for free pages.
    pub mapping: AtomicPtr<u8>,
    /// For page cache pages: offset within the file (in page-sized units).
    /// For slab pages: **unused in UmkaOS** (the freelist is out-of-band in
    /// the `Slab` struct, not in-object like Linux's SLUB). Retained for
    /// layout compatibility with the `Page` struct definition in
    /// [Section 4.2](#physical-memory-allocator). Interpretation depends on
    /// `flags & SLAB`.
    pub index_or_freelist: u64,
    /// LRU list linkage (for MGLRU generations). Intrusive doubly-linked list node.
    /// Free pages use this for buddy allocator free-list linkage.
    /// Size: 16 bytes (two pointers: prev + next).
    pub lru: IntrusiveListNode,
    /// Compound page order (0 for single pages, log2(nr_pages) for compound heads).
    /// Only meaningful on head pages; tail pages store a back-pointer to the head
    /// page via `mapping` (with a flag bit to distinguish from address_space pointers).
    pub compound_order: u8,
    /// NUMA node ID where this page was allocated.
    /// Discovered at boot from SRAT/SLIT tables (x86) or device tree (ARM/RISC-V).
    pub node_id: u8,
    /// Zone index within the node (DMA, DMA32, Normal, Movable).
    pub zone_id: u8,
    /// Explicit padding to align `mem_cgroup` to 8 bytes.
    /// Offset 51, size 5. Zero-initialized during memmap init.
    /// kernel-internal padding — not KABI, no info disclosure risk.
    pub _pad2: [u8; 5],
    /// Pointer to the memory cgroup that charged this page. Used by cgroup
    /// memory accounting to uncharge on free. Read on the page-free hot path;
    /// written on alloc. Null for pages not associated with any cgroup (e.g.,
    /// kernel slab pages charged to root, or pages allocated before cgroup init).
    ///
    /// **RCU lifetime contract**: The `mem_cgroup` pointer is protected by RCU
    /// on the read side. The page-free path reads this field under
    /// `rcu_read_lock()` to safely dereference the `MemCgroup` and perform
    /// uncharging. The `MemCgroup` struct is not freed until all charged pages
    /// have been uncharged — `MemCgroup.res` (resource counter) tracks the
    /// total charge, and `css_release()` defers `kfree_rcu(memcg)` until
    /// `res.usage == 0`. This ensures the pointer remains valid for the
    /// duration of any RCU read-side critical section that observes it.
    /// Writers (charge path at page allocation) set this field before the
    /// page becomes visible to other CPUs (via page cache insertion or PTE
    /// installation), so no additional synchronization is needed on the
    /// write side. See [Section 17.2](17-containers.md#control-groups) for `MemCgroup` lifecycle.
    pub mem_cgroup: *const MemCgroup,
    // **50-year design note**: This struct is frozen at 64 bytes. New
    // per-page metadata goes into `PageExtArray` (Pattern 1: Extension
    // Array) — a parallel vmemmap-backed array indexed by PFN, defined
    // in [Section 4.2](#physical-memory-allocator--pageextarray--per-page-extension-metadata).
    // If `PageExt` itself needs to grow beyond 16 bytes, Pattern 2
    // (Shadow-and-Migrate) reallocates the extension vmemmap in the
    // background — see the Data Format Evolution Framework in
    // [Section 13.18](13-device-classes.md#live-kernel-evolution--data-format-evolution).
    // The `Page` struct itself NEVER grows beyond 64 bytes.
}
// Layout analysis (64-bit):
//   refcount: AtomicI32          offset  0, size 4
//   mapcount: AtomicI32          offset  4, size 4
//   flags:    AtomicU32          offset  8, size 4
//   _pad1:    [u8; 4]            offset 12, size 4 (explicit, aligns mapping to 8)
//   mapping:  AtomicPtr<u8>      offset 16, size 8
//   index_or_freelist: u64       offset 24, size 8
//   lru: IntrusiveListNode       offset 32, size 16 (two 8-byte pointers)
//   compound_order: u8           offset 48, size 1
//   node_id: u8                  offset 49, size 1
//   zone_id: u8                  offset 50, size 1
//   _pad2:    [u8; 5]            offset 51, size 5 (explicit, aligns mem_cgroup to 8)
//   mem_cgroup: *const MemCgroup offset 56, size 8
//   Total fields: 64 bytes. mem_cgroup ends at offset 64, which equals the
//   alignment — no implicit tail padding. The struct is exactly one cache line.
//   Padding at offsets 12-15 and 51-55 is now explicit via `_pad1` and `_pad2`,
//   zero-initialized during memmap initialization (Phase B of memmap_to_vmemmap).
//   These padding fields do NOT cross KABI boundaries (Page is kernel-internal,
//   not KABI) and are never read by userspace, so they do not pose an information
//   disclosure risk. They are zero-initialized for deterministic behavior under
//   debug tools and crash dump analyzers.
//
// On 32-bit (ARMv7, PPC32): pointer fields are 4 bytes instead of 8, and
// IntrusiveListNode is 8 bytes (two 4-byte pointers). The field byte sum differs,
// but #[repr(C, align(64))] guarantees sizeof(Page) == 64 on ALL architectures
// via compiler-inserted tail padding.
const_assert!(core::mem::size_of::<Page>() == 64);

/// Global memmap: PFN-indexed array of page descriptors.
/// Initialised during mem_init() from the boot allocator. On NUMA systems,
/// each node's portion of the memmap is allocated from that node's memory
/// (node-local memmap) to minimize cross-node accesses during page operations.
pub static MEMMAP: OnceCell<&'static [Page]> = OnceCell::new();

/// Convert a physical address to a Page reference.
///
/// **Boot/post-boot dispatch**: During early boot (before `memmap_to_vmemmap()`
/// completes in Phase 1.2.1), this uses the flat `MEMMAP` array allocated by
/// the boot allocator. After vmemmap migration, it dispatches through
/// `PAGE_ARRAY` (the vmemmap-backed sparse mapping). The `PAGE_ARRAY` OnceCell
/// serves as the flag: if set, vmemmap is active; if not, fall back to MEMMAP.
///
/// Panics if the address is outside the physical memory range.
#[inline]
pub fn phys_to_page(paddr: PhysAddr) -> &'static Page {
    // Post-migration: use vmemmap-backed PageArray (hot path).
    if let Some(pa) = PAGE_ARRAY.get() {
        return pa.get(paddr).expect("PFN exceeds vmemmap range");
    }
    // Pre-migration: use boot-time flat MEMMAP.
    // Use as_u64() to avoid truncation on 32-bit LPAE (PhysAddr > 4 GB).
    let pfn = paddr.as_u64() / PAGE_SIZE as u64;
    // pfn as usize: safe because MEMMAP only covers the direct-map region,
    // which is bounded by virtual address space on 32-bit. Physical
    // addresses above the direct-map are handled by vmemmap (PAGE_ARRAY path).
    &MEMMAP.get().expect("memmap not initialised")[pfn as usize]
}

4.3.3 Slab Cache Garbage Collection¶

4.3.3.1 Problem¶

The slab allocator specification states that "slab caches are never destroyed." On systems with 50-year uptime targets, this becomes a memory leak: drivers that create custom slab caches (via the KABI slab creation interface) permanently leak the SlabCache descriptor (~256 bytes), the per-CPU magazine depot, and any empty slabs when the driver is unloaded. Over decades at ~100 driver load/unload cycles per day, this accumulates ~1.8 million orphaned caches — potentially hundreds of MB of wasted kernel metadata.

4.3.3.2 Design: kslab_gc Background Thread¶

A background kthread (kslab_gc) periodically scans all slab caches and destroys those meeting all four conditions:

Empty: No allocated objects across all CPUs and NUMA nodes.
Unreferenced: No live SlabRef<T> or SlabCache pointer held by any kernel component.
DESTROYABLE: The cache was created with SlabCacheFlags::DESTROYABLE (not a boot-time PERMANENT cache).
Idle: The last_activity_ns timestamp is older than idle_threshold_ms (default: 300 seconds = 5 minutes).

Built-in caches (task_struct, inode_cache, dentry_cache, vma_cache, etc.) are allocated at boot with SlabCacheFlags::PERMANENT and are never GC-eligible. Only caches created by drivers via the KABI slab creation interface carry DESTROYABLE.

/// Slab cache garbage collector configuration.
///
/// Runs as a background kthread (`kslab_gc`) at `SCHED_IDLE` priority.
/// Zero impact on latency-sensitive paths — the thread only runs when
/// the CPU is otherwise idle.
pub struct SlabGc {
    /// Scan interval. Default: 60 seconds. Tunable via
    /// `/proc/sys/vm/slab_gc_interval_ms`.
    pub interval_ms: u64,
    /// Minimum idle time before a cache becomes GC-eligible.
    /// A cache is "idle" when its `last_activity_ns` is older than this
    /// threshold. Default: 300,000 ms (5 minutes). Tunable via
    /// `/proc/sys/vm/slab_gc_idle_threshold_ms`.
    pub idle_threshold_ms: u64,
}

bitflags! {
    /// Flags controlling slab cache lifecycle.
    pub struct SlabCacheFlags: u32 {
        /// Cache is permanent (boot-time). Never eligible for GC.
        /// Set for: task_struct, inode_cache, dentry_cache, vma_cache,
        /// mm_struct_cache, signal_cache, and all other kernel-internal caches.
        const PERMANENT     = 0x0001;
        /// Cache was created by a driver via KABI. Eligible for GC when
        /// empty, unreferenced, and idle.
        const DESTROYABLE   = 0x0002;
        /// Cache is currently being drained for GC. No new allocations
        /// from this cache are permitted (alloc returns ENOMEM with a
        /// fallback hint to use the general-purpose kmalloc cache).
        const GC_DRAINING   = 0x0004;
    }
}

/// GC state machine for a slab cache.
/// Transitions: Active → Draining → Destroyed.
/// Only forward transitions are valid.
#[repr(u8)]
pub enum SlabGcState {
    /// Normal operation. Allocations and frees proceed normally.
    Active = 0,
    /// GC initiated. No new allocations; magazines being drained.
    Draining = 1,
    /// All objects freed, all slab pages returned to buddy allocator,
    /// SlabCache descriptor freed. Terminal state.
    Destroyed = 2,
}

4.3.3.3 GC Lock Ordering¶

Slab GC acquires locks in a strict partial order to prevent deadlocks. The ordering is consistent with the normal allocation path:

Phase	Lock acquired	Ordering constraint
Phase 1 (IPI magazine drain)	`depot.lock` per cache (acquired inside IPI handler per-CPU). The IPI handler runs on each target CPU with IRQs disabled, extracts magazines via `.take()` (setting loaded/spare to `None`), and pushes the `NonNull<SlabMagazine>` pointers to `depot.fulls` or `depot.empties` as appropriate. `magazine_active` is already `false` before the IPI.	`depot.lock(125)` only.
Phase 2 (depot drain)	`depot.lock` per cache	`depot.lock` is a leaf lock during normal allocation (acquired on magazine-miss slow path, no other lock held).
Phase 3 (partial list cleanup)	`node_partial.lock` per NUMA node	Acquired after `depot.lock` is released. Same ordering as the allocation path (`slab_alloc_slow()` acquires `node_partial.lock` without holding `depot.lock`).
Phase 5 (page return to buddy)	`buddy.zone_lock` per zone	Acquired inside `slab_shrink()` / `free_pages()`. `depot.lock` is released BEFORE `buddy.zone_lock` is acquired — no nested hold.

Lock ordering rule: buddy.zone_lock(120) < depot.lock(125) < node_partial.lock(130). There is NO actual nesting between these locks in current code. drain_objects_to_slabs() / drain_magazine_to_slabs() acquire node_partial.lock(130) but only AFTER the depot lock has been released (the spare magazine's objects are copied to a stack buffer under the depot lock, then drained without holding it). All lock interactions are sequential (acquire-release-acquire): slab_grow() acquires buddy.zone_lock (via alloc_pages()) then releases it, then acquires node_partial.lock to insert the new slab. GC Phase 2 acquires depot.lock, releases it, then Phase 5 acquires buddy.zone_lock. The buddy.zone_lock(120) level is assigned conservatively to prevent future code from accidentally nesting it with slab locks. The ordering rule prevents inversions with the allocation and free paths.

4.3.3.4 GC Protocol¶

The kslab_gc thread executes the following sequence for each candidate cache:

kslab_gc_scan():
  for each cache in SLAB_CACHES:
    if cache.flags & PERMANENT:
      continue          // Boot-time cache, skip
    if cache.gc_state.load(Relaxed) != Active:
      continue          // Already draining or destroyed
    if now_ns() - cache.last_activity_ns.load(Relaxed) < idle_threshold_ns:
      continue          // Still active recently
    if total_allocated_objects(cache) > 0:
      continue          // Has live objects, skip

    // Candidate for GC. Attempt to transition to Draining.
    if cache.gc_state.compare_exchange(Active, Draining, AcqRel, Relaxed).is_err():
      continue          // Race with another operation, skip

    // Phase 1: Drain all per-CPU magazines.
    //   Send IPI to each CPU requesting magazine drain for this cache.
    //   Each CPU's IPI handler:
    //     1. Runs with IRQs disabled (IPI entry disables them).
    //     2. Acquires depot.lock, extracts loaded/spare via .take()
    //        (setting both to None), pushes NonNull pointers to
    //        depot.fulls or depot.empties as appropriate.
    //     3. Drops depot.lock.
    //   Caller waits for all IPIs to complete.
    drain_all_cpu_magazines(cache)

    // Phase 2: Drain depot magazines back to partial lists.
    drain_depot_to_partial(cache)

    // Phase 3: Verify all slabs are empty.
    //   Walk node_partial lists; every slab must have free_count == capacity.
    if !all_slabs_empty(cache):
      // Objects appeared between the check and drain (race with late free).
      // Abort: return to Active state. Will retry on next GC scan.
      // Convergence: the GC succeeds only when the cache has zero live
      // objects and no concurrent free operations. The 300-second idle
      // threshold ensures the GC only attempts collection on truly
      // abandoned caches. For caches with rare but ongoing activity,
      // the idle threshold prevents futile GC attempts.
      cache.gc_state.store(Active, Release)
      continue

    // Phase 4: RCU barrier — wait for all RCU readers to complete.
    //   SlabRef<T> tokens may be held under RCU read-side locks. If we
    //   return slab pages to the buddy allocator while an RCU reader still
    //   holds a SlabRef pointing into these pages, the reader dereferences
    //   freed memory. rcu_barrier() ensures all RCU callbacks (including
    //   deferred frees from kfree_rcu) have completed before page return.
    //   This runs in workqueue context (may sleep), which is why slab GC
    //   cannot run in IRQ or atomic context.
    rcu_barrier()

    // Phase 5: Return all slab pages to the buddy allocator.
    //
    // Slab page return constraints:
    // - NUMA locality: pages are returned to the buddy allocator of the NUMA
    //   node they were originally allocated from (tracked via
    //   PageArray[pfn].node_id). This preserves NUMA affinity for future
    //   allocations from the same zone.
    // - Migration type: UNMOVABLE (slab pages are allocated as UNMOVABLE and
    //   returned with the same type). The buddy allocator does NOT reclassify.
    // - Context: workqueue (may sleep — required for rcu_barrier() in Phase 4).
    //   MUST NOT hold depot.lock when calling buddy free (see lock ordering).
    // - The depot lock is NOT held during this phase. Phase 3 already drained
    //   the depot under its lock; Phase 5 operates on the collected page list.
    return_slab_pages_to_buddy(cache)

    // Phase 6: Remove from global registries and free descriptor.
    // Note: DESTROYABLE caches are NOT in SLAB_CACHES_BY_CLASS — that array
    // contains only PERMANENT boot-time kmalloc size-class caches and is
    // sealed after slab_init(). Only the SLAB_CACHES XArray entry is removed.
    SLAB_CACHES.remove(cache.cache_id)
    // Also remove from the name dedup map:
    SLAB_CACHE_NAMES.lock().remove(cache.name);
    cache.gc_state.store(Destroyed, Release)
    free_cache_descriptor(cache)

GC helper function definitions:

/// Count total allocated (live) objects across the cache. Uses a per-cache
/// `AtomicU64` counter (`cache.total_allocated`) that is updated on the
/// SLOW PATH only (magazine miss/overflow, depot exchange). The hot-path
/// magazine pop/push does NOT update this counter — the count is approximate
/// (may lag by up to MAGAZINE_SIZE * num_cpus objects). This is acceptable
/// because GC only runs on caches idle for 300+ seconds, where approximate
/// zero means truly zero.
///
/// NOT a list walk — O(1) atomic load.
fn total_allocated_objects(cache: &SlabCache) -> u64 {
    cache.total_allocated.load(Relaxed)
}

/// Drain all per-CPU magazines for this cache via IPI. Each target CPU's
/// handler runs with preemption disabled and moves loaded + spare magazine
/// contents to the depot (empties/fulls as appropriate). No cross-CPU lock
/// acquisition — each CPU drains its own magazines.
///
/// **depot.lock under IPI**: The IPI handler acquires `depot.lock` (a SpinLock)
/// from IPI context. This is safe because SpinLock disables IRQs on the local
/// CPU. No deadlock: the IPI handler runs with IRQs disabled (IPI entry disables
/// them), and no other code path holds depot.lock with IRQs enabled during GC
/// (the GC kthread also acquires depot.lock with IRQs disabled via SpinLock).
fn drain_all_cpu_magazines(cache: &SlabCache) {
    // Build a closure that drains this cache's magazine pair:
    //   pair = &mut CpuLocalBlock.slab_magazines[cache.size_class]
    //   depot_guard = cache.depot.lock()
    //   For each of loaded/spare: if Some and count > 0, push to depot.fulls.
    //   If Some and count == 0, push to depot.empties.
    //   Set both fields to None (GC is draining; magazine_active is already false).
    //   drop(depot_guard)
    smp_call_function_many(
        &cpu_online_mask(),
        |_cpu| {
            // SAFETY: IPI handler runs on target CPU with IRQs disabled.
            let pair: &mut MagazinePair = unsafe {
                &mut *(&raw mut (*CpuLocal::as_ptr()).slab_magazines[cache.size_class])
            };
            let mut depot_guard = cache.depot.lock();
            // Extract loaded magazine via .take() — leaves pair.loaded = None.
            // GC has already set magazine_active = false, so None is safe:
            // all subsequent alloc/free on this CPU bypasses the magazine.
            if let Some(loaded_ptr) = pair.loaded.take() {
                // SAFETY: loaded_ptr is a valid NonNull<SlabMagazine>.
                let loaded_count = unsafe { loaded_ptr.as_ref().count };
                if loaded_count > 0 {
                    debug_assert!(depot_guard.fulls.len() < depot_guard.fulls.capacity());
                    depot_guard.fulls.push(loaded_ptr);
                } else {
                    debug_assert!(depot_guard.empties.len() < depot_guard.empties.capacity());
                    depot_guard.empties.push(loaded_ptr);
                }
            }
            // Extract spare magazine via .take() — leaves pair.spare = None.
            if let Some(spare_ptr) = pair.spare.take() {
                let spare_count = unsafe { spare_ptr.as_ref().count };
                if spare_count > 0 {
                    debug_assert!(depot_guard.fulls.len() < depot_guard.fulls.capacity());
                    depot_guard.fulls.push(spare_ptr);
                } else {
                    debug_assert!(depot_guard.empties.len() < depot_guard.empties.capacity());
                    depot_guard.empties.push(spare_ptr);
                }
            }
            drop(depot_guard);
        },
    );
    // Wait for all IPIs to complete (smp_call_function_many is synchronous).
}

/// Drain depot magazines back to per-NUMA partial lists. Acquires
/// depot.lock, pops all full magazines, releases depot.lock, then
/// for each magazine calls drain_magazine_to_slabs() to return objects
/// to their owning slabs. Empty slabs are collected for Phase 5.
fn drain_depot_to_partial(cache: &SlabCache) {
    let mut depot_guard = cache.depot.lock();
    let full_mags: Vec<_> = depot_guard.fulls.drain(..).collect();
    let empty_mags: Vec<_> = depot_guard.empties.drain(..).collect();
    drop(depot_guard);
    for mut mag_ptr in full_mags {
        // SAFETY: mag_ptr is a valid NonNull<SlabMagazine> from depot.fulls.
        // as_mut() yields &mut SlabMagazine for drain_magazine_to_slabs.
        let mag: &mut SlabMagazine = unsafe { mag_ptr.as_mut() };
        drain_magazine_to_slabs(mag, cache);
        // After draining, the magazine struct is deallocated.
        dealloc_magazine(mag_ptr);
    }
    // Empty magazines are freed (deallocate the magazine struct itself).
    for mag in empty_mags {
        dealloc_magazine(mag);
    }
}

/// Walk all per-NUMA partial lists and verify every slab has
/// free_count == capacity (all objects free). Returns false if any
/// slab has allocated objects.
fn all_slabs_empty(cache: &SlabCache) -> bool {
    for node in 0..num_possible_nodes() {
        let partial = cache.node_partial[node].lock();
        for slab in partial.iter() {
            if slab.free_count < slab.capacity {
                return false;
            }
        }
    }
    true
}

/// Collect all slab pages from all per-NUMA partial lists and return
/// them to the buddy allocator via slab_shrink(). Must NOT hold
/// depot.lock or node_partial.lock when calling buddy free.
fn return_slab_pages_to_buddy(cache: &SlabCache) {
    for node in 0..num_possible_nodes() {
        let mut slabs_to_free = Vec::new();
        {
            let mut partial = cache.node_partial[node].lock();
            while let Some(slab) = partial.pop_head() {
                slabs_to_free.push(slab);
            }
        } // node_partial lock released
        for slab in slabs_to_free {
            slab_shrink(cache, slab);
        }
    }
}

/// Free the SlabCache descriptor struct itself. Deallocates the depot
/// Vec storage, per-NUMA partial list heads, and the SlabCache struct
/// from its meta-slab (or boot allocator for boot-time caches).
fn free_cache_descriptor(cache: &SlabCache) {
    // depot.fulls and depot.empties Vecs are dropped automatically.
    // The SlabCache struct is freed back to its slab cache
    // (SLAB_CACHES is a meta-cache created during slab_init).
    kfree(cache as *const SlabCache as *mut u8);
}

4.3.3.5 Performance Impact¶

Steady state: Zero. The GC thread is SCHED_IDLE and only scans the cache list every 60 seconds. The scan itself is O(N) in the number of caches (~100-500 on a typical system), with each check being a few atomic loads — total scan time is ~10-50 μs.

During drain: The IPI to drain magazines costs ~10-50 μs across all CPUs (same mechanism used by existing slab_shrink for memory pressure). This happens only for caches that have been idle for 5+ minutes with zero live objects — extremely rare in practice. The drain affects only the target cache; other caches continue operating normally.

Hot path: Unaffected. Magazine pop/push never checks gc_state. The GC state check occurs at the entry of slab_alloc_slow() (magazine miss → slow path entry), where gc_state != Active causes the allocation to return AllocError::CacheDraining with a hint to fall back to kmalloc. This prevents allocations from racing with GC Phases 3-5 (verify empty → RCU barrier → free pages).

NMI safety: NMI handlers MUST NOT allocate from slab (use the emergency reserve pool — a pre-allocated per-CPU ArrayVec<*mut u8, 16> of 4 KiB buffers populated at boot). Enforcement: debug builds insert assert!(!in_nmi()) at the slab alloc() entry point. In release builds, this invariant is verified by static analysis during CI (the #[deny(slab_alloc_in_nmi)] lint). The slab GC magazine drain IPI is safe with respect to NMI observation because: (1) Magazine pop/push execute with IRQs disabled on the local CPU (via local_irq_save), and NMI handlers MUST NOT allocate from slab (they use the emergency reserve pool) — so there is no concurrent access to a CPU's magazine from both the normal path and an NMI. The IPI-based drain is serialized: the target CPU executes the drain handler with preemption disabled, and NMI handlers do not touch slab magazines. (2) The GC_DRAINING flag is read on the slow path only; an NMI hitting the slow path returns ENOMEM (falls back to emergency pool). (3) The rcu_barrier() in Phase 4 runs in workqueue context (sleepable) and cannot be preempted by NMI in a way that deadlocks — rcu_barrier() posts callbacks and waits; NMI does not interact with RCU callback processing.

CPU hotplug slab quiescence: When a CPU goes offline, its slab magazine pair is drained back to the depot (similar to PCP pool drainage in the physical allocator). A per-CPU magazine_active: AtomicBool flag (analogous to pcp_valid in Section 4.2) is cleared before draining. While magazine_active == false, allocations on this CPU fall through to the depot slow path. The flag is set back to true when the CPU comes back online and fresh magazines are allocated from the depot.

SlabRef::drop() when magazine_active == false: If a SlabRef is dropped on a CPU whose magazine is inactive (CPU going offline or not yet re-initialized), drop() bypasses the local magazine and returns the object directly to the depot via the depot lock path — the same slow path used when the magazine is full. This is safe because the depot lock is not CPU-affine.

Evolvability note: The slab allocator's hot path (magazine pop/push) is lock-free per-CPU code that cannot tolerate indirection through a replaceable vtable — every additional branch on the allocation fast path costs ~2-5 ns, violating the performance budget. However, the warm/cold-path policy decisions (NUMA node selection, slab growth, GC eligibility) are factored into a stateless SlabAllocPolicy trait that is replaceable via EvolvableComponent (PolicyPointId::SlabAllocPolicy = 9):

/// Slab allocator policy — warm/cold path only. The 10th stateless policy (index 9).
/// The hot-path magazine pop/push is NOT dispatched through this trait.
pub trait SlabAllocPolicy: Send + Sync {
    /// Select the preferred NUMA node for a new slab page allocation.
    fn select_node(&self, cache: &SlabCache, gfp: GfpFlags) -> NumaNode;
    /// Whether to grow the cache by allocating a new slab page on this node.
    fn should_grow(&self, cache: &SlabCache, node: NumaNode) -> bool;
    /// Whether this cache is eligible for GC after being idle for `idle_ms`.
    fn gc_eligible(&self, cache: &SlabCache, idle_ms: u64) -> bool;
}

The GC parameters (idle threshold, scan interval) are also tunable via /proc/sys/vm/slab_gc_* sysctl knobs for operational tuning without policy replacement.

4.3.3.6 Driver-Created Cache Merging and Per-CPU Magazine Mapping¶

kmem_cache_create() allows drivers and kernel subsystems to create dedicated slab caches with custom object sizes, alignment, and constructor/destructor functions. The merging algorithm determines whether a new cache is merged into an existing standard kmalloc size class or receives its own independent cache:

Merging conditions (all must hold for merge into standard kmalloc-N cache):

object_size, rounded up to the next size class boundary, matches an existing kmalloc size class (classes 0-23, 8..16384 bytes).
align <= natural alignment of the size class (e.g., 256-byte class has natural alignment of at least 8 bytes; custom alignment of 128 or 256 requires a dedicated cache).
ctor == None — constructor functions are incompatible with merging because objects from different caches may interleave in the same slab.
dtor == None — destructor functions are incompatible with merging because kfree() cannot determine which destructor to call for a generic kmalloc object.
flags does not include SLAB_NO_MERGE (an opt-out flag for caches that require isolation for debugging or security reasons).

If merged: The new cache is an alias. kmem_cache_create() returns a handle to the existing kmalloc-N SlabCache. Allocations use the standard per-CPU CpuLocalBlock.slab_magazines[sc] (fixed 26-entry array). No new per-CPU magazine slots are needed.

If NOT merged (custom ctor/dtor, non-standard alignment, or opt-out):

A dedicated SlabCache is created with its own depot, per-NUMA partial lists, and per-CPU magazine pair. The per-CPU magazines for non-standard caches are allocated from a per-CPU extensible magazine XArray keyed by cache_id:

/// Per-CPU extension for driver-created caches that cannot merge into
/// standard kmalloc size classes. Stored in PerCpu<T> (not CpuLocalBlock
/// registers — only the 26 standard size classes occupy the fast register
/// path). Accessed under IRQ-disable (same as standard magazines).
///
/// XArray keyed by cache_id (integer key — per collection policy).
/// Lazy initialization: the XArray entry is created on the first
/// alloc/free for this cache on this CPU.
pub static PERCPU_CUSTOM_MAGAZINES: PerCpu<XArray<MagazinePair>> =
    PerCpu::new(XArray::new);

The custom magazine path is the warm path (one XArray lookup per alloc/free, ~20-50ns overhead vs the register-based fast path). This is acceptable because driver-created caches are typically accessed less frequently than core kmalloc paths. The fast-path CpuLocalBlock.slab_magazines[24] remains reserved exclusively for the 26 standard kmalloc size classes.

kmem_cache_create() signature:

/// Create or look up a slab cache for objects of the given size and alignment.
///
/// If the requested parameters match an existing kmalloc size class (no
/// ctor/dtor, standard alignment), the existing cache is returned (merged).
/// Otherwise, a dedicated cache is created with its own depot and per-CPU
/// magazine pairs.
///
/// # Arguments
/// - `name`: Cache name for `/proc/slabinfo` and debugging. Must be unique
///   (or match an existing cache for intentional dedup).
/// - `object_size`: Size of each object in bytes.
/// - `align`: Minimum alignment in bytes (0 = natural alignment for object_size).
/// - `flags`: `SLAB_ZERO_INIT`, `SLAB_NO_MERGE`, `SLAB_PERMANENT`, etc.
/// - `ctor`: Optional constructor called on each newly-allocated object.
/// - `dtor`: Optional destructor called before returning objects to buddy.
///
/// # Returns
/// A `SlabCacheHandle` that can be passed to `kmem_cache_alloc()` / `kmem_cache_free()`.
pub fn kmem_cache_create(
    name: &'static str,
    object_size: usize,
    align: usize,
    flags: SlabCacheFlags,
    ctor: Option<fn(*mut u8)>,
    dtor: Option<fn(*mut u8)>,
) -> Result<SlabCacheHandle, AllocError>;

Cross-references: - Physical memory allocator (buddy pages returned here): Section 4.2 - Per-CPU magazines: Section 3.2 - Live kernel evolution (slab GC enables driver reload without slab leaks): Section 13.18 - KABI slab creation interface: Section 12.1

4.4 Page Cache¶

RCU-protected radix tree for page lookups (lock-free reads)
NUMA-aware page placement: Pages cached on the node closest to the requesting CPU
Writeback: Per-device writeback threads, dirty page ratio thresholds
Page reclaim: Generational LRU (Section 4.4.1) with per-CPU drain batching
Transparent huge pages: Automatic promotion of 4KB pages to 2MB when aligned runs are available, automatic splitting under memory pressure

PageRef — Lightweight page frame reference:

/// A lightweight, non-owning reference to a physical page frame (`Page`).
///
/// `PageRef` wraps a raw pointer to a `Page` descriptor. Unlike `Arc<Page>`,
/// it does NOT perform atomic reference counting on every lookup — the page
/// cache's RCU read-side protection guarantees the `Page` is not freed while
/// a reader holds the RCU read lock. Explicit `page_ref_inc()` / `page_ref_dec()`
/// calls manage the refcount only on the insert and eviction paths (warm path),
/// keeping the per-lookup hot path free of atomics.
///
/// # Safety
/// A `PageRef` is valid only while one of these conditions holds:
/// - The caller holds an RCU read lock (page cache lookup path).
/// - The caller holds a counted reference obtained via `page_ref_inc()`.
/// - The caller holds the page lock (`PageFlags::LOCKED`).
///
/// Dereferencing a `PageRef` outside these conditions is undefined behavior.
///
/// **RCU freeing protocol**: When a page is evicted from the page cache, the
/// `Page` descriptor is NOT immediately returned to the slab. Instead:
/// 1. The XArray entry is removed (xa_erase under xa_lock).
/// 2. `page_ref_dec()` drops the cache's counted reference.
/// 3. If refcount reaches 0, the page is freed via `call_rcu(page.i_rcu, page_free_rcu)`.
/// 4. `page_free_rcu` runs after all RCU readers have completed, then returns the
///    Page to the page allocator free list.
/// This ensures that any `PageRef` held under an RCU read lock remains valid
/// until the reader exits the RCU critical section — even if the page was
/// concurrently evicted. Readers that need the page beyond the RCU section
/// must call `page_ref_inc()` before exiting the RCU read lock.
#[derive(Clone, Copy)]
pub struct PageRef {
    ptr: *const Page,
}

PageCache — Per-inode page cache:

/// Per-inode page cache: maps file page offsets to their in-memory page frames.
///
/// Embedded inside the VFS `AddressSpace` wrapper
/// ([Section 14.1](14-vfs.md#virtual-filesystem-layer--addressspace-page-cache-mapping)), which is itself embedded
/// in each `Inode`. The hot-path dereference chain is
/// `inode → i_mapping.page_cache.pages → Page`.
///
/// `PageCache` is the canonical page **storage** backend (`XArray` with RCU-safe
/// lockless reads and per-slot updates). `AddressSpace` adds VFS-layer concerns:
/// writeback coordination (`writeback_lock`, `writeback_in_progress`), error
/// tracking (`wb_err: ErrSeq`), filesystem callbacks (`ops: &dyn AddressSpaceOps`),
/// and eviction flags.
///
/// **UmkaOS design**: Uses `XArray<PageEntry>` — a 64-way radix tree with per-slot
/// RCU publish semantics. This is the same data structure class Linux has used since
/// 5.0 (migrated from the older radix_tree). The choice follows the §3.1.13
/// collection usage policy, which mandates XArray for integer-keyed hot-path
/// mappings. The page cache is the single hottest integer-keyed data structure
/// in the kernel — every `read()`, `write()`, `mmap()` page fault, readahead,
/// writeback, and reclaim operation traverses it.
///
/// **Performance characteristics**:
/// - **Lookup**: O(1) — 3-4 radix levels for a 64-bit key space with 64-way fanout.
///   Each level is a single indexed load. Lock-free under `RcuReadGuard`.
/// - **Insert/remove**: O(1) per slot. Writers hold `xa_lock` (per-XArray, i.e.,
///   per-inode — no cross-file contention). Node allocation uses slab; no bulk
///   cloning or copying.
/// - **Range scan** (readahead): O(k) after O(1) start lookup, where k is the number
///   of pages in the range. Walking adjacent slots in a 64-way radix tree is
///   sequential node traversal — not k independent lookups.
/// - **Dirty marking**: Atomic flag update on the existing `PageEntry` — no tree
///   mutation, no lock acquisition on the read-modify-write fast path. Writers only
///   hold `xa_lock` for structural changes (insert/remove), not for flag updates.
/// - **Memory overhead**: One 64-entry radix node (~576 bytes) per populated range.
///   Sparse files only allocate nodes for pages that exist. Dense sequential files
///   achieve ~9 bytes/page overhead (one node per 64 pages).
///
/// **Thread safety**: Reads (page lookup on page fault, readahead) take an
/// `RcuReadGuard` — lock-free, wait-free on the read path. Writes (page insertion,
/// removal, eviction) hold `xa_lock`. The `xa_lock` is per-`PageCache` instance
/// (per-inode), so concurrent writes to different files never contend.
pub struct PageCache {
    /// XArray mapping from `PageIndex` → `PageEntry`. RCU-protected for lock-free
    /// reads. Writers hold the XArray's internal `xa_lock` for structural mutations.
    /// Per-inode (no cross-file contention).
    pub pages: XArray<PageEntry>,

    /// Total number of pages currently in the cache.
    /// Updated atomically on insert/remove.
    pub nr_pages: AtomicU64,

    /// Number of pages currently marked Dirty (awaiting writeback).
    /// Incremented by `mark_page_dirty()`. Decremented in writeback
    /// completion (step 11 of [Section 4.6](#writeback-subsystem)).
    ///
    /// **Sync contract**: Both `PageCache.nr_dirty` (per-inode) and
    /// `BdiWriteback.nr_dirty` (per-BDI, [Section 4.6](#writeback-subsystem)) are
    /// updated in the same code path (`__set_page_dirty()` /
    /// `end_page_writeback()`) but NOT atomically — the per-inode
    /// increment runs under xa_lock while the per-BDI increment runs
    /// outside it. Transient inconsistency (per-inode incremented but
    /// per-BDI not yet) is tolerable because `balance_dirty_pages()`
    /// uses smoothed bandwidth estimates, not exact instantaneous counts.
    /// The invariant is: never increment one without the other in the
    /// same code path (prevents systematic drift — a class of bugs Linux
    /// has experienced with `mapping->nrpages` vs `wb_stat(WB_DIRTIED)`).
    pub nr_dirty: AtomicU64,

    /// Nanoseconds-since-boot deadline: the earliest time at which a dirty page
    /// in this inode MUST be written back (30 seconds after first dirty mark,
    /// per the default `dirty_expire_centisecs = 3000`). Zero if no dirty pages.
    ///
    /// Set by `mark_inode_dirty()` as `now_ns + dirty_expire_centisecs * 10_000_000`
    /// when the first page is dirtied. The kupdate algorithm
    /// ([Section 4.6](#writeback-subsystem--kupdate-algorithm)) checks `now_ns >= deadline`
    /// to determine whether this inode's dirty pages have aged past the expiration
    /// threshold and should be scheduled for writeback. This is the forward-looking
    /// complement of `Inode.dirtied_when` (which records the backward-looking
    /// timestamp of when dirtying occurred). Both are set at the same time;
    /// `writeback_deadline_ns = dirtied_when + dirty_expire_centisecs * 10_000_000`.
    pub writeback_deadline_ns: AtomicU64,

    // Writeback coordination (writeback_in_progress sentinel and writeback_lock)
    // is owned by the VFS AddressSpace wrapper
    // ([Section 14.1](14-vfs.md#virtual-filesystem-layer--addressspace-page-cache-mapping)).
    // PageCache is purely the page storage backend; all writeback state lives in
    // AddressSpace to keep a single source of truth for writeback serialisation.
}

impl PageCache {
    /// Returns the backing device info for this page cache's owning inode.
    /// Navigates: `self → AddressSpace (container) → Inode → SuperBlock → s_bdi`.
    /// Used by `set_page_dirty()` to add the inode to the BDI's dirty list.
    pub fn bdi(&self) -> &BackingDevInfo { /* container_of → inode → sb.s_bdi */ }

    /// Returns a reference to the owning inode.
    /// Navigates: `self → AddressSpace (container) → Inode` via `container_of`.
    /// Used by `set_page_dirty()` and writeback scheduling.
    pub fn inode(&self) -> &Inode { /* container_of(self, AddressSpace, page_cache).host */ }
}

/// An entry in the page cache: a reference to a physical page frame plus state flags.
pub struct PageEntry {
    /// Reference to the backing page frame. The page may be in CpuRam, compressed
    /// (in the CompressPool), or being migrated — the PageLocation tracks this.
    /// PageRef avoids atomic refcount on every lookup. RCU protects the page
    /// reference lifetime. Explicit get/put calls manage the refcount on
    /// insert/eviction paths.
    pub page: PageRef,
    /// Current state of this page relative to its backing file.
    pub flags: PageFlags,
    /// Content generation counter. Incremented on every page content mutation:
    /// - `write()` / `write_iter()` completing into this page (via `generic_perform_write`).
    /// - Writeback completion clearing DIRTY (generation advances so DSM peers
    ///   can distinguish pre-writeback from post-writeback content).
    /// - DSM invalidation clearing UPTODATE (generation advances so stale RDMA
    ///   probes are detected — see [Section 6.11](06-dsm.md#dsm-distributed-page-cache--race-condition-handling)).
    /// - Page migration completion (content moved to new physical frame).
    ///
    /// **Primary consumer**: The DSM cooperative cache probe protocol uses
    /// `generation` to detect stale page data. When a responder RDMA-Writes a
    /// page to a requester, it includes the page's current `generation` in the
    /// `DsmProbeResponse.page_generation` field. The requester compares this
    /// against the `generation` of the `PageEntry` after insertion; a mismatch
    /// means a concurrent write invalidated the data between the RDMA Write and
    /// the response, and the requester must discard and re-fetch.
    ///
    /// **Cost**: one `AtomicU64::fetch_add(1, Release)` per write completion.
    /// This is on the warm path (write I/O completion), not the read hot path.
    /// Reads of `generation` use `Acquire` ordering to pair with the writer's
    /// `Release`.
    pub generation: AtomicU64,
}

bitflags! {
    /// State flags for a cached page. Inspected by the page fault handler,
    /// writeback thread, and reclaim path.
    pub struct PageFlags: u32 {
        /// Page data is current (consistent with backing storage or anon zero-page).
        /// Cleared when the page is evicted; set after readahead or writeback completes.
        const UPTODATE    = 0x0001;
        /// Page has been written to since last writeback. Drives writeback scheduling.
        const DIRTY       = 0x0002;
        /// Writeback to storage is currently in progress for this page.
        /// Set by the writeback thread; cleared on I/O completion.
        const WRITEBACK   = 0x0004;
        /// Page is locked (I/O in progress, compaction, or migration).
        /// Processes waiting for unlock sleep on a per-page wait queue.
        const LOCKED      = 0x0008;
        /// Page has been accessed since it was last placed at the tail of the LRU.
        /// Set by the page fault handler; cleared by the reclaim scanner.
        const ACCESSED    = 0x0010;
        /// Page is mapped into at least one process page table (has live PTEs).
        /// Used by the reclaim path to skip TLB shootdown for unmapped pages.
        const MAPPED      = 0x0020;
        /// Page is in the swap cache (backed by a swap entry, not a file).
        const SWAPCACHE   = 0x0040;
        /// Page belongs to a Tier 1 driver's DMA region (must not be reclaimed).
        const DMA_PINNED  = 0x0080;
        /// Page is under readahead (being speculatively loaded by the readahead engine).
        const READAHEAD   = 0x0100;
        /// Page is being migrated between memory tiers (DRAM → CXL, CXL → compressed).
        /// Set by tiering migration thread; cleared on completion. Page fault handler
        /// blocks on the page's wait queue until migration completes.
        const TIER_MIGRATING    = 0x0200;
        /// An I/O error occurred during writeback for this page. Set by the bio
        /// completion callback; consumed by `end_page_writeback()` which propagates
        /// the error to `AddressSpace.wb_err` (see writeback error chain below).
        const ERROR             = 0x0400;
        /// Page is part of the DSM cooperative cache. Reclaim must
        /// coordinate with DSM coherence protocol before evicting.
        /// DSM generation field in `PageEntry.generation`: see
        /// [Section 6.11](06-dsm.md#dsm-distributed-page-cache) for increment semantics and
        /// cooperative cache probe protocol.
        const DSM               = 0x0800;
        /// Page has suffered 3+ consecutive writeback failures and is permanently
        /// excluded from writeback. Set by the writeback engine after the retry
        /// limit is exhausted ([Section 4.6](#writeback-subsystem)). The writeback scanner
        /// skips pages with this flag: `if page.flags.contains(PERMANENT_ERROR)
        /// { continue; }` in the per-inode dirty page scan. An FMA event
        /// `WritePermanentError { pfn, inode }` is emitted when the flag is set.
        /// The page remains dirty (PG_DIRTY set) so that fsync() returns -EIO.
        const PERMANENT_ERROR   = 0x1000;
    }
}

/// File page offset: byte_offset / PAGE_SIZE. Used as XArray key for O(1) lookup.
pub type PageIndex = u64;

set_page_dirty() — mark a page as dirty in the page cache:

/// Mark a page as dirty. Called from `generic_perform_write()`, `write_iter()`,
/// and `page_mkwrite()` (CoW fault on a mapped file page).
///
/// Returns `true` if the page was not already dirty (i.e., this call actually
/// transitioned the page from clean to dirty).
///
/// This function is the single canonical entry point for dirtying a cached page.
/// It is referenced as `__set_page_dirty()` in the writeback subsystem
/// ([Section 4.6](#writeback-subsystem)) where it triggers `balance_dirty_pages()`.
///
/// # Hot path — no allocation, no sleeping locks.
///
/// **Atomicity invariant**: The page DIRTY flag and the XArray DIRTY tag must
/// be set together under `xa_lock`. Without this, the writeback scanner
/// (`xa_for_each_tagged(XA_TAG_DIRTY)`) can miss a page that has DIRTY set
/// on the page flags but whose XArray tag has not yet been applied — the page
/// would remain dirty indefinitely until a future `sync()` forces a full scan.
/// This matches the locking discipline of Linux `__set_page_dirty_nobuffers()`.
pub fn set_page_dirty(page: &PageRef, cache: &PageCache) -> bool {
    // Acquire the XArray lock to make flag+tag update atomic w.r.t.
    // writeback scanner reads. The lock is a SpinLock (no sleeping),
    // fine for this hot path.
    let _guard = cache.pages.xa_lock();

    // get_locked(): look up the XArray entry at the given index while the
    // xa_lock is held. Returns the entry value directly (no RCU needed since
    // the lock provides mutual exclusion). Equivalent to xa_load() under lock.
    let entry = cache.pages.get_locked(page.index());
    let old = entry.flags.fetch_or(PageFlags::DIRTY.bits(), Ordering::AcqRel);
    if old & PageFlags::DIRTY.bits() != 0 {
        return false; // Already dirty — no state change.
    }

    // Increment per-inode dirty page counter (pairs with writeback decrement).
    cache.nr_dirty.fetch_add(1, Ordering::Relaxed);

    // Tag the XArray slot so the writeback scanner can find dirty pages
    // via xa_for_each_tagged() without scanning the entire tree.
    // Both the flag set (above) and this tag set happen under xa_lock,
    // ensuring the writeback scanner never observes a half-dirty state.
    cache.pages.set_tag_locked(page.index(), XA_TAG_DIRTY);

    // xa_lock dropped here (_guard). The per-BDI counter and
    // bdi_dirty_inode() are both outside the lock — they use independent
    // atomics and do not need the per-inode xa_lock for correctness.
    // Moving them outside reduces lock hold time on the hot dirty path.
    drop(_guard);

    // Increment per-BDI dirty page counter (pairs with writeback decrement
    // in writeback_end_io() / Tier 0 step 11). Both counters track the same
    // events at different aggregation levels and MUST be updated together to
    // prevent drift — see the sync contract on PageCache.nr_dirty above.
    cache.bdi().wb.nr_dirty.fetch_add(1, Ordering::Relaxed);

    // Add the owning inode to the BDI's dirty inode list if not already
    // present. The BDI writeback thread iterates this list to schedule
    // writeback for all inodes with dirty pages.
    // bdi_dirty_inode() is idempotent — calling it when the inode is
    // already on the list is a no-op (checks I_DIRTY_PAGES flag).
    bdi_dirty_inode(cache.bdi(), cache.inode());

    // Dirty extent reservation: if the filesystem implements dirty extent
    // tracking ([Section 14.1](14-vfs.md#virtual-filesystem-layer--copy-on-write-and-redirect-on-write-infrastructure)),
    // call `vfs_dirty_extent_reserve()` to reserve metadata space for the
    // dirty page. This is called from the first-dirty path (when DIRTY
    // transitions from 0 to 1), not on already-dirty pages (which return
    // false above). The dirty extent protocol ensures that the VFS crash
    // recovery path knows which pages have uncommitted dirty data.
    // Call the filesystem's dirty_extent callback to reserve metadata
    // space for crash recovery journaling. page.index() is the page
    // index; the filesystem converts to byte offset internally.
    // offset = page.index() * PAGE_SIZE, len = PAGE_SIZE.
    let offset = page.index() * PAGE_SIZE as u64;
    cache.ops.dirty_extent(cache, offset, PAGE_SIZE as u64);

    true
}

Writeback error propagation chain (end-to-end):

Block driver sets bio.status to negative errno (e.g., -EIO).
Bio completion callback (IRQ/softirq context — see Section 3.8, Section 15.2):
Sets PageFlags::ERROR on the page (atomic bitfield update — permitted in IRQ).
Schedules deferred work on the blk-io workqueue for page cache state updates. Bio completion callbacks MUST NOT call page cache methods directly (they may acquire sleeping locks); all steps below run in workqueue context.
Deferred writeback completion (blk-io workqueue, process context):
end_page_writeback() clears PageFlags::WRITEBACK.
Wakes any task waiting on the page's wait queue.
Observes ERROR on the page.
Increments AddressSpace.wb_err generation counter via ErrSeq::set_err(errno) (Section 14.1).
Sets AS_EIO or AS_ENOSPC in AddressSpace.flags.
fsync() path:
Compares file.f_wb_err (snapshot from open time) with mapping.wb_err.
If different: returns the error to userspace, updates file.f_wb_err.
If same: no new errors, returns Ok(0).

See also: fsync error reporting (Section 14.4).

4.4.1 Generational LRU Page Reclaim¶

UmkaOS uses a generational LRU design, inspired by Linux's Multi-Gen LRU (MGLRU, merged in Linux 6.1) but redesigned from first principles. The goals are:

Accurate age estimation without per-page LRU movement (no lock-per-access).
Separate aging policy for file-backed and anonymous pages.
Per-cgroup reclaim priority — cgroup pressure is resolved before global reclaim.
Efficient operation under NUMA with per-node generation lists.
Refault distance tracking to protect frequently-accessed file pages.

Data structures:

/// Number of LRU generations maintained per zone (tunable, default 4).
/// Pages age through generations 0..N_GENS-1. Generation 0 is the newest
/// (just accessed), generation N_GENS-1 is the oldest (eviction candidate).
pub const N_GENS: usize = 4;

/// Doubly-linked intrusive list of `Page` structs within an LRU generation.
/// Uses `Page.lru: IntrusiveListNode` for linkage. O(1) insert/remove,
/// O(n) iteration. Protected by the zone's LRU lock.
pub type PageList = IntrusiveList<Page>;

/// Splits each LRU generation into anonymous and file-backed page pools,
/// matching Linux MGLRU's `lrugen` layout. Anon pages and file pages have
/// different cost models (swap I/O vs. discard/re-read), so they are
/// reclaimed with separate policies.
pub struct LruGeneration {
    /// File-backed pages (page cache, mmap MAP_SHARED).
    pub file: PageList,
    /// Anonymous pages (heap, stack, MAP_PRIVATE after CoW).
    pub anon: PageList,
}

/// Per-NUMA-zone generational LRU state.
///
/// # Design rationale
///
/// Linux's original LRU (active/inactive two-list per zone) has well-known
/// problems:
/// - thrashing: a single large sequential read evicts the entire working set
/// - inaccurate aging: every page access requires taking the LRU lock to move
///   the page to the head of the active list
/// - poor scan efficiency: reclaim must scan many active pages before finding
///   inactive candidates
///
/// MGLRU fixes these by coarsening page age into generations (not exact
/// timestamps). UmkaOS adopts the same approach but adds:
/// - `cgroup_pressure` integration (cgroup-first reclaim)
/// - refault distance tracking per-cgroup (Section 4.4.1)
/// - generation advancement driven by mm_walk (page table scan), not
///   lru_add/lru_deactivate calls
pub struct ZoneLru {
    /// Generational lists, indexed [generation_index][anon/file].
    /// Oldest generation is at `(oldest_gen % N_GENS)`.
    pub generations: [LruGeneration; N_GENS],
    /// Index of the oldest generation (mod N_GENS). Reclaim starts here.
    pub oldest_gen: u64,
    /// Index of the youngest generation. New pages start here.
    pub youngest_gen: u64,
    /// Protects generation list manipulation. Not held during mm_walk.
    pub lock: SpinLock<()>,
    /// Per-CPU page drain buffers: pages are added to per-CPU buffers and
    /// drained to the zone LRU under lock in batches (default: 32 pages).
    /// Eliminates per-page lock acquisitions on the hot path.
    pub percpu_drain: PerCpu<LruDrainBuffer>,
    /// Number of pages currently undergoing writeback (includes DSM writeback).
    /// Used by the reclaim scanner to avoid evicting pages mid-writeback and by
    /// the DSM coherence layer to track outstanding remote flushes.
    pub nr_pages_writeback_pending: AtomicU64,
    /// Refault tracking shadow entries (Section 4.4.1 — refault distance).
    pub shadow_entries: XArray<ShadowEntry>,
}

/// Per-CPU drain buffer to batch LRU insertions (avoids per-page LRU lock).
pub struct LruDrainBuffer {
    pub add_file: [PagePtr; 32],
    pub add_anon: [PagePtr; 32],
    pub file_count: u8,
    pub anon_count: u8,
}

/// Shadow entry stored in the page cache radix tree slot after a page is
/// evicted. Encodes the eviction generation so that a subsequent fault can
/// compute the refault distance and decide whether to insert the page into a
/// younger generation (protecting it from immediate re-eviction).
///
/// Encoded as a tagged pointer in the XArray: low 2 bits = tag `0b11`
/// (distinguishes shadow from live page pointer), bits 2..34 = generation
/// counter at eviction time.
///
/// **64-bit note**: On 64-bit architectures, bits 34-63 are reserved for
/// future use (e.g., XArray internal tags, additional eviction metadata
/// such as per-tier refault tracking or MGLRU generation type hints).
/// The 32-bit generation counter at bits 2..34 wraps after 2^32 MGLRU
/// generation advancements (not per-slot evictions). MGLRU advances
/// max_seq at most once per aging scan (~seconds to minutes under load).
/// At 1 advance/second, 2^32 generations provide ~136 years before wrap —
/// well beyond the 50-year uptime target.
///
/// **32-bit architecture note (ARMv7, PPC32)**: On 32-bit architectures,
/// `usize` is 32 bits. The shadow entry encoding is adjusted: bits 2..32
/// carry 30 bits of generation counter (not 32). This provides ~1 billion
/// (2^30) unique generations per XArray slot. At a worst-case refault rate
/// of 10,000 evictions/sec per slot, 30 bits provide ~29.8 hours before
/// wrap — sufficient for refault distance tracking, which only compares
/// generation deltas within a single MGLRU cycle (~seconds to minutes).
/// The `ShadowEntry` type is `usize`-sized to match XArray's pointer-sized
/// slot entries — `usize` on both 32-bit (4 bytes) and 64-bit (8 bytes) targets.
pub struct ShadowEntry(usize);

Generation advancement — mm_walk-based:

Pages do not move in the LRU list on each access. Instead, access bits in the page table (Accessed/AF/Reference flag, per-architecture) are examined by a periodic mm_walk scan:

Generation advancement algorithm (runs as kswapd wakeup or background thread):

1. For each zone needing reclaim:
   a. Walk all PTEs of processes mapped to this zone (using mm_walk,
      which visits pagetable leaves without holding any mm lock beyond
      mmap_read_lock for the duration of each VMA).
   b. For each PTE with Accessed=1:
      - Clear the Accessed bit (set to 0 for future aging).
      - Promote the referenced page: move it from its current generation
        to `youngest_gen` (generation 0 equivalent).
      - For file pages: also check the page cache Accessed flag (set by
        read(2)/mmap fault). Clear it.
   c. Pages not promoted during this walk remain in their current generation.
      After N_GENS walk cycles without promotion, a page reaches generation
      N_GENS-1 and becomes an eviction candidate.

2. Generation counter: after each complete zone walk, increment `youngest_gen`.
   Wrap: (youngest_gen % N_GENS). The oldest generation is automatically
   advanced: `oldest_gen = youngest_gen - (N_GENS - 1)`.

3. Per-NUMA locality: mm_walk for a zone only visits PTEs of tasks whose
   memory policy prefers that node. Cross-NUMA promotions are allowed but
   penalised (the page is promoted to generation 1, not generation 0).

Reclaim algorithm:

kswapd page reclaim (per zone, triggered when free pages < watermark_low):

1. Cgroup-first scan: if any cgroup has crossed memory.high (soft limit):
   a. Reclaim from that cgroup's pages first, targeting its per-cgroup LRU
      subset (CgroupLru, see below) before touching global lists.
   b. If cgroup is at memory.max (hard limit): trigger cgroup OOM
      ([Section 4.5](#oom-killer--oom-killer-policy)) before global reclaim.

2. Scan oldest generation (generation N_GENS-1):
   a. File pages first: these are cheaper to evict (no swap I/O — just
      discard if clean, writeback if dirty).
      - Clean file pages: immediately freed. Shadow entry written to
        radix tree for refault tracking.
      - Dirty file pages: queued for writeback. Recounted as clean once
        writeback completes.
   b. Anon pages next: these require swap I/O.
      - Compute swap slot, issue async write, mark page SwapCache.
      - Page remains mapped (PTEs updated to swap entry) until write
        completes, then page freed.

3. If oldest generation exhausted before watermark_high is reached:
   a. Age: advance oldest_gen by 1 (drop the oldest generation's lists).
   b. Resume from new oldest generation.
   c. If all generations exhausted without reaching watermark_high:
      trigger OOM resolution ([Section 4.5](#oom-killer--oom-killer-policy)).

4. Per-CPU drain: before scanning, drain all percpu_drain buffers to
   their zone LRU lists (batch under lock, 32 pages at a time).

5. Skip non-reclaimable pages: pages with DMA_PINNED flag (IOMMU-mapped
   by a Tier 1/2 driver or VFIO passthrough) are skipped — they cannot be
   reclaimed until the IOMMU mapping is torn down by the owning driver or VM.

6. DSM pages: pages with PageFlags::DSM are part of the Distributed Shared
   Memory subsystem. Before evicting a DSM page, call
   `dsm_subscriber.on_eviction_candidate(&page)` — if the subscriber returns
   `EvictionDecision::KeepRemote`, skip this page (DSM wants to transfer
   ownership to another node first; the page will be reclaimable after the
   transfer completes). If `EvictionDecision::AllowEvict`, proceed to
   `dsm_evict_page()` which notifies the coherence protocol to invalidate
   remote copies and update the directory.
   See [Section 6.11](06-dsm.md#dsm-distributed-page-cache).

7. DSM subscriber notification: after step 6's coherence protocol completes
   but before the page frame is freed, call
   `dsm_subscriber_on_eviction(page)`. This callback notifies the local
   DSM subscriber ([Section 6.12](06-dsm.md#dsm-subscriber-controlled-caching--subscriber-trait)) so
   it can:
   (a) Update the page location tracker (directory entry) to reflect that
       this node no longer holds the page.
   (b) Optionally transfer ownership to another node that has expressed
       interest (via the DSM interest bitmap), avoiding a full cache miss
       on the next remote access. The transfer is asynchronous — it is
       queued to the DSM writeback workqueue and does not block reclaim.
   (c) If no other node wants the page, mark the directory entry as
       "evicted" so future lookups trigger a storage read rather than a
       stale remote fetch.
   The callback is O(1): a single atomic update to the directory entry
   plus an optional RDMA write to the transfer target. It does not block
   on remote acknowledgment — the subscriber uses fire-and-forget
   semantics for eviction notifications.

Refault distance tracking:

When a page is evicted, a shadow entry is stored in the page cache radix tree at the same index. The shadow entry encodes the generation counter at eviction time. When the same page faults back in:

refault_distance = (current_generation - shadow_generation)

if refault_distance < N_GENS / 2:
    # page was re-accessed quickly — it has a hot working set entry
    # insert into youngest generation (protect from immediate re-eviction)
    insert_at_generation(page, youngest_gen)
    # Short refault: increment cgroup refault credit, which reduces
    # future reclaim aggressiveness for that cgroup.
    cgroup_increment_refault_credit(page.memcg)
else:
    # page was cold before re-fault — treat as new (one generation back)
    insert_at_generation(page, youngest_gen - 1)

Shadow entries are evicted from the XArray by a background cleaner when memory pressure is low, bounded by a maximum of nr_file_pages / 8 shadow entries per zone (prevents shadow entries from consuming significant memory).

Better than Linux LRU in these specific ways:

Linux LRU behaviour	UmkaOS improvement
Per-page LRU lock on each access (lru_add_drain)	mm_walk batch scan — no per-access lock
Two-list (active/inactive) — binary aging	N_GENS generations — finer aging granularity
Global reclaim first (cgroup OOM is reactive)	Cgroup-first reclaim — cgroup OOM before global pressure
Shadow entries lost after inode reclaim	XArray-based shadow tracking survives inode cache pressure
No penalty for cross-NUMA promotions	Cross-NUMA pages promoted to generation 1, not 0

Per-cgroup MGLRU (CgroupLru):

Each memory cgroup maintains its own full MGLRU generation structure, matching Linux's per-memcg lru_gen_folio design. MGLRU is the ONLY LRU implementation in UmkaOS — there is no legacy active/inactive dual-list fallback. This full commitment eliminates the dual-path complexity that prevents Linux from making MGLRU the default (as of 6.12+, Linux still supports both).

Why MGLRU-only is correct for UmkaOS: - Per-page access cost: MGLRU = ZERO (hardware PTE Accessed bit only). Active/inactive lists require an atomic flag update + deferred list_move under lock. MGLRU is faster on the hot path. - Under memory pressure: MGLRU uses page table walks (good cache locality, bloom-filter pruned) instead of rmap walks (pointer chasing, poor locality). O(referenced_pages) vs O(total_pages) scan cost. - Eviction quality: 4-generation gradient vs binary hot/cold. Benchmarks (Yu Zhao, Google fleet): +13-47% database throughput, -40% kswapd CPU, -85% low-memory kills. - UmkaOS targets containers at ~80% memory utilization — pressure is the NORMAL state, not the exception. MGLRU's advantages are maximized here. - Memory overhead: ~500 bytes/cgroup/node (acceptable for production servers).

/// Number of MGLRU generations. Must be power of 2 for masking.
/// Linux uses MAX_NR_GENS = 4.
pub const N_GENS: usize = 4;
/// Number of page types tracked per generation.
pub const ANON_AND_FILE: usize = 2;  // 0 = anon, 1 = file

/// Maximum number of memory zones. Matches Linux `include/linux/mmzone.h`
/// zone ordering and the ZoneType enum defined in [Section 4.2](#physical-memory-allocator):
/// DMA (0), DMA32 (1), Normal (2), HighMem (3), Movable (4).
/// HighMem is only populated on 32-bit targets (ARMv7, PPC32) where physical
/// memory exceeds the kernel direct-map region. On 64-bit targets, HighMem is
/// empty. The array size is always MAX_ZONES for uniform indexing.
pub const MAX_ZONES: usize = 5;

/// Per-memcg MGLRU tracking. One instance per `(memcg, node)` pair.
/// Linked from the `MemCgroup` via `lru_gen: ArrayVec<CgroupLru, NUMA_NODES_STACK_CAP>`
/// ([Section 4.2](#physical-memory-allocator)).
///
/// Matches Linux's `struct lru_gen_folio` in `include/linux/mmzone.h`.
/// Each memcg maintains the full N_GENS generation structure independently,
/// enabling O(1) generation aging without cross-cgroup lock acquisition.
///
/// **Replaceability**: The struct layout (field types, sizes, offsets) is
/// Nucleus — frozen data definition. The aging algorithm (generation
/// advancement, bloom filter construction, page table walk strategy) is
/// Evolvable — can be hot-swapped to improve aging heuristics without reboot.
pub struct CgroupLru {
    /// Highest completed aging generation sequence number. Monotonically
    /// increasing; used modulo N_GENS to index into the `folios` array.
    pub max_seq: AtomicU64,
    /// Lowest generation sequence number still in use, per type (anon, file).
    /// `min_seq[type] <= max_seq` always. Pages in generation `min_seq` are
    /// the oldest and are eviction candidates.
    pub min_seq: [AtomicU64; ANON_AND_FILE],
    /// Timestamps (jiffies) of the last aging pass for each generation.
    /// Used by the aging algorithm to determine which generations are
    /// stale and need promotion scanning.
    pub timestamps: [AtomicU64; N_GENS],
    /// Per-generation, per-type, per-zone page lists. Pages are linked via
    /// their `Page.lru` intrusive list node. A page belongs to exactly one
    /// generation list (not dual-linked into both global and cgroup lists).
    ///
    /// Index: `folios[gen % N_GENS][type][zone]` where type is 0=anon, 1=file.
    pub folios: [[[IntrusiveList<Page>; MAX_ZONES]; ANON_AND_FILE]; N_GENS],
    /// Page counts per generation per type per zone. Used for proportional
    /// reclaim decisions without walking the lists.
    pub nr_pages: [[[AtomicI64; MAX_ZONES]; ANON_AND_FILE]; N_GENS],
    /// Per-tier refault tracking. Tiers represent access recency within a
    /// generation (derived from PTE access patterns). Used to distinguish
    /// pages that were accessed via page table walks from those accessed
    /// only via direct I/O or prefetch.
    pub refaulted: [[AtomicU64; N_GENS]; ANON_AND_FILE],
    /// Refault credit: incremented on short-distance refaults, decremented
    /// on aging passes. Positive credit makes kswapd less aggressive toward
    /// this cgroup (its pages are being re-accessed quickly).
    pub refault_credit: AtomicI64,
    /// Lock protecting the `folios` list mutations. Per-cgroup per-node
    /// granularity ensures generation aging for one cgroup does not contend
    /// with another cgroup's reclaim. IRQ-safe because reclaim may run in
    /// softirq context ([Section 3.8](03-concurrency.md#interrupt-handling--softirq-deferred-interrupt-processing)).
    pub lock: SpinLock<()>,
}

When memory.high is crossed for a cgroup, kswapd targets that cgroup's oldest generation (min_seq) first, reclaiming its oldest pages before touching other cgroups. This provides proportional reclaim: cgroups exceeding their soft limit are reclaimed more aggressively, while cgroups within budget are protected. The per-cgroup generation structure enables precise aging — promoting a page within a cgroup's MGLRU requires only the cgroup's own lock, not a global LRU lock or a cross-cgroup lock.

4.4.1.1 Page Reclaim Replaceability — 50-Year Uptime Design¶

The page reclaim subsystem follows the same data/policy state-spill pattern as the physical memory allocator (PhysAllocPolicy, Section 4.2), VMM (VmmPolicy, Section 4.8), and capability system (CapPolicy, Section 9.1).

Non-replaceable data (struct layout and manipulation code frozen, verified): - ZoneLru — generation lists, oldest/youngest counters, per-CPU drain buffers, shadow entry XArray. Values change continuously (every reclaim cycle). - CgroupLru — per-cgroup MGLRU generation lists (folios, nr_pages, max_seq, min_seq), refault tracking, per-tier refault counters. - LruGeneration — file + anon page lists within each generation. - ShadowEntry — eviction-time generation encoding in XArray slots. - LruDrainBuffer — per-CPU batching buffers for lock-free LRU insertion.

Replaceable policy (stateless algorithm dispatcher, swapped via AtomicPtr):

/// Read-only snapshot of reclaim pressure metrics, passed to
/// PageReclaimPolicy methods. All fields are computed from non-replaceable
/// data before the policy call — the policy cannot mutate them.
pub struct ReclaimPressure {
    /// Free pages in this zone (current).
    pub nr_free: u64,
    /// Zone watermarks (from PhysAllocPolicy::recalc_watermarks()).
    pub wmark_min: u64,
    pub wmark_low: u64,
    pub wmark_high: u64,
    /// Whether this is a direct reclaim (blocking allocation) or
    /// background reclaim (kswapd).
    pub is_direct: bool,
    /// GFP flags of the allocation that triggered reclaim (if direct).
    pub gfp: GfpFlags,
    /// Pages scanned in this reclaim cycle so far.
    pub pages_scanned: u64,
    /// Pages reclaimed in this reclaim cycle so far.
    pub pages_reclaimed: u64,
    /// Allocation order that triggered reclaim (for compaction decisions).
    pub order: u32,
}

/// Page reclaim policy — replaceable decision layer for the generational LRU
/// page replacement subsystem. All mutable state (ZoneLru generation lists,
/// CgroupLru MGLRU generation lists, shadow entries, drain buffers) is owned
/// by the non-replaceable data structures defined above.
///
/// **State spill**: PageReclaimPolicy is stateless — all inputs are passed as
/// read-only references to the non-replaceable data. Replacement swaps only
/// the global policy pointer (~1 us stop-the-world). No state export/import
/// needed.
///
/// **When called**: kswapd wakeup, direct reclaim entry, mm_walk scheduling,
/// refault insertion, cgroup reclaim. Never called on the page access hot
/// path (page access only sets the hardware Accessed bit — no kernel code
/// runs).
pub trait PageReclaimPolicy: Send + Sync {
    /// Determine the anon-to-file reclaim ratio for this zone/cgroup.
    /// Returns the fraction of reclaim effort directed at anonymous pages
    /// (0 = file-only, 100 = anon-only). Linux calls this "swappiness."
    ///
    /// Default: 60 (same as Linux default swappiness — modest swap pressure).
    /// Replacement: workload-adaptive ratio based on refault rates, swap
    /// bandwidth utilization, or ML-predicted working set composition
    /// ([Section 23.1](23-ml-policy.md#aiml-policy-framework-closed-loop-kernel-intelligence)).
    ///
    /// Called once per reclaim cycle per zone (warm path).
    fn anon_file_ratio(
        &self,
        zone: &ZoneLru,
        cgroup: Option<&CgroupLru>,
        pressure: &ReclaimPressure,
    ) -> u8; // 0..=100

    /// Decide how many PTEs to scan in this mm_walk aging pass.
    /// Controls the CPU overhead vs. aging accuracy tradeoff.
    ///
    /// Default: min(zone_total_pages / 4, 32768) PTEs per pass.
    /// Replacement: adaptive based on pressure level and CPU utilization.
    ///
    /// Called once per mm_walk invocation (warm path, kswapd or direct
    /// reclaim).
    fn scan_budget(
        &self,
        zone: &ZoneLru,
        pressure: &ReclaimPressure,
    ) -> u64; // max PTEs to scan

    /// Compute the refault distance threshold for page promotion decisions.
    /// Pages re-faulting within this many generations are considered "hot"
    /// and promoted to the youngest generation; beyond this threshold they
    /// are inserted one generation back.
    ///
    /// Default: N_GENS / 2 (2 generations with default N_GENS=4).
    /// Replacement: adaptive — tighten during memory pressure (protect only
    /// truly hot pages), loosen when memory is plentiful.
    ///
    /// Called once per refault (warm path — refaults are already expensive).
    fn refault_threshold(
        &self,
        zone: &ZoneLru,
        pressure: &ReclaimPressure,
    ) -> u64; // generation distance

    /// Select the generation to assign to a cross-NUMA promoted page.
    /// Cross-NUMA faults bring in a page from a remote node; the penalty
    /// generation controls how quickly it can be evicted if not re-accessed.
    ///
    /// Default: youngest_gen - 1 (one generation back from newest).
    /// Replacement: ML-based NUMA locality prediction — pages with strong
    /// remote affinity might get youngest_gen (no penalty).
    ///
    /// Called once per cross-NUMA fault (warm path).
    fn cross_numa_promotion_gen(
        &self,
        zone: &ZoneLru,
        src_nid: u32,
        dst_nid: u32,
    ) -> u64; // target generation index

    /// Maximum shadow entries to retain per zone. Shadow entries track
    /// evicted pages for refault distance computation. Too few = lost
    /// refault tracking. Too many = memory waste.
    ///
    /// Default: nr_file_pages / 8.
    /// Replacement: adaptive based on refault rates — zones with high
    /// refault benefit from larger shadow pools.
    ///
    /// Called by background shadow entry cleaner (cold path).
    fn shadow_entry_budget(
        &self,
        zone: &ZoneLru,
        nr_file_pages: u64,
    ) -> u64;

    /// Per-cgroup reclaim aggressiveness. Returns a priority multiplier
    /// (100 = normal, >100 = more aggressive, <100 = less aggressive).
    /// Applied to the number of pages scanned from this cgroup's LRU lists.
    ///
    /// Default: 100 if cgroup below memory.high; 200 if at memory.high;
    /// 400 if at memory.max. Modulated by cgroup refault_credit.
    /// Replacement: workload-aware — latency-sensitive cgroups get lower
    /// aggressiveness, batch cgroups get higher.
    ///
    /// Called once per cgroup per reclaim cycle (warm path).
    fn cgroup_reclaim_priority(
        &self,
        cgroup: &CgroupLru,
        memcg_usage: u64,
        memcg_high: u64,
        memcg_max: u64,
    ) -> u32; // priority multiplier (100 = normal)

    /// Decide whether to trigger a generation advancement (aging) pass
    /// for a zone. Called on kswapd wakeup and direct reclaim entry.
    ///
    /// Default: advance if oldest generation has fewer eviction candidates
    /// than WMARK_LOW or if no candidates exist (all pages in youngest gen).
    /// Replacement: proactive aging based on allocation rate prediction.
    ///
    /// Called once per reclaim entry per zone (warm path).
    fn should_advance_generation(
        &self,
        zone: &ZoneLru,
        pressure: &ReclaimPressure,
    ) -> bool;
}

Global page reclaim policy instance:

/// Global page reclaim policy. Replaceable via live evolution.
/// Used by kswapd and direct reclaim paths — never on the page access hot
/// path. One indirect call per policy decision during reclaim (~5-10 ns
/// each on a reclaim cycle that takes microseconds per page evicted).
pub static RECLAIM_POLICY: RcuCell<&'static dyn PageReclaimPolicy> =
    RcuCell::new(&DefaultPageReclaimPolicy);

4.4.1.2 Performance Budget — Page Reclaim Policy Indirection Cost¶

Path	Frequency	PageReclaimPolicy called?	Cost
Page access (read/write)	Per memory access	No	0 (hardware Accessed bit only)
Page fault	Per fault (~1-2 us)	No (VmmPolicy handles faults)	0
kswapd wakeup	Per watermark breach (~ms cadence)	Yes: `should_advance_generation()`, `anon_file_ratio()`, `scan_budget()`	~15-30 ns (3 indirect calls)
mm_walk aging pass	Per zone per reclaim cycle	Yes: `scan_budget()`	~5 ns
Eviction candidate selection	Per page evicted	No (generation membership is data)	0
Refault page insertion	Per refault	Yes: `refault_threshold()`	~5 ns
Cross-NUMA page promotion	Per cross-NUMA fault	Yes: `cross_numa_promotion_gen()`	~5 ns
Cgroup reclaim scan	Per cgroup per reclaim cycle	Yes: `cgroup_reclaim_priority()`	~5 ns
Shadow entry cleanup	Background periodic	Yes: `shadow_entry_budget()`	Cold path

Key invariant: PageReclaimPolicy is never called on the page access hot path. Page access is pure hardware — the MMU sets the Accessed/AF bit with no kernel involvement. The policy is called only on reclaim paths (kswapd, direct reclaim), which are inherently warm-path operations spending microseconds per page evicted. The 5-30 ns of indirect call overhead is <0.1% of a typical reclaim cycle. The per-page eviction decision itself (which page to evict from the oldest generation) is not a policy call — it is a mechanical scan of the generation list, owned by non-replaceable code.

OOM Killer and Process Memory Hibernation have been moved to Section 4.5. OOM killer policy, victim selection, /dev/oom notification, process memory hibernation with MADV_DISCARDABLE / MADV_CRITICAL hints, and the cgroupfs hibernation interface are specified there.

Writeback Subsystem has been moved to Section 4.6. Per-device writeback model, BackingDevInfo, BdiWriteback, dirty page thresholds, balance_dirty_pages() throttling, inode dirty state machine, dirty page enumeration, congestion backpressure, kupdate algorithm, and the Tier 0/Tier 1 writeback domain crossing protocol are specified there.

4.4.1.3 `filemap_fault()` — Generic Page Cache Fault Handler¶

The default VmOperations::fault() implementation for file-backed VMAs. Called by handle_file_fault() (Section 4.8) when a page fault occurs on a file-backed VMA that does not override the fault method (most filesystems).

/// Generic page cache fault handler. Looks up the faulting page index in the
/// page cache, triggers synchronous readahead if not present, waits for the
/// page lock, and returns a `PageRef` for PTE installation.
///
/// # Sequence
/// 1. Compute page index: `vmf.pgoff = (vmf.address - vma.vm_start) / PAGE_SIZE + vma.vm_pgoff`.
/// 2. Page cache lookup under RCU read lock (`find_get_page(mapping, index)`).
/// 3. **Cache hit**: If the page is present and uptodate, return it immediately.
///    This is the fast path (~100ns).
/// 4. **Cache miss**: Trigger synchronous readahead
///    (`page_cache_sync_readahead(mapping, ra, index)`). This submits I/O
///    for the faulting page plus readahead window pages.
/// 5. Re-lookup the page in the cache (readahead may have populated it).
/// 6. If still not present: allocate a new page, add to the page cache, and
///    submit a single-page read via `AddressSpaceOps::read_page()`.
/// 7. Wait for `PageFlags::LOCKED` to clear (the I/O completion handler
///    unlocks the page and sets `UPTODATE`).
/// 8. If `!UPTODATE` after unlock: I/O error — return `FaultError::Sigbus`.
/// 9. Return the `PageRef` for PTE installation by the caller.
///
/// # Error cases
/// - `FaultError::Sigbus`: I/O error reading the page (disk corruption, NFS timeout).
/// - `FaultError::Oom`: Cannot allocate a page frame for the cache miss path.
/// - `FaultError::Retry`: `mmap_lock` was dropped during I/O wait and the VMA
///   may have changed; caller must re-lookup the VMA and retry the fault.
pub fn filemap_fault(vma: &Vma, vmf: &VmFault) -> Result<PageRef, FaultError> {
    let mapping = vma.file.as_ref().unwrap().address_space();
    let index = vmf.pgoff;

    // Fast path: RCU-protected page cache lookup.
    let _rcu = rcu_read_lock();
    if let Some(page) = mapping.pages.get(index) {
        if page.flags.load(Acquire) & (PG_UPTODATE | PG_LOCKED) == PG_UPTODATE {
            page_ref_inc(&page);
            return Ok(PageRef::from_page(&page));
        }
    }
    drop(_rcu);

    // Slow path: trigger readahead, re-lookup, or allocate + read.
    page_cache_sync_readahead(mapping, &vma.file.as_ref().unwrap().ra, index);
    let page = find_or_create_page(mapping, index, GFP_HIGHUSER_MOVABLE)?;

    // Wait for I/O completion (page lock).
    lock_page(&page);
    if page.flags.load(Acquire) & PG_UPTODATE == 0 {
        unlock_page(&page);
        page_ref_dec(&page);
        return Err(FaultError::Sigbus);
    }
    unlock_page(&page);
    Ok(PageRef::from_page(&page))
}

4.4.2 Readahead Engine¶

Sequential file reads benefit enormously from prefetching pages before the application requests them. The readahead engine detects sequential access patterns and pre-populates the page cache, converting random I/O latency (~100μs SSD, ~10ms HDD) into cache hits (~100ns).

Per-file readahead state:

/// Readahead state tracked per open file (embedded in struct File).
/// Tracks the readahead window: which pages have been prefetched and
/// how large the next readahead should be.
pub struct FileRaState {
    /// Start of the current readahead window (page index).
    pub start: u64,
    /// Size of the current readahead window (pages).
    /// Grows exponentially on sequential access, up to ra_pages.
    pub size: u32,
    /// Async readahead trigger point: when the application reads a page
    /// at `start + size - async_size`, the next readahead is initiated
    /// asynchronously (before the application blocks on a cache miss).
    pub async_size: u32,
    /// Maximum readahead window (pages). From BDI ra_pages (default 128KB
    /// = 32 pages at 4KB, configurable via /sys/block/<dev>/queue/read_ahead_kb).
    pub ra_pages: u32,
    /// Mmap miss counter: tracks page faults on mmap'd files that miss
    /// the readahead window. When mmap_miss > MMAP_LOTSAMISS (100),
    /// readahead is disabled for this file (access pattern is random).
    pub mmap_miss: u32,
    /// Previous read position (byte offset). Used to detect sequential
    /// access: if `current_offset == prev_pos + bytes_read`, the access
    /// is sequential.
    pub prev_pos: i64,
}

Mmap fault readahead: Page faults on mmap'd file regions use a per-VMA FileRaState (stored in vma.file.ra_state) to drive readahead, mirroring the read() path. On each fault, readahead_on_fault() consults this state: if the faulting page index is sequential with respect to the previous fault, the readahead window is expanded (same doubling algorithm as read()). If mmap_miss exceeds MMAP_LOTSAMISS (100), the access pattern is deemed random and readahead is suppressed for that VMA. This avoids polluting the page cache with speculative I/O for random-access mmap workloads (e.g., database index traversals).

Sequential detection and window sizing:

First read (cold start):
  → Initial readahead size = get_init_ra_size(req_size, max)
    = min(roundup_pow_of_two(req_size) * 4, max)  (if small request)
    = min(roundup_pow_of_two(req_size) * 2, max)  (if medium)
    = max                                           (if large)
  → Typical: first 4KB read → initial window = 16KB (4 pages)

Subsequent sequential reads:
  → Detected by: current page index == ra.start + ra.size (contiguous)
  → Grow window: get_next_ra_size(ra, max)
    = ra.size * 4  (if ra.size < max/16 — aggressive ramp-up)
    = ra.size * 2  (if ra.size < max/2  — moderate growth)
    = max          (otherwise — capped at device maximum)
  → Maximum window: ra_pages (default 32 pages = 128KB)
    Configurable: /sys/block/<dev>/queue/read_ahead_kb

Async readahead trigger:
  → Page at index (ra.start + ra.size - ra.async_size) has
    PG_READAHEAD flag set when inserted into the page cache.
  → When the application reads this flagged page, the NEXT
    readahead window is initiated asynchronously — the I/O is
    submitted before the application finishes processing the
    current window. This pipelines I/O and computation.

Random access:
  → If the read position is not contiguous with ra.prev_pos,
    reset the readahead window to a single page.
  → MADV_RANDOM: force ra.ra_pages = 0, disabling readahead entirely.
  → MADV_SEQUENTIAL: set ra.ra_pages = 2 * default (256KB).

Page cache readahead algorithm (ondemand_readahead):

/// Called from the page fault handler (file-backed VMA) and from
/// generic_file_read_iter() on a page cache miss.
fn page_cache_sync_readahead(
    mapping: &AddressSpace,
    ra: &mut FileRaState,
    file: &File,
    index: u64,      // page index requested
    req_count: u64,   // pages requested
) {
    // (1) If MADV_RANDOM or O_RANDOM: single-page read, no readahead.
    if file.f_mode & FMODE_RANDOM != 0 {
        force_page_cache_ra(mapping, ra, index, 1);
        return;
    }

    // (2) Check if this is the start of a sequential read.
    //     "Sequential" = index is within 1 page of the end of the last read.
    if index == ra.start + ra.size {
        // Contiguous with previous readahead — grow the window.
        ra.start = index;
        ra.size = get_next_ra_size(ra, ra.ra_pages);
        ra.async_size = ra.size;
    } else if index == 0 || index == ra.prev_pos / PAGE_SIZE + 1 {
        // First read in a new file, or sequential from prev_pos.
        ra.start = index;
        ra.size = get_init_ra_size(req_count, ra.ra_pages);
        ra.async_size = ra.size / 2; // Trigger async at halfway.
    } else {
        // Non-sequential — check page cache history for a pattern.
        // If the last N pages before `index` are cached (prior
        // readahead landed), this might be an interleaved reader.
        let miss = page_cache_prev_miss(mapping, index, 8);
        if miss <= 2 {
            // Likely interleaved sequential — resume readahead.
            ra.start = index;
            ra.size = get_init_ra_size(req_count, ra.ra_pages);
            ra.async_size = ra.size;
        } else {
            // Random access — single page, no readahead.
            ra.start = index;
            ra.size = req_count;
            ra.async_size = 0;
        }
    }

    // (3) Submit readahead I/O.
    ra_submit(mapping, ra);
}

/// Async readahead: triggered when the application hits a page with
/// PG_READAHEAD flag. Initiates the NEXT readahead window while the
/// application processes the current one.
fn page_cache_async_readahead(
    mapping: &AddressSpace,
    ra: &mut FileRaState,
    index: u64,
) {
    // The current window is being consumed — start the next one.
    ra.start += ra.size;
    ra.size = get_next_ra_size(ra, ra.ra_pages);
    ra.async_size = ra.size;
    ra_submit(mapping, ra);
}

ra_submit() — readahead I/O submission bridge:

ra_submit() bridges the page cache readahead engine and the filesystem's I/O submission path. It allocates pages, builds a ReadaheadControl, and delegates to the filesystem.

/// Readahead control block passed to filesystem's readahead() method.
/// Contains the page range to read and helpers for bio construction.
pub struct ReadaheadControl<'a> {
    /// Address space (file mapping) to read into.
    pub mapping: &'a AddressSpace,
    /// Starting page index of the readahead window.
    pub start: u64,
    /// Number of pages in the readahead window.
    pub nr_pages: u32,
    /// Pages already allocated in the page cache for this range.
    /// The filesystem builds bios targeting these pages.
    /// Uses PageRef (lightweight refcounted page pointer) rather than Arc<Page>
    /// because these pages are already held in the page cache XArray — the
    /// readahead control borrows them for the duration of bio construction.
    pub pages: &'a [PageRef],
}

/// Submit a readahead request. Called by the page cache when sequential
/// access is detected (via FileRaState heuristics).
///
/// 1. Allocate page cache pages for [ra.start .. ra.start + ra.size]
/// 2. Build ReadaheadControl with the allocated pages
/// 3. Call mapping.ops.readahead(&rc)
/// 4. Filesystem's readahead() implementation builds Bio(s) from rc.pages
///    and submits them via bio_submit()
///
/// If the filesystem does not implement readahead(), falls back to
/// per-page read_folio() calls.
pub fn page_cache_readahead(
    mapping: &AddressSpace,
    ra: &mut FileRaState,
    nr_pages: u32,
)

Detailed ra_submit() flow:

Page allocation: For each page index in [ra.start, ra.start + ra.size), check the page cache radix tree. If the page is already cached (prior readahead or demand read), skip it. Otherwise, allocate a new page from the page allocator (NUMA-local to the reading CPU), insert it into the page cache in PageFlags::LOCKED state, and add it to the pages array. Pages at the async trigger point (ra.start + ra.size - ra.async_size) get the PageFlags::READAHEAD flag set so that the next read hitting that page triggers the next async readahead window.
ReadaheadControl construction: Build the control block with the mapping, start index, page count, and the array of allocated pages.
Filesystem dispatch: Call mapping.ops.readahead(&rc). The filesystem translates logical page indices to physical block addresses and builds Bio requests.
Fallback: If the filesystem does not implement readahead() (returns None from the AddressSpaceOps method), ra_submit() falls back to calling mapping.ops.read_folio() for each page individually. This is slower (one I/O per page, no coalescing) but ensures all filesystems work.

Filesystem readahead() implementation pattern:

// Example: ext4 readahead implementation
fn readahead(&self, rc: &ReadaheadControl) {
    // Map logical pages to physical blocks via the extent tree
    let extents = self.map_blocks(rc.start, rc.nr_pages);
    for extent in extents {
        let bio = Bio::new(extent.block_dev, extent.start_sector);
        for page in &rc.pages[extent.page_range()] {
            bio.add_page(page);
        }
        bio.set_end_io(readahead_end_io); // unlocks pages on completion
        bio_submit(bio);
    }
}

The readahead_end_io callback runs on I/O completion (in softirq or workqueue context). For each page in the bio: if I/O succeeded, set PageFlags::UPTODATE and clear PageFlags::LOCKED (waking any waiters); if I/O failed, set PageFlags::ERROR and clear PageFlags::LOCKED. Read-path I/O error handling: on error, UPTODATE is NOT set. The page remains in the page cache (not evicted). Subsequent access to this page returns -EIO. The page can be retried by invalidating and re-reading (e.g., after device recovery). Failed readahead pages are removed from the page cache on the next access attempt (the demand read path retries via read_folio()).

Demand read encountering a non-UPTODATE page: When filemap_get_pages() finds a page in the cache that is not UPTODATE (either I/O-failed readahead or a truncated page), the demand read handler: 1. Locks the page (lock_page()). 2. If the page was truncated (no longer in the page cache after lock), restart lookup. 3. If PageFlags::ERROR is set, remove the page from the cache via delete_from_page_cache(), unlock, and return -EIO to the caller. 4. If the page is simply not uptodate (I/O not yet completed), wait for I/O completion (wait_on_page_locked()), then re-check UPTODATE. This handles the race where readahead I/O is still in-flight when the demand read arrives.

KABI readahead interface: For Tier 1/2 filesystem drivers accessed via KABI, the readahead engine dispatches via the VFS ring protocol (Section 14.2): a Readahead ring message (opcode 61) carries start_index, nr_pages, and a DmaBufferHandle for the entire readahead window. The driver submits I/O for all pages in a single Bio batch. This replaces the per-page ReadPage path (opcode 60) for readahead scenarios, enabling NVMe multi-segment Bio optimization. Filesystems that return EOPNOTSUPP for Readahead fall back to sequential ReadPage ring messages (one per page).

mmap readahead: File-backed page faults (handle_file_fault() in Section 4.8) also trigger readahead via page_cache_sync_readahead(). However, mmap access patterns are harder to predict (random access via pointer arithmetic). If ra.mmap_miss exceeds 100 (MMAP_LOTSAMISS), readahead is disabled for mmap faults on this file and only the faulting page is loaded.

Cross-references: - Page cache and Page struct: §4.4 - File-backed page fault handler: Section 4.8 (handle_file_fault()) - AddressSpaceOps::read_page(): Section 14.1 - Block I/O batching: Section 15.2

4.5 OOM Killer and Process Memory Hibernation¶

4.5.1 OOM Killer Policy¶

Overview:

The OOM (Out-of-Memory) killer is the last resort when the system cannot reclaim enough pages to satisfy an allocation. UmkaOS's OOM resolution is explicitly ordered, predictable, and userspace-notifiable — addressing well-known deficiencies in Linux's OOM killer (wrong process killed, oom_score_adj abuse, no advance warning, per-NUMA races).

Detection:

OOM is declared when all of the following are true:

1. alloc_pages(order=N) fails (buddy allocator returned null).
2. kswapd has completed at least one full reclaim cycle without reaching
   watermark_low.
3. Memory compaction ([Section 4.7](#transparent-huge-page-promotion-and-memory-compaction))
   was attempted and could not create a block of the requested order.
4. No swap space is available or all swap slots are in use.

For cgroup OOM, the trigger is: a cgroup has exceeded memory.max and reclaim within the cgroup subtree failed to bring usage below the limit.

Cgroup reclaim exhaustion is defined as: all reclaimable pages in the cgroup's LRU lists have been scanned (a full rotation of both active and inactive lists), AND no pages were successfully reclaimed, AND swap is either full or the cgroup's swap limit (memory.swap.max) is reached. This is stricter than global reclaim (which scans multiple times with increasing priority) — cgroup OOM triggers faster to contain the resource hog without affecting other cgroups.

Resolution order:

Step 0: Hibernate background cgroups (before any kill)
  - Before killing any process, the memory pressure manager checks whether
    any cgroups in "background" state (memory.hibernate_priority > 0) can
    be hibernated to free memory.
  - Hibernation freezes cgroup tasks and reclaims their anonymous pages
    (except MADV_CRITICAL regions) without killing the process.
  - See [Section 4.5](#oom-killer--process-memory-hibernation) for the full
    hibernation mechanism.
  - Only if hibernation cannot free enough memory does resolution proceed
    to Step 1.

Step 1: Cgroup OOM (if applicable)
  - If the failing allocation is charged to a cgroup at memory.max,
    and per-cgroup reclaim failed: resolve within that cgroup.
  - Select victim from the cgroup's process set only.
  - Never escalate to global OOM if cgroup OOM resolves the pressure.

Step 2: MPOL_BIND OOM (if applicable)
  - If the failing allocation has a NUMA memory policy that binds it to
    a specific set of nodes, and those nodes are exhausted: resolve OOM
    only across processes whose allocations are bound to the same node set.
  - Prevents global OOM from being triggered by a single NUMA-bound process.

Step 3: Global OOM
  - Select the victim from all non-exempt processes system-wide.
  - Emit a log entry with: victim PID, victim name, oom_score, total RSS,
    swap used, all process memory usage at time of decision.

Victim selection:

/// Internal OOM badness score (raw page count). Matches Linux `oom_badness()`
/// in `mm/oom_kill.c`: RSS + swap + page tables, adjusted by oom_score_adj.
/// dirty_kb is NOT included — dirty pages are already counted in RSS
/// (they are resident pages that happen to be modified — adding them would
/// double-count).
///
/// **Linux compatibility note**: The `child_rss/8` term was removed from
/// Linux `oom_badness()` in the 2010 OOM rewrite by David Rientjes
/// (commit `a63d83f427fb`). Modern Linux (6.x) does NOT iterate
/// `task_struct->children` or include child RSS. UmkaOS matches the
/// current Linux formula exactly to ensure `/proc/[pid]/oom_score`
/// produces values consistent with what `systemd-oomd` and `earlyoom`
/// expect. Fork-heavy workloads (PostgreSQL, Apache prefork) get
/// correct (non-inflated) scores.
///
/// Returns `i64::MIN` for unkillable tasks (oom_score_adj = -1000, init, etc.).
fn oom_badness(p: &Task, totalpages: u64) -> i64 {
    // Access oom_score_adj from the Task (per-thread field, but conventionally
    // all threads in a thread group share the same value — /proc/PID/oom_score_adj
    // writes propagate to all threads via the thread group iterator).
    let adj_val = p.oom_score_adj.load(Relaxed);
    if adj_val == -1000 { return i64::MIN; }

    // Skip tasks already marked as OOM victims (prevents double-selection).
    if p.flags.load(Relaxed) & PF_OOM_VICTIM != 0 {
        return i64::MIN;
    }

    // Access mm from Process. Kernel threads have mm: None.
    // Process.mm is `ArcSwap<MmStruct>` (atomic reference-counted swap),
    // ensuring that OOM scoring reads do not race with exec's mm
    // replacement. The `load()` returns a snapshot Arc that cannot be
    // concurrently freed.
    let mm_guard = p.process.mm.load();
    let mm = match mm_guard.as_ref() {
        Some(mm) => mm,
        None => return i64::MIN,  // kernel thread — unkillable
    };

    // Skip tasks whose memory has already been reaped (MMF_OOM_SKIP set by
    // the OOM reaper after page table teardown). Also skip vfork tasks
    // (shares parent's mm — killing them doesn't free memory).
    // Matches Linux oom_badness() checks in mm/oom_kill.c.
    //
    // MmFlags is a bitflags! type wrapping the raw AtomicU64 value.
    // Use MmFlags::from_bits_truncate() to convert the loaded u64 to the
    // bitflags type, then use .contains() for flag checks.
    let flags = MmFlags::from_bits_truncate(mm.flags.load(Relaxed));
    if flags.contains(MmFlags::MMF_OOM_SKIP) {
        return i64::MIN;
    }
    // in_vfork(): returns true if the task has a pending vfork_done
    // (vfork child that has not yet called exec/exit). Killing a vfork
    // child does not free memory (it shares the parent's mm).
    // Implementation: `!self.vfork_done.load(Acquire).is_null()` — matches Linux
    // mm/oom_kill.c oom_badness() check for `task->vfork_done`.
    if p.in_vfork() {
        return i64::MIN;
    }

    // RSS: sum raw PerCpu<AtomicI64> slots. Drift is bounded to
    // concurrent in-flight page faults (handful of pages at most) —
    // negligible for OOM scoring.
    // See [Section 4.8](#virtual-memory-manager--mmstruct--per-process-address-space).
    let rss = mm_sum_rss(mm);

    // Swap entries charged to this mm.
    let swap = mm.swap_count.load(Relaxed).max(0) as u64;

    // Page table pages (bytes / PAGE_SIZE). pgtable_bytes is AtomicI64
    // (signed to allow transient negative deltas during concurrent PTE ops).
    // .max(0) clamps the transient negative values.
    let pgtbl = (mm.pgtable_bytes.load(Relaxed).max(0) as u64) / PAGE_SIZE as u64;

    let points = (rss + swap + pgtbl) as i64;
    let adj = (adj_val as i64 * totalpages as i64) / 1000;

    points + adj
}

/// Sum per-CPU RSS counters for an address space. Iterates all CPU
/// slots of the raw `PerCpu<AtomicI64>` and sums the atomic counters.
///
/// `mm.rss` is `PerCpu<AtomicI64>` (NOT `PerCpuCounter<i64>`) — see
/// [Section 4.8](#virtual-memory-manager--mmstruct--per-process-address-space).
/// RSS uses raw per-CPU atomics because it is the hottest counter in the
/// kernel (updated on every page fault). `PerCpuCounter<i64>` would add
/// ~5-10 cycles of batch-threshold overhead per page fault — unacceptable.
///
/// The read here iterates all possible CPUs and sums with `Relaxed`
/// ordering. Drift is bounded to the number of concurrent page fault
/// handlers mid-`fetch_add` — in practice a handful of pages at most.
/// This is negligible for OOM scoring (proportional comparison against
/// memory limits in the hundreds-of-MiB-to-GiB range).
fn mm_sum_rss(mm: &MmStruct) -> u64 {
    let mut total: i64 = 0;
    // Iterate all POSSIBLE CPUs, not just online. A CPU that was online when
    // it incremented its counter may be offline at read time. Linux uses
    // `for_each_possible_cpu()` in `get_mm_rss()` for the same reason.
    for cpu in 0..num_possible_cpus() {
        total += mm.rss.get_cpu(cpu).load(Relaxed);
    }
    total.max(0) as u64
}

/// Linux-compatible OOM score for `/proc/[pid]/oom_score`.
/// This is the public `oom_score()` interface — called by procfs
/// `proc_oom_score()` handler and by the OOM event log (Step 1).
/// Matches the `proc_oom_score()` normalization in Linux `fs/proc/base.c`:
///
/// ```c
/// // Linux proc_oom_score():
/// if (badness != LONG_MIN)
///     points = (1000 + badness * 1000 / (long)totalpages) * 2 / 3;
/// ```
///
/// Produces range [0, ~1333] (practical maximum when `oom_score_adj = +1000`).
/// Linux documents the ABI range as [0, 2000] for historical compatibility
/// (the comment in Linux says "scale the badness value into [0, 2000] range
/// which we have been exporting for a long time so userspace might depend on
/// it"), but the formula's mathematical maximum is ~1333.
///
/// `oom_score_adj` range is [-1000, 1000] (same as Linux).
/// `oom_score_adj = -1000` means "never kill" (exempt from OOM selection).
/// `oom_score_adj = +1000` means "kill first" (always highest priority victim).
pub fn oom_score_compat(p: &Task) -> u64 {
    let totalpages = total_ram_pages() + total_swap_pages();
    debug_assert!(totalpages > 0 && totalpages <= i64::MAX as u64);
    let badness = oom_badness(p, totalpages);
    if badness == i64::MIN { return 0; }  // unkillable
    // Matches Linux proc_oom_score() formula semantics.
    // badness is in range [-(totalpages), totalpages].
    // The +1000 offset and *2/3 scaling match Linux's historical ABI.
    //
    // Overflow analysis: `badness * 1000` overflows i64 when
    // totalpages > i64::MAX / 1000 ≈ 9.2 × 10^15, corresponding to
    // ~37.6 PB of RAM+swap. Linux has the same vulnerability (silent
    // wrap in C, UB technically). UmkaOS uses saturating_mul to produce
    // a clamped-but-monotonic score on hypothetical >37.6 PB systems,
    // rather than panicking (debug) or wrapping (release). The output
    // may differ from Linux by ±1 at the saturation boundary, but
    // oom_score is advisory (procfs ABI) and exact match is not required
    // for systems that do not exist yet. For /proc/[pid]/oom_score
    // Linux compatibility on current hardware (<37.6 PB), the result
    // is identical to Linux.
    //
    // UmkaOS-native OOM scoring (oom_score_internal) uses a different
    // formula that does not have this overflow. The native score is
    // exposed via umkafs, not procfs.
    let score = (1000i64 + badness.saturating_mul(1000) / totalpages as i64) * 2 / 3;
    score.max(0) as u64
}

/// Internal OOM scorer using FMA-enriched signals. Used for actual victim
/// selection (not exposed to userspace). Incorporates memory pressure
/// gradient, writeback queue depth, reclaim efficiency, and I/O amplification
/// in addition to the raw memory footprint. Higher score = more likely killed.
pub fn oom_score_internal(p: &Task) -> i64 {
    let totalpages = total_ram_pages() + total_swap_pages();
    let base = oom_badness(p, totalpages);
    if base == i64::MIN { return i64::MIN; }  // unkillable

    // FMA bonus: tasks causing disproportionate memory pressure get a boost.
    // The FMA subsystem ([Section 20.1](20-observability.md#fault-management-architecture)) tracks per-cgroup
    // reclaim efficiency and I/O amplification. Tasks in cgroups with poor
    // reclaim efficiency (high scan-to-reclaim ratio) or high writeback
    // pressure are ranked higher for kill selection.
    let fma_bonus = fma_oom_pressure_score(p);

    base + fma_bonus
}

/// Per-cgroup FMA (Fault Management Architecture) memory pressure statistics.
/// Stored in the memory cgroup descriptor (`MemCgroup`). Updated by the
/// page reclaim scanner ([Section 4.4](#page-cache--generational-lru-page-reclaim)) at
/// the end of each reclaim pass. Read by the OOM internal scorer and by
/// `/sys/fs/cgroup/<path>/memory.pressure` for userspace PSI consumers.
///
/// **Thread safety**: Each field is `AtomicU64` — updated by kswapd or
/// direct reclaim (single writer per cgroup per scan pass) and read by
/// the OOM scorer (multiple readers). `Relaxed` ordering suffices because
/// the OOM scorer tolerates stale-by-one-scan-pass values.
///
/// **Cgroup accessor**: `Cgroup::fma_stats()` returns `Option<&FmaCgroupStats>`.
/// Returns `None` for cgroups that have never triggered reclaim (the
/// `FmaCgroupStats` is allocated on first reclaim pass, not at cgroup
/// creation — avoids memory waste for cgroups that never hit pressure).
pub struct FmaCgroupStats {
    /// Pages scanned per page successfully reclaimed in the last scan pass.
    /// Updated at the end of `shrink_node_memcg()` as
    /// `scanned_pages / max(reclaimed_pages, 1)`. A ratio > 10 indicates
    /// severe reclaim inefficiency (most scanned pages were pinned,
    /// recently accessed, or dirty).
    pub scan_to_reclaim_ratio: AtomicU64,
    /// Cumulative nanoseconds this cgroup's tasks spent stalled waiting
    /// for writeback I/O to complete during the last measurement window
    /// (reset every 10 seconds by the PSI accounting subsystem). Tracks
    /// `wait_on_page_writeback()` time charged to this memcg.
    pub writeback_stall_ns: AtomicU64,
}

/// FMA-enriched OOM pressure score. Returns a non-negative bonus added to
/// `oom_badness()` for internal kill decisions. Returns 0 when the FMA
/// subsystem is inactive or the task's cgroup has no pressure data.
///
/// The score is derived from the task's cgroup memory pressure metrics:
/// - `scan_to_reclaim_ratio`: pages scanned per page reclaimed (higher = worse).
///   A ratio > 10 indicates severe reclaim inefficiency.
/// - `writeback_stall_ns`: cumulative nanoseconds the cgroup spent stalled on
///   writeback during the last measurement window.
///
/// Formula: `bonus = (scan_to_reclaim_ratio.min(100) * 10) + (writeback_stall_ms)`
/// where `writeback_stall_ms = writeback_stall_ns / 1_000_000`.
/// Bounded to [0, 2000] to prevent FMA from dominating the base RSS score.
///
/// See [Section 20.1](20-observability.md#fault-management-architecture) for the source of these metrics.
fn fma_oom_pressure_score(p: &Task) -> i64 {
    // FMA service check — returns 0 if FMA is not active (static key gated).
    if !fma_active() { return 0; }

    // Task.cgroup is ArcSwap<Cgroup> — use .load() to obtain a snapshot
    // Guard. See [Section 17.2](17-containers.md#control-groups) for the ArcSwap migration protocol.
    let cg = p.cgroup.load();
    let fma = match cg.fma_stats() {
        Some(stats) => stats,
        None => return 0,
    };

    let scan_ratio = fma.scan_to_reclaim_ratio.load(Relaxed).min(100) as i64;
    let wb_stall_ms = (fma.writeback_stall_ns.load(Relaxed) / 1_000_000) as i64;
    (scan_ratio * 10 + wb_stall_ms).min(2000)
}

/// OOM victim exemptions (score -1000 is never selected):
/// - kernel threads (no mm)
/// - init (pid 1 of PID namespace)
/// - processes with oom_score_adj = -1000
/// - processes holding a mandatory kernel resource (marked OOM_EXEMPT at
///   creation, e.g., the memory-pressure notification daemon)

Notification before kill:

UmkaOS provides advance OOM notification via a /dev/oom character device. Processes interested in memory pressure events can open("/dev/oom", O_RDONLY) and poll() on it. Events delivered as fixed-size structs on read():

#[repr(C)]
pub struct OomNotification {
    /// Type of pressure event.
    pub kind: OomKind,
    /// Struct size in bytes. Userspace checks `notif.size >= offsetof(field) +
    /// sizeof(field)` before accessing fields added in future kernel versions
    /// (same extensibility pattern as `perf_event_attr`). Current value: 64.
    pub size: u32,
    /// Cgroup ID (from cgroup.id) whose memory exceeded the threshold,
    /// or 0 for global OOM notification. Uses u64 cgroup ID (not PID)
    /// because cgroups are identified by their unique hierarchy ID,
    /// not by any process PID.
    pub cgroup_id: u64,
    /// Current free pages across all zones.
    pub free_pages: u64,
    /// Total pages in system.
    pub total_pages: u64,
    /// Notification flags bitfield.
    /// - `OOM_FLAG_DROPPED` (0x01): one or more notifications were dropped
    ///   due to ring overflow before this notification was delivered.
    ///   Userspace should treat this as a "you missed events" indicator.
    pub flags: u32,
    /// Reserved for future fields (ABI stability). Must be zero-initialized.
    /// When new fields are added, they consume bytes from this reservation
    /// and `size` is updated to the new struct size.
    pub _reserved: [u8; 24],
    // Layout: kind(4) + size(4) + cgroup_id(8) + free_pages(8) +
    //   total_pages(8) + flags(4) + _reserved(24) + _tail_pad(4) = 64 bytes.
    /// Explicit tail padding to prevent kernel info leak (4 bytes to align
    /// struct end to 8-byte boundary). Must be zero-initialized.
    pub _tail_pad: u32,
}
const_assert!(core::mem::size_of::<OomNotification>() == 64);

/// Notification flag: one or more notifications were lost to ring overflow.
pub const OOM_FLAG_DROPPED: u32 = 0x01;

/// `read()` semantics for /dev/oom:
/// - If the caller's buffer `count` < `size_of::<OomNotification>()` (64 bytes),
///   return `-EINVAL`. This prevents partial reads of the notification struct.
/// - On success, each `read()` returns exactly one `OomNotification` (64 bytes).
/// - `poll()` / `epoll()`: returns `POLLIN` when at least one unread notification
///   is available. The internal ring holds up to 64 notifications; overflow drops
///   the oldest entry and sets a `DROPPED` flag on the next delivered notification.

#[repr(u32)]
pub enum OomKind {
    /// Cgroup at memory.high: soft pressure warning (no kill yet).
    CgroupHigh  = 1,
    /// Cgroup at memory.max and reclaim failed: cgroup OOM imminent.
    CgroupMax   = 2,
    /// Global OOM: system-wide memory exhausted, kill imminent.
    GlobalOom   = 3,
}

Kill sequence (canonical ordering — physical-memory-allocator.md references this definition via Section 4.5):

1. Log OOM event to kernel log (PID, name, oom_score, RSS, reason).
2. Set PF_OOM_VICTIM in Task.flags — exclusion flag, prevents re-selection
   by concurrent OOM invocations. Must be set FIRST: granting reserves
   (TIF_MEMDIE) or sending SIGKILL before exclusion creates a window for
   double-selection.
3. Send SIGKILL to the selected victim via the process-directed path:
   `do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID)`.
   This delivers SIGKILL to the thread group's shared pending queue and
   wakes the most responsive eligible thread (preferring
   TASK_INTERRUPTIBLE over TASK_KILLABLE). Process-directed delivery is
   critical for multi-threaded victims: it ensures the signal reaches a
   wakeable thread even if the OOM-selected task is in TASK_UNINTERRUPTIBLE
   (D-state). See [Section 8.5](08-process.md#signal-handling) for `do_send_sig_info()` definition.
   NOTE: Do NOT use `force_sig(SIGKILL, task)` — that is thread-directed
   and targets a single thread which may be in D-state.
4. Set TIF_MEMDIE — grants the victim's exit path unconditional access to
   memory reserves (below wmark_min) so that exit() can allocate memory
   for cleanup (page table teardown, fd close, etc.) without deadlocking.
   Set AFTER SIGKILL so the kill is already in-flight when reserves open.
5. Enqueue the victim's mm into the OOM reaper for asynchronous page
   reclamation:
   ```rust
   if let Some(ref mm) = task.process.mm {
       OOM_REAPER.victims.push(Arc::clone(mm));
       OOM_REAPER.wake.wake_one();
   }
   ```
   The reaper proactively unmaps anonymous pages without holding mmap_lock,
   breaking deadlock cycles where the victim holds mmap_lock. See
   [Section 4.2](#physical-memory-allocator--oom-reaper) for the full reaper spec.
6. Cgroup OOM with `memory.oom.group=1` ([Section 17.2](17-containers.md#control-groups)): skip steps
   2-5 for the individual victim. Instead, SIGKILL **all tasks** in the
   cgroup subtree simultaneously (applying the PF_OOM_VICTIM → SIGKILL →
   TIF_MEMDIE sequence to each). Each killed task's mm is enqueued in the
   reaper. The victim selected by `select_victim()` is included in the
   group iteration — no separate kill step needed. Without
   `memory.oom.group` (default), only the single highest-scoring victim
   is killed (steps 2-5 above).
6a. **Cgroup OOM event counter increments** (after victim selection, before
   step 7). Three distinct counters in the memcg descriptor:
   - `memcg.events_oom.fetch_add(1, Relaxed)`: Always incremented when
     cgroup OOM is invoked (even if no victim is found — counts invocations).
   - `memcg.events_oom_kill.fetch_add(1, Relaxed)`: Incremented per victim
     actually killed (may differ from `events_oom` if cgroup OOM fails to
     find a victim).
   - `memcg.events_oom_group_kill.fetch_add(1, Relaxed)`: Incremented once
     per group kill (step 6), not per task killed within the group.
   These counters are exposed via `memory.events` cgroupfs file and are
   read by `systemd-oomd` and other cgroup-aware OOM managers.
   For global OOM (no memcg), no cgroup counters are incremented.
7. After kill: wake all waiters blocked in alloc_pages, retry allocation.
   The allocator retries because TIF_MEMDIE grants reserve access and the
   reaper is asynchronously freeing pages. No explicit 500ms wait — the
   allocating task retries the allocation (matching Linux behavior where
   the OOM killer returns and the allocator retries with reserve access).
8. If the retry still fails and a new OOM event triggers: `select_victim()`
   skips the previous victim (PF_OOM_VICTIM check) and selects the next
   highest-scoring task.

out_of_memory() — top-level OOM entry point:

/// Top-level OOM entry point. Called from the page allocator's slow path
/// when allocation fails after exhausting all reclaim and compaction
/// options (see detection criteria above). Serialized by `OOM_LOCK`
/// (global Mutex) — only one OOM resolution runs at a time.
///
/// # Arguments
/// - `oc`: OOM context describing the failing allocation — contains the
///   GFP flags, allocation order, NUMA node preference, originating task,
///   and optional memcg constraint (for cgroup OOM).
///
/// # Returns
/// `true` if a victim was selected and killed (caller should retry the
/// allocation — the victim's exit path will free memory). `false` if no
/// eligible victim was found (caller returns -ENOMEM to userspace or
/// panics if the allocation is mandatory, e.g., GFP_NOFAIL).
///
/// # Algorithm
/// Implements the resolution order described above (Step 0 → Step 3).
pub fn out_of_memory(oc: &OomContext) -> bool {
    let _guard = OOM_LOCK.lock();

    // Step 0: Try hibernating background cgroups (if any eligible).
    if try_hibernate_background_cgroups(oc) {
        return true;  // Memory freed via hibernation — no kill needed.
    }

    // Step 1: Cgroup OOM (if the failing allocation is charged to a memcg).
    if let Some(ref memcg) = oc.memcg {
        if let Some(victim) = select_victim(oc, Some(memcg)) {
            oom_kill_process(oc, &victim);
            // Increment per-cgroup OOM event counter.
            memcg.events.oom_kill.fetch_add(1, Relaxed);
            return true;
        }
    }

    // Step 2: MPOL_BIND OOM (if the allocation has a NUMA binding policy).
    if oc.nodemask.is_some() {
        if let Some(victim) = select_victim(oc, None) {
            oom_kill_process(oc, &victim);
            return true;
        }
    }

    // Step 3: Global OOM.
    if let Some(victim) = select_victim(oc, None) {
        oom_kill_process(oc, &victim);
        return true;
    }

    false  // No eligible victim — caller gets -ENOMEM.
}

/// OOM context describing the failing allocation. Passed through the
/// OOM resolution chain.
pub struct OomContext {
    /// GFP flags of the failing allocation.
    pub gfp_mask: GfpFlags,
    /// Allocation order (0 = single page, up to MAX_ORDER).
    pub order: u32,
    /// NUMA node mask constraint (from MPOL_BIND), or None for any node.
    pub nodemask: Option<NodemaskRef>,
    /// Memory cgroup constraint: if Some, the allocation is charged to
    /// this memcg and cgroup OOM resolution should be attempted first.
    pub memcg: Option<Arc<MemCgroup>>,
    /// The task that triggered the allocation failure (current_task()).
    pub task: Arc<Task>,
}

/// Select the highest-scoring OOM victim from the eligible task set.
///
/// # Arguments
/// - `oc`: OOM context (GFP flags, NUMA constraints).
/// - `memcg`: If Some, restrict victim selection to tasks in this memcg
///   subtree (cgroup OOM). If None, consider all tasks (global OOM).
///
/// # Returns
/// The selected victim Task, or None if no eligible victim exists
/// (all tasks are unkillable or already OOM victims).
fn select_victim(
    oc: &OomContext,
    memcg: Option<&Arc<MemCgroup>>,
) -> Option<Arc<Task>> {
    let totalpages = total_ram_pages() + total_swap_pages();
    let mut best_score = i64::MIN;
    let mut best_task: Option<Arc<Task>> = None;

    // Iterate candidate tasks. For cgroup OOM: tasks in memcg subtree.
    // For global OOM: all tasks via for_each_process().
    let iter = match memcg {
        Some(cg) => cg.task_iter(),
        None => TaskIterator::all_processes(),
    };

    for p in iter {
        let score = oom_score_internal(&p);
        if score == i64::MIN { continue; }  // unkillable
        if score > best_score {
            best_score = score;
            best_task = Some(Arc::clone(&p));
        }
    }

    best_task
}

/// Execute the OOM kill sequence (Steps 1-8 from the kill sequence above).
/// `victim` is the task selected by `select_victim()`.
fn oom_kill_process(oc: &OomContext, victim: &Arc<Task>) {
    // 1. Log.
    oom_log_event(oc, victim);
    // 2. Set PF_OOM_VICTIM (exclusion flag — must be FIRST).
    victim.flags.fetch_or(PF_OOM_VICTIM, Release);
    // 3. Send SIGKILL via process-directed delivery.
    do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
    // 4. Set TIF_MEMDIE (grant reserve access).
    victim.thread_info.set_flag(TIF_MEMDIE);
    // 5. Enqueue mm into OOM reaper.
    if let Some(mm) = victim.process.mm.load().as_ref() {
        OOM_REAPER.victims.push(Arc::clone(mm));
        OOM_REAPER.wake.wake_one();
    }
    // 6. If memory.oom.group=1: SIGKILL all tasks in cgroup subtree.
    if let Some(ref memcg) = oc.memcg {
        if memcg.oom_group.load(Relaxed) != 0 {
            oom_kill_cgroup_group(memcg, oc);
        }
    }
    // 7. Emit /dev/oom notification.
    oom_notify(oc, victim);
}

/proc/PID/oom_score_adj interface: Range [-1000, 1000]. Inherited by children. Written by process itself (any value <= current adj) or by root (any value). Default: 0. Setting -1000 exempts a process from OOM selection (not from SIGKILL if sent explicitly — only OOM selection).

/proc/PID/oom_score (read-only): Current Linux-compatible OOM score (calls oom_score_compat() on read). systemd-oomd, earlyoom, and other procfs-reading tools see expected values.

oom_score_internal() (kernel-internal): Used for actual OOM kill decisions. Enriches the base score with FMA signals: writeback queue depth (from BdiWriteback.nr_dirty), memory pressure gradient (from PSI memory.pressure), and cgroup reclaim stall time. Exposed via umkafs for debugging, not procfs.

Design improvements over Linux OOM killer:

Linux OOM problem	UmkaOS resolution
Kills wrong process (oom_score inaccurate for swap-heavy workloads)	Linux-compat formula for `/proc/pid/oom_score` (RSS + swap + pgtable, matching modern Linux `oom_badness()`); internal scorer uses FMA writeback queue depth and memory pressure gradient for kill decisions
No advance warning before kill	`/dev/oom` notification device with `OomKind::CgroupHigh` at 80% pressure
Global OOM triggered by single cgroup	Cgroup OOM resolved first, never escalates unless cgroup OOM fails
OOM panic on kernel allocation failure	No panic — log + retry + select new victim if needed
NUMA OOM selects from wrong node set	MPOL_BIND OOM resolved within bound node set only
Per-NUMA `oom_zonelist` races	Explicit NUMA-aware step 2 in resolution order

4.5.1.1.1 DSM-Awareness in OOM Scoring¶

Processes holding DSM page ownership (Section 6.2) require special consideration during OOM selection because killing a DSM-owning process has side-effects beyond local memory reclamation.

DSM page count in oom_badness():

/// Adjust the OOM badness score for DSM-owning processes.
/// Called from `oom_score_internal()` (not `oom_badness()` — the Linux-compat
/// score intentionally does NOT include DSM adjustment, since Linux has no
/// concept of DSM pages and procfs consumers expect the standard formula).
///
/// The adjustment is a weighted count of DSM-exclusive (Modified) pages:
///   `bonus = dsm_exclusive_pages * DSM_OOM_WEIGHT`
///
/// DSM_OOM_WEIGHT defaults to 0 (no penalty): the OOM killer treats DSM pages
/// the same as local pages. This is correct for the common case where DSM
/// pressure is not the root cause of local memory exhaustion. Administrators
/// can increase the weight when DSM pressure IS the cause:
///   /proc/sys/vm/dsm_oom_weight  (range: 0-100, default: 0)
///
/// When DSM_OOM_WEIGHT > 0, DSM-heavy processes get higher scores, making them
/// preferred kill targets. This reduces the latency of OOM recovery because
/// killing a non-DSM process frees memory immediately (no ownership transfer).
pub static DSM_OOM_WEIGHT: AtomicU64 = AtomicU64::new(0);

fn dsm_oom_adjustment(p: &Task) -> i64 {
    let weight = DSM_OOM_WEIGHT.load(Relaxed);
    if weight == 0 { return 0; }  // fast path: no DSM adjustment
    // Query the DSM directory for the count of pages in Modified state
    // owned by this task's mm. The DSM directory maintains per-mm
    // ownership counters (see [Section 6.3](06-dsm.md#dsm-page-ownership-model--per-mm-ownership-tracking)).
    // This is a warm-path read (one atomic load per mm, no network I/O).
    let mm_guard = p.process.mm.load();
    let mm = match mm_guard.as_ref() {
        Some(mm) => mm,
        None => return 0,
    };
    let dsm_exclusive = mm.dsm_exclusive_pages.load(Relaxed);
    (dsm_exclusive as i64).saturating_mul(weight as i64)
}

Kill coordination for DSM-owning victims:

When oom_kill_process() kills a task that holds DSM exclusive pages, the kill sequence extends beyond the standard Steps 1-8:

SIGKILL is sent immediately (Step 3). No delay for DSM.
In parallel with the reaper (Step 5), dsm_bulk_release(mm) initiates bulk ownership release to home nodes for all Modified pages. This sends RDMA invalidation messages in batch (DsmBulkRelease wire message, Section 6.6) to each home node.
dsm_bulk_release() runs asynchronously on a kworker thread �� the OOM killer does NOT block on RDMA completions. It returns to Step 7 (wake allocators) immediately after enqueueing the release.
RDMA completions arrive asynchronously. Timeout: 100ms. If a home node does not acknowledge within 100ms, the local DSM directory marks those pages as LOST and the home node reconstructs ownership from its directory via the heartbeat-based lost-peer protocol (Section 5.8).
Non-DSM pages in the victim's mm are reclaimed immediately by the OOM reaper. DSM pages transition from Modified to Invalid locally and are freed as RDMA completions arrive (or after timeout).

Net effect: DSM-owning victims add ~0-100ms of asynchronous tail latency to memory reclamation for the DSM pages only. Non-DSM pages are freed at the same speed as non-DSM victims. The OOM killer itself is never blocked.

Post-OOM retry mechanism: After OOM kill, the killed task's exit path frees memory (VMAs, page tables, anonymous pages). Other tasks waiting for memory are on the per-node allocation wait queue, woken by the buddy allocator when pages become available. The faulting task retries the allocation after being woken.

/// Per-NUMA-node allocation wait queue. Tasks that fail page allocation
/// and enter the OOM path are added to this WaitQueue. The buddy allocator
/// wakes waiters when pages are freed (via `wake_all_allocators()`).
///
/// This is per-node (not per-zone) because UmkaOS uses a flat buddy
/// allocator per NUMA node, not per-zone. The node-level granularity
/// matches the buddy allocator's lock domain — a freed page on any order
/// wakes ALL waiters on that node, not just waiters for a specific zone.
/// This is coarser than Linux's per-zone approach but simpler and
/// sufficient: OOM is an ice-cold path, and the extra wakeups are
/// negligible compared to the OOM kill overhead.
///
/// Located in `BuddyAllocator` (one per NUMA node).
/// Woken by: `BuddyAllocator::free_pages()` after returning pages to the
/// free list, if `nr_free` crossed above `wmark_min` (throttle prevents
/// thundering-herd wakeups on every single page free).
pub alloc_waitqueue: WaitQueue,

OOM lock ordering: The OOM killer acquires OOM_LOCK (static OOM_LOCK: Mutex<()>, global — serializes ALL OOM kill sequences, both global and per-cgroup). Per-cgroup locking would add deadlock risk with nested cgroups and thundering-herd over-kill. OOM is an ice-cold path — global serialization adds no measurable latency. OOM_LOCK is NOT held during the actual kill (SIGKILL delivery). Lock ordering: mmap_lock < OOM_LOCK (the allocation path may hold mmap_lock when triggering OOM). The OOM killer never re-acquires mmap_lock — it reads RSS via per-CPU atomic counters without locking. See Section 3.4 for the Mutex ordering table.

mmap_lock + OOM_LOCK deadlock prevention: The allocation path may hold mmap_lock (read or write) when alloc_pages() fails and triggers OOM. The lock ordering mmap_lock (level 50) < OOM_LOCK (level 55) permits this: the OOM killer acquires OOM_LOCK while mmap_lock is already held. The critical invariant is that out_of_memory() NEVER acquires mmap_lock — it reads RSS via per-CPU AtomicI64 counters (mm_sum_rss()) which require no locking, and reads MmStruct.flags via AtomicU64::load(Relaxed). This avoids the classic Linux deadlock where the OOM killer blocks on mmap_lock held by the very task it needs to kill. The OOM reaper (Section 4.2) also avoids mmap_lock — it unmaps pages via direct page table walking without the mm lock.

Cgroup reclaim lock ordering: LRU lock (per-zone SpinLock) < mmap_lock. The reclaim scanner holds LRU lock while scanning pages. It never acquires mmap_lock (page table walks use lockless traversal via rmap). mmap_lock is only held by the allocating task (which triggered reclaim), not by the reclaim scanner itself.

4.5.2 Process Memory Hibernation¶

Motivation:

Traditional Linux/Android memory reclamation under pressure has one instrument: kill the process. iOS's Jetsam daemon takes a fundamentally different approach — it freezes background apps and reclaims their memory without killing them. When the user switches back, the app resumes from exactly the state it was in, with only a brief repopulation stall instead of a full cold start.

Android devices historically required more RAM than iPhones because Android's only recourse under pressure was to kill background processes (requiring full restart, ~1-3s), while iOS's freeze+reclaim allowed the same working set to fit in less physical memory (~50-200ms warm resume).

UmkaOS solves this at the kernel level with process memory hibernation: a composable cgroup-based mechanism that freezes a process group and reclaims its memory without killing it. Applications opt in to efficient hibernation via new madvise() hints. Processes that do not use the hints still benefit — the kernel reclaims all non-critical pages via compression or swap — it just can't guarantee which pages are safe to zero-fill on resume.

New madvise() hints:

/* MADV_DISCARDABLE (new): Anonymous pages in this region may be discarded
 * at any time under memory pressure, including during cgroup hibernation.
 * On next access after discard, pages are zero-filled — the application
 * is responsible for reinitializing their content.
 *
 * Use cases: GC-managed heaps (GC will reinitialize objects on first use),
 * compiled code caches (JIT can recompile on demand), image decode buffers,
 * pre-computed table data.
 *
 * Stronger than MADV_FREE: MADV_FREE pages are freed lazily under global
 * pressure. MADV_DISCARDABLE pages are freed eagerly when the owning
 * cgroup transitions to hibernating state.
 *
 * Note: discarded pages are NOT tracked individually. The application must
 * be designed to reinitialize any page in the region on first access —
 * there is no per-page "was this discarded?" query.
 */
madvise(addr, len, MADV_DISCARDABLE)  /* 256 — UmkaOS extension namespace starts here */

/* MADV_CRITICAL (new): Anonymous pages in this region must never be
 * evicted — not to swap, not during hibernation, not under any memory
 * pressure. Pages are wired (pinned) in physical RAM.
 *
 * Use cases: UI state (last rendered frame, scroll position), input
 * event queues, cryptographic key material, file descriptor tables for
 * critical IPC channels.
 *
 * Subject to a per-process quota enforced by the memory cgroup:
 *   memory.critical_limit (default: 64 MB per cgroup leaf)
 * Exceeding the quota: madvise() returns ENOMEM.
 *
 * Interaction with fork: MADV_CRITICAL is NOT inherited across fork().
 * Child processes start with no critical regions.
 */
madvise(addr, len, MADV_CRITICAL)     /* 257 — UmkaOS extension */

Memory page classification for hibernation:

When a cgroup enters hibernating state, every anonymous page in the cgroup's address spaces is classified into one of four categories:

Category	Condition	Hibernation action	Resume action
Critical	Covered by `MADV_CRITICAL`	Keep in RAM (wired)	Instant access
Discardable	Covered by `MADV_DISCARDABLE`	Free immediately (no swap I/O)	Zero-fill fault
Compressible	All other anonymous pages	Compress into compress pool (Section 4.12)	Decompress on fault
Uncompressible	Large anonymous pages that don't compress	Swap to disk	Swap-in on fault

File-backed pages (mmap MAP_SHARED, page cache) are handled by the normal LRU reclaim path (Section 4.4) — they do not need special treatment.

Data structures:

/// Per-VMA annotation for hibernation hints. Stored as flags in the VMA
/// descriptor ([Section 4.8](#virtual-memory-manager)). Multiple ranges within a VMA
/// can have different hints via VMA splitting (same mechanism as mprotect()).
bitflags! {
    pub struct VmaHibernateFlags: u8 {
        const DISCARDABLE = 0x01;  // madvise(MADV_DISCARDABLE)
        const CRITICAL    = 0x02;  // madvise(MADV_CRITICAL)
    }
}

/// Per-cgroup hibernation state, stored in the memcg descriptor.
pub enum HibernateState {
    /// Normal operation — tasks running, pages managed by LRU.
    Active,
    /// Transitioning to hibernated — tasks frozen, pages being reclaimed.
    /// Reads of memory.hibernate_state return "hibernating" during this phase.
    Hibernating,
    /// All non-critical pages reclaimed — tasks still frozen, minimal RSS.
    Hibernated,
    /// Transitioning back to active — tasks unfrozen, pages being warmed.
    Thawing,
}

/// Per-cgroup hibernation configuration. Exposed via cgroupfs memory.* files.
pub struct CgroupHibernate {
    pub state: HibernateState,
    /// Priority: 0 = not eligible for hibernation; 1-100 = eligible
    /// (higher = hibernated sooner under pressure). Set by orchestrator
    /// (e.g., ActivityManager, systemd-oomd).
    pub priority: u8,
    /// Maximum bytes of MADV_CRITICAL pages allowed in this cgroup subtree.
    /// Default: 64 MB. Enforced at madvise(MADV_CRITICAL) time.
    pub critical_limit: u64,
    /// Bytes of critical pages currently wired in this cgroup subtree.
    pub critical_used: AtomicU64,
    /// Pages discarded (zero-fill) in last hibernation cycle.
    pub discarded_pages: AtomicU64,
    /// Pages compressed in last hibernation cycle.
    pub compressed_pages: AtomicU64,
    /// Pages swapped in last hibernation cycle.
    pub swapped_pages: AtomicU64,
}

Hibernation algorithm:

hibernate_cgroup(cgroup):

Phase 1 — Freeze (synchronous, <1ms for typical app):
  1. Write cgroup.freeze = 1 (kernel cgroup freezer, not SIGSTOP).
     This is transparent to the frozen tasks — they do not observe
     the freeze transition. Wait for all tasks to reach FROZEN state.
  2. Disable the cgroup's memory.max enforcement temporarily
     (prevents OOM kill racing with our intentional hibernation).

Phase 2 — Discard (synchronous, ~1-5ms for typical 256 MB DISCARDABLE region):
  3. Walk all VMAs in all tasks of the cgroup.
  4. For each VMA with DISCARDABLE flag:
     a. Walk PTEs. For each Present PTE:
        - Unmap the PTE (mark not-present, flush TLB).
        - If the page is exclusively owned (mapcount == 0 AND refcount == 1):
          free immediately. The mapcount check ensures no other PTE maps
          this page (CoW sibling or shared mapping); the refcount check
          ensures no in-flight I/O holds a reference (DMA, page cache
          writeback). Both checks use the page descriptor's atomic fields
          ([Section 4.2](#physical-memory-allocator--page-descriptor)).
        - If shared (mapcount > 0: CoW parent or mmap MAP_SHARED) or
          pinned (refcount > 1: in-flight DMA, get_user_pages):
          leave for LRU reclaim — hibernation cannot safely free pages
          with external references. The page remains resident until
          the normal reclaim path handles it (which may swap it out
          if the LRU scanner reaches it).
     b. Do NOT write shadow entries — discarded pages are intentionally
        zeroed on resume; no refault tracking needed.
  5. Flush TLB shootdowns in batch (one IPI burst for all CPUs that
     had the cgroup's tasks scheduled recently).

Phase 3 — Compress remaining anonymous pages (asynchronous, background):
  6. Queue all remaining anonymous (non-CRITICAL, non-DISCARDABLE) VMA
     pages for compression via the compress pool path
     ([Section 4.2](#physical-memory-allocator)).
     - This runs asynchronously: the cgroup is already FROZEN and using
       zero CPU, so background compression does not compete with foreground
       work.
     - Priority: lower than foreground compression jobs.
  7. Pages that do not compress below 75% of original size: queue for swap.

Phase 4 — State transition:
  8. When all non-CRITICAL pages are compressed or swapped:
     transition cgroup.hibernate_state to HIBERNATED.
  9. RSS of the cgroup at this point: only CRITICAL pages + kernel structures
     (task_struct, page tables — typically <512 KB per process).
  10. Re-enable memory.max enforcement.

Thaw (resume) algorithm:

thaw_cgroup(cgroup):

Phase 1 — Unfreeze (synchronous, <1ms):
  1. Write cgroup.freeze = 0. Tasks are immediately runnable.
  2. Transition state to THAWING.

Phase 2 — Warm prefetch (asynchronous, ~50-200ms for typical app):
  3. CRITICAL pages: already present — zero latency.
  4. DISCARDABLE pages: populated with zero-fill on first fault.
     No I/O required. Fault latency: ~1-3us per page (same as demand
     zero anonymous fault). Total for 256 MB: ~65ms worst case if all
     pages are faulted simultaneously; typical UI path <10ms.
  5. Compressed pages: decompressed on fault by the thread that touches
     them (inline decompression, ~1-2us per page, LZ4 from compress pool).
     Background prefetcher also reads ahead: when a task first faults a
     page from a compressed region, the prefetcher decompresses the next
     16 pages of that VMA in a background kworker.
  6. Swapped pages: swap-in on fault. Background swap prefetcher issues
     readahead for sequential access patterns.

Phase 3 — State transition:
  7. After all swap I/O has been issued (not necessarily completed):
     transition state to ACTIVE. The cgroup is now fully active.
     Remaining pages trickle in on demand.

Cgroupfs interface (all under the memory controller):

memory.hibernate_state    rw  "active" | "hibernating" | "hibernated" | "thawing"
                              Write "hibernating" to trigger hibernation.
                              Write "active" to trigger thaw.

memory.hibernate_priority rw  0-100. 0 = not eligible (default).
                              Orchestrator sets this for background app cgroups.
                              The OOM resolution Step 0
                              ([Section 4.5](#oom-killer--oom-killer-policy)) uses this to
                              select which cgroups to hibernate first.

memory.critical_limit     rw  Maximum bytes of MADV_CRITICAL pages.
                              Default: 67108864 (64 MB).

memory.critical_current   ro  Current bytes of MADV_CRITICAL pages wired.

memory.hibernate_stats    ro  Lines: discarded_pages, compressed_pages,
                              swapped_pages, thaw_faults. Reset on each
                              thaw cycle.

/proc/PID/smaps extensions:

Each VMA entry in /proc/PID/smaps gains two new fields:

MadvDiscardable: <kb>    # bytes of this VMA covered by MADV_DISCARDABLE
MadvCritical:    <kb>    # bytes of this VMA covered by MADV_CRITICAL

Performance targets:

Metric	Target	Basis
Freeze latency (Phase 1)	< 1ms	cgroup freezer is synchronous
Discard phase (Phase 2, 256 MB DISCARDABLE)	< 5ms	TLB shootdown + page free, no I/O
Memory freed (active -> hibernated, 512 MB app)	400-480 MB	80-95% of anon freed
Warm resume CRITICAL path	< 5ms	pages in RAM, just unfreeze
Warm resume first-paint (compressed pages)	< 200ms	LZ4 decompression on fault
Cold resume (all pages swapped)	500ms - 2s	swap I/O latency

Comparison with existing approaches:

Approach	Memory freed	App restart cost	App changes needed
Android LMKD (kill)	100% of app	Cold start: 1-3s	None
iOS Jetsam (OS-managed)	70-90%	Warm resume: 50-500ms	None (OS decides)
UmkaOS hibernation (no hints)	~80% (compress/swap)	Warm resume: 200ms-2s	None
UmkaOS hibernation (with hints)	85-95% (discard + compress)	Warm resume: <200ms	madvise() calls

Backwards compatibility:

madvise(MADV_DISCARDABLE) and madvise(MADV_CRITICAL) are new hints. Linux apps that do not call them work unchanged; hibernation falls back to compress-everything behavior (still better than kill).
memory.hibernate_state is a new cgroupfs attribute. Orchestrators written for vanilla Linux simply don't write to it; behavior is identical to Linux.
The cgroup freeze interface (cgroup.freeze) is compatible with Linux cgroup-v2 freeze semantics.
This mechanism is designed for adoption: if it proves effective, the MADV_DISCARDABLE / MADV_CRITICAL hints and the memcg attributes are straightforward to propose for upstream Linux inclusion.

4.6 Writeback Subsystem¶

4.6.1 Writeback Thread Organization¶

Dirty pages must eventually be written to their backing store (filesystem, block device). UmkaOS uses a per-device writeback model: each backing device (each block device / filesystem superblock) has its own writeback context with an associated workqueue task. This avoids the Linux problem of a single global flusher becoming a bottleneck on systems with many storage devices.

Writeback contexts:

/// Maximum number of pending writeback work items per BDI. Bounds the
/// `BdiWriteback.work_list` to prevent unbounded heap allocation under
/// sustained fsync() load. 128 is sufficient for the highest observed
/// fsync concurrency (database workloads: ~100 concurrent fsync waiters
/// per device, with the writeback thread draining ~50 items/sec).
pub const WORK_LIST_MAX: usize = 128;

/// Backing device information — the kernel's abstraction for any entity
/// that can receive writeback I/O. Every block device and every network
/// filesystem superblock that supports dirty-page writeback has exactly one
/// `BackingDevInfo`.
///
/// Created during device/superblock registration (`register_bdev()` for block
/// devices, `sget()` for filesystem superblocks), destroyed on unregister.
/// The VFS references it via `SuperBlock.s_bdi` and the block layer via
/// `BlockDevice.bdi`. Reclaim and writeback throttling operate on the BDI,
/// not on individual inodes — this ensures that a single slow device does
/// not starve writeback for other devices.
///
/// **Init ordering**: `bdi_init()` is called as part of block device registration
/// (Phase 4.5+ in boot sequence). The mount path (`do_mount()`) verifies
/// `sb.s_bdi.is_some()` before allowing dirty pages. `mark_inode_dirty()`
/// asserts `inode.sb.s_bdi.is_some()` — a BDI-less superblock cannot have
/// dirty inodes (this catches pseudo-filesystems like procfs/sysfs that should
/// never enter the writeback path).
pub struct BackingDevInfo {
    /// Writeback state (dirty lists, bandwidth estimation, worker thread).
    pub wb: BdiWriteback,
    /// Congestion flag — set when device queue is full. Checked by
    /// `balance_dirty_pages()` to throttle writers and by kswapd to
    /// avoid reclaiming pages whose backing device cannot accept I/O.
    pub congested: AtomicBool,
    /// Maximum number of in-flight I/O requests before congestion.
    /// Set from the block device's hardware queue depth at registration
    /// time; updated if the device reports a queue depth change.
    pub queue_depth: AtomicU32,
    /// Current number of in-flight write requests. Incremented when a
    /// bio is submitted to the device, decremented in the completion
    /// callback. Used by `bdi_write_congested()` to detect congestion.
    pub inflight_writes: AtomicU32,
    /// Readahead window size in pages. Tunable via
    /// `/sys/block/<dev>/queue/read_ahead_kb` (default: 128 KB = 32 pages
    /// on 4 KB page size). The readahead algorithm
    /// ([Section 4.4](#page-cache--readahead-engine)) uses this as the maximum window.
    pub ra_pages: u32,
    /// Associated block device. `None` for network filesystems (NFS, CIFS)
    /// and pseudo-filesystems (tmpfs, procfs) that perform writeback via
    /// their own I/O path rather than the block layer.
    pub bdev: Option<Arc<dyn BlockDeviceOps>>,
}

/// Per-backing-device writeback state. One per block device or filesystem
/// superblock that can have dirty pages.
///
/// Reference-counted handle to an `Inode`. Used in writeback dirty lists
/// (`b_dirty`, `b_io`, `b_more_io`, `b_dirty_time`) where the inode must
/// remain alive while the writeback subsystem holds a reference.
/// Internally an `Arc<Inode>` — the writeback subsystem never takes ownership.
pub type InodeRef = Arc<Inode>;

/// Reference-counted handle to a `BdiWriteback` instance. Stored in
/// `Inode.i_wb` to track which writeback context (and thus which cgroup)
/// an inode is attributed to. Multiple inodes in the same cgroup share
/// one `BdiWritebackRef`.
pub type BdiWritebackRef = Arc<BdiWriteback>;

/// Canonical home: embedded in `BackingDevInfo` (the kernel's abstraction
/// for any entity that can receive writeback I/O).
pub struct BdiWriteback {
    /// Back-pointer to the owning BDI.
    // SAFETY: BackingDevInfo outlives all BdiWriteback instances. On device
    // unregister, writeback is drained (inflight_io reaches zero) and all
    // inode BdiWritebackRef references are released before BDI is freed.
    // This ordering is enforced by the block device teardown sequence
    // ({ref:block-storage-layer}  <!-- UNRESOLVED -->). During live kernel evolution, the BDI
    // teardown ordering guarantee is preserved by the Evolvable component's
    // drain protocol — all in-flight writeback completes before the old
    // Evolvable image is unloaded ([Section 13.18](13-device-classes.md#live-kernel-evolution)).
    pub bdi: NonNull<BackingDevInfo>,

    /// Dirty inode lists (protected by list_lock):
    /// - b_dirty: inodes with dirty pages, not yet scheduled for writeback.
    ///   Ordered by dirtied_when (oldest first).
    /// - b_io: inodes currently scheduled for writeback in this cycle.
    ///   Moved from b_dirty when writeback starts.
    /// - b_more_io: inodes that could not be fully written in this cycle
    ///   (e.g., congestion, partial write). Retried in the next cycle.
    /// - b_dirty_time: inodes with only dirty timestamps (I_DIRTY_TIME),
    ///   written lazily (dirtytime_expire_interval, default 12 hours).
    ///
    /// **Design rationale**: Inode-level dirty lists enable O(inodes) iteration
    /// for periodic writeback (scan dirty inodes, write their pages). Page-level
    /// XArray DIRTY tags enable O(dirty_pages) iteration for range writeback
    /// (e.g., `sync_file_range()`). Both are needed for different access patterns.
    /// Intrusive lists using `Inode::i_wb_link` (IntrusiveLink embedded in each
    /// inode). Zero per-insertion allocation — no heap allocation under list_lock.
    pub b_dirty: IntrusiveList<Inode>,
    pub b_io: IntrusiveList<Inode>,
    pub b_more_io: IntrusiveList<Inode>,
    pub b_dirty_time: IntrusiveList<Inode>,
    pub list_lock: SpinLock<()>,

    /// Number of dirty pages owned by this writeback context.
    /// Incremented by `__set_page_dirty()` ([Section 4.4](#page-cache)).
    /// Decremented by `end_page_writeback()` on writeback success
    /// (step 11 of the writeback completion protocol below).
    /// Used by balance_dirty_pages() to compute per-BDI dirty position and
    /// by the global dirty throttle to calculate proportional limits.
    pub nr_dirty: AtomicU64,

    /// Bandwidth estimation (for dirty throttling).
    /// Updated every BANDWIDTH_INTERVAL (200ms).
    pub write_bandwidth: AtomicU64,      // bytes/sec, smoothed
    pub avg_write_bandwidth: AtomicU64,  // long-term average
    pub dirty_ratelimit: AtomicU64,      // pages/sec allowed for dirtiers
    pub balanced_dirty_ratelimit: AtomicU64,

    /// Workqueue-based writeback. NOT a dedicated thread — uses the
    /// system workqueue with delayed scheduling. The delayed_work fires
    /// every dirty_writeback_interval (default 5 seconds, configurable
    /// via /proc/sys/vm/dirty_writeback_centisecs).
    pub dwork: DelayedWork,

    /// Work items queued for this BDI (from sync(), fsync(), memory pressure).
    /// Bounded to `WORK_LIST_MAX` entries (128) with backpressure: callers
    /// block on `work_list_wq` when the queue is full. This prevents unbounded
    /// heap allocation under sustained fsync() load (e.g., database workloads
    /// with 100K transactions/sec). The ArrayVec is contiguous and cache-friendly.
    ///
    /// **Backpressure**: When `work_list` is full, `queue_writeback_work()` adds
    /// the calling task to `work_list_wq` (WaitQueue) and sleeps until a slot
    /// is freed by `writeback_single_inode()` completing a work item.
    pub work_list: SpinLock<ArrayVec<WritebackWork, WORK_LIST_MAX>>,
    /// WaitQueue for callers blocked on a full work_list.
    pub work_list_wq: WaitQueue,

    /// Cgroup ID for this writeback context (cgroup v2 io controller).
    /// For the root BdiWriteback (bdi.wb), this is 0 (root cgroup).
    /// For per-cgroup BdiWriteback instances, this identifies the cgroup
    /// whose dirty pages this wb is responsible for writing.
    pub cgroup_id: CgroupId,
}

**Per-cgroup writeback attribution** (cgroup v2 `io.max` enforcement):

Each inode tracks which cgroup's writeback context it belongs to via
`Inode.i_wb: Option<BdiWritebackRef>`. When a process first dirties an inode,
`inode_attach_wb()` binds the inode to the process's cgroup's BdiWriteback:

```rust
/// Attach an inode to the appropriate per-cgroup BdiWriteback.
/// Called lazily on the first dirty-page write to this inode.
/// The inode is attributed to the writing process's cgroup for the
/// duration of its dirty lifetime (until all dirty pages are written).
///
/// For overlayfs upper inodes: the inode inherits the writing process's
/// cgroup, not the overlayfs mount's cgroup. This ensures that `io.max`
/// limits are enforced against the container performing the write, not
/// the host overlayfs mount. Without this, a container writing through
/// overlayfs could bypass its io.max limit entirely.
fn inode_attach_wb(inode: &Inode, page: &Page) {
    if inode.i_wb.is_some() {
        return; // already attached
    }
    let css = current_task().cgroup.io_css();
    let wb = bdi_get_or_create_wb(inode.sb.bdi, css.cgroup_id);
    inode.i_wb.store(Some(wb), Release);
}

The balance_dirty_pages() throttling function uses inode.i_wb to enforce per-cgroup dirty limits. If inode.i_wb is None (inode not yet dirtied), the root BdiWriteback is used as a fallback.

/// A writeback work item — describes what needs to be written. pub struct WritebackWork { /// Maximum pages to write in this work item. pub nr_pages: i64, /// Only write inodes on this superblock (None = all). pub sb: Option>, /// Sync mode: None (background), WB_SYNC_ALL (fsync/sync). pub sync_mode: WritebackSyncMode, /// Reason this writeback was initiated. pub reason: WritebackReason, /// Range-cyclic: start where the last writeback left off. pub range_cyclic: bool, /// For kupdate: only write pages dirty for longer than /// dirty_expire_interval (default 30 seconds). pub for_kupdate: bool, /// For background writeback: triggered by crossing dirty_background_ratio. pub for_background: bool, }

[repr(u8)]¶

pub enum WritebackReason { /// Background writeback (dirty ratio exceeded background threshold). Background = 0, /// Memory reclaim needs pages freed. VmScan = 1, /// Explicit sync() or fsync(). Sync = 2, /// Periodic kupdate (dirty_writeback_interval timer). Periodic = 3, /// Laptop mode (aggressive writeback to spin down disk). LaptopTimer = 4, /// Filesystem needs free space. FsFreeSpace = 5, }

### Writeback Domain Crossing (Tier 0 -> Tier 1)

The page cache and `BdiWriteback` state live in Core (Tier 0). Filesystem drivers that
implement `AddressSpaceOps::writepage()` / `writepages()` run in Tier 1 (isolated via
MPK/POE). The writeback thread must cross the domain boundary to invoke the filesystem's
writepage implementation.

UmkaOS uses a **writeback request ring** for this crossing. The workqueue-based writeback
thread (Tier 0) posts batch requests to the VFS driver's KABI ring; the VFS driver
(Tier 1) processes them and posts completions back.

```rust
/// A writeback request message posted from Tier 0 (writeback workqueue)
/// to a Tier 1 VFS driver's KABI command ring.
///
/// One request per inode batch. The writeback thread collects dirty pages
/// for an inode (via the `writeback_inode_pages()` algorithm described above),
/// then posts a single `WritebackRequest` covering the contiguous range.
/// For non-contiguous dirty ranges, multiple requests are posted (one per
/// contiguous segment).
///
/// **Two-level batching**: This ring batch minimizes domain switch cost
/// (32-128 pages per ring submission, one Tier 0→Tier 1 crossing per batch).
/// The bio layer ([Section 15.2](15-storage.md#block-io-and-volume-management)) performs a separate
/// merge: contiguous blocks are merged into multi-segment bios for device-
/// level I/O efficiency. These are complementary, not redundant.
///
/// `#[repr(C)]` ensures a stable ABI layout for the KABI ring message.
/// Total size: 48 bytes with #[repr(C)] layout:
///   inode_id(8) + sb_dev(4) + _pad1(4) + offset_pages(8) + dma_base(8)
///   + nr_pages(4) + sync_mode(4) + _pad(8) = 48.
/// Note: `DmaHandle` is an alias for `DmaAddr` (u64); see [Section 4.14](#dma-subsystem).
#[repr(C)]
pub struct WritebackRequest {
    /// Inode identifier (stable across crash/reload).
    pub inode_id: u64,
    /// Device ID of the inode's superblock ([Section 14.5](14-vfs.md#device-node-framework)).
    /// Identifies the target block device so that Core (Tier 0) can resolve
    /// the device for crash recovery bypass without needing VFS-domain state.
    /// Core maintains `DEVICE_REGISTRY: XArray<Arc<dyn BlockDeviceOps>>` keyed
    /// by `DevId`. DevId is 4 bytes (u32 encoded major:minor).
    pub sb_dev: DevId,
    /// Explicit padding between sb_dev (u32, 4 bytes) and offset_pages (u64, 8 bytes)
    /// to satisfy u64 alignment. Without this, repr(C) inserts 4 bytes of implicit
    /// padding, which leaks uninitialized kernel memory across the ring boundary.
    pub _pad1: [u8; 4],
    /// Starting page offset within the inode's address space.
    pub offset_pages: u64,
    /// Base DMA handle for the page data region. The writeback thread
    /// (`dma_map_page()` per page in the batch) maps dirty pages into a
    /// contiguous PKEY_SHARED region before posting this request. The Tier 1
    /// VFS driver reads page data at `dma_base + (page_offset * PAGE_SIZE)`
    /// for each page in the range `[offset_pages, offset_pages + nr_pages)`.
    /// The mapping uses `DMA_TO_DEVICE` direction — the Tier 1 driver reads
    /// (not writes) the page data. Tier 0 unmaps after receiving the
    /// corresponding `WritebackResponse`. For architectures without coherent
    /// DMA (ARMv7, some RISC-V), `dma_map_page()` performs a cache clean
    /// before the domain switch.
    pub dma_base: DmaHandle,
    /// Number of contiguous dirty pages to write back.
    /// Clamped from `WritebackControl.nr_to_write` (i64). Values exceeding
    /// `u32::MAX` (including `i64::MAX` "write everything") are clamped to
    /// `u32::MAX` (~16 TB at 4KB pages), sufficient for any single inode.
    pub nr_pages: u32,
    /// Synchronization mode for this writeback request.
    pub sync_mode: WritebackSyncMode,
    /// Reserved for future fields. Must be zero-initialized. Without this
    /// field, the struct would be 40 bytes (8-byte aligned, zero tail
    /// padding). The 8 bytes of explicit padding ensure ABI stability at
    /// 48 bytes for forward-compatible field additions. Making padding
    /// explicit prevents kernel info leaks to the KABI ring.
    pub _pad: [u8; 8],
}
const_assert!(core::mem::size_of::<WritebackRequest>() == 48);

/// Writeback synchronization mode. Determines how the writeback thread
/// and the VFS driver coordinate completion signaling.
#[repr(u32)]
pub enum WritebackSyncMode {
    /// No synchronization — fire-and-forget. Used for background writeback
    /// when dirty_background_ratio is exceeded. The writeback thread does
    /// not wait for completion before moving to the next inode batch.
    /// Named `Background` instead of `None` to avoid shadowing Rust's
    /// `Option::None` in `use WritebackSyncMode::*` contexts.
    /// Linux equivalent: `WB_SYNC_NONE`.
    Background = 0,
    /// Normal synchronization — the writeback thread tracks completion
    /// but does not block the dirtying process. Used for periodic kupdate
    /// writeback (dirty_writeback_centisecs timer). Completion updates
    /// bandwidth estimation counters.
    Normal = 1,
    /// Synchronous wait — the caller (fsync/sync) blocks until all pages
    /// in this request have been written to stable storage and the VFS
    /// driver has posted a `WritebackResponse`. Used for `fsync()`,
    /// `fdatasync()`, and `sync()` system calls.
    Wait = 2,
}

/// Per-invocation writeback control block, passed from the writeback thread
/// (or sync/fsync caller) through `writeback_inode_pages()` down to each
/// `AddressSpaceOps::writepage()` / `writepages()` implementation.
///
/// This struct is **not** an ABI type — it lives entirely within Tier 0
/// (core writeback) and Tier 1 (VFS driver) address spaces. It is never
/// serialised onto a KABI ring. The ring-level equivalent is
/// `WritebackRequest` (above), which carries only the subset needed for
/// cross-domain messaging.
///
/// Callers initialise `WritebackControl` before invoking writeback;
/// callees update `pages_written` and may inspect all other fields to
/// make I/O scheduling decisions (e.g., whether to submit async bios).
pub struct WritebackControl {
    /// Synchronization mode governing how completion is signaled.
    /// `WritebackSyncMode::Background` — background, fire-and-forget.
    /// `WritebackSyncMode::Normal` — periodic kupdate, track bandwidth.
    /// `WritebackSyncMode::Wait` — fsync/sync, block until stable storage.
    pub sync_mode: WritebackSyncMode,

    /// Maximum number of pages the caller wants written in this pass.
    /// Decremented by `writeback_inode_pages()` as pages are dispatched.
    /// When this reaches 0 the writeback loop stops and records
    /// `cyclic_start` for the next invocation.
    /// Set to `i64::MAX` for "write everything dirty".
    pub nr_to_write: i64,

    /// Pages actually written during this pass. Updated by the writeback
    /// loop (not by the filesystem). Callers inspect this after return to
    /// update bandwidth estimation counters.
    pub pages_written: u64,

    /// Byte-range start (inclusive). Limits writeback to pages overlapping
    /// `[range_start, range_end]`. Set to 0 for whole-file writeback.
    /// `fdatasync()` / `sync_file_range()` narrow this to the dirty extent.
    pub range_start: u64,

    /// Byte-range end (inclusive). Set to `u64::MAX` for whole-file writeback.
    pub range_end: u64,

    /// Range-cyclic mode: resume where the previous writeback pass stopped
    /// instead of always starting at page index 0. Used by background and
    /// kupdate writeback to spread I/O evenly across large files.
    pub range_cyclic: bool,

    /// Resume index for range-cyclic mode. Set by the writeback loop when
    /// `nr_to_write` is exhausted; read on the next invocation as the
    /// starting page index. Reset to 0 when the iterator wraps around.
    pub cyclic_start: u64,

    /// Periodic writeback initiated by the kupdate timer
    /// (`dirty_writeback_centisecs`). When `true`, only pages that have been
    /// dirty longer than `dirty_expire_centisecs` are eligible.
    pub for_kupdate: bool,

    /// Background writeback triggered by crossing `dirty_background_ratio`.
    /// The writeback thread writes until the dirty page count drops below
    /// the background threshold, then stops.
    pub for_background: bool,

    /// Writeback initiated by the page reclaimer (`kswapd` / direct reclaim).
    /// When `true`, the writeback path avoids any allocation that could
    /// re-enter reclaim (no `GFP_KERNEL`, no unbounded bio chains).
    pub for_reclaim: bool,

    /// When `true`, only write pages that were tagged `TOWRITE` at the
    /// start of the writeback pass. Pages dirtied after tagging are skipped,
    /// preventing unbounded writeback loops where new dirty pages are
    /// continuously appended. Used by `sync()` and `fsync()` to guarantee
    /// a finite write set.
    pub tagged_writepages: bool,
}

/// Completion response from Tier 1 VFS driver back to Tier 0.
/// Posted on the KABI completion ring after the filesystem has submitted
/// all I/O for the requested pages.
///
/// **Partial write correlation**: Tier 0 correlates partial writes by
/// checking which pages still have `PageFlags::WRITEBACK` set after
/// processing this response. Pages with `nr_written < original nr_pages`
/// will have WRITEBACK cleared only for the successfully written prefix;
/// remaining pages retain WRITEBACK and are re-submitted on the next
/// writeback cycle. Tier 0 does not need the original `nr_pages` in
/// the response because the per-page WRITEBACK flag is the tracking
/// mechanism.
///
/// `#[repr(C)]` for stable ABI layout on the KABI ring.
/// Total size: 32 bytes (8 + 8 + 4 + 4 + 4 + 4).
#[repr(C)]
pub struct WritebackResponse {
    /// Inode identifier (matches the request).
    pub inode_id: u64,
    /// Starting page offset (matches the request's `offset_pages`).
    pub offset_pages: u64,
    /// Number of pages successfully written. u32 because this is a
    /// per-request count bounded by `WritebackRequest.nr_pages` (also u32),
    /// not a monotonic counter. A single writeback request will never
    /// write more than u32::MAX pages.
    pub nr_written: u32,
    /// Error code: 0 on success, negative errno on failure (e.g., -EIO).
    /// On error, the writeback thread marks the affected pages with
    /// `PageFlags::ERROR` and propagates via the writeback error chain
    /// (`AddressSpace.wb_err`).
    pub error: i32,
    /// Explicit padding after `error` to maintain 32-byte struct size.
    /// Must be zero-initialized to prevent kernel info leak.
    pub _pad: [u8; 8],
}
const_assert!(core::mem::size_of::<WritebackResponse>() == 32);

Crash recovery bypass (VFS crash -> direct block write):

When a Tier 1 VFS driver crashes or is restarting, the writeback thread cannot post WritebackRequest messages to the VFS KABI ring — the ring consumer is gone. Core (Tier 0) detects the crash via the KABI health monitor (missed heartbeat or ring overflow) and activates the crash recovery bypass path:

Core reads the crashed VFS driver's BdiWriteback dirty lists (b_dirty, b_io, b_more_io) — these are Tier 0 owned structures, accessible without the VFS domain.
For each dirty inode, Core iterates DirtyIntentList.entries.
For committed entries (block_addr is Some): Core issues direct block I/O via entry.block_dev.submit_bio() (or, if block_dev is None, resolves the device from entry.sb_dev via DEVICE_REGISTRY). This writes dirty page data directly to the block device, bypassing the VFS driver entirely.
For Phase 1 entries (block_addr is None): these have dirty pages but no committed block address. Core cannot flush them. They are flagged as "potentially inconsistent" — the filesystem's journal will handle them on the next mount after the VFS driver reloads (Section 14.1).
After all committed extents are flushed, Core clears the WRITEBACK flags on the affected pages and proceeds with VFS driver reload.

This bypass path ensures that committed dirty data is never lost due to a VFS driver crash. The sb_dev field in both WritebackRequest and DirtyIntentEntry (Section 14.1) is the critical link that lets Core resolve the block device without VFS cooperation.

Writeback domain crossing sequence:

Tier 0 (writeback workqueue):
  1. Lock inode AddressSpace::writeback_lock (shared).
  2. Walk dirty page range, set PageFlags::WRITEBACK on each page.
  3. DMA-map dirty pages into PKEY_SHARED region (zero-copy; pages are pinned by
     WRITEBACK flag). The Tier 1 VFS driver reads page data at `dma_base +
     (page_offset * PAGE_SIZE)` without a memory copy.
  4. Post WritebackRequest { inode_id, sb_dev, offset_pages, nr_pages, sync_mode }
     to the VFS driver's KABI command ring.
  5. Domain switch: Tier 0 -> Tier 1 (~23 cycles).

Tier 1 (VFS driver):
  6. Dequeue WritebackRequest from KABI command ring.
  7. Call AddressSpaceOps::writepages() for the inode range.
     - The filesystem reads page data from the KABI shared buffer
       (PKEY_SHARED, readable by both tiers).
     - Submits block I/O to the block layer (bio_submit via KABI ring
       to the block device driver).
  8. On I/O completion: post WritebackResponse { inode_id, offset_pages,
     nr_written, error } to KABI completion ring.
  9. Domain switch: Tier 1 -> Tier 0.

Tier 0 (writeback completion — **sole owner of DIRTY→clean for Tier 1**):
  10. Dequeue WritebackResponse.
  11. For each successfully written page:
      - Clear PageFlags::WRITEBACK and PageFlags::DIRTY.
      - Decrement AddressSpace.nr_dirty.
      **Counter ownership**: For Tier 1 filesystems, this step is the SOLE
      owner of the `DIRTY → clean` transition and `nr_dirty` decrement.
      `writeback_end_io()` is NOT called for Tier 1 writeback bios — the
      bio completion runs inside the Tier 1 domain, and the Tier 0 response
      processing here handles all page cache updates. For Tier 0 (in-kernel)
      filesystems, `writeback_end_io()` owns the transition instead (this
      step is not reached). This single-owner-per-tier design prevents the
      double-decrement bug that would occur if both paths cleared DIRTY.
      **Counter maintenance**: This step decrements `AddressSpace.nr_dirty`
      (balancing the increment in `__set_page_dirty()`). If this decrement
      is omitted, `balance_dirty_pages()` will eventually throttle all
      writes to zero.
  12. For error pages:
      a. Increment `PageExt.wb_fail_count` (u8, stored in the per-page
         extension array, [Section 4.2](#physical-memory-allocator--pageextarray--per-page-extension-metadata)).
      b. Set `PageFlags::ERROR` on the page. Propagate to
         `AddressSpace.wb_err` via `mapping_set_error()` (see Writeback
         Error Propagation below). This step runs unconditionally for all
         error pages — both retryable and permanent.
      c. If `wb_fail_count >= 3`: set `PageFlags::PERMANENT_ERROR` on
         the page. Do NOT re-dirty — the page is permanently excluded
         from writeback. Emit FMA event
         `FaultEvent::WritebackPermanentError { inode, page_offset }`.
         The page remains dirty (`PG_DIRTY` stays set) so that
         `fsync()` returns `-EIO`.
      d. If `wb_fail_count < 3`: re-set `PageFlags::DIRTY` so the page
         remains in the writeback pipeline for retry on the next cycle
         (unless the filesystem has been remounted read-only, in which
         case the page is NOT re-dirtied). Clear `PageFlags::ERROR` on
         the page after the propagation to `wb_err` in step 12b — ERROR
         is transient per-writeback-cycle and is only used to propagate
         errors to `fsync()`. The next writeback attempt starts clean.
  13. If sync_mode == Wait: wake the blocked fsync()/sync() caller.
  14. Release writeback_lock.

Batching and amortization: The domain switch cost (~23 cycles per crossing) is amortized over the entire writeback batch. A typical writepages() call writes 32-128 pages per batch, so the per-page overhead is <1 cycle. For synchronous writeback (fsync()), the domain switch is negligible compared to the storage I/O latency (~10us for NVMe, ~5ms for HDD).

Writeback Error Propagation:

Bio completion errors arrive asynchronously in IRQ context (completion workqueue) after the writeback thread has moved on to the next inode. To avoid losing errors, each AddressSpace carries a generation-counter error field:

/// Writeback error sequence counter. Uses the canonical `ErrSeq` type defined in
/// [Section 14.4](14-vfs.md#vfs-fsync-and-cow) — a two-field structure
/// with a monotonic `seq: AtomicU32` (bits [31:1] = counter, bit [0] = "seen"
/// flag) and a `last_errno: AtomicI32` for reporting. Each open fd snapshots the
/// current `ErrSeq` value; `fsync()` calls `ErrSeq::check_and_advance()` to
/// detect errors since the snapshot.
pub wb_err: ErrSeq,

Error flow: 1. Bio completion in IRQ context: on I/O error, call mapping_set_error(mapping, errno): wb_err.set_err(errno) (increments seq counter by 2, clears "seen" bit, stores errno). Page lifetime guarantee: The bio holds an Arc<Page> reference (incremented by the writeback thread before bio submission). This prevents the page from being freed between bio submission and IRQ completion — even if the writeback thread moves to the next inode and the page is removed from the page cache by a concurrent truncate_inode_pages(). The bio completion handler decrements the Arc after clearing PG_WRITEBACK and updating wb_err. 2. PageFlags::ERROR is set on the affected page (informational — not used for error delivery to userspace). 3. fsync(file): filemap_check_errors(file) calls mapping.wb_err.check_and_advance(&mut file.f_wb_err) — compares the current ErrSeq generation against the fd's snapshot (taken at open() or last successful fsync()). If the generation has advanced with a non-zero errno, returns that errno and updates the fd's snapshot. See Section 14.4. 4. I_WRITEBACK flag is NOT used for error tracking — it tracks I/O in-flight state only. Error state is entirely in wb_err (ErrSeq).

This guarantees that: (a) no error is silently lost even if the writeback thread has moved to a different inode, (b) each fsync() caller sees errors that occurred since its last successful check, (c) multiple concurrent fsync() callers each get the error independently (counter-based, not flag-based).

End-to-end error propagation from writeback to fsync():

Bio completion sets bio.status = -EIO (or other I/O error).
end_page_writeback() checks bio.status; if error, calls mapping_set_error(mapping, error) which atomically increments mapping.wb_err (ErrSeq counter).
fsync() calls filemap_check_errors(mapping, &file.f_wb_err). If mapping.wb_err != file.f_wb_err, the error is returned and file.f_wb_err is advanced.
Each open fd sees each error exactly once. The error code from the original bio is preserved through the entire chain.

Dirty page thresholds and throttling:

The kernel maintains global dirty page limits to prevent unbounded dirty memory accumulation (which would cause massive write bursts and memory pressure):

/proc/sys/vm/dirty_background_ratio  = 10    (% of total memory)
/proc/sys/vm/dirty_ratio             = 20    (% of total memory)
/proc/sys/vm/dirty_writeback_centisecs = 500  (5 seconds — periodic timer)
/proc/sys/vm/dirty_expire_centisecs  = 3000   (30 seconds — max dirty age)

Threshold	Dirty pages below	Action
Background ratio (10%)	< 10% of RAM dirty	No writeback, no throttling
Background ratio (10%)	>= 10% of RAM dirty	Wake writeback workqueue (async)
Dirty ratio (20%)	>= 20% of RAM dirty	Throttle dirtiers — `balance_dirty_pages()` forces writing processes to sleep
Freerun	< (background + dirty) / 2	No throttling at all

balance_dirty_pages() throttling algorithm:

When a process dirties pages, the VFS calls balance_dirty_pages(). If the global dirty page count exceeds the freerun threshold, the function calculates a pause duration proportional to how far above the threshold the dirty count is. The process sleeps for pause milliseconds (max 200ms per sleep), allowing writeback to catch up. The throttling is smooth (not step-function) — the pause increases gradually as dirty pages approach dirty_ratio.

/// Throttle a page-dirtying process to keep dirty pages within limits.
/// Called from __set_page_dirty() when a page transitions to dirty.
fn balance_dirty_pages(wb: &BdiWriteback, pages_dirtied: u64) {
    let thresh = global_dirty_thresh();
    let bg_thresh = global_dirty_background_thresh();
    let dirty = nr_dirty_pages();

    if dirty < (bg_thresh + thresh) / 2 {
        return; // Freerun — no throttling needed.
    }

    if dirty >= bg_thresh {
        // Wake the writeback workqueue to start flushing.
        wb_wakeup(wb);
    }

    // Per-BDI congestion check: if this device's queue is full, throttle
    // more aggressively regardless of global dirty ratio. This prevents a
    // single slow device from consuming the entire dirty budget while its
    // queue backs up.
    let bdi_congested = wb.bdi.congested.load(Relaxed);
    let bdi_limit = if bdi_congested {
        // Congested: reduce this BDI's fair share to 50%, pushing writers
        // toward other devices or blocking them earlier.
        wb.bdi_dirty_limit() / 2
    } else {
        wb.bdi_dirty_limit()
    };
    let bdi_dirty = wb.nr_dirty.load(Relaxed);

    if dirty >= thresh || bdi_dirty >= bdi_limit {
        // Calculate proportional pause. The closer to thresh, the longer.
        let pos_ratio = pos_ratio_calc(dirty, thresh, bg_thresh);
        // Adjust for per-BDI congestion: if over bdi_limit, increase pause.
        let bdi_ratio = if bdi_dirty >= bdi_limit {
            pos_ratio / 2 // Double the effective pause for congested BDI.
        } else {
            pos_ratio
        };
        // Guard against divide-by-zero: pos_ratio can reach 0 when dirty
        // pages are at or above the limit. Linux `mm/page-writeback.c`
        // `balance_dirty_pages()` explicitly guards:
        //   if (unlikely(task_ratelimit == 0)) { pause = max_pause; goto pause; }
        // UmkaOS matches this: when bdi_ratio is zero, sleep for MAX_PAUSE.
        if bdi_ratio == 0 {
            sleep_ms(MAX_PAUSE);
        } else {
            let pause_ms = (BANDWIDTH_INTERVAL / bdi_ratio).min(MAX_PAUSE);
            sleep_ms(pause_ms); // MAX_PAUSE = 200ms
        }
    }
}

Per-BDI bandwidth estimation: Each BdiWriteback tracks its write bandwidth to proportionally distribute the global dirty limit across multiple devices. A fast NVMe device gets a larger share of the dirty budget than a slow USB stick. The bandwidth is estimated every 200ms from the writeback completion rate.

BdiWriteback.nr_dirty counter lifecycle: BdiWriteback.nr_dirty: AtomicU64 tracks the total number of dirty pages across all inodes attributed to this writeback context. The full increment/decrement lifecycle:

Increment: __set_page_dirty() → fetch_add(1, Relaxed) when a page transitions from clean to dirty (first dirty mark only; re-dirtying a page that is already PG_DIRTY does not increment).
Decrement on success: end_page_writeback() → fetch_sub(1, Relaxed) when writeback completes successfully and PG_DIRTY is cleared (step 11).
No decrement on error: When writeback fails and the page is re-dirtied (step 12c), nr_dirty is NOT decremented — the page remains dirty and the count is still accurate.
Decrement on truncate: truncate_inode_pages() calls cancel_dirty_page() which decrements nr_dirty for each dirty page being removed from the page cache.

No lock is needed — the counter uses atomic ops only. DirtyIntentList operations (Section 14.1) use i_rwsem for list integrity; the nr_dirty update is separate and independent. Global nr_dirty_pages() sums per-BDI wb.nr_dirty values (no global lock — each BDI counter is independently atomic).

Inode dirty state flags:

The VFS Inode.i_state: AtomicU32 field tracks writeback-relevant state using the following bit flags. These flags are manipulated atomically (CAS loops) and drive the dirty inode list placement and writeback scheduling decisions.

InodeStateFlags is defined canonically in Section 14.1 (VFS owns the Inode struct). The flags relevant to writeback scheduling are:

Flag	Bit	Writeback meaning
`I_DIRTY_SYNC`	1	Metadata dirty — placed on `b_dirty`
`I_DIRTY_DATASYNC`	2	Data-bearing metadata dirty — placed on `b_dirty`
`I_DIRTY_PAGES`	3	Page cache pages dirty — placed on `b_dirty`
`I_FREEING`	4	Being evicted — `mark_inode_dirty()` is no-op
`I_DIRTY_TIME`	5	Timestamps dirty — placed on `b_dirty_time`
`I_WRITEBACK`	7	Writeback in progress — skip re-queue

Combined masks: I_DIRTY = I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES (any non-timestamp dirty). I_DIRTY_ALL = I_DIRTY | I_DIRTY_TIME (any dirty including timestamps).

writeback_single_inode() — single-inode writeback entry point:

Referenced by the inode eviction path, fsync(), and the writeback engine. Handles the full lifecycle of writing back one inode's dirty pages and metadata.

/// Write back all dirty data and metadata for a single inode.
///
/// # Arguments
///
/// * `inode` — The inode to write back.
/// * `wbc` — Writeback control (sync mode, page budget, range).
///
/// # Algorithm
///
/// 1. Check SB_RDONLY: if the superblock is read-only, return immediately
///    (no I/O to a read-only filesystem). This prevents writeback after
///    emergency remount-ro.
///
/// 2. Atomically transition I_DIRTY → I_WRITEBACK via CAS loop on
///    `inode.i_state`. If I_WRITEBACK is already set (another thread is
///    writing this inode), behavior depends on sync mode:
///    - WB_SYNC_NONE: skip this inode (return Ok(0)), let the other
///      writer finish.
///    - WB_SYNC_ALL: spin-wait on I_WRITEBACK clear (the other thread
///      will complete writeback; we then re-check dirtiness).
///
/// 3. Dirty page writeback: if I_DIRTY_PAGES was set, call
///    `ops.writepages(mapping, wbc)` if the filesystem implements it.
///    Fallback: iterate dirty pages via `writeback_inode_pages()` (see
///    above) and call `ops.writepage()` per page.
///
/// 4. Metadata writeback: if I_DIRTY_SYNC or I_DIRTY_DATASYNC was set,
///    call `ops.write_inode(inode, wbc)` to flush inode metadata to disk.
///
/// 5. Inode list management:
///    a. If all dirty flags cleared and no pages remain in writeback
///       (`mapping.nrwriteback == 0`): atomically clear I_WRITEBACK
///       from i_state. Remove inode from b_io. Inode is now clean.
///    b. If inode was re-dirtied during writeback (I_DIRTY_* set while
///       I_WRITEBACK still active): clear I_WRITEBACK, move inode from
///       b_io back to b_dirty (re-queued for next writeback cycle).
///    c. If writeback is partial (congestion, budget exhaustion): clear
///       I_WRITEBACK, move inode to b_more_io for retry next cycle.
///
/// 6. Error handling: if `writepages()` / `writepage()` / `write_inode()`
///    returns an error, record it in `mapping.wb_err` via
///    `mapping_set_error()`. The I_WRITEBACK flag is still cleared
///    (we don't retry automatically — the next writeback cycle will
///    re-attempt). The error propagates to `fsync()` callers via
///    `file_check_and_advance_wb_err()`.
///
/// # Returns
///
/// `Ok(pages_written)` on success (including partial success where some
/// pages were written before an error on a later page).
/// `Err(errno)` if writeback failed entirely (e.g., SB_RDONLY, I/O error
/// on first page).
fn writeback_single_inode(
    inode: &Inode,
    wbc: &mut WritebackControl,
) -> Result<u64, Errno> {
    // 1. Read-only check.
    if inode.i_sb.flags.load(Acquire) & SB_RDONLY != 0 {
        return Err(Errno::EROFS);
    }

    // 2. CAS: I_DIRTY → I_WRITEBACK.
    // Declare `old` outside the loop so its value is available after break.
    let mut old;
    loop {
        old = inode.i_state.load(Acquire);
        if old & I_WRITEBACK != 0 {
            if wbc.sync_mode == WB_SYNC_NONE {
                return Ok(0); // Another writer active, skip.
            }
            // WB_SYNC_ALL: wait for the other writer to finish.
            wait_on_inode_writeback(inode);
            // Re-check: if inode is now clean, nothing to do.
            if inode.i_state.load(Acquire) & I_DIRTY_ALL == 0 {
                return Ok(0);
            }
            continue; // Retry CAS.
        }
        let new = (old & !I_DIRTY_ALL) | I_WRITEBACK;
        if inode.i_state.compare_exchange_weak(
            old, new, AcqRel, Acquire
        ).is_ok() {
            break;
        }
    }

    // Use the pre-CAS `old` value, not a fresh load. Re-reading i_state here
    // would introduce a TOCTOU race: another thread could clear dirty bits
    // between our successful CAS and the re-read.
    let dirty = old;
    let mapping = &inode.i_mapping;
    let mut pages_written: u64 = 0;

    // 3. Dirty pages.
    if dirty & I_DIRTY_PAGES != 0 {
        match mapping.ops.writepages(mapping, wbc) {
            Ok(n) => pages_written += n,
            Err(e) if e.errno() == Errno::EOPNOTSUPP => {
                pages_written += writeback_inode_pages(mapping, wbc)?;
            }
            Err(e) => {
                mapping_set_error(mapping, e);
                // Clear I_WRITEBACK before returning.
                inode.i_state.fetch_and(!I_WRITEBACK, Release);
                return Err(e);
            }
        }
    }

    // 4. Metadata.
    if dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC) != 0 {
        if let Err(e) = mapping.ops.write_inode(inode, wbc) {
            mapping_set_error(mapping, e);
            inode.i_state.fetch_and(!I_WRITEBACK, Release);
            return Err(e);
        }
    }

    // 5. List management (under wb.list_lock).
    let current = inode.i_state.load(Acquire);
    if current & I_DIRTY_ALL != 0 {
        // Re-dirtied during writeback → back to b_dirty.
        inode.i_state.fetch_and(!I_WRITEBACK, Release);
        wb_move_inode_to_dirty(inode);
    } else if mapping.nrwriteback.load(Relaxed) > 0 && wbc.sync_mode == WB_SYNC_NONE {
        // Pages still in flight, partial writeback → b_more_io.
        inode.i_state.fetch_and(!I_WRITEBACK, Release);
        wb_move_inode_to_more_io(inode);
    } else {
        // Clean.
        inode.i_state.fetch_and(!I_WRITEBACK, Release);
        wb_remove_inode_from_lists(inode);
    }

    Ok(pages_written)
}

Inode dirty state transitions:

                       mark_inode_dirty(I_DIRTY_PAGES)
    +----------+     ---------------------------------------->     +--------------+
    |  CLEAN   |                                                   |  I_DIRTY_*   |
    | (i_state |     <----------------------------------------     | on b_dirty   |
    |  = 0)    |       writeback completes, all clean              | or b_dirty_  |
    +----------+                                                   |    time      |
                                                                   +------+-------+
                                                                          |
                                         writeback thread picks           |
                                         inode from b_dirty              |
                                                                          v
                                                                   +--------------+
                                                                   | I_WRITEBACK  |
                                                                   | on b_io      |
                                                                   +------+-------+
                                                                          |
                             +-----------------+------------------+
                             |                 |                  |
                             v                 v                  v
                       all pages         congestion/         re-dirtied
                       written           partial write       during wb
                             |                 |                  |
                             v                 v                  v
                       CLEAN            b_more_io          I_WRITEBACK |
                       (remove from     (retry next        I_DIRTY_*
                        all lists)       cycle)            (stays on b_io,
                                                           re-queued to
                                                           b_dirty after
                                                           wb completes)

If an inode is re-dirtied while I_WRITEBACK is set (e.g., a write() occurs during writeback), both I_WRITEBACK and the relevant I_DIRTY_* flags are set simultaneously. When the current writeback completes, the inode is moved back to b_dirty (not removed) because the I_DIRTY_* flags are still set.

mark_inode_dirty() — Transition a clean inode to dirty:

This function is the sole entry point for adding an inode to a BDI's dirty inode list. It is called from the VFS write path (__generic_file_write_iter), the page fault CoW path, chmod/chown/utimensat, and filesystem-internal metadata updates.

/// Mark an inode as dirty and enqueue it for writeback.
///
/// `flags` specifies which dirty bits to set. Common combinations:
/// - `I_DIRTY_PAGES`: a data page was dirtied (write path).
/// - `I_DIRTY_SYNC`: metadata changed (chmod, chown, link count).
/// - `I_DIRTY_DATASYNC`: data-affecting metadata changed (truncate, fallocate).
/// - `I_DIRTY_TIME`: timestamps updated (read with relatime, write).
/// - `I_DIRTY_PAGES | I_DIRTY_DATASYNC`: write extended file (both data and size).
///
/// # Locking
///
/// Acquires `wb.list_lock` internally. Caller must NOT hold `wb.list_lock`.
/// Safe to call from any context (process, softirq) because `list_lock`
/// is IRQ-safe.
///
/// # Behaviour
///
/// 1. Atomically OR `flags` into `inode.i_state`.
/// 2. If the inode was already dirty (any `I_DIRTY_ALL` bits set before this
///    call), returns immediately — the inode is already on a dirty list.
/// 3. If `I_FREEING` is set in `i_state`, returns immediately (inode is being
///    evicted; dirtying would be a use-after-free).
/// 4. If `I_WRITEBACK` is set: sets the dirty flags but does NOT add to
///    `b_dirty` — the writeback completion path will re-queue the inode.
/// 5. Otherwise: records `dirtied_when = now_ns()` on the inode, acquires
///    `wb.list_lock`, and adds the inode to the appropriate dirty list:
///    - `I_DIRTY_TIME` only -> `wb.b_dirty_time`
///    - Any `I_DIRTY` flag -> `wb.b_dirty` (tail, maintaining oldest-first order)
/// 6. If this transitions the BDI from "no dirty inodes" to "has dirty inodes",
///    schedules the BDI's writeback `DelayedWork` to fire after
///    `dirty_writeback_centisecs`.
fn mark_inode_dirty(inode: &Inode, flags: InodeStateFlags) {
    // Fast path: inode already dirty — just OR in the new flags.
    let old = inode.i_state.fetch_or(flags.bits(), Ordering::AcqRel);
    if old & InodeStateFlags::I_DIRTY_ALL.bits() != 0 {
        return;
    }

    // Bail if inode is being freed.
    if old & InodeStateFlags::I_FREEING.bits() != 0 {
        return;
    }

    // Bail if writeback in progress — completion path will re-enqueue.
    if old & InodeStateFlags::I_WRITEBACK.bits() != 0 {
        return;
    }

    // First dirty transition: record dirty timestamp and enqueue.
    inode.dirtied_when.store(now_ns(), Ordering::Release);

    let wb = bdi_writeback_for(inode);
    let _guard = wb.list_lock.lock();

    if flags.contains(InodeStateFlags::I_DIRTY) {
        // Data or metadata dirty — add to b_dirty (oldest-first order).
        wb.b_dirty.push_back(inode.wb_link());
    } else {
        // Timestamps only — add to b_dirty_time (lazy writeback).
        wb.b_dirty_time.push_back(inode.wb_link());
    }

    // If this is the first dirty inode for this BDI, schedule writeback.
    if wb.b_dirty.len() + wb.b_dirty_time.len() == 1 {
        wb.dwork.schedule_delayed(dirty_writeback_interval());
    }
}

The dirtied_when field (nanoseconds-since-boot timestamp) on the inode records when it first became dirty. It is set once in mark_inode_dirty() and not updated on subsequent dirty calls (preserving oldest-first ordering in b_dirty). This timestamp drives the kupdate expiration check (see kupdate algorithm below).

Inode writeback algorithm:

Periodic timer fires (every dirty_writeback_interval):
  1. Move inodes from b_dirty to b_io (oldest first, up to nr_pages worth).
  2. For each inode in b_io:
     a. Lock inode's AddressSpace::writeback_lock.
     b. Atomically set I_WRITEBACK in inode.i_state (CAS loop).
     c. Walk the inode's page cache XArray for dirty pages (see dirty page
        enumeration below).
     d. For each dirty page: call AddressSpaceOps::writepage().
        -> Filesystem builds Bio -> bio_submit() to block layer.
     e. Check congestion: if bdi_write_congested() returns true, move inode
        to b_more_io and break (see congestion backpressure below).
     f. If all dirty pages written: atomically clear I_WRITEBACK and all
        I_DIRTY_* flags from i_state. Remove inode from b_io.
        If inode was re-dirtied during writeback (I_DIRTY_* still set after
        clearing I_WRITEBACK): move inode back to b_dirty.
     g. Release writeback_lock.
  3. After processing b_io, move b_more_io back to b_io for next cycle.
  4. Reschedule timer for next dirty_writeback_interval.

Dirty page enumeration within an inode:

When the writeback thread processes an inode, it must efficiently find all dirty pages in the inode's page cache. UmkaOS uses XArray tagged iteration with the DIRTY tag for O(k) enumeration (where k = number of dirty pages), skipping clean pages entirely.

/// Batch size for dirty page enumeration. 16 pages per batch balances
/// lock hold time (XArray node traversal under RCU) against per-batch
/// overhead (function call, writepage dispatch). Matches Linux's
/// PAGEVEC_SIZE used in write_cache_pages().
const WRITEBACK_BATCH_SIZE: usize = 16;

/// Enumerate dirty pages in an inode's page cache in ascending page index
/// order and submit them for writeback.
///
/// # Arguments
///
/// * `mapping` — The inode's AddressSpace (contains the PageCache).
/// * `wbc` — Writeback control state (tracks progress, page budget, cyclic start).
///
/// # Algorithm
///
/// Uses XArray tagged iteration (`xa_for_each_tagged`) with the `XA_TAG_DIRTY`
/// tag. The XArray maintains a per-entry tag bitmap; setting `PageFlags::DIRTY`
/// on a page also sets the XArray DIRTY tag on that slot. Tagged iteration
/// skips entire 64-entry radix nodes that have no tagged entries, making the
/// scan O(k) in the number of dirty pages rather than O(n) in total pages.
///
/// **DSM page exclusion**: Pages with `PageFlags::DSM` set are checked against
/// the DSM dirty bitmap before inclusion in the writeback dirty page walk.
/// Pages where the `DsmDirtyBitmap` bit is set (i.e., modified via RDMA
/// coherence and pending DSM writeback) are skipped — they have their own
/// RDMA-based writeback path managed by the DSM subsystem
/// ([Section 6.11](06-dsm.md#dsm-distributed-page-cache)). DSM pages where the dirty bitmap bit
/// is clear (locally clean from the DSM perspective) are eligible for normal
/// filesystem writeback. The writeback scanner checks
/// `page.flags & PageFlags::DSM && dsm_dirty_bitmap.test(page.index)` and
/// skips only those pages — they are not counted against the writeback page
/// budget (`wbc.nr_to_write`) and are not submitted to the filesystem's
/// `writepage` callback. This prevents double-writeback (local block I/O +
/// RDMA) while still allowing local writeback for clean DSM pages.
///
/// Pages are collected into batches of `WRITEBACK_BATCH_SIZE` (16) before being
/// submitted. Batching amortises the per-page lock/unlock and function call
/// overhead. Within each batch, pages are processed in ascending `PageIndex`
/// order to produce sequential I/O patterns (critical for HDD performance,
/// beneficial for SSD write combining).
///
/// # XArray Tag Clearing Locking Protocol
///
/// `clear_tag(index, XA_TAG_DIRTY)` is called under the per-page lock, NOT under
/// `xa_lock`. This is safe because XArray tag operations use internal CAS-based
/// slot updates that are atomic with respect to concurrent RCU readers. The
/// per-page lock provides exclusion against concurrent `mark_page_dirty()` for the
/// same page index (which sets the tag). Concurrent insertion/deletion by another
/// CPU on a DIFFERENT index is safe — XArray RCU iteration tolerates concurrent
/// structural modification (the iterator may skip or revisit entries, but never
/// crashes or corrupts). The `xa_lock` is required only for structural operations
/// (insert, erase, split/join) that modify the radix tree topology; tag operations
/// modify in-place bitmaps within existing nodes and use atomic bitops.
///
/// # Return
///
/// Returns `Ok(pages_written)` on success, `Err(errno)` if a writepage call
/// fails (the error is also propagated through the writeback error chain).
fn writeback_inode_pages(
    mapping: &AddressSpace,
    wbc: &mut WritebackControl,
) -> Result<u64, Errno> {
    let mut pages_written: u64 = 0;
    let mut batch: ArrayVec<(PageIndex, Arc<Page>), WRITEBACK_BATCH_SIZE> =
        ArrayVec::new();

    // Start index: for range_cyclic mode, resume from where the last writeback
    // left off (wbc.cyclic_start). For full-range mode, start at 0.
    let start = if wbc.range_cyclic { wbc.cyclic_start } else { 0 };

    // Tagged iteration: only visits slots with XA_TAG_DIRTY set.
    let mut iter = mapping.page_cache.pages.iter_tagged(start, XA_TAG_DIRTY);

    // Check writepages() availability BEFORE batch collection to avoid
    // wasting XArray iteration when the filesystem provides its own
    // writepages() implementation (which does its own dirty page enumeration).
    // This matches Linux's `do_writepages()` pattern: delegate to
    // `a_ops->writepages` first, fall back to `write_cache_pages()` only
    // when the filesystem does not provide writepages().
    if mapping.ops.has_writepages() {
        // Fast path: delegate to filesystem's writepages() directly.
        // The filesystem handles dirty page enumeration, bio construction,
        // extent merging, and I/O submission internally. wbc.nr_to_write
        // is decremented by the filesystem for each page it processes.
        match mapping.ops.writepages(mapping, wbc) {
          Ok(written) => {
            pages_written += written;
          }
          Err(e) => return Err(e),
        }
    } else {
        // Slow path: collect batch, fall back to per-page writepage().
        loop {
            batch.clear();
            while batch.len() < WRITEBACK_BATCH_SIZE {
                match iter.next() {
                    Some((index, entry)) => batch.push((index, entry.page.clone())),
                    None => break,
                }
            }

            if batch.is_empty() {
                break;
            }

            // Per-page fallback: submit each page individually via writepage().
            for &(index, ref page) in &batch {
                // Lock the page to serialise with concurrent faulters.
                page.lock();
                // Clear the DIRTY flag and XArray tag atomically.
                page.flags.fetch_and(!PageFlags::DIRTY.bits(), Ordering::Release);
                mapping.page_cache.pages.clear_tag(index, XA_TAG_DIRTY);
                // Set WRITEBACK flag (prevents reclaim during I/O).
                page.flags.fetch_or(PageFlags::WRITEBACK.bits(), Ordering::Release);
                // Increment nrwriteback: this page is now in-flight.
                mapping.nrwriteback.fetch_add(1, Ordering::Relaxed);
                page.unlock();

                // Dispatch to filesystem.
                mapping.ops.writepage(mapping, page, wbc)?;
                pages_written += 1;

                // Respect page budget.
                if wbc.nr_to_write > 0 {
                    wbc.nr_to_write -= 1;
                    if wbc.nr_to_write <= 0 {
                        wbc.cyclic_start = index + 1;
                        return Ok(pages_written);
                    }
                }
            }
        }
    }

    // Wrapped around to start — reset cyclic position.
    if wbc.range_cyclic {
        wbc.cyclic_start = 0;
    }

    Ok(pages_written)
}

Congestion backpressure:

When a backing device's I/O queue is full, submitting more writeback I/O would either block the writeback thread (stalling all other inodes on this BDI) or cause excessive memory pressure from queued bios. UmkaOS detects congestion and defers work to avoid these problems.

/// Check whether the backing device's write queue is congested.
///
/// Congestion is detected by comparing the device's in-flight write count
/// against its queue depth. If in-flight writes exceed 75% of the queue
/// depth, the device is considered congested.
///
/// # Returns
///
/// `true` if the device write queue is congested and the caller should back
/// off; `false` if more I/O can be submitted.
fn bdi_write_congested(bdi: &BackingDevInfo) -> bool {
    let inflight = bdi.inflight_writes.load(Ordering::Relaxed);
    let queue_depth = bdi.queue_depth.load(Ordering::Relaxed);
    // 75% threshold: leave headroom for fsync and priority writes.
    inflight > queue_depth * 3 / 4
}

When bdi_write_congested() returns true during inode writeback:

The writeback thread stops submitting pages for the current inode.
The inode is moved from b_io to b_more_io (preserving its position relative to other congestion-deferred inodes).
The writeback thread continues to the next inode in b_io (if any).
After all b_io inodes are processed, b_more_io inodes are moved back to b_io for the next writeback cycle.
A congestion wait is inserted: the writeback thread sleeps for CONGESTION_WAIT_MS (100ms) before the next cycle, giving the device time to drain its queue.

This ensures that a single slow device does not cause the writeback thread to spin on a congested queue. The 100ms backoff is short enough to maintain throughput on devices that drain quickly (NVMe), while preventing CPU waste on devices with deep queues (HDD RAID).

Kupdate algorithm:

The kupdate (kernel update) writeback mode is the periodic background mechanism that ensures dirty data does not remain in memory indefinitely. It runs on the same DelayedWork as background writeback but with different inode selection criteria.

Kupdate writeback (periodic, every dirty_writeback_centisecs = 500 centisecs = 5s):

  1. Timer fires. Build a WritebackWork with for_kupdate = true.

  2. Compute the expiration deadline:
       writeback_deadline_ns = now_ns() - dirty_expire_centisecs * 10_000_000
     (Default dirty_expire_centisecs = 3000 -> 30 seconds. An inode with
     dirtied_when older than 30 seconds ago is eligible for writeback.)

  3. Scan b_dirty from head (oldest dirtied_when first):
     For each inode in b_dirty:
       a. If inode.dirtied_when > writeback_deadline_ns:
            Stop — all remaining inodes are newer than the deadline
            (b_dirty is ordered by dirtied_when, oldest first).
       b. If inode.dirtied_when <= writeback_deadline_ns:
            Inode has been dirty for longer than dirty_expire_centisecs.
            Move it from b_dirty to b_io for writeback.

  4. Scan b_dirty_time with dirtytime expiration:
       dirtytime_deadline_ns = now_ns() - dirtytime_expire_interval * 1_000_000_000
     (Default dirtytime_expire_interval = 43200 seconds = 12 hours.)
     For each inode in b_dirty_time:
       a. If inode.dirtied_when <= dirtytime_deadline_ns:
            Promote: set I_DIRTY_SYNC in i_state (timestamps become metadata
            dirty). Move inode from b_dirty_time to b_io.
       b. Otherwise: stop (ordered by dirtied_when).

  5. Process b_io as in the standard inode writeback algorithm (lock, enumerate
     dirty pages, submit writepage, handle congestion).

  6. Reschedule the kupdate timer for the next dirty_writeback_centisecs
     interval. The timer is unconditional — it fires even if no inodes were
     eligible, ensuring prompt writeback when inodes age past the threshold.

Interaction between kupdate and background writeback:

Both kupdate and background writeback use the same BdiWriteback structure and the same b_dirty / b_io / b_more_io lists. They are distinguished by the WritebackWork.for_kupdate and WritebackWork.for_background flags:

Mode	Trigger	Inode selection	Page budget
Kupdate	Periodic timer (5s)	Only inodes dirty for > `dirty_expire_centisecs` (30s)	Unlimited (write all expired)
Background	`dirty_pages > dirty_background_ratio`	All dirty inodes on `b_dirty` (oldest first)	Proportional to `write_bandwidth`
Sync	`sync()` / `fsync()` call	All inodes (sync) or one inode (fsync)	Unlimited (write everything)

When both kupdate and background writeback are active simultaneously (e.g., dirty ratio exceeded and some inodes are also expired), they are serialised by wb.list_lock — only one writeback work item processes b_io at a time. The background writeback work item subsumes kupdate's work because it writes all dirty inodes, not just expired ones.

4.6.1.1 `end_page_writeback()` — Writeback I/O Completion¶

/// Called from the bio completion callback (via the `blk-io` workqueue in
/// process context) for each page in a completed writeback bio. Performs
/// the per-page state transitions described in step 11/12 above.
///
/// # Arguments
/// - `page`: the page whose writeback I/O has completed.
/// - `error`: `None` on success, `Some(errno)` on I/O error.
///
/// # Actions
/// 1. Clear `PageFlags::WRITEBACK`.
/// 2. Wake tasks blocked on `wait_on_page_writeback()` (fsync, WB_SYNC_ALL).
/// 3. On success: clear `PageFlags::DIRTY`, decrement `BdiWriteback.nr_dirty`.
/// 4. On error: increment `PageExt.wb_fail_count`. If >= 3, set
///    `PERMANENT_ERROR` (do not re-dirty). Otherwise re-dirty the page.
///    Call `mapping_set_error(mapping, error)` to record in `wb_err`.
pub fn end_page_writeback(page: &Page, error: Option<Errno>);

Cross-references: - Inode struct (i_state: AtomicU32): Section 14.1 - AddressSpace and writeback_lock: Section 14.1 - fsync end-to-end flow: Section 14.4 - Block I/O and bio_submit(): Section 15.2 - Workqueue framework: Section 3.11 - Dirty page thresholds and balance_dirty_pages(): see above in this section - Writeback I/O completion (writeback_end_io / end_page_writeback): defined in Section 15.2. When a writeback bio completes (IRQ context), bio_complete() invokes the bio's end_io callback (writeback_end_io_deferred), which schedules writeback_end_io() on the blk-io workqueue. That function calls end_page_writeback() on each page in the bio. This clears PG_WRITEBACK, decrements AddressSpace.nrwriteback, and wakes any tasks blocked on wait_on_page_writeback() (used by WB_SYNC_ALL paths and fsync()). Errors are recorded in AddressSpace.wb_err via mapping_set_error().

4.7 Transparent Huge Page Promotion and Memory Compaction¶

khugepaged — background THP promotion:

The kernel runs a background thread (khugepaged) that scans process address spaces for opportunities to promote 512 contiguous 4KB pages (2MB aligned) into a single 2MB transparent huge page. This reduces TLB pressure: a single 2MB TLB entry replaces 512 × 4KB entries, and modern CPUs have dedicated 2MB TLB slots (Intel: 32-64 entries; AMD: 64 entries; ARM: 32-48 entries depending on core).

Promotion flow:
  1. khugepaged scans VMAs with THP enabled (default for anonymous memory).
  2. For each 2MB-aligned range: check if all 512 base pages are present,
     anonymous (not file-backed), and writable.
  3. If yes: allocate a compound page (order-9), copy 512 base pages into it,
     update PTEs atomically under the page table lock, free the 512 base pages.
  4. If allocation fails (no contiguous 2MB block): skip and try next range.
     Memory compaction (below) may create the block for a future scan cycle.

The THP promotion decision is controlled by VmmPolicy::should_promote_thp() (Section 4.8), which is a replaceable policy method. The default policy promotes when all 512 base pages in the huge page range are present and the VMA allows huge pages.

Configuration:

/sys/kernel/mm/transparent_hugepage/enabled     — always / madvise / never
/sys/kernel/mm/transparent_hugepage/defrag      — always / defer / defer+madvise / madvise / never
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs  — interval (default: 10000ms)
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan         — pages per cycle (default: 4096)

Memory compaction:

When the buddy allocator cannot satisfy a high-order allocation (e.g., 2MB for THP or 1GB for explicit huge pages), the kernel triggers memory compaction: a process that migrates movable pages to create contiguous free regions.

Compaction algorithm (simplified):
  1. A "migration scanner" walks from the bottom of the zone upward, finding
     movable pages (LRU-resident, not pinned, not DMA-mapped).
  2. A "free scanner" walks from the top of the zone downward, finding free pages.
  3. When both scanners meet: the movable page is migrated (allocated at the free
     page's location, content copied, PTE updated), freeing a contiguous block at
     the migration scanner's position.
  4. Compaction stops when a block of the requested order is available or the
     scanners have exhausted the zone.

Page Mobility Classification for Compaction:

The migration scanner only moves movable pages. Pinned pages are skipped in place — compaction works around them, accepting that pinned pages create holes that limit contiguous block formation.

Movable (can be migrated): - Anonymous pages not locked by mlock() or mmap(MAP_LOCKED) - Page cache pages (clean: simply freed + re-read; dirty: writeback first) - Pages allocated with GFP_MOVABLE (standard for anonymous and file pages) - slab pages marked movable (used for large object caches only)

Pinned (cannot be migrated): - Pages locked via mlock() or mlockall() — physical address is fixed - Pages with active pin_user_pages() / get_user_pages(FOLL_PIN) references (GUP-pinned for ongoing DMA or direct I/O; pincount tracked separately from refcount via the PageFlags::PIN_COUNT field) - Pages mapped for DMA (registered with IOMMU; physical address in hardware) - Pages in driver vmap/vmalloc mappings (physical address fixed for MMIO) - Per-CPU data pages and kernel stack pages of currently running tasks - Slab pages not marked as movable (kernel object caches with raw pointers) - madvise(MADV_CRITICAL) pages (Section 4.5) — wired, never migrated

Note: "refcount > 1 beyond the page table mapping" alone does NOT make a page pinned. A page cache page may have refcount > 1 (page table + page cache + buffer_head) yet still be migratable — the migration code replaces all references atomically. Only explicit pin-count (GUP pin) or mlock prevents migration. The migration scanner checks page_maybe_dma_pinned() (true if pin_count > 0) to skip genuinely pinned pages.

Compaction behavior on pinned pages: The migration scanner encounters a pinned page and skips it, advancing the scanner by one page. The free scanner may still find free pages beyond the pinned page, but the resulting free block will be non-contiguous with pages before the pinned page. If the zone has many scattered pinned pages, compaction may fail to form a 2MB block even after a full scan.

Design implication: Drivers should use GFP_MOVABLE for their page allocations wherever possible to avoid becoming obstacles to THP formation and compaction. DMA mappings are inherently pinned and should be concentrated in IOMMU-mapped regions rather than scattered through general memory.

Latency impact: Compaction involves page migration (TLB shootdown + memcpy + PTE update). Per-page migration costs vary by scope: - Local NUMA migration (same socket, memcpy via CPU): ~200-500 ns per 4KB page - Cross-socket migration (cache-coherent interconnect, TLB shootdown included): ~1-10 μs per page - RDMA-based DSM migration (cross-node, see Section 6.2): ~2-50 μs depending on distance

Compaction stall time during synchronous operation accumulates these costs across all migrated pages. The defer defrag mode (default in UmkaOS) avoids synchronous compaction on page faults — instead, khugepaged and kcompactd run in the background, and allocation failures fall back to 4KB pages without blocking. The always defrag mode triggers synchronous compaction on every THP-eligible fault, which maximizes THP coverage but can cause multi-millisecond stalls — suitable only for throughput-oriented batch workloads, not latency-sensitive applications.

Disable option: For hard real-time workloads (isolcpus + nohz_full), THP promotion and compaction should be disabled entirely (transparent_hugepage/enabled=never) to eliminate background scanning and migration-induced latency jitter. These workloads should pre-allocate explicit huge pages at boot (umka.hugepages=<count>) for deterministic TLB behavior.

4.7.1 Struct Definitions¶

Nucleus/Evolvable classification:

Component	Classification	Rationale
`KhugepagedConfig`	Nucleus (data)	Struct layout — stores tunables read by the scan loop.
`CompactionControl`	Nucleus (data)	Struct layout — parameterizes a single compaction run.
`CompactionScanner`	Nucleus (data)	Struct layout — tracks scanner cursor state across calls.
khugepaged scan loop	Evolvable	Algorithm deciding which VMAs to promote. ML-tunable via `ParamId::ThpScanIntervalMs` and `ParamId::ThpPagesToScan`.
`should_promote_thp()`	Evolvable	Policy method on `VmmPolicy` (Section 4.8). Replaceable at runtime.
Compaction migration loop	Evolvable	Algorithm deciding page migration order and bail-out heuristics. ML-tunable via `ParamId::CompactionProactiveness`.

/// Configuration for the khugepaged background promotion thread.
/// Exposed to userspace via sysfs
/// (`/sys/kernel/mm/transparent_hugepage/khugepaged/*`).
///
/// **Phase**: Phase 2 (THP promotion is an optimization, not required for
/// correctness — base 4KB pages always work).
pub struct KhugepagedConfig {
    /// Milliseconds between scan cycles. Default: 10_000 (10 seconds).
    /// Sysfs: `scan_sleep_millisecs`. Lower values increase THP coverage
    /// at the cost of CPU overhead. ML-tunable via
    /// `ParamId::ThpScanIntervalMs` with bounds [100, 600_000].
    pub scan_sleep_ms: u32,

    /// Maximum pages to scan per cycle before yielding. Default: 4096.
    /// Sysfs: `pages_to_scan`. Controls scan granularity — larger values
    /// promote more aggressively per cycle. ML-tunable via
    /// `ParamId::ThpPagesToScan` with bounds [1, 65536].
    pub pages_to_scan: u32,

    /// Defrag mode controlling when synchronous compaction is triggered.
    /// Sysfs: `/sys/kernel/mm/transparent_hugepage/defrag`.
    pub defrag: ThpDefragMode,

    /// Whether THP is enabled globally. Sysfs:
    /// `/sys/kernel/mm/transparent_hugepage/enabled`.
    pub enabled: ThpEnabledMode,
}

/// THP defrag mode — controls synchronous compaction behavior.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ThpDefragMode {
    /// Always trigger synchronous compaction on THP-eligible faults.
    /// Maximum THP coverage but can cause multi-ms stalls.
    Always      = 0,
    /// Defer compaction to khugepaged/kcompactd background threads.
    /// Faults fall back to 4KB pages without blocking. **UmkaOS default.**
    Defer       = 1,
    /// Defer for all faults except madvise(MADV_HUGEPAGE) regions.
    DeferMadvise = 2,
    /// Synchronous compaction only for madvise(MADV_HUGEPAGE) regions.
    Madvise     = 3,
    /// Never trigger compaction for THP (rely on opportunistic promotion
    /// when free 2MB blocks happen to exist).
    Never       = 4,
}

/// THP global enable mode.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ThpEnabledMode {
    /// THP promotion enabled for all anonymous VMAs.
    Always  = 0,
    /// THP promotion only for VMAs marked with madvise(MADV_HUGEPAGE).
    Madvise = 1,
    /// THP disabled — all pages remain 4KB.
    Never   = 2,
}

/// Parameters for a single compaction run. Created by kcompactd or by
/// the synchronous compaction path when a high-order allocation fails.
// Kernel-internal struct, not KABI/wire — bool is safe.
pub struct CompactionControl {
    /// Zone being compacted (one compaction run operates on a single zone).
    ///
    /// SAFETY: `zone` points to a valid `Zone` instance within the buddy
    /// allocator's zone array. The zone outlives the compaction run (zones
    /// are never freed). Never null: a CompactionControl is always created
    /// for a specific zone. Only accessed under the zone's memory_hotplug_lock.
    pub zone: *const Zone,

    /// Requested allocation order (e.g., 9 for 2MB THP, 18 for 1GB
    /// explicit huge page). Compaction stops when a free block of this
    /// order exists.
    pub order: u8,

    /// Compaction mode — synchronous (caller blocks) or async (kcompactd).
    pub mode: CompactionMode,

    /// Migration type filter: only migrate pages of this mobility type
    /// (usually MOVABLE). Pages of other types are skipped.
    pub migratetype: MigrateType,

    /// Whether to use whole-pageblock migration (true) or individual
    /// page migration (false). Whole-pageblock is faster but may
    /// transiently increase fragmentation.
    pub whole_pageblock: bool,
}

/// Compaction mode.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum CompactionMode {
    /// Synchronous — caller blocks until compaction completes or fails.
    /// Used by direct compaction on allocation failure.
    Sync = 0,
    /// Asynchronous — kcompactd background thread. Does not block the
    /// allocating task.
    Async = 1,
}

/// Persistent scanner state for a zone's compaction. Stored per-zone so
/// that successive compaction runs resume where the last run left off,
/// avoiding redundant re-scanning of already-compacted regions.
pub struct CompactionScanner {
    /// Migration scanner position: PFN of the next page frame to examine
    /// for movable pages (scans upward from zone start).
    pub migrate_pfn: u64,

    /// Free scanner position: PFN of the next page frame to examine
    /// for free pages (scans downward from zone end).
    pub free_pfn: u64,

    /// Result of the last compaction attempt on this zone.
    pub last_result: CompactionResult,
}

/// Outcome of a compaction run.
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum CompactionResult {
    /// Not attempted yet (initial state).
    NotAttempted     = 0,
    /// Success — a block of the requested order is available.
    Success          = 1,
    /// Partial — some pages migrated but the target order was not achieved.
    /// Scanners can resume from their current positions.
    Partial          = 2,
    /// Skipped — zone watermarks indicate compaction is not useful.
    Skipped          = 3,
    /// Contention — compaction yielded due to lock contention or
    /// excessive migration failures. Will retry on next trigger.
    Contention       = 4,
    /// No suitable pages — scanners exhausted the zone without forming
    /// a block. Further compaction on this zone is futile until pages
    /// are freed or unmapped.
    NoSuitablePages  = 5,
}

THP page count constants:

/// Size (in bytes) of a PMD-level transparent huge page.
/// Architecture-dependent: derived from the PMD shift, which is the number
/// of bits covered by one PMD entry.
/// - 4KB-page architectures (x86-64, AArch64, ARMv7, RISC-V, PPC32, s390x,
///   LoongArch64): 2 MB (PMD_SHIFT = 21, 1 << 21 = 2_097_152).
/// - 64KB-page architectures (PPC64LE): 16 MB (PMD_SHIFT = 24,
///   1 << 24 = 16_777_216).
pub const HPAGE_PMD_SIZE: usize = 1 << arch::current::mm::PMD_SHIFT;

/// Number of base pages in a PMD-level transparent huge page.
/// Computed from `HPAGE_PMD_SIZE / PAGE_SIZE` at compile time.
/// - 4KB-page architectures: 2 MB / 4 KB = 512.
/// - 64KB-page architectures (PPC64LE): 16 MB / 64 KB = 256.
///
/// Used by `should_promote_thp()` and `count_present_ptes()` to determine
/// the number of base pages that must be present for THP promotion.
pub const HPAGE_PMD_NR: usize = HPAGE_PMD_SIZE / PAGE_SIZE;

VmmPolicy::should_promote_thp() — THP promotion policy method:

This is an Evolvable method on VmmPolicy (Section 4.8). The default implementation promotes when all 512 base pages are present and the VMA allows huge pages:

impl VmmPolicy {
    /// Decide whether to promote the 512 base pages at `addr` (2MB-aligned)
    /// into a transparent huge page.
    ///
    /// Returns:
    /// - `0` — do not promote (pages missing, VMA disallows THP, etc.)
    /// - `1` — promote now (all pages present, VMA allows THP)
    /// - `2` — defer (some pages present but promotion is not urgent;
    ///   khugepaged will retry on the next scan cycle)
    ///
    /// Return type is `u8` (not `bool`) because this crosses the Evolvable
    /// KABI boundary (CLAUDE.md rule 8: no `bool` in KABI structs).
    ///
    /// **ML tunability**: The ML policy framework can replace this method
    /// to consider additional signals: memory pressure (defer promotion
    /// under pressure to avoid compaction), VMA access frequency (promote
    /// hot VMAs first), and NUMA locality (prefer promotion on the local
    /// node).
    pub fn should_promote_thp(&self, vma: &Vma, addr: VirtAddr) -> u8 {
        if !vma.flags.contains(VmFlags::HUGEPAGE_ELIGIBLE) {
            return 0; // VMA disallows THP (madvise(MADV_NOHUGEPAGE) or never mode)
        }

        // Check that all 512 base pages in the 2MB range are present.
        let base_pages_present = count_present_ptes(vma.mm, addr, HPAGE_PMD_NR);
        if base_pages_present < HPAGE_PMD_NR {
            return 2; // Defer — some pages not yet faulted in.
        }

        // NUMA affinity check: if >= 75% of the base pages are on the local
        // NUMA node, promote on the local node. Otherwise defer — the pages
        // may be in the process of being migrated by NUMA balancing, and
        // promoting now would create a remote huge page that degrades locality.
        let local_node = numa_node_of_cpu(current_cpu_id());
        let local_pages = count_pages_on_node(vma.mm, addr, HPAGE_PMD_NR, local_node);
        if local_pages < (HPAGE_PMD_NR * 3) / 4 {
            return 2; // Defer — insufficient NUMA locality for huge page.
        }

        1 // All 512 pages present, VMA allows THP, NUMA-local → promote.
    }
}

4.8 Virtual Memory Manager¶

Maple tree for VMA (Virtual Memory Area) management (same as Linux 6.1+)
Demand paging: Pages allocated on first access (page fault handler in UmkaOS Core)
Copy-on-write (COW): Fork shares pages read-only, copies on write fault
Memory-mapped files: mmap backed by page cache, supports MAP_SHARED and MAP_PRIVATE
Huge pages: Both explicit (mmap with MAP_HUGETLB) and transparent (THP)
ASLR: Full address space layout randomization for user processes

4.8.1 MmStruct — Per-Process Address Space¶

Every process has exactly one MmStruct that owns the entire virtual address space: the VMA tree, page table root, brk heap state, and the locking that coordinates concurrent faults, mmap operations, and device page migrations. Kernel threads that borrow a user address space (lazy TLB mode, Section 4.9) hold a reference to the owning process's MmStruct without incrementing the user-visible reference count.

/// Per-process virtual address space descriptor. One per `Process`; shared by
/// all threads in a thread group. Threads created with `CLONE_VM` (pthreads)
/// share the same `MmStruct`; `fork()` creates a new `MmStruct` with COW-copied
/// VMAs and a fresh page table tree.
///
/// Accessed as `process.mm` (or `current().mm()` from task context).
/// The `Process.address_space` field in [Section 7](process-and-task-management.md)
/// holds an `Arc<MmStruct>`.
pub struct MmStruct {
    /// Active user-thread reference count. Incremented by fork (CLONE_VM
    /// not set), decremented by exit_mm(). When this reaches 0, the address
    /// space is torn down (unmap_vmas, free_pgtables). Distinct from the
    /// Arc<MmStruct> refcount which tracks lazy TLB and other kernel references.
    ///
    /// The distinction: `users > 0` means "some thread is actively using this mm";
    /// `Arc` refcount > 0 means "someone still holds a reference" (could be a
    /// lazy TLB entry or an mm_struct pinned for `/proc/PID/maps`).
    ///
    /// Linux equivalent: `mm_struct->mm_users` (atomic_t).
    ///
    /// **Lifecycle**:
    /// - Initialized to 1 at MmStruct creation (fork or exec).
    /// - Incremented by: `CLONE_VM` thread creation (pthreads share the mm),
    ///   `kthread_use_mm()` (kernel threads borrowing a user address space).
    /// - Decremented by: `exit_mm()` in `do_exit()`, `kthread_unuse_mm()`.
    /// - When `users` drops to 0: `exit_mmap()` is called -- all VMAs are
    ///   unmapped, page tables freed, TLB flushed. The MmStruct itself stays
    ///   alive until the `Arc` refcount also reaches 0.
    pub users: AtomicU32,

    /// VMA tree — all virtual memory areas for this address space.
    /// Maple tree with RCU-safe lock-free reads and COW mutations.
    /// See [Section 4.8](#virtual-memory-manager--maple-tree-vma-management) below.
    pub vma_tree: MapleTree<Vma>,

    /// Page table root (PGD) for this address space. Architecture-specific:
    /// x86-64 = CR3 physical address, AArch64 = TTBR0_EL1 value,
    /// RISC-V = satp value (mode + ASID + PPN).
    /// Loaded into the hardware register on context switch.
    pub pgd: PhysAddr,

    /// Reader-writer lock protecting VMA tree mutations.
    ///
    /// - **Read lock** (`mmap_lock.read()`): taken on VMA lookups that cannot use
    ///   the per-VMA fast path ([Section 4.8](#virtual-memory-manager--per-vma-locking)),
    ///   HMM device page migration ([Section 22.2](accelerator-memory-and-p2p-dma.md)),
    ///   and `/proc/<pid>/maps` reads. Page faults use `vm_lock.read()` on the
    ///   fast path and only fall back to `mmap_lock.read()` on VMA removal races.
    ///   Multiple readers proceed concurrently.
    /// - **Write lock** (`mmap_lock.write()`): taken on `mmap`, `munmap`,
    ///   `mprotect`, `mremap`, `brk`, VMA split/merge. Exclusive access.
    ///
    /// **Relationship to `MapleTree::write_lock`**: `mmap_lock.write()` must be
    /// held *before* acquiring `MapleTree::write_lock`. The `mmap_lock` serialises
    /// high-level VMM operations (which may need to allocate pages, update
    /// `total_vm`, check rlimits); the `MapleTree::write_lock` serialises the
    /// tree-internal COW node replacement. Acquiring `MapleTree::write_lock`
    /// without `mmap_lock.write()` is a lock-order violation.
    pub mmap_lock: RwLock<()>,

    /// Total mapped virtual bytes (sum of all VMA sizes). Updated atomically
    /// on every `mmap()`, `munmap()`, and `mremap()`. Enables O(1) `RLIMIT_AS`
    /// checks without walking the VMA tree
    /// ([Resource Limits](resource-limits-and-accounting.md)).
    pub total_vm: AtomicU64,

    /// Resident set size in pages (pages currently present in the page tables).
    /// Incremented on page fault resolution, decremented on unmap / swap-out /
    /// page migration. Split into per-CPU counters to avoid cache-line
    /// contention on multi-threaded workloads; summed lazily for
    /// `/proc/<pid>/status` reporting.
    ///
    /// **Type**: `PerCpu<AtomicI64>` — raw per-CPU atomic, NOT `PerCpuCounter<i64>`.
    ///
    /// **Why not `PerCpuCounter<i64>`?** RSS is the hottest counter in the kernel:
    /// updated on every page fault (millions/sec under memory-intensive workloads).
    /// `PerCpuCounter<i64>` adds ~5-10 cycles per update for batch threshold
    /// checking and approximate-read bookkeeping. On the page fault path, those
    /// cycles are unacceptable. Raw `PerCpu<AtomicI64>` costs exactly one atomic
    /// `fetch_add` — zero overhead beyond the op itself.
    ///
    /// **When to use which** (design decision AI-036, Option C):
    /// - `PerCpu<AtomicI64>`: HOT PATH counters (RSS — per page fault). Raw,
    ///   minimal. Caller manages PreemptGuard and CPU slot selection directly.
    /// - `PerCpuCounter<i64>`: WARM PATH counters (dirty page counts, free
    ///   block counts). Provides batch accumulation + approximate read.
    ///   Used by `balance_dirty_pages`, superblock writer counts, etc.
    ///
    /// **Access protocol** (hot path — page fault handler):
    /// ```
    /// let guard = preempt_disable();
    /// mm.rss.get(&guard).fetch_add(1, Relaxed);
    /// // guard dropped — preemption re-enabled
    /// ```
    /// The `PreemptGuard` identifies the current CPU's slot. Since the inner
    /// type is `AtomicI64`, preemption between guard acquisition and the atomic
    /// op is benign — the worst case is updating a different CPU's counter
    /// after migration, which is within the documented drift tolerance.
    ///
    /// **Read protocol** (cold path — OOM killer, `/proc/<pid>/status`):
    /// ```
    /// let mut total: i64 = 0;
    /// for cpu in 0..num_possible_cpus() {
    ///     total += mm.rss.get_cpu(cpu).load(Relaxed);
    /// }
    /// ```
    /// Iterates all possible CPUs (not just online — a CPU that incremented
    /// its counter may be offline at read time). See `mm_sum_rss()` in
    /// [Section 4.5](#oom-killer).
    ///
    /// **Per-CPU drift tolerance**: Because reads are non-atomic across CPUs,
    /// the sum may drift from the true value by up to ±N pages (where N is
    /// the number of concurrent page fault handlers actively between their
    /// `fetch_add` and the reader's `load` of that CPU's slot). In practice
    /// this is bounded to a handful of pages — far smaller than the ±128 page
    /// drift of `PerCpuCounter`'s batch threshold. This drift is acceptable
    /// because: (a) the OOM killer scores at cgroup granularity by summing all
    /// member processes' RSS, which averages out per-CPU drift; (b) the scoring
    /// uses proportional comparison (RSS / memory.max), where a few pages are
    /// negligible relative to typical cgroup memory limits (hundreds of MiB to
    /// GiB).
    ///
    /// Note: RSS counters should be padded to cache-line granularity to
    /// avoid false sharing on high-CPU-count systems. Use
    /// `PerCpu<CacheAligned<AtomicI64>>` where `CacheAligned<T>` is
    /// `#[repr(align(64))]`, or verify that the `PerCpu` allocator's
    /// per-CPU data region is already cache-line-aligned per CPU.
    ///
    /// Note on 32-bit architectures (ARMv7, PPC32): AtomicI64 on 32-bit targets
    /// may require a lock-based fallback (no native 64-bit atomic on ARMv7 without
    /// LPAE). The overhead is acceptable because the per-CPU design eliminates
    /// contention — contended atomics are the expensive case, and per-CPU
    /// partitioning ensures each CPU's slot is uncontended.
    /// On 64-bit targets, AtomicI64 compiles to a single native atomic instruction.
    pub rss: PerCpu<AtomicI64>,

    /// Number of VMAs in the Maple tree. Checked against `sysctl_max_map_count`
    /// (default 65530, matching Linux) on every `mmap()` / VMA split to prevent
    /// VMA-count denial-of-service.
    pub map_count: AtomicU32,

    /// Start of the brk heap region (set by `execve()` from the ELF `p_memsz`
    /// of the last `PT_LOAD` segment, page-aligned up).
    pub brk_start: VirtAddr,

    /// Current brk end (the value returned by `brk(0)`). Grows upward via
    /// `brk()` / `sbrk()`. Protected by `mmap_lock.write()` for modifications.
    pub brk_end: AtomicUsize,

    /// Stack start address (top of the initial thread's stack, set by `execve()`).
    pub stack_start: VirtAddr,

    /// Bitmask of CPUs that may have TLB entries for this address space.
    /// Used by the TLB shootdown protocol
    /// ([Section 4.8](#virtual-memory-manager--tlb-invalidation-by-architecture)) to determine
    /// which CPUs need IPIs (x86-64, RISC-V) or to detect lazy TLB mode
    /// (AArch64, PPC). Set on context-switch-in, cleared on context-switch-out
    /// or lazy TLB detach.
    pub mm_cpumask: AtomicCpuMask,

    /// Per-mm architecture-specific context: allocated PCID (x86-64), ASID
    /// (AArch64, ARMv7, RISC-V), or PID (PPC). Managed by the PCID/ASID
    /// allocator ([Section 4.9](#pcid-asid-management)).
    /// Updated on ASID generation rollover (full TLB flush + re-allocation).
    pub context: ArchMmContext,

    /// Base address of the mmap region (randomized by ASLR at `execve()`).
    /// Non-MAP_FIXED `mmap()` calls search downward from `mmap_base` (top-down
    /// layout, the default on x86-64 and AArch64) or upward from `mmap_base`
    /// (legacy bottom-up layout for `ADDR_COMPAT_LAYOUT` personality). Set by
    /// `arch_pick_mmap_layout()` during `execve()` based on stack `RLIMIT_STACK`,
    /// ASLR entropy, and personality flags.
    pub mmap_base: VirtAddr,

    /// End of the mmap region (upper bound for top-down search, lower bound
    /// for bottom-up search). Typically `TASK_SIZE - stack_guard_gap`.
    pub mmap_end: VirtAddr,

    /// Highest VMA end address currently in the address space. Used as a
    /// fallback hint for bottom-up allocation when top-down search fails.
    /// Updated on VMA insert/remove.
    pub highest_vm_end: AtomicUsize,

    /// RSS sub-counters for swap and page table accounting. These are
    /// per-mm atomic counters (not per-CPU like `rss`) because they are
    /// updated less frequently and accuracy matters for OOM scoring.
    ///
    /// `swap_count`: number of swap entries charged to this mm (incremented
    /// on swap-out, decremented on swap-in or unmap of swapped page).
    /// Used by `oom_badness()` as `MM_SWAPENTS`.
    pub swap_count: AtomicI64,

    /// `pgtable_bytes`: total bytes of page table pages allocated for this mm.
    /// Incremented when PTEs/PMDs/PUDs/PGDs are allocated, decremented on
    /// teardown. Used by `oom_badness()` as `MM_PGTABLES_BYTES / PAGE_SIZE`.
    pub pgtable_bytes: AtomicI64,

    /// Per-mm flags (bitflags). The raw value is `AtomicU64`; use
    /// `MmFlags::from_bits_truncate(flags.load(...))` to convert to the
    /// typed bitflags, and `flags.fetch_or(MmFlags::XXX.bits(), ...)` to
    /// set flags atomically. Two access patterns are used in the spec:
    /// - Raw bitwise: `flags.load(Relaxed) & MMF_OOM_SKIP != 0` (legacy)
    /// - Typed: `MmFlags::from_bits_truncate(flags.load(Acquire)).contains(MmFlags::MMF_UNSTABLE)`
    /// Both are correct; typed access is preferred for new code.
    pub flags: AtomicU64,

    // MmFlags bitflags — see definition below `MmStruct`.

    /// MMU notifier subscribers. Used by KVM (EPT/SLAT invalidation),
    /// HMM ([Section 22.4](22-accelerators.md#accelerator-memory-and-p2p-dma)), and IOMMU-aware subsystems
    /// that maintain secondary page tables derived from this address space.
    /// Protected by `mmu_notifier_lock` (see lock ordering below).
    ///
    /// Registration is a cold path (device attach, VM creation), so a Vec
    /// with initial capacity 4 (one heap allocation on first registration)
    /// is acceptable. Linux uses a dynamically-sized hlist with no static
    /// limit. UmkaOS matches this: no artificial capacity cap.
    ///
    /// Realistic workloads may need >4 notifiers (KVM + IOMMU + multi-GPU
    /// HMM + CXL + RDMA), so a fixed ArrayVec<4> would cause hard ENOMEM
    /// failures. Vec grows as needed with no hard limit.
    pub mmu_notifiers: SpinLock<Vec<Arc<dyn MmuNotifier>>>,
}

4.8.1.1 MMU Notifier¶

Secondary page table consumers (KVM EPT/NPT, GPU page tables, IOMMU domains) must be notified when the primary page tables change so they can invalidate stale entries. The MmuNotifier trait provides this callback interface.

/// Trait implemented by subsystems that maintain secondary mappings derived
/// from a process's primary page tables. Registered on a per-MmStruct basis.
///
/// **Lock ordering**: callers hold `mmap_lock.read()` (or `.write()`) before
/// invoking any callback. Implementations must NOT acquire `mmap_lock`.
/// The callback may acquire per-subsystem locks (e.g., KVM's `kvm->mmu_lock`,
/// IOMMU domain lock) but must not block on page allocation or I/O.
pub trait MmuNotifier: Send + Sync {
    /// Called before PTEs in `[start, end)` are cleared. The implementation
    /// must ensure that no secondary page table entries map addresses in the
    /// range after this call returns. This is the "prepare" half — the caller
    /// has NOT yet cleared the primary PTEs, so the implementation can still
    /// read the old PTE values if needed (e.g., to determine dirty state).
    ///
    /// `start` and `end` are page-aligned virtual addresses within the mm.
    fn invalidate_range_start(&self, mm: &MmStruct, start: VirtAddr, end: VirtAddr);

    /// Called after PTEs in `[start, end)` have been cleared and the TLB
    /// shootdown (if any) has completed. The implementation may release
    /// resources associated with the range (e.g., unpin backing pages).
    fn invalidate_range_end(&self, mm: &MmStruct, start: VirtAddr, end: VirtAddr);

    // Historical note: Linux's early MMU notifier versions included `change_pte()`
    // which was nonfunctional since 2012 — KVM unmaps the sPTE during
    // `invalidate_range_start`, making `change_pte` find nothing to update.
    // UmkaOS uses only `invalidate_range_start/end`. No `change_pte` method exists.

    /// Called when the MmStruct is about to be destroyed (process exit).
    /// The implementation must drop all secondary mappings and unregister.
    fn release(&self, mm: &MmStruct);
}

/// Register a notifier on an address space. Called by KVM at VM creation
/// (for EPT/NPT invalidation) and by HMM at device registration
/// (for GPU page table coherence).
///
/// Returns `Err(ENOMEM)` only on allocation failure (no static capacity limit).
/// Registration is a cold path (device attach, VM creation).
pub fn mmu_notifier_register(mm: &MmStruct, notifier: Arc<dyn MmuNotifier>)
    -> Result<(), Error>;

/// Unregister a previously registered notifier. Called on KVM VM teardown
/// or HMM device detach. Must be called with `mmap_lock.write()` held.
pub fn mmu_notifier_unregister(mm: &MmStruct, notifier: &Arc<dyn MmuNotifier>);

Callsites (where the VMM invokes notifier callbacks):

Operation	Callback	When
`munmap()` / `mremap()` shrink	`invalidate_range_start/end`	Before/after PTE clearing
COW fault (write to shared page)	`invalidate_range_start/end`	Before/after PTE update
Fork COW demotion (`copy_page_tables`)	`invalidate_range_start/end`	Around writable→read-only PTE demotion (see below)
THP split (2MB → 512 × 4KB)	`invalidate_range_start/end`	Around the PTE rewrite loop
Page migration (NUMA balancing)	`invalidate_range_start/end`	Before unmapping source, after mapping dest
`mprotect()` (permission change)	`invalidate_range_start/end`	Before/after permission PTE rewrite
Process exit (`exit_mmap()`)	`release`	Before page table teardown

VFIO interaction: Pages pinned via FOLL_LONGTERM (VFIO passthrough) are excluded from migration and compaction — they are never the subject of invalidate_range_* callbacks. If IOMMU domains need to be notified on process exit, they register an MmuNotifier with only the release callback populated.

Lock ordering summary:

Lock	Acquired by	Must hold before
`i_rwsem`	`truncate`, `write`, `fallocate`, `setattr`	`mmap_lock` (both read and write)
`mmap_lock.read()`	Page fault handler, VMA lookups, HMM migration, `/proc/<pid>/maps`	(none — read lock is compatible with other readers)
`mmap_lock.write()`	`mmap`, `munmap`, `mprotect`, `mremap`, `brk`	`MapleTree::write_lock`, `mmu_notifiers` (via callbacks)

Critical: i_rwsem must always be acquired before mmap_lock. Violating this order causes deadlock between truncate (which holds i_rwsem exclusive and needs mmap_lock to invalidate page table entries) and the page fault handler (which holds vm_lock read and needs i_rwsem shared via filemap_fault()). | mmu_notifiers SpinLock | mmu_notifier_register/unregister, callback iteration | Per-subsystem locks (KVM mmu_lock, IOMMU domain lock) | | MapleTree::write_lock | Tree-internal COW mutations (insert, remove, split, merge) | PTE-level spinlocks (PTL) |

4.8.1.1.1 MmuNotifier RAII Guard¶

When a single kernel operation triggers multiple mmu_notifier_invalidate_range() callbacks (e.g., munmap spanning multiple VMAs, KVM Stage-2 teardown, or THP splitting), each callback independently triggers a TLB invalidation on all registered subscribers. On AArch64, each Stage-2 invalidation requires a TLBI IPAS2E1IS + DSB ISH sequence — broadcasting to all cores with an inner-shareable domain barrier. Multiple sequential invalidations for adjacent or overlapping ranges create an IPI storm.

The RAII guard API ensures correct invalidate_range_start/invalidate_range_end pairing by construction: the guard calls invalidate_range_start on creation and invalidate_range_end on Drop. PTEs are cleared inside the guard's lifetime, guaranteeing subscribers are notified before PTEs are modified and released after the operation completes. This is strictly better than Linux's manual start/end pairing (source of CVE-2021-47639, CVE-2022-48991).

/// RAII guard that brackets PTE invalidation with mmu_notifier start/end calls.
/// `invalidate_range_start` is called on construction; `invalidate_range_end`
/// is called on Drop. PTEs must be cleared only while this guard is alive.
///
/// **Hot path**: This struct is stack-allocated — no heap allocation.
///
/// **Invariant**: `start < end` and both are page-aligned. The guard holds no
/// locks itself — it merely enforces the callback pairing contract.
pub struct MmuNotifierGuard<'a> {
    /// The address space whose subscribers are being notified.
    mm: &'a MmStruct,
    /// Start of the invalidated virtual address range (inclusive, page-aligned).
    start: VirtAddr,
    /// End of the invalidated virtual address range (exclusive, page-aligned).
    end: VirtAddr,
}

impl<'a> MmuNotifierGuard<'a> {
    /// Create a guard, calling `invalidate_range_start` on all registered
    /// MmuNotifier subscribers. Returns `Err(ENOMEM)` if a subscriber's
    /// `invalidate_range_start` fails (e.g., KVM cannot allocate shadow PT).
    ///
    /// **Precondition**: `start < end`, both page-aligned.
    pub fn new(mm: &'a MmStruct, start: VirtAddr, end: VirtAddr) -> Result<Self, Error> {
        mmu_notifier_invalidate_range_start(mm, start, end)?;
        Ok(Self { mm, start, end })
    }
}

impl Drop for MmuNotifierGuard<'_> {
    fn drop(&mut self) {
        mmu_notifier_invalidate_range_end(self.mm, self.start, self.end);
    }
}

For multi-VMA operations (e.g., munmap spanning N VMAs), a single guard covers the union range:

/// Accumulates invalidation ranges and creates a single MmuNotifierGuard
/// covering the union. Stack-allocated, O(1) per add_range() call.
pub struct MmuNotifierRangeBuilder {
    start: VirtAddr,
    end: VirtAddr,
    count: u32,
    active: bool,
}

impl MmuNotifierRangeBuilder {
    pub fn new() -> Self {
        Self { start: VirtAddr::MAX, end: VirtAddr::new(0), count: 0, active: false }
    }

    /// Record an invalidation range. Expands the builder to cover [start, end).
    pub fn add_range(&mut self, start: VirtAddr, end: VirtAddr) {
        self.start = self.start.min(start);
        self.end = self.end.max(end);
        self.count += 1;
        self.active = true;
    }

    /// Create the RAII guard for the accumulated union range.
    /// Returns `None` if no ranges were added.
    pub fn build<'a>(self, mm: &'a MmStruct) -> Result<Option<MmuNotifierGuard<'a>>, Error> {
        if !self.active { return Ok(None); }
        Ok(Some(MmuNotifierGuard::new(mm, self.start, self.end)?))
    }
}

Usage pattern (in munmap spanning multiple VMAs):

let mut builder = MmuNotifierRangeBuilder::new();
for vma in affected_vmas {
    builder.add_range(vma.start, vma.end);
}
let _guard = builder.build(mm)?;  // invalidate_range_start called here
for vma in affected_vmas {
    // ... clear PTEs (inside guard lifetime) ...
}
// _guard dropped here → invalidate_range_end called automatically

Performance: For a munmap spanning N VMAs, this reduces subscriber-side TLB invalidations from N to 1. On AArch64 KVM with 8 vCPU cores, a 10-VMA munmap drops from 10 × (TLBI IPAS2E1IS + DSB ISH) sequences to 1, saving ~9 × inner-shareable broadcast barriers (~200-500 cycles each depending on core count).

Interaction with invalidate_range_start/end pairing: The RAII guard issues a single start/end pair for the union range. Subscribers that track per-range state (e.g., KVM's mmu_notifier_count for blocking page faults during invalidation) see one start/end bracket, not N. The RAII pattern makes mispairing impossible — invalidate_range_end is always called exactly once per guard, even on early return or panic (Rust's drop guarantee). | PTE spinlock (PTL) | PTE installation (cmpxchg), COW fault resolution | (leaf — no further locks) |

The read-side fast path (page fault VMA lookup) acquires only the per-VMA vm_lock.read() via lockless RCU-protected maple tree lookup — bypassing mmap_lock entirely (Section 4.8). Zero exclusive locks, zero atomic RMW operations on shared cache lines. This is critical for multi-threaded workloads where dozens of threads may fault concurrently on the same address space.

4.8.2 Maple Tree VMA Management¶

The VMA index is a B-tree variant (Maple tree, as in Linux 6.1+) where each node stores ranges of VMAs sorted by virtual address. Unlike a red-black tree, Maple tree nodes are 256-byte dense nodes (4 cache lines), store multiple ranges per node (fanout ~16 for range nodes), and provide O(log n) lookup with far better cache behavior than pointer-chasing rb-trees. The reduced pointer chasing means fewer cache misses per lookup — a 3-level Maple tree touches ~12 cache lines (4 per level) versus 45-60 cache lines for an equivalent rb-tree traversal on a VMA set of ~100,000 entries (15-20 nodes × 3-4 cache lines each).

Vma — virtual memory area:

/// A Virtual Memory Area — a contiguous range of virtual addresses with uniform
/// protection and backing. The `MapleTree` stores one `Vma` per mapped region.
///
/// `Vma` structs are allocated from the slab allocator and stored in leaf nodes
/// of the `MapleTree`. The tree's range keys are `[vm_start, vm_end)`.
pub struct Vma {
    /// Inclusive start virtual address (page-aligned).
    pub vm_start: VirtAddr,
    /// Exclusive end virtual address (page-aligned).
    pub vm_end: VirtAddr,
    /// Protection and mapping flags (`PROT_READ`, `PROT_WRITE`, `PROT_EXEC`,
    /// `MAP_SHARED`, `MAP_PRIVATE`, `MAP_ANONYMOUS`, `MAP_HUGETLB`, etc.).
    /// Stored as a `VmFlags` bitfield; mirrors Linux `vm_flags`.
    pub vm_flags: VmFlags,
    /// Page offset within the backing file (in pages). Zero for anonymous VMAs.
    pub vm_pgoff: u64,
    /// Backing file, if any. `None` for anonymous VMAs.
    /// `Some` for file-backed mappings (`mmap` of a regular file, device, or
    /// shared memory object). `FileRef` is a type alias for `OpenFile`
    /// ([Section 14.1](14-vfs.md#virtual-filesystem-layer--openfile--per-open-file-state)); the
    /// `Arc<FileRef>` provides access to `file.inode.i_mapping` (the
    /// `AddressSpace` used by the page cache for this file).
    pub file: Option<Arc<FileRef>>,

    /// VM operations table. Allows filesystems and device drivers to
    /// customize fault handling, writeback notification, and VMA lifecycle.
    /// `None` for simple anonymous VMAs. Set during `do_mmap()` from the
    /// file's `f_op.mmap()` callback (for file-backed mappings) or from
    /// device driver `mmap` implementations (for device mappings).
    ///
    /// Key callbacks:
    /// - `fault()`: custom page fault handler (e.g., GPU buffer objects).
    /// - `page_mkwrite()`: called before a shared file page transitions
    ///   from read-only to writable (filesystem delayed allocation hook).
    /// - `close()`: called on VMA teardown (`munmap`, `exit_mmap`).
    /// - `open()`: called when a VMA is duplicated (`fork`).
    ///
    /// The `'static` lifetime is correct because `VmOperations` trait objects
    /// are registered as module-lifetime vtables (filesystem or driver module).
    pub vm_ops: Option<&'static dyn VmOperations>,

    /// Reverse-mapping linkage for anonymous pages. Links this VMA to the
    /// `AnonVma` structure that tracks all VMAs (across fork COW children)
    /// sharing the same set of anonymous pages. Required for:
    /// - Page reclaim (unmap all PTEs pointing to a physical page).
    /// - Page migration (NUMA balancing, compaction).
    /// - THP split (update all mappings of a huge page).
    ///
    /// `None` for file-backed VMAs (they use `AddressSpace.i_mmap` interval
    /// tree for reverse mapping) and for newly-created anonymous VMAs before
    /// the first page fault. Set lazily on the first anonymous page fault
    /// (`handle_anon_fault()`) by `anon_vma_prepare()`.
    ///
    /// After `fork()`, the child VMA shares the parent's `AnonVma` via
    /// `Arc::clone()`. The `AnonVma` holds a list of all connected VMAs
    /// (`anon_vma_chain`) for reverse-mapping walks.
    pub anon_vma: Option<Arc<AnonVma>>,

    /// DSM region reference for VMAs backed by distributed shared memory.
    /// `Some` when `vm_flags` contains `VM_DSM`. Provides the bridge from
    /// the VMM page fault path to the DSM subsystem: the fault handler uses
    /// this reference to look up per-page metadata, determine the home node,
    /// and dispatch coherence protocol messages. `None` for all non-DSM VMAs.
    /// Set during `dsm_mmap()` when a process maps a DSM region.
    /// See [Section 6.12](06-dsm.md#dsm-subscriber-controlled-caching) for the subscriber
    /// interface and [Section 6.5](06-dsm.md#dsm-page-fault-flow) for the full fault path.
    pub dsm_region: Option<Arc<DsmRegion>>,

    /// Per-VMA reader-writer lock. Allows page faults to proceed without
    /// acquiring the process-wide `mmap_lock`, dramatically reducing contention
    /// on multi-threaded workloads (databases, JVMs, container runtimes).
    /// See [Section 4.8](#virtual-memory-manager--per-vma-locking) for the full protocol.
    ///
    /// **Lock ordering**: `mmap_lock` nests outside `vm_lock`
    /// ([Section 3.4](03-concurrency.md#cumulative-performance-budget--lock-ordering)). A thread holding
    /// `vm_lock.read()` must never acquire `mmap_lock` in any mode.
    pub vm_lock: RwLock<()>,
    /// RCU head for deferred reclamation. After the VMA is removed from the tree
    /// and the RCU grace period elapses, the slab slot is returned.
    pub rcu: RcuHead,
}

impl Vma {
    /// Compute the PTE protection flags for this VMA.
    /// Translates `vm_flags` into architecture-specific PTE bits.
    pub fn pte_flags(&self) -> PteFlags {
        arch::current::mm::vma_to_pte_flags(self.vm_flags)
    }

    /// Length of the VMA in bytes.
    ///
    /// Returns `usize`, matching the `Address` trait's `Sub<Self> -> usize`
    /// definition. On 32-bit architectures (ARMv7, PPC32), VMA lengths
    /// cannot exceed 4 GB (the virtual address space is 32-bit).
    pub fn len(&self) -> usize {
        self.vm_end.0 - self.vm_start.0
    }
}

/// VM operations table for customizing VMA behavior. Filesystems and device
/// drivers implement this trait and install it via `vma.vm_ops` during
/// `f_op.mmap()`. The trait uses `&self` with interior mutability for
/// mutation (per KABI vtable conventions -- no `&mut self` across boundaries).
pub trait VmOperations: Send + Sync {
    /// Custom page fault handler. Called by `handle_file_fault()` when the
    /// VMA has `vm_ops`. Returns a page to install in the PTE, or an error.
    /// Default: delegates to `filemap_fault()` (generic page cache fault).
    fn fault(&self, vma: &Vma, vmf: &VmFault) -> Result<PageRef, FaultError> {
        filemap_fault(vma, vmf)
    }

    /// Called before a shared file page transitions from read-only to writable.
    /// The filesystem can perform delayed allocation (e.g., ext4 allocates
    /// blocks on write fault, not on mmap). Returns `Ok(())` to proceed with
    /// the write permission upgrade, or `Err` to send SIGBUS.
    fn page_mkwrite(&self, vma: &Vma, page: &Page) -> Result<(), FaultError> {
        Ok(())
    }

    /// Called when a VMA is duplicated during `fork()` / `clone(CLONE_VM=0)`.
    /// The filesystem can increment per-VMA reference counts.
    fn open(&self, _vma: &Vma) {}

    /// Called when a VMA is removed (`munmap`, `exit_mmap`, `mremap` shrink).
    /// The filesystem can release per-VMA resources.
    fn close(&self, _vma: &Vma) {}
}

/// Reverse-mapping structure for anonymous pages. Links all VMAs (across
/// fork COW children) that share the same set of anonymous page origins.
/// Used by the page reclaim scanner to find all PTEs mapping a given
/// physical page so they can be unmapped for reclaim or migration.
///
/// The `anon_vma` forms a tree: after `fork()`, the child VMA's `anon_vma`
/// is linked to the parent's `anon_vma` via `anon_vma_chain`. The rmap
/// walker traverses the tree to find all connected VMAs.
pub struct AnonVma {
    /// Lock protecting the `anon_vma_chain` list and the interval tree
    /// of VMAs. Taken in read mode by rmap walkers (reclaim, migration),
    /// write mode by VMA create/destroy/fork operations.
    pub rwsem: RwLock<()>,
    /// Root of the anon_vma tree. After fork, child `AnonVma` references
    /// the parent's root. The root is the anon_vma of the process that
    /// originally created the anonymous mapping (before any fork).
    pub root: *const AnonVma,
    /// Reference count. Incremented on VMA attach, decremented on VMA
    /// detach. When it reaches zero, the AnonVma is freed via RCU.
    pub refcount: AtomicU32,
}

/// Type alias: `FileRef` is `OpenFile` from the VFS layer
/// ([Section 14.1](14-vfs.md#virtual-filesystem-layer)). The `Vma.file` field uses
/// `Arc<FileRef>` to reference the open file description. Access path for
/// file-backed faults: `vma.file.as_ref().unwrap().inode.i_mapping` yields
/// the `AddressSpace`.
pub type FileRef = OpenFile;

4.8.2.1 `do_mmap()` -- mmap Syscall Implementation¶

The do_mmap() function is the core implementation of the mmap() system call. It validates parameters, selects an address, creates a VMA, and inserts it into the Maple tree. All error paths roll back partial state.

Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules. &self methods use interior mutability for mutation. Atomic fields use .store()/.load(). See CLAUDE.md §Spec Pseudocode Quality Gates.

/// Create a new virtual memory mapping. Called from the `mmap()` syscall
/// dispatch, the ELF loader ([Section 8.3](08-process.md#elf-loader)), and internal kernel
/// mapping paths (e.g., shmem, hugetlbfs).
///
/// # Arguments
/// - `mm`: The target address space.
/// - `addr`: Hint address (or exact address for MAP_FIXED/MAP_FIXED_NOREPLACE).
/// - `len`: Requested mapping length in bytes (rounded up to page boundary).
/// - `prot`: Protection flags (PROT_READ, PROT_WRITE, PROT_EXEC).
/// - `flags`: Mapping flags (MAP_SHARED, MAP_PRIVATE, MAP_ANONYMOUS,
///    MAP_FIXED, MAP_FIXED_NOREPLACE, MAP_POPULATE, MAP_LOCKED,
///    MAP_HUGETLB, MAP_NORESERVE, etc.).
/// - `file`: Backing file for file-backed mappings, or `None` for anonymous.
/// - `offset`: Byte offset within the file (must be page-aligned). Ignored
///    for anonymous mappings.
///
/// # Returns
/// The start address of the new mapping on success, or an error.
///
/// # Errors
/// - `EINVAL`: Invalid flag combination, misaligned offset, zero length,
///    or overflow (addr + len wraps around).
/// - `ENOMEM`: Address space limit exceeded (RLIMIT_AS), VMA count limit
///    exceeded (sysctl_max_map_count), or VMA allocation failed.
/// - `EEXIST`: MAP_FIXED_NOREPLACE and the requested range overlaps an
///    existing mapping.
/// - `EACCES`: File protection check failed (LSM or file permissions).
/// - `EPERM`: MAP_FIXED at a kernel-restricted address.
///
/// # Lock ordering
/// Acquires `mm.mmap_lock.write()` for the entire operation. Calls
/// `MapleTree::write_lock` internally during VMA insertion. May trigger
/// LSM hooks before VMA creation.
pub fn do_mmap(
    mm: &MmStruct,
    addr: usize,
    len: usize,
    prot: u32,
    flags: u32,
    file: Option<Arc<FileRef>>,
    offset: u64,
) -> Result<VirtAddr, Error> {
    // -- Step 1: Parameter validation --

    // Round length up to page boundary.
    let len = page_align_up(len);
    if len == 0 {
        return Err(EINVAL);
    }

    // Check for address overflow.
    if addr.checked_add(len).is_none() {
        return Err(ENOMEM);
    }

    // File offset must be page-aligned.
    if file.is_some() && (offset % PAGE_SIZE as u64 != 0) {
        return Err(EINVAL);
    }

    // Invalid flag combinations:
    // - MAP_SHARED and MAP_PRIVATE are mutually exclusive.
    // - MAP_FIXED_NOREPLACE implies fixed address semantics (no clobber).
    if (flags & MAP_SHARED != 0) && (flags & MAP_PRIVATE != 0) {
        return Err(EINVAL);
    }
    if (flags & MAP_SHARED == 0) && (flags & MAP_PRIVATE == 0) {
        return Err(EINVAL);  // one of SHARED/PRIVATE is required
    }

    // Compute VmFlags from prot + flags.
    let vm_flags = prot_flags_to_vm_flags(prot, flags);

    // -- Step 1b: MAP_ANONYMOUS validation --
    // When MAP_ANONYMOUS is set, file/offset are ignored (Linux behavior).
    // When file is None and MAP_ANONYMOUS is not set, the call is invalid.
    let file = if flags & MAP_ANONYMOUS != 0 {
        None  // ignore fd/offset for anonymous mappings
    } else {
        if file.is_none() {
            return Err(EBADF);
        }
        file
    };

    // -- Step 1c: TASK_SIZE validation --
    // Prevent mappings above the user address space boundary.
    if addr.saturating_add(len) > arch::current::TASK_SIZE {
        return Err(ENOMEM);
    }

    // -- Step 2: LSM security check --
    // Before creating any VMA or modifying the address space.
    // Uses `lsm_call!(mmap_file, ...)` — a dedicated mmap hook that receives
    // prot/flags context, matching Linux's `security_mmap_file()` hook name.
    // Separate from `file_security` because mmap needs `prot`/`flags` parameters
    // that generic file ops do not carry. SELinux policy rules reference
    // `mmap_file` by name.
    if let Some(ref f) = file {
        lsm_call!(mmap_file, f, prot, flags)?;
    }

    // -- Step 3: Acquire mmap_lock.write() --
    let _mmap_guard = mm.mmap_lock.write();

    // -- Step 4: Resource limit checks --
    // RLIMIT_AS: total virtual memory limit.
    // `Relaxed` is safe: `mmap_lock.write()` provides the ordering barrier.
    let current_total = mm.total_vm.load(Relaxed);
    if current_total + len as u64 > current_task().process.rlimits.limits[RLIMIT_AS].soft {
        return Err(ENOMEM);
    }

    // sysctl_max_map_count: VMA count limit (default 65530).
    let current_count = mm.map_count.load(Relaxed);
    if current_count >= sysctl_max_map_count() {
        return Err(ENOMEM);
    }

    // Step 4b: Overcommit accounting for private writable mappings.
    // Modes: 0 = heuristic, 1 = always allow, 2 = strict.
    // See [Section 4.2](#physical-memory-allocator--overcommit-accounting).
    if (vm_flags.contains(VM_WRITE) && !vm_flags.contains(VM_SHARED))
        && !(flags & MAP_NORESERVE != 0)
    {
        check_overcommit(mm, len / PAGE_SIZE)?;
    }

    // Step 4c: RLIMIT_MEMLOCK pre-check for MAP_LOCKED.
    // This check MUST happen before VMA creation (Step 8) to avoid a
    // resource leak: if we create the VMA first and then fail the
    // RLIMIT_MEMLOCK check, the VMA exists but is incorrectly not locked,
    // leaving committed virtual memory that cannot be reclaimed without
    // explicit munmap. Matches Linux's `mlock_future_ok()` placement.
    if flags & MAP_LOCKED != 0 && !capable(CAP_IPC_LOCK) {
        check_and_add_locked(&current_task().process, len / PAGE_SIZE)?;
    }

    // -- Step 5: Address selection --
    let map_addr;
    let is_fixed = (flags & MAP_FIXED != 0) || (flags & MAP_FIXED_NOREPLACE != 0);

    if is_fixed {
        // MAP_FIXED or MAP_FIXED_NOREPLACE: use the exact address.
        if addr % PAGE_SIZE != 0 {
            return Err(EINVAL);
        }
        map_addr = VirtAddr::new(addr);

        if flags & MAP_FIXED_NOREPLACE != 0 {
            // Check for overlap without unmapping.
            if maple_find_in_range(&mm.vma_tree, map_addr, map_addr + len).is_some() {
                return Err(EEXIST);
            }
        } else {
            // MAP_FIXED: unmap any existing mappings in the range.
            // Note: this is committed -- if the subsequent VMA insert fails,
            // the hole remains. This matches Linux behavior (MAP_FIXED may
            // create a hole on error). The failure case is exceedingly rare
            // (only possible on ENOMEM during VMA slab allocation after
            // unmapping).
            do_munmap(mm, map_addr, len)?;
        }
    } else {
        // Non-fixed: search for a free region.
        // The hint defaults to mm.mmap_base (ASLR-randomized) when addr=0.
        let hint = if addr != 0 {
            VirtAddr::new(addr)
        } else {
            mm.mmap_base
        };
        map_addr = maple_find_gap(&mm.vma_tree, len, PAGE_SIZE, hint)?;
    }

    // -- Step 6: MAP_HUGETLB dispatch --
    if flags & MAP_HUGETLB != 0 {
        // Extract huge page size from flags (MAP_HUGE_SHIFT bits).
        // Delegate to hugetlbfs-specific mmap path which handles:
        // - Huge page pool reservation and subpool accounting
        // - Creating an anonymous hugetlbfs file (for MAP_ANONYMOUS|MAP_HUGETLB)
        // See [Section 14.18](14-vfs.md#pseudo-filesystems--hugetlbfs-huge-page-filesystem).
        return do_mmap_hugetlb(mm, map_addr, len, vm_flags, flags, file, offset);
    }

    // -- Step 7: Allocate VMA from slab --
    // `slab_alloc_typed` returns a `&'static mut MaybeUninit<Vma>` from the
    // VMA slab cache. The caller must initialize all fields before use.
    // See [Section 4.3](#slab-allocator--typed-allocation-api).
    let vma_mem = slab_alloc_typed::<Vma>(GFP_KERNEL)?;

    // Initialize VMA fields via `MaybeUninit::write()` to construct a valid Vma.
    let vma = vma_mem.write(Vma {
        vm_start: map_addr,
        vm_end: map_addr + len,
        vm_flags,
        vm_pgoff: if file.is_some() { offset / PAGE_SIZE as u64 } else { 0 },
        vm_ops: None,       // set below for file-backed mappings
        anon_vma: None,     // lazily initialized on first anon fault
        file: None,
        vm_lock: RwLock::new(()),
        rcu: RcuHead::new(),
    });

    // -- Step 8: File mapping setup --
    // The file's mmap() callback MUST run BEFORE the merge attempt (Step 8b)
    // because filesystems may modify vm_flags (e.g., ext4 clears VM_MAYWRITE
    // for DAX, tmpfs sets VM_NORESERVE). Merging before the callback would
    // use pre-callback flags, potentially merging incompatible VMAs.
    // This matches Linux mm/mmap.c mmap_region() ordering: call_mmap() runs
    // before vma_merge().
    if let Some(ref f) = file {
        vma.file = Some(Arc::clone(f));

        // Call the file's mmap operation to let the filesystem customize
        // the VMA. The `file_mmap()` method receives the VMA by mutable
        // reference, allowing the filesystem to set `vma.vm_ops` to its
        // own VmOperations (e.g., ext4_file_vm_ops, shmem_vm_ops) and
        // adjust `vm_flags`. This matches Linux's `f_op->mmap(file, vma)`
        // in-kernel calling convention.
        //
        // Note: The VFS `FileOps::mmap` trait method is defined with a
        // decomposed signature for KABI ring transport. For the Tier 0
        // in-kernel `do_mmap()` path, we call the direct-call variant
        // which accepts `&mut Vma`. See [Section 14.1](14-vfs.md#virtual-filesystem-layer--file-operations).
        f.f_op.mmap(f, &mut vma)?;

        // Re-read vm_flags after the callback — the filesystem may have
        // modified them (e.g., set VM_IO, cleared VM_MAYWRITE).
        let vm_flags = vma.vm_flags;

        // For MAP_SHARED: register the VMA with the file's AddressSpace
        // interval tree for reverse mapping (used by writeback, truncation,
        // page migration — rmap_walk_file() needs to find all VMAs mapping
        // a given file offset range).
        if vm_flags.contains(VM_SHARED) {
            let mapping = &f.inode.i_mapping;
            let _imap_guard = mapping.i_mmap.write();
            mapping.i_mmap.tree.insert(
                vma.vm_pgoff,
                vma.vm_pgoff + (len / PAGE_SIZE) as u64,
                &vma,
            );
            // _imap_guard drops here, releasing the RwLock.
        }
    } else {
        // Anonymous mapping: vm_ops stays None. anon_vma is set lazily
        // on first page fault (anon_vma_prepare()).
        // MAP_GROWSDOWN: stack-like VMA that grows downward on fault.
        if flags & MAP_GROWSDOWN != 0 {
            vma.vm_flags |= VM_GROWSDOWN;
            // Stack guard gap: do not allow mappings within STACK_GUARD_GAP
            // pages below the VMA. The page fault handler checks this.
        }
    }

    // -- Step 8b: VMA merge attempt (AFTER file mmap callback) --
    // Now that vm_flags are finalized (the filesystem's mmap() callback has
    // run and may have modified them), attempt to merge with an adjacent VMA
    // that has identical protection, flags, file, and offset. This is critical
    // for sequential mmap() calls (JVMs, dynamic linkers) and keeps map_count
    // below sysctl_max_map_count.
    //
    // The merge attempt uses the FINAL vm_flags (post-callback). Merging
    // before the callback would use pre-callback flags, potentially merging
    // VMAs with incompatible flags (e.g., one with VM_IO set by the driver,
    // another without). See Linux mm/mmap.c vma_merge().
    if let Some(merged_vma) = vma_merge(mm, map_addr, map_addr + len,
                                         vma.vm_flags, &vma.file,
                                         vma.vm_pgoff) {
        // Merge succeeded: the existing VMA was extended. The new VMA is
        // redundant — free it. map_count is NOT incremented (no new VMA).
        // vma_merge() already updated the i_mmap interval tree for
        // file-backed MAP_SHARED VMAs (remove old + reinsert with extended
        // range). Only total_vm needs updating.
        //
        // If we registered in the i_mmap tree above (Step 8, MAP_SHARED),
        // unregister the new VMA since the merged VMA covers the range.
        if let Some(ref f) = vma.file {
            if vma.vm_flags.contains(VM_SHARED) {
                let mapping = &f.inode.i_mapping;
                let _imap_guard = mapping.i_mmap.write();
                mapping.i_mmap.tree.remove(&vma);
            }
        }
        slab_free_typed(vma);
        mm.total_vm.fetch_add(len as u64, Relaxed);
        if map_addr.0 + len > mm.highest_vm_end.load(Relaxed) {
            mm.highest_vm_end.store(map_addr.0 + len, Relaxed);
        }
        // MAP_LOCKED and MAP_POPULATE handled below (Step 11-12).
        // Fall through to the post-insert steps.
    } else {

    // -- Step 9: Insert VMA into Maple tree --
    match maple_insert(&mm.vma_tree, &vma) {
        Ok(()) => {}
        Err(e) => {
            // Rollback: free the VMA slab allocation. If file-backed,
            // unregister from the AddressSpace interval tree.
            // Note: under mmap_lock.write(), maple_insert can only fail
            // due to ENOMEM (slab exhaustion during tree node splitting),
            // not address-in-use (the range is guaranteed free at this point).
            if let Some(ref f) = vma.file {
                if vma.vm_flags.contains(VM_SHARED) {
                    let mapping = &f.inode.i_mapping;
                    let _imap_guard = mapping.i_mmap.write();
                    mapping.i_mmap.tree.remove(&vma);
                    // _imap_guard drops here, releasing the RwLock.
                }
                if let Some(ref ops) = vma.vm_ops {
                    ops.close(&vma);
                }
            }
            slab_free_typed(vma);
            return Err(e.into());
        }
    }

    // -- Step 10: Update mm accounting --
    mm.total_vm.fetch_add(len as u64, Relaxed);
    mm.map_count.fetch_add(1, Relaxed);
    // Safe: mmap_lock.write() provides exclusion for this load-compare-store.
    if map_addr.0 + len > mm.highest_vm_end.load(Relaxed) {
        mm.highest_vm_end.store(map_addr.0 + len, Relaxed);
    }

    } // end of else-branch for vma_merge

    // -- Step 11: MAP_LOCKED handling --
    // The RLIMIT_MEMLOCK check was already performed at Step 4c (before VMA
    // creation). For CAP_IPC_LOCK holders, the locked_pages accounting is
    // done unconditionally here since they bypass the limit check.
    if flags & MAP_LOCKED != 0 && capable(CAP_IPC_LOCK) {
        current_task().process.locked_pages.fetch_add(
            (len / PAGE_SIZE) as u64, Relaxed,
        );
    }

    // -- Step 12: MAP_POPULATE / MAP_LOCKED pre-fault --
    // Locked pages must be resident; MAP_POPULATE explicitly requests pre-fault.
    if (flags & MAP_POPULATE != 0) || (flags & MAP_LOCKED != 0) {
        // Drop mmap_lock.write() before pre-faulting. Pre-fault invokes
        // the page fault handler which needs only mmap_lock.read() (or the
        // per-VMA lock). Dropping the write lock allows concurrent faults.
        drop(_mmap_guard);

        // mm_populate() walks the VMA and faults in each page eagerly.
        // Allocation failures during pre-fault are silently ignored
        // (partial population is not an error -- matches Linux behavior).
        // mmap_lock is NOT held during populate (dropped above) to avoid
        // holding the write lock across page faults.
        mm_populate(mm, map_addr, len);
        // mm_populate pseudocode:
        //   fn mm_populate(mm: &MmStruct, start: VirtAddr, len: usize) {
        //       let end = start + len;
        //       let mut addr = start;
        //       while addr < end {
        //           // Re-acquire mmap_lock for read to look up VMA.
        //           let _guard = mm.mmap_lock.read();
        //           let vma = find_vma(mm, addr);
        //           if vma.is_none() { break; }
        //           let vma = vma.unwrap();
        //           drop(_guard);
        //           // Fault in one page — handle_mm_fault re-acquires
        //           // locks as needed (VMA lock or mmap_lock read).
        //           let _ = handle_mm_fault(mm, vma, addr, AccessType::Read);
        //           addr += PAGE_SIZE;
        //       }
        //   }
    }
    // When MAP_POPULATE is NOT set, _mmap_guard drops at function return,
    // releasing mmap_lock.write().

    Ok(map_addr)
}

MAP_NORESERVE: When MAP_NORESERVE is set, the kernel does not reserve swap space for the mapping. The overcommit accounting (Section 4.2) skips the reservation for this VMA. The vm_flags include VM_NORESERVE which is checked by the overcommit logic during page fault charge operations.

4.8.2.2 `prot_flags_to_vm_flags()` — Translate mmap flags to VmFlags¶

/// Translate `PROT_*` and `MAP_*` flags from the mmap syscall into the
/// internal `VmFlags` bitfield stored on the VMA. This function is the
/// canonical mapping between userspace-visible flags and kernel-internal
/// VMA properties.
pub fn prot_flags_to_vm_flags(prot: u32, flags: u32) -> VmFlags {
    let mut vm = VmFlags::empty();

    // Protection bits.
    if prot & PROT_READ  != 0 { vm |= VM_READ; }
    if prot & PROT_WRITE != 0 { vm |= VM_WRITE; }
    if prot & PROT_EXEC  != 0 { vm |= VM_EXEC; }

    // "May" bits — the maximum permissions that mprotect() can later grant.
    // Controlled by the file's open mode. For anonymous mappings, all
    // may-bits corresponding to the requested prot are set.
    if prot & PROT_READ  != 0 { vm |= VM_MAYREAD; }
    if prot & PROT_WRITE != 0 { vm |= VM_MAYWRITE; }
    if prot & PROT_EXEC  != 0 { vm |= VM_MAYEXEC; }

    // Mapping type.
    if flags & MAP_SHARED != 0 {
        vm |= VM_SHARED;
        vm |= VM_MAYSHARE;
    }

    // Behavioral flags.
    if flags & MAP_GROWSDOWN  != 0 { vm |= VM_GROWSDOWN; }
    if flags & MAP_LOCKED     != 0 { vm |= VM_LOCKED; }
    if flags & MAP_NORESERVE  != 0 { vm |= VM_NORESERVE; }
    if flags & MAP_DONTEXPAND != 0 { vm |= VM_DONTEXPAND; }
    if flags & MAP_HUGETLB    != 0 { vm |= VM_HUGETLB; }
    // MAP_STACK is currently a no-op hint (matches Linux).
    // MAP_32BIT (x86-64): handled by address selection, not VmFlags.
    // MAP_SYNC (DAX): handled by the filesystem mmap hook.

    vm
}

VmFlags bitfield: The complete set of VM_* flags with Linux-compatible values:

Flag	Bit	Description
`VM_READ`	0	Readable
`VM_WRITE`	1	Writable
`VM_EXEC`	2	Executable
`VM_SHARED`	3	Shared (vs private/COW)
`VM_MAYREAD`	4	mprotect may set VM_READ
`VM_MAYWRITE`	5	mprotect may set VM_WRITE
`VM_MAYEXEC`	6	mprotect may set VM_EXEC
`VM_MAYSHARE`	7	mprotect may set VM_SHARED
`VM_GROWSDOWN`	8	Stack-like, grows downward
`VM_UFFD_MISSING`	9	Reserved: userfaultfd missing-page tracking (set by `UFFDIO_REGISTER` with `UFFDIO_REGISTER_MODE_MISSING`)
`VM_PFNMAP`	10	Pure PFN mapping (no struct page)
`VM_LOCKED`	11	Pages locked in memory
`VM_IO`	12	Memory-mapped I/O
`VM_SEQ_READ`	13	Sequential readahead hint
`VM_RAND_READ`	14	Random access hint
`VM_DONTCOPY`	15	Do not copy on fork
`VM_DONTEXPAND`	16	Cannot mremap to expand
`VM_NORESERVE`	19	No overcommit reservation
`VM_HUGETLB`	20	Backed by huge pages
`VM_DONTDUMP`	22	Exclude from core dump
`VM_ACCOUNT`	24	Charge to memory cgroup
`VM_NOHUGEPAGE`	25	Disable THP
`VM_HUGEPAGE`	26	Enable THP (advice)
`VM_DSM`	56	UmkaOS: DSM-managed region

4.8.2.3 `vma_merge()` — Merge a New Mapping with Adjacent VMAs¶

Attempts to merge a newly created mapping region with adjacent VMAs that have compatible properties. This reduces VMA count for sequential mmap() calls (JVMs, dynamic linkers), keeping map_count below sysctl_max_map_count.

Three merge cases exist, following Linux mm/vma.c vma_merge_new_range():

Case	Condition	Result
Merge-both	Predecessor VMA ends at `start` AND successor VMA starts at `end`, AND both have identical `vm_flags`, `file`, `vm_pgoff` continuity	Predecessor is extended to cover `[pred.vm_start, succ.vm_end)`. Successor is removed and freed.
Merge-prev	Predecessor VMA ends at `start` with identical `vm_flags`, `file`, `vm_pgoff` continuity	Predecessor `vm_end` is extended to `end`.
Merge-next	Successor VMA starts at `end` with identical `vm_flags`, `file`, `vm_pgoff` continuity	Successor `vm_start` is shrunk to `start`.

Merge criteria (all must match between the new region and the adjacent VMA): - vm_flags — exact bitwise match (post-filesystem-callback flags) - vm_ops — same pointer (same VmOperations vtable) - file — same Arc<File> or both None - vm_pgoff continuity — for file-backed VMAs, the page offset must be contiguous (pred.vm_pgoff + (pred.vm_end - pred.vm_start) / PAGE_SIZE == new.vm_pgoff) - anon_vma — either both None, or mergeable (same root anon_vma) - No VM_SPECIAL flags that prohibit merging (VM_PFNMAP, VM_IO, VM_DONTEXPAND)

/// Attempt to merge a new mapping with adjacent VMAs.
///
/// # Arguments
/// - `mm`: the address space (mmap_lock must be held for write)
/// - `start`: start address of the new region
/// - `end`: end address of the new region (exclusive)
/// - `vm_flags`: finalized VmFlags (AFTER filesystem mmap callback)
/// - `file`: optional file backing
/// - `pgoff`: page offset (file-backed) or 0 (anonymous)
///
/// # Returns
/// `Some(&mut Vma)` pointing to the merged VMA on success, `None` if no
/// merge is possible. On success the Maple tree is updated in-place
/// (the merged VMA's range is extended). The caller must NOT insert a
/// new VMA into the tree.
///
/// # Preconditions
/// - `mm.mmap_lock` is held for write.
/// - The address range `[start, end)` does not overlap any existing VMA
///   (the gap was verified in Step 5).
///
/// # i_mmap interval tree update
/// For file-backed MAP_SHARED VMAs, the merged VMA's entry in the
/// `AddressSpace.i_mmap` interval tree is updated (remove + reinsert
/// with the extended pgoff range) so that reverse-mapping walks
/// (`rmap_walk_file()`) find the correct range. The i_mmap RwLock is
/// acquired internally.
/// Attempt to merge a new mapping with adjacent VMAs.
///
/// # Mutability model
/// This function takes `mm: &MmStruct` (shared reference) and returns
/// `Option<&mut Vma>`. The `&mut Vma` is sound because:
/// 1. The caller holds `mmap_lock` in write mode (exclusive access to VMA tree).
/// 2. Maple tree nodes are slab-allocated — each VMA is a separate allocation.
///    The `&mut Vma` references disjoint allocations, not overlapping array elements.
/// 3. The `&MmStruct` grants read access to the maple tree structure; the write
///    lock grants exclusive access to the VMA contents within.
///
/// In implementation, `MmStruct.vma_tree` uses interior mutability
/// (`UnsafeCell<MapleTree>`) with the SAFETY invariant: mmap_lock write mode
/// is held for all mutations. The `maple_remove`, `maple_update_range`, and
/// VMA field mutations below are all protected by this invariant.
fn vma_merge(
    mm: &MmStruct,
    start: VirtAddr,
    end: VirtAddr,
    vm_flags: VmFlags,
    file: &Option<Arc<File>>,
    pgoff: u64,
) -> Option<&mut Vma> {
    // Cannot merge VMAs with special flags.
    if vm_flags.intersects(VM_PFNMAP | VM_IO | VM_DONTEXPAND) {
        return None;
    }

    // Look up the predecessor (VMA ending at or before `start`) and
    // successor (VMA starting at or after `end`) via Maple tree walk.
    let prev = maple_find_prev(&mm.vma_tree, start);
    let next = maple_find_next(&mm.vma_tree, end);

    let can_merge_prev = prev.map_or(false, |p| {
        p.vm_end == start
        && p.vm_flags == vm_flags
        && file_eq(&p.file, file)
        && pgoff_contiguous(p, start, pgoff)
        && anon_vma_compatible(p, file)
        && vm_ops_eq(p.vm_ops.as_deref(), None) // new VMA has no vm_ops yet if anon
    });

    let can_merge_next = next.map_or(false, |n| {
        n.vm_start == end
        && n.vm_flags == vm_flags
        && file_eq(&n.file, file)
        && pgoff == n.vm_pgoff - ((end - start).as_usize() / PAGE_SIZE) as u64
        && anon_vma_compatible(n, file)
    });

    if can_merge_prev && can_merge_next {
        // Case 1: merge-both — extend prev to cover [prev.start, next.end),
        // remove next from the tree.
        let prev = prev.unwrap();
        let next = next.unwrap();

        // Update i_mmap interval tree for file-backed shared VMAs.
        if let Some(ref f) = prev.file {
            if prev.vm_flags.contains(VM_SHARED) {
                let mapping = &f.inode.i_mapping;
                let _imap_guard = mapping.i_mmap.write();
                // Remove both old entries, insert one covering the full range.
                mapping.i_mmap.tree.remove(prev);
                mapping.i_mmap.tree.remove(next);
                // Will reinsert after updating prev.vm_end.
            }
        }

        prev.vm_end = next.vm_end;
        maple_remove(&mm.vma_tree, next.vm_start, next.vm_end);

        // Reinsert the extended predecessor into i_mmap.
        if let Some(ref f) = prev.file {
            if prev.vm_flags.contains(VM_SHARED) {
                let mapping = &f.inode.i_mapping;
                let _imap_guard = mapping.i_mmap.write();
                mapping.i_mmap.tree.insert(
                    prev.vm_pgoff,
                    prev.vm_pgoff + ((prev.vm_end - prev.vm_start).as_usize() / PAGE_SIZE) as u64,
                    prev,
                );
            }
        }

        // Update the Maple tree entry for prev (range changed).
        maple_update_range(&mm.vma_tree, prev);
        // Free the successor VMA.
        slab_free_typed(next);
        mm.map_count.fetch_sub(1, Relaxed);
        return Some(prev);
    }

    if can_merge_prev {
        // Case 2: merge-prev — extend prev.vm_end to `end`.
        let prev = prev.unwrap();

        if let Some(ref f) = prev.file {
            if prev.vm_flags.contains(VM_SHARED) {
                let mapping = &f.inode.i_mapping;
                let _imap_guard = mapping.i_mmap.write();
                mapping.i_mmap.tree.remove(prev);
            }
        }

        prev.vm_end = end;

        if let Some(ref f) = prev.file {
            if prev.vm_flags.contains(VM_SHARED) {
                let mapping = &f.inode.i_mapping;
                let _imap_guard = mapping.i_mmap.write();
                mapping.i_mmap.tree.insert(
                    prev.vm_pgoff,
                    prev.vm_pgoff + ((prev.vm_end - prev.vm_start).as_usize() / PAGE_SIZE) as u64,
                    prev,
                );
            }
        }

        maple_update_range(&mm.vma_tree, prev);
        return Some(prev);
    }

    if can_merge_next {
        // Case 3: merge-next — shrink next.vm_start to `start`.
        let next = next.unwrap();

        if let Some(ref f) = next.file {
            if next.vm_flags.contains(VM_SHARED) {
                let mapping = &f.inode.i_mapping;
                let _imap_guard = mapping.i_mmap.write();
                mapping.i_mmap.tree.remove(next);
            }
        }

        next.vm_start = start;
        next.vm_pgoff = pgoff;

        if let Some(ref f) = next.file {
            if next.vm_flags.contains(VM_SHARED) {
                let mapping = &f.inode.i_mapping;
                let _imap_guard = mapping.i_mmap.write();
                mapping.i_mmap.tree.insert(
                    next.vm_pgoff,
                    next.vm_pgoff + ((next.vm_end - next.vm_start).as_usize() / PAGE_SIZE) as u64,
                    next,
                );
            }
        }

        maple_update_range(&mm.vma_tree, next);
        return Some(next);
    }

    None
}

/// Helper: check that page offsets are contiguous for merge-prev.
fn pgoff_contiguous(prev: &Vma, new_start: VirtAddr, new_pgoff: u64) -> bool {
    if prev.file.is_none() {
        return true; // anonymous VMAs do not have pgoff continuity constraints
    }
    let prev_pages = (prev.vm_end - prev.vm_start).as_usize() / PAGE_SIZE;
    prev.vm_pgoff + prev_pages as u64 == new_pgoff
}

/// Helper: compare file references for merge eligibility.
fn file_eq(a: &Option<Arc<File>>, b: &Option<Arc<File>>) -> bool {
    match (a, b) {
        (None, None) => true,
        (Some(fa), Some(fb)) => Arc::ptr_eq(fa, fb),
        _ => false,
    }
}

/// Helper: compare VmOperations pointers (same vtable = same ops).
fn vm_ops_eq(a: Option<&dyn VmOperations>, b: Option<&dyn VmOperations>) -> bool {
    match (a, b) {
        (None, None) => true,
        (Some(va), Some(vb)) => core::ptr::eq(va as *const _, vb as *const _),
        _ => false,
    }
}

/// Helper: check anon_vma compatibility for merge.
fn anon_vma_compatible(vma: &Vma, file: &Option<Arc<File>>) -> bool {
    // File-backed VMAs: anon_vma is None until COW fault, always compatible.
    // Anonymous VMAs: either both None or share the same anon_vma root.
    // For new mappings (anon_vma == None), always compatible.
    vma.anon_vma.is_none() || file.is_some()
}

4.8.2.4 `do_munmap()` — Unmap a Virtual Address Range¶

/// Unmap a range of virtual addresses. Called by `munmap()` syscall dispatch,
/// `do_mmap()` for MAP_FIXED (to clear existing mappings), and `exec_mmap()`
/// (to tear down the old address space).
///
/// # Algorithm
/// 1. Find all VMAs overlapping `[addr, addr + len)` via maple tree range query.
/// 2. If the first overlapping VMA starts before `addr`, split it (create new
///    VMA for the portion before `addr`, adjust the original).
/// 3. If the last overlapping VMA extends beyond `addr + len`, split it (create
///    new VMA for the portion after `addr + len`).
/// 4. For each VMA fully contained in the range:
///    a. Notify MmuNotifier callbacks (KVM, DSM, RDMA need to unmap their
///       translations). See [Section 4.8](#virtual-memory-manager--mmu-notifier-integration).
///    b. `unmap_page_range()`: walk PTEs in the range, clear each PTE,
///       decrement page refcounts, add pages to the TLB batch.
///    c. If file-backed + MAP_SHARED: remove from `AddressSpace.i_mmap`
///       interval tree (under i_mmap RwLock).
///    d. If `vma.vm_ops.close` is set, call `ops.close(&vma)`.
///    e. Remove VMA from maple tree.
///    f. Free VMA slab object (RCU-deferred if concurrent page faults are
///       possible — the per-VMA lock protocol requires VMA to survive one
///       RCU grace period after tree removal).
/// 5. `tlb_flush_range(addr, len)` — batch TLB flush for all cleared PTEs.
/// 6. Update `mm.total_vm`, `mm.map_count`, `process.locked_pages` accounting.
///
/// # Lock ordering
/// Caller must hold `mm.mmap_lock.write()`. The function acquires per-VMA
/// `vm_lock.write()` for each affected VMA before modification.
///
/// # Errors
/// - `EINVAL`: Misaligned address.
/// - `ENOMEM`: VMA split requires slab allocation and it fails.
pub fn do_munmap(mm: &MmStruct, addr: VirtAddr, len: usize) -> Result<(), Error>;

4.8.2.5 `maple_find_in_range()` — Check for VMA overlap¶

/// Check whether any VMA overlaps the address range `[start, start + len)`.
/// Returns `Some(vma_ref)` if an overlapping VMA is found, `None` otherwise.
///
/// Uses the maple tree's pivot-based range search. O(log n) where n is the
/// number of VMAs. Does NOT modify the tree.
///
/// Used by MAP_FIXED_NOREPLACE to detect address conflicts without unmapping.
pub fn maple_find_in_range(
    tree: &MapleTree,
    start: VirtAddr,
    end: VirtAddr,
) -> Option<&Vma>;

4.8.2.6 `slab_alloc_typed<T>()` — Typed Slab Allocation¶

/// Allocate a slab object of type `T` from the appropriate size-class cache.
///
/// Looks up the SizeClass for `size_of::<T>()` with `align_of::<T>()`.
/// Returns a reference to uninitialized memory. The caller MUST initialize
/// all fields before the object is made visible to other threads.
///
/// # Returns
/// `&'static mut MaybeUninit<T>` — the slab-backed memory has 'static lifetime
/// (freed only when explicitly returned via `slab_free_typed()`).
///
/// # Errors
/// - `AllocError` if the slab cache and buddy allocator are both exhausted.
///
/// # Relationship to slab_alloc
/// This is a typed wrapper around `slab_alloc(sc, gfp) -> Result<*mut u8>`.
/// It computes the SizeClass from `size_of::<T>()` and casts the result.
pub fn slab_alloc_typed<T>(gfp: GfpFlags) -> Result<&'static mut MaybeUninit<T>, AllocError>;

/// Free a slab object previously allocated by `slab_alloc_typed<T>()`.
/// The caller must ensure the object is not referenced after this call.
pub fn slab_free_typed<T>(obj: &'static mut T);

4.8.2.7 `do_mmap_hugetlb()` — Hugetlb mmap Path¶

Delegates to the hugetlbfs-specific mmap implementation. The hugetlb path differs from regular mmap in that it checks the huge page pool, handles subpool accounting, creates an anonymous hugetlbfs file for MAP_ANONYMOUS | MAP_HUGETLB, and may allocate pages upfront rather than on-demand. See Section 14.18 for the full specification.

4.8.2.8 `lsm_call!(mmap_file, ...)` — LSM mmap hook¶

The mmap security check uses the uniform lsm_call! macro dispatch:

lsm_call!(mmap_file, file, prot, flags)?;

This expands to iterate all registered SecurityModule instances and call mmap_file(file, prot, flags) on each. If any returns Err(SecurityDenial), the mmap is rejected. The hook name mmap_file matches Linux's security_mmap_file() — SELinux policy rules reference this name.

This is separate from lsm_call!(file_security, ...) because mmap needs prot and flags parameters that generic file ops do not carry (e.g., SELinux denies PROT_EXEC on untrusted files). Each operation with distinct parameters gets its own hook method in the SecurityModule trait.

The SecurityModule trait method:

/// LSM mmap file hook. Called from `do_mmap()` before creating a VMA
/// for a file-backed mapping. Modules check the `prot`/`flags` combination
/// against the file's security context (e.g., SELinux type enforcement).
///
/// See [Section 9.8](09-security.md#linux-security-module-framework--mmap-security-hook).
fn mmap_file(
    &self,
    file: &FileRef,
    prot: u32,
    flags: u32,
) -> Result<(), SecurityDenial> {
    Ok(()) // Default: permit
}

4.8.2.9 `check_overcommit()` — Overcommit accounting¶

/// Global committed pages counter. Single AtomicI64. No locks. No per-CPU
/// batching. mmap is a warm path — global atomic contention is negligible
/// compared to mmap_lock contention.
static COMMITTED_PAGES: AtomicI64 = AtomicI64::new(0);

/// Check and account for committed pages. Uses optimistic fetch_add with
/// rollback on failure.
///
/// Three overcommit modes (sysctl vm.overcommit_memory):
/// - Mode 0 (heuristic): reject if committed > (total_ram * overcommit_ratio/100 + swap)
/// - Mode 1 (always): always permit
/// - Mode 2 (strict): reject if committed > (total_ram * overcommit_ratio/100 + swap)
///
/// Mode 0 and 2 differ in that mode 0 also considers reclaimable pages;
/// mode 2 is the strict upper bound.
pub fn check_overcommit(mm: &MmStruct, pages: u64) -> Result<(), KernelError> {
    let mode = sysctl_overcommit_memory();
    if mode == OVERCOMMIT_ALWAYS {
        return Ok(());
    }

    // Optimistic commit: add first, check second.
    let prev = COMMITTED_PAGES.fetch_add(pages as i64, Relaxed);
    let committed = prev + pages as i64;

    let limit = compute_commit_limit(mode);
    if committed > limit {
        // Over-limit: rollback and reject.
        COMMITTED_PAGES.fetch_sub(pages as i64, Relaxed);
        return Err(KernelError::ENOMEM);
    }

    // LSM veto: some security modules restrict virtual memory commitments.
    if lsm_call!(vm_enough_memory, pages).is_err() {
        COMMITTED_PAGES.fetch_sub(pages as i64, Relaxed);
        return Err(KernelError::ENOMEM);
    }

    Ok(())
}

MapleNode — single node in the Maple tree:

/// Node type determines interpretation of the children/slots union.
/// Dense nodes store up to 16 entries indexed by offset; Range nodes
/// store ranges delimited by pivot values.
#[repr(u8)]
pub enum MapleNodeType {
    /// Dense node: slots are indexed by position (for small, dense ranges).
    /// Used when the address range is compact and contiguous.
    Dense = 0,
    /// Range node: pivots delimit address ranges, children/slots are keyed
    /// by range. Used for the general case of sparse VMA layouts.
    Range = 1,
}

/// A single node in the Maple tree. ~256 bytes (4 cache lines), sized for
/// fanout-16 B-tree nodes. Internal nodes store child pointers; leaf nodes
/// store VMA pointers. The fanout of 16 is chosen to balance tree height
/// (3 levels for ~4096 VMAs) against node size. Nodes are NUMA-allocated
/// from the slab allocator and RCU-protected for lock-free reads.
///
/// Approximate layout:
///   - 1 byte:  node_type + 1 byte nr_entries + 6 bytes padding + 8 bytes gap
///   - 120 bytes: pivots (15 × u64 range boundaries)
///   - 128 bytes: children (16 × *mut MapleNode, 8 bytes each on 64-bit)
///   - remaining: RcuHead + alignment padding
pub struct MapleNode {
    /// Discriminant: Dense or Range.
    pub node_type: MapleNodeType,
    /// Number of valid entries (pivots/children or slots). 0..=16.
    pub nr_entries: u8,
    /// Maximum free virtual address gap (in pages) in this subtree.
    /// Used by `maple_find_gap()` to prune subtrees that cannot satisfy
    /// an allocation request. u64 required: virtual address spaces on
    /// 64-bit architectures exceed 16 TB (x86-64: 128 TB, AArch64
    /// 48-bit: 256 TB, RISC-V Sv57: 64 PB), which overflows u32 page
    /// counts. Updated bottom-up on insert/remove.
    pub gap: u64,
    /// Range boundaries (for Range nodes). `pivots[i]` is the exclusive
    /// upper bound of range i. Entry i covers addresses
    /// `[pivots[i-1], pivots[i])` (with `pivots[-1]` implicitly 0 for
    /// the first entry). Up to 15 pivots delimit up to 16 ranges.
    pub pivots: [u64; 15],
    /// For internal nodes: child pointers (RCU-protected).
    /// For leaf nodes: VMA pointers.
    /// Exactly one of these is valid based on whether this is an internal
    /// or leaf node (determined by tree depth, not a per-node flag).
    /// Using a union avoids wasting space on an enum discriminant at
    /// every slot.
    pub children: [*mut MapleNode; 16],  // internal nodes
    // -- OR (union, leaf nodes) --
    // pub slots: [*mut Vma; 16],        // leaf nodes
    /// RCU head for deferred reclamation after copy-on-write replacement.
    pub rcu: RcuHead,
}

MapleTree — top-level tree descriptor:

/// The Maple tree VMA index. One per address space
/// ([Section 4.8](#virtual-memory-manager--mmstruct-per-process-address-space)).
///
/// Readers access the tree under `rcu_read_lock()` with no write-side
/// synchronization — the tree is always in a consistent state because
/// mutations use copy-on-write (new path from root to modified leaf,
/// then atomic root pointer swap).
///
/// Writers hold `write_lock` for mutual exclusion, then perform COW
/// mutations and publish via `rcu_assign_pointer()` on the root.
pub struct MapleTree {
    /// Root node pointer, RCU-protected. Readers load atomically via
    /// `rcu_dereference(root)`. Writers replace via `rcu_assign_pointer()`.
    /// NULL for an empty tree (no VMAs).
    pub root: RcuPtr<MapleNode>,
    /// Write-side serialization. Held during insert, remove, and split/merge
    /// operations. Readers never acquire this lock — they use RCU.
    pub write_lock: RwLock<()>,
    /// Highest virtual address mapped in this tree. Cached to avoid
    /// tree traversal for TASK_SIZE checks and stack growth limit
    /// enforcement. Updated on insert/remove.
    pub highest_addr: u64,
    /// Number of MapleNode objects in this tree (for memory accounting
    /// and diagnostics via `/proc/<pid>/status`).
    pub node_count: u32,
}

RCU read protocol:

Readers hold rcu_read_lock(), load root atomically via rcu_dereference(), and traverse the tree without acquiring any lock. All node pointers encountered during traversal are guaranteed valid for the duration of the RCU read-side critical section (writers use RCU-safe publish for all structural changes, and non-structural slot updates use rcu_assign_pointer() with smp_wmb() barriers so readers see either the old or new value, never a torn pointer).

Mutation protocol — two categories:

Non-structural mutations (slot update within an existing node, e.g., mprotect() changing VMA flags, madvise() updating VMA hints): The writer updates the affected slot in-place via a single rcu_assign_pointer(slots[offset], new_entry) on the existing leaf node. Zero slab allocations. The write-side smp_wmb() ensures the new VMA content is fully visible before the slot pointer update. RCU readers see either the old or new VMA, never a torn state. This is safe because multi-VMA consistency (e.g., mprotect across multiple VMAs) is serialized by mmap_lock.write(), not by tree-level atomicity.
Structural mutations (insert/remove entries requiring node split or merge, e.g., mmap(), munmap()): The writer allocates new node(s) via the slab allocator, copies unmodified children by pointer, and publishes the new subtree via rcu_assign_pointer() on the parent pointer. O(1) to O(h) slab allocations depending on whether splits propagate (h = tree height, typically 3-4). Old nodes are freed via rcu_call() after the current grace period.

This dual-path design eliminates O(depth) allocations per mprotect/madvise on JIT-heavy workloads (V8, JVM doing millions of mprotect() calls). Verified against Linux lib/maple_tree.c: mas_wr_slot_store() performs direct slot assignment via rcu_assign_pointer(slots[offset], wr_mas->entry) for non-structural mutations.

Operations:

/// Find the VMA containing `addr`, if any.
///
/// Walk from root comparing `addr` against pivots at each level to
/// select the correct child/slot. O(log n) with ~3 cache-line accesses
/// for typical VMA counts. Zero locks for readers (RCU read-side only).
///
/// # Arguments
/// * `tree` - The Maple tree to search.
/// * `addr` - Virtual address to look up.
///
/// # Returns
/// `Some(&Vma)` if `addr` falls within a mapped VMA, `None` otherwise.
/// The returned reference is valid for the duration of the caller's
/// `rcu_read_lock()` critical section.
pub fn maple_find(tree: &MapleTree, addr: u64) -> Option<&Vma>;

/// Insert a VMA into the tree.
///
/// Acquires `write_lock`. This is a structural mutation: allocates new
/// node(s) as needed, inserts the VMA, splits the leaf node if full
/// (promoting a pivot to the parent, splitting recursively if needed).
/// Updates gap fields bottom-up. Publishes changes via `rcu_assign_pointer()`.
/// O(1) to O(h) slab allocations where h is tree height (typically 3-4).
///
/// # Ownership
/// Takes `vma: Arc<Vma>` — the maple tree stores `Arc<Vma>` references.
/// This matches the ownership model: VMAs are slab-allocated, reference-counted
/// objects. The tree holds one Arc reference; the caller (e.g., `do_mmap` error
/// path) can hold another via `Arc::clone()` taken before insertion. If insertion
/// fails, the caller's clone remains valid for cleanup (e.g., `vma.file` unref).
///
/// # Mutability
/// Takes `tree: &MapleTree` (shared reference) because `MapleTree` uses interior
/// mutability (`UnsafeCell` internally). SAFETY invariant: mmap_lock write mode
/// is held. The previous signature (`tree: &mut MapleTree`) was inconsistent
/// with callers that access the tree through `&MmStruct`.
///
/// # Errors
/// Returns `Err(ErrAddrInUse)` if any part of the VMA's address range
/// `[vma.start, vma.end)` overlaps an existing VMA.
pub fn maple_insert(tree: &MapleTree, vma: Arc<Vma>) -> Result<(), ErrAddrInUse>;

/// Remove all VMAs overlapping the address range `[addr_start, addr_end)`.
///
/// Acquires `write_lock`. This is a structural mutation: allocates
/// replacement node(s), removes entries, merges underful nodes (nodes
/// with fewer than 25% of slots occupied are merged with a sibling).
/// Updates gap fields bottom-up. Publishes via `rcu_assign_pointer()`.
///
/// Removed VMAs are returned to the caller for cleanup (unmapping
/// page table entries, freeing backing pages). The old tree nodes
/// are freed via `rcu_call()` after the grace period.
pub fn maple_remove(tree: &mut MapleTree, addr_start: u64, addr_end: u64);

/// Find the lowest virtual address with at least `size` contiguous free
/// bytes, aligned to `align`, starting the search at `hint_addr`.
///
/// Uses the gap index: each internal node's `gap` field records the
/// maximum free gap in its subtree. The walker descends only into
/// subtrees whose `gap >= size` (in pages), pruning branches that
/// cannot possibly satisfy the request. This makes unmapped-area
/// search O(log n) instead of the O(n) linear VMA scan required
/// without gap tracking.
///
/// Used by `mmap()` without `MAP_FIXED` to find a suitable address.
/// The `hint_addr` (from the `addr` argument to mmap, or from the
/// per-mm free-area cursor) biases the search toward a preferred
/// region (typically above the current `brk` for bottom-up layouts
/// or below the stack for top-down layouts).
pub fn maple_find_gap(
    tree: &MapleTree,
    size: usize,
    align: usize,
    hint_addr: VirtAddr,
) -> Result<VirtAddr, Error>;
// Returns `Err(ENOMEM)` when no gap of `size` bytes with `align` alignment
// is available in the address space. `hint_addr` biases the search toward
// a preferred region (typically `mm.mmap_base` for ASLR).

Gap tracking:

Each internal node tracks the maximum free virtual address gap (in pages) within its subtree via the gap field. On every insert or remove, gaps are recalculated bottom-up from the modified leaf to the root.

Gap recalculation algorithm (bottom-up, O(height)):

recalc_gap(node):
  if node.is_leaf():
    # Leaf: gaps between stored VMAs plus the span before first and after last pivot
    node.gap = max_gap_in_leaf(node)
    # max_gap_in_leaf computes:
    #   span from node_min to pivot[0].vm_start,
    #   for each i: pivot[i].vm_end to pivot[i+1].vm_start,
    #   span from pivot[last].vm_end to node_max
    return
  # Internal node: maximum of all children's subtree gaps
  node.gap = 0
  for i in 0..=node.num_children:
    child_gap = node.children[i].gap   # already updated from recursive call below
    node.gap = max(node.gap, child_gap)

# Called after insert or remove modifies leaf L:
  Walk path from L to root.
  At each node on the path (bottom-up): recalc_gap(node)

Pivot boundary invariant: A node's gap represents the maximum contiguous free range (in pages) that fits entirely within the address span [node_min, node_max). A free range that spans a node boundary is split across child nodes — neither child records the full gap. This is acceptable: find_unmapped_area() checks child.gap >= requested_pages before descending, and if a cross-boundary free range exists, both adjacent children will have gap >= half the range, so the traversal will visit both and find the allocation.

Complexity: recalculation is O(height) = O(log n) node updates per insert/remove, where n is the number of VMAs. Each level updates exactly one node (the ancestor on the modification path).

This makes find_unmapped_area() (the core of mmap address selection) O(log n) instead of O(n): the walker descends only into subtrees with a sufficiently large gap, skipping entire branches of the tree that cannot satisfy the allocation. For a process with 10,000 VMAs, this reduces the search from ~10,000 VMA comparisons to ~4 node visits (tree height).

4.8.3 Page Fault Handler¶

The page fault handler is the most performance-critical path in the virtual memory subsystem — every demand-paged access, every COW fork, every compressed-page decompression, and every swap-in passes through it. The handler runs in UmkaOS Core (Tier 0) with zero domain crossings for the common case (anonymous page fault).

Fault entry:

Architecture-specific trap handlers (x86-64: #PF via IDT entry 14; AArch64: ESR_EL1 data/instruction abort; RISC-V: scause page fault exceptions) call the common entry point.

FaultError — error taxonomy for page fault handling:

Every handle_*_fault() function in this section returns Result<(), FaultError>. The variants map directly to signal delivery or retry actions. The caller (handle_page_fault) converts FaultError into the appropriate response: signal delivery for user-mode faults, kernel oops/panic for kernel-mode faults (after checking the exception fixup table).

/// Error taxonomy for page fault handling. Returned by `handle_*_fault()`
/// functions. Each variant maps to a specific signal or retry action.
///
/// Ordering: variants are arranged by severity (recoverable → fatal).
/// The `handle_page_fault` dispatcher uses this to decide the response:
///   - `Retry` → re-execute the faulting instruction (no signal).
///   - `Oom` → invoke the OOM killer, then retry or deliver SIGKILL.
///   - `Sigbus` / `Sigsegv` → deliver the corresponding signal.
///   - `HwPoison` / `MceRecoverable` → isolate the page, deliver SIGBUS
///     with `si_code = BUS_MCEERR_AR`.
///   - `SwapReadError` / `FileFault` → deliver SIGBUS (I/O error on
///     backing store).
pub enum FaultError {
    /// Out of memory — no pages available after reclaim + compaction.
    /// Triggers the OOM killer ([Section 4.2](#physical-memory-allocator--oom-killer)).
    /// If the OOM killer frees memory, the fault is retried. If the
    /// faulting task is itself selected for OOM kill, SIGKILL is delivered.
    ///
    /// **Signal delivery for user-mode OOM**: When `handle_page_fault()`
    /// receives `FaultError::Oom` for a user-mode fault, it invokes the OOM
    /// killer ([Section 4.5](#oom-killer)), which selects a victim task based on
    /// `oom_badness()` scoring. The victim (which may or may not be the
    /// faulting task) receives SIGKILL (force-delivered, non-maskable,
    /// non-catchable). If the OOM killer frees enough memory, the faulting
    /// task retries the fault. If the faulting task is itself selected as
    /// the OOM victim, it receives SIGKILL and does not retry.
    ///
    /// **Kernel-mode OOM during fault handling**: If a kernel-mode fault
    /// path (e.g., `copy_from_user` triggering a page fault for a
    /// user-mapped address, or a vmalloc fault) returns `FaultError::Oom`,
    /// the fault handler first checks the exception fixup table. If a fixup
    /// entry exists, control transfers to the fixup handler (which typically
    /// returns `-ENOMEM` to the caller). If no fixup entry exists, the
    /// kernel panics with the faulting address and allocation context
    /// (GFP flags, requested order, NUMA node) for post-mortem analysis.
    Oom,
    /// Bus error — access to a file region beyond EOF (truncated file),
    /// or a fault on a mapping backed by a device that returned an error.
    /// Delivers SIGBUS with `si_code = BUS_ADRERR`.
    Sigbus,
    /// Segmentation fault — access violation (write to read-only VMA,
    /// execute on non-exec VMA, or access to an address with no VMA).
    /// Delivers SIGSEGV with `si_code = SEGV_ACCERR` (permission) or
    /// `SEGV_MAPERR` (no mapping).
    Sigsegv,
    /// Transient conflict — retry the fault. Returned when a page was
    /// being migrated (NUMA balancing), a THP was being split, or a
    /// COW page was being resolved by another thread in the same
    /// address space. The caller drops the per-VMA lock (or `mmap_lock`
    /// on the slow path), yields briefly (`cond_resched()`), and retries.
    Retry,
    /// Hardware poison — uncorrectable memory error detected by EDAC
    /// ([Section 20.6](20-observability.md#edac-error-detection-and-correction-framework)).
    /// The faulting page is isolated (removed from all page tables,
    /// marked PG_hwpoison). Delivers SIGBUS with
    /// `si_code = BUS_MCEERR_AR` (action required).
    HwPoison,
    /// Machine check recoverable — MCE on the faulting page, but the
    /// error is correctable or the page can be offlined. Similar to
    /// `HwPoison` but the kernel may attempt transparent recovery
    /// (e.g., re-reading from backing store for file-backed pages).
    MceRecoverable,
    /// Swap read failed — I/O error reading a page from the swap
    /// device. The page remains on swap; the PTE is left as a swap
    /// entry. Delivers SIGBUS with `si_code = BUS_ADRERR`.
    SwapReadError(KernelError),
    /// File-backed fault failed — error from
    /// `AddressSpaceOps::read_page()` or iomap fault path. Typically
    /// an I/O error on the underlying block device or a network
    /// timeout for NFS. Delivers SIGBUS with `si_code = BUS_ADRERR`.
    FileFault(KernelError),
}

The KernelError type is the kernel's unified error enum (Section 3.14), which includes IoError and other I/O-related error codes. SwapReadError and FileFault carry the specific I/O error to enable detailed logging via FMA (Section 20.1).

/// Top-level page fault handler, called from architecture-specific trap code.
///
/// # Arguments
/// * `addr` - Faulting virtual address (from CR2 on x86, FAR_EL1 on AArch64,
///   stval on RISC-V).
/// * `access` - Type of access that caused the fault.
/// * `user_mode` - Whether the fault occurred in user mode (Ring 3 / EL0).
///
/// # Returns
/// `Ok(())` if the fault was resolved (page installed, execution resumes).
/// `Err(FaultError)` if the fault is fatal (signal delivery or kernel panic).
pub fn handle_page_fault(
    addr: VirtAddr,
    access: AccessType,
    user_mode: bool,
) -> Result<(), FaultError> {
    let mm = current_mm();

    // Enter NOFS allocation context — suppress filesystem writeback
    // re-entry for all allocations in the fault path.
    let _nofs = MemallocNofsGuard::new();

    // Step 1: VMA lookup (per-VMA lock fast path or mmap_lock slow path).
    // Details: see Per-VMA Locking section below.
    let vma = match find_vma_lockless(mm, addr) {
        Some(vma) => vma,
        None => {
            // No VMA at this address.
            if user_mode {
                return Err(FaultError::Segv(SEGV_MAPERR));
            } else {
                // Kernel fault on unmapped address — oops/panic.
                return Err(FaultError::KernelOops);
            }
        }
    };

    // Step 2: Permission check.
    if access == AccessType::Write && !vma.vm_flags.contains(VmFlags::VM_WRITE) {
        return Err(FaultError::Segv(SEGV_ACCERR));
    }
    if access == AccessType::Exec && !vma.vm_flags.contains(VmFlags::VM_EXEC) {
        return Err(FaultError::Segv(SEGV_ACCERR));
    }

    // Step 3: Read the PTE to determine the fault type.
    let pte = read_pte(mm.pgd(), addr);

    // Step 4: Dispatch based on fault type.
    if !pte.is_present() {
        // Page not present — demand paging.
        if vma.file.is_some() {
            // File-backed VMA: page cache fault.
            handle_file_fault(vma, addr, access)
        } else {
            // Anonymous VMA: allocate zero page.
            default_handle_anon_fault(mm, vma, addr, access)
        }
    } else if access == AccessType::Write && !pte.is_writable() {
        // Write to read-only PTE — COW fault.
        handle_cow_fault(vma, addr, pte.pfn())
    } else if pte.is_swap_entry() {
        // Swap entry — bring page back from swap.
        handle_swap_fault(mm, vma, addr, pte)
    } else {
        // PTE is present and has correct permissions — spurious fault
        // (TLB stale entry). Nothing to do, return success.
        Ok(())
    }
}

/// Access type that caused the page fault.
#[repr(u8)]
pub enum AccessType {
    /// Read access (load instruction).
    Read  = 0,
    /// Write access (store instruction).
    Write = 1,
    /// Instruction fetch (execute).
    Exec  = 2,
}

4.8.3.1.1 Per-VMA Locking¶

Linux 6.4 introduced per-VMA locks to eliminate mmap_lock contention on the page fault path. UmkaOS adopts this design from day one.

The VMA struct includes a per-VMA reader-writer lock:

/// Per-VMA lock. Allows page faults to proceed without acquiring
/// the process-wide `mmap_lock`, dramatically reducing contention
/// on multi-threaded workloads (databases, JVMs, container runtimes).
pub vm_lock: RwLock<()>,

Page fault fast path (per-VMA lock):

1. vma = maple_tree_lookup_lockless(mm, fault_addr)
   // Lockless RCU-protected lookup in the maple tree.
   // Returns None if VMA was concurrently removed.

2. if vma.is_none() {
       // VMA not found in lockless lookup — concurrent removal.
       // Fall back to mmap_lock slow path (step 6).
       goto slow_path;
   }

3. guard = vma.vm_lock.read()
   // Acquire per-VMA read lock. Multiple faulting threads on the
   // same VMA proceed in parallel (read locks are shared).

4. // Re-validate VMA after lock acquisition:
   if vma.is_detached() {
       // VMA was removed between lookup and lock. Fall back.
       drop(guard);
       goto slow_path;
   }

5. handle_fault(vma, fault_addr, flags)
   // Process the fault (demand paging, COW, file-backed read).
   // The per-VMA lock protects against concurrent VMA modification
   // (mprotect, mremap) but NOT against concurrent faults on the
   // same VMA (those are handled by page table locks).
   drop(guard);
   return;

slow_path:
6. mmap_lock.read()
   // Full process-wide lock. Used only when per-VMA fast path fails
   // (VMA removal race) or for operations that span multiple VMAs.
7. vma = maple_tree_lookup(mm, fault_addr)
8. handle_fault(vma, fault_addr, flags)
   mmap_lock.read_unlock()

VMA-modifying operations (mmap, munmap, mprotect, mremap):

1. mmap_lock.write()           // Exclude all readers (both mmap_lock and per-VMA)
2. for each affected VMA:
       vma.vm_lock.write()     // Exclude per-VMA fault handlers
3. Perform the VMA modification (split, merge, remove, change permissions)
4. for each affected VMA:
       vma.vm_lock.write_unlock()
       // If VMA was removed: mark as detached before unlocking,
       // so concurrent fault handlers in step 4 above detect it.
5. mmap_lock.write_unlock()

Lock ordering: mmap_lock nests outside vm_lock (see Section 3.4). A thread holding vm_lock.read() must never acquire mmap_lock in any mode.

Scalability impact: On a 256-core system running a multi-threaded database with 1M+ page faults/sec, per-VMA locking reduces mmap_lock contention from the primary bottleneck (~40% of fault latency on 128+ cores) to negligible. The mmap_lock write path is only taken for VMA structural modifications (mmap/munmap), which are orders of magnitude less frequent than page faults.

4.8.4 Page Fault Metadata by Architecture¶

When a page fault occurs, the hardware delivers fault information in architecture-specific registers before the trap handler can call the common handle_page_fault() entry point. UmkaOS's fault entry stubs (in umka-core/src/arch/*/mm.rs) read these registers and normalise them into the PageFaultInfo struct before dispatching to architecture-independent code:

/// Architecture-normalised page fault descriptor. Populated by the arch-specific
/// fault entry stub from hardware registers (CR2/ESR, FAR_EL1/ESR_EL1, stval/scause,
/// DAR/DSISR, DEAR/ESR) before `handle_page_fault` is called.
pub struct PageFaultInfo {
    /// Faulting virtual address.
    pub addr:  VirtAddr,
    /// True if the faulting access was a store (write).
    pub write: bool,
    /// True if the fault occurred at user privilege level (EL0 / Ring 3 / U-mode).
    pub user:  bool,
    /// True if the fault was an instruction fetch (NX / XN violation).
    pub exec:  bool,
}

The hardware registers that supply these fields differ by architecture:

Architecture	Fault Address	Fault Reason
x86-64	CR2 (linear address that caused the fault)	Error code pushed on stack: bit 0 = present (protection fault vs. not-present), bit 1 = write, bit 2 = user mode, bit 3 = reserved-bit write, bit 4 = instruction fetch, bit 5 = protection-key violation
AArch64	FAR_EL1 (Fault Address Register, EL1)	ESR_EL1: EC field = 0x24 (data abort from lower EL) or 0x25 (data abort same EL), 0x20 (instruction abort from lower EL) or 0x21 (instruction abort same EL); DFSC field (Data Fault Status Code): 0b000100 = translation fault L0, 0b000101 = L1, 0b000110 = L2, 0b000111 = L3, 0b001101 = permission fault
ARMv7	DFAR (Data Fault Address Register, CP15 c6 c0 0)	DFSR (Data Fault Status Register, CP15 c5 c0 0): status bits encode translation fault, permission fault, or alignment fault; WnR bit indicates write
RISC-V	stval CSR (supervisor trap value) = faulting virtual address	scause CSR: 12 = instruction page fault, 13 = load page fault, 15 = store/AMO page fault
PPC32	DEAR (Data Exception Address Register, SPR 61) for data; SRR0 for instruction faults	ESR (Exception Syndrome Register, SPR 62): ST bit indicates store vs. load; separate instruction-access exception (IABR) for instruction faults
PPC64LE	DAR (Data Address Register, SPR 19) for data faults; SRR0 for instruction faults	DSISR (Data Storage Interrupt Status Register, SPR 18): bit 25 = translation fault, bit 27 = protection fault, bit 26 = store
s390x	Translation-exception identification (TEID, lowcore offset 0xA0, 8 bytes): bits 0-51 = faulting virtual address (page-aligned), bits 52-53 = access type (00 = fetch, 01 = store, 10 = LAE)	Program interrupt code in lowcore at offset 0x8E (2 bytes): code 0x0010 = segment-translation exception, 0x0011 = page-translation exception, 0x003B = region-first-translation exception, 0x003C = region-second-translation exception, 0x003D = region-third-translation exception; translation-exception code in TEID bits 56-63: 04 = protection, 10 = page translation, 11 = segment translation
LoongArch64	CSR.BADV (Bad Virtual Address) = faulting virtual address	CSR.ESTAT (Exception Status): ECODE field (bits 21-16): 0x01 = TLB Refill (handled via CSR.TLBRENTRY), 0x02 = Page Invalid for Load (PIL), 0x03 = Page Invalid for Store (PIS), 0x04 = Page Invalid for Fetch (PIF), 0x07 = Page Privilege Violation (PPE), 0x08 = Page Modified Exception (PME — COW trigger)

The arch stub reads these registers in the trap entry path (before re-enabling interrupts) and fills PageFaultInfo. From that point on, all fault-handling code is architecture-independent and operates solely on PageFaultInfo, the VMA tree, and the physical allocator.

Lookup sequence (uses per-VMA locking — see Section 4.8):

VMA lookup: vma = maple_tree_lookup_lockless(current.vm_map, addr) — O(log n) lockless RCU-protected lookup (Section 4.8). If the lockless lookup succeeds, acquire vma.vm_lock.read() and re-validate (per-VMA fast path). If the VMA was concurrently removed or the lockless lookup fails, fall back to mmap_lock.read() (slow path). No write-side lock needed for the common case.
No VMA found: The address is not mapped in the process's address space. Deliver SIGSEGV with si_code = SEGV_MAPERR (bad address). If user_mode is false, check the kernel exception fixup table (__ex_table) first — see kernel fault handling below.
Permission check: The VMA exists but the access violates its protection bits (e.g., write to a PROT_READ-only mapping, or execute of a non-PROT_EXEC mapping). Deliver SIGSEGV with si_code = SEGV_ACCERR (protection fault).
Determine fault type based on the VMA and PTE state:

Condition	Fault Type	Handler
PTE not present, anonymous VMA	Anonymous fault	Allocate zero page, install PTE
PTE not present, file-backed VMA, `AS_DAX` set	DAX fault	`dax_iomap_fault()` — map persistent memory directly into PTE (Section 15.16)
PTE not present, file-backed VMA, not DAX	File fault	Look up page cache; if miss, submit I/O
PTE not present, VMA has `VM_DSM` flag	DSM read fault	Call `dsm_handle_fault(vma, addr, DsmFaultType::Read)` (see bridge function below).
PTE present, read-only, VMA has `VM_DSM` + writable, write fault	DSM write-upgrade fault	The page is in `SharedReader` state (MOESI S). Initiate ownership transfer: look up `DsmPageMeta` for this page (from the DSM region's per-page metadata slab, keyed by page frame number within the region — see Section 6.1); if state is `SharedReader`, send `Upgrade` to home node (no data transfer needed — Node A already has the data); if state is `NotPresent` (I), send `GetM` to home node (data transfer needed). Wait on `DsmFetchCompletion`. After exclusive ownership is granted: acquire PTL (level 185), upgrade PTE to read-write, release PTL. `tlb_flush_page(addr)` — required because the old read-only PTE may be cached in TLB (architecture-dependent: x86-64 may skip for RO→RW upgrade; AArch64/RISC-V/PPC require TLBI/sfence.vma/tlbie). Set `DsmPageState = Exclusive` (dirty). This row MUST be checked before the COW row — a DSM page with a write fault is an ownership transfer, not a copy-on-write.
PTE present, read-only, VMA is writable + COW, NOT `VM_DSM`	Copy-on-write fault	Copy page, update PTE
PTE swap entry, compressed bit set	Compressed fault	Decompress from CompressPool (Section 4.12)
PTE swap entry, compressed bit clear	Swap fault	Read from swap device, install PTE

4.8.4.1 `dsm_handle_fault` — VMM-to-DSM Bridge Function¶

/// Bridge between the VMM page fault path and the DSM coherence subsystem.
/// Called from the fault dispatch table when the faulting VMA has `VM_DSM`.
///
/// This function:
/// 1. Extracts the `DsmRegion` from `vma.dsm_region` (guaranteed `Some` because
///    the caller checks `VM_DSM` flag before dispatching here).
/// 2. Computes the page-aligned `region_id` and `page_index` from the fault
///    address and region base.
/// 3. Translates the VMM-level `AccessType` to `DsmFaultType` (Read/Write).
/// 4. Locates the subscriber for this region via `region.subscriber`.
/// 5. Calls `subscriber.on_page_fault(region_id, va, fault_type)` to obtain
///    a `DsmFaultHint` (subscriber decision: default fetch, prefetch, reject,
///    etc.). See [Section 6.12](06-dsm.md#dsm-subscriber-controlled-caching) for the trait.
/// 6. Dispatches the `DsmFaultHint` to the appropriate DSM protocol handler
///    (described in [Section 6.5](06-dsm.md#dsm-page-fault-flow)).
///
/// # Arguments
/// - `vma`: the VMA containing the faulting address (has `VM_DSM` flag).
/// - `addr`: the faulting virtual address (page-aligned by caller).
/// - `access`: the access type (Read or Write).
///
/// # Returns
/// `Ok(())` on success (PTE installed by the DSM fault handler).
/// `Err(FaultError::Oom)` if page frame allocation fails.
/// `Err(FaultError::Sigbus)` if the subscriber rejects the fault, or if
/// the home node is unreachable after retries.
fn dsm_handle_fault(
    vma: &Vma,
    addr: VirtAddr,
    access: AccessType,
) -> Result<(), FaultError> {
    let region = vma.dsm_region.as_ref()
        .expect("VM_DSM VMA has no dsm_region");
    let fault_type = match access {
        AccessType::Read => DsmFaultType::Read,
        AccessType::Write => DsmFaultType::Write,
    };
    let hint = region.subscriber.on_page_fault(
        region.region_id, addr.as_usize() as u64, fault_type,
    );
    match hint {
        DsmFaultHint::Default => {
            // Standard demand fetch: GetS for read, GetM/Upgrade for write.
            // Full protocol in [Section 6.5](06-dsm.md#dsm-page-fault-flow).
            dsm_demand_fetch(region, vma, addr, access)
        }
        DsmFaultHint::PrefetchAhead(n) => {
            // Fetch this page + n subsequent pages.
            dsm_demand_fetch(region, vma, addr, access)?;
            dsm_prefetch_ahead(region, vma, addr, n);
            Ok(())
        }
        DsmFaultHint::FetchRemote { source_node } => {
            // Fetch from specified peer (hint -- directory lookup may override).
            dsm_demand_fetch_from(region, vma, addr, access, source_node)
        }
        DsmFaultHint::FetchExclusive => {
            // Fetch with write intent via GetM even for read faults.
            dsm_demand_fetch(region, vma, addr, AccessType::Write)
        }
        DsmFaultHint::Reject => {
            Err(FaultError::Sigbus)
        }
        DsmFaultHint::MigrateThread { .. } => {
            // Phase 4+ feature. Fall back to default demand fetch.
            // Log warning for observability.
            log::warn!("DsmFaultHint::MigrateThread not yet implemented, falling back to Default");
            dsm_demand_fetch(region, vma, addr, access)
        }
    }
}

4.8.4.2 `install_pte` — Page Table Entry Installation¶

/// Install a page table entry mapping `addr` → `pfn` with the given flags.
///
/// Walks the four-level page table rooted at `pgd`, allocating intermediate
/// tables (PUD, PMD, PT) on demand from the page allocator with GfpFlags::KERNEL.
/// The final PTE is written with a Release store to ensure all prior page
/// content writes are visible before the mapping becomes active.
///
/// # Memory ordering
/// - **PTE write**: `Release` — ensures page zeroing/copying completes before
///   the mapping is visible to other CPUs via TLB fill.
/// - **Intermediate table stores**: `Release` — parent entry must be visible
///   before child entries are accessed.
///
/// # TLB considerations
/// - **New mapping** (PTE was !PRESENT): no TLB flush required — the TLB
///   cannot cache a not-present entry (x86, ARM, RISC-V, PPC all guarantee this).
/// - **Replacing existing mapping** (COW, migration): caller must issue
///   `tlb_flush_page(addr)` AFTER install_pte returns. install_pte does NOT
///   flush the TLB — the caller knows whether a flush is needed.
///
/// # Per-architecture PTE format
/// The `flags` parameter uses the architecture-neutral `PteFlags` bitflags.
/// `arch::current::mm::encode_pte(pfn, flags)` translates to the hardware
/// PTE format (x86-64: 64-bit PTE with NX; AArch64: stage-1 descriptor with
/// UXN/PXN; RISC-V: Sv48 PTE; PPC: HPTE or Radix PTE).
///
/// # Mutability
/// Takes `pgd: &PageGlobalDir` (shared reference) because `PageGlobalDir`
/// uses interior mutability for its page table entries. Each PTE is an
/// atomic value (hardware-atomically updated on all architectures via
/// compare-and-swap or architecture-specific PTE update instructions).
/// The mmap_lock (read or write mode) provides high-level synchronization;
/// the per-PTE spinlock (`pte_lockptr()`) provides fine-grained PTE-level
/// serialization for concurrent page faults on different addresses within
/// the same page table.
///
/// # `current_pgd()` definition
/// `current_pgd()` returns a reference to the current process's top-level
/// page table:
/// ```rust
/// fn current_pgd() -> &PageGlobalDir {
///     // Load the mm from the current task's process via ArcSwap.
///     let mm = current_task().process.mm.load();
///     // The PGD is embedded in MmStruct and lives for the mm's lifetime.
///     // The ArcSwapGuard keeps the Arc<MmStruct> alive.
///     &mm.pgd
/// }
/// ```
/// The returned reference is valid for the duration of the page fault
/// handler (the mm cannot be swapped during a fault — mmap_lock is held).
///
/// # Errors
/// Returns `FaultError::Oom` if intermediate page table allocation fails.
fn install_pte(
    pgd: &PageGlobalDir,
    addr: VirtAddr,
    pfn: Pfn,
    flags: PteFlags,
) -> Result<(), FaultError>

PF_MEMALLOC_NOFS — task-level allocation context restriction:

When the page fault handler holds mmap_lock (read or write mode), any memory allocation performed during the fault (intermediate page table pages, anonymous page frames, page cache pages) must use GFP_NOFS semantics to prevent lock ordering violations with filesystem locks (I_RWSEM, level 80). The reclaim path under GFP_KERNEL may invoke writepage(), which acquires I_RWSEM — but the fault handler may itself have been called from a filesystem code path that already holds I_RWSEM (e.g., write() → page fault on user buffer → handle_page_fault()), creating an ABBA deadlock: task A holds I_RWSEM + waits for mmap_lock, task B holds mmap_lock + reclaim tries to acquire I_RWSEM.

Rather than requiring every allocation call site within the fault path to pass GFP_NOFS explicitly (error-prone — a single missed call site creates a deadlock), UmkaOS uses a task-level context flag (PF_MEMALLOC_NOFS) that restricts all allocations made by the current task:

/// Set the task's allocation context to suppress filesystem callbacks
/// in reclaim. All subsequent allocations by this task will have the
/// FS bit cleared from their GfpFlags, regardless of what the caller
/// passes. This is an RAII guard — `Drop` restores the previous state.
///
/// Equivalent to Linux's `memalloc_nofs_save()` / `memalloc_nofs_restore()`.
pub struct MemallocNofsGuard {
    prev_flags: u32,
}

impl MemallocNofsGuard {
    /// Enter NOFS allocation context. All allocations by the current
    /// task will behave as if `GfpFlags::FS` is cleared.
    pub fn new() -> Self {
        // Use fetch_or to atomically set the NOFS bit. The returned value
        // is the PREVIOUS flags (before the OR), which we save for restore.
        // This is a single atomic RMW — no load-then-swap TOCTOU window.
        let prev = current_task().alloc_flags.fetch_or(PF_MEMALLOC_NOFS, Relaxed);
        Self { prev_flags: prev }
    }
}

impl Drop for MemallocNofsGuard {
    fn drop(&mut self) {
        // Restore: clear the NOFS bit if it was not set before we entered.
        // Use fetch_and to atomically clear only the bit we set, preserving
        // any bits that were set by nested guards or other flags.
        if self.prev_flags & PF_MEMALLOC_NOFS == 0 {
            current_task().alloc_flags.fetch_and(!PF_MEMALLOC_NOFS, Relaxed);
        }
        // If NOFS was already set in prev_flags, we leave it set (nested guard).
    }
}

The page fault entry point (handle_page_fault()) acquires this guard before any allocation:

fn handle_page_fault(...) {
    let _nofs = MemallocNofsGuard::new();
    // All allocations below (page tables, anonymous pages, etc.)
    // are automatically NOFS-restricted.
    ...
}

The physical page allocator (Section 4.2) checks current_task().alloc_flags and clears GfpFlags::FS before entering the reclaim path if PF_MEMALLOC_NOFS is set. This is a single bitwise AND on the allocation hot path — zero overhead when the flag is not set.

Anonymous fault path (most common, must be fast):

/// Handle a page fault on an anonymous (non-file-backed) VMA.
/// This is the hottest fault path: first access to malloc'd memory,
/// stack growth, and post-fork private pages all land here.
///
/// This is the **default implementation** of `VmmPolicy::handle_anon_fault()`.
/// The VmmPolicy trait method delegates to this function unless a replacement
/// policy is installed. See the `VmmPolicy` trait definition for the full
/// trait signature and the Evolvable replacement model.
///
/// Cost: ~1-2 us (page allocation + zero-fill + PTE install + TLB invalidate).
pub fn default_handle_anon_fault(
    mm: &MmStruct,
    vma: &Vma,
    addr: VirtAddr,
    access: AccessType,
) -> Result<(), FaultError> {
    // Step 1: Ensure the VMA has an anon_vma for reverse-mapping.
    // anon_vma_prepare() allocates the anon_vma if this is the first
    // anonymous fault on this VMA. Without it, page_add_anon_rmap()
    // cannot register the page → the page becomes non-reclaimable by
    // vmscan AND the COW refcount optimization in handle_cow_fault()
    // fires incorrectly (mapcount stays 0 even with multiple PTEs).
    anon_vma_prepare(vma)?;

    // Step 2: Allocate a zeroed physical page.
    // HIGHUSER_MOVABLE: allocate from the highest zone available to userspace,
    // mark as movable (compaction-friendly). ZERO: zero-fill for security (no
    // information leak from stale data). HIGHUSER_MOVABLE includes RECLAIM,
    // allowing the allocator to invoke kswapd/direct reclaim if watermarks
    // are breached — GfpFlags::USER alone would not enable reclaim.
    let page = phys_alloc(GfpFlags::HIGHUSER_MOVABLE | GfpFlags::ZERO)?;

    // Step 3: Acquire PTL and install PTE.
    // PTL serializes concurrent faults on the same page table page.
    // Without PTL, two threads faulting on the same anonymous page can
    // both install PTEs — the first page is leaked (no PTE points to it,
    // but its refcount is 1), causing a permanent memory leak per
    // concurrent fault.
    let ptl = pte_lockptr(current_pgd(), addr);
    let _ptl_guard = ptl.lock();

    // Re-read PTE under lock — another CPU may have resolved this fault
    // between our initial fault and acquiring PTL.
    let existing = read_pte(current_pgd(), addr);
    if existing.is_present() {
        // Another thread already handled this fault. Free our unused page.
        drop(_ptl_guard);
        page_deref(page.pfn());
        return Ok(());
    }

    let pte = PteFlags::PRESENT | PteFlags::USER | vma.pte_flags();
    // Step 4: Register reverse mapping (rmap) BEFORE installing the PTE.
    // page_add_anon_rmap increments mapcount. Without this, COW's
    // page_refcount == 1 check fires incorrectly after fork: both parent
    // and child have PTEs to this page, but mapcount is 0, so COW
    // reuses the page instead of copying — data corruption.
    page_add_anon_rmap(&page, vma, addr);

    match install_pte(current_pgd(), addr, page.pfn(), pte) {
        Ok(()) => {
            // Step 5: Increment RSS after successful PTE installation.
            // RSS tracking is required for OOM killer scoring and
            // /proc/[pid]/status VmRSS reporting.
            mm_rss_inc(mm, RssType::Anon, 1);
            Ok(())
        }
        Err(e) => {
            // install_pte failed (e.g., page table allocation failure).
            // Clean up: remove rmap, free page to prevent leak.
            page_remove_anon_rmap(&page, vma, addr);
            drop(_ptl_guard);
            page_deref(page.pfn());
            Err(e)
        }
    }
}

Copy-on-write (COW) fault path:

/// Handle a write fault on a copy-on-write page.
/// Default implementation of `VmmPolicy::handle_cow_fault()`.
/// Occurs after fork() when a child or parent first writes to a shared page.
///
/// Optimization: if the page has only one reference (the other process has
/// already COW-faulted or exited), skip the copy and just mark writable.
/// This is critical for fork+exec patterns where the parent's pages are
/// never actually copied.
fn handle_cow_fault(
    vma: &Vma,
    addr: VirtAddr,
    old_pfn: Pfn,
) -> Result<(), FaultError> {
    // Lock ordering: PTL is a spinlock (cannot sleep). MmuNotifierGuard::new()
    // calls invalidate_range_start() which may sleep (secondary MMUs may need
    // to complete TLB shootdowns). Therefore: allocate + notifier FIRST (may
    // sleep), then acquire PTL (cannot sleep), then re-check and install.
    // This matches Linux mm/memory.c wp_page_copy() ordering.

    // Step 1: Pre-check under PTL to handle the race-free "already handled" case.
    {
        let ptl = pte_lockptr(current_pgd(), addr);
        let _ptl_guard = ptl.lock();
        let pte = read_pte(current_pgd(), addr);
        if pte.pfn() != old_pfn || pte.is_writable() {
            // PTE changed (another thread handled the COW fault). Nothing to do.
            return Ok(());
        }
        // Check reuse-optimization: if refcount == 1, no copy needed.
        if page_refcount(old_pfn) == 1 {
            // Notify secondary MMUs even for the reuse path — the PTE
            // permissions change from RO to RW, which affects KVM sPTEs.
            // But MmuNotifier may sleep, so drop PTL first.
        }
        // Drop PTL before sleeping operations.
    }

    // Step 2: Allocate new page outside PTL (may sleep via GFP_KERNEL reclaim).
    let new_page = phys_alloc(GfpFlags::HIGHUSER_MOVABLE)?;

    // Step 3: Notify secondary MMUs (KVM, HMM) — may sleep.
    // RAII guard ensures invalidate_range_end is always called.
    let _notifier_guard = MmuNotifierGuard::new(
        current_mm(), addr, addr + PAGE_SIZE
    )?;

    // Step 4: Re-acquire PTL and re-check PTE (it may have changed while
    // we were allocating and notifying).
    let ptl = pte_lockptr(current_pgd(), addr);
    let _ptl_guard = ptl.lock();
    let pte = read_pte(current_pgd(), addr);
    if pte.pfn() != old_pfn || pte.is_writable() {
        // Race: another thread handled it. Free the unused page.
        page_deref(new_page.pfn());
        return Ok(());
    }

    if page_refcount(old_pfn) == 1 {
        // Only reference — skip copy, just set writable. Free unused alloc.
        page_deref(new_page.pfn());
        set_pte_writable(current_pgd(), addr);
        return Ok(());
    }

    // Step 5: Copy and install new PTE under PTL.
    copy_page(new_page.pfn(), old_pfn);

    // rmap update: remove old page from rmap, add new page.
    // Without this, the old page's mapcount drifts (slow-burn leak:
    // mapcount never reaches 0, page is never freed by vmscan).
    page_remove_rmap(old_pfn, vma, addr);
    page_add_anon_rmap(&new_page, vma, addr);

    match install_pte(
        current_pgd(),
        addr,
        new_page.pfn(),
        PteFlags::PRESENT | PteFlags::WRITE | PteFlags::USER,
    ) {
        Ok(()) => {
            // _notifier_guard dropped here → invalidate_range_end called.
            page_deref(old_pfn); // Decrement refcount on the shared page.
            Ok(())
        }
        Err(e) => {
            // install_pte failed — clean up new page to prevent leak.
            page_remove_anon_rmap(&new_page, vma, addr);
            page_add_rmap(old_pfn, vma, addr); // restore old rmap
            drop(_ptl_guard);
            page_deref(new_page.pfn());
            Err(e)
        }
    }
}

4.8.4.3 COW Page Table Duplication at Fork¶

When do_fork() creates a new process (without CLONE_VM), the child receives a complete copy of the parent's virtual address space with all writable pages marked copy-on-write. This is performed by copy_page_tables(), called from do_fork() step 11 (see Section 8.1).

/// Duplicate the parent's page tables for the child process (fork COW setup).
///
/// # Arguments
/// - `parent_mm`: The parent's `MmStruct`, with `mmap_lock` held for read.
/// - `child_mm`: The child's freshly-allocated `MmStruct` (empty page tables).
///
/// # Algorithm
///
/// For each VMA in `parent_mm.vma_tree` (Maple tree iteration):
///
///   1. **Skip non-copyable VMAs**:
///      - `VM_IO` (memory-mapped I/O regions) — device registers must not be
///        shared; the child must map devices independently.
///      - `VM_PFNMAP` (raw PFN mappings without struct page) — no refcount
///        tracking possible, so COW semantics cannot be applied.
///      - `VM_DONTCOPY` (explicitly excluded by `madvise(MADV_DONTFORK)`) —
///        userspace has requested that this VMA not be inherited.
///
///   2. **Clone the VMA metadata**: Create a new `Vma` for the child with the
///      same `start`, `end`, `prot`, `flags`, and `file` (if file-backed, the
///      `File` refcount is incremented). Insert the new VMA into `child_mm.vma_tree`.
///
///   3. **Walk the parent's page table** for the VMA's address range. For each
///      valid PTE at the leaf level:
///
///      a. **PTE maps a physical page** (not a swap entry, not a migration entry):
///         - If the VMA is writable (`VM_WRITE`) and the PTE is currently writable:
///           **clear `PTE_WRITE`** in the parent's PTE. This marks the parent's
///           mapping as COW. The parent will fault on next write, triggering
///           `handle_cow_fault()`.
///         - Copy the PTE value (now read-only) into the child's page table at
///           the corresponding virtual address.
///         - Increment the physical page's `mapcount` (via `page_add_map(pfn)`).
///           This tracks the number of page table entries pointing to the page,
///           enabling the single-reference optimization in `handle_cow_fault()`.
///
///      b. **PTE is a swap entry** (page has been swapped out):
///         - Copy the swap entry PTE directly to the child's page table.
///         - Increment the swap entry's reference count (`swap_dup(entry)`).
///         - No physical page refcount change (the page is not resident).
///
///      c. **PTE is a migration entry** (page is being migrated by NUMA balancing):
///         - Copy the migration entry PTE to the child's page table.
///         - Increment the underlying page's refcount (the migration will
///           complete and install a real PTE; both parent and child need a ref).
///
///      d. **PTE is zero / not present and not a swap entry**:
///         - Skip (the child inherits the hole; demand-paging will handle it).
///
///   4. **Page table pages** (PGD, P4D, PUD, PMD levels) are newly allocated for
///      the child. Only leaf PTEs are copied — intermediate levels are fresh
///      allocations pointing to the child's own page table hierarchy. This ensures
///      parent and child have fully independent page table trees.
///
///   5. **Huge pages** (PMD-level mappings, 2 MiB on x86-64):
///      - If a PMD entry maps a huge page and the VMA is writable: split the huge
///        page into base pages before applying COW. This avoids a 2 MiB copy on
///        the first write; the COW granularity is always the base page (4 KiB).
///      - Read-only huge pages (e.g., read-only file mappings) are copied at PMD
///        level without splitting (both parent and child share the huge page with
///        an incremented refcount).
///
/// # MMU Notifier Invalidation
///
/// When `copy_page_tables` demotes writable PTEs to read-only for COW, any
/// secondary MMU that has cached writable mappings derived from the parent's
/// page tables must be notified. Without this notification, a secondary MMU
/// (KVM EPT, GPU page tables, RDMA MR, device IOMMU) could retain a stale
/// writable mapping, allowing writes that bypass COW and corrupt shared pages.
///
/// **Concrete failure scenario (KVM)**: A KVM guest has an EPT entry mapping
/// a host page as writable. The host process forks. `copy_page_tables` clears
/// `PTE_WRITE` in the host PTE, but the EPT still has the writable mapping.
/// The guest writes through the EPT → the write goes directly to the physical
/// page without triggering a host page fault → both parent and child see the
/// write, violating COW isolation.
///
/// **Protocol**: For each VMA that contains at least one writable PTE being
/// demoted, `copy_page_tables` issues MMU notifier callbacks:
///
/// ```rust
/// // Per-VMA iteration inside copy_page_tables:
/// for vma in parent_mm.vma_tree.iter() {
///     if vma.flags.contains(VM_WRITE) && !parent_mm.mmu_notifiers.is_empty() {
///         // Notify BEFORE clearing PTE_WRITE — secondary MMUs must invalidate
///         // their writable mappings before the primary PTE is demoted.
///         mmu_notifier_invalidate_range_start(parent_mm, vma.start, vma.end);
///     }
///
///     // ... walk PTEs, clear PTE_WRITE, copy to child (steps 3a-3d above) ...
///
///     if vma.flags.contains(VM_WRITE) && !parent_mm.mmu_notifiers.is_empty() {
///         // Notify AFTER PTE demotion + TLB flush — secondary MMUs may now
///         // release resources (e.g., unpin pages that were held for DMA).
///         mmu_notifier_invalidate_range_end(parent_mm, vma.start, vma.end);
///     }
/// }
/// ```
///
/// **Granularity**: Notifications are issued per-VMA, not per-PTE. This batches
/// the callback overhead (KVM's `kvm->mmu_lock` acquisition, IOMMU TLB flush)
/// across potentially thousands of PTEs per VMA. The secondary MMU invalidates
/// the entire VMA range — this is conservative but correct, and avoids O(n)
/// callbacks for n PTEs within a single VMA.
///
/// **Optimization**: The `mmu_notifiers.is_empty()` check is evaluated once
/// per VMA. For the common case (no KVM, no HMM, no IOMMU — i.e., the
/// notifier list is empty), no callback overhead is incurred. The check reads
/// the notifier array length under the spinlock, but since `mmap_lock.read()`
/// is held, no concurrent registration can occur (registration requires
/// `mmap_lock.write()`), so the length is stable.
///
/// **Interaction with TLB flush**: The TLB flush (below) must complete before
/// `invalidate_range_end` is called. This ensures that no CPU can access the
/// old writable PTE between the `_start` and `_end` callbacks. The sequence
/// is: `_start` → clear PTE_WRITE → TLB flush → `_end`.
///
/// # TLB Flush
///
/// After all PTEs are processed, the parent's TLB entries must be flushed for
/// the pages whose PTEs were downgraded from writable to read-only:
///
/// - **PCID/ASID-capable hardware** (x86-64 with PCID, AArch64 with ASID,
///   RISC-V with ASID): flush only the parent's ASID entries. The child has a
///   freshly-allocated ASID with no stale TLB entries. This is an O(modified_pages)
///   operation via `invlpg`/`tlbi vale1is`/`sfence.vma` per page, or a full-ASID
///   flush if the modified count exceeds a threshold (typically 32 pages).
///
/// - **Hardware without PCID/ASID** (ARMv7 without ASID, PPC32 without LPID):
///   full TLB flush (`tlbiall` / `tlbiel`). This is the conservative fallback.
///
/// # Errors
///
/// Returns `AllocError` if any page table page allocation fails. On error, all
/// page table pages allocated so far for the child are freed, and all mapcount
/// increments are reversed. The child's `MmStruct` is left empty (no VMAs, no
/// page tables). The caller (`do_fork()`) handles the overall fork rollback.
///
/// # Lock state
///
/// - `parent_mm.mmap_lock` held for **read** (prevents VMA mutations but allows
///   concurrent page faults in the parent — important for large address spaces).
/// - PTE-level spinlocks (PTL) are acquired per-PTE-page during the walk to
///   prevent races with concurrent `handle_cow_fault()` or `munmap()` in the
///   parent.
/// - `child_mm.mmap_lock` is not contended (no other thread has a reference to
///   the child's mm yet).
fn copy_page_tables(parent_mm: &MmStruct, child_mm: &MmStruct) -> Result<(), AllocError>

OOM handling during fork: If copy_page_tables() encounters a page allocation failure (e.g., allocating a page table page at any level — PGD, PUD, PMD, PTE), fork returns -ENOMEM to the parent. The OOM killer is NOT invoked during fork for two reasons:

Deadlock avoidance: The parent holds mmap_lock.read() during copy_page_tables(). The OOM killer's oom_reap_task() path acquires mmap_lock.write() on the victim's mm to unmap pages. If the OOM killer selects the forking process (or any process sharing its mm), it would deadlock waiting for mmap_lock.write() while the fork holds the read lock.
Partial state: At the point of failure, the child's address space is partially constructed (some VMAs copied, some not). The child is not yet linked into the process tree (step 16 in do_fork), so it is not a valid OOM candidate. Invoking OOM would add latency to the fork error path without benefit.

Instead, copy_page_tables() performs deterministic cleanup on allocation failure: - All page table pages allocated for the child so far are freed (reverse walk). - All mapcount increments on physical pages (from step 3a) are reversed. - All swap_dup() refcount increments (from step 3b) are reversed. - The child's MmStruct is left empty (no VMAs, no page tables). - AllocError is returned to do_fork(), which rolls back steps 1–10 (PID release, cgroup cancel, user_struct decrement) and returns -ENOMEM to the parent's fork() syscall.

The parent process receives -ENOMEM and can retry, reduce its memory footprint, or handle the failure in application-specific ways. The system's memory reclaim machinery (kswapd, direct reclaim) runs concurrently and may free pages for a subsequent fork attempt.

Performance considerations: For large address spaces (hundreds of VMAs, millions of PTEs), copy_page_tables() is the dominant cost of fork(). The key optimizations:

Lazy PTE copying: Only PTEs that are actually present are copied. Sparse address spaces (common in mmap-heavy applications) skip large ranges of unpopulated page table pages entirely.
Batched TLB flush: Rather than flushing after each PTE modification, the function accumulates dirty addresses and issues a single batched flush at the end (or a full-ASID flush if the batch exceeds the threshold).
No page data copying: Only metadata (PTEs, refcounts) is modified. Actual page content is never copied at fork time — that is deferred to the COW fault handler (handle_cow_fault() above).

4.8.4.4 InvalidateSeq — Lockless Truncation-Fault Coordination¶

Replaces the INVALIDATE_LOCK rwsem from Linux 5.15+ (mapping->invalidate_lock, commit 730633f0b7f9) with a lockless seqcount. Writers (truncate, hole-punch, collapse-range) increment the counter to odd before mutating the page cache + PTEs, then to even after. Readers (page fault) check the counter without acquiring any lock.

Why not an rwsem (Linux approach)? The rwsem creates the spec's ONLY lock ordering exception: the fault path must acquire VMA_LOCK(105, read) then INVALIDATE_LOCK(90, read) -- a descending-level violation. This required lock_read_unchecked(), a compile-time call-site cap, runtime debug_assert!, and mandatory // SAFETY: annotations. With a seqcount, the fault path performs two atomic loads instead of acquiring a lock: no lock = no level = no ordering violation = zero escape hatches in the lock ordering system.

Performance: The rwsem reader-side costs ~20-40 cycles (lock xadd on x86, CAS on ARM). Two seqcount loads cost ~4-12 cycles on all architectures. Net savings: 16-36 cycles per file-backed page fault -- a pure gain that contributes to the negative-overhead target.

Nucleus/Evolvable classification: - InvalidateSeq struct: Nucleus (data structure). The counter's parity invariant (odd = in-progress, even = idle) is a correctness property that must not change during live evolution. - Retry policy: Evolvable. When the reader observes an odd sequence, the choice of action (yield, return Retry, backoff) is a policy decision. The Evolvable component registers ParamId::INVALIDATE_RETRY_STRATEGY with bounds [0=yield, 1=retry, 2=backoff].

/// Per-AddressSpace sequence counter for truncation-fault coordination.
///
/// Writers (truncate, hole-punch, collapse-range) bracket page cache
/// mutations with `invalidate_begin()` / `invalidate_end()`. Readers
/// (page fault) call `read_begin()` before page cache lookup and
/// `read_check()` after PTE installation.
///
/// # Differences from `SeqLock<T>`
///
/// `SeqLock<T>` ([Section 3.6](03-concurrency.md#lock-free-data-structures--seqlockt--sequence-lock))
/// protects a `T: Copy` data value. `InvalidateSeq` protects an *operation
/// window* -- it signals "truncation in progress" without guarding a
/// specific data payload. There is no inner `T`. Writer mutual exclusion
/// is provided by `I_RWSEM(write)`, not by an internal spinlock.
///
/// # Counter Width: u64
///
/// At 1 billion truncations per second (impossibly high -- real-world
/// servers see ~10K/s), u64 wraps in ~292 years. The 50-year uptime
/// requirement is satisfied with 5.8x margin. u32 wraps in ~4.9 days
/// at 10K/s -- unacceptable. On ARMv7, `AtomicU64` requires
/// `LDREXD`/`STREXD` (2 extra cycles vs single-word CAS) -- acceptable
/// on a warm path.
///
/// # Memory Ordering (per architecture)
///
/// | Operation | x86-64 (TSO) | AArch64 | ARMv7 | RISC-V | PPC32/PPC64LE | s390x | LoongArch64 |
/// |-----------|-------------|---------|-------|--------|---------------|-------|-------------|
/// | `read_begin` (Acquire load) | MOV (compiler barrier) | `LDAR` | `LDREX` + `DMB ISH` | `fence r,rw` + load | `lwsync` + load | plain load | `dbar 0x14` + load |
/// | `read_check` (`fence(Acquire)` + Relaxed load) | MOV (compiler barrier) | `DMB ISHLD` + LDR | `DMB ISH` + LDR | `fence r,rw` + load | `lwsync` + load | plain load | `dbar 0x14` + load |
/// | `invalidate_begin` (Release store) | MOV + compiler fence | `STLR` | `DMB ISH` + `STR` | store + `fence rw,w` | store + `lwsync` | plain store | store + `dbar 0x12` |
/// | `invalidate_end` (Release store) | MOV + compiler fence | `STLR` | `DMB ISH` + `STR` | store + `fence rw,w` | store + `lwsync` | plain store | store + `dbar 0x12` |
///
/// **Reader cost on x86-64**: 2 plain MOV instructions (~2 cycles total).
/// TSO already provides load-load and load-store ordering; the fence in
/// `read_check` compiles to a compiler barrier (no hardware instruction).
///
/// **Reader cost on AArch64**: `LDAR` (~2-4 cycles) + `DMB ISHLD` + LDR
/// (~4-6 cycles) = ~6-10 cycles total. The `DMB ISHLD` in `read_check`
/// ensures all preceding memory operations (page cache reads, PTE writes)
/// complete before the seq re-read. An `LDAR` alone does NOT order
/// PRECEDING operations -- it only orders SUBSEQUENT operations.
///
/// **Writer cost**: Negligible. The writer already holds `I_RWSEM(write)`
/// (~40-120 cycles). Two Release stores add ~2-4 cycles total.
pub struct InvalidateSeq {
    /// Sequence counter.
    /// - Even (including 0): no truncation/hole-punch in progress.
    /// - Odd: truncation/hole-punch is actively modifying the page cache
    ///   and/or zapping PTEs for this address space.
    ///
    /// Initialized to 0 at AddressSpace creation.
    ///
    /// **Invariant**: Only modified while holding `I_RWSEM(write)` on the
    /// owning inode. Multiple concurrent writers are serialized by I_RWSEM;
    /// the seqcount does NOT provide writer mutual exclusion.
    seq: AtomicU64,
}

impl InvalidateSeq {
    /// Create a new InvalidateSeq with seq = 0 (no operation in progress).
    pub const fn new() -> Self {
        Self { seq: AtomicU64::new(0) }
    }

    /// Begin an invalidation window (truncate/hole-punch/collapse-range).
    ///
    /// # Preconditions
    /// - Caller MUST hold `I_RWSEM(write)` on the owning inode.
    /// - The previous `invalidate_end()` must have completed (seq is even).
    ///
    /// # Postconditions
    /// - seq is odd (visible to all readers after the Release store).
    /// - All page cache mutations and PTE zaps that follow are "inside"
    ///   the invalidation window.
    ///
    /// # Memory ordering
    /// The Relaxed load is safe because `I_RWSEM(write)` Acquire provides
    /// a happens-before edge with any previous `invalidate_end()` Release
    /// store, via the I_RWSEM Release/Acquire pair between writers. The
    /// Release store ensures all prior stores (e.g., i_size update) are
    /// visible before readers see the odd seq.
    ///
    /// # Why load+store, not fetch_add
    /// `fetch_add` would be defense-in-depth against a bug where I_RWSEM
    /// is not held. But `fetch_add` on x86 is `LOCK XADD` (~8-18 cycles)
    /// vs MOV+MOV (~2 cycles), and -- more importantly -- it MASKS bugs.
    /// With load+store, the `debug_assert!` catches double-odd or
    /// double-even states. With `fetch_add`, two racing writers silently
    /// produce correct increments, hiding the missing I_RWSEM.
    ///
    /// # Panics (debug only)
    /// Panics if seq is already odd (nested invalidation -- programming error).
    #[inline(always)]
    pub fn invalidate_begin(&self) {
        let s = self.seq.load(Ordering::Relaxed);
        debug_assert!(
            s & 1 == 0,
            "InvalidateSeq::invalidate_begin: seq {} is odd — \
             nested or overlapping invalidation (missing invalidate_end?)",
            s
        );
        self.seq.store(s + 1, Ordering::Release);
    }

    /// End an invalidation window.
    ///
    /// # Preconditions
    /// - Caller MUST hold `I_RWSEM(write)` (same as `invalidate_begin`).
    /// - All page cache mutations and PTE zaps for this truncation are
    ///   complete.
    ///
    /// # Postconditions
    /// - seq is even (readers will see a consistent state).
    #[inline(always)]
    pub fn invalidate_end(&self) {
        let s = self.seq.load(Ordering::Relaxed);
        debug_assert!(
            s & 1 == 1,
            "InvalidateSeq::invalidate_end: seq {} is even — \
             no invalidation in progress (double end?)",
            s
        );
        self.seq.store(s + 1, Ordering::Release);
    }

    /// Optimistic read-begin. Returns the current sequence value.
    ///
    /// If the returned value is odd, truncation is in progress and the
    /// caller should not proceed with page cache lookup / PTE installation.
    /// The caller's retry policy (yield, return `FaultError::Retry`,
    /// backoff) is an Evolvable decision.
    ///
    /// Cost: 1 Acquire load (~1-2 cycles on x86-64, ~2-4 on AArch64).
    #[inline(always)]
    pub fn read_begin(&self) -> u64 {
        self.seq.load(Ordering::Acquire)
    }

    /// Check if the page cache / PTE state may have been invalidated
    /// since `read_begin()` returned `start_seq`.
    ///
    /// Returns `true` if retry is needed (seq changed).
    ///
    /// # Memory ordering: fence(Acquire) + Relaxed load
    ///
    /// The `fence(Acquire)` before the Relaxed load ensures that ALL
    /// preceding memory operations (page cache reads, PTE writes, page
    /// flag checks) are ordered before the seq re-read. This is the
    /// standard seqcount reader pattern (Linux's `read_seqcount_retry()`
    /// uses `smp_rmb()` before the re-read -- same semantics).
    ///
    /// **Why not a plain Acquire load?** An Acquire load orders
    /// SUBSEQUENT operations after the load, NOT preceding operations
    /// before it. On weakly-ordered architectures (AArch64, RISC-V,
    /// PPC, ARMv7), the CPU could speculatively execute the second seq
    /// load BEFORE intervening page cache reads complete. The
    /// `fence(Acquire)` prevents this reordering.
    ///
    /// **On x86-64 (TSO)**: The fence compiles to a compiler barrier
    /// (no hardware instruction). TSO already orders all loads w.r.t.
    /// each other. Zero additional hardware cost vs a plain Acquire load.
    ///
    /// **On AArch64**: `DMB ISHLD` + LDR (~4-6 cycles) instead of
    /// `LDAR` (~2-4 cycles). Delta: +2 cycles per fault. Total reader
    /// cost ~6-10 cycles instead of ~4-8. Still vastly cheaper than the
    /// rwsem's ~30-40 cycles.
    #[inline(always)]
    pub fn read_check(&self, start_seq: u64) -> bool {
        core::sync::atomic::fence(Ordering::Acquire);
        let s = self.seq.load(Ordering::Relaxed);
        s != start_seq
    }
}

The AddressSpace struct (Section 14.1) contains the InvalidateSeq field:

    /// Sequence counter for truncation-fault coordination. Replaces the
    /// former `INVALIDATE_LOCK` rwsem.
    ///
    /// Readers (page fault path): call `invalidate_seq.read_begin()` before
    /// page cache lookup and `invalidate_seq.read_check(seq)` after PTE
    /// installation. No lock acquired -- two atomic loads only.
    ///
    /// Writers (truncate/hole-punch/collapse-range): call
    /// `invalidate_seq.invalidate_begin()` before page cache mutation and
    /// `invalidate_seq.invalidate_end()` after. Must hold `I_RWSEM(write)`.
    pub invalidate_seq: InvalidateSeq,

Writer protocol (truncate / hole-punch / collapse-range path):

Writer (truncate path):
    PRECONDITION: caller holds I_RWSEM(write) on the inode.
    I_RWSEM serializes writers — no two truncates can overlap on the same inode.

    1. mapping.invalidate_seq.invalidate_begin()
       // seq transitions: even -> odd (Release store).

    2. Update i_size (atomically, Release ordering).

    3. truncate_inode_pages_range(mapping, new_size, u64::MAX)
       // Removes pages from page cache XArray. For hole-punch:
       // truncate_inode_pages_range(mapping, start, end).
       // Each page removal: xa_lock → erase → xa_unlock → unmap_mapping_range
       // (zap PTEs, TLB shootdown, mmu_notifier callbacks).

    4. Filesystem-specific truncation (free blocks, update extent tree).
       // Tier 1 domain crossing for isolated filesystem drivers.

    5. mapping.invalidate_seq.invalidate_end()
       // seq transitions: odd -> even (Release store).

    6. Release I_RWSEM(write).

collapse_range note: FALLOC_FL_COLLAPSE_RANGE shifts page offsets (page at offset 200 becomes offset 100 after collapsing 100 pages starting at offset 100). InvalidateSeq detects this -- a fault that looked up offset 200 and found page P before the collapse will see read_check() fail after the collapse moves P to offset 100. The retry re-looks up offset 200, which now maps to different data.

File-backed fault path:

/// Handle a page fault on a file-backed VMA.
/// Dispatches to the page cache (normal files) or DAX path (persistent memory).
///
/// **Lock state on entry:**
/// - `VMA_LOCK(105, read)` held (per-VMA fast path) OR `mmap_lock(100, read)`
///   held (slow path fallback).
/// - No `I_RWSEM` held -- the fault path does NOT acquire the inode rwsem.
///   Truncation coordination is provided by `InvalidateSeq` (lockless seqcount).
/// - No page lock held.
///
/// **Lock chain (strictly ascending -- ZERO exceptions):**
/// `VMA_LOCK(105, read)` -> `PAGE_LOCK(180)` -> `PTL(185)`
///
/// **Allocation context:** `GFP_NOFS | __GFP_MEMALLOC` -- filesystem
/// writeback re-entry is suppressed (`MemallocNofsGuard` set by
/// `handle_page_fault`). `__GFP_MEMALLOC` permits dipping into memory
/// reserves to avoid deadlock under extreme pressure.
///
/// **Caller contract:** The caller (`handle_page_fault`) must have already
/// verified that `addr` falls within a valid VMA and that the access type
/// is permitted by VMA flags (SEGV_ACCERR delivered before reaching here).
///
/// **Pseudocode convention**: Code in this section uses Rust syntax and
/// follows Rust ownership, borrowing, and type rules. `&self` methods use
/// interior mutability for mutation. Atomic fields use `.store()`/`.load()`.
/// All `#[repr(C)]` structs have `const_assert!` size verification. See
/// CLAUDE.md Spec Pseudocode Quality Gates.
fn handle_file_fault(
    vma: &Vma,
    addr: VirtAddr,
    access: AccessType,
) -> Result<(), FaultError> {
    // vma.file is Option<Arc<OpenFile>>. For file-backed VMAs it is always
    // Some — the caller (handle_page_fault) only dispatches to handle_file_fault
    // for VMAs with file != None. Unwrap is safe by precondition.
    let file = vma.file.as_ref().expect("file-backed VMA must have file");
    let mapping = &file.inode.i_mapping;
    let pgoff = vma.vm_pgoff + ((addr - vma.vm_start) >> PAGE_SHIFT);

    // DAX dispatch: persistent memory is mapped directly, no page cache.
    // DAX faults still use InvalidateSeq for truncation detection AND
    // the per-offset dax_entry_lock for concurrent-fault serialization
    // (orthogonal concerns: InvalidateSeq detects truncation, dax_entry_lock
    // prevents duplicate PTE installations at the same offset).
    if mapping.flags.load(Relaxed) & AS_DAX != 0 {
        return dax_iomap_fault(vma, addr, pgoff, access);
    }

    let pc = mapping.page_cache.as_ref().unwrap();

    // ---- InvalidateSeq reader protocol ----
    // (1) Load the sequence counter. If odd, truncation is in progress.
    let seq = mapping.invalidate_seq.read_begin();
    if seq & 1 != 0 {
        // Truncation in progress. Retry policy is Evolvable:
        //   Default: yield (cond_resched) then return Retry.
        //   This avoids CPU spinning during long truncations (e.g.,
        //   truncating a 100 GB file takes 100ms+; immediate Retry
        //   would spin-fault 20K-100K times).
        cond_resched();
        return Err(FaultError::Retry);
    }

    // (2) Page cache lookup under RCU.
    //     The XArray RCU load returns a reference valid for the RCU
    //     read-side critical section. We must increment the page refcount
    //     (try_get_ref) before exiting RCU to prevent use-after-free if
    //     reclaim evicts the page concurrently.
    rcu_read_lock();
    let cache_result = pc.pages.load(pgoff);

    if let Some(entry) = cache_result {
        // Cache hit. Pin the page before exiting RCU.
        if !entry.page.try_get_ref() {
            // Page is being freed (refcount was 0). Treat as cache miss.
            rcu_read_unlock();
            // Fall through to cache-miss path below.
        } else {
            rcu_read_unlock();
            let page = &entry.page;

            // Mark accessed for LRU aging.
            page.flags_fetch_or(PageFlags::ACCESSED, Relaxed);

            // (2a) Wait for I/O completion if the page is not yet UPTODATE.
            //      IMPORTANT: wait_on_page_locked() sleeps -- this MUST be
            //      outside the RCU read-side critical section (sleeping in
            //      RCU prevents grace period advancement → deadlock).
            if !page.flags_load(Acquire).contains(PageFlags::UPTODATE) {
                lock_page(page);  // test-and-set + sleep if already locked
                if page.flags_load(Acquire).contains(PageFlags::ERROR) {
                    unlock_page(page);
                    page.put_ref();
                    return Err(FaultError::Sigbus);
                }
                if !page.flags_load(Acquire).contains(PageFlags::UPTODATE) {
                    unlock_page(page);
                    page.put_ref();
                    return Err(FaultError::Sigbus);
                }
                unlock_page(page);
            }

            // (2b) Verify page is still in the correct mapping after
            //      potential sleep (page may have been truncated and
            //      re-inserted by another inode).
            if page.mapping() != Some(mapping) || page.index() != pgoff {
                page.put_ref();
                return Err(FaultError::Retry);
            }

            // (2c) EOF check: verify the faulting offset is within the
            //      current file size. A concurrent truncate may have shrunk
            //      i_size below our offset.
            let i_size = file.inode.i_size.load(Acquire);
            let max_pgoff = (i_size + PAGE_SIZE as u64 - 1) / PAGE_SIZE as u64;
            if pgoff >= max_pgoff {
                page.put_ref();
                return Err(FaultError::Sigbus);
            }

            // For MAP_PRIVATE: read-only PTE (COW on write).
            let mut pte_flags = PteFlags::PRESENT | PteFlags::USER | vma.pte_flags();
            if !vma.vm_flags.contains(VmFlags::VM_SHARED) {
                pte_flags.remove(PteFlags::WRITE);
            }

            // (2d) page_mkwrite() for MAP_SHARED writable faults.
            // Notify the filesystem before installing a writable PTE so
            // delayed-allocation filesystems can allocate on-disk blocks.
            // Note: page_mkwrite() is called BEFORE read_check(), so the
            // filesystem does real work (block allocation, journal reservation)
            // that will be thrown away on retry. This is correct: we need the
            // PTE installed to detect the truncation race. The wasted work
            // only occurs under concurrent truncation (rare).
            if vma.vm_flags.contains(VmFlags::VM_SHARED)
                && access == AccessType::Write
                && pte_flags.contains(PteFlags::WRITE)
            {
                if let Some(ref vm_ops) = vma.vm_ops {
                    lock_page(page);  // filesystem expects a locked page
                    match vm_ops.page_mkwrite(page) {
                        Ok(()) => {
                            // Filesystem allocated blocks. Page remains locked
                            // until after PTE installation.
                        }
                        Err(_e) => {
                            unlock_page(page);
                            page.put_ref();
                            return Err(FaultError::Sigbus);
                        }
                    }
                    // unlock_page happens after install_pte below
                }
            }

            // (2e) Install PTE under PTL (page table lock).
            //      PTL serializes concurrent PTE modifications within a
            //      single page table page. Re-read PTE under lock: if
            //      already present (another CPU resolved the same fault),
            //      skip installation.
            let ptl = pte_lockptr(current_pgd(), addr);
            let _ptl_guard = ptl.lock();
            let existing_pte = read_pte(current_pgd(), addr);
            if !existing_pte.is_present() {
                install_pte(current_pgd(), addr, page.pfn(), pte_flags)?;
                // RSS increment + file rmap for cache-hit path.
                mm_rss_inc(current_mm(), RssType::File, 1);
                page_add_file_rmap(page, mapping);
            }
            drop(_ptl_guard);

            // Unlock page if we locked it for page_mkwrite.
            if vma.vm_flags.contains(VmFlags::VM_SHARED)
                && access == AccessType::Write
            {
                unlock_page(page);
            }

            // (2f) InvalidateSeq read_check: did truncation occur during
            //      our operation window?
            if mapping.invalidate_seq.read_check(seq) {
                // Truncation raced with our fault. The PTE we installed may
                // point to a page that was (or is being) truncated. Undo.
                let ptl = pte_lockptr(current_pgd(), addr);
                let _ptl_guard = ptl.lock();
                zap_pte(current_pgd(), addr);
                drop(_ptl_guard);
                page.put_ref();
                return Err(FaultError::Retry);
            }

            // Success. The PTE holds the mapping reference now. Drop the
            // fault path's extra reference (the page's mapcount was
            // incremented by install_pte).
            page.put_ref();
            return Ok(());
        }
    } else {
        rcu_read_unlock();
    }

    // ---- Cache miss path ----
    // Allocate a page, insert into page cache, initiate I/O.
    // Track whether we inserted a new page (for cleanup on retry).
    let new_page = phys_alloc(GfpFlags::NOFS | GfpFlags::MEMALLOC)?;
    lock_page(&new_page);  // Lock before insertion (I/O completion will unlock)

    match pc.pages.try_store(pgoff, PageEntry::new(&new_page)) {
        Ok(()) => {
            // Won race -- we own this page cache slot.
            // Initiate readahead and read the faulting page.
            readahead_on_fault(&vma.file.ra_state, mapping, pgoff);
            mapping.ops.read_page(mapping, pgoff, &new_page)?;

            // Wait for I/O completion. read_page() submits async block I/O;
            // the bio completion callback sets PG_UPTODATE and clears
            // PG_LOCKED, waking all waiters.
            // IMPORTANT: this sleep is outside any RCU critical section.
            wait_on_page_locked(&new_page);

            if new_page.flags_load(Acquire).contains(PageFlags::ERROR) {
                // I/O failed. Remove the page from the cache to prevent
                // serving an error page to future faults.
                pc.pages.try_remove(pgoff);
                new_page.put_ref();
                return Err(FaultError::FileFault(KernelError::IoError));
            }

            // EOF check after I/O completion.
            let i_size = vma.file.inode.i_size.load(Acquire);
            let max_pgoff = (i_size + PAGE_SIZE as u64 - 1) / PAGE_SIZE as u64;
            if pgoff >= max_pgoff {
                pc.pages.try_remove(pgoff);
                new_page.put_ref();
                return Err(FaultError::Sigbus);
            }

            // Build PTE flags.
            let mut pf = PteFlags::PRESENT | PteFlags::USER | vma.pte_flags();
            if !vma.vm_flags.contains(VmFlags::VM_SHARED) {
                pf.remove(PteFlags::WRITE); // MAP_PRIVATE: COW on write
            }

            // RSS increment: charge the new page to the mm's resident set.
            mm_rss_inc(current_mm(), RssType::File, 1);
            // file rmap: track this mapping for reverse-map page reclaim.
            page_add_file_rmap(&new_page, mapping);

            // Install PTE under PTL.
            let ptl = pte_lockptr(current_pgd(), addr);
            let _ptl_guard = ptl.lock();
            let existing_pte = read_pte(current_pgd(), addr);
            if !existing_pte.is_present() {
                install_pte(current_pgd(), addr, new_page.pfn(), pf)?;
            }
            drop(_ptl_guard);

            // InvalidateSeq check -- CRITICAL on cache-miss path.
            // If seq changed, a truncation raced past our offset. The page
            // we just inserted may contain garbage (blocks freed by truncate,
            // then read_page read from those freed blocks). We MUST remove
            // the page from the XArray -- unlike the cache-hit path where
            // we only zap the PTE, on cache-miss the page itself is stale
            // and must not remain in the page cache.
            if mapping.invalidate_seq.read_check(seq) {
                let ptl = pte_lockptr(current_pgd(), addr);
                let _ptl_guard = ptl.lock();
                zap_pte(current_pgd(), addr);
                drop(_ptl_guard);
                // Remove the stale page from the page cache.
                // Without this, hole-punch races leave garbage pages
                // that serve stale data to subsequent faults.
                pc.pages.try_remove(pgoff);
                new_page.put_ref();
                return Err(FaultError::Retry);
            }

            new_page.put_ref();
        }
        Err(_existing) => {
            // Lost race -- another thread already inserted a page for this
            // offset. Drop our allocation, use the existing page.
            drop(new_page);
            let existing = pc.pages.load(pgoff).unwrap();
            // Pin the existing page.
            existing.page.get_ref();
            let page = &existing.page;

            // Wait for the winner to finish I/O.
            wait_on_page_locked(page);

            if page.flags_load(Acquire).contains(PageFlags::ERROR) {
                page.put_ref();
                return Err(FaultError::FileFault(KernelError::IoError));
            }
            if !page.flags_load(Acquire).contains(PageFlags::UPTODATE) {
                page.put_ref();
                return Err(FaultError::Sigbus);
            }

            let mut pf = PteFlags::PRESENT | PteFlags::USER | vma.pte_flags();
            if !vma.vm_flags.contains(VmFlags::VM_SHARED) {
                pf.remove(PteFlags::WRITE); // MAP_PRIVATE: COW on write
            }

            // Install PTE under PTL.
            let ptl = pte_lockptr(current_pgd(), addr);
            let _ptl_guard = ptl.lock();
            let existing_pte = read_pte(current_pgd(), addr);
            if !existing_pte.is_present() {
                install_pte(current_pgd(), addr, page.pfn(), pf)?;
            }
            drop(_ptl_guard);

            // InvalidateSeq check for lost-race path too.
            if mapping.invalidate_seq.read_check(seq) {
                let ptl = pte_lockptr(current_pgd(), addr);
                let _ptl_guard = ptl.lock();
                zap_pte(current_pgd(), addr);
                drop(_ptl_guard);
                page.put_ref();
                return Err(FaultError::Retry);
            }

            page.put_ref();
        }
    }
    Ok(())
}

Iomap types — used by DAX and buffered I/O extent mapping:

/// Type of storage mapping returned by `iomap_begin`.
pub enum IomapKind {
    /// Sparse file hole — no storage allocated.
    Hole,
    /// Allocated and written — `phys_addr` is valid.
    Mapped { phys_addr: PhysAddr },
    /// Allocated but never written (pre-allocated extent).
    Unwritten,
    /// Delayed allocation — reserved but not yet on disk.
    Delalloc,
    /// Data stored inline within the inode itself.
    InlineData,
}

bitflags! {
    /// Flags passed to `iomap_begin` describing the requested operation.
    pub struct IomapFlags: u32 {
        const WRITE  = 1 << 0;
        const ZERO   = 1 << 1;
        const REPORT = 1 << 2;
        const FAULT  = 1 << 3;
        const DIRECT = 1 << 4;
        const NOWAIT = 1 << 5;
    }
}

/// Describes a contiguous range of storage backing a file region.
/// Returned by `IomapOps::iomap_begin()` for DAX direct-access and
/// buffered I/O extent mapping.
pub struct Iomap {
    /// Physical address of the mapped extent (valid for Mapped/Unwritten).
    pub phys_addr: PhysAddr,
    /// Length of the mapped extent in bytes.
    pub length: u64,
    /// Offset within the file that this mapping starts at.
    pub file_offset: u64,
    /// Type of mapping.
    pub kind: IomapKind,
    /// Filesystem-private flags (opaque to the VMM).
    pub flags: u16,
}

/// Filesystem extent-mapping interface for DAX and buffered I/O.
/// Filesystems that support DAX or iomap-based buffered I/O implement
/// this trait on their inode operations.
pub trait IomapOps: Send + Sync {
    /// Map a file region to a storage extent.
    fn iomap_begin(&self, inode: &Inode, offset: u64, length: u64,
                   flags: IomapFlags) -> Result<Iomap, FaultError>;
    /// Called after an iomap operation completes. Allows the filesystem
    /// to finalize metadata (e.g., convert unwritten → written).
    fn iomap_end(&self, inode: &Inode, offset: u64, length: u64,
                 written: u64, iomap: &Iomap) -> Result<(), FaultError>;
}

DAX fault path (Section 15.16):

/// Handle a page fault on a DAX-mapped file (persistent memory).
/// Instead of allocating a page and copying data, map the persistent
/// memory physical address directly into the process page table.
/// No page cache, no page allocation, no memcpy.
fn dax_iomap_fault(
    vma: &Vma,
    addr: VirtAddr,
    pgoff: u64,
    access: AccessType,
) -> Result<(), FaultError> {
    // (1) Ask the filesystem for the physical location of this file block.
    //     Returns an iomap describing the persistent memory region.
    let iomap = vma.file.inode.i_op.iomap_begin(pgoff, access)?;

    match iomap.kind {
        IomapKind::Mapped { phys_addr } => {
            // Direct map: persistent memory address → PTE.
            let pte_flags = PteFlags::PRESENT | PteFlags::USER
                | if access == AccessType::Write { PteFlags::WRITE | PteFlags::DIRTY }
                  else { PteFlags::empty() }
                | PteFlags::DEVMAP; // Mark as device-mapped (non-reclaimable).
            install_pte(current_pgd(), addr, phys_to_pfn(phys_addr), pte_flags)?;

            // For 2 MiB aligned regions, install a PMD mapping (huge page)
            // to reduce TLB pressure on large persistent memory files.
            // Only if both VMA and physical alignment permit.
        }
        IomapKind::Hole => {
            // Sparse file hole — map the zero page read-only.
            // On write: filesystem must allocate storage first (COW).
            if access == AccessType::Write {
                let iomap = vma.file.inode.i_op.iomap_begin(pgoff, access)?;
                // Retry with allocated block.
                return dax_iomap_fault(vma, addr, pgoff, access);
            }
            install_pte(current_pgd(), addr, ZERO_PAGE_PFN,
                        PteFlags::PRESENT | PteFlags::USER)?;
        }
        IomapKind::Unwritten => {
            // Pre-allocated but unwritten — zero the range on first write,
            // then convert to Mapped.
            pmem_zero(iomap.phys_addr, PAGE_SIZE);
            pmem_flush(iomap.phys_addr, PAGE_SIZE);
            vma.file.inode.i_op.iomap_end(pgoff, access)?;
            install_pte(current_pgd(), addr, phys_to_pfn(iomap.phys_addr),
                        PteFlags::PRESENT | PteFlags::USER | PteFlags::WRITE
                        | PteFlags::DIRTY | PteFlags::DEVMAP)?;
        }
    }
    Ok(())
}

4.8.4.5 handle_swap_fault — Swap-In from Swap Device¶

/// Handle a page fault on a PTE that contains a swap entry (the page was
/// swapped out to a swap device). Reads the page back from swap, installs
/// a PTE mapping the newly allocated physical page, and frees the swap slot.
///
/// Called from the page fault dispatch table when the PTE is a swap entry with
/// the compressed bit clear (compressed pages are handled by the separate
/// compressed fault path in [Section 4.12](#memory-compression-tier--decompression-path)).
///
/// # Arguments
/// - `mm`: The faulting process's address space.
/// - `vma`: The VMA covering `addr` (already validated by the caller).
/// - `addr`: The faulting virtual address (page-aligned by the caller).
/// - `entry`: The decoded swap entry from the PTE, containing the swap device
///   index and offset within that device.
///
/// # Returns
/// - `Ok(page)`: The physical page now backing `addr`, with PTE installed.
/// - `Err(FaultError::SwapReadFailed)`: I/O error reading from the swap device.
/// - `Err(FaultError::Oom)`: Page allocation failed (memory pressure too high
///   even with reclaim — the OOM killer may be invoked by the caller).
///
/// # Algorithm
///
/// 1. **Decode swap entry**: Extract `swap_type` (index into the global
///    `swap_info` array, identifying which swap device/file) and `swap_offset`
///    (page-sized offset within that device). The swap entry format is
///    architecture-neutral — `arch::current::mm::decode_swap_pte()` extracts
///    the type and offset from the hardware PTE encoding.
///
/// 2. **Swap cache lookup**: Check if the page is already in the swap cache
///    (`swap_cache: XArray<SwapCacheEntry>` keyed by `(swap_type, swap_offset)`).
///    The swap cache holds pages that are in transit (being read from or written
///    to swap) or recently swapped in and not yet removed from swap.
///
///    - **Cache hit, page locked (`PG_LOCKED` set)**: Another thread is already
///      reading this page from swap. Sleep on the page's wait queue
///      (`wait_on_page_locked()`) until the I/O completes. On wake: if
///      `PG_ERROR` is set, the read failed — return `SwapReadFailed`. Otherwise
///      the page is ready — skip to step 6.
///
///    - **Cache hit, page unlocked**: The page was recently swapped in (still in
///      swap cache from a previous fault or readahead). Skip to step 6.
///
///    - **Cache miss**: Proceed to step 3 (initiate swap read).
///
/// 3. **Allocate physical page**: `phys_alloc(GfpFlags::USER | GfpFlags::RECLAIM)`.
///    If allocation fails, return `Oom`. The `RECLAIM` flag allows synchronous
///    reclaim (evicting clean page cache pages) to satisfy the allocation under
///    moderate memory pressure.
///
/// 4. **Insert into swap cache and initiate I/O**:
///    a. Lock the page (`page.flags |= PG_LOCKED`).
///    b. Insert into swap cache: `swap_cache.try_store((swap_type, swap_offset), page)`.
///       If another thread raced and inserted first, drop our page and use theirs
///       (same concurrent-dedup protocol as `handle_file_fault`).
///    c. Submit asynchronous block I/O to the swap device:
///       `swap_info[swap_type].read_page(swap_offset, page)`. This enqueues a
///       bio to the block layer ([Section 15.2](15-storage.md#block-io-and-volume-management)).
///    d. **Swapin readahead**: Prefetch a cluster of `swap_readahead_pages`
///       (default 8, configurable via `/proc/sys/vm/page-cluster`) contiguous
///       swap slots around `swap_offset`. For each slot in the cluster:
///       - Skip if the slot's `swap_map` refcount is 0 (free slot).
///       - Skip if already in the swap cache (recently faulted or still in transit).
///       - Allocate a page, insert into swap cache, submit read I/O.
///       Readahead pages are not installed into any page table — they sit in the
///       swap cache until the corresponding fault occurs (at which point step 2
///       finds them as a cache hit). Readahead pages that are not faulted within
///       a reclaim cycle are evicted from the swap cache by the page reclaimer.
///
///       **Readahead pattern detection**: The readahead cluster uses a sequential
///       access heuristic. If the last N swap faults (tracked per-VMA in
///       `vma.swap_readahead_info`) were to contiguous swap offsets, the cluster
///       size is doubled (up to `max_readahead_pages`, default 32). If access is
///       random, the cluster size is reduced to 1 (no readahead). This matches
///       Linux's `swap_ra_info` + `swapin_nr_pages()` adaptive algorithm.
///
///    e. Wait for the faulting page's I/O to complete: `wait_on_page_locked(page)`.
///       On wake: if `PG_ERROR` is set, remove the page from swap cache, free it,
///       and return `SwapReadFailed`.
///
/// 5. **Verify page integrity** (optional, enabled by `swap_integrity` boot
///    parameter or per-device `discard_integrity` flag): Compare a SHA-256
///    checksum of the page contents against the checksum recorded at swap-out
///    time (stored in a per-swap-device integrity table, one 32-byte hash per
///    swap slot). If mismatch: log a KERN_ERR message with the swap device,
///    offset, expected/actual hashes; deliver `SIGBUS` (`BUS_MCEERR_AR`) to
///    the faulting process; remove the page from swap cache and return
///    `SwapReadFailed`. This catches silent data corruption from faulty swap
///    storage (bit rot, firmware bugs, cosmic rays).
///
/// 6. **Install PTE**: Map `addr` to the physical page with appropriate flags:
///    ```
///    flags = PTE_PRESENT | PTE_USER | PTE_YOUNG
///            | (if vma.is_writable() { PTE_WRITE | PTE_DIRTY } else { 0 })
///    ```
///    Use `cmpxchg` on the PTE slot to atomically replace the swap entry with
///    the new mapping. If `cmpxchg` fails (another thread already resolved this
///    fault), free the speculatively allocated page and return the winner's page.
///
///    Set `PTE_YOUNG` to give the page an initial accessed bit — the page was
///    just faulted in, so it should not be immediately reclaimed. Set `PTE_DIRTY`
///    for writable VMAs to avoid an immediate write-protect fault on the next write.
///
/// 7. **Free swap slot**: Decrement the swap map reference count for this slot:
///    `swap_info[swap_type].swap_map[swap_offset] -= 1`. If the refcount reaches
///    zero (no other process has a COW reference to this swap entry), the slot is
///    returned to the swap device's free list. If the refcount is > 0 (e.g., a
///    forked process still references the swap entry via a copied PTE), the slot
///    remains allocated — the other process will fault independently and perform
///    its own swap-in.
///
///    Remove the page from the swap cache (it is now exclusively owned by this
///    process's page table, not shared with swap).
///
/// 8. **Update RSS**: Increment `mm.rss` (resident set size) by one page
///    via direct `PerCpu<AtomicI64>` access:
///    `let guard = preempt_disable(); mm.rss.get(&guard).fetch_add(1, Relaxed);`
///    This is a single atomic op on the current CPU's slot — zero overhead
///    beyond the `fetch_add` itself. No batch threshold, no approximate-read
///    bookkeeping. See `MmStruct.rss` documentation for why RSS uses
///    `PerCpu<AtomicI64>` (hot path) rather than `PerCpuCounter<i64>` (warm path).
///    Decrement the per-cgroup swap usage counter.
///
/// 9. **Return** the page reference.
///
/// # Memory ordering
/// - The page content write (from block I/O DMA) is ordered before the PTE
///   store via `Release` on the PTE cmpxchg. This ensures other CPUs that
///   observe the new PTE via TLB fill will see the correct page contents.
/// - `swap_map` decrement uses `Release` to ensure the PTE installation is
///   visible before the swap slot is freed (prevents a race where the slot
///   is reallocated and overwritten before the PTE is installed).
///
/// # Concurrency
/// - Multiple threads faulting on the same swap entry are serialized by the
///   swap cache `PG_LOCKED` mechanism: the first thread initiates I/O, others
///   sleep on the page lock and share the result.
/// - The PTE `cmpxchg` handles races between threads faulting on different
///   virtual addresses that map the same swap entry (possible after fork).
fn handle_swap_fault(
    mm: &MmStruct,
    vma: &Vma,
    addr: VirtAddr,
    entry: SwapEntry,
) -> Result<PageRef, FaultError>

SwapEntry — decoded swap PTE:

/// Decoded swap entry extracted from a non-present PTE. Contains enough
/// information to locate the page on the swap device.
pub struct SwapEntry {
    /// Index into the global `swap_info` array (identifies the swap device
    /// or swap file). Valid range: 0..MAX_SWAP_TYPES (default 32).
    pub swap_type: u8,
    /// Page-aligned offset within the swap device/file. Identifies which
    /// swap slot holds the evicted page's contents.
    pub swap_offset: u64,
}

Locking protocol:

The fault path holds the mm read lock (mmap_read_lock) for VMA tree stability during the entire fault resolution. Multiple faults on the same address space can proceed concurrently (read lock allows shared access). The PTE installation uses cmpxchg (compare-and-swap on the PTE entry) to detect racing faults on the same virtual page — install the new PTE only if the slot is still empty (for anonymous faults) or still contains the expected old value (for COW faults). If the cmpxchg fails, another CPU has already resolved the fault; the current fault handler frees the speculatively allocated page and returns success.

This lock-free PTE installation avoids per-page spinlocks and allows the common case (non-overlapping faults on different pages within the same address space) to proceed with zero contention.

Device faults (HMM): Accelerator drivers trigger device page faults via hmm_range_fault(), which acquires mmap_read_lock and walks the VMA tree to populate a mirror page table for device-side address translation. The device fault path shares the same VMA lookup, permission check, and page cache integration as CPU faults above, but the resulting PTE mappings are installed in the device's page table (not the CPU page table). Device faults may also trigger CPU page faults (if the backing page is not yet present) which are resolved by the standard handlers above before the HMM mirror is populated. See Section 22.1 for HMM integration with the unified accelerator framework.

Kernel fault handling:

Faults in kernel mode fall into two categories:

Expected faults (during copy_to_user / copy_from_user): These access user-space memory that may be swapped out, unmapped, or protected. The kernel exception fixup table (__ex_table) maps each potentially-faulting instruction address to a fixup handler that returns -EFAULT to the syscall caller instead of panicking. The fault handler checks __ex_table before delivering a signal.
Unexpected faults (kernel bug): Any kernel-mode fault at an address not in __ex_table indicates a kernel bug (NULL pointer dereference, use-after-free, stack overflow). The handler triggers a kernel oops: dumps registers, stack trace, and the faulting instruction, then panics. On debug builds, this also triggers a breakpoint for QEMU/GDB debugging.

/// Check if a kernel-mode fault has a fixup entry.
/// Returns the fixup address if found, None if this is a bug.
fn kernel_fixup(fault_ip: VirtAddr) -> Option<VirtAddr> {
    // __ex_table is a sorted array of (insn_addr, fixup_addr) pairs,
    // generated by the linker from __ex_table section entries placed
    // by the copy_to_user/copy_from_user macros.
    EX_TABLE.binary_search(fault_ip)
}

4.8.5 Custom Fault Handler Registration¶

Certain kernel subsystems need to intercept page faults in specific address ranges before the normal VMA-based handler runs. The primary use case is KVM post-copy live migration (Section 18.1), where a guest page that has not yet been transferred from the source host must be fetched on demand when the guest accesses it.

/// Callback invoked when a page fault occurs in a registered range.
/// Returns `Ok(page)` with the resolved physical page to install in the
/// page table, or `Err(FaultError)` to propagate the error (typically
/// `FaultError::Sigbus` for unrecoverable migration failure).
///
/// The handler runs with `mmap_lock` held in read mode. It MUST NOT
/// acquire `mmap_lock` in write mode (deadlock). It may block on I/O
/// (e.g., RDMA fetch from source host).
pub type CustomFaultFn = fn(
    addr: VirtAddr,
    access: AccessType,
) -> Result<PhysAddr, FaultError>;

/// Registration entry for a custom fault handler.
struct CustomFaultEntry {
    /// Start of the intercepted VA range (page-aligned).
    start: VirtAddr,
    /// End of the intercepted VA range (exclusive, page-aligned).
    end: VirtAddr,
    /// Handler function.
    handler: CustomFaultFn,
}

/// Per-MmStruct list of custom fault handlers. Protected by `mmap_lock`.
/// Typically 0-1 entries (only KVM post-copy uses this). Stored as an
/// ArrayVec with a small fixed capacity — no heap allocation on the
/// fault path.
///
/// Lookup: on page fault, before VMA lookup, scan this list for a
/// matching range. O(N) where N ≤ 4 (bounded). If matched, invoke
/// the handler directly; skip the normal VMA fault path.
pub type CustomFaultTable = ArrayVec<CustomFaultEntry, 4>;

Registration API:

/// Register a custom page fault handler for an address range within
/// the given address space. The handler intercepts faults before the
/// normal VMA-based path.
///
/// # Arguments
/// * `mm` - Target address space.
/// * `start` - Start of the VA range (page-aligned).
/// * `end` - End of the VA range (exclusive, page-aligned).
/// * `handler` - Fault handler function.
///
/// # Returns
/// `Ok(())` on success, `Err` if the range overlaps an existing
/// registration or exceeds the maximum (4 concurrent registrations).
///
/// # Locking
/// Caller must hold `mmap_lock` in write mode.
pub fn register_fault_handler(
    mm: &MmStruct,
    start: VirtAddr,
    end: VirtAddr,
    handler: CustomFaultFn,
) -> Result<(), KernelError>;

/// Unregister a previously registered custom fault handler.
/// Any in-flight faults in the range complete normally (the handler
/// remains callable until `unregister` returns). After return, no
/// new invocations will occur.
///
/// # Locking
/// Caller must hold `mmap_lock` in write mode.
pub fn unregister_fault_handler(
    mm: &MmStruct,
    start: VirtAddr,
    end: VirtAddr,
) -> Result<(), KernelError>;

Integration with handle_page_fault: After reading the faulting address and acquiring mmap_lock.read(), the fault handler checks mm.custom_fault_table before find_vma(). If a custom handler matches, it is invoked directly. This adds one ArrayVec scan (≤4 entries, typically 0) to the fault path — negligible overhead.

4.8.5.1.1 Tier 1 Fault Handler Domain Crossing¶

CustomFaultFn is a direct function pointer, callable only from Core (Tier 0). When a Tier 1 subsystem (e.g., a filesystem driver implementing demand-paged file-backed mappings, or a network subsystem implementing RDMA page-fault handling) needs to participate in page fault resolution, it cannot be invoked via a raw function pointer — it runs in a separate hardware isolation domain (MPK/POE). Instead, Core dispatches fault requests to the Tier 1 handler through the standard KABI ring buffer.

/// Fault handler request sent from Core to a Tier 1 handler via the
/// KABI DomainRingBuffer. Posted by the Core fault path when a custom
/// fault registration identifies a Tier 1 handler as the target.
///
/// Size: 32 bytes (fits in a single ring slot without fragmentation).
// kernel-internal, not KABI
#[repr(C)]
pub struct FaultHandlerRequest {
    /// Faulting virtual address (page-aligned by the caller).
    pub fault_addr: VirtAddr,
    /// Access type that triggered the fault (read, write, execute).
    pub fault_flags: PageFaultFlags,
    /// Global TaskId of the faulting task. The Tier 1 handler may use
    /// this to look up per-task state (e.g., file-backed mapping metadata).
    pub faulting_task: TaskId,
    /// Opaque cookie identifying the registration. Allows the Tier 1
    /// handler to correlate the request with its internal mapping table
    /// without a global lookup.
    pub registration_cookie: u64,
}
// KABI ring message. Layout depends on VirtAddr, PageFaultFlags, TaskId sizes
// (all fixed-size newtypes). Verified per-architecture at build time.

/// Fault handler response from a Tier 1 handler back to Core.
/// Posted on the return ring after the handler resolves (or fails) the fault.
///
/// Size: 24 bytes.
// kernel-internal, not KABI
#[repr(C)]
pub struct FaultHandlerResponse {
    /// Fault resolution result.
    pub result: VmFaultResult,
    /// Physical page to install in the faulting PTE. `Some` on success;
    /// `None` on error (Core delivers SIGBUS/SIGSEGV to the task).
    /// The page's refcount has been incremented by the Tier 1 handler;
    /// Core adopts ownership and will decrement on unmap.
    pub page: Option<PageHandle>,
    /// Protection bits for the new PTE (read, write, execute).
    /// Core installs the PTE with these permissions. The Tier 1 handler
    /// may grant fewer permissions than the VMA allows (e.g., read-only
    /// for COW pages that will be upgraded on a subsequent write fault).
    pub pte_prot: PageProt,
}
// KABI ring response. Contains Option<PageHandle> which may use niche
// optimization — size verified per-architecture at build time.

/// Fault resolution status.
#[repr(u32)]
pub enum VmFaultResult {
    /// Page resolved successfully. `page` field contains the physical page.
    Ok = 0,
    /// Mapping does not exist for this address. Core delivers SIGSEGV.
    NoMapping = 1,
    /// I/O error reading the backing store. Core delivers SIGBUS.
    IoError = 2,
    /// Out of memory allocating the page. Core returns -ENOMEM to the
    /// fault path, which may trigger OOM or retry.
    OutOfMemory = 3,
    /// Handler needs more time (e.g., remote page fetch in progress).
    /// Core re-queues the faulting task on the fault handler's wait queue.
    /// The handler sends a second response when the page is ready.
    Retry = 4,
}

Dispatch protocol:

Core's handle_page_fault finds a CustomFaultEntry whose registration targets a Tier 1 domain (indicated by CustomFaultEntry.tier: IsolationTier field, added alongside the existing handler field).
Instead of calling handler() directly, Core posts a FaultHandlerRequest to the Tier 1 handler's KABI DomainRingBuffer inbound ring.
The faulting task is put to sleep on a per-registration wait queue (CustomFaultEntry.waitq: WaitQueue).
The Tier 1 handler dequeues the request, resolves the page (disk read, network fetch, decompression, etc.), and posts a FaultHandlerResponse on the outbound ring.
Core's ring drain handler wakes the faulting task, installs the PTE using the returned PageHandle and pte_prot, and the task resumes.

Timeout: If no response arrives within 5 seconds (configurable via vm.fault_handler_timeout_ms), Core treats it as VmFaultResult::IoError and delivers SIGBUS to the faulting task. The Tier 1 handler's health is checked via the standard watchdog mechanism (Section 11.9).

Performance: The ring-buffer round-trip adds ~200-500 ns to the fault path (one domain switch + ring enqueue + dequeue + domain switch back). This is acceptable because Tier 1 fault handlers serve cold-path faults (file I/O, network page fetch) where the backing operation takes microseconds to milliseconds. Hot-path anonymous page faults and page cache hits are resolved entirely within Core without any domain crossing.

4.8.6 Page Table Walk and User Page Pinning¶

/// Walk the page table for `addr` and return a reference to the leaf PTE.
/// Does NOT fault in missing pages — returns `Err(EFAULT)` if the PTE is
/// not present. Used by drivers and I/O paths that need to inspect PTEs
/// without triggering page faults (e.g., RDMA registration, KVM EPT sync).
///
/// # Safety
///
/// The caller must hold `mm.mmap_lock` in at least read mode to prevent
/// concurrent VMA teardown. The returned PTE reference is valid only while
/// `mmap_lock` is held — releasing it invalidates the reference.
///
/// # Arguments
/// - `mm`: The address space to walk.
/// - `addr`: Virtual address to resolve (page-aligned).
///
/// # Returns
/// A reference to the leaf PTE entry, or `EFAULT` if the address is
/// unmapped at any page table level.
pub fn follow_pte(mm: &Mm, addr: VirtAddr) -> Result<&PteEntry, Errno>;

/// Pin user pages for DMA ([Section 4.14](#dma-subsystem)) or direct kernel access. This is the fast-path
/// implementation that avoids taking `mmap_lock` by using an optimistic
/// lockless page table walk under RCU protection.
///
/// # Algorithm
/// 1. Disable preemption and enter RCU read-side critical section.
/// 2. Walk the page table locklessly (PGD → PUD → PMD → PTE).
///    Each level is read with a single atomic load. If any level is
///    not present, fall back to the slow path (`get_user_pages_unlocked`).
/// 3. For each resolved PTE: atomically increment `page._refcount`.
///    If the page is a compound page, increment the head page's refcount.
/// 4. Re-read the PTE to verify it was not changed concurrently (ABA check).
///    If changed, decrement the refcount and retry or fall back.
/// 5. Exit RCU, re-enable preemption, return pinned pages.
///
/// # Arguments
/// - `start`: Starting user virtual address (page-aligned).
/// - `nr_pages`: Number of consecutive pages to pin.
/// - `gup_flags`: `FOLL_WRITE` (pin for write access, triggers COW),
///   `FOLL_PIN` (use pin_count instead of refcount for long-term pins).
/// - `pages`: Output slice to receive pinned `PageRef` entries.
///
/// # Returns
/// Number of pages successfully pinned, or negative errno.
/// Partial success is possible — the caller must unpin the returned pages
/// on error via `unpin_user_pages()`.
///
/// # Performance
/// Fast path (all PTEs present, no concurrent modifications): ~50-100 ns
/// per page on x86-64 (no locks, no TLB flush, no mmap_lock).
pub fn get_user_pages_fast(
    start: VirtAddr,
    nr_pages: u32,
    gup_flags: GupFlags,
    pages: &mut [PageRef],
) -> Result<u32, Errno>;

/// Release pages pinned by `get_user_pages_fast()`. Decrements refcounts
/// (or pin_counts for FOLL_PIN pages) and allows the pages to be reclaimed.
pub fn unpin_user_pages(pages: &[PageRef], nr_pages: u32);

/// GUP (Get User Pages) flags controlling page pinning behavior.
bitflags! {
    pub struct GupFlags: u32 {
        /// Pin pages for write access. Triggers COW if the page is shared.
        const WRITE     = 0x01;
        /// Long-term pin: uses the page's pin_count (separate from refcount)
        /// to signal that the page may be pinned for an extended duration
        /// (e.g., RDMA registration). The memory reclaimer will not attempt
        /// to migrate or swap out pinned pages.
        const LONGTERM  = 0x02;
        /// Use pin_count instead of refcount for accounting. Required for
        /// CMA/movable-zone pages that must not be migrated while pinned.
        const PIN       = 0x04;
    }
}

4.8.7 TLB Invalidation by Architecture¶

4.8.7.1 TLB Safety Invariant (All Architectures)¶

Universal invariant — no exceptions:

Unmap PTE entry → TLB flush (architecture-specific) → drop page reference

The page backing a PTE must not be freed, reused, or returned to any allocator until the TLB flush has completed on every CPU that may have cached the mapping. Violating this ordering creates a use-after-free window where a CPU with a stale TLB entry can read or write a page that has been reallocated to a different context — this is CVE-2018-18281 (Linux mremap race).

This invariant applies identically to all page types (anonymous, file-backed, device, DMA) and all unmap operations (munmap, mremap, mprotect with permission downgrade, process exit, memory reclaim). There are no fast-path exemptions.

ARM64 repeat TLBI requirement: On Cortex-A55 (erratum 2441007), Cortex-A76 (erratum 1286807), Cortex-A510 (erratum 2441009), and Neoverse-N1 (erratum 1542419), a single TLBI instruction may not fully invalidate all TLB entries due to speculative translation races. On these cores, the TLB flush sequence must be: DSB ISH → TLBI → DSB ISH → TLBI → DSB ISH → ISB (repeat TLBI+DSB). Per-CPU errata flags (Aarch64Errata bits in Section 2.16) gate this double-invalidation on big.LITTLE systems where only some cores are affected.

x86 Alder Lake/Raptor Lake PCID issue: On these hybrid CPUs (errata flag X86Errata::PCID_INVLPG_GLOBAL), INVLPG fails to flush Global TLB entries when PCID is enabled. UmkaOS disables PCID on affected steppings (detected at boot via CPUID family/model/stepping match) and uses full CR3 reload for TLB invalidation.

TLB invalidation is one of the most architecture-specific operations in the memory subsystem. UmkaOS abstracts it behind arch::current::mm::tlb_flush_* functions, but the underlying mechanisms differ substantially across ISAs in their broadcast model, granularity, and completion semantics.

x86-64:

Single page (local): INVLPG [addr] — invalidates the TLB entry for one virtual address on the current CPU, across all PCIDs.
Full flush (local): write CR3 with the NOFLUSH bit clear — flushes all non-global TLB entries for the current PCID.
PCID-preserving switch (preferred): write CR3 with bit 63 (NOFLUSH) set — switches address space without flushing any TLB entries (requires CR4.PCIDE = 1). This is the normal context-switch path when the target PCID is still valid.
Tagged shootdown: INVPCID instruction — type 0 = single-address for one PCID, type 1 = all entries for one PCID, type 2 = all non-global entries, type 3 = all entries including global. Requires CR4.PCIDE = 1.
Cross-CPU shootdown: send a fixed-vector IPI via the LAPIC to each CPU that may hold the stale mapping (tracked via mm_cpumask); the target CPU's IPI handler executes INVLPG or INVPCID then acknowledges.
Broadcast invalidation (AMD Zen 3+): INVLPGB (Invalidate Page Broadcast) — broadcasts a TLB invalidation to all CPUs without requiring IPIs. Supported on AMD Zen 3 (Milan) and later, detected via CPUID Fn8000_0008_EBX[INVLPGB]. The instruction encodes the virtual address, ASID, PCID, and page count in a single operation. Completion is checked via TLBSYNC — a serializing instruction that blocks until all preceding INVLPGB operations have completed on all CPUs. Sequence: INVLPGB (broadcast) → TLBSYNC (wait for completion) → proceed. This eliminates the IPI overhead (send IPI, wait for ACK per CPU) for TLB shootdown on large-core-count AMD systems. UmkaOS uses INVLPGB when available (checked via X86Features::INVLPGB bit), falling back to IPI-based shootdown on Intel and older AMD CPUs. INVLPGB can invalidate up to 64 pages per invocation (count field in RCX); ranges larger than 64 pages require multiple invocations or fallback to full ASID invalidation.

AArch64:

Single page (system-wide): TLBI VAE1IS, Xt — invalidate by virtual address, EL1, Inner Shareable domain. Xt encodes VA[55:12] in bits [43:0] and optionally the ASID in bits [63:48]. The IS (Inner Shareable) suffix broadcasts the operation to all CPUs in the coherency domain — no software IPI is required.
ASID-scoped flush: TLBI ASIDE1IS, Xt — invalidate all TLB entries for a specific ASID. Used on context switch when the target ASID has been recycled.
Full flush: TLBI VMALLE1IS — invalidate all EL1 entries for all ASIDs. Reserved for extreme cases (address space teardown with many ASIDs).
Completion sequence: DSB ISH (ensure preceding stores to page tables are visible) → TLBI VAE1IS → DSB ISH (ensure TLB invalidation is complete on all CPUs) → ISB (ensure subsequent instruction fetches use the new mapping).
No explicit IPI needed: the IS variants broadcast through the coherency domain's hardware interconnect. ARM's TLB maintenance operations with IS are the equivalent of x86's IPI + INVLPG in a single instruction.

ARMv7:

Single page (system-wide): MCR p15, 0, Rt, c8, c3, 3 (TLBIMVAAIS) — invalidate by modified virtual address and ASID, all CPUs.
Full flush (system-wide): MCR p15, 0, Rt, c8, c3, 0 (TLBIALLIS) — invalidate all TLB entries, all CPUs.
Completion: DSB before the MCR to ensure page table writes are visible; DSB after to ensure the invalidation has propagated; ISB to prevent stale instruction fetch.

RISC-V:

Single page (local): SFENCE.VMA rs1, rs2 — rs1 holds the virtual address (zero = all pages), rs2 holds the ASID (zero = all ASIDs). With both specified, invalidates the specific VA for the specific ASID on the current hart only.
Full flush (local): SFENCE.VMA x0, x0 — flushes all TLB entries on the current hart.
Cross-hart shootdown: no hardware broadcast mechanism. UmkaOS sends an IPI to each hart that may have cached the mapping (tracked via mm_cpumask), and each target hart executes SFENCE.VMA in the IPI handler. The initiating hart waits for all acknowledgments before returning.
Hypervisor extensions: HFENCE.VVMA (invalidate guest virtual TLB entries) and HFENCE.GVMA (invalidate guest physical TLB entries), used by the RISC-V KVM path (Section 18.1). Critical encoding note: HFENCE.GVMA takes the guest physical address right-shifted by 2 (GPA >> 2) in rs1, NOT the raw GPA. This matches the RISC-V H-extension spec where guest physical page numbers are encoded as GPA[55:2] in bits [53:0] of the register. Using an unshifted GPA causes invalidation of the wrong TLB entries. Compile-time assertion: static_assert(hfence_gvma_addr == (gpa >> 2)) in the KVM TLB flush path.
Lazy TLB mode (x86-64, RISC-V, LoongArch64): When a kernel thread runs, it borrows the previous user task's address space (page tables remain loaded in CR3/SATP). The kernel thread never accesses user addresses, so the stale mappings are harmless. However, if another CPU invalidates a user mapping via TLB shootdown IPI, the kernel thread's CPU must either flush or switch to init_mm — otherwise, when the kernel thread is descheduled and a user task resumes on that CPU, it will have stale TLB entries from before the invalidation.

4.8.7.1.1 Lazy TLB Enter/Exit Protocol¶

Two per-CPU flags (both plain bool in PerCpu<TlbState>, accessed with preemption disabled — IRQ-safe because only the local CPU writes them):

in_lazy_tlb: bool — true while a kernel thread is running with a borrowed mm.
tlb_flush_pending: bool — true when a TLB shootdown IPI arrived during lazy mode.

Event	Action
`context_switch()` to kernel thread	Set `in_lazy_tlb = true`. Leave CR3/SATP/TTBR unchanged (borrow previous mm).
TLB shootdown IPI while `in_lazy_tlb`	Set `tlb_flush_pending = true`. Skip the actual TLB flush (no user mappings accessed).
`context_switch()` to user task	Clear `in_lazy_tlb`. If `tlb_flush_pending` is set: perform a full TLB flush (arch-specific: `INVPCID_ALL` on x86-64, `SFENCE.VMA x0,x0` on RISC-V, `INVTLB 0x0,r0,r0` on LoongArch64), then clear `tlb_flush_pending`. If not set: normal PCID/ASID switch (no extra flush).
TLB shootdown IPI while NOT in lazy mode	Process normally (flush the specific VA range).

This avoids the cost of full flushes during lazy TLB while ensuring correctness on resume. The flags are not Atomic because they are only written by the local CPU with preemption disabled; the IPI handler runs on the same CPU. - ASID/PCID rollover: ASIDs (AArch64, RISC-V, s390x) and PCIDs (x86-64) are finite resources (8-16 bit). When all ASIDs have been assigned, a rollover occurs: a global generation counter is incremented, all CPUs are IPI'd to flush their TLBs, and ASID assignment restarts from 0. The fast-path check on context switch compares the task's ASID generation against the current global generation — if they match, the ASID is still valid and no TLB flush is needed. If they differ, a new ASID is assigned from the current generation's pool. The generation counter is AtomicU64 (never wraps in practice — 2^64 generations at 1M context switches/sec = 584,942 years). The ASID allocator uses a per-CPU bitmap to track free ASIDs, avoiding contention on the global counter except during rollover events.

PPC32 / PPC64LE:

Single page (local): tlbiel (invalidate local, BookS) — invalidates the TLB entry for the given effective address on the current CPU only.
Single page (global): tlbie (invalidate entry, BookS) — on POWER hardware, tlbie is automatically broadcast to all CPUs by the hardware interconnect. No software IPI is needed for cross-CPU shootdown on shared-memory multiprocessors.
Completion: ptesync before tlbie to ensure page table stores are visible; tlbsync + sync after tlbie to ensure the invalidation has completed on all CPUs.
BookE (PPC32 embedded): uses tlbivax (invalidate by address) + tlbsync instead of tlbie.

s390x:

Single page (local): IPTE (Invalidate Page Table Entry) — takes a pointer to the page table entry and optionally a virtual address. IPTE physically reads and invalidates the page table entry atomically. On z/Architecture multiprocessors, IPTE automatically broadcasts the invalidation to all CPUs within the configuration — no software IPI needed.
ASCE-scoped flush: IDTE (Invalidate DAT Table Entry) — invalidates all TLB entries derived from a specific region/segment table entry. Used for large-scale unmaps and address space teardown. Also broadcasts automatically.
Full flush: PTLB (Purge TLB) — invalidates all TLB entries on the local CPU. Primarily used at ASCE switch (equivalent to loading CR3 on x86). Not broadcast; each CPU executes its own PTLB when switching address spaces.
Completion: IPTE and IDTE are serializing — when the instruction completes, the invalidation has been applied on all CPUs. No additional barrier is needed. The CSP (Compare and Swap and Purge) instruction combines a CAS on the page table with TLB invalidation atomically (used for COW PTE updates).

LoongArch64:

Single page (local): INVTLB op, rj, rk — invalidate TLB entries matching specified criteria. op selects the mode: 0x0 = invalidate all, 0x1 = invalidate all STLB entries, 0x4 = invalidate by ASID (rj = ASID), 0x5 = invalidate by ASID and VA (rj = ASID, rk = virtual address), 0x6 = invalidate by global flag and VA. These operations are local to the executing CPU.
Cross-CPU shootdown: no hardware broadcast mechanism. UmkaOS sends an IPI to each CPU in mm_cpumask; the target CPU's IPI handler executes INVTLB with the appropriate operands. The initiating CPU waits for all acknowledgments.
Completion: DBAR 0 (data barrier) before INVTLB to ensure page table writes are visible; the INVTLB instruction itself is completion-synchronous on the local CPU. After remote IPI-based invalidation, the acknowledgment protocol provides ordering.

UmkaOS shootdown protocol summary:

Architecture	Cross-CPU TLB broadcast	Software IPI needed
x86-64	None (no hardware broadcast)	Yes — LAPIC IPI to each CPU in `mm_cpumask`
AArch64	`TLBI *IS` broadcasts via interconnect	No — hardware handles it
ARMv7	`TLBIALLIS` broadcasts via interconnect	No — hardware handles it
RISC-V	None (no hardware broadcast)	Yes — IPI to each hart in `mm_cpumask`
PPC64LE	`tlbie` broadcasts via interconnect	No — hardware handles it
PPC32 BookE	None	Yes — IPI required
s390x	`IPTE`/`IDTE` broadcast via hardware (serializing instructions)	No — hardware handles it
LoongArch64	None (no hardware broadcast)	Yes — IPI to each CPU in `mm_cpumask`

On architectures with hardware broadcast (AArch64, PPC64, s390x), UmkaOS still tracks mm_cpumask for lazy TLB mode (Section 4.9) but does not send IPIs for TLB invalidation — the hardware performs the broadcast through its coherency fabric. On x86, RISC-V, and LoongArch64, mm_cpumask is used both to determine which CPUs need IPIs and to detect lazy TLB mode. The initiating CPU marks pages dirty, sends IPIs to the target set, and waits for acknowledgment (via a per-CPU completion flag) before returning to the caller.

4.8.8 VMM Replaceability — 50-Year Uptime Design¶

The virtual memory manager is factored into non-replaceable hardware primitives and replaceable policy to support live kernel evolution over multi-decade uptimes. This follows the same state-spill pattern proven by the I/O scheduler (Section 16.21) and the qdisc subsystem (Section 16.21).

4.8.8.1 Non-Replaceable Hardware Primitives¶

The following are non-replaceable — they are architecture-specific PTE manipulation instructions:

Primitive	Location	Reason
`install_pte()`	`arch::current::mm`	Single PTE write; ~3-5 arch instructions
`read_pte()`	`arch::current::mm`	Single PTE read + decode
`encode_pte()`	`arch::current::mm`	Flag encoding per arch PTE format
PTE format constants	`arch::current::mm`	Bit positions for Present/RW/NX/User etc.
TLB flush instructions	Section 4.8	Hardware instructions (`INVLPG`, `TLBI`, `SFENCE.VMA`, etc.)

These are already confined to arch/*/mm modules per the Platform Abstraction Rules (CLAUDE.md). No changes needed — they are inherently non-replaceable.

4.8.8.2 VmmPolicy — Replaceable VMM Policy Layer¶

/// Replaceable virtual memory management policy. Controls page fault
/// handling strategy, THP promotion decisions, TLB flush optimization,
/// PCID/ASID eviction policy, and file readahead window sizing.
///
/// **Separation principle**: Hardware PTE manipulation (install_pte,
/// read_pte, PTE format encoding/decoding) is in arch::current::mm
/// (NON-REPLACEABLE, ~10 architecture-specific instructions). VMA tree
/// operations (Maple Tree insert/remove/find) are non-replaceable data
/// structure operations. VmmPolicy controls the DECISIONS:
///   - Which page to allocate on fault (anonymous, file, compressed)
///   - Whether to attempt huge page promotion
///   - When to issue individual TLB flush vs full ASID flush
///   - PCID/ASID eviction order on context switch
///   - COW copy strategy (immediate vs deferred batch)
///   - File readahead window sizing
///
/// **Replaceability compliance**: VmmPolicy is called on WARM paths
/// only — eviction decisions, THP promotion, TLB flush strategy,
/// PCID eviction, readahead sizing, and memory pressure response.
/// The per-fault PTE installation hot path (page allocation + zero-fill
/// + `install_pte()` + TLB fill) is fixed Nucleus code: direct function
/// calls with no trait dispatch. VmmPolicy methods are invoked AROUND
/// the fault (deciding allocation strategy before the fault, deciding
/// promotion/flush strategy after) but never on the PTE write itself.
/// This satisfies the replaceability rule: "hot path must never call
/// through a replaceable trait."
///
/// **State spill**: All mutable state (MmStruct, Vma tree, page tables,
/// Page descriptors) is owned by non-replaceable structures. VmmPolicy
/// is a stateless set of algorithms. Replacement swaps only the global
/// policy pointer (~1 us stop-the-world). No state export/import needed.
pub trait VmmPolicy: Send + Sync {
    /// Decide allocation strategy for an anonymous page fault. Returns
    /// the allocated page (NUMA node selection, speculative multi-page,
    /// compressed-page-aware). The caller (fixed Nucleus code) performs
    /// the actual PTE installation via `install_pte()` — a direct call,
    /// not through this trait.
    ///
    /// `FaultError` is defined earlier in this section (the `FaultError` enum)
    /// and enumerates all possible page fault outcomes (OOM, SIGBUS, retry, etc.).
    fn handle_anon_fault(
        &self,
        mm: &MmStruct,
        vma: &Vma,
        addr: VirtAddr,
        access: AccessType,
    ) -> Result<(), FaultError>;

    /// Handle a COW (copy-on-write) fault. Default: copy page if refcount
    /// > 1, else re-protect. Replacement: might implement deferred-copy
    /// for large forks, or speculative batch-copy for sequential access.
    fn handle_cow_fault(
        &self,
        mm: &MmStruct,
        vma: &Vma,
        addr: VirtAddr,
        old_pfn: Pfn,
    ) -> Result<(), FaultError>;

    /// Handle a file-backed page fault. Dispatches between page cache
    /// hit, readahead initiation, and DAX (direct access) paths.
    fn handle_file_fault(
        &self,
        mm: &MmStruct,
        vma: &Vma,
        addr: VirtAddr,
        access: AccessType,
    ) -> Result<(), FaultError>;

    /// Decide whether to attempt THP promotion for a given fault.
    /// Called on anonymous faults at huge-page-aligned addresses when THP
    /// is enabled. Default: attempt if all base pages in the huge page
    /// range are present and the VMA allows it.
    /// Replacement: might use access frequency heuristics, ML prediction
    /// ([Section 23.1](23-ml-policy.md#aiml-policy-framework-closed-loop-kernel-intelligence)),
    /// or cgroup-aware THP budgets.
    fn should_promote_thp(
        &self,
        mm: &MmStruct,
        addr: VirtAddr,
    ) -> bool;

    /// Determine TLB flush strategy for a given unmap/remap operation.
    /// Returns the recommended flush granularity.
    /// Default: single-page flush for small unmaps, full ASID flush
    /// for large unmaps (>32 pages on x86, >16 on ARM).
    /// Replacement: adaptive threshold based on TLB miss rate monitoring.
    fn tlb_flush_strategy(
        &self,
        mm: &MmStruct,
        addr_start: VirtAddr,
        addr_end: VirtAddr,
        nr_pages: u64,
    ) -> TlbFlushGranularity;

    /// PCID/ASID allocation and eviction policy. Called on context switch
    /// when no PCID/ASID is assigned to the target process.
    /// Default: LRU eviction
    /// ([Section 4.7](pcid-asid-management.md)).
    /// Replacement: frequency-based eviction or priority-aware allocation.
    fn pcid_evict_candidate(
        &self,
        active_pcids: &[PcidEntry],
        nr_active: usize,
    ) -> usize; // index of entry to evict

    /// File readahead window sizing. Called on file fault cache miss.
    /// Default: doubling window (4, 8, 16, 32 pages).
    /// Replacement: ML-tuned readahead based on per-file access patterns.
    fn readahead_window(
        &self,
        ra_state: &FileRaState,
        offset: u64,
    ) -> u32; // pages to read ahead

    /// Invalidate any cached decisions that depend on the old policy's
    /// parameters. Called as a post-swap callback during live evolution
    /// ([Section 13.18](13-device-classes.md#live-kernel-evolution)). Specifically: flush cached VMA merge
    /// decisions (speculative merge hints stored per-VMA that may differ
    /// under the new policy's THP promotion or readahead strategy).
    /// Cold path — called once per policy swap.
    fn flush_policy_caches(&self, mm_list: &MmList);
}

/// TLB flush granularity recommendation from VmmPolicy.
pub enum TlbFlushGranularity {
    /// Flush individual pages (INVLPG / TLBI VAE1IS per page).
    /// Preferred when nr_pages is small relative to TLB capacity.
    PerPage,
    /// Flush entire PCID/ASID. Cheaper than N individual flushes when
    /// N exceeds the crossover threshold (architecture-dependent).
    FullAsid,
    /// Defer flush: mark pages for lazy invalidation on next access.
    /// Used for batch munmap where most pages may never be re-accessed
    /// by this address space.
    Deferred,
}

Global VMM policy instance:

/// Global VMM policy. Replaceable via live evolution.
/// Used by handle_page_fault() on every fault. One indirect call per page
/// fault (~3-5 ns overhead on a ~1000-2000 ns operation = <0.3%).
pub static VMM_POLICY: RcuCell<&'static dyn VmmPolicy> =
    RcuCell::new(&DefaultVmmPolicy);

4.8.8.3 Performance Budget — VMM Policy Indirection Cost¶

Path	Path class	VmmPolicy called?	Cost
PTE install (`install_pte`)	Hot (per fault)	No — fixed Nucleus code, direct call	0 ns (no indirection)
Anonymous fault strategy	Warm (per fault, before PTE)	Yes: `handle_anon_fault()`	+3-5 ns (<0.3%)
COW fault strategy	Warm (per COW fault)	Yes: `handle_cow_fault()`	+3-5 ns (<0.3%)
File fault strategy	Warm (per file fault)	Yes: `handle_file_fault()`	+3-5 ns (<0.05%)
THP promotion check	Warm (per huge-aligned fault)	Yes: `should_promote_thp()`	+3-5 ns
TLB flush	Warm (per unmap operation)	Yes: `tlb_flush_strategy()`	+3-5 ns
Context switch (PCID miss)	Warm (~1 in 4096 switches)	Yes: `pcid_evict_candidate()`	+5-10 ns (cold)
Readahead sizing	Warm (per file cache miss)	Yes: `readahead_window()`	+3-5 ns

Key invariant: The PTE installation hot path is fixed Nucleus code with no trait dispatch. VmmPolicy is called on warm paths only (strategy decisions around the fault, eviction, readahead). Page faults are inherently expensive (1000-2000 ns minimum for anonymous, longer for file-backed). The 3-5 ns overhead of one indirect call on the warm decision path is unmeasurable in any workload benchmark. This design satisfies the replaceability rule: replaceable policy is never on the hot path.

4.8.8.4 vmalloc VA Space Defragmentation¶

The kernel vmalloc region provides virtually-contiguous mappings backed by potentially non-contiguous physical pages. Over multi-decade uptime, repeated vmalloc/vfree cycles fragment the VA space into small holes, eventually causing large vmalloc requests to fail despite ample physical memory.

/// vmalloc virtual address space manager. Manages the kernel's vmalloc
/// region used for vmalloc(), vmap(), and ioremap() allocations.
///
/// **50-year problem**: Each vmalloc allocation reserves a VA range. Freed
/// ranges become holes. Over decades of driver load/unload and module churn,
/// the VA space fragments. Eventually, large contiguous VA requests fail.
///
/// **Solution**: Two-level management with background VA compaction.
pub struct VmallocManager {
    /// VA range → VmallocArea mapping. Integer-keyed by VA start address
    /// (per [Section 3.13](03-concurrency.md#collection-usage-policy):
    /// XArray is mandatory for integer-keyed lookups). RCU-protected for
    /// lock-free VA lookup on the page fault path (vmalloc page faults)
    /// and kfree_rcu path.
    pub areas: XArray<VmallocArea>,
    /// Total VA space available for vmalloc (architecture-specific).
    /// x86-64: 32 TB. AArch64: 16 TB. RISC-V: varies by Sv mode.
    pub total_va: u64,
    /// Currently used VA space (sum of all live allocations).
    pub used_va: AtomicU64,
    /// Largest contiguous free VA range. Updated on alloc/free.
    /// Used for fast-path rejection of impossible allocation requests
    /// before scanning the XArray.
    pub largest_free: AtomicU64,
}

/// A single vmalloc allocation's metadata.
pub struct VmallocArea {
    /// Start of the virtual address range.
    pub va_start: usize,
    /// Size of the allocation in bytes.
    pub size: usize,
    /// Physical page descriptors backing this VA range.
    /// Lazy-populated: individual pages mapped on first access.
    /// Length = `size / PAGE_SIZE`. Heap-allocated because vmalloc areas
    /// range from one page to hundreds of MB — no fixed upper bound exists.
    pub pages: Box<[Option<PageRef>]>,
    /// Allocation flags (GFP-like).
    pub flags: VmallocFlags,
    /// Caller return address (for debugging / leak detection).
    pub caller: usize,
}

Related subsystems: - Live kernel evolution: Section 13.18 - I/O scheduler state-spill pattern: Section 16.21 - Qdisc state-spill pattern: Section 16.21 - PCID/ASID management: Section 4.9 - THP promotion: Section 4.7 - ML-guided VMM policy: Section 23.1 - Formal verification targets: Section 24.4 - Collection usage policy (XArray for integer keys): Section 3.13

4.9 PCID / ASID Management¶

Process Context IDentifiers tag TLB entries with a process identifier, avoiding TLB flushes on context switches. The mechanism is architecture-specific but the policy is shared:

Architecture	Mechanism	ID Width	Usable IDs	Register
x86-64	PCID	12 bits	4096	CR3[11:0]
AArch64	ASID	8 or 16 bits	256 or 65536	TTBR0_EL1[63:48] or TCR_EL1.AS=1 for 16-bit
ARMv7	ASID (CONTEXTIDR)	8 bits	256	CONTEXTIDR[7:0]
RISC-V	ASID	0-16 bits (WARL)	0-65536	satp[59:44] (same position for Sv39, Sv48, Sv57; width is implementation-defined, discovered at boot; RISC-V satp.ASID is WARL — implementations may support 0 bits (no ASIDs, full TLB flush on context switch), though most provide 9-16 bits)
PPC32	MMU PID	8 bits (e500v2); up to 14-bit TID on e6500	256 (e500v2)	PID SPR (via `mtspr`) — the MMU PID register, not the OS process ID. UmkaOS targets e500v2 (QEMU ppce500) with 8-bit PID
PPC64LE	PID (Radix) / LPIDR (HPT)	20 bits (Radix) / 12 bits (HPT)	~1M (Radix) / 4096 (HPT)	PIDR SPR (POWER9+) / LPIDR (POWER8)
s390x	ASCE (Address Space Control Element)	N/A (implicit)	N/A	CR1/CR7/CR13 (primary/secondary/home ASCE)
LoongArch64	ASID	implementation-defined (ASIDBITS at CSR.ASID[20:16]; 10 on current hardware)	up to 1024 (current)	CSR.ASID[9:0]

Common policy across all architectures: - LRU-based ID allocation: least-recently-used ID is evicted when all slots are full. x86 has 4096 slots; UmkaOS uses the full space with LRU eviction. The PCID/ASID eviction policy is controlled by VmmPolicy::pcid_evict_candidate() (Section 4.8), which is a replaceable policy method. The default policy uses LRU eviction. - TLB flush avoidance: context switch to a process that still has a valid ID requires zero TLB flushes. - Isolation domain switches require no ID change (same address space, different permissions — MPK/POE/DACR operate independently of address space IDs; page-table isolation on RISC-V uses separate page-table mappings within the same address space).

Architecture-specific notes: - AArch64: 16-bit ASID (TCR_EL1.AS=1) requires ID_AA64MMFR0_EL1.ASIDBits == 0b0010 (available on most ARMv8.2+ cores; some ARMv8.0 cores only support 8-bit ASID). UmkaOS discovers ASID width at boot and falls back to 8-bit with more frequent TLB flushes. 16-bit ASID gives 65536 IDs — effectively unlimited. 8-bit ASID (256 IDs) requires more aggressive LRU eviction. TLB maintenance uses TLBI ASIDE1IS for targeted invalidation by ASID. TCR_EL1.A1 selects whether TTBR0_EL1 or TTBR1_EL1 provides the ASID value. UmkaOS uses TTBR0_EL1 (TCR_EL1.A1=0, the reset default). This is set during early boot and never changed. - ARMv7: Only 8-bit ASID via CONTEXTIDR, limiting to 256 concurrent address spaces. TLB maintenance uses MCR p15, 0, Rd, c8, c7, 2 (TLBIASID). The DACR domain mechanism (Section 11.2) is orthogonal to ASID — domain switches never invalidate TLB. - RISC-V: ASID occupies bits [63:60]=mode, [59:44]=ASID in the satp CSR (same position for Sv39, Sv48, and Sv57). The ASID width is implementation-defined and discovered at boot (by writing all-ones to satp.ASID and reading back). SiFive U74: 9-bit (512 IDs). Other implementations may support up to 16-bit. TLB flush via sfence.vma with ASID argument for targeted invalidation.

RISC-V ASID availability: The RISC-V specification allows implementations with ASID width of 0 (the satp.ASID field is WARL — writes of non-zero values may be ignored). UmkaOS detects the available ASID width at boot by writing all-ones to satp.ASID and reading back the value. If ASID width is 0, UmkaOS falls back to full TLB flush on every context switch (issuing sfence.vma with rs1=x0, rs2=x0 after every satp write performs a full TLB flush). This is functionally correct but incurs higher context-switch overhead on ASID-less implementations. The ASID width is stored in a boot-time constant (RISCV_ASID_BITS: u32) and checked by the TLB management code to select between ASID-tagged invalidation (sfence.vma with ASID) and global invalidation (sfence.vma with rs1=x0, rs2=x0). - PPC32: 8-bit PID via the PID SPR, supporting 256 concurrent address spaces. TLB management uses tlbie (invalidate by effective address) or tlbia (invalidate all). The 16 segment registers provide an orthogonal isolation mechanism independent of PID. - PPC64LE: On POWER9+ with Radix MMU, 20-bit PID via the PIDR SPR supports up to ~1M concurrent address spaces — effectively unlimited. On POWER8 with HPT, the 12-bit LPIDR provides 4096 logical partition IDs. TLB management uses tlbie with targeting by PID (Radix) or tlbie with LPID (HPT). - s390x: s390x does not use an explicit ASID register. Address space identity is determined by the Address Space Control Element (ASCE) loaded in control registers CR1 (primary), CR7 (secondary), or CR13 (home address space). Switching address spaces changes the ASCE, which implicitly invalidates the TLB context for the old space. TLB invalidation uses the PTLB (Purge TLB) instruction for local invalidation; cross-CPU invalidation is triggered via SIGP (Signal Processor) orders. There is no lightweight per-entry TLB invalidation — PTLB purges all TLB entries on the local CPU, and CSP/CSPG (Compare and Swap and Purge) can be used for atomic page table updates with TLB coherence. ASCE-reuse optimization: When switching back to a previously-used address space (e.g., task A → task B → task A), UmkaOS checks if the ASCE value being loaded matches the current CR1 value. If they match, the LCTL (Load Control) instruction is skipped entirely — the TLB entries from the previous context are still valid (s390x TLB entries are tagged by ASCE origin, not by an explicit ASID). This is equivalent to PCID/ASID reuse on other architectures. The optimization applies to switch_mm() when the previous and next tasks share the same mm_struct (kernel threads borrowing process mm, or context switches within the same thread group). - LoongArch64: Implementation-defined ASID width via the CSR.ASID register. The ASID occupies CSR.ASID[9:0]; the implemented width is reported by the read-only ASIDBITS field at CSR.ASID[20:16]. Current implementations (3A5000, 3A6000) set ASIDBITS=10, supporting 1024 concurrent address spaces. UmkaOS reads ASIDBITS at boot and sizes the LRU pool accordingly. Unlike RISC-V's write-all-ones-and-read-back probing, LoongArch provides a dedicated read-only field for width discovery. TLB invalidation uses the INVTLB instruction, which supports multiple invalidation modes: by ASID, by virtual address, by ASID + VA, or global invalidation. INVTLB is local-only — cross-CPU TLB shootdown requires explicit IPI signaling followed by INVTLB on each target CPU. With 1024 IDs, LRU eviction frequency is moderate — significantly better than 8-bit architectures (256) but well below AArch64 16-bit (65536) or PPC64LE Radix (~1M).

4.9.1 Lazy TLB Mode for Kernel Threads¶

When the scheduler context-switches to a kernel thread (kworker, ksoftirqd, RCU callback thread) that has no user-space address space (mm == NULL), a TLB flush is wasteful — the kernel thread will never access user-space addresses, so the previous process's TLB entries can remain loaded harmlessly. Flushing them only to reload them on the next switch back to a user-space process wastes hundreds of cycles (full TLB invalidation: ~200-1000 cycles depending on TLB size and architecture).

UmkaOS implements lazy TLB mode (matching Linux's approach):

Enter lazy mode: When switching to a kernel thread, the scheduler does not write a new value to CR3 (x86), TTBR0_EL1 (AArch64), or satp (RISC-V). The previous process's page tables remain loaded, and its PCID/ASID remains active. The kernel thread runs exclusively in the kernel half of the address space (TTBR1 on AArch64, high-half on x86), which is shared across all processes.
Exit lazy mode: When switching from a kernel thread back to a user-space process, the scheduler checks whether the target process's PCID/ASID matches what is currently loaded. If yes: zero TLB flushes (the entries are still valid). If no: normal PCID/ASID switch with targeted invalidation.
TLB shootdown during lazy mode: If another CPU sends a TLB shootdown IPI for the user-space address range currently loaded in lazy mode, the lazy CPU must either process the shootdown (flush the stale entries) or mark its lazy state as "needs flush on exit." UmkaOS uses the deferred approach: the lazy CPU sets a per-CPU tlb_flush_pending flag and skips the actual flush. When the CPU exits lazy mode (switches to a user process), it checks the flag and performs a full TLB flush if set, then clears tlb_flush_pending (Section 4.8). This avoids IPIs waking idle CPUs unnecessarily.

Impact: On syscall-heavy workloads with many kernel threads (web servers, database engines), lazy TLB eliminates ~30-50% of TLB flushes. The benefit is proportional to the ratio of kernel-thread context switches to total context switches.

4.9.2 Observability¶

Per-CPU PCID/ASID statistics are exposed via /proc/vmstat for performance monitoring:

/// Per-CPU PCID/ASID allocation statistics. The owning CPU writes
/// with Relaxed ordering (single-writer — no contention). Cross-CPU
/// aggregation for /proc/vmstat reads with Relaxed ordering (approximate
/// is acceptable for statistics). AtomicU64 is required for Rust soundness:
/// cross-CPU reads without a lock need atomic access. On all architectures,
/// Relaxed atomics generate plain load/store instructions — zero overhead
/// vs non-atomic u64.
pub struct PcidStats {
    /// Total PCID/ASID allocations (fresh + recycled).
    pub allocations: AtomicU64,
    /// Total evictions (LRU-evicted to make room for a new process).
    pub evictions: AtomicU64,
    /// Context switches that hit a cached PCID/ASID (zero-cost TLB reuse).
    pub tlb_flush_avoided: AtomicU64,
    /// Context switches that required a full TLB flush (no cached ID).
    pub tlb_flush_full: AtomicU64,
}

/// **Architecture-specific counter semantics**:
///
/// - **s390x**: Address space identity is implicit via ASCE, so there are
///   no explicit ASID allocations or evictions. `allocations` and `evictions`
///   are always zero. The context switch path loads a new ASCE via `lctlg`
///   without calling the PCID/ASID allocator (no ASID concept on s390x).
///   Counters remain at zero because the allocator is never invoked.
///   `tlb_flush_avoided` tracks ASCE-reuse cases (same ASCE
///   still loaded after context switch). `tlb_flush_full` tracks `PTLB`
///   operations (full local TLB purge).
///
/// - **RISC-V with ASID width 0**: Implementations that support zero ASID
///   bits (`satp.ASID` is WARL and reads back as 0) require a full TLB
///   flush on every context switch. `allocations` and `evictions` are always
///   zero; `tlb_flush_full` equals the total number of context switches;
///   `tlb_flush_avoided` is always zero. The context switch path writes
///   `satp` with ASID=0 and issues `sfence.vma` (full TLB flush).
///   The PCID allocator's `allocate()` early-returns with
///   `PcidResult::FullFlush`, incrementing only `tlb_flush_full`.

4.10 Memory Tagging (Hardware-Assisted)¶

Memory tagging detects use-after-free and buffer overflow bugs by associating metadata tags with memory allocations and checking them on every access. UmkaOS supports hardware-accelerated tagging on ARM (MTE) and Intel (LAM), with a software shadow-memory fallback (KASAN-equivalent) for architectures without hardware support.

Specification is split across two focused sections:

ARM MTE subsystem (730 lines — full production spec): Section 10.4. Covers FEAT_MTE/MTE2/MTE3 feature levels, three tag-check-fault modes (sync/async/asymmetric), MteTaskConfig per-thread state, allocator integration (tag generation via IRG, adjacent-object tag separation, deallocation poisoning), context switch register save/restore, kernel-entry async fault delivery (SEGV_MTEAERR/SEGV_MTESERR), userspace interface (mmap(PROT_MTE), prctl(PR_SET_TAGGED_ADDR_CTRL)), ptrace tag access, core dump support (PT_AARCH64_MEMTAG_MTE), per-tier default modes, hardware availability table, and Linux ABI compatibility.
Hardware memory safety integration (608 lines — cross-platform design): Section 2.23. Covers MTE/LAM integration with the slab and buddy allocators, Intel LAM (LAM_U48/LAM_U57) with LASS security requirement, CHERI future-proofing, and the tag-aware allocator interface shared across all mechanisms.

Software fallback (KASAN-equivalent): On architectures without MTE or LAM (x86-64 without LAM, RISC-V, PPC, s390x, LoongArch64), debug builds use software shadow memory tagging. The shadow memory maps each 8-byte aligned region to a 1-byte shadow value tracking allocation state (allocated, freed, red zone). This provides the same bug detection as MTE but with ~2-3x runtime overhead, making it suitable for development and CI only.

4.11 NUMA Topology and Policy¶

Modern servers are NUMA (Non-Uniform Memory Access): memory access latency depends on which CPU socket is requesting and which physical memory bank is being accessed. A 2-socket server has ~80ns local DRAM access and ~150ns remote access (via QPI/UPI interconnect). At 4+ sockets or with CXL-attached memory (Section 5.1), the penalty grows further. The kernel must be NUMA-aware at every level — allocation, placement, scheduling, and rebalancing.

4.11.1 Topology Discovery¶

At boot, UmkaOS parses platform firmware tables to build a NUMA distance matrix:

Architecture	Firmware Source	Tables Parsed
x86-64	ACPI	SRAT (System Resource Affinity), SLIT (System Locality Information), HMAT (Heterogeneous Memory Attributes)
AArch64	ACPI or Device Tree	SRAT/SLIT (ACPI servers), `numa-node-id` property (DT-based SoCs)
ARMv7	Device Tree	`numa-node-id` property (rare; most ARMv7 is UMA)
RISC-V 64	Device Tree	`numa-node-id` property per memory and CPU node
PPC32	Device Tree	`numa-node-id` property (rare; most PPC32 is UMA)
PPC64LE	Device Tree	`ibm,associativity` property per CPU and memory node (PAPR NUMA)

Topology Source Precedence

When multiple firmware sources describe topology, UmkaOS applies the following precedence order (highest to lowest authority):

ACPI SRAT (System Resource Affinity Table): the authoritative source for node-to-physical-address mapping and CPU-to-node affinity. UmkaOS treats SRAT as ground truth for which physical address ranges belong to which NUMA node.
ACPI HMAT (Heterogeneous Memory Attribute Table, if present): provides precise read/write bandwidth and access latency in picoseconds for each initiator/target pair. HMAT takes precedence over SLIT for performance attributes when both are present, because it is more precise.
ACPI SLIT (System Locality Information Table): provides the inter-node distance matrix used for scheduling and migration cost estimates. SLIT distances are normalized per ACPI specification 6.5 §5.2.17: local distance = 10, remote distances are proportional (cross-socket is typically 20-30). SLIT is used where HMAT is absent or does not cover a particular node pair.
Device Tree (/cpu-map and numa-node-id properties): used on systems without ACPI — bare-metal RISC-V, PPC32, ARMv7, and some AArch64 SoCs. On ARM systems that have both ACPI and a device tree, ACPI HMAT is preferred over PPTT for performance attributes.
Single-node fallback: if none of the above sources are present, UmkaOS treats the system as a single NUMA node (distance matrix is a 1×1 matrix with value 10).

Override order: HMAT > SLIT > DTB numa-node-id > default (single-node).

The result is a directed distance matrix where distance[i][j] represents the relative access cost from node i to node j. The matrix may be asymmetric: the ACPI SLIT specification (section 5.2.17) explicitly allows different distances in each direction, and this occurs on real hardware with CXL-attached memory where CPU-to-CXL and CXL-to-CPU latencies differ due to different interconnect paths. Linux (drivers/acpi/numa/srat.c slit_valid()) accepts asymmetric SLIT tables without modification. Callers must use directed distance: distance(source_node, target_node) where source is where data currently resides and target is where it will be accessed from. Local access is always distance 10 (by ACPI convention). Cross-socket is typically 20-30. CXL-attached memory tiers (Section 5.1) appear as higher-distance NUMA nodes.

/// NUMA topology, populated at boot from SRAT/SLIT or device tree.
///
/// All arrays are dynamically sized at boot based on the number of NUMA nodes
/// discovered from firmware tables. No compile-time `MAX_NUMA_NODES` constant —
/// the kernel adapts to the hardware, following the same pattern as `PerCpu<T>`.
/// Allocation uses the boot allocator ([Section 4.1](#boot-allocator)) during early init.
/// If the boot allocator cannot satisfy the allocation (insufficient physical
/// memory for the topology structures), the kernel panics — this is fatal
/// because NUMA topology is required for all subsequent allocations. The
/// `&'static` slices are never freed (kernel-lifetime data).
///
/// On a system with CXL-attached memory (Section 5.1), each CXL memory device
/// appears as an additional NUMA node. A 4-socket server with 8 CXL memory pools
/// may have 12+ NUMA nodes. The dynamic sizing handles this without recompilation.
///
/// When the distributed kernel is active (Section 5.2.9), the topology graph
/// incorporates these NUMA distances as local edges for end-to-end cost
/// computation across peers. Local subsystems (allocator, scheduler) use
/// `NumaTopology::distance()` directly — the topology graph is for cross-peer only.
pub struct NumaTopology {
    /// Number of NUMA nodes discovered.
    pub nr_nodes: usize,
    /// Distance matrix: distance[i * nr_nodes + j] = relative access cost from
    /// node i to node j. distance[i * nr_nodes + i] = 10 (local). Higher = slower.
    /// Allocated as a flat array of size nr_nodes * nr_nodes from the boot allocator.
    ///
    /// Uses `u16` to accommodate both SLIT (0-255 range) and HMAT (0-65535 range)
    /// representations. ACPI HMAT (Heterogeneous Memory Attribute Table) uses
    /// values 0-65535 for memory latency/bandwidth attributes, which exceeds the
    /// u8 range of SLIT. When both SLIT and HMAT are present, SLIT-sourced
    /// distances are scaled: `slit_val * (hmat_local_latency_ns / SLIT_LOCAL_DISTANCE)`
    /// where `hmat_local_latency_ns` is the HMAT read latency for the diagonal
    /// entry (same-node access) **in nanoseconds** (post-normalization, matching
    /// Linux `drivers/acpi/numa/hmat.c` `hmat_normalize()` which converts raw
    /// HMAT picosecond entries to nanoseconds), and `SLIT_LOCAL_DISTANCE = 10`
    /// is the SLIT local reference value. This normalization factor
    /// (`HMAT_NORM_FACTOR`) is computed once during boot and stored as a global
    /// `u16`. With typical values (80 ns local DRAM, SLIT local = 10),
    /// `HMAT_NORM_FACTOR = 8`, and max distance = `255 * 8 = 2040` which fits
    /// `u16`. A `debug_assert!(HMAT_NORM_FACTOR <= 255)` guard at boot catches
    /// pathological firmware reporting anomalously high local latency values.
    /// When only SLIT is available, the factor is 1 (identity). HMAT is Phase 3.
    pub distance: &'static [u16],
    /// Per-node memory ranges (physical address start, length).
    /// Allocated as an array of nr_nodes entries from the boot allocator.
    /// Each entry holds up to 8 memory ranges (ArrayVec is stack-allocated).
    /// 8 is sufficient after SRAT range coalescing (adjacent physical ranges
    /// on the same node are merged). Systems with MMIO holes, CXL devices,
    /// and interleaved DIMMs have at most 4-6 non-contiguous ranges per node
    /// after coalescing. Overflow panics at boot with a clear message.
    pub node_mem: &'static [ArrayVec<MemRange, 8>],
    /// Per-node CPU sets.
    pub node_cpus: &'static [CpuMask],
    /// Per-node type classification (CPU memory, accelerator VRAM, CXL expander, etc.).
    /// Allocated as an array of nr_nodes entries from the boot allocator.
    ///
    /// `NumaNodeType` is defined canonically in
    /// [Section 22.4](22-accelerators.md#accelerator-memory-and-p2p-dma)
    /// because accelerator and CXL memory node types are the primary motivation.
    /// GPU VRAM and other accelerator-local memory exposed as NUMA nodes use
    /// `NumaNodeType::AcceleratorMemory` — such memory is managed by the device
    /// driver (via the AccelBase KABI vtable), not the kernel buddy allocator.
    pub node_type: &'static [NumaNodeType],
}

impl NumaTopology {
    /// Look up the distance between two NUMA nodes.
    /// Returns `u16::MAX` if either node ID is out of range (defensive against
    /// user-supplied values via set_mempolicy or mbind). Callers treat
    /// `u16::MAX` as "unreachable" — the migration cost check will never
    /// migrate to an unreachable node.
    pub fn distance(&self, from: NumaNodeId, to: NumaNodeId) -> u16 {
        let f = from.0 as usize;
        let t = to.0 as usize;
        if f >= self.nr_nodes || t >= self.nr_nodes {
            return u16::MAX;
        }
        self.distance[f * self.nr_nodes + t]
    }
}

4.11.2 Memory Allocation Policy¶

Per-process and per-VMA memory policies control which NUMA nodes are used for page allocation. These are set via the set_mempolicy(2) and mbind(2) syscalls (Linux-compatible):

Policy	Behavior	Typical Use
`MPOL_DEFAULT`	Allocate on the faulting CPU's local node	General-purpose (default)
`MPOL_BIND`	Restrict allocation to specified node set; OOM if all are full	Database buffer pools, pinned workloads
`MPOL_INTERLEAVE`	Round-robin page allocation across specified nodes	Hash tables, large shared mappings
`MPOL_PREFERRED`	Try specified node first, fall back to others if full	Soft affinity
`MPOL_LOCAL`	Always the local node (explicit, not inherited)	Latency-sensitive paths

MPOL_INTERLEAVE distributes pages across nodes at page granularity (4KB or 2MB for huge pages), amortizing bandwidth across all memory controllers. This is optimal for large data structures accessed uniformly (e.g., hash maps, columnar stores).

4.11.3 Automatic NUMA Balancing¶

UmkaOS implements automatic NUMA balancing (same approach as Linux's numa_balancing):

Disable option: Automatic NUMA balancing can be disabled at boot (umka.numa_balancing=0) or at runtime (/proc/sys/kernel/numa_balancing=0). This is appropriate for workloads that are already properly pinned via cpuset cgroups or numactl --membind, where all memory is allocated on the correct NUMA node from the start. On these workloads, the NUMA scanner's periodic page table scanning (clearing present bits to induce NUMA faults) adds measurable overhead: - Each scanned page incurs a minor page fault (~1-5μs including TLB shootdown) - The scanner runs every 1-30 seconds (adaptive), touching thousands of PTEs per scan - For hard real-time workloads on isolcpus cores, NUMA-fault-induced jitter is unacceptable — disable NUMA balancing and pin memory explicitly

Default: enabled (matches Linux). Auto-disabled on single-node systems (no benefit).

Scan: A periodic scanner walks process page tables and clears the present bit on a fraction of pages (making them trigger faults on next access). Scan rate is adaptive: faster for processes with high cross-node access, slower for well-placed processes.
Trap: When a task accesses a not-present page, the NUMA fault handler records which CPU (and thus which NUMA node) caused the fault.
Decide: A cost-benefit analysis compares the expected savings from reduced remote access latency against the migration cost. Migration cost depends on scope: local NUMA migration (same socket, memcpy + TLB shootdown): ~200-500 ns per 4KB page; cross-socket migration (cache-coherent interconnect latency included): ~1-10 μs per page; RDMA-based DSM migration (cross-node, see Section 6.2): ~2-50 μs depending on distance. The page_migration_cost_ns() function (below) returns the empirically calibrated value for the specific source/destination pair. Migration proceeds only if the net benefit is positive over a configurable window (default: 10 accesses saved per migration cost).
Migrate: The page is migrated to the accessing CPU's node. During migration the PTE is updated atomically under the page table lock — the application sees no inconsistency.

Scan rate adaptation algorithm

The scanner samples pages by setting PTEs to PROT_NONE (clearing the present bit), causing NUMA faults on the next access. The fault handler records which node accessed each page, providing the cross-node access ratio used to adapt the scan rate.

Each process maintains a numa_scan_interval field (in milliseconds), initialized to the base rate. The NUMA fault handler updates this field after each scan window completes:

Base rate: 1 scan per 1000 ms per process VMA.
Speed up: If cross-node access ratio > 20% in the last scan window, double the scan rate (halve the interval). Minimum interval: 100 ms.
Slow down: If cross-node access ratio < 5% for 3 consecutive scan windows, halve the scan rate (double the interval). Maximum interval: 5000 ms.
Caps: min_scan_interval = 100 ms, max_scan_interval = 5000 ms per VMA.

/// Per-process NUMA scan state, stored in the task's memory descriptor.
///
/// **Synchronization**: `update_scan_interval()` is called from the NUMA
/// scan task_work, which runs single-threaded per mm (not from individual
/// fault handlers). The NUMA fault handler only increments per-mm atomic
/// fault counters (`numa_faults_local: AtomicU32`, `numa_faults_remote:
/// AtomicU32` — defined in the `MmStruct` alongside this state). The scan
/// period update runs as task_work at the end of each scan window,
/// single-threaded per mm, under `mmap_lock` read mode. Plain `u32` fields
/// are safe because only one scan task_work instance runs per mm at a time
/// (the scan kthread's per-mm work item is not re-enqueued until the
/// previous one completes).
pub struct NumaScanState {
    /// Current scan interval in milliseconds. Starts at 1000 ms.
    /// Range: [MIN_SCAN_INTERVAL_MS, MAX_SCAN_INTERVAL_MS].
    pub numa_scan_interval_ms: u32,
    /// Consecutive scan windows where cross-node ratio was below 5%.
    /// Reset to 0 whenever a window exceeds the 5% threshold.
    pub low_cross_node_streak: u32,
}

const BASE_SCAN_INTERVAL_MS: u32 = 1000;
const MIN_SCAN_INTERVAL_MS: u32 = 100;
const MAX_SCAN_INTERVAL_MS: u32 = 5000;
const CROSS_NODE_HIGH_THRESHOLD_PCT: u32 = 20;
const CROSS_NODE_LOW_THRESHOLD_PCT: u32 = 5;
const LOW_STREAK_REQUIRED: u32 = 3;

/// Called by the NUMA scan task_work after completing a scan window for a process.
/// NOT called from individual NUMA fault handlers — those only increment atomic
/// per-mm fault counters. The scan task_work reads the atomic counters, resets
/// them (via `fetch_and(0, AcqRel)` — atomic reset), and calls this function to
/// update the scan interval.
/// `cross_node_faults` and `total_faults` are counts from the just-completed window.
///
/// **Raciness note**: The counter read-then-reset in the scan task_work is
/// intentionally racy. Faults arriving between `fetch_and(0, AcqRel)` and the
/// function call are lost for this window but counted in the next. The two
/// counters (local, remote) are read separately — the sum may be slightly
/// inconsistent. Both effects produce at most one window of suboptimal scan
/// frequency, bounded by [`MIN_SCAN_INTERVAL_MS`, `MAX_SCAN_INTERVAL_MS`].
/// This matches Linux's `task_numa_work()` approach in `kernel/sched/fair.c`.
pub fn update_scan_interval(state: &mut NumaScanState, cross_node_faults: u32, total_faults: u32) {
    if total_faults == 0 {
        return; // No data — keep current interval.
    }
    let ratio_pct = (cross_node_faults * 100) / total_faults;

    // Thresholds are exclusive: exactly 5% or 20% falls into the "keep current" band.
    if ratio_pct > CROSS_NODE_HIGH_THRESHOLD_PCT {
        // High cross-node access: scan faster (halve interval).
        state.numa_scan_interval_ms = (state.numa_scan_interval_ms / 2)
            .max(MIN_SCAN_INTERVAL_MS);
        state.low_cross_node_streak = 0;
    } else if ratio_pct < CROSS_NODE_LOW_THRESHOLD_PCT {
        state.low_cross_node_streak += 1;
        if state.low_cross_node_streak >= LOW_STREAK_REQUIRED {
            // Sustained low cross-node access: scan slower (double interval).
            state.numa_scan_interval_ms = (state.numa_scan_interval_ms * 2)
                .min(MAX_SCAN_INTERVAL_MS);
            state.low_cross_node_streak = 0;
        }
    } else {
        // Between 5% and 20%: keep current rate, reset low streak.
        state.low_cross_node_streak = 0;
    }
}

/// Estimated cost in nanoseconds to migrate a page of the given order to a
/// remote NUMA node. Accounts for TLB shootdown, page copy overhead, and
/// PTE invalidation on all CPUs mapping the page.
///
/// Measured empirically at boot time via a calibration micro-benchmark:
/// allocate a page on the local node and a page on each remote node, copy
/// between them using `memcpy`, measure elapsed cycles, repeat 8 times
/// and take the median. Outlier runs (>3× median) are discarded to
/// account for interrupt jitter during calibration. If calibration fails
/// (all runs are outliers or allocation fails), a conservative default
/// of 500 ns per 4 KiB page is used. Stored in the NUMA distance table.
/// Values are order-dependent: a 2 MiB huge page (order 9) costs ~10-20× a
/// 4 KiB base page (order 0) due to higher copy bandwidth × more TLB entries.
///
/// Typical values: 4 KiB page → ~200-500 ns; 2 MiB page → ~3000-8000 ns.
pub fn page_migration_cost_ns(order: u32) -> u64;

/// Expected latency penalty in nanoseconds per remote memory access between
/// two NUMA nodes at the given distance (per `numa_distance()`).
///
/// `distance` is the ACPI SLIT value (10 = local, 11-254 = remote, higher =
/// farther). Per ACPI specification 6.5 §5.2.17, the local (diagonal) entry is
/// always 10; remote distances are proportionally higher (cross-socket typically
/// 20-30). The penalty is converted to nanoseconds using the empirically
/// calibrated base penalty per SLIT unit (~1-5 ns per unit on EPYC/Ice Lake).
/// Returns 0 if `distance` ≤ 10 (same node or same die cache domain).
///
/// Typical values: cross-socket (distance ~30) → ~30-50 ns; cross-NUMA
/// cluster (distance ~40) → ~50-80 ns.
pub fn remote_penalty_ns(distance: u16) -> u64;

/// NUMA balancing decision for a single page.
pub fn should_migrate_page(
    page: &Page,
    accessing_node: NumaNodeId,
    current_node: NumaNodeId,
    access_count: u32,
) -> bool {
    if accessing_node == current_node {
        return false; // Already local.
    }
    let distance = numa_distance(current_node, accessing_node);
    let migration_cost_ns = page_migration_cost_ns(page.order());
    let expected_savings_ns = access_count as u64 * remote_penalty_ns(distance);
    expected_savings_ns > migration_cost_ns * 2 // Conservative: require 2x payoff.
}

4.11.4 NUMA-Aware Kernel Allocations¶

The buddy allocator (Section 4.2) and slab allocator (Section 4.3) are NUMA-aware:

Buddy allocator: Per-NUMA-node free lists. Allocation prefers the local node; cross-node fallback uses the SLIT distance matrix to pick the nearest alternative.
Slab allocator: Per-node partial slab lists. Frequently allocated objects (inodes, dentries, socket buffers, capability entries) are served from node-local slabs, avoiding cross-node cache line bouncing on the hot allocation/free paths.
Per-CPU caches: Already NUMA-local by construction (each CPU's cache draws from its node's buddy allocator). No additional logic needed.

4.11.5 NUMA Balancing and Isolation Domain Memory¶

The NUMA scanner (automatic NUMA balancing above) must respect hardware isolation domain boundaries (Section 11.3, Tier 1 isolation):

Tier 1 driver memory: Pages mapped in a Tier 1 driver's isolation domain are tagged with the domain's protection key (MPK PKEY, ARM POE key, etc.). The NUMA scanner skips pages whose protection key does not match the current process's default domain — it will not clear the present bit on driver-private pages, because the resulting NUMA fault would fire in the wrong protection domain. Tier 1 driver memory is migrated only when the driver explicitly requests it via the driver_request_numa_migration() KABI call (fully specified in Section 11.6, memory_v1 KABI table — see 10-drivers.md for the complete C ABI signature, error codes, atomicity guarantees, and DMA pinning interaction), or when the driver's domain is active on the faulting CPU.
DMA buffers: Pages marked with PG_dma_pinned (allocated via the DMA API, Section 11.4) are unconditionally exempt from NUMA migration. Moving a DMA buffer while a device holds its physical address would cause DMA to the wrong location. The NUMA scanner checks this flag before clearing the present bit and skips pinned pages entirely.
Kernel-internal per-CPU structures: Per-CPU run queues, slab magazines, and PerCpu slots are allocated on their home node at boot and are never candidates for NUMA migration (they have no user-space PTE to scan).

4.11.6 Memory Tier Classification¶

UmkaOS classifies memory sources by semantic type via the TierKind enum. This allows code to reference memory tiers by purpose rather than by ordinal position (which shifts as tiers are added or removed, e.g., when distributed mode introduces remote DRAM or CXL-attached memory — see Section 6.2).

/// Semantic classification of a memory tier.
/// Used by the memory subsystem to identify tier types independent of their
/// ordinal position in the latency hierarchy.
/// Variant order matches the tiering hierarchy (best to worst latency).
/// Performance ranking is defined by `tier_ordinal()`, not variant order,
/// but matching order improves readability.
pub enum TierKind {
    /// Local DRAM on the same NUMA node as the accessing CPU.
    LocalDram,
    /// High Bandwidth Memory (HBM) on the same package as the CPU.
    /// Found on Intel Sapphire Rapids HBM, AMD MI300, and HPC GPUs.
    /// Higher bandwidth but similar or slightly higher latency than DRAM.
    /// Distinct from LocalDram to enable bandwidth-aware placement policies.
    Hbm,
    /// Remote DRAM on a different NUMA node (same physical machine).
    RemoteDram,
    /// CXL Type-3 memory expander (no compute). Latency: 200-500 ns (vs ~80 ns for
    /// local DRAM, ~1 μs for NVMe). Bandwidth: up to 50 GB/s per device. Used for
    /// capacity expansion — cold data that does not fit in DRAM but is too hot for
    /// NVMe. Placed between `LocalDram`/`RemoteDram` and `Compressed` in the tiering
    /// hierarchy. The memory tiering subsystem promotes pages hotter than
    /// `cxl_promote_threshold` to `LocalDram` and demotes pages colder than
    /// `cxl_demote_threshold` from `LocalDram` to `CxlExpander`.
    CxlExpander,
    /// Persistent memory (NVDIMM, Intel Optane DCPMM, CXL PMEM).
    PersistentMem,
    /// CXL Type-2 device-attached memory (compute + memory, e.g., smart NICs,
    /// inference accelerators with attached DRAM). Similar placement to `GpuVram`
    /// but for non-GPU accelerators. Managed via HMM (Heterogeneous Memory
    /// Management) like GPU VRAM — pages migrate between CPU DRAM and device memory
    /// under HMM control.
    CxlDeviceMemory,
    /// GPU VRAM (accessible via BAR or unified memory).
    GpuVram,
    /// Accelerator-attached memory (non-GPU: inference NPUs, FPGAs with local DDR).
    /// Managed as a NUMA node with high latency to CPU cores. HMM migrates pages
    /// between CPU DRAM and accelerator memory.
    /// See [Section 22.1](22-accelerators.md#unified-accelerator-framework--memory-management-and-hmm-integration).
    AcceleratorMemory,
    /// Compressed in-memory pages (zswap/zram tier).
    Compressed,
    /// Local swap (block device backed).
    Swap,
    /// Remote node DRAM via DSM (distributed shared memory, Section 5.6).
    DsmRemote,
}

Tiering hierarchy (best to worst latency/performance, all 11 TierKind variants):

LocalDram / Hbm → RemoteDram → CxlExpander → PersistentMem → CxlDeviceMemory → GpuVram / AcceleratorMemory → Compressed → Swap → DsmRemote

Hbm is co-located with LocalDram (same package) — higher bandwidth, similar latency. RemoteDram follows local DRAM — same technology but cross-socket latency (~150-300 ns). CxlExpander sits between remote DRAM and the compressed tier: slower than DRAM (200-500 ns) but faster than decompression (~1-2 μs) or NVMe swap (~10 μs). PersistentMem (NVDIMM/Optane) has similar latency to CXL but with persistence. CxlDeviceMemory is grouped with device-attached memory (alongside GpuVram and AcceleratorMemory) because its latency and bandwidth are device-specific and managed by the device driver rather than by the general tiering policy. AcceleratorMemory covers non-GPU accelerators (inference NPUs, FPGAs) that have local DDR managed via HMM, analogous to GpuVram for GPUs. DsmRemote is last in the static default — network latency (~5-50 μs RDMA, ~50-200 μs TCP) exceeds all local tiers including NVMe swap (~10-50 μs). Even when RDMA is faster in ideal conditions, DsmRemote depends on network health, coordinator availability, and remote node load, while local swap is always available and deterministic.

Note on HDD-swap systems: This static ordering assumes NVMe or SSD swap devices. On systems with HDD-only swap (>1 ms latency) and RDMA DSM (<50 μs), the runtime tier ordinal mapping should promote DsmRemote above Swap. The ML tiering policy (Section 23.1) adjusts tier ordinals based on measured latencies at runtime.

The memory subsystem maintains a runtime mapping from TierKind to the current ordinal tier number. Code that needs to compare tier performance uses TierKind and queries mem::tier_ordinal(kind) rather than hardcoding numeric values.

TierKind → NumaNodeType conversion: These two enums classify memory from different perspectives: TierKind is the performance tier (latency-ordered), NumaNodeType (Section 22.4) is the hardware topology type. The mapping is not 1:1 because multiple topology types can share a performance tier (e.g., CXL-attached DRAM and remote DRAM may be in the same latency tier). The conversion function:

/// Map a NUMA node's topology type to its memory tier kind, relative to
/// a source node. The `from_node` is needed to distinguish local DRAM
/// (distance <= 10, same socket) from remote DRAM (distance > 10, cross-socket).
///
/// For call sites that do not have a specific source node (e.g., global tier
/// enumeration), use the node itself as both `from_node` and `to_node` —
/// this produces `LocalDram` for CpuMemory nodes, which is the conservative
/// default. Note that the result is approximate in that context.
pub fn numa_node_type_to_tier_kind(
    ntype: &NumaNodeType,
    from_node: NumaNodeId,
    to_node: NumaNodeId,
) -> TierKind {
    match ntype {
        NumaNodeType::CpuMemory => {
            if from_node == to_node || numa_distance(from_node, to_node) <= 10 {
                TierKind::LocalDram
            } else {
                TierKind::RemoteDram
            }
        }
        NumaNodeType::AcceleratorMemory { .. } => TierKind::AcceleratorMemory,
        NumaNodeType::CxlMemory { .. } => TierKind::CxlExpander,
        NumaNodeType::CxlSharedPool { .. } => TierKind::CxlExpander,
    }
}

Note: GPU VRAM is represented as NumaNodeType::AcceleratorMemory with appropriate device_id and bandwidth_gbs — there is no separate GpuVram variant. The TierKind::GpuVram tier kind is used for tiering policy decisions and maps from AcceleratorMemory nodes whose device class is GPU (determined by querying the device registry for the device_id).

The reverse mapping (TierKind → NumaNodeType) is not unique — a single tier kind may correspond to multiple NUMA node types. Code that needs a specific NumaNodeType should query NUMA_TOPOLOGY.node_type[node_id] directly.

4.12 Memory Compression Tier¶

Inspired by: macOS (2013), Windows 10 (2015), Linux zswap/zram. IP status: Clean — academic concept from the 1990s, BSD-licensed algorithms.

4.12.1 Problem¶

When the system is under memory pressure, the page reclaim path must free pages. The options in Section 4.4 are:

ML tuning: Reclaim behavior parameters (reclaim_aggressiveness, prefetch_window_pages, compress_entropy_threshold, numa_migration_threshold, swap_local_ratio) are registered in the Kernel Tunable Parameter Store. The umka-ml-numa and umka-ml-compress Tier 2 services observe page fault and eviction events to tune these parameters per-cgroup at runtime. See Section 23.1 for the complete parameter catalog and observation types.

Evict clean page cache pages (free, but re-read from disk on next access)
Write dirty pages to swap (expensive: NVMe ~10us per 4KB, HDD ~5ms)

There is a third option, cheaper than swap: compress the page in memory. Modern CPUs running LZ4 compress a 4KB page in ~1-2 microseconds. If the page compresses to <2KB (typical for most workloads), you've freed 2KB without any I/O. Decompression on access is ~0.5 microseconds.

This is 5-10x faster than NVMe swap and 1000x faster than HDD swap.

4.12.2 Architecture¶

Insert a compressed tier between the LRU inactive list and swap:

Per-CPU Free Page Pools (hot path)
    |
    v
Per-NUMA Buddy Allocator
    |
    v
Page Cache (RCU radix tree)
    |
    v
LRU Active List --evict--> LRU Inactive List
                                |
                    +-----------+-----------+
                    |                       |
              [compress]              [swap out]
                    |                  (existing)
                    v
            Compressed Pool             Swap Device
            (compress pool in memory)           (NVMe/HDD)
                    |
              [decompress]
                    |
                    v
              Page restored
              to active LRU

4.12.3 Compressed Page Pool¶

The CompressPool uses three support types that are defined here before the main struct:

BootVec<T> — boot-time fixed-capacity vector:

/// A `Vec`-like container backed by the boot allocator. Allocated during early
/// boot before the slab allocator is available. Capacity is fixed at creation
/// and never reallocated — this is critical because `BootVec` is used for
/// structures that must remain stable under memory pressure (CompressPool regions,
/// NUMA topology arrays). After boot completes, `BootVec` contents are
/// typically used read-only (new entries fill pre-allocated slots but the
/// backing allocation never moves).
///
/// Allocated via `boot_alloc(size_of::<T>() * cap)` from the boot allocator
/// (Section 4.1). Panics if the boot allocator cannot satisfy the request
/// (fatal: the kernel cannot proceed without these structures).
pub struct BootVec<T> {
    /// Pointer to the first element. Allocated from the boot allocator.
    /// The allocation is `cap * size_of::<T>()` bytes, aligned to `align_of::<T>()`.
    ptr: *mut T,
    /// Number of initialized elements. Invariant: `len <= cap`.
    len: usize,
    /// Maximum capacity (fixed at creation, never changes).
    cap: usize,
}

PTE-embedded indexing — direct slot lookup via SwapEntry offset:

The CompressPool uses PTE-embedded indexing: the SwapEntry stored in the page table entry directly encodes the compressed slot location. No separate index data structure (hash table, XArray, or tree) is needed. This is the same approach used by macOS/XNU and Linux ZRAM.

SwapEntry encoding for compressed pages:

SwapEntry layout for SWP_COMPRESSED (type = 30):
  Bits [60:56] = SWP_COMPRESSED (30) — identifies this as a compressed page
  Bits [55:50] = NUMA node ID (6 bits, max 64 nodes = NUMA_NODES_STACK_CAP)
  Bits [49:0]  = slot index within the per-NUMA CompressPool (50 bits,
                 max ~1.1 × 10^15 slots — sufficient for petabyte-scale memory)

/// Shift to extract the NUMA node from a compressed-page SwapEntry offset.
const COMPRESS_NODE_SHIFT: u32 = 50;
/// Mask to extract the slot index from a compressed-page SwapEntry offset.
const COMPRESS_SLOT_MASK: u64 = (1u64 << COMPRESS_NODE_SHIFT) - 1;

When a page is compressed: 1. The compressed data is stored in a region slot on the page's NUMA node. 2. The slot location is encoded into the SwapEntry offset field as (numa_node << COMPRESS_NODE_SHIFT) | slot_index. 3. A SwapEntry is constructed: SwapEntry::new(SWP_COMPRESSED, (numa_node as u64) << COMPRESS_NODE_SHIFT | slot_index) where SWP_COMPRESSED = 30 identifies the page as compressed. See Section 4.13 for SwapEntry encoding. 4. The PTE is updated to contain this SwapEntry (same mechanism as swap-to-disk).

On decompression (page fault on a compressed page): 1. Read the SwapEntry from the PTE. 2. Confirm area_id == SWP_COMPRESSED (type 30) — routes to compression handler. 3. Extract NUMA node and slot index from the offset field: let numa_node = (offset >> COMPRESS_NODE_SHIFT) as usize; let slot_idx = offset & COMPRESS_SLOT_MASK; 4. Direct array lookup: compress_pools[numa_node].regions[slot_idx / slots_per_region] .slots[slot_idx % slots_per_region] — O(1), no hashing, no probing. 5. Decompress the data from the slot.

This eliminates: - The entire FixedHashTable (540-720 MB vmalloc pre-allocation on a 256 GB server) - All Robin Hood hashing complexity and its RCU-safety concerns - The runtime max_pool_percent resize restriction (no hash table to constrain)

CompressPoolRegion — backing memory region with size-segregated free lists:

/// Size-segregated free lists within a CompressPool region.
/// O(1) allocation for the 7 standard compressed-page size classes.
/// Each head points to a singly-linked list of free slots of that
/// size class, embedded intrusive (the slot's memory stores the next
/// pointer while free — zero overhead when in use).
pub struct SizeClassFreeList {
    /// Free list heads, one per size class.
    /// Index 0 = 64B, 1 = 128B, 2 = 256B, 3 = 512B,
    /// 4 = 1024B, 5 = 2048B, 6 = 3072B.
    heads: [*mut FreeSlotNode; 7],
    /// Count of free slots per size class (for diagnostics and
    /// compaction threshold checks).
    counts: [u32; 7],
}

/// Standard compressed page size classes. A compressed page is rounded up
/// to the nearest size class. Internal fragmentation is bounded: worst case
/// ~50% for a 65-byte page in a 128-byte slot. Average internal fragmentation
/// at the typical compression ratio (2.5-4.0x) is ~10-15%.
pub const COMPRESS_SIZE_CLASSES: [usize; 7] = [64, 128, 256, 512, 1024, 2048, 3072];

/// Node embedded at the start of each free slot. The node occupies the first
/// 8 bytes of the free slot itself (intrusive — zero overhead when in use).
pub struct FreeSlotNode {
    /// Pointer to the next free slot of the same size class, or null.
    pub next: *mut FreeSlotNode,
}

/// A backing memory region within the CompressPool. Each region is a contiguous
/// allocation from the buddy allocator (order 4-6, 64KB-256KB).
pub struct CompressPoolRegion {
    /// Physical address of the region's backing memory.
    ///
    /// SAFETY: Points to a valid buddy allocation of `size` bytes.
    /// Valid for the region's lifetime. Freed when the region is
    /// returned to the buddy allocator during compaction or pool shrink.
    pub base: PhysAddr,
    /// Region size in bytes (64KB, 128KB, or 256KB).
    pub size: usize,
    /// Per-size-class free lists for O(1) compressed page allocation.
    pub free_lists: SizeClassFreeList,
    /// Total number of slots in this region (derived from size and
    /// slot layout at init time).
    pub total_slots: u32,
    /// Number of currently occupied (live compressed page) slots.
    pub used_slots: AtomicU32,
    /// Fullness classification for compaction policy.
    pub fullness: RegionFullness,
    /// Timestamp of last allocation or deallocation (for compaction
    /// age-based triggering). Nanosecond monotonic clock.
    pub last_access_ns: AtomicU64,
    /// Lock-free Treiber stack for cross-CPU slot returns. When a CPU that
    /// does NOT own this region frees a slot, it pushes to this stack via
    /// CAS (`compare_exchange` on the head pointer). The owning CPU drains
    /// this stack during its next allocation from the region, batching the
    /// returned slots back into the appropriate size-class free lists.
    ///
    /// Ordering: push uses `Release` (publish the new node), drain uses
    /// `Acquire` (observe the complete chain). ABA prevention: generation
    /// counter in the high bits of the `AtomicPtr` (tagged pointer) on
    /// architectures with sufficient VA bits; on 32-bit, a separate
    /// `AtomicU32` generation counter alongside the head pointer.
    pub remote_free_list: AtomicPtr<FreeSlotNode>,
}

CompressPool struct:

// Kernel-internal (umka-core/src/mem/compress_pool.rs)

/// A pool of compressed pages stored in contiguous memory.
///
/// **Concurrency model — per-CPU current region + PTE-embedded direct index:**
///
/// The CompressPool achieves negative overhead vs Linux's zsmalloc by
/// eliminating cross-CPU contention on the hot compress path AND eliminating
/// the index data structure entirely (PTE-embedded indexing):
///
/// - **Hot path (compress)**: Each CPU has a CpuLocal `current_region` pointer.
///   Allocation is per-CPU region + size-segregated free list — zero cross-CPU
///   contention. The compress operation: (1) LZ4-compress into a per-CPU scratch
///   buffer (no lock), (2) allocate a slot from the per-CPU current region's
///   free list (no lock — CpuLocal access under preempt_disable), (3) memcpy
///   compressed data to slot, (4) construct SwapEntry with slot_offset and store
///   in PTE. NO index insertion needed — the PTE IS the index.
///   Compared to Linux's zsmalloc `pool->lock` global spinlock, this eliminates
///   the global contention bottleneck entirely.
///
/// - **Hot path (decompress)**: Read SwapEntry from PTE, extract slot_offset,
///   direct array lookup into region. O(1), zero overhead, no hashing, no probing.
///   Lock-free: the slot data is immutable once written (compressed pages are
///   never modified in place — they are freed and re-compressed on write).
///
/// - **Warm path (deallocation)**: Return the slot to the region. Deallocation
///   may occur on a different CPU than allocation (e.g., the allocating task
///   migrated, or a different thread unmaps the page). If the deallocating CPU
///   owns the region (same `current_region`): return directly to the per-CPU
///   free list (no lock, O(1)). If the deallocating CPU does NOT own the region:
///   push to the region's per-region atomic MPSC return list:
///   `region.remote_free_list: AtomicPtr<FreeSlotNode>` (lock-free Treiber stack).
///   The owning CPU drains the remote free list during its next allocation from
///   that region (batched, amortized). This avoids cross-CPU lock contention
///   on deallocation while preserving the per-CPU allocation fast path.
///
/// - **Cold path (compaction)**: Two-level compaction runs in a dedicated kthread.
///   See CompressPool Compaction Algorithm below. During slot migration, the
///   compaction kthread updates the PTE via `rmap_walk()` to point to the new
///   slot offset — same mechanism used by page migration.
///
/// **Per-NUMA sharding**: One CompressPool per NUMA node. Compressed pages stay on
/// the same NUMA node as their original allocation.
///
/// **No separate index structure**: The entire FixedHashTable (which would have been
/// 540-720 MB of vmalloc on a 256 GB server) is eliminated. The SwapEntry in the PTE
/// directly encodes the slot location. This is the same design as macOS/XNU compressed
/// memory and Linux ZRAM. Total index overhead: 0 bytes.
///
/// Read-only statistics (`stats` fields) use `AtomicU64` and can be read without
/// any lock. The `current_size` field is `AtomicUsize` for lockless reads by the
/// reclaim path throttle (approximate is acceptable there).
pub struct CompressPool {
    /// Per-CPU current region for allocation. Each CPU draws from its own
    /// region, eliminating cross-CPU lock contention on the compress path.
    /// When a CPU's current region is exhausted, the refill sequence is:
    ///   1. Re-enable preemption (drop the PreemptGuard).
    ///   2. Allocate a new region from buddy (GFP_NOIO — cannot recurse into
    ///      I/O which might re-enter zswap).
    ///   3. If allocation fails: return -ENOMEM to the compress path, which
    ///      falls back to writing the page to the disk-backed swap device.
    ///   4. On success: re-disable preemption. Read the CURRENT CPU's
    ///      CpuLocal `current_region` (the task may have migrated during
    ///      the buddy allocation in step 2).
    ///      - If `current_region` is already non-null (another task refilled
    ///        it while we slept, or we migrated to a CPU that already has a
    ///        valid region): push the newly allocated region to the global
    ///        `COMPRESS_SPARE_REGIONS: SpinLock<ArrayVec<*mut CompressPoolRegion,
    ///        SPARE_CAP>>`. The compaction kthread drains the spare list during
    ///        its regular sweep, returning spare regions to the buddy allocator.
    ///      - If `current_region` is null: assign the new region to this
    ///        CPU's CpuLocal slot.
    ///      Retry the slot allocation.
    ///
    ///   **Migration safety**: Without the null check in step 4, a task that
    ///   migrates between steps 1 and 4 would overwrite the target CPU's
    ///   existing region, leaking the old region's buddy pages (64KB-256KB
    ///   per leak). Under sustained memory pressure with frequent migrations,
    ///   this accumulates as a silent memory leak — violating the 50-year
    ///   uptime requirement.
    ///
    /// The preempt drop at step 1 is required because buddy allocation may
    /// sleep (via direct reclaim under GFP_NOIO). Holding preempt_disable
    /// across a sleeping allocation would panic.
    /// Accessed under preempt_disable (CpuLocal pattern).
    current_region: CpuLocal<*mut CompressPoolRegion>,

    /// Backing memory regions (allocated from buddy allocator).
    /// Each region is a contiguous allocation (order 4-6, 64KB-256KB).
    /// Fixed-capacity array sized at init time (max_regions = max_pool_bytes /
    /// min_region_size). The region array never grows after init — new regions
    /// are allocated from the buddy allocator and placed into pre-allocated
    /// slots. This avoids Vec reallocation under memory pressure, which is
    /// the exact scenario where CompressPool is most active.
    regions: BootVec<CompressPoolRegion>,

    /// Statistics.
    stats: CompressPoolStats,

    /// Compression algorithm (compile-time selected).
    /// LZ4 is default: ~1-2us compress, ~0.5us decompress per 4KB.
    algorithm: CompressionAlgorithm,

    /// Maximum pool size (fraction of total RAM, default 50%).
    /// Can be changed at runtime in both directions — no hash table
    /// constraint (hash table eliminated; direct SwapEntry offset indexing).
    /// Reducing max_pool_percent triggers background eviction until
    /// utilization falls below the new threshold.
    max_size: usize,

    /// Current pool size.
    current_size: AtomicUsize,

    /// NUMA-aware: one CompressPool per NUMA node.
    numa_node: NumaNodeId,
}

/// Per-slot header stored at the beginning of each occupied compressed slot.
/// The header is followed immediately by the compressed data. Total slot size
/// is `sizeof(CompressedSlotHeader) + compressed_size`, rounded up to the
/// nearest size class in `COMPRESS_SIZE_CLASSES`.
///
/// With PTE-embedded indexing, no separate index entry is needed — the
/// SwapEntry in the PTE directly encodes the slot's location (area_id +
/// slot_offset). This header provides the metadata needed for decompression
/// and integrity verification.
#[repr(C)]
pub struct CompressedSlotHeader {
    /// Compressed data size in bytes (at the default 1.5x threshold, pages
    /// exceeding ~2730 bytes are rejected; u16, max 65535, accommodates
    /// any compression policy).
    pub compressed_size: u16,
    /// Padding to align checksum to 4-byte boundary.
    pub _pad: u16,
    /// Original page checksum (CRC32C, for integrity verification on
    /// decompression). CRC32C is used because all supported architectures
    /// provide hardware acceleration: x86 SSE4.2 `crc32` instruction,
    /// ARMv8 CRC extension, RISC-V: software implementation (no ratified
    /// hardware CRC extension as of 2026).
    /// CRC32C is NOT a collision-free integrity guarantee. The threat model:
    /// if a hardware bit-flip corrupts the compressed data in RAM, CRC32C detects
    /// it with probability (1 - 2^-32). The birthday paradox analysis is not
    /// relevant here — this is not a hash table collision scenario. Each CRC32C
    /// is checked against its OWN original page data on decompression, not
    /// against other pages. A mismatch (due to bit-flip corruption) triggers
    /// SIGBUS to the owning process, not kernel-wide data loss. For systems
    /// requiring stronger guarantees, ECC RAM eliminates the bit-flip source.
    /// The storage layer provides per-block checksums ([Section 15.2](15-storage.md#block-io-and-volume-management))
    /// for on-disk integrity.
    pub checksum: u32,
}
const_assert!(core::mem::size_of::<CompressedSlotHeader>() == 8);

pub struct CompressPoolStats {
    /// Pages stored in compressed form.
    pub pages_stored: AtomicU64,
    /// Total compressed bytes (sum of compressed sizes).
    pub compressed_bytes: AtomicU64,
    /// Total original bytes (pages_stored * PAGE_SIZE).
    pub original_bytes: AtomicU64,
    /// Compression ratio = original_bytes / compressed_bytes.
    /// Typical: 2.5-4.0x for most workloads.

    /// Pages rejected (incompressible — ratio < 1.5x).
    pub pages_rejected: AtomicU64,
    /// Pages writeback to swap (pool full or evicted from pool).
    pub pages_writeback: AtomicU64,
    /// Decompressions (page faults on compressed pages).
    pub decompressions: AtomicU64,
}

4.12.4 Compression Policy¶

Not every page should be compressed. The policy:

pub struct CompressionPolicy {
    /// Minimum compression ratio to accept a page, encoded as
    /// ratio * 100 (fixed-point to avoid FPU use in kernel).
    /// Default: 150 (meaning 1.5x). A 4KB page must compress to
    /// fewer than PAGE_SIZE * 100 / min_ratio_x100 = 2730 bytes
    /// to be accepted. Pages that compress worse go directly to swap.
    pub min_ratio_x100: u32,

    /// Maximum compress pool size as percentage of total RAM (default 50).
    ///
    /// **Rationale for 50% default**: UmkaOS targets container workloads at 80%+
    /// memory utilization where memory pressure is the NORMAL operating state.
    /// At 50%, the pool can hold 100-150% of RAM worth of data (at 2-3x
    /// compression ratio), meaning swap I/O is rarely needed. The pool does NOT
    /// consume memory when empty — regions are allocated from the buddy allocator
    /// on demand and returned when emptied by compaction. The 50% sets a CEILING,
    /// not a reservation. Idle cost is ~8MB (region slot array only).
    ///
    /// Comparison: Linux zswap defaults to 20% (conservative), Android ZRAM uses
    /// ~50% of RAM. UmkaOS matches the more aggressive Android/ChromeOS approach
    /// because container workloads benefit more from deep compression buffers
    /// (fewer swap I/Os = better p99 latency under pressure).
    ///
    /// **ML-tunable**: Registered as `ParamId::CompressMaxPoolPercent` with range
    /// [10, 80]. The ML policy framework ([Section 23.1](23-ml-policy.md#aiml-policy-framework-closed-loop-kernel-intelligence))
    /// observes compress/decompress latency, swap I/O rates, and memory pressure
    /// signals to adjust per-cgroup. Database workloads (large buffer cache) benefit
    /// from lower values (~25%); container-dense workloads benefit from higher (~60%).
    /// Boot parameter: `umka.compress.max_pool_pct=50` (system-wide default).
    pub max_pool_percent: u32,

    /// When compress pool is full, evict oldest compressed pages to swap.
    /// This is the "writeback" path.
    pub writeback_on_full: bool,

    /// Page types eligible for compression.
    pub compress_anonymous: bool,       // true (default)
    pub compress_file_backed: bool,     // false (default — just evict clean pages)
    pub compress_shmem: bool,           // true (default — tmpfs/shm pages)

    /// Pages with active DMA mappings are never eligible for compression
    /// or swap-out. The page reclaim path checks the DMA pin count
    /// (tracked in the device registry's DeviceResources, Section 11.4)
    /// before attempting reclaim. This field is always false and exists
    /// as a documented invariant — it cannot be overridden.
    pub compress_dma_pinned: bool,      // false (invariant, never set to true)
}

Hugepages are never compressed. Hugepages (2 MB on x86-64/AArch64, 1 GB on x86-64) are allocated specifically for performance-critical workloads (databases, JVMs with large heaps, GPU buffer backing). Compressing them would defeat their purpose and introduce unpredictable latency. The specific rules are:

A page that is part of a hugepage compound (marked PageFlags::HUGEPAGE) is excluded from the compression scanner entirely. The reclaim path skips it without attempting compression.
If a hugepage must be reclaimed under severe memory pressure, it is handled in one of two ways: (a) swapped directly to swap as a compound block (512 contiguous 4 KB swap slots for a 2 MB hugepage), or (b) split into 4 KB base pages first via split_huge_page(), after which the resulting base pages are individually eligible for the normal compression path.
Transparent Huge Pages (THP) that have already been split (by split_huge_page()) lose the PageFlags::HUGEPAGE flag on their constituent pages. Those 4 KB pages are thereafter eligible for compression like any other anonymous page.
Explicitly locked huge pages (mlock()'d or mlockall()'d) are never reclaimed, split, or compressed — they are pinned in physical RAM by the mlock invariant.

Decision flow when the page reclaim path needs to free a page:

Page to reclaim:
  |
  Is it DMA-pinned (active device mapping)?
  |-- Yes -> Skip. Not eligible for reclaim. Done.
  |
  Is it a clean file-backed page?
  |-- Yes -> Simply evict (will re-read from disk). Done.
  |
  Is compression enabled for this page type?
  |-- No -> Write to swap (existing path). Done.
  |
  Attempt LZ4 compression:
  |
  Compressed to < (PAGE_SIZE * 100 / min_ratio_x100)?
  |-- No -> Incompressible. Write to swap. Done.
  |
  Is CompressPool below max_pool_percent?
  |-- No -> Evict oldest compressed pages to swap, then store. Done.
  |
  Store compressed page in CompressPool.
  Free original page frame.
  Done.

4.12.5 Decompression Path¶

When a page fault occurs on a compressed page:

1. Page fault handler reads the SwapEntry from the PTE.
2. Confirm area_id == SWP_COMPRESSED (30).
3. Extract NUMA node and slot index from the offset field:
   numa_node = offset >> COMPRESS_NODE_SHIFT
   slot_idx  = offset & COMPRESS_SLOT_MASK
4. Direct array lookup: compress_pools[numa_node].regions[slot_idx / slots_per_region]
   .slots[slot_idx % slots_per_region] — O(1), no hashing.
5. Allocate a fresh page frame.
6. Decompress from the located slot into the fresh page (LZ4).
7. Free the compressed slot (return to region's size-class free list).
8. Update PTE to point to the fresh page frame.
9. Return from page fault.

Total time: ~1-2us (page allocation + LZ4 decompress + TLB update).
Compare: swap read from NVMe = ~10-15us, from HDD = ~5-10ms.

4.12.6 NUMA Awareness¶

One CompressPool per NUMA node. Compressed pages stay on the same NUMA node as their original allocation. This avoids cross-NUMA decompression latency.

NUMA Node 0:                    NUMA Node 1:
  Buddy Allocator 0               Buddy Allocator 1
  Page Cache 0                    Page Cache 1
  LRU Lists 0                    LRU Lists 1
  CompressPool 0          <-local->     CompressPool 1
  (compresses node-0 pages)      (compresses node-1 pages)

4.12.7 Compression Algorithm Selection¶

#[repr(u32)]
pub enum CompressionAlgorithm {
    /// LZ4: ~1-2us compress, ~0.5us decompress. Best latency.
    /// Default choice.
    Lz4     = 0,
    /// Zstd (level 1): ~3-5us compress, ~1us decompress. Better ratio.
    /// Use when memory pressure is high and slightly more latency is OK.
    Zstd1   = 1,
    /// Zstd (level 3): ~5-10us compress, ~1us decompress. Best ratio.
    /// Use when swap I/O is very expensive (HDD) and CPU is available.
    Zstd3   = 2,
}

Both LZ4 and Zstd are BSD-licensed. We include our own no_std implementations (no external C library dependency in kernel).

4.12.8 Latency Spikes and Fragmentation¶

Transparent memory compression introduces two risks that must be explicitly managed: latency spikes during decompression and fragmentation within the compressed pool.

Decompression latency spikes — when a process accesses a compressed page, the page fault handler must decompress it before the access can proceed. This adds ~1-2μs (LZ4) or ~1-5μs (Zstd) to the page fault latency. While small in absolute terms, this can cause tail latency spikes for latency-sensitive workloads:

Worst case: a process accesses 100 compressed pages in rapid succession (e.g., scanning a large array that was mostly evicted). Each page fault takes ~1-2μs, totaling ~100-200μs of stall time. This is comparable to a single NVMe read but much better than swap (~10-15μs per page from NVMe × 100 pages = ~1-1.5ms).
Mitigation — prefetch on decompression: when decompressing page N, the kernel speculatively decompresses pages N+1 through N+3 (sequential readahead into the compressed pool). This converts 4 serial page faults into 1 fault + 3 prefetches, reducing total latency by ~75% for sequential access patterns.
Mitigation — per-cgroup opt-out: latency-sensitive cgroups (real-time workloads, databases) can disable compression entirely via memory.compress_pool.enabled = 0 in the cgroup controller. Their pages are never compressed — they go directly to swap (or are simply not reclaimed until OOM).
Mitigation — priority-aware compression: pages belonging to high-priority tasks (RT scheduling class, latency-sensitive cgroups) are placed last on the compression candidate list, ensuring they are compressed only under severe memory pressure.

CompressPool fragmentation — the compressed pool stores variable-size compressed pages (a 4KB page might compress to 500 bytes, or 2KB, or 3800 bytes). This creates internal fragmentation:

Size-class allocation within regions: the CompressPool divides each region (allocated as order-4 to order-6 pages from the buddy allocator, i.e., 64KB-256KB contiguous blocks) into variable-size slots. Slots are managed with per-size-class free lists (SizeClassFreeList) for O(1) allocation — see CompressPoolRegion above.
Fragmentation metric: the kernel tracks compressed_bytes / pool_total_size as the pool utilization ratio.

CompressPool Compaction Algorithm — Two-Level

CompressPool compaction uses a two-level approach inspired by Linux's zsmalloc fullness-group classification. Regions are classified by utilization into four groups: EMPTY (0%), LOW (<25%), MED (25-75%), HIGH (>75%), FULL (100%). No second pool allocation is ever required — compaction migrates entries between existing regions without allocating new memory.

Constants:

/// Minor compaction trigger: start when any region drops below this utilization.
pub const COMPRESS_POOL_MINOR_THRESHOLD_PCT: u8 = 25;
/// Major compaction trigger: start when overall pool fragmentation exceeds this.
pub const COMPRESS_POOL_MAJOR_THRESHOLD_PCT: u8 = 85;
/// Major compaction also triggers if the oldest active region has been idle this long.
pub const COMPRESS_POOL_COMPACTION_AGE_MS: u64 = 5_000;
/// Major compaction stops when utilization drops below this target.
pub const COMPRESS_POOL_COMPACTION_TARGET_PCT: u8 = 75;

Region fullness classification (zsmalloc-inspired):

#[repr(u8)]
pub enum RegionFullness {
    Empty = 0,  // 0% utilization — return to buddy immediately
    Low   = 1,  // <25% — evacuation candidate for major compaction
    Med   = 2,  // 25-75% — normal operation
    High  = 3,  // >75% — preferred allocation target
    Full  = 4,  // 100% — no free space, skip for allocation
}

Minor compaction (per-region, warm path):

Triggered: after each deallocation, if the freed region's utilization < MINOR_THRESHOLD.
Action: coalesce adjacent free slots within the region (defragment the
  size-segregated free list). O(slots_in_region). No cross-region data movement.
  Restores allocation locality within the region.
Cost: ~50-200ns per region (in-place free list rebuild).

Major compaction (cross-region, cold path):

fn major_compact(pool: &mut CompressPool):
  // Identify LOW-utilization regions as evacuation sources.
  sources = pool.regions
    .filter(|r| r.fullness() == RegionFullness::Low)
    .sort_by(|a, b| a.last_access_time.cmp(&b.last_access_time))

  for src_region in sources:
    if pool.overall_utilization_pct() <= COMPRESS_POOL_COMPACTION_TARGET_PCT:
      break  // compaction target reached

    // Migrate each live compressed page from src_region to a MED/HIGH region.
    for entry in src_region.live_entries():
      dst_slot = pool.alloc_from_med_or_high(entry.compressed_size)?
      memcpy(dst_slot, src_region.data_at(entry.offset), entry.compressed_size)
      // Update PTE via reverse mapping: old slot_offset → new slot_offset.
      // Uses rmap_walk() (same mechanism as page migration) to find all PTEs
      // pointing to the old SwapEntry and rewrite them with the new offset.
      rmap_walk_update_swap_entry(old_swap_entry, new_swap_entry)

    // Source region is now EMPTY — return its backing pages to buddy.
    pool.free_region(src_region)
    buddy_allocator::free_pages(src_region.backing_pages)

Major compaction runs in a dedicated kthread at priority SCHED_IDLE (below any real workload) and yields every 1ms to avoid interfering with I/O paths. It is also invoked synchronously (without yield) during allocation failure before OOM is triggered.

Size-segregated free lists: Each region maintains per-size-class free lists for O(1) slot allocation. Size classes: 64, 128, 256, 512, 1024, 2048, 3072 bytes. A compressed page is rounded up to the nearest size class. Internal fragmentation is bounded: worst case ~50% for a 65-byte page rounded to 128-byte slot. Average internal fragmentation at the typical compression ratio (2.5-4.0x) is ~10-15%.

Compaction cost: moving compressed pages requires updating the PTE via rmap_walk() to rewrite the SwapEntry with the new slot offset. This is ~100ns per page move. The kthread yields every 1ms, so it processes ~10,000 pages per second on average (~1ms of CPU time per second) and does not block page faults.
Worst case: highly heterogeneous compression ratios (some pages compress 10:1, others 2:1) create severe fragmentation. The compaction kthread keeps up with steady- state workloads but may fall behind during allocation bursts. If pool utilization exceeds 90% and compaction cannot reduce it, the kernel temporarily stops compressing new pages (sending them directly to swap) until compaction catches up.

Interaction with buddy allocator — the CompressPool allocates backing memory from the buddy allocator in large contiguous blocks (64KB-256KB). These allocations are infrequent (one allocation per ~16-64 compressed pages) and always order-4 or larger. This avoids polluting the buddy allocator's small-order freelists. When the CompressPool shrinks (memory pressure eases), regions are returned to the buddy allocator as whole contiguous blocks, avoiding external fragmentation.

Buddy allocator view:
  Order-4+ allocations → CompressPool regions (stable, long-lived)
  Order-0 allocations  → Regular page cache, anonymous pages (high churn)

These two allocation classes don't interfere: CompressPool uses large blocks,
everything else uses small blocks. The buddy allocator's per-order freelists
keep them naturally separated.

4.12.9 Linux Interface Exposure¶

procfs (compatible with Linux zswap interface):

    enabled                    #

href="#__codelineno-175-1">/proc/umka/compress_pool/ # "1" or "0" (read/write) algorithm # "lz4", "zstd1", "zstd3" (read/write) max_pool_percent # 50 (read/write, percentage of total RAM; range # [10, 80]. Can be changed at runtime in both # directions. PTE-embedded indexing means no # pre-allocated hash table constrains resizing. # Reducing triggers background eviction until # utilization falls below the new threshold. # New compressions are rejected while # utilization exceeds the limit.) pool_total_size # Current pool size in bytes stored_pages # Number of pages in pool compressed_bytes # Total compressed bytes original_bytes # Total original bytes compression_ratio # current ratio (e.g., "3.21") rejected_pages # Incompressible pages sent to swap writeback_pages # Pages evicted from pool to swap decompressions # Total decompress operations /sys/kernel/mm/compress_pool/ # Alternative sysfs path # Same attributes, for tools that prefer sysfs

/proc/meminfo additions (matching Linux zswap format):

Zswapped:        512000 kB     (original size of compressed pages)
Zswap:           180000 kB     (actual compressed size in memory)

These are additive fields. free, top, htop and other tools that parse /proc/meminfo simply ignore fields they don't recognize.

4.13 Swap Subsystem¶

Swap is the lowest tier in UmkaOS's memory hierarchy. Pages evicted from the compression tier (Section 4.12) that fail the compression ratio threshold are written to a swap backing store (block device or file). Swap provides a safety net against out-of-memory kills: when DRAM and the compressed pool are exhausted, anonymous pages can be written to persistent storage and reclaimed later on demand.

Memory hierarchy (top = fastest, bottom = slowest):

  Per-CPU Free Page Pools          ~5 ns
        |
  Per-NUMA Buddy Allocator         ~50 ns
        |
  Page Cache (RCU XArray)          ~30 ns lookup
        |
  LRU Active → Inactive lists
        |
  Compression Tier (LZ4)          ~1-2 us compress
        |
  *** Swap Subsystem ***           ~10 us NVMe, ~5 ms HDD

ML tuning: Swap behavior parameters (swappiness, page_cluster, swap_readahead_window) are registered in the Kernel Tunable Parameter Store. The umka-ml-swap Tier 2 service observes swap-in/swap-out rates, page fault patterns, and per-cgroup memory pressure events to tune these parameters at runtime. See Section 23.1 for the complete parameter catalog and observation types.

4.13.1 Swap Area¶

Each active swap target (device or file) is represented by a SwapArea. The kernel supports up to MAX_SWAP_AREAS (32) concurrently active swap areas, matching the Linux limit. Areas are identified by a u8 area ID (0-31); IDs are allocated at swapon(2) time and released at swapoff(2).

/// Maximum number of concurrently active swap areas.
pub const MAX_SWAP_AREAS: usize = 32;

/// Global swap area table. Indexed by area_id (0..MAX_SWAP_AREAS).
/// Slots are `None` when no swap area is active for that ID.
/// Protected by a global `SwapLock` (RwLock) for swapon/swapoff mutations;
/// read-side access during page-out/page-in uses RCU.
pub static SWAP_AREAS: RcuLock<[Option<Arc<SwapArea>>; MAX_SWAP_AREAS]> =
    RcuLock::new([None; MAX_SWAP_AREAS]);

/// Describes a single swap backing store (device or file).
pub struct SwapArea {
    /// Area identifier (0..MAX_SWAP_AREAS). Assigned at swapon, released at swapoff.
    pub area_id: u8,

    /// Backing store type and handles.
    pub backing: SwapBacking,

    /// Priority for allocation ordering. Range: -1 to 32767.
    /// Higher values are preferred. Areas with equal priority are used
    /// round-robin. Default priority is -1 (auto-assigned in swapon order).
    pub priority: i16,

    /// Total number of page-sized swap slots in this area.
    pub total_slots: u64,

    /// Number of currently free (unallocated) slots. Decremented atomically
    /// on allocation, incremented on free. Enables O(1) "is this area full?"
    /// checks without scanning the bitmap.
    pub free_slots: AtomicU64,

    /// Slot allocation bitmap. Bit set = free, bit clear = allocated.
    /// Length: `ceil(total_slots / 64)` entries. Allocated via vmalloc
    /// (for large bitmaps — a 1 TB swap area needs a 32 MB bitmap, too
    /// large for the buddy allocator). Boot-activated swap areas use the
    /// boot allocator. Runtime `swapon` uses vmalloc.
    /// **Ownership**: The bitmap is owned by the SwapArea and freed on
    /// swapoff after all references are drained (SWAP_AREAS RCU grace
    /// period + inflight_io == 0).
    pub slot_bitmap: VmallocSlice<AtomicU64>,

    /// Next cluster index to scan for SSD sequential allocation (see
    /// SwapSlotAllocator below). Wraps to 0 when it reaches the end.
    pub cluster_cursor: AtomicU64,

    /// Area flags (active, SSD, discard, encrypted). Stored as AtomicU32
    /// because `swap_slot_free()` reads flags concurrently with `swapoff()`
    /// modifying them. Load once with `flags.load(Acquire)` and construct
    /// `SwapAreaFlags::from_bits_truncate()` from the loaded value.
    pub flags: AtomicU32,

    /// Per-area swap slot allocator state.
    pub allocator: SwapSlotAllocator,

    /// In-flight I/O count. Incremented before submitting a swap read or
    /// write bio, decremented on completion. swapoff waits for this to
    /// reach zero before deactivating the area.
    pub inflight_io: AtomicU64,
}

/// Swap backing store variant.
pub enum SwapBacking {
    /// Swap on a raw block device (partition or whole disk).
    BlockDevice {
        /// Block device handle ([Section 15.2](15-storage.md#block-io-and-volume-management)).
        dev: BlockDevHandle,
        /// log2(sector_size / PAGE_SIZE) — used to convert slot offsets
        /// to sector addresses. For 512B sectors: shift = 3 (8 sectors
        /// per 4KB page). For 4KB sectors: shift = 0.
        sector_shift: u8,
    },

    /// Swap on a regular file (on any filesystem).
    File {
        /// VFS file handle ([Section 14.1](14-vfs.md#virtual-filesystem-layer)).
        vfs_file: Arc<OpenFile>,
        /// Extent map: contiguous runs of file blocks that back swap slots.
        /// Built at swapon time by walking the file's extent tree (via
        /// `FIEMAP` ioctl or `bmap()` fallback). The extent map is read-only
        /// after swapon — the file must not be modified while active as swap.
        /// Vec is acceptable here: swapon is a cold path (infrequent system
        /// admin operation), and the extent count is bounded by file
        /// fragmentation (typically <1000 extents).
        extents: Vec<SwapExtent>,
    },
}

/// A contiguous physical region backing a range of swap file pages.
pub struct SwapExtent {
    /// Offset in the swap file, in page units (0 = first page of the file).
    pub start_page: u64,
    /// Physical block number on the underlying device.
    pub start_block: u64,
    /// Number of contiguous pages in this extent.
    pub nr_pages: u64,
}

bitflags::bitflags! {
    /// Swap area state and capability flags.
    pub struct SwapAreaFlags: u32 {
        /// Area is active and accepting page-out writes. Cleared during
        /// swapoff drain to stop new allocations while existing pages
        /// are migrated back to memory.
        const WRITEOK    = 1 << 0;

        /// Issue TRIM/UNMAP commands to the device when swap slots are
        /// freed. Reduces write amplification on SSDs. Only meaningful
        /// for SwapBacking::BlockDevice with SSD firmware.
        const DISCARD    = 1 << 1;

        /// An asynchronous batch discard operation is in progress.
        /// New frees accumulate in a discard queue rather than issuing
        /// individual TRIM commands.
        const DISCARDING = 1 << 2;

        /// Backing device is non-rotational (SSD/NVMe). Enables cluster-
        /// based sequential allocation to reduce write amplification.
        /// Detected via BlockDeviceInfo.rotational at swapon time.
        const SSD        = 1 << 3;

        /// Encrypt pages before writing to swap. Per-area ephemeral key
        /// generated at swapon, zeroized at swapoff.
        /// See [Section 9.7](09-security.md#confidential-computing).
        const ENCRYPTED  = 1 << 4;
    }
}

4.13.2 Swap Entry¶

A SwapEntry is a packed 64-bit identifier that uniquely names a single page-sized slot across all swap areas. It encodes both the area ID and the slot offset within that area. SwapEntry values are stored in non-present PTEs, in the swap cache XArray, and in per-cgroup swap accounting structures.

/// Packed swap slot identifier.
///
/// Layout:
///   Bits [60:56] — swap area ID (5-bit field, MAX_SWAP_AREAS=32;
///                  bits [63:61] reserved zero; byte-aligned within [63:56])
///   Bits [55:0]  — slot offset within the area (56 bits, supporting up to
///                  2^56 pages = 256 PiB per area — well beyond any storage
///                  device foreseeable within the 50-year design horizon)
///
/// A SwapEntry of zero is the null sentinel (area 0, offset 0 is never
/// allocated — slot 0 of every area is reserved for the swap header).
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct SwapEntry(u64);

/// Bit-field constants for SwapEntry encoding.
///
/// **5-bit type width**: The kernel-internal `SwapEntry` uses a 5-bit type field
/// (max 32 swap types). This is sufficient for UmkaOS which defines types 28-30
/// (`SWP_HIBERNATION`, `SWP_REMOTE`, `SWP_COMPRESSED`) and reserves slots 0-27
/// for swap device areas (`MAX_SWAPFILES = 28`).
///
/// **LoongArch64 PTE conversion**: LoongArch64 hardware PTEs use 7-bit type fields
/// (128 swap types). The `SwapPteEncoding::encode()` zero-extends the 5-bit kernel
/// type into the 7-bit PTE field (lossless). The `SwapPteEncoding::decode_type()`
/// extracts 7 bits from the PTE and asserts `< 32`:
/// ```rust
/// fn decode_type(pte: RawPte) -> u8 {
///     let raw_type = ((pte >> TYPE_SHIFT) & 0x7F) as u8;
///     debug_assert!(raw_type < 32, "LoongArch64 swap PTE type exceeds 5-bit SwapEntry range");
///     raw_type
/// }
/// ```
/// If LoongArch64 ever requires >32 swap types, widen `SWAP_ENTRY_TYPE_MASK` to
/// `0x7F` and adjust `SWAP_ENTRY_TYPE_SHIFT` to 54 (reducing offset to 54 bits,
/// still sufficient at 64 PiB per swap area).
const SWAP_ENTRY_TYPE_SHIFT: u32 = 56;
const SWAP_ENTRY_TYPE_MASK: u64 = 0x1F;  // 5 bits [60:56]
const SWAP_ENTRY_OFFSET_MASK: u64 = (1u64 << 56) - 1;  // 56 bits [55:0]

impl SwapEntry {
    pub const NULL: Self = SwapEntry(0);

    /// Construct a SwapEntry from area ID and slot offset.
    /// Both values are masked to their respective bit widths to ensure
    /// round-trip consistency with `area_id()` and `offset()` extractors,
    /// even if a corrupted value is passed in a release build (where
    /// `debug_assert!` is elided).
    pub const fn new(area_id: u8, offset: u64) -> Self {
        debug_assert!(area_id < MAX_SWAP_AREAS as u8);
        debug_assert!(offset <= SWAP_ENTRY_OFFSET_MASK);
        // Defense-in-depth: area_id=0 offset=0 produces the NULL sentinel
        // SwapEntry(0). The allocator never returns offset 0 for any area
        // (slot 0 is the swap header, permanently marked allocated at swapon
        // time), but this assert catches accidental manual construction of
        // the NULL sentinel.
        debug_assert!(area_id != 0 || offset != 0,
            "SwapEntry::new(0, 0) would produce the NULL sentinel");
        SwapEntry(
            (((area_id as u64) & SWAP_ENTRY_TYPE_MASK) << SWAP_ENTRY_TYPE_SHIFT)
            | (offset & SWAP_ENTRY_OFFSET_MASK)
        )
    }

    /// Extract the swap area ID. Masked to 5 bits to prevent reserved
    /// bits [63:61] from corrupting the result on a corrupted PTE.
    pub const fn area_id(self) -> u8 {
        ((self.0 >> SWAP_ENTRY_TYPE_SHIFT) & SWAP_ENTRY_TYPE_MASK) as u8
    }

    /// Extract the slot offset within the area.
    pub const fn offset(self) -> u64 { self.0 & SWAP_ENTRY_OFFSET_MASK }

    /// Raw u64 value, used as the XArray key in SwapCache.
    pub const fn raw(self) -> u64 { self.0 }

    pub const fn is_null(self) -> bool { self.0 == 0 }
}

4.13.2.1 Per-Architecture PTE Swap Entry Encoding¶

When a page is swapped out, its PTE is set to a non-present format that encodes the SwapEntry. Each architecture has a different non-present PTE layout. The arch-specific encode_swap_pte / decode_swap_pte functions in umka-core/src/arch/*/mm.rs perform the conversion.

UmkaOS-native per-architecture encoding with tier hints. Swap PTE encoding is kernel-internal (never visible to userspace), so UmkaOS defines its own per-arch encoding optimized for the UmkaOS swap tier model. Each architecture implements the SwapPteEncoding trait in arch/*/mm.rs:

/// Per-architecture swap PTE encoding. Implemented in arch/*/mm.rs.
/// Abstracts the architecture-specific bit layout for non-present PTEs.
pub trait SwapPteEncoding {
    const TYPE_BITS: u32;    // 5 on most arches, 7 on LoongArch64
    const OFFSET_BITS: u32;  // Architecture-dependent
    const TIER_BITS: u32;    // 2 on 64-bit (Compressed/Local/Remote/Archive), 0 on 32-bit

    fn encode(swap_type: u8, offset: u64, tier: SwapTier, flags: SwapPteFlags) -> RawPte;
    fn decode_type(pte: RawPte) -> u8;
    fn decode_offset(pte: RawPte) -> u64;
    fn decode_tier(pte: RawPte) -> SwapTier;
    fn decode_flags(pte: RawPte) -> SwapPteFlags;
}

/// Swap tier hint embedded in the PTE (64-bit arches only).
/// On page fault, the tier hint allows direct branch to the tier-specific
/// fetch path, skipping the swap_info[type] lookup (saves one cache miss).
#[repr(u8)]
pub enum SwapTier {
    Compressed = 0,  // zswap/zram compressed page
    Local = 1,       // Local swap device (SSD/HDD)
    Remote = 2,      // DSM remote page / cluster swap
    Archive = 3,     // Tiered storage / cold page archive
}

Architecture	PTE Size	Type Bits	Tier Bits	Offset Bits	Special
x86-64	64-bit	5	2	50	soft-dirty, uffd-wp, anon-exclusive reserved bits
AArch64	64-bit	5	2	48	Upper bits reserved for hardware use
ARMv7	32-bit	5	0	25	No tier hint; fallback to swap_info lookup
RISC-V 64	64-bit	5	2	50
PPC32	32-bit	5	0	25	No tier hint; fallback to swap_info lookup
PPC64LE	64-bit	5	2	50
s390x	64-bit	5	2	50	Invalid-page bit encoding per DAT
LoongArch64	64-bit	7	2	48	7 type bits (128 swap types, matching hardware flexibility)

UmkaOS-specific internal swap types occupy reserved slots above MAX_SWAPFILES:

Type ID	Name	Purpose
30	`SWP_COMPRESSED`	zswap/zram compressed page (inline tier=Compressed). The 56-bit offset field is partitioned: bits [55:50] = NUMA node ID (6 bits), bits [49:0] = slot index within the per-NUMA CompressPool. See Section 4.12 for `COMPRESS_NODE_SHIFT` / `COMPRESS_SLOT_MASK`.
29	`SWP_REMOTE`	DSM remote page awaiting fetch (inline tier=Remote)
28	`SWP_HIBERNATION`	Hibernation snapshot page

Page fault fast path optimization: On 64-bit arches, the fault handler checks the 2-bit tier hint from the PTE first. If the tier is known (not Unknown), it branches directly to the tier-specific fetch routine (zswap decompress, local I/O, RDMA fetch), skipping the swap_info[type] array lookup. This saves one L1d cache miss (~3-5ns) on every swap-in fault. On 32-bit arches (no tier bits), the handler falls through to the standard swap_info[type] lookup (same cost as Linux).

32-bit architecture note: ARMv7 and PPC32 have 25 bits for the swap offset, limiting each swap area to 128 GiB of swap space (at 4 KB pages). This is adequate for 32-bit systems where total addressable RAM is 4 GiB (with LPAE: 8 GiB). The full 56-bit offset in SwapEntry is truncated to the architecture's PTE width when encoding; decode_swap_pte zero-extends back to 64 bits.

swapon() validation: Before activating a swap area, swapon() validates that the area's page count fits within the architecture's PTE swap offset capacity:

/// Maximum swap offset pages per architecture, derived from PTE swap offset bit widths.
/// swapon() rejects swap areas larger than this limit with EINVAL.
/// These values are compile-time constants per architecture (arch::MAX_SWAP_OFFSET_PAGES).
pub const MAX_SWAP_OFFSET_PAGES: [(&str, u64); 8] = [
    ("x86-64",      1 << 50),  // 50-bit offset field
    ("AArch64",     1 << 48),  // 48-bit offset field
    ("ARMv7",       1 << 25),  // 25-bit offset field (128 GiB at 4 KB pages)
    ("RISC-V 64",   1 << 50),  // 50-bit offset field
    ("PPC32",       1 << 25),  // 25-bit offset field (128 GiB at 4 KB pages)
    ("PPC64LE",     1 << 50),  // 50-bit offset field
    ("s390x",       1 << 50),  // 50-bit offset field
    ("LoongArch64", 1 << 48),  // 48-bit offset field (7-bit type + 2-bit tier uses more bits)
];

// In swapon():
if area_pages > arch::MAX_SWAP_OFFSET_PAGES {
    return Err(SyscallError::EINVAL);
}

This prevents silent truncation: without this check, a swap area larger than the PTE can encode would cause encode_swap_pte to lose high-order offset bits, leading to aliased swap slots (multiple pages mapping to the same swap slot = data corruption).

/// Architecture-provided swap PTE encoding (umka-core/src/arch/*/mm.rs).
///
/// `SwapPteOps` is the high-level arch-portable interface for converting
/// between `SwapEntry` (kernel-internal opaque handle) and raw PTE values.
/// It delegates to `SwapPteEncoding` internally. Generic code calls
/// `SwapPteOps`; only `arch/*/mm.rs` implements `SwapPteEncoding`.
///
/// The split exists because `SwapEntry` encapsulates `area_id + offset` as
/// an opaque unit, while `SwapPteEncoding` works with the individual
/// type/offset/tier/flags fields needed for the bit-packing.
pub trait SwapPteOps {
    /// Encode a SwapEntry into a non-present PTE value.
    fn encode_swap_pte(entry: SwapEntry) -> u64;

    /// Decode a non-present PTE value back into a SwapEntry.
    /// Panics (debug) if the PTE has the present bit set.
    fn decode_swap_pte(pte: u64) -> SwapEntry;
}

4.13.3 Swap Cache¶

The swap cache is a global XArray that maps SwapEntry values to in-memory Page references. It serves two purposes:

Deduplication: When multiple processes share a COW mapping of the same swap slot (e.g., after fork()), the swap cache ensures only one physical read is issued. The first fault reads the page from swap and inserts it into the cache; subsequent faults find it in the cache and map it directly.
Writeback coherency: A page being written to swap remains in the cache until the write completes. If a fault occurs during writeback, the in-memory copy is served immediately without waiting for the I/O.

/// Global swap cache. Keyed by SwapEntry::raw() (u64), values are Arc<Page>.
///
/// XArray is mandated by the collection usage policy
/// ([Section 3.13](03-concurrency.md#collection-usage-policy)) for all integer-keyed mappings. RCU-safe
/// reads allow the page fault handler to look up a swap cache entry without
/// acquiring any lock — the same pattern used by the page cache
/// ([Section 4.4](#page-cache)).
///
/// Insertion: under the XArray slot lock (per-slot, not global).
/// Deletion: RCU-deferred to ensure readers that obtained a reference
///           via rcu_read_lock see a consistent page.
pub static SWAP_CACHE: XArray<Arc<Page>> = XArray::new();

Swap cache lifecycle for a single page:

1. Page selected for swap-out by reclaim:
   → Allocate swap slot (SwapEntry)
   → Insert (SwapEntry, Arc<Page>) into SWAP_CACHE
   → Submit async write I/O to swap device
   → Update PTE to non-present swap entry

2. Write I/O completes:
   → Page remains in SWAP_CACHE (may still be mapped by other processes)
   → If no process has this SwapEntry in a PTE, remove from SWAP_CACHE
     and free the page frame

3. Page fault on a swap PTE:
   → Decode SwapEntry from PTE
   → rcu_read_lock()     // required: protects SWAP_CACHE XArray traversal
   → Lookup SWAP_CACHE[entry.raw()]
   → If found: take Arc::clone() on the page ref while under RCU
   → rcu_read_unlock()
   → If found: map the cached page, no I/O needed
   → If not found: allocate page, read from swap device, insert into
     SWAP_CACHE (under xa_lock), map page, update PTE to present

4. Last process unmaps the page:
   → Remove from SWAP_CACHE (RCU-deferred)
   → Free the swap slot back to the area bitmap
   → Drop Arc<Page> (page frame freed when refcount reaches zero)

4.13.4 Swap Slot Allocator¶

Each SwapArea has a dedicated slot allocator that converts the global bitmap into fast, contention-free slot allocation. The allocator uses two strategies based on the storage device type:

/// Per-area swap slot allocator.
pub struct SwapSlotAllocator {
    /// Cluster size in pages. SSD = 256 (1 MiB clusters), HDD = 1.
    /// Clusters group allocations sequentially to reduce SSD write
    /// amplification (FTL garbage collection sees sequential writes as
    /// a single erase unit). HDD uses single-page allocation because
    /// rotational latency dominates — sequential layout within a cluster
    /// provides no benefit when seeks are milliseconds.
    pub cluster_pages: u64,

    /// Per-CPU free slot caches. Each CPU caches a batch of pre-allocated
    /// slot offsets to avoid contention on the global bitmap. Cache size
    /// is 64 entries — enough to absorb bursts of page-out activity on a
    /// single CPU without depleting the global pool.
    pub per_cpu_cache: PerCpu<SwapSlotCache>,
}

/// Per-CPU cache of pre-allocated swap slot offsets.
/// Uses interior mutability (UnsafeCell/Cell) because PerCpu<T>.get()
/// returns `&T` (shared reference), and mutation under preempt_disable
/// is the standard per-CPU access pattern in UmkaOS.
pub struct SwapSlotCache {
    /// Cached slot offsets. Valid entries are at indices [0..count.get()).
    /// Cell: mutation under preempt_disable (single-writer per-CPU).
    pub slots: UnsafeCell<[u64; SWAP_SLOT_CACHE_SIZE]>,
    /// Number of valid cached slots. u8 suffices (range [0, 64]).
    pub count: Cell<u8>,
}

// SAFETY: SwapSlotCache is per-CPU and only accessed under preempt_disable.
// The owning CPU is the sole accessor. No cross-CPU sharing occurs.
unsafe impl Sync for SwapSlotCache {}

/// Number of swap slots cached per CPU.
pub const SWAP_SLOT_CACHE_SIZE: usize = 64;
const_assert!(SWAP_SLOT_CACHE_SIZE <= u8::MAX as usize);

Allocation algorithm:

swap_slot_alloc(area: &SwapArea) -> Option<SwapEntry>:
    // 1. Fast path: per-CPU cache (no atomics, no contention)
    //    Preemption must be disabled to prevent migration between cpu_id()
    //    and the cache access. PerCpu<T>.get() enforces this via PreemptGuard.
    let guard = preempt_disable();
    let cache = area.allocator.per_cpu_cache.get(&guard);
    let count = cache.count.get();
    if count > 0:
        cache.count.set(count - 1);
        // SAFETY: per-CPU access under preempt_disable, index < SWAP_SLOT_CACHE_SIZE
        let slots = unsafe { &*cache.slots.get() };
        return Some(SwapEntry::new(area.area_id, slots[(count - 1) as usize]))

    // 2. Refill: batch-allocate SWAP_SLOT_CACHE_SIZE slots from bitmap
    // SAFETY: per-CPU access under preempt_disable
    let slots = unsafe { &mut *cache.slots.get() };
    refill_count = bitmap_alloc_batch(
        &area.slot_bitmap,
        &area.cluster_cursor,
        area.allocator.cluster_pages,
        slots,
        SWAP_SLOT_CACHE_SIZE,
    );
    if refill_count == 0:
        return None  // area is full

    area.free_slots.fetch_sub(refill_count, Ordering::Relaxed);
    cache.count.set((refill_count - 1) as u8);  // return one, cache the rest
    let slots = unsafe { &*cache.slots.get() };
    return Some(SwapEntry::new(area.area_id, slots[(refill_count - 1) as usize]))

bitmap_alloc_batch(bitmap, cursor, cluster_pages, out, max_count):
    // Scan from cursor position, find set bits (free slots),
    // clear them atomically (CAS on AtomicU64 words).
    // For SSD: advance cursor by cluster_pages to maintain sequential layout.
    // For HDD: advance cursor by 1 (no clustering).
    // Wraps to bitmap start when end is reached.
    // Returns number of slots allocated (0 if bitmap is exhausted).

Deallocation:

swap_slot_free(area: &SwapArea, entry: SwapEntry):
    let offset = entry.offset();
    let word = offset / 64;
    let bit = offset % 64;

    // Set the bit back to 1 (free) with atomic OR.
    area.slot_bitmap[word].fetch_or(1u64 << bit, Ordering::Release);
    area.free_slots.fetch_add(1, Ordering::Relaxed);

    // If DISCARD flag is set and not already batching, queue a TRIM.
    // Load flags once to avoid TOCTOU race between the two checks.
    let flags = SwapAreaFlags::from_bits_truncate(area.flags.load(Ordering::Acquire));
    if flags.contains(SwapAreaFlags::DISCARD)
        && !flags.contains(SwapAreaFlags::DISCARDING):
        queue_swap_discard(area, offset);

4.13.5 swapon(2) / swapoff(2) Syscalls¶

4.13.5.1 swapon¶

int swapon(const char *path, int swapflags);

Flags:

bitflags::bitflags! {
    /// Flags for swapon(2). Linux ABI-compatible values.
    pub struct SwapOnFlags: u32 {
        /// Set custom priority. When this bit is set, bits [14:0] of swapflags
        /// contain the priority value: `prio = swapflags & SWAP_FLAG_PRIO_MASK`.
        const PREFER      = 0x8000;
        /// Enable TRIM/UNMAP on this swap area.
        const DISCARD     = 0x10000;
        /// Discard the entire swap area at swapon time (full TRIM).
        const DISCARD_ONCE = 0x20000;
        /// Discard freed swap pages in batches (background TRIM).
        const DISCARD_PAGES = 0x40000;
    }
}

/// Priority is encoded in bits [14:0] of swapflags when SWAP_FLAG_PREFER
/// (bit 15) is set. Linux decodes priority as:
///   `if (swap_flags & SWAP_FLAG_PREFER) prio = swap_flags & SWAP_FLAG_PRIO_MASK;`
/// There is no shift — `SWAP_FLAG_PRIO_SHIFT` does not exist in current
/// Linux source (`mm/swapfile.c`).
pub const SWAP_FLAG_PRIO_MASK: u32 = 0x7FFF;

swapon validation sequence:

Capability check: Caller must hold CAP_SYS_ADMIN (Section 9.9). Return -EPERM otherwise.
Path resolution: Resolve path to a VFS inode. If it is a block device, open via BlockDevHandle; if a regular file, open via OpenFile.
Header validation: Read the first page (4096 bytes). Verify the magic bytes SWAPSPACE2 at offset PAGE_SIZE - 10. Verify swap_header.version == 1. Verify swap_header.last_page * PAGE_SIZE <= device_or_file_size. Verify swap_header.pagesize == PAGE_SIZE (the swap partition must match the kernel's page size). Return -EINVAL on any mismatch.
Bad page scan: The swap header contains a bad-page bitmap. Mark any flagged slots as permanently allocated (clear bit in slot_bitmap) so they are never used.
Extent mapping (file-backed only): Walk the file's extent tree via FIEMAP to build the SwapExtent array. Verify that all extents are contiguous within their block ranges (holes in the file are not permitted — return -EINVAL).
Device detection: Query BlockDeviceInfo to determine rotational status. Set SwapAreaFlags::SSD for non-rotational devices.
Allocator init: Allocate slot_bitmap (one bit per slot, AtomicU64 words). Initialize per-CPU slot caches to empty. Set cluster_pages to 256 for SSD, 1 for HDD.
Activate: Insert SwapArea into SWAP_AREAS[area_id], set WRITEOK flag, publish via RCU. Log activation to the system event bus (Section 7.9).

4.13.5.2 swapoff¶

int swapoff(const char *path);

swapoff drain sequence:

Capability check: CAP_SYS_ADMIN required.
Clear WRITEOK: Atomically clear SwapAreaFlags::WRITEOK to prevent new allocations to this area.
Drain pages: For every allocated slot in the bitmap, read the page back into memory. If memory is insufficient, attempt to migrate the page to another active swap area. This is the expensive step — draining a large swap area may take seconds to minutes depending on I/O bandwidth.
Wait for in-flight I/O: Spin on inflight_io until it reaches zero.
Deactivate: Set SWAP_AREAS[area_id] = None, publish via RCU, then wait for an RCU grace period to ensure no reader holds a reference.
Free resources: Deallocate slot_bitmap, flush per-CPU caches, close the backing device or file handle.

4.13.6 Reclaim Integration¶

The swap subsystem is invoked by the page reclaim path as the final tier of eviction, after the compression tier has rejected a page or is full.

4.13.6.1 Three-Tier Eviction Pipeline¶

LRU Inactive List
    |
    v
[1] Page is clean file-backed?  --Yes--> Evict (drop, re-read later)
    |No
    v
[2] Compression eligible?  --Yes--> Attempt LZ4 compress
    |No                               |
    |                          Compress ratio OK?
    |                          |Yes           |No
    |                          v              |
    |                    Store in CompressPool |
    |                                         |
    +---<-------------------------------------+
    |
    v
[3] Swap writeback:
    → Allocate swap slot from highest-priority area with free slots
    → If SwapAreaFlags::ENCRYPTED: encrypt page (see Encrypted Swap below)
    → Insert page into SWAP_CACHE
    → Submit async bio write to swap device
    → Update PTE to non-present swap entry
    → On I/O completion: page frame is freed

4.13.6.2 Swappiness¶

The swappiness parameter controls the balance between reclaiming file-backed pages (dropping clean page cache) and swapping anonymous pages.

/// Global swappiness value. Range: 0-200.
///   0   = never swap anonymous pages (except under OOM pressure)
///   100 = equal weight for file cache and anonymous reclaim
///   200 = aggressively prefer swapping anonymous over file cache
///
/// Default: 60 (slight preference for keeping anonymous pages in memory,
/// matching the Linux default).
///
/// Accessed by kswapd and direct reclaim on every scan cycle.
pub static SWAPPINESS: AtomicU32 = AtomicU32::new(60);

The reclaim scanner uses swappiness to compute the scan ratio between the anonymous and file LRU lists:

anonymous_scan_ratio = swappiness
file_scan_ratio      = 200 - swappiness

// At swappiness=60 (default):  anon_ratio=60, file_ratio=140
//   → scan ~30% anonymous, ~70% file pages
// At swappiness=100:           anon_ratio=100, file_ratio=100
//   → equal scanning of anonymous and file pages
// At swappiness=200:           anon_ratio=200, file_ratio=0
//   → scan only anonymous pages (aggressive swapping, no file eviction)

Per-cgroup override: Each cgroup with the memory controller enabled can set memory.swap.max to limit per-cgroup swap usage, and the memory controller's local swappiness overrides the global value (Section 17.2).

Reclaim entry points:

kswapd: Background reclaim daemon (one per NUMA node). Wakes when free pages fall below the watermark_low threshold. Scans LRU lists and calls swap_writeback() for pages that reach the swap tier.
Direct reclaim: Synchronous reclaim in the page allocation slow path when free pages are below watermark_min. More aggressive than kswapd — may block the allocating task until pages are freed.

4.13.7 Swap Readahead¶

When reading a page from swap, the kernel speculatively reads nearby pages to exploit sequential I/O characteristics of the backing device.

/// Swap readahead configuration.
pub struct SwapReadaheadConfig {
    /// Readahead window in pages. Must be a power of two.
    /// Range: 1 (disabled) to 32.
    /// Controlled by sysctl vm.page-cluster (log2 of this value).
    /// Default: 8 pages (vm.page-cluster = 3).
    pub window: u32,

    /// Current adaptive window per swap area. Starts at `window`,
    /// increases on sequential access patterns (up to `window`),
    /// decreases to 1 on random access.
    /// Written under preempt_disable by the owning CPU only.
    /// `AtomicU32` (not `Cell<u32>`) because `/proc/vmstat` and the ML policy
    /// framework ([Section 23.1](23-ml-policy.md#aiml-policy-framework-closed-loop-kernel-intelligence))
    /// read all CPUs' adaptive windows with `Relaxed` ordering for statistics
    /// aggregation, without requiring an IPI to each CPU.
    pub adaptive_window: PerCpu<AtomicU32>,

    /// Last swap entry read on each CPU, used for sequential access detection.
    /// `access_is_sequential()` compares the current entry's offset with
    /// `last_swap_entry` offset: if delta is 1 (or within cluster stride),
    /// the access pattern is sequential and the window expands.
    /// Accessed under preempt_disable (PerCpu pattern).
    pub last_swap_entry: PerCpu<SwapEntry>,
}

Readahead algorithm:

swap_readahead(entry: SwapEntry, area: &SwapArea):
    let base_offset = entry.offset()
    let window = adaptive_window[cpu_id()].load()

    // Align to cluster boundary for SSD sequential locality.
    let cluster_base = base_offset & !(area.allocator.cluster_pages - 1)

    // Read slots [base_offset .. base_offset + window), clamped to area bounds.
    for offset in base_offset .. min(base_offset + window, area.total_slots):
        if offset == base_offset:
            continue  // the faulting page is read synchronously, not here

        let ra_entry = SwapEntry::new(area.area_id, offset)
        if SWAP_CACHE.get(ra_entry.raw()).is_some():
            continue  // already cached

        if !is_slot_allocated(area, offset):
            continue  // slot is free, no data to read

        // Submit async read bio for this slot. On completion, the page is
        // inserted into SWAP_CACHE (but not mapped into any process — it
        // will be found there on the next fault for this entry).
        submit_swap_read_async(area, ra_entry)

    // Adaptive window adjustment:
    if access_is_sequential(entry, last_swap_entry[cpu_id()]):
        adaptive_window[cpu_id()].store(
            min(window * 2, SWAP_READAHEAD_CONFIG.window), Ordering::Relaxed)
    else:
        adaptive_window[cpu_id()].store(1, Ordering::Relaxed)

    last_swap_entry[cpu_id()] = entry

The vm.page-cluster sysctl is the log2 of the readahead window. Default: 3 (window = 2^3 = 8 pages). Setting it to 0 disables readahead (window = 1).

4.13.8 Per-Cgroup Swap Accounting¶

Swap usage is charged to the cgroup that owns the page at the time of swap-out. The memory controller tracks swap independently from RSS to allow fine-grained limits.

/// Per-cgroup swap accounting state (embedded in MemController).
///
/// See [Section 17.2](17-containers.md#control-groups) for the MemController struct and the cgroup
/// hierarchy. These fields are stored inline in MemController (not
/// heap-allocated) for O(1) access from the swap-out hot path.
pub struct CgroupSwapState {
    /// Current swap pages charged to this cgroup. Incremented on swap-out,
    /// decremented on swap-in or swap slot free.
    pub swap_usage: AtomicU64,

    /// Best-effort limit on swap pages for this cgroup. If swap_usage would
    /// exceed this limit, the page is not swapped out — the reclaim path
    /// either finds a different page or triggers per-cgroup OOM.
    /// Read with Relaxed ordering in the swap-charge CAS loop, so a
    /// concurrent update to swap_max may take a few iterations to take effect.
    /// This matches Linux cgroup v2 behavior (not a hard real-time guarantee).
    /// Set via `memory.swap.max` (in bytes, converted to pages internally).
    /// Default: u64::MAX (unlimited).
    pub swap_max: AtomicU64,
}

Cgroup control files (cgroup v2 memory controller):

File	Type	Description
`memory.swap.current`	read-only	Current swap usage in bytes
`memory.swap.max`	read-write	Hard swap limit in bytes (`"max"` = unlimited)
`memory.swap.high`	read-write	Swap throttle threshold — reclaim pressure increases above this
`memory.swap.events`	read-only	Cumulative event counters: `max` (number of times swap.max hit)

Charge accounting flow:

swap_charge(cgroup: &Cgroup, nr_pages: u64) -> Result<()>:
    let state = &cgroup.mem_controller.swap_state;
    loop:
        let current = state.swap_usage.load(Ordering::Acquire);
        if current + nr_pages > state.swap_max.load(Ordering::Relaxed):
            return Err(SwapError::CgroupLimitExceeded)
        if state.swap_usage.compare_exchange_weak(
            current, current + nr_pages,
            Ordering::AcqRel, Ordering::Relaxed
        ).is_ok():
            return Ok(())

// Note: swap_usage may temporarily exceed swap_max by up to nr_pages due to the
// non-atomic check-then-CAS pattern. This matches Linux's page_counter_try_charge()
// behavior and is not a correctness issue for swap accounting.

swap_uncharge(cgroup: &Cgroup, nr_pages: u64):
    cgroup.mem_controller.swap_state.swap_usage
        .fetch_sub(nr_pages, Ordering::Release);

4.13.9 procfs Interface¶

The swap subsystem exposes status and statistics through standard Linux-compatible procfs and sysctl interfaces.

/proc/swaps — per-area status (Linux-compatible format):

Filename                Type        Size       Used       Priority
/dev/nvme0n1p3          partition   16777212   4521984    -2
/swapfile               file        8388604    0          -3

Fields: path, type (partition or file), total size in KiB, used size in KiB, priority. Sorted by activation order (same as Linux).

/proc/meminfo additions:

SwapTotal:       25165816 kB    (sum of all active swap areas)
SwapFree:        20643832 kB    (sum of free_slots * PAGE_SIZE / 1024)
SwapCached:        131072 kB    (pages in SWAP_CACHE that are also in memory)

/proc/vmstat additions:

pswpin   12345678     (cumulative pages read from swap since boot)
pswpout  23456789     (cumulative pages written to swap since boot)

Both pswpin and pswpout are u64 counters. At 10,000 pages/second continuous swap activity, a u64 counter overflows after ~58 million years — well within the 50-year uptime design target.

sysctl interface:

sysctl	Type	Default	Description
`vm.swappiness`	u32	60	Global swappiness (0-200)
`vm.page-cluster`	u32	3	log2 of swap readahead window
`vm.swap_readahead_enable`	bool	true	Enable/disable adaptive readahead

4.13.10 Encrypted Swap¶

When SwapAreaFlags::ENCRYPTED is set (either via a swapon flag or forced by system policy for confidential computing workloads), all pages are encrypted before DMA to the swap device and decrypted on read-back. This prevents swap contents from being recovered by physical media analysis.

/// Per-swap-area encryption context. Created at swapon time when
/// ENCRYPTED flag is set. Destroyed (key zeroized) at swapoff.
pub struct SwapCryptoCtx {
    /// AES-256-XTS key (512 bits: 256 for encryption + 256 for tweak).
    /// Generated from the kernel CSPRNG ([Section 10.1](10-security-extensions.md#kernel-crypto-api)) at
    /// swapon time. Stored in kernel memory with `NOKASAN` and `NOACCESS`
    /// page flags — only the crypto fast path can read it.
    pub key: Zeroizing<[u8; 64]>,

    /// Per-CPU pre-expanded AES key schedules to avoid re-expansion on
    /// every encrypt/decrypt. Populated at swapon from `key`.
    pub expanded_key: PerCpu<AesXtsExpandedKey>,
}

/// Pre-expanded AES-256-XTS key schedule for one CPU.
/// AES-256 uses 14 rounds; XTS requires two AES instances (encryption key
/// and tweak key). Total round keys: 2 * (14 + 1) * 16 = 480 bytes.
/// Aligned to cache line to prevent false sharing in per-CPU allocation.
///
/// On architectures with AES hardware acceleration (x86-64 AES-NI,
/// AArch64 ARMv8-CE, s390x CPACF), the expanded key layout matches
/// the hardware's expected format. On software-only architectures
/// (ARMv7, PPC32, RISC-V without Zkne), a generic table is used.
/// The crypto API ([Section 10.1](10-security-extensions.md#kernel-crypto-api)) abstracts this difference.
#[repr(C, align(64))]
pub struct AesXtsExpandedKey {
    /// Round keys for the encryption AES-256 instance (15 keys * 16 bytes).
    pub enc_round_keys: [[u8; 16]; 15],
    /// Round keys for the tweak AES-256 instance (15 keys * 16 bytes).
    pub tweak_round_keys: [[u8; 16]; 15],
}
const_assert!(size_of::<AesXtsExpandedKey>() == 512);
// 480 bytes of round keys + 32 bytes padding to cache-line-aligned 512.

Encryption flow (swap-out path):

1. Reclaim selects a page for swap-out to an ENCRYPTED area.
2. Allocate a temporary bounce page from the per-CPU bounce pool.
3. Encrypt the page contents into the bounce page:
   - Algorithm: AES-256-XTS
   - Tweak: SwapEntry.raw() (unique per slot, non-repeating)
   - The original page is read-only during this operation.
4. Submit the bounce page (not the original) for DMA write.
5. On I/O completion: return the bounce page to the pool, free the
   original page frame.

Decryption flow (swap-in path):

1. Page fault decodes SwapEntry from PTE.
2. Allocate a fresh page frame.
3. Submit read bio for the swap slot → DMA into the fresh page.
4. On I/O completion: decrypt in-place using the area's SwapCryptoCtx.
5. Insert decrypted page into page tables and SWAP_CACHE.

Per-architecture AES acceleration:

Architecture	Hardware AES	Instructions	Fallback
x86-64	AES-NI	`aesenc`, `aesdec`, `aeskeygenassist`	Software AES (FIPS-validated)
AArch64	ARMv8 Crypto Extension	`AESE`, `AESD`, `AESMC`, `AESIMC`	Software AES
ARMv7	NEON AES (if CE present)	`AESE.8`, `AESD.8`	Software AES
RISC-V 64	Zkne/Zknd (if ratified)	`aes64es`, `aes64ds`	Software AES
PPC32	None	N/A	Software AES
PPC64LE	VMX/VSX crypto	`vcipher`, `vcipherlast`	Software AES
s390x	CPACF (MSA)	`KM` (cipher message) with AES function codes	Software AES
LoongArch64	LSX/LASX crypto (if present)	LSX AES instructions	Software AES

The kernel queries the CPU feature flags at boot (Section 2.16) and selects the fastest available implementation. All implementations produce bit-identical output (NIST AES test vectors verified at boot in debug builds).

4.13.11 Performance Budget¶

Operation	Path	Cost	Notes
Swap slot alloc (per-CPU cache hit)	Hot	~20 ns	No atomics, array index decrement
Swap slot alloc (bitmap scan, 64-slot batch)	Warm	~200 ns	AtomicU64 CAS per bitmap word
Swap slot free (bitmap set + atomic add)	Hot	~15 ns	Single atomic OR + fetch_add
SwapCache lookup (XArray RCU read)	Hot	~30 ns	Lock-free radix tree traversal
SwapCache insert (XArray slot lock)	Warm	~80 ns	Per-slot spinlock, single CAS
Swap writeback (single page, NVMe)	Cold	~10 us	DMA + device write + completion IRQ
Swap writeback (single page, SATA SSD)	Cold	~50 us	Higher command overhead than NVMe
Swap writeback (single page, HDD)	Cold	~5 ms	Seek + rotational latency dominated
Swap read-in (single page, NVMe)	Cold	~10 us	DMA + device read + completion IRQ
Swap read-in (single page, HDD)	Cold	~5 ms	Seek + rotational latency dominated
Cluster readahead (8 pages, NVMe)	Cold	~15 us	Sequential: one command, 8 pages
Encrypted swap write (AES-NI)	Cold	~10.5 us	~0.5 us AES-XTS encrypt + ~10 us I/O
Encrypted swap write (software AES)	Cold	~14 us	~4 us software AES + ~10 us I/O
swapon (full init, NVMe device)	Cold	~1 ms	Header read, bitmap alloc, extent map
swapoff (16 GiB area, NVMe)	Cold	~2-10 s	Read back all pages, depends on I/O bw

4.14 DMA Subsystem¶

Direct Memory Access (DMA) allows hardware devices to transfer data to and from system memory without CPU involvement. The kernel DMA subsystem provides the abstractions that drivers use to allocate DMA-capable memory and map existing memory for device access.

Cross-references: IOMMU group formation (Section 11.5), device registry (Section 11.4), physical memory allocator (Section 4.2), cache coherency (Section 4.14).

4.14.1 Design: IOMMU-First DMA¶

Linux approach: Every device has a dma_ops void-pointer function table. SWIOTLB (software bounce buffering) is always initialized at boot, consuming 64 MB of low memory even on systems where no device needs it. dma_map_single accepts and returns void *.

UmkaOS approach:

IOMMU-first: When the system has an IOMMU (VT-d on x86-64, SMMU on AArch64, RISC-V IOMMU, etc.) all devices are assigned an IOMMU domain at probe time (Section 11.5). DMA addresses are IOVAs in the device's mapped address space. No physically-contiguous allocation is required, no bounce buffering is needed.
Typed API: DmaDevice is a Rust trait. dma_alloc_coherent returns CoherentDmaBuf (typed struct), not (void *, dma_addr_t). The type system prevents confusing a CPU address with a DMA address.
SWIOTLB early allocation: Always allocated early from low memory at Phase 1.2.2 (before IOMMU discovery), matching Linux behavior. After all device probing completes (Phase 5.x), the SWIOTLB pool may be released if full device coverage is confirmed. The release cannot happen at IOMMU init time (Phase 4.1) because no devices have been probed yet — the device registry is empty. On a modern x86-64 server with VT-d covering all devices, SWIOTLB memory is reclaimed post-boot.
Cache coherency explicit: The API requires callers to call dma_sync_for_cpu / dma_sync_for_device on streaming mappings. On cache-coherent architectures these are no-ops; on non-coherent architectures (ARMv7, bare RISC-V) they perform cache range operations. The explicitness prevents silent corruption bugs when porting to a non-coherent platform.

4.14.2 Core Types¶

/// A device-visible DMA address.
///
/// On systems with IOMMU: an IOVA (I/O Virtual Address) in the device's
/// mapped address space.
/// On systems without IOMMU (SWIOTLB path): the physical address directly.
/// Never a kernel virtual address. Do not dereference from CPU code.
pub type DmaAddr = u64;

/// Maximum entries in a single scatter-gather list.
/// Drivers that need to map more data must split into multiple DmaSgls.
pub const DMA_MAX_SGE: usize = 512;

/// A single scatter-gather entry: device-visible address + byte length.
#[repr(C)]
pub struct DmaScatterEntry {
    pub dma_addr: DmaAddr,
    pub length:   u32,
    _pad:         u32,  // Pad to 16 bytes for cache-friendly array stride and fast
                        // index arithmetic (power-of-two size). This struct is kernel-
                        // internal; device-specific SGL formats (NVMe PRP/SGL, VirtIO
                        // descriptor, AHCI PRD) are converted from DmaSgl in the
                        // device driver layer.
}
const_assert!(core::mem::size_of::<DmaScatterEntry>() == 16);

/// A scatter-gather list for a multi-segment DMA operation.
/// Backed by a fixed-size ArrayVec: no heap allocation on hot paths.
pub struct DmaSgl {
    pub entries: ArrayVec<DmaScatterEntry, DMA_MAX_SGE>,
}

impl DmaSgl {
    /// Construct a DmaSgl from a Bio's segment list (format conversion only).
    /// This does NOT perform DMA mapping — it copies the page/offset/len
    /// triplets from the Bio's segment array into DmaScatterEntry format.
    /// The caller must subsequently call `dma_map_sgl()` to obtain
    /// device-visible DMA addresses.
    ///
    /// Each `BioSegment { page: PageRef, offset: u32, len: u32 }` is
    /// converted to a `DmaScatterEntry { dma_addr: page.phys_addr() + offset,
    /// length: len }`. Adjacent segments with contiguous physical addresses
    /// are merged into a single entry (reducing SGL depth for the driver).
    pub fn from_bio_segments(bio: &Bio) -> Self {
        let mut sgl = DmaSgl { entries: ArrayVec::new() };
        for seg in bio.segments() {
            let phys = seg.page.phys_addr() + seg.offset as u64;
            let len = seg.len;
            // Merge with previous entry if physically contiguous.
            if let Some(last) = sgl.entries.last_mut() {
                if last.dma_addr + last.length as u64 == phys {
                    last.length += len;
                    continue;
                }
            }
            sgl.entries.push(DmaScatterEntry {
                dma_addr: phys,
                length: len,
                _pad: 0,
            });
        }
        sgl
    }
}

/// DMA transfer direction. Controls cache coherency operations on non-coherent
/// architectures and informs IOMMU access-permission bits.
#[repr(u8)]
pub enum DmaDirection {
    /// Data flows from DRAM to device (write DMA, e.g., NIC Tx, storage write).
    ToDevice      = 0,
    /// Data flows from device to DRAM (read DMA, e.g., NIC Rx, storage read).
    FromDevice    = 1,
    /// Bidirectional (used for control transfers, e.g., USB SETUP packets).
    Bidirectional = 2,
    /// DMA address is mapped but no transfer is in progress.
    None          = 3,
}

/// Convert a `DmaDirection` to the corresponding IOMMU access permission bits.
/// Used internally by `dma_map_single()`, `dma_map_sgl()`, and `iommu_map()` to
/// set the correct read/write permissions on IOMMU page table entries.
///
/// | DmaDirection | DmaProt | Rationale |
/// |---|---|---|
/// | `ToDevice` | `READ` | Device reads from DRAM (device-side read access). |
/// | `FromDevice` | `WRITE` | Device writes to DRAM (device-side write access). |
/// | `Bidirectional` | `READ \| WRITE` | Device may read and write. |
/// | `None` | `0` (empty) | No active transfer; mapping exists for address reservation only. |
///
/// Note: the direction is from the **device's** perspective for IOMMU permissions.
/// `ToDevice` means the CPU writes data then the device reads it via DMA — so the
/// IOMMU page table entry needs device-side READ permission.
impl From<DmaDirection> for DmaProt {
    fn from(dir: DmaDirection) -> DmaProt {
        match dir {
            DmaDirection::ToDevice      => DmaProt::READ,
            DmaDirection::FromDevice    => DmaProt::WRITE,
            DmaDirection::Bidirectional => DmaProt::READ | DmaProt::WRITE,
            DmaDirection::None          => DmaProt::empty(),
        }
    }
}

/// A coherent (consistent) DMA buffer.
///
/// Memory is visible to both CPU and device without explicit cache synchronization.
/// On cache-coherent architectures: allocated with uncached or write-combining mapping.
/// On non-coherent architectures: allocated with non-cacheable page table attributes.
///
/// Coherent buffers are used for: control ring descriptors, command queues, status
/// pages — any memory that is both written and read by both CPU and device.
///
/// On drop: the kernel calls `dma_free_coherent` automatically (RAII).
pub struct CoherentDmaBuf {
    /// Kernel virtual address for CPU access. Valid for the lifetime of this struct.
    pub cpu_addr: NonNull<u8>,
    /// Device-visible DMA address for programming into device registers.
    pub dma_addr: DmaAddr,
    /// Allocation size in bytes (page-aligned).
    pub size:     usize,
    /// Device that owns this allocation (for IOMMU unmap on drop).
    device_id:    DeviceId,
}

/// A streaming DMA mapping.
///
/// Maps an existing CPU buffer for temporary device access. Unlike `CoherentDmaBuf`,
/// the CPU must call `dma_sync_for_cpu` before reading and `dma_sync_for_device`
/// before the device reads. These are no-ops on cache-coherent architectures but
/// perform cache range operations on non-coherent architectures.
///
/// Streaming mappings are used for: packet data buffers, block I/O data pages —
/// memory that is either read OR written by the device (not both concurrently).
pub struct StreamingDmaMap {
    /// Kernel virtual address of the mapped buffer. Used by
    /// `dma_sync_for_cpu` / `dma_sync_for_device` on non-coherent
    /// architectures for cache line operations (which require CPU VA):
    /// AArch64 `dc civac, Xt`, ARMv7 `DCCIMVAC`, RISC-V `CBO.CLEAN`.
    /// Set at `dma_map_single()` time from the caller's `virt_addr`.
    pub cpu_addr: NonNull<u8>,
    pub dma_addr: DmaAddr,
    pub size:     usize,
    pub dir:      DmaDirection,
    device_id:    DeviceId,
}

/// Errors returned by DMA operations.
#[derive(Debug)]
pub enum DmaError {
    /// No physically-contiguous memory available within the device's DMA mask.
    /// May resolve after memory compaction ([Section 4.7](#transparent-huge-page-promotion-and-memory-compaction)). Caller may retry.
    OutOfMemory,
    /// IOMMU IOVA space exhausted or IOMMU returned a mapping fault.
    IommuMapFailed,
    /// Device's DMA mask is too restrictive for the requested physical region.
    /// Use SWIOTLB path or allocate from a lower memory zone.
    AddressRangeTooSmall,
    /// The scatter-gather list would exceed `DMA_MAX_SGE` entries.
    /// Caller must split the I/O operation into smaller chunks.
    ScatterTooLarge,
    /// The SWIOTLB bounce buffer pool has been released (post-boot, all
    /// devices have IOMMU coverage). Streaming DMA map falls through to
    /// direct mapping. If a device is hot-plugged that requires bounce
    /// buffering and no SWIOTLB pool exists, `dma_map_single` returns
    /// this error and the device cannot perform DMA until the operator
    /// provides IOMMU coverage or reboots with `umka.swiotlb_mb=N`.
    PoolReleased,
    /// The specified (address, size) range is invalid for the IOMMU domain
    /// (not mapped, or exceeds domain bounds). Returned by `dma_unmap_*`
    /// when the caller passes a DMA address that was never mapped, or by
    /// `dma_map_*` when the requested physical range falls outside the
    /// device's IOMMU domain window.
    InvalidRange,
}

4.14.3 DmaDevice Trait¶

Every device that can initiate DMA implements (or receives) a DmaDevice handle. The kernel creates this handle during device registration, populated with the device's IOMMU domain (if any) and DMA mask.

/// Concrete DMA handle created by the kernel for each DMA-capable device at probe time.
///
/// Constructed in `device_init()` ([Section 11.4](11-drivers.md#device-registry-and-bus-management--device-matching)):
/// 1. Kernel looks up or creates the device's IOMMU domain ([Section 11.5](11-drivers.md#iommu-and-dma-mapping--iommu-groups)).
/// 2. Kernel reads the device's DMA address mask from bus enumeration (PCI BAR config,
///    DT `dma-ranges`, or firmware table).
/// 3. Kernel allocates `DmaDeviceHandle` and passes it to the driver via the KABI
///    init function's `dma: &dyn DmaDevice` parameter.
///
/// Drivers never construct this struct; they receive it as a trait object reference.
// Kernel-internal, not KABI — drivers access via DmaDevice trait.
pub struct DmaDeviceHandle {
    /// IOMMU domain this device belongs to. `None` for devices behind no IOMMU
    /// (in which case SWIOTLB bounce buffering is used for addresses exceeding
    /// `dma_mask`). Assigned at probe time; normally stable for the device's
    /// lifetime.
    ///
    /// **VFIO reassignment**: When a device is assigned to a VM via VFIO
    /// ([Section 18.5](18-virtualization.md#vfio-and-iommufd-device-passthrough-framework)), its IOMMU domain
    /// changes from the host kernel domain to a VM-specific domain. To handle
    /// this safely, the domain reference is RCU-protected:
    /// - Readers (DMA map/unmap on the hot path) use `rcu_read_lock()` to
    ///   obtain a reference to the current domain. No locking overhead beyond
    ///   the RCU read-side (preempt_disable on non-PREEMPT_RT).
    /// - VFIO reassignment (cold path): publishes the new domain via
    ///   `rcu_assign_pointer()`, then calls `synchronize_rcu()` (blocking)
    ///   to wait for the grace period before dropping the old domain
    ///   reference. `synchronize_rcu()` is chosen over `call_rcu()` because
    ///   VFIO reassignment is a cold path where the old IOMMU domain must
    ///   be fully quiesced before the new domain handles DMA. All in-flight
    ///   DMA operations using the old domain complete before the grace
    ///   period ends (DMA map/unmap are bounded operations).
    pub iommu_domain: RcuCell<Option<Arc<IommuDomain>>>,
    /// Maximum DMA-addressable physical address (e.g., 0xFFFF_FFFF for 32-bit,
    /// 0xFFFF_FFFF_FFFF_FFFF for 64-bit). Discovered from bus configuration.
    pub dma_mask: u64,
    /// Coherent DMA mask — for allocations that must be simultaneously visible to
    /// CPU and device without explicit cache maintenance. Often the same as `dma_mask`,
    /// but may be smaller on devices with limited coherent addressing.
    pub coherent_dma_mask: u64,
    /// True if the IOMMU domain is configured in identity/passthrough mode
    /// (IOVA == physical address, no translation). When set,
    /// `streaming_zone_for_device()` must respect `dma_mask` directly
    /// because the IOMMU provides no address remapping benefit. Set at
    /// probe time from the IOMMU domain configuration; `false` when
    /// `iommu_domain` is `None`. Determined during device probe: if the
    /// IOMMU driver reports `IommuTranslationMode::Identity` for this
    /// device's IOMMU domain, set to `true`. Default: `false` (assume
    /// full translation until IOMMU probe confirms otherwise).
    pub iommu_identity_mapped: bool,
    /// Maximum size (bytes) of a single scatter-gather entry. The DMA mapping
    /// layer splits entries at this boundary, guaranteeing no single
    /// ScatterList.dma_length exceeds this value. This prevents u32 overflow
    /// on devices with large physically-contiguous regions (>4 GiB).
    /// Default: `SZ_64K` (65536) for most bus types (PCIe, USB, SCSI).
    /// NVMe/RDMA: `u32::MAX` (4 GiB - 1). Set during device probe from
    /// bus-specific configuration. Matches Linux `dma_max_seg_size`.
    pub max_segment_size: u32,
}

/// Implemented by the kernel DMA subsystem for each registered DMA-capable device.
///
/// Drivers do not implement this trait; they receive a reference to their device's
/// DMA handle through the driver framework ([Section 11.4](11-drivers.md#device-registry-and-bus-management)).
/// The concrete implementation is `DmaDeviceHandle` (above).
pub trait DmaDevice: Send + Sync {
    /// Allocate coherent DMA memory.
    ///
    /// Returns a buffer where CPU and device share coherent access. The `dma_addr`
    /// field of the returned struct is the value to program into hardware registers.
    ///
    /// Allocation policy:
    /// - If device has IOMMU: allocate from any physical zone; map into IOMMU domain.
    /// - If no IOMMU AND size fits in coherent_dma_mask: allocate physically-contiguous from
    ///   the appropriate zone (DMA32 for 32-bit devices, NORMAL otherwise).
    /// - If no IOMMU AND allocation would exceed coherent_dma_mask: fail with OutOfMemory;
    ///   coherent allocation cannot use SWIOTLB.
    fn dma_alloc_coherent(
        &self,
        size: usize,
        gfp:  GfpFlags,
    ) -> Result<CoherentDmaBuf, DmaError>;

    // Non-cacheable mapping for non-coherent architectures:
    //
    // On non-coherent architectures (ARMv7, some RISC-V implementations),
    // `dma_alloc_coherent` must ensure CPU-device coherence without
    // explicit cache flushes on every access. After allocating pages, the
    // per-architecture `DmaArchOps::setup_coherent_mapping()` sets the
    // page table attributes to non-cacheable (`pgprot_noncached()`).
    // This guarantees that all CPU writes are immediately visible to the
    // device without requiring explicit cache maintenance operations.
    //
    // On cache-coherent architectures (x86-64, most AArch64), this step
    // is a no-op — the hardware cache coherency protocol ensures
    // visibility automatically, and the pages retain normal cacheable
    // attributes for better CPU access performance.

    /// Free a coherent DMA buffer. Unmaps from IOMMU, returns pages to allocator.
    fn dma_free_coherent(&self, buf: CoherentDmaBuf);

    /// Map an existing kernel virtual buffer for streaming DMA.
    ///
    /// Does NOT copy data. On cache-coherent architectures: programs the IOMMU
    /// and returns the IOVA. On non-coherent architectures: additionally flushes
    /// cache lines for the buffer range (ToDevice direction) or invalidates them
    /// (FromDevice direction).
    ///
    /// The buffer must remain valid (not freed, not moved) until `dma_unmap_single`.
    fn dma_map_single(
        &self,
        virt_addr: NonNull<u8>,
        size:      usize,
        dir:       DmaDirection,
    ) -> Result<StreamingDmaMap, DmaError>;

    /// Unmap a streaming DMA mapping.
    ///
    /// After this call, the device MUST NOT access the previously-mapped memory.
    /// On non-coherent architectures: invalidates cache lines (FromDevice) or
    /// writes back dirty lines (ToDevice) before releasing the mapping.
    fn dma_unmap_single(&self, map: StreamingDmaMap);

    /// Map a scatter-gather list of physical pages for DMA.
    ///
    /// Each `PhysPage` is mapped into the device's IOMMU domain (or returned as
    /// its physical address on SWIOTLB systems). The resulting `DmaSgl` contains
    /// device-visible addresses for all segments.
    fn dma_map_sgl(
        &self,
        pages: &[PhysPage],
        dir:   DmaDirection,
    ) -> Result<DmaSgl, DmaError>;

    /// Unmap a scatter-gather list. All DmaAddr values in the SGL become invalid.
    fn dma_unmap_sgl(&self, sgl: DmaSgl, dir: DmaDirection);

    /// Synchronize a streaming mapping for CPU access.
    ///
    /// Must be called before the CPU reads from a `FromDevice` mapping or before
    /// the CPU writes to a `ToDevice` mapping (to avoid stale cache lines).
    ///
    /// No-op on: x86-64, AArch64+CCI, PPC64LE, s390x, LoongArch64 3A6000+ (cache-coherent).
    /// Performs cache range invalidation on: ARMv7, bare AArch64, RISC-V, LoongArch64 3A5000.
    fn dma_sync_for_cpu(&self, map: &StreamingDmaMap);

    /// Synchronize a streaming mapping for device access.
    ///
    /// Must be called after the CPU writes to a `ToDevice` mapping, before the
    /// device reads from it, to ensure the data is flushed from CPU caches.
    ///
    /// No-op on cache-coherent architectures. Cache writeback on non-coherent.
    fn dma_sync_for_device(&self, map: &StreamingDmaMap);

    /// The device's DMA address mask (e.g., 0xFFFF_FFFF for 32-bit DMA devices).
    /// Used to determine whether SWIOTLB bounce buffering is needed.
    fn dma_mask(&self) -> u64;

    /// True if this device's DMA is routed through an IOMMU domain.
    fn has_iommu(&self) -> bool;
}

4.14.3.1.1 `DmaArchOps` — Per-Architecture Cache Coherency Operations¶

Each architecture implements DmaArchOps to handle cache maintenance differences. On cache-coherent architectures (x86-64, most AArch64), these are no-ops. On non-coherent architectures (ARMv7, some RISC-V), they perform the necessary cache line flushes/invalidations and page table attribute changes.

/// Per-architecture DMA cache coherency operations. Implemented once per arch
/// in `umka_core::arch::{arch}::dma`. The DMA subsystem dispatches through
/// `arch::current::dma::DMA_ARCH_OPS` for all cache maintenance.
pub trait DmaArchOps: Send + Sync {
    /// Set up a coherent (non-cacheable) CPU mapping for DMA pages.
    /// On non-coherent architectures, this remaps the pages with
    /// `pgprot_noncached()` attributes. On coherent architectures (x86-64),
    /// this is a no-op — hardware cache coherency handles visibility.
    fn setup_coherent_mapping(
        &self,
        pages: &[Page],
        size: usize,
    ) -> Result<DmaAddr, DmaError>;

    /// Synchronize a DMA buffer for CPU access after device writes.
    /// On non-coherent architectures: invalidate cache lines covering
    /// the range so the CPU reads device-written data from memory.
    /// `cpu_addr` is the kernel virtual address (required for cache
    /// line operations: AArch64 `dc civac`, ARMv7 `DCCIMVAC`, RISC-V `CBO.INVAL`).
    /// On coherent architectures: no-op.
    fn sync_for_cpu(&self, cpu_addr: NonNull<u8>, size: usize, dir: DmaDirection);

    /// Synchronize a DMA buffer for device access after CPU writes.
    /// On non-coherent architectures: flush (clean) cache lines covering
    /// the range so the device reads CPU-written data from memory.
    /// `cpu_addr` is the kernel virtual address (required for cache operations).
    /// On coherent architectures: no-op.
    fn sync_for_device(&self, cpu_addr: NonNull<u8>, size: usize, dir: DmaDirection);
}

4.14.3.2 IOMMU Range Operations¶

/// Unmap a contiguous range of IOVA space from an IOMMU domain and
/// flush the IOMMU TLB to ensure no stale translations remain. Used
/// by KVM memslot removal and VFIO container teardown to revoke device
/// DMA access to freed host pages.
///
/// Internally performs: (1) walk the IOMMU page table, clear PTEs for
/// `[iova .. iova + size)`, (2) issue `flush_iotlb()` for the range,
/// (3) wait for flush completion (synchronous).
///
/// # Errors
/// Returns `DmaError::InvalidRange` if `[iova, iova+size)` is not
/// fully mapped in the domain.
pub fn iommu_unmap_range(
    domain: &IommuDomain,
    iova: u64,
    size: u64,
) -> Result<(), DmaError>;

4.14.3.3 DMA Zone Selection¶

All DMA buffer allocations must respect the device's addressing capabilities. Three zone selection functions serve different allocation paths:

/// Zone selection for coherent DMA allocations (dma_alloc_coherent).
/// Returns the GfpFlags zone modifier based on the device's coherent_dma_mask.
pub fn coherent_zone_for_device(dev: &DmaDevice) -> GfpFlags {
    // Less-than-or-equal matches Linux's dma_direct_optimal_gfp_mask():
    // a device whose mask exactly equals DMA_BIT_MASK(24) can only address
    // the DMA zone, so it MUST be constrained to that zone.
    if dev.coherent_dma_mask <= DMA_BIT_MASK(24) { GfpFlags::DMA }
    else if dev.coherent_dma_mask <= DMA_BIT_MASK(32) { GfpFlags::DMA32 }
    else { GfpFlags::empty() } // Normal zone
}

/// Zone selection for streaming DMA allocations.
/// Called by subsystems that pre-allocate buffers for streaming DMA
/// (e.g., network ring buffers, SCSI command buffers).
/// Uses dma_mask (not coherent_dma_mask) because streaming mappings
/// go through the IOMMU when available.
pub fn streaming_zone_for_device(dev: &DmaDevice) -> GfpFlags {
    // iommu_domain is an RcuCell (see DmaDevice field docs): VFIO reassignment
    // can change it concurrently. We must hold an RCU read guard to safely
    // dereference the current domain pointer. Without rcu_read_lock(), the
    // domain could be freed between our is_some() check and any subsequent
    // use — a use-after-free.
    let _rcu = rcu_read_lock();
    let has_iommu = dev.iommu_domain.borrow(&_rcu).is_some();
    if has_iommu && !dev.iommu_identity_mapped {
        GfpFlags::empty() // IOMMU translates — any zone is fine
    } else {
        // No IOMMU or identity-mapped: must respect dma_mask directly
        if dev.dma_mask <= DMA_BIT_MASK(24) { GfpFlags::DMA }
        else if dev.dma_mask <= DMA_BIT_MASK(32) { GfpFlags::DMA32 }
        else { GfpFlags::empty() }
    }
}

Identity-mapped IOMMU: When an IOMMU is configured in identity/passthrough mode (dev.iommu_identity_mapped == true), IOMMU translation is 1:1 (virtual == physical). In this case, the device's dma_mask MUST be respected directly — the IOMMU provides no address translation benefit. streaming_zone_for_device() handles this by falling through to the dma_mask check.

Call sites: coherent_zone_for_device() is called by dma_alloc_coherent() (see the DmaDevice trait above). streaming_zone_for_device() is called by NIC ring buffer allocation, SCSI command buffer allocation, and any subsystem pre-allocating DMA buffers.

4.14.4 Cache Coherency Per Architecture¶

Streaming DMA operations require explicit cache management on non-coherent architectures. dma_sync_for_cpu and dma_sync_for_device perform the following architecture-specific operations:

Architecture	Coherent?	`sync_for_cpu` action	`sync_for_device` action
x86-64	Yes	No-op (TSO + MESI coherency)	No-op
AArch64 + CCI	Yes	No-op (cache coherent interconnect)	No-op
AArch64 bare SoC	No	`dc civac` (clean+invalidate)	`dc civac` (writeback+invalidate)
ARMv7	No	`CP15 DCIMVAC` range (invalidate)	`CP15 DCCMVAC` range (clean)
RISC-V	No	`fence.i` + SBI or MMIO flush	`fence` + SBI or MMIO flush
PPC32	No	`dcbi` (invalidate range)	`dcbst + sync` (flush range)
PPC64LE	Yes	No-op (POWER MESI coherency)	No-op
s390x	Yes	No-op (TSO)	No-op
LoongArch64 3A6000	Yes	No-op (HW cache coherency)	No-op
LoongArch64 3A5000	No	Cache invalidate range	Cache writeback range

"AArch64 bare SoC" refers to AArch64 platforms without a cache-coherent interconnect (CCI or CMN), such as some Cortex-A55 clusters. The presence of CCI is detected at boot time from the device tree (compatible = "arm,cci-400" or equivalent).

On RISC-V, the cache flush mechanism depends on the SoC: systems that implement the Svpbmt extension can use page-table-based cache control; others require SBI or memory-mapped cache controller registers. The arch-specific dma_sync_* implementation selects the appropriate mechanism at boot.

s390x does not use DMA in the traditional sense. Native s390x devices use Channel I/O: the channel subsystem executes CCW programs (Channel Command Words) that specify memory addresses directly. The channel subsystem performs the data transfer — no IOMMU, no scatter-gather lists, and no DMA mapping API for native channel I/O devices. For virtio devices over CCW transport (virtio-ccw), the virtio-ccw layer handles address translation and the DmaDevice trait applies normally. Native channel I/O devices use the CCW program model instead (see Section 11.10). Cache coherency is not a concern: s390x is TSO (total store ordering) and data DMA through the channel subsystem is architecturally coherent. (I-cache is architecturally incoherent and requires explicit CSP/IPTE for code modification, but this does not affect data DMA paths.)

LoongArch64 uses standard DMA via PCIe with IOMMU support provided by the Loongson 7A1000/7A2000 bridge chip. On 3A6000 and later, DMA is hardware cache-coherent (no explicit cache maintenance needed). On 3A5000, DMA may be non-coherent and the dma_sync_* implementation performs explicit cache maintenance operations. The standard DmaDevice trait applies to all LoongArch64 devices. SWIOTLB fallback is available for devices with limited addressing (e.g., 32-bit DMA mask on a 64-bit system with DRAM above 4 GB).

4.14.4.1 DMA Descriptor Ring Memory Ordering¶

All DMA descriptor ring protocols (virtio virtqueues, NVMe submission/completion queues, network TX/RX rings, block I/O scatter-gather rings) must use explicit DMA memory barriers between descriptor writes and doorbell notifications:

// Producer (CPU → device): write descriptors, then signal device
write_descriptor(ring, desc);           // Store descriptor fields
dma_wmb();                              // Ensure descriptor stores are visible to device
write_doorbell(ring, new_tail);         // Notify device of new work

// Consumer (device → CPU): read doorbell, then read descriptors
let new_head = read_doorbell(ring);     // Check for completed work
dma_rmb();                              // Ensure descriptor reads see device's stores
let result = read_descriptor(ring, idx); // Read completion data

dma_wmb() orders CPU stores to DMA-visible memory. On x86-64 (TSO): compiler barrier only (hardware provides store-store ordering). On AArch64/ARMv7: DMB OSHST (outer-shareable store barrier). On RISC-V: fence ow,ow (orders device outputs and memory writes). On PPC64: lwsync (light-weight sync; not eieio — eieio does not order cacheable stores on POWER8+, Section 2.18). On PPC32 (e500): sync (lwsync causes an Illegal Instruction trap on e500v1/v2 cores; sync/msync must be used instead).

dma_rmb() orders CPU loads from DMA-visible memory. On x86-64: compiler barrier. On AArch64/ARMv7: DMB OSHLD (outer-shareable load barrier). On RISC-V: fence ir,ir (orders device inputs and memory reads). On PPC64: lwsync. On PPC32 (e500): sync (e500v1/v2 lack lwsync). On s390x: compiler barrier only (TSO — hardware provides total store ordering; all loads and stores are sequentially consistent with respect to other CPUs and I/O). On LoongArch64: dbar 0x0A for dma_wmb() (store-store completion-scoped barrier for DMA-visible memory), dbar 0x05 for dma_rmb() (load-load completion-scoped barrier). LoongArch dbar hint values encode the barrier type in bits [4:0], with bit 4 indicating the scope: 0 = completion-scoped (stronger, waits for prior operations to complete), 1 = ordering-scoped (weaker, only orders but does not wait for completion). 0x0A (c_w_w) is the completion-scoped store-store barrier; 0x05 (cr_r_) is the completion-scoped load-load barrier. The ordering-scoped equivalents 0x1A (o_w_w) and 0x15 (or_r_) are WEAKER (used by __smp_wmb() and __smp_rmb() in Linux). For DMA rings, completion-scoped barriers are correct because the CPU must ensure descriptor writes are visible to the device before signaling the doorbell (see LoongArch Reference Manual vol 1, §2.2.10.1; Linux arch/loongarch/include/asm/barrier.h). A full dbar 0 (unconditional barrier) is never needed for DMA descriptor rings because the ordered write-then-read pattern does not require load-store ordering. On 3A6000+ with hardware cache coherency, dbar hints compile to no-ops but are retained for forward compatibility with weaker-ordered future implementations.

Omitting these barriers causes descriptor corruption — the device may read partially written descriptors or the CPU may read stale completion data. This is a silent data corruption bug, not a crash, making it extremely difficult to diagnose.

4.14.5 SWIOTLB (Software IOMMU / Bounce Buffering)¶

SWIOTLB is activated only when at least one of the following conditions is true:

No IOMMU + restricted DMA mask: At least one device has no IOMMU coverage (bare physical address DMA) AND that device's dma_mask() is smaller than the top of physical RAM.
Tier 2 sub-page DMA: A Tier 2 (userspace) driver performs streaming DMA to an offset within a page where the DMA region does not start at a page boundary. IOMMU translation operates at page granularity — it cannot restrict device access to a sub-page region. Without bounce buffering, the device could read or write adjacent data within the same page that belongs to a different security context. SWIOTLB copies the sub-page region into a dedicated bounce buffer slot (aligned to SWIOTLB_SLOT_SIZE), maps only that slot for the device, and copies back on unmap. This ensures Tier 2 DMA isolation at sub-page granularity.

When sub-page bounce is needed: offset_within_page + size does not span the entire page AND the device is Tier 2. Tier 1 drivers are trusted with full page access (they share the kernel address space); only Tier 2 drivers require sub-page isolation.

This covers: legacy PCI devices on x86 without VT-d (32-bit DMA mask, >4 GB DRAM), ARMv7 SoCs with DRAM above 4 GB and no IOMMU, PCIe devices on early RISC-V boards without an IOMMU, and LoongArch64 PCIe devices with 32-bit DMA masks on systems with DRAM above 4 GB. On modern x86-64 server systems with Intel VT-d enabled, SWIOTLB is allocated at boot (Phase 1.2.2) but released post-boot after IOMMU enumeration confirms full device coverage, reclaiming the memory. On s390x, SWIOTLB is never needed: native devices use Channel I/O (no DMA mapping) and virtio-ccw handles its own address translation.

/// Software IOMMU pool: physically-contiguous memory in the lowest-addressable DRAM.
///
/// Allocated once at boot from zone DMA or DMA32. Size is configurable via
/// `umka.swiotlb_mb=N` (default: min(64, RAM_GB) MB).
pub struct SwiotlbPool {
    base_phys:   PhysAddr,
    size:        usize,
    /// Tracks free/used 2 KB slots. Allocation is first-fit; no coalescing needed
    /// because SWIOTLB allocations have bounded lifetime (unmap frees the slot).
    slot_bitmap: SpinLock<Bitmap>,
}

/// Minimum SWIOTLB slot size. Allocations are rounded up to this.
pub const SWIOTLB_SLOT_SIZE: usize = 2048;

pub static SWIOTLB: OnceCell<SwiotlbPool> = OnceCell::new();

impl SwiotlbPool {
    /// Release the SWIOTLB pool back to the buddy allocator if no device
    /// requires it. Called once after all device probing completes (Phase 5.x
    /// in [Section 2.3](02-boot-hardware.md#boot-init-cross-arch)), NOT at IOMMU enumeration time (Phase 4.1).
    ///
    /// # Why Phase 5.x, not Phase 4.1
    ///
    /// The release decision requires iterating all probed devices to check
    /// whether any lack IOMMU coverage. At Phase 4.1 (IOMMU init), no devices
    /// have been probed yet — `device_registry::iter_all()` returns an empty
    /// iterator, so the check would always conclude "no device needs SWIOTLB"
    /// and incorrectly release the pool. Devices are probed at Phase 4.4a
    /// (bus enumeration) and Phase 5.1 (Tier 0 drivers), so the release check
    /// must run after all probing completes.
    ///
    /// Boot sequence:
    ///   Phase 1.2.2: SWIOTLB pool allocated from low memory
    ///   Phase 4.1:   IOMMU initialized
    ///   Phase 4.4a:  Devices probed, IOMMU domains assigned
    ///   Phase 5.1:   Tier 0 drivers loaded
    ///   Phase 5.x:   swiotlb_release_if_unused() — iterate probed devices,
    ///                release pool if all have IOMMU coverage
    ///
    /// # Algorithm
    ///
    /// 1. Iterate all probed devices in the device registry.
    /// 2. For each device, call `device_needs_swiotlb()`. If any device
    ///    returns `true`, the pool is retained (at least one device lacks
    ///    full IOMMU coverage and may need bounce buffering).
    /// 3. If no device needs SWIOTLB: free the pool's physically-contiguous
    ///    pages back to the buddy allocator via `free_pages(base_phys, size)`.
    ///    The `OnceCell` remains set (to avoid race conditions with concurrent
    ///    callers) but the pool is marked as released — subsequent `alloc_slot()`
    ///    calls return `Err(DmaError::PoolReleased)`.
    /// 4. Log the result: either "SWIOTLB: released N MB (all devices have
    ///    IOMMU coverage)" or "SWIOTLB: retained N MB (M devices require
    ///    bounce buffering)".
    ///
    /// # Safety
    ///
    /// Must be called exactly once, after all device probing is complete and
    /// before any DMA mapping request could race with the release. The boot
    /// sequence guarantees this: device probing (Phase 4.4a + 5.1) completes
    /// before the SWIOTLB release check runs at Phase 5.x.
    pub fn release_if_unused(&self) -> bool {
        let any_needs_swiotlb = device_registry::iter_all()
            .any(|dev| device_needs_swiotlb(&dev));
        if any_needs_swiotlb {
            klog!(Info, "SWIOTLB: retained {} MB ({} devices require bounce buffering)",
                  self.size >> 20,
                  device_registry::iter_all().filter(|d| device_needs_swiotlb(d)).count());
            return false;
        }
        // No device needs SWIOTLB — return memory to the buddy allocator.
        free_pages(self.base_phys, self.size);
        self.slot_bitmap.lock().mark_released();
        klog!(Info, "SWIOTLB: released {} MB (all devices have IOMMU coverage)",
              self.size >> 20);
        true
    }
}

SWIOTLB pool release: After all device probing completes (Phase 5.x), the boot sequence calls SWIOTLB.get().map(|pool| pool.release_if_unused()). The release runs post-probe, not at IOMMU init time (Phase 4.1), because the decision requires iterating probed devices — at Phase 4.1 the device registry is empty. On modern x86-64 servers with Intel VT-d covering all devices, this reclaims 64 MB of low memory.

Hot-plug-capable systems: On systems where PCIe hot-plug is available (detected by the presence of a hot-plug controller capability in any root port or downstream port during PCI enumeration), the SWIOTLB pool is never released, even if all currently probed devices have IOMMU coverage. A legacy device hot-plugged later may lack IOMMU support and require bounce buffering. The release_if_unused() method checks pci_hotplug_capable() and returns false without freeing if hot-plug hardware is present. Release is only performed on embedded or fixed-topology systems (no hot-plug controller, no Thunderbolt/USB4 PCIe tunneling) where the device population is guaranteed static after boot.

On systems where release does occur: the pool is never re-allocated after release — if a device is somehow attached that requires bounce buffering and no SWIOTLB pool exists, dma_map_single returns Err(DmaError::PoolReleased) and the device cannot perform DMA until the operator provides IOMMU coverage or reboots with umka.swiotlb_mb=N.

SWIOTLB bounce algorithm for dma_map_single on a device without IOMMU:

1. If phys_addr + size <= device.dma_mask():
   — Buffer is already in the device-accessible range.
   — Return phys_addr as DmaAddr directly (no copy, no slot allocation).

2. Else (buffer is above the DMA mask):
   — Allocate a slot from SwiotlbPool (fails with OutOfMemory if pool is full).
   — If ToDevice or Bidirectional: copy CPU buffer → SWIOTLB slot.
   — Return slot.base_phys as DmaAddr.
   — On dma_unmap_single: if FromDevice or Bidirectional: copy slot → CPU buffer.
   — Free the SWIOTLB slot.

SWIOTLB pool exhaustion: If no slot is available, dma_map_single returns Err(DmaError::OutOfMemory). The driver must handle this (typically by returning an error to the I/O request, which is retried by the block/network layer). The pool size should be tuned to match the worst-case concurrent DMA of all attached legacy devices; the umka.swiotlb_mb=N parameter allows operator tuning.

4.14.5.1 SWIOTLB Activation Decision Tree¶

For any given DMA mapping request on a specific device, the following decision tree determines whether SWIOTLB bounce buffering is used. The decision is per-mapping (not per-device), because a device with a 32-bit DMA mask may have some buffers below 4 GB (direct) and others above (bounced):

dma_map_*(device, phys_addr, size, dir):
  │
  ├─ Device has IOMMU with full translation?
  │    │
  │    ├─ Yes → IOMMU allocates IOVA, maps phys page into device domain.
  │    │         No SWIOTLB needed regardless of phys_addr or dma_mask.
  │    │         (IOVA space is always within the device's addressable range.)
  │    │
  │    └─ No (identity-mapped IOMMU or no IOMMU)
  │         │
  │         ├─ phys_addr + size - 1 <= device.dma_mask?
  │         │    │
  │         │    ├─ Yes → Direct mapping: DmaAddr = phys_addr.
  │         │    │         No bounce buffer needed.
  │         │    │
  │         │    └─ No (buffer above device's addressable range)
  │         │         │
  │         │         ├─ SWIOTLB pool initialized?
  │         │         │    │
  │         │         │    ├─ Yes → Bounce: allocate SWIOTLB slot (always within
  │         │         │    │         dma_mask range), copy data if ToDevice/Bidir,
  │         │         │    │         return slot phys_addr as DmaAddr.
  │         │         │    │
  │         │         │    └─ No → Return Err(DmaError::AddressRangeTooSmall).
  │         │         │           (System misconfiguration: device with restricted
  │         │         │           mask but SWIOTLB not initialized at boot.)
  │         │
  │         └─ [Tier 2 sub-page DMA check — see condition 2 in SWIOTLB section above]
  │              If Tier 2 driver AND mapping does not cover full page(s):
  │              → Bounce via SWIOTLB for sub-page isolation.

Identity-mapped IOMMU: Some firmware configurations set up the IOMMU in pass-through / identity-map mode (DMA address = physical address). This occurs when: (a) the BIOS/UEFI configures VT-d in pass-through mode, (b) the operator sets umka.iommu=passthrough on the command line, or (c) the IOMMU driver detects that all devices are in a default domain with identity mapping. In this mode, the IOMMU provides no address translation — it effectively acts as if absent. SWIOTLB bounce is required for any device whose dma_mask is smaller than the highest physical address in the system, exactly as in the no-IOMMU case.

/// Per-device SWIOTLB activation predicate.
///
/// Called during device probe to determine if this device may require SWIOTLB.
/// If true, the global SWIOTLB pool must be initialized before this device
/// can perform DMA. Called by `device_init()` after DMA mask discovery.
pub fn device_needs_swiotlb(device: &DmaDeviceHandle) -> bool {
    let has_full_iommu = device.iommu_domain.read().as_ref()
        .map_or(false, |dom| dom.translation_mode() == IommuTranslationMode::Full);
    if has_full_iommu {
        return false;  // Full IOMMU translation — IOVA always within range.
    }
    // Identity-mapped IOMMU or no IOMMU: check if any physical memory
    // exceeds the device's DMA mask.
    let max_phys = phys_mem_top();  // Highest physical address with DRAM.
    device.dma_mask < max_phys
}

/// IOMMU translation mode for a device's domain.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum IommuTranslationMode {
    /// Full IOVA translation: device sees IOVAs, not physical addresses.
    /// SWIOTLB never needed.
    Full,
    /// Identity mapping: DMA address = physical address. IOMMU provides
    /// isolation (preventing device from accessing other devices' pages)
    /// but no address remapping. SWIOTLB needed if dma_mask < max phys.
    Identity,
}

4.14.6 Linux External ABI¶

/proc/iomem: SWIOTLB region appears as a SWIOTLB buffer entry (same as Linux).
/proc/swiotlb (present only if SWIOTLB is active): pool size, used slots, slot size.
dma-buf file-descriptor-based DMA buffer sharing: the DMA_BUF_IOCTL_SYNC ioctl maps to dma_sync_for_cpu (with DMA_BUF_SYNC_START) and dma_sync_for_device (with DMA_BUF_SYNC_END). DMA-BUF is used by the camera subsystem (Section 13.16), GPU (Section 13.5), and media pipeline (Section 13.7) for zero-copy buffer sharing between devices.
GFP_DMA / GFP_DMA32 flags (Section 4.2) continue to work as in Linux: they request allocation from the low-memory zones suitable for legacy DMA devices.

4.14.7 Tier 1 DMA Allocation Path¶

Tier 1 DMA allocation path: Tier 1 drivers allocate DMA buffers via the DeviceResources KABI table (Section 11.4). The allocation enters the DMA subsystem which selects the appropriate backing depending on the device's coherency model (cache-coherent: direct from buddy; non-coherent: SWIOTLB bounce buffer). DMA buffers are within the Tier 1 driver's memory domain but physically owned by the kernel — the domain's protection key grants read/write access without the driver controlling the physical address.

DMA-BUF is a kernel framework for sharing DMA buffers between device drivers via file descriptors. A buffer allocated by one driver (the exporter, e.g., a camera driver) can be imported by another driver (the importer, e.g., a GPU) without any data copy. The file descriptor acts as a cross-process, cross-driver handle to the underlying physical pages.

Primary consumers: Camera → GPU rendering, GPU → display, video decoder → display, GPU → network (RDMA), and any pipeline where data flows between distinct device drivers without touching userspace.

/// Userspace-visible handle to a DMA-BUF file descriptor.
///
/// This is the type used in all device-class trait interfaces (`CameraDevice`,
/// `GpuDevice`, `AccelDevice`, `MediaCodec`, `CryptoAccelDevice`,
/// `DisplayDevice`) to pass DMA buffer references. It wraps a raw file
/// descriptor number that refers to a DMA-BUF object in the kernel's file
/// descriptor table.
///
/// The handle is created by the exporter driver (e.g., camera: `VIDIOC_EXPBUF`,
/// GPU: `DRM_IOCTL_PRIME_HANDLE_TO_FD`) and passed to importers (e.g., GPU:
/// `DRM_IOCTL_PRIME_FD_TO_HANDLE`, display: atomic commit). Closing the fd
/// decrements the DMA-BUF reference count.
///
/// **Cross-process sharing**: `DmaBufHandle` values can be passed between
/// processes via `SCM_RIGHTS` on a unix socket. The receiving process gets a
/// new fd number that references the same underlying `DmaBuf` object.
#[repr(transparent)]
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub struct DmaBufHandle(pub i32);

/// Scatter-gather list for DMA operations.
///
/// An array of (physical_address, length) entries describing a physically
/// discontiguous buffer. Used by `DmaBufOps::map_dma_buf()` and by the
/// streaming DMA API (`DmaDevice::dma_map_sg`).
///
/// Layout is Linux-compatible (`struct scatterlist` in `linux/scatterlist.h`).
pub struct ScatterList {
    /// Physical page backing this entry.
    pub page: PageRef,
    /// Byte offset within the page.
    pub offset: u32,
    /// Length of this entry in bytes (may span multiple pages if physically
    /// contiguous; the SG allocator merges adjacent pages).
    pub length: u32,
    /// DMA address (filled by `dma_map_sg`, invalid before mapping).
    pub dma_address: u64,
    /// DMA length (may differ from `length` if SWIOTLB bouncing is active).
    pub dma_length: u32,
}

/// Reservation object for implicit DMA fence synchronization.
///
/// Tracks pending DMA operations on a DMA-BUF so that importers can wait
/// for the exporter's writes to complete before reading. Shared between
/// the DRM/GPU, camera, media, and NPU subsystems.
///
/// Contains two fence slots:
/// - `exclusive_fence`: set by the current writer (e.g., GPU render, camera
///   capture). Only one exclusive fence may be active at a time.
/// - `shared_fences`: set by concurrent readers (e.g., multiple display
///   scanout engines reading the same framebuffer).
///
/// The fence values use the `DmaFence` timeline semaphore type
/// ([Section 13.5](13-device-classes.md#gpu-compute)) which works across all device types (GPU, NPU,
/// camera, crypto engine).
pub struct ReservationObject {
    /// Lock protecting fence slot updates. Readers can check fences under
    /// RCU; writers must hold this lock.
    pub lock: SpinLock<()>,
    /// Exclusive (write) fence. Signaled when the writer finishes.
    pub exclusive_fence: Option<DmaFence>,
    /// Shared (read) fences. All must be signaled before the next exclusive
    /// writer can proceed. Bounded to MAX_SHARED_FENCES (16) — if exceeded,
    /// the oldest signaled fences are garbage-collected.
    pub shared_fences: ArrayVec<DmaFence, 16>,
}

/// DMA-BUF kernel object. Represents a shareable buffer backed by physical pages.
///
/// Created by the exporter driver via `dma_buf_export()`. The returned file
/// descriptor can be passed to other processes (via SCM_RIGHTS unix socket)
/// or to other drivers (via ioctl on a device fd).
pub struct DmaBuf {
    /// Size of the buffer in bytes.
    pub size: usize,
    /// Exporter-provided operations for mapping, unmapping, and synchronization.
    pub ops: &'static DmaBufOps,
    /// Exporter-private data (e.g., the GPU BO, camera frame buffer struct).
    ///
    /// # Safety
    ///
    /// **Ownership**: `priv_data` is owned by the exporter driver. It must
    /// point to memory that outlives the `DmaBuf`. The exporter allocates
    /// this data and frees it in `DmaBufOps::release()`.
    ///
    /// **Tier 1 crash recovery**: On Tier 1 exporter crash, the DMA-BUF
    /// crash handler nullifies `priv_data` (sets to null) and redirects
    /// all `DmaBufOps` callbacks through no-op stubs that return `EIO`.
    /// Importers receive `EIO` on all subsequent operations. The `DmaBuf`
    /// ref is released via standard refcount drain. This prevents
    /// use-after-free: the old driver domain's memory is inaccessible
    /// after crash, so no valid dereference is possible.
    ///
    /// **Allocation domain**: Exporters SHOULD allocate `priv_data` from a
    /// slab cache in the kernel core domain (Tier 0), not in the driver's
    /// Tier 1 domain, when the DMA-BUF may outlive a driver reload cycle.
    /// This ensures the data survives domain teardown.
    pub priv_data: *mut (),
    /// Reference count. Incremented on `dma_buf_get(fd)`, decremented on close.
    ///
    /// **u64 policy exemption**: Refcounts are fundamentally different from
    /// monotonic counters — they go up and down, and their maximum value at
    /// any instant is bounded by the number of live references (not cumulative
    /// over time). Each reference requires an open fd or kernel-held Arc,
    /// both bounded by process fd limits and total DMA-buf count. Maximum
    /// simultaneous references cannot exceed a few thousand per DMA-BUF.
    /// u32 matches Linux `struct file` refcount width.
    pub refcount: AtomicU32,
    /// Reservation object for implicit fencing (shared with DRM subsystem).
    /// Tracks pending GPU/DMA operations so that importers wait for completion.
    pub resv: ReservationObject,
    /// Scatter-gather table: physical page addresses for the buffer.
    /// Lazily populated on first `dma_buf_map_attachment()`.
    /// NOTE: ScatterList (physical pages), not DmaSgl (IOVA-mapped).
    /// DmaBuf tracks the physical page backing; IOVA mapping is per-attachment.
    pub sg_table: Option<ScatterList>,
    /// Associated VFS file for the fd-based lifecycle.
    pub file: Arc<OpenFile>,
}

/// Exporter-provided operations for DMA-BUF lifecycle management.
///
/// **KABI versioning**: `DmaBufOps` follows standard KABI vtable versioning
/// ([Section 12.1](12-kabi.md#kabi-overview)). The `vtable_size` field is the forward-compatibility
/// discriminant: the kernel reads only `min(vtable_size, KERNEL_DMABUF_OPS_SIZE)`
/// bytes. New methods are appended; older drivers on newer kernels have trailing
/// methods read as `None` (null function pointer).
pub struct DmaBufOps {
    /// Bounds-safety check: byte count of this vtable.
    pub vtable_size: u64,
    /// KABI version this vtable was compiled against. Used by the kernel
    /// to select compatibility shims for older drivers.
    pub kabi_version: u64,
    /// Attach an importer device. Called when a driver calls `dma_buf_attach()`.
    /// The exporter can reject incompatible devices (e.g., devices on a different
    /// IOMMU domain that cannot share the buffer's physical pages).
    ///
    /// The `dev` parameter is the kernel-internal `DmaDevice` obtained from the
    /// device registry via `DeviceNodeId` lookup. Callers at the KABI boundary
    /// pass a `DeviceNodeId`; the kernel resolves it to `&DmaDevice` before
    /// invoking this callback. Drivers never construct `DmaDevice` directly.
    ///
    /// **RCU liveness guarantee**: The caller (kernel DMA-BUF core) holds an RCU
    /// read lock (`RcuReadGuard`) for the duration of this callback. The `dev`
    /// reference is guaranteed to remain valid for the callback's lifetime because
    /// device unregistration publishes a null pointer via `rcu_assign_pointer()`
    /// and waits for a grace period before freeing the `DmaDeviceHandle`. This
    /// prevents use-after-free during VFIO device reassignment: when a device is
    /// detached from a VM and reassigned, the old `DmaDeviceHandle` is not freed
    /// until all in-flight `attach()` callbacks (which are bounded, non-sleeping
    /// operations) complete their RCU read-side critical section.
    ///
    /// The C ABI uses `*const DmaDevice` (raw pointer) because `extern "C"` FFI
    /// cannot express Rust lifetime constraints. The RCU invariant is enforced by
    /// the kernel call site, not by the type system. Drivers MUST NOT stash the
    /// `dev` pointer beyond the callback return — they must copy any needed device
    /// metadata (DMA mask, IOMMU domain reference) into the `DmaBufAttachment`
    /// during `attach()`.
    pub attach: Option<unsafe extern "C" fn(
        buf: *mut DmaBuf,
        dev: *const DmaDevice,
    ) -> i32>,

    /// Detach an importer device.
    pub detach: Option<unsafe extern "C" fn(buf: *mut DmaBuf, dev: *const DmaDevice)>,

    /// Map the buffer for DMA by the importer device. Returns a scatter-gather
    /// table of DMA addresses. The exporter pins pages and creates IOMMU mappings.
    /// `direction`: DMA_TO_DEVICE, DMA_FROM_DEVICE, or DMA_BIDIRECTIONAL.
    pub map_dma_buf: unsafe extern "C" fn(
        attachment: *mut DmaBufAttachment,
        direction: DmaDirection,
    ) -> *mut DmaSgl,

    /// Unmap a previously mapped buffer. Unpins pages and tears down IOMMU mappings.
    pub unmap_dma_buf: unsafe extern "C" fn(
        attachment: *mut DmaBufAttachment,
        sgl: *mut DmaSgl,
        direction: DmaDirection,
    ),

    /// Release the DMA-BUF. Called when the last reference is dropped.
    /// The exporter frees the underlying physical pages.
    pub release: unsafe extern "C" fn(buf: *mut DmaBuf),

    /// CPU-side mmap of the buffer. Used by userspace to access the buffer
    /// contents directly (e.g., for software rendering fallback).
    /// Returns the VMA mapping flags.
    pub mmap: Option<unsafe extern "C" fn(buf: *mut DmaBuf, vma: *mut VmaStruct) -> i32>,

    /// Begin CPU access. Called by `DMA_BUF_IOCTL_SYNC` with `DMA_BUF_SYNC_START`.
    /// Ensures cache coherency (flushes device writes, invalidates CPU cache).
    pub begin_cpu_access: Option<unsafe extern "C" fn(
        buf: *mut DmaBuf,
        direction: DmaDirection,
    ) -> i32>,

    /// End CPU access. Called by `DMA_BUF_IOCTL_SYNC` with `DMA_BUF_SYNC_END`.
    /// Flushes CPU writes so they are visible to device DMA.
    pub end_cpu_access: Option<unsafe extern "C" fn(
        buf: *mut DmaBuf,
        direction: DmaDirection,
    ) -> i32>,
}

/// Per-importer attachment to a DMA-BUF.
pub struct DmaBufAttachment {
    /// The DMA-BUF being imported.
    pub dmabuf: *mut DmaBuf,
    /// The importing device.
    pub dev: *const DmaDevice,
    /// Importer-private data.
    pub priv_data: *mut (),
    /// Intrusive list linkage (DmaBuf maintains a list of all attachments).
    pub node: IntrusiveListNode,
}

4.14.8.1 ScatterList to DmaSgl Conversion Bridge¶

ScatterList (physical page descriptions) and DmaSgl (device-visible IOVA entries) represent different stages of the DMA mapping pipeline. ScatterList is the exporter's view (physical pages); DmaSgl is the importer's view (IOVA-mapped addresses for a specific device). The conversion occurs during dma_buf_map_attachment():

PhysPage[] → ScatterList (exporter allocates pages, records page + offset + length)
          → dma_map_sg_table() (maps each ScatterList entry through the importer's IOMMU)
          → DmaSgl (device-visible DmaAddr + length entries)

/// Map a ScatterList (physical pages) into a device's IOMMU domain, producing
/// a DmaSgl with device-visible IOVA entries.
///
/// Called by DMA-BUF `map_dma_buf` implementations and by any subsystem that
/// needs to convert a physical page list to device-accessible addresses.
///
/// Each ScatterList entry's `page` + `offset` is translated to a DmaAddr:
/// - IOMMU path: IOVA allocated from the device's IOMMU domain, page mapped.
/// - No-IOMMU path: physical address returned directly (or bounced via SWIOTLB
///   if phys_addr > device.dma_mask).
///
/// Adjacent physically-contiguous entries are merged into single DmaSgl entries
/// to minimize scatter-gather list length (reduces hardware DMA descriptor count).
pub fn dma_map_sg_table(
    device: &dyn DmaDevice,
    sg: &[ScatterList],
    dir: DmaDirection,
) -> Result<DmaSgl, DmaError> {
    let mut sgl = DmaSgl { entries: ArrayVec::new() };
    let mut prev_end: Option<DmaAddr> = None;
    for entry in sg {
        let phys = entry.page.to_phys_addr() + entry.offset as u64;
        let dma_addr = map_phys_to_dma(device, phys, entry.length as usize, dir)?;
        // Merge with previous entry if physically and IOVA-contiguous.
        if let Some(pe) = prev_end {
            if pe == dma_addr {
                if let Some(last) = sgl.entries.last_mut() {
                    // Guard against u32 overflow on merged length.
                    // Individual entries are page-sized but many contiguous
                    // pages can exceed 4 GiB when merged.
                    let merged = (last.length as u64) + (entry.length as u64);
                    if merged <= u32::MAX as u64 {
                        last.length = merged as u32;
                        prev_end = Some(dma_addr + entry.length as u64);
                        continue;
                    }
                    // Overflow: do not merge, fall through to push a new entry.
                }
            }
        }
        if sgl.entries.is_full() {
            return Err(DmaError::ScatterTooLarge);
        }
        sgl.entries.push(DmaScatterEntry {
            dma_addr,
            length: entry.length,
            _pad: 0,
        });
        prev_end = Some(dma_addr + entry.length as u64);
    }
    Ok(sgl)
}

/// Unmap a previously mapped scatter-gather table. Releases IOMMU mappings
/// (or SWIOTLB bounce buffer slots) for all entries.
///
/// Must be called before the ScatterList pages are freed or reused.
/// After this call, all DmaAddr values in the corresponding DmaSgl are invalid.
pub fn dma_unmap_sg_table(
    device: &dyn DmaDevice,
    sg: &[ScatterList],
    dir: DmaDirection,
) {
    for entry in sg {
        if entry.dma_address != 0 {
            unmap_dma(device, entry.dma_address, entry.dma_length as usize, dir);
        }
    }
}

/// Internal: map a single physical address range to a device-visible DMA address.
///
/// Decision tree:
/// 1. Device has IOMMU with full translation → allocate IOVA, map page in IOMMU.
/// 2. Device has IOMMU in identity-map mode → phys_addr IS the DMA address,
///    but if phys_addr > dma_mask → bounce via SWIOTLB.
/// 3. No IOMMU → phys_addr directly, but if phys_addr > dma_mask → SWIOTLB.
fn map_phys_to_dma(
    device: &dyn DmaDevice,
    phys: PhysAddr,
    size: usize,
    dir: DmaDirection,
) -> Result<DmaAddr, DmaError>;

Relationship summary: - ScatterList: physical view. Fields: page (PageRef), offset (u32), length (u32). Used by DMA-BUF exporters and the block I/O layer. - DmaSgl: device view. Fields: dma_addr (DmaAddr/u64), length (u32). Used by hardware DMA engines (NVMe SGL, network TX descriptors, HDA BDL). - dma_map_sg_table() bridges the two: takes ScatterList + DmaDevice, returns DmaSgl. The DmaDevice determines whether the mapping goes through IOMMU (IOVA allocation) or direct physical address (with optional SWIOTLB bounce).

Userspace API (Linux-compatible ioctls on the DMA-BUF file descriptor):

ioctl	Value	Description
`DMA_BUF_IOCTL_SYNC`	`0x40086200`	Begin/end CPU access (`DMA_BUF_SYNC_START` / `DMA_BUF_SYNC_END`). Argument is `struct dma_buf_sync` containing a `__u64 flags` field — always 8 bytes on all architectures (`__u64` is fixed-width).
`DMA_BUF_SET_NAME`	`0x40086201` (LP64) / `0x40046201` (ILP32)	Set a debug name (shown in `/proc/[pid]/fdinfo/N`). FIX-028: Defined as `_IOW('b', 1, const char )` — the ioctl number encodes `sizeof(const char )` in bits 16-29: 8 bytes on LP64, 4 bytes on ILP32 (ARMv7, PPC32). The kernel's ioctl dispatch must accept both values or use `compat_ioctl` for 32-bit userspace.

Import/export flow:

Exporter (camera driver):
  1. Allocate physical pages for frame buffer
  2. dma_buf_export(size, &camera_dma_buf_ops, priv) → DmaBuf
  3. dma_buf_fd(dmabuf, O_CLOEXEC) → fd
  4. Return fd to userspace (via V4L2 VIDIOC_EXPBUF ioctl)

Userspace:
  5. Pass fd to GPU driver (via DRM PRIME ioctl DRM_IOCTL_PRIME_FD_TO_HANDLE)

Importer (GPU driver):
  6. dma_buf_get(fd) → DmaBuf reference
  7. dma_buf_attach(dmabuf, gpu_device) → DmaBufAttachment
  8. dma_buf_map_attachment(attach, DMA_FROM_DEVICE) → DmaSgl
  9. Use scatter-gather addresses for GPU texture sampling
  10. dma_buf_unmap_attachment(attach, sgl) when done
  11. dma_buf_detach(dmabuf, attach)
  12. dma_buf_put(dmabuf) — decrement refcount

Fence synchronization: DMA-BUF integrates with the ReservationObject (dma-fence) subsystem for implicit synchronization. When the exporter submits a DMA operation (e.g., camera capture), it attaches a fence to the reservation object. Importers automatically wait for the fence before accessing the buffer. This prevents the GPU from reading a frame that the camera has not yet finished writing.

Fence wait API:

/// Block the calling task until a DMA fence signals or the timeout expires.
///
/// The fence is signaled by the device driver calling `dma_fence_signal()`
/// on I/O completion (typically from the interrupt completion handler or
/// a deferred workqueue). When the fence signals, all waiters are woken.
///
/// # Scheduler integration
///
/// The waiting task enters `TaskState::UNINTERRUPTIBLE` and is placed on
/// the fence's internal `WaitQueue`. The scheduler removes the task from
/// the run queue — it consumes no CPU time while waiting. When
/// `dma_fence_signal()` fires, the signal callback calls `wake_all()` on
/// each registered waiter, transitioning the task back to `RUNNABLE`.
/// The wakeup receives the standard EEVDF sleeper bonus
/// ([Section 7.1](07-scheduling.md#scheduler--activatetask-sleeping-to-runnable-transition)).
///
/// # Timeout
///
/// If `timeout_ns` elapses before the fence signals, the waiter is removed
/// from the WaitQueue and `Err(DmaFenceError::Timeout)` is returned. A
/// timeout of `0` performs a non-blocking poll (returns immediately with
/// `Ok(())` if already signaled, `Err(Timeout)` otherwise). A timeout of
/// `u64::MAX` waits indefinitely (not recommended — use with caution).
///
/// # Error conditions
///
/// - `DmaFenceError::Timeout`: deadline expired before signal.
/// - `DmaFenceError::DeviceLost`: the owning device crashed during the
///   wait. The fence is force-signaled with error status by the crash
///   recovery path ([Section 11.9](11-drivers.md#crash-recovery-and-state-preservation)).
///   Callers should propagate EIO to userspace.
///
/// # Concurrency
///
/// Multiple tasks may wait on the same fence concurrently. All waiters
/// are woken when the fence signals (broadcast wake). The maximum number
/// of concurrent waiters per fence is `MAX_FENCE_WAITERS` (64); exceeding
/// this limit returns `Err(DmaFenceError::TooManyWaiters)`.
pub fn dma_fence_wait(fence: &DmaFence, timeout_ns: u64) -> Result<(), DmaFenceError> {
    if fence.is_signaled() {
        return Ok(());
    }
    let waiter = WaitEntry::new(current_task());
    fence.waitqueue.add(&waiter)?;  // Err if MAX_FENCE_WAITERS exceeded
    current_task().set_state(TaskState::UNINTERRUPTIBLE);

    // Arm a timer for the timeout. On fire, removes the waiter and wakes
    // the task with a timeout flag.
    let timer = if timeout_ns < u64::MAX {
        Some(HrTimer::new_oneshot(timeout_ns, || {
            waiter.set_timed_out();
            waiter.wake();
        }))
    } else {
        None
    };

    // Yield to the scheduler. The task is descheduled until either:
    // (a) dma_fence_signal() wakes it, or (b) the timeout timer fires.
    schedule();

    // Cleanup: remove waiter from the fence's WaitQueue.
    fence.waitqueue.remove(&waiter);
    if let Some(t) = timer { t.cancel(); }

    if waiter.timed_out() {
        Err(DmaFenceError::Timeout)
    } else if fence.has_error() {
        Err(DmaFenceError::DeviceLost)
    } else {
        Ok(())
    }
}

/// Signal a DMA fence, waking all waiters.
///
/// Called by the device driver on I/O completion — typically from the
/// interrupt handler or a completion workqueue. Must not be called twice
/// on the same fence (double-signal is a driver bug and triggers a
/// debug assertion). Sets the fence's `SIGNALED` flag atomically, then
/// wakes all entries on the fence's WaitQueue via broadcast `wake_all()`.
///
/// # Context
///
/// Safe to call from interrupt context (hardirq). The WaitQueue wake
/// path does not allocate or acquire sleeping locks.
pub fn dma_fence_signal(fence: &DmaFence) {
    debug_assert!(!fence.is_signaled(), "double-signal on DmaFence");
    fence.status.store(FENCE_SIGNALED, Release);
    fence.waitqueue.wake_all();
}

#[repr(u32)]
pub enum DmaFenceError {
    /// Timeout expired before fence signaled.
    Timeout        = 1,
    /// Owning device lost (crashed or removed).
    DeviceLost     = 2,
    /// Too many concurrent waiters on this fence.
    TooManyWaiters = 3,
}

Cross-references: - Camera V4L2 EXPBUF integration: Section 13.16 - GPU DRM PRIME import: Section 13.5 - Media pipeline zero-copy: Section 13.7 - DMA cache coherency table: Section 4.14 - IOMMU containment: Section 11.3

P2P DMA: Peer-to-peer DMA between accelerators (GPU-to-GPU, GPU-to-NVMe) uses shared IOMMU kernel domains. Both devices must be in the same IOMMU group or have explicit P2P DMA permission granted via the IOMMU API. See Section 22.4 for the P2P DMA validation protocol, including BAR aperture validation, ACS (Access Control Services) checks, and the dma_p2p_distance() topology query that determines whether two devices can perform P2P transfers without CPU bounce buffering.

/// Query the PCIe topology distance between two devices for P2P DMA.
///
/// Returns the P2P routing distance: the number of PCIe switches between
/// `initiator` and `target` in the PCIe hierarchy. Used to determine
/// whether P2P DMA is feasible and what latency to expect.
///
/// # Returns
/// - `Ok(0)`: Devices are on the same PCIe switch or root port (optimal P2P).
/// - `Ok(n)`: Devices are `n` switches apart. P2P is possible but adds
///   `n * ~50-200ns` latency per transfer vs. same-switch.
/// - `Err(P2pError::DifferentDomain)`: Devices are in different IOMMU domains
///   — P2P DMA requires CPU bounce buffering (not true P2P).
/// - `Err(P2pError::AcsBlocked)`: ACS (Access Control Services) is enabled on
///   an intermediate switch, blocking peer-to-peer routing. ACS must be
///   disabled on all switches in the path for P2P to work.
/// - `Err(P2pError::NoBarAperture)`: The target device does not expose a
///   BAR aperture large enough for the requested P2P transfer size.
///
/// # Algorithm
/// 1. Walk the PCIe topology tree from `initiator` upward to find the
///    Lowest Common Ancestor (LCA) switch with `target`.
/// 2. At each intermediate switch, check ACS settings (`ACS_P2P_REQUEST_REDIRECT`,
///    `ACS_P2P_COMPLETION_REDIRECT`). If ACS is enabled, P2P is routed through
///    the root complex (effectively CPU bounce).
/// 3. Compute distance = depth(initiator, LCA) + depth(target, LCA).
/// 4. Validate target BAR aperture covers the requested P2P region.
pub fn dma_p2p_distance(
    initiator: &dyn DeviceNode,
    target: &dyn DeviceNode,
) -> Result<u32, P2pError> {
    // Walk PCIe topology tree, check ACS, compute distance.
}

P2P DMA mapping lifecycle: (1) Authorize via p2p_acl_check(src, dst) — verifies both devices are in the same IOMMU group or have explicit P2P permission. (2) Map: source device exports BAR region; destination device maps it via IOMMU. (3) Transfer: destination device DMAs directly from source's BAR aperture. (4) Unmap: release IOMMU mapping. The mapping persists until explicitly unmapped — it is not tied to a single transfer. The KABI dma_p2p_map/dma_p2p_unmap methods (Section 12.3) expose this lifecycle to Tier 1 drivers.

4.14.8.2 Kernel-Wide Memory Coherence Model¶

Domain pair	Mechanism	Ordering	See
CPU ↔ CPU (same NUMA)	Hardware cache coherence (MESI/MOESI)	TSO (x86, s390x), Release-Acquire (ARM, RISC-V, PPC, LoongArch)	Section 3.9
CPU ↔ CPU (cross-NUMA)	Hardware coherence + interconnect (QPI/UPI, CCIX, CXL.cache)	Same as above + ~50-100ns penalty	Section 4.11
CPU ↔ Device (DMA)	IOMMU + cache flush/snoop	DMA fence (Release on submit, Acquire on complete)	Section 4.14
CPU ↔ GPU/Accelerator	IOMMU snoop or explicit flush	DMA fence per operation	Section 22.4
GPU ↔ GPU	P2P BAR or staged DMA	Device-specific (NVLink: TSO; PCIe: relaxed)	Section 22.4
CPU ↔ Remote node (DSM)	RDMA + MOESI coherence protocol	Page-granularity, causal or strict	Section 6.6
CPU ↔ CXL memory	CXL.mem hardware coherence	Same as local NUMA (hardware-managed)	Section 5.9

All cross-domain data transfers use the DMA fence abstraction (Section 4.14) for ordering. The fence guarantees: all writes before dma_fence_signal() are visible to all readers after dma_fence_wait() returns.

4.15 Extended Memory Operations¶

This section specifies eight syscalls that extend or augment the core memory management interface. Each syscall is Linux ABI-compatible at the binary level (same numbers, same struct layouts, same flag values) while the internal implementation uses UmkaOS-native data structures and algorithms. Improvements over Linux are called out explicitly in each subsection.

Cross-references: virtual memory regions (Section 4.6), huge page management (Section 4.5), NUMA policy (Section 4.9), page cache (Section 4.4), CompressPool compression (Section 4.10).

4.15.1 mremap — Remap a Virtual Address Region¶

Syscall signature

void *mremap(void *old_addr, size_t old_size, size_t new_size,
             int flags, ... /* void *new_addr */);

Returns new_addr on success, MAP_FAILED (cast of (void *)-1) on error with errno set. The optional fifth argument new_addr is required when MREMAP_FIXED is set.

Flags

bitflags::bitflags! {
    /// Flags for mremap(2). Exact Linux values — binary ABI is stable.
    pub struct MremapFlags: u32 {
        /// Kernel may move the mapping to a different virtual address if it cannot
        /// grow in place. Without this flag, ENOMEM is returned instead of moving.
        const MAYMOVE     = 0x1;

        /// Move the mapping to exactly new_addr, unmapping whatever was there before.
        /// Requires MAYMOVE. Returns EPERM if MAYMOVE is absent.
        const FIXED       = 0x2;

        /// Do not unmap old_addr after moving. The old VA range is replaced with
        /// anonymous zero pages (as if freshly mmap'd MAP_ANONYMOUS). The PTEs from
        /// the old range are transferred to new_addr. Added in Linux 5.7.
        /// Used by the Go runtime garbage collector to preserve VA reservations.
        const DONTUNMAP   = 0x4;
    }
}

Three operational cases

Grow in place (new_size > old_size, no move): extend the VMA's end if the pages immediately above old_addr + old_size are unmapped. No PTE copying required; the new sub-range starts with no PTEs present (demand-fault on access). This is constant-time in the VMA tree.
Shrink (new_size < old_size): always succeeds. Unmap the tail sub-range [old_addr + new_size, old_addr + old_size) using unmap_vma_range(), which unmaps PTEs and drops page references. Returns old_addr.
Move (MAYMOVE set, grow in place impossible): find a free VA range of new_size, copy PTEs from the old range to the new range without copying physical pages, TLB-flush the old range, optionally install zero anonymous PTEs at the old range if DONTUNMAP, then free the old VMA (or update its length if DONTUNMAP).

Algorithm — detailed steps

mremap(old_addr, old_size, new_size, flags, new_addr):
  1. Validate old_addr is page-aligned. Return EINVAL if not.
  2. Validate old_size and new_size are non-zero. Return EINVAL if either is zero.
  3. Validate flags: FIXED requires MAYMOVE (return EPERM if FIXED && !MAYMOVE).
  4. Validate DONTUNMAP requires MAYMOVE and new_size == old_size.
     (Kernel 5.7 behaviour: size must match to preserve the hole invariant.)
     Return EINVAL if violated.
  5. Lock MM write-lock (mm.write_lock()).
  6. Look up source VMA: find_vma(mm, old_addr). Return EFAULT if not found or
     old_addr < vma.start. Verify [old_addr, old_addr+old_size) lies within one VMA
     (no spanning). Return EFAULT if it spans two VMAs.
  7. If shrink (new_size < old_size):
       a. Split VMA at old_addr + new_size if needed.
       b. Call unmap_vma_range(old_addr + new_size, old_size - new_size).
          This unmaps PTEs, drops struct page references, fires mmu_notifier.
       c. Return old_addr. (No TLB shootdown needed: pages are gone.)
  8. If grow in place possible (no VMA in [old_addr+old_size, old_addr+new_size)):
       a. Extend VMA end to old_addr + new_size.
       b. Return old_addr.
  9. If MAYMOVE not set: return ENOMEM (cannot grow in place, cannot move).
  10. Find destination VA range:
        - If FIXED: use new_addr. Verify new_addr is page-aligned (EINVAL if not).
          Unmap [new_addr, new_addr+new_size) if occupied.
        - Else: call find_free_vma(mm, new_size, hint=old_addr+old_size).
          Return ENOMEM if no free range found.
  11. Verify old and new ranges do not overlap (EINVAL if they do).
  12. Allocate new VMA struct, copy flags/prot/file/offset from source VMA.
  13. Call remap_pte_range(old_addr, new_addr, new_size):
        - Walks PTEs from old_addr using the page-table walker.
        - For each present PTE: atomically clear old PTE, set new PTE at new_addr
          offset. No physical page is copied.
        - For 2MB huge PTEs: transfers the huge PTE under the page table lock (PTL)
          within a single critical section (see UmkaOS improvement below).
        - Increments page map-count (vm_page_count) by 0 (count unchanged, just
          remapped).
  14. If DONTUNMAP:
        - Replace old VMA range with MAP_ANONYMOUS|MAP_PRIVATE zero VMA.
        - Install zero PTEs for old range (demand-fault semantics — actually leave
          PTEs absent; they will fault in as zeros on access).
        - VMA for old range remains in mm.vma_tree with anon backing.
      Else:
        - Remove old VMA from mm.vma_tree.
        - Call mmu_notifier_invalidate(mm, old_addr, old_size).
  15. Insert new VMA into mm.vma_tree.
  16. TLB flush of old VA range required: with DONTUNMAP the old mapping is
      preserved, but stale TLB entries pointing to the original physical pages
      must be invalidated. New accesses to the old VA range will fault and be
      handled by the kernel correctly only after the flush.
      Call tlb_flush_range(mm, old_addr, old_size).
  17. Unlock MM write-lock.
  18. Return new_addr.

UmkaOS improvement over Linux: atomic huge-page remap

Linux 5.x mremap splits 2MB huge pages at the source range boundaries before remapping, then reconstructs huge pages at the destination. This causes THP split overhead (up to ~80μs per split on large mappings) and temporary fragmentation of the huge-page pool.

UmkaOS's remap_pte_range() detects 2MB-aligned source ranges and transfers the huge PTE without splitting, using the page table lock (PTL) to serialize the operation:

Serialization via PTL, not memory-visible atomicity: The huge PTE transfer is not truly atomic in the memory-visible sense — it consists of two separate memory operations (clear source PMD entry, set destination PMD entry). Instead, the PTL is held for both the remove-from-source and insert-to-destination operations within a single critical section. No other CPU can observe the page in an inconsistent state while the PTL is held, because any concurrent page table walk must also acquire the PTL before modifying entries in the same PMD range.

Cross-PMD transfers: If the source and destination fall in different PMD ranges (and thus have different PTLs), both PTLs are acquired in canonical address order (lower virtual address first) to prevent ABBA deadlock, and the transfer is performed while both locks are held.

No unmapped window: There is no window where the page is mapped in neither the source nor the destination. The PTL critical section ensures that the destination PMD entry is written before the source PTL is released. Another CPU attempting to access the source address during the transfer will block on the PTL; by the time it acquires the lock, the TLB flush (step 16) will have invalidated any stale entries.

/// Remap PTEs from src_va to dst_va for `len` bytes.
/// Transfers huge PTEs (2MB) without splitting when both src and dst are
/// 2MB-aligned and len is a multiple of 2MB. Uses the page table lock (PTL)
/// to serialize the transfer — not memory-visible atomicity.
///
/// # Safety
/// Caller holds mm write-lock. src and dst ranges must not overlap.
/// src range PTEs must be valid (verified by caller before entry).
unsafe fn remap_pte_range(
    mm: &mut MemoryMap,
    src_va: VirtAddr,
    dst_va: VirtAddr,
    len: usize,
) {
    let mut offset = 0usize;
    while offset < len {
        let src = src_va + offset;
        let dst = dst_va + offset;
        // Check for 2MB-aligned huge page opportunity.
        if src.is_aligned(HUGE_PAGE_SIZE)
            && dst.is_aligned(HUGE_PAGE_SIZE)
            && len - offset >= HUGE_PAGE_SIZE
        {
            // Transfer huge PTE under PTL: acquire lock(s) for the source and
            // destination PMD ranges. If they fall under different PTLs, acquire
            // in canonical (lower-address-first) order to prevent deadlock.
            let ptl_src = mm.page_table.pmd_lock(src);
            let ptl_dst = mm.page_table.pmd_lock(dst);
            // `lock_pair_ordered(a, b)` checks `ptr::eq(a, b)` first. If both
            // arguments point to the same lock (src and dst in the same PMD),
            // it acquires only once. This handles same-PMD remap safely.
            let _guards = lock_pair_ordered(&ptl_src, &ptl_dst);

            // Within this critical section, no other CPU can observe a state where
            // the page is in neither source nor destination.
            let pmd_src = mm.page_table.pmd_entry_mut(src);
            let huge_pte = pmd_src.take(); // Clear source PMD entry.
            let pmd_dst = mm.page_table.pmd_entry_mut(dst);
            pmd_dst.set(huge_pte);         // Write destination PMD entry.
            // PTL guards drop here — locks released.
            offset += HUGE_PAGE_SIZE;
        } else {
            // Fall back to 4KB PTE transfer, batched per PTE page.
            // Acquire PTL pair once per PTE page (512 PTEs on 64-bit,
            // 1024 on 32-bit), NOT once per individual PTE. This matches
            // Linux's `move_ptes()` which acquires locks at the PMD level
            // via `pte_offset_map_lock()` and batch-moves all PTEs within
            // that page table page.
            //
            // Without batching, a 1 GB mremap would acquire and release
            // 2 × 262,144 spinlocks (~10ms of pure lock overhead). With
            // per-PTE-page batching, it is 2 × 512 lock operations — a
            // 512x reduction.
            let pte_page_size = PTES_PER_PAGE * PAGE_SIZE; // 512 * 4KB = 2MB on 64-bit
            let chunk_end = ((src + pte_page_size) & !(pte_page_size - 1)).min(src_va + len);

            let ptl_src = mm.page_table.pte_lock(src);  // PMD-level lock
            let ptl_dst = mm.page_table.pte_lock(dst);  // PMD-level lock
            let _guards = lock_pair_ordered(&ptl_src, &ptl_dst);

            // Move all PTEs within this PTE page under a single lock pair.
            let mut inner_offset = src;
            while inner_offset < chunk_end {
                let pte_src = mm.page_table.pte_entry_mut(inner_offset);
                let pte = pte_src.take();
                let dst_addr = dst + (inner_offset - src);
                let pte_dst = mm.page_table.pte_entry_mut(dst_addr);
                pte_dst.set(pte);
                inner_offset += PAGE_SIZE;
            }
            // PTL guards drop here — locks released for this PTE page.
            offset += chunk_end - src;
        }
    }
}

This avoids any THP split/reconstruct overhead. The huge PTE is transferred in two memory operations (clear + set) within a single PTL critical section, preserving dirty/accessed bits and avoiding the split_huge_page() path entirely.

Error cases

Error	Condition
`EFAULT`	`old_addr` not in any VMA, or range spans two VMAs
`EINVAL`	`old_addr` not page-aligned, `new_addr` not page-aligned (FIXED), zero sizes, DONTUNMAP with size mismatch, overlapping ranges
`ENOMEM`	MAYMOVE not set and cannot grow in place; or MAYMOVE set but no free VA range
`EPERM`	`MREMAP_FIXED` set without `MREMAP_MAYMOVE`

Linux compatibility: flag values, return value convention, errno codes, and DONTUNMAP semantics are identical to Linux 5.7+. The UmkaOS-specific huge-page optimisation is transparent to userspace.

4.15.2 mincore — Query Page Residency¶

Syscall signature

int mincore(void *addr, size_t length, unsigned char *vec);

Returns 0 on success. The output array vec contains one byte per page in [addr, addr + length). Bit 0 of each byte is 1 if the page is currently resident in physical RAM, 0 otherwise.

Output byte encoding

Standard Linux bit 0 is preserved for compatibility. UmkaOS extends the remaining bits to provide richer page-state information:

bitflags::bitflags! {
    /// Per-page status byte returned by mincore(2).
    /// Bit 0 is Linux-compatible. Bits 1-7 are UmkaOS extensions (zero on platforms
    /// that do not have the corresponding feature, or when queried via a
    /// compatibility shim that strips extension bits).
    pub struct MinCoreStatus: u8 {
        /// Page is resident in CPU DRAM (bit 0 — Linux-compatible).
        const RESIDENT        = 0x01;

        /// Page is in compressed memory (CompressPool). Accessible but not in raw DRAM.
        /// Retrieving it requires decompression (~1-2us: direct slot lookup ~10ns +
        /// LZ4 decompression ~0.5us + page allocation + PTE installation + TLB flush).
        /// Linux always returns 0 for compressed pages (it does not have a compression
        /// tier by default). See [Section 4.12](#memory-compression-tier) for decompression latency.
        const COMPRESSED      = 0x02;

        /// Page is on a remote RDMA node (DSM remote page). Accessing it triggers
        /// an RDMA fetch (~2-3μs). Only set when the distributed memory subsystem
        /// is active (Section 5).
        const REMOTE_RDMA     = 0x04;

        /// Page is write-protected by userfaultfd WP mode.
        const UFFD_WP         = 0x08;

        /// Page is huge (part of a 2MB THP). The byte at the first page of the huge
        /// page has this bit set; the remaining 511 bytes in the same huge page also
        /// have it set. Informational only.
        const HUGE_PAGE       = 0x10;

        // Bits 0x20, 0x40, 0x80 reserved for future use.
    }
}

Standard mincore() returns bytes with only bit 0 meaningful (as per Linux). The extended bits are populated but userspace that ignores them sees standard Linux semantics (bit 0 = resident).

Algorithm

mincore(addr, length, vec):
  1. Validate addr is page-aligned. Return EINVAL if not.
  2. Validate length > 0, addr + length does not overflow. Return EINVAL if overflow.
  3. num_pages = ceil(length / PAGE_SIZE).
  4. Verify vec pointer (num_pages bytes) is writable by the calling process.
     Return EFAULT if vec is not accessible.
  5. Acquire MM read-lock (mm.read_lock()).
  6. Walk [addr, addr + length) page by page:
     For each page address p in the range:
       a. Look up VMA containing p. If none found: unlock, return ENOMEM.
          (Linux returns ENOMEM for unmapped ranges — UmkaOS matches this.)
       b. Look up PTE for p in the page table (no lock needed beyond MM read-lock;
          PTE reads are inherently racy — this is acceptable per POSIX and Linux).
       c. status = 0.
       d. If PTE is present and not swapped-out: status |= RESIDENT.
       e. If PTE maps a compressed CompressPool page (UmkaOS PTE extension bit): status |= COMPRESSED.
       f. If PTE maps a remote DSM page: status |= REMOTE_RDMA.
       g. If PTE has the UFFD_WP soft-dirty bit set: status |= UFFD_WP.
       h. If the PTE is a PMD (huge page, 2MB): status |= HUGE_PAGE | RESIDENT.
          (A present huge PTE implies all 512 sub-pages are resident.)
       i. Write status byte to vec[page_index].
       j. Advance page_index.
       k. If p is at a 2MB-aligned boundary and this is a 2MB huge PTE:
          advance p by HUGE_PAGE_SIZE and fill all 512 vec bytes with status.
          (Avoids walking 512 individual 4KB PTEs for huge pages.)
  7. Release MM read-lock.
  8. Return 0.

Raciness note: mincore is inherently racy — a page may be swapped out between the PTE check and the return to userspace. This is by design (Linux documentation explicitly states this). UmkaOS does not attempt to serialise page reclaim during mincore.

UmkaOS improvement over Linux

Linux walks the page table at 4KB granularity even for huge pages, redundantly checking 512 sub-PTEs that all have the same state. UmkaOS detects huge PMD entries and bulk-fills the output vector for the entire 2MB range in a single operation, reducing mincore overhead by up to 512x for workloads with large THP mappings (e.g., QEMU guest RAM).

Additionally, the extension bits (COMPRESSED, REMOTE_RDMA) expose information that Linux provides no equivalent for, enabling userspace memory managers (e.g., the UmkaOS memory daemon) to make informed reclaim decisions without resorting to /proc/PID/smaps parsing.

Extended syscall: mincore_ex

For richer queries, UmkaOS provides mincore_ex as a forward-compat extension:

int mincore_ex(void *addr, size_t length, struct mincore_ex_page *vec,
               uint32_t flags);

/// Extended per-page record returned by mincore_ex(2).
/// Size is 8 bytes; the vec array must hold ceil(length/PAGE_SIZE) entries.
#[repr(C)]
pub struct MinCoreExPage {
    /// MinCoreStatus bits (see above).
    pub status: MinCoreStatus,
    /// NUMA node the physical page is on (0-255; 0xFF = unknown or remote).
    pub numa_node: u8,
    /// Page age in units of 0.1 seconds since last access (max 0xFFFE = 6553.4s;
    /// 0xFFFF = unknown). Derived from hardware Accessed bit recency tracking.
    pub age_deciseconds: u16,
    /// Physical frame number (PFN) mod 2^32. Zero for non-resident pages.
    /// **Security**: Populated only when the caller has `CAP_SYS_ADMIN`.
    /// For unprivileged callers, `pfn_lo32` is always zero. This prevents
    /// Rowhammer-class attacks that exploit physical address knowledge
    /// (CVE-2015-0565). Matches Linux's `/proc/PID/pagemap` restriction
    /// (kernel 4.0+, `fs/proc/task_mmu.c`: `pm.show_pfn = file_ns_capable(
    /// file, &init_user_ns, CAP_SYS_ADMIN)`).
    pub pfn_lo32: u32,
}
// Layout: MinCoreStatus(1) + numa_node(1) + age_deciseconds(2) + pfn_lo32(4) = 8 bytes.
const_assert!(size_of::<MinCoreExPage>() == 8);

mincore_ex is a new UmkaOS syscall number (not a Linux number); it has no Linux equivalent. Userspace that needs it links against the UmkaOS compat header.

pfn_lo32 capability gating: The implementation must check capabilities before populating the PFN field:

let show_pfn = current_task().has_capability(CAP_SYS_ADMIN);
// ... per-page loop:
entry.pfn_lo32 = if show_pfn { pfn as u32 } else { 0 };

Error cases

Error	Condition
`EFAULT`	`vec` pointer not writable, or `addr` faults (not in any VMA — but see ENOMEM)
`EINVAL`	`addr` not page-aligned; `addr + length` overflows; `length` is zero
`ENOMEM`	The queried range includes an unmapped region

Linux compatibility: standard mincore() syscall number and bit-0 semantics are exact. Extension bits in the output byte are always zero when called via the Linux compat path (UmkaOS compatibility shim strips them). mincore_ex is a new UmkaOS-only syscall.

4.15.3 membarrier — Expedite Memory Barrier¶

Syscall signature

int membarrier(int cmd, unsigned int flags, int cpu_id);

Returns 0 on success (or a bitmask for MEMBARRIER_CMD_QUERY), -1 on error.

Commands

Exact Linux values from <linux/membarrier.h>. All are implemented:

/// Commands for membarrier(2). Exact Linux values.
#[repr(i32)]
pub enum MembarrierCmd {
    /// Query supported commands. Returns bitmask of supported MembarrierCmd values.
    /// Bit N set means (1 << N) command is supported.
    Query                           = 0,

    /// Issue a full memory barrier on all CPUs (not just those running this process).
    /// Equivalent to sending an IPI to all online CPUs and waiting for each to
    /// execute a memory fence. Slow but unconditional.
    Global                          = 1,

    /// Barrier on all CPUs currently running threads from processes that have
    /// registered REGISTER_GLOBAL_EXPEDITED. Faster than GLOBAL when most processes
    /// are registered.
    GlobalExpedited                 = 2,

    /// Register this process for GlobalExpedited. Sets a flag in ProcessFlags that
    /// causes this process's CPUs to be included in future GlobalExpedited barriers.
    RegisterGlobalExpedited         = 4,

    /// Barrier on all CPUs running threads of the calling process. IPI sent only to
    /// CPUs with a thread of this process currently scheduled. Fastest for
    /// single-process use cases (e.g., JIT compiler memory visibility).
    PrivateExpedited                = 8,

    /// Register this process for PrivateExpedited barriers.
    RegisterPrivateExpedited        = 16,

    /// Full sync-core barrier: like PrivateExpedited but also serialises the
    /// instruction pipeline (CPUID on x86, ISB on ARM). Required for self-modifying
    /// code (JIT). After this barrier, all CPUs have seen the latest instruction
    /// stream.
    PrivateExpeditedSyncCore        = 32,

    /// Register for PrivateExpeditedSyncCore.
    RegisterPrivateExpeditedSyncCore = 64,

    /// Restart RSEQ (restartable sequences) on all threads of the calling process.
    /// Forces any in-flight RSEQ critical sections to abort and restart. Used to
    /// push updated RSEQ code into effect.
    PrivateExpeditedRseq            = 128,

    /// Register for PrivateExpeditedRseq.
    RegisterPrivateExpeditedRseq    = 256,

    /// Return a bitmask of which REGISTER_* commands the calling process
    /// currently has active. Added Linux 6.3. Implementation: read the
    /// process's `MembarrierState` bitmask and return the raw bits.
    GetRegistrations                = 512,
}

Internal data structures

bitflags::bitflags! {
    /// Per-process membarrier registration state. Stored in ProcessFlags.
    /// Cheap to check on the hot path (single bitmask load).
    pub struct MembarrierState: u32 {
        const REGISTERED_GLOBAL_EXPEDITED          = 1 << 0;
        const REGISTERED_PRIVATE_EXPEDITED         = 1 << 1;
        const REGISTERED_PRIVATE_EXPEDITED_SYNC    = 1 << 2;
        const REGISTERED_PRIVATE_EXPEDITED_RSEQ    = 1 << 3;
    }
}

The MembarrierState field lives in the Process struct alongside ProcessFlags. It is read atomically (Acquire load, paired with Release store in membarrier registration) on every IPI target check. No lock is needed for the read path.

Algorithm — GLOBAL

membarrier(GLOBAL, flags=0, cpu_id=0):
  1. For each online CPU c:
       Send IPI to c with handler: smp_mb() (full memory barrier instruction).
       Wait for IPI acknowledgment (c's handler has executed).
  2. Execute smp_mb() on the calling CPU.
  3. Return 0.

GLOBAL is expensive (~N microseconds for N CPUs) and should only be used at infrequent synchronisation points. It is the slowest command but requires no prior registration.

Algorithm — GLOBAL_EXPEDITED

membarrier(GLOBAL_EXPEDITED, flags=0, cpu_id=0):
  1. No caller registration required. Any process may call GLOBAL_EXPEDITED.
  2. Use PerCpu::for_each() to iterate over all CPUs:
       For each CPU c whose current process has REGISTERED_GLOBAL_EXPEDITED set:
         Enqueue a barrier IPI (using the IPI mechanism from Section 3).
  3. Wait for all enqueued IPIs to be acknowledged.
  4. smp_mb() on calling CPU.
  5. Return 0.

Algorithm — PRIVATE_EXPEDITED

membarrier(PRIVATE_EXPEDITED, flags=0, cpu_id=0):
  1. Verify caller registered. Return EPERM if not.
  2. Build CPU mask: for each CPU c, check if c is currently running a thread
     belonging to the calling process (check CpuLocal::current_task().process_id).
  3. Send barrier IPI to each CPU in the mask.
  4. Wait for acknowledgments.
  5. smp_mb() on calling CPU.
  6. Return 0.

Algorithm — PRIVATE_EXPEDITED_SYNC_CORE

Same as PRIVATE_EXPEDITED but the IPI handler executes an instruction-serialising fence:

x86-64: CPUID (serialising instruction, flushes instruction pipeline)
AArch64: ISB SY (instruction synchronisation barrier)
ARMv7: ISB via CP15
RISC-V: FENCE.I (instruction fence)
PPC64LE: isync + sync

This ensures that self-modifying code (JIT-compiled code installed by one thread) is visible to the instruction fetch unit on all other threads before execution continues.

Algorithm — PRIVATE_EXPEDITED_RSEQ

membarrier(PRIVATE_EXPEDITED_RSEQ, flags=0, cpu_id=0):
  1. For each thread T in the calling process currently running on some CPU c:
       If T's rseq_cs pointer is non-null (T is inside a restartable sequence):
         Send IPI to c requesting RSEQ abort: set T's rseq_cs to null and
         redirect T's PC to the abort_ip specified in the rseq struct.
  2. Return 0.

UmkaOS improvement over Linux

Linux's membarrier implementation acquires the scheduler runqueue lock on each target CPU to determine whether a thread of the target process is running. This can cause priority inversion and latency spikes on RT tasks.

UmkaOS avoids this by using CpuLocal::current_task_pid() (a register-based read, ~1-4 cycles, no lock) to check which process is on each CPU. The check is performed via PerCpu::for_each() which issues a non-locking load of each CPU's register-aliased task pointer. The IPI target set is built without acquiring any scheduler lock.

This reduces the overhead of PRIVATE_EXPEDITED from O(N_CPUs × lock_acquire_latency) to O(N_CPUs × register_read_latency), a roughly 10-50x improvement on large systems.

QUERY return value

MEMBARRIER_CMD_QUERY (0) returns a bitmask of all implemented commands. UmkaOS returns:

1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 = 1023 (all commands supported)

GET_REGISTRATIONS implementation

membarrier(GET_REGISTRATIONS, flags=0, cpu_id=0):
    return current_process().membarrier_state.bits() as i32

Error cases

Error	Condition
`EINVAL`	Unknown `cmd` value; non-zero `flags` (reserved, must be 0); invalid `cpu_id`
`EPERM`	`PRIVATE_EXPEDITED`, `PRIVATE_EXPEDITED_SYNC_CORE`, or `PRIVATE_EXPEDITED_RSEQ` called without prior `REGISTER_PRIVATE_EXPEDITED`; `GLOBAL_EXPEDITED` requires no registration

Linux compatibility: all command values, registration semantics, and error codes match Linux 5.10+. The UmkaOS optimisation is transparent to userspace.

4.15.4 userfaultfd — User-Space Page Fault Handling¶

Syscall signature

int userfaultfd(int flags);

Returns a file descriptor on success, -1 on error. The fd is polled for readability when fault messages are pending.

Flags

bitflags::bitflags! {
    /// Flags for userfaultfd(2).
    pub struct UffdFlags: i32 {
        /// Set close-on-exec on the returned fd.
        const CLOEXEC          = libc::O_CLOEXEC;   // 0x80000

        /// Set O_NONBLOCK on the returned fd. read() returns EAGAIN when no
        /// messages are pending instead of blocking.
        const NONBLOCK         = libc::O_NONBLOCK;  // 0x800

        /// Allow unprivileged (non-CAP_SYS_PTRACE) processes to create userfaultfd
        /// instances and handle faults for their own address space. Added Linux 5.11.
        /// Without this flag, unprivileged processes get EPERM.
        const USER_MODE_ONLY   = 0x1;
    }
}

Wire message format (exact Linux uffd_msg struct layout, binary-compatible):

/// Fault message delivered to userspace via read() on the userfaultfd fd.
/// Total size: 32 bytes. Layout matches Linux <linux/userfaultfd.h> uffd_msg.
#[repr(C)]
pub struct UffdMsg {
    /// Event type (UffdEvent enum). 8-bit field.
    pub event: u8,
    pub _reserved1: u8,
    pub _reserved2: u16,
    pub _reserved3: u32,
    /// Event-specific payload. Union in C; UmkaOS uses an enum for type safety
    /// internally, but serialises to the union wire format.
    pub arg: UffdMsgArg,
}
// Layout: 1 + 1 + 2 + 4 + 24 = 32 bytes.
const_assert!(size_of::<UffdMsg>() == 32);

/// 24-byte union (wire format). One variant active based on UffdMsg.event.
#[repr(C)]
pub union UffdMsgArg {
    /// For UFFD_EVENT_PAGEFAULT.
    pub pagefault: UffdMsgPagefault,
    /// For UFFD_EVENT_FORK.
    pub fork: UffdMsgFork,
    /// For UFFD_EVENT_REMAP.
    pub remap: UffdMsgRemap,
    /// For UFFD_EVENT_REMOVE.
    pub remove: UffdMsgRemove,
    /// Padding to 24 bytes.
    pub _pad: [u8; 24],
}
const_assert!(size_of::<UffdMsgArg>() == 24);

/// Pagefault event payload (event = UFFD_EVENT_PAGEFAULT = 0x12).
#[repr(C)]
pub struct UffdMsgPagefault {
    /// UFFD_PAGEFAULT_FLAG_WRITE (0x01): fault was a write. Else read fault.
    /// UFFD_PAGEFAULT_FLAG_WP   (0x02): write-protect fault (WP mode).
    /// UFFD_PAGEFAULT_FLAG_MINOR (0x04): minor fault (MINOR mode).
    pub flags: u64,
    /// Faulting virtual address.
    pub address: u64,
    /// Per-thread ID of the faulting task (value returned by gettid(),
    /// called `ptid` in Linux's uffd_msg). NOT the thread group ID
    /// (getpid() value). Only valid when UFFD_FEATURE_THREAD_ID was
    /// negotiated via UFFDIO_API.
    pub ptid: u32,
    /// Explicit tail padding to fill the 24-byte UffdMsgArg union.
    /// Must be zero-initialized to prevent kernel info leak to userspace.
    pub _pad: u32,
}
// Layout: 8 + 8 + 4 + 4 = 24 bytes.
const_assert!(size_of::<UffdMsgPagefault>() == 24);

#[repr(u8)]
pub enum UffdEvent {
    Pagefault = 0x12,
    Fork      = 0x13,
    Remap     = 0x14,
    Remove    = 0x15,
    Unmap     = 0x16,
}

Internal data structures

/// Number of shards in the pending fault table.
/// Power of 2 for cheap modulo via bitmask.
/// 64 shards = max 1 contended lock per 64-CPU group on a 4096-CPU system.
pub const UFFD_FAULT_SHARDS: usize = 64;

/// Maximum simultaneously pending faults per shard.
/// Total capacity per UffdInstance: UFFD_FAULT_SHARDS × UFFD_SHARD_SLOTS = 8192.
/// Sufficient for live migration of large Redis/database workloads on servers
/// with hundreds of cores. When a shard is full, `wait_for_resolution` returns
/// `Err(UffdError::Backpressure)`; the page fault handler yields and retries.
pub const UFFD_SHARD_SLOTS: usize = 128;
const_assert!(UFFD_SHARD_SLOTS <= u16::MAX as usize);

/// Maximum tasks simultaneously waiting on the same faulting page.
/// 8 covers typical thread-level parallelism faulting the same shared mapping.
pub const UFFD_MAX_WAITERS_PER_PAGE: usize = 8;

/// A waiter entry in the pending fault table. Stores the task reference and
/// the WaitQueue token needed to wake the task when the fault is resolved.
pub struct WaitEntry {
    /// The blocked task.
    pub task: Arc<Task>,
    /// Token for the wait queue (used by `wake_one` to resume the task).
    pub wq_token: WaitQueueToken,
}

/// One slot in the flat pending-fault table.
/// `addr == PageAlignedAddr(0)` means this slot is unoccupied (virtual address 0
/// is never a valid userfaultfd target: the zero page is mapped read-only by the
/// kernel and cannot be registered with UFFDIO_REGISTER).
// kernel-internal, not KABI
#[repr(C)]
pub struct PendingFaultSlot {
    /// Page address being waited on (0 = empty).
    pub addr: PageAlignedAddr,
    /// Number of valid entries in `waiters`.
    pub waiter_count: u8,
    /// Tasks waiting for UFFDIO_COPY or UFFDIO_ZEROPAGE to resolve this page.
    /// Indices 0..waiter_count are initialised; the rest are uninitialised.
    pub waiters: [MaybeUninit<WaitEntry>; UFFD_MAX_WAITERS_PER_PAGE],
}

/// Pre-allocated, bounded flat table for one uffd shard. No heap allocation
/// after UffdInstance creation. Slot allocation and lookup are O(1) via
/// `free_stack` (free-list) and `addr_to_slot` (XArray reverse index).
/// Each shard is independent, so concurrent faults on different shards
/// never contend.
pub struct PendingFaultShard {
    pub slots: [PendingFaultSlot; UFFD_SHARD_SLOTS],
    /// Count of occupied slots (addr != 0). Checked before scanning to
    /// provide O(1) full detection.
    pub occupied: u16,
    /// Stack of free slot indices. `pop()` = O(1) allocation,
    /// `push()` = O(1) deallocation. Initialised at UffdInstance creation
    /// with all indices 0..UFFD_SHARD_SLOTS. Replaces the linear scan
    /// for a free slot in `wait_for_resolution`.
    pub free_stack: ArrayVec<u16, UFFD_SHARD_SLOTS>,
    /// Reverse index: page_addr → slot index. Provides O(1) lookup of the
    /// slot holding a given faulting address. Replaces the `.iter_mut().find()`
    /// linear scans in both `wait_for_resolution` and `resolve_fault`.
    /// XArray uses slab-allocated nodes but never allocates in the fault path
    /// beyond the pre-warmed slab pool.
    pub addr_to_slot: XArray<u16>,
}

/// State for one userfaultfd instance. Allocated on fd creation.
pub struct UffdInstance {
    /// Registered virtual address ranges and their modes.
    /// IntervalTree provides O(log N) lookup on fault address.
    pub registered_ranges: RwLock<IntervalTree<UffdVmaRange>>,

    /// Lock-free MPSC ring for fault messages delivered to userspace.
    /// Producers: page fault handlers from MULTIPLE faulting threads. A process
    /// with N threads sharing the same `MmStruct` can generate concurrent page
    /// faults, all pushing to this ring. The SPSC model was incorrect — it
    /// assumed a single producer, but multi-threaded processes produce concurrent
    /// faults. `BoundedMpmcRing` (used as MPSC: multiple producers, single
    /// consumer) provides the required atomicity via per-slot sequence numbers.
    /// Consumer: userspace thread calling read() on the uffd fd.
    /// Capacity 2048: handles burst of faults before userspace drains the ring.
    pub message_queue: BoundedMpmcRing<UffdMsg, 2048>,

    /// Sharded pending fault table.
    ///
    /// Fault address is sharded by page-aligned address bits [17:12] (6 bits above the 12-bit page offset = 64 shards).
    /// This distributes concurrent faults across 64 independent locks.
    ///
    /// Each shard is a bounded flat table (`PendingFaultShard`) — no heap allocation
    /// after UffdInstance creation. Within a shard, slot allocation is O(1) via
    /// `free_stack` and address lookup is O(1) via `addr_to_slot` XArray.
    ///
    /// **Why not HashMap**: HashMap allocates on every new-key insert. In the page
    /// fault path (thread context but potentially deep in mm/ code), allocation
    /// failures would require complex retry logic. Fixed pre-allocated storage
    /// provides predictable, bounded behaviour with no allocation in the fault path.
    pub pending_faults: [Mutex<PendingFaultShard>; UFFD_FAULT_SHARDS],

    /// Tasks waiting for any message (read() on uffd fd).
    pub wakeup_queue: WaitQueue,

    /// Negotiated API version and feature bits (from UFFDIO_API ioctl).
    pub api_version: u64,
    pub features: UffdFeatures,

    /// Registration mode flags controlling which fault types this uffd handles.
    pub mode: UffdMode,
}

impl UffdInstance {
    /// Get the shard index for a page address.
    /// Uses bits above the page offset (bits [17:12] for 4KB pages) for
    /// distribution, ensuring different pages map to different shards.
    #[inline]
    fn shard_index(page_addr: PageAlignedAddr) -> usize {
        ((page_addr.0 >> 12) & (UFFD_FAULT_SHARDS as u64 - 1)) as usize
    }

    /// Block the current task until the fault at `page_addr` is resolved.
    /// Returns `Err(UffdError::Backpressure)` if the shard table is full;
    /// the caller must yield and retry. On `Backpressure`, the page fault
    /// handler returns `VM_FAULT_RETRY`. The MM subsystem drops VMA locks,
    /// allows the calling task to be rescheduled, and retries the fault on
    /// the next access. This matches Linux's userfaultfd retry behavior.
    pub fn wait_for_resolution(&self, page_addr: PageAlignedAddr) -> Result<(), UffdError> {
        let shard_idx = Self::shard_index(page_addr);
        let mut shard = self.pending_faults[shard_idx].lock();

        // O(1) lookup: check if a slot already exists for this address.
        let slot_idx = if let Some(&idx) = shard.addr_to_slot.get(page_addr.0) {
            idx as usize
        } else {
            // O(1) allocation: pop a free slot index from the stack.
            let idx = match shard.free_stack.pop() {
                Some(i) => i,
                None => return Err(UffdError::Backpressure),
            };
            shard.slots[idx as usize].addr = page_addr;
            shard.addr_to_slot.insert(page_addr.0, idx);
            shard.occupied += 1;
            idx as usize
        };
        let slot = &mut shard.slots[slot_idx];

        if slot.waiter_count as usize >= UFFD_MAX_WAITERS_PER_PAGE {
            return Err(UffdError::TooManyWaiters);
        }
        let waiter = WaitEntry::current_task();
        slot.waiters[slot.waiter_count as usize].write(waiter.clone());
        slot.waiter_count += 1;
        drop(shard); // Release the shard lock BEFORE sleeping

        waiter.wait().map_err(|_| UffdError::Interrupted)
    }

    /// Wake all tasks waiting on `page_addr` (called from UFFDIO_COPY handler).
    pub fn resolve_fault(&self, page_addr: PageAlignedAddr) {
        let shard_idx = Self::shard_index(page_addr);
        // Extract waiters under the lock, then wake them after releasing it.
        let mut waiters_buf = [MaybeUninit::<WaitEntry>::uninit(); UFFD_MAX_WAITERS_PER_PAGE];
        let waiter_count;
        {
            let mut shard = self.pending_faults[shard_idx].lock();
            // O(1) lookup via reverse index.
            let slot_idx = match shard.addr_to_slot.remove(page_addr.0) {
                None => return, // No waiters (racy resolve — benign)
                Some(idx) => idx as usize,
            };
            let slot = &mut shard.slots[slot_idx];
            waiter_count = slot.waiter_count as usize;
            // SAFETY: Three invariants ensure this read is valid:
            // 1. Initialization: slots[0..waiter_count] were initialized by
            //    `wait_for_resolution()` before the shard lock was released.
            // 2. No deinitialization: the shard lock (held above) prevents
            //    concurrent modification between write and this read.
            // 3. waiter_count: protected by the shard SpinLock and only
            //    incremented by `wait_for_resolution()` under that lock.
            for i in 0..waiter_count {
                waiters_buf[i].write(unsafe { slot.waiters[i].assume_init_read() });
            }
            slot.addr = PageAlignedAddr(0); // Mark slot free
            slot.waiter_count = 0;
            shard.occupied -= 1;
            // O(1) deallocation: return slot index to the free stack.
            shard.free_stack.push(slot_idx as u16);
        } // Shard lock released before waking

        // SAFETY: waiters_buf[0..waiter_count] were initialised above.
        for i in 0..waiter_count {
            unsafe { waiters_buf[i].assume_init_read() }.wake();
        }
    }
}

/// One registered VA range.
pub struct UffdVmaRange {
    pub start:  VirtAddr,
    pub end:    VirtAddr,
    /// Which fault types to intercept on this range.
    pub mode:   UffdMode,
}

bitflags::bitflags! {
    pub struct UffdMode: u64 {
        /// Intercept page-not-present faults (standard userfaultfd use).
        const MISSING    = 1 << 0;
        /// Intercept write faults on write-protected pages.
        const WRITEPROTECT = 1 << 1;
        /// Intercept minor faults (page exists but may need content update).
        const MINOR      = 1 << 2;
    }
}

bitflags::bitflags! {
    pub struct UffdFeatures: u64 {
        const PAGEFAULT_FLAG_WP            = 1 << 0;
        const EVENT_FORK                   = 1 << 1;
        const EVENT_REMAP                  = 1 << 2;
        const EVENT_REMOVE                 = 1 << 3;
        const MISSING_HUGETLBFS            = 1 << 4;
        const MISSING_SHMEM                = 1 << 5;
        const EVENT_UNMAP                  = 1 << 6;
        const SIGBUS                       = 1 << 7;
        const THREAD_ID                    = 1 << 8;
        const MINOR_HUGETLBFS              = 1 << 9;
        const MINOR_SHMEM                  = 1 << 10;
        const EXACT_ADDRESS                = 1 << 11;
        const WP_HUGETLBFS_SHMEM           = 1 << 12;
        const WP_UNPOPULATED               = 1 << 13;
        const POISON                       = 1 << 14;
        const WP_ASYNC                     = 1 << 15;
    }
}

ioctl operations

All UFFDIO_* ioctl values match Linux exactly.

UFFDIO_API (0xc018aa3f): Handshake. Userspace sends uffdio_api struct with api = UFFD_API (0xaa) and desired feature bits. Kernel validates API version, returns supported features and ioctls bitmasks. Stores agreed features in UffdInstance.features. Must be the first ioctl on the fd.
UFFDIO_REGISTER (0xc020aa00): Register a VA range. Struct uffdio_register with range.start, range.len, and mode (MISSING/WP/MINOR). Kernel inserts entry into registered_ranges IntervalTree. Returns the ioctls bitmask valid for this range. The VMA covering the range must already exist; returns EINVAL if not.
UFFDIO_UNREGISTER (0x8010aa01): Remove a registered range. Removes IntervalTree entries. Wakes any tasks blocked on pending faults in the range with EFAULT.
UFFDIO_COPY (0xc028aa03): Copy a page to resolve a missing fault. Struct uffdio_copy: dst (target VA), src (source VA in caller's address space), len (must be PAGE_SIZE or HUGE_PAGE_SIZE), mode flags, copy (out: bytes copied). Kernel copies the page from src to dst in the faulting process's address space, then wakes the faulting task. UmkaOS extension: if src is a memfd offset passed via the mode flags, maps the memfd page directly into the destination (zero physical copy).
UFFDIO_ZEROPAGE (0xc020aa04): Install a zero page at the faulted address. Wakes the faulting task. Equivalent to UFFDIO_COPY with a zero source page but without the data copy.
UFFDIO_WRITEPROTECT (0xc018aa06): Write-protect or unprotect a range of pages. Struct uffdio_writeprotect: range + mode (UFFDIO_WRITEPROTECT_MODE_WP to protect, 0 to unprotect). Updates PTEs to add/remove write permission, flushes TLB.
UFFDIO_CONTINUE (0xc020aa08): Continue a minor fault. The page already exists at the physical level; this ioctl installs the PTE and wakes the faulting task.
UFFDIO_POISON (0xc018aa0a): Mark a range as poisoned. Accessing any page in the range delivers SIGBUS to the accessing task. Implemented by installing a special "poison" PTE marker.

Page fault path

Page fault handler receives fault at address A in process P:
  1. Look up A in P's uffd registered_ranges IntervalTree.
     O(log N) lookup. If not registered: handle as normal fault.
  2. Determine fault type: MISSING (PTE not present), WP (write to write-protected
     page), MINOR (page exists in page cache but PTE not yet installed).
  3. If mode does not cover this fault type: handle as normal fault.
  4. Construct UffdMsg with event=PAGEFAULT, address=A, flags per fault type.
  5. Enqueue message into UffdInstance.message_queue (BoundedMpmcRing push,
     lock-free CAS on per-slot sequence numbers — multiple threads may fault
     concurrently in the same process).
  6. Wake any thread blocked in read() on the uffd fd (if needed).
  7. Allocate WaitEntry for the faulting task; call wait_for_resolution(A):
        compute shard = bits [17:12] of A, lock pending_faults[shard], insert WaitEntry.
  8. Block the faulting task: schedule_out(current_task, WaitState::UffdWait).
  --- userspace runs, reads the message, calls UFFDIO_COPY or UFFDIO_ZEROPAGE ---
  9. UFFDIO_COPY handler:
       a. Copy source page data into a newly allocated physical page for address A.
          UmkaOS improvement: if src is a memfd page, install a copy-on-write mapping
          to the memfd page rather than copying bytes (zero-copy for live migration).
       b. Install PTE for A in the faulting process's page table.
       c. Flush TLB for A on all CPUs running threads of P.
       d. Call resolve_fault(A): compute shard = bits [17:12] of A, lock
          pending_faults[shard], remove WaitEntry list for A, unlock shard.
       e. Wake all waiting tasks. They resume executing the faulting instruction.
  10. Faulting task resumes; instruction retries and succeeds.

UmkaOS improvement: zero-copy UFFDIO_COPY via memfd source

When a VM live migration engine wants to supply guest RAM pages via userfaultfd, the standard UFFDIO_COPY requires the migration engine to first read() the page from the network socket into a buffer, then call UFFDIO_COPY to copy from that buffer into the guest address space. This is two copies of the page data.

UmkaOS adds a UFFDIO_COPY_MODE_MEMFD flag to uffdio_copy.mode. When set, src is interpreted as a memfd page offset rather than a VA. The kernel maps the memfd page directly into the guest's page table with a copy-on-write mapping. No data is copied; the physical page is shared read-only until either side writes (triggering CoW). For read-heavy VM workloads, this eliminates all copy overhead.

Write-protect mode

WP mode intercepts writes to pages that have been write-protected via UFFDIO_WRITEPROTECT. The page fault handler detects a write to a WP page, delivers a UFFD_EVENT_PAGEFAULT with UFFD_PAGEFAULT_FLAG_WP set, and blocks the writing task. Userspace inspects the write, optionally modifies the page, then calls UFFDIO_WRITEPROTECT (with mode=0) to unprotect the page and wake the task. Used for dirty-page tracking in snapshot engines.

MADV_USERFAULTFD hint

madvise(addr, len, MADV_USERFAULTFD) (UmkaOS-specific advice constant) marks the range as a candidate for uffd monitoring. This is a hint to the kernel to pre-register the range in the VMA's uffd_flags, so that future page faults check the uffd IntervalTree without needing a full VMA scan. It does not replace UFFDIO_REGISTER.

Error cases

Error	Condition
`EPERM`	Caller lacks privilege and `USER_MODE_ONLY` not set in flags
`ENOMEM`	Kernel cannot allocate `UffdInstance`
`EINVAL`	`UFFDIO_API` with wrong API magic; `UFFDIO_REGISTER` on unmapped range; `UFFDIO_COPY` with bad length
`EFAULT`	Bad pointer in ioctl struct
`EEXIST`	`UFFDIO_REGISTER` on a range that is already registered

Linux compatibility: all ioctl numbers, struct layouts, event codes, and feature flags match Linux 5.14+. The UFFDIO_COPY_MODE_MEMFD extension uses a reserved flag bit in uffdio_copy.mode and is ignored by Linux (which returns EINVAL for unknown mode bits — UmkaOS accepts it). The MADV_USERFAULTFD advice is UmkaOS-only.

4.15.4.1 Custom Fault Handler Domain Crossing (Nucleus → Tier 1)¶

Problem statement: KVM post-copy live migration and device-backed memory require custom page fault handlers that run in Tier 1 subsystems (KVM, device drivers). The page fault handler itself runs in Core (Tier 0). A direct function call from Core's fault handler into a Tier 1 subsystem would violate isolation boundaries — a crash in the Tier 1 handler during fault resolution would take down Core.

UmkaOS solves this with a fault request ring protocol: Core posts a message to the registering Tier 1 subsystem's KABI ring and blocks the faulting task until a response arrives. This preserves isolation while keeping the latency path tight (one ring round-trip per fault).

Fault registration: A Tier 1 subsystem (e.g., KVM, a device driver) registers a custom fault handler for a VMA range by calling register_custom_fault() during VMA setup. The registration records the target KABI ring and the FaultType discriminant in the VMA's vm_ops metadata. Core's page fault handler checks for custom fault registrations before falling through to the normal anonymous/file-backed fault path.

/// A fault request message posted from Core (Tier 0) page fault handler
/// to a Tier 1 subsystem's KABI command ring.
///
/// Core posts this when it encounters a page fault on a VMA with a
/// registered custom fault handler. The faulting task blocks until the
/// Tier 1 subsystem posts a `FaultResponse`.
///
// kernel-internal, not KABI
/// `#[repr(C)]` for stable ABI layout on the KABI ring.
/// Total size: 24 bytes (8 + 4 + 4 + 4 + 4).
// kernel-internal, not KABI
#[repr(C)]
pub struct FaultRequest {
    /// Faulting virtual address (page-aligned by Core before posting).
    pub fault_addr: u64,
    /// Fault flags from the hardware exception (read/write/exec, user/kernel).
    /// Architecture-neutral encoding — Core translates from arch-specific
    /// error codes before posting.
    pub fault_flags: u32,
    /// Discriminant identifying the fault type. Lets the Tier 1 handler
    /// dispatch to the correct resolution path without parsing the VMA.
    pub fault_type: FaultType,
    /// PID of the faulting process. Tier 1 uses this to locate the
    /// correct guest context (KVM) or device mapping (device driver).
    pub source_pid: u32,
    /// Reserved padding. Must be zero.
    pub _pad: u32,
}

/// Custom fault type discriminant. Identifies the resolution path
/// that the Tier 1 handler should take.
#[repr(u32)]
pub enum FaultType {
    /// KVM post-copy migration: the guest accessed a page that has not
    /// yet been transferred from the migration source. The KVM handler
    /// fetches the page over the migration channel and installs it.
    PostCopyFetch = 0,
    /// Device-backed memory fault: the page is backed by a device
    /// (e.g., GPU VRAM, persistent memory device) and must be fetched
    /// or mapped by the device driver.
    DeviceFault = 1,
    /// Userfaultfd-like resolution for in-kernel consumers. Unlike
    /// userspace userfaultfd (which delivers messages via read() on an
    /// fd), this path delivers to a Tier 1 kernel subsystem's KABI ring.
    /// Used by in-kernel migration engines and checkpoint/restore.
    UserfaultfdResolve = 2,
}

/// Fault resolution response from Tier 1 back to Core (Tier 0).
/// Posted on the KABI completion ring after the Tier 1 handler has
/// resolved the fault (fetched the page, mapped device memory, etc.).
///
/// `#[repr(C)]` for stable ABI layout on the KABI ring.
/// Total size: 24 bytes (8 + 4 + 4 + 8).
#[repr(C)]
pub struct FaultResponse {
    /// The faulting address (matches `FaultRequest.fault_addr`).
    /// Core uses this to locate the blocked task's wait entry.
    pub fault_addr: u64,
    /// Result code: 0 on success, negative errno on failure.
    /// On failure, Core delivers SIGBUS to the faulting task
    /// (same as a hardware fault on an unmapped page).
    pub result: i32,
    /// Padding for alignment.
    pub _pad: u32,
    /// Physical address of the resolved page. Core installs a PTE
    /// mapping this frame into the faulting process's address space.
    /// For `PostCopyFetch`: the page fetched from the migration source.
    /// For `DeviceFault`: the device-mapped physical frame.
    /// For `UserfaultfdResolve`: the page supplied by the resolver.
    /// Zero when `result != 0`.
    pub page_phys: PhysAddr,
}
const_assert!(core::mem::size_of::<FaultRequest>() == 24);
const_assert!(core::mem::size_of::<FaultResponse>() == 24);

Fault request ring protocol (end-to-end):

Core (Tier 0) — page fault handler:
  1. Hardware exception delivers fault for address A in process P.
  2. Core looks up the VMA for A. VMA has `custom_fault: Some(registration)`.
  3. Core constructs FaultRequest { fault_addr: A, fault_flags, fault_type,
     source_pid: P.pid, _pad: 0 }.
  4. Post FaultRequest to the Tier 1 subsystem's KABI command ring
     (identified by registration.kabi_ring_id).
  5. Allocate WaitEntry for the faulting task:
     shard = bits [17:12] of A (same sharding as userfaultfd).
     Lock pending_custom_faults[shard], insert WaitEntry.
  6. Block the faulting task: schedule_out(current_task, WaitState::CustomFaultWait).

Tier 1 (KVM / device driver / resolver):
  7. Dequeue FaultRequest from KABI command ring.
  8. Dispatch on fault_type:
     - PostCopyFetch: send page request to migration source over the
       migration channel. Receive page data. Allocate a physical frame
       and copy data into it. Return its PhysAddr.
     - DeviceFault: map the device-backed physical frame. Return its
       PhysAddr.
     - UserfaultfdResolve: resolve per subsystem-specific logic.
  9. Post FaultResponse { fault_addr: A, result: 0, page_phys }
     to KABI completion ring.

Core (Tier 0) — fault completion:
  10. Dequeue FaultResponse for address A.
  11. If result == 0:
      a. Extract physical frame from page_handle.
      b. Install PTE for address A in process P's page table
         (permissions from VMA flags).
      c. Flush TLB for A on all CPUs running threads of P.
  12. If result != 0:
      a. Deliver SIGBUS to the faulting task (si_addr = A,
         si_code = BUS_ADRERR).
  13. Resolve the wait entry:
      shard = bits [17:12] of A, lock pending_custom_faults[shard],
      remove WaitEntry, unlock shard. Wake the faulting task.
  14. Faulting task resumes; instruction retries and succeeds (or
      the SIGBUS handler runs).

Latency analysis: The faulting task blocks for the full round-trip: KABI ring post (~50ns) + Tier 1 processing + KABI ring response (~50ns) + wake (~200ns). For KVM post-copy, the dominant cost is the network fetch (~50-500us depending on page size and network latency). The ring overhead is negligible (<1us). For device faults, the device mapping is typically <1us (MMIO BAR already mapped), so total fault latency is ~1-2us — comparable to a TLB miss on a cold page.

Crash handling: If the Tier 1 subsystem crashes while a FaultRequest is pending (Core is blocking), the KABI health monitor detects the crash and synthesizes a FaultResponse { result: -EIO } for all pending requests on the crashed ring. This wakes all blocked tasks with SIGBUS, preventing indefinite hangs. The crashed subsystem is restarted via the standard Tier 1 reload path (Section 11.9).

Cross-references: - Userfaultfd (userspace fault handling): this section (userfaultfd above) - KVM post-copy migration: Section 18.1 - KABI ring buffer protocol: Section 12.1 - Tier 1 crash recovery: Section 11.9 - PhysAddr: Section 4.2

4.15.5 memfd_create — Create Anonymous File¶

Syscall signature

int memfd_create(const char *name, unsigned int flags);

Returns a file descriptor on success, -1 on error.

Flags

bitflags::bitflags! {
    /// Flags for memfd_create(2). Exact Linux values.
    pub struct MemfdFlags: u32 {
        /// Set close-on-exec on the returned fd.
        const CLOEXEC        = 0x0001;

        /// Allow seals to be applied via fcntl(F_ADD_SEALS). Without this flag,
        /// F_ADD_SEALS returns EPERM.
        const ALLOW_SEALING  = 0x0002;

        /// Back the memfd with hugeTLB pages from the hugetlbfs pool.
        const HUGETLB        = 0x0004;

        /// Hugepage size selector (combined with HUGETLB): 2MB pages.
        /// Value: (21 << 26) = 0x54000000.
        const HUGE_2MB       = 0x54000000;

        /// Hugepage size selector (combined with HUGETLB): 1GB pages.
        /// Value: (30 << 26) = 0x78000000.
        const HUGE_1GB       = 0x78000000;

        /// Automatically apply F_SEAL_EXEC (prevents mmap with PROT_EXEC).
        /// Added in Linux 6.3. Implies ALLOW_SEALING.
        const NOEXEC_SEAL    = 0x0008;

        /// Allow the file to be mapped executable. When /proc/sys/vm/memfd_noexec
        /// is 2 (restrictive mode), this flag is required to map as PROT_EXEC.
        /// Added in Linux 6.3. Conflicts with NOEXEC_SEAL (returns EINVAL).
        const EXEC           = 0x0010;
    }
}

name parameter

The name string appears in /proc/PID/fd/N as memfd:name. Maximum length is 249 bytes (enforced by UmkaOS; Linux silently truncates at 249 bytes without returning an error — UmkaOS returns EINVAL for names exceeding 249 bytes to avoid silent data loss). The name is informational only; there is no filesystem lookup by name.

Internal representation

/// An anonymous in-memory file created by memfd_create(2).
/// Backed by UmkaOS's page cache; supports read, write, mmap, ftruncate, lseek.
pub struct AnonFile {
    /// Descriptive name for /proc/PID/fd/ display. Max 249 bytes.
    pub name: ArrayString<249>,

    /// Current file size in bytes. mmap() beyond this → SIGBUS.
    /// Changed by ftruncate(); cannot shrink below sealed size if F_SEAL_SHRINK set.
    pub size: AtomicU64,

    /// Page cache backing the file content.
    /// Pages are ordinary page-cache pages; they are evictable under memory pressure
    /// unless the file is sealed and mapped (sealing keeps pages pinned).
    pub page_cache: PageCache,

    /// Applied seals (F_SEAL_*). Append-only: seals can only be added, never removed.
    pub seals: AtomicU32,

    /// Flags from memfd_create call, retained for /proc inspection.
    pub flags: MemfdFlags,

    /// Hugetlb page size (0 if not HUGETLB). One of PAGE_SIZE, HUGE_2MB, HUGE_1GB.
    pub huge_page_size: usize,
}

Seal constants (from fcntl F_ADD_SEALS / F_GET_SEALS, exact Linux values):

bitflags::bitflags! {
    /// File seals for memfd. Applied via fcntl(fd, F_ADD_SEALS, seals).
    pub struct MemfdSeals: u32 {
        /// Prevent ftruncate() from shrinking the file.
        const SEAL_SHRINK  = 0x0001;
        /// Prevent ftruncate() from growing the file.
        const SEAL_GROW    = 0x0002;
        /// Prevent write() and mmap(PROT_WRITE) (shared writable mappings).
        const SEAL_WRITE   = 0x0004;
        /// Prevent any future seal modifications (seal the seals).
        const SEAL_SEAL    = 0x0008;
        /// Prevent mmap(PROT_EXEC) and mprotect(PROT_EXEC). Added Linux 6.3.
        const SEAL_EXEC    = 0x0010;
    }
}

File operations

The AnonFile is exposed through the VFS layer (Section 14.1) via a FileOps implementation:

read(offset, buf): copies page-cache pages into buf. Standard copy_to_user.
write(offset, buf): copies buf into page-cache pages, allocating pages on demand. Fails with EPERM if SEAL_WRITE is set.
mmap(offset, len, prot, flags):
MAP_SHARED | PROT_WRITE: fails if SEAL_WRITE set.
MAP_SHARED | PROT_EXEC or MAP_PRIVATE | PROT_EXEC: fails if SEAL_EXEC set.
With HUGETLB: allocates pages from the hugetlb pool (Section 4.5), uses huge PTEs.
Private mappings (MAP_PRIVATE) are copy-on-write from the page cache.
ftruncate(size):
If new size < old size and SEAL_SHRINK set: EPERM.
If new size > old size and SEAL_GROW set: EPERM.
Adjusts AnonFile.size. Pages beyond new size are freed if shrinking.
lseek(offset, whence): standard file seek semantics.
fcntl(F_ADD_SEALS, seals): requires ALLOW_SEALING (else EPERM). Adds seal bits atomically. Cannot remove seals. SEAL_SEAL prevents any further seal additions.
fcntl(F_GET_SEALS): returns current seal bitmask.

Algorithm — memfd_create

memfd_create(name, flags):
  1. Validate name: strlen(name) <= 249. Return EINVAL if too long.
  2. Validate flags: EXEC and NOEXEC_SEAL cannot both be set. Return EINVAL if both.
  3. Validate flags: only known bits set. Return EINVAL for unknown bits.
  4. If HUGETLB set: validate HUGE_2MB or HUGE_1GB is consistent. Return EINVAL
     if incompatible size bits set.
  5. Allocate AnonFile struct in slab. Return ENOMEM if allocation fails.
  6. Initialise AnonFile:
       name = copy of name param.
       size = 0.
       page_cache = PageCache::new(anonfile_ops).
       seals = 0.
       flags = flags param.
       huge_page_size = if HUGE_2MB: HUGE_2MB_SIZE elif HUGE_1GB: HUGE_1GB_SIZE else 0.
  7. If NOEXEC_SEAL: set seals |= SEAL_EXEC.
  8. Allocate a file descriptor in the current process's fd table.
     Return EMFILE if per-process fd limit (RLIMIT_NOFILE) reached.
  9. Install AnonFile into fd table as a FileDescriptor with the AnonFileOps vtable.
  10. If CLOEXEC: set FD_CLOEXEC on the fd.
  11. Return fd.

Use cases

JIT compilers: memfd_create + ftruncate + write code + F_SEAL_WRITE + mmap(PROT_EXEC). The seal prevents runtime modification after compilation.
systemd unit file passing: service manager writes config into memfd, passes fd to child via LISTEN_FDS. Sealed with F_SEAL_WRITE so the child cannot corrupt it.
Container rootfs: overlay layers backed by memfds for ephemeral container filesystems.
Anonymous shared memory: multiple processes share a memfd via fd passing (e.g., over Unix socket SCM_RIGHTS), replacing POSIX shm_open / SysV shmget.

Error cases

Error	Condition
`EINVAL`	`name` too long (>249 bytes); conflicting flags (EXEC + NOEXEC_SEAL); unknown flag bits; incompatible HUGE_* bits
`EMFILE`	Per-process fd limit reached
`ENFILE`	System-wide open file limit reached
`ENOMEM`	Kernel cannot allocate AnonFile struct

Linux compatibility: syscall number, flag values, seal constants, and all ioctl semantics match Linux 3.17+ (memfd_create introduction) through 6.3+ (NOEXEC_SEAL/EXEC). UmkaOS returns EINVAL on names >249 bytes rather than silently truncating; this is a deliberate deviation that produces correct behaviour for correct callers and surfaces bugs in callers that pass too-long names.

4.15.6 memfd_secret — Create a Secret Memory Region¶

Syscall signature

int memfd_secret(unsigned int flags);

Returns a file descriptor on success, -1 on error. The fd is then used with mmap() to create a virtual memory region that is inaccessible to the kernel itself.

Availability

UmkaOS enables memfd_secret unconditionally on x86-64 and AArch64 (where hardware assists are available or direct-map manipulation is safe). On ARMv7, RISC-V 64, PPC32, PPC64LE, s390x, and LoongArch64, UmkaOS returns ENOSYS — these architectures lack the address-space topology needed to safely excise regions from the kernel direct map without excessive TLB maintenance cost.

Flags

Only O_CLOEXEC (0x80000) is defined for memfd_secret. All other flag bits are reserved; returning EINVAL if set.

Security model

The kernel direct map (physmap / PAGE_OFFSET region) maps all physical RAM at a fixed virtual address in the kernel address space. Any code with kernel-mode execution can read any RAM via this mapping, including secret process data. memfd_secret defeats this by removing the direct-map PTEs for the selected physical pages.

After memfd_secret and mmap():

The physical pages backing the secret region are allocated normally.
The process page table has normal PTEs for the region (present, readable, writable).
The kernel direct-map page table has no PTEs for those physical frame numbers.
A per-kernel SecretRegions set tracks which PFNs have been excised.
Any kernel code path that would access those PFNs via the direct map (e.g., copy_to_user, copy_from_user, kmap, swap writeback) checks SecretRegions and returns EFAULT or refuses the operation.

Internal data structures

/// Global registry of PFNs that have been excised from the kernel direct map.
/// Kernel code must check this before accessing any PFN via the direct map.
/// XArray keyed by `Pfn::raw()` (integer key) — mandated by the collection
/// usage policy ([Section 3.13](03-concurrency.md#collection-usage-policy)) for all integer-keyed mappings.
/// RCU-safe reads allow `copy_from_user` checks without acquiring any lock.
/// Write side (add/remove PFN) holds the XArray's internal `xa_lock`.
static SECRET_REGIONS: XArray<SecretPageInfo> = XArray::new();

/// **Allocation strategy**: XArray uses slab-allocated nodes. `SECRET_REGIONS`
/// pre-warms the slab pool at boot for the estimated maximum secret page count
/// (`system_memory / PAGE_SIZE × 0.001`, capped at 64K entries). If insert
/// fails during `mmap(MAP_SECRET)` PTE excision (slab exhaustion under extreme
/// pressure), the fault handler undoes PTE clearing for already-processed PFNs
/// and returns `ENOMEM`.

pub struct SecretPageInfo {
    /// The virtual address in the owning process that maps this PFN.
    pub owner_va: VirtAddr,
    /// The owning process's PID, for diagnostics.
    pub owner_pid: Pid,
    /// Hardware encryption key ID (AMD SME), if hardware encryption active.
    /// 0 if not using hardware encryption.
    pub sme_key_id: u32,
}

/// Per-process set of secret VMAs. Tracked for munmap() and process-exit cleanup.
///
/// `Vec` allocation here is intentional and safe. `mmap(MAP_SECRET)` and `munmap()`
/// are **process context** operations (never called from interrupt context or from
/// within the memory reclaim path). `Vec::push` uses `GFP_KERNEL` semantics — it can
/// block and invoke the page reclaimer, but the caller (a user syscall) already holds
/// no locks that the reclaimer needs, so there is no deadlock risk.
/// A typical process has 0-5 secret VMAs; `Vec` is appropriate for this cardinality.
pub struct SecretVmaSet {
    pub entries: Vec<SecretVmaEntry>,
}

pub struct SecretVmaEntry {
    pub va_start: VirtAddr,
    pub va_end:   VirtAddr,
    pub pfns:     Vec<Pfn>,
}

Algorithm — mmap on a memfd_secret fd

mmap(addr, len, PROT_READ|PROT_WRITE, MAP_SHARED, secret_fd, 0):
  1. Only MAP_SHARED is permitted. MAP_PRIVATE returns EINVAL (secret pages cannot be
     CoW'd — a copy would not inherit the kernel direct-map excision).
  2. Only PROT_READ, PROT_WRITE, or both are permitted. PROT_EXEC returns EINVAL
     (executable secret regions create JIT-compiler attack surfaces).
  3. Validate len is PAGE_SIZE-aligned and non-zero.
  4. Allocate physical pages for the region:
       pages = buddy_alloc(len / PAGE_SIZE, GFP_KERNEL | GFP_ZERO)
       Return ENOMEM if allocation fails.
  5. Install process PTEs: for each allocated PFN, install a present, user-accessible
     PTE at the requested virtual address in the process page table.
  6. Excise from kernel direct map:
       For each PFN in the allocation:
         a. Clear the PTE at (KERNEL_PHYSMAP_BASE + pfn * PAGE_SIZE) in the kernel
            page table. This is a single store to a kernel PTE.
         b. Insert PFN into SECRET_REGIONS (acquires write spinlock, O(1)).
       Flush kernel TLB for all excised addresses on all CPUs:
         tlb_flush_kernel_range(phys_to_virt(pfn * PAGE_SIZE), len).
       This TLB flush is expensive (IPI to all CPUs) but occurs once at mmap() time,
       not on every secret region access.
  7. If AMD SME is available (detected via CPUID leaf 0x8000001F):
       a. Assign an ephemeral C-bit encryption key to these pages.
       b. Set the C-bit in the process PTEs (memory encryption enabled for these pages).
       c. Hardware encrypts all writes to these pages with the process-specific key.
       d. Store key_id in SecretPageInfo.
     If ARM64 Realm (CCA) is available:
       a. Assign these pages to the Realm granule table (Realm PA space).
       b. Normal world (kernel) cannot access Realm-assigned pages; accesses fault.
  8. Record the VMA in the process's SecretVmaSet for cleanup.
  9. Return the mapped VA.

Algorithm — munmap / process exit cleanup

On munmap(addr, len) covering a secret region:
  1. Look up SecretVmaEntry for [addr, addr+len).
  2. Restore kernel direct-map PTEs:
       For each PFN in the entry:
         a. Remove from SECRET_REGIONS.
         b. Re-install PTE at (KERNEL_PHYSMAP_BASE + pfn * PAGE_SIZE).
     Flush kernel TLB for restored range.
  3. Zero the physical pages (prevent data leakage before returning to buddy allocator).
  4. Remove process PTEs for the region.
  5. Return pages to buddy allocator.
  6. Remove SecretVmaEntry from the process's SecretVmaSet.

Limitations

Fork semantics with active memfd_secret regions: fork() succeeds unconditionally. The child inherits the memfd_secret fd (standard POSIX semantics). However, the child does NOT inherit the secret mapping -- the VMA has VM_DONTCOPY set, so dup_mmap() skips it. The child's address space does not contain the secret region. The child can call mmap() on the inherited fd to establish its own independent mapping of the same secret pages. If the parent set O_CLOEXEC at creation time, the fd is closed in the child on the next exec(), not on fork().
Cannot read() or write() via the fd: the fd has no data operations. Content is accessible only through the mmap'd region. read()/write() on the fd return EBADF.
Pages are pinned (cannot be swapped): swap requires the kernel to write the page to a swap device, which requires kernel direct-map access. Since the direct-map entries are excised, swap is impossible. The pages are PG_mlocked to prevent reclaim.
No userfaultfd on secret regions: uffd requires kernel access to copy pages (UFFDIO_COPY), which conflicts with the direct-map excision. Attempting to register a secret region with uffd returns EINVAL.
Single process only: the fd cannot be shared via SCM_RIGHTS (returns EACCES on the receiving end if the receiving process attempts to mmap it, since the kernel cannot install cross-process direct-map entries consistently).

Kernel code protection

All kernel paths that access user memory must check SECRET_REGIONS:

/// Illustrates the SECRET_REGIONS check integrated into copy_from_user.
/// This is NOT the complete copy_from_user implementation (which also handles
/// non-present pages, highmem on 32-bit, cross-page copies, SMAP/PAN, etc.).
/// The full implementation is in [Section 19.1](19-sysapi.md#syscall-interface).
///
/// **Important**: This illustration shows ONLY the SECRET_REGIONS guard.
/// The actual `copy_from_user` implementation accesses the user page via the
/// user virtual address (with SMAP/PAN temporarily enabled), NOT via the
/// kernel direct map. The PFN is computed only to check the SECRET_REGIONS
/// set; the data copy itself uses the user VA.
///
/// A `SECRET_REGIONS_ACTIVE: AtomicBool` static branch short-circuits the
/// check when no secret pages exist (the common case). When false, the
/// branch is predicted-not-taken (~0 cycles). When memfd_secret creates
/// the first secret region, the flag is set to true and the check is
/// activated. This matches Linux's `static_branch` pattern.
pub fn copy_from_user(kernel_dst: &mut [u8], user_va: VirtAddr) -> Result<(), Efault> {
    // Compute PFN from the user VA for the SECRET_REGIONS check.
    // Uses virt_to_phys() (defined in [Section 4.3](#slab-allocator--address-conversion-helpers)).
    let phys = virt_to_phys(user_va.as_ptr());
    let pfn = phys.0 >> PAGE_SHIFT;
    if SECRET_REGIONS_ACTIVE.load(Relaxed) && SECRET_REGIONS.contains(&pfn) {
        // SAFETY: We are intentionally refusing to access a secret page.
        return Err(Efault);
    }
    // The actual copy uses the user VA with SMAP/PAN, not the direct map.
    unsafe { copy_from_user_raw(kernel_dst, user_va) }
}

The XArray RCU-mode read-side check is O(log64 N) and lock-free (for 64K max secret pages, this is 3 radix tree levels — effectively constant time). On x86-64 with SME, the hardware C-bit provides a secondary enforcement layer: even if copy_from_user skips the software check (e.g., via a kernel exploit), the hardware will return encrypted garbage rather than plaintext.

Error cases

Error	Condition
`ENOSYS`	Architecture does not support `memfd_secret` (ARMv7, RISC-V, PPC, s390x, LoongArch64)
`EPERM`	Caller is unprivileged and `/proc/sys/vm/memfd_secret_allowed` is 0
`ENOMEM`	Cannot allocate physical pages for the secret region
`EINVAL`	Unknown flags; PROT_EXEC on mmap; MAP_PRIVATE on mmap; size not page-aligned
`EBADF`	`read()`/`write()` attempted on the secret fd

Linux compatibility: syscall number and basic fd semantics match Linux 5.14+. The AMD SME encryption and ARM64 Realm extensions are UmkaOS-specific enhancements that operate transparently (userspace sees the same interface; hardware provides additional enforcement).

4.15.7 process_vm_readv / process_vm_writev — Cross-Process Memory I/O¶

Syscall signatures

ssize_t process_vm_readv(pid_t pid,
                         const struct iovec *local_iov,  size_t liovcnt,
                         const struct iovec *remote_iov, size_t riovcnt,
                         unsigned long flags);

ssize_t process_vm_writev(pid_t pid,
                          const struct iovec *local_iov,  size_t liovcnt,
                          const struct iovec *remote_iov, size_t riovcnt,
                          unsigned long flags);

Return total bytes transferred on success, -1 on error. Partial transfers (when the remote range spans valid and invalid pages) return the count of bytes successfully transferred before the first fault.

Parameters

pid: target process identifier (raw PID, not pidfd — Linux API uses raw PID here).
local_iov[liovcnt]: scatter/gather descriptors in the caller's address space. iov_base is a VA in the current process; iov_len is the byte count.
remote_iov[riovcnt]: scatter/gather descriptors in the target process's address space.
flags: must be 0. All other values return EINVAL. Reserved for future extensions.

flags field

bitflags::bitflags! {
    /// Flags for process_vm_readv / process_vm_writev.
    /// Currently only the zero value is valid. Reserved bits are rejected with EINVAL.
    /// Uses `KernelULong` (= `usize`) to match Linux's `unsigned long` parameter:
    /// 32-bit on ILP32 architectures (ARMv7, PPC32), 64-bit on LP64 (x86-64, AArch64).
    /// The compat syscall layer zero-extends the 32-bit value on ILP32.
    pub struct ProcessVmFlags: KernelULong {
        // No flags defined. Field reserved for future use.
    }
}

Permission model

The caller must have one of: 1. PTRACE_MODE_ATTACH_REALCREDS permission over the target process (same as ptrace attach), checked via ptrace_may_access(target, PTRACE_MODE_ATTACH_REALCREDS). 2. The same real UID/GID as the target, the target has not set PR_SET_DUMPABLE to non-dumpable, and no LSM vetoes the access.

Cannot cross user namespace boundaries unless the caller has CAP_SYS_PTRACE in the target's user namespace.

The permission check uses the real credentials of the calling thread (current_real_cred), not the effective credentials, matching Linux's ptrace_may_access semantics.

Algorithm — process_vm_readv

process_vm_readv(pid, local_iov, liovcnt, remote_iov, riovcnt, flags):
  1. Validate flags == 0. Return EINVAL if not.
  2. Validate liovcnt and riovcnt are <= IOV_MAX (1024). Return EINVAL if exceeded.
  3. Copy local_iov and remote_iov arrays from userspace. Return EFAULT if either
     pointer is inaccessible.
  4. Compute transfer size: transfer_size = min(sum(local_iov[i].iov_len),
     sum(remote_iov[j].iov_len)). Local and remote totals need not match;
     the transfer proceeds up to the smaller of the two sums.
  5. Resolve target process: task_from_pid(pid). Return ESRCH if not found.
     Take a reference to prevent the target from exiting during the operation.
  6. Permission check: ptrace_may_access(target). Return EPERM if denied.
  7. Acquire target MM read-lock: target.mm.read_lock().
  8. Initialize local_cursor and remote_cursor (index + byte offset within current iov).
  9. total_copied = 0.
  10. While remote bytes remain:
        a. Get next remote segment: remote_va, remote_len from remote_iov[remote_cursor].
        b. Get next local segment: local_va, local_len from local_iov[local_cursor].
        c. chunk = min(remote_len, local_len).
        d. Check if remote_va is in SECRET_REGIONS (any PFN in [remote_va, remote_va+chunk)).
           If so: return EFAULT (cannot read secret pages cross-process).
        e. Pin remote pages: get_user_pages(target.mm, remote_va, chunk,
                                            FOLL_REMOTE | FOLL_GET, &pages).
           Returns number of pages successfully pinned.
           If 0 pages pinned: release target MM read-lock; return EFAULT.
        f. If chunk > LARGE_COPY_THRESHOLD (1MB) and chunk is PAGE_SIZE-aligned:
             Use remap_copy_path (see UmkaOS improvement below).
           Else:
             memcpy from pinned pages into local_va, byte by byte across page
             boundaries (kmap each page, copy, kunmap).
        g. Unpin pages: put_user_pages(pages, num_pages).
        h. total_copied += chunk.
        i. Advance local_cursor and remote_cursor by chunk.
  11. Release target MM read-lock.
  12. Drop target process reference.
  13. Return total_copied.

Algorithm — process_vm_writev

Identical to process_vm_readv but data flows from local to remote. The get_user_pages call uses FOLL_WRITE on remote pages to get writable mappings. The copy direction is reversed (local to remote pages).

UmkaOS improvement: efficient large transfers

For transfers larger than 1MB where both source and destination are page-aligned, UmkaOS substitutes a kmap-window path for the byte-copy path:

remap_copy_path(src_mm, src_va, dst_va, len):
  1. Allocate a temporary kernel VA window (using vmalloc area).
  2. For each PAGE_SIZE-aligned chunk of [src_va, src_va+len):
       a. Get the PFN of the source page (from src_mm's page tables).
       b. Map that PFN into the temporary kernel VA window.
       c. Map the target (local) address as a writable page in the current mm.
       d. Use SIMD-accelerated memcpy (rep movsb on x86, NEON on AArch64) to copy
          from the kernel window to the local page.
     Note: no intermediate heap buffer is allocated — data goes directly from the
     source physical page to the destination physical page via kmap/kunmap.
  3. Release temporary kernel VA window.

For workloads that regularly read >1MB from another process (e.g., a debugger inspecting a large heap), this avoids the overhead of allocating and freeing a large intermediate buffer for each transfer.

Thread safety and raciness

The target process may be running concurrently. Reading its memory while it modifies it is inherently racy. This is intentional and matches Linux semantics — process_vm_readv is documented as providing no atomicity guarantees. The target MM read-lock is held only to pin pages, not for the duration of the copy; the target can create or destroy VMAs while the copy is in progress (page pinning prevents the physical page from being reclaimed, but the VMA map may change).

Callers (debuggers, profilers, GC engines) that require consistent reads must arrange their own synchronisation with the target (e.g., ptrace SIGSTOP).

Cannot read memfd_secret regions

Steps 10d above checks SECRET_REGIONS. Any attempt to read or write a PFN in the secret set returns EFAULT. The caller cannot work around this by using different remote iov segments — the check is per-PFN, covering any sub-page access.

Error cases

Error	Condition
`ESRCH`	No process with the given PID, or process is a zombie
`EPERM`	Caller does not have ptrace permission over the target
`EFAULT`	Remote address faults (not mapped, not pinnable, or in secret region)
`EINVAL`	`flags` is non-zero; `liovcnt` or `riovcnt` > IOV_MAX; iov length sums mismatch
`ENOMEM`	Cannot pin target pages (target has too many pinned pages)

Linux compatibility: syscall numbers, argument order, iovec struct layout, return value semantics, and permission model match Linux 3.2+. The secret-region EFAULT and the large-transfer optimised path are UmkaOS extensions that are transparent to callers.

4.15.8 process_madvise — Batch madvise for Another Process¶

Syscall signature

ssize_t process_madvise(int pidfd, const struct iovec *iovec, size_t vlen,
                        int advice, unsigned int flags);

Returns total bytes advised on success (sum of iov_len for successfully processed iov entries), -1 on error. Like madvise(), this is advisory — the kernel may ignore the hints.

Why pidfd instead of raw PID

Using a pidfd (a file descriptor referring to a specific process, created by pidfd_open()) eliminates the TOCTOU race inherent in raw PID numbers: a raw PID may be recycled between the lookup and the advise operation, accidentally advising the wrong process. A pidfd refers to a specific process struct; even if the process exits, the pidfd remains valid (the kernel keeps the process struct alive) and subsequent operations on it return ESRCH cleanly. This matches the Linux 5.10+ pidfd-based API design.

Advice values

Linux-compatible remote advice whitelist:

Linux process_madvise (since 5.10) only permits these advice values for remote processes (i.e., when the target pidfd refers to another process):

/// Advice values permitted for remote process_madvise().
/// Values match Linux MADV_* constants. This whitelist matches Linux 6.13+.
#[repr(i32)]
pub enum RemoteMadviseAdvice {
    /// Mark pages as will-be-needed soon; kernel prefetches.
    Willneed      = 3,
    /// Mark pages as cold; candidates for reclaim before warmer pages.
    Cold          = 20,
    /// Request immediate reclaim of the specified pages.
    Pageout       = 21,
    /// Collapse small pages into transparent huge pages.
    Collapse      = 25,
}

When process_madvise is called with a self-pidfd (target == caller), any valid madvise advice value is permitted (same as calling madvise directly).

UmkaOS extensions for remote process_madvise:

The following advice values are UmkaOS extensions not available in Linux's remote process_madvise. They require CAP_SYS_ADMIN and are clearly outside the Linux ABI contract:

/// UmkaOS-only advice values for remote process_madvise.
/// Values ≥ 256 are the UmkaOS madvise extension namespace.
#[repr(i32)]
pub enum UmkaRemoteMadviseAdvice {
    /// Mark pages as not needed; kernel may free them.
    /// Linux rejects MADV_DONTNEED for remote process_madvise — UmkaOS permits it
    /// with CAP_SYS_ADMIN because it is useful for memory manager daemons.
    Dontneed      = 4,
    /// Mark pages as freeable (kernel may or may not free immediately).
    /// Same rationale as MADV_DONTNEED: rejected by Linux remotely, UmkaOS permits
    /// with CAP_SYS_ADMIN.
    Free          = 8,
    /// Mark pages as discardable (memory hibernation hint, §4.4.3).
    /// The kernel may compress or offload these pages under memory pressure.
    Discardable   = 256,
    /// Mark pages as critical (must not be reclaimed or compressed).
    /// Used by real-time and latency-sensitive allocations.
    Critical      = 257,
}

MADV_DISCARDABLE (256) and MADV_CRITICAL (257) are UmkaOS extensions. Values ≥ 256 are reserved as the UmkaOS madvise extension namespace. Linux's current allocation reaches MADV_COLLAPSE = 25, with MADV_HWPOISON = 100 and MADV_SOFT_OFFLINE = 101 as isolated arch-specific outliers. Using 256+ provides a clean separation that survives Linux's natural growth at ~2 hints per major release (a conflict would require ~115 years of Linux development). These hints integrate with the process memory hibernation subsystem described in Section 4.5 and allow a privileged memory manager daemon to set hibernation hints on behalf of application processes.

Permission model

Advice type                  Required capability
───────────────────────────  ──────────────────────────────────────
Linux-compatible remote:
  MADV_COLD, MADV_WILLNEED   CAP_SYS_NICE or same real UID as target
  MADV_PAGEOUT               CAP_SYS_NICE or same real UID as target
  MADV_COLLAPSE              CAP_SYS_NICE or same real UID as target
UmkaOS extensions (remote):
  MADV_DONTNEED, MADV_FREE   CAP_SYS_ADMIN (destructive — can cause data loss)
  MADV_DISCARDABLE,
    MADV_CRITICAL             CAP_SYS_NICE (UmkaOS extension)
Self-pidfd:
  Any valid madvise advice    No extra capability (same as madvise())

Destructive advice (MADV_DONTNEED, MADV_FREE) requires CAP_SYS_ADMIN for remote targets because it can cause data loss in the target process. Linux rejects these entirely for remote process_madvise; UmkaOS permits them with elevated privileges for memory manager daemons that need to reclaim specific process pages.

flags parameter

Must be 0. All other values return EINVAL. Reserved for future per-call options (e.g., PROCESS_MADVISE_ASYNC for non-blocking batch processing).

vlen limit

UmkaOS enforces a maximum of 1024 iov entries per call. Linux has no documented limit, creating a potential DoS vector where an attacker with CAP_SYS_ADMIN passes millions of tiny iov entries to pin the kernel in process_madvise indefinitely. The 1024-entry limit caps the worst-case kernel time at ~100μs (1024 VMAs × ~100ns per VMA lookup).

Internal data structures

No new persistent structures; process_madvise is a stateless operation that modifies VMA flags and page table entries in the target process.

Algorithm

process_madvise(pidfd, iovec, vlen, advice, flags):
  1. Validate flags == 0. Return EINVAL if not.
  2. Validate vlen <= 1024. Return EINVAL if exceeded.
  3. Validate advice is a known MadviseAdvice value. Return EINVAL if not.
  4. Resolve process from pidfd: pidfd_get_task(pidfd). Return EBADF if pidfd invalid.
     Return ESRCH if process has exited.
  5. Permission check: check_process_madvise_permission(current, target, advice).
     Return EPERM if not permitted.
  6. Copy iovec array from userspace (vlen entries). Return EFAULT if pointer bad.
  7. Validate each iov entry: iov_base page-aligned, iov_len non-zero.
     Return EINVAL if any entry fails validation.
  8. Acquire target MM read-lock.
  9. Pre-scan: collect all VMAs covering each iov range, validate none span unmapped
     holes. Collect set of CPUs with threads from target running (for TLB flush).
     (Pre-scan allows us to fail early without partial effects for advisory operations.
     For destructive advice, partial effects are acceptable and we proceed range-by-range.)
  10. total_bytes = 0.
  11. Collect pending_tlb_flush = TlbFlushSet::new().
  12. For each iov[i] in iovec:
        a. range = [iov[i].iov_base, iov[i].iov_base + iov[i].iov_len).
        b. Find VMA(s) covering range. Return ENOMEM if any sub-range is unmapped.
        c. Call madvise_vma(target.mm, vma, range, advice,
                            &mut pending_tlb_flush).
           This updates VMA flags and/or PTE bits as needed for the advice.
           For MADV_COLD: clear PTE accessed bits in range.
           For MADV_PAGEOUT: mark pages for immediate reclaim via page_reclaim_direct().
           For MADV_DONTNEED: unmap pages (free anon, drop file cache references).
           For MADV_FREE: mark pages with PG_lazyfree; reclaimed on memory pressure.
           For MADV_HUGEPAGE / MADV_NOHUGEPAGE: update VMA THP flags.
           For MADV_DISCARDABLE / MADV_CRITICAL: update VMA hibernation hint flags.
        d. total_bytes += iov[i].iov_len.
  13. UmkaOS improvement: issue a single batched TLB flush for all modified ranges.
        pending_tlb_flush.flush_all_cpus(target_cpu_mask).
        (Linux issues one IPI per range; UmkaOS coalesces all into one IPI per CPU.)
  14. Release target MM read-lock.
  15. Drop target process reference.
  16. Return total_bytes.

UmkaOS improvement: coalesced TLB flush

Linux's process_madvise calls tlb_gather_mmu and tlb_finish_mmu once per iov range, issuing one TLB flush IPI to each target CPU per range. For 1024 ranges on a 64-CPU system, this is 65,536 IPIs (1024 × 64). On a system where IPIs take ~2μs each, this is 131ms of IPI overhead for a single process_madvise call — unacceptable for a production memory manager daemon.

UmkaOS coalesces: all iov ranges are processed before the TLB flush, accumulating dirty page-table entries in a TlbFlushSet. After all ranges are processed, a single IPI per CPU flushes all modified entries in one shot. For 1024 ranges on 64 CPUs, this is 64 IPIs total — a 1024x reduction in IPI count.

/// Accumulated TLB flush work. Regions are added during madvise processing;
/// flushed in one batch at the end of process_madvise.
///
/// `ArrayVec<VaRange, 64>` stores up to 64 ranges inline (on the stack)
/// without heap allocation, covering the common case. ArrayVec never spills
/// to the heap — it is a fixed-capacity container. The 64-entry capacity
/// covers >99% of real workloads (madvise rarely touches more than 64
/// distinct VA ranges in a single call).
///
/// **Capacity overflow handling**: When the ArrayVec is full (64 entries),
/// the accumulated ranges are flushed immediately via `flush_all_cpus()`,
/// the `TlbFlushSet` is cleared, and processing continues. This converts
/// one coalesced flush into multiple flushes (worst case: one flush per
/// 64 ranges), which is correct but slower. The partial-flush fallback
/// ensures process_madvise never fails due to internal TLB bookkeeping.
/// Implementation: check `ranges.is_full()` before each push; if full,
/// call `flush_all_cpus()` then `ranges.clear()`.
pub struct TlbFlushSet {
    /// VA ranges that need TLB invalidation, accumulated across all iov entries.
    pub ranges: ArrayVec<VaRange, 64>,
    /// CPU mask: which CPUs need to receive the TLB flush IPI.
    pub cpu_mask: CpuMask,
}

impl TlbFlushSet {
    /// Issue a single TLB flush IPI to each CPU in cpu_mask, invalidating all
    /// accumulated VA ranges. The IPI handler on each CPU calls
    /// flush_tlb_multi_range(ranges) to invalidate all ranges at once.
    pub fn flush_all_cpus(&self, extra_mask: CpuMask) {
        let mask = self.cpu_mask | extra_mask;
        smp_send_ipi(mask, IpiKind::TlbFlushMulti(&self.ranges));
    }
}

The IPI handler on each target CPU receives the full list of VA ranges and calls the architecture-specific multi-range TLB invalidation:

x86-64: INVLPG per page in each range (or INVPCID with type=1 for PCID-aware invalidation).
AArch64: TLBI VAE1IS per page.
RISC-V: SFENCE.VMA with address operands per page.

MADV_PAGEOUT batching

When MADV_PAGEOUT appears across multiple iov ranges, UmkaOS collects all target pages into a single batch and calls reclaim_pages_batch(pages) once. This amortises the per-page reclaim overhead (LRU list manipulation, swap slot allocation) across all pages in the call. Linux calls reclaim_pages() once per VMA range.

MADV_DISCARDABLE and MADV_CRITICAL integration

These UmkaOS-specific advice values set flags in the VMA's vm_flags:

bitflags::bitflags! {
    pub struct VmFlags: u64 {
        // ... standard Linux VM_* flags (exact values) ...
        /// UmkaOS extension: pages in this VMA are discardable under memory pressure.
        /// The memory hibernation path (§4.4.3) treats these as low-priority.
        const VM_UMKA_DISCARDABLE = 1 << 56;
        /// UmkaOS extension: pages in this VMA must not be reclaimed or compressed.
        /// Real-time allocations use this to guarantee access latency.
        const VM_UMKA_CRITICAL    = 1 << 57;
    }
}

The memory reclaim path checks VM_UMKA_CRITICAL before evicting any page; pages in critical VMAs are skipped even under severe memory pressure. The swap/compression path checks VM_UMKA_DISCARDABLE to prioritise these pages for early compression or swap.

Error cases

Error	Condition
`EBADF`	`pidfd` is not a valid pidfd
`ESRCH`	Process referenced by `pidfd` has exited
`EPERM`	Insufficient privilege for the requested advice type
`EINVAL`	`flags` non-zero; `vlen` > 1024; unknown `advice`; non-aligned iov_base; zero iov_len
`EFAULT`	`iovec` pointer not readable
`ENOMEM`	iov range covers an unmapped region

Linux compatibility: syscall number, pidfd semantics, iovec struct, MADV_COLD and MADV_PAGEOUT advice values, and permission model match Linux 5.10+. MADV_DISCARDABLE and MADV_CRITICAL are UmkaOS extensions using values beyond the Linux-defined range. The 1024-entry vlen cap is an UmkaOS safety limit with no Linux equivalent.

4.15.9 Memory Page Offlining (memory_hotplug)¶

The FMA subsystem (Section 20.1) retires pages with memory errors (signalled by EDAC via fma_report_health()) by calling memory_hotplug::offline_page(). EDAC does NOT call offline_page() directly -- all page retirement decisions are mediated by FMA. This function is also used by the memory hotplug path when a DIMM is being removed.

/// Retire a single physical page frame, making it permanently unavailable
/// for allocation. Used by FMA (after EDAC reports correctable error threshold
/// exceeded or uncorrectable error) and memory hotplug (DIMM removal).
///
/// # Procedure
/// 1. If the page is in use (refcount > 0):
///    a. If page is a user page: migrate contents to a new frame via the
///       page migration subsystem (same NUMA node preferred, any node acceptable).
///    b. If page is a slab page: evict the slab object and free the page.
///    c. If page is a page-table page: cannot offline — return EBUSY.
///    d. If page is a kernel stack page: cannot offline — return EBUSY.
/// 2. Remove the page from the buddy allocator's free lists.
/// 3. Set page flags: PG_HWPOISON | PG_OFFLINE.
/// 4. Update zone accounting: zone.present_pages -= 1.
/// 5. Update NUMA node memory map: node.present_pages -= 1.
/// 6. If the page was in the page cache, invalidate the mapping and notify
///    the filesystem via address_space_operations.error_remove_page().
/// 7. Emit tracepoint: memory_page_offline(pfn, reason, migrate_target).
///
/// # Errors
/// - EBUSY: page is a kernel page-table or stack page (cannot migrate).
/// - ENOMEM: migration failed (no free frame available on any node).
/// - EINVAL: pfn is outside the valid physical memory range.
///
/// # Concurrency
/// Takes the per-zone memory_hotplug_lock (Mutex) to serialize concurrent
/// offline operations on the same zone. The migration path may sleep.
pub fn offline_page(pfn: PhysFrameNum, reason: OfflineReason) -> Result<(), OfflineError>;

pub enum OfflineReason {
    /// EDAC: correctable error count exceeded threshold for this page.
    EdacThreshold,
    /// EDAC: uncorrectable error detected on this page.
    EdacUncorrectable,
    /// Memory hotplug: DIMM being removed.
    HotRemove,
    /// Operator request via sysfs (echo offline > /sys/devices/system/memory/memoryN/state).
    ManualOffline,
    /// FMA subsystem: page retired due to predictive failure analysis
    /// (e.g., EDAC CE rate trending, machine-check patrol scrub hit,
    /// or ML-policy-driven preemptive retirement).
    FmaRetire,
}

pub enum OfflineError {
    /// Page is a kernel page-table or stack page; cannot be migrated.
    Busy,
    /// No free frame available for migration.
    NoMemory,
    /// PFN is outside the valid physical memory range.
    InvalidPfn,
}

/// **Idempotent behavior**: If the page is already poisoned (previously
/// offlined), `offline_page()` succeeds idempotently (returns `Ok`). The
/// page's poison state is not modified. This matches Linux's
/// `memory_failure()` which returns 0 (success) for already-poisoned
/// pages. No new error variant is needed.

Migration details: Page migration follows the same path as NUMA balancing migration (Section 4.8): unmap PTEs, copy page contents, remap PTEs to new frame, TLB flush via IPI. The difference is that offlined pages are never returned to the buddy allocator — they remain permanently poisoned.

Interaction with CMA: If the offlined page belongs to a CMA (Contiguous Memory Allocator) region, the CMA bitmap is updated to mark the page as unusable. Future CMA allocations skip this page.

Chapter 4: Memory Management¶

4.1 Boot Allocator¶

4.1.1 Kernel Address Types¶

4.2 Physical Memory Allocator¶

4.2.1 BuddyAllocator¶

4.2.1.1 Constants¶

4.2.1.2 Migration Types¶

4.2.1.3 BuddyFreeList — Per-Order Free List¶

4.2.1.4 PcpPagePool — Per-CPU Free Page Pool¶

4.2.1.5 BuddyAllocator — Per-NUMA-Node Allocator¶

4.2.1.5.1 NUMA Node TEE Info¶

4.2.1.6 Global Allocator Table¶

4.2.1.7 alloc_pages() — Page Allocation¶

4.2.1.8 free_pages() — Page Deallocation¶

4.2.1.9 Split Algorithm¶

4.2.1.10 Merge Algorithm¶

4.2.1.11 PCP Refill¶

4.2.1.12 PCP Drain¶

4.2.1.13 GfpFlags Integration Summary¶

4.2.1.14 Initialization Sequence¶

4.2.1.15 OOM Killer¶

4.2.1.16 OOM Reaper¶

4.2.1.17 OOM Score¶

4.2.1.18 Cgroup OOM Events¶

4.2.1.19 ML Policy Integration — Intelligent OOM Victim Selection¶

4.2.2 Allocator Replaceability — 50-Year Uptime Design¶

4.2.2.1 Non-Replaceable Data Structures¶

4.2.2.2 PageArray — vmemmap-Backed Page Descriptor Array¶

4.2.2.3 MEMMAP to vmemmap Migration¶

4.2.2.4 Replaceable Physical Allocator Policy¶

4.2.2.5 Performance Budget — Policy Indirection Cost¶

4.2.2.6 PageExtArray — Per-Page Extension Metadata¶

4.3 Slab Allocator¶

4.3.1.1 kmalloc() / kfree() Public API¶

4.3.1.2 Internal Slab Structures¶

4.3.1.3 Slab-to-Buddy Page Allocation¶

4.3.2 Page Frame Descriptor¶

4.3.3 Slab Cache Garbage Collection¶

4.3.3.1 Problem¶

4.3.3.2 Design: kslab_gc Background Thread¶

4.3.3.3 GC Lock Ordering¶

4.3.3.4 GC Protocol¶

4.3.3.5 Performance Impact¶

4.3.3.6 Driver-Created Cache Merging and Per-CPU Magazine Mapping¶

4.4 Page Cache¶

4.4.1 Generational LRU Page Reclaim¶

4.4.1.1 Page Reclaim Replaceability — 50-Year Uptime Design¶

4.4.1.2 Performance Budget — Page Reclaim Policy Indirection Cost¶

4.4.1.3 filemap_fault() — Generic Page Cache Fault Handler¶

4.4.2 Readahead Engine¶

4.5 OOM Killer and Process Memory Hibernation¶

4.5.1 OOM Killer Policy¶

4.5.1.1.1 DSM-Awareness in OOM Scoring¶

4.5.2 Process Memory Hibernation¶

4.6 Writeback Subsystem¶

4.6.1 Writeback Thread Organization¶

[repr(u8)]¶

4.6.1.1 end_page_writeback() — Writeback I/O Completion¶

4.7 Transparent Huge Page Promotion and Memory Compaction¶

4.7.1 Struct Definitions¶

4.8 Virtual Memory Manager¶

4.8.1 MmStruct — Per-Process Address Space¶

4.8.1.1 MMU Notifier¶

4.8.1.1.1 MmuNotifier RAII Guard¶

4.8.2 Maple Tree VMA Management¶

4.8.2.1 do_mmap() -- mmap Syscall Implementation¶

4.8.2.2 prot_flags_to_vm_flags() — Translate mmap flags to VmFlags¶

4.8.2.3 vma_merge() — Merge a New Mapping with Adjacent VMAs¶

4.8.2.4 do_munmap() — Unmap a Virtual Address Range¶

4.8.2.5 maple_find_in_range() — Check for VMA overlap¶

4.8.2.6 slab_alloc_typed<T>() — Typed Slab Allocation¶

4.8.2.7 do_mmap_hugetlb() — Hugetlb mmap Path¶

4.8.2.8 lsm_call!(mmap_file, ...) — LSM mmap hook¶

4.8.2.9 check_overcommit() — Overcommit accounting¶

4.8.3 Page Fault Handler¶

4.8.3.1.1 Per-VMA Locking¶

4.8.4 Page Fault Metadata by Architecture¶

4.8.4.1 dsm_handle_fault — VMM-to-DSM Bridge Function¶

4.8.4.2 install_pte — Page Table Entry Installation¶

4.8.4.3 COW Page Table Duplication at Fork¶

4.2.1.3 `BuddyFreeList` — Per-Order Free List¶

4.2.1.4 `PcpPagePool` — Per-CPU Free Page Pool¶

4.2.1.5 `BuddyAllocator` — Per-NUMA-Node Allocator¶

4.2.1.7 `alloc_pages()` — Page Allocation¶

4.2.1.8 `free_pages()` — Page Deallocation¶

4.3.1.1 `kmalloc()` / `kfree()` Public API¶

4.4.1.3 `filemap_fault()` — Generic Page Cache Fault Handler¶

4.6.1.1 `end_page_writeback()` — Writeback I/O Completion¶

4.8.2.1 `do_mmap()` -- mmap Syscall Implementation¶

4.8.2.2 `prot_flags_to_vm_flags()` — Translate mmap flags to VmFlags¶

4.8.2.3 `vma_merge()` — Merge a New Mapping with Adjacent VMAs¶

4.8.2.4 `do_munmap()` — Unmap a Virtual Address Range¶

4.8.2.5 `maple_find_in_range()` — Check for VMA overlap¶

4.8.2.6 `slab_alloc_typed<T>()` — Typed Slab Allocation¶

4.8.2.7 `do_mmap_hugetlb()` — Hugetlb mmap Path¶

4.8.2.8 `lsm_call!(mmap_file, ...)` — LSM mmap hook¶

4.8.2.9 `check_overcommit()` — Overcommit accounting¶

4.8.4.1 `dsm_handle_fault` — VMM-to-DSM Bridge Function¶

4.8.4.2 `install_pte` — Page Table Entry Installation¶

4.14.3.1.1 `DmaArchOps` — Per-Architecture Cache Coherency Operations¶