Skip to content

Chapter 4: Memory Management

Physical allocator, virtual memory, page tables, slab, NUMA, compression tier, page cache


4.1 Memory Management

Memory management runs entirely within UmkaOS Core. The common case (page fault handling, allocation, deallocation) involves zero protection domain crossings.

4.1.0 Boot Allocator

The boot allocator provides memory to early-init code before the buddy allocator and slab are operational. It initialises from firmware-provided memory maps (ACPI/E820 on x86, Device Tree on ARM/RISC-V) and is retired once the buddy allocator takes over.

Design: A single-pass, forward-bumping allocator over a sorted list of free physical ranges. No free operation — allocations are permanent until the buddy allocator is initialised. This is intentional: boot-time data (PerCpu arrays, NUMA topology, RCU state, GDT/IDT, early page tables) has kernel-lifetime and is never freed.

/// Boot-phase allocator. Initialised from firmware memory tables before any other
/// kernel allocator. Retired (replaced by buddy allocator) during mem_init().
/// All allocations are permanent (no free). Single-threaded use only (pre-SMP).
pub struct BootAlloc {
    /// Sorted list of free physical memory ranges from firmware.
    /// Populated from E820 (x86), UEFI memory map, or Device Tree /memory nodes.
    /// Maximum 256 entries — sufficient for any real system (Linux caps at 128).
    ranges: [PhysRange; 256],
    /// Number of valid entries in `ranges`.
    nr_ranges: usize,
    /// Next allocation pointer within the current range.
    current_top: PhysAddr,
    /// Index of the range being served.
    current_range: usize,
}

/// A contiguous physical memory range.
pub struct PhysRange {
    pub base: PhysAddr,
    pub end:  PhysAddr,  // exclusive
}

impl BootAlloc {
    /// Allocate `size` bytes with `align` alignment (must be power of two).
    /// Searches ranges in order, advancing to the next free range when the
    /// current one is exhausted. Panics if no range has enough space.
    /// Returns a physical address; caller maps it into the kernel virtual window.
    pub fn alloc(&mut self, size: usize, align: usize) -> PhysAddr;

    /// Mark a physical range as reserved (used by ACPI tables, initramfs, etc.)
    /// so it is excluded from allocatable ranges. Must be called before alloc().
    pub fn reserve(&mut self, base: PhysAddr, size: usize);

    /// Hand off all remaining free ranges to the buddy allocator.
    /// Called once during mem_init(). After this call, BootAlloc is inert.
    pub fn hand_off_to_buddy(&mut self, buddy: &mut BuddyAllocator);
}

Initialisation sequence (arch-independent):

1. arch_early_init() — establish identity map of first 1 GB, enable MMU/paging.
2. parse_firmware_memory_map() — read E820/UEFI/DT, build BootAlloc.ranges[].
3. reserve_kernel_image() — mark kernel .text/.data/.bss as reserved.
4. reserve_initramfs() — mark initramfs region as reserved.
5. reserve_acpi_tables() — mark RSDP/XSDT/MADT/SRAT/HMAT regions as reserved.
6. BootAlloc available for use by all early init code.
7. mem_init():
   a. Allocate per-NUMA BuddyAllocator structures from BootAlloc.
   b. Call boot_alloc.hand_off_to_buddy() — all remaining free pages enter buddy.
   c. Mark BootAlloc as retired. Further alloc() panics.
8. slab_init() — build slab caches on top of buddy.
9. Normal allocation (Box::new, Arc::new, etc.) is now available.

Key differences from Linux memblock: No deferred boot-time allocation tracking, no late reservations after step 6, no mirror/hotplug bookkeeping at this stage (handled by NUMA topology after buddy is up). Simpler invariant: the boot allocator is a one-pass bump allocator that retires cleanly.

4.1.1 Physical Memory Allocator

Per-CPU Page Caches (hot path, no locks)
    |
    v
Per-NUMA-Node Buddy Allocator (warm path, per-node lock)
    |
    v
Global Buddy Allocator (cold path, rare fallback)
  • Per-CPU page caches: Each CPU maintains a private cache of free pages. Allocation and deallocation from this cache require no locking and no atomic operations.
  • Per-NUMA buddy allocator: When per-CPU caches are exhausted, pages are allocated from the NUMA-local buddy allocator. This uses a per-node lock (minimal contention because per-CPU caches absorb most traffic).
  • Page order: Buddy allocator manages orders 0-10 (4KB to 4MB).
  • NUMA awareness: Allocations prefer the requesting CPU's NUMA node. Cross-node allocation is a fallback with configurable policy.

See also: Section 2.3 (Hardware Memory Safety) hooks into the physical and slab allocators to assign MTE tags on allocation and clear them on free, providing hardware-assisted use-after-free and buffer overflow detection.

GfpFlags — Memory Allocation Flags:

bitflags! {
    /// Memory allocation flags passed to the physical page allocator and slab allocator.
    ///
    /// **Context rules** — must be respected to avoid deadlocks:
    /// - In interrupt context or under spinlock: use `GFP_ATOMIC` or `GFP_NOWAIT`.
    /// - In normal schedulable kernel context: use `GFP_KERNEL`.
    /// - From within filesystem code (holding inode/page locks): add `GFP_NOFS`.
    /// - From within block driver or I/O path: add `GFP_NOIO`.
    ///
    /// **UmkaOS vs Linux**: Linux uses a raw `gfp_t` typedef (u32) with scattered
    /// `#define` constants. UmkaOS uses `bitflags!` for compile-time type safety —
    /// invalid flag combinations (e.g., GFP_ATOMIC | GFP_KERNEL) are caught at
    /// the call site rather than producing silent runtime bugs.
    pub struct GfpFlags: u32 {
        // --- Base zone modifiers (where to allocate from) ---

        /// Allocate from the DMA zone (typically <16 MB on x86; architecture-specific).
        /// Required for ISA-DMA-capable devices. Use `GFP_DMA32` for 32-bit DMA.
        const DMA              = 0x0000_0001;
        /// Allocate from the DMA32 zone (below 4 GB physical address).
        /// Use for PCI devices that cannot address >4 GB.
        const DMA32            = 0x0000_0002;
        /// Allocate from HIGHMEM zone if available (32-bit systems only).
        /// Not meaningful on 64-bit architectures where all RAM is directly mapped.
        const HIGHMEM          = 0x0000_0004;
        /// Page is movable by the memory compactor (can be migrated without
        /// affecting correctness from the kernel's perspective).
        const MOVABLE          = 0x0000_0008;

        // --- Reclaim / sleeping modifiers ---

        /// May invoke the page reclaimer and sleep waiting for memory.
        /// The standard flag for all normal sleepable kernel allocations.
        const RECLAIM          = 0x0000_0010;
        /// May sleep (yield the CPU) while waiting for memory.
        const IO               = 0x0000_0020;
        /// Disable filesystem callbacks in the reclaim path. Use when the
        /// caller holds filesystem locks (writepage, readpage, etc.) to
        /// prevent deadlock via re-entrant filesystem calls.
        const FS               = 0x0000_0040;
        /// Disable all I/O (including swap writeback) in the reclaim path.
        /// Use from block drivers or storage paths that must not block on I/O.
        const WRITE            = 0x0000_0080;
        /// Allow kswapd reclaim (can wake kswapd if below WMARK_LOW).
        const KSWAPD_RECLAIM   = 0x0000_0100;
        /// Use the emergency reserve pool. Interrupt-safe; may not sleep.
        /// Higher priority than NOWAIT; intended for true interrupt context.
        const ATOMIC           = 0x0000_0200;
        /// For user-visible pages (anonymous memory, file-backed pages
        /// accessed by userspace). Enables reclaimable migration.
        const USER             = 0x0000_0400;
        /// Zero the allocated memory before returning. Caller gets clean pages.
        const ZERO             = 0x0000_0800;

        // --- Composite flags for common use cases ---

        /// Standard kernel allocation: sleepable, may reclaim, may do I/O.
        /// Use for all normal kernel allocations in process context.
        const KERNEL = Self::RECLAIM.bits() | Self::IO.bits() | Self::FS.bits()
                     | Self::WRITE.bits() | Self::KSWAPD_RECLAIM.bits();

        /// Non-blocking allocation: no sleep, no reclaim. For contexts where
        /// failure is acceptable (e.g., caches). Preferred over GFP_ATOMIC for
        /// non-interrupt contexts that simply must not sleep.
        const NOWAIT = 0x0000_0000; // no modifiers — direct alloc only

        /// Interrupt-safe allocation using reserve pool. Use only in IRQ context
        /// or under spinlock where GFP_NOWAIT's failure rate is unacceptable.
        const ATOMIC_ALLOC = Self::ATOMIC.bits() | Self::KSWAPD_RECLAIM.bits();

        /// Like GFP_KERNEL but zeroes the memory. For kernel objects that must
        /// start in a known-zero state (prevents info leaks from recycled pages).
        const KERNEL_ZEROED = Self::KERNEL.bits() | Self::ZERO.bits();

        /// Like GFP_KERNEL but disables filesystem reclaim. Safe from VFS/fs code.
        const KERNEL_NOFS = Self::RECLAIM.bits() | Self::IO.bits()
                          | Self::WRITE.bits() | Self::KSWAPD_RECLAIM.bits();

        /// Like GFP_KERNEL but disables all I/O reclaim. Safe from block drivers.
        const KERNEL_NOIO = Self::RECLAIM.bits() | Self::FS.bits()
                          | Self::KSWAPD_RECLAIM.bits();

        /// For user-visible pages with high-memory zone support and movability.
        const HIGHUSER_MOVABLE = Self::KERNEL.bits() | Self::USER.bits()
                               | Self::HIGHMEM.bits() | Self::MOVABLE.bits();
    }
}

4.1.2 Slab Allocator

For kernel object allocation (capabilities, file descriptors, inodes, etc.):

  • Per-CPU slab caches with magazine-based design
  • Per-NUMA partial slab lists
  • Object constructors/destructors for complex types
  • SLUB-style merging of similarly-sized caches

SlabList — doubly-linked list of slab objects within a slab cache:

/// Doubly-linked list of `Slab` objects. Used for the `partial` list within
/// each NUMA node's slab cache — slabs that have some free objects but are
/// not completely empty. The allocator draws from the partial list when the
/// per-CPU magazine is empty: it pops the head slab, allocates an object
/// from it, and (if the slab is now full) removes it from the list. When
/// an object is freed back to a full slab, that slab is pushed onto the
/// partial list.
///
/// Invariant: `count` always equals the number of nodes reachable from
/// `head` by following `next` pointers. `head.prev == null` and
/// `tail.next == null`.
pub struct SlabList {
    /// First slab in the list, or null if empty.
    pub head: *mut Slab,
    /// Last slab in the list, or null if empty.
    pub tail: *mut Slab,
    /// Number of slabs in the list.
    pub count: usize,
}

/// Intrusive list links embedded in each `Slab` header. These fields are
/// only valid when the slab is on a `SlabList` (partial list). A slab that
/// is on the per-CPU magazine or is completely free does not use these links.
///
/// Each `Slab` represents one page (or compound page for large objects)
/// subdivided into fixed-size object slots.
// (embedded in Slab struct):
//   pub prev: *mut Slab,
//   pub next: *mut Slab,

Slab — Single slab page descriptor:

/// A single slab page: a contiguous memory region divided into `capacity`
/// fixed-size objects. The freelist is stored **out-of-band** (in a separate
/// metadata page, not within the object region) for security and debuggability.
///
/// **UmkaOS vs Linux**: Linux stores the freelist inside free objects themselves
/// (each free slot contains a pointer to the next free slot). This enables
/// use-after-free exploitation (an attacker who writes to a freed object can
/// corrupt the freelist). UmkaOS's out-of-band freelist prevents this class of
/// bug from being exploitable at the cost of ~2 bytes of metadata per object.
///
/// Typical slab sizes: 8-object slab for large objects (≥4 KB/8 = 512 B each),
/// up to 512 objects for small objects (8-16 bytes each in a 4 KB page).
pub struct Slab {
    /// Physical address of the first byte of the object region.
    /// Always page-aligned. The object region occupies `capacity * obj_size` bytes.
    pub base: PhysAddr,

    /// Size of each object in bytes, including alignment padding.
    /// Constant for the lifetime of the slab (set at slab creation).
    pub obj_size: u32,

    /// Maximum number of objects this slab can hold. Computed as
    /// `usable_page_bytes / obj_size` at slab creation.
    pub capacity: u16,

    /// Number of currently free (unallocated) objects in this slab.
    pub free_count: u16,

    /// Compact out-of-band freelist. Stores indices (0..capacity) of free object
    /// slots. The top `free_count` entries are valid free slot indices.
    /// Allocation: pop `freelist[free_count - 1]`, decrement `free_count`.
    /// Deallocation: push slot index to `freelist[free_count]`, increment `free_count`.
    ///
    /// Double-free detection: before pushing, verify the slot is not already
    /// in `freelist[0..free_count]` (O(free_count), acceptable for debug builds;
    /// skipped in release with a KVA-guarded canary approach).
    ///
    /// Memory: allocated from the buddy allocator as a separate page (or from
    /// a small metadata pool for slabs with capacity ≤ 256). Never overlaps
    /// with the object region.
    pub freelist: *mut u16,

    /// Pointer to the `SlabCache` that owns this slab. Used for accounting
    /// and to return the slab to the appropriate partial/empty list.
    /// SAFETY: `cache` pointer is valid for the slab's lifetime (a slab is always
    /// owned by a cache and freed before the cache is destroyed).
    pub cache: *const SlabCache,

    /// Which list this slab is currently on (drives cache list management).
    pub list_state: SlabState,

    /// Slab generation counter. Incremented on each alloc+free cycle.
    /// Used by the sanitizer to detect stale pointers (optional, debug builds).
    #[cfg(debug_assertions)]
    pub generation: u32,
}

/// Which free-list within a `SlabCache` this slab is currently linked into.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SlabState {
    /// All `capacity` objects are allocated (no free slots).
    Full,
    /// Some objects allocated, some free. The common steady-state.
    Partial,
    /// No objects allocated. Eligible for return to the buddy allocator.
    Empty,
}

SlabCache — Slab cache descriptor:

/// A slab cache for objects of a fixed size and alignment. Created at boot
/// for common kernel types (capabilities, inodes, etc.) and on demand for
/// driver-specific types.
pub struct SlabCache {
    /// Object size in bytes (including alignment padding).
    pub object_size: u32,
    /// Object alignment requirement.
    pub align: u32,
    /// Per-CPU magazine: hot objects for alloc/free without any lock.
    /// Magazine size = 2 × (PAGE_SIZE / object_size), capped at 64.
    pub percpu: PerCpu<SlabMagazine>,
    /// Per-NUMA-node partial slab lists. Indexed by NUMA node ID.
    /// Length = `num_online_nodes()` at boot; allocated once from the boot-time
    /// allocator during `slab_init()` and never resized. The `&'static` lifetime
    /// is correct because this allocation lives for the kernel's entire lifetime
    /// (slab caches are never destroyed). Dynamically sized based on discovered
    /// NUMA node count (no compile-time MAX_NUMA_NODES constant — see
    /// Section 4.1.8). Follows the same dynamic sizing pattern as `PerCpu<T>`
    /// (Section 3.1.1) and `NumaTopology` (Section 4.1.8).
    ///
    /// Allocated during slab subsystem init from the boot-time allocator before
    /// general-purpose allocation is available. NUMA topology discovery
    /// (Section 4.1.8) must complete before `slab_init()` so the node count is
    /// known. The boot allocator provides a `&'static [SpinLock<SlabList>]` by
    /// allocating a contiguous region and leaking it (the allocation is
    /// permanent, so no memory is actually leaked).
    pub partial: &'static [SpinLock<SlabList>],
    /// Optional constructor called on newly allocated objects.
    pub ctor: Option<fn(*mut u8)>,
    /// Optional destructor called before returning objects to the page allocator.
    pub dtor: Option<fn(*mut u8)>,
    /// Name for debugging and /proc/slabinfo.
    pub name: &'static str,
}

/// Per-CPU magazine for fast alloc/free (no lock, no atomic *within the magazine*
/// operation itself — push/pop are plain pointer moves). The slab fast path
/// accesses the magazine pointer via `CpuLocal` (Section 3.1.2), which uses the
/// architecture's per-CPU register (~1 cycle on x86-64, ~2-4 cycles on AArch64).
/// No borrow-state CAS is involved on the slab fast path — `CpuLocal` bypasses
/// `PerCpu<T>` entirely. See Section 3.1.2 for `CpuLocal` design, Section 3.1.3 for `PerCpu<T>`
/// debug-only borrow checking.
pub struct SlabMagazine {
    /// Stack of free object pointers. alloc() pops, free() pushes.
    pub objects: [*mut u8; 64],
    /// Number of valid entries in `objects`.
    pub count: u32,
}

SlabCache::alloc() fast path: Pop from per-CPU magazine via CpuLocal (Section 3.1.2) — the slab magazine pointer is a field in CpuLocalBlock, accessed through the architecture's per-CPU register. No lock, no atomic, no borrow-state check on the fast path. If magazine is empty, refill from per-NUMA partial list (one lock acquisition fills the entire magazine). If no partial slabs, allocate a new page from the buddy allocator.

SlabCache::free() fast path: Push to per-CPU magazine via CpuLocal. If magazine is full, flush half to the per-NUMA partial list.

Cost model: The slab fast path uses CpuLocal (Section 3.1.2) for magazine access, matching Linux's this_cpu_* pattern: ~1-10 cycles depending on architecture (single instruction on x86-64 via gs: prefix, 2-3 instructions on AArch64 via TPIDR_EL1). General per-CPU data accessed through PerCpu<T> uses a debug-only borrow-state CAS (Section 3.1.3): ~3-8 cycles in release builds, ~20-30 cycles in debug builds. The debug-mode CAS catches aliasing bugs during development; release builds trust the structural invariants (preemption disabled = CPU pinned, IRQs disabled = exclusive access).

SlabRef<T> — Owned handle to a slab-allocated object:

/// A reference to a slab-allocated object. Provides O(1) allocation
/// and deallocation from the owning slab cache. Implements `Deref<Target = T>`
/// for transparent access. `Drop` returns the object to the slab cache.
/// Not `Clone` — each `SlabRef` represents unique ownership.
pub struct SlabRef<T: ?Sized> {
    ptr: NonNull<T>,
    cache: *const SlabCache,
}

4.1.3 Page Cache

  • RCU-protected radix tree for page lookups (lock-free reads)
  • NUMA-aware page placement: Pages cached on the node closest to the requesting CPU
  • Writeback: Per-device writeback threads, dirty page ratio thresholds
  • Page reclaim: Generational LRU (Section 4.1.3.1) with per-CPU drain batching
  • Transparent huge pages: Automatic promotion of 4KB pages to 2MB when aligned runs are available, automatic splitting under memory pressure

PageCache — Per-inode page cache:

/// Per-inode page cache: maps file page offsets to their in-memory page frames.
///
/// Embedded directly in each `Inode` to eliminate the extra pointer dereference
/// on the page-fault hot path (`inode → page_cache.pages → Page`).
///
/// **UmkaOS design**: Uses `BTreeMap<PageIndex, PageEntry>` under RCU protection.
/// Linux uses XArray (a radix-tree variant, since 5.0) for O(log n) lookups and
/// lockless RCU reads. UmkaOS uses `BTreeMap` for equivalent algorithmic complexity
/// with a simpler, well-audited implementation. The RCU clone-and-swap write path
/// is acceptable for page insertions (rare relative to page lookups in hot-path
/// workloads like sequential I/O, where the page is inserted once and read many times).
///
/// **Thread safety**: Reads (page lookup on page fault, readahead) take an
/// `RcuReadGuard`. Writes (page insertion, dirty marking, page eviction) take
/// `write_lock` and clone-and-swap the `BTreeMap`.
pub struct PageCache {
    /// RCU-protected map from file page index to page entry.
    /// Key: page offset in units of PAGE_SIZE (byte_offset / PAGE_SIZE).
    /// Value: PageEntry wrapping the physical Page frame + state flags.
    ///
    /// Immutable snapshot after RCU publish: readers see a consistent map
    /// without holding any lock. Writers clone, modify, and atomically swap.
    pub pages: RcuCell<BTreeMap<PageIndex, PageEntry>>,

    /// Write-side serialization lock (RCU write pattern).
    /// Held only during page insertion/removal/state-transition writes.
    /// Never held during reads — the read path is lock-free.
    pub write_lock: Mutex<()>,

    /// Total number of pages currently in the cache (Uptodate + Dirty + Writeback).
    /// Updated atomically under `write_lock` on insert/remove.
    pub nr_pages: AtomicU64,

    /// Number of pages currently marked Dirty (awaiting writeback).
    pub nr_dirty: AtomicU64,

    /// Nanoseconds-since-boot deadline: the earliest time at which a dirty page
    /// in this inode MUST be written back (30 seconds after first dirty mark,
    /// per the default `dirty_expire_centisecs = 3000`). Zero if no dirty pages.
    /// Drives the writeback thread's per-inode scheduling.
    pub writeback_deadline_ns: AtomicU64,

    /// True while a writeback I/O is in progress for this inode. Set by the
    /// writeback thread; cleared on I/O completion. Prevents concurrent writebacks.
    pub writeback_in_progress: AtomicBool,
}

/// An entry in the page cache: a reference to a physical page frame plus state flags.
pub struct PageEntry {
    /// Reference to the backing page frame. The page may be in CpuRam, compressed
    /// (in the Zpool), or being migrated — the PageLocation tracks this.
    pub page: Arc<Page>,
    /// Current state of this page relative to its backing file.
    pub flags: PageFlags,
}

bitflags! {
    /// State flags for a cached page. Inspected by the page fault handler,
    /// writeback thread, and reclaim path.
    pub struct PageFlags: u32 {
        /// Page data is current (consistent with backing storage or anon zero-page).
        /// Cleared when the page is evicted; set after readahead or writeback completes.
        const UPTODATE    = 0x0001;
        /// Page has been written to since last writeback. Drives writeback scheduling.
        const DIRTY       = 0x0002;
        /// Writeback to storage is currently in progress for this page.
        /// Set by the writeback thread; cleared on I/O completion.
        const WRITEBACK   = 0x0004;
        /// Page is locked (I/O in progress, compaction, or migration).
        /// Processes waiting for unlock sleep on a per-page wait queue.
        const LOCKED      = 0x0008;
        /// Page has been accessed since it was last placed at the tail of the LRU.
        /// Set by the page fault handler; cleared by the reclaim scanner.
        const ACCESSED    = 0x0010;
        /// Page is mapped into at least one process page table (has live PTEs).
        /// Used by the reclaim path to skip TLB shootdown for unmapped pages.
        const MAPPED      = 0x0020;
        /// Page is in the swap cache (backed by a swap entry, not a file).
        const SWAPCACHE   = 0x0040;
        /// Page belongs to a Tier 1 driver's DMA region (must not be reclaimed).
        const DMA_PINNED  = 0x0080;
        /// Page is under readahead (being speculatively loaded by the readahead engine).
        const READAHEAD   = 0x0100;
    }
}

/// File page offset: byte_offset / PAGE_SIZE. Used as BTreeMap key for O(log n) lookup.
pub type PageIndex = u64;

4.1.3.1 Generational LRU Page Reclaim

UmkaOS uses a generational LRU design, inspired by Linux's Multi-Gen LRU (MGLRU, merged in Linux 6.1) but redesigned from first principles. The goals are:

  1. Accurate age estimation without per-page LRU movement (no lock-per-access).
  2. Separate aging policy for file-backed and anonymous pages.
  3. Per-cgroup reclaim priority — cgroup pressure is resolved before global reclaim.
  4. Efficient operation under NUMA with per-node generation lists.
  5. Refault distance tracking to protect frequently-accessed file pages.

Data structures:

/// Number of LRU generations maintained per zone (tunable, default 4).
/// Pages age through generations 0..N_GENS-1. Generation 0 is the newest
/// (just accessed), generation N_GENS-1 is the oldest (eviction candidate).
pub const N_GENS: usize = 4;

/// Splits each LRU generation into anonymous and file-backed page pools,
/// matching Linux MGLRU's `lrugen` layout. Anon pages and file pages have
/// different cost models (swap I/O vs. discard/re-read), so they are
/// reclaimed with separate policies.
pub struct LruGeneration {
    /// File-backed pages (page cache, mmap MAP_SHARED).
    pub file: PageList,
    /// Anonymous pages (heap, stack, MAP_PRIVATE after CoW).
    pub anon: PageList,
}

/// Per-NUMA-zone generational LRU state.
///
/// # Design rationale
///
/// Linux's original LRU (active/inactive two-list per zone) has well-known
/// problems:
/// - thrashing: a single large sequential read evicts the entire working set
/// - inaccurate aging: every page access requires taking the LRU lock to move
///   the page to the head of the active list
/// - poor scan efficiency: reclaim must scan many active pages before finding
///   inactive candidates
///
/// MGLRU fixes these by coarsening page age into generations (not exact
/// timestamps). UmkaOS adopts the same approach but adds:
/// - `cgroup_pressure` integration (cgroup-first reclaim)
/// - refault distance tracking per-cgroup (Section 4.1.3.1)
/// - generation advancement driven by mm_walk (page table scan), not
///   lru_add/lru_deactivate calls
pub struct ZoneLru {
    /// Generational lists, indexed [generation_index][anon/file].
    /// Oldest generation is at `(oldest_gen % N_GENS)`.
    pub generations: [LruGeneration; N_GENS],
    /// Index of the oldest generation (mod N_GENS). Reclaim starts here.
    pub oldest_gen: u64,
    /// Index of the youngest generation. New pages start here.
    pub youngest_gen: u64,
    /// Protects generation list manipulation. Not held during mm_walk.
    pub lock: SpinLock<()>,
    /// Per-CPU page drain buffers: pages are added to per-CPU buffers and
    /// drained to the zone LRU under lock in batches (default: 32 pages).
    /// Eliminates per-page lock acquisitions on the hot path.
    pub percpu_drain: PerCpu<LruDrainBuffer>,
    /// Refault tracking shadow entries (Section 4.1.3.1 — refault distance).
    pub shadow_entries: XArray<ShadowEntry>,
}

/// Per-CPU drain buffer to batch LRU insertions (avoids per-page LRU lock).
pub struct LruDrainBuffer {
    pub add_file: [PagePtr; 32],
    pub add_anon: [PagePtr; 32],
    pub file_count: u8,
    pub anon_count: u8,
}

/// Shadow entry stored in the page cache radix tree slot after a page is
/// evicted. Encodes the eviction generation so that a subsequent fault can
/// compute the refault distance and decide whether to insert the page into a
/// younger generation (protecting it from immediate re-eviction).
///
/// Encoded as a tagged pointer in the XArray: low 2 bits = tag `0b11`
/// (distinguishes shadow from live page pointer), bits 2..34 = generation
/// counter at eviction time.
pub struct ShadowEntry(u64);

Generation advancement — mm_walk-based:

Pages do not move in the LRU list on each access. Instead, access bits in the page table (Accessed/AF/Reference flag, per-architecture) are examined by a periodic mm_walk scan:

Generation advancement algorithm (runs as kswapd wakeup or background thread):

1. For each zone needing reclaim:
   a. Walk all PTEs of processes mapped to this zone (using mm_walk,
      which visits pagetable leaves without holding any mm lock beyond
      mmap_read_lock for the duration of each VMA).
   b. For each PTE with Accessed=1:
      - Clear the Accessed bit (set to 0 for future aging).
      - Promote the referenced page: move it from its current generation
        to `youngest_gen` (generation 0 equivalent).
      - For file pages: also check the page cache Accessed flag (set by
        read(2)/mmap fault). Clear it.
   c. Pages not promoted during this walk remain in their current generation.
      After N_GENS walk cycles without promotion, a page reaches generation
      N_GENS-1 and becomes an eviction candidate.

2. Generation counter: after each complete zone walk, increment `youngest_gen`.
   Wrap: (youngest_gen % N_GENS). The oldest generation is automatically
   advanced: `oldest_gen = youngest_gen - (N_GENS - 1)`.

3. Per-NUMA locality: mm_walk for a zone only visits PTEs of tasks whose
   memory policy prefers that node. Cross-NUMA promotions are allowed but
   penalised (the page is promoted to generation 1, not generation 0).

Reclaim algorithm:

kswapd page reclaim (per zone, triggered when free pages < watermark_low):

1. Cgroup-first scan: if any cgroup has crossed memory.high (soft limit):
   a. Reclaim from that cgroup's pages first, targeting its per-cgroup LRU
      subset (CgroupLru, see below) before touching global lists.
   b. If cgroup is at memory.max (hard limit): trigger cgroup OOM
      (Section 4.1.3.2) before global reclaim.

2. Scan oldest generation (generation N_GENS-1):
   a. File pages first: these are cheaper to evict (no swap I/O — just
      discard if clean, writeback if dirty).
      - Clean file pages: immediately freed. Shadow entry written to
        radix tree for refault tracking.
      - Dirty file pages: queued for writeback. Recounted as clean once
        writeback completes.
   b. Anon pages next: these require swap I/O.
      - Compute swap slot, issue async write, mark page SwapCache.
      - Page remains mapped (PTEs updated to swap entry) until write
        completes, then page freed.

3. If oldest generation exhausted before watermark_high is reached:
   a. Age: advance oldest_gen by 1 (drop the oldest generation's lists).
   b. Resume from new oldest generation.
   c. If all generations exhausted without reaching watermark_high:
      trigger OOM resolution (Section 4.1.3.2).

4. Per-CPU drain: before scanning, drain all percpu_drain buffers to
   their zone LRU lists (batch under lock, 32 pages at a time).

Refault distance tracking:

When a page is evicted, a shadow entry is stored in the page cache radix tree at the same index. The shadow entry encodes the generation counter at eviction time. When the same page faults back in:

refault_distance = (current_generation - shadow_generation)

if refault_distance < N_GENS / 2:
    # page was re-accessed quickly — it has a hot working set entry
    # insert into youngest generation (protect from immediate re-eviction)
    insert_at_generation(page, youngest_gen)
else:
    # page was cold before re-fault — treat as new (insert at generation 0)
    insert_at_generation(page, youngest_gen)
    # note: same physical insertion point, but: for cgroup accounting,
    # a short refault distance increments the cgroup's "refault credit"
    # which reduces future reclaim aggressiveness for that cgroup

Shadow entries are evicted from the XArray by a background cleaner when memory pressure is low, bounded by a maximum of nr_file_pages / 8 shadow entries per zone (prevents shadow entries from consuming significant memory).

Better than Linux LRU in these specific ways:

Linux LRU behaviour UmkaOS improvement
Per-page LRU lock on each access (lru_add_drain) mm_walk batch scan — no per-access lock
Two-list (active/inactive) — binary aging N_GENS generations — finer aging granularity
Global reclaim first (cgroup OOM is reactive) Cgroup-first reclaim — cgroup OOM before global pressure
Shadow entries lost after inode reclaim XArray-based shadow tracking survives inode cache pressure
No penalty for cross-NUMA promotions Cross-NUMA pages promoted to generation 1, not 0

4.1.3.2 OOM Killer Policy

Overview:

The OOM (Out-of-Memory) killer is the last resort when the system cannot reclaim enough pages to satisfy an allocation. UmkaOS's OOM resolution is explicitly ordered, predictable, and userspace-notifiable — addressing well-known deficiencies in Linux's OOM killer (wrong process killed, oom_score_adj abuse, no advance warning, per-NUMA races).

Detection:

OOM is declared when all of the following are true:

1. alloc_pages(order=N) fails (buddy allocator returned null).
2. kswapd has completed at least one full reclaim cycle without reaching
   watermark_low.
3. Memory compaction (Section 4.1.4) was attempted and could not create a
   block of the requested order.
4. No swap space is available or all swap slots are in use.

For cgroup OOM, the trigger is: a cgroup has exceeded memory.max and reclaim within the cgroup subtree failed to bring usage below the limit.

Resolution order:

Step 0: Hibernate background cgroups (before any kill)
  - Before killing any process, the memory pressure manager checks whether
    any cgroups in "background" state (memory.hibernate_priority > 0) can
    be hibernated to free memory.
  - Hibernation freezes cgroup tasks and reclaims their anonymous pages
    (except MADV_CRITICAL regions) without killing the process.
  - See Section 4.1.3.3 for the full hibernation mechanism.
  - Only if hibernation cannot free enough memory does resolution proceed
    to Step 1.

Step 1: Cgroup OOM (if applicable)
  - If the failing allocation is charged to a cgroup at memory.max,
    and per-cgroup reclaim failed: resolve within that cgroup.
  - Select victim from the cgroup's process set only.
  - Never escalate to global OOM if cgroup OOM resolves the pressure.

Step 2: MPOL_BIND OOM (if applicable)
  - If the failing allocation has a NUMA memory policy that binds it to
    a specific set of nodes, and those nodes are exhausted: resolve OOM
    only across processes whose allocations are bound to the same node set.
  - Prevents global OOM from being triggered by a single NUMA-bound process.

Step 3: Global OOM
  - Select the victim from all non-exempt processes system-wide.
  - Emit a log entry with: victim PID, victim name, oom_score, total RSS,
    swap used, all process memory usage at time of decision.

Victim selection:

/// OOM score for process P. Higher score = more likely to be killed.
/// This formula is designed to select the largest memory consumer while
/// respecting operator intent (oom_score_adj) without allowing adj abuse
/// to protect processes that have consumed all memory.
///
/// Note: oom_score_adj range is [-1000, 1000] (same as Linux).
/// oom_score_adj = -1000 means "never kill" (exempt from OOM selection).
/// oom_score_adj = +1000 means "kill first" (always highest priority victim).
pub fn oom_score(p: &Task) -> i64 {
    let rss_kb    = p.mm.rss_pages * 4;        // RSS in KB
    let pgtable   = p.mm.pgtable_pages * 4;    // page table overhead
    let dirty_kb  = p.mm.dirty_pages * 4;       // dirty file pages charged to p
    let child_rss = p.children.iter().map(|c| c.mm.rss_pages * 4).sum::<u64>();

    // Base score: total memory footprint (RSS + page tables + 1/8 children).
    // Child RSS included at 1/8 weight: killing parent reclaims children too,
    // but children may survive independently (so not full weight).
    let base = (rss_kb + pgtable + dirty_kb + child_rss / 8) as i64;

    // adj_delta: shift base score. Range [-1000..+1000] mapped to [-base..+base].
    // A process at adj -1000 gets score 0 (never selected — see exempt check).
    // A process at adj +1000 gets score 2×base (always selected first).
    let adj_delta = (p.oom_score_adj as i64 * base) / 1000;

    base + adj_delta
}

/// OOM victim exemptions (score -1000 is never selected):
/// - kernel threads (no mm)
/// - init (pid 1 of PID namespace)
/// - processes with oom_score_adj = -1000
/// - processes holding a mandatory kernel resource (marked OOM_EXEMPT at
///   creation, e.g., the memory-pressure notification daemon)

Notification before kill:

UmkaOS provides advance OOM notification via a /dev/oom character device. Processes interested in memory pressure events can open("/dev/oom", O_RDONLY) and poll() on it. Events delivered as fixed-size structs on read():

#[repr(C)]
pub struct OomNotification {
    /// Type of pressure event.
    pub kind: OomKind,
    /// PID of the cgroup whose memory exceeded the threshold,
    /// or 0 for global OOM notification.
    pub cgroup_pid: u32,
    /// Current free pages across all zones.
    pub free_pages: u64,
    /// Total pages in system.
    pub total_pages: u64,
    /// Reserved for future fields (ABI stability).
    pub _reserved: [u8; 32],
}

#[repr(u32)]
pub enum OomKind {
    /// Cgroup at memory.high: soft pressure warning (no kill yet).
    CgroupHigh  = 1,
    /// Cgroup at memory.max and reclaim failed: cgroup OOM imminent.
    CgroupMax   = 2,
    /// Global OOM: system-wide memory exhausted, kill imminent.
    GlobalOom   = 3,
}

Kill sequence:

1. Log OOM event to kernel log (PID, name, oom_score, RSS, reason).
2. Send SIGKILL to the selected victim.
3. Wait up to 500ms for the victim to exit (oom_wait_ms, configurable).
4. If victim has not exited after 500ms: log "OOM victim did not die" and
   continue — do NOT panic. The allocator retries; if still out of memory,
   select a new victim (next-highest oom_score).
5. Cgroup OOM: SIGKILL all tasks in the cgroup simultaneously (not just
   the highest-score task) to avoid partial kills leaving zombie cgroups.
6. After kill: wake all waiters blocked in alloc_pages, retry allocation.

/proc/PID/oom_score_adj interface: Range [-1000, 1000]. Inherited by children. Written by process itself (any value ≤ current adj) or by root (any value). Default: 0. Setting -1000 exempts a process from OOM selection (not from SIGKILL if sent explicitly — only OOM selection).

/proc/PID/oom_score (read-only): Current computed OOM score (calls oom_score() on read). Useful for monitoring and OOM debugging.

Design improvements over Linux OOM killer:

Linux OOM problem UmkaOS resolution
Kills wrong process (oom_score inaccurate for swap-heavy workloads) dirty_kb and pgtable included in base score; child RSS weighted at 1/8
No advance warning before kill /dev/oom notification device with OomKind::CgroupHigh at 80% pressure
Global OOM triggered by single cgroup Cgroup OOM resolved first, never escalates unless cgroup OOM fails
OOM panic on kernel allocation failure No panic — log + retry + select new victim if needed
NUMA OOM selects from wrong node set MPOL_BIND OOM resolved within bound node set only
Per-NUMA oom_zonelist races Explicit NUMA-aware step 2 in resolution order

4.1.3.3 Process Memory Hibernation

Motivation:

Traditional Linux/Android memory reclamation under pressure has one instrument: kill the process. iOS's Jetsam daemon takes a fundamentally different approach — it freezes background apps and reclaims their memory without killing them. When the user switches back, the app resumes from exactly the state it was in, with only a brief repopulation stall instead of a full cold start.

Android devices historically required more RAM than iPhones because Android's only recourse under pressure was to kill background processes (requiring full restart, ~1-3s), while iOS's freeze+reclaim allowed the same working set to fit in less physical memory (~50-200ms warm resume).

UmkaOS solves this at the kernel level with process memory hibernation: a composable cgroup-based mechanism that freezes a process group and reclaims its memory without killing it. Applications opt in to efficient hibernation via new madvise() hints. Processes that do not use the hints still benefit — the kernel reclaims all non-critical pages via compression or swap — it just can't guarantee which pages are safe to zero-fill on resume.

New madvise() hints:

/* MADV_DISCARDABLE (new): Anonymous pages in this region may be discarded
 * at any time under memory pressure, including during cgroup hibernation.
 * On next access after discard, pages are zero-filled — the application
 * is responsible for reinitializing their content.
 *
 * Use cases: GC-managed heaps (GC will reinitialize objects on first use),
 * compiled code caches (JIT can recompile on demand), image decode buffers,
 * pre-computed table data.
 *
 * Stronger than MADV_FREE: MADV_FREE pages are freed lazily under global
 * pressure. MADV_DISCARDABLE pages are freed eagerly when the owning
 * cgroup transitions to hibernating state.
 *
 * Note: discarded pages are NOT tracked individually. The application must
 * be designed to reinitialize any page in the region on first access —
 * there is no per-page "was this discarded?" query.
 */
madvise(addr, len, MADV_DISCARDABLE)  /* 256 — UmkaOS extension namespace starts here */

/* MADV_CRITICAL (new): Anonymous pages in this region must never be
 * evicted — not to swap, not during hibernation, not under any memory
 * pressure. Pages are wired (pinned) in physical RAM.
 *
 * Use cases: UI state (last rendered frame, scroll position), input
 * event queues, cryptographic key material, file descriptor tables for
 * critical IPC channels.
 *
 * Subject to a per-process quota enforced by the memory cgroup:
 *   memory.critical_limit (default: 64 MB per cgroup leaf)
 * Exceeding the quota: madvise() returns ENOMEM.
 *
 * Interaction with fork: MADV_CRITICAL is NOT inherited across fork().
 * Child processes start with no critical regions.
 */
madvise(addr, len, MADV_CRITICAL)     /* 257 — UmkaOS extension */

Memory page classification for hibernation:

When a cgroup enters hibernating state, every anonymous page in the cgroup's address spaces is classified into one of four categories:

Category Condition Hibernation action Resume action
Critical Covered by MADV_CRITICAL Keep in RAM (wired) Instant access
Discardable Covered by MADV_DISCARDABLE Free immediately (no swap I/O) Zero-fill fault
Compressible All other anonymous pages Compress into zpool (Section 4.2) Decompress on fault
Uncompressible Large anonymous pages that don't compress Swap to disk Swap-in on fault

File-backed pages (mmap MAP_SHARED, page cache) are handled by the normal LRU reclaim path (Section 4.1.3.1) — they do not need special treatment.

Data structures:

/// Per-VMA annotation for hibernation hints. Stored as flags in the VMA
/// descriptor (Section 4.1.5). Multiple ranges within a VMA can have
/// different hints via VMA splitting (same mechanism as mprotect()).
pub struct VmaHibernateFlags: u8 {
    const DISCARDABLE = 0x01;  // madvise(MADV_DISCARDABLE)
    const CRITICAL    = 0x02;  // madvise(MADV_CRITICAL)
}

/// Per-cgroup hibernation state, stored in the memcg descriptor.
pub enum HibernateState {
    /// Normal operation — tasks running, pages managed by LRU.
    Active,
    /// Transitioning to hibernated — tasks frozen, pages being reclaimed.
    /// Reads of memory.hibernate_state return "hibernating" during this phase.
    Hibernating,
    /// All non-critical pages reclaimed — tasks still frozen, minimal RSS.
    Hibernated,
    /// Transitioning back to active — tasks unfrozen, pages being warmed.
    Thawing,
}

/// Per-cgroup hibernation configuration. Exposed via cgroupfs memory.* files.
pub struct CgroupHibernate {
    pub state: HibernateState,
    /// Priority: 0 = not eligible for hibernation; 1-100 = eligible
    /// (higher = hibernated sooner under pressure). Set by orchestrator
    /// (e.g., ActivityManager, systemd-oomd).
    pub priority: u8,
    /// Maximum bytes of MADV_CRITICAL pages allowed in this cgroup subtree.
    /// Default: 64 MB. Enforced at madvise(MADV_CRITICAL) time.
    pub critical_limit: u64,
    /// Bytes of critical pages currently wired in this cgroup subtree.
    pub critical_used: AtomicU64,
    /// Pages discarded (zero-fill) in last hibernation cycle.
    pub discarded_pages: AtomicU64,
    /// Pages compressed in last hibernation cycle.
    pub compressed_pages: AtomicU64,
    /// Pages swapped in last hibernation cycle.
    pub swapped_pages: AtomicU64,
}

Hibernation algorithm:

hibernate_cgroup(cgroup):

Phase 1 — Freeze (synchronous, <1ms for typical app):
  1. Write cgroup.freeze = 1 (kernel cgroup freezer, not SIGSTOP).
     This is transparent to the frozen tasks — they do not observe
     the freeze transition. Wait for all tasks to reach FROZEN state.
  2. Disable the cgroup's memory.max enforcement temporarily
     (prevents OOM kill racing with our intentional hibernation).

Phase 2 — Discard (synchronous, ~1-5ms for typical 256 MB DISCARDABLE region):
  3. Walk all VMAs in all tasks of the cgroup.
  4. For each VMA with DISCARDABLE flag:
     a. Walk PTEs. For each Present PTE:
        - Unmap the PTE (mark not-present, flush TLB).
        - If the page is exclusively owned (refcount == 1): free immediately.
        - If shared (CoW parent or mmap shared): leave for LRU reclaim.
     b. Do NOT write shadow entries — discarded pages are intentionally
        zeroed on resume; no refault tracking needed.
  5. Flush TLB shootdowns in batch (one IPI burst for all CPUs that
     had the cgroup's tasks scheduled recently).

Phase 3 — Compress remaining anonymous pages (asynchronous, background):
  6. Queue all remaining anonymous (non-CRITICAL, non-DISCARDABLE) VMA
     pages for compression via the zpool path (Section 4.2).
     - This runs asynchronously: the cgroup is already FROZEN and using
       zero CPU, so background compression does not compete with foreground
       work.
     - Priority: lower than foreground compression jobs.
  7. Pages that do not compress below 75% of original size: queue for swap.

Phase 4 — State transition:
  8. When all non-CRITICAL pages are compressed or swapped:
     transition cgroup.hibernate_state to HIBERNATED.
  9. RSS of the cgroup at this point: only CRITICAL pages + kernel structures
     (task_struct, page tables — typically <512 KB per process).
  10. Re-enable memory.max enforcement.

Thaw (resume) algorithm:

thaw_cgroup(cgroup):

Phase 1 — Unfreeze (synchronous, <1ms):
  1. Write cgroup.freeze = 0. Tasks are immediately runnable.
  2. Transition state to THAWING.

Phase 2 — Warm prefetch (asynchronous, ~50-200ms for typical app):
  3. CRITICAL pages: already present — zero latency.
  4. DISCARDABLE pages: populated with zero-fill on first fault.
     No I/O required. Fault latency: ~1-3μs per page (same as demand
     zero anonymous fault). Total for 256 MB: ~65ms worst case if all
     pages are faulted simultaneously; typical UI path <10ms.
  5. Compressed pages: decompressed on fault by the thread that touches
     them (inline decompression, ~10-50μs per page, LZ4 from zpool).
     Background prefetcher also reads ahead: when a task first faults a
     page from a compressed region, the prefetcher decompresses the next
     16 pages of that VMA in a background kworker.
  6. Swapped pages: swap-in on fault. Background swap prefetcher issues
     readahead for sequential access patterns.

Phase 3 — State transition:
  7. After all swap I/O has been issued (not necessarily completed):
     transition state to ACTIVE. The cgroup is now fully active.
     Remaining pages trickle in on demand.

Cgroupfs interface (all under the memory controller):

memory.hibernate_state    rw  "active" | "hibernating" | "hibernated" | "thawing"
                              Write "hibernating" to trigger hibernation.
                              Write "active" to trigger thaw.

memory.hibernate_priority rw  0-100. 0 = not eligible (default).
                              Orchestrator sets this for background app cgroups.
                              The OOM resolution Step 0 (Section 4.1.3.2) uses
                              this to select which cgroups to hibernate first.

memory.critical_limit     rw  Maximum bytes of MADV_CRITICAL pages.
                              Default: 67108864 (64 MB).

memory.critical_current   ro  Current bytes of MADV_CRITICAL pages wired.

memory.hibernate_stats    ro  Lines: discarded_pages, compressed_pages,
                              swapped_pages, thaw_faults. Reset on each
                              thaw cycle.

/proc/PID/smaps extensions:

Each VMA entry in /proc/PID/smaps gains two new fields:

MadvDiscardable: <kb>    # bytes of this VMA covered by MADV_DISCARDABLE
MadvCritical:    <kb>    # bytes of this VMA covered by MADV_CRITICAL

Performance targets:

Metric Target Basis
Freeze latency (Phase 1) < 1ms cgroup freezer is synchronous
Discard phase (Phase 2, 256 MB DISCARDABLE) < 5ms TLB shootdown + page free, no I/O
Memory freed (active → hibernated, 512 MB app) 400-480 MB 80-95% of anon freed
Warm resume CRITICAL path < 5ms pages in RAM, just unfreeze
Warm resume first-paint (compressed pages) < 200ms LZ4 decompression on fault
Cold resume (all pages swapped) 500ms - 2s swap I/O latency

Comparison with existing approaches:

Approach Memory freed App restart cost App changes needed
Android LMKD (kill) 100% of app Cold start: 1-3s None
iOS Jetsam (OS-managed) 70-90% Warm resume: 50-500ms None (OS decides)
UmkaOS hibernation (no hints) ~80% (compress/swap) Warm resume: 200ms-2s None
UmkaOS hibernation (with hints) 85-95% (discard + compress) Warm resume: <200ms madvise() calls

Backwards compatibility:

  • madvise(MADV_DISCARDABLE) and madvise(MADV_CRITICAL) are new hints. Linux apps that do not call them work unchanged; hibernation falls back to compress-everything behavior (still better than kill).
  • memory.hibernate_state is a new cgroupfs attribute. Orchestrators written for vanilla Linux simply don't write to it; behavior is identical to Linux.
  • The cgroup freeze interface (cgroup.freeze) is compatible with Linux cgroup-v2 freeze semantics.
  • This mechanism is designed for adoption: if it proves effective, the MADV_DISCARDABLE / MADV_CRITICAL hints and the memcg attributes are straightforward to propose for upstream Linux inclusion.

4.1.4 Transparent Huge Page Promotion and Memory Compaction

khugepaged — background THP promotion:

The kernel runs a background thread (khugepaged) that scans process address spaces for opportunities to promote 512 contiguous 4KB pages (2MB aligned) into a single 2MB transparent huge page. This reduces TLB pressure: a single 2MB TLB entry replaces 512 × 4KB entries, and modern CPUs have dedicated 2MB TLB slots (Intel: 32-64 entries; AMD: 64 entries; ARM: 32-48 entries depending on core).

Promotion flow:
  1. khugepaged scans VMAs with THP enabled (default for anonymous memory).
  2. For each 2MB-aligned range: check if all 512 base pages are present,
     anonymous (not file-backed), and writable.
  3. If yes: allocate a compound page (order-9), copy 512 base pages into it,
     update PTEs atomically under the page table lock, free the 512 base pages.
  4. If allocation fails (no contiguous 2MB block): skip and try next range.
     Memory compaction (below) may create the block for a future scan cycle.

Configuration:

/sys/kernel/mm/transparent_hugepage/enabled     — always / madvise / never
/sys/kernel/mm/transparent_hugepage/defrag      — always / defer / defer+madvise / madvise / never
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs  — interval (default: 10000ms)
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan         — pages per cycle (default: 4096)

Memory compaction:

When the buddy allocator cannot satisfy a high-order allocation (e.g., 2MB for THP or 1GB for explicit huge pages), the kernel triggers memory compaction: a process that migrates movable pages to create contiguous free regions.

Compaction algorithm (simplified):
  1. A "migration scanner" walks from the bottom of the zone upward, finding
     movable pages (LRU-resident, not pinned, not DMA-mapped).
  2. A "free scanner" walks from the top of the zone downward, finding free pages.
  3. When both scanners meet: the movable page is migrated (allocated at the free
     page's location, content copied, PTE updated), freeing a contiguous block at
     the migration scanner's position.
  4. Compaction stops when a block of the requested order is available or the
     scanners have exhausted the zone.

Page Mobility Classification for Compaction:

The migration scanner only moves movable pages. Pinned pages are skipped in place — compaction works around them, accepting that pinned pages create holes that limit contiguous block formation.

Movable (can be migrated): - Anonymous pages not locked by mlock() or mmap(MAP_LOCKED) - Page cache pages (clean: simply freed + re-read; dirty: writeback first) - Pages allocated with GFP_MOVABLE (standard for anonymous and file pages) - slab pages marked movable (used for large object caches only)

Pinned (cannot be migrated): - Pages locked via mlock() or mlockall() — physical address is fixed - Pages mapped for DMA (registered with IOMMU; physical address in hardware) - Pages in driver vmap/vmalloc mappings (physical address fixed for MMIO) - Per-CPU data pages and kernel stack pages of currently running tasks - Pages with refcount > 1 beyond the page table mapping (additional holders) - madvise(MADV_CRITICAL) pages (Section 4.1.3.3) — wired, never migrated

Compaction behavior on pinned pages: The migration scanner encounters a pinned page and skips it, advancing the scanner by one page. The free scanner may still find free pages beyond the pinned page, but the resulting free block will be non-contiguous with pages before the pinned page. If the zone has many scattered pinned pages, compaction may fail to form a 2MB block even after a full scan.

Design implication: Drivers should use GFP_MOVABLE for their page allocations wherever possible to avoid becoming obstacles to THP formation and compaction. DMA mappings are inherently pinned and should be concentrated in IOMMU-mapped regions rather than scattered through general memory.

Latency impact: Compaction involves page migration (TLB shootdown + memcpy + PTE update). Per-page migration costs vary by scope: - Local NUMA migration (same socket, memcpy via CPU): ~200-500 ns per 4KB page - Cross-socket migration (cache-coherent interconnect, TLB shootdown included): ~1-10 μs per page - RDMA-based DSM migration (cross-node, see Section 5.x): ~2-50 μs depending on distance

Compaction stall time during synchronous operation accumulates these costs across all migrated pages. The defer defrag mode (default in UmkaOS) avoids synchronous compaction on page faults — instead, khugepaged and kcompactd run in the background, and allocation failures fall back to 4KB pages without blocking. The always defrag mode triggers synchronous compaction on every THP-eligible fault, which maximizes THP coverage but can cause multi-millisecond stalls — suitable only for throughput-oriented batch workloads, not latency-sensitive applications.

Disable option: For hard real-time workloads (isolcpus + nohz_full), THP promotion and compaction should be disabled entirely (transparent_hugepage/enabled=never) to eliminate background scanning and migration-induced latency jitter. These workloads should pre-allocate explicit huge pages at boot (umka.hugepages=<count>) for deterministic TLB behavior.

4.1.5 Virtual Memory Manager

  • Maple tree for VMA (Virtual Memory Area) management (same as Linux 6.1+)
  • Demand paging: Pages allocated on first access (page fault handler in UmkaOS Core)
  • Copy-on-write (COW): Fork shares pages read-only, copies on write fault
  • Memory-mapped files: mmap backed by page cache, supports MAP_SHARED and MAP_PRIVATE
  • Huge pages: Both explicit (mmap with MAP_HUGETLB) and transparent (THP)
  • ASLR: Full address space layout randomization for user processes

4.1.5.1 Maple Tree VMA Management

The VMA index is a B-tree variant (Maple tree, as in Linux 6.1+) where each node stores ranges of VMAs sorted by virtual address. Unlike a red-black tree, Maple tree nodes are 256-byte dense nodes (4 cache lines), store multiple ranges per node (fanout ~16 for range nodes), and provide O(log n) lookup with far better cache behavior than pointer-chasing rb-trees. The reduced pointer chasing means fewer cache misses per lookup — a 3-level Maple tree touches ~12 cache lines (4 per level) versus 45-60 cache lines for an equivalent rb-tree traversal on a VMA set of ~100,000 entries (15-20 nodes × 3-4 cache lines each).

Vma — virtual memory area:

/// A Virtual Memory Area — a contiguous range of virtual addresses with uniform
/// protection and backing. The `MapleTree` stores one `Vma` per mapped region.
///
/// `Vma` structs are allocated from the slab allocator and stored in leaf nodes
/// of the `MapleTree`. The tree's range keys are `[vm_start, vm_end)`.
pub struct Vma {
    /// Inclusive start virtual address (page-aligned).
    pub vm_start: VirtAddr,
    /// Exclusive end virtual address (page-aligned).
    pub vm_end: VirtAddr,
    /// Protection and mapping flags (`PROT_READ`, `PROT_WRITE`, `PROT_EXEC`,
    /// `MAP_SHARED`, `MAP_PRIVATE`, `MAP_ANONYMOUS`, `MAP_HUGETLB`, etc.).
    /// Stored as a `VmFlags` bitfield; mirrors Linux `vm_flags`.
    pub vm_flags: VmFlags,
    /// Page offset within the backing file (in pages). Zero for anonymous VMAs.
    pub vm_pgoff: u64,
    /// Backing file, if any. `None` for anonymous VMAs.
    /// `Some` for file-backed mappings (`mmap` of a regular file, device, or
    /// shared memory object).
    pub file: Option<Arc<FileRef>>,
    /// RCU head for deferred reclamation. After the VMA is removed from the tree
    /// and the RCU grace period elapses, the slab slot is returned.
    pub rcu: RcuHead,
}

impl Vma {
    /// Compute the PTE protection flags for this VMA.
    /// Translates `vm_flags` into architecture-specific PTE bits.
    pub fn pte_flags(&self) -> PteFlags {
        arch::current::mm::vma_to_pte_flags(self.vm_flags)
    }

    /// Length of the VMA in bytes.
    pub fn len(&self) -> u64 {
        self.vm_end.0 - self.vm_start.0
    }
}

MapleNode — single node in the Maple tree:

/// Node type determines interpretation of the children/slots union.
/// Dense nodes store up to 16 entries indexed by offset; Range nodes
/// store ranges delimited by pivot values.
#[repr(u8)]
pub enum MapleNodeType {
    /// Dense node: slots are indexed by position (for small, dense ranges).
    /// Used when the address range is compact and contiguous.
    Dense = 0,
    /// Range node: pivots delimit address ranges, children/slots are keyed
    /// by range. Used for the general case of sparse VMA layouts.
    Range = 1,
}

/// A single node in the Maple tree. ~256 bytes (4 cache lines), sized for
/// fanout-16 B-tree nodes. Internal nodes store child pointers; leaf nodes
/// store VMA pointers. The fanout of 16 is chosen to balance tree height
/// (3 levels for ~4096 VMAs) against node size. Nodes are NUMA-allocated
/// from the slab allocator and RCU-protected for lock-free reads.
///
/// Approximate layout:
///   - 1 byte:  node_type + 1 byte nr_entries + 2 bytes gap + 4 bytes padding
///   - 120 bytes: pivots (15 × u64 range boundaries)
///   - 128 bytes: children (16 × *mut MapleNode, 8 bytes each on 64-bit)
///   - remaining: RcuHead + alignment padding
pub struct MapleNode {
    /// Discriminant: Dense or Range.
    pub node_type: MapleNodeType,
    /// Number of valid entries (pivots/children or slots). 0..=16.
    pub nr_entries: u8,
    /// Maximum free virtual address gap (in pages) in this subtree.
    /// Used by `maple_find_gap()` to prune subtrees that cannot satisfy
    /// an allocation request. Stored in pages, so u32 covers gaps up to
    /// 2^32 * 4KB = 16 TB — sufficient for any realistic 64-bit address space
    /// (Linux x86-64 user ASLR range is ~128 TB; individual gaps are typically
    /// much smaller). Updated bottom-up on insert/remove.
    pub gap: u32,
    /// Range boundaries (for Range nodes). `pivots[i]` is the exclusive
    /// upper bound of range i. Entry i covers addresses
    /// `[pivots[i-1], pivots[i])` (with `pivots[-1]` implicitly 0 for
    /// the first entry). Up to 15 pivots delimit up to 16 ranges.
    pub pivots: [u64; 15],
    /// For internal nodes: child pointers (RCU-protected).
    /// For leaf nodes: VMA pointers.
    /// Exactly one of these is valid based on whether this is an internal
    /// or leaf node (determined by tree depth, not a per-node flag).
    /// Using a union avoids wasting space on an enum discriminant at
    /// every slot.
    pub children: [*mut MapleNode; 16],  // internal nodes
    // -- OR (union, leaf nodes) --
    // pub slots: [*mut Vma; 16],        // leaf nodes
    /// RCU head for deferred reclamation after copy-on-write replacement.
    pub rcu: RcuHead,
}

MapleTree — top-level tree descriptor:

/// The Maple tree VMA index. One per address space (`MmStruct`).
///
/// Readers access the tree under `rcu_read_lock()` with no write-side
/// synchronization — the tree is always in a consistent state because
/// mutations use copy-on-write (new path from root to modified leaf,
/// then atomic root pointer swap).
///
/// Writers hold `write_lock` for mutual exclusion, then perform COW
/// mutations and publish via `rcu_assign_pointer()` on the root.
pub struct MapleTree {
    /// Root node pointer, RCU-protected. Readers load atomically via
    /// `rcu_dereference(root)`. Writers replace via `rcu_assign_pointer()`.
    /// NULL for an empty tree (no VMAs).
    pub root: RcuPtr<MapleNode>,
    /// Write-side serialization. Held during insert, remove, and split/merge
    /// operations. Readers never acquire this lock — they use RCU.
    pub write_lock: RwLock<()>,
    /// Highest virtual address mapped in this tree. Cached to avoid
    /// tree traversal for TASK_SIZE checks and stack growth limit
    /// enforcement. Updated on insert/remove.
    pub highest_addr: u64,
    /// Number of MapleNode objects in this tree (for memory accounting
    /// and diagnostics via `/proc/<pid>/status`).
    pub node_count: u32,
}

RCU read protocol:

Readers hold rcu_read_lock(), load root atomically via rcu_dereference(), and traverse the tree without acquiring any lock. All node pointers encountered during traversal are guaranteed valid for the duration of the RCU read-side critical section (because writers never modify nodes in place — they COW).

Writers hold write_lock for mutual exclusion, then: 1. Copy-on-write the path from root to the modified leaf (allocate new nodes, copy unmodified children by pointer). 2. Atomically swap the new root via rcu_assign_pointer(). 3. Schedule rcu_call() to free the old nodes after the current grace period (all readers that could have seen the old root have exited their critical sections).

This means a write operation allocates O(log n) new nodes (one per tree level), which is bounded by tree height (typically 3-4 for normal process address spaces with up to ~65,000 VMAs).

Operations:

/// Find the VMA containing `addr`, if any.
///
/// Walk from root comparing `addr` against pivots at each level to
/// select the correct child/slot. O(log n) with ~3 cache-line accesses
/// for typical VMA counts. Zero locks for readers (RCU read-side only).
///
/// # Arguments
/// * `tree` - The Maple tree to search.
/// * `addr` - Virtual address to look up.
///
/// # Returns
/// `Some(&Vma)` if `addr` falls within a mapped VMA, `None` otherwise.
/// The returned reference is valid for the duration of the caller's
/// `rcu_read_lock()` critical section.
pub fn maple_find(tree: &MapleTree, addr: u64) -> Option<&Vma>;

/// Insert a VMA into the tree.
///
/// Acquires `write_lock`. COW-copies the path from root to the
/// insertion point. Inserts the VMA, splits the leaf node if full
/// (promoting a pivot to the parent, splitting recursively if needed).
/// Updates gap fields bottom-up. Atomically swaps the new root.
///
/// # Errors
/// Returns `Err(ErrAddrInUse)` if any part of the VMA's address range
/// `[vma.start, vma.end)` overlaps an existing VMA.
pub fn maple_insert(tree: &mut MapleTree, vma: Vma) -> Result<(), ErrAddrInUse>;

/// Remove all VMAs overlapping the address range `[addr_start, addr_end)`.
///
/// Acquires `write_lock`. COW-copies the affected path. Removes
/// entries, merges underful nodes (nodes with fewer than 25% of slots
/// occupied are merged with a sibling). Updates gap fields bottom-up.
/// Atomically swaps the new root.
///
/// Removed VMAs are returned to the caller for cleanup (unmapping
/// page table entries, freeing backing pages). The old tree nodes
/// are freed via `rcu_call()` after the grace period.
pub fn maple_remove(tree: &mut MapleTree, addr_start: u64, addr_end: u64);

/// Find the lowest virtual address with at least `size` contiguous free
/// bytes, aligned to `align`, starting the search at `hint_addr`.
///
/// Uses the gap index: each internal node's `gap` field records the
/// maximum free gap in its subtree. The walker descends only into
/// subtrees whose `gap >= size` (in pages), pruning branches that
/// cannot possibly satisfy the request. This makes unmapped-area
/// search O(log n) instead of the O(n) linear VMA scan required
/// without gap tracking.
///
/// Used by `mmap()` without `MAP_FIXED` to find a suitable address.
/// The `hint_addr` (from the `addr` argument to mmap, or from the
/// per-mm free-area cursor) biases the search toward a preferred
/// region (typically above the current `brk` for bottom-up layouts
/// or below the stack for top-down layouts).
pub fn maple_find_gap(
    tree: &MapleTree,
    size: u64,
    align: u64,
    hint_addr: u64,
) -> Option<u64>;

Gap tracking:

Each internal node tracks the maximum free virtual address gap (in pages) within its subtree via the gap field. On every insert or remove, gaps are recalculated bottom-up from the modified leaf to the root.

Gap recalculation algorithm (bottom-up, O(height)):

recalc_gap(node):
  if node.is_leaf():
    # Leaf: gaps between stored VMAs plus the span before first and after last pivot
    node.gap = max_gap_in_leaf(node)
    # max_gap_in_leaf computes:
    #   span from node_min to pivot[0].vm_start,
    #   for each i: pivot[i].vm_end to pivot[i+1].vm_start,
    #   span from pivot[last].vm_end to node_max
    return
  # Internal node: maximum of all children's subtree gaps
  node.gap = 0
  for i in 0..=node.num_children:
    child_gap = node.children[i].gap   # already updated from recursive call below
    node.gap = max(node.gap, child_gap)

# Called after insert or remove modifies leaf L:
  Walk path from L to root.
  At each node on the path (bottom-up): recalc_gap(node)

Pivot boundary invariant: A node's gap represents the maximum contiguous free range (in pages) that fits entirely within the address span [node_min, node_max). A free range that spans a node boundary is split across child nodes — neither child records the full gap. This is acceptable: find_unmapped_area() checks child.gap >= requested_pages before descending, and if a cross-boundary free range exists, both adjacent children will have gap >= half the range, so the traversal will visit both and find the allocation.

Complexity: recalculation is O(height) = O(log n) node updates per insert/remove, where n is the number of VMAs. Each level updates exactly one node (the ancestor on the modification path).

This makes find_unmapped_area() (the core of mmap address selection) O(log n) instead of O(n): the walker descends only into subtrees with a sufficiently large gap, skipping entire branches of the tree that cannot satisfy the allocation. For a process with 10,000 VMAs, this reduces the search from ~10,000 VMA comparisons to ~4 node visits (tree height).

4.1.5.2 Page Fault Handler

The page fault handler is the most performance-critical path in the virtual memory subsystem — every demand-paged access, every COW fork, every compressed-page decompression, and every swap-in passes through it. The handler runs in UmkaOS Core (Tier 0) with zero domain crossings for the common case (anonymous page fault).

Fault entry:

Architecture-specific trap handlers (x86-64: #PF via IDT entry 14; AArch64: ESR_EL1 data/instruction abort; RISC-V: scause page fault exceptions) call the common entry point:

/// Top-level page fault handler, called from architecture-specific trap code.
///
/// # Arguments
/// * `addr` - Faulting virtual address (from CR2 on x86, FAR_EL1 on AArch64,
///   stval on RISC-V).
/// * `access` - Type of access that caused the fault.
/// * `user_mode` - Whether the fault occurred in user mode (Ring 3 / EL0).
///
/// # Returns
/// `Ok(())` if the fault was resolved (page installed, execution resumes).
/// `Err(FaultError)` if the fault is fatal (signal delivery or kernel panic).
pub fn handle_page_fault(
    addr: VirtAddr,
    access: AccessType,
    user_mode: bool,
) -> Result<(), FaultError>;

/// Access type that caused the page fault.
#[repr(u8)]
pub enum AccessType {
    /// Read access (load instruction).
    Read  = 0,
    /// Write access (store instruction).
    Write = 1,
    /// Instruction fetch (execute).
    Exec  = 2,
}

4.1.5.3 Page Fault Metadata by Architecture

When a page fault occurs, the hardware delivers fault information in architecture-specific registers before the trap handler can call the common handle_page_fault() entry point. UmkaOS's fault entry stubs (in umka-core/src/arch/*/mm.rs) read these registers and normalise them into the PageFaultInfo struct before dispatching to architecture-independent code:

/// Architecture-normalised page fault descriptor. Populated by the arch-specific
/// fault entry stub from hardware registers (CR2/ESR, FAR_EL1/ESR_EL1, stval/scause,
/// DAR/DSISR, DEAR/ESR) before `handle_page_fault` is called.
pub struct PageFaultInfo {
    /// Faulting virtual address.
    pub addr:  VirtAddr,
    /// True if the faulting access was a store (write).
    pub write: bool,
    /// True if the fault occurred at user privilege level (EL0 / Ring 3 / U-mode).
    pub user:  bool,
    /// True if the fault was an instruction fetch (NX / XN violation).
    pub exec:  bool,
}

The hardware registers that supply these fields differ by architecture:

Architecture Fault Address Fault Reason
x86-64 CR2 (linear address that caused the fault) Error code pushed on stack: bit 0 = present (protection fault vs. not-present), bit 1 = write, bit 2 = user mode, bit 3 = reserved-bit write, bit 4 = instruction fetch, bit 5 = protection-key violation
AArch64 FAR_EL1 (Fault Address Register, EL1) ESR_EL1: EC field = 0x21 (data abort from lower EL) or 0x20 (instruction abort from lower EL); DFSC field (Data Fault Status Code): 0b000100 = translation fault L0, 0b000101 = L1, 0b000110 = L2, 0b000111 = L3, 0b001101 = permission fault
ARMv7 DFAR (Data Fault Address Register, CP15 c6 c0 0) DFSR (Data Fault Status Register, CP15 c5 c0 0): status bits encode translation fault, permission fault, or alignment fault; WnR bit indicates write
RISC-V stval CSR (supervisor trap value) = faulting virtual address scause CSR: 12 = instruction page fault, 13 = load page fault, 15 = store/AMO page fault
PPC32 DEAR (Data Exception Address Register, SPR 61) for data; SRR0 for instruction faults ESR (Exception Syndrome Register, SPR 62): ST bit indicates store vs. load; separate instruction-access exception (IABR) for instruction faults
PPC64LE DAR (Data Address Register, SPR 19) for data faults; SRR0 for instruction faults DSISR (Data Storage Interrupt Status Register, SPR 18): bit 25 = translation fault, bit 27 = protection fault, bit 26 = store

The arch stub reads these registers in the trap entry path (before re-enabling interrupts) and fills PageFaultInfo. From that point on, all fault-handling code is architecture-independent and operates solely on PageFaultInfo, the VMA tree, and the physical allocator.

Lookup sequence:

  1. VMA lookup: vma = maple_find(current.vm_map, addr) — O(log n) under RCU read lock (Section 4.1.5.1). No write-side lock needed.

  2. No VMA found: The address is not mapped in the process's address space. Deliver SIGSEGV with si_code = SEGV_MAPERR (bad address). If user_mode is false, check the kernel exception fixup table (__ex_table) first — see kernel fault handling below.

  3. Permission check: The VMA exists but the access violates its protection bits (e.g., write to a PROT_READ-only mapping, or execute of a non-PROT_EXEC mapping). Deliver SIGSEGV with si_code = SEGV_ACCERR (protection fault).

  4. Determine fault type based on the VMA and PTE state:

Condition Fault Type Handler
PTE not present, anonymous VMA Anonymous fault Allocate zero page, install PTE
PTE not present, file-backed VMA File fault Look up page cache; if miss, submit I/O
PTE present, read-only, VMA is writable + COW Copy-on-write fault Copy page, update PTE
PTE swap entry, compressed bit set Compressed fault Decompress from ZPool (Section 4.2.5)
PTE swap entry, compressed bit clear Swap fault Read from swap device, install PTE

Anonymous fault path (most common, must be fast):

/// Handle a page fault on an anonymous (non-file-backed) VMA.
/// This is the hottest fault path: first access to malloc'd memory,
/// stack growth, and post-fork private pages all land here.
///
/// Cost: ~1-2μs (page allocation + zero-fill + PTE install + TLB invalidate).
fn handle_anon_fault(vma: &Vma, addr: VirtAddr) -> Result<(), FaultError> {
    let page = phys_alloc(GfpFlags::ZERO | GfpFlags::USER)?;
    let pte = PteFlags::PRESENT | PteFlags::USER | vma.pte_flags();
    install_pte(current_pgd(), addr, page.pfn(), pte);
    Ok(())
}

Copy-on-write (COW) fault path:

/// Handle a write fault on a copy-on-write page.
/// Occurs after fork() when a child or parent first writes to a shared page.
///
/// Optimization: if the page has only one reference (the other process has
/// already COW-faulted or exited), skip the copy and just mark writable.
/// This is critical for fork+exec patterns where the parent's pages are
/// never actually copied.
fn handle_cow_fault(
    vma: &Vma,
    addr: VirtAddr,
    old_pfn: Pfn,
) -> Result<(), FaultError> {
    if page_refcount(old_pfn) == 1 {
        // Only reference — skip copy, just set writable.
        set_pte_writable(current_pgd(), addr);
        return Ok(());
    }
    let new_page = phys_alloc(GfpFlags::USER)?;
    copy_page(new_page.pfn(), old_pfn);
    install_pte(
        current_pgd(),
        addr,
        new_page.pfn(),
        PteFlags::PRESENT | PteFlags::WRITE | PteFlags::USER,
    );
    page_deref(old_pfn); // Decrement refcount on the shared page.
    Ok(())
}

Locking protocol:

The fault path holds the mm read lock (mmap_read_lock) for VMA tree stability during the entire fault resolution. Multiple faults on the same address space can proceed concurrently (read lock allows shared access). The PTE installation uses cmpxchg (compare-and-swap on the PTE entry) to detect racing faults on the same virtual page — install the new PTE only if the slot is still empty (for anonymous faults) or still contains the expected old value (for COW faults). If the cmpxchg fails, another CPU has already resolved the fault; the current fault handler frees the speculatively allocated page and returns success.

This lock-free PTE installation avoids per-page spinlocks and allows the common case (non-overlapping faults on different pages within the same address space) to proceed with zero contention.

Kernel fault handling:

Faults in kernel mode fall into two categories:

  1. Expected faults (during copy_to_user / copy_from_user): These access user-space memory that may be swapped out, unmapped, or protected. The kernel exception fixup table (__ex_table) maps each potentially-faulting instruction address to a fixup handler that returns -EFAULT to the syscall caller instead of panicking. The fault handler checks __ex_table before delivering a signal.

  2. Unexpected faults (kernel bug): Any kernel-mode fault at an address not in __ex_table indicates a kernel bug (NULL pointer dereference, use-after-free, stack overflow). The handler triggers a kernel oops: dumps registers, stack trace, and the faulting instruction, then panics. On debug builds, this also triggers a breakpoint for QEMU/GDB debugging.

/// Check if a kernel-mode fault has a fixup entry.
/// Returns the fixup address if found, None if this is a bug.
fn kernel_fixup(fault_ip: VirtAddr) -> Option<VirtAddr> {
    // __ex_table is a sorted array of (insn_addr, fixup_addr) pairs,
    // generated by the linker from __ex_table section entries placed
    // by the copy_to_user/copy_from_user macros.
    EX_TABLE.binary_search(fault_ip)
}

4.1.5.4 TLB Invalidation by Architecture

TLB invalidation is one of the most architecture-specific operations in the memory subsystem. UmkaOS abstracts it behind arch::current::mm::tlb_flush_* functions, but the underlying mechanisms differ substantially across ISAs in their broadcast model, granularity, and completion semantics.

x86-64:

  • Single page (local): INVLPG [addr] — invalidates the TLB entry for one virtual address on the current CPU, across all PCIDs.
  • Full flush (local): write CR3 with the NOFLUSH bit clear — flushes all non-global TLB entries for the current PCID.
  • PCID-preserving switch (preferred): write CR3 with bit 63 (NOFLUSH) set — switches address space without flushing any TLB entries (requires CR4.PCIDE = 1). This is the normal context-switch path when the target PCID is still valid.
  • Tagged shootdown: INVPCID instruction — type 0 = single-address for one PCID, type 1 = all entries for one PCID, type 2 = all non-global entries, type 3 = all entries including global. Requires CR4.PCIDE = 1.
  • Cross-CPU shootdown: send a fixed-vector IPI via the LAPIC to each CPU that may hold the stale mapping (tracked via mm_cpumask); the target CPU's IPI handler executes INVLPG or INVPCID then acknowledges.

AArch64:

  • Single page (system-wide): TLBI VAE1IS, Xt — invalidate by virtual address, EL1, Inner Shareable domain. Xt encodes VA[55:12] in bits [43:0] and optionally the ASID in bits [63:48]. The IS (Inner Shareable) suffix broadcasts the operation to all CPUs in the coherency domain — no software IPI is required.
  • ASID-scoped flush: TLBI ASIDE1IS, Xt — invalidate all TLB entries for a specific ASID. Used on context switch when the target ASID has been recycled.
  • Full flush: TLBI VMALLE1IS — invalidate all EL1 entries for all ASIDs. Reserved for extreme cases (address space teardown with many ASIDs).
  • Completion sequence: DSB ISH (ensure preceding stores to page tables are visible) → TLBI VAE1ISDSB ISH (ensure TLB invalidation is complete on all CPUs) → ISB (ensure subsequent instruction fetches use the new mapping).
  • No explicit IPI needed: the IS variants broadcast through the coherency domain's hardware interconnect. ARM's TLB maintenance operations with IS are the equivalent of x86's IPI + INVLPG in a single instruction.

ARMv7:

  • Single page (system-wide): MCR p15, 0, Rt, c8, c3, 3 (TLBIMVAAIS) — invalidate by modified virtual address and ASID, all CPUs.
  • Full flush (system-wide): MCR p15, 0, Rt, c8, c3, 0 (TLBIALLIS) — invalidate all TLB entries, all CPUs.
  • Completion: DSB before the MCR to ensure page table writes are visible; DSB after to ensure the invalidation has propagated; ISB to prevent stale instruction fetch.

RISC-V:

  • Single page (local): SFENCE.VMA rs1, rs2rs1 holds the virtual address (zero = all pages), rs2 holds the ASID (zero = all ASIDs). With both specified, invalidates the specific VA for the specific ASID on the current hart only.
  • Full flush (local): SFENCE.VMA x0, x0 — flushes all TLB entries on the current hart.
  • Cross-hart shootdown: no hardware broadcast mechanism. UmkaOS sends an IPI to each hart that may have cached the mapping (tracked via mm_cpumask), and each target hart executes SFENCE.VMA in the IPI handler. The initiating hart waits for all acknowledgments before returning.
  • Hypervisor extensions: HFENCE.VVMA (invalidate guest virtual TLB entries) and HFENCE.GVMA (invalidate guest physical TLB entries), used by the RISC-V KVM path (Section 14).

PPC32 / PPC64LE:

  • Single page (local): tlbiel (invalidate local, BookS) — invalidates the TLB entry for the given effective address on the current CPU only.
  • Single page (global): tlbie (invalidate entry, BookS) — on POWER hardware, tlbie is automatically broadcast to all CPUs by the hardware interconnect. No software IPI is needed for cross-CPU shootdown on shared-memory multiprocessors.
  • Completion: ptesync before tlbie to ensure page table stores are visible; tlbsync + sync after tlbie to ensure the invalidation has completed on all CPUs.
  • BookE (PPC32 embedded): uses tlbivax (invalidate by address) + tlbsync instead of tlbie.

UmkaOS shootdown protocol summary:

Architecture Cross-CPU TLB broadcast Software IPI needed
x86-64 None (no hardware broadcast) Yes — LAPIC IPI to each CPU in mm_cpumask
AArch64 TLBI *IS broadcasts via interconnect No — hardware handles it
ARMv7 TLBI*ALL*IS broadcasts via interconnect No — hardware handles it
RISC-V None (no hardware broadcast) Yes — IPI to each hart in mm_cpumask
PPC64LE tlbie broadcasts via interconnect No — hardware handles it
PPC32 BookE None Yes — IPI required

On architectures with hardware broadcast (AArch64, PPC64), UmkaOS still tracks mm_cpumask for lazy TLB mode (Section 4.1.6.1) but does not send IPIs for TLB invalidation — the hardware performs the broadcast through its coherency fabric. On x86 and RISC-V, mm_cpumask is used both to determine which CPUs need IPIs and to detect lazy TLB mode. The initiating CPU marks pages dirty, sends IPIs to the target set, and waits for acknowledgment (via a per-CPU completion flag) before returning to the caller.

4.1.6 PCID / ASID Management

Process Context IDentifiers tag TLB entries with a process identifier, avoiding TLB flushes on context switches. The mechanism is architecture-specific but the policy is shared:

Architecture Mechanism ID Width Usable IDs Register
x86-64 PCID 12 bits 4096 CR3[11:0]
AArch64 ASID 8 or 16 bits 256 or 65536 TTBR0_EL1[63:48] or TCR_EL1.AS=1 for 16-bit
ARMv7 ASID (CONTEXTIDR) 8 bits 256 CONTEXTIDR[7:0]
RISC-V ASID 0-16 bits (WARL) 0-65536 satp[59:44] (same position for Sv39, Sv48, Sv57; width is implementation-defined, discovered at boot; RISC-V satp.ASID is WARL — implementations may support 0 bits (no ASIDs, full TLB flush on context switch), though most provide 9-16 bits)
PPC32 PID 8 bits 256 PID SPR (via mtspr)
PPC64LE PID (Radix) / LPIDR (HPT) 20 bits (Radix) / 12 bits (HPT) ~1M (Radix) / 4096 (HPT) PIDR SPR (POWER9+) / LPIDR (POWER8)

Common policy across all architectures: - LRU-based ID allocation: least-recently-used ID is evicted when all slots are full. x86 has 4096 slots; UmkaOS uses the full space with LRU eviction. - TLB flush avoidance: context switch to a process that still has a valid ID requires zero TLB flushes. - Isolation domain switches require no ID change (same address space, different permissions — MPK/POE/DACR operate independently of address space IDs; page-table isolation on RISC-V uses separate page-table mappings within the same address space).

Architecture-specific notes: - AArch64: 16-bit ASID (TCR_EL1.AS=1) requires ID_AA64MMFR0_EL1.ASIDBits == 0b0010 (available on most ARMv8.2+ cores; some ARMv8.0 cores only support 8-bit ASID). UmkaOS discovers ASID width at boot and falls back to 8-bit with more frequent TLB flushes. 16-bit ASID gives 65536 IDs — effectively unlimited. 8-bit ASID (256 IDs) requires more aggressive LRU eviction. TLB maintenance uses TLBI ASIDE1IS for targeted invalidation by ASID. - ARMv7: Only 8-bit ASID via CONTEXTIDR, limiting to 256 concurrent address spaces. TLB maintenance uses MCR p15, 0, Rd, c8, c7, 2 (TLBIASID). The DACR domain mechanism (Section 10.2) is orthogonal to ASID — domain switches never invalidate TLB. - RISC-V: ASID occupies bits [63:60]=mode, [59:44]=ASID in the satp CSR (same position for Sv39, Sv48, and Sv57). The ASID width is implementation-defined and discovered at boot (by writing all-ones to satp.ASID and reading back). SiFive U74: 9-bit (512 IDs). Other implementations may support up to 16-bit. TLB flush via sfence.vma with ASID argument for targeted invalidation.

RISC-V ASID availability: The RISC-V specification allows implementations with ASID width of 0 (the satp.ASID field is WARL — writes of non-zero values may be ignored). UmkaOS detects the available ASID width at boot by writing all-ones to satp.ASID and reading back the value. If ASID width is 0, UmkaOS falls back to full TLB flush on every context switch (issuing sfence.vma with rs1=x0, rs2=x0 after every satp write performs a full TLB flush). This is functionally correct but incurs higher context-switch overhead on ASID-less implementations. The ASID width is stored in a boot-time constant (RISCV_ASID_BITS: u32) and checked by the TLB management code to select between ASID-tagged invalidation (sfence.vma with ASID) and global invalidation (sfence.vma with rs1=x0, rs2=x0). - PPC32: 8-bit PID via the PID SPR, supporting 256 concurrent address spaces. TLB management uses tlbie (invalidate by effective address) or tlbia (invalidate all). The 16 segment registers provide an orthogonal isolation mechanism independent of PID. - PPC64LE: On POWER9+ with Radix MMU, 20-bit PID via the PIDR SPR supports up to ~1M concurrent address spaces — effectively unlimited. On POWER8 with HPT, the 12-bit LPIDR provides 4096 logical partition IDs. TLB management uses tlbie with targeting by PID (Radix) or tlbie with LPID (HPT).

4.1.6.1 Lazy TLB Mode for Kernel Threads

When the scheduler context-switches to a kernel thread (kworker, ksoftirqd, RCU callback thread) that has no user-space address space (mm == NULL), a TLB flush is wasteful — the kernel thread will never access user-space addresses, so the previous process's TLB entries can remain loaded harmlessly. Flushing them only to reload them on the next switch back to a user-space process wastes hundreds of cycles (full TLB invalidation: ~200-1000 cycles depending on TLB size and architecture).

UmkaOS implements lazy TLB mode (matching Linux's approach):

  • Enter lazy mode: When switching to a kernel thread, the scheduler does not write a new value to CR3 (x86), TTBR0_EL1 (AArch64), or satp (RISC-V). The previous process's page tables remain loaded, and its PCID/ASID remains active. The kernel thread runs exclusively in the kernel half of the address space (TTBR1 on AArch64, high-half on x86), which is shared across all processes.

  • Exit lazy mode: When switching from a kernel thread back to a user-space process, the scheduler checks whether the target process's PCID/ASID matches what is currently loaded. If yes: zero TLB flushes (the entries are still valid). If no: normal PCID/ASID switch with targeted invalidation.

  • TLB shootdown during lazy mode: If another CPU sends a TLB shootdown IPI for the user-space address range currently loaded in lazy mode, the lazy CPU must either process the shootdown (flush the stale entries) or mark its lazy state as "needs flush on exit." UmkaOS uses the deferred approach: the lazy CPU sets a per-CPU tlb_needs_flush flag and skips the actual flush. When the CPU exits lazy mode (switches to a user process), it checks the flag and performs a full TLB flush if set. This avoids IPIs waking idle CPUs unnecessarily.

Impact: On syscall-heavy workloads with many kernel threads (web servers, database engines), lazy TLB eliminates ~30-50% of TLB flushes. The benefit is proportional to the ratio of kernel-thread context switches to total context switches.

4.1.7 Memory Tagging (Hardware-Assisted)

  • ARM MTE (Memory Tagging Extension): 4-bit tags on every 16-byte granule for use-after-free and buffer overflow detection in kernel allocations. Architecturally defined in ARMv8.5-A; first implemented in ARMv9 silicon (Cortex-A510, A710, A715, X3 and later) and Neoverse V2/V3. ARMv8.x cores prior to v8.5 (Cortex-A55, A75, A76, A77, A78, X1) do not have MTE. On hardware without MTE, UmkaOS falls back to software shadow tagging (KASAN-equivalent). The kernel probes for MTE at boot via the ID_AA64PFR1_EL1.MTE field.
  • Intel LAM (Linear Address Masking): Top address bits available for metadata
  • Used in debug/development builds and optionally in production for high-security deployments

See also: Section 2.3 (Hardware Memory Safety) provides the full MTE/LAM integration design including tag-aware allocators and CHERI future-proofing. Section 14.7 (Persistent Memory) adds DAX-mapped PMEM as an additional memory tier with crash-consistency guarantees.

4.1.8 NUMA Topology and Policy

Modern servers are NUMA (Non-Uniform Memory Access): memory access latency depends on which CPU socket is requesting and which physical memory bank is being accessed. A 2-socket server has ~80ns local DRAM access and ~150ns remote access (via QPI/UPI interconnect). At 4+ sockets or with CXL-attached memory (Section 5.1), the penalty grows further. The kernel must be NUMA-aware at every level — allocation, placement, scheduling, and rebalancing.

4.1.8.1 Topology Discovery

At boot, UmkaOS parses platform firmware tables to build a NUMA distance matrix:

Architecture Firmware Source Tables Parsed
x86-64 ACPI SRAT (System Resource Affinity), SLIT (System Locality Information), HMAT (Heterogeneous Memory Attributes)
AArch64 ACPI or Device Tree SRAT/SLIT (ACPI servers), numa-node-id property (DT-based SoCs)
ARMv7 Device Tree numa-node-id property (rare; most ARMv7 is UMA)
RISC-V 64 Device Tree numa-node-id property per memory and CPU node
PPC32 Device Tree numa-node-id property (rare; most PPC32 is UMA)
PPC64LE Device Tree ibm,associativity property per CPU and memory node (PAPR NUMA)

Topology Source Precedence

When multiple firmware sources describe topology, UmkaOS applies the following precedence order (highest to lowest authority):

  1. ACPI SRAT (System Resource Affinity Table): the authoritative source for node-to-physical-address mapping and CPU-to-node affinity. UmkaOS treats SRAT as ground truth for which physical address ranges belong to which NUMA node.

  2. ACPI HMAT (Heterogeneous Memory Attribute Table, if present): provides precise read/write bandwidth and access latency in picoseconds for each initiator/target pair. HMAT takes precedence over SLIT for performance attributes when both are present, because it is more precise.

  3. ACPI SLIT (System Locality Information Table): provides the inter-node distance matrix used for scheduling and migration cost estimates. SLIT distances are normalized per ACPI specification 6.5 §5.2.17: local distance = 10, remote distances are proportional (cross-socket is typically 20-30). SLIT is used where HMAT is absent or does not cover a particular node pair.

  4. Device Tree (/cpu-map and numa-node-id properties): used on systems without ACPI — bare-metal RISC-V, PPC32, ARMv7, and some AArch64 SoCs. On ARM systems that have both ACPI and a device tree, ACPI HMAT is preferred over PPTT for performance attributes.

  5. Single-node fallback: if none of the above sources are present, UmkaOS treats the system as a single NUMA node (distance matrix is a 1×1 matrix with value 10).

Override order: HMAT > SLIT > DTB numa-node-id > default (single-node).

The result is a symmetric distance matrix where distance[i][j] represents the relative access cost from node i to node j. Local access is always distance 10 (by ACPI convention). Cross-socket is typically 20-30. CXL-attached memory tiers (Section 5.1) appear as higher-distance NUMA nodes.

/// NUMA topology, populated at boot from SRAT/SLIT or device tree.
///
/// All arrays are dynamically sized at boot based on the number of NUMA nodes
/// discovered from firmware tables. No compile-time `MAX_NUMA_NODES` constant —
/// the kernel adapts to the hardware, following the same pattern as `PerCpu<T>`.
/// Allocation uses the boot allocator (Section 4.1.1) during early init.
///
/// On a system with CXL-attached memory (Section 5.1), each CXL memory device
/// appears as an additional NUMA node. A 4-socket server with 8 CXL memory pools
/// may have 12+ NUMA nodes. The dynamic sizing handles this without recompilation.
pub struct NumaTopology {
    /// Number of NUMA nodes discovered.
    pub nr_nodes: usize,
    /// Distance matrix: distance[i * nr_nodes + j] = relative access cost from
    /// node i to node j. distance[i * nr_nodes + i] = 10 (local). Higher = slower.
    /// Allocated as a flat array of size nr_nodes * nr_nodes from the boot allocator.
    ///
    /// Uses `u16` to accommodate both SLIT (0-255 range) and HMAT (0-65535 range)
    /// representations. ACPI HMAT (Heterogeneous Memory Attribute Table) uses
    /// values 0-65535 for memory latency/bandwidth attributes, which exceeds the
    /// u8 range of SLIT. SLIT-sourced distances are scaled by the HMAT
    /// normalization factor when both tables are present.
    pub distance: &'static [u16],
    /// Per-node memory ranges (physical address start, length).
    /// Allocated as an array of nr_nodes entries from the boot allocator.
    /// Each entry holds up to 4 memory ranges (ArrayVec is stack-allocated).
    pub node_mem: &'static [ArrayVec<MemRange, 4>],
    /// Per-node CPU sets.
    pub node_cpus: &'static [CpuMask],
}

impl NumaTopology {
    /// Look up the distance between two NUMA nodes.
    pub fn distance(&self, from: NumaNodeId, to: NumaNodeId) -> u16 {
        self.distance[from.0 as usize * self.nr_nodes + to.0 as usize]
    }
}

4.1.8.2 Memory Allocation Policy

Per-process and per-VMA memory policies control which NUMA nodes are used for page allocation. These are set via the set_mempolicy(2) and mbind(2) syscalls (Linux-compatible):

Policy Behavior Typical Use
MPOL_DEFAULT Allocate on the faulting CPU's local node General-purpose (default)
MPOL_BIND Restrict allocation to specified node set; OOM if all are full Database buffer pools, pinned workloads
MPOL_INTERLEAVE Round-robin page allocation across specified nodes Hash tables, large shared mappings
MPOL_PREFERRED Try specified node first, fall back to others if full Soft affinity
MPOL_LOCAL Always the local node (explicit, not inherited) Latency-sensitive paths

MPOL_INTERLEAVE distributes pages across nodes at page granularity (4KB or 2MB for huge pages), amortizing bandwidth across all memory controllers. This is optimal for large data structures accessed uniformly (e.g., hash maps, columnar stores).

4.1.8.3 Automatic NUMA Balancing

UmkaOS implements automatic NUMA balancing (same approach as Linux's numa_balancing):

Disable option: Automatic NUMA balancing can be disabled at boot (umka.numa_balancing=0) or at runtime (/proc/sys/kernel/numa_balancing=0). This is appropriate for workloads that are already properly pinned via cpuset cgroups or numactl --membind, where all memory is allocated on the correct NUMA node from the start. On these workloads, the NUMA scanner's periodic page table scanning (clearing present bits to induce NUMA faults) adds measurable overhead: - Each scanned page incurs a minor page fault (~1-5μs including TLB shootdown) - The scanner runs every 1-30 seconds (adaptive), touching thousands of PTEs per scan - For hard real-time workloads on isolcpus cores, NUMA-fault-induced jitter is unacceptable — disable NUMA balancing and pin memory explicitly

Default: enabled (matches Linux). Auto-disabled on single-node systems (no benefit).

  1. Scan: A periodic scanner walks process page tables and clears the present bit on a fraction of pages (making them trigger faults on next access). Scan rate is adaptive: faster for processes with high cross-node access, slower for well-placed processes.
  2. Trap: When a task accesses a not-present page, the NUMA fault handler records which CPU (and thus which NUMA node) caused the fault.
  3. Decide: A cost-benefit analysis compares the expected savings from reduced remote access latency against the migration cost. Migration cost depends on scope: local NUMA migration (same socket, memcpy + TLB shootdown): ~200-500 ns per 4KB page; cross-socket migration (cache-coherent interconnect latency included): ~1-10 μs per page; RDMA-based DSM migration (cross-node, see Section 5.x): ~2-50 μs depending on distance. The page_migration_cost_ns() function (below) returns the empirically calibrated value for the specific source/destination pair. Migration proceeds only if the net benefit is positive over a configurable window (default: 10 accesses saved per migration cost).
  4. Migrate: The page is migrated to the accessing CPU's node. During migration the PTE is updated atomically under the page table lock — the application sees no inconsistency.

Scan rate adaptation algorithm

The scanner samples pages by setting PTEs to PROT_NONE (clearing the present bit), causing NUMA faults on the next access. The fault handler records which node accessed each page, providing the cross-node access ratio used to adapt the scan rate.

Each process maintains a numa_scan_interval field (in milliseconds), initialized to the base rate. The NUMA fault handler updates this field after each scan window completes:

  • Base rate: 1 scan per 1000 ms per process VMA.
  • Speed up: If cross-node access ratio > 20% in the last scan window, double the scan rate (halve the interval). Minimum interval: 100 ms.
  • Slow down: If cross-node access ratio < 5% for 3 consecutive scan windows, halve the scan rate (double the interval). Maximum interval: 5000 ms.
  • Caps: min_scan_interval = 100 ms, max_scan_interval = 5000 ms per VMA.
/// Per-process NUMA scan state, stored in the task's memory descriptor.
pub struct NumaScanState {
    /// Current scan interval in milliseconds. Starts at 1000 ms.
    /// Range: [MIN_SCAN_INTERVAL_MS, MAX_SCAN_INTERVAL_MS].
    pub numa_scan_interval_ms: u32,
    /// Consecutive scan windows where cross-node ratio was below 5%.
    /// Reset to 0 whenever a window exceeds the 5% threshold.
    pub low_cross_node_streak: u32,
}

const BASE_SCAN_INTERVAL_MS: u32 = 1000;
const MIN_SCAN_INTERVAL_MS: u32 = 100;
const MAX_SCAN_INTERVAL_MS: u32 = 5000;
const CROSS_NODE_HIGH_THRESHOLD_PCT: u32 = 20;
const CROSS_NODE_LOW_THRESHOLD_PCT: u32 = 5;
const LOW_STREAK_REQUIRED: u32 = 3;

/// Called by the NUMA fault handler after completing a scan window for a process.
/// `cross_node_faults` and `total_faults` are counts from the just-completed window.
pub fn update_scan_interval(state: &mut NumaScanState, cross_node_faults: u32, total_faults: u32) {
    if total_faults == 0 {
        return; // No data — keep current interval.
    }
    let ratio_pct = (cross_node_faults * 100) / total_faults;

    if ratio_pct > CROSS_NODE_HIGH_THRESHOLD_PCT {
        // High cross-node access: scan faster (halve interval).
        state.numa_scan_interval_ms = (state.numa_scan_interval_ms / 2)
            .max(MIN_SCAN_INTERVAL_MS);
        state.low_cross_node_streak = 0;
    } else if ratio_pct < CROSS_NODE_LOW_THRESHOLD_PCT {
        state.low_cross_node_streak += 1;
        if state.low_cross_node_streak >= LOW_STREAK_REQUIRED {
            // Sustained low cross-node access: scan slower (double interval).
            state.numa_scan_interval_ms = (state.numa_scan_interval_ms * 2)
                .min(MAX_SCAN_INTERVAL_MS);
            state.low_cross_node_streak = 0;
        }
    } else {
        // Between 5% and 20%: keep current rate, reset low streak.
        state.low_cross_node_streak = 0;
    }
}
/// Estimated cost in nanoseconds to migrate a page of the given order to a
/// remote NUMA node. Accounts for TLB shootdown, page copy overhead, and
/// PTE invalidation on all CPUs mapping the page.
///
/// Measured empirically at boot time via a calibration micro-benchmark
/// (same-node copy vs cross-node copy). Stored in the NUMA distance table.
/// Values are order-dependent: a 2 MiB huge page (order 9) costs ~10-20× a
/// 4 KiB base page (order 0) due to higher copy bandwidth × more TLB entries.
///
/// Typical values: 4 KiB page → ~200-500 ns; 2 MiB page → ~3000-8000 ns.
pub fn page_migration_cost_ns(order: u32) -> u64;

/// Expected latency penalty in nanoseconds per remote memory access between
/// two NUMA nodes at the given distance (per `numa_distance()`).
///
/// `distance` is the ACPI SLIT value (10 = local, 11-254 = remote, higher =
/// farther). Per ACPI specification 6.5 §5.2.17, the local (diagonal) entry is
/// always 10; remote distances are proportionally higher (cross-socket typically
/// 20-30). The penalty is converted to nanoseconds using the empirically
/// calibrated base penalty per SLIT unit (~1-5 ns per unit on EPYC/Ice Lake).
/// Returns 0 if `distance` ≤ 10 (same node or same die cache domain).
///
/// Typical values: cross-socket (distance ~30) → ~30-50 ns; cross-NUMA
/// cluster (distance ~40) → ~50-80 ns.
pub fn remote_penalty_ns(distance: u8) -> u64;
/// NUMA balancing decision for a single page.
pub fn should_migrate_page(
    page: &Page,
    accessing_node: NumaNodeId,
    current_node: NumaNodeId,
    access_count: u32,
) -> bool {
    if accessing_node == current_node {
        return false; // Already local.
    }
    let distance = numa_distance(current_node, accessing_node);
    let migration_cost_ns = page_migration_cost_ns(page.order());
    let expected_savings_ns = access_count as u64 * remote_penalty_ns(distance);
    expected_savings_ns > migration_cost_ns * 2 // Conservative: require 2x payoff.
}

4.1.8.4 NUMA-Aware Kernel Allocations

The buddy allocator (Section 4.1.1) and slab allocator (Section 4.1.2) are NUMA-aware:

  • Buddy allocator: Per-NUMA-node free lists. Allocation prefers the local node; cross-node fallback uses the SLIT distance matrix to pick the nearest alternative.
  • Slab allocator: Per-node partial slab lists. Frequently allocated objects (inodes, dentries, socket buffers, capability entries) are served from node-local slabs, avoiding cross-node cache line bouncing on the hot allocation/free paths.
  • Per-CPU caches: Already NUMA-local by construction (each CPU's cache draws from its node's buddy allocator). No additional logic needed.

4.1.8.5 NUMA Balancing and Isolation Domain Memory

The NUMA scanner (automatic NUMA balancing above) must respect hardware isolation domain boundaries (Section 10.4, Tier 1 isolation):

  • Tier 1 driver memory: Pages mapped in a Tier 1 driver's isolation domain are tagged with the domain's protection key (MPK PKEY, ARM POE key, etc.). The NUMA scanner skips pages whose protection key does not match the current process's default domain — it will not clear the present bit on driver-private pages, because the resulting NUMA fault would fire in the wrong protection domain. Tier 1 driver memory is migrated only when the driver explicitly requests it via the driver_request_numa_migration() KABI call (fully specified in Section 10.5.9.4, memory_v1 KABI table — see 10-drivers.md for the complete C ABI signature, error codes, atomicity guarantees, and DMA pinning interaction), or when the driver's domain is active on the faulting CPU.

  • DMA buffers: Pages marked with PG_dma_pinned (allocated via the DMA API, Section 10.5.3.7) are unconditionally exempt from NUMA migration. Moving a DMA buffer while a device holds its physical address would cause DMA to the wrong location. The NUMA scanner checks this flag before clearing the present bit and skips pinned pages entirely.

  • Kernel-internal per-CPU structures: Per-CPU run queues, slab magazines, and PerCpu slots are allocated on their home node at boot and are never candidates for NUMA migration (they have no user-space PTE to scan).

4.1.8.6 Memory Tier Classification

UmkaOS classifies memory sources by semantic type via the TierKind enum. This allows code to reference memory tiers by purpose rather than by ordinal position (which shifts as tiers are added or removed, e.g., when distributed mode introduces remote DRAM or CXL-attached memory — see Section 5.6.1.1).

/// Semantic classification of a memory tier.
/// Used by the memory subsystem to identify tier types independent of their
/// ordinal position in the latency hierarchy.
pub enum TierKind {
    /// Local DRAM on the same NUMA node as the accessing CPU.
    LocalDram,
    /// High Bandwidth Memory (HBM) on the same package as the CPU.
    /// Found on Intel Sapphire Rapids HBM, AMD MI300, and HPC GPUs.
    /// Higher bandwidth but similar or slightly higher latency than DRAM.
    /// Distinct from LocalDram to enable bandwidth-aware placement policies.
    Hbm,
    /// Remote DRAM on a different NUMA node (same physical machine).
    RemoteDram,
    /// CXL Type-3 memory expander (no compute). Latency: 200-500 ns (vs ~80 ns for
    /// local DRAM, ~1 μs for NVMe). Bandwidth: up to 50 GB/s per device. Used for
    /// capacity expansion — cold data that does not fit in DRAM but is too hot for
    /// NVMe. Placed between `LocalDram`/`RemoteDram` and `Compressed` in the tiering
    /// hierarchy. The memory tiering subsystem promotes pages hotter than
    /// `cxl_promote_threshold` to `LocalDram` and demotes pages colder than
    /// `cxl_demote_threshold` from `LocalDram` to `CxlExpander`.
    CxlExpander,
    /// CXL Type-2 device-attached memory (compute + memory, e.g., smart NICs,
    /// inference accelerators with attached DRAM). Similar placement to `GpuVram`
    /// but for non-GPU accelerators. Managed via HMM (Heterogeneous Memory
    /// Management) like GPU VRAM — pages migrate between CPU DRAM and device memory
    /// under HMM control.
    CxlDeviceMemory,
    /// GPU VRAM (accessible via BAR or unified memory).
    GpuVram,
    /// Persistent memory (NVDIMM, Intel Optane DCPMM, CXL PMEM).
    PersistentMem,
    /// Remote node DRAM via DSM (distributed shared memory, Section 5.6).
    DsmRemote,
    /// Compressed in-memory pages (zswap/zram tier).
    Compressed,
    /// Local swap (block device backed).
    Swap,
}

Tiering hierarchy (best to worst latency/performance):

LocalDram → CxlExpander → CxlDeviceMemory → GpuVram → Compressed → Swap → DsmRemote

CxlExpander sits between local DRAM and the compressed tier: it is slower than DRAM (200-500 ns vs ~80 ns) but faster than decompression (~1-2 μs) or NVMe swap (~10 μs). CxlDeviceMemory is grouped with device-attached memory (alongside GpuVram) because its latency and bandwidth are device-specific and managed by the device driver rather than by the general tiering policy.

The memory subsystem maintains a runtime mapping from TierKind to the current ordinal tier number. Code that needs to compare tier performance uses TierKind and queries mem::tier_ordinal(kind) rather than hardcoding numeric values.


4.2 Memory Compression Tier

Inspired by: macOS (2013), Windows 10 (2015), Linux zswap/zram. IP status: Clean — academic concept from the 1990s, BSD-licensed algorithms.

4.2.1 Problem

When the system is under memory pressure, the page reclaim path must free pages. The options in Section 4.1.3 are:

ML tuning: Reclaim behavior parameters (reclaim_aggressiveness, prefetch_window_pages, compress_entropy_threshold, numa_migration_threshold, swap_local_ratio) are registered in the Kernel Tunable Parameter Store. The umka-ml-numa and umka-ml-compress Tier 2 services observe page fault and eviction events to tune these parameters per-cgroup at runtime. See Section 22.1.5 for the complete parameter catalog and observation types.

  1. Evict clean page cache pages (free, but re-read from disk on next access)
  2. Write dirty pages to swap (expensive: NVMe ~10us per 4KB, HDD ~5ms)

There is a third option, cheaper than swap: compress the page in memory. Modern CPUs running LZ4 compress a 4KB page in ~1-2 microseconds. If the page compresses to <2KB (typical for most workloads), you've freed 2KB without any I/O. Decompression on access is ~0.5 microseconds.

This is 5-10x faster than NVMe swap and 1000x faster than HDD swap.

4.2.2 Architecture

Insert a compressed tier between the LRU inactive list and swap:

Per-CPU Page Caches (hot path)
    |
    v
Per-NUMA Buddy Allocator
    |
    v
Page Cache (RCU radix tree)
    |
    v
LRU Active List --evict--> LRU Inactive List
                                |
                    +-----------+-----------+
                    |                       |
              [compress]              [swap out]
                    |                  (existing)
                    v
            Compressed Pool             Swap Device
            (zpool in memory)           (NVMe/HDD)
                    |
              [decompress]
                    |
                    v
              Page restored
              to active LRU

4.2.3 Compressed Page Pool

The ZPool uses three support types that are defined here before the main struct:

BootVec<T> — boot-time fixed-capacity vector:

/// A `Vec`-like container backed by the boot allocator. Allocated during early
/// boot before the slab allocator is available. Capacity is fixed at creation
/// and never reallocated — this is critical because `BootVec` is used for
/// structures that must remain stable under memory pressure (ZPool regions,
/// NUMA topology arrays). After boot completes, `BootVec` contents are
/// typically used read-only (new entries fill pre-allocated slots but the
/// backing allocation never moves).
///
/// Allocated via `boot_alloc(size_of::<T>() * cap)` from the boot allocator
/// (Section 4.1.1). Panics if the boot allocator cannot satisfy the request
/// (fatal: the kernel cannot proceed without these structures).
pub struct BootVec<T> {
    /// Pointer to the first element. Allocated from the boot allocator.
    /// The allocation is `cap * size_of::<T>()` bytes, aligned to `align_of::<T>()`.
    ptr: *mut T,
    /// Number of initialized elements. Invariant: `len <= cap`.
    len: usize,
    /// Maximum capacity (fixed at creation, never changes).
    cap: usize,
}

FixedHashTable<K, V> — pre-allocated open-addressed hash table:

/// Pre-allocated, fixed-capacity open-addressed hash table with linear probing.
/// Never resizes — capacity is set at init time and the backing array is
/// allocated once (via `vmalloc` for large tables, boot allocator for small ones).
///
/// Designed for use under memory pressure where a growable `HashMap` would fail
/// to resize at the worst possible moment. The caller must check `count < capacity`
/// before insertion; inserting into a full table panics (this is a kernel bug,
/// not a recoverable error — the capacity calculation must account for maximum
/// occupancy).
///
/// Hash function: `FxHash` (fast, non-cryptographic). The ZPool index is not
/// security-sensitive (compressed page lookup, not user-facing), so collision
/// resistance is not required — only speed and distribution quality matter.
pub struct FixedHashTable<K, V> {
    /// Backing array of slots. Each slot is `None` (empty) or `Some((key, value))`.
    /// Allocated as a contiguous virtual allocation of `capacity` entries.
    entries: *mut [Option<(K, V)>],
    /// Total number of slots (fixed at init).
    capacity: usize,
    /// Number of occupied slots. Invariant: `count <= capacity`.
    count: usize,
    /// Bitmask for index computation: `capacity - 1` (capacity is always a
    /// power of two for fast modular arithmetic via bitwise AND).
    mask: usize,
}

FreeList — intrusive singly-linked free list for ZPool region management:

/// Intrusive singly-linked free list for managing free space within ZPool
/// regions. Each free region embeds a `FreeListNode` at its start address
/// (the free memory itself stores the metadata — zero overhead when the
/// region is in use).
///
/// Allocation: first-fit — walk the list, find the first node with
/// `size >= request`, split if the remainder is large enough to hold a
/// `FreeListNode` (16 bytes minimum). O(n) worst case, but ZPool regions
/// are large (64KB-256KB each) so the free list per region is short
/// (typically <10 entries even under heavy fragmentation).
///
/// Deallocation: insert at head and coalesce with adjacent free regions
/// if they are contiguous in memory (address-ordered coalescing).
pub struct FreeList {
    /// Head of the free list, or null if no free space remains.
    pub head: *mut FreeListNode,
}

/// Node embedded at the start of each free region. The node occupies the
/// first 16 bytes of the free region itself (intrusive — no separate
/// allocation needed).
pub struct FreeListNode {
    /// Pointer to the next free region, or null if this is the last.
    pub next: *mut FreeListNode,
    /// Size of this free region in bytes (including the FreeListNode header).
    pub size: usize,
}

CompressedPageKey — handle to a compressed page in the zpool:

/// Key identifying a compressed page in the zpool.
/// Encodes the zpool slab address and the byte offset within the slab
/// of the compressed block. The layout is:
/// - bits 63:20 — slab base address (page-aligned, bottom 12 bits zero, top 44 bits)
/// - bits 19:0  — byte offset within the slab (max 1 MB slabs)
///
/// Zero is the null sentinel (no valid compressed block has address 0
/// because the allocator never returns NULL).
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct CompressedPageKey(pub u64);

impl CompressedPageKey {
    pub const NULL: Self = CompressedPageKey(0);
    pub fn is_null(&self) -> bool { self.0 == 0 }
    pub fn slab_addr(&self) -> u64 { self.0 & !0xFFFFF }
    pub fn byte_offset(&self) -> u32 { (self.0 & 0xFFFFF) as u32 }
}

ZPool struct:

// Kernel-internal (umka-core/src/mem/zpool.rs)

/// A pool of compressed pages stored in contiguous memory.
pub struct ZPool {
    /// Backing memory regions (allocated from buddy allocator).
    /// Each region is a contiguous allocation (order 4-6, 64KB-256KB).
    /// Fixed-capacity array sized at init time (max_regions = max_pool_bytes /
    /// min_region_size). Like the index table, the region array never grows after
    /// init — new regions are allocated from the buddy allocator and placed into
    /// pre-allocated slots. This avoids Vec reallocation under memory pressure,
    /// which is the exact scenario where zpool is most active.
    regions: BootVec<ZPoolRegion>,

    /// Index: maps (address_space_id, page_offset) -> compressed location.
    /// Pre-allocated open-addressed hash table with linear probing and a fixed
    /// capacity determined at init time: `capacity = max_size / PAGE_SIZE * 4 / 3`
    /// (load factor ~0.75 target). The table is backed by a virtually-contiguous
    /// allocation (`vmalloc`-style: contiguous in kernel virtual address space,
    /// physically scattered across individual pages) obtained during `ZPool::init()`
    /// before memory pressure exists. On a 256 GB server with `max_pool_percent=25`,
    /// the table may reach ~28M entries (~900 MB) — far exceeding the buddy
    /// allocator's maximum contiguous order (4 MB). Using vmalloc avoids this
    /// constraint while keeping O(1) index arithmetic via contiguous virtual
    /// addresses. The table never grows or reallocates — under severe memory
    /// pressure (the exact scenario where zpool is most active), a growable HashMap
    /// would fail to resize, causing the compression tier to fail precisely when
    /// it is needed most.
    ///
    /// **Performance note**: Linear probing degrades rapidly above ~70% load factor
    /// due to primary clustering. The 75% load factor target (capacity = max_entries × 4/3)
    /// keeps the table near but below the degradation threshold at maximum occupancy,
    /// with the reclaim path throttle (see below) ensuring the table never actually
    /// reaches 75% load. For hot-path hash tables (e.g., PFN lookup), use Robin Hood
    /// hashing or cuckoo hashing for better worst-case behavior. The ZPool index
    /// tolerates linear probing because lookups are on the page-fault path (~1-2μs
    /// total), where the hash table probe cost is a small fraction of the overall
    /// decompression latency.
    ///
    /// **Probe count at 75% load**: At 75% load, linear probing expects an average of
    /// **~2.5 probes per lookup** (theoretical: 1/(2(1-0.75)²) ≈ 2.5). This is the
    /// design target — acceptable for compression metadata access, which is not on
    /// the per-packet hot path. If pool load exceeds **80%** (detectable when probe
    /// count exceeds 5 on average, or when `fill_count / capacity > 0.80`), the pool
    /// triggers a background **compaction sweep** to reclaim fragmented free slots:
    ///
    /// ```
    /// ZPool compaction sweep:
    /// 1. Allocate a new ZPool of the same capacity.
    /// 2. Copy all live (non-empty) entries from old pool to new pool.
    /// 3. Atomically swap old and new pool pointers (under the pool's RwLock).
    /// 4. Free the old pool.
    /// Cost: O(capacity) memcpy. Triggered at most once per minute to avoid
    /// thrashing. Logged to observability as ZPoolCompaction event.
    /// ```
    ///
    /// **Capacity exhaustion handling**: The 4/3 sizing factor ensures the index has
    /// capacity for more entries than the pool can ever hold under `max_pool_percent`:
    /// at maximum pool occupancy, `pages_stored == max_size / PAGE_SIZE`, while the
    /// index capacity is `max_size / PAGE_SIZE * 4 / 3` — 33% larger. The index
    /// therefore never exhausts before the pool limit is reached. ZPool reaches
    /// `max_pool_percent` before the index reaches capacity, so `ZPoolError::IndexFull`
    /// is a defence-in-depth guard, not the normal control path. The primary control is
    /// the reclaim path throttle, which stops compressing when
    /// `pages_stored >= index_capacity * 3 / 4` (i.e., at `max_size / PAGE_SIZE`),
    /// keeping the index load factor at or below 75% and probe counts acceptable.
    index: FixedHashTable<CompressedPageKey, CompressedEntry>,

    /// Free space tracker (first-fit allocator within regions).
    free_list: FreeList,

    /// Statistics.
    stats: ZPoolStats,

    /// Compression algorithm (compile-time selected).
    /// LZ4 is default: ~1-2us compress, ~0.5us decompress per 4KB.
    algorithm: CompressionAlgorithm,

    /// Maximum pool size (fraction of total RAM, default 25%).
    max_size: usize,

    /// Current pool size.
    current_size: AtomicUsize,

    /// NUMA-aware: one ZPool per NUMA node.
    numa_node: NumaNodeId,
}

#[derive(Clone, Copy)]
pub struct CompressedEntry {
    /// Region index within the ZPool. u32 supports up to 4 billion regions;
    /// with 256KB regions, that is ~1 PB of ZPool capacity — well beyond any
    /// foreseeable server. (u16 would limit to 65536 × 256KB = 16GB, which
    /// overflows on servers with >64GB RAM at 25% ZPool allocation.)
    region: u32,
    /// Offset within region (max 256KB region, so u32 is generous).
    offset: u32,
    /// Compressed size in bytes (at the default 1.5x threshold, pages exceeding
    /// ~2730 bytes are rejected; the field itself is u16, max 65535,
    /// accommodating any compression policy).
    compressed_size: u16,
    /// Original page checksum (CRC32C, for integrity verification on
    /// decompression). CRC32C is used because all supported architectures
    /// provide hardware acceleration: x86 SSE4.2 `crc32` instruction,
    /// ARMv8 CRC extension, RISC-V: software implementation (no ratified hardware CRC extension).
    /// By the birthday paradox, expected collisions ≈ k²/(2n) for k pages and
    /// n=2^32 hash space. For k=1,000,000 compressed pages: 10^12/(2×4,294,967,296)
    /// ≈ 116 expected collisions — each causing a single process crash on
    /// decompression of corrupt data, not kernel-wide data loss. This is an
    /// intentional tradeoff for compressed swap; the storage layer provides
    /// per-block checksums (Section 13.1) for stronger integrity where needed.
    checksum: u32,
}

pub struct ZPoolStats {
    /// Pages stored in compressed form.
    pub pages_stored: AtomicU64,
    /// Total compressed bytes (sum of compressed sizes).
    pub compressed_bytes: AtomicU64,
    /// Total original bytes (pages_stored * PAGE_SIZE).
    pub original_bytes: AtomicU64,
    /// Compression ratio = original_bytes / compressed_bytes.
    /// Typical: 2.5-4.0x for most workloads.

    /// Pages rejected (incompressible — ratio < 1.5x).
    pub pages_rejected: AtomicU64,
    /// Pages writeback to swap (pool full or evicted from pool).
    pub pages_writeback: AtomicU64,
    /// Decompressions (page faults on compressed pages).
    pub decompressions: AtomicU64,
}

4.2.4 Compression Policy

Not every page should be compressed. The policy:

pub struct CompressionPolicy {
    /// Minimum compression ratio to accept a page, encoded as
    /// ratio * 100 (fixed-point to avoid FPU use in kernel).
    /// Default: 150 (meaning 1.5x). A 4KB page must compress to
    /// fewer than PAGE_SIZE * 100 / min_ratio_x100 = 2730 bytes
    /// to be accepted. Pages that compress worse go directly to swap.
    pub min_ratio_x100: u32,

    /// Maximum zpool size as percentage of total RAM (default 25).
    pub max_pool_percent: u32,

    /// When zpool is full, evict oldest compressed pages to swap.
    /// This is the "writeback" path.
    pub writeback_on_full: bool,

    /// Page types eligible for compression.
    pub compress_anonymous: bool,       // true (default)
    pub compress_file_backed: bool,     // false (default — just evict clean pages)
    pub compress_shmem: bool,           // true (default — tmpfs/shm pages)

    /// Pages with active DMA mappings are never eligible for compression
    /// or swap-out. The page reclaim path checks the DMA pin count
    /// (tracked in the device registry's DeviceResources, Section 10.5)
    /// before attempting reclaim. This field is always false and exists
    /// as a documented invariant — it cannot be overridden.
    pub compress_dma_pinned: bool,      // false (invariant, never set to true)
}

Hugepages are never compressed. Hugepages (2 MB on x86-64/AArch64, 1 GB on x86-64) are allocated specifically for performance-critical workloads (databases, JVMs with large heaps, GPU buffer backing). Compressing them would defeat their purpose and introduce unpredictable latency. The specific rules are:

  • A page that is part of a hugepage compound (marked PageFlags::HUGEPAGE) is excluded from the compression scanner entirely. The reclaim path skips it without attempting compression.
  • If a hugepage must be reclaimed under severe memory pressure, it is handled in one of two ways: (a) swapped directly to swap as a compound block (512 contiguous 4 KB swap slots for a 2 MB hugepage), or (b) split into 4 KB base pages first via split_huge_page(), after which the resulting base pages are individually eligible for the normal compression path.
  • Transparent Huge Pages (THP) that have already been split (by split_huge_page()) lose the PageFlags::HUGEPAGE flag on their constituent pages. Those 4 KB pages are thereafter eligible for compression like any other anonymous page.
  • Explicitly locked huge pages (mlock()'d or mlockall()'d) are never reclaimed, split, or compressed — they are pinned in physical RAM by the mlock invariant.

Decision flow when the page reclaim path needs to free a page:

Page to reclaim:
  |
  Is it DMA-pinned (active device mapping)?
  |-- Yes -> Skip. Not eligible for reclaim. Done.
  |
  Is it a clean file-backed page?
  |-- Yes -> Simply evict (will re-read from disk). Done.
  |
  Is compression enabled for this page type?
  |-- No -> Write to swap (existing path). Done.
  |
  Attempt LZ4 compression:
  |
  Compressed to < (PAGE_SIZE * 100 / min_ratio_x100)?
  |-- No -> Incompressible. Write to swap. Done.
  |
  Is ZPool below max_pool_percent?
  |-- No -> Evict oldest compressed pages to swap, then store. Done.
  |
  Store compressed page in ZPool.
  Free original page frame.
  Done.

4.2.5 Decompression Path

When a page fault occurs on a compressed page:

1. Page fault handler (UmkaOS Core) checks if the faulting address
   maps to a compressed page (index lookup).
2. Allocate a fresh page frame.
3. Decompress from ZPool into the fresh page.
4. Update page tables to point to the fresh page.
5. Remove the compressed entry from ZPool.
6. Return from page fault.

Total time: ~1-2us (page allocation + LZ4 decompress + TLB update).
Compare: swap read from NVMe = ~10-15us, from HDD = ~5-10ms.

4.2.6 NUMA Awareness

One ZPool per NUMA node. Compressed pages stay on the same NUMA node as their original allocation. This avoids cross-NUMA decompression latency.

NUMA Node 0:                    NUMA Node 1:
  Buddy Allocator 0               Buddy Allocator 1
  Page Cache 0                    Page Cache 1
  LRU Lists 0                    LRU Lists 1
  ZPool 0          <-local->     ZPool 1
  (compresses node-0 pages)      (compresses node-1 pages)

4.2.7 Compression Algorithm Selection

#[repr(u32)]
pub enum CompressionAlgorithm {
    /// LZ4: ~1-2us compress, ~0.5us decompress. Best latency.
    /// Default choice.
    Lz4     = 0,
    /// Zstd (level 1): ~3-5us compress, ~1us decompress. Better ratio.
    /// Use when memory pressure is high and slightly more latency is OK.
    Zstd1   = 1,
    /// Zstd (level 3): ~5-10us compress, ~1us decompress. Best ratio.
    /// Use when swap I/O is very expensive (HDD) and CPU is available.
    Zstd3   = 2,
}

Both LZ4 and Zstd are BSD-licensed. We include our own no_std implementations (no external C library dependency in kernel).

4.2.8 Latency Spikes and Fragmentation

Transparent memory compression introduces two risks that must be explicitly managed: latency spikes during decompression and fragmentation within the compressed pool.

Decompression latency spikes — when a process accesses a compressed page, the page fault handler must decompress it before the access can proceed. This adds ~1-2μs (LZ4) or ~1-5μs (Zstd) to the page fault latency. While small in absolute terms, this can cause tail latency spikes for latency-sensitive workloads:

  • Worst case: a process accesses 100 compressed pages in rapid succession (e.g., scanning a large array that was mostly evicted). Each page fault takes ~1-2μs, totaling ~100-200μs of stall time. This is comparable to a single NVMe read but much better than swap (~10-15μs per page from NVMe × 100 pages = ~1-1.5ms).
  • Mitigation — prefetch on decompression: when decompressing page N, the kernel speculatively decompresses pages N+1 through N+3 (sequential readahead into the compressed pool). This converts 4 serial page faults into 1 fault + 3 prefetches, reducing total latency by ~75% for sequential access patterns.
  • Mitigation — per-cgroup opt-out: latency-sensitive cgroups (real-time workloads, databases) can disable compression entirely via memory.zpool.enabled = 0 in the cgroup controller. Their pages are never compressed — they go directly to swap (or are simply not reclaimed until OOM).
  • Mitigation — priority-aware compression: pages belonging to high-priority tasks (RT scheduling class, latency-sensitive cgroups) are placed last on the compression candidate list, ensuring they are compressed only under severe memory pressure.

ZPool fragmentation — the compressed pool stores variable-size compressed pages (a 4KB page might compress to 500 bytes, or 2KB, or 3800 bytes). This creates internal fragmentation:

  • Buddy-within-zpool: the ZPool divides each region (allocated as order-4 to order-6 pages from the buddy allocator, i.e., 64KB-256KB contiguous blocks) into variable-size slots. Slots are managed with a simple first-fit allocator within each region.
  • Fragmentation metric: the kernel tracks compressed_bytes / pool_total_size as the pool utilization ratio.

ZPool Compaction Algorithm

ZPool compaction is triggered when the compressed memory pool becomes fragmented (many partially-filled slabs) and reclaims slab memory by migrating objects.

Constants:

/// Compaction trigger: start when pool utilization exceeds this threshold.
pub const ZPOOL_COMPACTION_THRESHOLD_PCT: u8 = 85;
/// Compaction also triggers if the oldest active slab has been idle this long.
pub const ZPOOL_COMPACTION_AGE_MS: u64 = 5_000;
/// Compaction stops when utilization drops below this target.
pub const ZPOOL_COMPACTION_TARGET_PCT: u8 = 75;
/// Evict slabs with occupancy below this fraction (25% = mostly empty).
pub const ZPOOL_MIN_OCCUPANCY_PCT: u8 = 25;

Trigger conditions (checked after each decompression or on allocation failure): 1. Pool utilization (used_bytes / capacity_bytes) * 100 > ZPOOL_COMPACTION_THRESHOLD_PCT, OR 2. monotonic_now() - oldest_active_slab.last_access_time > ZPOOL_COMPACTION_AGE_MS

Compaction algorithm:

fn compact_zpool(pool: &mut ZPool):
  // Sort candidate slabs: oldest (by last-access time) and least-occupied first.
  candidates = pool.slabs
    .filter(|s| s.occupancy_pct() < ZPOOL_MIN_OCCUPANCY_PCT)
    .sort_by(|a, b| a.last_access_time.cmp(&b.last_access_time))

  for slab in candidates:
    if pool.utilization_pct() <= ZPOOL_COMPACTION_TARGET_PCT:
      break  // compaction target reached

    // Migrate each live object in the slab to a new allocation in the pool.
    for obj in slab.live_objects():
      new_loc = pool.alloc(obj.compressed_size)?
      pool.copy_object(obj, new_loc)
      pool.update_page_mapping(obj.original_pfn, new_loc)

    // Slab is now empty; return its backing pages to the buddy allocator.
    pool.free_slab(slab)
    buddy_allocator::free_pages(slab.backing_pages)

Compaction runs in a dedicated kthread at priority SCHED_IDLE (below any real workload) and yields every 1ms to avoid interfering with I/O paths. It is also invoked synchronously (without yield) during allocation failure before OOM is triggered.

  • Compaction cost: moving compressed pages requires updating the reverse-mapping index (which virtual address points to this compressed slot). This is ~100ns per page move. The kthread yields every 1ms, so it processes ~10,000 pages per second on average (~1ms of CPU time per second) and does not block page faults.
  • Worst case: highly heterogeneous compression ratios (some pages compress 10:1, others 2:1) create severe fragmentation. The compaction kthread keeps up with steady- state workloads but may fall behind during allocation bursts. If pool utilization exceeds 90% and compaction cannot reduce it, the kernel temporarily stops compressing new pages (sending them directly to swap) until compaction catches up.

Interaction with buddy allocator — the ZPool allocates backing memory from the buddy allocator in large contiguous blocks (64KB-256KB). These allocations are infrequent (one allocation per ~16-64 compressed pages) and always order-4 or larger. This avoids polluting the buddy allocator's small-order freelists. When the ZPool shrinks (memory pressure eases), regions are returned to the buddy allocator as whole contiguous blocks, avoiding external fragmentation.

Buddy allocator view:
  Order-4+ allocations → ZPool regions (stable, long-lived)
  Order-0 allocations  → Regular page cache, anonymous pages (high churn)

These two allocation classes don't interfere: ZPool uses large blocks,
everything else uses small blocks. The buddy allocator's per-order freelists
keep them naturally separated.

4.2.9 Linux Interface Exposure

procfs (compatible with Linux zswap interface):

/proc/umka/zpool/
    enabled             # "1" or "0" (read/write)
    algorithm           # "lz4", "zstd1", "zstd3" (read/write)
    max_pool_percent    # 25 (read/write, percentage of total RAM; can only
# # decrease at runtime — increasing is rejected because
# # the hash table index was sized at init for the
# # original max_pool_percent and cannot grow)
    pool_total_size     # Current pool size in bytes
    stored_pages        # Number of pages in pool
    compressed_bytes    # Total compressed bytes
    original_bytes      # Total original bytes
    compression_ratio   # current ratio (e.g., "3.21")
    rejected_pages      # Incompressible pages sent to swap
    writeback_pages     # Pages evicted from pool to swap
    decompressions      # Total decompress operations

/sys/kernel/mm/zpool/          # Alternative sysfs path
# # Same attributes, for tools that prefer sysfs

/proc/meminfo additions (matching Linux zswap format):

Zswapped:        512000 kB     (original size of compressed pages)
Zswap:           180000 kB     (actual compressed size in memory)

These are additive fields. free, top, htop and other tools that parse /proc/meminfo simply ignore fields they don't recognize.


4.6 Extended Memory Operations

This section specifies eight syscalls that extend or augment the core memory management interface. Each syscall is Linux ABI-compatible at the binary level (same numbers, same struct layouts, same flag values) while the internal implementation uses UmkaOS-native data structures and algorithms. Improvements over Linux are called out explicitly in each subsection.

Cross-references: virtual memory regions (Section 4.1.5), huge page management (Section 4.1.4), NUMA policy (Section 4.1.8), page cache (Section 4.1.3), ZPool compression (Section 4.2).


4.6.1 mremap — Remap a Virtual Address Region

Syscall signature

void *mremap(void *old_addr, size_t old_size, size_t new_size,
             int flags, ... /* void *new_addr */);

Returns new_addr on success, MAP_FAILED (cast of (void *)-1) on error with errno set. The optional fifth argument new_addr is required when MREMAP_FIXED is set.

Flags

bitflags::bitflags! {
    /// Flags for mremap(2). Exact Linux values — binary ABI is stable.
    pub struct MremapFlags: u32 {
        /// Kernel may move the mapping to a different virtual address if it cannot
        /// grow in place. Without this flag, ENOMEM is returned instead of moving.
        const MAYMOVE     = 0x1;

        /// Move the mapping to exactly new_addr, unmapping whatever was there before.
        /// Requires MAYMOVE. Returns EPERM if MAYMOVE is absent.
        const FIXED       = 0x2;

        /// Do not unmap old_addr after moving. The old VA range is replaced with
        /// anonymous zero pages (as if freshly mmap'd MAP_ANONYMOUS). The PTEs from
        /// the old range are transferred to new_addr. Added in Linux 5.7.
        /// Used by the Go runtime garbage collector to preserve VA reservations.
        const DONTUNMAP   = 0x4;
    }
}

Three operational cases

  1. Grow in place (new_size > old_size, no move): extend the VMA's end if the pages immediately above old_addr + old_size are unmapped. No PTE copying required; the new sub-range starts with no PTEs present (demand-fault on access). This is constant-time in the VMA tree.

  2. Shrink (new_size < old_size): always succeeds. Unmap the tail sub-range [old_addr + new_size, old_addr + old_size) using unmap_vma_range(), which unmaps PTEs and drops page references. Returns old_addr.

  3. Move (MAYMOVE set, grow in place impossible): find a free VA range of new_size, copy PTEs from the old range to the new range without copying physical pages, TLB-flush the old range, optionally install zero anonymous PTEs at the old range if DONTUNMAP, then free the old VMA (or update its length if DONTUNMAP).

Algorithm — detailed steps

mremap(old_addr, old_size, new_size, flags, new_addr):
  1. Validate old_addr is page-aligned. Return EINVAL if not.
  2. Validate old_size and new_size are non-zero. Return EINVAL if either is zero.
  3. Validate flags: FIXED requires MAYMOVE (return EPERM if FIXED && !MAYMOVE).
  4. Validate DONTUNMAP requires MAYMOVE and new_size == old_size.
     (Kernel 5.7 behaviour: size must match to preserve the hole invariant.)
     Return EINVAL if violated.
  5. Lock MM write-lock (mm.write_lock()).
  6. Look up source VMA: find_vma(mm, old_addr). Return EFAULT if not found or
     old_addr < vma.start. Verify [old_addr, old_addr+old_size) lies within one VMA
     (no spanning). Return EFAULT if it spans two VMAs.
  7. If shrink (new_size < old_size):
       a. Split VMA at old_addr + new_size if needed.
       b. Call unmap_vma_range(old_addr + new_size, old_size - new_size).
          This unmaps PTEs, drops struct page references, fires mmu_notifier.
       c. Return old_addr. (No TLB shootdown needed: pages are gone.)
  8. If grow in place possible (no VMA in [old_addr+old_size, old_addr+new_size)):
       a. Extend VMA end to old_addr + new_size.
       b. Return old_addr.
  9. If MAYMOVE not set: return ENOMEM (cannot grow in place, cannot move).
  10. Find destination VA range:
        - If FIXED: use new_addr. Verify new_addr is page-aligned (EINVAL if not).
          Unmap [new_addr, new_addr+new_size) if occupied.
        - Else: call find_free_vma(mm, new_size, hint=old_addr+old_size).
          Return ENOMEM if no free range found.
  11. Verify old and new ranges do not overlap (EINVAL if they do).
  12. Allocate new VMA struct, copy flags/prot/file/offset from source VMA.
  13. Call remap_pte_range(old_addr, new_addr, new_size):
        - Walks PTEs from old_addr using the page-table walker.
        - For each present PTE: atomically clear old PTE, set new PTE at new_addr
          offset. No physical page is copied.
        - For 2MB huge PTEs: transfers the huge PTE under the page table lock (PTL)
          within a single critical section (see UmkaOS improvement below).
        - Increments page map-count (vm_page_count) by 0 (count unchanged, just
          remapped).
  14. If DONTUNMAP:
        - Replace old VMA range with MAP_ANONYMOUS|MAP_PRIVATE zero VMA.
        - Install zero PTEs for old range (demand-fault semantics — actually leave
          PTEs absent; they will fault in as zeros on access).
        - VMA for old range remains in mm.vma_tree with anon backing.
      Else:
        - Remove old VMA from mm.vma_tree.
        - Call mmu_notifier_invalidate(mm, old_addr, old_size).
  15. Insert new VMA into mm.vma_tree.
  16. TLB flush of old VA range required: with DONTUNMAP the old mapping is
      preserved, but stale TLB entries pointing to the original physical pages
      must be invalidated. New accesses to the old VA range will fault and be
      handled by the kernel correctly only after the flush.
      Call tlb_flush_range(mm, old_addr, old_size).
  17. Unlock MM write-lock.
  18. Return new_addr.

UmkaOS improvement over Linux: atomic huge-page remap

Linux 5.x mremap splits 2MB huge pages at the source range boundaries before remapping, then reconstructs huge pages at the destination. This causes THP split overhead (up to ~80μs per split on large mappings) and temporary fragmentation of the huge-page pool.

UmkaOS's remap_pte_range() detects 2MB-aligned source ranges and transfers the huge PTE without splitting, using the page table lock (PTL) to serialize the operation:

Serialization via PTL, not memory-visible atomicity: The huge PTE transfer is not truly atomic in the memory-visible sense — it consists of two separate memory operations (clear source PMD entry, set destination PMD entry). Instead, the PTL is held for both the remove-from-source and insert-to-destination operations within a single critical section. No other CPU can observe the page in an inconsistent state while the PTL is held, because any concurrent page table walk must also acquire the PTL before modifying entries in the same PMD range.

Cross-PMD transfers: If the source and destination fall in different PMD ranges (and thus have different PTLs), both PTLs are acquired in canonical address order (lower virtual address first) to prevent ABBA deadlock, and the transfer is performed while both locks are held.

No unmapped window: There is no window where the page is mapped in neither the source nor the destination. The PTL critical section ensures that the destination PMD entry is written before the source PTL is released. Another CPU attempting to access the source address during the transfer will block on the PTL; by the time it acquires the lock, the TLB flush (step 16) will have invalidated any stale entries.

/// Remap PTEs from src_va to dst_va for `len` bytes.
/// Transfers huge PTEs (2MB) without splitting when both src and dst are
/// 2MB-aligned and len is a multiple of 2MB. Uses the page table lock (PTL)
/// to serialize the transfer — not memory-visible atomicity.
///
/// # Safety
/// Caller holds mm write-lock. src and dst ranges must not overlap.
/// src range PTEs must be valid (verified by caller before entry).
unsafe fn remap_pte_range(
    mm: &mut MemoryMap,
    src_va: VirtAddr,
    dst_va: VirtAddr,
    len: usize,
) {
    let mut offset = 0usize;
    while offset < len {
        let src = src_va + offset;
        let dst = dst_va + offset;
        // Check for 2MB-aligned huge page opportunity.
        if src.is_aligned(HUGE_PAGE_SIZE)
            && dst.is_aligned(HUGE_PAGE_SIZE)
            && len - offset >= HUGE_PAGE_SIZE
        {
            // Transfer huge PTE under PTL: acquire lock(s) for the source and
            // destination PMD ranges. If they fall under different PTLs, acquire
            // in canonical (lower-address-first) order to prevent deadlock.
            let ptl_src = mm.page_table.pmd_lock(src);
            let ptl_dst = mm.page_table.pmd_lock(dst);
            let _guards = lock_pair_ordered(&ptl_src, &ptl_dst);

            // Within this critical section, no other CPU can observe a state where
            // the page is in neither source nor destination.
            let pmd_src = mm.page_table.pmd_entry_mut(src);
            let huge_pte = pmd_src.take(); // Clear source PMD entry.
            let pmd_dst = mm.page_table.pmd_entry_mut(dst);
            pmd_dst.set(huge_pte);         // Write destination PMD entry.
            // PTL guards drop here — locks released.
            offset += HUGE_PAGE_SIZE;
        } else {
            // Fall back to 4KB PTE transfer (same PTL pattern at PTE level).
            let pte_src = mm.page_table.pte_entry_mut(src);
            let pte = pte_src.take();
            let pte_dst = mm.page_table.pte_entry_mut(dst);
            pte_dst.set(pte);
            offset += PAGE_SIZE;
        }
    }
}

This avoids any THP split/reconstruct overhead. The huge PTE is transferred in two memory operations (clear + set) within a single PTL critical section, preserving dirty/accessed bits and avoiding the split_huge_page() path entirely.

Error cases

Error Condition
EFAULT old_addr not in any VMA, or range spans two VMAs
EINVAL old_addr not page-aligned, new_addr not page-aligned (FIXED), zero sizes, DONTUNMAP with size mismatch, overlapping ranges
ENOMEM MAYMOVE not set and cannot grow in place; or MAYMOVE set but no free VA range
EPERM MREMAP_FIXED set without MREMAP_MAYMOVE

Linux compatibility: flag values, return value convention, errno codes, and DONTUNMAP semantics are identical to Linux 5.7+. The UmkaOS-specific huge-page optimisation is transparent to userspace.


4.6.2 mincore — Query Page Residency

Syscall signature

int mincore(void *addr, size_t length, unsigned char *vec);

Returns 0 on success. The output array vec contains one byte per page in [addr, addr + length). Bit 0 of each byte is 1 if the page is currently resident in physical RAM, 0 otherwise.

Output byte encoding

Standard Linux bit 0 is preserved for compatibility. UmkaOS extends the remaining bits to provide richer page-state information:

bitflags::bitflags! {
    /// Per-page status byte returned by mincore(2).
    /// Bit 0 is Linux-compatible. Bits 1-7 are UmkaOS extensions (zero on platforms
    /// that do not have the corresponding feature, or when queried via a
    /// compatibility shim that strips extension bits).
    pub struct MinCoreStatus: u8 {
        /// Page is resident in CPU DRAM (bit 0 — Linux-compatible).
        const RESIDENT        = 0x01;

        /// Page is in compressed memory (ZPool). Accessible but not in raw DRAM.
        /// Retrieving it requires decompression (~10-50μs). Linux always returns 0
        /// for compressed pages (it does not have a compression tier by default).
        const COMPRESSED      = 0x02;

        /// Page is on a remote RDMA node (DSM remote page). Accessing it triggers
        /// an RDMA fetch (~2-3μs). Only set when the distributed memory subsystem
        /// is active (Section 5).
        const REMOTE_RDMA     = 0x04;

        /// Page is write-protected by userfaultfd WP mode.
        const UFFD_WP         = 0x08;

        /// Page is huge (part of a 2MB THP). The byte at the first page of the huge
        /// page has this bit set; the remaining 511 bytes in the same huge page also
        /// have it set. Informational only.
        const HUGE_PAGE       = 0x10;

        // Bits 0x20, 0x40, 0x80 reserved for future use.
    }
}

Standard mincore() returns bytes with only bit 0 meaningful (as per Linux). The extended bits are populated but userspace that ignores them sees standard Linux semantics (bit 0 = resident).

Algorithm

mincore(addr, length, vec):
  1. Validate addr is page-aligned. Return EINVAL if not.
  2. Validate length > 0, addr + length does not overflow. Return EINVAL if overflow.
  3. num_pages = ceil(length / PAGE_SIZE).
  4. Verify vec pointer (num_pages bytes) is writable by the calling process.
     Return EFAULT if vec is not accessible.
  5. Acquire MM read-lock (mm.read_lock()).
  6. Walk [addr, addr + length) page by page:
     For each page address p in the range:
       a. Look up VMA containing p. If none found: unlock, return ENOMEM.
          (Linux returns ENOMEM for unmapped ranges — UmkaOS matches this.)
       b. Look up PTE for p in the page table (no lock needed beyond MM read-lock;
          PTE reads are inherently racy — this is acceptable per POSIX and Linux).
       c. status = 0.
       d. If PTE is present and not swapped-out: status |= RESIDENT.
       e. If PTE maps a compressed ZPool page (UmkaOS PTE extension bit): status |= COMPRESSED.
       f. If PTE maps a remote DSM page: status |= REMOTE_RDMA.
       g. If PTE has the UFFD_WP soft-dirty bit set: status |= UFFD_WP.
       h. If the PTE is a PMD (huge page, 2MB): status |= HUGE_PAGE | RESIDENT.
          (A present huge PTE implies all 512 sub-pages are resident.)
       i. Write status byte to vec[page_index].
       j. Advance page_index.
       k. If p is at a 2MB-aligned boundary and this is a 2MB huge PTE:
          advance p by HUGE_PAGE_SIZE and fill all 512 vec bytes with status.
          (Avoids walking 512 individual 4KB PTEs for huge pages.)
  7. Release MM read-lock.
  8. Return 0.

Raciness note: mincore is inherently racy — a page may be swapped out between the PTE check and the return to userspace. This is by design (Linux documentation explicitly states this). UmkaOS does not attempt to serialise page reclaim during mincore.

UmkaOS improvement over Linux

Linux walks the page table at 4KB granularity even for huge pages, redundantly checking 512 sub-PTEs that all have the same state. UmkaOS detects huge PMD entries and bulk-fills the output vector for the entire 2MB range in a single operation, reducing mincore overhead by up to 512x for workloads with large THP mappings (e.g., QEMU guest RAM).

Additionally, the extension bits (COMPRESSED, REMOTE_RDMA) expose information that Linux provides no equivalent for, enabling userspace memory managers (e.g., the UmkaOS memory daemon) to make informed reclaim decisions without resorting to /proc/PID/smaps parsing.

Extended syscall: mincore_ex

For richer queries, UmkaOS provides mincore_ex as a forward-compat extension:

int mincore_ex(void *addr, size_t length, struct mincore_ex_page *vec,
               uint32_t flags);
/// Extended per-page record returned by mincore_ex(2).
/// Size is 8 bytes; the vec array must hold ceil(length/PAGE_SIZE) entries.
#[repr(C)]
pub struct MinCoreExPage {
    /// MinCoreStatus bits (see above).
    pub status: MinCoreStatus,
    /// NUMA node the physical page is on (0-255; 0xFF = unknown or remote).
    pub numa_node: u8,
    /// Page age in units of 0.1 seconds since last access (max 0xFFFE = 6553.4s;
    /// 0xFFFF = unknown). Derived from hardware Accessed bit recency tracking.
    pub age_deciseconds: u16,
    /// Physical frame number (PFN) mod 2^32. Zero for non-resident pages.
    pub pfn_lo32: u32,
}

mincore_ex is a new UmkaOS syscall number (not a Linux number); it has no Linux equivalent. Userspace that needs it links against the UmkaOS compat header.

Error cases

Error Condition
EFAULT vec pointer not writable, or addr faults (not in any VMA — but see ENOMEM)
EINVAL addr not page-aligned; addr + length overflows; length is zero
ENOMEM The queried range includes an unmapped region

Linux compatibility: standard mincore() syscall number and bit-0 semantics are exact. Extension bits in the output byte are always zero when called via the Linux compat path (UmkaOS compatibility shim strips them). mincore_ex is a new UmkaOS-only syscall.


4.6.3 membarrier — Expedite Memory Barrier

Syscall signature

int membarrier(int cmd, unsigned int flags, int cpu_id);

Returns 0 on success (or a bitmask for MEMBARRIER_CMD_QUERY), -1 on error.

Commands

Exact Linux values from <linux/membarrier.h>. All are implemented:

/// Commands for membarrier(2). Exact Linux values.
#[repr(i32)]
pub enum MembarrierCmd {
    /// Query supported commands. Returns bitmask of supported MembarrierCmd values.
    /// Bit N set means (1 << N) command is supported.
    Query                           = 0,

    /// Issue a full memory barrier on all CPUs (not just those running this process).
    /// Equivalent to sending an IPI to all online CPUs and waiting for each to
    /// execute a memory fence. Slow but unconditional.
    Global                          = 1,

    /// Barrier on all CPUs currently running threads from processes that have
    /// registered REGISTER_GLOBAL_EXPEDITED. Faster than GLOBAL when most processes
    /// are registered.
    GlobalExpedited                 = 2,

    /// Register this process for GlobalExpedited. Sets a flag in ProcessFlags that
    /// causes this process's CPUs to be included in future GlobalExpedited barriers.
    RegisterGlobalExpedited         = 4,

    /// Barrier on all CPUs running threads of the calling process. IPI sent only to
    /// CPUs with a thread of this process currently scheduled. Fastest for
    /// single-process use cases (e.g., JIT compiler memory visibility).
    PrivateExpedited                = 8,

    /// Register this process for PrivateExpedited barriers.
    RegisterPrivateExpedited        = 16,

    /// Full sync-core barrier: like PrivateExpedited but also serialises the
    /// instruction pipeline (CPUID on x86, ISB on ARM). Required for self-modifying
    /// code (JIT). After this barrier, all CPUs have seen the latest instruction
    /// stream.
    PrivateExpeditedSyncCore        = 32,

    /// Register for PrivateExpeditedSyncCore.
    RegisterPrivateExpeditedSyncCore = 64,

    /// Restart RSEQ (restartable sequences) on all threads of the calling process.
    /// Forces any in-flight RSEQ critical sections to abort and restart. Used to
    /// push updated RSEQ code into effect.
    PrivateExpeditedRseq            = 128,

    /// Register for PrivateExpeditedRseq.
    RegisterPrivateExpeditedRseq    = 256,
}

Internal data structures

/// Per-process membarrier registration state. Stored in ProcessFlags.
/// Cheap to check on the hot path (single bitmask load).
pub struct MembarrierState: u32 {
    const REGISTERED_GLOBAL_EXPEDITED          = 1 << 0;
    const REGISTERED_PRIVATE_EXPEDITED         = 1 << 1;
    const REGISTERED_PRIVATE_EXPEDITED_SYNC    = 1 << 2;
    const REGISTERED_PRIVATE_EXPEDITED_RSEQ    = 1 << 3;
}

The MembarrierState field lives in the Process struct alongside ProcessFlags. It is read atomically (SeqCst load) on every IPI target check. No lock is needed for the read path.

Algorithm — GLOBAL

membarrier(GLOBAL, flags=0, cpu_id=0):
  1. For each online CPU c:
       Send IPI to c with handler: smp_mb() (full memory barrier instruction).
       Wait for IPI acknowledgment (c's handler has executed).
  2. Execute smp_mb() on the calling CPU.
  3. Return 0.

GLOBAL is expensive (~N microseconds for N CPUs) and should only be used at infrequent synchronisation points. It is the slowest command but requires no prior registration.

Algorithm — GLOBAL_EXPEDITED

membarrier(GLOBAL_EXPEDITED, flags=0, cpu_id=0):
  1. No caller registration required. Any process may call GLOBAL_EXPEDITED.
  2. Use PerCpu::for_each() to iterate over all CPUs:
       For each CPU c whose current process has REGISTERED_GLOBAL_EXPEDITED set:
         Enqueue a barrier IPI (using the IPI mechanism from Section 3).
  3. Wait for all enqueued IPIs to be acknowledged.
  4. smp_mb() on calling CPU.
  5. Return 0.

Algorithm — PRIVATE_EXPEDITED

membarrier(PRIVATE_EXPEDITED, flags=0, cpu_id=0):
  1. Verify caller registered. Return EPERM if not.
  2. Build CPU mask: for each CPU c, check if c is currently running a thread
     belonging to the calling process (check CpuLocal::current_task().process_id).
  3. Send barrier IPI to each CPU in the mask.
  4. Wait for acknowledgments.
  5. smp_mb() on calling CPU.
  6. Return 0.

Algorithm — PRIVATE_EXPEDITED_SYNC_CORE

Same as PRIVATE_EXPEDITED but the IPI handler executes an instruction-serialising fence:

  • x86-64: CPUID (serialising instruction, flushes instruction pipeline)
  • AArch64: ISB SY (instruction synchronisation barrier)
  • ARMv7: ISB via CP15
  • RISC-V: FENCE.I (instruction fence)
  • PPC64LE: isync + sync

This ensures that self-modifying code (JIT-compiled code installed by one thread) is visible to the instruction fetch unit on all other threads before execution continues.

Algorithm — PRIVATE_EXPEDITED_RSEQ

membarrier(PRIVATE_EXPEDITED_RSEQ, flags=0, cpu_id=0):
  1. For each thread T in the calling process currently running on some CPU c:
       If T's rseq_cs pointer is non-null (T is inside a restartable sequence):
         Send IPI to c requesting RSEQ abort: set T's rseq_cs to null and
         redirect T's PC to the abort_ip specified in the rseq struct.
  2. Return 0.

UmkaOS improvement over Linux

Linux's membarrier implementation acquires the scheduler runqueue lock on each target CPU to determine whether a thread of the target process is running. This can cause priority inversion and latency spikes on RT tasks.

UmkaOS avoids this by using CpuLocal::current_task_pid() (a register-based read, ~1-4 cycles, no lock) to check which process is on each CPU. The check is performed via PerCpu::for_each() which issues a non-locking load of each CPU's register-aliased task pointer. The IPI target set is built without acquiring any scheduler lock.

This reduces the overhead of PRIVATE_EXPEDITED from O(N_CPUs × lock_acquire_latency) to O(N_CPUs × register_read_latency), a roughly 10-50x improvement on large systems.

QUERY return value

MEMBARRIER_CMD_QUERY (0) returns a bitmask of all implemented commands. UmkaOS returns:

1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 = 511 (all commands supported)

Error cases

Error Condition
EINVAL Unknown cmd value; non-zero flags (reserved, must be 0); invalid cpu_id
EPERM PRIVATE_EXPEDITED, PRIVATE_EXPEDITED_SYNC_CORE, or PRIVATE_EXPEDITED_RSEQ called without prior REGISTER_PRIVATE_EXPEDITED; GLOBAL_EXPEDITED requires no registration

Linux compatibility: all command values, registration semantics, and error codes match Linux 5.10+. The UmkaOS optimisation is transparent to userspace.


4.6.4 userfaultfd — User-Space Page Fault Handling

Syscall signature

int userfaultfd(int flags);

Returns a file descriptor on success, -1 on error. The fd is polled for readability when fault messages are pending.

Flags

bitflags::bitflags! {
    /// Flags for userfaultfd(2).
    pub struct UffdFlags: i32 {
        /// Set close-on-exec on the returned fd.
        const CLOEXEC          = libc::O_CLOEXEC;   // 0x80000

        /// Set O_NONBLOCK on the returned fd. read() returns EAGAIN when no
        /// messages are pending instead of blocking.
        const NONBLOCK         = libc::O_NONBLOCK;  // 0x800

        /// Allow unprivileged (non-CAP_SYS_PTRACE) processes to create userfaultfd
        /// instances and handle faults for their own address space. Added Linux 5.11.
        /// Without this flag, unprivileged processes get EPERM.
        const USER_MODE_ONLY   = 0x1;
    }
}

Wire message format (exact Linux uffd_msg struct layout, binary-compatible):

/// Fault message delivered to userspace via read() on the userfaultfd fd.
/// Total size: 32 bytes. Layout matches Linux <linux/userfaultfd.h> uffd_msg.
#[repr(C)]
pub struct UffdMsg {
    /// Event type (UffdEvent enum). 8-bit field.
    pub event: u8,
    pub _reserved1: u8,
    pub _reserved2: u16,
    pub _reserved3: u32,
    /// Event-specific payload. Union in C; UmkaOS uses an enum for type safety
    /// internally, but serialises to the union wire format.
    pub arg: UffdMsgArg,
}

/// 24-byte union (wire format). One variant active based on UffdMsg.event.
#[repr(C)]
pub union UffdMsgArg {
    /// For UFFD_EVENT_PAGEFAULT.
    pub pagefault: UffdMsgPagefault,
    /// For UFFD_EVENT_FORK.
    pub fork: UffdMsgFork,
    /// For UFFD_EVENT_REMAP.
    pub remap: UffdMsgRemap,
    /// For UFFD_EVENT_REMOVE.
    pub remove: UffdMsgRemove,
    /// Padding to 24 bytes.
    pub _pad: [u8; 24],
}

/// Pagefault event payload (event = UFFD_EVENT_PAGEFAULT = 0x12).
#[repr(C)]
pub struct UffdMsgPagefault {
    /// UFFD_PAGEFAULT_FLAG_WRITE (0x01): fault was a write. Else read fault.
    /// UFFD_PAGEFAULT_FLAG_WP   (0x02): write-protect fault (WP mode).
    /// UFFD_PAGEFAULT_FLAG_MINOR (0x04): minor fault (MINOR mode).
    pub flags: u64,
    /// Faulting virtual address.
    pub address: u64,
    /// Thread ID of the faulting task (PTID in Linux, tid here).
    pub ptid: u32,
}

#[repr(u8)]
pub enum UffdEvent {
    Pagefault = 0x12,
    Fork      = 0x13,
    Remap     = 0x14,
    Remove    = 0x15,
    Unmap     = 0x16,
}

Internal data structures

/// Number of shards in the pending fault table.
/// Power of 2 for cheap modulo via bitmask.
/// 64 shards = max 1 contended lock per 64-CPU group on a 4096-CPU system.
pub const UFFD_FAULT_SHARDS: usize = 64;

/// Maximum simultaneously pending faults per shard.
/// Total capacity per UffdInstance: UFFD_FAULT_SHARDS × UFFD_SHARD_SLOTS = 8192.
/// Sufficient for live migration of large Redis/database workloads on servers
/// with hundreds of cores. When a shard is full, `wait_for_resolution` returns
/// `Err(UffdError::Backpressure)`; the page fault handler yields and retries.
pub const UFFD_SHARD_SLOTS: usize = 128;

/// Maximum tasks simultaneously waiting on the same faulting page.
/// 8 covers typical thread-level parallelism faulting the same shared mapping.
pub const UFFD_MAX_WAITERS_PER_PAGE: usize = 8;

/// One slot in the flat pending-fault table.
/// `addr == PageAlignedAddr(0)` means this slot is unoccupied (virtual address 0
/// is never a valid userfaultfd target: the zero page is mapped read-only by the
/// kernel and cannot be registered with UFFDIO_REGISTER).
#[repr(C)]
pub struct PendingFaultSlot {
    /// Page address being waited on (0 = empty).
    pub addr: PageAlignedAddr,
    /// Number of valid entries in `waiters`.
    pub waiter_count: u8,
    /// Tasks waiting for UFFDIO_COPY or UFFDIO_ZEROPAGE to resolve this page.
    /// Indices 0..waiter_count are initialised; the rest are uninitialised.
    pub waiters: [MaybeUninit<WaitEntry>; UFFD_MAX_WAITERS_PER_PAGE],
}

/// Pre-allocated, bounded flat table for one uffd shard. No heap allocation
/// after UffdInstance creation. Linear scan over UFFD_SHARD_SLOTS entries is
/// O(128) but cache-friendly and contention-free (each shard is independent).
pub struct PendingFaultShard {
    pub slots: [PendingFaultSlot; UFFD_SHARD_SLOTS],
    /// Count of occupied slots (addr != 0). Checked before scanning to
    /// provide O(1) full detection.
    pub occupied: u16,
}

/// State for one userfaultfd instance. Allocated on fd creation.
pub struct UffdInstance {
    /// Registered virtual address ranges and their modes.
    /// IntervalTree provides O(log N) lookup on fault address.
    pub registered_ranges: RwLock<IntervalTree<UffdVmaRange>>,

    /// Lock-free SPSC ring for fault messages delivered to userspace.
    /// Producer: page fault handler (interrupt context).
    /// Consumer: userspace thread calling read() on the uffd fd.
    /// Capacity 2048: handles burst of faults before userspace drains the ring.
    pub message_queue: SpscRing<UffdMsg, 2048>,

    /// Sharded pending fault table.
    ///
    /// Fault address is sharded by page-aligned address bits [11:6] (6 bits = 64 shards).
    /// This distributes concurrent faults across 64 independent locks.
    ///
    /// Each shard is a bounded flat table (`PendingFaultShard`) — no heap allocation
    /// after UffdInstance creation. Linear scan within a shard is O(UFFD_SHARD_SLOTS)
    /// but in practice scans ≤ 10 entries (concurrent faults per shard are rare).
    ///
    /// **Why not HashMap**: HashMap allocates on every new-key insert. In the page
    /// fault path (thread context but potentially deep in mm/ code), allocation
    /// failures would require complex retry logic. Fixed pre-allocated storage
    /// provides predictable, bounded behaviour with no allocation in the fault path.
    pub pending_faults: [Mutex<PendingFaultShard>; UFFD_FAULT_SHARDS],

    /// Tasks waiting for any message (read() on uffd fd).
    pub wakeup_queue: WaitQueue,

    /// Negotiated API version and feature bits (from UFFDIO_API ioctl).
    pub api_version: u64,
    pub features: UffdFeatures,

    /// Registration mode flags controlling which fault types this uffd handles.
    pub mode: UffdMode,
}

impl UffdInstance {
    /// Get the shard index for a page address.
    /// Uses bits [11:6] of the page-aligned address for distribution.
    #[inline]
    fn shard_index(page_addr: PageAlignedAddr) -> usize {
        ((page_addr.0 >> 6) & (UFFD_FAULT_SHARDS as u64 - 1)) as usize
    }

    /// Block the current task until the fault at `page_addr` is resolved.
    /// Returns `Err(UffdError::Backpressure)` if the shard table is full;
    /// the caller must yield and retry.
    pub fn wait_for_resolution(&self, page_addr: PageAlignedAddr) -> Result<(), UffdError> {
        let shard_idx = Self::shard_index(page_addr);
        let mut shard = self.pending_faults[shard_idx].lock();

        // Find existing slot for this address or claim a free slot.
        let slot = shard.slots.iter_mut().find(|s| s.addr == page_addr);
        let slot = match slot {
            Some(s) => s,
            None => {
                if shard.occupied as usize >= UFFD_SHARD_SLOTS {
                    return Err(UffdError::Backpressure);
                }
                let free = shard.slots.iter_mut()
                    .find(|s| s.addr.0 == 0)
                    .expect("occupied count inconsistency: counter != actual count");
                free.addr = page_addr;
                shard.occupied += 1;
                free
            }
        };

        if slot.waiter_count as usize >= UFFD_MAX_WAITERS_PER_PAGE {
            return Err(UffdError::TooManyWaiters);
        }
        let waiter = WaitEntry::current_task();
        slot.waiters[slot.waiter_count as usize].write(waiter.clone());
        slot.waiter_count += 1;
        drop(shard); // Release the shard lock BEFORE sleeping

        waiter.wait().map_err(|_| UffdError::Interrupted)
    }

    /// Wake all tasks waiting on `page_addr` (called from UFFDIO_COPY handler).
    pub fn resolve_fault(&self, page_addr: PageAlignedAddr) {
        let shard_idx = Self::shard_index(page_addr);
        // Extract waiters under the lock, then wake them after releasing it.
        let mut waiters_buf = [MaybeUninit::<WaitEntry>::uninit(); UFFD_MAX_WAITERS_PER_PAGE];
        let waiter_count;
        {
            let mut shard = self.pending_faults[shard_idx].lock();
            match shard.slots.iter_mut().find(|s| s.addr == page_addr) {
                None => return, // No waiters (racy resolve — benign)
                Some(slot) => {
                    waiter_count = slot.waiter_count as usize;
                    // SAFETY: slots 0..waiter_count were written in wait_for_resolution.
                    for i in 0..waiter_count {
                        waiters_buf[i].write(unsafe { slot.waiters[i].assume_init_read() });
                    }
                    slot.addr = PageAlignedAddr(0); // Mark slot free
                    slot.waiter_count = 0;
                    shard.occupied -= 1;
                }
            }
        } // Shard lock released before waking

        // SAFETY: waiters_buf[0..waiter_count] were initialised above.
        for i in 0..waiter_count {
            unsafe { waiters_buf[i].assume_init_read() }.wake();
        }
    }
}

/// One registered VA range.
pub struct UffdVmaRange {
    pub start:  VirtAddr,
    pub end:    VirtAddr,
    /// Which fault types to intercept on this range.
    pub mode:   UffdMode,
}

bitflags::bitflags! {
    pub struct UffdMode: u64 {
        /// Intercept page-not-present faults (standard userfaultfd use).
        const MISSING    = 1 << 0;
        /// Intercept write faults on write-protected pages.
        const WRITEPROTECT = 1 << 1;
        /// Intercept minor faults (page exists but may need content update).
        const MINOR      = 1 << 2;
    }
}

bitflags::bitflags! {
    pub struct UffdFeatures: u64 {
        const PAGEFAULT_FLAG_WP            = 1 << 0;
        const EVENT_FORK                   = 1 << 1;
        const EVENT_REMAP                  = 1 << 2;
        const EVENT_REMOVE                 = 1 << 3;
        const MISSING_HUGETLBFS            = 1 << 4;
        const MISSING_SHMEM                = 1 << 5;
        const EVENT_UNMAP                  = 1 << 6;
        const SIGBUS                       = 1 << 7;
        const THREAD_ID                    = 1 << 8;
        const MINOR_HUGETLBFS              = 1 << 9;
        const MINOR_SHMEM                  = 1 << 10;
        const EXACT_ADDRESS                = 1 << 11;
        const WP_HUGETLBFS_SHMEM           = 1 << 12;
        const WP_UNPOPULATED               = 1 << 13;
        const POISON                       = 1 << 14;
        const WP_ASYNC                     = 1 << 15;
    }
}

ioctl operations

All UFFDIO_* ioctl values match Linux exactly.

  • UFFDIO_API (0xc018aa3f): Handshake. Userspace sends uffdio_api struct with api = UFFD_API (0xaa) and desired feature bits. Kernel validates API version, returns supported features and ioctls bitmasks. Stores agreed features in UffdInstance.features. Must be the first ioctl on the fd.

  • UFFDIO_REGISTER (0xc020aa00): Register a VA range. Struct uffdio_register with range.start, range.len, and mode (MISSING/WP/MINOR). Kernel inserts entry into registered_ranges IntervalTree. Returns the ioctls bitmask valid for this range. The VMA covering the range must already exist; returns EINVAL if not.

  • UFFDIO_UNREGISTER (0x8010aa01): Remove a registered range. Removes IntervalTree entries. Wakes any tasks blocked on pending faults in the range with EFAULT.

  • UFFDIO_COPY (0xc028aa03): Copy a page to resolve a missing fault. Struct uffdio_copy: dst (target VA), src (source VA in caller's address space), len (must be PAGE_SIZE or HUGE_PAGE_SIZE), mode flags, copy (out: bytes copied). Kernel copies the page from src to dst in the faulting process's address space, then wakes the faulting task. UmkaOS extension: if src is a memfd offset passed via the mode flags, maps the memfd page directly into the destination (zero physical copy).

  • UFFDIO_ZEROPAGE (0xc020aa04): Install a zero page at the faulted address. Wakes the faulting task. Equivalent to UFFDIO_COPY with a zero source page but without the data copy.

  • UFFDIO_WRITEPROTECT (0xc018aa06): Write-protect or unprotect a range of pages. Struct uffdio_writeprotect: range + mode (UFFDIO_WRITEPROTECT_MODE_WP to protect, 0 to unprotect). Updates PTEs to add/remove write permission, flushes TLB.

  • UFFDIO_CONTINUE (0xc020aa08): Continue a minor fault. The page already exists at the physical level; this ioctl installs the PTE and wakes the faulting task.

  • UFFDIO_POISON (0xc018aa0a): Mark a range as poisoned. Accessing any page in the range delivers SIGBUS to the accessing task. Implemented by installing a special "poison" PTE marker.

Page fault path

Page fault handler receives fault at address A in process P:
  1. Look up A in P's uffd registered_ranges IntervalTree.
     O(log N) lookup. If not registered: handle as normal fault.
  2. Determine fault type: MISSING (PTE not present), WP (write to write-protected
     page), MINOR (page exists in page cache but PTE not yet installed).
  3. If mode does not cover this fault type: handle as normal fault.
  4. Construct UffdMsg with event=PAGEFAULT, address=A, flags per fault type.
  5. Enqueue message into UffdInstance.message_queue (SPSC push, lock-free).
  6. Wake any thread blocked in read() on the uffd fd (if needed).
  7. Allocate WaitEntry for the faulting task; call wait_for_resolution(A):
        compute shard = bits [11:6] of A, lock pending_faults[shard], insert WaitEntry.
  8. Block the faulting task: schedule_out(current_task, WaitState::UffdWait).
  --- userspace runs, reads the message, calls UFFDIO_COPY or UFFDIO_ZEROPAGE ---
  9. UFFDIO_COPY handler:
       a. Copy source page data into a newly allocated physical page for address A.
          UmkaOS improvement: if src is a memfd page, install a copy-on-write mapping
          to the memfd page rather than copying bytes (zero-copy for live migration).
       b. Install PTE for A in the faulting process's page table.
       c. Flush TLB for A on all CPUs running threads of P.
       d. Call resolve_fault(A): compute shard = bits [11:6] of A, lock
          pending_faults[shard], remove WaitEntry list for A, unlock shard.
       e. Wake all waiting tasks. They resume executing the faulting instruction.
  10. Faulting task resumes; instruction retries and succeeds.

UmkaOS improvement: zero-copy UFFDIO_COPY via memfd source

When a VM live migration engine wants to supply guest RAM pages via userfaultfd, the standard UFFDIO_COPY requires the migration engine to first read() the page from the network socket into a buffer, then call UFFDIO_COPY to copy from that buffer into the guest address space. This is two copies of the page data.

UmkaOS adds a UFFDIO_COPY_MODE_MEMFD flag to uffdio_copy.mode. When set, src is interpreted as a memfd page offset rather than a VA. The kernel maps the memfd page directly into the guest's page table with a copy-on-write mapping. No data is copied; the physical page is shared read-only until either side writes (triggering CoW). For read-heavy VM workloads, this eliminates all copy overhead.

Write-protect mode

WP mode intercepts writes to pages that have been write-protected via UFFDIO_WRITEPROTECT. The page fault handler detects a write to a WP page, delivers a UFFD_EVENT_PAGEFAULT with UFFD_PAGEFAULT_FLAG_WP set, and blocks the writing task. Userspace inspects the write, optionally modifies the page, then calls UFFDIO_WRITEPROTECT (with mode=0) to unprotect the page and wake the task. Used for dirty-page tracking in snapshot engines.

MADV_USERFAULTFD hint

madvise(addr, len, MADV_USERFAULTFD) (UmkaOS-specific advice constant) marks the range as a candidate for uffd monitoring. This is a hint to the kernel to pre-register the range in the VMA's uffd_flags, so that future page faults check the uffd IntervalTree without needing a full VMA scan. It does not replace UFFDIO_REGISTER.

Error cases

Error Condition
EPERM Caller lacks privilege and USER_MODE_ONLY not set in flags
ENOMEM Kernel cannot allocate UffdInstance
EINVAL UFFDIO_API with wrong API magic; UFFDIO_REGISTER on unmapped range; UFFDIO_COPY with bad length
EFAULT Bad pointer in ioctl struct
EEXIST UFFDIO_REGISTER on a range that is already registered

Linux compatibility: all ioctl numbers, struct layouts, event codes, and feature flags match Linux 5.14+. The UFFDIO_COPY_MODE_MEMFD extension uses a reserved flag bit in uffdio_copy.mode and is ignored by Linux (which returns EINVAL for unknown mode bits — UmkaOS accepts it). The MADV_USERFAULTFD advice is UmkaOS-only.


4.6.5 memfd_create — Create Anonymous File

Syscall signature

int memfd_create(const char *name, unsigned int flags);

Returns a file descriptor on success, -1 on error.

Flags

bitflags::bitflags! {
    /// Flags for memfd_create(2). Exact Linux values.
    pub struct MemfdFlags: u32 {
        /// Set close-on-exec on the returned fd.
        const CLOEXEC        = 0x0001;

        /// Allow seals to be applied via fcntl(F_ADD_SEALS). Without this flag,
        /// F_ADD_SEALS returns EPERM.
        const ALLOW_SEALING  = 0x0002;

        /// Back the memfd with hugeTLB pages from the hugetlbfs pool.
        const HUGETLB        = 0x0004;

        /// Hugepage size selector (combined with HUGETLB): 2MB pages.
        /// Value: (21 << 26) = 0x54000000.
        const HUGE_2MB       = 0x54000000;

        /// Hugepage size selector (combined with HUGETLB): 1GB pages.
        /// Value: (30 << 26) = 0x78000000.
        const HUGE_1GB       = 0x78000000;

        /// Automatically apply F_SEAL_EXEC (prevents mmap with PROT_EXEC).
        /// Added in Linux 6.3. Implies ALLOW_SEALING.
        const NOEXEC_SEAL    = 0x0008;

        /// Allow the file to be mapped executable. When /proc/sys/vm/memfd_noexec
        /// is 2 (restrictive mode), this flag is required to map as PROT_EXEC.
        /// Added in Linux 6.3. Conflicts with NOEXEC_SEAL (returns EINVAL).
        const EXEC           = 0x0010;
    }
}

name parameter

The name string appears in /proc/PID/fd/N as memfd:name. Maximum length is 249 characters (enforced by UmkaOS; Linux silently truncates at 249 bytes without returning an error — UmkaOS returns EINVAL for names exceeding 249 characters to avoid silent data loss). The name is informational only; there is no filesystem lookup by name.

Internal representation

/// An anonymous in-memory file created by memfd_create(2).
/// Backed by UmkaOS's page cache; supports read, write, mmap, ftruncate, lseek.
pub struct AnonFile {
    /// Descriptive name for /proc/PID/fd/ display. Max 249 chars.
    pub name: ArrayString<249>,

    /// Current file size in bytes. mmap() beyond this → SIGBUS.
    /// Changed by ftruncate(); cannot shrink below sealed size if F_SEAL_SHRINK set.
    pub size: AtomicU64,

    /// Page cache backing the file content.
    /// Pages are ordinary page-cache pages; they are evictable under memory pressure
    /// unless the file is sealed and mapped (sealing keeps pages pinned).
    pub page_cache: PageCache,

    /// Applied seals (F_SEAL_*). Append-only: seals can only be added, never removed.
    pub seals: AtomicU32,

    /// Flags from memfd_create call, retained for /proc inspection.
    pub flags: MemfdFlags,

    /// Hugetlb page size (0 if not HUGETLB). One of PAGE_SIZE, HUGE_2MB, HUGE_1GB.
    pub huge_page_size: usize,
}

Seal constants (from fcntl F_ADD_SEALS / F_GET_SEALS, exact Linux values):

bitflags::bitflags! {
    /// File seals for memfd. Applied via fcntl(fd, F_ADD_SEALS, seals).
    pub struct MemfdSeals: u32 {
        /// Prevent ftruncate() from shrinking the file.
        const SEAL_SHRINK  = 0x0001;
        /// Prevent ftruncate() from growing the file.
        const SEAL_GROW    = 0x0002;
        /// Prevent write() and mmap(PROT_WRITE) (shared writable mappings).
        const SEAL_WRITE   = 0x0004;
        /// Prevent any future seal modifications (seal the seals).
        const SEAL_SEAL    = 0x0008;
        /// Prevent mmap(PROT_EXEC) and mprotect(PROT_EXEC). Added Linux 6.3.
        const SEAL_EXEC    = 0x0010;
    }
}

File operations

The AnonFile is exposed through the VFS layer (Section 10) via a FileOps implementation:

  • read(offset, buf): copies page-cache pages into buf. Standard copy_to_user.
  • write(offset, buf): copies buf into page-cache pages, allocating pages on demand. Fails with EPERM if SEAL_WRITE is set.
  • mmap(offset, len, prot, flags):
  • MAP_SHARED | PROT_WRITE: fails if SEAL_WRITE set.
  • MAP_SHARED | PROT_EXEC or MAP_PRIVATE | PROT_EXEC: fails if SEAL_EXEC set.
  • With HUGETLB: allocates pages from the hugetlb pool (Section 4.1.4), uses huge PTEs.
  • Private mappings (MAP_PRIVATE) are copy-on-write from the page cache.
  • ftruncate(size):
  • If new size < old size and SEAL_SHRINK set: EPERM.
  • If new size > old size and SEAL_GROW set: EPERM.
  • Adjusts AnonFile.size. Pages beyond new size are freed if shrinking.
  • lseek(offset, whence): standard file seek semantics.
  • fcntl(F_ADD_SEALS, seals): requires ALLOW_SEALING (else EPERM). Adds seal bits atomically. Cannot remove seals. SEAL_SEAL prevents any further seal additions.
  • fcntl(F_GET_SEALS): returns current seal bitmask.

Algorithm — memfd_create

memfd_create(name, flags):
  1. Validate name: strlen(name) <= 249. Return EINVAL if too long.
  2. Validate flags: EXEC and NOEXEC_SEAL cannot both be set. Return EINVAL if both.
  3. Validate flags: only known bits set. Return EINVAL for unknown bits.
  4. If HUGETLB set: validate HUGE_2MB or HUGE_1GB is consistent. Return EINVAL
     if incompatible size bits set.
  5. Allocate AnonFile struct in slab. Return ENOMEM if allocation fails.
  6. Initialise AnonFile:
       name = copy of name param.
       size = 0.
       page_cache = PageCache::new(anonfile_ops).
       seals = 0.
       flags = flags param.
       huge_page_size = if HUGE_2MB: HUGE_2MB_SIZE elif HUGE_1GB: HUGE_1GB_SIZE else 0.
  7. If NOEXEC_SEAL: set seals |= SEAL_EXEC.
  8. Allocate a file descriptor in the current process's fd table.
     Return EMFILE if per-process fd limit (RLIMIT_NOFILE) reached.
  9. Install AnonFile into fd table as a FileDescriptor with the AnonFileOps vtable.
  10. If CLOEXEC: set FD_CLOEXEC on the fd.
  11. Return fd.

Use cases

  • JIT compilers: memfd_create + ftruncate + write code + F_SEAL_WRITE + mmap(PROT_EXEC). The seal prevents runtime modification after compilation.
  • systemd unit file passing: service manager writes config into memfd, passes fd to child via LISTEN_FDS. Sealed with F_SEAL_WRITE so the child cannot corrupt it.
  • Container rootfs: overlay layers backed by memfds for ephemeral container filesystems.
  • Anonymous shared memory: multiple processes share a memfd via fd passing (e.g., over Unix socket SCM_RIGHTS), replacing POSIX shm_open / SysV shmget.

Error cases

Error Condition
EINVAL name too long (>249 chars); conflicting flags (EXEC + NOEXEC_SEAL); unknown flag bits; incompatible HUGE_* bits
EMFILE Per-process fd limit reached
ENFILE System-wide open file limit reached
ENOMEM Kernel cannot allocate AnonFile struct

Linux compatibility: syscall number, flag values, seal constants, and all ioctl semantics match Linux 3.17+ (memfd_create introduction) through 6.3+ (NOEXEC_SEAL/EXEC). UmkaOS returns EINVAL on names >249 chars rather than silently truncating; this is a deliberate deviation that produces correct behaviour for correct callers and surfaces bugs in callers that pass too-long names.


4.6.6 memfd_secret — Create a Secret Memory Region

Syscall signature

int memfd_secret(unsigned int flags);

Returns a file descriptor on success, -1 on error. The fd is then used with mmap() to create a virtual memory region that is inaccessible to the kernel itself.

Availability

UmkaOS enables memfd_secret unconditionally on x86-64 and AArch64 (where hardware assists are available or direct-map manipulation is safe). On ARMv7, RISC-V 64, PPC32, and PPC64LE, UmkaOS returns ENOSYS — these architectures lack the address-space topology needed to safely excise regions from the kernel direct map without excessive TLB maintenance cost.

Flags

Only O_CLOEXEC (0x80000) is defined for memfd_secret. All other flag bits are reserved; returning EINVAL if set.

Security model

The kernel direct map (physmap / PAGE_OFFSET region) maps all physical RAM at a fixed virtual address in the kernel address space. Any code with kernel-mode execution can read any RAM via this mapping, including secret process data. memfd_secret defeats this by removing the direct-map PTEs for the selected physical pages.

After memfd_secret and mmap():

  1. The physical pages backing the secret region are allocated normally.
  2. The process page table has normal PTEs for the region (present, readable, writable).
  3. The kernel direct-map page table has no PTEs for those physical frame numbers.
  4. A per-kernel SecretRegions set tracks which PFNs have been excised.
  5. Any kernel code path that would access those PFNs via the direct map (e.g., copy_to_user, copy_from_user, kmap, swap writeback) checks SecretRegions and returns EFAULT or refuses the operation.

Internal data structures

/// RCU-protected hash map for read-mostly data.
///
/// # Concurrency model
/// - **Read side**: takes an `RcuReadGuard` only (no spinlock). Reads the current
///   bucket array via an `RcuPointer` load-acquire, then traverses the bucket chain.
///   Multiple concurrent readers on different CPUs are always safe.
/// - **Write side**: acquires `write_lock` (a `SpinLock`), then performs the update
///   (insert, remove, or resize) before publishing the new state via an RCU
///   store-release. Concurrent writers are serialised by `write_lock`.
/// - **Consistency**: Readers see either the pre-update or the post-update state,
///   never a torn intermediate. Removed entries are freed only after an RCU grace
///   period, so readers that observed the entry before removal can finish safely.
///
/// # Resize policy
/// The bucket array doubles in capacity when the load factor exceeds 0.75. Resize
/// allocates a new `BucketArray`, rehashes all live entries, then publishes it with
/// `RcuPointer::store_release`. The old array is freed after the next RCU grace
/// period. Resizing is O(n) and serialised under `write_lock`.
///
/// # Type parameters
/// - `K`: Key type — must implement `Hash + Eq + Copy + Send + Sync`.
/// - `V`: Value type — must implement `Clone + Send + Sync`.
///
/// # Complexity
/// - Lookup: O(1) amortised, O(n) worst case (all keys hash to one bucket).
/// - Insert / remove: O(1) amortised under `write_lock`.
/// - Resize: O(n) amortised across n insertions (doubles capacity at 75% load).
pub struct RcuHashMap<K, V> {
    /// Current bucket array, replaced atomically on resize.
    /// Readers dereference this under an `RcuReadGuard`; writers replace it
    /// under `write_lock` using `store_release` to publish the new array.
    buckets: RcuPointer<BucketArray<K, V>>,
    /// Write-side serialisation lock. Only held during insert, remove, and resize.
    /// Must NOT be held while sleeping or while performing I/O.
    write_lock: SpinLock<()>,
    /// Approximate entry count. Updated under `write_lock`; read with `Relaxed`
    /// ordering by callers checking the load factor before triggering a resize.
    count: AtomicUsize,
}

/// Flat array of bucket heads. Replaced as a unit when the map is resized.
/// Allocated from the slab allocator as a single contiguous object; freed
/// (via `rcu_call`) after the current grace period expires.
///
/// The flexible-array field `buckets` contains exactly `len` entries.
/// `BucketArray` is always heap-allocated; it is never embedded in another struct.
struct BucketArray<K, V> {
    /// Number of buckets in this array. Always a power of two so that
    /// `hash & (len - 1)` is a valid O(1) bucket index.
    len: usize,
    /// Bucket heads — each is the head of an RCU-managed singly-linked list.
    /// Entries in the same bucket are linked via `HashEntry::next`.
    buckets: [RcuPointer<HashEntry<K, V>>],  // flexible array: len entries
}

/// One key-value entry in the hash map.
/// Entries are prepended to their bucket's list with a store-release;
/// stale entries are unlinked under `write_lock` and freed after a grace period.
struct HashEntry<K, V> {
    /// Full 64-bit hash of `key`, retained to accelerate resize rehashing
    /// (avoids recomputing the hash for every entry on every resize).
    hash: u64,
    pub key:   K,
    pub value: V,
    /// Next entry in this bucket's chain. `RcuPointer::null()` terminates the list.
    next: RcuPointer<HashEntry<K, V>>,
}

/// Global registry of PFNs that have been excised from the kernel direct map.
/// Kernel code must check this before accessing any PFN via the direct map.
/// RcuHashMap provides read-side lock-free access (O(1) amortised).
/// Write side (add/remove PFN) holds a spinlock.
static SECRET_REGIONS: RcuHashMap<Pfn, SecretPageInfo> = RcuHashMap::new();

pub struct SecretPageInfo {
    /// The virtual address in the owning process that maps this PFN.
    pub owner_va: VirtAddr,
    /// The owning process's PID, for diagnostics.
    pub owner_pid: Pid,
    /// Hardware encryption key ID (AMD SME), if hardware encryption active.
    /// 0 if not using hardware encryption.
    pub sme_key_id: u32,
}

/// Per-process set of secret VMAs. Tracked for munmap() and process-exit cleanup.
///
/// `Vec` allocation here is intentional and safe. `mmap(MAP_SECRET)` and `munmap()`
/// are **process context** operations (never called from interrupt context or from
/// within the memory reclaim path). `Vec::push` uses `GFP_KERNEL` semantics — it can
/// block and invoke the page reclaimer, but the caller (a user syscall) already holds
/// no locks that the reclaimer needs, so there is no deadlock risk.
/// A typical process has 0-5 secret VMAs; `Vec` is appropriate for this cardinality.
pub struct SecretVmaSet {
    pub entries: Vec<SecretVmaEntry>,
}

pub struct SecretVmaEntry {
    pub va_start: VirtAddr,
    pub va_end:   VirtAddr,
    pub pfns:     Vec<Pfn>,
}

Algorithm — mmap on a memfd_secret fd

mmap(addr, len, PROT_READ|PROT_WRITE, MAP_SHARED, secret_fd, 0):
  1. Only MAP_SHARED is permitted. MAP_PRIVATE returns EINVAL (secret pages cannot be
     CoW'd — a copy would not inherit the kernel direct-map excision).
  2. Only PROT_READ, PROT_WRITE, or both are permitted. PROT_EXEC returns EINVAL
     (executable secret regions create JIT-compiler attack surfaces).
  3. Validate len is PAGE_SIZE-aligned and non-zero.
  4. Allocate physical pages for the region:
       pages = buddy_alloc(len / PAGE_SIZE, GFP_KERNEL | GFP_ZERO)
       Return ENOMEM if allocation fails.
  5. Install process PTEs: for each allocated PFN, install a present, user-accessible
     PTE at the requested virtual address in the process page table.
  6. Excise from kernel direct map:
       For each PFN in the allocation:
         a. Clear the PTE at (KERNEL_PHYSMAP_BASE + pfn * PAGE_SIZE) in the kernel
            page table. This is a single store to a kernel PTE.
         b. Insert PFN into SECRET_REGIONS (acquires write spinlock, O(1)).
       Flush kernel TLB for all excised addresses on all CPUs:
         tlb_flush_kernel_range(phys_to_virt(pfn * PAGE_SIZE), len).
       This TLB flush is expensive (IPI to all CPUs) but occurs once at mmap() time,
       not on every secret region access.
  7. If AMD SME is available (detected via CPUID leaf 0x8000001F):
       a. Assign an ephemeral C-bit encryption key to these pages.
       b. Set the C-bit in the process PTEs (memory encryption enabled for these pages).
       c. Hardware encrypts all writes to these pages with the process-specific key.
       d. Store key_id in SecretPageInfo.
     If ARM64 Realm (CCA) is available:
       a. Assign these pages to the Realm granule table (Realm PA space).
       b. Normal world (kernel) cannot access Realm-assigned pages; accesses fault.
  8. Record the VMA in the process's SecretVmaSet for cleanup.
  9. Return the mapped VA.

Algorithm — munmap / process exit cleanup

On munmap(addr, len) covering a secret region:
  1. Look up SecretVmaEntry for [addr, addr+len).
  2. Restore kernel direct-map PTEs:
       For each PFN in the entry:
         a. Remove from SECRET_REGIONS.
         b. Re-install PTE at (KERNEL_PHYSMAP_BASE + pfn * PAGE_SIZE).
     Flush kernel TLB for restored range.
  3. Zero the physical pages (prevent data leakage before returning to buddy allocator).
  4. Remove process PTEs for the region.
  5. Return pages to buddy allocator.
  6. Remove SecretVmaEntry from the process's SecretVmaSet.

Limitations

  • Fork semantics with active memfd_secret regions: fork() succeeds unconditionally. The child process does NOT inherit the secret mapping — the secret fd is closed in the child (set to -1 in the child's fd table) and the secret VMA is absent from the child's address space. The secret region remains active in the parent only. If the child needs its own secret region, it creates a new memfd_secret fd after fork. This matches the principle of least surprise: fork gives the child a clean slate for sensitive memory, preventing accidental secret inheritance.
  • Cannot read() or write() via the fd: the fd has no data operations. Content is accessible only through the mmap'd region. read()/write() on the fd return EBADF.
  • Pages are pinned (cannot be swapped): swap requires the kernel to write the page to a swap device, which requires kernel direct-map access. Since the direct-map entries are excised, swap is impossible. The pages are PG_mlocked to prevent reclaim.
  • No userfaultfd on secret regions: uffd requires kernel access to copy pages (UFFDIO_COPY), which conflicts with the direct-map excision. Attempting to register a secret region with uffd returns EINVAL.
  • Single process only: the fd cannot be shared via SCM_RIGHTS (returns EACCES on the receiving end if the receiving process attempts to mmap it, since the kernel cannot install cross-process direct-map entries consistently).

Kernel code protection

All kernel paths that access user memory must check SECRET_REGIONS:

/// Safe wrapper for kernel reads from user VA. Returns Err(Efault) if the
/// physical page backing `user_va` is in the secret region set.
pub fn copy_from_user(kernel_dst: &mut [u8], user_va: VirtAddr) -> Result<(), Efault> {
    let pfn = va_to_pfn(user_va)?;
    if SECRET_REGIONS.contains(&pfn) {
        // SAFETY: We are intentionally refusing to access a secret page.
        return Err(Efault);
    }
    // SAFETY: pfn is accessible via direct map (not in SECRET_REGIONS).
    unsafe { copy_from_direct_map(kernel_dst, pfn) }
}

The RcuHashMap read-side check is O(1) and lock-free. On x86-64 with SME, the hardware C-bit provides a secondary enforcement layer: even if copy_from_user skips the software check (e.g., via a kernel exploit), the hardware will return encrypted garbage rather than plaintext.

Error cases

Error Condition
ENOSYS Architecture does not support memfd_secret (ARMv7, RISC-V, PPC)
EPERM Caller is unprivileged and /proc/sys/vm/memfd_secret_allowed is 0
ENOMEM Cannot allocate physical pages for the secret region
EINVAL Unknown flags; PROT_EXEC on mmap; MAP_PRIVATE on mmap; size not page-aligned
EBADF read()/write() attempted on the secret fd

Linux compatibility: syscall number and basic fd semantics match Linux 5.14+. The AMD SME encryption and ARM64 Realm extensions are UmkaOS-specific enhancements that operate transparently (userspace sees the same interface; hardware provides additional enforcement).


4.6.7 process_vm_readv / process_vm_writev — Cross-Process Memory I/O

Syscall signatures

ssize_t process_vm_readv(pid_t pid,
                         const struct iovec *local_iov,  size_t liovcnt,
                         const struct iovec *remote_iov, size_t riovcnt,
                         unsigned long flags);

ssize_t process_vm_writev(pid_t pid,
                          const struct iovec *local_iov,  size_t liovcnt,
                          const struct iovec *remote_iov, size_t riovcnt,
                          unsigned long flags);

Return total bytes transferred on success, -1 on error. Partial transfers (when the remote range spans valid and invalid pages) return the count of bytes successfully transferred before the first fault.

Parameters

  • pid: target process identifier (raw PID, not pidfd — Linux API uses raw PID here).
  • local_iov[liovcnt]: scatter/gather descriptors in the caller's address space. iov_base is a VA in the current process; iov_len is the byte count.
  • remote_iov[riovcnt]: scatter/gather descriptors in the target process's address space.
  • flags: must be 0. All other values return EINVAL. Reserved for future extensions.

flags field

bitflags::bitflags! {
    /// Flags for process_vm_readv / process_vm_writev.
    /// Currently only the zero value is valid. Reserved bits are rejected with EINVAL.
    pub struct ProcessVmFlags: u64 {
        // No flags defined. Field reserved for future use.
    }
}

Permission model

The caller must have one of: 1. PTRACE_MODE_ATTACH_REALCREDS permission over the target process (same as ptrace attach), checked via ptrace_may_access(target, PTRACE_MODE_ATTACH_REALCREDS). 2. The same real UID/GID as the target, the target has not set PR_SET_DUMPABLE to non-dumpable, and no LSM vetoes the access.

Cannot cross user namespace boundaries unless the caller has CAP_SYS_PTRACE in the target's user namespace.

The permission check uses the real credentials of the calling thread (current_real_cred), not the effective credentials, matching Linux's ptrace_may_access semantics.

Algorithm — process_vm_readv

process_vm_readv(pid, local_iov, liovcnt, remote_iov, riovcnt, flags):
  1. Validate flags == 0. Return EINVAL if not.
  2. Validate liovcnt and riovcnt are <= IOV_MAX (1024). Return EINVAL if exceeded.
  3. Copy local_iov and remote_iov arrays from userspace. Return EFAULT if either
     pointer is inaccessible.
  4. Compute transfer size: transfer_size = min(sum(local_iov[i].iov_len),
     sum(remote_iov[j].iov_len)). Local and remote totals need not match;
     the transfer proceeds up to the smaller of the two sums.
  5. Resolve target process: task_from_pid(pid). Return ESRCH if not found.
     Take a reference to prevent the target from exiting during the operation.
  6. Permission check: ptrace_may_access(target). Return EPERM if denied.
  7. Acquire target MM read-lock: target.mm.read_lock().
  8. Initialize local_cursor and remote_cursor (index + byte offset within current iov).
  9. total_copied = 0.
  10. While remote bytes remain:
        a. Get next remote segment: remote_va, remote_len from remote_iov[remote_cursor].
        b. Get next local segment: local_va, local_len from local_iov[local_cursor].
        c. chunk = min(remote_len, local_len).
        d. Check if remote_va is in SECRET_REGIONS (any PFN in [remote_va, remote_va+chunk)).
           If so: return EFAULT (cannot read secret pages cross-process).
        e. Pin remote pages: get_user_pages(target.mm, remote_va, chunk,
                                            FOLL_REMOTE | FOLL_GET, &pages).
           Returns number of pages successfully pinned.
           If 0 pages pinned: release target MM read-lock; return EFAULT.
        f. If chunk > LARGE_COPY_THRESHOLD (1MB) and chunk is PAGE_SIZE-aligned:
             Use remap_copy_path (see UmkaOS improvement below).
           Else:
             memcpy from pinned pages into local_va, byte by byte across page
             boundaries (kmap each page, copy, kunmap).
        g. Unpin pages: put_user_pages(pages, num_pages).
        h. total_copied += chunk.
        i. Advance local_cursor and remote_cursor by chunk.
  11. Release target MM read-lock.
  12. Drop target process reference.
  13. Return total_copied.

Algorithm — process_vm_writev

Identical to process_vm_readv but data flows from local to remote. The get_user_pages call uses FOLL_WRITE on remote pages to get writable mappings. The copy direction is reversed (local to remote pages).

UmkaOS improvement: efficient large transfers

For transfers larger than 1MB where both source and destination are page-aligned, UmkaOS substitutes a kmap-window path for the byte-copy path:

remap_copy_path(src_mm, src_va, dst_va, len):
  1. Allocate a temporary kernel VA window (using vmalloc area).
  2. For each PAGE_SIZE-aligned chunk of [src_va, src_va+len):
       a. Get the PFN of the source page (from src_mm's page tables).
       b. Map that PFN into the temporary kernel VA window.
       c. Map the target (local) address as a writable page in the current mm.
       d. Use SIMD-accelerated memcpy (rep movsb on x86, NEON on AArch64) to copy
          from the kernel window to the local page.
     Note: no intermediate heap buffer is allocated — data goes directly from the
     source physical page to the destination physical page via kmap/kunmap.
  3. Release temporary kernel VA window.

For workloads that regularly read >1MB from another process (e.g., a debugger inspecting a large heap), this avoids the overhead of allocating and freeing a large intermediate buffer for each transfer.

Thread safety and raciness

The target process may be running concurrently. Reading its memory while it modifies it is inherently racy. This is intentional and matches Linux semantics — process_vm_readv is documented as providing no atomicity guarantees. The target MM read-lock is held only to pin pages, not for the duration of the copy; the target can create or destroy VMAs while the copy is in progress (page pinning prevents the physical page from being reclaimed, but the VMA map may change).

Callers (debuggers, profilers, GC engines) that require consistent reads must arrange their own synchronisation with the target (e.g., ptrace SIGSTOP).

Cannot read memfd_secret regions

Steps 10d above checks SECRET_REGIONS. Any attempt to read or write a PFN in the secret set returns EFAULT. The caller cannot work around this by using different remote iov segments — the check is per-PFN, covering any sub-page access.

Error cases

Error Condition
ESRCH No process with the given PID, or process is a zombie
EPERM Caller does not have ptrace permission over the target
EFAULT Remote address faults (not mapped, not pinnable, or in secret region)
EINVAL flags is non-zero; liovcnt or riovcnt > IOV_MAX; iov length sums mismatch
ENOMEM Cannot pin target pages (target has too many pinned pages)

Linux compatibility: syscall numbers, argument order, iovec struct layout, return value semantics, and permission model match Linux 3.2+. The secret-region EFAULT and the large-transfer optimised path are UmkaOS extensions that are transparent to callers.


4.6.8 process_madvise — Batch madvise for Another Process

Syscall signature

ssize_t process_madvise(int pidfd, const struct iovec *iovec, size_t vlen,
                        int advice, unsigned int flags);

Returns total bytes advised on success (sum of iov_len for successfully processed iov entries), -1 on error. Like madvise(), this is advisory — the kernel may ignore the hints.

Why pidfd instead of raw PID

Using a pidfd (a file descriptor referring to a specific process, created by pidfd_open()) eliminates the TOCTOU race inherent in raw PID numbers: a raw PID may be recycled between the lookup and the advise operation, accidentally advising the wrong process. A pidfd refers to a specific process struct; even if the process exits, the pidfd remains valid (the kernel keeps the process struct alive) and subsequent operations on it return ESRCH cleanly. This matches the Linux 5.10+ pidfd-based API design.

Advice values

All MADV_* constants match Linux exactly:

/// Advice values for process_madvise(). Values match Linux MADV_* constants.
#[repr(i32)]
pub enum MadviseAdvice {
    /// Mark pages as will-be-needed soon; kernel prefetches.
    Willneed      = 3,
    /// Mark pages as not needed; kernel may free them.
    Dontneed      = 4,
    /// Mark pages as freeable (kernel may or may not free immediately).
    Free          = 8,
    /// Enable transparent huge pages on the range.
    Hugepage      = 14,
    /// Disable transparent huge pages on the range.
    Nohugepage    = 15,
    /// Mark pages as cold; they are candidates for reclaim before warmer pages.
    Cold          = 20,
    /// Request immediate reclaim of the specified pages.
    Pageout       = 21,
    /// UmkaOS extension: mark pages as discardable (memory hibernation hint, §4.1.3.3).
    /// The kernel may compress or offload these pages under memory pressure.
    /// Value 256 begins the UmkaOS madvise extension namespace (≥256 = UmkaOS-only).
    Discardable   = 256,
    /// UmkaOS extension: mark pages as critical (must not be reclaimed or compressed).
    /// Used by real-time and latency-sensitive allocations.
    Critical      = 257,
}

MADV_DISCARDABLE (256) and MADV_CRITICAL (257) are UmkaOS extensions. Values ≥ 256 are reserved as the UmkaOS madvise extension namespace. Linux's current allocation reaches MADV_COLLAPSE = 25, with MADV_HWPOISON = 100 and MADV_SOFT_OFFLINE = 101 as isolated arch-specific outliers. Using 256+ provides a clean separation that survives Linux's natural growth at ~2 hints per major release (a conflict would require ~115 years of Linux development). These hints integrate with the process memory hibernation subsystem described in Section 4.1.3.3 and allow a privileged memory manager daemon to set hibernation hints on behalf of application processes.

Permission model

Advice type                  Required capability
MADV_COLD, MADV_WILLNEED     CAP_SYS_NICE or same real UID as target
MADV_PAGEOUT, MADV_FREE,
  MADV_DONTNEED              CAP_SYS_ADMIN
MADV_HUGEPAGE,
  MADV_NOHUGEPAGE            CAP_SYS_NICE or same real UID as target
MADV_DISCARDABLE,
  MADV_CRITICAL              CAP_SYS_NICE (UmkaOS extension)

Destructive advice (MADV_DONTNEED, MADV_FREE, MADV_PAGEOUT) requires CAP_SYS_ADMIN because it can cause data loss in the target process.

flags parameter

Must be 0. All other values return EINVAL. Reserved for future per-call options (e.g., PROCESS_MADVISE_ASYNC for non-blocking batch processing).

vlen limit

UmkaOS enforces a maximum of 1024 iov entries per call. Linux has no documented limit, creating a potential DoS vector where an attacker with CAP_SYS_ADMIN passes millions of tiny iov entries to pin the kernel in process_madvise indefinitely. The 1024-entry limit caps the worst-case kernel time at ~100μs (1024 VMAs × ~100ns per VMA lookup).

Internal data structures

No new persistent structures; process_madvise is a stateless operation that modifies VMA flags and page table entries in the target process.

Algorithm

process_madvise(pidfd, iovec, vlen, advice, flags):
  1. Validate flags == 0. Return EINVAL if not.
  2. Validate vlen <= 1024. Return EINVAL if exceeded.
  3. Validate advice is a known MadviseAdvice value. Return EINVAL if not.
  4. Resolve process from pidfd: pidfd_get_task(pidfd). Return EBADF if pidfd invalid.
     Return ESRCH if process has exited.
  5. Permission check: check_process_madvise_permission(current, target, advice).
     Return EPERM if not permitted.
  6. Copy iovec array from userspace (vlen entries). Return EFAULT if pointer bad.
  7. Validate each iov entry: iov_base page-aligned, iov_len non-zero.
     Return EINVAL if any entry fails validation.
  8. Acquire target MM read-lock.
  9. Pre-scan: collect all VMAs covering each iov range, validate none span unmapped
     holes. Collect set of CPUs with threads from target running (for TLB flush).
     (Pre-scan allows us to fail early without partial effects for advisory operations.
     For destructive advice, partial effects are acceptable and we proceed range-by-range.)
  10. total_bytes = 0.
  11. Collect pending_tlb_flush = TlbFlushSet::new().
  12. For each iov[i] in iovec:
        a. range = [iov[i].iov_base, iov[i].iov_base + iov[i].iov_len).
        b. Find VMA(s) covering range. Return ENOMEM if any sub-range is unmapped.
        c. Call madvise_vma(target.mm, vma, range, advice,
                            &mut pending_tlb_flush).
           This updates VMA flags and/or PTE bits as needed for the advice.
           For MADV_COLD: clear PTE accessed bits in range.
           For MADV_PAGEOUT: mark pages for immediate reclaim via page_reclaim_direct().
           For MADV_DONTNEED: unmap pages (free anon, drop file cache references).
           For MADV_FREE: mark pages with PG_lazyfree; reclaimed on memory pressure.
           For MADV_HUGEPAGE / MADV_NOHUGEPAGE: update VMA THP flags.
           For MADV_DISCARDABLE / MADV_CRITICAL: update VMA hibernation hint flags.
        d. total_bytes += iov[i].iov_len.
  13. UmkaOS improvement: issue a single batched TLB flush for all modified ranges.
        pending_tlb_flush.flush_all_cpus(target_cpu_mask).
        (Linux issues one IPI per range; UmkaOS coalesces all into one IPI per CPU.)
  14. Release target MM read-lock.
  15. Drop target process reference.
  16. Return total_bytes.

UmkaOS improvement: coalesced TLB flush

Linux's process_madvise calls tlb_gather_mmu and tlb_finish_mmu once per iov range, issuing one TLB flush IPI to each target CPU per range. For 1024 ranges on a 64-CPU system, this is 65,536 IPIs (1024 × 64). On a system where IPIs take ~2μs each, this is 131ms of IPI overhead for a single process_madvise call — unacceptable for a production memory manager daemon.

UmkaOS coalesces: all iov ranges are processed before the TLB flush, accumulating dirty page-table entries in a TlbFlushSet. After all ranges are processed, a single IPI per CPU flushes all modified entries in one shot. For 1024 ranges on 64 CPUs, this is 64 IPIs total — a 1024x reduction in IPI count.

/// Accumulated TLB flush work. Regions are added during madvise processing;
/// flushed in one batch at the end of process_madvise.
pub struct TlbFlushSet {
    /// VA ranges that need TLB invalidation, accumulated across all iov entries.
    pub ranges: SmallVec<[VaRange; 64]>,
    /// CPU mask: which CPUs need to receive the TLB flush IPI.
    pub cpu_mask: CpuMask,
}

impl TlbFlushSet {
    /// Issue a single TLB flush IPI to each CPU in cpu_mask, invalidating all
    /// accumulated VA ranges. The IPI handler on each CPU calls
    /// flush_tlb_multi_range(ranges) to invalidate all ranges at once.
    pub fn flush_all_cpus(&self, extra_mask: CpuMask) {
        let mask = self.cpu_mask | extra_mask;
        smp_send_ipi(mask, IpiKind::TlbFlushMulti(&self.ranges));
    }
}

The IPI handler on each target CPU receives the full list of VA ranges and calls the architecture-specific multi-range TLB invalidation:

  • x86-64: INVLPG per page in each range (or INVPCID with type=1 for PCID-aware invalidation).
  • AArch64: TLBI VAE1IS per page.
  • RISC-V: SFENCE.VMA with address operands per page.

MADV_PAGEOUT batching

When MADV_PAGEOUT appears across multiple iov ranges, UmkaOS collects all target pages into a single batch and calls reclaim_pages_batch(pages) once. This amortises the per-page reclaim overhead (LRU list manipulation, swap slot allocation) across all pages in the call. Linux calls reclaim_pages() once per VMA range.

MADV_DISCARDABLE and MADV_CRITICAL integration

These UmkaOS-specific advice values set flags in the VMA's vm_flags:

bitflags::bitflags! {
    pub struct VmFlags: u64 {
        // ... standard Linux VM_* flags (exact values) ...
        /// UmkaOS extension: pages in this VMA are discardable under memory pressure.
        /// The memory hibernation path (§4.1.3.3) treats these as low-priority.
        const VM_UMKA_DISCARDABLE = 1 << 56;
        /// UmkaOS extension: pages in this VMA must not be reclaimed or compressed.
        /// Real-time allocations use this to guarantee access latency.
        const VM_UMKA_CRITICAL    = 1 << 57;
    }
}

The memory reclaim path checks VM_UMKA_CRITICAL before evicting any page; pages in critical VMAs are skipped even under severe memory pressure. The swap/compression path checks VM_UMKA_DISCARDABLE to prioritise these pages for early compression or swap.

Error cases

Error Condition
EBADF pidfd is not a valid pidfd
ESRCH Process referenced by pidfd has exited
EPERM Insufficient privilege for the requested advice type
EINVAL flags non-zero; vlen > 1024; unknown advice; non-aligned iov_base; zero iov_len
EFAULT iovec pointer not readable
ENOMEM iov range covers an unmapped region

Linux compatibility: syscall number, pidfd semantics, iovec struct, MADV_COLD and MADV_PAGEOUT advice values, and permission model match Linux 5.10+. MADV_DISCARDABLE and MADV_CRITICAL are UmkaOS extensions using values beyond the Linux-defined range. The 1024-entry vlen cap is an UmkaOS safety limit with no Linux equivalent.