Chapter 17: Containers and Namespaces¶

Namespace architecture (8 types), cgroups v2, POSIX IPC, OCI runtime

Type Definitions Used in This Part¶

/// Unique identifier for a schedulable task within the kernel.
/// Globally unique, never reused (monotonically increasing from boot).
/// Used for PID translation in PID namespaces.
pub type TaskId = u64;

/// Unique identifier for a process (thread group leader).
/// Globally unique, never reused (monotonically increasing from boot).
/// Corresponds to the TGID in Linux terminology; exposed as PID to userspace
/// via getpid(). Used in Process struct, parent/child tracking, and signal delivery.
pub type ProcessId = u64;

/// Handle to a physical page frame. Wraps the frame number.
/// Used by pipe buffers for zero-copy page gifting via vmsplice().
pub struct PhysPage {
    /// Physical frame number (PFN).
    pub pfn: u64,
}

/// Wait queue head for blocking operations.
/// Used by pipe buffers to block readers/writers.
/// Defined in Section 3.1.6 (umka-core/src/sync/wait.rs).
// WaitQueueHead is defined in Section 3.1.6.2 (03-concurrency.md).
// See that section for the full struct definition and wait/wake protocol.
pub use WaitQueueHead;

/// RCU-protected cell for read-mostly data (defined in Section 3.1.6, 03-concurrency.md).
/// Readers acquire an RCU read guard and call `RcuCell::load()` — lock-free, no
/// cache-line bouncing. Writers call `RcuCell::update(new_value)` which atomically
/// replaces the pointer; the old value is deferred-freed after one RCU grace period.
pub struct RcuCell<T> { _phantom: core::marker::PhantomData<T> }
impl<T> RcuCell<T> {
    pub const fn new_empty() -> Self { Self { _phantom: core::marker::PhantomData } }
    /// Load current value under an active RCU read guard.
    /// Atomically reads the internal pointer (Acquire); returns `None` if null.
    /// Full specification: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths--rcu-read-copy-update).
    pub fn load(&self) -> Option<Arc<T>> { /* atomic load + Arc::clone */ }
    /// Swap to a new value; old value freed after RCU grace period.
    /// Atomically replaces the pointer (Release); schedules deferred drop of the
    /// previous value via `rcu_call` after all current readers release their guards.
    /// Full specification: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths--rcu-read-copy-update).
    pub fn update(&self, new: Arc<T>) { /* atomic swap + rcu_call(old) */ }
}

/// RCU-protected immutable pointer. Thinner than `RcuCell`; no internal `Arc`
/// wrapping. Suitable for single-owner data swapped atomically (e.g., UTS string table).
/// Specified here; the underlying RCU primitives are defined in
/// [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths--rcu-read-copy-update).
pub struct RcuPtr<T> { _phantom: core::marker::PhantomData<T> }

/// Integer-to-object ID allocator (analogous to Linux `struct idr`).
/// IDs are allocated monotonically; recycled IDs are never reused within a
/// boot session (prevents PID-reuse exploits). Thread-safe via internal spinlock.
/// Specified here; underlying RCU primitives in
/// [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths--rcu-read-copy-update).
pub struct Idr<T> { _phantom: core::marker::PhantomData<T> }
impl<T> Idr<T> {
    /// Assign the next monotonic ID and store `value`. Never reuses IDs within a
    /// boot session. Full specification: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths--rcu-read-copy-update).
    pub fn allocate(&self, value: T) -> u32 { /* spinlock + radix insert */ }
    /// Look up by ID. O(1) radix-tree walk under internal spinlock.
    pub fn lookup(&self, id: u32) -> Option<&T> { /* radix lookup */ }
    /// Remove and return the value for `id`. The ID is retired permanently.
    pub fn remove(&self, id: u32) -> Option<T> { /* spinlock + radix remove */ }
}

/// RCU-protected variant of `Idr`. Lookups are lock-free (RCU read guard only);
/// insertions acquire an internal write mutex and publish via RCU swap.
/// Specified here; underlying RCU primitives in
/// [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths--rcu-read-copy-update).
pub struct RcuIdr<T> { _phantom: core::marker::PhantomData<T> }

/// RCU-protected hash map: lock-free reads under RCU guard, serialized writes.
/// Write path clones the bucket list and swaps atomically. Suitable for maps
/// that are read on hot paths (per-packet, per-task) but written rarely.
/// Specified here; underlying RCU primitives in
/// [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths--rcu-read-copy-update).
pub struct RcuHashMap<K, V> { _phantom: core::marker::PhantomData<(K, V)> }

/// RCU-protected growable vector: lock-free reads under RCU guard, serialized writes.
/// The backing array is heap-allocated and published as an atomic pointer. Readers
/// acquire an RCU read guard, load the pointer, and index into the array. Writers
/// allocate a new backing array, copy elements, swap the pointer, and defer-free
/// the old backing via `rcu_call()`. Used for KVM memslots and similar read-mostly
/// collections with rare structural modification.
/// Defined in Section 3.1.6 (03-concurrency.md).
pub struct RcuVec<T> {
    ptr: AtomicPtr<RcuVecInner<T>>,
}
/// Internal backing for `RcuVec<T>`.
struct RcuVecInner<T> {
    len: usize,
    cap: usize,
    data: [T],  // trailing unsized array
}
impl<T: Clone> RcuVec<T> {
    /// Read the current snapshot under an RCU read guard.
    pub fn load<'g>(&self, _guard: &'g RcuReadGuard) -> &'g [T] {
        // SAFETY: pointer valid while RCU guard held.
        unsafe { &(*self.ptr.load(Ordering::Acquire)).data[..(*self.ptr.load(Ordering::Acquire)).len] }
    }
    /// Replace the backing array. Caller must hold external write mutex.
    /// Old backing is deferred-freed after one RCU grace period.
    pub fn update(&self, new_elements: &[T]) { /* clone into new backing, swap, rcu_call old */ }
}

/// Namespace type enumeration for hierarchy tracking.
/// UmkaOS implements all 8 Linux namespace types (see Section 8.1.6).
///
/// Uses sequential kernel-internal values (`#[repr(u8)]`). These do NOT
/// correspond to the CLONE_NEW* bitflags passed by userspace (e.g.,
/// CLONE_NEWPID = 0x20000000, CLONE_NEWNET = 0x40000000). Translation
/// from CLONE_NEW* bitflags happens at the syscall boundary via
/// `clone_flag_to_ns_type()` below.
#[repr(u8)]
pub enum NamespaceType {
    Pid    = 0,
    Net    = 1,
    Mnt    = 2,
    Uts    = 3,
    Ipc    = 4,
    User   = 5,
    Cgroup = 6,
    Time   = 7, // Linux 5.6+
}

/// Convert a single CLONE_NEW* bitflag (from clone(2) / unshare(2) flags)
/// to the kernel-internal `NamespaceType`.
///
/// Callers must iterate over each set bit in the `clone_flags` word and
/// call this function once per bit. Returns `None` for bits that are not
/// namespace flags (e.g., CLONE_VM, CLONE_FILES).
pub fn clone_flag_to_ns_type(bit: u64) -> Option<NamespaceType> {
    match bit {
        libc::CLONE_NEWPID    => Some(NamespaceType::Pid),
        libc::CLONE_NEWNET    => Some(NamespaceType::Net),
        libc::CLONE_NEWNS     => Some(NamespaceType::Mnt),
        libc::CLONE_NEWUTS    => Some(NamespaceType::Uts),
        libc::CLONE_NEWIPC    => Some(NamespaceType::Ipc),
        libc::CLONE_NEWUSER   => Some(NamespaceType::User),
        libc::CLONE_NEWCGROUP => Some(NamespaceType::Cgroup),
        libc::CLONE_NEWTIME   => Some(NamespaceType::Time),
        _                     => None,
    }
}

Note on Capability<T> syntax: This document uses Capability<NetStack> and Capability<VfsNode> as type hints indicating what resource a capability references. The underlying Capability struct (Section 9.1) is non-generic; the target type is determined by the object_id field. This notation is for documentation clarity only.

17.1 Namespace Architecture¶

Linux namespaces isolate global system resources. In UmkaOS, namespaces are not primitive kernel objects; rather, they are synthesized from UmkaOS's native Capability Domains (Section 9.1) and Virtual Filesystem (VFS) mounts.

17.1.1 Capability Domain Mapping¶

When a process creates a new namespace via clone(CLONE_NEW*) or unshare(), UmkaOS allocates a new Capability Domain or modifies the existing one:

CLONE_NEWPID (PID Namespace): Creates a new PID translation table in the process's Capability Domain. The umka-sysapi layer translates local PIDs (e.g., PID 1) to global UmkaOS task IDs.
CLONE_NEWNET (Network Namespace): Creates an isolated network stack instance. On clone(CLONE_NEWNET), the new network namespace's user_ns is set to current_task().nsproxy.user_ns — the creating task's user namespace owns the new net namespace. This determines which user namespace governs capability checks (ns_capable(net_ns.user_ns, CAP_NET_ADMIN) etc.) for the new network namespace.
The new namespace has no network interfaces except lo (loopback, 127.0.0.1/8)
No connectivity to the host or external network unless explicitly configured
Network interfaces (physical NICs, VETH pairs, bridges, VLANs) are owned by a specific namespace and cannot be accessed from other namespaces
Each namespace has its own routing table, iptables/nftables rules, and socket port space
Per-namespace network state is defined below; the umka-net subsystem (Section 16.1-38) implements the network stack that operates within these namespace boundaries:

/// Network interface table. Uses XArray for O(1) lookup on the
/// packet receive/transmit path (integer-keyed → XArray per collection policy).
/// Supports up to 2^32 interfaces; typical deployments have 2-16.
pub struct InterfaceTable {
    /// XArray indexed by InterfaceIndex (u32). O(1) lookup, RCU-compatible
    /// reads, ordered iteration for enumeration (netlink GETLINK, /proc/net/if_inet6).
    table: XArray<Arc<NetDevice>>,
}

/// Per-namespace nftables rule set. Contains all firewall chains (input, output,
/// forward, prerouting, postrouting, plus user-defined chains) scoped to a single
/// network namespace. Rule evaluation is on the per-packet hot path and is
/// RCU-protected natively by XArray: readers (packet filter path) never block,
/// writers (rule updates via `nft` commands) use per-entry `xa_store()` / `xa_erase()`
/// under `NetNamespace.config_lock`.
///
/// The `generation` counter is incremented on every rule set mutation. It enables
/// userspace (`nft monitor`) and the connection tracking subsystem to detect stale
/// rule evaluations and re-evaluate affected conntrack entries.
pub struct FirewallRules {
    /// All chains in this rule set, keyed by chain ID (u64). XArray is used
    /// because the key is an integer (chain ID) — per collection policy,
    /// integer-keyed mappings always use XArray. RCU-protected reads for
    /// lock-free packet-path traversal.
    pub chains: XArray<Arc<NfChain>>,
    /// Monotonically increasing generation counter. Incremented (Release) on
    /// every rule add/delete/replace. Read (Acquire) by conntrack re-evaluation
    /// and `nft monitor` polling.
    pub generation: AtomicU64,
}

/// Per-namespace network state.
pub struct NetNamespace {
    /// Namespace ID (unique across the system).
    pub ns_id: u64,

    /// Network interface table. Uses XArray for O(1) lookup on the
    /// packet receive/transmit path (integer-keyed → XArray per collection policy).
    ///
    /// **RCU-protected natively by XArray**: Interface lookup is on the per-packet
    /// hot path (every incoming and outgoing packet resolves its interface). XArray
    /// provides lock-free RCU reads natively — readers call `xa_load()` under
    /// `rcu_read_lock()`. Writers (interface add/remove, rare) call `xa_store()`
    /// / `xa_erase()` which publish entries via XArray's internal RCU mechanism.
    /// No clone-and-swap needed — per-entry O(log₆₄ N) updates. This matches
    /// Linux's RCU-protected `net_device` lookup exactly — lock-free reads,
    /// serialized writes. The write path holds `config_lock` (below) for
    /// serialization.
    pub interfaces: InterfaceTable,

    /// Loopback interface (always present, cannot be deleted).
    pub loopback: Arc<NetDevice>,

    /// Routing table (per-namespace, not shared).
    ///
    /// **RCU-protected natively by FIB trie internals**: Route lookup is on the
    /// per-packet forwarding path. The FIB trie uses per-entry RCU publishing
    /// for lock-free reads — `ip route add/del` modifies individual trie nodes
    /// with O(log N) cost, not O(total routes) clone-and-swap. The write path
    /// holds `config_lock` (below) for serialization. Linux uses RCU for FIB
    /// (Forwarding Information Base) lookup with the same per-entry pattern.
    pub routes: RouteTable,

    /// Firewall rules (iptables/nftables equivalent).
    /// Rules are scoped to this namespace only.
    ///
    /// **RCU-protected natively by XArray**: Rule evaluation is on the per-packet
    /// filter path. The `chains` XArray provides lock-free RCU reads natively.
    /// Writers (rule updates via `nft` commands) call `xa_store()` / `xa_erase()`
    /// for per-entry O(log₆₄ N) updates under `config_lock`, then increment the
    /// `generation` counter with Release ordering. Readers load `generation` with
    /// Acquire ordering before evaluating chains. Linux uses RCU for netfilter
    /// rule traversal.
    pub firewall: FirewallRules,

    /// Mutex for serializing configuration mutations (interface add/remove,
    /// route updates, firewall rule changes). Only the write side holds this;
    /// packet-path readers never touch it. Separating the write-side lock from
    /// the read-side RCU ensures that configuration changes do not block
    /// packet processing.
    pub config_lock: Mutex<()>,

    /// Port allocation bitmap (per-namespace).
    /// Allows the same port number to be bound in different namespaces.
    /// Mutex is correct here: port allocation happens on bind()/connect(),
    /// not on the per-packet path. See `PortAllocator` below.
    pub port_allocator: Mutex<PortAllocator>,

    /// Owning user namespace. Required for capability checks
    /// (CAP_NET_ADMIN, CAP_NET_RAW, CAP_NET_BIND_SERVICE) by networking
    /// subsystems that call `ns_capable(net_ns.user_ns, cap)`.
    pub user_ns: Arc<UserNamespace>,

    /// Capability to this network stack (for delegation).
    /// Processes in this namespace implicitly hold this capability.
    pub stack_cap: Capability<NetStack>,

    /// Per-namespace connection tracking table. In Linux, conntrack state is
    /// per-network-namespace: each container's netns maintains its own connection
    /// tracking entries, independent of the host and other containers. NAT rules,
    /// stateful firewall decisions, and connection reuse all operate against this
    /// namespace-scoped table.
    pub conntrack: ConntrackTable,

    /// All sockets belonging to this namespace. Used for teardown cleanup:
    /// when the namespace is destroyed, all sockets in this list are closed
    /// (RST for TCP, immediate free for UDP/raw). This ensures no orphaned
    /// sockets survive namespace destruction.
    ///
    /// Sockets register themselves on creation (`socket()` syscall) and
    /// deregister on `close()`. The list is protected by `config_lock`
    /// (write-side only; socket creation/destruction is not on the
    /// per-packet hot path).
    ///
    /// **Collection choice**: `Vec<Arc<Socket>>` -- cold path only (socket
    /// teardown on namespace destruction). Unbounded `Vec` is acceptable per
    /// collection policy for cold paths. The maximum size is bounded by
    /// `ulimit -n` and the namespace's file descriptor count.
    pub socket_list: Vec<Arc<Socket>>,
}

/// Per-namespace connection tracking state. Each network namespace maintains its
/// own conntrack table so that NAT translations, stateful firewall decisions, and
/// connection reuse are fully isolated between containers.
///
/// The `entries` map is RCU-protected: packet-path lookups (hot path) are lock-free;
/// conntrack entry creation/destruction (warm path) serializes under an internal
/// writer lock within `RcuHashMap`.
pub struct ConntrackTable {
    /// Active connection tracking entries, keyed by 5-tuple hash.
    pub entries: RcuHashMap<ConntrackKey, ConntrackEntry>,
    /// Total number of active entries (for `/proc/sys/net/netfilter/nf_conntrack_count`
    /// per-namespace accounting and early-drop threshold enforcement).
    pub count: AtomicU64,
}

/// Per-namespace ephemeral port allocator. Covers the ephemeral port range
/// 1024-65535 (64,512 ports). Allocation uses a rotor (`next`) for O(1)
/// average-case and a bitmap for collision detection.
///
/// The bitmap has 1024 words of 64 bits each = 65,536 bits. Bit N corresponds
/// to port N. Ports 0-1023 (well-known) are always set in the bitmap at init
/// time and never allocated by the rotor. The `range` field allows
/// `/proc/sys/net/ipv4/ip_local_port_range` tunability per namespace.
pub struct PortAllocator {
    /// Inclusive (low, high) ephemeral port range. Default: (32768, 60999).
    /// Configurable via `/proc/sys/net/ipv4/ip_local_port_range`.
    pub range: (u16, u16),
    /// Next candidate port (rotor). Wraps within `range`.
    /// **Lock-free algorithm**: `next` is atomically incremented via
    /// `fetch_add(1, Relaxed)`. If the result exceeds `range.1`, it wraps
    /// to `range.0` via modular arithmetic. The Mutex<PortAllocator> in
    /// NetNamespace serializes the full allocate-and-set-bitmap operation
    /// (not the rotor read), so two concurrent allocators may read the same
    /// `next` value — the bitmap collision check resolves this (both try
    /// to set the same bit; the Mutex serializes the CAS on the bitmap word).
    pub next: AtomicU16,
    /// Bitmap of in-use ports. 1024 words × 64 bits = 65,536 ports.
    /// Bit set = port in use. Atomic words for lock-free read on bind-time
    /// collision check (write serialized by `Mutex<PortAllocator>` in
    /// `NetNamespace`).
    pub bitmap: [AtomicU64; 1024],
}

PortAllocator provides namespace-scoped ephemeral port allocation. Per-protocol tables (UdpTable.ephemeral_next, TcpTable.ephemeral_next) use per-CPU fast paths within the PortAllocator's allocated range. The PortAllocator reserves ranges; per-protocol tables consume individual ports within those ranges.

A new socket inherits the creating thread's network namespace (current_task().nsproxy.net_ns). setns(CLONE_NEWNET) affects future socket creation but does not migrate existing sockets. Sockets bound to a destroyed namespace return EIO on all operations. getsockopt(SO_NETNS_COOKIE) returns the namespace's unique 64-bit cookie for identification.

See Section 16.13 for NetDevice lifecycle, and Section 16.6 for RouteTable internals.

/// Fixed-size interface name (matching Linux IFNAMSIZ = 16).
/// Prevents unbounded heap allocation and OOM attacks via long names.
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct InterfaceName([u8; 16]);

VETH pairs for inter-namespace connectivity: A VETH (virtual ethernet) pair connects two namespaces. Creation:

ip link add veth0 type veth peer name veth1
ip link set veth1 netns <target-namespace>

In UmkaOS, this creates two virtual interfaces that are cross-linked: - veth0 in the caller's namespace - veth1 in the target namespace - Packets sent to one end appear on the other (like a virtual patch cable)

Container networking flow: 1. Container runtime creates a new network namespace for the container 2. Creates a VETH pair: one end in host namespace (e.g., veth0), one in container (e.g., eth0) 3. Host end is attached to a bridge (e.g., docker0, cni0) for external connectivity 4. Container end is assigned an IP from the bridge's subnet 5. NAT/masquerading rules on the host allow container → external traffic 6. Port forwarding rules map host ports → container ports

Network namespace initial state: When a new network namespace is created via clone(CLONE_NEWNET) or unshare(CLONE_NEWNET), the kernel initializes the following state before returning to the caller:

Loopback interface (lo): auto-created, brought UP, assigned addresses 127.0.0.1/8 (IPv4) and ::1/128 (IPv6). The loopback device is permanent and cannot be deleted or moved to another namespace.

Registration steps (executed inside net_ns_init_loopback(ns)):

let lo = NetDevice::new_loopback(ns.ns_id);
lo.ifindex.store(1, Relaxed);   // Always 1 within the namespace.
lo.flags.store(IFF_LOOPBACK | IFF_UP | IFF_RUNNING, Release);
ns.interfaces.table.xa_store(1, Arc::new(lo));
ns.loopback = Arc::clone(&ns.interfaces.table.xa_load(1).unwrap());
// Add addresses: 127.0.0.1/8 on lo (IPv4), ::1/128 on lo (IPv6).
lo.inet_addrs.push(Inet4Addr::new(127, 0, 0, 1, 8));
lo.inet6_addrs.push(Inet6Addr::loopback(128));
// Add loopback routes to ns.routes (see routing table init below).

The loopback driver is a Tier 0 software device (no hardware, no ring dispatch). NetDevice::new_loopback() is defined in Section 16.13.

Initial routing table: contains only loopback routes:
- 127.0.0.0/8 → lo (IPv4 loopback subnet)
- ::1/128 → lo (IPv6 loopback host route)
- No default gateway. The container runtime or CNI plugin is responsible for adding routes after creating veth pairs and assigning addresses (typically: default via <bridge-ip> dev eth0).
Empty firewall rules: all nftables/iptables chains are created with default ACCEPT policy (INPUT, OUTPUT, FORWARD). No rules are pre-loaded. The container runtime may install restrictive rules after namespace setup.
Empty socket table: no sockets exist. Sockets created in the new namespace bind to its port allocator (NetNamespace.port_allocator), independent of the host namespace's ports.
Empty FIB (Forwarding Information Base): except for the loopback routes above, no entries exist. No neighbor cache entries (ARP/NDP tables are empty).
Empty conntrack table: no connection tracking state. Connections established after namespace creation are tracked independently from the host namespace's conntrack.
Sysctl defaults: per-namespace network sysctls (e.g., net.ipv4.ip_forward, net.core.rmem_default) are initialized to their kernel-default values, independent of the host namespace's sysctl settings. ip_forward defaults to 0 (disabled); the container runtime enables it if the container needs to forward traffic.

This design ensures that a new network namespace starts fully isolated with no connectivity. All external connectivity must be explicitly configured by the container runtime, matching Linux behavior exactly.

Tunnel device namespace scoping: Tunnel devices (GRE, VXLAN, GENEVE, WireGuard, IPIP, SIT) are scoped to the network namespace in which they are created. A tunnel's dev.net_ns field points to its owning namespace, and the tunnel's encap/decap processing uses that namespace's routing table and socket table. Moving a tunnel device between namespaces via ip link set <dev> netns <ns> updates dev.net_ns and re-binds the tunnel's UDP encapsulation socket (if any) in the target namespace. This follows the same per-namespace InterfaceTable model as physical NICs and veth pairs (Section 16.13).

CLONE_NEWNS (Mount Namespace): Creates a private copy of the VFS mount tree for the process. Changes to this tree do not affect the parent domain unless explicitly marked shared. When unshare(CLONE_NEWNS) is called, a new mount namespace is created by COW-cloning the caller's mount tree. The task's root and pwd references are updated to point to the corresponding mount points in the new namespace. No filesystem data is copied — only the mount table is duplicated.
CLONE_NEWUTS (UTS Namespace): Creates an isolated hostname/domainname state. Stored as a reference-counted UtsNamespace struct in the task's NamespaceSet (see Section 17.1).
CLONE_NEWIPC (IPC Namespace): Isolates System V IPC objects and POSIX message queues.
CLONE_NEWUSER (User Namespace): Creates a new UID/GID mapping table within the Capability Domain.
CLONE_NEWTIME (Time Namespace): Creates isolated offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. The container sees its own "boot time" starting from zero, independent of the host's actual boot time. The TimeNamespace struct with offset fields is defined in Section 17.1 below.

Namespace creation rollback on partial failure: When clone() or unshare() requests multiple namespaces simultaneously (e.g., CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS), namespace creation must be atomic — either all requested namespaces are created, or none are. If the Nth namespace allocation fails (ENOMEM, ENOSPC from namespace limits), the previously-created (N-1) namespaces must be rolled back:

Namespaces are created in a fixed order: user (if CLONE_NEWUSER, always first), then mount, PID, time, net, IPC, UTS, cgroup. The only ordering requirement for correctness is that CLONE_NEWUSER is processed first (see CLONE_NEWUSER ordering requirement above). The non-NEWUSER ordering is arbitrary -- these namespaces are independent of each other. The order listed here is the canonical implementation order used consistently in sys_unshare() and do_fork() namespace creation paths.
Each successfully-created namespace is recorded in a local ArrayVec<CreatedNs, 8>.
On failure at step K: a. Iterate the ArrayVec in reverse order. b. For each created namespace: decrement its refcount (Arc::drop). If the refcount reaches zero (no other task shares it), the namespace is fully destroyed (PID IDR freed, net stack torn down, mount tree dropped, etc.). c. Return the error to the caller. The task's nsproxy is unchanged.
On success: atomically swap the task's nsproxy to the new NamespaceSet containing all newly-created namespaces. The old nsproxy is dropped (refcount decrement; shared namespaces remain alive via other tasks' references).

CLONE_NEWUSER ordering requirement: When CLONE_NEWUSER is combined with other CLONE_NEW* flags in a single clone() or unshare() call, UmkaOS MUST create the user namespace first, before creating any other namespace. This ordering is a correctness requirement, not an optimization:

The new user namespace maps the caller's UID/GID to root (UID 0) inside the namespace, granting CAP_SYS_ADMIN within that namespace.
Creating other namespaces (PID, NET, MNT, IPC, UTS, CGROUP, TIME) requires CAP_SYS_ADMIN — either in the caller's current user namespace or in the newly created one.
If CLONE_NEWUSER is processed after other CLONE_NEW* flags, the capability check for those namespaces runs against the parent user namespace, which may deny the operation for unprivileged callers (rootless containers).
If CLONE_NEWUSER is absent from the flags, all other namespace creation operations require CAP_SYS_ADMIN in the caller's current user namespace.

Implementation: do_fork() and sys_unshare() sort the namespace creation order internally. Regardless of the bit order in the flags argument, the processing sequence is: (1) CLONE_NEWUSER (if present), (2) all other CLONE_NEW* flags in kernel-defined order (MNT, UTS, IPC, PID, CGROUP, NET, TIME). This matches Linux kernel behavior (create_new_namespaces() always processes CLONE_NEWUSER first).

17.1.1.1 `sys_unshare()` — Standalone Namespace Disassociation¶

/// Create new instances of the specified namespace types for the calling task
/// without creating a new process. Equivalent to the namespace-creation effects
/// of `clone(flags)` but applied to the calling thread itself.
///
/// # Arguments
/// - `flags`: Bitmask of `CLONE_NEW*` flags. Each flag creates the corresponding
///   namespace type: `CLONE_NEWPID`, `CLONE_NEWNET`, `CLONE_NEWNS`,
///   `CLONE_NEWIPC`, `CLONE_NEWUTS`, `CLONE_NEWUSER`, `CLONE_NEWCGROUP`,
///   `CLONE_NEWTIME`. Flags may be combined.
///
/// # Ordering
/// If `CLONE_NEWUSER` is present, it MUST be processed first (same ordering
/// as `clone()`). The implementation reorders internally regardless of the
/// caller's flag order.
///
/// # Semantics
/// - Creates new namespace instances for each specified flag.
/// - Replaces the calling task's `NamespaceSet` (`nsproxy`) with a new one
///   containing the newly created namespaces. Sibling threads are unaffected —
///   they retain the old `Arc<NamespaceSet>`.
/// - `CLONE_NEWPID`: Unlike `clone()`, the calling process does NOT enter the
///   new PID namespace itself. Instead, it is stored as `pending_pid_ns` and
///   future children will be created in the new namespace.
/// - Requires `CAP_SYS_ADMIN` in the caller's user namespace for all flags
///   except `CLONE_NEWUSER` (which is always allowed, subject to nesting limits).
///
/// # Returns
/// `Ok(0)` on success, or `Err(errno)` on failure:
/// - `EPERM`: Missing capability.
/// - `ENOSPC`: Namespace nesting depth exceeded (32 levels for both user and PID namespaces).
/// - `ENOMEM`: Insufficient memory to create namespace structures.
pub fn sys_unshare(flags: u64) -> Result<i64, Errno>;

sys_unshare() implementation algorithm:

Check permissions: CAP_SYS_ADMIN required for all flags except CLONE_NEWUSER. For CLONE_NEWUSER, check nesting depth (max 32 levels). If multi-threaded and CLONE_NEWUSER is set, return EINVAL (Linux requires single-threaded for user namespace unshare; returns EINVAL, not EPERM).
If CLONE_NEWUSER is present, PREPARE (but do not commit) credentials: a. Create new UserNamespace (child of current). b. Prepare new credentials (prepare_creds) with CAP_SYS_ADMIN in the new user namespace. Do NOT call commit_creds() yet — defer until step 5 to ensure rollback is possible if namespace allocation fails.
Clone the current NamespaceSet: let mut new_ns = old_ns.clone();
For each CLONE_NEW* flag, create the namespace and update new_ns:
CLONE_NEWNS: Call copy_tree() to clone the mount tree. copy_tree() returns the new Arc<MountNamespace> and internally handles updating fs.root/fs.pwd in its step 6 (see Section 14.6 for the canonical algorithm including parameter list and all 6 steps). Signature:
```
copy_tree(
    source_root_mount: &Arc<Mount>,
    source_root_dentry: &Arc<Dentry>,
) -> Result<Arc<MountNamespace>>
```
The new MountNamespace is assigned to new_ns.mount_ns. The fs.root/fs.pwd update (translating old mount references to new mount references via the internal mount_map) is performed inside copy_tree() step 6, NOT by the caller. This avoids exposing the mount_map to the caller.
```
new_ns.mount_ns = copy_tree(
    &old_ns.mount_ns.root_mount,
    &old_ns.mount_ns.root_dentry,
)?;
// fs.root and fs.pwd already updated by copy_tree step 6.
```
CLONE_NEWPID: Create new PID namespace. Store as new_ns.pending_pid_ns (replaces any existing pending value — the old value's refcount is decremented). The calling task does NOT enter the new PID namespace.
CLONE_NEWTIME: Create new time namespace. Store as new_ns.pending_time_ns. Like CLONE_NEWPID, the calling task does NOT enter the new time namespace — only future children will. This matches Linux's behavior where unshare(CLONE_NEWTIME) stores the namespace as time_ns_for_children.
CLONE_NEWNET: Create empty network namespace. Update BOTH new_ns.net_ns AND new_ns.net_stack (dual-field invariant — see NamespaceSet definition).
CLONE_NEWIPC: Create empty IPC namespace (no inherited IPC objects). Assign to new_ns.ipc_ns.
CLONE_NEWUTS: Create UTS namespace copying parent's hostname/domainname. Assign to new_ns.uts_ns.
CLONE_NEWCGROUP: Create cgroup namespace. The child's cgroup root is the caller's current cgroup. Assign to new_ns.cgroup_ns. See the Inheritance Rules table below for per-type semantics.
Commit credentials and swap nsproxy atomically: If CLONE_NEWUSER was requested, commit_creds() and nsproxy.store() are performed together under task_lock() (matching the pattern used by setns(CLONE_NEWUSER) — see below). This prevents a window where credentials and nsproxy disagree. If CLONE_NEWUSER was not requested, simply: task.nsproxy.store(Arc::new(new_ns)); (ArcSwap::store() provides its own internal synchronization -- no explicit Ordering parameter.)
Drop the old Arc<NamespaceSet> (decrements refcounts on all old namespaces).

If any step fails, roll back all namespaces created so far (same rollback protocol as clone() — see Section 17.1).

PID 1 signal protection within PID namespaces:

The first process created in a PID namespace gets PID 1 and acts as the namespace's init. PID 1 has special signal handling (critical for container correctness):

Default-disposition signals are silently dropped: Signals with default disposition (SIG_DFL) are NOT delivered to PID 1 unless PID 1 has explicitly installed a handler for that signal. This prevents accidental termination of the container init (e.g., SIGTERM with default disposition would kill a normal process but is dropped for namespace PID 1).
SIGKILL/SIGSTOP from within the namespace are dropped: Processes inside the same PID namespace cannot kill or stop their init. This prevents a misbehaving container process from bringing down the container.
Parent namespace CAN send any signal: The parent namespace (or any ancestor namespace) can send any signal including SIGKILL to PID 1 of a child namespace. This is how the container runtime stops a container — it sends SIGKILL from outside the namespace.

These rules are enforced in send_signal() by checking whether the target is is_child_reaper(target) (PID 1 of its namespace) and whether the sender is in the same or an ancestor namespace. See Section 8.5.

17.1.2 Namespace Implementation¶

Namespaces are implemented entirely within the umka-sysapi layer. The core microkernel (umka-core) is unaware of namespaces; it only understands Capability Domains and object access rights.

/// Per-task namespace proxy. Like Linux's `task_struct->nsproxy`, this is
/// owned by each `Task` (not `Process`) so that `setns(2)` and `unshare(2)`
/// can change namespaces for a single thread without affecting siblings.
/// Wrapped in `Arc<NamespaceSet>` in the Task struct; threads that share
/// namespaces share the same `Arc`. `unshare()`/`setns()` replaces the
/// calling task's `Arc` with a new one.
pub struct NamespaceSet {
    /// PID namespace for this task. Determines the PID number space
    /// visible to the task: `getpid()` returns this namespace's local
    /// PID, not the global TaskId. The `PidNamespace` struct (defined
    /// below) contains the per-namespace IDR allocation map (`pid_map`)
    /// and the reverse map (global TaskId -> local pid_t).
    ///
    /// Shared via `Arc` across all tasks in the same PID namespace.
    /// A new `Arc<PidNamespace>` is created only by `clone(CLONE_NEWPID)`
    /// or `unshare(CLONE_NEWPID)`.
    pub pid_ns: Arc<PidNamespace>,

    /// Pending PID namespace for future children (set by setns(CLONE_NEWPID)).
    /// When set, fork()/clone() creates children in this namespace rather than
    /// the current task's PID namespace. The task's own PID is unchanged.
    ///
    /// Protected by SpinLock to ensure setns() and clone() don't race in
    /// multi-threaded processes. clone() atomically reads and clears
    /// pending_pid_ns, preventing stale values across multiple clone() calls.
    pub pending_pid_ns: SpinLock<Option<Arc<PidNamespace>>>,

    /// Mount namespace containing the mount tree, mount hash table,
    /// and all mount metadata for this task's VFS view.
    /// See [Section 14.1](14-vfs.md#virtual-filesystem-layer--mount-tree-data-structures-and-operations) for `MountNamespace` definition.
    pub mount_ns: Arc<MountNamespace>,

    /// Network stack capability handle — used to resolve the concrete namespace.
    pub net_stack: Capability<NetStack>,

    /// Cached resolved network namespace. Populated from `net_stack.cap_resolve()`
    /// at creation time (clone/unshare/setns). Provides direct `Arc<NetNamespace>`
    /// access without capability resolution on every socket/routing operation.
    /// All networking code uses `nsproxy.net_ns` (not `net_stack`) for lookups.
    ///
    /// **INVARIANT**: `net_ns` must always equal `net_stack.cap_resolve()`. Updated
    /// in lockstep with `net_stack` at every NamespaceSet construction site:
    /// `clone()`, `unshare()`, `setns(CLONE_NEWNET)`, and the `CLONE_NEWUSER` path.
    /// A stale `net_ns` silently routes network operations (socket creation, routing
    /// lookups, neighbor resolution) to the wrong namespace.
    ///
    /// **Circular reference prevention**: `net_stack` (Capability<NetStack>) holds
    /// an `Arc<NetNamespace>` internally via capability resolution. `net_ns` is a
    /// separate `Arc<NetNamespace>` clone (not a second ownership path — both point
    /// to the same allocation). There is no cycle: `NamespaceSet` → `Arc<NetNamespace>`
    /// is a one-way ownership edge. The `NetNamespace` does NOT hold a reference back
    /// to `NamespaceSet`. There is no cycle because `UserNamespace` does not hold
    /// a reference back to `NetNamespace` or `NamespaceSet`. The `Arc<UserNamespace>`
    /// in `NetNamespace.user_ns` is a forward reference (child namespace pointing
    /// to parent user namespace), not a back-edge.
    pub net_ns: Arc<NetNamespace>,

    /// UTS namespace (hostname, domainname).
    pub uts_ns: Arc<UtsNamespace>,

    /// IPC namespace (SysV semaphores, message queues, shared memory).
    pub ipc_ns: Arc<IpcNamespace>,

    /// Cgroup namespace (cgroup root view).
    pub cgroup_ns: Arc<CgroupNamespace>,

    /// Time namespace offsets (CLOCK_MONOTONIC, CLOCK_BOOTTIME).
    pub time_ns: Arc<TimeNamespace>,

    /// Pending time namespace for future children (set by setns(CLONE_NEWTIME)).
    /// When set, fork()/clone() creates children with the target time offsets.
    /// Follows Linux 5.8+ semantics where CLONE_NEWTIME affects children only.
    ///
    /// Protected by SpinLock (same rationale as `pending_pid_ns` above).
    pub pending_time_ns: SpinLock<Option<Arc<TimeNamespace>>>,

    /// User namespace governing UID/GID mappings and capability scope.
    pub user_ns: Arc<UserNamespace>,

    /// IMA namespace (per-container integrity measurement policy and log).
    /// Created alongside the user namespace. See Section 9.4.3 for ImaNamespace struct.
    pub ima_ns: Arc<ImaNamespace>,
}

impl NamespaceSet {
    /// Construct a "tombstone" NamespaceSet referencing the init namespaces.
    /// Used by `do_exit()` Step 11 to detach the exiting task from its
    /// namespaces without leaving dangling references. Each field points to
    /// the system's init namespace instance (created at boot, never destroyed).
    /// This matches Linux's `init_nsproxy` which references `init_pid_ns`,
    /// `init_net`, `init_uts_ns`, etc.
    pub fn empty() -> Self {
        // Each INIT_*_NS is a `OnceCell<Arc<T>>`, not a bare `Arc<T>`.
        // `.get().expect(...)` extracts the inner `&Arc<T>` from the
        // initialized OnceCell. This panics only if called before
        // `init_namespaces()` completes (which runs early in boot,
        // before any task can call `do_exit()`).
        NamespaceSet {
            pid_ns: Arc::clone(INIT_PID_NS.get().expect("INIT_PID_NS not initialized")),
            pending_pid_ns: SpinLock::new(None),
            mount_ns: Arc::clone(INIT_MOUNT_NS.get().expect("INIT_MOUNT_NS not initialized")),
            net_stack: INIT_NET_STACK.get().expect("INIT_NET_STACK not initialized").clone(),
            net_ns: Arc::clone(INIT_NET_NS.get().expect("INIT_NET_NS not initialized")),
            uts_ns: Arc::clone(INIT_UTS_NS.get().expect("INIT_UTS_NS not initialized")),
            ipc_ns: Arc::clone(INIT_IPC_NS.get().expect("INIT_IPC_NS not initialized")),
            cgroup_ns: Arc::clone(INIT_CGROUP_NS.get().expect("INIT_CGROUP_NS not initialized")),
            time_ns: Arc::clone(INIT_TIME_NS.get().expect("INIT_TIME_NS not initialized")),
            pending_time_ns: SpinLock::new(None),
            user_ns: Arc::clone(INIT_USER_NS.get().expect("INIT_USER_NS not initialized")),
            ima_ns: Arc::clone(INIT_IMA_NS.get().expect("INIT_IMA_NS not initialized")),
        }
    }
}

/// Clone implementation for NamespaceSet. For each Arc field, performs
/// `Arc::clone()` (cheap refcount increment). For SpinLock fields
/// (pending_pid_ns, pending_time_ns), acquires the lock, clones the inner
/// value (`Option<Arc<T>>`), and creates a new unlocked SpinLock protecting
/// the cloned value. The new NamespaceSet is an independent copy that can
/// be mutated without affecting the original.
impl Clone for NamespaceSet {
    fn clone(&self) -> Self {
        NamespaceSet {
            pid_ns: Arc::clone(&self.pid_ns),
            pending_pid_ns: SpinLock::new(self.pending_pid_ns.lock().clone()),
            mount_ns: Arc::clone(&self.mount_ns),
            net_stack: self.net_stack.clone(),
            net_ns: Arc::clone(&self.net_ns),
            uts_ns: Arc::clone(&self.uts_ns),
            ipc_ns: Arc::clone(&self.ipc_ns),
            cgroup_ns: Arc::clone(&self.cgroup_ns),
            time_ns: Arc::clone(&self.time_ns),
            pending_time_ns: SpinLock::new(self.pending_time_ns.lock().clone()),
            user_ns: Arc::clone(&self.user_ns),
            ima_ns: Arc::clone(&self.ima_ns),
        }
    }
}

17.1.2.1 Init Namespace Initialization¶

All init namespace instances are stored as static globals initialized once at boot via OnceCell. Initialization order matters — user_ns must be first because all other namespaces reference it via user_ns fields. Init namespaces are never destroyed (their refcount never reaches zero).

/// Init (root) PID namespace. Level 0, no parent.
static INIT_PID_NS: OnceCell<Arc<PidNamespace>> = OnceCell::new();
/// Init mount namespace. Contains the root filesystem mount tree.
static INIT_MOUNT_NS: OnceCell<Arc<MountNamespace>> = OnceCell::new();
/// Init network namespace. Contains the host network stack.
static INIT_NET_NS: OnceCell<Arc<NetNamespace>> = OnceCell::new();
/// Init network stack capability handle.
static INIT_NET_STACK: OnceCell<NetStackHandle> = OnceCell::new();
/// Init UTS namespace. Contains the host's hostname and domainname.
static INIT_UTS_NS: OnceCell<Arc<UtsNamespace>> = OnceCell::new();
/// Init IPC namespace. Contains system-wide SysV/POSIX IPC objects.
static INIT_IPC_NS: OnceCell<Arc<IpcNamespace>> = OnceCell::new();
/// Init cgroup namespace. Root cgroup view.
static INIT_CGROUP_NS: OnceCell<Arc<CgroupNamespace>> = OnceCell::new();
/// Init time namespace. Zero offsets (no adjustment).
static INIT_TIME_NS: OnceCell<Arc<TimeNamespace>> = OnceCell::new();
/// Init user namespace. Root user namespace (uid_map = identity).
static INIT_USER_NS: OnceCell<Arc<UserNamespace>> = OnceCell::new();
/// Init IMA namespace.
static INIT_IMA_NS: OnceCell<Arc<ImaNamespace>> = OnceCell::new();

Initialization order (called from init_namespaces() during boot, after memory allocator and slab are online): 1. INIT_USER_NS — first, because all others reference it. 2. INIT_PID_NS — PID 1 (init) is the child reaper. 3. INIT_MOUNT_NS — requires VFS and root filesystem to be mounted. 4. INIT_NET_NS, INIT_NET_STACK — network stack initialization. 5. INIT_UTS_NS, INIT_IPC_NS, INIT_CGROUP_NS, INIT_TIME_NS, INIT_IMA_NS — order among these is arbitrary (no inter-dependencies).

/// PID namespace. Each namespace has its own PID number space: a process visible
/// in a child namespace has a different pid_t than in the parent namespace.
///
/// # Nesting
/// PID namespaces form a tree. The root (init) namespace is the global root.
/// A process in namespace N with pid=5 may appear as pid=105 in namespace N's parent.
/// Translation traverses `parent` pointers up the tree.
///
/// # PID allocation
/// Each namespace allocates PIDs from an IDR (integer allocation map). The global PID
/// (used internally in the kernel) is always allocated from the root namespace.
/// Every namespace in the path from root to the process's namespace gets one entry
/// in the translation map.
///
/// # /proc visibility and PID enumeration
/// `/proc/[pid]` uses the PID from the reading process's namespace, not the global
/// TaskId. A process reading `/proc/5/status` in namespace N will see the task whose
/// local pid_t in N equals 5; the same task may have a different pid_t in the parent
/// namespace. If no task with that local pid exists in the reader's namespace,
/// the entry is absent from `/proc`.
///
/// **PID enumeration scoping**: `readdir("/proc")` enumerates only the PIDs visible
/// in the calling process's PID namespace. A container's init (PID 1) sees only its
/// own descendants; the host's root namespace sees all PIDs. The procfs `readdir`
/// implementation iterates `PidNamespace.pid_map` (below) for the caller's namespace
/// level, yielding only entries that have a valid local pid_t at that level.
pub struct PidNamespace {
    /// Unique namespace identifier (for /proc/self/ns/pid).
    pub ns_id: u64,
    /// Owning user namespace. Required by the `Namespace` trait for
    /// `setns()` capability checks. Set to `current_task().nsproxy.user_ns`
    /// at creation time. Weak reference avoids cycles.
    pub user_ns: Weak<UserNamespace>,
    /// Reference to the namespace's init process (PID 1). Set when the
    /// first task is created in this namespace (`do_fork` with `CLONE_NEWPID`).
    /// Used by signal delivery to implement PID 1 signal protection.
    ///
    /// **Weak reference**: `Weak<Task>` avoids a reference cycle:
    /// `PidNamespace -> Arc<Task> -> ArcSwap<NamespaceSet> -> Arc<PidNamespace>`.
    /// Using `Arc<Task>` would prevent both the PidNamespace and the init Task
    /// from being freed while the init task is blocked (e.g., in `epoll_wait`),
    /// causing a memory leak that accumulates over container lifecycles.
    ///
    /// Signal delivery uses `child_reaper.upgrade()` to obtain a temporary
    /// `Arc<Task>`. If `upgrade()` returns `None`, the init task has exited
    /// (namespace teardown in progress via `zap_pid_ns_processes()`).
    /// As a fast-path alternative, callers can look up PID 1 directly in
    /// `pid_map` (which is updated atomically on task exit).
    pub child_reaper: Option<Weak<Task>>,
    /// Parent namespace. `None` only for the root PID namespace.
    pub parent: Option<Arc<PidNamespace>>,
    /// Nesting level. Root = 0; maximum = 32 (matches Linux `MAX_PID_NS_LEVEL`).
    /// Both PID and user namespace nesting depth exceeded return ENOSPC.
    /// Tracked by separate Linux constants: `MAX_PID_NS_LEVEL` and
    /// `MAX_USER_NS_LEVEL` (both 32).
    pub level: u32,
    /// PID allocation map for this namespace level.
    /// Key: pid_t value in this namespace (allocated by IDR); Value: global TaskId.
    ///
    /// IDR (integer-ID radix-tree allocator) provides O(log n) pid allocation with
    /// RCU read-side protection. `pid_lookup()` is lock-free on the read path
    /// (kill(), waitpid(), /proc/[pid] traversal). Writes (fork/exit) are serialized
    /// by the Idr's internal SpinLock. Integrated next-ID allocation eliminates a
    /// separate PID counter and separate "find a free PID" logic.
    pub pid_map: Idr<TaskId>,
    /// Reverse map: global TaskId → local pid_t in this namespace.
    /// Used by `pid_nr()` (TaskId → local pid_t), called on signal delivery,
    /// `/proc/[pid]` traversal, and `waitpid()` — all hot paths.
    ///
    /// RCU-protected read path: `pid_nr()` takes only an RCU read guard (~1-3 cycles,
    /// no spinning). Write path: insert at fork, remove at exit — both serialized by
    /// `pid_map`'s existing SpinLock (held anyway for IDR allocation/deallocation).
    /// This eliminates the separate `SpinLock<HashMap>` that previously serialized
    /// every signal delivery on the read path.
    ///
    /// Implementation: sparse radix tree (same Idr structure as pid_map) keyed on
    /// the lower 32 bits of TaskId. **Longevity analysis**: At 1 million forks/sec
    /// (sustained, far beyond any practical workload), the lower 32 bits of TaskId
    /// wrap after ~4295 seconds (~72 minutes). However, the reverse_map only contains
    /// LIVE tasks in this namespace — entries are removed at task exit. A collision
    /// requires two live tasks in the same namespace whose TaskId lower-32 bits match,
    /// which requires >4 billion cumulative forks within the namespace lifetime with
    /// both tasks still alive. For short-lived containers this is impossible; for
    /// long-running namespaces with extreme fork rates, the full 64-bit TaskId is
    /// used for authoritative identification (the reverse_map is a fast-path cache).
    ///
    /// **Collision fallback**: After `reverse_map.lookup(task_id.lower32())` returns
    /// a candidate `pid_t`, the caller MUST verify `pid_map.lookup(pid_t).task_id ==
    /// task_id` (full 64-bit check). On mismatch, fall back to linear scan of
    /// `pid_map` entries. This ensures correctness even when two live tasks' lower-32
    /// bits collide. The linear scan is O(N) where N = live tasks in the namespace,
    /// but the mismatch case is astronomically rare under normal workloads.
    pub reverse_map: RcuIdr<u32>,
    /// Maximum PID value in this namespace (default: 4,194,304 = PID_MAX).
    /// **Invariant**: pid_max <= i32::MAX (2,147,483,647). POSIX pid_t is signed
    /// i32; values above i32::MAX would be interpreted as negative PIDs by
    /// userspace (violating the `kill(-pid)` process-group convention). The
    /// kernel clamps any sysctl write to `/proc/sys/kernel/pid_max` to this bound.
    /// Reduced-max namespaces allow container runtimes to limit PID exhaustion attacks.
    pub pid_max: u32,
    /// Number of active tasks in this namespace.
    /// **u32 justification**: Bounded by `pid_max` (max i32::MAX ≈ 2.1 billion),
    /// which is well within u32 range. Unlike cgroup task counters (which use
    /// AtomicU64 because cgroups can span multiple PID namespaces and accumulate
    /// across namespace boundaries), a single PID namespace's task count is
    /// strictly bounded by its `pid_max`.
    pub nr_tasks: AtomicU32,
}

/// Translates a global `TaskId` to the local pid_t visible in `ns`.
/// Returns `None` if the task is not visible in `ns` (created in a sibling namespace).
///
/// Hot path: called on signal delivery, /proc traversal, waitpid(). Uses RCU
/// read-side guard — no spinning, no lock acquisition.
pub fn pid_nr(task_id: TaskId, ns: &PidNamespace) -> Option<u32> {
    let guard = rcu_read_lock();
    let candidate_pid = ns.reverse_map.lookup(task_id.lower32(), &guard)?;
    // Full 64-bit verification: at 1M forks/sec the lower 32 bits wrap after
    // ~72 minutes. If two live tasks collide on lower32, the reverse_map
    // returns the wrong pid_t. Verify via the forward map.
    let entry = ns.pid_map.lookup(candidate_pid, &guard)?;
    if entry == task_id {
        Some(candidate_pid)
    } else {
        // Collision: fall back to linear scan (cold path).
        pid_nr_slow(task_id, ns, &guard)
    }
}

/// Cold-path linear scan for pid_nr when lower-32 collision is detected.
/// Iterates all entries in the pid_map to find the one matching the full
/// 64-bit TaskId. O(N) where N = live tasks in this namespace, bounded
/// by pid_max. Expected frequency: near-zero under normal workloads.
#[cold]
fn pid_nr_slow(task_id: TaskId, ns: &PidNamespace, _guard: &RcuReadGuard) -> Option<u32> {
    for (pid, tid) in ns.pid_map.iter() {
        if tid == task_id {
            return Some(pid);
        }
    }
    None
}

/// Translates a task's PID as seen from an arbitrary target namespace.
///
/// This is the primary cross-namespace PID translation function, equivalent to
/// Linux's `task_pid_nr_ns()`. Used by:
/// - `kill()` to translate the target PID from the caller's namespace
/// - `waitpid()` to report child PIDs in the caller's namespace
/// - `getppid()` to report the parent's PID in the caller's namespace
/// - `/proc/[pid]/status` fields (PPid, NSpid, etc.)
/// - `io_uring` PID namespace resolution
///
/// # Algorithm
/// 1. If `ns.level > task_ns.level`: the task was created in a parent namespace
///    of `ns` — it is not visible in `ns`. Return `None`.
/// 2. Walk `task_ns` upward via `.parent` until reaching `ns.level`.
/// 3. If the walked namespace is the same object as `ns`: look up the task in
///    `ns.reverse_map`. Return the local pid_t.
/// 4. If the walked namespace differs from `ns`: the task is in a sibling
///    namespace at the same level — it is not visible. Return `None`.
///
/// Hot path: O(depth) where depth = task_ns.level - ns.level, typically 0-2.
pub fn task_pid_nr_ns(task: &Task, ns: &PidNamespace) -> Option<u32> {
    // Bind the ArcSwap guard to an explicit variable so the Arc<NamespaceSet>
    // lives for the duration of the function. Without this, the temporary
    // guard would be dropped at the end of the statement, and `task_ns`
    // would become a dangling reference. Rust's temporary lifetime extension
    // rules DO extend the guard in `let task_ns = &expr.field` patterns, but
    // the explicit binding is clearer and avoids a subtle lifetime footgun.
    let ns_guard = task.nsproxy.load();
    let task_ns = &ns_guard.pid_ns;
    if ns.level > task_ns.level {
        return None;
    }
    // Walk up to the target level.
    let mut walk = Arc::clone(task_ns);
    while walk.level > ns.level {
        walk = walk.parent.as_ref()?.clone();
    }
    // Check same namespace object (Arc pointer equality).
    // `walk` has been advanced to `ns.level` via the parent chain above.
    // If `walk` and `ns` are the same Arc, the task is visible in `ns`.
    // If they differ, `ns` is a sibling namespace at the same depth —
    // the task is not visible.
    if !Arc::ptr_eq(&walk, ns) {
        return None; // sibling namespace — not visible
    }
    pid_nr(task.task_id, ns)
}

/// UTS namespace state.
///
/// Hostname and domainname are read on every `uname()` syscall (glibc calls
/// this once per process, but short-lived processes — container health checks,
/// shell scripts — call it frequently). Writes (`sethostname`, `setdomainname`)
/// are rare (typically once at container creation). RCU gives lock-free reads.
pub struct UtsNamespace {
    /// Unique namespace ID (same value as the nsfs inode number).
    /// Allocated from a global `AtomicU64` counter at creation time.
    pub ns_id: u64,
    /// Owning user namespace. Required by the `Namespace` trait for
    /// `setns()` capability checks. Weak reference avoids cycles.
    pub user_ns: Weak<UserNamespace>,
    /// Current hostname and domainname. Read lock-free via RCU on the
    /// `uname()` path; updated via clone-and-swap under `update_lock`.
    /// Arc allows uname() readers to hold a reference beyond the RCU grace
    /// period during copy_to_user (which may sleep on page fault).
    pub strings: RcuPtr<Arc<UtsStrings>>,
    /// Serializes `sethostname()` / `setdomainname()` updates.
    pub update_lock: Mutex<()>,
}

/// UTS string pair (hostname + domainname). Immutable once published;
/// updates create a new `UtsStrings` and swap the RCU pointer.
pub struct UtsStrings {
    /// Hostname (max 64 bytes, NUL-terminated).
    pub hostname: [u8; 65],
    /// NIS domain name (max 64 bytes, NUL-terminated).
    pub domainname: [u8; 65],
}

**IPC namespace**: Per-IPC-namespace state for SysV and POSIX IPC objects.
Canonical definition with full field documentation:
[Section 17.3](#posix-ipc--ipc-namespace-dispatch-sysv-ipc). Created by `clone(CLONE_NEWIPC)` or
`unshare(CLONE_NEWIPC)`. Integer-keyed SysV maps use `Idr<T>` (O(1),
RCU-protected reads); POSIX message queues use `RwLock<BTreeMap>` (string keys,
cold path).

/// SysV shared memory segment (shmget/shmat/shmctl).
pub struct ShmSegment {
    /// Unique key (from shmget; IPC_PRIVATE = 0 means anonymous).
    pub key: i32,
    /// Segment identifier (returned by shmget). Signed i32 matching
    /// Linux shmid_ds.shm_perm.id and the shmget() return type.
    pub shmid: i32,
    /// Size in bytes (rounded up to page boundary at creation).
    pub size: usize,
    /// Physical pages backing this segment (reference-counted).
    pub pages: Arc<PhysPages>,
    /// Owner UID/GID at creation time.
    pub uid: u32,
    pub gid: u32,
    /// Permission mode bits (lower 9 bits, like file mode).
    pub mode: u16,
    /// Attachment count (number of active shmat() mappings).
    pub nattach: AtomicU32,
    /// Creation time and last attach/detach timestamps (monotonic nanoseconds).
    pub ctime: u64,
    pub atime: u64,
    pub dtime: u64,
}

/// Physical page backing for SysV shared memory segments.
///
/// Shared between all processes that `shmat()` the segment. Each `shmat()`
/// creates a VMA in the calling process's address space that maps these
/// shared physical pages. The same `Pfn` appears in multiple page tables
/// simultaneously.
///
/// # Lifecycle
///
/// 1. **Creation** (`shmget`): `PhysPages` is allocated with `pages` pre-sized
///    to `ceil(size / PAGE_SIZE)` entries, all initialized to `None`.
///    If `SHM_HUGETLB` is set, huge pages are allocated eagerly at creation.
/// 2. **First attach** (`shmat`): for non-hugetlb segments, physical pages are
///    allocated lazily on first page fault (demand paging). The faulting task
///    acquires `pages.lock`, checks if `pages[idx]` is `None`, allocates a
///    zeroed page frame, and stores the `Pfn`. The lock is held only for the
///    duration of the page allocation (cold path, not per-access).
/// 3. **Detach** (`shmdt`): removes the VMA from the calling process. Does NOT
///    free physical pages (other processes may still be attached).
/// 4. **Destruction** (`shmctl IPC_RMID`): marks the segment for deletion.
///    The segment ID is removed from the IPC namespace's segment table, preventing
///    new `shmat()` calls. Actual page freeing occurs when `nattach` in the parent
///    `ShmSegment` reaches 0 (last detach after IPC_RMID).
///
/// # Design note
///
/// `Vec<Option<Pfn>>` inside `SpinLock` is acceptable because:
/// - The `Vec` is pre-allocated to full size at segment creation (no realloc
///   under the lock). `Vec::with_capacity(npages)` followed by `resize(npages, None)`.
/// - The lock protects only the page-present state during fault handling.
/// - Hot-path page table lookups do NOT acquire this lock — they read the PTE
///   directly. The lock is only needed when populating a previously-absent page.
pub struct PhysPages {
    /// Physical page frames backing this shared memory segment.
    /// Indexed by page offset within the segment. `None` = not yet faulted in.
    /// Pre-allocated to `ceil(size / PAGE_SIZE)` entries at creation.
    pub pages: SpinLock<Vec<Option<Pfn>>>,
    /// Total size in bytes (rounded up to page boundary at creation).
    pub size: usize,
}

/// SysV semaphore set (semget/semop/semctl).
pub struct SemSet {
    pub key: i32,
    pub semid: i32,  // Linux ABI: semget() returns int (i32)
    /// Number of semaphores in this set (1–SEMMSL; Linux default SEMMSL=32000).
    pub nsems: u16,
    /// Semaphore values (one per semaphore in the set).
    pub sems: Box<[AtomicU16]>,
    /// Undo table: per-task pending undos (restored on task exit).
    /// SpinLock because semop undo list is accessed in task-exit path.
    pub undo_list: SpinLock<Vec<SemUndo>>,
    pub uid: u32,
    pub gid: u32,
    pub mode: u16,
    pub ctime: u64,
    pub otime: u64,
}

/// SysV semaphore undo entry. Tracks adjustments that must be reversed
/// on process exit to prevent semaphore value leakage.
///
/// **Per-process sharing**: SysV semaphore undos are per-process (thread
/// group), not per-thread. All threads in a thread group share a single
/// `undo_list` (stored in `Process.sysvsem_undo`, not `Task`). This matches
/// Linux's `struct sem_undo_list` which is shared across all threads via
/// `current->sysvsem.undo_list`. `exit_sem()` runs once when the thread
/// group leader exits (last thread in the group), applying all accumulated
/// undo adjustments atomically against each referenced semaphore set.
///
/// When a `semop()` call includes `SEM_UNDO`, the kernel records the
/// inverse of each semaphore adjustment in a `SemUndo` entry associated
/// with the calling process. On process exit (`exit_sem()`), the kernel
/// iterates the process's `undo_list` and applies all recorded adjustments
/// atomically per semaphore set, restoring semaphore values to their
/// pre-operation state.
///
/// # Storage design
///
/// Uses a sparse representation: only semaphores that were actually modified
/// with `SEM_UNDO` are tracked. This avoids allocating a dense array of
/// 32000 entries (Linux SEMMSL) for the common case where a process touches
/// only a few semaphores in a set.
pub struct SemUndo {
    /// PID of the process that owns this undo entry. Required for
    /// `exit_sem()` to identify entries belonging to the dying process
    /// when scanning from the semaphore-set side (`SemSet.undo_list`).
    /// The per-process `sysvsem_undo` list provides the reverse index
    /// (O(1) traversal from the process side).
    pub process_id: ProcessId,
    /// Semaphore set this undo applies to.
    pub sem_id: u32,
    /// Sparse list of (semaphore_index, adjustment) pairs.
    /// Only semaphores modified with SEM_UNDO are tracked. On process exit,
    /// `semval[idx] += adj` is applied for each `(idx, adj)` entry.
    ///
    /// **Capacity**: Capped at `semset.sem_nsems` (the semaphore count of the
    /// owning set), matching Linux behavior where the undo array is sized to
    /// SEMMSL. Since `sem_nsems <= SEMMSL` (Linux default 32000), this is the
    /// natural upper bound. If the limit is reached, subsequent SEM_UNDO
    /// operations on new semaphore indices in this set return ENOSPC.
    ///
    /// **Allocation**: `Vec` instead of `ArrayVec<256>` — the `semop()` path is
    /// warm (bounded frequency), so heap allocation is acceptable per collection
    /// policy (Ch 3.1.13). Each push calls `memcg_charge()` against the calling
    /// task's memory cgroup, preventing unprivileged processes from consuming
    /// unbounded kernel memory via SEM_UNDO. The cap is Evolvable policy (not a
    /// Nucleus type parameter): a live evolution can adjust the limit without
    /// data structure migration.
    pub adjustments: Vec<(u16, i16)>,
}

/// SysV message queue (msgget/msgsnd/msgrcv).
///
/// **Allocation strategy**: Message bodies are allocated from a per-IPC-namespace
/// slab cache (`msg_slab`) BEFORE acquiring the message queue SpinLock. The slab
/// cache uses fixed-size buckets (64, 256, 1024, 4096, `MSGMAX` bytes) to avoid
/// heap fragmentation. The caller allocates a `SysVMessage` from the slab, copies
/// the userspace data, then acquires `lock` and pushes the pre-allocated message
/// into the ring. This ensures zero heap allocation under the SpinLock.
pub struct MsgQueue {
    pub key: i32,
    pub msqid: i32,
    pub uid: u32,
    pub gid: u32,
    pub mode: u16,
    /// Mutable state protected by the SpinLock. All fields that change
    /// after msgget() live inside `MsgQueueInner` so the lock actually
    /// protects the data it guards (no `SpinLock<()>` anti-pattern).
    pub inner: SpinLock<MsgQueueInner>,
}

/// Mutable MsgQueue state, protected by `MsgQueue.inner` SpinLock.
pub struct MsgQueueInner {
    /// Messages stored in the queue (FIFO order).
    /// Bounded by `max_bytes` (default MSGMNB = 16384).
    /// BoundedRing is pre-allocated at msgget() time — no allocation under lock.
    pub messages: BoundedRing<SysVMessage>,
    /// Current total size of all messages in bytes.
    pub current_bytes: usize,
    /// Maximum bytes in queue (default: MSGMNB = 16384).
    pub max_bytes: usize,
    /// Tasks waiting to send (queue full).
    pub send_wait: WaitQueue,
    /// Tasks waiting to receive (queue empty or no matching type).
    pub recv_wait: WaitQueue,
    /// Last msgsnd time (monotonic nanoseconds).
    pub stime: u64,
    /// Last msgrcv time (monotonic nanoseconds).
    pub rtime: u64,
    /// Last msgctl IPC_SET or msgget creation time (monotonic nanoseconds).
    pub ctime: u64,
}

/// A single SysV message.
/// Allocated from the per-IPC-namespace `msg_slab` cache before acquiring
/// the MsgQueue SpinLock. Freed back to the slab after msgrcv() returns.
pub struct SysVMessage {
    /// Message type (from msgsnd mtype; must be > 0).
    pub mtype: i64,
    /// Message data — slab-allocated fixed-size buffer. `data_len` holds
    /// the actual message length; the slab bucket may be larger.
    pub data: SlabBox<[u8]>,
    /// Actual data length in bytes (≤ slab bucket size).
    pub data_len: usize,
}

/// POSIX message queue (mq_open/mq_send/mq_receive; mqueue filesystem).
/// Linux-compatible: /dev/mqueue filesystem, mq_notify(3) supported.
pub struct PosixMqueue {
    /// Queue name (from mq_open; unique within the mqueue namespace).
    pub name: ArrayString<256>,
    /// Attributes: max messages, max message size, current count.
    pub attr: MqueueAttr,
    /// Priority queue: messages ordered by descending priority, then FIFO within
    /// equal priority. Stored as a BinaryHeap; see `PosixMessage::cmp` for the
    /// ordering that provides POSIX-required FIFO stability at equal priority.
    /// Pre-allocated at mq_open() time with `BinaryHeap::with_capacity(attr.mq_maxmsg)`.
    /// This ensures no heap allocation occurs under the SpinLock during mq_send().
    /// Maximum memory per queue: mq_maxmsg * (size_of::<PosixMessage>() + mq_msgsize).
    /// Bounded by RLIMIT_MSGQUEUE per-user limit (default: 819200 bytes on Linux).
    /// The mq_open() syscall validates that the requested queue size does not
    /// exceed the caller's remaining RLIMIT_MSGQUEUE allowance.
    /// Mutable state protected by a single SpinLock (same pattern as `MsgQueue`
    /// above — no `SpinLock<()>` anti-pattern). All fields that are modified
    /// during mq_send()/mq_receive() are inside this lock.
    pub inner: SpinLock<PosixMqueueInner>,
    /// Tasks blocked on mq_receive (queue empty).
    pub recv_waiters: WaitQueue,
    /// Tasks blocked on mq_send (queue full).
    pub send_waiters: WaitQueue,
    /// Notification registration (mq_notify).
    pub notify: Option<MqueueNotify>,
    pub uid: u32,
    pub gid: u32,
    pub mode: u16,
}

/// Interior mutable state of a POSIX message queue, protected by `PosixMqueue.inner`.
pub struct PosixMqueueInner {
    /// Priority queue: messages ordered by descending priority, then FIFO within
    /// equal priority. Pre-allocated at mq_open() time with
    /// `BinaryHeap::with_capacity(attr.mq_maxmsg)`.
    pub queue: BinaryHeap<PosixMessage>,
    /// Monotonically increasing sequence counter. Assigned to each message on
    /// mq_send() to provide stable FIFO ordering within equal-priority messages.
    pub next_seq: u64,
}

pub struct MqueueAttr {
    /// Maximum number of messages (mq_maxmsg; default 10, max 65536).
    pub maxmsg: u32,
    /// Maximum message size in bytes (mq_msgsize; default 8192, max 1MB).
    pub msgsize: u32,
    /// Current number of messages in the queue.
    pub curmsgs: u32,
}

/// A POSIX message queue message. Ordering is by descending priority, then
/// ascending sequence number (FIFO within equal priority), as required by
/// POSIX.1-2017 mq_receive(3).
///
/// # Ordering (BinaryHeap is a max-heap)
/// ```rust
/// impl Ord for PosixMessage {
///     fn cmp(&self, other: &Self) -> Ordering {
///         self.priority.cmp(&other.priority)
///             .then(other.seq.cmp(&self.seq)) // reverse seq: lower seq = older = wins
///     }
/// }
/// ```
/// A higher priority beats a lower priority. Within the same priority, the
/// message with the smaller `seq` (sent earlier) has a *larger* `Ord` value
/// so the max-heap dequeues it first — preserving FIFO order.
pub struct PosixMessage {
    /// Priority (0–MQ_PRIO_MAX-1; higher = delivered first).
    pub priority: u32,
    /// Per-queue send sequence number. Assigned from `PosixMqueue::next_seq`
    /// at mq_send() time. Breaks ties within equal-priority messages: lower
    /// seq means the message was sent earlier and must be dequeued first.
    pub seq: u64,
    /// Message payload. `Box<[u8]>` is a warm-path allocation: one allocation
    /// per `mq_send()` call. Size is bounded by `PosixMqueue.attr.mq_msgsize`
    /// (max 16 MiB per Linux, default 8192 bytes). The BinaryHeap owns the
    /// Box; deallocation happens on `mq_receive()` when the message is consumed.
    pub data: Box<[u8]>,
}

/// Process identifier. Matches Linux `pid_t` (always `i32` on all architectures).
/// See [Section 8.1](08-process.md#process-and-task-management) for the kernel-internal PID allocation model.
pub type Pid = i32;

/// POSIX message queue notification registration (mq_notify(3)).
///
/// At most one process may register for notification on a given queue at any
/// time. Notification fires **once** when a message arrives on a previously
/// empty queue, then auto-deregisters (matching POSIX.1-2017 semantics).
/// The process must call `mq_notify()` again to re-register.
///
/// Registration: `mq_notify(mqd, &sigevent)` with `sigev_notify` = SIGEV_SIGNAL
/// or SIGEV_THREAD. Passing a null sigevent pointer deregisters the current
/// notification. Returns EBUSY if another process is already registered.
pub struct MqueueNotify {
    /// Notification delivery mode.
    pub mode: MqueueNotifyMode,
    /// PID of the process that registered for notification.
    /// Used to target the signal or thread creation.
    pub pid: Pid,
}

/// Notification delivery mode for POSIX message queues.
///
/// Matches the Linux `SIGEV_SIGNAL` / `SIGEV_THREAD` semantics from
/// `<signal.h>` sigevent structure.
pub enum MqueueNotifyMode {
    /// Deliver a signal to the registered process.
    /// `si_code` is set to `SI_MESGQ`, `si_value` carries the `sigev_value`
    /// from the original `mq_notify()` registration.
    Signal {
        /// Signal number to deliver (e.g., SIGRTMIN+0).
        signo: u8,
        /// User-provided value passed back in siginfo_t.si_value.
        sigev_value: usize,
    },
    /// Spawn a new thread in the registered process.
    /// The kernel creates a thread (via internal clone) with the specified
    /// entry point and attributes. This matches SIGEV_THREAD semantics;
    /// glibc's `mq_notify()` wrapper typically implements this in userspace,
    /// but the kernel must support it for direct syscall users.
    Thread {
        /// Thread entry function address in the registered process.
        notify_fn: usize,
        /// Pointer to pthread_attr_t (or null for defaults) in the
        /// registered process's address space.
        notify_attr: usize,
        /// User-provided value passed as the thread function argument.
        sigev_value: usize,
    },
}

/// Cgroup namespace state.
pub struct CgroupNamespace {
    /// Unique namespace ID (same value as the nsfs inode number).
    pub ns_id: u64,
    /// Owning user namespace. Required by the `Namespace` trait for
    /// `setns()` capability checks.
    pub user_ns: Weak<UserNamespace>,
    /// Root cgroup directory visible to processes in this namespace.
    /// Processes see this as "/" in /sys/fs/cgroup.
    pub cgroup_root: Arc<Cgroup>,
}

/// Time namespace state (Linux 5.6+).
pub struct TimeNamespace {
    /// Unique namespace ID (same value as the nsfs inode number).
    pub ns_id: u64,
    /// Owning user namespace. Required by the `Namespace` trait for
    /// `setns()` capability checks.
    pub user_ns: Weak<UserNamespace>,
    /// Offset added to CLOCK_MONOTONIC for processes in this namespace.
    /// Allows containers to see a "boot time" starting from 0.
    pub monotonic_offset_ns: AtomicI64,
    /// Offset added to CLOCK_BOOTTIME for processes in this namespace.
    pub boottime_offset_ns: AtomicI64,
}

17.1.3 Container Root Filesystem: pivot_root(2)¶

Container runtimes (runc, containerd, crun) require a mechanism to change the root filesystem after setting up the mount namespace. UmkaOS implements the standard pivot_root(2) syscall:

/// pivot_root(new_root: &CStr, put_old: &CStr) -> Result<()>
///
/// Atomically swaps the root mount with another mount point. Required for
/// OCI-compliant container creation.
///
/// # Prerequisites (checked by syscall)
/// - new_root must be a mount point
/// - put_old must be at or under new_root
/// - Caller must be in a mount namespace (CLONE_NEWNS or unshare(CLONE_NEWNS))
/// - Caller must have CAP_SYS_ADMIN in its user namespace
///
/// # Operation
/// 1. Attach new_root to the root of the mount namespace
/// 2. Move the old root to put_old
/// 3. The process's root directory is now new_root
/// 4. Subsequent umount(put_old) removes the old root from the namespace
///
/// # Container Runtime Usage
/// ```
/// // Standard OCI container creation sequence:
/// unshare(CLONE_NEWNS);                    // New mount namespace
/// mount("none", "/", NULL, MS_REC | MS_PRIVATE, NULL);  // Make all private
/// mount("/var/lib/container/rootfs", "/var/lib/container/rootfs",
///       NULL, MS_BIND | MS_REC, NULL);     // Bind-mount rootfs onto itself
/// pivot_root("/var/lib/container/rootfs", "/var/lib/container/rootfs/.oldroot");
/// chdir("/");                              // Ensure we're in new root
/// umount2("/.oldroot", MNT_DETACH);        // Detach old root
/// // Process now sees container rootfs as /
/// ```
///
/// # Difference from chroot(2)
/// pivot_root is fundamentally different from chroot:
/// - chroot only affects the process's view of the root directory
/// - pivot_root actually moves the mount point, affecting all processes in the namespace
/// - chroot can be escaped via mount namespace tricks; pivot_root cannot
/// - Container runtimes MUST use pivot_root for secure isolation
///
/// # Error codes
/// - EBUSY: new_root is not a mount point, or put_old is not under new_root
/// - EINVAL: new_root and put_old are the same
/// - ENOENT: path component does not exist
/// - ENOTDIR: path component is not a directory
/// - EPERM: Caller lacks CAP_SYS_ADMIN, or not in mount namespace
/// - ENOSYS: Not implemented (will not occur in UmkaOS)
SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, const char __user *, put_old)

Mandatory umount2(put_old, MNT_DETACH) after pivot_root: After pivot_root() succeeds, the host filesystem is still mounted at put_old inside the container's mount namespace. This is a security requirement — the container init process MUST call umount2(put_old, MNT_DETACH) to detach the host filesystem before executing the container entrypoint. Without this step:

The entire host filesystem remains visible and traversable inside the container at the put_old mount point (e.g., /.oldroot/etc/shadow, /.oldroot/proc).
A container process with sufficient capabilities could read host secrets, modify host files, or escape the container entirely by chdir-ing into the host filesystem tree.
MNT_DETACH (lazy unmount) is used instead of a synchronous unmount because in-flight path lookups may still hold references to the old root mount; lazy unmount detaches the mount from the namespace immediately (invisible to new lookups) and the actual cleanup occurs after the last reference is released (RCU grace period).

OCI-compliant container runtimes (runc, containerd, crun) all perform this step. UmkaOS does not enforce it automatically (the kernel cannot know when the container setup sequence is complete), but the container creation documentation, examples, and the pivot_root(2) man page MUST document this as a mandatory step. The container creation sequence in the seccomp-bpf section below reflects this ordering.

Effect on other processes: pivot_root only affects processes whose root is the old root mount within the same mount namespace. Processes in other mount namespaces are unaffected. Within the same mount namespace, processes whose root directory points to the old root mount will see the new root after the RCU-published pointer swap (step 7 above). Processes that have already chroot-ed to a subdirectory of the old root are also unaffected because their root is not the namespace root mount.

Interaction with other namespaces: - pivot_root operates on the caller's mount namespace - The root change is visible to all processes sharing that mount namespace - Combined with PID namespace: the container's init (PID 1) sees only the new root - Combined with User namespace: unprivileged processes can pivot_root within their own user namespace if they have CAP_SYS_ADMIN there

Implementation notes: The VFS layer (Section 14.1) handles the mount tree manipulation. The Mount struct, MountNamespace, mount hash table, and the complete pivot_root algorithm using these types are defined in Section 14.6 (13-vfs.md). The summary below is retained for context; the authoritative specification is Section 14.6.

Lookup new_root and verify it's a mount point
Lookup put_old and verify it's under new_root
Lock the mount tree for modification (holds mount_lock)
Detach the current root from the namespace's mount list
Attach new_root as the new namespace root
Reattach the old root at put_old position
Publish the new root via RCU: rcu_assign_pointer(namespace->root, new_root)
Unlock the mount tree

Atomicity with respect to path lookups: Steps 4–6 are performed while holding mount_lock, and the old root pointer remains valid in the RCU-published slot until step 7 overwrites it. Path lookups (open(), stat(), readlink(), etc.) take an RCU read-side reference to the namespace root at the start of lookup via rcu_dereference(namespace->root). This ensures: - In-flight path lookups that started before pivot_root complete with the old root (consistent view) - New path lookups that start after step 7 see the new root - No path lookup can see a partially-updated state (no torn reads, no null pointer) - Between steps 4–6, the data structures are modified under mount_lock, but lookups still see the old root via RCU The RCU grace period after step 7 ensures that by the time umount(put_old) completes, no in-flight lookups hold references to the old root.

17.1.4 Joining Namespaces: setns(2) and nsenter¶

Container operations like docker exec require joining an existing namespace. UmkaOS implements setns(2) for this purpose:

/// setns(fd: RawFd, nstype: c_int) -> Result<()>
///
/// Reassociates the calling thread with the namespace referenced by fd.
///
/// # Parameters
/// - fd: File descriptor referring to a namespace (obtained from /proc/[pid]/ns/[type])
/// - nstype: Namespace type (CLONE_NEW* constant) or 0 to auto-detect from fd
///
/// # Prerequisites
/// - Caller must have CAP_SYS_ADMIN in the target namespace's owning user namespace
/// - For PID namespaces: No restriction (affects future children only, per Linux 3.8+)
/// - For user namespaces: Caller must not be in a chroot environment
/// - The namespace must still exist (owning process hasn't exited)
///
/// # Container Runtime Usage (docker exec)
/// ```
/// // Join a running container's namespaces:
/// int fd = open("/proc/[container_pid]/ns/mnt", O_RDONLY | O_CLOEXEC);
/// setns(fd, CLONE_NEWNS);  // Join mount namespace
/// close(fd);
///
/// fd = open("/proc/[container_pid]/ns/net", O_RDONLY | O_CLOEXEC);
/// setns(fd, CLONE_NEWNET); // Join network namespace
/// close(fd);
///
/// // PID namespace must be joined via clone(), not setns()
/// // (kernel limitation: can't change PID namespace of running process)
/// // exec() into container: now running in container's namespaces
/// execve("/bin/sh", ["/bin/sh"], envp);
/// ```
///
/// # User Namespace Ordering (Implementation Detail)
/// When a single setns() call joins a user namespace alongside other namespace types,
/// joining the user namespace first is required internally because it changes the
/// caller's effective capabilities. UmkaOS handles this transparently: the kernel
/// internally processes any user namespace transition before other namespace
/// transitions in the same call sequence, regardless of the order in which the
/// caller passes fds. The caller may pass namespace fds in any order; UmkaOS's
/// implementation reorders them as needed. This matches Linux behavior, where
/// `nsenter --all` and `unshare` may call setns() in arbitrary order without error.
///
/// # Namespace file descriptors
/// Each namespace type is exposed via /proc/[pid]/ns/:
/// ```
/// /proc/[pid]/ns/cgroup         → Cgroup namespace
/// /proc/[pid]/ns/ipc            → IPC namespace
/// /proc/[pid]/ns/mnt            → Mount namespace
/// /proc/[pid]/ns/net            → Network namespace
/// /proc/[pid]/ns/pid            → PID namespace (current)
/// /proc/[pid]/ns/pid_for_children → PID namespace for future children (after setns)
/// /proc/[pid]/ns/time           → Time namespace (Linux 5.6+)
/// /proc/[pid]/ns/time_for_children → Time namespace for future children (Linux 5.8+)
/// /proc/[pid]/ns/user           → User namespace
/// /proc/[pid]/ns/uts            → UTS namespace
/// ```
///
/// The `*_for_children` symlinks reveal the pending namespace set by
/// `setns(CLONE_NEWPID)` or `setns(CLONE_NEWTIME)`. They differ from the
/// main symlinks when a process has called `setns()` but not yet forked.
/// Container introspection tools (`lsns`, `nsenter --target`) use these.
///
/// These are magic links: reading them returns the namespace type, and
/// opening them gives a file descriptor that can be passed to setns().
///
/// # Error codes
/// - EBADF: Invalid fd
/// - EINVAL: fd does not refer to a namespace, nstype doesn't match fd type,
///           or (for PID namespace) caller has other threads in its thread group
/// - EPERM: Caller lacks CAP_SYS_ADMIN in target namespace's user namespace
/// - ENOENT: Namespace has been destroyed
SYSCALL_DEFINE2(setns, int, fd, int, nstype)

PID namespace special case: A process cannot change its own PID namespace via setns() — the process's PID in its original namespace remains unchanged. However, setns(fd, CLONE_NEWPID) is valid since Linux 3.8: it sets the PID namespace for future children created by fork()/clone(). The caller's own PID view is unchanged, but newly created children will be in the target PID namespace.

This is why docker exec uses nsenter with --fork flag: it joins other namespaces via setns(), sets the target PID namespace for children, then forks a child that inherits all joined namespaces and has the correct PID view.

TOCTOU safety: setns() acquires a reference count on the target namespace before validating it, then holds that reference across the join operation. The namespace cannot be destroyed while setns() holds its reference — this prevents the use-after-free TOCTOU that would otherwise exist between checking namespace validity and joining it. The reference is released after the join completes or if validation fails.

Implementation:

href="#__codelineno-14-1">fn sys_setns(fd: RawFd, nstype: c_int) -> Result<()> { let file = current_task().files.get(fd)?; let ns_inode = file.inode.downcast_ref::<NsInode>() .ok_or(Errno::EINVAL)?; // Verify nstype matches (if specified) if nstype != 0 && ns_inode.nstype != nstype { return Err(Errno::EINVAL); } // Internal reordering: if this fd is for a user namespace and the caller has // already registered other namespace fds in this setns() sequence, the user // namespace transition is applied first before those pending transitions are // processed. This is an implementation detail — the caller may pass fds in // any order (Linux-compatible API). No EINVAL is returned for ordering. // Check CAP_SYS_ADMIN in target namespace's user namespace let target_user_ns = ns_inode.namespace.user_ns.upgrade().ok_or(Errno::ENOENT)?; if !ns_capable(current_task(), &target_user_ns, CAP_SYS_ADMIN) { return Err(Errno::EPERM); } // Join the namespace. Like Linux's switch_task_namespaces(), we clone // the current NamespaceSet, replace the target namespace field, and // swap the task's nsproxy Arc to point to the new set. This is per-task: // sibling threads are unaffected (they hold their own Arc<NamespaceSet>). let task = current_task(); // RCU load + structural clone. This clones the entire NamespaceSet even // though setns() only modifies one field. The clone cost is ~8 Arc::clone // operations (one per namespace field) + SpinLock init for pending_pid_ns // and pending_time_ns. This is acceptable for setns() (cold path, ~1-10 // calls per container lifecycle). A future optimization could use a // COW NamespaceSet that defers cloning until the second mutation, but // the complexity is not justified given the low call frequency. let old_ns = task.nsproxy.load().as_ref().clone(); // Build a new NamespaceSet from the clone. All arms modify `new_ns` // in-place rather than using struct update syntax (`..old_ns`), which // avoids Rust borrow-checker issues: struct update moves the source, // making it incompatible with arms that need `old_ns` intact (TIME, PID). // // **SpinLock::Clone contract**: Cloning a `NamespaceSet` clones the // protected DATA inside each SpinLock field (e.g., pending_pid_ns, // pending_time_ns), not the lock state. The new `SpinLock` is a fresh, // unlocked instance protecting a clone of the inner value. Calling // `.lock().replace()` on `new_ns.pending_pid_ns` below acquires the // NEW lock (which is uncontended — `new_ns` is a local variable with // no concurrent accessors). The lock acquisition is technically // vacuous for the local variable, but it is required by the SpinLock // API and ensures the code compiles correctly with the same type // signatures used when operating on a shared NamespaceSet. let mut new_ns = old_ns; match ns_inode.nstype { CLONE_NEWNS => { // Mount namespace join requires CAP_SYS_ADMIN in the caller's // own user namespace AND CAP_SYS_CHROOT in the caller's own // user namespace (because joining a mount namespace resets // root/pwd, equivalent to a chroot). The CAP_SYS_ADMIN check // in the target namespace's user_ns was already performed above. // This matches Linux kernel behavior (verified via setns(2) man page). let caller_user_ns = &task.nsproxy.load().user_ns; if !ns_capable(current_task(), caller_user_ns, CAP_SYS_ADMIN) { return Err(Errno::EPERM); } if !ns_capable(current_task(), caller_user_ns, CAP_SYS_CHROOT) { return Err(Errno::EPERM); } // Switch to the target mount namespace. After updating // nsproxy.mnt_ns, reset task.fs.root and task.fs.pwd to the // new namespace's root mount point. This matches Linux behavior: // entering a mount namespace adopts its root. Both root and pwd // are reset together under fs.write() to ensure consistency. // // Open file descriptors are NOT affected — they retain their // original dentry/vfsmount references (not re-resolved). Only // future path resolutions (open, stat, etc.) use the new root. let target_mnt = ns_inode.namespace.as_mnt_ns() .expect("MNT namespace"); let new_root = target_mnt.root_mount(); { // Acquire fs.write() lock — this serializes both field // updates atomically. Concurrent readers (via task.fs.read()) // see either both old values or both new values, never a // mixed state. This is benign by design: the RwLock ensures // root and pwd are always consistent with each other. let mut fs = task.fs.write(); fs.root = new_root.clone(); fs.pwd = new_root; } new_ns.mount_ns = target_mnt; } CLONE_NEWNET => { let target_net = ns_inode.namespace.as_net_ns().expect("NET namespace"); let target_stack = target_net.stack_cap.clone(); new_ns.net_ns = target_net; new_ns.net_stack = target_stack; } CLONE_NEWUTS => { new_ns.uts_ns = ns_inode.namespace.as_uts_ns().expect("UTS namespace"); } CLONE_NEWIPC => { new_ns.ipc_ns = ns_inode.namespace.as_ipc_ns().expect("IPC namespace"); } CLONE_NEWUSER => { // Chroot'd processes cannot join user namespaces (could escape chroot) if task.is_chrooted() { return Err(Errno::EPERM); } let target_user = ns_inode.namespace.as_user_ns().expect("USER namespace"); // Update IMA namespace alongside user namespace. IMA measurement // policy is scoped per user namespace — switching user_ns without // updating ima_ns would cause IMA policy lookups to resolve against // the wrong namespace, potentially bypassing container-specific // integrity requirements or logging measurements to the wrong log. let target_ima = target_user.ima_ns.clone(); // Atomicity: nsproxy.user_ns and task.cred MUST be updated together // under a single task_lock() hold. Without this, a window exists where // nsproxy.user_ns points to the new namespace but cred.user_ns still // references the old namespace (or vice versa). During that window, // ns_capable() checks would resolve against the wrong namespace, // potentially granting or denying capabilities incorrectly. // // Protocol (Solution B — true atomic swap): // 1. Prepare new credentials with cred.user_ns = target_user. // 2. Pre-allocate new nsproxy Arc BEFORE taking task_lock. // 3. Under task_lock(): commit_creds(new_cred) AND swap nsproxy // atomically. Both are O(1) pointer swaps, safe under spinlock. // 4. Drop old_nsproxy OUTSIDE lock scope (Arc refcount decrement // may involve deallocation, which must not happen under spinlock). // // Invariant: commit_creds() must not sleep or acquire locks that // nest outside alloc_lock (confirmed: UmkaOS commit_creds is // rcu_assign_pointer only). // // Better than Linux: zero-width window. Linux accepts a brief // cred/nsproxy disagreement and relies on the soft invariant that // ns_capable() reads cred.user_ns. Our approach is defense-in-depth. let new_cred = prepare_creds(&task.cred); new_cred.user_ns = target_user.clone(); new_cred.cap_effective = CAP_FULL_SET; new_cred.cap_permitted = CAP_FULL_SET; new_cred.cap_inheritable = 0; new_cred.cap_bounding = CAP_FULL_SET; new_cred.cap_ambient = 0; new_cred.uid = target_user.map_uid_from_parent(task.cred.uid); new_cred.gid = target_user.map_gid_from_parent(task.cred.gid); new_ns.user_ns = target_user; new_ns.ima_ns = target_ima; // Pre-allocate nsproxy before taking the lock (allocation may sleep). let new_nsproxy = Arc::new(new_ns); let old_nsproxy; { let _guard = task.task_lock(); commit_creds(new_cred); // RCU pointer swap, O(1) // Task.nsproxy is ArcSwap<NamespaceSet> — interior mutability // allows mutation through &Task. The ArcSwap::store() is an atomic // pointer exchange (O(1)), safe under task_lock. old_nsproxy = task.nsproxy.swap(new_nsproxy); } drop(old_nsproxy); // Arc refcount decrement OUTSIDE lock return Ok(()); // early return — skip common nsproxy swap below } CLONE_NEWCGROUP => { new_ns.cgroup_ns = ns_inode.namespace.as_cgroup_ns() .expect("CGROUP namespace"); } CLONE_NEWTIME => { // Time namespace affects future children, not the caller (Linux 5.8+ semantics). // Set pending_time_ns so fork()/clone() children use the target time offsets. // pending_time_ns is SpinLock-protected — interior mutability. new_ns.pending_time_ns.lock() .replace(ns_inode.namespace.as_time_ns().expect("TIME namespace")); } CLONE_NEWPID => { // PID namespace affects future children, not the caller. // Set pending_pid_ns so fork()/clone() creates children in target NS. // pending_pid_ns is SpinLock-protected — interior mutability. new_ns.pending_pid_ns.lock() .replace(ns_inode.namespace.as_pid_ns().expect("PID namespace")); } _ => return Err(Errno::EINVAL), }; // Atomically replace the task's nsproxy under task_lock to prevent // concurrent setns() races. Without locking, two concurrent setns() // calls (thread A sets NET, thread B sets UTS) could race — B's store // would overwrite A's net_ns change because both cloned from the same // old nsproxy. task_lock serializes the entire clone-modify-swap. let new_nsproxy = Arc::new(new_ns); { let _guard = task.task_lock(); task.nsproxy.store(new_nsproxy); } Ok(()) }

chroot + setns interaction:

When setns(CLONE_NEWNS) switches the task's mount namespace, task.fs.root and task.fs.pwd are reset to the new namespace's root mount (lines above). This effectively escapes any previous chroot boundary -- the task's root is now the new namespace's root, not the chroot directory. This is Linux-compatible behavior (man 2 setns: "A process reassociating itself with a new mount namespace... will have its root and current working directory reset to the root of the mount namespace"). For user namespaces, setns(CLONE_NEWUSER) is denied for chrooted tasks (returns EPERM) to prevent chroot escapes via user namespace capability grants.

setns(CLONE_NEWUSER) credential transformation:

When setns(fd, CLONE_NEWUSER) is called, the kernel performs a credential transformation to grant the caller capabilities within the target user namespace:

Validation: Verify the caller has CAP_SYS_ADMIN in the target user namespace's parent namespace. This prevents unprivileged users from joining arbitrary user namespaces — only a process that is already privileged in the parent context can adopt a child user namespace's identity.

Credential update (commit_creds path — used by nsenter --user): When setns(CLONE_NEWUSER) is called directly (not via pending+fork), the kernel performs an immediate credential transformation:

let new_cred = prepare_creds(&current_task().cred);
new_cred.user_ns = target_ns.clone();
// Grant full capability set within the target namespace.
// These capabilities are namespace-scoped: ns_capable() checks
// resolve against cred.user_ns, so CAP_FULL_SET here does NOT
// grant capabilities in the parent or init namespace.
new_cred.cap_effective = CAP_FULL_SET;
new_cred.cap_permitted = CAP_FULL_SET;
new_cred.cap_inheritable = 0;
// Reset bounding set to full — all capabilities are permitted
// in the new namespace (matching Linux behavior).
new_cred.cap_bounding = CAP_FULL_SET;
// Clear ambient capabilities — ambient caps do not cross user
// namespace boundaries (matching Linux 4.8+ behavior).
new_cred.cap_ambient = 0;
// UID/GID mapping: the caller's UID/GID are resolved through
// the target namespace's uid_map/gid_map. If no mapping exists
// for the caller's UID, it appears as overflow_uid (65534).
new_cred.uid = target_ns.map_uid_from_parent(current_cred().uid);
new_cred.gid = target_ns.map_gid_from_parent(current_cred().gid);
commit_creds(new_cred);

Multi-threaded restriction: setns(CLONE_NEWUSER) fails with EINVAL if the calling process has more than one thread (thread_group_count > 1). This matches Linux behavior: changing the user namespace affects credential resolution for all threads (they share task.cred), so it is only safe when the process is single-threaded. Multi-threaded processes must use clone(CLONE_NEWUSER) to create a child in the new namespace instead.

Fork/clone consumption of pending namespaces:

When fork() or clone() creates a child process, it reads (but does NOT consume) the pending PID and time namespaces. The lock().clone() pattern ensures that concurrent setns() + clone() in a multi-threaded process cannot race. The pending value is NOT consumed on fork -- this matches Linux's pid_ns_for_children semantics where unshare(CLONE_NEWPID) affects ALL future children, not just the next one. The pending value is only cleared by a subsequent setns() that replaces it, or by the parent's exit:

// In copy_namespaces() during fork/clone:
let pending_pid = current_task().nsproxy.pending_pid_ns.lock().clone();
let child_pid_ns = pending_pid.unwrap_or_else(|| Arc::clone(&current_task().nsproxy.pid_ns));

let pending_time = current_task().nsproxy.pending_time_ns.lock().clone();
let child_time_ns = pending_time.unwrap_or_else(|| Arc::clone(&current_task().nsproxy.time_ns));

17.1.5 Namespace Hierarchy and Inheritance¶

Namespaces form a hierarchical tree with parent-child relationships. When a process creates a new namespace via clone() or unshare(), the new namespace is a child of the caller's namespace:

Root Namespace (init)
  ├── PID NS 1 (container A)          ← child of root PID NS
  │   └── PID NS 1.1 (nested container) ← child of PID NS 1
  ├── PID NS 2 (container B)          ← child of root PID NS
  └── User NS 1 (unprivileged container)
      └── User NS 1.1 (child of User NS 1)

Parent-child link semantics:

/// Per-namespace-type hierarchy tracking.
pub struct NamespaceHierarchy {
    /// Pointer to parent namespace (None for root).
    /// The parent reference is weak (Weak<Namespace>) to prevent reference cycles.
    /// When the parent is dropped (all processes exited), child namespaces
    /// become orphans but remain functional until their own processes exit.
    pub parent: Option<Weak<Namespace>>,

    /// Children of this namespace (weak references).
    /// Weak references allow children to be destroyed independently of the parent.
    /// This matches Linux behavior: a parent namespace can be destroyed while
    /// children still exist (children become orphans but remain functional).
    ///
    /// **Eager cleanup**: when a child namespace is dropped, its `Drop`
    /// implementation acquires the parent's `children` lock and removes itself
    /// via `retain(|w| !Weak::ptr_eq(w, &self_weak))`. This prevents dead
    /// `Weak` refs from accumulating — the Vec length always equals the number
    /// of living children. The Vec is unbounded (K8s may create thousands of
    /// network namespaces) but acceptable per collection policy §3.1.13:
    /// namespace creation is warm-path (per-container, not per-syscall).
    /// Eager cleanup is O(N) per child drop, serialized by the Mutex.
    /// Under mass teardown of N sibling namespaces, total cost is O(N^2).
    /// At N=100 (typical K8s pod), this is <100us. At N=1000+, batch cleanup
    /// via deferred GC is an optimization opportunity for extreme cases but is
    /// not implemented in Phase 2; the O(N^2) bound at N<=1000 (<=10ms) is
    /// acceptable for all practical K8s deployments.
    pub children: Mutex<Vec<Weak<Namespace>>>,

    /// Namespace type (PID, NET, MNT, UTS, IPC, USER, CGROUP, TIME).
    pub ns_type: NamespaceType,

    /// Inode number for this namespace (for /proc/PID/ns/*).
    /// Generated from a global counter, unique across all namespace types.
    pub ns_id: u64,
}

/// Inode backing `/proc/[pid]/ns/*` entries in the nsfs pseudo-filesystem.
///
/// Each namespace type is exposed as a magic symlink under `/proc/[pid]/ns/`.
/// Opening such a symlink returns a file descriptor backed by an `NsInode`.
/// This fd can be passed to `setns(fd, nstype)` to join the namespace, or
/// held open to keep the namespace alive (preventing destruction even after
/// all member processes have exited).
///
/// The nsfs pseudo-filesystem is mounted internally at boot and is not
/// visible in the mount tree. Its sole purpose is to provide inode objects
/// for namespace file descriptors.
pub struct NsInode {
    /// Namespace type (Pid, Net, Mnt, Uts, Ipc, User, Cgroup, Time).
    /// Used by `setns()` to verify the `nstype` argument matches the fd.
    pub ns_type: NamespaceType,
    /// Reference to the actual namespace object. Downcasted to the concrete
    /// type (`Arc<NetNamespace>`, `Arc<PidNamespace>`, etc.) by `setns()`
    /// and other namespace operations. Holding this `Arc` keeps the namespace
    /// alive as long as the fd is open.
    pub namespace: Arc<dyn Namespace>,
    /// Inode number = namespace ID (unique u64, assigned at namespace creation
    /// from a global atomic counter). This is the value returned by `stat()`
    /// on `/proc/[pid]/ns/*` and used by `lsns(1)` to identify namespaces.
    pub ino: u64,
}

/// Trait implemented by all namespace types. Provides type identification
/// and the unique namespace ID for nsfs inode generation.
pub trait Namespace: Send + Sync {
    /// Returns the type of this namespace.
    fn ns_type(&self) -> NamespaceType;
    /// Returns the unique namespace ID (same value as the nsfs inode number).
    fn id(&self) -> u64;
    /// Returns the owning user namespace (for capability checks in `setns()`).
    fn user_ns(&self) -> Weak<UserNamespace>;

    /// Downcast to MountNamespace. Default returns None; overridden by MountNamespace.
    fn as_mnt_ns(&self) -> Option<Arc<MountNamespace>> { None }
    /// Downcast to PidNamespace. Default returns None; overridden by PidNamespace.
    fn as_pid_ns(&self) -> Option<Arc<PidNamespace>> { None }
    /// Downcast to NetNamespace. Default returns None; overridden by NetNamespace.
    fn as_net_ns(&self) -> Option<Arc<NetNamespace>> { None }
    /// Downcast to UtsNamespace. Default returns None; overridden by UtsNamespace.
    fn as_uts_ns(&self) -> Option<Arc<UtsNamespace>> { None }
    /// Downcast to IpcNamespace. Default returns None; overridden by IpcNamespace.
    fn as_ipc_ns(&self) -> Option<Arc<IpcNamespace>> { None }
    /// Downcast to UserNamespace. Default returns None; overridden by UserNamespace.
    fn as_user_ns(&self) -> Option<Arc<UserNamespace>> { None }
    /// Downcast to CgroupNamespace. Default returns None; overridden by CgroupNamespace.
    fn as_cgroup_ns(&self) -> Option<Arc<CgroupNamespace>> { None }
    /// Downcast to TimeNamespace. Default returns None; overridden by TimeNamespace.
    fn as_time_ns(&self) -> Option<Arc<TimeNamespace>> { None }
}

Inheritance rules:

Namespace Type	Child Inherits	Modification Scope
`CLONE_NEWPID`	No (child starts fresh with PID 1)	Child's PID 1 = child init process
`CLONE_NEWNET`	No (child gets isolated network stack)	Child has no interfaces except loopback
`CLONE_NEWNS`	Yes (copy-on-write mount tree)	Child's mounts are private unless marked shared
`CLONE_NEWUTS`	Yes (copies parent's hostname/domainname)	Container runtimes typically overwrite via `sethostname()`
`CLONE_NEWIPC`	No (child gets empty IPC namespace)	Child has isolated SysV/POSIX IPC
`CLONE_NEWUSER`	No (child starts with empty UID/GID mappings)	Parent must write `/proc/PID/uid_map` and `gid_map` to grant subordinate ranges
`CLONE_NEWCGROUP`	No (child gets own cgroup root)	Child's cgroup is a child of caller's cgroup
`CLONE_NEWTIME`	No (child gets zero offsets)	Child's time offsets are independent

Note on CLONE_NEWUTS: The child namespace initially inherits the parent's hostname and domainname (copy, not reference). Container runtimes (runc, containerd) typically overwrite this immediately with the container ID via sethostname().

Note on CLONE_NEWUSER: A newly created user namespace starts with empty UID/GID mappings — all UIDs/GIDs resolve to nobody/nogroup (65534) until mappings are written to /proc/PID/uid_map and /proc/PID/gid_map by a privileged process in the parent namespace. This is a critical security property: children do not automatically inherit the parent's full UID range. Instead, the parent explicitly grants a subordinate range (typically from /etc/subuid and /etc/subgid).

User namespace nesting limit: User namespaces can be nested to a maximum depth of 32 (matching Linux's compile-time limit). clone() or unshare() with CLONE_NEWUSER returns ENOSPC if the nesting depth would exceed 32. This prevents resource exhaustion attacks via deeply nested namespaces.

Namespace reference counting: Each namespace is reference-counted via Arc<Namespace>. A namespace is destroyed when: 1. All processes in the namespace have exited (process count → 0) 2. All file descriptors referring to /proc/PID/ns/* are closed 3. All bind mounts of the namespace file have been unmounted

Reference chain for bind-mounted namespace files: When a namespace file (/proc/PID/ns/net, etc.) is bind-mounted to keep the namespace alive, the reference chain is: Mount → Dentry → NsInode → Arc<Namespace>. Each link holds an Arc reference. Unmounting drops the Mount, which drops the Dentry reference, which drops the NsInode, which decrements the Arc<Namespace> refcount. Only when ALL three conditions above are met does the refcount reach zero, triggering the destruction protocol below.

Note: Namespace destruction is independent of parent/child relationships. A child namespace can outlive its parent (it becomes an orphan but remains functional), and a parent can be destroyed while children exist.

Namespace Destruction Protocol:

Each namespace is destroyed independently when its own Arc refcount drops to zero. Namespaces do not all reach zero simultaneously — a task's nsproxy drop decrements each namespace's refcount independently. When a single task exits and drops its Arc<NamespaceSet>, each contained Arc<FooNamespace> is decremented. If that decrement brings a namespace's refcount to zero, that namespace's cleanup callback runs immediately. Other namespaces in the same nsproxy may still have non-zero refcounts (held by other tasks, bind mounts, or open /proc/PID/ns/* fds).

The "reverse creation order" below applies to the callbacks registered in a single nsproxy release (where multiple namespaces may drop to zero in the same drop(Arc<NamespaceSet>) call), not to independent namespace lifecycle events.

Destruction sequence (reverse of creation order):

Reference count drop to zero triggers namespace_put(ns):
Check: refcount.fetch_sub(1) == 1 (last reference).
If not: return (namespace still has users).
Notify namespace-aware subsystems (in reverse creation order): The cleanup callbacks are registered at namespace creation time and called in LIFO (stack) order. Creation order is: user, MNT, UTS, IPC, PID, CGROUP, NET, TIME. Therefore destruction order is:
Time namespace (timens): release the time offset record.
Network namespace (netns): tear down all virtual interfaces, routing tables, conntrack tables. Close all sockets bound to this ns.
Cgroup namespace (cgroupns): detach from the cgroup hierarchy view.
PID namespace (pidns): send SIGKILL to all remaining processes in the ns. Wait for them to exit (pid namespace cannot be destroyed while it has living processes — init reaping ensures all descendant PIDs are cleaned up).
IPC namespace (ipcns): destroy all System V IPC objects (semaphores, message queues, shared memory segments) and POSIX IPC objects.
UTS namespace (utsns): free hostname and domainname strings.
Mount namespace (mntns): umount all mounts in the namespace's mount tree (reverse of mount order). Release all struct Mount references.
User namespace (userns): no active revocation is needed. Capabilities granted within a user namespace are scoped to that namespace's Arc<UserNamespace>. When the last reference drops (this destruction path), the UID/GID mapping tables and the capability scope are freed by the Arc destructor. Processes that held CAP_SYS_ADMIN within this user namespace lose that capability implicitly because the namespace object no longer exists to validate against.
Free the namespace struct: Drop the Arc<Namespace> reference (drop(ns)). The Arc destructor handles deallocation once the refcount reaches zero. At this point all subsystem state has been cleaned up.

PID namespace destruction special case: A PID namespace cannot be destroyed while any process inside it is alive. If the namespace init (PID 1 of the namespace) exits, all other processes in the namespace receive SIGKILL. The namespace destruction waits for all processes to exit before proceeding to step 2.

Mount namespace destruction special case: Lazy unmounts (MNT_DETACH): mounts that were detached but still have open file descriptors remain alive until all fds are closed. The mount namespace destructor marks these as "orphan mounts" — they remain accessible to current holders but new opens are rejected. The last file close triggers the final mount cleanup.

Overlayfs cleanup ordering during namespace teardown: see Section 14.8. The workdir must be cleaned before upper layer unmount.

Ordering guarantee: The reverse-creation-order cleanup ensures that inner namespaces (created from within an outer namespace) are cleaned up before their parent resources. This prevents use-after-free in cross-namespace references.

Cgroup vs network namespace destruction ordering: When a task's last reference drops and both its cgroup membership and network namespace are being cleaned up, network namespace cleanup runs first (sockets closed, conntrack purged, interfaces torn down). Cgroup accounting finalizes after — outstanding I/O byte counters are drained and memory charges released only once all network I/O has ceased. This ordering is guaranteed by the reverse-creation-order sequence above: network namespace (early in step 2) precedes cgroup namespace (near-last in step 2).

17.1.6 User Namespace UID/GID Mapping Security¶

User namespaces allow unprivileged users to have "root" (UID 0) within a namespace while mapping to an unprivileged UID outside. This is the foundation of rootless containers.

Security model:

/// A single contiguous range in a UID or GID mapping.
/// Maps `count` IDs starting at `inner_start` (inside namespace) to
/// `outer_start` (in parent namespace).
pub struct IdMapEntry {
    pub inner_start: u32,
    pub outer_start: u32,
    pub count: u32,
}

/// Maximum ID mapping entries per user namespace (matches Linux's limit of 340
/// per /proc/PID/uid_map and /proc/PID/gid_map).
const MAX_ID_MAPPINGS: usize = 340;

/// User namespace: defines UID/GID translation mappings and capability scope.
/// Each user namespace has an owner (the uid in the parent namespace of the
/// process that created it) and an ordered list of ID mappings. Capabilities
/// held by a process are relative to its user namespace — CAP_SYS_ADMIN in
/// a child user namespace does not grant privilege in the parent.
///
/// **Write-once ID mappings (lock-free reads):** Linux enforces that
/// `/proc/PID/uid_map` and `/proc/PID/gid_map` can each be written **exactly
/// once** per user namespace lifetime. UmkaOS mirrors this: `uid_map` and
/// `gid_map` use a write-once-then-frozen model. Before the map is written,
/// all UIDs/GIDs resolve to `nobody`/`nogroup` (65534). After the single
/// write, the map is frozen and all subsequent reads are **lock-free** — a
/// plain pointer dereference to an immutable `IdMapArray`. No RwLock, no
/// atomic RMW, zero contention on the hottest path in the kernel (`stat()`,
/// `open()`, `access()`, `kill()`, every permission check).
///
/// The write path uses `map_lock` to serialize the single write and publish
/// the frozen map via a Release store on the `Arc` pointer. Reads use an
/// Acquire load — on x86 this compiles to a plain `MOV` (TSO).
///
/// /proc/PID/uid_map write restrictions:
/// - Writer must be in the parent user namespace OR have CAP_SETUID in parent
/// - Writer must have CAP_SYS_ADMIN in the target namespace (or be the target)
/// - Mapped UIDs must be valid in the parent namespace
/// - Total mapped range cannot exceed /proc/sys/kernel/uid_max (or configured limit)
/// - Can only be written ONCE; second write returns EPERM
// kernel-internal, not KABI — contains Arc, Option, SpinLock. Never crosses a boundary.
#[repr(C)]
pub struct UserNamespace {
    /// Unique namespace identifier, monotonically increasing. Used for
    /// cross-namespace permission checks, procfs display, and uevent
    /// attribution. Allocated from a global `AtomicU64` counter.
    pub ns_id: u64,

    /// Parent user namespace (None for init_user_ns).
    parent: Option<Arc<UserNamespace>>,

    /// Nesting depth of this user namespace. The initial (root) user namespace
    /// has `level = 0`. Each child increments by 1. The kernel enforces a
    /// maximum nesting depth of 32 levels (`level < 32`); `unshare(CLONE_NEWUSER)`
    /// and `clone(CLONE_NEWUSER)` return `-ENOSPC` if `parent.level >= 31`.
    /// This limit prevents unbounded recursion in `is_same_or_ancestor()` checks
    /// and caps the O(depth) capability lookup chain used during cross-namespace
    /// permission checks (see `compute_effective_caps` below).
    pub level: u32,

    /// Frozen UID mappings. `None` before `/proc/PID/uid_map` is written
    /// (all UIDs resolve to 65534). `Some(...)` after the single write —
    /// immutable thereafter. Reads are lock-free (Acquire load on the
    /// `Option` discriminant). The `Arc` ensures the backing array lives
    /// as long as any namespace referencing it.
    ///
    /// `OnceCell<T>`: write-once cell. `get()` returns `Option<&T>` via
    /// Acquire load (lock-free). `set(value)` initializes exactly once
    /// (returns `Err` if already set). Equivalent to `std::sync::OnceLock`
    /// but `no_std`-compatible.
    uid_map: OnceCell<Arc<IdMapArray>>,

    /// Frozen GID mappings. Same write-once-then-frozen semantics as uid_map.
    gid_map: OnceCell<Arc<IdMapArray>>,

    /// Serializes the single write to uid_map/gid_map. Only held during
    /// the `/proc/PID/uid_map` write path (once per namespace lifetime).
    /// Never contended after initialization.
    map_lock: Mutex<()>,

    /// Owner's UID in the parent namespace.
    owner_uid: u32,
    /// Owner's GID in the parent namespace.
    owner_gid: u32,
    /// Capability set: what capabilities processes in this namespace can hold.
    /// Uses SystemCaps (u128) to accommodate UmkaOS-native capabilities in bits 64-127
    /// ([Section 9.1](09-security.md#capability-based-foundation)). Starts with full caps in a new user namespace;
    /// reduced by setuid, prctl, etc.
    cap_permitted: SystemCaps,
}

/// Frozen, immutable ID mapping array. Created once when `/proc/PID/uid_map`
/// (or `gid_map`) is written, never modified thereafter.
pub struct IdMapArray {
    /// Sorted by `inner_start` for binary search on large maps.
    /// Typical container: 1 entry (e.g., 0-65535 → 100000-165535).
    /// For ≤5 entries, linear scan is faster than binary search.
    entries: ArrayVec<IdMapEntry, MAX_ID_MAPPINGS>,
    /// Cached: true if mapping is 1:1 identity (inner == outer for all ranges).
    /// Enables fast-path bypass: return the input UID unchanged.
    is_identity: bool,
}

Capability interactions with user namespaces:

A process with UID 0 inside a user namespace has full capabilities within that namespace
Capabilities are NOT granted against resources owned by ancestor namespaces
Example: A process with "root" in User NS 1 cannot mount() a filesystem from the host
The cap_effective mask is computed at syscall entry time based on:
The process's current UID within its user namespace
The target object's owning user namespace
The intersection of the process's capability bounding set with capabilities valid for the target

Determining the owning user namespace for kernel objects:

Object Type	Owning User Namespace	Mechanism
File (VFS inode)	User namespace of the mount	Each mount has `mnt_user_ns` set at mount time. Files inherit from their mount.
Socket	User namespace of the creating process	Stored in `sock->sk_user_ns` at socket creation
IPC object (shm, sem, msg)	User namespace of the creating namespace	IPC namespace → User namespace mapping at IPC NS creation
Capability token	User namespace of the issuing process	Stored in capability header
Process (for signals)	User namespace of the process	Stored in `task_struct->user_ns`
Device node	User namespace of the initial mount	Device nodes are always in the initial namespace

cap_effective computation algorithm:

The effective capability set for a process operating on an object is the intersection of: 1. The process's current effective capabilities (cap_effective) 2. The capabilities valid for the target object's namespace

This ensures that a process which has dropped capabilities via capset() does not regain them when accessing child namespace objects.

compute_effective_caps(process, object):
  1. proc_ns = process.user_namespace
  2. obj_ns = object.owning_user_namespace
  3. proc_caps = process.cap_effective  // NOT cap_bounding — use current effective set

  4. // Check if process's NS is an ancestor of object's NS (or same NS)
  5. if is_same_or_ancestor(proc_ns, obj_ns):
  6.     // Process is in a parent (or same) namespace — capabilities apply
  7.     // Return intersection of process's effective caps and caps valid for target
  8.     return intersection(proc_caps, capabilities_valid_for(obj_ns))

  9. // Check if process's NS is a descendant of object's NS
 10. if is_ancestor(obj_ns, proc_ns):
 11.     // Process is in a child namespace — no capabilities against parent objects
 12.     return EMPTY_CAP_SET

 13. // Unrelated namespaces (neither ancestor nor descendant)
 14. // This happens with sibling containers
 15. return EMPTY_CAP_SET

is_same_or_ancestor(potential_ancestor, potential_descendant):
  // Walk up the hierarchy from potential_descendant toward root.
  // Return true if potential_ancestor is encountered (including if they're the same).
  // Each step calls Arc::downgrade(cursor).upgrade() on the parent reference.
  // If upgrade() returns None, the parent namespace was concurrently destroyed —
  // this is safe because namespace destruction is serialized with the last task
  // exit, so a live task always has an upgradeable parent chain. An upgrade()
  // failure means the walk has reached a destroyed ancestor — return false
  // (the destroyed ancestor cannot be the potential_ancestor).
  cursor = potential_descendant
  while cursor != None:
      if cursor == potential_ancestor:
          return true
      cursor = cursor.parent  // Arc<UserNamespace> — parent is always alive while child exists
  return false

capabilities_valid_for(namespace):
  // Returns the set of capabilities valid for objects in this namespace.
  // Capabilities are restricted based on namespace ownership rules:
  let mut valid = ALL_CAPS
  // CAP_SYS_ADMIN operations that affect global kernel state (e.g., swapon,
  // mount --bind outside the mount namespace) are not valid in non-init namespaces.
  if namespace.is_non_init_user_ns():
      valid &= ~CAP_SYS_ADMIN_GLOBAL  // Remove host-affecting subset
      // Distributed/cluster-wide capabilities are stripped for non-init user
      // namespaces. Containers must not issue cluster-wide operations —
      // a compromised container should not be able to join/leave the cluster,
      // create DSM regions, or manage peer membership.
      valid &= ~CAP_CLUSTER_ADMIN     // Cluster join/leave, topology changes
      valid &= ~CAP_DSM_CREATE        // Create/destroy DSM regions
      valid &= ~CAP_PEER_MANAGE       // Peer membership, capability delegation
  // CAP_NET_ADMIN is only valid in the owning network namespace.
  if namespace.net_ns != target_object.net_ns:
      valid &= ~CAP_NET_ADMIN
  return valid

/// `CAP_SYS_ADMIN_GLOBAL` — a real UmkaOS capability bit (bit 87), defined in
/// [Section 9.2](09-security.md#permission-and-acl-model). It authorizes operations with cluster-wide or
/// host-global scope that go beyond what `CAP_SYS_ADMIN` permits.
///
/// In namespace context: tasks in non-init user namespaces have
/// `CAP_SYS_ADMIN_GLOBAL` stripped from their effective set (see
/// `capabilities_valid_for` above), preventing them from exercising
/// host-affecting operations regardless of other capabilities held.
///
/// Operations requiring `CAP_SYS_ADMIN_GLOBAL` (forbidden in non-init user namespaces):
/// - Creating new user namespaces when `user_namespaces_max` system limit is exceeded
/// - Mounting filesystems with `MS_STRICTATIME` in any namespace other than init
/// - Modifying kernel parameters via `sysctl(2)` outside the init namespace
/// - Attaching to another process's PID, UTS, or IPC namespace via `setns(2)`
///   (user namespace setns: allowed; other namespace types: init-only)
/// - Cluster-wide DLM lockspace management, shared overlay topology changes

Key invariant: The intersection() at line 8 ensures that if a process drops CAP_NET_ADMIN via capset(), it cannot exercise CAP_NET_ADMIN against any object, including objects in child namespaces. This upholds the guarantee in Section 9.9: "a dropped privilege can never be regained."

File capability interpretation:

File capabilities (set via setcap) are interpreted relative to the file's owning user namespace: 1. When execve() loads a binary with file capabilities, the kernel checks if the file's owning user namespace is the same as or an ancestor of the process's user namespace. 2. If the file is in a descendant namespace (i.e., the file was created inside a child namespace), its capability bits are ignored when executed from the parent — prevents a child namespace from granting capabilities in the parent. 3. If the file is in the same or an ancestor namespace, the file's capabilities are added to the process's permitted/effective sets, subject to the usual cap_bounding restrictions. This matches Linux semantics: a host binary with file caps is honored inside a container, but a container binary with file caps is not honored on the host.

Setuid/setgid binary behavior in nested namespaces:

Binary Location	Setuid Behavior	Rationale
Initial namespace (host)	UID changes in initial namespace	Traditional Unix behavior
Child namespace	UID changes within child namespace only	Cannot escalate to parent namespace UIDs
Mounted from host into container	Setuid bit ignored	Prevents host→container privilege escalation

Privilege escalation prevention:

A process in a child user namespace cannot modify the parent's UID mappings
setuid() inside a user namespace only affects the inner UID, not the outer UID
File capability bits (setcap) are interpreted relative to the file's owning user namespace
Signals from a less-privileged namespace to a more-privileged namespace are blocked unless explicitly allowed

17.1.7 User Namespace Mount Restrictions¶

Not all filesystem types are safe to mount from within an unprivileged user namespace. A filesystem that reads raw block device data (ext4, XFS, Btrfs) could exploit a crafted disk image to trigger kernel vulnerabilities. UmkaOS restricts which filesystems are mountable in non-init user namespaces via the FS_USERNS_MOUNT flag on the filesystem type registration.

bitflags! {
    /// Filesystem type flags, set at fs_type registration time.
    pub struct FsTypeFlags: u32 {
        /// This filesystem is safe to mount in a non-init user namespace.
        /// Only filesystems that do NOT interpret raw block device data
        /// and cannot be used to escalate privileges should set this flag.
        const FS_USERNS_MOUNT  = 1 << 0;
        /// Filesystem does not require a backing device (pseudo-fs).
        const FS_NO_DEV        = 1 << 1;
        /// Filesystem requires a block device as source.
        const FS_REQUIRES_DEV  = 1 << 2;
    }
}

Mount permission check in do_mount():

do_mount(source, target, fs_type, flags):
  1. Resolve fs_type by name from the registered filesystem table.
  2. If the calling task is NOT in the init user namespace:
     a. Check: fs_type.flags contains FS_USERNS_MOUNT.
        If not: return EPERM — this filesystem cannot be mounted in
        a non-init user namespace.
     b. Check: ns_capable(current.nsproxy.mount_ns.user_ns, CAP_MOUNT).
        The caller must have CAP_MOUNT (bit 70) within the mount namespace's
        owning user namespace (not the init user namespace).
        See [Section 14.6](14-vfs.md#mount-tree-data-structures-and-operations--domount-mount-a-filesystem) for the
        canonical algorithm.
  3. If the calling task IS in the init user namespace:
     Check: capable(CAP_MOUNT).
  4. Proceed with mount.

Filesystems with FS_USERNS_MOUNT (safe for unprivileged user namespace mounts):

Filesystem	Rationale
`proc`	Virtual; no raw device access; per-PID-namespace view
`sysfs`	Virtual; read-only for non-init namespaces
`tmpfs`	Memory-backed; no device access
`overlayfs`	Layer composition; lower layers already mounted
`FUSE`	Userspace filesystem; kernel only relays operations
`devpts`	PTY slave filesystem; namespace-scoped
`mqueue`	POSIX message queue filesystem; namespace-scoped
`cgroup2`	Cgroup filesystem; namespace-scoped view

Filesystems WITHOUT FS_USERNS_MOUNT (require CAP_SYS_ADMIN in init user namespace):

Filesystem	Rationale
`ext4`, `xfs`, `btrfs`	Parse untrusted on-disk data structures
`nfs`	Network filesystem; requires kernel credential management
`zfs`	Complex on-disk format with kernel-level decompression
`fat`, `exfat`, `ntfs`	Parse untrusted on-disk data
`iso9660`	Parse untrusted on-disk data

This matches Linux behavior (since Linux 3.8+ user namespace mount restrictions) and is required for rootless container runtimes (Podman rootless, Docker rootless) to mount proc/sysfs/tmpfs inside unprivileged containers while preventing privilege escalation via crafted filesystem images.

17.1.8 Devtmpfs Namespace Awareness¶

Devtmpfs (Section 14.5) is a kernel-managed tmpfs that auto-populates /dev with device nodes. In a namespace-aware kernel, containers must not see all host devices.

Design: Devtmpfs itself is a single global instance (the kernel needs exactly one authoritative device registry). Container isolation of /dev is achieved through the mount namespace and device cgroup mechanism, not by creating per-namespace devtmpfs instances:

Mount namespace filtering: The container runtime creates a new mount namespace (CLONE_NEWNS), mounts a fresh tmpfs on /dev inside the container, and bind-mounts only the specific device nodes the container needs from the host's devtmpfs. Typically: /dev/null, /dev/zero, /dev/random, /dev/urandom, /dev/full, /dev/tty, /dev/ptmx, and any explicitly granted devices.
Device cgroup enforcement: The BPF_PROG_TYPE_CGROUP_DEVICE program (Section 17.2) attached to the container's cgroup denies open() on device nodes not in the allow-list. Even if a container process mknods a device node it has no cgroup permission for, the open will be denied by the BPF hook at chrdev_open()/blkdev_open() time.
Net effect: A container sees a minimal /dev with only bind-mounted devices, and the device cgroup prevents access to any device not explicitly granted. This matches Docker/Kubernetes behavior exactly: docker run creates a restricted /dev with ~15 entries from the default device allow-list.

17.1.9 Security Policy Integration¶

Container isolation requires multiple defense layers beyond namespaces and capabilities. UmkaOS integrates with security policy mechanisms at specific points in the container lifecycle:

seccomp-bpf (Syscall Filtering): OCI-compliant container runtimes (Docker, containerd, CRI-O) require seccomp-bpf to restrict the syscall surface available to containerized processes. UmkaOS's seccomp implementation is part of the eBPF subsystem described in Section 19.2, which covers eBPF program types including seccomp-bpf for per-process syscall filtering. The typical container creation sequence is:

--- Privileged path (root or CAP_SYS_ADMIN) ---
1.  clone(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWUSER | ...)
    — CLONE_NEWUSER is processed first internally (see ordering requirement above).
      The child inherits full capabilities within the new user namespace.

--- Rootless (unprivileged) path ---
1.  unshare(CLONE_NEWUSER)
    — Creates a new user namespace FIRST. The calling process gains CAP_SYS_ADMIN
      within the new user namespace, enabling subsequent namespace creation.
1a. Write UID/GID mappings to /proc/self/uid_map and /proc/self/gid_map.
1b. clone(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | ...)
    — Now succeeds because the calling process has CAP_SYS_ADMIN in its
      (new) user namespace.

--- Common steps (both paths, in the child process) ---
2.  mount("overlay", new_root, "overlay",
         "lowerdir=<image_layers>,upperdir=<container_rw>,workdir=<overlay_work>")
    — Mount overlayfs with lowerdir=image layers (read-only), upperdir=container
      writable layer, workdir=overlay work directory. This assembles the container's
      root filesystem from the OCI image layer stack before pivot_root.
3.  pivot_root(new_root, put_old) — change filesystem root
4.  umount2(put_old, MNT_DETACH) — MANDATORY: detach host filesystem (security)
5.  Place the container init process in its cgroup. Two approaches:
    a. **Two-step (legacy)**: Write the PID to `cgroup.procs` in the target cgroup
       hierarchy, moving it from the parent's cgroup to the container-specific cgroup.
    b. **`CLONE_INTO_CGROUP` (preferred, Linux 5.7+)**: Pass the target cgroup fd
       via `clone3(CLONE_INTO_CGROUP)` at step 1, which places the child directly
       into the target cgroup at fork time — no post-fork migration needed. See
       [Section 8.1](08-process.md#process-and-task-management) for `CLONE_INTO_CGROUP` specification.
    Both approaches enforce per-container resource limits (memory, CPU, I/O) from
    this point forward.
6.  seccomp(SECCOMP_SET_MODE_FILTER, ...) — install syscall filter
7.  drop_capabilities() — reduce capability set
8.  execve() — exec container entrypoint

drop_capabilities() specification:

drop_capabilities() restricts the calling task's capability sets to only those capabilities permitted by the container's configuration (OCI runtime spec process.capabilities). This is the last privilege reduction step before execve() and is critical for container security: without it, a container process retains all capabilities inherited from the container runtime (which runs as root).

/// Drop capabilities not in the container's allowed set.
///
/// OCI container runtimes (runc, crun) call this after seccomp filter
/// installation and before execve(). The `allowed` set is derived from
/// the OCI runtime spec's `process.capabilities` object, which specifies
/// five capability sets independently: bounding, effective, inheritable,
/// permitted, and ambient.
///
/// # Steps
///
/// 1. **Bounding set**: For each capability NOT in `allowed.bounding`,
///    call `prctl(PR_CAPBSET_DROP, cap)`. This permanently removes the
///    capability from the bounding set — it cannot be regained even via
///    setuid binaries. The bounding set limits which capabilities can
///    appear in the permitted set after execve().
///
/// 2. **Permitted set**: Intersect the current permitted set with
///    `allowed.permitted`. Capabilities not in the intersection are
///    dropped via `capset()`. Once dropped from the permitted set, a
///    capability cannot be re-acquired (the bounding set prevents it).
///
/// 3. **Effective set**: Set to `allowed.effective`. Typically identical
///    to the permitted set for containers that need capabilities, or
///    empty for unprivileged containers (capabilities are in permitted
///    but not effective, requiring explicit cap_raise).
///
/// 4. **Inheritable set**: Set to `allowed.inheritable`. Controls which
///    capabilities survive execve() when combined with file capabilities.
///    Docker's default: empty (no inheritable caps). Kubernetes may set
///    specific inheritable caps for init containers.
///
/// 5. **Ambient set**: Set to `allowed.ambient`. Ambient capabilities
///    are automatically added to the permitted and effective sets on
///    execve() without requiring file capabilities. Used by containers
///    that need specific capabilities in the entrypoint without setuid
///    or file caps. Each ambient capability must also be in both the
///    permitted and inheritable sets (kernel enforced).
///
/// # Errors
///
/// Returns `EPERM` if the calling task does not own the capabilities
/// it is trying to retain (i.e., `allowed` contains capabilities not
/// in the current permitted set). This indicates a container runtime
/// bug — the runtime should not request capabilities it does not have.
///
/// # Security invariant
///
/// After `drop_capabilities()` returns successfully, the task's
/// effective capability set is a subset of the OCI-configured allowed
/// set. No capability outside the allowed set can be regained by the
/// task or any of its descendants (guaranteed by the bounding set).
pub fn drop_capabilities(allowed: &OciCapabilities) -> Result<(), Errno>;

/// OCI capability configuration. Maps 1:1 to the `process.capabilities`
/// object in the OCI runtime specification v1.1+.
pub struct OciCapabilities {
    /// Capabilities retained in the bounding set.
    pub bounding: CapabilitySet,
    /// Capabilities in the effective set after drop.
    pub effective: CapabilitySet,
    /// Capabilities in the inheritable set.
    pub inheritable: CapabilitySet,
    /// Capabilities in the permitted set.
    pub permitted: CapabilitySet,
    /// Capabilities in the ambient set.
    pub ambient: CapabilitySet,
}

/// Bitfield of Linux capabilities and UmkaOS-native extensions.
/// Bits 0-63: Linux-compatible capabilities (CAP_CHOWN through
/// CAP_CHECKPOINT_RESTORE; Linux defines 41 as of 6.1).
/// Bits 64-127: UmkaOS-native capabilities (matching SystemCaps layout
/// from [Section 9.1](09-security.md#capability-based-foundation)). u128 ensures the OCI capability
/// set can represent both Linux-compatible and UmkaOS-native capability bits.
pub struct CapabilitySet(u128);

Docker's default capability set retains 14 of the 41 capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_MKNOD, CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETFCAP, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_KILL, CAP_AUDIT_WRITE. All other capabilities are dropped. Kubernetes restricted PodSecurityStandard drops all capabilities and allows only NET_BIND_SERVICE to be added back.

The seccomp filter must be installed before execve() so that the filter applies to the container's entrypoint and all its descendants. Docker's default seccomp profile blocks ~44 dangerous syscalls (e.g., kexec_load, reboot, mount). Kubernetes PodSecurityStandards mandate seccomp profiles for restricted workloads.

Seccomp Filter Stacking and Composition:

Nested containers require multiple independent seccomp filters to coexist on a single thread: the OCI runtime installs a broad container policy (filter F1) during container setup, and the container workload may subsequently install its own application-specific filter (filter F2) via prctl(PR_SET_SECCOMP) or seccomp(SECCOMP_SET_MODE_FILTER, ...). UmkaOS implements Linux-compatible stacking semantics so that existing container runtimes (runc, containerd) work without modification.

Stacking rules:

Filters stack: each seccomp(SECCOMP_SET_MODE_FILTER, ...) call appends a new filter to the thread's filter list. All previously installed filters remain active. Filters cannot be removed.
Evaluation order: on each syscall entry, filters are evaluated in reverse installation order — newest filter first, oldest filter last. All installed filters are evaluated; there is no short-circuit on SECCOMP_RET_ALLOW.

Exception: SECCOMP_RET_KILL_PROCESS and SECCOMP_RET_KILL_THREAD cause immediate thread or process termination without evaluating any remaining filters. This matches Linux behavior.

Action priority: when multiple filters return different actions for the same syscall, the highest-severity action wins regardless of evaluation order:

Priority	Action	Effect
1 (highest)	`SECCOMP_RET_KILL_PROCESS`	Terminate entire process
2	`SECCOMP_RET_KILL_THREAD`	Terminate calling thread
3	`SECCOMP_RET_TRAP`	Deliver SIGSYS
4	`SECCOMP_RET_ERRNO`	Return specified errno
5	`SECCOMP_RET_USER_NOTIF`	Notify supervisor via fd
6	`SECCOMP_RET_TRACE`	Notify ptrace tracer
7	`SECCOMP_RET_LOG`	Allow and log
8 (lowest)	`SECCOMP_RET_ALLOW`	Allow syscall

Example: if F1 returns SECCOMP_RET_ALLOW and F2 returns SECCOMP_RET_ERRNO(EPERM), the syscall is blocked with EPERM. A workload-installed filter can make the effective policy strictly more restrictive than the runtime-installed filter, but never less restrictive.

Unknown actions: If a filter returns an action value not in the table above (e.g., a future SECCOMP_RET_* value that UmkaOS does not yet recognize), UmkaOS treats it with SECCOMP_RET_KILL_THREAD-level priority (restrictive default). This matches Linux's signed-comparison semantics where unknown low numeric values get high priority. Unknown actions are never treated as SECCOMP_RET_ALLOW. See seccomp_action_priority() in Section 10.3 for the implementation.

NO_NEW_PRIVS requirement: a thread must have no_new_privs = 1 (set via prctl(PR_SET_NO_NEW_PRIVS, 1)) before installing a seccomp filter unless it holds CAP_SYS_ADMIN. This is identical to Linux. Container runtimes set no_new_privs as part of their standard setup sequence.
Maximum filter count: 512 filters per thread. Linux limits total BPF instruction count (MAX_INSNS_PER_PATH = 32768), not filter count; UmkaOS imposes an explicit filter-count ceiling at 512 (matching Section 10.3). Attempting to install a 513th filter returns E2BIG.
Filter inheritance: child processes created via fork() or clone() inherit the parent's complete filter stack. The inherited filters are immutable in the child — the child may only append further filters, never remove inherited ones.
Memory lifecycle: Each compiled seccomp filter is reference-counted via Arc<SeccompFilter>. On fork(), the child increments the refcount of every filter in its inherited stack. On task exit, the task drops its Arc references to all filters in its stack; when the last Arc reference to a filter drops, the filter's BPF bytecode and compiled representation are freed. The maximum memory per task with 512 stacked filters is bounded: 512 Arc increments on fork, 512 Arc decrements on exit.

Nested container policy: when an OCI runtime installs filter F1 (broad allow-list, blocking dangerous syscalls) and the container workload subsequently installs filter F2 (narrow application allow-list), both filters are active simultaneously. The effective policy is the union of restrictions from both filters: a syscall is allowed only if both F1 and F2 allow it. This composability property is what makes layered container security correct — deeper container nesting cannot relax an outer filter's restrictions.

UmkaOS implementation note: UmkaOS compiles the filter stack into a single BPF program at installation time. When a new filter is added to an existing stack, the kernel combines the compiled representation of the existing stack with the new filter's BPF bytecode and recompiles the result into a single executable program. This single-program approach is semantically identical to sequential per-filter evaluation (the action priority table above is preserved exactly) but eliminates repeated per-filter dispatch overhead at syscall entry. The recompilation occurs once at seccomp(SECCOMP_SET_MODE_FILTER, ...) time, not on each syscall.

LSM Integration: UmkaOS supports pluggable Linux Security Modules (AppArmor, SELinux profiles). Container runtimes can specify an LSM profile via OCI annotations, which UmkaOS applies at execve() time. The integrity measurement framework (Section 9.5, 08-security.md) provides the foundation for policy enforcement. The full LSM framework — hook table, security blob allocation, module registration, and AND-logic stacking — is specified in Section 9.8.

17.1.10 Cross-Node Namespace ID Translation¶

In a distributed UmkaOS cluster (Section 5.1), each node maintains its own independent PID, UID, and mount namespace hierarchies. When a capability or IPC message crosses node boundaries, namespace-scoped identifiers (PIDs, UIDs, GIDs) must be translated.

Protocol: Cross-node operations use cluster-global identifiers rather than translating between per-node namespace IDs:

PID namespace: Each task has a cluster-unique ClusterTaskId = (node_id: u16, local_pid: u32). Cross-node kill() and waitpid() operate on ClusterTaskId, not raw PIDs. The receiving node translates ClusterTaskId to a local PID within the init PID namespace before dispatching the signal. Tasks in non-init PID namespaces are not directly addressable from remote nodes — the remote node must target the task's init-namespace PID, and the local kernel applies namespace visibility rules.
User namespace (UID/GID): Cross-node operations assume a shared UID/GID directory service (LDAP, /etc/passwd synchronisation). The wire protocol carries raw uid_t/gid_t values. The receiving node interprets them in its init user namespace. Non-init user namespace UID mappings are strictly node-local and are NOT translated across nodes. A container's mapped UIDs are meaningful only on the node hosting that container.
Mount namespace: Mount namespaces are strictly node-local. Cross-node filesystem access uses the capability-based VFS service provider protocol (Section 14.1), not mount namespace sharing. A remote file access carries a (node_id, inode_id, fs_id) triple — the receiving node resolves this against its own mount table.
Network namespace: Cross-node network namespace awareness is limited to ClusterTaskId-scoped socket operations. The networking stack on each node operates independently; cross-node traffic uses the RDMA transport layer (Section 5.4), which bypasses per-node network namespaces.

See also: - Section 19.2: eBPF subsystem including seccomp-bpf - Section 9.5 (08-security.md): Runtime Integrity Measurement (IMA) - Section 9.9: Credential model and capability dropping

17.2 Control Groups (Cgroups v2)¶

Linux cgroups v2 provide hierarchical resource allocation and limiting. UmkaOS implements the unified cgroup v2 interface, mapping controller semantics to UmkaOS's native scheduler, memory manager, and I/O subsystems.

Cgroup v1 compatibility shim: Docker (Moby) and older systemd versions (pre-247) require cgroup v1 hierarchy paths. UmkaOS provides a read-mostly v1 compatibility shim that: - Exposes /sys/fs/cgroup/{cpu,memory,pids,blkio,...} mount points - Translates v1 control file reads/writes to v2 equivalents (e.g., memory.limit_in_bytes → memory.max, cpu.shares → cpu.weight) - Supports the 4 most common v1 controllers: cpu, memory, blkio, pids - Returns -ENOSYS for v1-only features with no v2 equivalent (e.g., cpuacct separate hierarchy, net_cls, net_prio) - Multi-hierarchy emulation: each v1 controller appears as a separate mount, but all are backed by the single v2 unified hierarchy

Specification scope: The v1 shim control file format details (exact file paths, value format, error responses) are deferred to Phase 4. The core v2 implementation below is the authoritative resource control mechanism. Until Phase 4, Moby/systemd v1 compatibility cannot be integration-tested — only the semantic translation (which v1 files → which v2 controls) is validated.

Note: Modern Docker (Moby ≥ 20.10), Kubernetes (≥ 1.25), and systemd (≥ 247) all work natively with cgroup v2. The v1 shim is needed only for legacy container runtimes. Systems running current versions of these tools can operate entirely on the v2 interface without the shim.

Phase 3 gate: The cgroup v2 procfs/sysfs detection surface (/sys/fs/cgroup/cgroup.controllers, /proc/self/cgroup in 0::/ format, cgroup2 mount type) is a Phase 3 exit requirement. Without correct v2 detection responses, Docker/runc falls back to cgroup v1 mode. See Section 24.2 for the full checklist.

17.2.1 Core Data Structures¶

17.2.1.1 Cgroup Node¶

The Cgroup struct is the central object in the cgroup v2 hierarchy. Every directory under /sys/fs/cgroup/ corresponds to one Cgroup node. The hierarchy is a tree; the root node is owned by CgroupRoot.

UmkaOS's cgroup design avoids two sources of complexity present in Linux's implementation: - No multi-hierarchy: cgroup v1's per-controller separate hierarchies are gone; the single v2 unified hierarchy is the only model. The v1 shim (see above) re-exposes v2 state through legacy paths at the cgroupfs layer without creating second hierarchies inside the kernel. - No cgroup_subsys indirection: Linux routes every controller operation through a cgroup_subsys vtable, adding an indirect call on every resource charge. UmkaOS embeds controller state directly in Cgroup as Option<ControllerState> fields; disabled controllers are None and add no overhead.

/// A cgroup node in the cgroup v2 unified hierarchy.
///
/// The hierarchy is a tree rooted at `CgroupRoot.root`. Tasks are assigned
/// to leaf or intermediate cgroups. Resource controllers operate per-cgroup.
///
/// # Memory layout note
/// Controller state structs are stored inline (`Option<T>`) rather than
/// heap-allocated so that null-pointer checks for disabled controllers
/// compile to a branch on a locally cached discriminant — no extra
/// pointer dereference on the hot path.
/// Alias for cgroup tree nodes. All cgroup references are `Arc<Cgroup>` —
/// tree ownership flows downward (parent → children); parent pointers are `Weak`.
/// Maximum cgroup nesting depth (stack-safety bound). Linux uses `INT_MAX`,
/// i.e., effectively unlimited; 256 levels are sufficient for all practical
/// deployments including deeply nested container orchestrators.
pub const CGROUP_MAX_DEPTH: usize = 256;

pub type CgroupNode = Arc<Cgroup>;

pub struct Cgroup {
    /// Unique cgroup ID (assigned at creation, never reused).
    /// Also used as the inode number of the cgroupfs directory.
    pub id: u64,

    /// Parent cgroup. `None` only for the root cgroup (id == 1).
    /// `Weak` avoids reference cycles: the tree is owned downward
    /// (`CgroupRoot → Arc<Cgroup> → Arc<Cgroup> children`); the
    /// parent pointer is a non-owning back-edge.
    pub parent: Option<Weak<Cgroup>>,

    /// Child cgroups. RCU-protected for lockless read traversal
    /// (`for_each_descendant`, cgroupfs directory listing, recursive
    /// accounting). Writers (mkdir, rmdir) acquire both `hierarchy_lock`
    /// in `CgroupRoot` and `children_lock`, then publish changes via RCU:
    ///   1. Clone the Vec under the SpinLock.
    ///   2. Modify the clone (insert or remove child).
    ///   3. Swap the RcuCell to point to the new Vec.
    ///   4. Old Vec is freed after the RCU grace period.
    /// Readers call `children.read()` under `rcu_read_lock()` — no lock
    /// acquisition, no contention. The Vec is unbounded (K8s may create
    /// hundreds of cgroups under `system.slice`) but acceptable per
    /// collection policy §3.1.13: cgroup creation is cold-path.
    ///
    /// **Performance note**: Clone-and-swap is O(N) per mkdir/rmdir where
    /// N = number of siblings. At 500 siblings, the clone copies 500 * 8 =
    /// 4000 bytes (~1us). For extreme cgroup counts (>10K siblings), XArray
    /// migration would improve scalability, but K8s pod creation (~1-10/sec)
    /// is well within the cold-path budget.
    pub children: RcuCell<Vec<Arc<Cgroup>>>,
    /// SpinLock protecting `children` writes. Only held during structural
    /// modifications (mkdir/rmdir) — never on the read path.
    pub children_lock: SpinLock<()>,

    /// Depth from the root cgroup (root = 0, root's children = 1, etc.).
    /// Used by `cgroup_lca()` to compute the Lowest Common Ancestor
    /// during task migration. Set at `cgroup_mkdir()` time as
    /// `parent.depth + 1`. Maximum: `CGROUP_MAX_DEPTH` (256).
    pub depth: u32,

    /// Name of this cgroup relative to parent (max 255 bytes, no '/').
    /// Fixed-size inline storage avoids heap allocation for short names
    /// (typical names: "docker", "system.slice", container IDs ≤ 64 bytes).
    pub name: CgroupName,

    /// Tasks directly assigned to this cgroup (not descendants).
    /// Written by task migration; read by cgroupfs `cgroup.procs` output.
    ///
    /// Uses `RwLock<XArray<()>>` keyed by `TaskId` (integer key) for O(1) insert,
    /// remove, and membership test. Per the collection policy, integer-keyed
    /// membership sets use XArray (not HashMap/FxHashSet). The unit value `()`
    /// means this is a pure membership set — only presence/absence matters.
    /// Readers (cgroupfs `cgroup.procs` output) take a read lock; writers
    /// (task migration) take a write lock. Task migration is serialized by the
    /// global task-migration lock anyway, so write contention is negligible.
    pub tasks: RwLock<XArray<()>>,

    /// Number of tasks in this cgroup and all descendants.
    /// Updated atomically during task migration (O(depth) walk, done once
    /// per migration, not per tick). Used by `pids.current` propagation and
    /// for efficient "is this cgroup populated?" checks.
    pub population: AtomicU64,

    // ── Resource controller state ────────────────────────────────────────
    // Each field is `None` when the controller is disabled for this cgroup.
    // Controller state is only present when listed in the parent's
    // `subtree_control` mask (or for the root cgroup, in `cgroup.controllers`).

    /// CPU bandwidth controller (`cpu.weight`, `cpu.max`, `cpu.guarantee`).
    pub cpu: Option<CpuController>,

    /// Memory controller (`memory.max`, `memory.high`, `memory.current`, etc.).
    pub memory: Option<MemController>,

    /// Block I/O controller (`io.max`, `io.weight`).
    pub io: Option<IoController>,

    /// PID controller (`pids.max`, `pids.current`).
    pub pids: Option<PidsController>,

    /// CPU affinity controller (`cpuset.cpus`, `cpuset.mems`, partition mode).
    pub cpuset: Option<CpusetController>,

    /// RDMA/InfiniBand resource controller (`rdma.max`).
    pub rdma: Option<RdmaController>,

    /// Huge page controller (`hugetlb.<size>.max`).
    pub hugetlb: Option<HugetlbController>,

    /// Miscellaneous resource controller (`misc.max`; e.g., SGX EPC pages).
    pub misc: Option<MiscController>,

    /// perf_event cgroup controller. Limits per-cgroup PMU resource usage
    /// to prevent container perf_event exhaustion.
    pub perf_event: Option<PerfEventController>,

    // ── Network bandwidth note ───────────────────────────────────────────
    // UmkaOS (like Linux cgroup v2) has NO dedicated network bandwidth
    // controller. Network bandwidth limiting is achieved through
    // BPF_PROG_TYPE_CGROUP_SKB programs attached via `BPF_CGROUP_INET_EGRESS`
    // / `BPF_CGROUP_INET_INGRESS` hooks combined with TC qdiscs
    // ([Section 16.21](16-networking.md#traffic-control-and-queue-disciplines)). This is the standard
    // approach used by Cilium, systemd, and modern container runtimes.
    // The v1 `net_cls` and `net_prio` controllers are NOT implemented.

    // ── Hierarchy control ────────────────────────────────────────────────

    /// Which controllers are enabled for this cgroup's children.
    /// Written to `cgroup.subtree_control`; read on every child mkdir.
    /// `ControllerMask` is a bitmask — O(1) enable/disable.
    pub subtree_control: ControllerMask,

    // ── Freeze state ─────────────────────────────────────────────────────

    /// Per-cgroup freeze request: set to `true` when userspace writes `1` to
    /// `cgroup.freeze`; cleared when userspace writes `0`.
    /// See Section 17.2.8 for the freeze/thaw protocol.
    pub freeze: AtomicBool,

    /// Effective freeze state: `true` iff `self.freeze || parent.e_freeze`.
    /// Computed by propagating downward on freeze/thaw writes:
    ///   `child.e_freeze = child.freeze || parent.e_freeze`
    /// A task remains frozen until ALL ancestor cgroups are thawed AND its own
    /// `freeze` is cleared. This two-boolean model matches Linux's
    /// `cgroup_freezer_state` (`bool freeze` + `bool e_freeze` in
    /// `include/linux/cgroup-defs.h`).
    pub e_freeze: AtomicBool,

    /// Lifecycle state for zero-residual destruction. See
    /// [Section 17.2](#control-groups--cgroup-zero-residual-destruction).
    pub lifecycle: AtomicU8,

    // ── Generation counter for walk-free limit propagation ───────────────

    /// Incremented whenever any resource limit in this cgroup or any
    /// ancestor changes. Each task caches the generation value at the
    /// time its limits were last computed. On the next resource charge,
    /// the task compares its cached generation against this field. On
    /// mismatch, the task re-walks from its cgroup to the root to
    /// recompute its effective limits, then updates the cache.
    ///
    /// This makes limit changes O(1) to publish (one atomic increment)
    /// and amortizes the re-walk cost to the next resource operation on
    /// each task — no per-tick accounting, no broadcast, no lock convoy.
    pub generation: AtomicU64,

    // ── cgroupfs integration ──────────────────────────────────────────────

    /// Inode for this cgroup's directory in the cgroupfs pseudo-filesystem.
    /// `None` before the cgroupfs is mounted. The inode is allocated at
    /// cgroup creation time and freed at cgroup destruction.
    pub inode: Option<Arc<Inode>>,

    // ── BPF integration ──────────────────────────────────────────────────

    /// Attached BPF programs for this cgroup (ingress/egress/device/sysctl).
    /// Max 64 programs per cgroup (matching Linux BPF_CGROUP_MAX_PROGS).
    pub bpf_progs: SpinLock<ArrayVec<BpfCgroupLink, 64>>,
}

/// Fixed-size inline cgroup name (max 255 bytes, not NUL-terminated).
/// Avoids heap allocation for the common case (names ≤ 255 bytes).
pub struct CgroupName {
    /// Number of valid bytes in `data`.
    len: u8,
    /// Raw UTF-8 bytes. Characters '/' and '\0' are rejected at creation.
    data: [u8; 255],
}

/// Bitmask of resource controllers. One bit per controller type.
/// Used for `subtree_control` (enabled-for-children) and for
/// `cgroup.controllers` (available on the system).
#[derive(Clone, Copy, Default)]
pub struct ControllerMask(pub u32);

impl ControllerMask {
    pub const CPU:     u32 = 1 << 0;
    pub const MEMORY:  u32 = 1 << 1;
    pub const IO:      u32 = 1 << 2;
    pub const PIDS:    u32 = 1 << 3;
    pub const CPUSET:  u32 = 1 << 4;
    pub const RDMA:    u32 = 1 << 5;
    pub const HUGETLB: u32 = 1 << 6;
    pub const MISC:    u32 = 1 << 7;
    pub const PERF_EVENT: u32 = 1 << 8;

    /// Returns `true` if the given controller bit is set.
    pub fn has(self, bit: u32) -> bool { self.0 & bit != 0 }

    /// Returns the union of two masks (enabling controllers from both).
    pub fn union(self, other: ControllerMask) -> ControllerMask {
        ControllerMask(self.0 | other.0)
    }
}

/// Per-task cgroup migration state. Stored in the task struct as an
/// `AtomicU8` to allow lock-free reads during `cgroup.procs` enumeration.
/// The three-phase migration protocol (see Task Migration steps 4, 9, and
/// charge reconciliation) uses this to ensure a task is always visible in
/// exactly one cgroup and that charge accounting is deferred during migration.
///
/// State transitions: None(0) → Migrating(1) on `cgroup_migrate_prepare`,
/// Migrating(1) → Complete(2) on `cgroup_migrate_finish`,
/// Complete(2) → None(0) on charge reconciliation completion.
/// The `cgroup_migration_state: AtomicU8` field is stored in the `Task` struct
/// ([Section 8.1](08-process.md#process-and-task-management--task-struct-definition)).
#[repr(u8)]
pub enum CgroupMigrationState {
    /// Task is a normal member of its cgroup (steady state).
    None      = 0,
    /// Task is being migrated: still in the source cgroup's task list
    /// but its `task.cgroup` pointer may already point to the target.
    /// Charge accounting is deferred during this state. Readers of
    /// `cgroup.procs` include MIGRATING tasks in their source cgroup
    /// for consistency.
    Migrating = 1,
    /// Migration finished but charge reconciliation is pending.
    /// The task has been moved to the target cgroup's task list,
    /// but the charge transfer (memory, CPU bandwidth) between
    /// source and target cgroups has not yet completed. The scheduler
    /// skips cgroup bandwidth enforcement during this state.
    /// Transitions to None(0) after charge reconciliation completes.
    Complete  = 2,
}

/// Iterate over all descendants of a cgroup in pre-order (parent before children).
///
/// Used by subsystem-state walkers (e.g., `css_for_each_descendant` in
/// Section 23.1.3 for ML-policy parameter propagation). The traversal holds
/// an RCU read-side reference, so the caller must be in an RCU read-side
/// critical section. Structural modifications (mkdir/rmdir) are blocked by
/// hierarchy_lock but do not block this iterator — concurrent rmdir leaves
/// the cgroup visible until the RCU grace period completes.
///
/// Worst-case complexity: O(N) where N = number of descendants.
/// Cgroup trees in practice are shallow (depth ≤ 8) and narrow (breadth
/// ≤ hundreds), so this is bounded by the total cgroup count.
impl Cgroup {
    /// Walk all descendants in pre-order. The callback receives each
    /// descendant cgroup (excluding `self`). Returns early if `f` returns
    /// `ControlFlow::Break`.
    ///
    /// Must be called within an RCU read-side critical section.
    /// **Lifecycle safety**: Callers that must not operate on cgroups
    /// undergoing destruction MUST check `cg.lifecycle.load(Acquire) ==
    /// CgroupLifecycle::Active` for each visited descendant. Concurrent
    /// `cgroup_rmdir()` leaves the cgroup visible in the tree until
    /// the RCU grace period completes, so the iterator may yield cgroups
    /// that are mid-teardown.
    pub fn for_each_descendant<F>(&self, f: F)
    where
        F: FnMut(&Arc<Cgroup>) -> core::ops::ControlFlow<()>,
    {
        // Implementation: iterative pre-order DFS using sibling-then-child
        // traversal with O(depth) stack space. The stack stores the current
        // position at each depth level, NOT all children at any level. This
        // avoids the breadth explosion problem: K8s `kubepods.slice` may have
        // 500+ child cgroups at one level, but the tree is at most
        // `CGROUP_MAX_DEPTH` (256) levels deep.
        //
        // No heap allocation — this runs inside an RCU read-side critical
        // section where sleeping (and therefore demand-paging a heap
        // allocation) is forbidden. Reads `children` via `RcuCell::read()`
        // under `rcu_read_lock()` — no lock acquisition. The RcuCell
        // guarantees a consistent snapshot; structural modifications
        // (mkdir/rmdir) publish a new Vec via RCU, so the old Vec remains
        // valid for the read-side grace period.
        //
        // `CGROUP_MAX_DEPTH` bound is validated at `cgroup_mkdir()` time
        // (depth check rejects nesting beyond the limit), so the stack
        // cannot overflow during a well-formed traversal.
        //
        // Pseudocode (sibling-then-child traversal):
        //   // Each entry: (children_snapshot, next_sibling_index)
        //   let mut stack: ArrayVec<(RcuRef<Vec<Arc<Cgroup>>>, usize), CGROUP_MAX_DEPTH>
        //       = ArrayVec::new();
        //   let root_children = self.children.read(&guard);
        //   if root_children.is_empty() { return; }
        //   stack.push((root_children, 0));
        //
        //   while let Some((children, idx)) = stack.last_mut() {
        //       if *idx >= children.len() {
        //           stack.pop();  // exhausted this level, backtrack
        //           continue;
        //       }
        //       let cg = Arc::clone(&children[*idx]);
        //       *idx += 1;  // advance to next sibling for the next iteration
        //
        //       if f(&cg) == Break { return; }
        //
        //       let grandchildren = cg.children.read(&guard);
        //       if !grandchildren.is_empty() {
        //           stack.push((grandchildren, 0));  // descend
        //       }
        //   }
        let _ = f;
    }
}

/// Free function wrapper for subsystem-state iteration.
/// Called as `css_for_each_descendant(&root_css)` in Section 23.1.3.
///
/// Equivalent to `root.cgroup().for_each_descendant(...)`, performing
/// a pre-order RCU-protected walk over all descendant cgroups and
/// yielding a reference to the subsystem-specific state in each.
///
/// Must be called within an RCU read-side critical section.
pub fn css_for_each_descendant<'a>(
    root: &'a impl CgroupSubsystemState,
) -> CssDescendantIter<'a> {
    CssDescendantIter {
        inner: root.cgroup().for_each_descendant(),
        _marker: PhantomData,
    }
}

/// Trait for subsystem-specific cgroup state objects.
/// Each resource controller (CPU, memory, IO, ML-policy, etc.) that needs
/// per-cgroup state implements this trait. The `cgroup()` method returns
/// the owning `Cgroup`, enabling generic tree walks.
pub trait CgroupSubsystemState {
    /// Return the cgroup that owns this subsystem state.
    fn cgroup(&self) -> &Arc<Cgroup>;
}

17.2.1.2 CPU Controller State¶

/// CPU controller state, present when the `cpu` controller is enabled
/// for this cgroup (listed in parent's `subtree_control`).
///
/// Maps to `cpu.weight`, `cpu.max`, `cpu.guarantee`, and `cpu.stat`
/// cgroupfs files. See Section 17.2.3 for the integration with UmkaOS's
/// EEVDF scheduler and CBS bandwidth enforcement.
pub struct CpuController {
    /// `cpu.weight`: relative CPU share among siblings (1..=10000, default 100).
    /// Used directly as the EEVDF task-group weight.
    pub weight: AtomicU32,

    /// `cpu.max` quota: microseconds of CPU time allowed per `period_us`.
    /// `u64::MAX` means unlimited (no throttling — the default). Uses a sentinel
    /// value instead of `Option` to avoid branching overhead on the hot path
    /// (every scheduler tick checks this field). Matches the representation in
    /// `CpuBandwidthThrottle.quota_us`.
    pub max_us: AtomicU64,

    /// `cpu.max` period in microseconds (default 100,000 = 100 ms).
    /// Always set even when `max_us` is `u64::MAX` (holds the configured period
    /// for when a quota is later added).
    pub period_us: AtomicU64,

    /// `cpu.max` bandwidth throttle state (quota, period, runtime pool, stats).
    /// This is the single source of truth for all bandwidth throttling accounting
    /// — `cpu.stat` reads for nr_periods, nr_throttled, and throttled_time are
    /// served from this struct. See [Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees--cpumax-ceiling-enforcement-bandwidth-throttling).
    pub bandwidth: CpuBandwidthThrottle,

    /// CBS (Constant Bandwidth Server) configuration for `cpu.guarantee`
    /// and `cpu.max` enforcement. Per-CPU servers (`CbsCpuServer`) are
    /// allocated lazily on each CPU's runqueue; this struct holds the
    /// cgroup-wide parameters they read at replenishment time.
    /// See [Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees) for the per-CPU CBS model.
    pub cbs: Option<CbsGroupConfig>,

    // ── Heterogeneous CPU scheduling (big.LITTLE / P-core/E-core) ──────

    /// Preferred CPU core type for tasks in this cgroup. On heterogeneous
    /// platforms (ARM big.LITTLE, Intel Alder Lake+), the scheduler
    /// preferentially places tasks in this cgroup on cores of the specified
    /// type. `None` means no preference (scheduler uses its default
    /// energy-aware policy). Exposed via `cpu.core_type` cgroupfs file.
    /// Values: "performance" (big/P-core), "efficiency" (LITTLE/E-core).
    pub core_type: Option<CpuCoreType>,

    /// `cpu.capacity.min`: minimum CPU capacity required (0..=1024).
    /// The scheduler will not place tasks from this cgroup on CPUs whose
    /// normalized capacity is below this value. ARM defines CPU capacity as
    /// a DMIPS/MHz-normalized value (0-1024) where 1024 = the most capable
    /// core in the system. Default: 0 (no minimum — tasks may run on any core).
    /// Used by Android-style EAS (Energy Aware Scheduling) and heterogeneous
    /// server workloads (e.g., "this cgroup needs big cores").
    /// **UmkaOS extension**: This knob is NOT present in upstream Linux cgroups v2.
    /// Linux exposes capacity via `sched_setattr(SCHED_FLAG_UTIL_CLAMP_MIN)` per-task
    /// only. UmkaOS elevates it to a per-cgroup knob for container-level capacity
    /// pinning. Tools that do not recognize `cpu.capacity.min` will ignore it.
    pub capacity_min: AtomicU32,

    /// `cpu.capacity.max`: maximum CPU capacity allowed (0..=1024).
    /// The scheduler will not place tasks from this cgroup on CPUs whose
    /// normalized capacity exceeds this value. Default: 1024 (no maximum —
    /// tasks may run on the most capable cores). Setting this below the
    /// system's maximum capacity constrains the cgroup to efficiency cores,
    /// useful for background/batch workloads that should not contend with
    /// latency-sensitive workloads for high-performance cores.
    /// **UmkaOS extension**: Same as `capacity_min` — not in upstream Linux
    /// cgroups v2. Linux equivalent is per-task `SCHED_FLAG_UTIL_CLAMP_MAX`.
    pub capacity_max: AtomicU32,

    // ── Latency tuning ──────────────────────────────────────────────────

    /// `cpu.latency_nice`: per-cgroup latency-nice hint (-20 to +19, default 0).
    ///
    /// **UmkaOS-original extension** — NOT a Linux feature. `latency_nice` was
    /// proposed on LKML (Vincent Guittot / Parth Shah, 2022-2024) but never
    /// merged into `torvalds/linux` mainline. As of Linux 6.17+, there is no
    /// `latency_nice` field in `struct sched_attr`, no `SCHED_FLAG_LATENCY_NICE`
    /// bit, and no per-cgroup `cpu.latency_nice` knob. Applications and cgroup
    /// configurations using this feature are UmkaOS-only.
    ///
    /// Shifts the EEVDF eligibility window for all tasks in this cgroup:
    ///   - Negative values (e.g., -20): earlier eligibility → lower scheduling
    ///     latency (latency-sensitive workloads: databases, interactive UIs).
    ///   - Positive values (e.g., +19): later eligibility → higher throughput
    ///     but increased scheduling latency (batch, background workloads).
    ///   - Zero: no adjustment (default EEVDF behavior).
    ///
    /// Written via the `cpu.latency_nice` cgroupfs file. When a task's
    /// per-task `latency_nice` (set via `sched_setattr(2)`) differs from the
    /// cgroup value, the more latency-sensitive (lower) value wins — the
    /// effective latency-nice is `min(task.latency_nice, cgroup.latency_nice)`.
    ///
    /// The value is propagated to the scheduler via the same two-phase
    /// mechanism as `cpu.weight`: atomic store here, then per-CPU IPI to
    /// update all `SchedEntity` instances in the cgroup's task list.
    ///
    /// Cross-reference: [Section 7.1](07-scheduling.md#scheduler) for EEVDF latency-nice integration
    /// and the `LATENCY_NICE_TO_WEIGHT` table.
    pub latency_nice: AtomicI32,

    // ── Accumulated statistics (read via `cpu.stat`) ─────────────────────
    /// Total CPU time consumed (microseconds). Monotonically increasing.
    pub usage_us: AtomicU64,
    // NOTE: Bandwidth throttling stats (nr_periods, nr_throttled, throttled_time_us)
    // are maintained in `CpuBandwidthThrottle` — the single source of truth for
    // all cpu.max bandwidth accounting. See [Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees--cpumax-ceiling-enforcement-bandwidth-throttling).
    // `cpu.stat` reads are served by reading from `bandwidth.nr_periods` etc.
}

/// CPU core type classification for heterogeneous scheduling.
/// Determined at boot time from ACPI CPPC (Collaborative Processor Performance
/// Control) or device tree `capacity-dmips-mhz` properties.
#[repr(u8)]
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum CpuCoreType {
    /// High-performance core (ARM Cortex-X/A7x series, Intel P-core).
    /// Highest capacity value in the system.
    Performance = 0,
    /// Energy-efficient core (ARM Cortex-A5x series, Intel E-core).
    /// Lower capacity, lower power consumption.
    Efficiency  = 1,
}

Per-CPU CBS Budget Tracking

CpuController.cbs stores the cgroup-wide CBS configuration (Option<CbsGroupConfig>). None means no CBS guarantee has been configured for this cgroup (the default). Actual budget enforcement is per-CPU via CbsCpuServer structures, one per cgroup per CPU that has runnable tasks. The per-CPU model eliminates global pool lock contention — all budget operations are per-CPU atomics or CAS on sibling CPUs. See Section 7.6 for the complete per-CPU CBS design including:

CbsGroupConfig (cgroup-wide parameters: quota, period, burst, total_weight)
CbsCpuServer (per-CPU: budget, deadline, throttled state, local_weight)
Proportional share replenishment (no global timer, per-CPU timers)
Atomic steal protocol (exhaust → steal from NUMA-local siblings first)
Task migration handling (weight transfer, lazy proportional rebalance)

Throttling mechanics: when CbsCpuServer.throttled is set, the scheduler's pick_next_task() skips tasks in the throttled cgroup. Tasks already running when the budget expires are preempted at the next scheduler tick. On per-CPU timer replenishment, throttled servers are un-throttled and their tasks' OnRqState transitions from CbsThrottled back to Queued, re-entering the EEVDF runqueue.

17.2.1.3 Memory Controller State¶

/// Memory controller state, present when the `memory` controller is enabled.
///
/// Maps to `memory.current`, `memory.high`, `memory.max`, `memory.swap.max`,
/// `memory.oom.group`, and `memory.events` cgroupfs files.
/// See Section 17.2.4 for the integration with the physical memory allocator.
pub struct MemController {
    /// `memory.current`: total bytes of memory charged to this cgroup.
    /// Updated on every page charge/uncharge (one atomic add per page fault
    /// or page table manipulation). Monotonically tracks live usage.
    pub usage: AtomicU64,

    /// `memory.high`: soft limit in bytes. When `usage` exceeds this, the
    /// cgroup's tasks are throttled (sleeping in the allocator path) and
    /// reclaim is prioritized for pages belonging to this cgroup.
    /// `u64::MAX` means unlimited (default).
    pub high: AtomicU64,

    /// `memory.max`: hard limit in bytes. When `usage` would exceed this,
    /// the per-cgroup OOM killer is invoked before the allocation completes.
    /// `u64::MAX` means unlimited (default).
    pub max: AtomicU64,

    /// `memory.swap.max`: swap usage hard limit in bytes.
    /// `u64::MAX` means unlimited (default).
    /// Controls how much of this cgroup's memory may be swapped out.
    pub swap_max: AtomicU64,

    /// `memory.min`: absolute minimum memory guarantee (bytes). Memory below
    /// this threshold is NEVER reclaimed, even under global OOM pressure.
    /// This provides a hard guarantee for critical workloads. The page scanner
    /// unconditionally skips pages belonging to cgroups whose `usage` is at or
    /// below `memory_min`. Default: 0 (no guarantee).
    ///
    /// Effective value propagation: a cgroup's effective `memory.min` is
    /// `min(memory_min, parent.effective_memory_min * memory_min / siblings_sum)`.
    /// This ensures children cannot collectively claim more protection than
    /// the parent offers.
    pub memory_min: AtomicU64,

    /// `memory.low`: best-effort memory protection (bytes). Memory below this
    /// threshold is protected from reclaim unless there is no other reclaimable
    /// memory in the system. Provides softer protection than `memory_min`:
    /// the page scanner deprioritizes pages in cgroups under their `memory.low`
    /// threshold, but will still reclaim them as a last resort before invoking
    /// the OOM killer. Default: 0 (no protection).
    ///
    /// Effective value propagation: same formula as `memory_min`. The reclaim
    /// path computes `effective_low` by walking the cgroup hierarchy from the
    /// target cgroup to the root, distributing the parent's protection budget
    /// proportionally among siblings based on their configured `memory.low`
    /// values.
    pub memory_low: AtomicU64,

    /// `memory.oom.group`: when `true`, the OOM killer kills **all tasks**
    /// in the cgroup rather than selecting a single victim. Useful for
    /// atomically terminating a container that has overrun its memory budget.
    pub oom_group: AtomicBool,

    /// `memory.events` counters. These seven counters map to the fields
    /// in the `memory.events` cgroupfs file (Linux cgroup v2 interface).

    /// `memory.events.low`: number of times the cgroup was reclaimed below
    /// `memory.low` (i.e., the cgroup's guaranteed minimum was breached).
    pub events_low: AtomicU64,
    /// `memory.events.high`: number of times `usage` exceeded `memory.high`,
    /// triggering throttling and priority reclaim.
    pub events_high: AtomicU64,
    /// `memory.events.max`: number of times `usage` hit `memory.max` and
    /// allocation attempts were stalled or failed.
    pub events_max: AtomicU64,
    /// `memory.events.oom`: number of times the OOM killer was invoked
    /// for this cgroup (regardless of whether a kill actually happened).
    pub events_oom: AtomicU64,
    /// `memory.events.oom_kill`: number of OOM kills triggered for this cgroup.
    /// Incremented each time the OOM killer selects a victim in this cgroup.
    pub oom_kill: AtomicU64,
    /// `memory.events.oom_group_kill`: number of times this cgroup was killed
    /// as a group due to `memory.oom.group = 1` being set. Distinct from
    /// `oom_kill` which counts individual process kills.
    pub oom_group_kill: AtomicU64,
    /// `memory.events.sock_throttled`: number of times network sockets
    /// associated with this cgroup are throttled due to memory pressure.
    pub sock_throttled: AtomicU64,

    /// LRU list of pages charged to tasks in this cgroup.
    /// The reclaim path consults this list to find candidate pages when
    /// `memory.high` is exceeded or when global reclaim pressure is high.
    /// `CgroupLru` uses MGLRU generation lists (not legacy active/inactive
    /// two-list LRU). Pages are distributed across `N_GENS` generations
    /// (default 4), split into file-backed and anonymous pools per
    /// generation, matching the per-zone `ZoneLru` design in
    /// [Section 4.4](04-memory.md#page-cache--generational-lru-page-reclaim). The oldest generation
    /// is the primary reclaim target.
    /// SpinLock required: direct reclaim may call into cgroup LRU from atomic
    /// context (e.g., page allocator under memory pressure with IRQs disabled).
    /// Mutex would sleep in that context, causing a scheduling-while-atomic BUG.
    pub lru: SpinLock<CgroupLru>,
}

17.2.1.3.1 Per-CPU Memory Charge Batching (`MemCgroupStock`)¶

On systems with 128+ CPUs, every page allocation contends on the memory controller's global usage: AtomicU64 counter. To eliminate this hot-path atomic contention, UmkaOS uses a per-CPU charge cache modeled on Linux's memcg_stock_pcp:

/// Per-CPU charge cache for the memory controller.
/// Amortizes global AtomicU64 contention on `MemController::usage`.
/// Stored in the `CpuLocalBlock` (see {ref:per-cpu-data-and-preemption-control}  <!-- UNRESOLVED -->).
// kernel-internal, not KABI — per-CPU charge cache, never crosses a boundary.
#[repr(C)]
pub struct MemCgroupStock {
    /// Pre-charged bytes available for allocation without touching the global counter.
    cached_charge: u64,
    /// The cgroup this stock is cached for. `None` = stock is empty.
    /// Note: kernel-internal struct (per-CPU, never crosses KABI or wire
    /// boundary). `Option<CgroupId>` layout is stable within a single
    /// compilation. Consider `NonZeroU64` for `CgroupId` to get niche
    /// optimization (`Option<CgroupId>` = 8 bytes, zero = None).
    cached_cgroup: Option<CgroupId>,
}

Operation:

Charge (hot path): mem_cgroup_charge(cg, size) acquires a PreemptGuard (via CpuLocal::preempt_disable()) then checks the local CPU's MemCgroupStock. If cached_cgroup == cg and cached_charge >= size, deduct locally — zero atomics, zero contention. The PreemptGuard MUST be held for the entire read-check-deduct sequence; without it, the task could be migrated to a different CPU mid-sequence, accessing the wrong CPU's stock. Linux achieves the same guarantee via local_irq_save() / local_irq_restore() in consume_stock().
Refill: When the stock is empty or for a different cgroup, perform a single usage.fetch_add(STOCK_SIZE) on the global counter, caching STOCK_SIZE bytes locally. Default STOCK_SIZE = 32 * PAGE_SIZE (128 KiB on 4K pages; 2 MiB on AArch64/PPC64 with 64K pages). The proportional scaling is intentional: larger pages mean larger minimum allocation granularity, so the batching stock scales proportionally to avoid excessive global counter traffic.
Drain triggers: The per-CPU stock is drained (returned to the global counter) on:
CPU offline (cpu_dead notifier)
Task migration to a different cgroup (the old cgroup's stock is flushed)
memory.high breach detection (all CPUs' stocks for that cgroup are drained via IPI to get an accurate reading)
memory.max limit check (drain before comparing against limit)
Accuracy: The global usage counter may be up to nr_cpus * STOCK_SIZE below the true usage. Limit enforcement (memory.max) drains all stocks before rejecting an allocation, ensuring limits are respected exactly. Soft limits (memory.high) tolerate the imprecision — the reclaim signal may fire slightly late, which is acceptable.

17.2.1.4 PID Controller State¶

/// PID controller state, present when the `pids` controller is enabled.
///
/// Maps to `pids.current`, `pids.max`, and `pids.events` cgroupfs files.
/// See Section 17.2.6 for fork-bomb prevention semantics.
pub struct PidsController {
    /// `pids.current`: number of tasks (threads + processes) currently in
    /// this cgroup subtree. Incremented by fork/clone, decremented by exit.
    pub current: AtomicU64,

    /// `pids.max`: maximum tasks allowed in this cgroup subtree.
    /// `u64::MAX` means unlimited (default). `fork()`/`clone()` checks
    /// `current < max` before allocating a new task; returns `EAGAIN` on failure.
    pub max: AtomicU64,

    /// `pids.events max`: number of fork/clone calls that were rejected
    /// because `current` reached `max`. Monotonically increasing.
    pub events_max: AtomicU64,
}

17.2.1.5 I/O Controller State¶

/// I/O controller state, present when the `io` controller is enabled.
///
/// Maps to `io.max`, `io.weight`, `io.stat`, and `io.pressure` cgroupfs files.
/// See [Section 15.18](15-storage.md#io-priority-and-scheduling) for integration with the block I/O scheduler.
pub struct IoController {
    /// `io.weight`: relative I/O weight among siblings (1..=10000, default 100).
    /// Used by the I/O scheduler to compute per-cgroup I/O bandwidth shares.
    /// Higher weight = larger share of available I/O bandwidth relative to siblings.
    /// At I/O dispatch time, the block layer's I/O scheduler reads the
    /// cgroup's `io.weight` via `cgroup_io_weight()` to compute proportional
    /// share. Weight is NOT applied at `submit_bio()` time — tagging at
    /// submission would not account for request merging and reordering.
    ///
    /// **Block-layer consumer**: The KABI I/O scheduler dispatches bios with
    /// a priority derived from `io.weight`. At dispatch time, the scheduler
    /// calls `cgroup_io_weight(bio.task.cgroup)` and maps the weight to
    /// a proportional share of the device's dispatch budget. The mapping is:
    /// `dispatch_share = cgroup_weight / sum(sibling_weights)`. Bios from
    /// higher-weight cgroups are dequeued proportionally more often during each
    /// dispatch round. This is implemented in the block layer's dispatch loop
    /// ([Section 15.2](15-storage.md#block-io-and-volume-management--bio-submission-and-completion)),
    /// which evaluates cgroup weight when selecting the next request to issue
    /// to the device.
    pub weight: AtomicU32,

    /// `io.latency` target in microseconds per device. When the cgroup's average
    /// I/O completion latency exceeds this target, the I/O scheduler throttles
    /// other cgroups competing for the same device to protect this cgroup's latency.
    /// `0` means no latency target (disabled, default). Written via the
    /// `io.latency` cgroupfs file (`MAJ:MIN target=<usec>`).
    ///
    /// **Enforcement mechanism**: On each bio completion, the block layer updates
    /// the cgroup's per-device exponential moving average (EMA) of completion
    /// latency: `ema = ema * 7/8 + actual_latency * 1/8`. When `ema > lat_target_us`,
    /// the cgroup is marked "latency-protected" and the I/O scheduler applies
    /// backpressure to sibling cgroups: their dispatch budget is reduced to
    /// `max(1, normal_budget * lat_target_us / sibling_ema)`. This throttles
    /// competing cgroups proportionally to how much they exceed the protected
    /// cgroup's target, without starving them completely. The EMA window (8
    /// samples) provides hysteresis to avoid oscillation. Protected status is
    /// cleared when `ema <= lat_target_us` for 8 consecutive samples.
    pub lat_target_us: AtomicU64,

    /// Per-device rate limits (write-side authoritative copy).
    /// Configuration writes (`io.max` writes from cgroupfs) acquire the
    /// Mutex, modify the Vec, then publish a new immutable snapshot to
    /// `devices_snapshot`. Only the cgroupfs write path touches this.
    /// Bounded by unique `(major, minor)` device pairs. Cap: 1024 per cgroup
    /// (enforced by cgroupfs write handler; returns ENOSPC if exceeded).
    /// Typical servers have 10-50 block devices; 1024 provides ample headroom.
    pub devices_config: Mutex<Vec<IoDeviceLimits>>,

    /// Per-device rate limits (read-side snapshot for hot path).
    /// XArray keyed by `dev_t` (u64 integer key) — O(1) lookup per bio.
    /// Published via RCU clone-and-swap when `devices_config` changes.
    /// `cgroup_io_throttle()` reads this under `rcu_read_lock()` —
    /// no Mutex contention, no heap allocation on the per-I/O path.
    ///
    /// **`cgroup_io_throttle()` lookup path** (called from the block layer
    /// bio submission path):
    /// ```
    /// let cgroup = CGROUP_TABLE.get(bio.cgroup_id);
    /// let io_ctl = cgroup.subsystems.io.as_ref();
    /// io_ctl.throttle(bio);
    /// ```
    /// Inside `io_ctl.throttle(bio)`: acquires `rcu_read_lock()`, reads
    /// `devices_snapshot` via RCU dereference, looks up `bio.dev` in the
    /// XArray (O(1)), applies rate limit checks (rbps/wbps/riops/wiops),
    /// and sleeps if the bio would exceed the current token bucket balance.
    /// The RCU read lock is dropped before sleeping.
    pub devices_snapshot: RcuCell<XArray<IoDeviceLimits>>,

    /// PSI (Pressure Stall Information) for this cgroup's I/O subsystem.
    /// Exposed as `io.pressure`. Tracks the fraction of time tasks in this
    /// cgroup are stalled waiting for I/O completions.
    pub psi: PsiState,
}

/// Block device identifier encoded as (major, minor) pair.
/// Linux ABI: `dev_t` is a 64-bit value with major in bits 8-19 and 32-63,
/// minor in bits 0-7 and 20-31 (glibc encoding). This struct stores the
/// decoded components; `from_dev_t` / `to_dev_t` handle the bit packing.
#[repr(C)]
pub struct DeviceNumber {
    /// Major device number (identifies the driver).
    pub major: u32,
    /// Minor device number (identifies the device instance within the driver).
    pub minor: u32,
}
// kernel-internal, not KABI. Layout: 4 + 4 = 8 bytes.
const_assert!(size_of::<DeviceNumber>() == 8);

impl DeviceNumber {
    /// Decode a Linux `dev_t` (glibc encoding) into (major, minor).
    pub fn from_dev_t(dev: u64) -> Self {
        Self {
            major: ((dev >> 8) & 0xFFF | (dev >> 32) & !0xFFF) as u32,
            minor: (dev & 0xFF | (dev >> 12) & !0xFF) as u32,
        }
    }

    /// Encode as Linux `dev_t` (glibc encoding).
    pub fn to_dev_t(&self) -> u64 {
        let maj = self.major as u64;
        let min = self.minor as u64;
        ((maj & 0xFFF) << 8) | ((maj & !0xFFF) << 32)
            | (min & 0xFF) | ((min & !0xFF) << 12)
    }
}

/// Per-device I/O limits for one block device within a cgroup.
/// All limit fields use `u64::MAX` as the "unlimited" sentinel, consistent with
/// `MemController.max`, `MemController.high`, `PidsController.max`, and `swap_max`.
/// This avoids the 32-byte discriminant overhead of `Option<u64>` (no niche
/// optimization for `u64`) and eliminates the per-field branch on the bio
/// submission hot path. Read under RCU via `devices_snapshot`.
pub struct IoDeviceLimits {
    /// Block device identified by (major, minor) numbers.
    pub dev: DeviceNumber,
    /// Read bandwidth limit in bytes per second. `u64::MAX` = unlimited.
    pub rbps: u64,
    /// Write bandwidth limit in bytes per second. `u64::MAX` = unlimited.
    pub wbps: u64,
    /// Read I/O operations per second limit. `u64::MAX` = unlimited.
    pub riops: u64,
    /// Write I/O operations per second limit. `u64::MAX` = unlimited.
    pub wiops: u64,
}

17.2.1.6 Additional Controller State Structs¶

The following structs back the rdma, hugetlb, misc, cpuset, and shared PSI/LRU fields referenced in Cgroup above. They are defined here rather than inline so that the Cgroup struct definition in Section 17.2.1.1 remains readable.

/// RDMA cgroup controller. Limits RDMA/InfiniBand resource usage per cgroup.
/// Controls: MR (memory regions), MW (memory windows), PD (protection domains),
/// AH (address handles), QP (queue pairs), SRQ (shared receive queues).
/// Mirrors Linux's `rdma` cgroup subsystem (kernel 4.11+).
pub struct RdmaController {
    /// Per-device RDMA resource limits. Key: RDMA device index (u32).
    /// XArray: O(1) lookup by integer device index, RCU-compatible reads.
    pub limits: XArray<RdmaDeviceLimit>,
    /// Current RDMA resource usage. Key: RDMA device index (u32).
    /// XArray: O(1) lookup by integer device index, RCU-compatible reads.
    pub usage: XArray<RdmaDeviceUsage>,
}

/// Per-device RDMA resource limits for one cgroup.
pub struct RdmaDeviceLimit {
    /// Max memory regions.
    pub max_mr:  u32,
    /// Max memory windows.
    pub max_mw:  u32,
    /// Max protection domains.
    pub max_pd:  u32,
    /// Max address handles.
    pub max_ah:  u32,
    /// Max queue pairs.
    pub max_qp:  u32,
    /// Max shared receive queues.
    pub max_srq: u32,
}

/// Current RDMA resource usage for one cgroup on one device.
pub struct RdmaDeviceUsage {
    pub mr:  AtomicU32,
    pub mw:  AtomicU32,
    pub pd:  AtomicU32,
    pub ah:  AtomicU32,
    pub qp:  AtomicU32,
    pub srq: AtomicU32,
}

/// Huge-page cgroup controller. Limits huge page usage per cgroup per page size.
/// Maps to Linux's `hugetlb` cgroup subsystem.
/// Key: huge page size in bytes (2MB = 2097152, 1GB = 1073741824, etc.).
pub struct HugetlbController {
    /// Maximum huge-page bytes allowed per page size. Value `u64::MAX` = unlimited.
    /// HugePageSize is an integer (bytes); XArray provides O(1) lookup per
    /// collection policy (integer keys → XArray, not BTreeMap).
    pub limits: XArray<HugePageSize, u64>,
    /// Current huge-page bytes in use per page size.
    pub usage:  XArray<HugePageSize, AtomicU64>,
}

/// Huge page size in bytes (2 MiB, 1 GiB, etc.).
pub type HugePageSize = u64;

/// Miscellaneous cgroup controller (Linux 5.13+). Provides per-resource usage limits
/// for resources that do not fit into other controllers (e.g., UHID, eudbus entries).
pub struct MiscController {
    /// Per-resource limits. Key: resource name (e.g., "uhid", "eudbus").
    /// Value: limit and live usage counter.
    /// Warm path: misc resource charge/uncharge on device open/close.
    /// BTreeMap<Box<str>> is appropriate: string keys (non-integer), dynamic
    /// (module-registered resource types), bounded (typically <10 resource
    /// types system-wide).
    pub resources: BTreeMap<Box<str>, MiscResource>,
}

/// One named miscellaneous resource tracked by `MiscController`.
pub struct MiscResource {
    /// Maximum units allowed. `u64::MAX` = unlimited.
    pub max:   u64,
    /// Current units in use.
    pub usage: AtomicU64,
}

/// Cpuset cgroup controller. Pins tasks in a cgroup to specific CPUs and NUMA nodes.
/// Maps to Linux's `cpuset` subsystem (cgroup v2: `cpuset.cpus`, `cpuset.mems`).
pub struct CpusetController {
    /// CPUs this cgroup's tasks are allowed to run on. Empty = inherit from parent.
    pub allowed_cpus:  CpuMask,
    /// NUMA memory nodes this cgroup's tasks may allocate from. Empty = any node.
    pub allowed_mems:  NodeMask,
    /// If true, enforce CPU affinity even during load balancing (exclusive cpuset).
    pub cpu_exclusive: bool,
    /// If true, enforce NUMA node affinity for memory allocation.
    pub mem_exclusive: bool,
    /// If true, allow migration of tasks off their cpuset during hotplug events.
    pub mem_migrate:   bool,
}

/// NUMA node affinity mask. Bit N = NUMA node N is allowed.
///
/// Supports up to 1024 NUMA nodes (`[u64; 16]` = 128 bytes), matching Linux's
/// `MAX_NUMNODES` configuration default. For systems with fewer nodes, only the
/// first `ceil(nr_nodes / 64)` words are meaningful; the remainder are zero.
///
/// 1024 nodes is sufficient for the largest production systems (SGI UV3000: 256
/// nodes; HPE Superdome Flex: 32 nodes). `NodeMask` is used in cgroup cpuset
/// configuration (cold path), so the 128-byte size is acceptable — it is never
/// allocated per-page or per-task on the hot path.
pub struct NodeMask {
    pub bits: [u64; 16],
}

/// Pressure Stall Information (PSI) state for one resource (CPU, memory, or I/O).
/// Exposed via /sys/fs/cgroup/<cgroup>/cpu.pressure, memory.pressure, io.pressure.
/// Matches Linux's `psi_group_cpu`/`psi_group_mem`/`psi_group_io` layout.
pub struct PsiState {
    /// Exponentially-weighted moving average of stall time, in units of 0.01%.
    /// Index 0 = 10-second window, 1 = 60-second window, 2 = 300-second window.
    /// `some_avg`: at least one task stalled (partial stall).
    /// Protected by `CgroupRstatLock` — both the timer callback (writer) and the
    /// cgroupfs read path (`cpu.pressure`, `memory.pressure`, `io.pressure`) acquire
    /// this lock before accessing these fields. Using `AtomicU32` for defense-in-depth
    /// so that a missed lock acquisition does not produce a data race (undefined
    /// behavior on non-x86 architectures where u32 reads are not naturally atomic).
    pub some_avg:  [AtomicU32; 3],
    /// `full_avg`: all tasks stalled (full stall).
    /// Same locking and atomicity rationale as `some_avg`.
    pub full_avg:  [AtomicU32; 3],
    /// Cumulative stall time in microseconds since cgroup creation.
    pub some_total: AtomicU64,
    pub full_total: AtomicU64,
    /// Timestamp of the last PSI measurement (nanoseconds since boot).
    pub last_update_ns: AtomicU64,
}

/// Per-cgroup LRU for memory reclaim ordering (MGLRU-based).
/// Tracks pages owned by this cgroup to enable cgroup-aware reclaim
/// (reclaim targets the cgroup that is over its memory limit first).
///
/// Uses generational LRU matching the per-zone `ZoneLru` design in
/// [Section 4.4](04-memory.md#page-cache--generational-lru-page-reclaim). Each generation splits
/// pages into file-backed and anonymous pools for differentiated reclaim
/// cost modeling. The oldest generation is the primary reclaim target.
///
/// This replaces the legacy active/inactive two-list LRU that Linux used
/// prior to MGLRU (Linux 6.1+). UmkaOS adopts generational LRU from the
/// start; the two-list model is never used.
pub struct CgroupLru {
    /// Generational lists, indexed by `generation_index % N_GENS`.
    /// Each generation contains separate file-backed and anonymous page lists.
    /// See `LruGeneration` in [Section 4.4](04-memory.md#page-cache--generational-lru-page-reclaim).
    pub generations: [LruGeneration; N_GENS],
    /// Index of the oldest generation (mod N_GENS). Reclaim starts here.
    pub oldest_gen: u64,
    /// Index of the youngest generation. New pages start here.
    pub youngest_gen: u64,
    /// Pages currently under writeback (not reclaimable until writeback completes).
    pub writeback: IntrusiveList<Page>,
    /// Total number of pages across all generations + writeback.
    pub nr_pages:  AtomicU64,
    /// Pages that have been reclaimed since last check (for memory.stat reporting).
    pub nr_reclaimed: AtomicU64,
}

17.2.1.7 Hierarchy Root¶

/// Root of the cgroup v2 unified hierarchy. One instance per system.
///
/// UmkaOS has a single `CgroupRoot` (no per-controller separate hierarchies —
/// those were the v1 design that UmkaOS eliminates). The root cgroup has `id == 1`
/// and no parent.
pub struct CgroupRoot {
    /// The root cgroup node. All other cgroups are reachable from here via
    /// `children` links. `Arc` because `CgroupNamespace` instances hold
    /// per-namespace root references into this tree (at arbitrary subtree nodes).
    pub root: Arc<Cgroup>,

    /// Lock protecting hierarchy structure changes (mkdir, rmdir, task migration).
    /// Held in **write mode** during cgroup creation and destruction (mkdir/rmdir).
    /// Held in **read mode** during task migration (steps 2-14 of the migration
    /// protocol; concurrent migrations allowed). **Not** held during resource
    /// charging (those operations use per-cgroup atomics).
    ///
    /// `RwLock` (level 210 in the lock ordering table): concurrent hierarchy
    /// traversals (cgroupfs readdir, population-count propagation) and task
    /// migrations hold read locks; mkdir/rmdir hold write locks.
    pub hierarchy_lock: RwLock<()>,

    /// Fast O(1) lookup from cgroup ID to `Arc<Cgroup>`.
    /// Used by cgroupfs to resolve inode numbers back to cgroup nodes,
    /// and by the `CLONE_NEWCGROUP` implementation to find the anchor node
    /// for a new cgroup namespace.
    ///
    /// `XArray`: O(1) lookup by integer cgroup ID with native RCU-protected
    /// reads (no explicit locking for readers). Entries are inserted at cgroup
    /// creation and removed at destruction (after a grace period, since
    /// cgroupfs inodes may hold references). Serialized writes via XArray's
    /// internal lock.
    pub id_map: XArray<Arc<Cgroup>>,

    /// Monotonically increasing ID counter. Assigned at cgroup creation;
    /// never reused (even after cgroup destruction). `AtomicU64` allows
    /// lock-free ID allocation at mkdir time.
    pub next_id: AtomicU64,

    /// VFS mount point for the cgroupfs pseudo-filesystem.
    /// `None` if cgroupfs has not yet been mounted (early boot).
    /// After mount, this is the `Mount` (Section 14.3) returned by
    /// `mount("cgroup2", "/sys/fs/cgroup", "cgroup2", 0, NULL)`.
    pub mount: Option<Mount>,
}

17.2.1.8 Task Migration (cgroup.procs write)¶

Writing a PID to cgroup.procs atomically moves all threads in the thread group to the target cgroup. This is required by the cgroup v2 ABI — Docker, systemd, and Kubernetes expect that writing a TGID moves all threads. Writing to cgroup.threads (if threaded mode is enabled) moves a single thread.

The migration protocol is O(depth × threads) where depth is the cgroup tree height from source/target to their Lowest Common Ancestor (LCA) and threads is the thread group size. Limit recomputation is deferred lazily via generation counters (not done during migration itself).

Threadgroup atomicity: To prevent concurrent fork() from creating new threads during migration (which would escape the target cgroup), the migrating task acquires process.threadgroup_rwsem (read-write semaphore on the Process struct, level 205) in write mode BEFORE step 2. This serializes with fork() which acquires it in read mode. All-or-nothing semantics: if migration fails for any thread, ALL threads are rolled back to the source cgroup.

In the common case (migration within the same subtree, depth ≤ 4), the LCA walk touches ≤ 8 nodes per thread (~400-800 ns). Worst case (cross-subtree at depth 256, 1000 threads): ~25-50 ms, bounded by the maximum nesting depth × thread count.

Migration steps for: write(fd_cgroup_procs, pid_str):

1. Resolve PID to TaskId using the writer's PID namespace. Obtain
   the `Process` struct to access the thread group. Compute the LCA
   (Lowest Common Ancestor) of source and target cgroups:
   ```rust
   /// Compute the Lowest Common Ancestor of two cgroups.
   /// Requires both cgroups to have a `depth` field (distance from root).
   ///
   /// NOTE: `cg.parent` is `Option<Weak<Cgroup>>`. Each `.unwrap()` upgrades the
   /// `Weak` to `Arc` (elided here for readability). The upgrade cannot fail because
   /// `population > 0` on every ancestor guarantees liveness while descendants exist.
   fn cgroup_lca(a: &Cgroup, b: &Cgroup) -> Arc<Cgroup> {
       let mut wa: Arc<Cgroup> = a.self_ref();
       let mut wb: Arc<Cgroup> = b.self_ref();
       // Equalize depths by walking the deeper one up.
       while wa.depth > wb.depth {
           wa = wa.parent.as_ref()
               .expect("non-root cgroup has no parent")
               .upgrade()
               .expect("ancestor population > 0 guarantees liveness -- see hierarchy_lock");
       }
       while wb.depth > wa.depth {
           wb = wb.parent.as_ref()
               .expect("non-root cgroup has no parent")
               .upgrade()
               .expect("ancestor population > 0 guarantees liveness -- see hierarchy_lock");
       }
       // Walk both up until they meet.
       while !Arc::ptr_eq(&wa, &wb) {
           wa = wa.parent.as_ref()
               .expect("non-root cgroup has no parent")
               .upgrade()
               .expect("ancestor population > 0 guarantees liveness -- see hierarchy_lock");
           wb = wb.parent.as_ref()
               .expect("non-root cgroup has no parent")
               .upgrade()
               .expect("ancestor population > 0 guarantees liveness -- see hierarchy_lock");
       }
       wa
   }
   ```

1a. Acquire `process.threadgroup_rwsem` in write mode (level 205).
    This prevents concurrent `fork()` from creating new threads during
    migration. `fork()` acquires this in read mode (step 1 of do_fork).

2. Acquire `root.hierarchy_lock` (RwLock, read mode). This serializes with
   concurrent `cgroup_mkdir()`/`cgroup_rmdir()` (which hold write mode) but
   allows concurrent task migrations (both hold read mode). The lock ordering
   table ([Section 3.4](03-concurrency.md#cumulative-performance-budget)) documents this at level 210.
   Released after step 12 (before step 13's RQ_LOCK acquisition). **Lock
   ordering**: `HIERARCHY_LOCK` (level 210) is ABOVE `RQ_LOCK` (level 50) in
   the hierarchy, so they must NEVER be held simultaneously. The migration
   protocol enforces this via a two-phase structure: Phase 1 (steps 2-12)
   holds `HIERARCHY_LOCK` for cgroup tree operations; Phase 2 (steps 13-15)
   acquires `RQ_LOCK` for runqueue dequeue/enqueue. The lock is released
   between phases (end of step 12).

3. For EACH thread in the thread group (`process.thread_group.tasks.iter()`),
   mark the task as migrating via CAS (two-phase task list protocol):
   `cgroup_migration_state.compare_exchange(0, 1, Acquire, Relaxed)`.
   If any thread's CAS fails (already migrating), release all acquired
   locks and return `EAGAIN`. This is the CAS-based trylock documented
   in the `CgroupMigrationState` enum — `CAS(None=0, Migrating=1)` to
   acquire, `store(None=0, Release)` to release. The CAS atomically marks
   the task as in-transit; no separate "mark migrating" step is needed
   beyond this CAS. The `source.tasks` RwLock is NOT held here; it is
   acquired later in the task list move step. The task remains in
   `source.tasks` but is marked as in-transit. Readers of `cgroup.procs`
   include MIGRATING tasks in their source cgroup for consistency,
   ensuring the task is always visible in exactly one cgroup.

4. Check the source and target cgroups for controller constraints:
   - **No-internal-processes rule (cgroup v2)**: If the target cgroup has
     children with controllers enabled (`target.subtree_control != 0`),
     processes cannot be placed directly in it. Return `EBUSY`. This
     prevents the "internal node with both tasks and child cgroups"
     state that breaks resource distribution guarantees.
   - If target has a `PidsController`: verify (current + thread_count) <= max;
     return EAGAIN if over limit.
   - If target has a `MemController`: verify the task's current RSS would not
     immediately exceed memory.max in the target. If over-limit, return ENOMEM.

5. Update the task's cgroup pointer:
     task.cgroup.swap(Arc::clone(&target));
   The `ArcSwap::swap()` atomically replaces the cgroup reference.
   `task.cgroup` is of type `ArcSwap<Cgroup>` (not bare `Arc`), enabling
   atomic replacement with concurrent readers via RCU-like semantics.
   The swap pairs with `ArcSwap::load()` in the resource-charge path,
   ensuring that subsequent charges from this task are credited to the target.

7. Transfer memory charges from source to target via `mem_cgroup_migrate()`:
   ```rust
   /// Transfer memory charges from source to target cgroup during
   /// task migration. Charges are moved as bulk counter adjustments
   /// (not per-page transfers) for efficiency.
   ///
   /// # Arguments
   /// - `task`: The task being migrated (provides RSS via process.mm).
   /// - `source`: Source cgroup (charges subtracted).
   /// - `target`: Target cgroup (charges added).
   /// - `lca`: Lowest Common Ancestor (charges above LCA are unchanged).
   ///
   /// # Error handling
   /// Returns `Err(ENOMEM)` if adding charges to the target would exceed
   /// `memory.max`. On error, any partially transferred charges are
   /// rolled back before returning.
   fn mem_cgroup_migrate(
       task: &Task,
       source: Arc<Cgroup>,
       target: Arc<Cgroup>,
       lca: Arc<Cgroup>,
   ) -> Result<(), Errno> {
       // RSS is shared among threads via Process.mm — get the process-level
       // RSS, not per-task (threads share address space).
       // ArcSwap::load() returns an ArcSwapGuard that extends the Arc lifetime.
       // Single load to avoid TOCTOU: rss_pages and rss_bytes must be
       // consistent. A concurrent page fault between two separate loads
       // could cause a one-page divergence.
       let mm = task.process.mm.load();
       let rss_pages = mm.rss_pages.load(Relaxed);
       let rss_bytes = rss_pages as u64 * PAGE_SIZE as u64;

       // Check target capacity before transferring.
       // Note: check-then-charge race exists. Concurrent migrations or
       // allocations may temporarily push usage above memory.max between
       // this check and the fetch_add below. This is benign — the
       // per-cgroup OOM killer handles the overshoot (same pattern as
       // Linux mem_cgroup_charge). No lock is needed because the atomics
       // ensure no lost updates; the overshoot is bounded to the sum of
       // concurrent in-flight migration RSS values.
       if let Some(ref mem) = target.memory {
           let new_usage = mem.usage.load(Relaxed) + rss_bytes;
           if new_usage > mem.max.load(Relaxed) {
               return Err(Errno::ENOMEM);
           }
       }

       // Walk from source up to LCA, subtracting charges.
       // Use Arc<Cgroup> (not &Cgroup) to keep the parent alive across loop
       // iterations. The previous code used `.upgrade().unwrap().as_ref()`
       // which created a temporary Arc that was dropped before the borrow
       // was used — a dangling reference (compile error in Rust).
       let mut cg_arc = Arc::clone(&source);
       while !Arc::ptr_eq(&cg_arc, &lca) {
           if let Some(ref mem) = cg_arc.memory {
               mem.usage.fetch_sub(rss_bytes as u64, Relaxed);
               // Anon vs file-backed distinction: both are transferred
               // as bulk counter adjustments. Per-page type tracking is
               // maintained by the existing charge_type counters.
           }
           cg_arc = cg_arc.parent.as_ref().unwrap().upgrade().unwrap();
       }

       // Walk from target up to LCA, adding charges.
       let mut cg_arc = Arc::clone(&target);
       while !Arc::ptr_eq(&cg_arc, &lca) {
           if let Some(ref mem) = cg_arc.memory {
               mem.usage.fetch_add(rss_bytes as u64, Relaxed);
           }
           cg_arc = cg_arc.parent.as_ref().unwrap().upgrade().unwrap();
       }

       Ok(())
   }
   ```
   This ordering (pointer update before charge transfer) ensures correctness:
   new allocations after step 6 are charged to the target, and freed memory
   from allocations made before migration is correctly unaccounted from the
   source (which still holds the charge until this step transfers it).

   **Bulk vs per-page model**: The bulk RSS charge transfer (fetch_sub/fetch_add
   on `mem.usage`) diverges from Linux's per-page `mem_cgroup_charge()`/
   `mem_cgroup_uncharge()` model. This is an intentional UmkaOS simplification
   that avoids iterating every mapped page during migration. The trade-off:
   the `memory.max` check above uses the full RSS as a single quantum, which
   means a migration can temporarily overshoot `memory.max` by up to the
   entire process RSS (not just one page). This is acceptable because:
   (1) the per-cgroup OOM killer handles overshoots, (2) the overshoot is
   bounded by `process.mm.rss_pages` (not unbounded), and (3) the bulk
   transfer is O(1) vs O(RSS_pages) for the per-page model.

6a. Drain `MemCgroupStock` on the current CPU if it is cached for the source
    cgroup: add `cached_charge` back to `source.memory.usage` and clear the
    stock. This ensures accurate RSS accounting before the charge transfer in
    step 7 ([Section 17.2](#control-groups--per-cpu-memory-charge-batching-memcgroupstock)).

**OOM during migration (steps 6-7)**: If `mem_cgroup_migrate()` fails mid-transfer
(e.g., the target cgroup's `memory.max` would be exceeded by the transferred charges),
the kernel rolls back in reverse order: step 7 memory charges partial undo (reverse
any transferred charges), step 6 cgroup pointer (`task.cgroup.swap(source)`), step 5
migration state reset. Step 8 (PID counts) has NOT yet executed at step-7 failure time
and is NOT rolled back.
The task's `cgroup_migration_state` is set back to `CgroupMigrationState::None` and
the `cgroup.procs` write returns `-ENOMEM`. The task remains in the source cgroup
with all its original charges intact. No partial migration is ever visible to userspace.

8. Charge PID controllers:
   - Decrement `PidsController::current` on source (and all ancestors up to LCA).
   - Increment `PidsController::current` on target (and all ancestors up to LCA).

9. CBS weight adjustment for source and target cgroups
   ([Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees--cpumax-ceiling-enforcement-bandwidth-throttling)):
   Because step 1 specifies that writing a PID to `cgroup.procs` moves ALL
   threads in the thread group, the weight adjustment MUST sum across all
   threads, not just the leader. The per-thread iteration scope is the
   `process.thread_group.tasks` list:
   ```rust
   let total_delta: u32 = process.thread_group.tasks.iter()
       .map(|thread| thread.weight)
       .sum();
   ```
   - If the source cgroup has a CBS-guaranteed `CpuController`, adjust
     `CbsGroupConfig.total_weight -= total_delta`. The cgroup's total
     `cpu.guarantee` does NOT change — it is a cgroup property, not
     divisible per-task. Only `total_weight` (used for proportional
     budget distribution at replenishment) is adjusted.
   - If the target cgroup has a CBS-guaranteed `CpuController`, adjust
     `CbsGroupConfig.total_weight += total_delta`.
   - If the destination cgroup's total CBS guarantee commitments would
     exceed the system's `cpu.guarantee` admission limit, return ENOSPC.
     Roll back steps 8, 7, 6, and 5 **in reverse order** (PID counts
     restore, memory charge reverse transfer, cgroup pointer restore,
     migration state reset) before returning.
   - If neither source nor target has a CpuController, this step is a no-op.

10. Complete the task list move (phase 2 of two-phase protocol):
    - Acquire source.tasks and target.tasks write locks **in cgroup-ID order**
      (lower CgroupId first) to prevent ABBA deadlock when two tasks migrate
      in opposite directions between the same pair of cgroups concurrently.
      The `tasks` RwLock is at lock level 215 (between `HIERARCHY_LOCK` at 210
      and `SOCK_LOCK` at 230; same-level tiebreaker: cgroup-ID order, parallel
      to `RQ_LOCK`'s CPU-ID ordering convention).
    - Remove TaskId from source.tasks.
    - Insert TaskId into target.tasks.
    - Set task.cgroup_migration_state = CgroupMigrationState::Complete
      (charge reconciliation pending; transitions to None in step 16).
    - Release both locks (higher CgroupId first).
    Because the task was marked Migrating in step 5, it was continuously
    visible in source.tasks throughout the migration. It now becomes
    visible in target.tasks.

11. Update population counts along the path from source/target to their LCA:
    - Decrement `source.population` and each ancestor up to LCA.
    - Increment `target.population` and each ancestor up to LCA.

12. Propagate generation counters:
    - Increment `source.generation` (Relaxed — any observer that sees the
      task's new cgroup will also observe the updated generation).
    - Increment `target.generation`.
    Tasks that cached effective limits from either cgroup will detect the
    mismatch on the next resource charge and re-walk to recompute limits.

13. Release the migration lock.

14. Release `root.hierarchy_lock` (read mode). This completes the hierarchy
    phase. The RQ_LOCK acquisition below must NOT be nested under
    HIERARCHY_LOCK (acquiring RQ_LOCK at level 50 while holding
    HIERARCHY_LOCK at level 210 would be a descending acquisition,
    violating the ascending lock ordering invariant).

15. Acquire RQ_LOCK on the task's current CPU's runqueue via the lockfree
    `lock_task_rq()` protocol ([Section 7.1](07-scheduling.md#scheduler--task-to-runqueue-lookup-protocol-lockfree-cpuid-retry)).

    **If task == rq.curr** (currently running):
    The migration MUST use the full `put_prev_task` / `set_next_task`
    protocol via `SchedClassOps` trait methods to ensure PELT, WaiterCount,
    and CBS accumulators are updated correctly. Direct tree manipulation
    would bypass accumulator flushing and corrupt scheduling state.

      a. `put_prev_task(rq, curr)` — remove from EEVDF tree, flush vruntime
         delta, update PELT load averages, flush CBS charge accumulators.
      b. `dequeue_task(rq, task)` — adjust source cgroup GroupEntity weight,
         update source WaiterCount.
      c. Update scheduler's view of task cgroup: the scheduler uses the
         task's cgroup reference from `task.cgroup` (already swapped in step 6).
         No redundant `task.cgroup.store()` is needed here -- step 6's
         `ArcSwap::swap()` already made the target cgroup visible. The
         dequeue/enqueue pair in steps (b)/(d) is what actually reassigns
         the task's GroupEntity tree position.
      d. `enqueue_task(rq, task)` ��� add to target cgroup GroupEntity, update
         target WaiterCount.
      e. `set_next_task(rq, task)` — re-set as curr with new cgroup context,
         reset accounting accumulators for the new cgroup.

    **If task != rq.curr** (runnable but not currently executing):

      a. `dequeue_task(rq, task)` — standard dequeue via SchedClassOps.
      b. The scheduler reads `task.cgroup` (already swapped in step 6) to
         determine the target cgroup's GroupEntity. No redundant store.
      c. `enqueue_task(rq, task)` — standard enqueue via SchedClassOps.

    **CBS server migration** (same-CPU, cross-cgroup) is handled within
    the dequeue/enqueue SchedClassOps paths:
      - If source cgroup has CBS: dequeue_task removes task from source
        `CbsCpuServer.tree`, updates `source_server.local_weight -= task.weight`.
        If this was the last task, the server becomes a steal donor.
      - If target cgroup has CBS: enqueue_task finds or creates `CbsCpuServer`
        for target on this CPU. Sets `target_server.local_weight += task.weight`.
        Enqueues task in `target_server.tree`.
      - Cross CBS/non-CBS: migration between CBS server tree and main EEVDF
        tree is handled by the dequeue (from source) + enqueue (to target)
        pair. See [Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees--cpumax-ceiling-enforcement-bandwidth-throttling).

    Release RQ_LOCK.

    If the task is currently running on a tickless core
    ([Section 7.1](07-scheduling.md#scheduler--architecture)), send a reschedule IPI to that CPU.
    This forces the scheduler to re-evaluate the task with updated
    GroupEntity parameters from the target cgroup. Without this IPI,
    the task continues running with stale cpu.weight indefinitely
    (tickless cores have no periodic tick to trigger rescheduling).

16. Set `task.cgroup_migration_state = CgroupMigrationState::None` (Release).
    This completes the migration and re-enables bandwidth enforcement.
    The `Complete -> None` transition is performed here, not deferred,
    because bandwidth enforcement should resume immediately after the
    runqueue reassignment in step 15.

The LCA (Lowest Common Ancestor) walk in steps 6–8 and 10 is bounded by the
maximum cgroup nesting depth (`CGROUP_MAX_DEPTH` in UmkaOS). In the
common case (migration within the same subtree, depth ≤ 4), the walk touches
≤ 8 nodes.

Population propagation (step 11) uses a spinlock-free path: `population` is
an `AtomicU64` updated with `fetch_add`/`fetch_sub`. The LCA walk does not
need to hold `hierarchy_lock` because cgroup destruction requires the
population to be zero (enforced before rmdir proceeds).

Cgroup Filesystem and Hierarchy¶

Cgroups are exposed via a pseudo-filesystem mounted at /sys/fs/cgroup:

/sys/fs/cgroup/
├── cgroup.controllers      # Available controllers (cpu cpuset io memory pids rdma hugetlb misc perf_event)
├── cgroup.subtree_control  # Controllers enabled for children
├── cgroup.type             # "domain" (default) or "threaded" (thread-mode subtree)
├── cgroup.procs            # TGIDs in this cgroup (writes move all threads)
├── cgroup.threads          # TIDs in this cgroup (threaded mode only; writes move single thread)
├── system.slice/           # Systemd system services
├── user.slice/             # User sessions
└── docker/                 # Container cgroups
    └── <container-id>/
        ├── cpu.max
        ├── cpu.weight
        ├── cpu.guarantee   # UmkaOS extension ([Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees))
        ├── memory.max
        ├── memory.current
        ├── io.max
        ├── pids.max
        └── cpuset.cpus

cgroup_mkdir() — cgroup creation algorithm:

cgroup_mkdir(parent: &Cgroup, name: &str) -> Result<Arc<Cgroup>, CgroupError>:
  1. Validate name (≤255 bytes, no '/', no '..' traversal).
  2. Acquire parent.children_lock (SpinLock) + root.hierarchy_lock (RwLock write).
  3. Check: parent is not Draining/Dead. Return ENODEV if so.
  4. Allocate Cgroup struct:
     - id = root.next_id.fetch_add(1, Relaxed).
     - parent = Some(Weak::clone(&parent)).
     - name = ArrayString::from(name).
     - lifecycle = Active.
  5. Initialize subsystem controllers with DEFAULT values (NOT copies
     of parent state) for each controller enabled in parent.subtree_control:
     - cpu: cpu.weight = CGROUP_WEIGHT_DFL (100), cpu.max = "max 100000"
     - memory: memory.max = u64::MAX (unlimited), memory.high = u64::MAX
     - io: io.weight = 100
     - pids: pids.max = u64::MAX (unlimited)
     - cpuset: cpuset.cpus = empty, cpuset.mems = empty
     - perf_event: no per-cgroup limit by default
  6. Insert into hierarchy: clone parent.children Vec, push new Arc<Cgroup>,
     publish via RcuCell swap (old Vec freed after RCU grace period).
  7. Insert into root.id_map XArray: id_map.store(id, Arc::clone(&cg)).
  8. Create cgroupfs directory: mkdir in the parent's cgroupfs inode,
     populate control files (cgroup.procs, cgroup.subtree_control,
     per-controller knobs).
  9. Release locks.
  10. Return Ok(cg).

Hierarchy delegation: A cgroup can delegate control to a subtree by enabling controllers in cgroup.subtree_control. Only controllers enabled in the parent's subtree_control are available in child cgroups. This matches Linux semantics for unprivileged container runtimes.

Cgroup namespace integration: CLONE_NEWCGROUP creates a new cgroup namespace where the process's current cgroup becomes the root of its view. Processes see /sys/fs/cgroup/ starting from their namespace's cgroup root, enabling rootless container runtimes to manage their own cgroup hierarchy.

/proc/[pid]/cgroup path relativity: When a process reads /proc/[pid]/cgroup, the kernel computes the cgroup path relative to the reader's cgroup namespace root, not the global cgroup root:

/// Compute the cgroup path as seen by the reading process.
///
/// The path is relative to the reader's cgroup namespace root.
/// If the target task's cgroup is outside the reader's cgroupns
/// subtree, the path is shown as "/../<path>" (indicating the task
/// is in an ancestor cgroup — visible but not navigable).
///
/// This function backs /proc/[pid]/cgroup and /proc/[pid]/cgroup.controllers.
fn cgroup_path_from_reader(
    target_cgroup: &Cgroup,
    reader_cgroupns_root: &Cgroup,
) -> ArrayString<256> {
    // Walk from target_cgroup toward root, collecting path components.
    // Stop when we reach reader_cgroupns_root.
    // If target is a descendant of reader's root: path is relative (e.g., "/app/worker").
    // If target IS reader's root: path is "/".
    // If target is NOT a descendant: path starts with "/.." (ancestor).
    kernfs_path_from_node(target_cgroup.kn, reader_cgroupns_root.kn)
}

Effect on container processes: - A container process whose cgroup namespace root is /sys/fs/cgroup/system.slice/docker-abc123.scope sees /proc/self/cgroup as 0::/ (it thinks it is at the cgroup root). - A process in a child cgroup within the container (e.g., .../docker-abc123.scope/app) sees 0::/app. - A host process reading the same container process's cgroup sees the full path from the global root: 0::/system.slice/docker-abc123.scope.

This is essential for container compatibility: systemd inside a container expects to see 0::/ and then create child cgroups relative to that root. Without cgroupns path relativity, systemd would see the host's full cgroup hierarchy and fail to manage cgroups correctly.

17.2.2 CPU Controller Integration¶

The cgroup cpu controller maps to the UmkaOS scheduler:

cpu.weight: Hierarchical group proportional share (range 1-10000, default 100). A cgroup's effective share of CPU time is (my_weight / sum_sibling_weights) × parent_effective_share. This is a group-level multiplier, not a direct EEVDF per-entity weight — the scheduler distributes the group's CPU share among its member tasks using their individual nice-derived weights. When a cgroup with weight 200 competes with a sibling at weight 100, it receives approximately 2× as much CPU time at the group level, but individual task scheduling within the group follows standard EEVDF lag/vruntime mechanics.
cpu.max: Sets the bandwidth ceiling using CFS-style throttling. Format: "<quota> <period>" (both in microseconds). Example: "400000 1000000" limits the cgroup to 40% CPU (400ms per 1000ms period). This is a maximum limit, not a guarantee. When throttled, the cgroup's tasks are removed from the run queue until the next period begins. This matches standard Linux cgroup v2 semantics.
cpu.guarantee: (UmkaOS extension, see Section 7.6) Sets the bandwidth floor using Constant Bandwidth Server (CBS). Format: "<budget> <period>". Guarantees minimum CPU time regardless of other load. This is distinct from cpu.max: a cgroup can have both a guarantee (floor) and a limit (ceiling).

Relationship between cpu.max and cpu.guarantee: | Setting | Effect | Use Case | |---------|--------|----------| | cpu.max only | Limits maximum, no minimum | Prevent runaway containers | | cpu.guarantee only | Guarantees minimum, no maximum | RT workloads that need bounded latency | | Both | Guarantees minimum AND limits maximum | Mixed workloads with SLA |

When a cgroup is throttled (by either mechanism), the scheduler removes its tasks from the EEVDF tree until the next period or until budget is replenished.

17.2.2.1 `cpu.weight` Write Handler¶

When userspace writes a new value to cpu.weight, the cgroup subsystem validates the value, updates the CpuController.weight atomic, and then calls the scheduler to propagate the new weight to all per-CPU GroupEntity instances (Section 7.2).

/// cgroupfs write handler for `cpu.weight`.
/// Called from the cgroupfs VFS write path with the cgroup's mutex held.
pub fn cpu_weight_write(cgroup: &CgroupNode, buf: &[u8]) -> Result<usize, KernelError> {
    let value = parse_u32(buf)?;
    if !(1..=10000).contains(&value) {
        return Err(KernelError::InvalidArgument);
    }

    let cpu_ctrl = cgroup.cpu.as_ref()
        .ok_or(KernelError::ControllerNotEnabled)?;

    // Update the authoritative weight in the CpuController.
    // Relaxed ordering is sufficient: the scheduler reads this field under
    // per-CPU runqueue locks, and sched_group_set_weight() acquires those
    // locks with Acquire semantics after this store.
    cpu_ctrl.weight.store(value, Ordering::Relaxed);

    // Propagate to all per-CPU GroupEntity instances. This iterates all
    // online CPUs, acquires each runqueue lock, and updates the GroupEntity's
    // weight + vdeadline. On tickless cores with a running task from this
    // cgroup, a reschedule IPI is sent.
    sched_group_set_weight(cgroup.id, value);

    Ok(buf.len())
}

This two-phase design (atomic store + per-CPU propagation) ensures that: 1. A concurrent fork()/wake_up() that creates a new GroupEntity on a CPU not yet visited by the propagation loop reads the new weight from CpuController.weight (the authoritative source for new entity creation). 2. Existing GroupEntity instances are updated in-place with proper vdeadline recalculation, preserving accumulated lag for fairness continuity. 3. Tickless cores are explicitly poked so the running task is rescheduled under the new weight — without the IPI, a tickless core would run indefinitely at the stale weight.

17.2.3 Memory Controller Integration¶

The memory controller tracks physical page allocations per cgroup:

memory.current: The sum of all pages charged to this cgroup (in bytes).
memory.max: Hard limit (bytes). When exceeded, the per-cgroup OOM killer is invoked.
memory.high: Soft limit (bytes). When exceeded, the cgroup is throttled and its pages are prioritized for reclaim, but no OOM occurs.
memory.low: Memory protection (bytes). Pages below this threshold are protected from reclaim unless the system is under severe pressure.
memory.swap.max: Limits swap usage for this cgroup (Section 22.7).

Per-cgroup OOM killer: When memory.current exceeds memory.max, the OOM killer selects a victim within the cgroup subtree only — processes outside this cgroup are not affected. This is independent of global OOM (Section 4.5): per-cgroup OOM can trigger even when global memory is not exhausted. Per-cgroup OOM victim selection runs under the global OOM_LOCK (same lock as global OOM — no per-cgroup OOM lock). Victim selection uses the canonical oom_badness() formula defined in Section 4.5: points = rss_pages + swap_pages + pgtable_pages; score = points + (oom_score_adj * totalpages) / 1000. Note: dirty_kb is NOT included (dirty pages are already counted in RSS — double-counting would inflate scores); child_rss/8 is NOT included (removed from Linux oom_badness() in the 2010 OOM rewrite, commit a63d83f427fb). The only difference from global OOM is scope: per-cgroup OOM considers only tasks within the cgroup subtree. Tasks with oom_score_adj == -1000 are OOM-immune (matching Linux semantics). When memory.oom.group is set to true, the OOM killer kills all tasks in the cgroup atomically rather than selecting a single victim (Linux memory.oom.group semantics). Without this flag (default), only the highest-scoring victim is killed.

try_charge(cgroup, nr_pages) flow: (1) Atomic add to cgroup.memory.usage. (2) If usage > memory.max: (a) try direct reclaim within cgroup. (b) If reclaim insufficient: acquire global OOM_LOCK (Section 3.4), then invoke OOM killer scoped to this cgroup (OomConstraint::Cgroup(cgroup_id)). (c) If OOM frees enough: release OOM_LOCK, retry charge. (d) If still over: release OOM_LOCK, return -ENOMEM (task gets SIGKILL or allocation fails). All OOM paths — global and per-cgroup — use the single global OOM_LOCK to prevent concurrent over-kill. See Section 4.5 for the canonical kill sequence.

memory.min / memory.low enforcement: memory.min — reclaim scanner unconditionally skips pages belonging to cgroups where usage <= memory.min (hard protection). memory.low — reclaim scanner deprioritizes pages from cgroups where usage <= memory.low (soft protection — reclaimed only when no other reclaimable memory is available).

Cgroup-scoped vs global reclaim: Cgroup-scoped reclaim scans only the cgroup's LRU lists (not global). kswapd is NOT invoked for cgroup-scoped reclaim — only direct reclaim in the charging task's context. Global kswapd runs only for global memory pressure (zone watermarks).

Memory accounting has low overhead (~1 atomic increment per page charge); the charge operation piggybacks on existing page table allocation routines in umka-core.

17.2.4 I/O Controller Integration¶

The io controller limits block I/O bandwidth and IOPS per cgroup:

io.max: Per-device limits. Format: "<major>:<minor> rbps=<bytes> wbps=<bytes> riops=<ops> wiops=<ops>". Example: "8:0 rbps=10485760 wbps=5242880" limits reads to 10 MB/s and writes to 5 MB/s on device 8:0.
io.weight: Proportional weight for best-effort I/O scheduling (1-10000, default 100).

Device identity for stacking filesystems: overlayfs, device-mapper, and other stacking layers do not have their own block device identity for io.max purposes. The <major>:<minor> always refers to the underlying physical block device (e.g., /dev/sda = 8:0). The block I/O layer resolves I/O from stacking filesystems to the backing device before applying cgroup throttling, matching Linux behavior.

The block I/O subsystem (Section 15.2) integrates with cgroup accounting: each bio (block I/O request) is tagged with its originating cgroup, and the I/O scheduler enforces per-cgroup limits.

17.2.5 PIDs Controller (Fork Bomb Prevention)¶

pids.max: Maximum number of tasks (threads + processes) in the cgroup subtree. Prevents fork bombs from exhausting system-wide PID space. fork()/clone() returns EAGAIN when the limit is reached.
pids.current: Current number of tasks in the cgroup.

This is critical for container isolation: a misbehaving container cannot exhaust the host's PID space.

17.2.5.1 Cgroup Fork Hooks¶

The cgroup subsystem participates in fork()/clone() via two hook functions called by the process creation path (Section 8.1):

cgroup_can_fork(task, clone_flags) — Called before the child task struct is fully initialized. Performs pre-allocation checks and counter reservations:

/// Pre-fork cgroup check. Walks the parent's cgroup hierarchy and invokes
/// each controller's can_fork hook. Returns Ok(()) if all controllers
/// approve, or Err(EAGAIN) if any controller rejects.
///
/// On failure, already-approved controllers are rolled back via
/// cgroup_cancel_fork().
pub fn cgroup_can_fork(parent: &Task) -> Result<(), KernelError> {
    // For each CSS in parent's cgroup set, call controller.can_fork().
    // Currently only the pids controller has a can_fork hook.
    pids_can_fork(parent)?;
    Ok(())
}

/// Pids controller fork check. Atomically increments pids.current for
/// the task's cgroup and every ancestor up to the root.
///
/// If any ancestor's pids.current exceeds pids.max after increment,
/// the increment is rolled back for all already-incremented cgroups,
/// pids.events_max is bumped on the failing cgroup, and EAGAIN is returned.
fn pids_can_fork(parent: &Task) -> Result<(), KernelError> {
    let mut incremented: ArrayVec<&Cgroup, 256> = ArrayVec::new();

    // Walk from the task's cgroup to the root, incrementing each.
    // NOTE: cg.parent is Weak<Cgroup>. upgrade() is elided in this pseudocode.
    // The invariant that parents are alive is guaranteed by population counting:
    // a cgroup with population > 0 (true during fork) keeps all ancestors alive.
    let mut cg = parent.cgroup();
    loop {
        if let Some(ref pids) = cg.pids {
            let prev = pids.current.fetch_add(1, Ordering::Relaxed);
            let max = pids.max.load(Ordering::Relaxed);
            if prev + 1 > max && max != u64::MAX {
                // Over limit — rollback this increment and all prior.
                pids.current.fetch_sub(1, Ordering::Relaxed);
                pids.events_max.fetch_add(1, Ordering::Relaxed);
                // Rollback already-incremented ancestors.
                for ancestor in &incremented {
                    if let Some(ref apids) = ancestor.pids {
                        apids.current.fetch_sub(1, Ordering::Relaxed);
                    }
                }
                return Err(KernelError::Again); // EAGAIN
            }
            incremented.push(cg);
        }
        match cg.parent {
            Some(ref p) => cg = p,
            None => break,
        }
    }
    Ok(())
}

/// Rollback cgroup_can_fork() reservations when fork fails after the
/// cgroup checks succeeded but a later step (RLIMIT_NPROC, PID alloc,
/// memory) failed.
pub fn cgroup_cancel_fork(parent: &Task) {
    // NOTE: cg.parent is Weak<Cgroup>. upgrade() is elided in this pseudocode.
    // Same invariant as pids_can_fork: population > 0 guarantees ancestor liveness.
    let mut cg = parent.cgroup();
    loop {
        if let Some(ref pids) = cg.pids {
            pids.current.fetch_sub(1, Ordering::Relaxed);
        }
        match cg.parent {
            Some(ref p) => cg = p,
            None => break,
        }
    }
}

cgroup_post_fork(child, parent, target_cgroup) — Called after the child task struct is initialized but before wake_up_new_task(). Attaches the child to the resolved cgroup (either the parent's cgroup, or the target specified by CLONE_INTO_CGROUP).

Failure and rollback semantics:

The fork path has a three-phase cgroup protocol:

cgroup_can_fork() — Atomically increments pids.current for the parent's cgroup and all ancestors. If any ancestor exceeds pids.max, the increment is rolled back and the fork fails with EAGAIN before the child task is allocated.
cgroup_post_fork() — Attaches the child to the cgroup's task set. This step is infallible by design: the pids reservation was secured in step 1, and the task set insertion (RwLock-protected XArray::insert) cannot fail (XArray uses slab-allocated radix tree nodes, which is acceptable per collection policy). No re-check of pids.max is performed here because the counter was already incremented atomically in step 1.
If any fork step between cgroup_can_fork() and cgroup_post_fork() fails (e.g., PID allocation exhaustion, RLIMIT_NPROC, memory allocation for the child task struct), the fork path calls cgroup_cancel_fork() which decrements pids.current for the parent's cgroup and all ancestors, reversing the reservation made in step 1. The child task is never added to the cgroup's task set, and no cgroup state is leaked.

TOCTOU note: An administrator may lower pids.max between cgroup_can_fork() and cgroup_post_fork(), causing pids.current > pids.max temporarily. This is benign and matches Linux behavior: the lowered limit prevents new forks but does not kill existing tasks. The overshoot is bounded (at most one task per concurrent fork) and resolves when tasks exit.

/// Post-fork cgroup attachment. Adds the child task to the resolved
/// cgroup set across all controllers. Must complete before the child
/// is made runnable.
///
/// This function is infallible: the pids.current counter was already
/// incremented by cgroup_can_fork(), so no counter update or limit
/// check is needed here. If the child must not be added (fork failure
/// after cgroup_can_fork()), the caller invokes cgroup_cancel_fork()
/// instead — cgroup_post_fork() is never called for failed forks.
///
/// The `target_cgroup` parameter is set by do_fork step 15 for
/// CLONE_INTO_CGROUP; otherwise it is None and the child inherits
/// the parent's cgroup. This function does NOT overwrite child.cgroup
/// if it was already set — the target_cgroup parameter communicates
/// the resolved cgroup from the fork path.
///
/// The cgroup's task set is updated atomically (under the cgroup's
/// tasks RwLock).
pub fn cgroup_post_fork(
    child: &Task,
    parent: &Task,
    target_cgroup: Option<&Arc<Cgroup>>,
) {
    // Determine which cgroup to attach the child to.
    let cg: Arc<Cgroup> = match target_cgroup {
        Some(tgt) => Arc::clone(tgt),
        None => parent.cgroup.load(),  // ArcSwap load
    };

    // Store the resolved cgroup on the child task.
    child.cgroup.store(Arc::clone(&cg));  // ArcSwap store

    // Add child to the cgroup's task set. This set is used by:
    // - cgroup.procs reads (enumerate tasks)
    // - Freezer (iterate tasks to set TASK_FROZEN)
    // - OOM killer (select victim within cgroup)
    // - Migration (remove from old cgroup's set, add to new)
    cg.tasks.write().insert(child.tid);
}

Limit change propagation: When a cgroup's cpu.max or cpu.guarantee is changed, the new limit takes effect lazily via generation counters (incremented on write). A child task forked before the limit change may temporarily exceed the new limit for up to one scheduler slice (~4ms at HZ=250). This one-slice transient overshoot is acceptable: the child is re-evaluated at its first scheduler tick. The generation counter is checked on the warm path (wakeup / replenishment), not the hot path (context switch), to avoid per-switch overhead.

Exit path: When a task exits, cgroup_exit(task) performs the inverse of the cgroup_can_fork() + cgroup_post_fork() pair, ensuring zero counter drift over the kernel's lifetime:

cgroup_exit(task):
  1. let cg = task.cgroup.load();  // ArcSwap load of current cgroup
  2. For each subsystem controller attached to the cgroup hierarchy:
       a. controller.exit(task)  // controller-specific cleanup:
          - pids: pids.current.fetch_sub(1, Release) for task's cgroup
            and every ancestor up to the root.
          - memory: uncharge residual kernel memory (see zero-residual
            drain in [Section 17.2](#control-groups--memory-controller)).
          - cpu: remove task from CFS bandwidth tracking.
          - io: remove task from blkio weight accounting.
  3. Remove task from the cgroup's task list:
     cg.tasks.write().remove(task.task_id());
  4. Update population count:
     cg.population.fetch_sub(1, Release);
     // Walk ancestors to root, decrementing population.
  5. Set task.cgroup.store(root_cgroup);
     // Point to root cgroup, preventing use-after-free if any subsystem
     // accesses the task's cgroup after exit but before Task struct free.

Zero-residual drain (memory controller): When the last task exits a cgroup, residual kernel memory charges (slab objects, page tables, socket buffers) may still reference the cgroup's mem_cgroup. These are drained asynchronously: the cgroup transitions to a "dying" state where new charges are rejected but existing charges are allowed to drain naturally as their owning objects are freed. The mem_cgroup is freed only when charge_count reaches zero. A watchdog timer logs a warning if a dying cgroup's charges do not drain within 60 seconds (potential leak).

17.2.6 Cpuset Controller (CPU and NUMA Pinning)¶

cpuset.cpus: CPUs allowed for tasks in this cgroup. Format: "0-3,8-11" (CPU list).
cpuset.mems: NUMA nodes allowed for memory allocation. Format: "0,2" (node list).
cpuset.cpus.partition: Partition mode (root, member, isolated). Isolated partitions have exclusive CPU access.

The scheduler respects cpuset constraints when selecting a CPU for a task. NUMA-aware allocation (Section 4.1) respects the cpuset.mems mask.

17.2.7 Freezer (Cgroup Pause/Resume)¶

cgroup.freeze: Write 1 to freeze all tasks in the cgroup subtree; write 0 to thaw.
cgroup.events: Contains frozen 0/1 indicating current frozen state.

Frozen tasks are removed from the run queue and cannot be scheduled. Used by docker pause and checkpoint/restore.

Freeze/thaw algorithm:

Freeze initiation: Write 1 to cgroup.freeze.
Kernel walks the cgroup's task list (under cgroup_mutex).
For each task: set TASK_FROZEN (0x0000_8000) bit in task.state via set_special_state(). If the task is currently on a CPU, send a reschedule IPI; the task observes the frozen bit at the next preemption point or syscall return and calls schedule(), where the scheduler sees TASK_FROZEN and does not re-enqueue.
For tasks already in INTERRUPTIBLE/UNINTERRUPTIBLE sleep: set TASK_FROZEN directly. The task remains sleeping; the scheduler will not select it.
Report rcu_note_context_switch() for each frozen task's CPU (implicit quiescent state — frozen tasks cannot hold RCU read locks).
Set cgroup.events.frozen = 1 after all tasks have entered TASK_FROZEN.
Wakeup-during-freeze: If a wakeup event arrives for a TASK_FROZEN task (e.g., I/O completion, timer expiry), the task remains frozen. The wakeup is recorded as a pending wakeup flag so the task is immediately schedulable when thawed.
Thaw: Write 0 to cgroup.freeze.
Kernel walks the cgroup's task list.
For each task: clear TASK_FROZEN bit. If the task has a pending wakeup, call try_to_wake_up() to enqueue it on the appropriate run queue with its preserved vruntime and lag.
A task in OnRqState::Deferred (EEVDF lag compensation) is removed from the EEVDF tree during freeze, transitioning to OnRqState::Off + TASK_FROZEN. On thaw, it re-enters as Deferred (not Queued) to preserve EEVDF fairness -- the task's lag value is restored so it does not gain unfair scheduling advantage from being frozen.
Set cgroup.events.frozen = 0.
Nested freeze: A cgroup can be frozen by both its own freeze field and an ancestor cgroup's freeze field. The effective frozen state (e_freeze) is propagated downward: child.e_freeze = child.freeze || parent.e_freeze. A task remains frozen until ALL ancestor cgroups are thawed AND the task's own cgroup has freeze = false.
Fatal signals while frozen: Processes in TASK_FROZEN can be killed by SIGKILL. The signal delivery path checks for TASK_FROZEN and transitions the task to TASK_RUNNING to allow exit processing. This ensures kill -9 always works regardless of cgroup freeze state.

RCU Interaction. Frozen tasks cannot execute code and therefore cannot report RCU quiescent states. To prevent RCU grace periods from blocking indefinitely, UmkaOS's RCU subsystem treats entry into TASK_FROZEN as an implicit quiescent state: the cgroup freezer calls rcu_report_dead() (the same hook used at task exit) on behalf of the frozen task's CPU at the moment the task is frozen. This is safe because a frozen task holds no RCU read-side critical sections — it is not executing, so it cannot be inside rcu_read_lock(). When the task is thawed it re-enters the normal quiescent-state reporting cycle with no special handling required. This design ensures that container pause/resume, whole-cgroup SIGSTOP, and checkpoint-restore operations never stall RCU grace periods regardless of freeze duration.

CBS timer interaction: When a cgroup enters the Frozen state, all CBS replenishment timers for tasks in that cgroup are cancelled (hrtimer_cancel). When the cgroup thaws, CBS timers are re-armed with fresh budgets — the throttled flag is cleared and budget is set to the full CBS period to avoid spurious throttling on resume.

17.2.7.1 Freeze and Network Interaction¶

In-flight TX packets: When a cgroup is frozen, VETH TX packets already enqueued in the peer's RX ring are delivered normally. The receiving side processes them. If the receiver is also frozen, the delivery wakeup is recorded as pending.
TCP mid-send: If a task is frozen during sendmsg():
Partially-queued data remains in the socket send buffer (not discarded).
TCP retransmit timers continue. Timer expiry wakeups are recorded as pending.
On thaw, the task resumes. Retransmit timer may have expired; TCP handles this via standard timeout/retransmit.
UDP receives: Queued in socket receive buffer. Wakeup recorded as pending. On thaw, data is available immediately.
VETH cross-namespace freeze: Packets flow in both directions regardless of freeze state. The freeze is per-cgroup-per-task, not per-interface. VETH pair connectivity is unaffected; only task scheduling is frozen.
Cleanup on cgroup destroy after freeze: If a frozen cgroup is destroyed, all frozen tasks receive SIGKILL. Socket close triggers TCP RST/FIN. No packet leaks.

17.2.7.2 Cgroup Zero-Residual Destruction¶

17.2.7.2.1 Problem¶

When rmdir removes a cgroup directory, the cgroup's population is zero (no tasks), but residual resource charges may remain:

Memory controller: Pages on LRU lists still charged to the cgroup's MemController::usage. These pages were allocated by tasks that have since migrated out or exited, but the page charge persists until the page is freed (reclaimed, munmap'd, or process exits).
I/O controller: In-flight block I/O requests tagged with the cgroup's ID. These drain naturally when I/O completes (~10 ms typical).
Hugetlb controller: Huge pages still charged. These may persist indefinitely if a process in another cgroup holds a mapping to a page originally charged here.

In Linux, these zombie cgroups persist until all charged pages are freed, which may be never (a process in cgroup B holding a page allocated while it was in cgroup A keeps cgroup A's kernel metadata alive). Over 50-year uptime, this leaks ~2-8 KB of kernel memory per zombie cgroup. At 100 cgroup creates/destroys per hour (typical Kubernetes pod churn), this accumulates ~875 MB over 50 years — significant on memory-constrained systems.

17.2.7.2.2 Design: Bounded Residual Drain¶

UmkaOS enforces a zero-residual invariant: after rmdir, all residual resource charges are drained within a bounded deadline. The cgroup's kernel memory (Arc<Cgroup> and all controller state) is freed within this deadline, guaranteed.

/// Drain timeout for residual memory charges after cgroup rmdir.
/// Default: 30 seconds. Configurable via
/// `/proc/sys/kernel/cgroup_drain_timeout_ms` (range: 1000-300000).
pub const CGROUP_MEM_DRAIN_TIMEOUT_MS: u64 = 30_000;

/// Lifecycle state machine for cgroup destruction.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum CgroupLifecycle {
    /// Normal operation. Tasks may be present.
    Active = 0,
    /// `rmdir` called, population == 0, draining residual charges.
    /// No new tasks can be migrated into this cgroup (`cgroup.procs`
    /// write returns ENODEV; mkdir in parent does not find it).
    Draining = 1,
    /// All charges drained (or force-drained). Kernel memory freed.
    Dead = 2,
}

17.2.7.2.3 Drain Protocol¶

The drain runs as a kworker task (cgroup_drain_residual) triggered after rmdir confirms population == 0:

cgroup_drain_residual(cgroup):
  Precondition: population == 0, lifecycle == Draining.

  Phase 1 — REPARENT memory charges to the parent cgroup:
    Walk all pages on the cgroup's per-node LRU lists.
    For each page:
      parent.memory.usage.fetch_add(PAGE_SIZE, Relaxed)
      self.memory.usage.fetch_sub(PAGE_SIZE, Relaxed)
      page.mem_cgroup = parent   // Under page lock
    Bounded: at most `self.memory.usage / PAGE_SIZE` pages.
    At 1 GB residual (extreme): ~256K pages, ~12 ms wall time
    (50 ns per page: one fetch_add + one fetch_sub + one pointer store).

  Phase 2 — REPARENT hugetlb charges (same pattern, fewer pages).

  Phase 3 — DETACH all BPF programs attached to the cgroup.
    Detach BPF_CGROUP_INET_INGRESS, BPF_CGROUP_INET_EGRESS,
    BPF_CGROUP_INET_SOCK_CREATE, BPF_CGROUP_DEVICE, BPF_CGROUP_SYSCTL,
    BPF_CGROUP_GETSOCKOPT, BPF_CGROUP_SETSOCKOPT.
    Precondition: task list is empty (population == 0).
    Programs are detached BEFORE releasing memory charges (Phase 5)
    to prevent BPF callbacks from accessing freed cgroup state.
    Detach is O(n) in number of attached programs (typically 0-3).

  Phase 4 — WAIT for in-flight I/O to complete.
    Bounded by I/O timeout (typically < 30 seconds).
    Check: no in-flight bios tagged with this cgroup's ID.

  Phase 5 — VERIFY all controller usages are zero:
    memory.usage.load(Acquire) == 0
    io: no in-flight bios tagged with this cgroup ID
    hugetlb: usage == 0

  Phase 6 — If verification passes:
    lifecycle.store(Dead, Release)
    Drop Arc<Cgroup> → frees Cgroup struct and all controller state.

  Phase 7 — If verification fails after CGROUP_MEM_DRAIN_TIMEOUT_MS:
    Force-reparent any remaining charges to the parent.
    Log FMA warning: "cgroup {id} force-drained after timeout;
      {n} bytes reparented to parent {parent_id}."
    lifecycle.store(Dead, Release)
    Drop Arc<Cgroup>.

17.2.7.2.4 Performance Impact¶

Steady state: Zero. The lifecycle field is never checked on the hot path (task scheduling, memory allocation, I/O submission). It is only read during cgroup.procs writes (to reject migration into a Draining cgroup) and during rmdir processing — both cold paths.

During drain: The per-page reparent is ~50 ns/page (one fetch_add + one fetch_sub + one pointer store under page lock). At 256K pages (1 GB residual), total drain time is ~12 ms. The page lock is the same lock used by reclaim — no new lock is introduced. Other cgroups are unaffected.

Invariant: After cgroup_drain_residual returns, the Cgroup struct has exactly zero residual resource charges. No zombie cgroups persist beyond the drain timeout. Over 50-year uptime, zero memory is leaked to cgroup metadata.

17.2.8 Additional Controllers¶

Controller	Key Interface	Description
`hugetlb`	`hugetlb.<size>.max`	Limits huge page allocations per cgroup
`rdma`	`rdma.max`	Limits RDMA/InfiniBand resources
`misc`	`misc.max`	Limits miscellaneous resources (e.g., SGX EPC)
`accel`	`accel.devices`, `accel.memory.max`, `accel.compute.max`, `accel.priority`	Accelerator compute and memory limits per cgroup (Section 22.5).

17.2.8.1 Device Access Control (BPF-based)¶

Linux cgroup v2 replaces the v1 devices controller with a BPF-based enforcement mechanism. UmkaOS implements the same model: device access decisions are made by a BPF_PROG_TYPE_CGROUP_DEVICE program attached to the cgroup.

/// BPF context passed to BPF_PROG_TYPE_CGROUP_DEVICE programs.
/// The BPF program returns 0 (deny) or 1 (allow).
#[repr(C)]
pub struct BpfCgroupDevCtx {
    /// Access type: BPF_DEVCG_ACC_MKNOD=1, BPF_DEVCG_ACC_READ=2, BPF_DEVCG_ACC_WRITE=4.
    pub access_type: u32,
    /// Device type: BPF_DEVCG_DEV_BLOCK=1, BPF_DEVCG_DEV_CHAR=2.
    pub dev_type: u32,
    /// Major device number.
    pub major: u32,
    /// Minor device number.
    pub minor: u32,
}
// BPF UAPI context struct. Layout: 4 × u32 = 16 bytes.
const_assert!(size_of::<BpfCgroupDevCtx>() == 16);

Enforcement hook point: - chrdev_open() / blkdev_open() in VFS calls cgroup_bpf_run(BPF_CGROUP_DEVICE, &ctx) before granting access to the device. - mknod() syscall calls the same hook before creating the device node in the filesystem. - Default (no BPF program attached): allow all device access. This matches the principle of least surprise — cgroups without explicit device policy impose no restrictions.

Legacy v1 shim translation (see the v1→v2 translation table below): The v1 devices.allow / devices.deny interface is translated into BPF programs attached to the cgroup: - devices.allow a *:* rwm → generates and attaches a permissive BPF program (unconditional return 1). - devices.deny a → attaches a deny-all BPF program (unconditional return 0). - Specific rules (e.g., devices.allow c 1:3 rwm) → generates a BPF program that matches the (dev_type, major, minor, access_type) tuple and returns 1 for matching entries, 0 otherwise. Multiple rules accumulate into a single match-list BPF program that is atomically replaced on each devices.allow / devices.deny write.

Hierarchy enforcement: BPF device programs are evaluated bottom-up from the task's cgroup to the root. Access is allowed only if every ancestor's BPF program (if any) returns 1. This ensures a parent cgroup can restrict device access for all descendants regardless of their individual policies.

17.2.8.2 perf_event Cgroup Controller¶

The perf_event controller limits per-cgroup PMU (Performance Monitoring Unit) resource usage. Without limits, a container running perf record can exhaust all available hardware performance counters, denying monitoring to other containers and the host.

The Cgroup struct includes:

    /// perf_event cgroup controller. Limits per-cgroup PMU resource usage
    /// to prevent container perf_event exhaustion.
    pub perf_event: Option<PerfEventController>,

pub struct PerfEventController {
    /// Maximum number of concurrent perf events for this cgroup subtree.
    /// Default: unlimited (0 = no limit). Write `perf_event.max` to set.
    pub max_events: AtomicU32,
    /// Current active perf event count in this cgroup subtree.
    pub nr_events: AtomicU32,
}

Enforcement: The perf_event_open() syscall, after validating the event attributes, checks the calling task's cgroup: 1. Read cgroup.perf_event.max_events. If 0, no limit — allow unconditionally. 2. Otherwise, atomically increment nr_events via fetch_add(1, Relaxed). 3. If the post-increment value exceeds max_events, decrement back and return -ENOSPC. 4. On perf_event_release() (close of the perf event fd), decrement nr_events.

Hierarchy: Child cgroups inherit the parent's limit as an upper bound. A child can set a lower perf_event.max but cannot exceed the parent's value. The effective limit for any cgroup is min(own max_events, parent effective limit). When evaluating the limit, the kernel walks from the task's cgroup to the root, checking each ancestor's nr_events < max_events (short-circuiting on the first failure).

cgroupfs interface: - perf_event.max: read/write, integer or "max" (unlimited). Default: "max". - perf_event.current: read-only, current number of active perf events in this subtree.

17.2.9 Cgroup v1 Compatibility Translation¶

UmkaOS exposes cgroup v2 exclusively inside the kernel. For userspace processes that set cgroup v1 knobs (Docker Engine ≤20.10, systemd pre-247, legacy orchestrators), UmkaOS provides a v1-to-v2 translation shim implemented as a virtual filesystem (cgroupv1fs) that mounts the legacy hierarchy paths at /sys/fs/cgroup/cpu, /sys/fs/cgroup/memory, etc. The full shim specification is in Section 19.1. This section documents the authoritative translation table and the cpu.shares → cpu.weight formula that the shim applies.

Translation table (v1 write → v2 equivalent):

Subsystem	v1 knob	v2 equivalent	Conversion formula
memory	`memory.limit_in_bytes`	`memory.max`	Direct (bytes); -1 → "max"
memory	`memory.soft_limit_in_bytes`	`memory.high`	Direct (bytes)
memory	`memory.memsw.limit_in_bytes`	`memory.swap.max`	`swap_max = memsw - mem`
memory	`memory.kmem.limit_in_bytes`	(no v2 equivalent)	Silently ignored (kmem tracking removed in v2)
memory	`memory.oom_control` (disable OOM)	(no direct v2 equivalent)	Incompatible: v1 `oom_kill_disable=1` means "do not kill tasks in this cgroup"; v2 has no equivalent knob. `memory.oom.group` is a different feature (kills all tasks atomically on OOM rather than selecting one victim). The shim silently drops `oom_kill_disable` writes and logs a compat warning.
cpu	`cpu.shares`	`cpu.weight`	`weight = clamp(1 + (shares − 2) × 9999 / 262142, 1, 10000)`
cpu	`cpu.cfs_quota_us` + `cpu.cfs_period_us`	`cpu.max`	`"$quota $period"` (µs); quota=-1 → `"max $period"`
cpuacct	`cpuacct.usage`	`cpu.stat` (usage_usec)	Read-only; ns→µs unit conversion
blkio	`blkio.throttle.read_bps_device`	`io.max` (rbps=N)	`MAJ:MIN rbps=N`
blkio	`blkio.throttle.write_bps_device`	`io.max` (wbps=N)	`MAJ:MIN wbps=N`
blkio	`blkio.throttle.read_iops_device`	`io.max` (riops=N)	`MAJ:MIN riops=N`
blkio	`blkio.throttle.write_iops_device`	`io.max` (wiops=N)	`MAJ:MIN wiops=N`
blkio	`blkio.weight`	`io.weight`	v1 range 10–1000 → v2 range 100–10000 via `weight × 10`
freezer	`freezer.state`	`cgroup.freeze`	`FROZEN` → `"1"`, `THAWED` → `"0"`
net_cls	`net_cls.classid`	(no v2 equivalent)	Logged and ignored; use eBPF for network classification
net_prio	`net_prio.ifpriomap`	(no v2 equivalent)	Logged and ignored
pids	`pids.max`	`pids.max`	Direct
devices	`devices.allow` / `devices.deny`	`BPF_PROG_TYPE_CGROUP_DEVICE`	Translated to eBPF program attached to the cgroup
hugetlb	`hugetlb.Xm.limit_in_bytes`	`hugetlb.Xm.max`	Direct
rdma	`rdma.max`	`rdma.max`	Direct
(any v1)	`cgroup.event_control`	(no v2 equivalent)	Silently ignored — v1-only eventfd notification mechanism, replaced by inotify on cgroupfs in v2
(any v1)	`notify_on_release`	(no v2 equivalent)	Silently ignored — v1-only automatic agent notification, no v2 equivalent (cgroup v2 uses systemd scope/slice lifecycle)

cpu.shares formula derivation: Linux cpu.shares range is [2, 262144]; cpu.weight range is [1, 10000]. The formula is a linear interpolation that maps the full v1 range onto the full v2 range:

weight = clamp(1 + (shares - 2) × 9999 / 262142, 1, 10000)

This is the formula used by runc (the OCI reference runtime), containerd, and crun as of 2025. Key values:

v1 `cpu.shares`	v2 `cpu.weight`
2 (minimum)	1
1024 (Docker default)	~39
262144 (maximum)	10000

Systemd divergence: systemd (≥247) writes cpu.weight directly when operating in cgroup v2 mode, using its own unit mapping (default weight = 100) rather than the runc formula. When systemd writes v2 files natively, those writes bypass the shim entirely and go straight to the cgroupfs. The shim translates only raw v1 cgroupfs writes from programs that open the legacy v1 paths directly (older Docker daemons, legacy orchestrators).

Implementation: The shim is implemented in umka-sysapi as cgroupv1fs, a VFS pseudo-filesystem that mounts legacy controller directories. Writes to v1 paths invoke the translation function below and apply the result to the v2 cgroupfs. Reads return translated v2 values in v1 format.

/// Result of translating a cgroup v1 write to its v2 equivalent.
pub struct CgroupV2Write {
    /// Relative path of the v2 control file (e.g., "memory.max", "cpu.weight").
    pub path: &'static str,
    /// Value to write (already formatted for v2 semantics).
    pub value: ArrayString<64>,
}

/// Translate a cgroup v1 write to its v2 equivalent.
///
/// Returns `None` if the v1 knob has no v2 equivalent (silently ignored).
/// The caller is responsible for applying the returned `CgroupV2Write` to the
/// actual v2 cgroupfs node for the same cgroup.
pub fn cgroupv1_translate(
    subsystem: CgroupV1Subsystem,
    knob: &str,
    value: &[u8],
) -> Option<CgroupV2Write> {
    match (subsystem, knob) {
        (CgroupV1Subsystem::Memory, "memory.limit_in_bytes") => {
            let bytes = parse_bytes_or_max(value)?;
            Some(CgroupV2Write { path: "memory.max", value: format_bytes_or_max(bytes) })
        }
        (CgroupV1Subsystem::Cpu, "cpu.shares") => {
            let shares: u64 = parse_u64(value).ok()?;
            let weight = 1u64.saturating_add(
                shares.saturating_sub(2).saturating_mul(9999) / 262142
            ).clamp(1, 10000);
            Some(CgroupV2Write { path: "cpu.weight", value: weight.to_string() })
        }
        // net_cls.classid, net_prio.ifpriomap, memory.kmem.limit_in_bytes:
        // no v2 equivalent — return None (logged by caller, not propagated).
        (CgroupV1Subsystem::NetCls, _)
        | (CgroupV1Subsystem::NetPrio, _)
        | (CgroupV1Subsystem::Memory, "memory.kmem.limit_in_bytes") => None,
        // Full table follows the translation table above for all other knobs.
        _ => cgroupv1_translate_full(subsystem, knob, value),
    }
}

17.3 POSIX Inter-Process Communication (IPC)¶

UmkaOS supports standard POSIX IPC mechanisms, optimized using UmkaOS's native zero-copy primitives where possible.

17.3.1 AF_UNIX Sockets¶

Local domain sockets (AF_UNIX) are heavily used in containerized environments (e.g., Docker, Kubernetes).

Zero-Copy Process-to-Process Rings: For SOCK_STREAM sockets, UmkaOS maps the connection to a pair of single-producer/single-consumer (SPSC) ring buffers shared directly between the two processes. These are distinct from the kernel-domain KABI ring buffers (Section 11.7), which are fixed-size command/completion rings for Tier 0/Tier 1 communication. The AF_UNIX ring buffer is:

/// Cache-line-aligned wrapper to prevent false sharing between fields
/// accessed by different CPUs or different producer/consumer threads.
///
/// The alignment is platform-dependent, defined by `CACHE_LINE_SIZE`:
/// - x86-64: 64 bytes (but spatial prefetcher pairs → 128-byte effective)
/// - AArch64: 64 or 128 bytes (Neoverse V2 / Apple M-series)
/// - ARMv7: 32 or 64 bytes
/// - RISC-V: 64 bytes (typical)
/// - PPC32: 32 bytes
/// - PPC64LE: 128 bytes (POWER9/10)
/// - s390x: 256 bytes (z13+)
/// - LoongArch64: 64 bytes (3A5000/6000)
///
/// `CACHE_LINE_SIZE` is a compile-time constant per target:
///   64  — x86-64, AArch64 (default), ARMv7, RISC-V, LoongArch64
///   128 — PPC64LE, AArch64 (Apple/Neoverse V2 build config)
///   256 — s390x
///
/// The `CacheAligned` wrapper uses the target's `CACHE_LINE_SIZE` to ensure
/// no false sharing on any supported platform.
///
/// Implementation: Rust's `#[repr(align(N))]` requires a literal, so we use
/// `cfg_attr` to select the correct alignment per target:
#[cfg_attr(target_arch = "s390x", repr(C, align(256)))]
#[cfg_attr(target_arch = "powerpc64", repr(C, align(128)))]
#[cfg_attr(target_arch = "powerpc", repr(C, align(32)))]
#[cfg_attr(not(any(target_arch = "s390x", target_arch = "powerpc64", target_arch = "powerpc")), repr(C, align(64)))]
pub struct CacheAligned<T>(pub T);

/// Platform-dependent cache line size constant.
/// Must agree with the `CacheAligned` alignment above.
#[cfg(target_arch = "s390x")]
pub const CACHE_LINE_SIZE: usize = 256;
#[cfg(target_arch = "powerpc64")]
pub const CACHE_LINE_SIZE: usize = 128;
#[cfg(target_arch = "powerpc")]
pub const CACHE_LINE_SIZE: usize = 32;
#[cfg(not(any(target_arch = "s390x", target_arch = "powerpc64", target_arch = "powerpc")))]
pub const CACHE_LINE_SIZE: usize = 64;

/// Process-to-process SPSC ring for AF_UNIX SOCK_STREAM zero-copy.
/// Mapped into both processes' address spaces at connection time.
///
/// # Safety
///
/// `buffer` is a raw pointer to a kernel-managed shared memory region:
/// - **Allocation**: Kernel allocates `capacity` bytes from the page allocator
///   at `connect()` time. Pages are pinned (not swappable).
/// - **Mapping**: Mapped read-write into the sender's address space, read-only
///   into the receiver's. The kernel creates both mappings atomically during
///   `connect()`.
/// - **Lifetime**: The buffer outlives both process mappings. Refcounted via
///   `Arc<UserSpscRing>`; the last `Arc` drop (on socket close) unmaps both
///   sides and returns pages to the page allocator.
/// - **Deallocation**: On socket `close()` (fd release), the kernel unmaps the
///   buffer from both processes. If one process exits while the other holds a
///   reference, the surviving mapping remains valid until its socket is closed.
pub struct UserSpscRing {
    /// Ring buffer memory, shared between sender and receiver.
    /// Mapped read-write in sender, read-only in receiver.
    /// See struct-level SAFETY for lifetime guarantees.
    pub buffer: *mut u8,

    /// Total buffer size in bytes (power of 2 for efficient masking).
    pub capacity: usize,

    /// Write position (updated by sender, read by receiver).
    /// Stored in a separate cache line to avoid false sharing.
    /// Note: CacheAligned uses platform-specific alignment (32/64/128/256 bytes
    /// depending on architecture; PPC32=32, x86-64/AArch64/ARMv7/RISC-V/LoongArch64=64,
    /// PPC64LE=128, s390x=256) to prevent false sharing.
    ///
    /// **Memory ordering**: Sender stores with `Release` after writing data
    /// into the ring buffer. Receiver loads with `Acquire` before reading
    /// data — this ensures the receiver sees the sender's data writes
    /// before observing the advanced write_pos.
    pub write_pos: CacheAligned<AtomicU64>,

    /// Read position (updated by receiver, read by sender).
    ///
    /// **Memory ordering**: Receiver stores with `Release` after consuming
    /// data from the ring buffer. Sender loads with `Acquire` before
    /// checking available space — this ensures the sender sees the
    /// receiver's consumption before reusing ring buffer slots.
    pub read_pos: CacheAligned<AtomicU64>,

    /// Futex word for blocking when ring is full (sender waits) or empty (receiver waits).
    ///
    /// Protocol:
    ///   Value 0 = IDLE (no waiter blocked).
    ///   Value 1 = SENDER_WAITING (sender blocked because ring is full).
    ///   Value 2 = RECEIVER_WAITING (receiver blocked because ring is empty).
    ///
    /// Sender path:
    ///   1. Check space: if (write_pos - read_pos) < capacity, write data, advance write_pos.
    ///   2. If ring is full: store(1, Release) into futex_word, then FUTEX_WAIT(&futex_word, 1).
    ///   3. When receiver advances read_pos, it checks futex_word. If == 1, store(0) + FUTEX_WAKE.
    ///
    /// Receiver path:
    ///   1. Check data: if write_pos != read_pos, read data, advance read_pos.
    ///   2. If ring is empty: store(2, Release) into futex_word, then FUTEX_WAIT(&futex_word, 2).
    ///   3. When sender advances write_pos, it checks futex_word. If == 2, store(0) + FUTEX_WAKE.
    pub futex_word: AtomicU32,
}

The umka-sysapi layer intercepts send() and recv() calls and translates them into ring buffer enqueues/dequeues.
Data is copied twice: once from sender's buffer into the shared ring, once from the ring into receiver's buffer. This eliminates the traditional kernel-buffer intermediate copy, reducing the path from 3 copies to 2.
The kernel is only invoked via futex when a ring is full/empty and the process must block.

SOCK_SEQPACKET message boundaries: SOCK_SEQPACKET requires preserving message boundaries — recv() must return exactly one message per call. The ring buffer protocol includes a 4-byte length header before each message:

/// Message format in SOCK_SEQPACKET ring:
/// | msg_len: u32 | data: [u8; msg_len] | msg_len: u32 | data: ... |
///
/// The receiver reads msg_len, then reads exactly that many bytes.
/// Short reads (buffer smaller than msg_len) discard the remainder of the message.

For SOCK_DGRAM AF_UNIX sockets, a similar framed protocol is used, but the ring is unidirectional (no connection, just a receive queue per socket).

17.3.2 Pipes and FIFOs¶

Standard pipes are implemented as bounded in-memory buffers managed by the VFS. - For high-throughput scenarios, applications can use vmsplice() to zero-copy data from a pipe into a memory-mapped region. - Internally, a pipe is a specialized VfsNode that maintains a wait queue for readers and writers.

17.3.2.1 Pipe Buffer¶

/// Default pipe capacity: 16 pages × 4 KB = 64 KB, matching Linux's default
/// since kernel 2.6.11. This is the inline fast-path capacity.
pub const PIPE_DEFAULT_PAGES: usize = 16;

/// Pipe buffer: inline storage for the common case (≤16 pages = 64KB default pipe),
/// with heap fallback for pipes expanded via fcntl(F_SETPIPE_SZ).
///
/// Allocated when pipe(2) or pipe2(2) is called.
///
/// The default buffer size is 65536 bytes (64 KB). The size is configurable via
/// fcntl(F_SETPIPE_SZ) up to /proc/sys/fs/pipe-max-size (default 1 MB; root with
/// CAP_SYS_RESOURCE may raise further, hard limit 2^31 bytes per Linux
/// `round_pipe_size()`).
///
/// **Zero-copy optimization**: When a pipe page is "gifted" via vmsplice()
/// with SPLICE_F_GIFT, the page is transferred to the pipe without copying.
/// The gifted page is unmapped from the sender's address space and becomes
/// owned by the pipe until read. This enables zero-copy data pipelines.
///
/// **Allocation model**: The inline `pages_small` array covers the standard 64 KB
/// default pipe (16 pages × 4 KB). When `fcntl(F_SETPIPE_SZ)` sets capacity
/// beyond 16 pages, the buffer transitions to `pages_large` (a heap-allocated
/// `Vec<PipePage>`). This hybrid approach keeps the struct compact (384 bytes of
/// inline page storage vs. the previous 6144 bytes) while supporting the full
/// Linux pipe size range.
pub struct PipeBuffer {
    // === First cache line(s): hot-path lock-free atomic fields ===
    // These fields are accessed on every read/write syscall without holding
    // any lock. Placing them first ensures they occupy the initial cache
    // lines of the heap-allocated struct, minimising cache misses on the
    // common single-reader/single-writer path.

    /// Index of the first page with data (read cursor).
    pub read_idx: AtomicU32,

    /// Index of the first empty page (write cursor).
    pub write_idx: AtomicU32,

    /// Byte offset within pages[read_idx] for partial reads.
    pub read_offset: AtomicU32,

    /// Byte offset within pages[write_idx] for partial writes.
    pub write_offset: AtomicU32,

    /// Total bytes currently in the pipe (atomic for lock-free size check).
    pub len: AtomicU32,

    /// Total pipe capacity in bytes (set by fcntl F_SETPIPE_SZ, default 65536).
    pub capacity: AtomicU32,

    /// Seqlock for detecting concurrent fcntl(F_SETPIPE_SZ) during lock-free writes.
    /// Uses the `SeqLock` protocol defined in
    /// [Section 3.6](03-concurrency.md#lock-free-data-structures--seqlockt--sequence-lock): odd values indicate
    /// resize in progress; even values indicate stable. Writers read before and after;
    /// if changed, retry. Incremented twice per F_SETPIPE_SZ resize. At 1 resize/sec
    /// (unrealistic), wraps in ~292 billion years with u64.
    pub resize_seq: AtomicU64,

    /// Count of active single-writer fast-path operations.
    /// fcntl(F_SETPIPE_SZ) waits for this to reach 0 before resizing.
    pub active_writer: AtomicU32,

    // === Warm fields: reader/writer reference counts and page count ===

    /// Number of readers (for detecting write-side SIGPIPE).
    /// When this drops to 0, write() returns EPIPE.
    pub reader_count: AtomicU32,

    /// Number of writers (for detecting read-side EOF).
    /// When this drops to 0 and the pipe is empty, read() returns 0.
    pub writer_count: AtomicU32,

    /// Count of valid entries in `pages_small` (0 when `pages_large` is in use).
    /// Maximum value: `PIPE_DEFAULT_PAGES` (16).
    pub small_len: AtomicU8,

    // === Cold fields: locks, wait queues, and page storage ===
    // Only accessed on blocked paths (empty/full) and on resize.

    /// Wait queue for blocked readers (pipe empty).
    pub read_wait: WaitQueueHead,

    /// Wait queue for blocked writers (pipe full).
    pub write_wait: WaitQueueHead,

    /// Lock for modifying the page ring (growing/shrinking) and multi-writer path.
    /// The lock-free single-writer path does not hold this lock.
    /// Lock level: PIPE_LOCK (14), above SLAB_LOCK (13). This ensures that
    /// pipe operations that allocate pages (pipe_write when expanding) cannot
    /// deadlock with the slab allocator.
    pub ring_lock: Mutex<()>,

    /// Reader-writer lock for multi-reader coordination on FIFOs.
    /// When multiple readers exist, readers acquire this in shared mode
    /// (concurrent readers don't contend with each other). Each reader
    /// atomically claims bytes via `read_idx.fetch_add()`, then reads from
    /// its claimed range without holding any lock. Writers do NOT acquire
    /// this lock — it coordinates readers only. The write path uses
    /// `ring_lock` (multi-writer) or the lock-free single-writer path.
    pub read_lock: RwLock<()>,

    /// Heap-allocated pages for pipes expanded beyond 16 pages.
    /// `None` until the first `fcntl(F_SETPIPE_SZ)` exceeding 16 pages.
    /// Allocated from the general kernel heap (not slab) because expanded
    /// pipe buffers are rare and size varies. Accessed only while holding
    /// `ring_lock`.
    pub pages_large: Option<Vec<PipePage>>,

    /// Inline storage for ≤ 16 pages (covers the 64 KB default pipe size).
    /// Zero-allocation fast path for the common case.
    /// `MaybeUninit` avoids initialization cost for unused slots while
    /// keeping stack safety — only `small_len` entries are valid.
    /// Placed last so the hot atomic fields above occupy the initial cache lines.
    pub pages_small: [MaybeUninit<PipePage>; PIPE_DEFAULT_PAGES],
}

> **Design rationale**: `PipeBuffer` is heap-allocated (not stack-allocated). The hot-path
> atomic counters (`read_idx`, `write_idx`, `len`, `resize_seq`, `active_writer`) are placed
> **first** so they occupy the initial cache lines of the allocation; the 384-byte
> `pages_small` array is placed **last** so it does not evict the hot counters on
> lock-free read/write paths.
>
> **Inline vs. heap page storage**: `pages_small` covers the standard 64 KB pipe
> (16 pages x 4 KB). The previous design used a 256-slot inline array
> (`[PipePage; 256]` = ~6144 bytes) sized for the maximum possible pipe (1 MB via
> `F_SETPIPE_SZ`), which is rarely reached in practice — default Linux pipes are 64 KB
> (16 pages), and most pipes never exceed this. The 16-slot inline array reduces the
> struct's page-storage footprint from 6144 bytes to 384 bytes (16 x 24B), a 16x
> reduction that dramatically improves slab allocator cache density.
>
> **Transition to heap**: When `fcntl(F_SETPIPE_SZ)` sets capacity beyond 16 pages
> (> 64 KB), the buffer transitions to `pages_large` (`Vec<PipePage>`) and `small_len`
> is set to 0. This transition is uncommon in production workloads. The `Vec` is
> allocated from the general kernel heap (not slab) because expanded pipe sizes vary.
>
> **`MaybeUninit` wrapper**: The `MaybeUninit<PipePage>` wrapper avoids initialization
> cost for unused inline slots while maintaining stack safety. Only the first
> `small_len` entries contain valid data; the remainder are uninitialised memory.

/// A single page in the pipe buffer.
pub struct PipePage {
    /// Physical page containing the data.
    /// Allocated from the page allocator or gifted via vmsplice.
    pub page: PhysPage,

    /// Number of valid bytes in this page (0 = empty, PAGE_SIZE = full).
    /// For gifted pages, this is the full page; for standard writes,
    /// partial pages are possible.
    pub len: AtomicUsize,

    /// True if this page was gifted via vmsplice(SPLICE_F_GIFT).
    /// Gifted pages are unmapped from the sender and transferred to
    /// the reader; standard pages are copied.
    pub is_gifted: AtomicBool,
}

Pipe write algorithm (lock-free fast path):

Note: This algorithm assumes single-writer semantics for the lock-free fast path. POSIX pipes technically allow multiple concurrent writers, but such usage requires atomic writes smaller than PIPE_BUF (4096 bytes) to guarantee data integrity. For UmkaOS's high-performance path, the lock-free algorithm below requires exactly one concurrent writer — multi-writer scenarios fall back to the mutex-protected slow path described below.

Multi-writer slow path (POSIX atomicity guarantee for writes ≤ PIPE_BUF):

When multiple writers are detected (via writer_count.load(Acquire) > 1), all writers acquire ring_lock (a mutex) before writing. Under ring_lock: 1. The writer checks available space (same as step 3 of the fast path). 2. If remaining ≤ PIPE_BUF (4096), the write is performed atomically: all bytes are written to contiguous pages before write_idx is advanced. If insufficient contiguous space exists, the writer sleeps on the pipe's wait queue until space is available (matching Linux POSIX behaviour). 3. If remaining > PIPE_BUF, POSIX does not guarantee atomicity. The write proceeds page-by-page under the mutex (may interleave with other large writes). 4. write_idx and len are updated under the mutex, then ring_lock is released.

The interaction with the lock-free reader is safe because the reader only reads committed pages (visible via len.load(Acquire)), and the reader's read_idx advancement is atomic. The resize_seq seqlock interaction is the same as the fast path — fcntl(F_SETPIPE_SZ) acquires ring_lock and waits for active writers.

Multi-reader coordination: When multiple readers exist on a FIFO, readers acquire ring_lock in shared mode (readers don't contend with each other for the mutex — they use a separate read_lock reader-writer lock). Each reader atomically claims bytes via read_idx.fetch_add(), then reads from its claimed range without holding any lock. This allows concurrent readers to make progress on different regions of the pipe buffer.

Resize safety: The lock-free write path uses a resize_seq: AtomicU64 seqlock to detect concurrent fcntl(F_SETPIPE_SZ) operations. Before starting the write loop, the writer reads the seqlock; after completing each page, it re-checks. If the seqlock changed, the writer retries from the beginning with the new page count. fcntl(F_SETPIPE_SZ) acquires ring_lock, waits for in-flight single-writers via an active_writer count, increments resize_seq, performs the resize (potentially transitioning from pages_small to pages_large when expanding beyond 16 pages), and increments resize_seq again. This ensures the lock-free path never observes an inconsistent buffer size.

write(pipe, data, len):
  0. remaining = len; written = 0
  1. seq_start = resize_seq.load(Acquire)  // Capture resize generation
  2. If reader_count.load(Acquire) == 0: return EPIPE (SIGPIPE to caller)
     // TOCTOU note: reader may close between this check and write. This is
     // acceptable per POSIX — data written to a pipe with no readers is simply
     // discarded, and the next write() will observe reader_count == 0 and
     // return EPIPE. The pipe remains consistent; no data corruption occurs.
  3. // Try to claim fast path via compare-and-swap
     if !active_writer.compare_exchange(0, 1, Acquire, Relaxed).is_ok():
         // Another writer active — take slow path with ring_lock
         return write_slow_path(pipe, data, len)
  4. current_num_pages = pipe.page_count()  // Derive from small_len or pages_large.len()
  5. while remaining > 0:
       a. If len.load() >= capacity.load():
          // Pipe full — block on write_wait
          active_writer.store(0, Release)  // Release during wait
          wait_event_interruptible(write_wait, len.load() < capacity)
          if interrupted: return written  // bytes successfully written before interrupt
          // Re-acquire fast path and re-check
          if !active_writer.compare_exchange(0, 1, Acquire, Relaxed).is_ok():
              // Lost to another writer during wait — take slow path
              return written + write_slow_path(pipe, &data[written..], remaining)
          if reader_count.load(Acquire) == 0:
              active_writer.store(0, Release)
              return EPIPE
          // Check for resize during wait
          if resize_seq.load(Acquire) != seq_start:
              active_writer.store(0, Release)
              goto 1  // Retry with new seq_start; written/remaining preserved

       b. write_idx_val = write_idx.load(Relaxed)
       c. write_off = write_offset.load(Relaxed)
       d. available = min(PAGE_SIZE - write_off, remaining)
       e. copy data[written:written+available] to pages[write_idx_val][write_off:]
       e'. pages[write_idx_val].len.store(write_off + available, Release)  // Update per-page len
       f. write_offset.store(write_off + available, Release)
       g. len.fetch_add(available, Release)  // Publishing barrier for data in step e
       h. If write_offset == PAGE_SIZE:
          // Page full, advance to next — but first check for concurrent resize
          if resize_seq.load(Acquire) != seq_start:
              active_writer.store(0, Release)
              goto 1  // Retry with new seq_start; written/remaining preserved
          write_idx.store((write_idx_val + 1) % current_num_pages, Release)
          write_offset.store(0, Release)
       i. written += available; remaining -= available
  6. active_writer.store(0, Release)  // Release fast-path lock
  7. wake_up(read_wait)  // Notify any blocked readers
  8. return written

Memory ordering rationale for write path: The Release on len.fetch_add() (step g) is the publishing barrier that synchronizes with the reader's Acquire load of global len. This ensures all prior stores (the data memcpy in step e, the per-page len update in step e') are visible to the reader before it observes the new len value. The reader must use the global len Acquire→per-page len Acquire chain.

fcntl(F_SETPIPE_SZ) implementation:

fcntl_setpipe_sz(pipe, new_size):
  1. ring_lock.lock()
  2. // Wait for active single-writers to complete using futex
     while active_writer.load(Acquire) > 0:
         // Use futex wait instead of busy-spin to avoid priority inversion
         futex_wait(&active_writer, expected=1, timeout=1ms)
  3. resize_seq.fetch_add(1, Release)  // Start resize
  4. old_pages = pages  // Save pointer to old pages array
  5. // Perform resize: if new_pages > 16, transition to pages_large (Vec);
     //   copy data from old pages, update small_len or pages_large
  6. resize_seq.fetch_add(1, Release)  // End resize
  7. rcu_call(old_pages, free_pages_callback)  // Defer freeing old pages array
  8. ring_lock.unlock()

Design note — lock ordering during resize: The resize path holds ring_lock while waiting for active_writer to drain (with a 1ms timeout). This prevents permanent deadlock but creates a retry loop if the writer is blocked on an unrelated resource. The implementation SHOULD drop ring_lock before the futex wait, re-acquire it after wake-up, and re-validate the resize preconditions. This two-phase approach (validate → release → wait → re-acquire → re-validate) eliminates the lock-while-wait pattern at the cost of one extra validation pass.

Memory safety during resize: When fcntl(F_SETPIPE_SZ) replaces the pages array, the OLD pages array is freed via rcu_call() (deferred until the next RCU grace period). This ensures that any concurrent reader in step 4a-4e, which executes under an implicit RCU read-side critical section (preemption disabled during the pipe read fast path), will not access freed memory. The seqlock (resize_seq) detects that a resize occurred and triggers a retry, but the deferred freeing guarantees that the stale pointer is still valid for the duration of the read attempt.

Multi-writer support: When multiple threads write to the same pipe concurrently, the lock-free path cannot be used. The kernel detects multi-writer scenarios using a compare-and-swap pattern: a writer performs active_writer.compare_exchange(0, 1, Acquire, Relaxed). If successful (previous value was 0), it proceeds on the fast path. If it fails (another writer is active), it acquires ring_lock and takes the slow path. This ensures exactly one writer can be on the fast path at a time, preserving POSIX atomic write guarantees for writes ≤ PIPE_BUF.

Pipe read algorithm (lock-free, requires single reader or mutex for multi-reader):

read(pipe, buffer, len):
  0. seq_start = resize_seq.load(Acquire)  // Capture resize generation
  1. If len.load(Acquire) == 0:
       // Pipe empty — check for EOF or block
       if writer_count.load(Acquire) == 0:
           return 0  // EOF — all writers closed
       wait_event_interruptible(read_wait, len.load(Acquire) > 0 || writer_count.load(Acquire) == 0)
       if interrupted: return 0
       if len.load(Acquire) == 0 && writer_count.load(Acquire) == 0:
           return 0  // EOF after wakeup
       // Check for resize during wait
       if resize_seq.load(Acquire) != seq_start:
           goto 0  // Retry with new parameters

  2. bytes_read = 0
  3. current_num_pages = pipe.page_count()  // Derive from small_len or pages_large.len()
  4. while bytes_read < len && len.load(Acquire) > 0:
       a. // Check for concurrent resize BEFORE accessing pages[] array.
          // If resize occurred, the old pages[] pointer may be deallocated.
          if resize_seq.load(Acquire) != seq_start:
              seq_start = resize_seq.load(Acquire)
              current_num_pages = pipe.page_count()
       b. read_idx_val = read_idx.load(Acquire)  // Acquire to see writer's stores
       c. read_off = read_offset.load(Acquire)
       d. // Determine bytes available in current page
          page_len = pages[read_idx_val].len.load(Acquire)
          available = min(page_len - read_off, len - bytes_read)
       e. // Copy data from page to user buffer (Acquire ensures data is visible)
          copy pages[read_idx_val][read_off:read_off+available] to buffer[bytes_read:]
       f. // Post-copy validation: if a resize raced with the copy, the data
          // may be stale. Discard this iteration and retry.
          if resize_seq.load(Acquire) != seq_start:
              seq_start = resize_seq.load(Acquire)
              current_num_pages = pipe.page_count()
              // Re-read page index and offset — resize may have moved data
              // to different page indices or changed the page array size.
              // Without this, stale read_idx_val may index a different page
              // (data corruption) or exceed current_num_pages (OOB access).
              read_idx_val = read_idx.load(Acquire)
              read_off = read_offset.load(Acquire)
              continue  // Retry — do not commit read_offset or len changes
       g. read_offset.store(read_off + available, Release)
       h. If (read_off + available) >= page_len:
          // Page consumed — advance index BEFORE decrementing len. This ensures
          // a concurrent writer observing free space (via len) sees the updated
          // read_idx and does not overwrite the page the reader just finished.
          read_idx.store((read_idx_val + 1) % current_num_pages, Release)
          read_offset.store(0, Release)
       i. len.fetch_sub(available, Release)  // Must be AFTER read_idx advance
       j. bytes_read += available

  5. wake_up(write_wait)  // Notify any blocked writers
  6. return bytes_read

Memory ordering rationale: The reader uses Acquire loads on len, read_idx, read_offset, and pages[].len to synchronize with the writer's Release stores. This ensures the reader observes all data written before the writer updated these indices. On weakly-ordered architectures (AArch64, RISC-V, ARMv7, PPC), this ordering is critical to prevent the reader from seeing stale data.

pipe_poll implementation: Polls a pipe for readiness without blocking:

fn pipe_poll(pipe: &PipeBuffer, events: PollEvents, pt: &mut PollTable) -> PollEvents {
    poll_wait(&pipe.read_wait, pt);
    poll_wait(&pipe.write_wait, pt);
    let mut ready = PollEvents::empty();
    let avail = pipe.len.load(Relaxed);
    if avail > 0 { ready |= EPOLLIN | EPOLLRDNORM; }
    if pipe.writer_count.load(Acquire) == 0 { ready |= EPOLLHUP; }
    if avail < pipe.capacity.load(Relaxed) { ready |= EPOLLOUT | EPOLLWRNORM; }
    if pipe.reader_count.load(Acquire) == 0 { ready |= EPOLLERR; }
    ready & events
}

FIFOs (named pipes): A FIFO is a VFS node (VfsNode) that, when opened, creates a reference to an existing PipeBuffer or creates a new one. Multiple readers and writers can open a FIFO; the reader_count and writer_count fields track opens/closes. Writers use the multi-writer slow path when concurrent writes are detected. When the last reader and last writer close, the buffer is freed.

17.3.3 Shared Memory (POSIX and SysV)¶

POSIX shm_open(): Implemented as a memory-mapped file (mmap) backed by a hidden tmpfs instance.
SysV shmget(): Maps to the same underlying physical memory allocation mechanism, but managed via the CLONE_NEWIPC namespace tables.

POSIX Message Queues (mq_open): POSIX message queues are implemented as special files in a per-IPC-namespace mqueue filesystem (mounted at /dev/mqueue). Each queue is a VFS inode backed by an in-memory priority-sorted message store. The implementation reuses the VFS file operation dispatch path rather than a dedicated PosixMqueue struct — mq_open() creates a file in mqueuefs, mq_send()/mq_receive() are implemented as VFS write()/read() with priority-sorted insertion. Queue attributes (mq_maxmsg, mq_msgsize) are stored as extended attributes on the inode. See Section 14.18 for the mqueuefs mount integration.

Both mechanisms result in direct page table entries (PTEs) mapping the same physical frames into multiple Capability Domains.

17.3.4 IPC Namespace Dispatch (SysV IPC)¶

SysV IPC objects (shared memory segments, semaphore arrays, and message queues) are isolated per IPC namespace. Each IPC namespace maintains independent key-to-ID mappings so that the same key_t value in two different containers refers to two entirely separate IPC objects.

Dispatch path: All SysV IPC syscalls resolve the IPC namespace from the calling task's NamespaceSet before performing any lookup or creation:

/// Per-namespace SysV IPC resource limits. Matches Linux defaults from
/// `include/uapi/linux/ipc.h` and `/proc/sys/kernel/`. Each limit is
/// independently tunable via sysctl within the IPC namespace.
pub struct IpcLimits {
    // --- Shared memory limits ---
    /// Maximum size of a single shared memory segment (bytes).
    /// Linux default: SHMMAX = ULONG_MAX - (1UL << 24) ≈ 16 EiB on 64-bit.
    /// UmkaOS default: same (u64::MAX - (1 << 24)).
    pub shmmax: u64,
    /// Maximum total shared memory pages (system-wide within this namespace).
    /// Linux default: SHMALL = ULONG_MAX - (1UL << 24).
    pub shmall: u64,
    /// Maximum number of shared memory segments (system-wide within this namespace).
    /// Linux default: SHMMNI = 4096.
    pub shmmni: u32,

    // --- Semaphore limits ---
    /// Maximum semaphores per semaphore set.
    /// Linux default: SEMMSL = 32000.
    pub semmsl: u32,
    /// Maximum semaphores system-wide within this namespace.
    /// Linux default: SEMMNS = 1024000000.
    pub semmns: u32,
    /// Maximum operations per semop() call.
    /// Linux default: SEMOPM = 500.
    pub semopm: u32,
    /// Maximum semaphore value.
    /// Linux default: SEMVMX = 32767.
    pub semvmx: u32,
    /// Maximum number of semaphore sets (system-wide within this namespace).
    /// Linux default: SEMMNI = 32000.
    pub semmni: u32,

    // --- Message queue limits ---
    /// Maximum size of a single message (bytes).
    /// Linux default: MSGMAX = 8192.
    pub msgmax: u32,
    /// Maximum total bytes in a single message queue.
    /// Linux default: MSGMNB = 16384.
    pub msgmnb: u32,
    /// Maximum number of message queues (system-wide within this namespace).
    /// Linux default: MSGMNI = 32000.
    pub msgmni: u32,
}

impl Default for IpcLimits {
    /// Returns Linux-compatible defaults.
    fn default() -> Self {
        Self {
            shmmax: u64::MAX - (1 << 24),
            shmall: u64::MAX - (1 << 24),
            shmmni: 4096,
            semmsl: 32000,
            semmns: 1_024_000_000,
            semopm: 500,
            semvmx: 32767,
            semmni: 32000,
            msgmax: 8192,
            msgmnb: 16384,
            msgmni: 32000,
        }
    }
}

/// Per-IPC-namespace state. One instance per IPC namespace, created by
/// clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC).
///
/// **Locking**: Each IPC type (shm, sem, msg) has a single `RwLock<IpcIdTable>`
/// that protects BOTH the ID allocator (Idr) and the key-to-ID map (XArray)
/// atomically. This matches Linux's single `ids->rwsem` per IPC type (see
/// `ipc/util.c ipcget_public()`). Using separate locks would create a TOCTOU
/// race: two threads could both see "key not present" then both allocate,
/// producing duplicate IPC objects for the same key — violating POSIX semantics
/// (`shmget(key, size, IPC_CREAT|IPC_EXCL)` must return EEXIST if key exists).
///
/// `RwLock` (not SpinLock) allows concurrent `shmat()`/`shmdt()` read-side
/// access without writer contention.
pub struct IpcNamespace {
    /// Unique namespace ID.
    pub ns_id: u64,

    /// SysV shared memory: ID allocator + key-to-ID map under single lock.
    /// key_t is i32 on Linux; XArray key uses `key as u32 as u64` to avoid
    /// sign-extension overlap.
    pub shm: RwLock<IpcIdTable<ShmSegment>>,

    /// SysV semaphore sets: ID allocator + key-to-ID map under single lock.
    pub sem: RwLock<IpcIdTable<SemSet>>,

    /// SysV message queues: ID allocator + key-to-ID map under single lock.
    pub msg: RwLock<IpcIdTable<MsgQueue>>,

    /// System-wide limits (per-namespace, matching Linux defaults).
    /// Configurable via sysctl within the namespace.
    pub limits: IpcLimits,

    /// Owning user namespace (for permission checks).
    pub user_ns: Arc<UserNamespace>,

    /// POSIX message queues (name -> queue). String-keyed; BTreeMap is
    /// appropriate since mq_open() is a setup-time operation, not per-message.
    pub posix_mqueues: RwLock<BTreeMap<String, Arc<PosixMqueue>>>,
}

/// Bundled IPC ID allocator and key-to-ID map. Protected by a single
/// `RwLock` in `IpcNamespace` to prevent TOCTOU between key lookup and
/// ID allocation. Matches Linux's `struct ipc_ids` which holds both the
/// IDR and the key hash table under one `rwsem`.
pub struct IpcIdTable<T> {
    /// ID allocator. O(1) lookup by IPC ID on the shmat/semop/msgsnd path.
    pub ids: Idr<T>,
    /// Key-to-ID reverse map. XArray keyed by `key_t as u32 as u64`.
    /// Used by shmget/semget/msgget to find existing objects by key.
    pub key_map: XArray<i32>,
}

/// Resolve the IPC namespace for a SysV IPC syscall.
///
/// Called at the entry of shmget/semget/msgget/shmctl/semctl/msgctl/
/// shmat/shmdt/semop/msgsnd/msgrcv.
fn current_ipc_ns() -> &Arc<IpcNamespace> {
    &current_task().nsproxy.ipc_ns
}

Syscall dispatch:

shmget(key, size, flags):
  1. ipc_ns = current_ipc_ns()
  2. let mut table = ipc_ns.shm.write();  // single RwLock covers both ids + key_map
  3. If key == IPC_PRIVATE or (flags & IPC_CREAT) and key not in table.key_map:
       Allocate new ShmSegment in table.ids; insert into table.key_map.
  4. Else: look up key in table.key_map → shmid → return shmid.
  5. Permission check uses ipc_ns.user_ns for ns_capable().

semget(key, nsems, flags):
  1. ipc_ns = current_ipc_ns()
  2. let mut table = ipc_ns.sem.write();
  3. Same pattern: IPC_PRIVATE → allocate; existing key → lookup in table.key_map.
  4. SemSet allocated with nsems semaphores (each AtomicU16).

msgget(key, flags):
  1. ipc_ns = current_ipc_ns()
  2. let mut table = ipc_ns.msg.write();
  3. Same pattern: IPC_PRIVATE → allocate; existing key → lookup in table.key_map.

Isolation guarantee: A process in IPC namespace A cannot access or even detect the existence of IPC objects in namespace B. The key-to-ID mappings are entirely disjoint. ipcs inside a container shows only that container's IPC objects. This matches Linux IPC namespace semantics required by Docker and Kubernetes pod isolation (shareProcessNamespace: false implies separate IPC namespaces).