Chapter 16: Containers and Namespaces

Namespace architecture (8 types), cgroups v2, POSIX IPC, OCI runtime

Type Definitions Used in This Part

/// Unique identifier for a schedulable task within the kernel.
/// Globally unique, never reused (monotonically increasing from boot).
/// Used for PID translation in PID namespaces.
pub type TaskId = u64;

/// Handle to a physical page frame. Wraps the frame number.
/// Used by pipe buffers for zero-copy page gifting via vmsplice().
pub struct PhysPage {
    /// Physical frame number (PFN).
    pub pfn: u64,
}

/// Wait queue head for blocking operations.
/// Used by pipe buffers to block readers/writers.
/// Defined in Section 3.1.6 (umka-core/src/sync/wait.rs).
// WaitQueueHead is defined in Section 3.1.6.2 (03-concurrency.md).
// See that section for the full struct definition and wait/wake protocol.
pub use WaitQueueHead;

/// Namespace type enumeration for hierarchy tracking.
/// UmkaOS implements all 8 Linux namespace types (see Section 7.1.6).
///
/// Uses sequential kernel-internal values (`#[repr(u8)]`). These do NOT
/// correspond to the CLONE_NEW* bitflags passed by userspace (e.g.,
/// CLONE_NEWPID = 0x20000000, CLONE_NEWNET = 0x40000000). Translation
/// from CLONE_NEW* bitflags happens at the syscall boundary via
/// `clone_flag_to_ns_type()` below.
#[repr(u8)]
pub enum NamespaceType {
    Pid    = 0,
    Net    = 1,
    Mnt    = 2,
    Uts    = 3,
    Ipc    = 4,
    User   = 5,
    Cgroup = 6,
    Time   = 7, // Linux 5.6+
}

/// Convert a single CLONE_NEW* bitflag (from clone(2) / unshare(2) flags)
/// to the kernel-internal `NamespaceType`.
///
/// Callers must iterate over each set bit in the `clone_flags` word and
/// call this function once per bit. Returns `None` for bits that are not
/// namespace flags (e.g., CLONE_VM, CLONE_FILES).
pub fn clone_flag_to_ns_type(bit: u64) -> Option<NamespaceType> {
    match bit {
        libc::CLONE_NEWPID    => Some(NamespaceType::Pid),
        libc::CLONE_NEWNET    => Some(NamespaceType::Net),
        libc::CLONE_NEWNS     => Some(NamespaceType::Mnt),
        libc::CLONE_NEWUTS    => Some(NamespaceType::Uts),
        libc::CLONE_NEWIPC    => Some(NamespaceType::Ipc),
        libc::CLONE_NEWUSER   => Some(NamespaceType::User),
        libc::CLONE_NEWCGROUP => Some(NamespaceType::Cgroup),
        libc::CLONE_NEWTIME   => Some(NamespaceType::Time),
        _                     => None,
    }
}

Note on Capability<T> syntax: This document uses Capability<NetStack> and Capability<VfsNode> as type hints indicating what resource a capability references. The underlying Capability struct (Section 8.1.1) is non-generic; the target type is determined by the object_id field. This notation is for documentation clarity only.

16.1 Namespace Architecture

Linux namespaces isolate global system resources. In UmkaOS, namespaces are not primitive kernel objects; rather, they are synthesized from UmkaOS's native Capability Domains (Section 8.1) and Virtual Filesystem (VFS) mounts.

16.1.1 Capability Domain Mapping

When a process creates a new namespace via clone(CLONE_NEW*) or unshare(), UmkaOS allocates a new Capability Domain or modifies the existing one:

CLONE_NEWPID (PID Namespace): Creates a new PID translation table in the process's Capability Domain. The umka-compat layer translates local PIDs (e.g., PID 1) to global UmkaOS task IDs.
CLONE_NEWNET (Network Namespace): Creates an isolated network stack instance:
The new namespace has no network interfaces except lo (loopback, 127.0.0.1/8)
No connectivity to the host or external network unless explicitly configured
Network interfaces (physical NICs, VETH pairs, bridges, VLANs) are owned by a specific namespace and cannot be accessed from other namespaces
Each namespace has its own routing table, iptables/nftables rules, and socket port space
Per-namespace network state is defined below; the umka-net subsystem (Section 15.1-38) implements the network stack that operates within these namespace boundaries:

/// Network interface table. Uses a hash map for O(1) lookup on the
/// packet receive/transmit path.
/// Maximum 64 interfaces (typical deployments have 2-16).
/// FxHash for fast integer key hashing with minimal code overhead.
pub struct InterfaceTable {
    /// Hash table indexed by InterfaceIndex. Probe distance ~1 for
    /// typical deployments with ≤16 interfaces.
    table: FixedHashMap<InterfaceIndex, Arc<NetInterface>, 64>,
    /// Ordered list for enumeration (netlink GETLINK, /proc/net/if_inet6).
    /// Maintained in parallel with the hash table.
    ordered: Vec<InterfaceIndex>,
}

/// Per-namespace network state.
pub struct NetNamespace {
    /// Namespace ID (unique across the system).
    pub ns_id: u64,

    /// Network interface table. Uses a hash map for O(1) lookup on the
    /// packet receive/transmit path.
    /// Maximum 64 interfaces (typical deployments have 2-16).
    /// FxHash for fast integer key hashing with minimal code overhead.
    ///
    /// BTreeMap replaced with FixedHashMap: O(log n) BTreeMap on the per-packet
    /// path (netif_receive_skb, dev_queue_xmit) was a performance regression.
    /// Hash table provides O(1) lookup with ~1-2 probe distance for typical
    /// ≤16 interface deployments.
    ///
    /// **RCU-protected**: Interface lookup is on the per-packet hot path (every
    /// incoming and outgoing packet resolves its interface). Readers call
    /// `rcu_read_lock()` and traverse the map without any lock acquisition.
    /// Writers (interface add/remove, rare) clone-and-swap via `RcuCell`:
    /// build a new InterfaceTable, atomically publish via `RcuCell::update()`,
    /// defer freeing the old table to an RCU grace period. This matches Linux's
    /// RCU-protected `net_device` lookup exactly — lock-free reads, serialized
    /// writes. The write path holds `config_lock` (below) for serialization.
    pub interfaces: RcuCell<InterfaceTable>,

    /// Loopback interface (always present, cannot be deleted).
    pub loopback: Arc<NetInterface>,

    /// Routing table (per-namespace, not shared).
    ///
    /// **RCU-protected**: Route lookup is on the per-packet forwarding path.
    /// Same pattern as interfaces: lock-free RCU reads, clone-and-swap writes.
    /// Linux uses RCU for FIB (Forwarding Information Base) lookup.
    pub routes: RcuCell<RouteTable>,

    /// Firewall rules (iptables/nftables equivalent).
    /// Rules are scoped to this namespace only.
    ///
    /// **RCU-protected**: Rule evaluation is on the per-packet filter path.
    /// Same pattern: lock-free RCU reads, clone-and-swap writes on rule update.
    /// Linux uses RCU for netfilter rule traversal.
    pub firewall: RcuCell<FirewallRules>,

    /// Mutex for serializing configuration mutations (interface add/remove,
    /// route updates, firewall rule changes). Only the write side holds this;
    /// packet-path readers never touch it. Separating the write-side lock from
    /// the read-side RCU ensures that configuration changes do not block
    /// packet processing.
    pub config_lock: Mutex<()>,

    /// Port allocation bitmap (per-namespace).
    /// Allows the same port number to be bound in different namespaces.
    /// Mutex is correct here: port allocation happens on bind()/connect(),
    /// not on the per-packet path.
    pub port_allocator: Mutex<PortAllocator>,

    /// Capability to this network stack (for delegation).
    /// Processes in this namespace implicitly hold this capability.
    pub stack_cap: Capability<NetStack>,
}

/// Fixed-size interface name (matching Linux IFNAMSIZ = 16).
/// Prevents unbounded heap allocation and OOM attacks via long names.
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct InterfaceName([u8; 16]);

VETH pairs for inter-namespace connectivity: A VETH (virtual ethernet) pair connects two namespaces. Creation: ip link add veth0 type veth peer name veth1 ip link set veth1 netns <target-namespace> In UmkaOS, this creates two virtual interfaces that are cross-linked: - veth0 in the caller's namespace - veth1 in the target namespace - Packets sent to one end appear on the other (like a virtual patch cable)

Container networking flow: 1. Container runtime creates a new network namespace for the container 2. Creates a VETH pair: one end in host namespace (e.g., veth0), one in container (e.g., eth0) 3. Host end is attached to a bridge (e.g., docker0, cni0) for external connectivity 4. Container end is assigned an IP from the bridge's subnet 5. NAT/masquerading rules on the host allow container → external traffic 6. Port forwarding rules map host ports → container ports - CLONE_NEWNS (Mount Namespace): Creates a private copy of the VFS mount tree for the process. Changes to this tree do not affect the parent domain unless explicitly marked shared. - CLONE_NEWUTS (UTS Namespace): Creates an isolated hostname/domainname state. Stored as a reference-counted UtsNamespace struct in the process's NamespaceSet (see Section 16.1.2). - CLONE_NEWIPC (IPC Namespace): Isolates System V IPC objects and POSIX message queues. - CLONE_NEWUSER (User Namespace): Creates a new UID/GID mapping table within the Capability Domain. - CLONE_NEWTIME (Time Namespace): Creates isolated offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. The container sees its own "boot time" starting from zero, independent of the host's actual boot time. The TimeNamespace struct with offset fields is defined in Section 16.1.2 below.

16.1.2 Namespace Implementation

Namespaces are implemented entirely within the umka-compat layer. The core microkernel (umka-core) is unaware of namespaces; it only understands Capability Domains and object access rights.

pub struct NamespaceSet {
    /// PID translation table (Local PID -> Global Task ID).
    ///
    /// Radix-tree-based integer ID allocator (IDR), same structure Linux uses
    /// in `pid_namespace` (`struct idr`). Provides O(1) average-case lookup
    /// (1-2 cache-line accesses) on the `kill()`, `waitpid()`, and `/proc/[pid]`
    /// hot paths. Reads are RCU-protected (lock-free under `rcu_read_lock`);
    /// writes (fork/exit) are serialized by the Idr's internal SpinLock.
    /// Integrated next-ID allocation eliminates a separate PID counter.
    pub pid_map: Idr<TaskId>,

    /// Pending PID namespace for future children (set by setns(CLONE_NEWPID)).
    /// When set, fork()/clone() creates children in this namespace rather than
    /// the current process's PID namespace. The process's own PID is unchanged.
    pub pending_pid_ns: Option<Arc<PidNamespace>>,

    /// Mount namespace containing the mount tree, mount hash table,
    /// and all mount metadata for this process's VFS view.
    /// See Section 13.2 (13-vfs.md) for `MountNamespace` definition.
    pub mount_ns: Arc<MountNamespace>,

    /// Network stack instance capability
    pub net_stack: Capability<NetStack>,

    /// UTS namespace state (hostname, domainname).
    pub uts_state: Arc<UtsNamespace>,

    /// IPC namespace (SysV semaphores, message queues, shared memory).
    pub ipc_state: Arc<IpcNamespace>,

    /// Cgroup namespace (cgroup root view).
    pub cgroup_state: Arc<CgroupNamespace>,

    /// Time namespace offsets (CLOCK_MONOTONIC, CLOCK_BOOTTIME).
    pub time_state: Arc<TimeNamespace>,

    /// Pending time namespace for future children (set by setns(CLONE_NEWTIME)).
    /// When set, fork()/clone() creates children with the target time offsets.
    /// Follows Linux 5.8+ semantics where CLONE_NEWTIME affects children only.
    pub pending_time_ns: Option<Arc<TimeNamespace>>,

    /// User namespace governing UID/GID mappings and capability scope.
    pub user_ns: Arc<UserNamespace>,

    /// IMA namespace (per-container integrity measurement policy and log).
    /// Created alongside the user namespace. See Section 8.4.3 for ImaNamespace struct.
    pub ima_state: Arc<ImaNamespace>,
}

/// PID namespace. Each namespace has its own PID number space: a process visible
/// in a child namespace has a different pid_t than in the parent namespace.
///
/// # Nesting
/// PID namespaces form a tree. The root (init) namespace is the global root.
/// A process in namespace N with pid=5 may appear as pid=105 in namespace N's parent.
/// Translation traverses `parent` pointers up the tree.
///
/// # PID allocation
/// Each namespace allocates PIDs from an IDR (integer allocation map). The global PID
/// (used internally in the kernel) is always allocated from the root namespace.
/// Every namespace in the path from root to the process's namespace gets one entry
/// in the translation map.
///
/// # /proc visibility
/// `/proc/[pid]` uses the PID from the reading process's namespace, not the global
/// TaskId. A process reading `/proc/5/status` in namespace N will see the task whose
/// local pid_t in N equals 5; the same task may have a different pid_t in the parent
/// namespace. If no task with that local pid exists in the reader's namespace,
/// the entry is absent from `/proc`.
pub struct PidNamespace {
    /// Unique namespace identifier (for /proc/self/ns/pid).
    pub ns_id: u64,
    /// Parent namespace. `None` only for the root PID namespace.
    pub parent: Option<Arc<PidNamespace>>,
    /// Nesting level. Root = 0; maximum = 32 (matches Linux PID_MAX_LIMIT depth).
    pub level: u32,
    /// PID allocation map for this namespace level.
    /// Key: pid_t value in this namespace (allocated by IDR); Value: global TaskId.
    ///
    /// IDR (integer-ID radix-tree allocator) provides O(log n) pid allocation with
    /// RCU read-side protection. `pid_lookup()` is lock-free on the read path
    /// (kill(), waitpid(), /proc/[pid] traversal). Writes (fork/exit) are serialized
    /// by the Idr's internal SpinLock. Integrated next-ID allocation eliminates a
    /// separate PID counter and separate "find a free PID" logic.
    pub pid_map: Idr<TaskId>,
    /// Reverse map: global TaskId → local pid_t in this namespace.
    /// Used by `pid_nr()` (TaskId → local pid_t), called on signal delivery,
    /// `/proc/[pid]` traversal, and `waitpid()` — all hot paths.
    ///
    /// RCU-protected read path: `pid_nr()` takes only an RCU read guard (~1-3 cycles,
    /// no spinning). Write path: insert at fork, remove at exit — both serialized by
    /// `pid_map`'s existing SpinLock (held anyway for IDR allocation/deallocation).
    /// This eliminates the separate `SpinLock<HashMap>` that previously serialized
    /// every signal delivery on the read path.
    ///
    /// Implementation: sparse radix tree (same Idr structure as pid_map) keyed on
    /// the lower 32 bits of TaskId (unique within namespace lifetime; collisions
    /// are impossible since TaskId is a global monotonic counter).
    pub reverse_map: RcuIdr<u32>,
    /// Maximum PID value in this namespace (default: 4,194,304 = PID_MAX).
    /// Reduced-max namespaces allow container runtimes to limit PID exhaustion attacks.
    pub pid_max: u32,
    /// Number of active tasks in this namespace.
    pub nr_tasks: AtomicU32,
}

/// Translates a global `TaskId` to the local pid_t visible in `ns`.
/// Returns `None` if the task is not visible in `ns` (created in a sibling namespace).
///
/// Hot path: called on signal delivery, /proc traversal, waitpid(). Uses RCU
/// read-side guard — no spinning, no lock acquisition.
pub fn pid_nr(task_id: TaskId, ns: &PidNamespace) -> Option<u32> {
    let guard = rcu_read_lock();
    ns.reverse_map.lookup(task_id.lower32(), &guard)
}

/// UTS namespace state.
///
/// Hostname and domainname are read on every `uname()` syscall (glibc calls
/// this once per process, but short-lived processes — container health checks,
/// shell scripts — call it frequently). Writes (`sethostname`, `setdomainname`)
/// are rare (typically once at container creation). RCU gives lock-free reads.
pub struct UtsNamespace {
    /// Current hostname and domainname. Read lock-free via RCU on the
    /// `uname()` path; updated via clone-and-swap under `update_lock`.
    pub strings: RcuPtr<Arc<UtsStrings>>,
    /// Serializes `sethostname()` / `setdomainname()` updates.
    pub update_lock: Mutex<()>,
}

/// UTS string pair (hostname + domainname). Immutable once published;
/// updates create a new `UtsStrings` and swap the RCU pointer.
pub struct UtsStrings {
    /// Hostname (max 64 bytes, NUL-terminated).
    pub hostname: [u8; 65],
    /// NIS domain name (max 64 bytes, NUL-terminated).
    pub domainname: [u8; 65],
}

/// IPC namespace state (SysV IPC objects).
///
/// Integer-keyed maps (SysV IPC) use `Idr<T>` — a radix-tree-based integer ID
/// allocator matching Linux's `struct idr` in `ipc/util.c`. SysV IPC IDs are
/// small integers assigned sequentially, ideal for radix indexing: O(1) lookup
/// vs O(log n) for tree-based maps. Reads are RCU-protected (lock-free under
/// `rcu_read_lock`); writes are serialized by the Idr's internal SpinLock.
/// Integrated ID allocation provides atomically-assigned unique IDs per type.
///
/// POSIX message queues use string keys (`/queue_name`), so `RwLock<BTreeMap>`
/// is appropriate — O(log n) by queue count, and `mq_open()` is a cold path
/// (called once at setup, not per-message). Linux uses a filesystem mount for
/// POSIX mqueues; the BTreeMap serves as the namespace directory equivalent.
pub struct IpcNamespace {
    /// SysV shared memory segments (shmid -> segment).
    pub shm_segments: Idr<Arc<ShmSegment>>,
    /// SysV semaphore sets (semid -> semaphore set).
    pub sem_sets: Idr<Arc<SemSet>>,
    /// SysV message queues (msqid -> queue).
    pub msg_queues: Idr<Arc<MsgQueue>>,
    /// POSIX message queues (name -> queue). String-keyed; BTreeMap is
    /// appropriate since mq_open() is a setup-time operation, not per-message.
    pub posix_mqueues: RwLock<BTreeMap<String, Arc<PosixMqueue>>>,
}

/// SysV shared memory segment (shmget/shmat/shmctl).
pub struct ShmSegment {
    /// Unique key (from shmget; IPC_PRIVATE = 0 means anonymous).
    pub key: i32,
    /// Segment identifier (returned by shmget).
    pub shmid: u32,
    /// Size in bytes (rounded up to page boundary at creation).
    pub size: usize,
    /// Physical pages backing this segment (reference-counted).
    pub pages: Arc<PhysPages>,
    /// Owner UID/GID at creation time.
    pub uid: u32,
    pub gid: u32,
    /// Permission mode bits (lower 9 bits, like file mode).
    pub mode: u16,
    /// Attachment count (number of active shmat() mappings).
    pub nattach: AtomicU32,
    /// Creation time and last attach/detach timestamps (monotonic nanoseconds).
    pub ctime: u64,
    pub atime: u64,
    pub dtime: u64,
}

/// SysV semaphore set (semget/semop/semctl).
pub struct SemSet {
    pub key: i32,
    pub semid: u32,
    /// Number of semaphores in this set (1–SEMMSL; Linux default SEMMSL=32000).
    pub nsems: u16,
    /// Semaphore values (one per semaphore in the set).
    pub sems: Box<[AtomicI16]>,
    /// Undo table: per-task pending undos (restored on task exit).
    /// SpinLock because semop undo list is accessed in task-exit path.
    pub undo_list: SpinLock<Vec<SemUndo>>,
    pub uid: u32,
    pub gid: u32,
    pub mode: u16,
    pub ctime: u64,
    pub otime: u64,
}

/// SysV message queue (msgget/msgsnd/msgrcv).
pub struct MsgQueue {
    pub key: i32,
    pub msqid: u32,
    /// Messages stored in the queue (FIFO order).
    /// Bounded by `msg_qbytes` (max queue size in bytes).
    pub messages: VecDeque<SysVMessage>,
    /// Current total size of all messages in bytes.
    pub current_bytes: usize,
    /// Maximum bytes in queue (default: MSGMNB = 65536).
    pub max_bytes: usize,
    /// Lock protects messages, current_bytes, send_wait, recv_wait.
    pub lock: SpinLock<()>,
    /// Tasks waiting to send (queue full).
    pub send_wait: WaitQueue,
    /// Tasks waiting to receive (queue empty or no matching type).
    pub recv_wait: WaitQueue,
    pub uid: u32,
    pub gid: u32,
    pub mode: u16,
    pub stime: u64, // last msgsnd time (monotonic nanoseconds)
    pub rtime: u64, // last msgrcv time (monotonic nanoseconds)
    pub ctime: u64,
}

/// A single SysV message.
pub struct SysVMessage {
    /// Message type (from msgsnd mtype; must be > 0).
    pub mtype: i64,
    /// Message data (zero-copy: Box<[u8]> avoids double-allocation).
    pub data: Box<[u8]>,
}

/// POSIX message queue (mq_open/mq_send/mq_receive; mqueue filesystem).
/// Linux-compatible: /dev/mqueue filesystem, mq_notify(3) supported.
pub struct PosixMqueue {
    /// Queue name (from mq_open; unique within the mqueue namespace).
    pub name: ArrayString<256>,
    /// Attributes: max messages, max message size, current count.
    pub attr: MqueueAttr,
    /// Priority queue: messages ordered by descending priority, then FIFO within
    /// equal priority. Stored as a BinaryHeap; see `PosixMessage::cmp` for the
    /// ordering that provides POSIX-required FIFO stability at equal priority.
    pub queue: BinaryHeap<PosixMessage>,
    /// Monotonically increasing sequence counter. Assigned to each message on
    /// mq_send() to provide stable FIFO ordering within equal-priority messages.
    /// `Ordering::Relaxed` is sufficient: the SpinLock below provides the
    /// happens-before edge; seq is only compared within the same queue.
    pub next_seq: AtomicU64,
    /// Lock protecting queue, next_seq, and waiters.
    pub lock: SpinLock<()>,
    /// Tasks blocked on mq_receive (queue empty).
    pub recv_waiters: WaitQueue,
    /// Tasks blocked on mq_send (queue full).
    pub send_waiters: WaitQueue,
    /// Notification registration (mq_notify).
    pub notify: Option<MqueueNotify>,
    pub uid: u32,
    pub gid: u32,
    pub mode: u16,
}

pub struct MqueueAttr {
    /// Maximum number of messages (mq_maxmsg; default 10, max 65536).
    pub maxmsg: u32,
    /// Maximum message size in bytes (mq_msgsize; default 8192, max 1MB).
    pub msgsize: u32,
    /// Current number of messages in the queue.
    pub curmsgs: u32,
}

/// A POSIX message queue message. Ordering is by descending priority, then
/// ascending sequence number (FIFO within equal priority), as required by
/// POSIX.1-2017 mq_receive(3).
///
/// # Ordering (BinaryHeap is a max-heap)
/// ```rust
/// impl Ord for PosixMessage {
///     fn cmp(&self, other: &Self) -> Ordering {
///         self.priority.cmp(&other.priority)
///             .then(other.seq.cmp(&self.seq)) // reverse seq: lower seq = older = wins
///     }
/// }
/// ```
/// A higher priority beats a lower priority. Within the same priority, the
/// message with the smaller `seq` (sent earlier) has a *larger* `Ord` value
/// so the max-heap dequeues it first — preserving FIFO order.
pub struct PosixMessage {
    /// Priority (0–MQ_PRIO_MAX-1; higher = delivered first).
    pub priority: u32,
    /// Per-queue send sequence number. Assigned from `PosixMqueue::next_seq`
    /// at mq_send() time. Breaks ties within equal-priority messages: lower
    /// seq means the message was sent earlier and must be dequeued first.
    pub seq: u64,
    pub data: Box<[u8]>,
}

/// Cgroup namespace state.
pub struct CgroupNamespace {
    /// Root cgroup directory visible to processes in this namespace.
    /// Processes see this as "/" in /sys/fs/cgroup.
    pub cgroup_root: Arc<Cgroup>,
}

/// Time namespace state (Linux 5.6+).
pub struct TimeNamespace {
    /// Offset added to CLOCK_MONOTONIC for processes in this namespace.
    /// Allows containers to see a "boot time" starting from 0.
    pub monotonic_offset_ns: AtomicI64,
    /// Offset added to CLOCK_BOOTTIME for processes in this namespace.
    pub boottime_offset_ns: AtomicI64,
}

16.1.3 Container Root Filesystem: pivot_root(2)

Container runtimes (runc, containerd, crun) require a mechanism to change the root filesystem after setting up the mount namespace. UmkaOS implements the standard pivot_root(2) syscall:

/// pivot_root(new_root: &CStr, put_old: &CStr) -> Result<()>
///
/// Atomically swaps the root mount with another mount point. Required for
/// OCI-compliant container creation.
///
/// # Prerequisites (checked by syscall)
/// - new_root must be a mount point
/// - put_old must be at or under new_root
/// - Caller must be in a mount namespace (CLONE_NEWNS or unshare(CLONE_NEWNS))
/// - Caller must have CAP_SYS_ADMIN in its user namespace
///
/// # Operation
/// 1. Attach new_root to the root of the mount namespace
/// 2. Move the old root to put_old
/// 3. The process's root directory is now new_root
/// 4. Subsequent umount(put_old) removes the old root from the namespace
///
/// # Container Runtime Usage
/// ```
/// // Standard OCI container creation sequence:
/// unshare(CLONE_NEWNS);                    // New mount namespace
/// mount("none", "/", NULL, MS_REC | MS_PRIVATE, NULL);  // Make all private
/// mount("/var/lib/container/rootfs", "/var/lib/container/rootfs",
///       NULL, MS_BIND | MS_REC, NULL);     // Bind-mount rootfs onto itself
/// pivot_root("/var/lib/container/rootfs", "/var/lib/container/rootfs/.oldroot");
/// chdir("/");                              // Ensure we're in new root
/// umount2("/.oldroot", MNT_DETACH);        // Detach old root
/// // Process now sees container rootfs as /
/// ```
///
/// # Difference from chroot(2)
/// pivot_root is fundamentally different from chroot:
/// - chroot only affects the process's view of the root directory
/// - pivot_root actually moves the mount point, affecting all processes in the namespace
/// - chroot can be escaped via mount namespace tricks; pivot_root cannot
/// - Container runtimes MUST use pivot_root for secure isolation
///
/// # Error codes
/// - EBUSY: new_root is not a mount point, or put_old is not under new_root
/// - EINVAL: new_root and put_old are the same
/// - ENOENT: path component does not exist
/// - ENOTDIR: path component is not a directory
/// - EPERM: Caller lacks CAP_SYS_ADMIN, or not in mount namespace
/// - ENOSYS: Not implemented (will not occur in UmkaOS)
SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, const char __user *, put_old)

Interaction with other namespaces: - pivot_root operates on the caller's mount namespace - The root change is visible to all processes sharing that mount namespace - Combined with PID namespace: the container's init (PID 1) sees only the new root - Combined with User namespace: unprivileged processes can pivot_root within their own user namespace if they have CAP_SYS_ADMIN there

Implementation notes: The VFS layer (Section 13.1) handles the mount tree manipulation. The Mount struct, MountNamespace, mount hash table, and the complete pivot_root algorithm using these types are defined in Section 13.2 (13-vfs.md). The summary below is retained for context; the authoritative specification is Section 13.2.11.2.

Lookup new_root and verify it's a mount point
Lookup put_old and verify it's under new_root
Lock the mount tree for modification (holds mount_lock)
Detach the current root from the namespace's mount list
Attach new_root as the new namespace root
Reattach the old root at put_old position
Publish the new root via RCU: rcu_assign_pointer(namespace->root, new_root)
Unlock the mount tree

Atomicity with respect to path lookups: Steps 4–6 are performed while holding mount_lock, and the old root pointer remains valid in the RCU-published slot until step 7 overwrites it. Path lookups (open(), stat(), readlink(), etc.) take an RCU read-side reference to the namespace root at the start of lookup via rcu_dereference(namespace->root). This ensures: - In-flight path lookups that started before pivot_root complete with the old root (consistent view) - New path lookups that start after step 7 see the new root - No path lookup can see a partially-updated state (no torn reads, no null pointer) - Between steps 4–6, the data structures are modified under mount_lock, but lookups still see the old root via RCU The RCU grace period after step 7 ensures that by the time umount(put_old) completes, no in-flight lookups hold references to the old root.

16.1.4 Joining Namespaces: setns(2) and nsenter

Container operations like docker exec require joining an existing namespace. UmkaOS implements setns(2) for this purpose:

/// setns(fd: RawFd, nstype: c_int) -> Result<()>
///
/// Reassociates the calling thread with the namespace referenced by fd.
///
/// # Parameters
/// - fd: File descriptor referring to a namespace (obtained from /proc/[pid]/ns/[type])
/// - nstype: Namespace type (CLONE_NEW* constant) or 0 to auto-detect from fd
///
/// # Prerequisites
/// - Caller must have CAP_SYS_ADMIN in the target namespace's owning user namespace
/// - For PID namespaces: No restriction (affects future children only, per Linux 3.8+)
/// - For user namespaces: Caller must not be in a chroot environment
/// - The namespace must still exist (owning process hasn't exited)
///
/// # Container Runtime Usage (docker exec)
/// ```
/// // Join a running container's namespaces:
/// int fd = open("/proc/[container_pid]/ns/mnt", O_RDONLY | O_CLOEXEC);
/// setns(fd, CLONE_NEWNS);  // Join mount namespace
/// close(fd);
///
/// fd = open("/proc/[container_pid]/ns/net", O_RDONLY | O_CLOEXEC);
/// setns(fd, CLONE_NEWNET); // Join network namespace
/// close(fd);
///
/// // PID namespace must be joined via clone(), not setns()
/// // (kernel limitation: can't change PID namespace of running process)
/// // exec() into container: now running in container's namespaces
/// execve("/bin/sh", ["/bin/sh"], envp);
/// ```
///
/// # User Namespace Ordering (Implementation Detail)
/// When a single setns() call joins a user namespace alongside other namespace types,
/// joining the user namespace first is required internally because it changes the
/// caller's effective capabilities. UmkaOS handles this transparently: the kernel
/// internally processes any user namespace transition before other namespace
/// transitions in the same call sequence, regardless of the order in which the
/// caller passes fds. The caller may pass namespace fds in any order; UmkaOS's
/// implementation reorders them as needed. This matches Linux behavior, where
/// `nsenter --all` and `unshare` may call setns() in arbitrary order without error.
///
/// # Namespace file descriptors
/// Each namespace type is exposed via /proc/[pid]/ns/:
/// ```
/// /proc/[pid]/ns/cgroup         → Cgroup namespace
/// /proc/[pid]/ns/ipc            → IPC namespace
/// /proc/[pid]/ns/mnt            → Mount namespace
/// /proc/[pid]/ns/net            → Network namespace
/// /proc/[pid]/ns/pid            → PID namespace (current)
/// /proc/[pid]/ns/pid_for_children → PID namespace for future children (after setns)
/// /proc/[pid]/ns/time           → Time namespace (Linux 5.6+)
/// /proc/[pid]/ns/time_for_children → Time namespace for future children (Linux 5.8+)
/// /proc/[pid]/ns/user           → User namespace
/// /proc/[pid]/ns/uts            → UTS namespace
/// ```
///
/// The `*_for_children` symlinks reveal the pending namespace set by
/// `setns(CLONE_NEWPID)` or `setns(CLONE_NEWTIME)`. They differ from the
/// main symlinks when a process has called `setns()` but not yet forked.
/// Container introspection tools (`lsns`, `nsenter --target`) use these.
///
/// These are magic links: reading them returns the namespace type, and
/// opening them gives a file descriptor that can be passed to setns().
///
/// # Error codes
/// - EBADF: Invalid fd
/// - EINVAL: fd does not refer to a namespace, nstype doesn't match fd type,
///           or (for PID namespace) caller has other threads in its thread group
/// - EPERM: Caller lacks CAP_SYS_ADMIN in target namespace's user namespace
/// - ENOENT: Namespace has been destroyed
SYSCALL_DEFINE2(setns, int, fd, int, nstype)

PID namespace special case: A process cannot change its own PID namespace via setns() — the process's PID in its original namespace remains unchanged. However, setns(fd, CLONE_NEWPID) is valid since Linux 3.8: it sets the PID namespace for future children created by fork()/clone(). The caller's own PID view is unchanged, but newly created children will be in the target PID namespace.

This is why docker exec uses nsenter with --fork flag: it joins other namespaces via setns(), sets the target PID namespace for children, then forks a child that inherits all joined namespaces and has the correct PID view.

TOCTOU safety: setns() acquires a reference count on the target namespace before validating it, then holds that reference across the join operation. The namespace cannot be destroyed while setns() holds its reference — this prevents the use-after-free TOCTOU that would otherwise exist between checking namespace validity and joining it. The reference is released after the join completes or if validation fails.

Implementation:

fn sys_setns(fd: RawFd, nstype: c_int) -> Result<()> {
    let file = current_task().fd_table.get(fd)?;
    let ns_inode = file.inode.downcast_ref::<NsInode>()
        .ok_or(Errno::EINVAL)?;

    // Verify nstype matches (if specified)
    if nstype != 0 && ns_inode.nstype != nstype {
        return Err(Errno::EINVAL);
    }

    // Internal reordering: if this fd is for a user namespace and the caller has
    // already registered other namespace fds in this setns() sequence, the user
    // namespace transition is applied first before those pending transitions are
    // processed. This is an implementation detail — the caller may pass fds in
    // any order (Linux-compatible API). No EINVAL is returned for ordering.

    // Check CAP_SYS_ADMIN in target namespace's user namespace
    let target_user_ns = ns_inode.namespace.user_ns.upgrade().ok_or(Errno::ENOENT)?;
    if !has_cap_sys_admin_in(current_task(), &target_user_ns) {
        return Err(Errno::EPERM);
    }

    // Join the namespace (type-specific switch)
    match ns_inode.nstype {
        CLONE_NEWNS => current_task().ns_state.switch_mount(&ns_inode.namespace),
        CLONE_NEWNET => current_task().ns_state.switch_net(&ns_inode.namespace),
        CLONE_NEWUTS => current_task().ns_state.switch_uts(&ns_inode.namespace),
        CLONE_NEWIPC => current_task().ns_state.switch_ipc(&ns_inode.namespace),
        CLONE_NEWUSER => {
            // Chroot'd processes cannot join user namespaces (could escape chroot)
            if current_task().is_chrooted() {
                return Err(Errno::EPERM);
            }
            current_task().ns_state.switch_user(&ns_inode.namespace)?;
        }
        CLONE_NEWCGROUP => current_task().ns_state.switch_cgroup(&ns_inode.namespace),
        CLONE_NEWTIME => {
            // Time namespace affects future children, not the caller (Linux 5.8+ semantics).
            // Set pending_time_ns so fork()/clone() children use the target time offsets.
            current_task().ns_state.pending_time_ns =
                Some(ns_inode.namespace.as_time_ns().expect("TIME namespace"));
        }
        CLONE_NEWPID => {
            // PID namespace affects future children, not the caller.
            // Set pending_pid_ns so fork()/clone() creates children in target NS.
            current_task().ns_state.pending_pid_ns =
                Some(ns_inode.namespace.as_pid_ns().expect("PID namespace"));
        }
        _ => return Err(Errno::EINVAL),
    }

    Ok(())
}

16.1.5 Namespace Hierarchy and Inheritance

Namespaces form a hierarchical tree with parent-child relationships. When a process creates a new namespace via clone() or unshare(), the new namespace is a child of the caller's namespace:

Root Namespace (init)
  ├── PID NS 1 (container A)          ← child of root PID NS
  │   └── PID NS 1.1 (nested container) ← child of PID NS 1
  ├── PID NS 2 (container B)          ← child of root PID NS
  └── User NS 1 (unprivileged container)
      └── User NS 1.1 (child of User NS 1)

Parent-child link semantics:

/// Per-namespace-type hierarchy tracking.
pub struct NamespaceHierarchy {
    /// Pointer to parent namespace (None for root).
    /// The parent reference is weak (Weak<Namespace>) to prevent reference cycles.
    /// When the parent is dropped (all processes exited), child namespaces
    /// become orphans but remain functional until their own processes exit.
    pub parent: Option<Weak<Namespace>>,

    /// Children of this namespace (weak references).
    /// Weak references allow children to be destroyed independently of the parent.
    /// This matches Linux behavior: a parent namespace can be destroyed while
    /// children still exist (children become orphans but remain functional).
    /// The Vec is cleaned up lazily when iterating (dead Weak refs are removed).
    pub children: Mutex<Vec<Weak<Namespace>>>,

    /// Namespace type (PID, NET, MNT, UTS, IPC, USER, CGROUP, TIME).
    pub ns_type: NamespaceType,

    /// Inode number for this namespace (for /proc/PID/ns/*).
    /// Generated from a global counter, unique across all namespace types.
    pub ns_id: u64,
}

Inheritance rules:

Namespace Type	Child Inherits	Modification Scope
`CLONE_NEWPID`	No (child starts fresh with PID 1)	Child's PID 1 = child init process
`CLONE_NEWNET`	No (child gets isolated network stack)	Child has no interfaces except loopback
`CLONE_NEWNS`	Yes (copy-on-write mount tree)	Child's mounts are private unless marked shared
`CLONE_NEWUTS`	Yes (copies parent's hostname/domainname)	Container runtimes typically overwrite via `sethostname()`
`CLONE_NEWIPC`	No (child gets empty IPC namespace)	Child has isolated SysV/POSIX IPC
`CLONE_NEWUSER`	No (child starts with empty UID/GID mappings)	Parent must write `/proc/PID/uid_map` and `gid_map` to grant subordinate ranges
`CLONE_NEWCGROUP`	No (child gets own cgroup root)	Child's cgroup is a child of caller's cgroup
`CLONE_NEWTIME`	No (child gets zero offsets)	Child's time offsets are independent

Note on CLONE_NEWUTS: The child namespace initially inherits the parent's hostname and domainname (copy, not reference). Container runtimes (runc, containerd) typically overwrite this immediately with the container ID via sethostname().

Note on CLONE_NEWUSER: A newly created user namespace starts with empty UID/GID mappings — all UIDs/GIDs resolve to nobody/nogroup (65534) until mappings are written to /proc/PID/uid_map and /proc/PID/gid_map by a privileged process in the parent namespace. This is a critical security property: children do not automatically inherit the parent's full UID range. Instead, the parent explicitly grants a subordinate range (typically from /etc/subuid and /etc/subgid).

User namespace nesting limit: User namespaces can be nested to a maximum depth of 32 (matching Linux's compile-time limit). clone() or unshare() with CLONE_NEWUSER returns ENOSPC if the nesting depth would exceed 32. This prevents resource exhaustion attacks via deeply nested namespaces. (Version note: Linux 3.11–4.8 returned EUSERS for this condition; Linux 4.9+ returns ENOSPC. UmkaOS follows the Linux 4.9+ semantics.)

Namespace reference counting: Each namespace is reference-counted via Arc<Namespace>. A namespace is destroyed when: 1. All processes in the namespace have exited (process count → 0) 2. All file descriptors referring to /proc/PID/ns/* are closed 3. All bind mounts of the namespace file have been unmounted

Note: Namespace destruction is independent of parent/child relationships. A child namespace can outlive its parent (it becomes an orphan but remains functional), and a parent can be destroyed while children exist.

Namespace Destruction Protocol:

A namespace is destroyed when its reference count drops to zero. This occurs when the last process in the namespace exits (or when the namespace was created as an orphan with unshare() and the creating process exits without cloning into it).

Destruction sequence (reverse of creation order):

Reference count drop to zero triggers namespace_put(ns):
Check: refcount.fetch_sub(1) == 1 (last reference).
If not: return (namespace still has users).
Notify namespace-aware subsystems (in reverse creation order): The cleanup callbacks are registered at namespace creation time and called in LIFO (stack) order:
Network namespace (netns): tear down all virtual interfaces, routing tables, conntrack tables. Close all sockets bound to this ns.
Mount namespace (mntns): umount all mounts in the namespace's mount tree (reverse of mount order). Release all struct Mount references.
PID namespace (pidns): send SIGKILL to all remaining processes in the ns. Wait for them to exit (pid namespace cannot be destroyed while it has living processes — init reaping ensures all descendant PIDs are cleaned up).
IPC namespace (ipcns): destroy all System V IPC objects (semaphores, message queues, shared memory segments) and POSIX IPC objects.
UTS namespace (utsns): free hostname and domainname strings.
User namespace (userns): revoke all capabilities granted within the ns.
Cgroup namespace (cgroupns): detach from the cgroup hierarchy view.
Time namespace (timens): release the time offset record.
Free the namespace struct: Drop the Arc<Namespace> reference (drop(ns)). The Arc destructor handles deallocation once the refcount reaches zero. At this point all subsystem state has been cleaned up.

PID namespace destruction special case: A PID namespace cannot be destroyed while any process inside it is alive. If the namespace init (PID 1 of the namespace) exits, all other processes in the namespace receive SIGKILL. The namespace destruction waits for all processes to exit before proceeding to step 2.

Mount namespace destruction special case: Lazy unmounts (MNT_DETACH): mounts that were detached but still have open file descriptors remain alive until all fds are closed. The mount namespace destructor marks these as "orphan mounts" — they remain accessible to current holders but new opens are rejected. The last file close triggers the final mount cleanup.

Ordering guarantee: The reverse-creation-order cleanup ensures that inner namespaces (created from within an outer namespace) are cleaned up before their parent resources. This prevents use-after-free in cross-namespace references.

16.1.6 User Namespace UID/GID Mapping Security

User namespaces allow unprivileged users to have "root" (UID 0) within a namespace while mapping to an unprivileged UID outside. This is the foundation of rootless containers.

Security model:

/// A single contiguous range in a UID or GID mapping.
/// Maps `count` IDs starting at `inner_start` (inside namespace) to
/// `outer_start` (in parent namespace).
pub struct IdMapEntry {
    pub inner_start: u32,
    pub outer_start: u32,
    pub count: u32,
}

/// Maximum ID mapping entries per user namespace (matches Linux's limit of 340
/// per /proc/PID/uid_map and /proc/PID/gid_map).
const MAX_ID_MAPPINGS: usize = 340;

/// User namespace: defines UID/GID translation mappings and capability scope.
/// Each user namespace has an owner (the uid in the parent namespace of the
/// process that created it) and an ordered list of ID mappings. Capabilities
/// held by a process are relative to its user namespace — CAP_SYS_ADMIN in
/// a child user namespace does not grant privilege in the parent.
///
/// **Write-once ID mappings (lock-free reads):** Linux enforces that
/// `/proc/PID/uid_map` and `/proc/PID/gid_map` can each be written **exactly
/// once** per user namespace lifetime. UmkaOS mirrors this: `uid_map` and
/// `gid_map` use a write-once-then-frozen model. Before the map is written,
/// all UIDs/GIDs resolve to `nobody`/`nogroup` (65534). After the single
/// write, the map is frozen and all subsequent reads are **lock-free** — a
/// plain pointer dereference to an immutable `IdMapArray`. No RwLock, no
/// atomic RMW, zero contention on the hottest path in the kernel (`stat()`,
/// `open()`, `access()`, `kill()`, every permission check).
///
/// The write path uses `map_lock` to serialize the single write and publish
/// the frozen map via a Release store on the `Arc` pointer. Reads use an
/// Acquire load — on x86 this compiles to a plain `MOV` (TSO).
///
/// /proc/PID/uid_map write restrictions:
/// - Writer must be in the parent user namespace OR have CAP_SETUID in parent
/// - Writer must have CAP_SYS_ADMIN in the target namespace (or be the target)
/// - Mapped UIDs must be valid in the parent namespace
/// - Total mapped range cannot exceed /proc/sys/kernel/uid_max (or configured limit)
/// - Can only be written ONCE; second write returns EPERM
#[repr(C)]
pub struct UserNamespace {
    /// Parent user namespace (None for init_user_ns).
    parent: Option<Arc<UserNamespace>>,

    /// Frozen UID mappings. `None` before `/proc/PID/uid_map` is written
    /// (all UIDs resolve to 65534). `Some(...)` after the single write —
    /// immutable thereafter. Reads are lock-free (Acquire load on the
    /// `Option` discriminant). The `Arc` ensures the backing array lives
    /// as long as any namespace referencing it.
    ///
    /// `OnceCell<T>`: write-once cell. `get()` returns `Option<&T>` via
    /// Acquire load (lock-free). `set(value)` initializes exactly once
    /// (returns `Err` if already set). Equivalent to `std::sync::OnceLock`
    /// but `no_std`-compatible.
    uid_map: OnceCell<Arc<IdMapArray>>,

    /// Frozen GID mappings. Same write-once-then-frozen semantics as uid_map.
    gid_map: OnceCell<Arc<IdMapArray>>,

    /// Serializes the single write to uid_map/gid_map. Only held during
    /// the `/proc/PID/uid_map` write path (once per namespace lifetime).
    /// Never contended after initialization.
    map_lock: Mutex<()>,

    /// Owner's UID in the parent namespace.
    owner_uid: u32,
    /// Owner's GID in the parent namespace.
    owner_gid: u32,
    /// Capability set: what capabilities processes in this namespace can hold.
    /// Uses SystemCaps (u128) to accommodate UmkaOS-native capabilities in bits 64-127
    /// ([Section 8.1](08-security.md#81-capability-based-foundation)). Starts with full caps in a new user namespace;
    /// reduced by setuid, prctl, etc.
    cap_permitted: SystemCaps,
}

/// Frozen, immutable ID mapping array. Created once when `/proc/PID/uid_map`
/// (or `gid_map`) is written, never modified thereafter.
pub struct IdMapArray {
    /// Sorted by `inner_start` for binary search on large maps.
    /// Typical container: 1 entry (e.g., 0-65535 → 100000-165535).
    /// For ≤5 entries, linear scan is faster than binary search.
    entries: ArrayVec<IdMapEntry, MAX_ID_MAPPINGS>,
    /// Cached: true if mapping is 1:1 identity (inner == outer for all ranges).
    /// Enables fast-path bypass: return the input UID unchanged.
    is_identity: bool,
}

Capability interactions with user namespaces:

A process with UID 0 inside a user namespace has full capabilities within that namespace
Capabilities are NOT granted against resources owned by ancestor namespaces
Example: A process with "root" in User NS 1 cannot mount() a filesystem from the host
The cap_effective mask is computed at syscall entry time based on:
The process's current UID within its user namespace
The target object's owning user namespace
The intersection of the process's capability bounding set with capabilities valid for the target

Determining the owning user namespace for kernel objects:

Object Type	Owning User Namespace	Mechanism
File (VFS inode)	User namespace of the mount	Each mount has `mnt_user_ns` set at mount time. Files inherit from their mount.
Socket	User namespace of the creating process	Stored in `sock->sk_user_ns` at socket creation
IPC object (shm, sem, msg)	User namespace of the creating namespace	IPC namespace → User namespace mapping at IPC NS creation
Capability token	User namespace of the issuing process	Stored in capability header
Process (for signals)	User namespace of the process	Stored in `task_struct->user_ns`
Device node	User namespace of the initial mount	Device nodes are always in the initial namespace

cap_effective computation algorithm:

The effective capability set for a process operating on an object is the intersection of: 1. The process's current effective capabilities (cap_effective) 2. The capabilities valid for the target object's namespace

This ensures that a process which has dropped capabilities via capset() does not regain them when accessing child namespace objects.

compute_effective_caps(process, object):
  1. proc_ns = process.user_namespace
  2. obj_ns = object.owning_user_namespace
  3. proc_caps = process.cap_effective  // NOT cap_bounding — use current effective set

  4. // Check if process's NS is an ancestor of object's NS (or same NS)
  5. if is_same_or_ancestor(proc_ns, obj_ns):
  6.     // Process is in a parent (or same) namespace — capabilities apply
  7.     // Return intersection of process's effective caps and caps valid for target
  8.     return intersection(proc_caps, capabilities_valid_for(obj_ns))

  9. // Check if process's NS is a descendant of object's NS
 10. if is_ancestor(obj_ns, proc_ns):
 11.     // Process is in a child namespace — no capabilities against parent objects
 12.     return EMPTY_CAP_SET

 13. // Unrelated namespaces (neither ancestor nor descendant)
 14. // This happens with sibling containers
 15. return EMPTY_CAP_SET

is_same_or_ancestor(potential_ancestor, potential_descendant):
  // Walk up the hierarchy from potential_descendant toward root.
  // Return true if potential_ancestor is encountered (including if they're the same).
  cursor = potential_descendant
  while cursor != None:
      if cursor == potential_ancestor:
          return true
      cursor = cursor.parent
  return false

capabilities_valid_for(namespace):
  // Returns the set of capabilities valid for objects in this namespace.
  // Capabilities are restricted based on namespace ownership rules:
  let mut valid = ALL_CAPS
  // CAP_SYS_ADMIN operations that affect global kernel state (e.g., swapon,
  // mount --bind outside the mount namespace) are not valid in non-init namespaces.
  if namespace.is_non_init_user_ns():
      valid &= ~CAP_SYS_ADMIN_GLOBAL  // Remove host-affecting subset
  // CAP_NET_ADMIN is only valid in the owning network namespace.
  if namespace.net_ns != target_object.net_ns:
      valid &= ~CAP_NET_ADMIN
  return valid

/// `CAP_SYS_ADMIN_GLOBAL` — a pseudo-capability synthesized by the namespace
/// validation layer. It is NOT a real Linux capability bit; it represents the
/// combination of: `CAP_SYS_ADMIN` + held in the **initial user namespace**
/// (`user_ns == &init_user_ns`).
///
/// Operations requiring `CAP_SYS_ADMIN_GLOBAL` (forbidden in non-init user namespaces
/// regardless of CAP_SYS_ADMIN possession):
/// - Creating new user namespaces when `user_namespaces_max` system limit is exceeded
/// - Mounting filesystems with `MS_STRICTATIME` in any namespace other than init
/// - Modifying kernel parameters via `sysctl(2)` outside the init namespace
/// - Attaching to another process's PID, UTS, or IPC namespace via `setns(2)`
///   (user namespace setns: allowed; other namespace types: init-only)
///
/// Implementation: `capabilities_valid_for(op, task, ns)` returns `true` iff:
///   `task.cap_effective` contains `CAP_SYS_ADMIN`
///   AND (`task.user_ns == &init_user_ns` OR the operation does NOT require GLOBAL)
///
/// This marker constant documents the concept; it is not stored as a capability bit.
pub const CAP_SYS_ADMIN_GLOBAL_REQUIRED: &str = "CAP_SYS_ADMIN in init_user_ns";

Key invariant: The intersection() at line 8 ensures that if a process drops CAP_NET_ADMIN via capset(), it cannot exercise CAP_NET_ADMIN against any object, including objects in child namespaces. This upholds the guarantee in Section 8.8.9: "a dropped privilege can never be regained."

File capability interpretation:

File capabilities (set via setcap) are interpreted relative to the file's owning user namespace: 1. When execve() loads a binary with file capabilities, the kernel checks if the file's owning user namespace is the same as or an ancestor of the process's user namespace. 2. If the file is in a descendant namespace (i.e., the file was created inside a child namespace), its capability bits are ignored when executed from the parent — prevents a child namespace from granting capabilities in the parent. 3. If the file is in the same or an ancestor namespace, the file's capabilities are added to the process's permitted/effective sets, subject to the usual cap_bounding restrictions. This matches Linux semantics: a host binary with file caps is honored inside a container, but a container binary with file caps is not honored on the host.

Setuid/setgid binary behavior in nested namespaces:

Binary Location	Setuid Behavior	Rationale
Initial namespace (host)	UID changes in initial namespace	Traditional Unix behavior
Child namespace	UID changes within child namespace only	Cannot escalate to parent namespace UIDs
Mounted from host into container	Setuid bit ignored	Prevents host→container privilege escalation

Privilege escalation prevention:

A process in a child user namespace cannot modify the parent's UID mappings
setuid() inside a user namespace only affects the inner UID, not the outer UID
File capability bits (setcap) are interpreted relative to the file's owning user namespace
Signals from a less-privileged namespace to a more-privileged namespace are blocked unless explicitly allowed

16.1.7 Security Policy Integration

Container isolation requires multiple defense layers beyond namespaces and capabilities. UmkaOS integrates with security policy mechanisms at specific points in the container lifecycle:

seccomp-bpf (Syscall Filtering): OCI-compliant container runtimes (Docker, containerd, CRI-O) require seccomp-bpf to restrict the syscall surface available to containerized processes. UmkaOS's seccomp implementation is part of the eBPF subsystem described in Section 18.1.4, which covers eBPF program types including seccomp-bpf for per-process syscall filtering. The typical container creation sequence is:

1. clone(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | ...)
2. unshare(CLONE_NEWUSER) — if rootless
3. pivot_root() — change filesystem root
4. seccomp(SECCOMP_SET_MODE_FILTER, ...) — install syscall filter
5. drop_capabilities() — reduce capability set
6. execve() — exec container entrypoint

The seccomp filter must be installed before execve() so that the filter applies to the container's entrypoint and all its descendants. Docker's default seccomp profile blocks ~44 dangerous syscalls (e.g., kexec_load, reboot, mount). Kubernetes PodSecurityStandards mandate seccomp profiles for restricted workloads.

Seccomp Filter Stacking and Composition:

Nested containers require multiple independent seccomp filters to coexist on a single thread: the OCI runtime installs a broad container policy (filter F1) during container setup, and the container workload may subsequently install its own application-specific filter (filter F2) via prctl(PR_SET_SECCOMP) or seccomp(SECCOMP_SET_MODE_FILTER, ...). UmkaOS implements Linux-compatible stacking semantics so that existing container runtimes (runc, containerd) work without modification.

Stacking rules:

Filters stack: each seccomp(SECCOMP_SET_MODE_FILTER, ...) call appends a new filter to the thread's filter list. All previously installed filters remain active. Filters cannot be removed.
Evaluation order: on each syscall entry, filters are evaluated in reverse installation order — newest filter first, oldest filter last. All installed filters are evaluated; there is no short-circuit on SECCOMP_RET_ALLOW.

Exception: SECCOMP_RET_KILL_PROCESS and SECCOMP_RET_KILL_THREAD cause immediate thread or process termination without evaluating any remaining filters. This matches Linux behavior.

Action priority: when multiple filters return different actions for the same syscall, the highest-severity action wins regardless of evaluation order:

Priority	Action	Effect
1 (highest)	`SECCOMP_RET_KILL_PROCESS`	Terminate entire process
2	`SECCOMP_RET_KILL_THREAD`	Terminate calling thread
3	`SECCOMP_RET_TRAP`	Deliver SIGSYS
4	`SECCOMP_RET_ERRNO`	Return specified errno
5	`SECCOMP_RET_USER_NOTIF`	Notify supervisor via fd
6	`SECCOMP_RET_TRACE`	Notify ptrace tracer
7	`SECCOMP_RET_LOG`	Allow and log
8 (lowest)	`SECCOMP_RET_ALLOW`	Allow syscall

Example: if F1 returns SECCOMP_RET_ALLOW and F2 returns SECCOMP_RET_ERRNO(EPERM), the syscall is blocked with EPERM. A workload-installed filter can make the effective policy strictly more restrictive than the runtime-installed filter, but never less restrictive.

NO_NEW_PRIVS requirement: a thread must have no_new_privs = 1 (set via prctl(PR_SET_NO_NEW_PRIVS, 1)) before installing a seccomp filter unless it holds CAP_SYS_ADMIN. This is identical to Linux. Container runtimes set no_new_privs as part of their standard setup sequence.
Maximum filter count: 512 filters per thread. Linux limits total BPF instruction count (MAX_INSNS_PER_PATH = 32768), not filter count; UmkaOS imposes an explicit filter-count ceiling at 512 (matching Section 9.3). Attempting to install a 513th filter returns E2BIG.
Filter inheritance: child processes created via fork() or clone() inherit the parent's complete filter stack. The inherited filters are immutable in the child — the child may only append further filters, never remove inherited ones.

Nested container policy: when an OCI runtime installs filter F1 (broad allow-list, blocking dangerous syscalls) and the container workload subsequently installs filter F2 (narrow application allow-list), both filters are active simultaneously. The effective policy is the union of restrictions from both filters: a syscall is allowed only if both F1 and F2 allow it. This composability property is what makes layered container security correct — deeper container nesting cannot relax an outer filter's restrictions.

UmkaOS implementation note: UmkaOS compiles the filter stack into a single BPF program at installation time. When a new filter is added to an existing stack, the kernel combines the compiled representation of the existing stack with the new filter's BPF bytecode and recompiles the result into a single executable program. This single-program approach is semantically identical to sequential per-filter evaluation (the action priority table above is preserved exactly) but eliminates repeated per-filter dispatch overhead at syscall entry. The recompilation occurs once at seccomp(SECCOMP_SET_MODE_FILTER, ...) time, not on each syscall.

LSM Integration: UmkaOS supports pluggable Linux Security Modules (AppArmor, SELinux profiles). Container runtimes can specify an LSM profile via OCI annotations, which UmkaOS applies at execve() time. The integrity measurement framework (Section 8.4, 08-security.md) provides the foundation for policy enforcement. The full LSM framework — hook table, security blob allocation, module registration, and AND-logic stacking — is specified in Section 8.7.

See also: - Section 18.1.4: eBPF subsystem including seccomp-bpf - Section 8.4 (08-security.md): Runtime Integrity Measurement (IMA) - Section 8.8: Credential model and capability dropping

16.2 Control Groups (Cgroups v2)

Linux cgroups v2 provide hierarchical resource allocation and limiting. UmkaOS implements the unified cgroup v2 interface, mapping controller semantics to UmkaOS's native scheduler, memory manager, and I/O subsystems.

Cgroup v1 compatibility shim: Docker (Moby) and older systemd versions (pre-247) require cgroup v1 hierarchy paths. UmkaOS provides a read-mostly v1 compatibility shim that: - Exposes /sys/fs/cgroup/{cpu,memory,pids,blkio,...} mount points - Translates v1 control file reads/writes to v2 equivalents (e.g., memory.limit_in_bytes → memory.max, cpu.shares → cpu.weight) - Supports the 4 most common v1 controllers: cpu, memory, blkio, pids - Returns -ENOSYS for v1-only features with no v2 equivalent (e.g., cpuacct separate hierarchy, net_cls, net_prio) - Multi-hierarchy emulation: each v1 controller appears as a separate mount, but all are backed by the single v2 unified hierarchy

Specification scope: The v1 shim control file format details (exact file paths, value format, error responses) are deferred to Phase 4. The core v2 implementation below is the authoritative resource control mechanism. Until Phase 4, Moby/systemd v1 compatibility cannot be integration-tested — only the semantic translation (which v1 files → which v2 controls) is validated.

16.2.1 Core Data Structures

16.2.1.1 Cgroup Node

The Cgroup struct is the central object in the cgroup v2 hierarchy. Every directory under /sys/fs/cgroup/ corresponds to one Cgroup node. The hierarchy is a tree; the root node is owned by CgroupRoot.

UmkaOS's cgroup design avoids two sources of complexity present in Linux's implementation: - No multi-hierarchy: cgroup v1's per-controller separate hierarchies are gone; the single v2 unified hierarchy is the only model. The v1 shim (see above) re-exposes v2 state through legacy paths at the cgroupfs layer without creating second hierarchies inside the kernel. - No cgroup_subsys indirection: Linux routes every controller operation through a cgroup_subsys vtable, adding an indirect call on every resource charge. UmkaOS embeds controller state directly in Cgroup as Option<ControllerState> fields; disabled controllers are None and add no overhead.

/// A cgroup node in the cgroup v2 unified hierarchy.
///
/// The hierarchy is a tree rooted at `CgroupRoot.root`. Tasks are assigned
/// to leaf or intermediate cgroups. Resource controllers operate per-cgroup.
///
/// # Memory layout note
/// Controller state structs are stored inline (`Option<T>`) rather than
/// heap-allocated so that null-pointer checks for disabled controllers
/// compile to a branch on a locally cached discriminant — no extra
/// pointer dereference on the hot path.
pub struct Cgroup {
    /// Unique cgroup ID (assigned at creation, never reused).
    /// Also used as the inode number of the cgroupfs directory.
    pub id: u64,

    /// Parent cgroup. `None` only for the root cgroup (id == 1).
    /// `Weak` avoids reference cycles: the tree is owned downward
    /// (`CgroupRoot → Arc<Cgroup> → Arc<Cgroup> children`); the
    /// parent pointer is a non-owning back-edge.
    pub parent: Option<Weak<Cgroup>>,

    /// Child cgroups. Protected by `hierarchy_lock` in `CgroupRoot`
    /// for structural modifications (mkdir, rmdir); reads during task
    /// migration hold an RCU read-side reference instead.
    pub children: Mutex<Vec<Arc<Cgroup>>>,

    /// Name of this cgroup relative to parent (max 255 bytes, no '/').
    /// Fixed-size inline storage avoids heap allocation for short names
    /// (typical names: "docker", "system.slice", container IDs ≤ 64 bytes).
    pub name: CgroupName,

    /// Tasks directly assigned to this cgroup (not descendants).
    /// Written by task migration; read by cgroupfs `cgroup.procs` output.
    ///
    /// Uses a `RwLock<FxHashSet<TaskId>>` for O(1) insert, remove, and membership
    /// test (replacing the O(n) `Vec` scan used for migration checks). Readers
    /// (cgroupfs `cgroup.procs` output) take a read lock; writers (task migration)
    /// take a write lock. Task migration is serialized by the global task-migration
    /// lock anyway, so write contention is negligible. The `FxHashSet` (rustc's
    /// FxHashMap hasher) provides O(1) lookup without heap allocation per operation.
    pub tasks: RwLock<FxHashSet<TaskId>>,

    /// Number of tasks in this cgroup and all descendants.
    /// Updated atomically during task migration (O(depth) walk, done once
    /// per migration, not per tick). Used by `pids.current` propagation and
    /// for efficient "is this cgroup populated?" checks.
    pub population: AtomicU64,

    // ── Resource controller state ────────────────────────────────────────
    // Each field is `None` when the controller is disabled for this cgroup.
    // Controller state is only present when listed in the parent's
    // `subtree_control` mask (or for the root cgroup, in `cgroup.controllers`).

    /// CPU bandwidth controller (`cpu.weight`, `cpu.max`, `cpu.guarantee`).
    pub cpu: Option<CpuController>,

    /// Memory controller (`memory.max`, `memory.high`, `memory.current`, etc.).
    pub memory: Option<MemController>,

    /// Block I/O controller (`io.max`, `io.weight`).
    pub io: Option<IoController>,

    /// PID controller (`pids.max`, `pids.current`).
    pub pids: Option<PidsController>,

    /// CPU affinity controller (`cpuset.cpus`, `cpuset.mems`, partition mode).
    pub cpuset: Option<CpusetController>,

    /// RDMA/InfiniBand resource controller (`rdma.max`).
    pub rdma: Option<RdmaController>,

    /// Huge page controller (`hugetlb.<size>.max`).
    pub hugetlb: Option<HugetlbController>,

    /// Miscellaneous resource controller (`misc.max`; e.g., SGX EPC pages).
    pub misc: Option<MiscController>,

    // ── Hierarchy control ────────────────────────────────────────────────

    /// Which controllers are enabled for this cgroup's children.
    /// Written to `cgroup.subtree_control`; read on every child mkdir.
    /// `ControllerMask` is a bitmask — O(1) enable/disable.
    pub subtree_control: ControllerMask,

    // ── Freeze state ─────────────────────────────────────────────────────

    /// When `true`, all tasks in this cgroup (including descendants) are
    /// frozen: removed from the scheduler run-queue and prevented from
    /// being scheduled. Set/cleared by writing to `cgroup.freeze`.
    /// See Section 16.2.8 for the freeze/thaw protocol.
    pub frozen: AtomicBool,

    // ── Generation counter for walk-free limit propagation ───────────────

    /// Incremented whenever any resource limit in this cgroup or any
    /// ancestor changes. Each task caches the generation value at the
    /// time its limits were last computed. On the next resource charge,
    /// the task compares its cached generation against this field. On
    /// mismatch, the task re-walks from its cgroup to the root to
    /// recompute its effective limits, then updates the cache.
    ///
    /// This makes limit changes O(1) to publish (one atomic increment)
    /// and amortizes the re-walk cost to the next resource operation on
    /// each task — no per-tick accounting, no broadcast, no lock convoy.
    pub generation: AtomicU64,

    // ── cgroupfs integration ──────────────────────────────────────────────

    /// Inode for this cgroup's directory in the cgroupfs pseudo-filesystem.
    /// `None` before the cgroupfs is mounted. The inode is allocated at
    /// cgroup creation time and freed at cgroup destruction.
    pub inode: Option<Arc<Inode>>,
}

/// Fixed-size inline cgroup name (max 255 bytes, not NUL-terminated).
/// Avoids heap allocation for the common case (names ≤ 255 bytes).
pub struct CgroupName {
    /// Number of valid bytes in `data`.
    len: u8,
    /// Raw UTF-8 bytes. Characters '/' and '\0' are rejected at creation.
    data: [u8; 255],
}

/// Bitmask of resource controllers. One bit per controller type.
/// Used for `subtree_control` (enabled-for-children) and for
/// `cgroup.controllers` (available on the system).
#[derive(Clone, Copy, Default)]
pub struct ControllerMask(pub u32);

impl ControllerMask {
    pub const CPU:     u32 = 1 << 0;
    pub const MEMORY:  u32 = 1 << 1;
    pub const IO:      u32 = 1 << 2;
    pub const PIDS:    u32 = 1 << 3;
    pub const CPUSET:  u32 = 1 << 4;
    pub const RDMA:    u32 = 1 << 5;
    pub const HUGETLB: u32 = 1 << 6;
    pub const MISC:    u32 = 1 << 7;

    /// Returns `true` if the given controller bit is set.
    pub fn has(self, bit: u32) -> bool { self.0 & bit != 0 }

    /// Returns the union of two masks (enabling controllers from both).
    pub fn union(self, other: ControllerMask) -> ControllerMask {
        ControllerMask(self.0 | other.0)
    }
}

16.2.1.2 CPU Controller State

/// CPU controller state, present when the `cpu` controller is enabled
/// for this cgroup (listed in parent's `subtree_control`).
///
/// Maps to `cpu.weight`, `cpu.max`, `cpu.guarantee`, and `cpu.stat`
/// cgroupfs files. See Section 16.2.3 for the integration with UmkaOS's
/// EEVDF scheduler and CBS bandwidth enforcement.
pub struct CpuController {
    /// `cpu.weight`: relative CPU share among siblings (1..=10000, default 100).
    /// Used directly as the EEVDF task-group weight.
    pub weight: AtomicU32,

    /// `cpu.max` quota: microseconds of CPU time allowed per `period_us`.
    /// `None` means unlimited (no throttling).
    pub max_us: Option<AtomicU64>,

    /// `cpu.max` period in microseconds (default 100,000 = 100 ms).
    /// Always set even when `max_us` is `None` (holds the configured period
    /// for when a quota is later added).
    pub period_us: AtomicU64,

    /// CBS (Constant Bandwidth Server) state for `cpu.guarantee` enforcement.
    /// Present even when `cpu.guarantee` is not set (idle state then).
    /// See [Section 6.3](06-scheduling.md#63-constant-bandwidth-server-cbs) for
    /// CBS semantics and the `CbsGroupServer` struct definition.
    pub cbs: CbsGroupServer,

    // ── Accumulated statistics (read via `cpu.stat`) ─────────────────────
    /// Total CPU time consumed (microseconds). Monotonically increasing.
    pub usage_us: AtomicU64,
    /// Number of throttling periods that have elapsed.
    pub nr_periods: AtomicU64,
    /// Number of periods in which this cgroup was throttled (quota exhausted).
    pub nr_throttled: AtomicU64,
    /// Total time spent throttled (microseconds).
    pub throttled_us: AtomicU64,
}

16.2.1.3 Memory Controller State

/// Memory controller state, present when the `memory` controller is enabled.
///
/// Maps to `memory.current`, `memory.high`, `memory.max`, `memory.swap.max`,
/// `memory.oom.group`, and `memory.events` cgroupfs files.
/// See Section 16.2.4 for the integration with the physical memory allocator.
pub struct MemController {
    /// `memory.current`: total bytes of memory charged to this cgroup.
    /// Updated on every page charge/uncharge (one atomic add per page fault
    /// or page table manipulation). Monotonically tracks live usage.
    pub usage: AtomicU64,

    /// `memory.high`: soft limit in bytes. When `usage` exceeds this, the
    /// cgroup's tasks are throttled (sleeping in the allocator path) and
    /// reclaim is prioritized for pages belonging to this cgroup.
    /// `u64::MAX` means unlimited (default).
    pub high: AtomicU64,

    /// `memory.max`: hard limit in bytes. When `usage` would exceed this,
    /// the per-cgroup OOM killer is invoked before the allocation completes.
    /// `u64::MAX` means unlimited (default).
    pub max: AtomicU64,

    /// `memory.swap.max`: swap usage hard limit in bytes.
    /// `u64::MAX` means unlimited (default).
    /// Controls how much of this cgroup's memory may be swapped out.
    pub swap_max: AtomicU64,

    /// `memory.oom.group`: when `true`, the OOM killer kills **all tasks**
    /// in the cgroup rather than selecting a single victim. Useful for
    /// atomically terminating a container that has overrun its memory budget.
    pub oom_group: AtomicBool,

    /// `memory.events` counter: number of OOM kills triggered for this cgroup.
    /// Incremented each time the OOM killer selects a victim in this cgroup.
    pub oom_kill: AtomicU64,

    /// LRU list of pages charged to tasks in this cgroup.
    /// The reclaim path consults this list to find candidate pages when
    /// `memory.high` is exceeded or when global reclaim pressure is high.
    /// `CgroupLru` is a two-list (active/inactive) LRU matching Linux's
    /// per-cgroup LRU structure; pages are moved between lists by the
    /// page-access tracking machinery in `umka-core`.
    pub lru: Mutex<CgroupLru>,
}

16.2.1.4 PID Controller State

/// PID controller state, present when the `pids` controller is enabled.
///
/// Maps to `pids.current`, `pids.max`, and `pids.events` cgroupfs files.
/// See Section 16.2.6 for fork-bomb prevention semantics.
pub struct PidsController {
    /// `pids.current`: number of tasks (threads + processes) currently in
    /// this cgroup subtree. Incremented by fork/clone, decremented by exit.
    pub current: AtomicU64,

    /// `pids.max`: maximum tasks allowed in this cgroup subtree.
    /// `u64::MAX` means unlimited (default). `fork()`/`clone()` checks
    /// `current < max` before allocating a new task; returns `EAGAIN` on failure.
    pub max: AtomicU64,

    /// `pids.events max`: number of fork/clone calls that were rejected
    /// because `current` reached `max`. Monotonically increasing.
    pub events_max: AtomicU64,
}

16.2.1.5 I/O Controller State

/// I/O controller state, present when the `io` controller is enabled.
///
/// Maps to `io.max`, `io.weight`, `io.stat`, and `io.pressure` cgroupfs files.
/// See Section 16.2.5 for integration with the block I/O scheduler.
pub struct IoController {
    /// Per-device rate limits. Each entry specifies bandwidth (bytes/s) and
    /// IOPS limits for one block device identified by major:minor number.
    /// Protected by a Mutex because limit changes (`io.max` writes) are rare
    /// configuration events; the per-request hot path reads limits under a
    /// short RCU read-side reference rather than holding this lock.
    pub devices: Mutex<Vec<IoDeviceLimits>>,

    /// PSI (Pressure Stall Information) for this cgroup's I/O subsystem.
    /// Exposed as `io.pressure`. Tracks the fraction of time tasks in this
    /// cgroup are stalled waiting for I/O completions.
    pub psi: PsiState,
}

/// Per-device I/O limits for one block device within a cgroup.
pub struct IoDeviceLimits {
    /// Block device identified by (major, minor) numbers.
    pub dev: DeviceNumber,
    /// Read bandwidth limit in bytes per second. `None` = unlimited.
    pub rbps: Option<u64>,
    /// Write bandwidth limit in bytes per second. `None` = unlimited.
    pub wbps: Option<u64>,
    /// Read I/O operations per second limit. `None` = unlimited.
    pub riops: Option<u64>,
    /// Write I/O operations per second limit. `None` = unlimited.
    pub wiops: Option<u64>,
}

16.2.1.5a Additional Controller State Structs

The following structs back the rdma, hugetlb, misc, cpuset, and shared PSI/LRU fields referenced in Cgroup above. They are defined here rather than inline so that the Cgroup struct definition in Section 16.2.1.1 remains readable.

/// RDMA cgroup controller. Limits RDMA/InfiniBand resource usage per cgroup.
/// Controls: MR (memory regions), MW (memory windows), PD (protection domains),
/// AH (address handles), QP (queue pairs), SRQ (shared receive queues).
/// Mirrors Linux's `rdma` cgroup subsystem (kernel 4.11+).
pub struct RdmaController {
    /// Per-device RDMA resource limits. Key: RDMA device index.
    pub limits: BTreeMap<u32, RdmaDeviceLimit>,
    /// Current RDMA resource usage. Key: RDMA device index.
    pub usage: BTreeMap<u32, RdmaDeviceUsage>,
}

/// Per-device RDMA resource limits for one cgroup.
pub struct RdmaDeviceLimit {
    /// Max memory regions.
    pub max_mr:  u32,
    /// Max memory windows.
    pub max_mw:  u32,
    /// Max protection domains.
    pub max_pd:  u32,
    /// Max address handles.
    pub max_ah:  u32,
    /// Max queue pairs.
    pub max_qp:  u32,
    /// Max shared receive queues.
    pub max_srq: u32,
}

/// Current RDMA resource usage for one cgroup on one device.
pub struct RdmaDeviceUsage {
    pub mr:  AtomicU32,
    pub mw:  AtomicU32,
    pub pd:  AtomicU32,
    pub ah:  AtomicU32,
    pub qp:  AtomicU32,
    pub srq: AtomicU32,
}

/// Huge-page cgroup controller. Limits huge page usage per cgroup per page size.
/// Maps to Linux's `hugetlb` cgroup subsystem.
/// Key: huge page size in bytes (2MB = 2097152, 1GB = 1073741824, etc.).
pub struct HugetlbController {
    /// Maximum huge-page bytes allowed per page size. Value `u64::MAX` = unlimited.
    pub limits: BTreeMap<HugePageSize, u64>,
    /// Current huge-page bytes in use per page size.
    pub usage:  BTreeMap<HugePageSize, AtomicU64>,
}

/// Huge page size in bytes (2 MiB, 1 GiB, etc.).
pub type HugePageSize = u64;

/// Miscellaneous cgroup controller (Linux 5.13+). Provides per-resource usage limits
/// for resources that do not fit into other controllers (e.g., UHID, eudbus entries).
pub struct MiscController {
    /// Per-resource limits. Key: resource name (e.g., "uhid", "eudbus").
    /// Value: limit and live usage counter.
    pub resources: BTreeMap<Box<str>, MiscResource>,
}

/// One named miscellaneous resource tracked by `MiscController`.
pub struct MiscResource {
    /// Maximum units allowed. `u64::MAX` = unlimited.
    pub max:   u64,
    /// Current units in use.
    pub usage: AtomicU64,
}

/// Cpuset cgroup controller. Pins tasks in a cgroup to specific CPUs and NUMA nodes.
/// Maps to Linux's `cpuset` subsystem (cgroup v2: `cpuset.cpus`, `cpuset.mems`).
pub struct CpusetController {
    /// CPUs this cgroup's tasks are allowed to run on. Empty = inherit from parent.
    pub allowed_cpus:  CpuMask,
    /// NUMA memory nodes this cgroup's tasks may allocate from. Empty = any node.
    pub allowed_mems:  NodeMask,
    /// If true, enforce CPU affinity even during load balancing (exclusive cpuset).
    pub cpu_exclusive: bool,
    /// If true, enforce NUMA node affinity for memory allocation.
    pub mem_exclusive: bool,
    /// If true, allow migration of tasks off their cpuset during hotplug events.
    pub mem_migrate:   bool,
}

/// NUMA node affinity mask. Bit N = NUMA node N is allowed.
/// Up to 128 NUMA nodes (two u64 words).
pub struct NodeMask {
    pub bits: [u64; 2],
}

/// Pressure Stall Information (PSI) state for one resource (CPU, memory, or I/O).
/// Exposed via /sys/fs/cgroup/<cgroup>/cpu.pressure, memory.pressure, io.pressure.
/// Matches Linux's `psi_group_cpu`/`psi_group_mem`/`psi_group_io` layout.
pub struct PsiState {
    /// Exponentially-weighted moving average of stall time, in units of 0.01%.
    /// Index 0 = 10-second window, 1 = 60-second window, 2 = 300-second window.
    /// `some_avg`: at least one task stalled (partial stall).
    pub some_avg:  [u32; 3],
    /// `full_avg`: all tasks stalled (full stall).
    pub full_avg:  [u32; 3],
    /// Cumulative stall time in microseconds since cgroup creation.
    pub some_total: AtomicU64,
    pub full_total: AtomicU64,
    /// Timestamp of the last PSI measurement (nanoseconds since boot).
    pub last_update_ns: AtomicU64,
}

/// Per-cgroup LRU for memory reclaim ordering.
/// Tracks pages owned by this cgroup to enable cgroup-aware reclaim
/// (reclaim targets the cgroup that is over its memory limit first).
pub struct CgroupLru {
    /// Active LRU list: recently accessed pages. Reclaim scans tail first.
    pub active:   IntrusiveList<Page>,
    /// Inactive LRU list: pages not recently accessed. Primary reclaim target.
    pub inactive: IntrusiveList<Page>,
    /// Pages currently under writeback (not reclaimable until writeback completes).
    pub writeback: IntrusiveList<Page>,
    /// Total number of pages on all lists (active + inactive + writeback).
    pub nr_pages:  AtomicU64,
    /// Pages that have been reclaimed since last check (for memory.stat reporting).
    pub nr_reclaimed: AtomicU64,
}

16.2.1.6 Hierarchy Root

/// Root of the cgroup v2 unified hierarchy. One instance per system.
///
/// UmkaOS has a single `CgroupRoot` (no per-controller separate hierarchies —
/// those were the v1 design that UmkaOS eliminates). The root cgroup has `id == 1`
/// and no parent.
pub struct CgroupRoot {
    /// The root cgroup node. All other cgroups are reachable from here via
    /// `children` links. `Arc` because `CgroupNamespace` instances hold
    /// per-namespace root references into this tree (at arbitrary subtree nodes).
    pub root: Arc<Cgroup>,

    /// Lock protecting hierarchy structure changes (mkdir, rmdir).
    /// Held briefly during cgroup creation and destruction; **not** held during
    /// task migration or resource charging (those operations use per-cgroup
    /// locks and atomics).
    ///
    /// `RwLock`: concurrent hierarchy traversals (e.g., cgroupfs readdir, kernel
    /// population-count propagation) hold read locks; mkdir/rmdir hold write locks.
    pub hierarchy_lock: RwLock<()>,

    /// Fast O(1) lookup from cgroup ID to `Arc<Cgroup>`.
    /// Used by cgroupfs to resolve inode numbers back to cgroup nodes,
    /// and by the `CLONE_NEWCGROUP` implementation to find the anchor node
    /// for a new cgroup namespace.
    ///
    /// `RcuHashMap`: lock-free reads under RCU guard, serialized writes.
    /// Entries are inserted at cgroup creation and removed at destruction
    /// (after a grace period, since cgroupfs inodes may hold references).
    pub id_map: RcuHashMap<u64, Arc<Cgroup>>,

    /// Monotonically increasing ID counter. Assigned at cgroup creation;
    /// never reused (even after cgroup destruction). `AtomicU64` allows
    /// lock-free ID allocation at mkdir time.
    pub next_id: AtomicU64,

    /// VFS mount point for the cgroupfs pseudo-filesystem.
    /// `None` if cgroupfs has not yet been mounted (early boot).
    /// After mount, this is the `VfsMount` returned by
    /// `mount("cgroup2", "/sys/fs/cgroup", "cgroup2", 0, NULL)`.
    pub mount: Option<VfsMount>,
}

16.2.1.7 Task Migration (cgroup.procs write)

Writing a PID to cgroup.procs atomically moves the task to the target cgroup. The migration protocol is O(1) in tree depth: limits are recomputed lazily on the next resource operation, not during migration itself.

Migration steps for: write(fd_cgroup_procs, pid_str):

1. Resolve PID to TaskId using the writer's PID namespace.

2. Acquire the task's migration lock (per-task SpinLock, prevents concurrent
   migration of the same task from two writers).

3. Check the source and target cgroups for controller constraints:
   - If target has a `PidsController`: verify (current + 1) <= max; return
     EAGAIN if over limit.
   - If target has a `MemController`: verify the task's current RSS would not
     immediately exceed memory.max in the target. If over-limit, return ENOMEM.

4. Charge source cgroup controllers (subtract):
   - Decrement `PidsController::current` on source (and all ancestors up to LCA).
   - Subtract task's RSS from `MemController::usage` on source (and ancestors).

5. Charge target cgroup controllers (add):
   - Increment `PidsController::current` on target (and all ancestors up to LCA).
   - Add task's RSS to `MemController::usage` on target (and ancestors).

6. Update the task's cgroup pointer:
     task.cgroup.store(Arc::clone(&target), Ordering::Release);
   The Release store pairs with Acquire loads in the resource-charge path,
   ensuring that subsequent charges from this task are credited to the target.

7. Update task lists:
   - Remove TaskId from source.tasks (O(1) hash set remove under write lock).
   - Insert TaskId into target.tasks (O(1) hash set insert under write lock).

8. Update population counts along the path from source/target to their LCA:
   - Decrement `source.population` and each ancestor up to LCA.
   - Increment `target.population` and each ancestor up to LCA.

9. Propagate generation counters:
   - Increment `source.generation` (Relaxed — any observer that sees the
     task's new cgroup will also observe the updated generation).
   - Increment `target.generation`.
   Tasks that cached effective limits from either cgroup will detect the
   mismatch on the next resource charge and re-walk to recompute limits.

10. Release the migration lock.

The LCA (Lowest Common Ancestor) walk in steps 4–5 and 8 is bounded by the
maximum cgroup nesting depth (256 in UmkaOS, matching Linux's limit). In the
common case (migration within the same subtree, depth ≤ 4), the walk touches
≤ 8 nodes.

Population propagation (step 8) uses a spinlock-free path: `population` is
an `AtomicU64` updated with `fetch_add`/`fetch_sub`. The LCA walk does not
need to hold `hierarchy_lock` because cgroup destruction requires the
population to be zero (enforced before rmdir proceeds).

16.2.2 Cgroup Filesystem and Hierarchy

Cgroups are exposed via a pseudo-filesystem mounted at /sys/fs/cgroup:

/sys/fs/cgroup/
├── cgroup.controllers      # Available controllers (cpu io memory pids cpuset freezer hugetlb)
├── cgroup.subtree_control  # Controllers enabled for children
├── cgroup.procs            # PIDs in this cgroup
├── system.slice/           # Systemd system services
├── user.slice/             # User sessions
└── docker/                 # Container cgroups
    └── <container-id>/
        ├── cpu.max
        ├── cpu.weight
        ├── cpu.guarantee   # UmkaOS extension (Section 6.3)
        ├── memory.max
        ├── memory.current
        ├── io.max
        ├── pids.max
        └── cpuset.cpus

Hierarchy delegation: A cgroup can delegate control to a subtree by enabling controllers in cgroup.subtree_control. Only controllers enabled in the parent's subtree_control are available in child cgroups. This matches Linux semantics for unprivileged container runtimes.

Cgroup namespace integration: CLONE_NEWCGROUP creates a new cgroup namespace where the process's current cgroup becomes the root of its view. Processes see /sys/fs/cgroup/ starting from their namespace's cgroup root, enabling rootless container runtimes to manage their own cgroup hierarchy.

16.2.3 CPU Controller Integration

The cgroup cpu controller maps to the UmkaOS scheduler:

cpu.weight: Mapped directly to the EEVDF task weight (range 1-10000, default 100). A higher weight grants a proportionally larger share of CPU time relative to sibling cgroups with lower weights.
cpu.max: Sets the bandwidth ceiling using CFS-style throttling. Format: "<quota> <period>" (both in microseconds). Example: "400000 1000000" limits the cgroup to 40% CPU (400ms per 1000ms period). This is a maximum limit, not a guarantee. When throttled, the cgroup's tasks are removed from the run queue until the next period begins. This matches standard Linux cgroup v2 semantics.
cpu.guarantee: (UmkaOS extension, see Section 6.3) Sets the bandwidth floor using Constant Bandwidth Server (CBS). Format: "<budget> <period>". Guarantees minimum CPU time regardless of other load. This is distinct from cpu.max: a cgroup can have both a guarantee (floor) and a limit (ceiling).

Relationship between cpu.max and cpu.guarantee: | Setting | Effect | Use Case | |---------|--------|----------| | cpu.max only | Limits maximum, no minimum | Prevent runaway containers | | cpu.guarantee only | Guarantees minimum, no maximum | RT workloads that need bounded latency | | Both | Guarantees minimum AND limits maximum | Mixed workloads with SLA |

When a cgroup is throttled (by either mechanism), the scheduler removes its tasks from the EEVDF tree until the next period or until budget is replenished.

16.2.4 Memory Controller Integration

The memory controller tracks physical page allocations per cgroup:

memory.current: The sum of all pages charged to this cgroup (in bytes).
memory.max: Hard limit (bytes). When exceeded, the per-cgroup OOM killer is invoked.
memory.high: Soft limit (bytes). When exceeded, the cgroup is throttled and its pages are prioritized for reclaim, but no OOM occurs.
memory.low: Memory protection (bytes). Pages below this threshold are protected from reclaim unless the system is under severe pressure.
memory.swap.max: Limits swap usage for this cgroup (Section 4.2.1).

Per-cgroup OOM killer: When memory.current exceeds memory.max, the OOM killer selects a victim within the cgroup subtree only — processes outside this cgroup are not affected. This is independent of global OOM (Section 4.1): per-cgroup OOM can trigger even when global memory is not exhausted. Victim selection criteria: 1. Select the task with the largest RSS within the cgroup subtree 2. Respect per-process oom_score_adj values: processes with OOM_SCORE_ADJ=-1000 are exempt from per-cgroup OOM (matching Linux semantics) 3. The final score is RSS + oom_score_adj_factor, with -1000 meaning "exempt"

This differs from the global OOM heuristic (Section 4.1, not Section 3.2.3 which covers panic handling) in scope: per-cgroup OOM selects victims only within the cgroup subtree, while global OOM considers all processes. Both respect oom_score_adj for compatibility with container workloads that use it to protect critical processes.

Memory accounting has low overhead (~1 atomic increment per page charge); the charge operation piggybacks on existing page table allocation routines in umka-core.

16.2.5 I/O Controller Integration

The io controller limits block I/O bandwidth and IOPS per cgroup:

io.max: Per-device limits. Format: "<major>:<minor> rbps=<bytes> wbps=<bytes> riops=<ops> wiops=<ops>". Example: "8:0 rbps=10485760 wbps=5242880" limits reads to 10 MB/s and writes to 5 MB/s on device 8:0.
io.weight: Proportional weight for best-effort I/O scheduling (1-10000, default 100).

The block I/O subsystem (Section 14.3) integrates with cgroup accounting: each bio (block I/O request) is tagged with its originating cgroup, and the I/O scheduler enforces per-cgroup limits.

16.2.6 PIDs Controller (Fork Bomb Prevention)

pids.max: Maximum number of tasks (threads + processes) in the cgroup subtree. Prevents fork bombs from exhausting system-wide PID space. fork()/clone() returns EAGAIN when the limit is reached.
pids.current: Current number of tasks in the cgroup.

This is critical for container isolation: a misbehaving container cannot exhaust the host's PID space.

16.2.7 Cpuset Controller (CPU and NUMA Pinning)

cpuset.cpus: CPUs allowed for tasks in this cgroup. Format: "0-3,8-11" (CPU list).
cpuset.mems: NUMA nodes allowed for memory allocation. Format: "0,2" (node list).
cpuset.cpus.partition: Partition mode (root, member, isolated). Isolated partitions have exclusive CPU access.

The scheduler respects cpuset constraints when selecting a CPU for a task. NUMA-aware allocation (Section 4.1) respects the cpuset.mems mask.

16.2.8 Freezer (Cgroup Pause/Resume)

cgroup.freeze: Write 1 to freeze all tasks in the cgroup subtree; write 0 to thaw.
cgroup.events: Contains frozen 0/1 indicating current frozen state.

Frozen tasks are removed from the run queue and cannot be scheduled. Used by docker pause and checkpoint/restore.

RCU Interaction. Frozen tasks cannot execute code and therefore cannot report RCU quiescent states. To prevent RCU grace periods from blocking indefinitely, UmkaOS's RCU subsystem treats entry into TASK_FROZEN as an implicit quiescent state: the cgroup freezer calls rcu_report_dead() (the same hook used at task exit) on behalf of the frozen task's CPU at the moment the task is frozen. This is safe because a frozen task holds no RCU read-side critical sections — it is not executing, so it cannot be inside rcu_read_lock(). When the task is thawed it re-enters the normal quiescent-state reporting cycle with no special handling required. This design ensures that container pause/resume, whole-cgroup SIGSTOP, and checkpoint-restore operations never stall RCU grace periods regardless of freeze duration.

16.2.9 Additional Controllers

Controller	Key Interface	Description
`hugetlb`	`hugetlb.<size>.max`	Limits huge page allocations per cgroup
`rdma`	`rdma.max`	Limits RDMA/InfiniBand resources
`misc`	`misc.max`	Limits miscellaneous resources (e.g., SGX EPC)

16.2.10 Cgroup v1 Compatibility Translation

UmkaOS exposes cgroup v2 exclusively inside the kernel. For userspace processes that set cgroup v1 knobs (Docker Engine ≤20.10, systemd pre-247, legacy orchestrators), UmkaOS provides a v1-to-v2 translation shim implemented as a virtual filesystem (cgroupv1fs) that mounts the legacy hierarchy paths at /sys/fs/cgroup/cpu, /sys/fs/cgroup/memory, etc. The full shim specification is in Section 18.1.7. This section documents the authoritative translation table and the cpu.shares → cpu.weight formula that the shim applies.

Translation table (v1 write → v2 equivalent):

Subsystem	v1 knob	v2 equivalent	Conversion formula
memory	`memory.limit_in_bytes`	`memory.max`	Direct (bytes); -1 → "max"
memory	`memory.soft_limit_in_bytes`	`memory.high`	Direct (bytes)
memory	`memory.memsw.limit_in_bytes`	`memory.swap.max`	`swap_max = memsw - mem`
memory	`memory.kmem.limit_in_bytes`	(no v2 equivalent)	Silently ignored (kmem tracking removed in v2)
memory	`memory.oom_control` (disable OOM)	`memory.oom.group`	`oom_kill_disable=1` → `oom.group=1`
cpu	`cpu.shares`	`cpu.weight`	`weight = clamp(1 + (shares − 2) × 9999 / 262142, 1, 10000)`
cpu	`cpu.cfs_quota_us` + `cpu.cfs_period_us`	`cpu.max`	`"$quota $period"` (µs); quota=-1 → `"max $period"`
cpuacct	`cpuacct.usage`	`cpu.stat` (usage_usec)	Read-only; ns→µs unit conversion
blkio	`blkio.throttle.read_bps_device`	`io.max` (rbps=N)	`MAJ:MIN rbps=N`
blkio	`blkio.throttle.write_bps_device`	`io.max` (wbps=N)	`MAJ:MIN wbps=N`
blkio	`blkio.throttle.read_iops_device`	`io.max` (riops=N)	`MAJ:MIN riops=N`
blkio	`blkio.throttle.write_iops_device`	`io.max` (wiops=N)	`MAJ:MIN wiops=N`
blkio	`blkio.weight`	`io.weight`	Direct (1–1000 range)
freezer	`freezer.state`	`cgroup.freeze`	`FROZEN` → `"1"`, `THAWED` → `"0"`
net_cls	`net_cls.classid`	(no v2 equivalent)	Logged and ignored; use eBPF for network classification
net_prio	`net_prio.ifpriomap`	(no v2 equivalent)	Logged and ignored
pids	`pids.max`	`pids.max`	Direct
devices	`devices.allow` / `devices.deny`	`BPF_PROG_TYPE_CGROUP_DEVICE`	Translated to eBPF program attached to the cgroup
hugetlb	`hugetlb.Xm.limit_in_bytes`	`hugetlb.Xm.max`	Direct
rdma	`rdma.max`	`rdma.max`	Direct

cpu.shares formula derivation: Linux cpu.shares range is [2, 262144]; cpu.weight range is [1, 10000]. The formula is a linear interpolation that maps the full v1 range onto the full v2 range:

weight = clamp(1 + (shares - 2) × 9999 / 262142, 1, 10000)

This is the formula used by runc (the OCI reference runtime), containerd, and crun as of 2025. Key values:

v1 `cpu.shares`	v2 `cpu.weight`
2 (minimum)	1
1024 (Docker default)	~39
262144 (maximum)	10000

Systemd divergence: systemd (≥247) writes cpu.weight directly when operating in cgroup v2 mode, using its own unit mapping (default weight = 100) rather than the runc formula. When systemd writes v2 files natively, those writes bypass the shim entirely and go straight to the cgroupfs. The shim translates only raw v1 cgroupfs writes from programs that open the legacy v1 paths directly (older Docker daemons, legacy orchestrators).

Implementation: The shim is implemented in umka-compat as cgroupv1fs, a VFS pseudo-filesystem that mounts legacy controller directories. Writes to v1 paths invoke the translation function below and apply the result to the v2 cgroupfs. Reads return translated v2 values in v1 format.

/// Translate a cgroup v1 write to its v2 equivalent.
///
/// Returns `None` if the v1 knob has no v2 equivalent (silently ignored).
/// The caller is responsible for applying the returned `CgroupV2Write` to the
/// actual v2 cgroupfs node for the same cgroup.
pub fn cgroupv1_translate(
    subsystem: CgroupV1Subsystem,
    knob: &str,
    value: &[u8],
) -> Option<CgroupV2Write> {
    match (subsystem, knob) {
        (CgroupV1Subsystem::Memory, "memory.limit_in_bytes") => {
            let bytes = parse_bytes_or_max(value)?;
            Some(CgroupV2Write { path: "memory.max", value: format_bytes_or_max(bytes) })
        }
        (CgroupV1Subsystem::Cpu, "cpu.shares") => {
            let shares: u64 = parse_u64(value).ok()?;
            let weight = 1u64.saturating_add(
                shares.saturating_sub(2).saturating_mul(9999) / 262142
            ).clamp(1, 10000);
            Some(CgroupV2Write { path: "cpu.weight", value: weight.to_string() })
        }
        // net_cls.classid, net_prio.ifpriomap, memory.kmem.limit_in_bytes:
        // no v2 equivalent — return None (logged by caller, not propagated).
        (CgroupV1Subsystem::NetCls, _)
        | (CgroupV1Subsystem::NetPrio, _)
        | (CgroupV1Subsystem::Memory, "memory.kmem.limit_in_bytes") => None,
        // Full table follows the translation table above for all other knobs.
        _ => cgroupv1_translate_full(subsystem, knob, value),
    }
}

16.3 POSIX Inter-Process Communication (IPC)

UmkaOS supports standard POSIX IPC mechanisms, optimized using UmkaOS's native zero-copy primitives where possible.

16.3.1 AF_UNIX Sockets

Local domain sockets (AF_UNIX) are heavily used in containerized environments (e.g., Docker, Kubernetes).

Zero-Copy Process-to-Process Rings: For SOCK_STREAM sockets, UmkaOS maps the connection to a pair of single-producer/single-consumer (SPSC) ring buffers shared directly between the two processes. These are distinct from the kernel-domain KABI ring buffers (Section 10.6), which are fixed-size command/completion rings for Tier 0/Tier 1 communication. The AF_UNIX ring buffer is:

/// Process-to-process SPSC ring for AF_UNIX SOCK_STREAM zero-copy.
/// Mapped into both processes' address spaces at connection time.
pub struct UserSpscRing {
    /// Ring buffer memory, shared between sender and receiver.
    /// Mapped read-write in sender, read-only in receiver.
    pub buffer: *mut u8,

    /// Total buffer size in bytes (power of 2 for efficient masking).
    pub capacity: usize,

    /// Write position (updated by sender, read by receiver).
    /// Stored in a separate cache line to avoid false sharing.
    /// Note: CacheAligned uses 128-byte alignment (max of x86/ARM/PPC cache lines)
    /// to prevent false sharing on all 6 target architectures.
    pub write_pos: CacheAligned<AtomicU64>,

    /// Read position (updated by receiver, read by sender).
    pub read_pos: CacheAligned<AtomicU64>,

    /// Futex word for blocking when ring is full (sender waits) or empty (receiver waits).
    ///
    /// Protocol:
    ///   Value 0 = IDLE (no waiter blocked).
    ///   Value 1 = SENDER_WAITING (sender blocked because ring is full).
    ///   Value 2 = RECEIVER_WAITING (receiver blocked because ring is empty).
    ///
    /// Sender path:
    ///   1. Check space: if (write_pos - read_pos) < capacity, write data, advance write_pos.
    ///   2. If ring is full: store(1, Release) into futex_word, then FUTEX_WAIT(&futex_word, 1).
    ///   3. When receiver advances read_pos, it checks futex_word. If == 1, store(0) + FUTEX_WAKE.
    ///
    /// Receiver path:
    ///   1. Check data: if write_pos != read_pos, read data, advance read_pos.
    ///   2. If ring is empty: store(2, Release) into futex_word, then FUTEX_WAIT(&futex_word, 2).
    ///   3. When sender advances write_pos, it checks futex_word. If == 2, store(0) + FUTEX_WAKE.
    pub futex_word: AtomicU32,
}

The umka-compat layer intercepts send() and recv() calls and translates them into ring buffer enqueues/dequeues.
Data is copied twice: once from sender's buffer into the shared ring, once from the ring into receiver's buffer. This eliminates the traditional kernel-buffer intermediate copy, reducing the path from 3 copies to 2.
The kernel is only invoked via futex when a ring is full/empty and the process must block.

SOCK_SEQPACKET message boundaries: SOCK_SEQPACKET requires preserving message boundaries — recv() must return exactly one message per call. The ring buffer protocol includes a 4-byte length header before each message:

/// Message format in SOCK_SEQPACKET ring:
/// | msg_len: u32 | data: [u8; msg_len] | msg_len: u32 | data: ... |
///
/// The receiver reads msg_len, then reads exactly that many bytes.
/// Short reads (buffer smaller than msg_len) discard the remainder of the message.

For SOCK_DGRAM AF_UNIX sockets, a similar framed protocol is used, but the ring is unidirectional (no connection, just a receive queue per socket).

16.3.2 Pipes and FIFOs

Standard pipes are implemented as bounded in-memory buffers managed by the VFS. - For high-throughput scenarios, applications can use vmsplice() to zero-copy data from a pipe into a memory-mapped region. - Internally, a pipe is a specialized VfsNode that maintains a wait queue for readers and writers.

Pipe data structure:

/// Default pipe capacity: 16 pages × 4 KB = 64 KB, matching Linux's default
/// since kernel 2.6.11. This is the inline fast-path capacity.
pub const PIPE_DEFAULT_PAGES: usize = 16;

/// Pipe buffer: inline storage for the common case (≤16 pages = 64KB default pipe),
/// with heap fallback for pipes expanded via fcntl(F_SETPIPE_SZ).
///
/// Allocated when pipe(2) or pipe2(2) is called.
///
/// The default buffer size is 65536 bytes (64 KB). The size is configurable via
/// fcntl(F_SETPIPE_SZ) up to /proc/sys/fs/pipe-max-size (default 1 MB; root with
/// CAP_SYS_RESOURCE may raise further, hard limit 2^31 bytes per Linux
/// `round_pipe_size()`).
///
/// **Zero-copy optimization**: When a pipe page is "gifted" via vmsplice()
/// with SPLICE_F_GIFT, the page is transferred to the pipe without copying.
/// The gifted page is unmapped from the sender's address space and becomes
/// owned by the pipe until read. This enables zero-copy data pipelines.
///
/// **Allocation model**: The inline `pages_small` array covers the standard 64 KB
/// default pipe (16 pages × 4 KB). When `fcntl(F_SETPIPE_SZ)` sets capacity
/// beyond 16 pages, the buffer transitions to `pages_large` (a heap-allocated
/// `Vec<PipePage>`). This hybrid approach keeps the struct compact (384 bytes of
/// inline page storage vs. the previous 6144 bytes) while supporting the full
/// Linux pipe size range.
pub struct PipeBuffer {
    // === First cache line(s): hot-path lock-free atomic fields ===
    // These fields are accessed on every read/write syscall without holding
    // any lock. Placing them first ensures they occupy the initial cache
    // lines of the heap-allocated struct, minimising cache misses on the
    // common single-reader/single-writer path.

    /// Index of the first page with data (read cursor).
    pub read_idx: AtomicU32,

    /// Index of the first empty page (write cursor).
    pub write_idx: AtomicU32,

    /// Byte offset within pages[read_idx] for partial reads.
    pub read_offset: AtomicU32,

    /// Byte offset within pages[write_idx] for partial writes.
    pub write_offset: AtomicU32,

    /// Total bytes currently in the pipe (atomic for lock-free size check).
    pub len: AtomicU32,

    /// Total pipe capacity in bytes (set by fcntl F_SETPIPE_SZ, default 65536).
    pub capacity: AtomicU32,

    /// Seqlock for detecting concurrent fcntl(F_SETPIPE_SZ) during lock-free writes.
    /// Odd values indicate resize in progress; even values indicate stable.
    /// Writers read before and after; if changed, retry.
    pub resize_seq: AtomicU32,

    /// Count of active single-writer fast-path operations.
    /// fcntl(F_SETPIPE_SZ) waits for this to reach 0 before resizing.
    pub active_writer: AtomicU32,

    // === Warm fields: reader/writer reference counts and page count ===

    /// Number of readers (for detecting write-side SIGPIPE).
    /// When this drops to 0, write() returns EPIPE.
    pub reader_count: AtomicU32,

    /// Number of writers (for detecting read-side EOF).
    /// When this drops to 0 and the pipe is empty, read() returns 0.
    pub writer_count: AtomicU32,

    /// Count of valid entries in `pages_small` (0 when `pages_large` is in use).
    /// Maximum value: `PIPE_DEFAULT_PAGES` (16).
    pub small_len: AtomicU8,

    // === Cold fields: locks, wait queues, and page storage ===
    // Only accessed on blocked paths (empty/full) and on resize.

    /// Wait queue for blocked readers (pipe empty).
    pub read_wait: WaitQueueHead,

    /// Wait queue for blocked writers (pipe full).
    pub write_wait: WaitQueueHead,

    /// Lock for modifying the page ring (growing/shrinking) and multi-writer path.
    /// The lock-free single-writer path does not hold this lock.
    pub ring_lock: Mutex<()>,

    /// Reader-writer lock for multi-reader coordination on FIFOs.
    /// When multiple readers exist, readers acquire this in shared mode
    /// (concurrent readers don't contend with each other). Each reader
    /// atomically claims bytes via `read_idx.fetch_add()`, then reads from
    /// its claimed range without holding any lock. Writers do NOT acquire
    /// this lock — it coordinates readers only. The write path uses
    /// `ring_lock` (multi-writer) or the lock-free single-writer path.
    pub read_lock: RwLock<()>,

    /// Heap-allocated pages for pipes expanded beyond 16 pages.
    /// `None` until the first `fcntl(F_SETPIPE_SZ)` exceeding 16 pages.
    /// Allocated from the general kernel heap (not slab) because expanded
    /// pipe buffers are rare and size varies. Accessed only while holding
    /// `ring_lock`.
    pub pages_large: Option<Vec<PipePage>>,

    /// Inline storage for ≤ 16 pages (covers the 64 KB default pipe size).
    /// Zero-allocation fast path for the common case.
    /// `MaybeUninit` avoids initialization cost for unused slots while
    /// keeping stack safety — only `small_len` entries are valid.
    /// Placed last so the hot atomic fields above occupy the initial cache lines.
    pub pages_small: [MaybeUninit<PipePage>; PIPE_DEFAULT_PAGES],
}

> **Design rationale**: `PipeBuffer` is heap-allocated (not stack-allocated). The hot-path
> atomic counters (`read_idx`, `write_idx`, `len`, `resize_seq`, `active_writer`) are placed
> **first** so they occupy the initial cache lines of the allocation; the 384-byte
> `pages_small` array is placed **last** so it does not evict the hot counters on
> lock-free read/write paths.
>
> **Inline vs. heap page storage**: `pages_small` covers the standard 64 KB pipe
> (16 pages x 4 KB). The previous design used a 256-slot inline array
> (`[PipePage; 256]` = ~6144 bytes) sized for the maximum possible pipe (1 MB via
> `F_SETPIPE_SZ`), which is rarely reached in practice — default Linux pipes are 64 KB
> (16 pages), and most pipes never exceed this. The 16-slot inline array reduces the
> struct's page-storage footprint from 6144 bytes to 384 bytes (16 x 24B), a 16x
> reduction that dramatically improves slab allocator cache density.
>
> **Transition to heap**: When `fcntl(F_SETPIPE_SZ)` sets capacity beyond 16 pages
> (> 64 KB), the buffer transitions to `pages_large` (`Vec<PipePage>`) and `small_len`
> is set to 0. This transition is uncommon in production workloads. The `Vec` is
> allocated from the general kernel heap (not slab) because expanded pipe sizes vary.
>
> **`MaybeUninit` wrapper**: The `MaybeUninit<PipePage>` wrapper avoids initialization
> cost for unused inline slots while maintaining stack safety. Only the first
> `small_len` entries contain valid data; the remainder are uninitialised memory.

/// A single page in the pipe buffer.
pub struct PipePage {
    /// Physical page containing the data.
    /// Allocated from the page allocator or gifted via vmsplice.
    pub page: PhysPage,

    /// Number of valid bytes in this page (0 = empty, PAGE_SIZE = full).
    /// For gifted pages, this is the full page; for standard writes,
    /// partial pages are possible.
    pub len: AtomicUsize,

    /// True if this page was gifted via vmsplice(SPLICE_F_GIFT).
    /// Gifted pages are unmapped from the sender and transferred to
    /// the reader; standard pages are copied.
    pub is_gifted: AtomicBool,
}

Pipe write algorithm (lock-free fast path):

Note: This algorithm assumes single-writer semantics for the lock-free fast path. POSIX pipes technically allow multiple concurrent writers, but such usage requires atomic writes smaller than PIPE_BUF (4096 bytes) to guarantee data integrity. For UmkaOS's high-performance path, the lock-free algorithm below requires exactly one concurrent writer — multi-writer scenarios fall back to the mutex-protected slow path described below.

Multi-writer slow path (POSIX atomicity guarantee for writes ≤ PIPE_BUF):

When multiple writers are detected (via writer_count.load(Acquire) > 1), all writers acquire ring_lock (a mutex) before writing. Under ring_lock: 1. The writer checks available space (same as step 3 of the fast path). 2. If remaining ≤ PIPE_BUF (4096), the write is performed atomically: all bytes are written to contiguous pages before write_idx is advanced. If insufficient contiguous space exists, the writer sleeps on the pipe's wait queue until space is available (matching Linux POSIX behaviour). 3. If remaining > PIPE_BUF, POSIX does not guarantee atomicity. The write proceeds page-by-page under the mutex (may interleave with other large writes). 4. write_idx and len are updated under the mutex, then ring_lock is released.

The interaction with the lock-free reader is safe because the reader only reads committed pages (visible via len.load(Acquire)), and the reader's read_idx advancement is atomic. The resize_seq seqlock interaction is the same as the fast path — fcntl(F_SETPIPE_SZ) acquires ring_lock and waits for active writers.

Multi-reader coordination: When multiple readers exist on a FIFO, readers acquire ring_lock in shared mode (readers don't contend with each other for the mutex — they use a separate read_lock reader-writer lock). Each reader atomically claims bytes via read_idx.fetch_add(), then reads from its claimed range without holding any lock. This allows concurrent readers to make progress on different regions of the pipe buffer.

Resize safety: The lock-free write path uses a resize_seq: AtomicU32 seqlock to detect concurrent fcntl(F_SETPIPE_SZ) operations. Before starting the write loop, the writer reads the seqlock; after completing each page, it re-checks. If the seqlock changed, the writer retries from the beginning with the new page count. fcntl(F_SETPIPE_SZ) acquires ring_lock, waits for in-flight single-writers via an active_writer count, increments resize_seq, performs the resize (potentially transitioning from pages_small to pages_large when expanding beyond 16 pages), and increments resize_seq again. This ensures the lock-free path never observes an inconsistent buffer size.

write(pipe, data, len):
  0. remaining = len; written = 0
  1. seq_start = resize_seq.load(Acquire)  // Capture resize generation
  2. If reader_count.load(Acquire) == 0: return EPIPE (SIGPIPE to caller)
     // TOCTOU note: reader may close between this check and write. This is
     // acceptable per POSIX — data written to a pipe with no readers is simply
     // discarded, and the next write() will observe reader_count == 0 and
     // return EPIPE. The pipe remains consistent; no data corruption occurs.
  3. // Try to claim fast path via compare-and-swap
     if !active_writer.compare_exchange(0, 1, Acquire, Relaxed).is_ok():
         // Another writer active — take slow path with ring_lock
         return write_slow_path(pipe, data, len)
  4. current_num_pages = pipe.page_count()  // Derive from small_len or pages_large.len()
  5. while remaining > 0:
       a. If len.load() >= capacity.load():
          // Pipe full — block on write_wait
          active_writer.store(0, Release)  // Release during wait
          wait_event_interruptible(write_wait, len.load() < capacity)
          if interrupted: return written  // bytes successfully written before interrupt
          // Re-acquire fast path and re-check
          if !active_writer.compare_exchange(0, 1, Acquire, Relaxed).is_ok():
              // Lost to another writer during wait — take slow path
              return written + write_slow_path(pipe, &data[written..], remaining)
          if reader_count.load(Acquire) == 0:
              active_writer.store(0, Release)
              return EPIPE
          // Check for resize during wait
          if resize_seq.load(Acquire) != seq_start:
              active_writer.store(0, Release)
              goto 1  // Retry with new seq_start; written/remaining preserved

       b. write_idx_val = write_idx.load(Relaxed)
       c. write_off = write_offset.load(Relaxed)
       d. available = min(PAGE_SIZE - write_off, remaining)
       e. copy data[written:written+available] to pages[write_idx_val][write_off:]
       e'. pages[write_idx_val].len.store(write_off + available, Release)  // Update per-page len
       f. write_offset.store(write_off + available, Release)
       g. len.fetch_add(available, Release)  // Publishing barrier for data in step e
       h. If write_offset == PAGE_SIZE:
          // Page full, advance to next — but first check for concurrent resize
          if resize_seq.load(Acquire) != seq_start:
              active_writer.store(0, Release)
              goto 1  // Retry with new seq_start; written/remaining preserved
          write_idx.store((write_idx_val + 1) % current_num_pages, Release)
          write_offset.store(0, Release)
       i. written += available; remaining -= available
  6. active_writer.store(0, Release)  // Release fast-path lock
  7. wake_up(read_wait)  // Notify any blocked readers
  8. return written

Memory ordering rationale for write path: The Release on len.fetch_add() (step g) is the publishing barrier that synchronizes with the reader's Acquire load of global len. This ensures all prior stores (the data memcpy in step e, the per-page len update in step e') are visible to the reader before it observes the new len value. The reader must use the global len Acquire→per-page len Acquire chain.

fcntl(F_SETPIPE_SZ) implementation:

fcntl_setpipe_sz(pipe, new_size):
  1. ring_lock.lock()
  2. // Wait for active single-writers to complete using futex
     while active_writer.load(Acquire) > 0:
         // Use futex wait instead of busy-spin to avoid priority inversion
         futex_wait(&active_writer, expected=1, timeout=1ms)
  3. resize_seq.fetch_add(1, Release)  // Start resize
  4. old_pages = pages  // Save pointer to old pages array
  5. // Perform resize: if new_pages > 16, transition to pages_large (Vec);
     //   copy data from old pages, update small_len or pages_large
  6. resize_seq.fetch_add(1, Release)  // End resize
  7. rcu_call(old_pages, free_pages_callback)  // Defer freeing old pages array
  8. ring_lock.unlock()

Design note — lock ordering during resize: The resize path holds ring_lock while waiting for active_writer to drain (with a 1ms timeout). This prevents permanent deadlock but creates a retry loop if the writer is blocked on an unrelated resource. The implementation SHOULD drop ring_lock before the futex wait, re-acquire it after wake-up, and re-validate the resize preconditions. This two-phase approach (validate → release → wait → re-acquire → re-validate) eliminates the lock-while-wait pattern at the cost of one extra validation pass.

Memory safety during resize: When fcntl(F_SETPIPE_SZ) replaces the pages array, the OLD pages array is freed via rcu_call() (deferred until the next RCU grace period). This ensures that any concurrent reader in step 4a-4e, which executes under an implicit RCU read-side critical section (preemption disabled during the pipe read fast path), will not access freed memory. The seqlock (resize_seq) detects that a resize occurred and triggers a retry, but the deferred freeing guarantees that the stale pointer is still valid for the duration of the read attempt.

Multi-writer support: When multiple threads write to the same pipe concurrently, the lock-free path cannot be used. The kernel detects multi-writer scenarios using a compare-and-swap pattern: a writer performs active_writer.compare_exchange(0, 1, Acquire, Relaxed). If successful (previous value was 0), it proceeds on the fast path. If it fails (another writer is active), it acquires ring_lock and takes the slow path. This ensures exactly one writer can be on the fast path at a time, preserving POSIX atomic write guarantees for writes ≤ PIPE_BUF.

Pipe read algorithm (lock-free, requires single reader or mutex for multi-reader):

read(pipe, buffer, len):
  0. seq_start = resize_seq.load(Acquire)  // Capture resize generation
  1. If len.load(Acquire) == 0:
       // Pipe empty — check for EOF or block
       if writer_count.load(Acquire) == 0:
           return 0  // EOF — all writers closed
       wait_event_interruptible(read_wait, len.load(Acquire) > 0 || writer_count.load(Acquire) == 0)
       if interrupted: return 0
       if len.load(Acquire) == 0 && writer_count.load(Acquire) == 0:
           return 0  // EOF after wakeup
       // Check for resize during wait
       if resize_seq.load(Acquire) != seq_start:
           goto 0  // Retry with new parameters

  2. bytes_read = 0
  3. current_num_pages = pipe.page_count()  // Derive from small_len or pages_large.len()
  4. while bytes_read < len && len.load(Acquire) > 0:
       a. // Check for concurrent resize BEFORE accessing pages[] array.
          // If resize occurred, the old pages[] pointer may be deallocated.
          if resize_seq.load(Acquire) != seq_start:
              seq_start = resize_seq.load(Acquire)
              current_num_pages = pipe.page_count()
       b. read_idx_val = read_idx.load(Acquire)  // Acquire to see writer's stores
       c. read_off = read_offset.load(Acquire)
       d. // Determine bytes available in current page
          page_len = pages[read_idx_val].len.load(Acquire)
          available = min(page_len - read_off, len - bytes_read)
       e. // Copy data from page to user buffer (Acquire ensures data is visible)
          copy pages[read_idx_val][read_off:read_off+available] to buffer[bytes_read:]
       f. // Post-copy validation: if a resize raced with the copy, the data
          // may be stale. Discard this iteration and retry.
          if resize_seq.load(Acquire) != seq_start:
              seq_start = resize_seq.load(Acquire)
              current_num_pages = pipe.page_count()
              continue  // Retry — do not commit read_offset or len changes
       g. read_offset.store(read_off + available, Release)
       h. If read_offset.load(Relaxed) >= page_len:
          // Page consumed — advance index BEFORE decrementing len. This ensures
          // a concurrent writer observing free space (via len) sees the updated
          // read_idx and does not overwrite the page the reader just finished.
          read_idx.store((read_idx_val + 1) % current_num_pages, Release)
          read_offset.store(0, Release)
       i. len.fetch_sub(available, Release)  // Must be AFTER read_idx advance
       j. bytes_read += available

  5. wake_up(write_wait)  // Notify any blocked writers
  6. return bytes_read

Memory ordering rationale: The reader uses Acquire loads on len, read_idx, read_offset, and pages[].len to synchronize with the writer's Release stores. This ensures the reader observes all data written before the writer updated these indices. On weakly-ordered architectures (AArch64, RISC-V, ARMv7, PPC), this ordering is critical to prevent the reader from seeing stale data.

FIFOs (named pipes): A FIFO is a VFS node (VfsNode) that, when opened, creates a reference to an existing PipeBuffer or creates a new one. Multiple readers and writers can open a FIFO; the reader_count and writer_count fields track opens/closes. Writers use the multi-writer slow path when concurrent writes are detected. When the last reader and last writer close, the buffer is freed.

16.3.3 Shared Memory (POSIX and SysV)

POSIX shm_open(): Implemented as a memory-mapped file (mmap) backed by a hidden tmpfs instance.
SysV shmget(): Maps to the same underlying physical memory allocation mechanism, but managed via the CLONE_NEWIPC namespace tables.

Both mechanisms result in direct page table entries (PTEs) mapping the same physical frames into multiple Capability Domains.