Chapter 7: Process and Task Management

Task/Process structs, fork/exec/exit, signals, process groups, sessions, real-time guarantees

7.1 Process and Task Management

This section defines how UmkaOS represents runnable entities, creates and destroys processes, loads programs, and manages the virtual address space operations that user space relies on. The scheduler that decides when tasks run is in Section 6.1; this section covers what tasks are and how they are born, transformed, and reaped.

7.1.1 Task Model

UmkaOS uses task as the schedulable unit. Tasks are grouped into processes that share an address space, capability table, and file descriptor table. This mirrors the Linux thread-group model: a "process" is a collection of tasks that share CLONE_VM | CLONE_FILES | CLONE_SIGHAND, and a "thread" is simply another task within the same process.

Task descriptor:

bitflags! {
    /// Task scheduling state flags. Compound states are formed by ORing base flags.
    /// The scheduler checks these to decide when sleeping tasks can be woken.
    ///
    /// **Linux ABI compatibility**: The numeric values match Linux's TASK_* constants
    /// for /proc/[pid]/status "State:" field and ptrace compatibility.
    pub struct TaskState: u32 {
        /// Task is on a run queue and eligible to be scheduled (or currently running).
        /// Not a bit — the zero value. A task with no other flag set is RUNNING.
        const RUNNING           = 0x0000_0000;
        /// Sleeping; woken by signals, explicit wake_up(), or timer expiry.
        const INTERRUPTIBLE     = 0x0000_0001;
        /// Sleeping in uninterruptible wait; only woken by explicit wake_up().
        /// Used for I/O waits that must not be interrupted by signals.
        const UNINTERRUPTIBLE   = 0x0000_0002;
        /// Stopped by SIGSTOP or group stop. Resumed by SIGCONT.
        const STOPPED           = 0x0000_0004;
        /// Being ptraced; stopped at a ptrace event.
        const TRACED            = 0x0000_0008;
        /// Exit in progress; task_struct still exists for wait4() reaping.
        const ZOMBIE            = 0x0000_0010;
        /// All resources freed; task_struct about to be released.
        const DEAD              = 0x0000_0020;
        /// Modifier: woken by fatal signals (SIGKILL) even while UNINTERRUPTIBLE.
        /// Combine with UNINTERRUPTIBLE: KILLABLE = UNINTERRUPTIBLE | WAKEKILL.
        const WAKEKILL          = 0x0000_0100;
        /// Convenient alias: UNINTERRUPTIBLE that can be killed.
        const KILLABLE          = Self::UNINTERRUPTIBLE.bits() | Self::WAKEKILL.bits();
        /// Modifier: task should receive a wake-up IPI on the next tick
        /// (used by WEA fibers to avoid dedicated wakeup IPIs on short waits).
        const WAKEUP_DEFERRED   = 0x0000_0200;
        /// Task is being migrated between CPUs; temporarily off any run queue.
        const MIGRATING         = 0x0000_0400;
    }
}

/// Maximum task comm name length including null terminator. Linux-compatible value.
pub const TASK_COMM_LEN: usize = 16;

pub struct Task {
    /// Kernel-unique task identifier.
    pub tid: TaskId,
    /// Owning process (shared with sibling tasks).
    pub process: Arc<Process>,
    /// Task name (thread/process name). Set to the executable basename on exec()
    /// (truncated to 15 bytes + null terminator). Updated by prctl(PR_SET_NAME)
    /// and pthread_setname_np(). Exposed via /proc/[pid]/comm and
    /// /proc/[pid]/status (Name: field).
    ///
    /// **External ABI**: Must be [u8; TASK_COMM_LEN] for Linux /proc compatibility.
    /// The null terminator at index 15 is always maintained.
    pub comm: [u8; TASK_COMM_LEN],
    /// Scheduling state machine.
    pub state: TaskState,      // Running, Runnable, Blocked, Stopped, Zombie
    /// Scheduler bookkeeping (vruntime, deadline params, etc.).
    pub sched_entity: SchedEntity,
    /// Which CPUs this task may run on.
    pub cpu_affinity: CpuSet,
    /// Blocked and pending signal masks.
    pub signal_mask: SignalSet,
    /// Architecture-specific saved CPU context for context switching.
    /// Defined per-architecture in `umka-core/src/arch/*/context.rs`.
    /// On x86-64: contains saved RBP, RSP, RBX, R12-R15, RIP (return address),
    ///   SSE/AVX state pointer, and FPU state. (~96 bytes)
    /// On AArch64: contains saved X19-X28, X29 (FP), X30 (LR), SP, and
    ///   TPIDR_EL0. (~88 bytes)
    /// On RISC-V 64: saved s0-s11, ra, sp, and thread pointer. (~104 bytes)
    /// On ARMv7: saved r4-r11, lr, sp. (~40 bytes)
    /// The architecture module's `context_switch(prev: &mut Task, next: &Task)`
    /// saves the current registers into `prev.context` and restores from `next.context`.
    /// `ArchContext` is an opaque type alias: `pub type ArchContext = arch::current::context::SavedContext;`
    pub context: ArchContext,
    /// Per-task capability restriction handle. Acts as a per-thread restriction
    /// mask: can only **narrow**, never widen, the process-wide CapSpace
    /// ([Section 8.1.1](09-security-extensions.md#93-capability-based-foundation)). A thread that restricts its own
    /// capabilities cannot re-grant them to itself or other threads in the same
    /// process. Used by sandboxed thread models (e.g., renderer threads that drop
    /// filesystem capabilities after startup).
    pub capabilities: CapHandle,
    /// Embedded futex waiter node (Section 18.2.1). A task can block on at
    /// most one futex at a time, so a single embedded node is sufficient.
    /// Linked into a FutexHashTable bucket when the task calls futex_wait;
    /// unlinked on wake or timeout. Uses intrusive linking to avoid heap
    /// allocation under the bucket spinlock.
    ///
    /// **Task exit safety**: When a task exits (including SIGKILL) while
    /// linked into a futex bucket, the exit path (`do_exit`) must unlink
    /// this node before freeing the Task struct. The cleanup sequence is:
    ///   1. Acquire the futex bucket spinlock for this waiter's bucket.
    ///   2. If `futex_waiter.is_linked()`, remove the node from the bucket
    ///      list (O(1) for intrusive doubly-linked list).
    ///   3. Release the bucket spinlock.
    /// This runs before address space teardown (step 3 of `do_exit`) and
    /// before the Task struct is freed. The bucket spinlock serializes
    /// against concurrent `futex_wake()` operations — a wakeup that races
    /// with task exit will either see the node (and wake it, which is
    /// harmless for a dying task) or see it already removed.
    pub futex_waiter: FutexWaiter,
    /// Scheduler upcall re-entrancy guard (Section 7.1.7.2).
    ///
    /// Set to `true` by the kernel immediately before transferring control to
    /// the userspace scheduler upcall handler, and cleared when the handler
    /// calls `SYS_scheduler_upcall_resume()` or `SYS_scheduler_upcall_block()`.
    ///
    /// While `in_upcall` is `true`:
    /// - A new blocking condition does **not** trigger a nested upcall. Instead,
    ///   the kernel falls back to standard 1:1 blocking for the duration of the
    ///   handler's own blocking operations.
    /// - Blocking syscalls entered from within the upcall handler return normally
    ///   when the I/O or wait completes; no re-entrancy into the upcall stack.
    /// - If a scheduling event occurs that would ordinarily issue an upcall,
    ///   `upcall_pending` is set instead, and the upcall is delivered as soon
    ///   as the current handler exits.
    ///
    /// `AtomicBool` with `Relaxed` load/store is sufficient: only the kernel
    /// modifies this field on behalf of the owning task, and signal/interrupt
    /// delivery always observes the correct value because the architecture
    /// guarantees single-copy atomicity for aligned byte reads.
    pub in_upcall: AtomicBool,
    /// Deferred upcall flag (Section 7.1.7.2).
    ///
    /// Set to `true` by the kernel when a scheduling event occurs while
    /// `in_upcall` is already `true` (i.e., the upcall handler is currently
    /// executing). The pending upcall cannot be delivered immediately because
    /// nesting would overwrite the `UpcallFrame` at the top of the upcall stack,
    /// corrupting the original fiber's saved register state.
    ///
    /// When the in-flight upcall handler calls `SYS_scheduler_upcall_resume()`
    /// or `SYS_scheduler_upcall_block()`:
    /// 1. `in_upcall` is cleared.
    /// 2. If `upcall_pending` is `true`:
    ///    a. Clear `upcall_pending`.
    ///    b. Re-examine the scheduling state and, if a blocking event is still
    ///       pending, deliver the deferred upcall immediately before returning
    ///       to user space. This is equivalent to a normal upcall delivery.
    ///
    /// At most one pending upcall is ever deferred: if multiple blocking
    /// conditions arise while the handler is executing, they merge into a single
    /// pending flag. The handler will re-examine the scheduler state on the next
    /// delivery and observe all accumulated events.
    pub upcall_pending: AtomicBool,
    /// Per-thread nonce for UpcallFrame integrity verification (Section 7.1.7.3).
    ///
    /// Generated from the hardware RNG (`RDRAND` on x86-64, `RNDR` on AArch64,
    /// `seed` CSR on RISC-V with Zkr, platform entropy source on PPC) at thread
    /// creation time (in `do_fork` / `do_clone`). Stored exclusively in kernel
    /// memory — never exposed to userspace through any ABI, register, or memory
    /// mapping.
    ///
    /// When the kernel builds an `UpcallFrame` on the upcall stack, it writes
    /// `magic_cookie = UPCALL_FRAME_MAGIC ^ upcall_frame_nonce` into the frame.
    /// On `SYS_scheduler_upcall_resume`, the kernel verifies that
    /// `frame.magic_cookie ^ upcall_frame_nonce == UPCALL_FRAME_MAGIC`. A frame
    /// not written by the kernel's own upcall dispatch code will fail this check,
    /// preventing a malicious fiber scheduler from forging frames with arbitrary
    /// register state.
    pub upcall_frame_nonce: u64,

    /// File descriptor table reference.
    ///
    /// Threads within the same process that were created with `CLONE_FILES`
    /// share a single `FdTable` instance via this `Arc`. Threads created
    /// without `CLONE_FILES` (e.g., after `fork()` without `CLONE_FILES`,
    /// or via `unshare(CLONE_FILES)`) own their own independent clone of
    /// the table.
    ///
    /// All fd operations (open, close, dup, read, write) go through this
    /// reference. The `FdTable` itself is lock-protected; the `Arc` wrapper
    /// allows the reference to be cloned cheaply at `fork()` / `clone()`.
    ///
    /// On `exec()`, the table is kept but all close-on-exec (`O_CLOEXEC`)
    /// file descriptors are closed atomically before the new binary gains
    /// control.
    pub files: Arc<FdTable>,

    // Resource limits are stored in `self.process.rlimits` (the `Process` struct).
    // All threads in a thread group share one `Process` and therefore one `RlimitSet`
    // automatically — no separate `Arc<RlimitSet>` layer needed. Enforcement code
    // reads `task.process.rlimits` directly. Writes (`setrlimit`, `prlimit64`) acquire
    // `task.process.rlimit_lock`. Fork copies the entire `Process::rlimits` into the
    // new child's `Process` struct — no shared Arc means no COW complexity.

    /// Intrusive sibling node linking this task into its parent's child list.
    ///
    /// Each process (specifically, each thread-group leader) is linked into
    /// its parent process's `children` list via this node. The node is inserted
    /// at `fork()` (parent acquires `Process::lock`, appends child) and removed
    /// on `do_exit()` after reparenting orphans to the init process.
    ///
    /// Using an intrusive node embedded in `Task` avoids per-child heap
    /// allocation in the `fork()` fast path. The node links by `TaskId`
    /// rather than `*mut Task` to avoid raw-pointer aliasing across the
    /// parent-child boundary; lookups resolve `TaskId` → `Task` via the
    /// global task table.
    ///
    /// Only the thread-group leader's `sibling_node` is linked; non-leader
    /// threads within the same process do NOT appear in the parent's child
    /// list. The node is in an unlinked state (`IntrusiveNode::is_linked()`
    /// returns `false`) for all non-leader tasks.
    pub sibling_node: IntrusiveNode<TaskId>,
}

/// All tasks (threads) sharing the same TGID, i.e., the same `Process`.
///
/// Embedded inside `Process`. A task is a member of exactly one `ThreadGroup`.
/// The list is protected by the containing `Process::lock` (a `SpinLock` guarding
/// `thread_group` and `children`).
///
/// Design note: Linux embeds its per-group state in `signal_struct`. UmkaOS keeps it
/// in `ThreadGroup` to make the ownership boundary explicit: `Process` owns the
/// group metadata, not individual tasks.
pub struct ThreadGroup {
    /// Number of live (non-zombie) threads in the group.
    /// Decremented at thread exit (before zombie state), incremented at `clone()`.
    pub count: AtomicU32,
    /// Exit code set by the first `exit_group()` call or a fatal signal delivered
    /// to any thread in the group. Encoded as a Linux-compatible wait status word
    /// (WEXITSTATUS / WTERMSIG format) so that `waitid()` can return it directly.
    /// `u32::MAX` while the group is alive (sentinel for "not yet exited").
    pub exit_code: AtomicU32,
    /// All `Task` IDs in this group. Intrusive list — each `Task` embeds a
    /// `group_node: IntrusiveListNode` for O(1) insert/remove.
    pub tasks: IntrusiveList<TaskId>,
}

pub struct Process {
    /// Kernel-unique process identifier.
    pub pid: ProcessId,
    /// Page tables and VMA tree (Section 4.1.5).
    pub address_space: AddressSpace,
    /// Process-wide capability table (Section 8.1.1).
    pub cap_table: CapSpace,
    /// Open file descriptor table.
    pub fd_table: FdTable,
    /// Parent process (None for init).
    pub parent: Option<ProcessId>,
    /// Children of this process. Intrusive doubly-linked list provides O(1)
    /// insertion at fork() and O(1) removal at child exit. Each Process embeds
    /// a sibling link node, avoiding per-child heap allocation. `Arc<Process>`
    /// is used because processes are already reference-counted (see `Task.process`);
    /// no reference cycle exists because the reverse edge (`parent`) is a scalar
    /// `Option<ProcessId>`, not a strong reference.
    pub children: IntrusiveList<Arc<Process>>,
    /// All tasks sharing this address space.
    pub thread_group: ThreadGroup,
    /// Namespace membership (Section 7.1.6).
    pub namespaces: NamespaceSet,
    /// uid, gid, supplementary groups (Section 8.1.2).
    pub cred: Credentials,
    /// Shared signal handler table for this thread group.
    ///
    /// All threads created with `CLONE_SIGHAND` share the same `SignalHandlers`
    /// instance. `sigaction(2)` modifies the shared table under `SignalHandlers::lock`;
    /// signal delivery reads entries lock-free (array index by signal number is
    /// naturally atomic for pointer-sized fields on all supported architectures).
    ///
    /// On `exec()` the handler table is replaced: a fresh `SignalHandlers` is
    /// allocated with all dispositions reset to `SigHandler::Default` (except
    /// `SIG_IGN` entries, which are preserved per POSIX).
    ///
    /// On `fork()` the child receives a *copy* of the parent's table (COW semantics:
    /// a new `SignalHandlers` with the same entries). Threads created with
    /// `CLONE_SIGHAND` share the parent's existing `Arc<SignalHandlers>` directly.
    pub sighand: Arc<SignalHandlers>,
    /// Controlling terminal for this process's session.
    ///
    /// `None` for processes that have no controlling terminal — typically daemons
    /// that called `setsid()` and have not yet opened a terminal, or processes
    /// started without a terminal (e.g., launched by a service manager).
    ///
    /// Set by the TTY layer when a session leader opens a terminal without
    /// `O_NOCTTY`, or explicitly via `TIOCSCTTY`. Cleared when the terminal
    /// hangs up or when `TIOCNOTTY` is called by the session leader.
    ///
    /// Protected by the session lock (`Session::lock`). The `Tty` type is fully
    /// defined in Chapter 20 (User I/O); this field holds an `Arc` reference so
    /// that the terminal device persists as long as any process retains it as its
    /// controlling terminal, even after the last file-descriptor reference is closed.
    pub tty: Option<Arc<Tty>>,
}

/// Per-process signal handler table, shared across the thread group.
///
/// POSIX requires that `sigaction(2)` affects the whole process: all threads
/// observe the updated disposition. This is achieved by sharing a single
/// `SignalHandlers` instance (via `Arc`) among all threads created with
/// `CLONE_SIGHAND` — the same sharing flag used for the fd table in the
/// Linux/POSIX thread model.
///
/// # Locking discipline
///
/// - **Reads** (signal delivery, `sigpending(2)`): lock-free. The `action`
///   array is accessed by signal number index. Each `SigAction` entry is
///   written atomically as a unit only while `lock` is held; readers observe
///   either the old or new value, never a torn intermediate state, because
///   `SigAction` fits within a single cache line and pointer-sized stores are
///   atomic on all supported architectures.
/// - **Writes** (`sigaction(2)`): acquire `lock`, update the entry, release.
///   Writers are rare (handler installation at program startup); the spinlock
///   is never contended on the fast path.
pub struct SignalHandlers {
    /// Signal dispositions, indexed by signal number (1-indexed; index 0 unused).
    ///
    /// Valid indices: 1 through `SIGMAX` (64 inclusive).
    /// `action[0]` is reserved and always contains `SigAction::default()`.
    ///
    /// Invariants enforced by the kernel:
    /// - `action[SIGKILL - 1].handler` is always `SigHandler::Default`.
    /// - `action[SIGSTOP - 1].handler` is always `SigHandler::Default`.
    ///   `sigaction(2)` rejects attempts to change either (returns `EINVAL`).
    pub action: [SigAction; SIGMAX + 1],
    /// Spinlock protecting writes to `action`. Not held during reads (signal
    /// delivery). Contention is extremely rare: only `sigaction(2)` writes,
    /// which typically occurs only during process startup.
    pub lock: SpinLock<()>,
}

/// Maximum signal number supported. RT signal range is 34–64; SIGMAX = 64.
/// Matches Linux SIGRTMAX for 64-bit architectures (glibc uses signals 32–33
/// internally for NPTL; application-visible SIGRTMAX = 64).
pub const SIGMAX: usize = 64;

/// Architecture-specific saved context. Aliased from `arch::current::context::SavedContext`.
/// See `umka-core/src/arch/*/context.rs` for per-architecture definitions.
/// Minimum fields required by all architectures:
/// - Callee-saved general-purpose registers (per ABI)
/// - Stack pointer
/// - Return address / program counter
/// - Thread-local pointer (if used for CpuLocal)
pub type ArchContext = arch::current::context::SavedContext;

// Forward declaration: `Tty` is fully defined in Chapter 20 (User I/O).
//
// A `Tty` represents an open terminal device — either a real hardware serial
// terminal or one side of a pseudo-terminal (PTY) pair. The controlling
// terminal is attached to a session via `TIOCSCTTY` or by the session leader
// opening the first terminal device after `setsid()` without `O_NOCTTY`.
//
// The field `Process::tty` holds `Option<Arc<Tty>>` so that the terminal
// device's reference count stays elevated for as long as any process holds it
// as a controlling terminal, independently of how many file descriptors point
// to the same device.
pub struct Tty; // forward declaration — see Chapter 20

// ---------------------------------------------------------------------------
// Supporting type definitions referenced by Task and Process
// ---------------------------------------------------------------------------

/// Scheduler entity embedded in `Task`. Carries all EEVDF scheduling state
/// for a task: virtual runtime, virtual deadline, eligibility, lag, and
/// deferred-dequeue status.
///
/// Full definition is in [Section 6.1 (Scheduler Data Structures)](06-scheduling.md#61-scheduler),
/// as `EevdfTask`. The type alias `SchedEntity = EevdfTask` is used here for
/// clarity: in the process/task context, "scheduling entity" is the familiar
/// term. The scheduler chapter uses "EevdfTask" to emphasize the algorithm.
///
/// Embedded directly in `Task` (not heap-allocated) so that the scheduler
/// hot path can access it without pointer indirection.
pub type SchedEntity = EevdfTask; // defined in Section 6.1

/// File descriptor table — maps non-negative integer file descriptor numbers
/// to open file descriptions.
///
/// One `FdTable` may be shared among multiple tasks (threads) that were
/// created with `CLONE_FILES`. Each task holds an `Arc<FdTable>` reference;
/// the table's internal `SpinLock` serialises concurrent fd operations.
///
/// # Locking discipline
/// - All reads and writes to the fd array require holding `inner.lock`.
/// - `max_fds` is an `AtomicU32` updated under the lock; it may be read
///   without the lock for a conservative upper bound on valid fd indices.
///   Callers that need an exact bound must hold the lock.
///
/// # Lifecycle
/// - Created empty at process start (or after `unshare(CLONE_FILES)`).
/// - On `fork()` without `CLONE_FILES`: copied (COW — the copy is a fresh
///   `FdTable` with the same fd → `Arc<OpenFile>` entries, each `Arc`
///   cloned to bump the open-file reference count).
/// - On `fork()` with `CLONE_FILES` (thread creation): the `Arc` is cloned
///   (zero copy), both tasks share the same `FdTable`.
/// - On `exec()`: `O_CLOEXEC` fds are closed atomically under the lock
///   before the new binary's entry point runs. Non-cloexec fds remain open.
pub struct FdTable {
    /// Lock-protected inner state.
    pub inner: SpinLock<FdTableInner>,
    /// Current highest allocated fd index plus one. Read with `Relaxed`
    /// for a fast upper bound. The authoritative count of open fds is
    /// the number of `Some` entries in `inner.fds`.
    pub max_fds: AtomicU32,
    /// Total number of currently open file descriptors. Updated under
    /// `inner.lock`. Used by RLIMIT_NOFILE enforcement (Section 7.5.4).
    pub count: AtomicU32,
}

/// Lock-protected contents of `FdTable`.
pub struct FdTableInner {
    /// Sparse fd-to-file mapping. `fds[n] = Some(f)` means fd `n` is open
    /// and refers to open file description `f`. `None` means the slot is
    /// closed (or not yet allocated).
    ///
    /// The vector grows on demand; it is never shrunk (to avoid reallocation
    /// under the lock). The vector length is always at most
    /// `RLIMIT_NOFILE.hard` for the owning process.
    pub fds: Vec<Option<Arc<OpenFile>>>,
    /// Bitmap of close-on-exec descriptors. Bit `n` is set if fd `n` should
    /// be closed at `exec()`. Maintained in sync with `fds`: when a slot is
    /// cleared (closed), the corresponding cloexec bit is also cleared.
    ///
    /// Separate from the `OpenFile` to avoid a per-file atomic on the
    /// exec-close fast path (a single range clear on the bitmap is faster
    /// than iterating each OpenFile's flags).
    pub cloexec: BitVec,
}

/// An open file description (not a file descriptor — one description may be
/// referenced by multiple fds and multiple processes after `dup(2)` or `fork()`).
///
/// Forward declaration — fully defined in [Chapter 13 (VFS)](13-vfs.md).
pub struct OpenFile; // forward declaration — see Chapter 13

Each task holds its own CapHandle that can further restrict the process-wide CapSpace but never widen it. This allows individual threads to voluntarily drop privileges -- for example, a worker thread that processes untrusted input can shed network capabilities before entering its main loop.

7.1.2 Process Creation

Linux problem: fork() copies the entire process state -- page tables, file descriptor table, signal handlers, credentials -- then the child almost always immediately calls exec(), discarding everything that was just copied. The clone() syscall provides fine-grained control via a combinatorial flag space (CLONE_VM, CLONE_FILES, CLONE_FS, CLONE_SIGHAND, CLONE_NEWPID, ...) that is powerful but difficult to use correctly.

UmkaOS native model: Capability-based spawn(). A new process is created with an explicit set of capabilities, an address space, and an entry point. Nothing is inherited implicitly -- the parent must grant each resource (memory regions, file descriptors, capabilities) that the child should receive. This makes the child's authority set visible and auditable at creation time.

pub struct SpawnArgs {
    /// ELF binary or entry point address.
    pub entry: EntrySpec,
    /// Capabilities to grant to the child (subset of caller's CapSpace).
    /// If the Vec allocation fails, `create_process()` returns
    /// `Err(KernelError::OutOfMemory)` — same failure path as all other spawn
    /// errors; there is no silent failure.
    pub granted_caps: Vec<CapHandle>,
    /// File descriptors to pass (remapped into child's fd table).
    pub fds: Vec<(Fd, Fd)>,
    /// Initial address space configuration.
    pub address_space: AddressSpaceSpec,
    /// CPU affinity for the initial task.
    pub cpu_affinity: CpuSet,
    /// Namespace set (inherit parent's or create new).
    pub namespaces: NamespaceSpec,
}

Linux compatibility: fork() and clone() are implemented in the compat layer (Section 18.1) by translating to the underlying task/process primitives:

fork() = clone(SIGCHLD) = COW address space copy + fd table copy + signal handler copy. Page table entries are marked read-only and reference-counted; the actual page copy is deferred to the write-fault handler (Section 4.1.5).
clone(CLONE_VM | CLONE_FILES | ...) = create a new task within the same process (i.e., a thread). No address space copy, no fd table copy.
vfork() = parent blocks until the child calls exec() or _exit(). The child temporarily shares the parent's address space (no COW overhead). Implemented by setting a completion flag that the parent waits on.
clone3() = the modern extensible version. Supported with the same struct clone_args layout as Linux 5.3+.

7.1.3 Program Execution (exec)

execve() replaces the current task's address space with a new program image. The previous mappings, signal dispositions, and pending signals are discarded; the file descriptor table is preserved (minus CLOEXEC descriptors).

ELF loading sequence:

Parse the ELF header and program headers from the file.
Verify the ELF machine type matches the running architecture.
For each PT_LOAD segment: create a VMA with the specified permissions and map the file region (demand-paged via the page cache).
If an ELF interpreter is specified (PT_INTERP), map it as well. This is typically ld-linux-x86-64.so.2 or ld-musl-x86_64.so.1.
Allocate a new user stack. Push auxv (auxiliary vector), envp, and argv onto the stack in the standard layout expected by the C runtime.
Set the instruction pointer to the interpreter's entry point (or the binary's entry point if statically linked).
Return to user space.

Capability grants on exec replace the traditional setuid/setgid mechanism (Section 8.1.6). Instead of running the new program as a different UID, the kernel consults a per-binary capability grant table and adds the specified capabilities to the task's CapHandle. The process never gains uid 0 -- it gains precisely the capabilities the binary needs (e.g., CAP_NET_BIND_SERVICE for a web server on port 80).

Security cleanup on exec:

File descriptors with CLOEXEC are closed.
Signal dispositions are reset to SIG_DFL (except SIG_IGN).
Pending signals are cleared.
The process dumpable flag is re-evaluated (non-dumpable if capabilities were gained).
Address space layout randomization (ASLR) re-randomizes all base addresses.

7.1.4 Task Exit and Resource Cleanup

A task exits via exit() (single task) or exit_group() (all tasks in the process). The latter is what the C library's exit() actually calls.

Cleanup order for a single-task exit:

Cancel pending asynchronous I/O (io_uring SQEs, AIO requests).
If this is the last task in the process, proceed to process cleanup (below). Otherwise, release per-task resources (stack, ArchContext) and remove from the thread group.

Process cleanup (when the last task exits):

Close all file descriptors in the fd table.
Release all capabilities in the CapSpace.
Tear down the address space: unmap all VMAs, release page table pages, decrement page reference counts.
Deliver SIGCHLD to the parent process.
Reparent children: any surviving child processes are reparented to the nearest subreaper (a process that set PR_SET_CHILD_SUBREAPER) or to init (pid 1).
Transition to the zombie state. The task remains in the task table with its exit status until the parent calls wait() / waitpid() / waitid().

Zombie reaping: The zombie consumes only a small task-table slot (no address space, no fd table, no capabilities). The parent retrieves the exit status and resource usage via wait4() or waitid(), which frees the slot. If the parent exits without reaping, the reparented-to ancestor (init or subreaper) is responsible.

Session and process group lifecycle: When a session leader exits, SIGHUP is delivered to the foreground process group of the controlling terminal. The terminal is disassociated from the session. This matches POSIX semantics required by sshd, tmux, and shell job control.

UmkaOS Process Exit Cleanup Tokens

Problem with atexit() and signal handlers for cleanup: Resource cleanup on process death depends on atexit() (runs handlers synchronously during exit, can block or be skipped on SIGKILL), signal handlers for SIGTERM (async, the process may crash before the handler runs), or close() callbacks on file descriptors (limited expressiveness). None of these mechanisms fire on SIGKILL or OOM kill, and a handler that blocks stalls the entire exit.

UmkaOS exit cleanup tokens: A kernel-managed cleanup mechanism tied to process lifetime. When the process exits for any reason — normal exit(), SIGKILL, unhandled fault, or OOM kill — the kernel executes registered cleanup actions after the process's address space is torn down, using kernel-internal state that is independent of the process's stack and heap.

/// A handle to a registered process exit cleanup action.
///
/// Dropping the handle (or calling `cancel()`) unregisters the action so
/// it will not run when the process exits.
///
/// Obtained from `umka_register_exit_cleanup()`.
pub struct ExitCleanupHandle {
    /// Opaque identifier assigned at registration time.
    id: CleanupId,
    /// Weak reference so the handle does not extend process lifetime.
    process: Weak<Process>,
}

impl ExitCleanupHandle {
    /// Cancel this cleanup action.
    ///
    /// After this call the action will not execute on process exit.
    /// Safe to call from any thread that owns the handle.
    pub fn cancel(self) {
        // Consuming self triggers Drop, which removes the entry from the
        // process's cleanup list under the process lock.
    }
}

/// The kernel action to execute when the owning process exits.
pub enum ExitCleanupAction {
    /// Revoke a capability held by the kernel on behalf of this process.
    /// Equivalent to calling `cap_revoke()` after the process has gone.
    /// Used by resource managers to release kernel resources without
    /// relying on heartbeat polling.
    RevokeCap(Cap),

    /// Unlink a filesystem path (e.g., a pidfile or UNIX socket file).
    /// Executes as `unlinkat(dirfd, path, 0)` in the kernel cleanup thread.
    /// `dirfd` is `None` for absolute paths (treated as AT_FDCWD relative
    /// to the root mount at exit time, not to the process's cwd).
    UnlinkPath {
        path: PathBuf,
        dirfd: Option<DirFd>,
    },

    /// Send a signal to another process.
    /// Executes as `kill(target_pid, signo)` from the kernel cleanup thread.
    /// Permission check uses the exiting process's saved credentials.
    SendSignal {
        target_pid: Pid,
        signo: Signal,
    },

    /// Increment an eventfd counter to notify watchers that this process
    /// has exited. Executes as `write(fd, &value.to_ne_bytes(), 8)`.
    NotifyEventFd {
        fd: OwnedFd,
        value: u64,
    },
}

/// Maximum number of cleanup actions that may be registered per process.
/// `umka_register_exit_cleanup()` returns `EMFILE` when this limit is reached.
pub const UMKA_MAX_EXIT_CLEANUPS: usize = 64;

/// Register a cleanup action to run when the current process exits.
///
/// The action runs in a dedicated kernel cleanup thread after the process's
/// address space and file descriptors are closed, and before the zombie is
/// made visible to `waitpid()`. The handle must be kept alive for the action
/// to remain registered; dropping the handle cancels the action.
///
/// Returns `EMFILE` if `UMKA_MAX_EXIT_CLEANUPS` actions are already registered.
/// Returns `EINVAL` if the action references an invalid fd, cap, or path.
///
/// New syscall: `umka_register_exit_cleanup(2)` — x86-64 syscall number 1024.
pub fn umka_register_exit_cleanup(
    action: ExitCleanupAction,
) -> Result<ExitCleanupHandle>;

Execution ordering within do_exit():

Process calls exit() or is killed — do_exit() begins.
Per-task cancellation: pending async I/O is cancelled, signal handlers are permanently disabled (no new signals can be delivered).
Address space torn down: all VMAs unmapped, page tables freed, fd table closed, capabilities released. From this point the process cannot access memory or kernel resources it owned.
Cleanup phase: the kernel cleanup thread (one per CPU socket, pre-started at boot) dequeues and executes all ExitCleanupActions registered for this process, in registration order. Each action runs with a 1-second timeout: if an action blocks beyond the timeout, the kernel logs a warning at WARN level identifying the action type and process, then skips to the next action. Cleanup actions must not block; this timeout is a safety net, not a design budget.
Process transitions to the zombie state and becomes visible to waitpid().
SIGCHLD is delivered to the parent.

Why run cleanup after MM teardown?: Cleanup actions are intentionally deferred until after the process's own resources are gone, for two reasons. First, RevokeCap and UnlinkPath are safe to execute unconditionally because the process can no longer race with them — it has no address space or fd table. Second, SendSignal after MM teardown sends a notification to an observer rather than interacting with a still-running process, which is a well-defined operation with no ordering ambiguity.

Comparison with existing cleanup mechanisms:

Property	`atexit()` / C++ destructors	Signal handlers (SIGTERM)	UmkaOS exit cleanup tokens
Runs on SIGKILL	No	No	Yes
Runs on OOM kill	No	No	Yes
Runs on unhandled fault	No	No	Yes
Can block exit	Yes	Yes	No (1 s timeout, kernel-managed)
Can crash and skip cleanup	Yes	Yes	No (runs in kernel thread)
Requires process address space	Yes	Yes	No (runs after MM torn down)
Integrates with capabilities	No	No	Yes (`RevokeCap` action)
Max registered actions	Unlimited (heap)	1 per signal	64 per process

Linux compatibility: atexit(), on_exit(), and C++ destructors are fully supported via the userspace runtime — they are unchanged. Exit cleanup tokens are an UmkaOS extension exposed through the umka_register_exit_cleanup(2) syscall (x86-64 number 1024, in the UmkaOS-private syscall range starting at 1024). Existing Linux binaries do not use this mechanism and are not affected by it.

Use cases:

Container runtimes: unlink socket files and pidfiles when a container process dies, without needing to poll for liveness.
Resource managers: release kernel resources (close capabilities, free reserved bandwidth) when a client process exits, replacing the heartbeat polling pattern common in distributed systems.
Language runtimes (Go, JVM, .NET): guarantee cleanup of native resources even when the runtime is killed before its own shutdown hooks can run.
Service monitors: write to an eventfd to notify a watchdog when a worker process exits, with lower latency than polling /proc or pidfd_poll.

7.1.5 Address Space Operations

User space manipulates its address space through these syscalls, all of which go through capability checks on the calling process's CapSpace:

mmap() / munmap(): Create and destroy virtual memory regions. Anonymous mappings allocate from the physical allocator on demand (Section 4.1.1). File-backed mappings go through the page cache (Section 4.1.3). MAP_SHARED mappings are backed by a shared page cache entry; MAP_PRIVATE mappings COW on write.
mprotect(): Change page permissions on an existing mapping. On x86-64, this integrates with hardware domain isolation (Section 10.2) -- Tier 1 driver memory regions can be made accessible only when the driver's protection key is active.
brk() / sbrk(): Legacy heap expansion interface. Supported for compatibility with applications that do not use mmap-based allocators. Implemented as a resizable anonymous VMA at the process's break address.

7.1.5.1 `mprotect(addr, len, prot)` — Change Memory Region Permissions

Changes the access permissions of the virtual address range [addr, addr+len). Equivalent to Linux mprotect(2) and mprotect_pkey(2) (when prot includes PROT_MTE or a memory-key flag).

Algorithm:

Validate range: addr must be page-aligned; len is rounded up to the next page boundary. If addr + len overflows or exceeds TASK_SIZE, return EINVAL. If len == 0, return 0 immediately (no-op, Linux-compatible).
Look up VMAs: Walk the Maple tree (process.mm.vmas) to find all VMAs that overlap [addr, addr+len). If any gap exists in the range (unmapped pages), return ENOMEM. If the range spans multiple VMAs, each is updated independently; VMAs that are only partially covered are split (see vma_split()).
Permission check: For each VMA:
If adding execute permission (PROT_EXEC) and VM_NOEXEC is set on the VMA (e.g., from a MAP_NOEXEC file mount), return EACCES.
If adding write permission on a shared file mapping, check that the file was opened with write access; return EACCES otherwise.
PROT_GROWSUP / PROT_GROWSDOWN: adjust VMA bounds before permission update.
Update PTEs and VMA flags: For each page in the range:
Update VmaFlags in the VMA descriptor (VM_READ, VM_WRITE, VM_EXEC).
Walk the page table and update PTEs to match new permissions. Use the architecture's pte_modify(pte, newprot) helper.
If removing write permission from a dirty page, flush the page's dirty bit to the page cache (writeback accounting).
TLB shootdown: After updating PTEs, send a TLB invalidation IPI to all CPUs that have the process mapped (derived from process.mm.cpu_mask). Wait for all CPUs to acknowledge before returning. This is mandatory for correctness — stale TLB entries with old permissions could allow reads/writes to unprotected pages. On x86-64 with INVLPG batching: issue INVLPG for each page in the range on the local CPU; remote CPUs receive a single IPI and invalidate via their local INVLPG.

Return value: 0 on success; negative errno on failure (EINVAL, ENOMEM, EACCES).

Struct used: No new struct; operates on existing Vma, VmaFlags, and page table structures defined in Section 4.1.5.

Capability-mediated memory sharing provides a secure alternative to POSIX shared memory. Instead of a global namespace (/dev/shm/name), memory regions are shared by explicitly granting a capability to the target process:

/// Grant access to a memory region to another process.
/// Returns a transferable capability handle.
pub fn mem_grant(
    target: ProcessId,
    region: VmaId,
    perms: Permissions,
) -> Result<CapHandle>;

/// Map a previously granted region into the current address space.
/// The grant capability is consumed (single-use) or retained
/// depending on the grant's delegation policy.
pub fn mem_map(
    grant: CapHandle,
    hint_addr: Option<usize>,
) -> Result<*mut u8>;

This model has several advantages over POSIX shm_open:

No global namespace: Shared regions are not visible to unrelated processes.
Fine-grained permissions: The granter specifies read, write, or execute -- not just "owner/group/other" file modes.
Revocable: The granter can revoke the capability (Section 8.1.1 generation counter), and the next access by the target faults.
Auditable: Every grant and map operation flows through capability checks and can be logged.

POSIX shared memory (shm_open / mmap MAP_SHARED) is implemented on top of this mechanism via the compat layer, with a tmpfs-backed /dev/shm namespace that translates names to capability grants.

7.1.6 Namespaces

Each process belongs to a set of namespaces that isolate its view of system resources. UmkaOS implements all 8 Linux namespace types (see Section 16.1 in 16-containers.md for full details):

Namespace	Isolates	Key syscall flags
`pid`	Process ID space	`CLONE_NEWPID`
`net`	Network stack, interfaces, routes	`CLONE_NEWNET`
`mnt`	Mount table	`CLONE_NEWNS`
`user`	UID/GID mappings	`CLONE_NEWUSER`
`uts`	Hostname, domainname	`CLONE_NEWUTS`
`ipc`	SysV IPC, POSIX message queues	`CLONE_NEWIPC`
`cgroup`	Cgroup root directory	`CLONE_NEWCGROUP`
`time`	`CLOCK_MONOTONIC` / `CLOCK_BOOTTIME` offsets	`CLONE_NEWTIME`

Capability gating: Creating a new namespace requires the appropriate capability -- the UmkaOS equivalent of CAP_SYS_ADMIN (for most namespaces) or an unprivileged CLONE_NEWUSER followed by capabilities within the new user namespace. This matches Linux semantics so that rootless containers (Podman, Docker rootless mode) work unmodified.

Process creation integration: clone3() and the native spawn() both accept a NamespaceSpec that specifies whether each namespace is inherited from the parent or freshly created. Namespaces are reference-counted; they are destroyed when the last process in the namespace exits and no external references (bind mounts, open /proc/[pid]/ns/* file descriptors) remain.

See also: Section 16.1 (16-containers.md) provides the full namespace implementation details and container runtime compatibility requirements.

7.1.7 User-Mode Scheduling (Fibers and M:N Threading)

UmkaOS provides an opt-in scheduler upcall mechanism that enables userspace libraries to implement M:N threading — multiplexing many lightweight fibers (cooperative coroutines) onto fewer OS threads, with correct behaviour when a fiber blocks in a syscall. This is the only kernel-level primitive needed for fibers; the fiber context switch itself (saving/restoring registers, swapping stacks) is purely a userspace library operation and requires no syscall.

This feature is native-UmkaOS-only. It does not exist on Linux. Applications that use it are not portable to Linux without a shim. Existing Linux-compatible applications that do not opt in are completely unaffected.

7.1.7.1 Motivation

A fiber (cooperative coroutine, user-mode thread) is a save/restore of the integer and FPU register state plus a stack pointer swap. No kernel involvement is needed for the switch itself. The hard problem is a fiber calling a blocking syscall: without kernel cooperation, the entire OS thread blocks, starving all other fibers running on it.

Three approaches exist: 1. Async-only I/O (restrict fibers to io_uring/epoll): Works, but requires all callees to be async-aware. Incompatible with legacy synchronous code. 2. One OS thread per fiber (1:1): Works, but eliminates the efficiency advantage of fibers and limits parallelism to the thread count. 3. Scheduler upcalls (this design): The kernel calls a registered userspace function before blocking, allowing the fiber scheduler to park the current fiber and immediately run another. The OS thread never actually blocks while runnable fibers exist.

This is the scheduler activations model (Anderson et al., SOSP 1992), implemented in Solaris LWPs, early NetBSD, and macOS pthreads internals. It is the correct kernel primitive for M:N scheduling.

7.1.7.2 Scheduler Upcall Registration

A thread registers an upcall handler via a new UmkaOS syscall:

/// Register a scheduler upcall handler for the calling thread.
///
/// When the calling thread is about to enter a blocking state (blocking
/// syscall, futex wait, page fault that requires I/O), the kernel saves
/// the thread's full register state into `upcall_stack_top - sizeof(UpcallFrame)`
/// and transfers control to `handler`.
///
/// The handler runs on `upcall_stack` (a separate dedicated stack of
/// `upcall_stack_size` bytes) to avoid corrupting the fiber's stack.
///
/// # Arguments
/// - `handler`:          Upcall entry point (see UpcallFrame below).
/// - `upcall_stack`:     Userspace-allocated stack for upcall execution.
/// - `upcall_stack_size`: Size of that stack in bytes (minimum 8 KiB).
///
/// # Returns
/// `Ok(())` on success. `EINVAL` if `upcall_stack` is not mapped writable
/// or `upcall_stack_size` is below the minimum.
SYS_register_scheduler_upcall(
    handler:           extern "C" fn(*mut UpcallFrame),
    upcall_stack:      *mut u8,
    upcall_stack_size: usize,
) -> Result<()>;

/// Deregister the upcall handler. Thread reverts to standard 1:1 blocking.
SYS_deregister_scheduler_upcall() -> Result<()>;

Re-entrancy protection: The Task struct carries two atomic flags for upcall re-entrancy: in_upcall (set while the handler is executing) and upcall_pending (deferred trigger for events that arrive while in_upcall is true). These fields are defined in the Task struct in Section 7.1.1.

The full protocol for upcall delivery is:

Step 1 — Before delivering an upcall to userspace:
    Check task.in_upcall.load(Relaxed):
    - false → proceed to step 2.
    - true  → the upcall handler is already executing on the upcall stack.
               Nesting would overwrite UpcallFrame, destroying the original
               fiber's register state. Instead:
               a. Set task.upcall_pending = true.
               b. For blocking-syscall events: block the OS thread directly
                  (standard 1:1 blocking), as if no upcall handler were registered.
                  The syscall completes normally; the handler will be re-entered
                  for the next event once upcall_pending is processed.
               c. For page-fault I/O events: block synchronously until the page
                  is populated; do not invoke the handler.
               Return from step 1 — do NOT proceed to step 2.

Step 2 — Set task.in_upcall = true (Relaxed store; the architecture guarantees
    single-copy atomicity for aligned byte writes).

Step 3 — Build the UpcallFrame on the upcall stack and transfer control to the
    registered handler.

Step 4 — When the handler calls SYS_scheduler_upcall_resume() or
    SYS_scheduler_upcall_block():
    a. Set task.in_upcall = false.
    b. Check task.upcall_pending:
       - false → return to user space normally (resume the selected fiber or
                 block the OS thread waiting for completions).
       - true  → clear upcall_pending, re-examine the current scheduling state,
                 and if a blocking event is still pending, deliver a fresh upcall
                 immediately (loop back to step 2). This coalesces all deferred
                 events into a single handler invocation.

Blocking inside the upcall handler:
    The upcall handler must not make blocking syscalls that would stall the OS
    thread indefinitely — doing so would starve all fibers on that thread. If the
    handler invokes a blocking syscall (e.g., futex_wait, read on a blocking fd),
    the kernel permits it (task.in_upcall remains true, so no nested upcall is
    issued), and the OS thread blocks until the syscall completes. task.upcall_pending
    is set if any scheduling event arrives during that block. The handler should
    use non-blocking or io_uring-based I/O on its own internal data structures
    (run queue, completion ring) to avoid this. EUCLWAIT is NOT returned; the
    kernel does not prohibit blocking syscalls from within the handler — it simply
    defers the next upcall via upcall_pending rather than nesting.

This is analogous to how POSIX signal handlers mask the same signal during delivery to prevent re-entrant corruption: the handler is protected from re-entry, while incoming events are deferred rather than dropped.

/// Saved register state of the fiber that triggered the upcall.
/// Passed by pointer to the upcall handler; the fiber is resumed by
/// restoring these registers (see SYS_scheduler_upcall_resume).
#[repr(C)]
pub struct UpcallFrame {
    /// Saved general-purpose registers (architecture-specific layout).
    pub regs:        ArchRegs,
    /// Why the fiber is blocking.
    pub reason:      BlockReason,
    /// Opaque kernel handle — pass back to SYS_scheduler_upcall_resume
    /// or SYS_scheduler_upcall_block.
    pub fiber_token: u64,
    /// Integrity cookie: `UPCALL_FRAME_MAGIC ^ task.upcall_frame_nonce`.
    /// Written by the kernel when building the frame; verified by
    /// `SYS_scheduler_upcall_resume` before restoring registers.
    /// See Section 7.1.7.3 UpcallFrame Validation, step 5.
    pub magic_cookie: u64,
}

#[repr(u32)]
pub enum BlockReason {
    /// Entering a blocking syscall (e.g., read, write, futex_wait).
    BlockingSyscall = 1,
    /// Page fault requiring disk I/O (demand paging).
    PageFaultIo     = 2,
    /// Waiting for a kernel lock (unlikely; most kernel waits are brief).
    KernelLock      = 3,
}

7.1.7.3 Upcall Handler Flow

Fiber A calls read(fd, buf, len)  →  would block
  ↓
Kernel saves Fiber A registers into UpcallFrame on upcall stack
  ↓
Kernel transfers control to handler(frame) on upcall stack
  ↓
Handler (fiber scheduler):
  - Parks Fiber A: stores frame->fiber_token, records Fiber A as "blocked on read"
  - Submits read to io_uring for non-blocking completion
    (Note: the wakeup path uses io_uring completion rings, not eventfd.
    The io_uring CQE provides both the completion signal and the result
    data in a single shared-memory read, avoiding the extra syscall
    overhead of eventfd notification.)
  - Picks Fiber B from the run queue
  - Calls SYS_scheduler_upcall_resume(fiber_b_frame) to restore Fiber B
  ↓
Fiber B runs on the OS thread
  ↓
io_uring completion arrives → event loop wakes Fiber A
  - Handler receives io_uring completion
  - Reconstructs Fiber A's UpcallFrame with the result
  - Calls SYS_scheduler_upcall_resume(fiber_a_frame) to restore Fiber A
  ↓
Fiber A resumes with read() returning the result

Two new syscalls control fiber resumption:

/// Restore a fiber that was parked by an upcall.
/// Restores the registers from `frame` and returns to the fiber's PC.
/// The `result` value is placed in the return register (rax / x0 / a0).
/// This call never returns to the caller — control goes to the fiber.
SYS_scheduler_upcall_resume(frame: *const UpcallFrame, result: i64) -> !;

/// Tell the kernel it is safe to block the OS thread now.
/// Used when all fibers are waiting and there is nothing to run.
/// The thread blocks until any previously registered io_uring completion,
/// futex wake, or signal arrives. On return, the handler is called again
/// with the newly unblocked fiber.
SYS_scheduler_upcall_block() -> !;

UpcallFrame Validation -- SYS_scheduler_upcall_resume performs the following checks on the frame pointer before restoring any register state. All checks must pass; failure returns an error to the caller (which is possible because the call has not yet transferred control to the fiber).

Frame pointer bounds check. The frame pointer must lie within the calling thread's user-space stack bounds, checked against task.stack_base and task.stack_size from the kernel's Task struct. The entire UpcallFrame (from frame to frame + size_of::<UpcallFrame>()) must fit within the range [stack_base, stack_base + stack_size). If out of bounds: return EFAULT.
Saved instruction pointer check. The saved program counter in the frame (frame.regs.rip on x86-64, frame.regs.pc on AArch64/ARMv7, frame.regs.sepc on RISC-V, frame.regs.srr0 on PPC) must point to user-space: its value must be less than USER_ADDR_LIMIT. A saved PC pointing into kernel address space would allow the fiber scheduler to redirect execution into the kernel. If the PC is at or above USER_ADDR_LIMIT: return EPERM.
Saved stack pointer check. The saved stack pointer in the frame (frame.regs.rsp on x86-64, frame.regs.sp on AArch64/ARMv7/RISC-V/PPC) must be within the calling thread's user-space stack bounds (same range as check 1). If out of bounds: return EFAULT.
Segment/privilege-level register check (architecture-specific):
x86-64: frame.regs.cs must equal USER_CS (typically 0x2B, Ring 3 code segment) and frame.regs.ss must equal USER_SS (typically 0x23, Ring 3 stack segment). If either is wrong: return EINVAL.
AArch64: frame.regs.pstate & PSTATE_EL_MASK must indicate EL0. If the saved PSTATE specifies EL1 or higher: return EINVAL.
ARMv7: frame.regs.cpsr & MODE_MASK must indicate USR mode (0x10). If it specifies any privileged mode: return EINVAL.
RISC-V: frame.regs.sstatus & SPP_MASK must indicate U-mode (SPP=0). If SPP indicates S-mode: return EINVAL.
PPC32/PPC64LE: frame.regs.msr & MSR_PR must be set (problem state / user mode). If PR is clear (supervisor mode): return EINVAL.
Magic cookie integrity check. The frame.magic_cookie field must satisfy frame.magic_cookie ^ task.upcall_frame_nonce == UPCALL_FRAME_MAGIC, where UPCALL_FRAME_MAGIC is a compile-time constant (e.g., 0x55504341_4C4C464D -- "UPCALLFM" in ASCII) and task.upcall_frame_nonce is the per-thread nonce stored in the kernel's Task struct (see field documentation above). The nonce is generated from the hardware RNG at thread creation time and is never accessible to userspace. A frame not written by the kernel's own upcall dispatch code will fail this check because the attacker cannot know the nonce value. If the cookie does not match: return EINVAL.

When the kernel builds an UpcallFrame (Step 3 of Section 7.1.7.2), it writes magic_cookie = UPCALL_FRAME_MAGIC ^ task.upcall_frame_nonce into the frame before transferring control to the upcall handler.

All five checks are performed in order; the first failure terminates validation and returns the corresponding error code. Only after all checks pass does the kernel restore the register state from the frame and transfer control to the fiber.

7.1.7.4 Interaction with io_uring

For BlockingSyscall upcalls, the handler typically converts the blocking operation to a non-blocking io_uring submission (IORING_OP_READ, IORING_OP_WRITE, IORING_OP_FUTEX_WAIT, etc.) and calls SYS_scheduler_upcall_block() when the run queue is empty. The io_uring completion ring provides the wakeup. This combination fully replaces the blocking syscall with an async equivalent, transparent to the fiber.

For PageFaultIo upcalls, the handler typically has no alternative — a page must be faulted in from disk. The handler parks the faulting fiber and runs others, then calls SYS_scheduler_upcall_block() until a wakeup arrives.

Page fault completion wakeup: When the kernel completes the I/O for a demand page fault, it posts a synthetic completion event to the thread's registered io_uring completion queue (if present) or writes an 8-byte counter increment to the thread's registered eventfd (if configured via SYS_register_scheduler_upcall). The completion event carries the fiber_token of the faulting fiber, allowing the handler to identify which parked fiber is now runnable. If neither io_uring nor eventfd is registered, the kernel wakes the thread from SYS_scheduler_upcall_block() directly and issues a new upcall with reason = PageFaultIo and the original fiber_token, allowing the handler to resume the fiber.

7.1.7.5 Fiber Library Design

UmkaOS ships a userspace fiber library (umka-fiber) in the standard library. The library provides:

// umka-fiber (userspace library, not kernel code)

pub struct Fiber { /* stack, register save area, FLS slot table */ }
pub struct FiberScheduler { /* per-OS-thread run queue, upcall stack */ }

impl FiberScheduler {
    /// Initialize the scheduler on the current OS thread.
    /// Allocates an upcall stack and calls SYS_register_scheduler_upcall.
    pub fn init() -> Self;

    /// Create a fiber with the given entry point and stack size.
    pub fn spawn(&self, f: impl FnOnce() + 'static, stack_size: usize) -> FiberId;

    /// Cooperatively yield to the next runnable fiber.
    /// If no other fiber is runnable, returns immediately.
    pub fn yield_now(&self);

    /// Run the scheduler loop. Returns when all fibers have completed.
    pub fn run(&mut self);
}

Fiber Local Storage (FLS): Each Fiber has a private FLS table (array of *mut () slots, analogous to Windows FlsAlloc/FlsGetValue). The library swaps the FLS table pointer on every SwitchToFiber — no kernel involvement. Thread-local storage (#[thread_local]) continues to work as normal and is shared across all fibers on the same OS thread (matching Windows TLS semantics; FLS is distinct).

7.1.7.6 WEA Integration

Windows Fiber support in WEA (Section 18.4.6) maps directly onto this mechanism:

ConvertThreadToFiber() → FiberScheduler::init() on the calling thread.
CreateFiber(size, fn, param) → FiberScheduler::spawn(...).
SwitchToFiber(fiber) → cooperative yield to a specific fiber; pure userspace register swap, no syscall.
FlsAlloc / FlsGetValue / FlsSetValue → read/write into the current fiber's FLS slot table; implemented in ntdll by WINE, no WEA syscall needed.
Blocking syscall inside a fiber → scheduler upcall converts to io_uring; the OS thread runs other fibers while waiting.

The TEB NtTib.FiberData field is updated by WINE on every SwitchToFiber call (userspace write to the TEB in user address space). The kernel's role is only to provide the fast NtCurrentTeb() path via the per-thread GS base mapping (Section 18.4.5) and the scheduler upcall mechanism above.

7.2 Real-Time Guarantees

7.2.1 Beyond CBS

Section 6.3 provides CPU bandwidth guarantees via CBS (Constant Bandwidth Server). This ensures average bandwidth. Real-time workloads need worst-case latency bounds: interrupt-to-response always under a specific ceiling.

7.2.2 Design: Bounded Latency Paths

// umka-core/src/rt/mod.rs

/// Real-time configuration (system-wide, set at boot or runtime).
pub struct RtConfig {
    /// Maximum interrupt latency guarantee (nanoseconds).
    /// The kernel guarantees that ISR entry occurs within this bound
    /// after the interrupt fires.
    /// Default: 50_000 (50 μs). Achievable on x86 with careful design.
    pub max_irq_latency_ns: u64,

    /// Maximum scheduling latency for SCHED_DEADLINE tasks (nanoseconds).
    /// The kernel guarantees that a runnable DEADLINE task is scheduled
    /// within this bound.
    /// Default: 100_000 (100 μs).
    pub max_sched_latency_ns: u64,

    /// Preemption model.
    pub preemption: PreemptionModel,
}

#[repr(u32)]
pub enum PreemptionModel {
    /// Voluntary preemption (default). Preemption at explicit preempt points.
    /// Lowest overhead, highest latency variance.
    Voluntary   = 0,

    /// Full preemption. Preemptible everywhere except hard critical sections.
    /// Moderate overhead, good latency bounds.
    /// Equivalent to Linux PREEMPT (non-RT).
    Full        = 1,

    /// RT preemption. All `spinlock_t` and `rwlock_t` instances become sleeping locks
    /// (mapped to `rt_mutex`). `raw_spinlock_t` remains a true spinning lock with
    /// interrupts disabled, used for scheduler internals, interrupt handling, and
    /// hardware access paths that must not sleep.
    /// Interrupts are threaded. Maximum preemptibility.
    /// Equivalent to Linux PREEMPT_RT.
    /// Highest overhead (~2-5% throughput), tightest latency bounds.
    Realtime    = 2,
}

RT Wakeup Latency Budget (x86-64, target ≤ 100 μs):

Component                           Worst case    Basis
──────────────────────────────────────────────────────────────────────
Hardware interrupt delivery          ~1 μs        LAPIC delivery latency
IRQ handler + EOI                    ~3 μs        Minimal ISR: ACK + flag set
Scheduler wakeup (try_to_wake_up)    ~2 μs        Runqueue lock + enqueue
Context switch overhead              ~3 μs        Register save/restore, FPU
TLB flush (if ASID switch)           ~2 μs        CR3 write + pipeline flush
WRPKRU domain switch                 ~0.1 μs      23 cycles @ 3 GHz
Cache warm (MADV_CRITICAL pages)     ~10 μs       8 L3 misses × ~10ns/miss
Cross-socket IPI (if needed)         ~5 μs        LAPIC IPI round-trip
──────────────────────────────────────────────────────────────────────
Total, same-socket                   ~21 μs       within 100 μs budget
Total, cross-socket                  ~26 μs       within 100 μs budget

Caveats and configuration requirements:

This budget assumes the RT task pins its working set with madvise(MADV_CRITICAL) (Section 4.1.3.3) and mlock() to prevent page faults on the RT path.
SMI (System Management Interrupt) from firmware can add 50-500 μs stalls and is outside UmkaOS's control. For hard RT: configure isolcpus=N nohz_full=N and ensure BIOS/UEFI does not issue SMIs on isolated CPUs.
Memory compaction (Section 4.1.4) is disabled on nohz_full CPUs.
The 100 μs target is for SCHED_RT tasks. SCHED_DEADLINE tasks additionally have their CBS deadline enforcement; actual latency depends on declared parameters.

7.2.3 Key Design Decisions for RT

1. Threaded interrupts (when PreemptionModel::Realtime):
   All hardware interrupts are handled by kernel threads.
   Threads are schedulable — RT tasks can preempt interrupt handlers.
   Cost: ~1 μs additional interrupt latency (thread switch).
   Linux PREEMPT_RT does the same.

2. Priority inheritance for RtMutex locks:
   When a low-priority task holds an RtMutex needed by a high-priority task,
   the low-priority task inherits the high-priority task's priority.
   Prevents priority inversion (classic RT problem).
   Cost: ~5-10 cycles per lock acquire (check/update priority).
   Linux PREEMPT_RT does the same.

3. No unbounded loops in kernel paths:
   Every loop has a bounded iteration count.
   Memory allocation in RT context: from pre-allocated pools (no reclaim).
   Page fault in RT context: fails immediately (no I/O wait).
   Enforced by coding guidelines + Verus verification (Section 23.10).

4. Deadline admission control (Section 6.1.4):
   SCHED_DEADLINE tasks declare (runtime, period, deadline).
   Kernel admits the task ONLY if it can guarantee the deadline.
   If admission would violate existing guarantees: returns -EBUSY.
   Same semantics as Linux SCHED_DEADLINE.

7.2.3.1 Priority Inheritance Protocol

Priority inheritance applies exclusively to RtMutex (the real-time mutex with PI support). Standard SpinLock and Mutex do not use PI: SpinLock is non-preemptible (no scheduling occurs while spinning), and sleeping Mutex uses its own priority boosting. RtMutex is the kernel primitive that replaces SpinLock and Mutex throughout the kernel when PreemptionModel::Realtime is active.

/// Real-time mutex with priority inheritance support.
///
/// # Data structure choice: intrusive linked list, not BinaryHeap
///
/// The waiter list is stored as a **priority-sorted intrusive doubly-linked
/// list** with nodes embedded directly in `Task` (`Task::rt_waiter`), not as
/// a heap-allocated `BinaryHeap<RtMutexWaiter>`.
///
/// Rationale:
/// - `RtMutex::lock` is a `RawSpinLock` (held with preemption disabled).
///   Heap allocation inside a raw spinlock is prohibited: the allocator may
///   attempt to acquire a lock that is already held, causing deadlock.
/// - An intrusive list node (`RtMutexWaiter` embedded in `Task`) requires
///   zero heap allocation — the node storage comes from the blocked task's
///   own stack frame.
/// - Tasks cannot be freed while they are waiting (the task pins itself by
///   not returning from `rt_mutex_lock`), so intrusive node lifetimes are safe.
/// - A sorted doubly-linked list gives O(n) insert (n = waiter count, typically
///   1-3 in production RT workloads) and O(1) remove-top (next owner on unlock),
///   which is optimal for the actual workload distribution.
///
/// # Invariants
/// - `owner` is null if and only if the mutex is unlocked.
/// - `waiters` is sorted by descending `effective_priority` (highest priority
///   waiter is at the list head — it will be handed ownership first).
/// - All waiter priorities satisfy: priority ≤ effective priority of `owner`
///   (maintained by `pi_propagate` on lock, `rt_mutex_unlock` on unlock).
/// - Every `RtMutexWaiter` node in `waiters` is embedded in a `Task` that is
///   currently blocked on this specific mutex.
pub struct RtMutex {
    /// Current owner task (`null` if unlocked).
    /// Written only while `lock` is held; readable with Acquire load.
    pub owner: AtomicPtr<Task>,
    /// Priority-sorted intrusive doubly-linked list of blocked waiters.
    /// List head = highest-priority waiter (next owner on unlock).
    /// Protected by `lock`.
    pub waiters: IntrusiveList<RtMutexWaiter>,
    /// Internal spinlock protecting `waiters` and `owner` transitions.
    /// This is a raw spin (never yields, never a sleeping lock), held for
    /// at most ~10–30 instructions during ownership handoff.
    pub lock: RawSpinLock,
}

/// Priority inheritance waiter node. One node per `Task`; embedded directly
/// in `Task` as `Task::rt_waiter: Option<RtMutexWaiter>`.
///
/// Using an intrusive node embedded in `Task` instead of a heap-allocated
/// entry avoids all allocation inside `RawSpinLock` critical sections.
/// The node is valid for exactly the duration that the task is blocked on
/// an `RtMutex`; it is initialized before the lock attempt and cleared on
/// acquisition or timeout/signal.
pub struct RtMutexWaiter {
    /// Intrusive list links (prev/next pointers into the mutex's waiter list).
    pub links: IntrusiveListLinks,
    /// The task that owns this waiter node (back-pointer for PI chain walk).
    pub task: NonNull<Task>,
    /// The task's effective priority when enqueued.
    /// Updated in-place by `pi_propagate` if the task's priority changes while waiting.
    pub effective_priority: u32,
}

/// Fields added to `Task` to support PI chain traversal and intrusive waiter nodes.
pub struct Task {
    // ... (existing fields from Section 7.1.1) ...

    /// Embedded waiter node for the RtMutex this task is currently waiting on.
    /// `Some(node)` while blocked; `None` otherwise.
    /// The node is initialized before `rt_mutex_lock()` sleeps and cleared on
    /// acquisition, timeout, or signal delivery. Initialized inside the RtMutex's
    /// `lock` critical section, so no additional synchronization is needed.
    pub rt_waiter: Option<RtMutexWaiter>,

    /// The RtMutex this task is currently blocked waiting to acquire, or
    /// `None` if the task is not blocked on any RtMutex.
    /// Written under the RtMutex's internal spinlock before the task sleeps.
    /// Used by `pi_propagate` to follow the ownership chain.
    pub blocked_on_rt_mutex: Option<NonNull<RtMutex>>,

    /// The task's current effective priority.
    ///
    /// Normally equals `base_priority` (the scheduler-assigned priority).
    /// Raised by PI when this task holds an `RtMutex` that a higher-priority
    /// task is waiting on. Lowered back to `base_priority` (or the maximum of
    /// all remaining held-mutex waiter priorities) when the blocking task
    /// acquires the mutex and this task releases it.
    pub effective_priority: u32,
}

PI propagation algorithm (chain walk):

Constants:
  MAX_PI_CHAIN_DEPTH = 10   // maximum ownership chain length before
                            // declaring deadlock (matches Linux rt_mutex)

rt_mutex_lock(mutex, current_task):
    mutex.lock.lock_spin()
    if mutex.owner.load(Acquire).is_null():
        // Uncontended: claim ownership directly.
        mutex.owner.store(current_task, Release)
        mutex.lock.unlock_spin()
        return Ok(())

    // Contended: initialize the intrusive waiter node embedded in current_task
    // and insert it into the mutex's priority-sorted waiter list. Zero allocation.
    current_task.rt_waiter = Some(RtMutexWaiter {
        links: IntrusiveListLinks::new(),
        task: NonNull::from(current_task),
        effective_priority: current_task.effective_priority,
    })
    // Insert in priority order (O(n) but n is typically 1-3 in RT workloads).
    mutex.waiters.insert_sorted(&mut current_task.rt_waiter.as_mut().unwrap().links)
    pi_propagate(mutex, depth=0)   // boost owner; may chain-propagate
    mutex.lock.unlock_spin()

    // Sleep until woken by rt_mutex_unlock.
    current_task.blocked_on_rt_mutex = Some(mutex)
    current_task.sleep(WaitReason::RtMutex)
    current_task.blocked_on_rt_mutex = None
    current_task.rt_waiter = None   // clear the intrusive node after wake
    // On wake: we are the new owner (rt_mutex_unlock hands ownership directly).
    return Ok(())

pi_propagate(mutex, depth):
    if depth > MAX_PI_CHAIN_DEPTH:
        // Cycle detected in the ownership chain: deadlock.
        // rt_mutex_lock returns Err(LockError::Deadlock).
        return Err(LockError::Deadlock)

    owner_ptr = mutex.owner.load(Acquire)
    if owner_ptr.is_null():
        return Ok(())   // mutex became unlocked concurrently; nothing to boost

    owner = &mut *owner_ptr
    // The list head is the highest-priority waiter (sorted descending on insert).
    max_waiter_prio = mutex.waiters.front().map_or(0, |w| w.effective_priority)

    if owner.effective_priority >= max_waiter_prio:
        return Ok(())   // owner is already at or above the required priority; done

    // Boost the owner.
    owner.effective_priority = max_waiter_prio
    scheduler::update_priority(owner)   // reposition in runqueue if running/runnable

    // Chain propagation: if the owner is itself blocked on another RtMutex,
    // propagate the boosted priority to that mutex's owner as well.
    if let Some(upstream_mutex) = owner.blocked_on_rt_mutex:
        upstream_mutex.lock.lock_spin()
        // Update our waiter entry in the upstream mutex with the new priority.
        upstream_mutex.waiters.update_priority(owner, max_waiter_prio)
        pi_propagate(upstream_mutex, depth + 1)
        upstream_mutex.lock.unlock_spin()

    return Ok(())

Priority restoration on unlock:

rt_mutex_unlock(mutex, current_task):
    mutex.lock.lock_spin()

    // Remove current_task from the owner role.
    mutex.owner.store(null, Release)

    // Restore current_task's effective priority to the maximum of:
    //   (a) its base_priority (scheduler-assigned), and
    //   (b) the maximum waiter priority across all *other* RtMutexes it holds.
    // This correctly handles the case where the task holds multiple mutexes.
    current_task.effective_priority = current_task.compute_effective_priority()
    scheduler::update_priority(current_task)

    // Hand ownership to the highest-priority waiter (if any).
    if let Some(next_waiter) = mutex.waiters.pop_max():
        next_task = &mut *next_waiter.task
        mutex.owner.store(next_task, Release)
        // Wake the new owner. It will find itself the owner on return from sleep.
        scheduler::wake_task(next_task)

    mutex.lock.unlock_spin()

compute_effective_priority() iterates the task's list of currently held RtMutex instances (a per-task Vec<NonNull<RtMutex>> updated on each lock/unlock, protected by the task's scheduler lock) and returns max(base_priority, max over held mutexes of their max waiter priority). The list is bounded by the lock nesting depth (MAX_PI_CHAIN_DEPTH), so this is O(MAX_PI_CHAIN_DEPTH) in the worst case.

Deadlock detection: PI chain propagation in pi_propagate tracks depth. At depth == MAX_PI_CHAIN_DEPTH, if the ownership chain has not terminated, a cycle is inferred and rt_mutex_lock returns Err(LockError::Deadlock). The kernel logs the chain (owner PIDs and mutex addresses) at WARN level before returning the error to the caller. This matches the Linux rt_mutex deadlock detection heuristic.

Scope: Standard SpinLock and Mutex in UmkaOS do not use PI. They are not sleeping locks in the PREEMPT_RT sense: SpinLock disables preemption, so no scheduling event can cause priority inversion while it is held. RtMutex is used wherever a kernel lock may be held across a context switch, which in PreemptionModel::Realtime is most kernel mutexes (converted from SpinLock by the RT build).

7.2.3.2 Threaded IRQ Thread Mapping

When PreemptionModel::Realtime is active, all device interrupt handlers run in dedicated kernel threads rather than in hard-IRQ context. This allows RT tasks to preempt interrupt handlers and gives the scheduler full visibility over IRQ handler execution time. The mapping from IRQ number to handler thread is defined by IrqThread.

/// Kernel thread that services a single threaded hardware interrupt.
///
/// One `IrqThread` exists per `IrqAction` registered with `IRQF_THREAD`.
/// The thread sleeps on `wait` between interrupt deliveries. The hard-IRQ
/// top half sets `pending` and wakes the thread; the thread calls the
/// device's thread function and re-enables the IRQ line.
pub struct IrqThread {
    /// The kernel task backing this IRQ thread.
    /// Named `"irq/{irq}/{name}"` (e.g., `"irq/42/eth0"`).
    pub task: Arc<Task>,
    /// IRQ number this thread services.
    pub irq: u32,
    /// Descriptive name of the IRQ handler (from `request_irq`).
    pub name: Arc<str>,
    /// The registered IRQ action (contains the thread function pointer).
    pub action: Arc<IrqAction>,
    /// Wait queue: the thread sleeps here until the hard-IRQ top half wakes it.
    pub wait: WaitQueue,
    /// Set by the hard-IRQ top half; cleared by the thread bottom half.
    /// `AtomicBool` allows lock-free set from IRQ context and load from
    /// thread context.
    pub pending: AtomicBool,
    /// Scheduling priority of this IRQ thread.
    /// Default: `SCHED_FIFO` priority 50. Adjustable via `chrt(1)` or by
    /// passing `IRQF_THREAD_PRIORITY(n)` in `request_irq` flags.
    /// Priority 50 places IRQ threads above normal tasks (priority 0–49 range
    /// for `SCHED_OTHER`) but below high-priority RT tasks (priority 51–99).
    pub priority: u32,
    /// CPU affinity mask: which CPUs may run this IRQ thread.
    /// Initialised from `/proc/irq/{irq}/smp_affinity` at registration time;
    /// adjustable at runtime via the same sysfs path.
    pub affinity: CpuSet,
}

Thread creation — invoked by request_irq(irq, handler, flags, name, dev_id) when flags includes IRQF_THREAD:

1. Allocate a new IrqThread with:
     - task name: "irq/{irq}/{name}" (truncated to TASK_COMM_LEN = 15 chars)
     - scheduling class: SCHED_FIFO, priority 50 (or IRQF_THREAD_PRIORITY(n))
     - CPU affinity: all CPUs (matches current smp_affinity; adjustable later)
     - pending: false
     - wait: empty WaitQueue
2. Start the kernel thread. The thread body immediately executes:
       loop {
           wait_event(&irq_thread.wait, irq_thread.pending.load(Acquire))
           irq_thread.pending.store(false, Release)
           action.thread_fn(irq, action.dev_id)   // device bottom-half handler
           irq_chip.irq_unmask(irq)               // re-enable the IRQ line
       }
3. Register the IrqThread in the global irq_thread_table[irq].
   If IRQF_SHARED is set and another IrqAction already exists for this IRQ,
   each action gets its own IrqThread (one thread per registered handler).

Hard-IRQ → thread handoff (the two-phase split):

Phase 1: Hard-IRQ top half (runs in interrupt context, preemption disabled):
    1. Acknowledge the interrupt at the interrupt controller (mask the line,
       send EOI, or equivalent — platform-specific).
    2. Run the "primary handler" (the fast part of the ISR: read a status
       register, record the event, clear a flag). Return IRQ_WAKE_THREAD.
    3. Set IrqThread::pending = true  (Relaxed store; the wake_up below
       provides the Release barrier via the WaitQueue spinlock).
    4. wake_up(&irq_thread.wait)      (wakes the sleeping IrqThread task).
    5. Hard-IRQ exits; preemption re-enabled.

Phase 2: IRQ thread bottom half (runs as a schedulable kernel task):
    1. Woken by WaitQueue::wake_up.
    2. Loads IrqThread::pending (Acquire); confirms it is true.
    3. Clears IrqThread::pending (Release store).
    4. Calls action.thread_fn(irq, dev_id) — the device's actual handler
       (DMA buffer processing, packet reception, block I/O completion, etc.).
       This may block (sleep on mutexes, allocate memory) — it is a normal
       kernel thread context.
    5. Calls irq_chip.irq_unmask(irq) to re-enable the hardware IRQ line.
    6. Returns to the wait_event loop at the top.

Thread lifecycle:

Event	Action
`request_irq(IRQF_THREAD)`	Create and start IrqThread; register in irq_thread_table
`free_irq()`	Call `kthread_stop(irq_thread.task)`: set stop flag, wake the thread; thread exits its wait loop on the next iteration
`IRQF_SHARED` (multiple handlers)	Each `IrqAction` gets an independent `IrqThread`; all threads are woken by the hard-IRQ top half and each tests its own `pending` flag
CPU hotplug (CPU offline)	IRQ affinity is updated to exclude the offline CPU; if the IRQ thread is running on that CPU, the scheduler migrates it before the CPU is taken offline
`PreemptionModel` not `Realtime`	IRQ threads are not created; handlers run in hard-IRQ context as usual

Scheduling: IRQ threads run at SCHED_FIFO priority 50 by default. This priority was chosen so that: - IRQ threads run before all SCHED_OTHER tasks (priority 0): device events are processed promptly. - IRQ threads yield to high-priority RT tasks (SCHED_FIFO priority 51–99): an RT control loop at priority 80 can preempt any IRQ thread. - The priority can be raised (e.g., to 70 for a network card in a soft-RT path) via chrt -f 70 $(pgrep irq/42/eth0) or the IRQF_THREAD_PRIORITY(n) flag in request_irq. Raising above 99 is rejected (EINVAL).

/proc/irq/{n}/smp_affinity: The affinity mask of IrqThread::task is updated when the sysfs file is written, using sched_setaffinity() on the kernel thread. This is the standard Linux interface; irqbalance(8) and tuna(8) work without modification.

7.2.4 RT + Domain Isolation Interaction

The raw WRPKRU instruction takes ~23 cycles on modern Intel microarchitectures (~6ns at 4 GHz). On KABI call boundaries, the domain switch is unconditional (the caller always needs to switch to the callee's domain), so the switch cost is the raw WRPKRU cost: ~23 cycles. The performance budget (Section 1.2) uses this figure: 4 switches × ~23 cycles = ~92 cycles per I/O round-trip.

RT jitter analysis: In a tight RT control loop making KABI calls at 10kHz, each call requires a round-trip domain switch (out and back = ~46 cycles = ~12ns), accumulating to ~120μs/sec of jitter (10,000 × 12ns). See I/O path analysis below.

RT latency policy:
  Tier 0 drivers: run in Core isolation domain. Zero transition cost.
    RT-critical paths (interrupt handlers, timer callbacks) use Tier 0.
  Tier 1 drivers: one-way domain switch adds ~6ns per WRPKRU
    (~23 cycles at 4 GHz). Round-trip (out+back) = ~12ns.
    Acceptable for soft-RT (audio, video). Not for hard-RT (<10μs).
    RT tasks requiring <10μs determinism should only use Tier 0 paths.
  Tier 2 drivers: user-space. Context switch cost (~1μs). Not for RT.

Why domain isolation does not cause priority inversion: Unlike mutex-based isolation, domain switching is a single unprivileged instruction (WRPKRU) that executes in constant time with no blocking, no lock acquisition, and no kernel involvement. A high-priority RT task switching domains cannot be blocked by a lower-priority task holding the domain. This is fundamentally different from process-based isolation (Tier 2) where IPC involves a context switch that can be delayed by scheduling.

Shared ring buffers in RT paths: When an RT task communicates with a Tier 1 driver via a shared ring buffer, the ring buffer memory is tagged with the shared PKEY (readable/writable by both core and driver domains). Accessing the ring buffer does not require a domain switch — the shared PKEY is always accessible. Only direct access to driver-private memory requires WRPKRU. Therefore, the typical RT I/O path is:

RT task → write command to shared ring buffer (no WRPKRU)
       → doorbell write to MMIO (requires WRPKRU → driver domain → WRPKRU back: ~12ns)
       → poll completion from shared ring buffer (no WRPKRU)

Total domain switch overhead per I/O op: ~12ns (one domain round-trip: two
WRPKRU instructions at ~23 cycles each = ~46 cycles = ~12ns at 4 GHz).
At 10kHz: 120μs/sec. At 1kHz: 12μs/sec. Negligible.

Preemption during domain switch: WRPKRU is a single instruction that cannot be preempted mid-execution. If a timer interrupt arrives between two WRPKRU instructions (e.g., switch to driver domain, then switch back), the interrupt handler saves and restores PKRU as part of the register context. The RT task resumes with its PKRU intact. No special handling is needed — this is the same as any register save/restore on interrupt.

7.2.5 CPU Isolation for Hard RT

Standard Linux RT practice — fully supported:

isolcpus=2-3       Reserve CPUs 2-3: no normal tasks, no load balancing.
nohz_full=2-3      Tickless on CPUs 2-3: no timer interrupts when idle
                    or running a single RT task.
rcu_nocbs=2-3      RCU callbacks offloaded from CPUs 2-3: no RCU
                    processing on isolated CPUs.

Isolated CPUs have: no timer ticks, no RCU callbacks, no workqueues, no kernel threads (except pinned ones). This is required for hard-RT workloads (LinuxCNC, IEEE 1588 PTP, audio with <1ms latency). Cross-reference: Section 6.1.5.11 lists isolcpus and nohz_full as supported.

7.2.6 Driver Crash During RT-Critical Path

If a Tier 1 driver crashes while an RT task depends on it:

Policy: immediate error notification, NOT wait for recovery.

1. Domain fault detected → crash recovery starts (Section 10.8).
2. RT task blocked on the driver gets IMMEDIATE unblock with error:
   - Pending I/O returns -EIO.
   - Pending KABI calls return CapError::DriverCrashed.
   - Signal SIGBUS delivered if task is in a blocking syscall.
3. RT task handles the error (application-specific failsafe mode).
4. Driver recovery (~100ms) happens in background.
5. RT task can resume normal operation after driver reloads.

Rationale: RT guarantees are more important than waiting for recovery.
An RT task must ALWAYS get a response within its deadline, even if that
response is an error. Blocking an RT task for 100ms violates the RT contract.

7.2.7 Linux Compatibility

Real-time interfaces are standard Linux:

SCHED_FIFO, SCHED_RR:       sched_setscheduler() — supported
SCHED_DEADLINE:              sched_setattr() — supported
/proc/sys/kernel/sched_rt_*: RT scheduler tunables — supported
/sys/kernel/realtime:        "1" when PREEMPT_RT is active — supported
clock_nanosleep(TIMER_ABSTIME): deterministic wakeup — supported
mlockall(MCL_CURRENT|MCL_FUTURE): prevent page faults — supported

Existing RT applications (JACK audio, ROS2, LinuxCNC, PTP/IEEE 1588) work without modification.

NUMA-Aware RT Memory:

Hard real-time tasks must avoid remote NUMA access (unpredictable latency). Standard Linux practice applies:

numactl --membind=0 --cpunodebind=0 ./rt_application

The kernel enforces: when a process has SCHED_DEADLINE or SCHED_FIFO priority AND is bound to a NUMA node via set_mempolicy(MPOL_BIND), the memory allocator does NOT fall back to remote nodes on allocation failure — it returns ENOMEM instead. This prevents unpredictable remote-access latency spikes.

Additionally, NUMA balancing (automatic page migration based on access patterns) is disabled for RT-priority tasks. Automatic page migration adds unpredictable latency (~50-200μs per migrated page). RT tasks pin their memory explicitly.

7.2.8 Performance Impact

When PreemptionModel::Voluntary (default): zero overhead vs Linux. Same model.

When PreemptionModel::Full: ~1% throughput reduction. Same as Linux PREEMPT.

When PreemptionModel::Realtime: ~2-5% throughput reduction. Same as Linux PREEMPT_RT. This is the unavoidable cost of deterministic scheduling — the same cost any RT OS pays.

The preemption model is configurable at boot. Debian servers use Voluntary (default). Embedded/RT deployments use Realtime.

7.2.9 Hardware Resource Determinism

Software scheduling (EEVDF, CBS, threaded IRQs, priority inheritance) guarantees CPU execution time. However, on modern multi-core SoCs, shared hardware resources — L3 caches, memory controllers, interconnect bandwidth — introduce unpredictable latency spikes that violate hard real-time deadlines regardless of CPU priority.

A high-priority RT task running on an isolated CPU can still miss its deadline if a background batch job on another core evicts the RT task's data from the shared L3 cache, or saturates the memory controller with streaming writes. This is the "noisy neighbor" problem at the hardware level.

UmkaOS addresses this by extending its Capability Domain model to physically partition shared hardware resources, using platform QoS extensions where available.

7.2.9.1 Cache Partitioning (Intel RDT / ARM MPAM)

Modern server CPUs expose hardware Quality of Service (QoS) mechanisms that allow the OS to assign cache and memory bandwidth quotas per workload:

Intel Resource Director Technology (RDT): Available on Xeon Skylake-SP and later. Provides Cache Allocation Technology (CAT) for L3 partitioning and Memory Bandwidth Allocation (MBA) for memory controller throttling. Controlled via MSRs (IA32_PQR_ASSOC, IA32_L3_MASK_n). Up to 16 Classes of Service (CLOS).
ARM Memory Partitioning and Monitoring (MPAM): Optional extension introduced in ARMv8.4-A (FEAT_MPAM). Provides Cache Portion Partitioning (CPP) and Memory Bandwidth Partitioning (MBP). Thread-to-partition assignment is via system registers (MPAM0_EL1, MPAM1_EL1) which set the PARTID for the executing thread. Actual resource limits (cache way bitmasks, bandwidth caps) are configured via MMIO registers in each Memory System Component (MSC) — e.g., MPAMCFG_CPBM for cache portions and MPAMCFG_MBW_MAX for bandwidth limits. Up to 256 Partition IDs (PARTIDs).

UmkaOS integrates these into the Capability Domain model (Section 8.1). Each Capability Domain can optionally carry a ResourcePartition constraint:

// umka-core/src/rt/resource_partition.rs

/// Hardware resource partition assigned to a Capability Domain.
/// Only meaningful when the platform provides QoS extensions (RDT, MPAM).
/// On platforms without QoS support, this struct is ignored.
pub struct ResourcePartition {
    /// L3 Cache Allocation bitmask.
    /// Each set bit grants the domain access to one cache "way."
    /// On Intel RDT, this maps to IA32_L3_MASK_n for the assigned CLOS.
    /// On ARM MPAM, this maps to the cache portion bitmap for the assigned PARTID.
    /// Example: 0x000F = ways 0-3 (exclusive to this domain).
    pub l3_cache_mask: u32,

    /// Memory Bandwidth Allocation percentage (1-100).
    /// Throttles memory controller traffic generated by this domain.
    /// On Intel MBA, this maps to the delay value for the assigned CLOS.
    /// On ARM MPAM, this maps to the MBW_MAX control for the assigned PARTID.
    /// 100 = no throttling. 50 = limit to ~50% of peak bandwidth.
    pub mem_bandwidth_pct: u8,

    /// Whether this partition is exclusive (no overlap with other domains).
    /// When true, the kernel verifies that no other domain's l3_cache_mask
    /// overlaps with this one. Allocation fails with EBUSY if overlap detected.
    pub exclusive: bool,
}

Determinism strategy:

RT Domains: Granted exclusive L3 cache ways (e.g., ways 0-3 on a 16-way cache). Their hot data is never evicted by other workloads. Memory bandwidth set to 100%.
Best-Effort Domains: Restricted to the remaining L3 cache ways (e.g., ways 4-15) and throttled via MBA during contention (e.g., limited to 50% bandwidth).
Discovery at boot: The kernel queries CPUID (Intel) or MPAM system registers (ARM) to discover the number of available cache ways and CLOS/PARTID slots. If the hardware does not support RDT/MPAM, the ResourcePartition constraint is silently ignored and a warning is logged.

Cache monitoring integration: RDT also provides Cache Monitoring Technology (CMT) and Memory Bandwidth Monitoring (MBM), which report per-CLOS cache occupancy and bandwidth usage. UmkaOS exposes these counters via the observability framework (Section 19.2) as stable tracepoints, enabling operators to verify that RT workloads remain within their allocated cache partition.

7.2.9.2 Strict Memory Pinning for RT Domains

Hard real-time tasks cannot tolerate page faults. A single page fault in a 10 kHz control loop adds 3-50 μs of jitter (TLB miss + page table walk + potential disk I/O), which can exceed the entire deadline budget.

UmkaOS provides strict memory pinning semantics for RT Capability Domains:

Eager allocation: When an RT task calls mmap() or loads an executable, all physical frames are allocated and page table entries populated immediately (equivalent to MAP_POPULATE). No demand paging.
Pre-faulted stacks: The kernel pre-faults the full stack allocation for RT threads at clone() time. Stack guard pages are still present but the usable stack region is fully backed by physical memory.
Exempt from reclaim: Pages owned by an RT domain are never targeted by kswapd page reclaim (Section 4.1), never compressed by ZRAM (Section 4.2), and never swapped. The OOM killer will target non-RT domains first; it will only kill an RT task as a last resort after all non-RT tasks have been considered.
NUMA-local enforcement: When an RT task is bound to a NUMA node via set_mempolicy(MPOL_BIND), the allocator returns ENOMEM rather than falling back to remote NUMA nodes. Remote NUMA access adds 50-200 ns of unpredictable latency per cache miss — unacceptable for hard RT. NUMA auto-balancing (automatic page migration) is disabled for RT-priority tasks.

These properties are activated automatically when a task has SCHED_FIFO, SCHED_RR, or SCHED_DEADLINE policy AND is assigned to a Capability Domain with an RtConfig (Section 7.2.2). They can also be requested explicitly via mlockall(MCL_CURRENT | MCL_FUTURE), which is the standard Linux RT practice.

7.2.9.3 Time-Sensitive Networking (TSN)

For distributed real-time systems (industrial control, automotive Ethernet, robotics), determinism must extend beyond the CPU to the network. UmkaOS integrates with hardware Time-Sensitive Networking (IEEE 802.1) features via umka-net (Section 15.1):

IEEE 802.1Qbv (Time-Aware Shaper): NICs with hardware TSN support expose gate control lists (GCLs) that schedule packet transmission at precise microsecond intervals. UmkaOS bypasses the software Qdisc layer for TSN-tagged traffic classes, programming the NIC's hardware scheduler directly via KABI. RT packets are never queued in software — they are placed directly in a hardware TX ring whose transmission gate opens at the scheduled time.
IEEE 802.1AS (Generalized Precision Time Protocol): Hardware PTP timestamps from the NIC's clock are fed directly to the timekeeping subsystem (Section 6.5). The CLOCK_TAI system clock is synchronized to the PTP grandmaster with sub-microsecond accuracy. The CBS scheduler (Section 6.3) uses this PTP-synchronized timebase to align RT task wakeups with hardware transmission windows — the task wakes up, computes, and its output packet hits the NIC exactly when the 802.1Qbv gate is open.
IEEE 802.1Qci (Per-Stream Filtering and Policing): Ingress traffic is filtered in hardware by stream ID. Non-RT traffic arriving on an RT-reserved stream is dropped at the NIC before it reaches the CPU, preventing interference with RT packet processing.

Architecture note: TSN support requires Tier 1 NIC drivers that implement the TSN KABI extensions (gate control list programming, PTP clock read, stream filter configuration). Standard NICs without TSN hardware operate normally but cannot provide network-level determinism. The umka-net stack detects TSN capability at driver registration via the device registry (Section 10.5).

7.3 Signal Handling

Signals are the primary asynchronous notification mechanism inherited from POSIX. UmkaOS implements the full Linux-compatible signal model: 31 standard signals (numbers 1–31), kernel-reserved real-time slots 32–33, and 31 user-visible real-time signals SIGRTMIN–SIGRTMAX (numbers 34–64 when glibc NPTL reserves two slots). The Task struct carries per-task signal state in its signal_mask field (Section 7.1.1); process-wide signal disposition is stored in Process::sighand.

7.3.1 Signal Table

Every signal has a default action and optionally a user-installed handler.

Num	Name	Default	Description
1	SIGHUP	TERM	Hangup on controlling terminal or parent process death
2	SIGINT	TERM	Keyboard interrupt (Ctrl+C)
3	SIGQUIT	CORE	Keyboard quit (Ctrl+\)
4	SIGILL	CORE	Illegal CPU instruction
5	SIGTRAP	CORE	Trace/breakpoint trap
6	SIGABRT	CORE	`abort(3)` call
7	SIGBUS	CORE	Bus error (misaligned or unmapped access)
8	SIGFPE	CORE	Floating-point / arithmetic exception
10	SIGKILL	TERM	Unconditional termination (unblockable, uncatchable)
11	SIGUSR1	TERM	User-defined signal 1
12	SIGSEGV	CORE	Invalid virtual-memory reference
13	SIGUSR2	TERM	User-defined signal 2
14	SIGPIPE	TERM	Write to pipe with no reader
15	SIGALRM	TERM	`alarm(2)` real-time timer expiry
16	SIGTERM	TERM	Graceful termination request
17	SIGSTKFLT	TERM	Coprocessor stack fault (legacy x86; rarely generated)
18	SIGCHLD	IGN	Child stopped, continued, or terminated
19	SIGCONT	CONT	Resume stopped process
20	SIGSTOP	STOP	Unconditional stop (unblockable, uncatchable)
21	SIGTSTP	STOP	Keyboard stop (Ctrl+Z); catchable
23	SIGTTIN	STOP	Background process read from controlling terminal
24	SIGTTOU	STOP	Background process write to controlling terminal (if TOSTOP)
23	SIGURG	IGN	Out-of-band data on socket
24	SIGXCPU	CORE	CPU time limit exceeded (`setrlimit(RLIMIT_CPU)`)
25	SIGXFSZ	CORE	File size limit exceeded (`setrlimit(RLIMIT_FSIZE)`)
26	SIGVTALRM	TERM	Virtual timer (user-time only) expiry
27	SIGPROF	TERM	Profiling timer expiry (user + system time)
28	SIGWINCH	IGN	Terminal window-size change
29	SIGIO / SIGPOLL	TERM	I/O now possible on fd (same number)
30	SIGPWR	TERM	Power failure / UPS notification
31	SIGSYS	CORE	Invalid system call argument (seccomp violation)
32–33	(reserved)	—	NPTL-internal RT signals (pthread_cancel, SIGSETXID); use RT queuing; not application-usable
34	SIGRTMIN	TERM	First user-visible real-time signal
35–63	SIGRTMIN+1 … SIGRTMAX-1	TERM	Real-time signals; no predefined meaning
64	SIGRTMAX	TERM	Last user-visible real-time signal

Signals 32 and 33 are within the RT signal range (32–64) and therefore use RT queuing semantics (per Section 7.3.3: append SigInfo to per-signal queue). However, they are allocated internally to NPTL: signal 32 is SIGRTMIN used for pthread_cancel delivery, signal 33 is SIGRTMIN+1 used for SIGSETXID (thread credential synchronization). They are not available for application use via SIGRTMIN + N calculations — sigrtmin() returns 34 as the first application-usable RT signal.

Default action codes: - TERM — terminate the process via do_exit(). - CORE — terminate and write a core dump (format matches Linux ELF core; respects RLIMIT_CORE and /proc/sys/kernel/core_pattern-equivalent umkafs path). - STOP — place the process in TaskState::Stopped; notify parent with SIGCHLD. - CONT — if currently stopped, resume execution; otherwise ignore. - IGN — discard silently.

SIGKILL and SIGSTOP cannot be caught, blocked, or ignored. All other signals may have their disposition changed via sigaction().

Real-time signals (34–64) differ from standard signals in three ways: 1. Queued delivery: multiple instances of the same RT signal are individually queued; standard signals collapse to at most one pending instance. 2. Ordered delivery: when multiple distinct RT signals are pending, the lowest-numbered signal is delivered first (SIGRTMIN has highest priority). 3. Value attachment: RT signals sent via sigqueue() carry a SigVal payload (integer or pointer), delivered to SA_SIGINFO handlers in SigInfo::si_value.

7.3.2 Signal Data Structures

SigAction

SigAction describes the disposition of a single signal, corresponding to the POSIX struct sigaction. It is stored in the per-process signal handler table (Process::sighand), indexed by signal number minus 1.

/// Per-signal disposition record.
///
/// # Invariants
/// - `handler` == `SigHandler::Default` or `SigHandler::Ignore` when
///   `sa_flags` contains `SA_RESETHAND` and the signal is being delivered
///   (handler is reset to Default before invocation).
/// - SIGKILL and SIGSTOP always have `handler == SigHandler::Default`; the
///   kernel enforces this and rejects `sigaction()` calls that attempt to
///   change them.
pub struct SigAction {
    /// Signal handler or default/ignore disposition.
    pub handler: SigHandler,
    /// Signals to add to the thread's signal mask during handler execution.
    /// The delivered signal itself is also masked unless SA_NODEFER is set.
    pub sa_mask: SignalSet,
    /// Modifier flags.
    pub sa_flags: SaFlags,
    /// Optional user-space trampoline (calls `sigreturn`). If None, the
    /// kernel uses the vsyscall/vDSO trampoline.
    pub sa_restorer: Option<unsafe extern "C" fn()>,
}

/// Signal handler variant.
pub enum SigHandler {
    /// Perform the signal's default action (see Section 7.3.1 table).
    Default,
    /// Discard the signal.
    Ignore,
    /// Classic signal handler: receives signal number only.
    Handler(unsafe extern "C" fn(sig: i32)),
    /// Extended handler: receives signal number, siginfo pointer, and
    /// ucontext pointer. Enabled by SA_SIGINFO.
    SigAction(unsafe extern "C" fn(sig: i32, info: *mut SigInfo, ctx: *mut UContext)),
}

bitflags! {
    /// Flags for `SigAction::sa_flags`.
    pub struct SaFlags: u32 {
        /// Do not send SIGCHLD when a child stops (SIGTSTP/SIGTTIN/SIGTTOU),
        /// only when it terminates.
        const SA_NOCLDSTOP = 0x0000_0001;
        /// Do not create zombies: reap children automatically; `wait()` returns
        /// ECHILD immediately.
        const SA_NOCLDWAIT = 0x0000_0002;
        /// Deliver extended siginfo to a `SigAction` handler.
        const SA_SIGINFO   = 0x0000_0004;
        /// Invoke the handler on the alternate signal stack (see `sigaltstack`).
        const SA_ONSTACK   = 0x0800_0000;
        /// Restart slow syscalls interrupted by this signal instead of
        /// returning EINTR. See Section 7.3.6 for the list of restartable
        /// syscalls.
        const SA_RESTART   = 0x1000_0000;
        /// Do not automatically mask the signal during its own handler.
        const SA_NODEFER   = 0x4000_0000;
        /// Reset handler to SIG_DFL after the signal is delivered (POSIX
        /// one-shot semantics).
        const SA_RESETHAND = 0x8000_0000;
    }
}

SignalSet

A 64-bit bitmask representing a set of signal numbers. Bit n-1 (zero-indexed) represents signal n. This layout is identical to the Linux sigset_t for 64-bit architectures, ensuring ABI compatibility for sigprocmask(), sigaction(), and sigwaitinfo().

/// Bitmask of signals (bit i-1 = signal i). Matches Linux 64-bit sigset_t.
#[repr(transparent)]
#[derive(Clone, Copy, Default, PartialEq, Eq)]
pub struct SignalSet(pub u64);

impl SignalSet {
    /// Return a set containing exactly signal `sig` (1-indexed).
    ///
    /// # Panics
    /// Panics if `sig` is 0 or > 64.
    pub fn mask(sig: u8) -> Self {
        assert!(sig >= 1 && sig <= 64, "signal number out of range");
        Self(1u64 << (sig - 1))
    }

    pub fn empty() -> Self { Self(0) }
    pub fn full() -> Self { Self(u64::MAX) }

    pub fn contains(self, sig: u8) -> bool {
        self.0 & (1u64 << (sig - 1)) != 0
    }

    pub fn insert(&mut self, sig: u8) { self.0 |= 1u64 << (sig - 1); }
    pub fn remove(&mut self, sig: u8) { self.0 &= !(1u64 << (sig - 1)); }

    pub fn union(self, other: Self) -> Self { Self(self.0 | other.0) }
    pub fn intersect(self, other: Self) -> Self { Self(self.0 & other.0) }
    pub fn complement(self) -> Self { Self(!self.0) }

    /// True if any signal in this set is pending and unblocked.
    pub fn has_pending_unblocked(self, mask: SignalSet) -> bool {
        self.intersect(mask.complement()).0 != 0
    }

    /// Return the lowest signal number in the set, or None if empty.
    pub fn lowest(self) -> Option<u8> {
        if self.0 == 0 { None } else { Some(self.0.trailing_zeros() as u8 + 1) }
    }
}

SigInfo

SigInfo matches the Linux siginfo_t ABI exactly. It is passed to SA_SIGINFO handlers and enqueued for RT signal delivery.

/// Signal origin information (matches Linux siginfo_t ABI).
///
/// The active union variant is determined by the signal number and `si_code`:
/// - SIGCHLD → `sigchld` field
/// - SIGSEGV, SIGBUS, SIGILL, SIGFPE → `sigfault` field
/// - SIGPOLL/SIGIO → `sigpoll` field
/// - RT signals sent via `sigqueue()` → `rt` field
/// - Signals sent via `kill()` or `tgkill()` → `kill` field
/// - POSIX timers → `timer` field
#[repr(C)]
pub struct SigInfo {
    pub si_signo: i32,
    pub si_errno: i32,
    pub si_code: i32,
    pub _pad: i32,
    pub _union: SigInfoUnion,
}

/// `si_code` constants (source of the signal).
pub mod si_code {
    pub const SI_USER:    i32 = 0;      // kill() or raise()
    pub const SI_KERNEL:  i32 = 0x80;   // sent by kernel
    pub const SI_QUEUE:   i32 = -1;     // sigqueue()
    pub const SI_TIMER:   i32 = -2;     // POSIX timer
    pub const SI_ASYNCIO: i32 = -4;     // AIO completion
    pub const SI_MESGQ:   i32 = -3;     // POSIX message queue
    // SIGSEGV codes
    pub const SEGV_MAPERR: i32 = 1;     // address not mapped
    pub const SEGV_ACCERR: i32 = 2;     // permission denied
    pub const SEGV_BNDERR: i32 = 3;     // MPK/bounds violation
    // SIGBUS codes
    pub const BUS_ADRALN: i32 = 1;      // invalid address alignment
    pub const BUS_ADRERR: i32 = 2;      // non-existent physical address
    pub const BUS_OBJERR: i32 = 3;      // object-specific hardware error
    // SIGILL codes
    pub const ILL_ILLOPC: i32 = 1;      // illegal opcode
    pub const ILL_ILLOPN: i32 = 2;      // illegal operand
    pub const ILL_ILLADR: i32 = 3;      // illegal addressing mode
    pub const ILL_ILLTRP: i32 = 4;      // illegal trap
    pub const ILL_PRVOPC: i32 = 5;      // privileged opcode
    pub const ILL_COPROC: i32 = 8;      // coprocessor error
    // SIGFPE codes
    pub const FPE_INTDIV: i32 = 1;      // integer divide by zero
    pub const FPE_INTOVF: i32 = 2;      // integer overflow
    pub const FPE_FLTDIV: i32 = 3;      // floating-point divide by zero
    pub const FPE_FLTOVF: i32 = 4;      // floating-point overflow
    pub const FPE_FLTUND: i32 = 5;      // floating-point underflow
    pub const FPE_FLTRES: i32 = 6;      // floating-point inexact result
    pub const FPE_FLTINV: i32 = 7;      // floating-point invalid operation
    // SIGCHLD codes
    pub const CLD_EXITED:    i32 = 1;
    pub const CLD_KILLED:    i32 = 2;
    pub const CLD_DUMPED:    i32 = 3;
    pub const CLD_TRAPPED:   i32 = 4;
    pub const CLD_STOPPED:   i32 = 5;
    pub const CLD_CONTINUED: i32 = 6;
    // SIGPOLL codes
    pub const POLL_IN:  i32 = 1;
    pub const POLL_OUT: i32 = 2;
    pub const POLL_MSG: i32 = 3;
    pub const POLL_ERR: i32 = 4;
    pub const POLL_PRI: i32 = 5;
    pub const POLL_HUP: i32 = 6;
}

#[repr(C)]
pub union SigInfoUnion {
    /// For kill()/tgkill(): sender PID and UID.
    pub kill: SigInfoKill,
    /// For POSIX timers.
    pub timer: SigInfoTimer,
    /// For sigqueue() / RT signals.
    pub rt: SigInfoRt,
    /// For SIGCHLD.
    pub sigchld: SigInfoSigchld,
    /// For SIGSEGV, SIGBUS, SIGILL, SIGFPE.
    pub sigfault: SigInfoSigfault,
    /// For SIGPOLL/SIGIO.
    pub sigpoll: SigInfoSigpoll,
    /// Raw padding to match Linux siginfo_t size (128 bytes total).
    pub _pad: [u8; 112],
}

#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoKill   { pub si_pid: u32, pub si_uid: u32 }

#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoTimer  {
    pub si_timerid: i32, pub si_overrun: i32, pub si_value: SigVal,
}

#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoRt     { pub si_pid: u32, pub si_uid: u32, pub si_value: SigVal }

#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoSigchld {
    pub si_pid: u32, pub si_uid: u32, pub si_status: i32,
    pub si_utime: i64, pub si_stime: i64,
}

#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoSigfault {
    /// Fault address.
    pub si_addr: usize,
    /// LSB of the faulting address (for SIGBUS BUS_MCEERR_AR).
    pub si_addr_lsb: i16,
    /// For bounds-checked faults: lower and upper bound.
    pub si_lower: usize,
    pub si_upper: usize,
}

#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoSigpoll { pub si_band: i64, pub si_fd: i32 }

/// Value carried by RT signals and POSIX timers.
#[repr(C)] #[derive(Clone, Copy)]
pub union SigVal {
    pub sival_int: i32,
    pub sival_ptr: usize,
}

7.3.3 Signal Delivery Algorithm

Pending Signal State

Each Task maintains two sets:

pending_task: SignalSet — signals pending for this specific thread.
pending_process: SignalSet — signals pending for any thread in the process (process-directed signals).

Additionally, RT signals that may be queued multiple times are stored in a per-process SigQueue (a fixed-capacity ring buffer of SigInfo records). When a standard signal is sent and already pending, the duplicate is dropped. When an RT signal is sent, it is always enqueued unless the queue has reached RLIMIT_SIGPENDING items (EAGAIN is returned to the sender).

Per-Signal-Type Bucketed Queue

Problem with a global signal queue: RLIMIT_SIGPENDING limits the total number of queued signals across all signal types for a UID. If process A floods process B with SIGCHLD (rapid child spawning), the global queue fills and B can no longer receive SIGUSR1, SIGTERM, or any other signal — even one-time signals that should always get through.

UmkaOS per-signal-type buckets: Each signal number has an independent sub-queue with its own limit. A SIGCHLD flood cannot prevent delivery of other signals.

/// Per-task pending signal state.
pub struct PendingSignals {
    /// Per-signal-number queues.
    /// Index = signal number (1–64). Index 0 is unused.
    /// Each queue is independent: exhausting one does not affect the others.
    queues: [SignalQueue; 65],

    /// Total pending count across all queues.
    /// Used for O(1) RLIMIT_SIGPENDING enforcement without summing all queues.
    total_count: u32,

    /// Bitmask: bit N is set iff `queues[N]` is non-empty.
    /// Bits 1–64 are valid; bit 0 is always clear.
    /// Enables O(1) "find first pending signal" via trailing-zero-count.
    pending_mask: AtomicU64,
}

/// Maximum queued RT signals per signal number per task.
/// Linux default: min(RLIMIT_SIGPENDING, 1000). UmkaOS: 32 per signal.
pub const RT_SIGQUEUE_MAX: usize = 32;

pub struct SignalQueue {
    /// Pending signal instances for this signal number.
    ///
    /// Standard signals (1–31): at most 1 entry (non-queuing semantics). Any
    /// duplicate sent while one is already pending is silently discarded per
    /// POSIX at-most-one-pending rules.
    ///
    /// Real-time signals (32–64): up to `RT_SIGQUEUE_MAX` entries (queuing
    /// semantics). The oldest entry is delivered first (FIFO).
    ///
    /// RT signal queue is bounded at `RT_SIGQUEUE_MAX` entries per signal
    /// number. Overflow returns `EAGAIN` to `sigqueue()` callers (POSIX
    /// behavior). The fixed array avoids heap allocation under the signal
    /// queue spinlock.
    entries: [Option<SigInfo>; RT_SIGQUEUE_MAX],
    head: u8,
    tail: u8,
    len: u8,
}

/// Per-signal-number limit for real-time signal queues.
/// Standard signals (1–31) are capped at 1 pending instance (coalesced).
/// RT signals (32–64) are queued up to this limit per signal number.
/// The per-queue cap prevents a single RT signal type from consuming the
/// entire UID budget, while `PendingSignals::total_count` enforces the
/// aggregate RLIMIT_SIGPENDING across all queues.
pub const PER_SIGNAL_RT_LIMIT: usize = RT_SIGQUEUE_MAX;

Finding the next pending signal is O(1):

/// Return the lowest-numbered unblocked pending signal, or None.
/// Uses a single hardware CTZ instruction — no iteration over queues.
fn next_pending_signal(pending: &PendingSignals, mask: &SigSet) -> Option<u32> {
    // Mask out blocked signals from the pending bitmask.
    let eligible = pending.pending_mask.load(Ordering::Acquire) & !mask.bits();
    if eligible == 0 {
        return None;
    }
    // Lowest-numbered eligible signal: bit 1 = signal 1, so add 1 to bit
    // position. Bit 0 is always clear, so trailing_zeros() >= 1.
    let signo = eligible.trailing_zeros();  // bit position == signal number
    Some(signo)
}

RLIMIT_SIGPENDING enforcement: total_count is incremented on every send_signal() call that enqueues a signal and checked against the UID's sigpending_count limit (identical to Linux). The per-queue bucketing does not change the aggregate cap: a process with RLIMIT_SIGPENDING=100 can have at most 100 pending signals total across all 64 signal numbers. What changes is that a SIGCHLD flood consumes budget from total_count but cannot consume entries in other signals' per-type queues. EAGAIN is returned to the sender only when total_count reaches the UID limit, or when an RT signal's individual queue has reached PER_SIGNAL_RT_LIMIT entries.

Signal delivery priority within the eligible set:

Standard signals (1–31) have higher delivery priority than RT signals (32–64), matching Linux POSIX behavior.
Within standard signals: lowest signal number delivered first (SIGHUP=1 before SIGINT=2, etc.).
Within RT signals: lowest signal number delivered first (SIGRTMIN before SIGRTMIN+1, etc.), which preserves the Linux ordering for equal-priority RT signals.

dequeue_signal() is updated to pop from queues[signo].entries instead of a global linked list, then clears bit signo in pending_mask when the queue empties, and decrements total_count.

sigpending(2) return value: reports pending_mask cast to a sigset_t (O(1) — no iteration required).

Linux compatibility: signal delivery order and coalescing semantics are identical to Linux. The bucketed storage is an internal implementation detail invisible to userspace. The only observable difference: a SIGCHLD flood no longer prevents other signal types from being queued — an improvement that Linux cannot make without ABI-breaking changes to its task_struct.pending flat-list design.

Sending a Signal

Signal delivery originates from send_signal(target, sig, info, scope):

Validate: reject signal 0 (used only for kill() permission check). Reject numbers > 64.
Permission check: the sending task must have CAP_KILL in its capability domain, or euid/uid of sender must match uid/suid of target process. SIGCONT may always be sent within the same session.
SIGKILL/SIGSTOP fast path: if sig is SIGKILL or SIGSTOP, wake every thread in the target process immediately; these signals bypass the blocked mask.
Check if ignored: if the signal's disposition is SigHandler::Ignore and the signal is not SIGCHLD with SA_NOCLDWAIT semantics, drop it immediately. Exception: signals that originated from the kernel (e.g. SIGSEGV from a fault) are delivered forcibly regardless of disposition.
Enqueue:
Standard signals (1–31): set the corresponding bit in pending_task or pending_process. If already set, do nothing (at-most-one-pending semantics).
RT signals (32–64): append a SigInfo record to SigQueue, then set the bit in pending_task or pending_process to indicate non-empty.
Wake the target: if the target thread is in TaskState::Blocked waiting in an interruptible sleep (e.g. poll, read, nanosleep), and the signal is not blocked by its signal_mask, reschedule the thread. The syscall's slow path will return EINTR (or restart, per SA_RESTART — see Section 7.3.6).
Set TIF_SIGPENDING: mark a per-CPU flag on the target's CPU (or set it in the thread's flags if it is not running). The flag is checked on every kernel → user transition.

Thread Selection for Process-Directed Signals

A process-directed signal (sent to a PID rather than a TID) can be delivered to any thread that does not have it blocked. The selection algorithm is:

Prefer a thread that is sleeping in an interruptible state (TASK_INTERRUPTIBLE) and has the signal unblocked.
Among multiple eligible threads, select the first one found in thread-group order (deterministic but not specified to user space).
If all threads have the signal blocked, the signal remains pending in pending_process until a thread unblocks it via sigprocmask() or pthread_sigmask().
If execve() is in progress, defer until the exec completes.

Checking and Delivering Pending Signals

The kernel checks for pending signals at every kernel → user-mode return:

Syscall return path: after do_syscall() completes and before SYSRET/IRET.
Interrupt return path: after the interrupt handler runs, before returning to user mode.
schedule() return: after a context switch, when returning to the resumed task.

The check is gated on TIF_SIGPENDING. If set, do_signal() is called:

fn do_signal(regs: &mut ArchRegs) {
    loop {
        let sig = dequeue_signal();   // lowest-numbered pending unblocked signal
        match sig {
            None => break,
            Some((signum, info)) => handle_signal(signum, info, regs),
        }
    }
}

dequeue_signal() checks pending_task first (thread-directed), then pending_process. For RT signals, it dequeues the oldest SigInfo for the chosen signal from SigQueue.

Signal Handler Invocation

handle_signal(signum, info, regs) performs the following steps:

Determine action: look up Process::sighand[signum - 1].
Default action: execute the default (TERM, CORE, STOP, CONT, IGN).
TERM: call do_exit(signal_exit_status(signum)).
CORE: write core dump, then do_exit().
STOP: set TaskState::Stopped, send SIGCHLD to parent, call schedule().
CONT: resume all stopped threads in the process.
IGN: return.
User handler: build a signal frame on the user stack (Section 7.3.4), then redirect the kernel → user return to the handler entry point.
Update mask: add sa_mask | SignalSet::mask(signum) to signal_mask (unless SA_NODEFER is set for the last term).
SA_RESETHAND: if set, reset disposition to SigHandler::Default before invoking the handler.
SA_ONSTACK: if set and the task has an alternate signal stack (sigaltstack) registered and not already in use, switch the user RSP to the alt stack.

7.3.4 Signal Frame Layout (x86-64)

On x86-64 the kernel pushes an rt_sigframe onto the user stack (or the alternate signal stack if SA_ONSTACK is set and active). The frame is 16-byte aligned at the point where RSP is set for the handler. The layout matches the Linux rt_sigframe ABI so that unmodified glibc signal trampolines work without modification.

/// User-stack frame built by the kernel when delivering a signal (x86-64).
///
/// Stack grows downward. The kernel writes this struct below the current RSP,
/// aligns the resulting RSP to 16 bytes, then subtracts 8 (simulating a
/// `call` instruction's return address push) before setting RSP for the handler.
///
/// # ABI note
/// This layout is fixed by the Linux x86-64 signal ABI. glibc's `__restore_rt`
/// trampoline (the default `sa_restorer`) issues `syscall` with rax = SYS_rt_sigreturn
/// (15) to re-enter the kernel. The kernel then reads `uc` from this frame to
/// restore all registers and the signal mask.
#[repr(C)]
pub struct RtSigFrame {
    /// Return address pushed by the handler call: points to `sa_restorer`
    /// (or the vDSO `__restore_rt` trampoline if `sa_restorer` is None).
    pub pretcode: *const u8,
    /// Extended signal information (only meaningful when SA_SIGINFO is set;
    /// present in the frame regardless).
    pub info: SigInfo,
    /// Saved user-space execution context, restored by `sigreturn`.
    pub uc: UContext,
    // Architecture-private FP/XSAVE state follows immediately in memory,
    // pointed to by uc.uc_mcontext.fpstate. It is not part of this struct
    // because its size is runtime-determined (XSAVE area size from CPUID).
}

/// User context saved on signal entry, restored on `sigreturn`.
#[repr(C)]
pub struct UContext {
    pub uc_flags:  u64,
    pub uc_link:   *mut UContext,
    pub uc_stack:  SigAltStack,
    pub uc_mcontext: MContext,
    pub uc_sigmask: SignalSet,
}

/// Alternate signal stack descriptor (`struct stack_t`).
#[repr(C)]
pub struct SigAltStack {
    pub ss_sp:    *mut u8,
    pub ss_flags: i32,
    pub ss_size:  usize,
}

/// Machine context: all general-purpose registers plus segment and FP state.
/// Matches Linux `struct sigcontext` / `mcontext_t` for x86-64.
#[repr(C)]
pub struct MContext {
    pub r8:      u64,
    pub r9:      u64,
    pub r10:     u64,
    pub r11:     u64,
    pub r12:     u64,
    pub r13:     u64,
    pub r14:     u64,
    pub r15:     u64,
    pub rdi:     u64,
    pub rsi:     u64,
    pub rbp:     u64,
    pub rbx:     u64,
    pub rdx:     u64,
    pub rax:     u64,
    pub rcx:     u64,
    pub rsp:     u64,
    pub rip:     u64,
    pub eflags:  u64,
    pub cs:      u16,
    pub gs:      u16,
    pub fs:      u16,
    pub ss:      u16,
    pub err:     u64,
    pub trapno:  u64,
    pub oldmask: u64,
    pub cr2:     u64,
    /// Pointer to XSAVE area (or null if no FP state).
    pub fpstate: *mut FpState,
    pub _reserved: [u64; 8],
}

/// Header of the XSAVE state area (variable-length; size from `CPUID.(EAX=0Dh,ECX=0)`).
#[repr(C)]
pub struct FpState {
    pub cwd:    u16,  // x87 control word
    pub swd:    u16,  // x87 status word
    pub twd:    u16,  // x87 tag word
    pub fop:    u16,  // last FP instruction opcode
    pub rip:    u64,  // last FP instruction RIP
    pub rdp:    u64,  // last FP data RIP
    pub mxcsr:  u32,
    pub mxcsr_mask: u32,
    // 8×16-byte st/mm registers, 16×16-byte XMM registers, optional AVX/AVX-512 state
    // follow via the standard XSAVE layout.
}

Frame construction sequence:

Compute the new RSP: subtract size_of::<RtSigFrame>() from current user RSP.
Align down to 16 bytes, subtract 8 (ABI: stack must be 16-byte aligned at handler entry, simulating the call that pushed a return address).
Write RtSigFrame fields: pretcode = sa_restorer (or vDSO trampoline), info = the SigInfo, uc.uc_mcontext = all current user registers, uc.uc_sigmask = task's current signal mask before adding sa_mask.
Save XSAVE state: call the arch XSAVE routine to write FP/vector state into the area immediately following the frame; set uc.uc_mcontext.fpstate.
Set uc.uc_stack to the alternate signal stack descriptor (so the handler can inspect it).
Modify the kernel → user return: set rip = handler entry point, rdi = signum, rsi = &frame.info (if SA_SIGINFO), rdx = &frame.uc (if SA_SIGINFO), rsp = computed stack pointer from step 2.
Set uc.uc_flags = 0 (reserved for future use by the kernel; glibc checks it).

sigreturn: When the handler returns, the trampoline executes SYS_rt_sigreturn (syscall number 15 on x86-64). The kernel reads RtSigFrame from RSP, restores all registers from uc.uc_mcontext, restores the signal mask from uc.uc_sigmask (blocking signals that the kernel added for handler execution), and restores FP state from the XSAVE area. Execution resumes at the saved rip.

AArch64 and RISC-V: The signal frame layout for AArch64 and RISC-V 64 differs in the register set and XSAVE/FPSIMD save area. Those layouts are specified in the arch-specific sections of Section 2.2.

`kill(pid, sig)` → `Result<(), Errno>`

Send signal sig to a target specified by pid: - pid > 0: send to the process with that PID. - pid == 0: send to every process in the sender's process group. - pid == -1: send to every process the sender has permission to signal, except PID 1 (init). - pid < -1: send to process group |pid|.

Returns ESRCH if no target process exists, EPERM if the caller lacks permission. Signal 0 performs a permission check only (no signal delivered); this is used to test process existence.

`tgkill(tgid, tid, sig)` → `Result<(), Errno>`

Send sig to thread tid within thread group tgid. This is the correct way to direct a signal to a specific thread. Returns ESRCH if tid does not exist within tgid, preventing the race in tkill() where the TID may have been recycled.

`tkill(tid, sig)` → `Result<(), Errno>`

Legacy single-argument form: send sig to thread tid without verifying the thread group. Retained for compatibility; new code should use tgkill().

`sigqueue(pid, sig, value)` → `Result<(), Errno>`

Send a real-time signal sig to process pid with an attached SigVal payload value. Semantics are the same as kill() for the pid argument. The value is stored in SigInfo::si_value (union: sival_int or sival_ptr), and si_code is set to SI_QUEUE. Only meaningful for RT signals (34–64); for standard signals, the payload is carried in the queued SigInfo but only one instance is queued.

`sigaction(sig, act, oldact)` → `Result<(), Errno>`

Install a new disposition act for signal sig, returning the previous disposition in oldact (if non-null). Rejects attempts to change SIGKILL or SIGSTOP (EINVAL). The disposition is process-wide and shared among all threads. The act.sa_mask is sanitized: SIGKILL and SIGSTOP bits are cleared.

`sigprocmask(how, set, oldset)` → `Result<(), Errno>`

Modify the calling thread's signal mask. how is one of: - SIG_BLOCK: add set to mask. - SIG_UNBLOCK: remove set from mask. - SIG_SETMASK: replace mask with set.

SIGKILL and SIGSTOP bits in set are silently ignored. After modification, if any previously blocked signal is now unblocked and pending, TIF_SIGPENDING is set so the signal is delivered at the next kernel exit.

`sigaltstack(ss, old_ss)` → `Result<(), Errno>`

Register an alternate signal stack. If ss is non-null, sets the alternate stack to the region [ss.ss_sp, ss.ss_sp + ss.ss_size) with flags ss.ss_flags. Flag SS_DISABLE disables the alternate stack; SS_AUTODISARM clears the alternate stack flag on signal entry (prevents recursive use without explicit re-arm). Minimum size is MINSIGSTKSZ (architecture-dependent; typically 2 KiB for x86-64, defined in the umka-compat header shim).

`sigwaitinfo(set, info)` / `sigtimedwait(set, info, timeout)`

Synchronously wait for any signal in set to become pending. Atomically dequeues and returns it in info. These are thread-directed: only signals pending on the calling thread or its process are considered. sigtimedwait adds a timespec timeout; returns EAGAIN if it expires.

7.3.6 SA_RESTART and EINTR

When a signal interrupts a blocking syscall:

If the installed handler has SA_RESTART set: the kernel automatically restarts the syscall by setting RIP back to the syscall instruction and re-entering the syscall handler. This is transparent to the user-space process.
Otherwise: the syscall returns -EINTR. The process must restart manually (or use TEMP_FAILURE_RETRY-style looping).

Restartable syscalls (SA_RESTART causes automatic restart): read, readv, write, writev, ioctl (when marked restartable by the driver), wait4, waitpid, waitid, nanosleep, clock_nanosleep, pause, sigsuspend, sigtimedwait, sigwaitinfo, poll, select, pselect6, ppoll, epoll_wait, epoll_pwait, futex(FUTEX_WAIT), msgrcv, msgsnd, semop, semtimedop, recvfrom, recvmsg, recvmmsg, sendto, sendmsg, sendmmsg, connect, accept, accept4, open (if blocking on a FIFO or device), openat.

Non-restartable syscalls (always return EINTR even with SA_RESTART): sleep (returns remaining time), usleep, clock_nanosleep with TIMER_ABSTIME (returns EINTR with no restart), io_getevents, io_uring_enter (with IORING_ENTER_GETEVENTS when interrupted before any completion), getrandom with GRND_NONBLOCK.

The distinction reflects whether the syscall can safely re-enter from the top without corrupting partial progress. Syscalls that have already partially consumed data (e.g. a partial read) complete and return the partial count; they are not restarted.

7.3.7 Signal Inheritance Across fork() and exec()

fork() / clone(): The child inherits a complete copy of the parent's signal disposition table (sighand), the current signal mask of the cloning thread, and the alternate signal stack descriptor. The child's pending signal sets are cleared: signals pending in the parent are not delivered to the child. This prevents fork-bomb-style cascades and matches POSIX semantics.

exec(): When a process calls execve(), all signal handlers are reset to their default disposition. Signals set to SIG_IGN remain ignored (POSIX exception: SIGCHLD set to SIG_IGN before exec retains SIG_IGN after exec). The signal mask is unchanged. Pending signals that were sent to the old executable are retained and delivered to the new image after exec completes. The alternate signal stack is cleared (the old stack region is no longer mapped).

Thread creation (clone with CLONE_SIGHAND): The new thread inherits the process's signal handler table (shared, not copied) and begins with an empty pending set. The new thread's signal mask is copied from the creating thread's mask. The new thread has no alternate signal stack; it must call sigaltstack() independently.

7.3.8 SIGCHLD and wait()

SIGCHLD is sent to a parent process whenever a child: - Terminates (normal exit or killed by signal). - Stops due to a job control signal (SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU), unless the parent has set SA_NOCLDSTOP in its SIGCHLD disposition. - Continues after being stopped, if the parent is listening for SIGCHLD with SA_NOCLDSTOP cleared and WCONTINUED semantics.

wait4(pid, status, options, rusage) / waitpid(pid, status, options):

These syscalls block until a child matching pid changes state: - pid > 0: wait for the specific child. - pid == -1: wait for any child. - pid == 0: wait for any child in the same process group. - pid < -1: wait for any child in process group |pid|.

Options: WNOHANG (non-blocking; returns 0 if no child is waitable), WUNTRACED (report stopped children), WCONTINUED (report continued children), __WALL (wait for any child regardless of clone flags).

When a child transitions to a waitable state, the kernel: 1. Sets the child's exit status in Process::exit_status. 2. Sends SIGCHLD to the parent (unless parent set SA_NOCLDWAIT). 3. Wakes any parent blocked in wait4() / waitpid() / waitid(). 4. If the child is a zombie (terminated, not yet reaped), the wait() call consumes the zombie and releases the child's PID.

If the parent has set SA_NOCLDWAIT in its SIGCHLD action, or has set the SIGCHLD disposition to SIG_IGN, children are automatically reaped on termination without becoming zombies. wait() then returns ECHILD immediately.

waitid(idtype, id, infop, options): Extended form that fills a SigInfo struct with child status information (exit code, signal, stop/continue cause) rather than encoding it in a status word. WNOWAIT option peeks at the waitable child without consuming it.

7.4 Process Groups and Sessions

Process groups and sessions implement the POSIX job control model: the mechanism by which a shell manages sets of processes, routes terminal I/O signals, and controls which process group has foreground access to the controlling terminal.

7.4.1 Structures

ProcessGroup

A process group is a collection of processes that share a process group ID (PGID). Every process belongs to exactly one process group.

/// A collection of processes sharing a common process group ID.
///
/// # Invariants
/// - `pgid` equals the PID of the process group leader at the time the group
///   was created. The leader may subsequently exit; the group persists until
///   all members have exited.
/// - `session` never changes after creation.
/// - Every task linked via `members` corresponds to a live process whose
///   `Process::pgid` field equals this group's `pgid`.
pub struct ProcessGroup {
    /// Process group identifier (PGID).
    pub pgid: Pid,
    /// Session this group belongs to.
    pub session: Arc<Session>,
    /// Members of this process group as an intrusive linked list.
    /// Uses `Task::pid_group_node` embedded in each `Task` struct —
    /// no heap allocation occurs under the spinlock.
    pub members: SpinLock<IntrPidList>,
    /// True if this group is the foreground process group of its session's
    /// controlling terminal.
    pub foreground: AtomicBool,
}

ProcessGroup and Session use intrusive linked lists for member tracking: Task contains a pid_group_node: ListNode field that links it into its process group's member chain. IntrPidList is a singly-linked intrusive list through this node. The spinlock protects list pointer manipulation only; no heap allocation occurs under the lock. Nodes are embedded in Task structs (slab-allocated at task creation) and unlinked at task exit before the task's slab memory is freed.

Session

A session is a collection of process groups sharing a controlling terminal and a session ID (SID). The SID equals the PID of the session leader (the process that called setsid()).

/// A collection of process groups sharing a controlling terminal and session ID.
///
/// # Invariants
/// - `sid` equals the PID of the process that created this session via `setsid()`.
/// - `leader` is None if the session leader has exited; the session persists
///   as long as any member process is alive.
/// - At most one process group in `process_groups` has `foreground == true`.
/// - `controlling_terminal` is None until `TIOCSCTTY` succeeds or a session
///   leader opens a terminal that becomes the controlling terminal.
pub struct Session {
    /// Session identifier.
    pub sid: Pid,
    /// PID of the session leader, or None if it has exited.
    pub leader: Option<Pid>,
    /// The controlling terminal for this session, if any.
    pub controlling_terminal: Option<Arc<Tty>>,
    /// All process groups in this session, keyed by PGID.
    ///
    /// `HashMap` under `SpinLock` is acceptable here because process group changes
    /// are rare: they occur only on `setpgid()`, `setsid()`, and process exit.
    /// A typical session has O(1)–O(10) groups. The spinlock is never held across
    /// a blocking operation; contention is negligible in practice.
    pub process_groups: SpinLock<HashMap<Pid, Arc<ProcessGroup>>>,
}

The Process struct (Section 7.1.1) carries two additional fields for group/session membership:

pub struct Process {
    // ... (existing fields from Section 7.1.1) ...
    /// Process group ID.
    pub pgid: Pid,
    /// Session ID.
    pub sid: Pid,
}

The global kernel state holds two lookup tables, both protected by the session lock:

/// All live process groups, keyed by PGID.
static PROCESS_GROUPS: RwLock<HashMap<Pid, Arc<ProcessGroup>>>;
/// All live sessions, keyed by SID.
static SESSIONS: RwLock<HashMap<Pid, Arc<Session>>>;

7.4.2 System Calls

`setpgid(pid, pgid)` → `Result<(), Errno>`

Move process pid into process group pgid. If pid is 0, the caller is the target. If pgid is 0, the target's own PID is used as the new PGID (creating a new process group with the target as leader).

Preconditions enforced by the kernel: - The target process must be either the caller itself, or a child of the caller that has not yet called execve() (EACCES is returned after exec). - The target process must be in the same session as the caller (EPERM if not). - If pgid refers to an existing group, that group must be in the same session (EPERM if not). - A session leader cannot change its own PGID (EPERM).

Procedure: 1. Validate preconditions above. 2. If pgid refers to an existing ProcessGroup, add the target PID to its members; update Process::pgid. 3. If pgid is a new PGID (no existing group), create a new ProcessGroup with pgid = requested PGID, inherit the target's session; insert into PROCESS_GROUPS. Add the target PID to its members. 4. Remove the target PID from its previous ProcessGroup::members. If the old group is now empty, remove it from PROCESS_GROUPS and Session::process_groups. 5. Update Process::pgid.

`getpgid(pid)` → `Result<Pid, Errno>`

Return the PGID of process pid. If pid is 0, return the caller's PGID. Returns ESRCH if pid does not exist.

`setsid()` → `Result<Pid, Errno>`

Create a new session with the caller as session leader.

Precondition: The caller must not already be a process group leader (EPERM if it is, because allowing it would create a session whose SID conflicts with an existing PGID in another session).

Procedure: 1. Create a new Session with sid = caller's PID, leader = caller's PID, controlling_terminal = None. 2. Create a new ProcessGroup with pgid = caller's PID, session = new session. 3. Add the new group to the new session's process_groups. 4. Remove the caller from its old process group (see setpgid step 4). 5. Update caller's Process::pgid = caller's PID, Process::sid = caller's PID. 6. Insert session and group into SESSIONS and PROCESS_GROUPS. 7. Return the new SID.

The caller is now isolated from its former controlling terminal: no controlling terminal is associated with the new session. The caller must open a terminal device and acquire it as the controlling terminal via TIOCSCTTY if required.

`getsid(pid)` → `Result<Pid, Errno>`

Return the SID of process pid. If pid is 0, return the caller's SID. Returns ESRCH if pid does not exist. Some implementations return EPERM if pid is in a different session; UmkaOS returns the SID unconditionally (matches Linux behavior).

`tcsetpgrp(fd, pgid)` → `Result<(), Errno>`

Set the foreground process group of the terminal referred to by fd to pgid. fd must refer to the controlling terminal of the calling process's session.

Preconditions: - fd must be an open file descriptor for a terminal (ENOTTY otherwise). - The terminal must be the controlling terminal of the calling process's session (ENOTTY otherwise). - The process group pgid must exist and be in the same session (EPERM if not). - The calling process must be in the foreground group, or have SIGTTOU unblocked and not ignored (otherwise SIGTTOU is sent to the calling process's group first).

Procedure: 1. Clear foreground on the current foreground process group (if any). 2. Set foreground on the ProcessGroup with the given pgid. 3. Update Tty::foreground_pgid (see Section 20.1).

`tcgetpgrp(fd)` → `Result<Pid, Errno>`

Return the PGID of the foreground process group of the terminal fd. Returns ENOTTY if fd is not a terminal or not the session's controlling terminal.

7.4.3 Job Control Signals

Job control signals mediate access between process groups and the controlling terminal. The TTY layer (Section 20.1) generates these signals in response to hardware events or process I/O attempts.

SIGTSTP (signal 20)

Sent to the foreground process group of the controlling terminal when the terminal's ISIG flag is set and the user types the SUSP character (typically Ctrl+Z, character code 26). All processes in the foreground group receive SIGTSTP simultaneously.

Default action: STOP. Processes may catch SIGTSTP to perform cleanup before stopping (e.g., restore terminal settings). A handler that catches SIGTSTP must eventually re-raise SIGSTOP to actually stop, or the shell cannot detect the stop.

SIGTTIN (signal 21)

Sent to a process group when a background process attempts to read() from the controlling terminal and the process is not in the foreground group and the terminal is the session's controlling terminal.

If the process group is orphaned (Section 7.4.4) or SIGTTIN is blocked/ignored in the reading process, read() returns EIO instead of stopping the process.

SIGTTOU (signal 22)

Sent to a process group when a background process attempts to write() to the controlling terminal, but only when the terminal's TOSTOP output discipline flag is set (via tcsetattr(TOSTOP)). If TOSTOP is not set, background writes proceed without signal.

Same orphan and block/ignore exceptions as SIGTTIN apply: EIO is returned instead of stopping an orphaned or signal-ignoring process.

SIGCONT (signal 18)

Resumes a stopped process group. Sent explicitly by the shell (via kill -CONT or the fg/bg builtins) or by the kernel as part of the SIGTSTP handling path.

When SIGCONT is delivered to a stopped process: 1. All tasks in the process group transition from TaskState::Stopped to TaskState::Runnable. 2. SIGCHLD is sent to the parent process (with si_code = CLD_CONTINUED) unless the parent has set SA_NOCLDSTOP. 3. Any pending SIGSTOP or SIGTSTP for the same process is discarded (SIGCONT cancels pending stop signals within the same delivery cycle).

SIGCONT may be sent from any process that has permission to signal the target, not only the session leader or TTY.

Interaction with SIGKILL and SIGSTOP

SIGKILL terminates a stopped process without resuming it first. SIGSTOP can be sent to any process group regardless of foreground status; it bypasses the job control machinery and cannot be caught or ignored.

7.4.4 Orphaned Process Groups

A process group is orphaned (per POSIX.1-2017 definition) if it has no member process whose parent is in a different process group within the same session.

In other words, a group G in session S is orphaned when every process p in G has its parent either: - Outside session S entirely (e.g. parent exited and was reparented to init, which is in a different session), or - Also inside group G.

If any process in G has its parent in a different group G' ≠ G within the same session S, then G is non-orphaned and the parent group is responsible for signaling it.

Why orphaning matters for job control: An orphaned process group has no parent process in the session that could deliver SIGCONT to resume it. A stopped orphaned group would be permanently stuck. POSIX therefore requires the kernel to send SIGHUP followed immediately by SIGCONT to the orphaned group, giving processes a chance to handle the hangup or resume.

Detection Algorithm

The check runs in two situations:

On exit(): when a process P exits, for each process group G that contains a child of P in the same session:
If G was non-orphaned because P was in a different group within the same session, and G becomes orphaned after P's exit, and any process in G is stopped — deliver SIGHUP then SIGCONT to every process in G.
On setpgid(): when a process leaves a group, perform the same check for any group that may now have become orphaned due to the membership change.

Implementation in do_exit():

fn check_orphaned_pgrps_on_exit(exiting: &Process) {
    let session = get_session(exiting.sid);
    for child_pid in exiting.children.iter() {
        let child = get_process(child_pid);
        let child_pgrp = get_pgroup(child.pgid);
        // Only examine groups in the same session.
        if child.sid != exiting.sid { continue; }
        // Only care about groups where the exiting process was the "anchor"
        // (i.e., exiting was in a *different* group within the session).
        if exiting.pgid == child.pgid { continue; }
        // After exit, is the group now orphaned?
        if is_orphaned(&child_pgrp, &session) {
            // If any member is stopped, deliver SIGHUP + SIGCONT.
            if group_has_stopped_member(&child_pgrp) {
                send_signal_to_pgrp(child.pgid, SIGHUP);
                send_signal_to_pgrp(child.pgid, SIGCONT);
            }
        }
    }
}

fn is_orphaned(pgrp: &ProcessGroup, session: &Session) -> bool {
    let members = pgrp.members.lock();
    for &pid in members.iter() {
        let proc = get_process(pid);
        if let Some(parent_pid) = proc.parent {
            let parent = get_process(parent_pid);
            // If parent is in the same session but a different group,
            // this group is NOT orphaned.
            if parent.sid == session.sid && parent.pgid != pgrp.pgid {
                return false;
            }
        }
    }
    true
}

A group with no stopped members that becomes orphaned is not signaled — there is no need, because it will not be permanently stuck. The SIGHUP+SIGCONT pair is sent only to prevent unrecoverable stop.

7.4.5 Controlling Terminal Association

A controlling terminal is a TTY device associated with a session. At most one terminal may be the controlling terminal for a given session; a given terminal may be the controlling terminal for at most one session.

Acquisition

A terminal becomes the controlling terminal of a session by one of two means:

Implicit acquisition (non-POSIX extension, enabled by default on Linux and UmkaOS): When a session leader opens a terminal device that does not already have a controlling terminal, and the open flag O_NOCTTY is not set, the terminal is automatically assigned as the session's controlling terminal.
Explicit acquisition via TIOCSCTTY ioctl: The session leader sends TIOCSCTTY (with argument 1 to steal from another session, or 0 for a non-stealing assignment) on an open terminal file descriptor. Only the session leader may issue TIOCSCTTY. If another session currently controls the terminal:
Argument 0: returns EPERM.
Argument 1 and caller has CAP_SYS_ADMIN: sends SIGHUP to the former controlling session's foreground process group, then transfers the terminal.

Disassociation via setsid()

setsid() always disassociates the calling process from its current controlling terminal. After setsid(): - The new session has no controlling terminal (Session::controlling_terminal = None). - The old session retains its controlling terminal unchanged. - The caller can no longer send or receive job control signals via the old terminal.

Disassociation on Terminal Hangup

When a controlling terminal is closed by its last opener (e.g., a modem hangs up, or the terminal emulator closes the PTY master): 1. SIGHUP is sent to the session's foreground process group. 2. SIGCONT is sent to the foreground process group (to resume stopped processes so they can handle SIGHUP). 3. Session::controlling_terminal is set to None. 4. The session no longer has a controlling terminal; processes that subsequently attempt tcgetpgrp() on any fd receive ENOTTY.

The TTY layer initiates this sequence via tty_hangup() (Section 20.1).

`TIOCNOTTY` ioctl

A process in the session (not necessarily the leader) may call ioctl(fd, TIOCNOTTY) to disassociate the calling process's session from its controlling terminal. This is the POSIX-specified way for a daemon to relinquish its controlling terminal after forking from a session leader. After TIOCNOTTY: - If the caller was the session leader: same effect as the hangup procedure above (SIGHUP + SIGCONT to foreground group, terminal disassociated from session). - If the caller was not the session leader: the call succeeds but has no effect on the session's controlling terminal (matches Linux behavior).

7.5 Resource Limits and Accounting

Resource limits (rlimit) and resource usage accounting (rusage) provide the POSIX-standard mechanism for constraining per-process resource consumption and for reporting how much of each resource a process has consumed. UmkaOS implements the full Linux rlimit/rusage interface with wire-compatible struct layouts, exact signal semantics, and /proc/PID/limits output format.

Internally UmkaOS improves on Linux in two respects:

Lock-free accounting: all RusageAccum fields are AtomicU64, updated in the scheduler hot path without taking any lock (Linux uses task_lock() for some rusage paths).
UID-level enforcement via atomics: RLIMIT_NPROC, RLIMIT_SIGPENDING, and RLIMIT_MSGQUEUE are enforced against per-UID atomic counters in the user namespace rather than scanning process lists.

7.5.1 Resource Limit Types

UmkaOS supports all 16 standard Linux resource limit types. The numeric values match Linux exactly so that getrlimit/setrlimit/prlimit64 wire calls are binary-compatible.

Constant	Value	Resource	Unit	Enforcement point
`RLIMIT_CPU`	0	CPU time	seconds	Scheduler tick
`RLIMIT_FSIZE`	1	Max file size	bytes	`vfs_write()`
`RLIMIT_DATA`	2	Data segment size	bytes	`brk()` / `mmap()`
`RLIMIT_STACK`	3	Stack size	bytes	Page fault handler
`RLIMIT_CORE`	4	Core dump size	bytes	Core dump path
`RLIMIT_RSS`	5	Resident set size	bytes	Advisory; cgroup integration
`RLIMIT_NPROC`	6	Processes/threads per UID	count	`do_fork()`
`RLIMIT_NOFILE`	7	Open file descriptors	count	`alloc_fd()`
`RLIMIT_MEMLOCK`	8	Locked memory	bytes	`mlock()` / `mmap(MAP_LOCKED)`
`RLIMIT_AS`	9	Virtual address space	bytes	`mmap()`, `mremap()`, `brk()`
`RLIMIT_LOCKS`	10	File locks (obsolete)	count	Always `RLIM_INFINITY`
`RLIMIT_SIGPENDING`	11	Pending signals per UID	count	`send_signal()`
`RLIMIT_MSGQUEUE`	12	POSIX MQ bytes per UID	bytes	MQ create/open
`RLIMIT_NICE`	13	Maximum nice value	—	Scheduler (`20 - rlim = min nice`)
`RLIMIT_RTPRIO`	14	Max RT scheduling priority	—	Scheduler
`RLIMIT_RTTIME`	15	Max RT CPU time	microseconds	RT scheduler tick

Signal semantics for exceeded limits:

RLIMIT_CPU soft: SIGXCPU is delivered repeatedly (every second) after the soft limit is crossed. At the hard limit: SIGKILL is delivered (matches Linux behavior).
RLIMIT_FSIZE: SIGXFSZ is delivered and vfs_write() returns EFBIG. Both the signal and the error are returned, matching Linux exactly.
RLIMIT_RTTIME soft: SIGXCPU. Hard: SIGKILL.
RLIMIT_STACK: the page fault handler rejects stack growth beyond the limit; the task receives SIGSEGV (stack overflow).
All other limits: the syscall returns EAGAIN or ENOMEM as appropriate; no signal is delivered for non-CPU/non-file limits.

RLIMIT_LOCKS (value 10) is present for ABI completeness but is always RLIM_INFINITY in UmkaOS. The Linux file lock limit was never meaningfully enforced in modern kernels and UmkaOS does not implement it.

RLIM_INFINITY on the wire is u64::MAX (0xffff_ffff_ffff_ffff), matching Linux.

7.5.2 Wire Format and Syscalls

`struct rlimit` wire layout

/* Wire format — binary-identical to Linux struct rlimit64 */
pub struct RlimitWire {
    pub rlim_cur: u64,  // soft limit (RLIM_INFINITY = u64::MAX)
    pub rlim_hard: u64, // hard limit (RLIM_INFINITY = u64::MAX)
}

The 32-bit struct rlimit (with unsigned long fields) is handled by the getrlimit/setrlimit compat path; the 64-bit form is used by prlimit64 and internally.

Syscalls

getrlimit(resource: u32, rlim: *mut RlimitWire) -> Result<(), Errno>

Returns the soft and hard limits for resource in the calling process's RlimitSet. Reads under rlimit_lock are not required: the lock is only held during writes. Each RlimitPair is read atomically as a unit via the lock only in setrlimit; getrlimit reads the pair without a lock because each individual u64 field is naturally aligned and updated together only while holding rlimit_lock — no torn read is possible on any supported architecture.

Returns EINVAL if resource >= 16.

setrlimit(resource: u32, rlim: *const RlimitWire) -> Result<(), Errno>

Sets the soft and hard limits for resource. Rules:

The soft limit must not exceed the hard limit.
A non-privileged process (!CAP_SYS_RESOURCE) may only lower the hard limit, not raise it. A lowered hard limit is irreversible.
A process with CAP_SYS_RESOURCE may raise both limits up to the system maximum (/proc/sys/fs/nr_open for RLIMIT_NOFILE, etc.).
RLIMIT_NOFILE hard limit may not exceed NR_OPEN (1,048,576 by default, tunable).
RLIMIT_NPROC hard limit may not exceed PID_MAX_LIMIT (4,194,304).

Returns EINVAL for invalid resource, bad limit ordering, or out-of-range values. Returns EPERM for privilege violations.

prlimit64(pid: pid_t, resource: u32, new_limit: *const RlimitWire, old_limit: *mut RlimitWire) -> Result<(), Errno>

The prlimit64 syscall (Linux 2.6.36+, x86-64 syscall number 302) extends getrlimit/setrlimit with a pid argument for operating on another process.

pid == 0: operates on the calling process (identical to getrlimit/setrlimit).
pid != 0: the caller must either have CAP_SYS_PTRACE with PTRACE_MODE_ATTACH_REALCREDS, or the caller's real/effective UID must match the target process's real/saved UID and the caller must not be cross-namespace.
new_limit == NULL: read-only (equivalent to getrlimit on the target).
old_limit == NULL: write-only (equivalent to setrlimit on the target).
Both new_limit and old_limit non-NULL: atomic read-then-write under rlimit_lock.

Returns ESRCH if pid does not name a live process. Returns EPERM if the credential check fails.

getrusage(who: i32, rusage: *mut RusageWire) -> Result<(), Errno>

Returns accumulated resource usage. who values:

Constant	Value	Meaning
`RUSAGE_SELF`	0	Current process (sum of all live threads)
`RUSAGE_CHILDREN`	-1	Sum of all waited-for (reaped) children
`RUSAGE_THREAD`	1	Current thread only (Linux extension)
`RUSAGE_BOTH`	-2	Self + children (used internally by `wait4`)

RUSAGE_BOTH is not a valid who value from user space; getrusage returns EINVAL for it. It is used internally by the wait4/waitid exit path to atomically collect and add child usage to the parent's children_rusage accumulator.

times(buf: *mut Tms) -> Result<clock_t, Errno>

Legacy POSIX interface. Returns elapsed real time (in clock ticks since an arbitrary epoch) and fills buf with:

pub struct Tms {
    pub tms_utime: clock_t,   // user time of calling process, in USER_HZ ticks
    pub tms_stime: clock_t,   // system time of calling process, in USER_HZ ticks
    pub tms_cutime: clock_t,  // user time of waited-for children, in USER_HZ ticks
    pub tms_cstime: clock_t,  // system time of waited-for children, in USER_HZ ticks
}

USER_HZ = 100. Values are derived from RusageAccum.utime_ns / RusageAccum.stime_ns divided by 1_000_000_000 / USER_HZ = 10_000_000.

7.5.3 Internal Structures

`RlimitSet` — per-process limit storage

/// Per-process resource limit set.
///
/// Stored in `Process`. Inherited verbatim on `fork()`. Not modified by `exec()`.
/// Protected by `Process.rlimit_lock` for write access; read access is unsynchronized
/// (see Section 7.5.2 for correctness argument).
pub struct RlimitSet {
    /// Indexed by RLIMIT_* constant (0–15).
    pub limits: [RlimitPair; 16],
}

/// A single soft/hard limit pair.
pub struct RlimitPair {
    /// Soft limit. RLIM_INFINITY = u64::MAX.
    pub soft: u64,
    /// Hard limit. RLIM_INFINITY = u64::MAX. Invariant: soft <= hard.
    pub hard: u64,
}

Process is extended with the following fields (additions to the struct shown in Section 7.1.1):

pub struct Process {
    // ... existing fields ...

    /// Resource limits for this process (Section 7.5).
    pub rlimits: RlimitSet,
    /// Held only during setrlimit / prlimit64 writes.
    pub rlimit_lock: Mutex<()>,
    /// Locked memory byte count for RLIMIT_MEMLOCK enforcement.
    pub locked_pages: AtomicU64,
    /// Accumulated resource usage of waited-for children (updated on wait).
    pub children_rusage: RusageAccum,
}

`RusageAccum` — per-task accounting

/// Lock-free resource usage accumulator, one per Task (thread).
///
/// All fields are AtomicU64 updated in the scheduler hot path and page fault handler
/// without taking any lock. Process-level totals are computed by summing across all
/// live threads plus the per-process children_rusage accumulator.
pub struct RusageAccum {
    /// Accumulated user-mode CPU time in nanoseconds.
    pub utime_ns: AtomicU64,
    /// Accumulated kernel-mode CPU time in nanoseconds.
    pub stime_ns: AtomicU64,
    /// Minor (non-I/O) page faults.
    pub minflt: AtomicU64,
    /// Major (I/O-requiring) page faults.
    pub majflt: AtomicU64,
    /// Voluntary context switches (task called schedule() explicitly).
    pub nvcsw: AtomicU64,
    /// Involuntary context switches (preempted by scheduler).
    pub nivcsw: AtomicU64,
    /// Block device read operations (incremented by the block layer).
    pub inblock: AtomicU64,
    /// Block device write operations (incremented by the block layer).
    pub oublock: AtomicU64,
    /// Peak RSS in kilobytes. Updated with fetch_max() when RSS grows.
    pub peak_rss_kb: AtomicU64,
}

Task is extended with:

pub struct Task {
    // ... existing fields ...

    /// Per-thread resource usage accumulator (Section 7.5).
    pub rusage: RusageAccum,
    /// Tick at which last SIGXCPU was delivered. 0 = never delivered.
    /// Used to ensure SIGXCPU is sent at most once per second (POSIX requirement).
    pub last_sigxcpu_tick: u64,
}

Update discipline

utime_ns / stime_ns: updated at every context switch. The scheduler records the timestamp at switch-out (Task.last_switched_in: Instant), computes elapsed time, and adds it to either utime_ns (if the task was in user mode at switch-out) or stime_ns (if in kernel mode). Addition uses fetch_add with Relaxed ordering; the value is only read by getrusage, which is not on a critical path.
minflt / majflt: incremented in the page fault handler with fetch_add(1, Relaxed).
nvcsw / nivcsw: incremented in schedule(). Voluntary switches come from explicit schedule() calls (blocked on I/O, futex, etc.); involuntary from timer preemption or yield_to_scheduler().
inblock / oublock: incremented by the block I/O completion path (Section 14.3) after each read/write bio completes.
peak_rss_kb: updated in the page fault handler when a new page is faulted in and the resulting RSS (in KB) exceeds the current peak. Uses fetch_max(new_rss_kb, Relaxed).

7.5.4 Enforcement Points

Each limit is enforced at a specific point in the kernel. This section documents the exact call site, the check performed, and the consequence of exceeding the limit.

RLIMIT_NOFILE — file descriptor count

Checked in alloc_fd() before a slot is assigned in the process FdTable. The check is performed without holding rlimit_lock:

fn alloc_fd(process: &Process) -> Result<Fd, Errno> {
    let soft = process.rlimits.limits[RLIMIT_NOFILE].soft;
    let current_count = process.fd_table.count.load(Relaxed);
    if current_count >= soft {
        return Err(Errno::EMFILE);
    }
    // ... allocate slot ...
}

The FdTable.count atomic is the authoritative open-fd count. Reading rlimits.soft and fd_table.count separately (without a combined lock) can permit a small race window during concurrent open() calls; UmkaOS accepts this because Linux has the same behavior and the practical consequence is that the limit may be exceeded by at most (number of concurrent open() calls - 1) descriptors, which is bounded and harmless.

RLIMIT_NPROC — process/thread count per UID

Checked in do_fork() before a new task is created. The count is maintained in the UserEntry for the calling task's UID:

fn do_fork(parent: &Task, flags: CloneFlags) -> Result<Arc<Task>, Errno> {
    let uid = parent.process.cred.uid;
    let user_entry = parent.process.namespaces.user_ns.get_user_entry(uid);
    let count = user_entry.task_count.fetch_add(1, AcqRel);
    let soft = parent.process.rlimits.limits[RLIMIT_NPROC].soft;
    if count >= soft && !parent.has_capability(CAP_SYS_ADMIN)
                     && !parent.has_capability(CAP_SYS_RESOURCE) {
        user_entry.task_count.fetch_sub(1, Relaxed);
        return Err(Errno::EAGAIN);
    }
    // ... create task ...
    // On failure after this point, decrement task_count before returning.
}

task_count is decremented in do_exit() after all threads of the dying task have exited (for the thread-group leader, after the last thread calls do_exit).

RLIMIT_AS — virtual address space

Checked in mmap(), mremap(), and brk() before committing a new VMA:

The AddressSpace struct maintains a running total of all mapped VMA sizes for O(1) limit checking:

pub struct AddressSpace {
    // ... other fields ...
    /// Running total of all mapped VMA sizes in bytes.
    /// Updated atomically on every mmap(), munmap(), and mremap().
    /// Enables O(1) RLIMIT_AS checks without walking the VMA tree.
    pub vm_total_bytes: AtomicUsize,
}

The RLIMIT_AS check is O(1): compare addr_space.vm_total_bytes.load(Acquire) + new_size against task.rlimit[RLIMIT_AS]. No VMA walk required. vm_total_bytes is updated atomically (with fetch_add/fetch_sub) on every mmap(), munmap(), and mremap().

fn check_rlimit_as(process: &Process, add_bytes: usize) -> Result<(), Errno> {
    let soft = process.rlimits.limits[RLIMIT_AS].soft;
    if soft == u64::MAX {
        return Ok(());
    }
    let current_as = process.address_space.vm_total_bytes.load(Acquire);
    if current_as + add_bytes > soft as usize {
        return Err(Errno::ENOMEM);
    }
    Ok(())
}

RLIMIT_MEMLOCK — locked memory

mlock() and mmap(MAP_LOCKED) increment Process.locked_pages atomically. The check and update are:

fn check_and_add_locked(process: &Process, add_bytes: u64) -> Result<(), Errno> {
    let soft = process.rlimits.limits[RLIMIT_MEMLOCK].soft;
    if soft == u64::MAX {
        return Ok(());
    }
    let prev = process.locked_pages.fetch_add(add_bytes, AcqRel);
    if prev + add_bytes > soft {
        process.locked_pages.fetch_sub(add_bytes, Relaxed);
        return Err(Errno::ENOMEM);
    }
    Ok(())
}

munlock() and mmap(MAP_LOCKED) region unmaps decrement locked_pages by the region size. locked_pages never goes negative; the decrement is guarded by a debug assertion.

RLIMIT_SIGPENDING — pending signal count per UID

Checked in send_signal() for real-time signals (signals 34–64) and for sigqueue(). Standard signals (1–31) are not subject to this limit because they are not queued per occurrence. The check uses the UserEntry.sigpending_count atomic:

fn queue_rt_signal(target_uid: Uid, ns: &UserNamespace, ...) -> Result<(), Errno> {
    let user = ns.get_user_entry(target_uid);
    let soft = /* resolved from target process's RlimitSet[RLIMIT_SIGPENDING] */;
    let prev = user.sigpending_count.fetch_add(1, AcqRel);
    if prev >= soft {
        user.sigpending_count.fetch_sub(1, Relaxed);
        return Err(Errno::EAGAIN);
    }
    // enqueue signal ...
}

sigpending_count is decremented when the signal is consumed by the target task in dequeue_signal().

RLIMIT_MSGQUEUE — POSIX MQ bytes per UID

Tracked in UserNamespace.users per UID via UserEntry.mq_bytes: AtomicU64. Checked when a new message queue is created or when a message is sent that would increase the total byte count. The limit applies to the creator's UID:

fn mq_check_and_add(uid: Uid, ns: &UserNamespace,
                    add_bytes: u64, soft: u64) -> Result<(), Errno> {
    let user = ns.get_user_entry(uid);
    let prev = user.mq_bytes.fetch_add(add_bytes, AcqRel);
    if prev + add_bytes > soft {
        user.mq_bytes.fetch_sub(add_bytes, Relaxed);
        return Err(Errno::EMFILE); // matches Linux (EMFILE for mq_open)
    }
    Ok(())
}

mq_bytes is decremented when a message is received (dequeued) or when a queue is unlinked and drained.

RLIMIT_CPU — CPU time

Checked at every scheduler tick in the per-CPU run loop. The check compares accumulated CPU time (in seconds) against the soft and hard limits:

fn tick_check_rlimit_cpu(task: &mut Task) {
    let cpu_secs = (task.rusage.utime_ns.load(Relaxed)
                  + task.rusage.stime_ns.load(Relaxed)) / 1_000_000_000;
    let soft = task.process.rlimits.limits[RLIMIT_CPU].soft;
    let hard = task.process.rlimits.limits[RLIMIT_CPU].hard;
    if soft != u64::MAX && cpu_secs >= soft {
        // SIGXCPU is delivered at most once per second after soft limit crossing
        // (POSIX/Linux behavior). `last_sigxcpu_tick` prevents signal flooding
        // on every timer tick.
        let now = current_tick();
        if now.saturating_sub(task.last_sigxcpu_tick) >= TICKS_PER_SECOND {
            send_signal_to_process(&task.process, SIGXCPU);
            task.last_sigxcpu_tick = now;
        }
    }
    if hard != u64::MAX && cpu_secs >= hard {
        send_signal_to_process(&task.process, SIGKILL);
    }
}

SIGXCPU is delivered repeatedly (once per second after the soft limit is crossed) until the process responds (catches the signal and reduces CPU usage, or is killed by reaching the hard limit). This matches Linux behavior.

RLIMIT_RTTIME — RT CPU time

Checked in the RT scheduler tick path for tasks with SchedPolicy::Fifo or SchedPolicy::RoundRobin. The rt_runtime_us field in SchedEntity accumulates real-time CPU time in microseconds. At each RT tick:

fn rt_tick_check_rlimit(task: &Task) {
    let rt_us = task.sched_entity.rt_runtime_us.load(Relaxed);
    let soft = task.process.rlimits.limits[RLIMIT_RTTIME].soft;
    let hard = task.process.rlimits.limits[RLIMIT_RTTIME].hard;
    if soft != u64::MAX && rt_us >= soft {
        send_signal_to_process(&task.process, SIGXCPU);
    }
    if hard != u64::MAX && rt_us >= hard {
        send_signal_to_process(&task.process, SIGKILL);
    }
}

rt_runtime_us resets to zero when the task transitions out of an RT scheduling class (e.g., via sched_setscheduler() to SCHED_OTHER).

RLIMIT_FSIZE — maximum file size

Checked in vfs_write() before each write that would extend a file beyond its current size. When the file size after the write would exceed the soft limit:

SIGXFSZ is delivered to the calling process.
vfs_write() returns EFBIG.

Both the signal and the error are returned, matching Linux exactly. The hard limit is treated the same as the soft limit (there is no separate behavior at the hard limit for file size — SIGXFSZ is delivered regardless).

RLIMIT_STACK — stack size

The page fault handler checks RLIMIT_STACK when expanding the stack region downward (on architectures where the stack grows down, i.e., all supported UmkaOS targets). If the proposed new stack bottom would place the stack size beyond the soft limit:

The fault is not satisfied (page is not mapped).
The task receives SIGSEGV.

This is the only limit enforced by the fault handler rather than a syscall entry point. The stack VMA is annotated with VmaFlags::STACK so the fault handler can identify it.

RLIMIT_DATA — data segment

Checked in brk() when expanding the heap:

fn do_brk(process: &Process, new_brk: usize) -> Result<usize, Errno> {
    let soft = process.rlimits.limits[RLIMIT_DATA].soft;
    if soft != u64::MAX {
        let data_size = (new_brk - process.address_space.start_data) as u64;
        if data_size > soft {
            return Err(Errno::ENOMEM);
        }
    }
    // ... extend heap VMA ...
}

RLIMIT_DATA is also checked in mmap(MAP_ANONYMOUS | MAP_PRIVATE) when the mapping is created in the data/heap region (below the stack, above the text segment). It is not checked for file-backed mappings or shared mappings.

7.5.5 Inheritance Across fork() and exec()

Fork: the child inherits an exact byte-for-byte copy of the parent's RlimitSet. No limits are reset or modified. Both the rlimits array and the locked_pages counter start as copies of the parent's values. The child gets its own rlimit_lock.

Thread creation (clone with CLONE_VM): threads of the same process share the process RlimitSet directly (all threads reference the same Process struct via task.process). There is no per-thread limit set; all threads in a process share one RlimitSet and one rlimit_lock.

Exec: execve() does not modify any resource limits. Limits survive across exec. This matches POSIX and Linux.

Exception — stack limit on exec: if the program's initial stack size (as determined by the ELF loader and the kernel's stack setup) would exceed RLIMIT_STACK, the exec fails with ENOMEM before the new address space is committed. This check occurs after the new address space is built but before the old address space is torn down, so a failed exec leaves the calling process intact.

UID counter on fork: UserEntry.task_count is incremented in do_fork() (for the new task) and decremented in do_exit() (after the last thread of the process exits). The child process's UID may differ from the parent's if the exec or clone set up a different UID mapping; in that case, the decrement targets the UID's UserEntry at the time do_exit() runs, not at fork time.

7.5.6 UID-Level Accounting

Several limits (RLIMIT_NPROC, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE) are enforced per-UID rather than per-process. The counters live in UserEntry objects stored in the user namespace:

/// Per-UID accounting entry within a user namespace.
pub struct UserEntry {
    /// Number of live tasks (threads) whose real UID matches this entry.
    /// Incremented in do_fork(), decremented in do_exit().
    /// u64 to correctly compare against RLIMIT values which are u64; u32
    /// would silently pass the limit check if soft > 4 billion.
    pub task_count: AtomicU64,
    /// Number of queued real-time signals for tasks with this real UID.
    /// Incremented in queue_rt_signal(), decremented in dequeue_signal().
    /// u64 to correctly compare against RLIMIT_SIGPENDING which is u64.
    pub sigpending_count: AtomicU64,
    /// Total bytes allocated for POSIX message queues owned by this UID.
    /// Incremented on mq_open/msgsnd, decremented on mq_unlink/msgrcv.
    pub mq_bytes: AtomicU64,
}

UserEntry objects are stored in UserNamespace.users: RcuHashMap<Uid, Arc<UserEntry>>. Lookup is O(1) average under RCU read lock (no blocking). New entries are created on first task creation for a UID and are removed when task_count reaches zero and all associated resources are released.

The RcuHashMap uses the UmkaOS RCU implementation (Section 3.1) for reads: readers acquire an RcuReadGuard (a single CpuLocal flag write, ~1-3 cycles), look up the entry, clone the Arc, and release the guard. Writers take a per-map mutex, insert or remove, publish the new table under RCU, then wait for a grace period before freeing the old table.

7.5.7 `getrusage` Wire Format

The rusage struct written to user space is binary-identical to the Linux struct rusage:

/// Wire format for getrusage(2). Binary-identical to Linux struct rusage.
pub struct RusageWire {
    /// User CPU time.
    pub ru_utime: Timeval,
    /// System CPU time.
    pub ru_stime: Timeval,
    /// Peak RSS in kilobytes (max over the process lifetime).
    pub ru_maxrss: i64,
    /// Integral shared memory size. Not tracked by UmkaOS; always 0.
    pub ru_ixrss: i64,
    /// Integral unshared data size. Not tracked by UmkaOS; always 0.
    pub ru_idrss: i64,
    /// Integral unshared stack size. Not tracked by UmkaOS; always 0.
    pub ru_isrss: i64,
    /// Minor (non-I/O) page faults.
    pub ru_minflt: i64,
    /// Major (I/O-requiring) page faults.
    pub ru_majflt: i64,
    /// Number of times the process was swapped out. Not tracked; always 0.
    pub ru_nswap: i64,
    /// Block input operations (reads from block devices).
    pub ru_inblock: i64,
    /// Block output operations (writes to block devices).
    pub ru_oublock: i64,
    /// IPC messages sent. Not tracked by UmkaOS; always 0.
    pub ru_msgsnd: i64,
    /// IPC messages received. Not tracked by UmkaOS; always 0.
    pub ru_msgrcv: i64,
    /// Signals received. Not tracked by UmkaOS; always 0.
    pub ru_nsignals: i64,
    /// Voluntary context switches.
    pub ru_nvcsw: i64,
    /// Involuntary context switches.
    pub ru_nivcsw: i64,
}

pub struct Timeval {
    pub tv_sec: i64,
    pub tv_usec: i64,
}

Fields marked "always 0" correspond to metrics that Linux also does not reliably populate (ru_ixrss, ru_idrss, ru_isrss, ru_nswap, ru_msgsnd, ru_msgrcv, ru_nsignals). Returning zero for these fields matches Linux behavior and is correct for RUSAGE_SELF, RUSAGE_CHILDREN, and RUSAGE_THREAD.

Assembly from RusageAccum:

fn fill_rusage_wire(accum: &RusageAccum, out: &mut RusageWire) {
    let utime_us = accum.utime_ns.load(Relaxed) / 1_000;
    let stime_us = accum.stime_ns.load(Relaxed) / 1_000;
    out.ru_utime = Timeval { tv_sec: (utime_us / 1_000_000) as i64,
                             tv_usec: (utime_us % 1_000_000) as i64 };
    out.ru_stime = Timeval { tv_sec: (stime_us / 1_000_000) as i64,
                             tv_usec: (stime_us % 1_000_000) as i64 };
    out.ru_maxrss  = accum.peak_rss_kb.load(Relaxed) as i64;
    out.ru_minflt  = accum.minflt.load(Relaxed) as i64;
    out.ru_majflt  = accum.majflt.load(Relaxed) as i64;
    out.ru_inblock = accum.inblock.load(Relaxed) as i64;
    out.ru_oublock = accum.oublock.load(Relaxed) as i64;
    out.ru_nvcsw   = accum.nvcsw.load(Relaxed) as i64;
    out.ru_nivcsw  = accum.nivcsw.load(Relaxed) as i64;
    // Unused fields already zero-initialized.
}

For RUSAGE_SELF, the kernel iterates all live threads in the process thread group, sums their RusageAccum fields into a temporary accumulator, then calls fill_rusage_wire. For RUSAGE_CHILDREN, it reads Process.children_rusage directly (already a summed accumulator updated atomically on each wait()). For RUSAGE_THREAD, it uses only the calling thread's Task.rusage.

7.5.8 `/proc/PID/limits` Format

UmkaOS generates /proc/PID/limits in the exact format that Linux uses, enabling unmodified tools (bash ulimit -a, prlimit(1), container runtimes) to read it.

Format specification:

Header line (fixed width): "Limit Soft Limit Hard Limit Units "
One data line per resource (in ascending RLIMIT order, 0 through 15).
Column layout (left-aligned fixed-width fields):
Limit column: 26 characters
Soft Limit column: 21 characters (value or "unlimited")
Hard Limit column: 21 characters (value or "unlimited")
Units column: remainder

Example output:

Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             31672                62193                processes
Max open files            1024                 1048576              files
Max locked memory         67108864             67108864             bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       31672                31672                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

The display names, ordering, and units strings must match Linux exactly. The umkafs virtual filesystem handler for /proc/PID/limits (Section 19.1) reads the process RlimitSet under rlimit_lock (since it is a read of the full 32-field set and we want a consistent snapshot) and formats the output using a static table:

static RLIMIT_DISPLAY: [RlimitDisplay; 16] = [
    RlimitDisplay { name: "Max cpu time",       unit: "seconds",   resource: RLIMIT_CPU },
    RlimitDisplay { name: "Max file size",       unit: "bytes",     resource: RLIMIT_FSIZE },
    RlimitDisplay { name: "Max data size",       unit: "bytes",     resource: RLIMIT_DATA },
    RlimitDisplay { name: "Max stack size",      unit: "bytes",     resource: RLIMIT_STACK },
    RlimitDisplay { name: "Max core file size",  unit: "bytes",     resource: RLIMIT_CORE },
    RlimitDisplay { name: "Max resident set",    unit: "bytes",     resource: RLIMIT_RSS },
    RlimitDisplay { name: "Max processes",       unit: "processes", resource: RLIMIT_NPROC },
    RlimitDisplay { name: "Max open files",      unit: "files",     resource: RLIMIT_NOFILE },
    RlimitDisplay { name: "Max locked memory",   unit: "bytes",     resource: RLIMIT_MEMLOCK },
    RlimitDisplay { name: "Max address space",   unit: "bytes",     resource: RLIMIT_AS },
    RlimitDisplay { name: "Max file locks",      unit: "locks",     resource: RLIMIT_LOCKS },
    RlimitDisplay { name: "Max pending signals", unit: "signals",   resource: RLIMIT_SIGPENDING },
    RlimitDisplay { name: "Max msgqueue size",   unit: "bytes",     resource: RLIMIT_MSGQUEUE },
    RlimitDisplay { name: "Max nice priority",   unit: "",          resource: RLIMIT_NICE },
    RlimitDisplay { name: "Max realtime priority", unit: "",        resource: RLIMIT_RTPRIO },
    RlimitDisplay { name: "Max realtime timeout", unit: "us",       resource: RLIMIT_RTTIME },
];

Values of u64::MAX are rendered as "unlimited". All other values are rendered as decimal integers. The output is generated on every read of the /proc file; no caching is performed (the file is small and infrequently read).

7.5.9 `/proc/PID/stat` Field Mapping

UmkaOS's /proc/PID/stat output is generated directly from Task struct fields, maintaining Linux field order and encoding for compatibility with tools like ps, top, htop, and procps.

Linux /proc/PID/stat has 52 space-separated fields in a fixed order (per man 5 proc). The table below documents the authoritative mapping from each field to the UmkaOS Task struct field that populates it (field numbers are 1-indexed):

Field #	Name	UmkaOS Task field	Notes
1	`pid`	`task.pid`	Process ID
2	`comm`	`task.comm`	Command name, truncated to 15 chars, surrounded by `()`
3	`state`	`task.state` → char	`R`=Running, `S`=Sleeping, `D`=Disk sleep, `Z`=Zombie, `T`=Stopped, `t`=Tracing, `X`=Dead
4	`ppid`	`task.parent.pid`	Parent PID (0 for init)
5	`pgrp`	`task.pgrp`	Process group ID
6	`session`	`task.session`	Session ID
7	`tty_nr`	`task.tty`	Controlling terminal (encoded as `(major<<8)\\|minor`), 0 if none
8	`tpgid`	`task.tty.foreground_pgrp`	Foreground process group of controlling terminal
10	`flags`	`task.flags`	Kernel flags bitmask (PF_* values, Linux-compatible)
11	`minflt`	`task.min_faults`	Minor page faults
12	`cminflt`	`task.children_min_faults`	Minor faults of waited-for children
13	`majflt`	`task.maj_faults`	Major page faults
14	`cmajflt`	`task.children_maj_faults`	Major faults of waited-for children
15	`utime`	`task.utime_ticks`	User mode time in clock ticks
16	`stime`	`task.stime_ticks`	Kernel mode time in clock ticks
17	`cutime`	`task.children_utime`	User time of waited-for children
18	`cstime`	`task.children_stime`	Kernel time of waited-for children
19	`priority`	`task.prio`	Kernel scheduling priority (negated nice-20 for RT)
20	`nice`	`task.nice`	Nice value: -20 (high priority) to 19 (low priority)
21	`num_threads`	`task.thread_group.count`	Number of threads in thread group
23	`itrealvalue`	0	Always 0 (obsolete)
24	`starttime`	`task.start_time_ticks`	Start time after boot in clock ticks
23	`vsize`	`task.mm.total_vm * PAGE_SIZE`	Virtual memory size in bytes
24	`rss`	`task.mm.rss_pages`	Resident set size in pages
25	`rsslim`	`task.rlimit[RLIMIT_RSS]`	RSS soft limit in bytes
38	`exit_signal`	`task.exit_signal`	Signal sent to parent on death
39	`processor`	`task.cpu_id`	Last CPU on which task ran
40	`rt_priority`	`task.rt_priority`	RT priority 1-99 (0 for non-RT)
41	`policy`	`task.sched_policy`	Scheduling policy (SCHED_NORMAL=0, SCHED_FIFO=1, etc.)

Fields 26-37 and 42-52 are block I/O, signal, and cgroup fields — populated from task.io_accounting, task.pending_signals, and task.cgroup_id respectively.

7.5.10 Linux Compatibility Notes

Topic	Detail
`RLIMIT_*` numeric values	Identical to Linux (0–15)
`RLIM_INFINITY` wire value	`u64::MAX` on 64-bit; `u32::MAX` on 32-bit compat path
`prlimit64` syscall number	302 on x86-64 (verified against Linux 6.x)
`getrlimit` / `setrlimit`	Syscall numbers 97 / 160 on x86-64
`getrusage` syscall number	98 on x86-64
`times` syscall number	100 on x86-64
`RUSAGE_SELF` / `RUSAGE_CHILDREN` / `RUSAGE_THREAD`	Values 0, -1, 1 match Linux
`struct rusage` layout	Binary-identical, including unused zero fields
`struct rlimit64` layout	Binary-identical (two u64 fields)
`USER_HZ`	100 (matches Linux default; not runtime-configurable)
`NR_OPEN` default	1,048,576 (matches `fs.nr_open` default in Linux)
`PID_MAX_LIMIT`	4,194,304 (matches Linux `pid_max` hard ceiling)
`RLIMIT_LOCKS`	Always `RLIM_INFINITY`; enforcement not implemented (matches modern Linux)
`/proc/PID/limits` format	Character-for-character identical to Linux
`prlimit64` credential check	Requires `PTRACE_MODE_ATTACH_REALCREDS` or matching real/saved UID
`RLIMIT_RSS` enforcement	Advisory in Linux; UmkaOS integrates with cgroup memory controller
`RLIMIT_NICE` encoding	`20 - rlim_cur = minimum nice value`; rlim_cur 0 → nice min 20 (no privilege), rlim_cur 20 → nice min 0
SIGXCPU on CPU limit	Delivered repeatedly (every second) after soft limit; SIGKILL at hard limit
SIGXFSZ on file size	Signal + EFBIG returned, matching Linux exactly