Chapter 8: Security Architecture

Capabilities, credentials, LSM framework, verified boot, TPM, IMA, post-quantum cryptography, confidential computing

8.1 Capability-Based Foundation

8.1.1 Capability-Based Foundation

Every resource in UmkaOS is accessed through unforgeable capability tokens. This is the native security model -- not a bolt-on.

pub struct Capability {
    /// Unique identifier of the target object
    pub object_id: ObjectId,

    /// Bitfield of permitted operations
    pub permissions: PermissionBits,

    /// Monotonically increasing generation counter for revocation.
    /// When an object's generation advances, all capabilities with
    /// older generations become invalid.
    pub generation: u64,

    /// Additional constraints on this capability
    pub constraints: CapConstraints,

    /// Delegation depth: 0 = root capability from kernel, max 16.
    /// Incremented each time this capability is delegated to another context.
    /// `cap_delegate()` rejects the call with `CapError::DelegationDepthExceeded`
    /// when this field equals `CAP_MAX_DELEGATION_DEPTH`.
    pub delegation_depth: u8,
    // Delegation children list NOT stored here — see CapEntry below.
    // The Capability struct is the immutable validation token; it is
    // fixed-size and contains no heap references. The mutable delegation
    // list (children derived via cap_delegate) lives in CapEntry, which is
    // stored in the CapTable and accessed only in syscall context.
}

/// Per-CapId storage in the CapTable. Every issued capability has one CapEntry.
/// The `cap` field is the immutable validation token; `children` is the mutable
/// delegation audit list. The two are separated so that hot-path validation
/// (which only reads `cap`) never touches the `children` Mutex.
pub struct CapEntry {
    /// Immutable capability token. Fixed-size; no heap allocation. Copied into
    /// `ValidatedCap` tokens on first validation (see Section 8.1.2).
    pub cap: Capability,

    /// Child capabilities derived from this one via `cap_delegate()`.
    /// Accessed only during `cap_delegate()` and `cap_revoke()` — never in
    /// the hot validation path. Heap allocation here is acceptable (syscall
    /// context). Bounded by `CAP_MAX_DELEGATIONS` (256); `cap_delegate()`
    /// returns `CapError::DelegationLimitReached` if the limit is reached.
    pub children: Mutex<Vec<DelegationRecord>>,
}

/// Records a single delegation of a capability to another domain.
/// Stored in the parent capability's delegation list so that revocation
/// can propagate to all delegate holders.
pub struct DelegationRecord {
    /// The delegated capability ID (child capability).
    pub delegated_cap: CapId,
    /// The domain that received the delegation.
    pub target_domain: DomainId,
    /// TSC timestamp of the delegation.
    pub timestamp_tsc: u64,
    /// Rights granted to the delegate (subset of parent rights).
    pub granted_rights: Rights,
}

pub struct CapConstraints {
    /// Capability expires after this timestamp (0 = no expiry)
    pub expires_at: u64,

    /// Can this capability be delegated to other processes?
    pub delegatable: bool,

    /// Maximum delegation depth (0 = no further delegation)
    pub max_delegation_depth: u32,

    /// Restrict to specific CPU set. Uses the same `CpuMask` type as the
    /// scheduler (Section 6.1) and NUMA topology (Section 4.1.8).
    pub cpu_affinity_mask: CpuMask,
}

/// CPU bitmask type. Sized at boot to fit the actual CPU count discovered from
/// firmware (MADT/DTB). Stored as a fixed-size array of u64 words allocated once
/// in a static per-subsystem arena. A default-constructed (all-zeros) mask means
/// "any CPU" (no affinity constraint).
///
/// The word count is `(nr_cpus + 63) / 64`, making `CpuMask` a compile-time-
/// unknown but boot-time-fixed size. All CpuMask instances in the same kernel
/// boot have the same word count. The backing storage is dynamically sized:
/// at boot, the kernel discovers the actual CPU count and allocates a
/// `CpuMaskPool` with the correct word count. Subsystems that store CpuMask
/// by value use a thin wrapper around a pool-allocated bitmask. For
/// frequently-embedded cases (e.g., `Capability.cpu_affinity_mask`), an
/// inline representation is available when the CPU count fits in 2 words
/// (≤128 CPUs) for const-initialization convenience; larger systems use the
/// pool-backed variant to avoid placing large bitmasks on the stack.
pub struct CpuMask {
    /// Bitmask storage. Either inline (for small CPU counts) or a pointer
    /// to pool-allocated storage.
    storage: CpuMaskStorage,
}

impl CpuMask {
    /// Return an empty mask (all bits clear).
    ///
    /// Uses the inline storage variant with `active_words = 2`, which covers
    /// up to 128 CPUs. For systems with more than 128 CPUs, callers must use
    /// `CpuMaskPool` to allocate a pool-backed instance of the correct size.
    /// This method is `const` so it can initialise `static` fields at
    /// compile time.
    pub const fn empty() -> Self {
        CpuMask {
            storage: CpuMaskStorage::Inline { bits: [0u64; 2], active_words: 2 },
        }
    }

    /// Return a mask with all bits set for CPUs 0 .. `nr_cpus` (exclusive).
    ///
    /// Bits for non-existent CPUs beyond `nr_cpus` are left clear. The last
    /// partial word is filled using a shift mask so that only the exact
    /// `nr_cpus % 64` low-order bits are set. Panics in debug builds if
    /// `nr_cpus` would require more than 2 inline words (i.e., `nr_cpus > 128`);
    /// callers targeting large systems must use pool-backed allocation.
    pub fn full(nr_cpus: u32) -> Self {
        let mut mask = Self::empty();
        let full_words = (nr_cpus / 64) as usize;
        let remainder  = nr_cpus % 64;
        let bits = match &mut mask.storage {
            CpuMaskStorage::Inline { bits, .. } => bits,
            CpuMaskStorage::Pool   { .. }       => unreachable!(),
        };
        for i in 0..full_words.min(2) {
            bits[i] = u64::MAX;
        }
        if remainder > 0 && full_words < 2 {
            bits[full_words] = (1u64 << remainder) - 1;
        }
        mask
    }

    /// Set the bit for CPU `cpu` (mark it as present in the mask).
    ///
    /// `word = cpu / 64`, `bit = cpu % 64`. Out-of-bounds CPUs (beyond the
    /// active word count) are silently ignored so that callers do not need to
    /// guard against topology changes during bringup.
    pub fn set(&mut self, cpu: u32) {
        let word = (cpu / 64) as usize;
        let bit  = cpu % 64;
        match &mut self.storage {
            CpuMaskStorage::Inline { bits, active_words } => {
                if word < *active_words as usize {
                    bits[word] |= 1u64 << bit;
                }
            }
            CpuMaskStorage::Pool { bits, word_count } => {
                if word < *word_count as usize {
                    // SAFETY: `word < word_count` and pool memory is valid for
                    // the kernel lifetime; pool is exclusively owned through
                    // the CpuMask wrapper.
                    unsafe { *bits.add(word) |= 1u64 << bit; }
                }
            }
        }
    }

    /// Clear the bit for CPU `cpu` (remove it from the mask).
    ///
    /// `word = cpu / 64`, `bit = cpu % 64`. Out-of-bounds CPUs are silently
    /// ignored (same policy as `set`).
    pub fn clear(&mut self, cpu: u32) {
        let word = (cpu / 64) as usize;
        let bit  = cpu % 64;
        match &mut self.storage {
            CpuMaskStorage::Inline { bits, active_words } => {
                if word < *active_words as usize {
                    bits[word] &= !(1u64 << bit);
                }
            }
            CpuMaskStorage::Pool { bits, word_count } => {
                if word < *word_count as usize {
                    // SAFETY: `word < word_count` and pool memory is valid for
                    // the kernel lifetime; pool is exclusively owned through
                    // the CpuMask wrapper.
                    unsafe { *bits.add(word) &= !(1u64 << bit); }
                }
            }
        }
    }

    /// Return `true` if CPU `cpu` is set in the mask.
    ///
    /// `word = cpu / 64`, `bit = cpu % 64`. Returns `false` for any CPU
    /// whose index falls outside the active word range.
    pub fn test(&self, cpu: u32) -> bool {
        let word = (cpu / 64) as usize;
        let bit  = cpu % 64;
        match &self.storage {
            CpuMaskStorage::Inline { bits, active_words } => {
                word < *active_words as usize && (bits[word] >> bit) & 1 == 1
            }
            CpuMaskStorage::Pool { bits, word_count } => {
                if word >= *word_count as usize { return false; }
                // SAFETY: `word < word_count` and pool memory is valid for
                // the kernel lifetime; accessed read-only here.
                (unsafe { *bits.add(word) } >> bit) & 1 == 1
            }
        }
    }

    /// Return the index of the lowest-numbered set CPU, or `None` if empty.
    ///
    /// Scans words from lowest to highest. Within each word uses
    /// `u64::trailing_zeros()` (a single BSF/CTZ instruction on all
    /// supported architectures). Returns `None` when all words are zero.
    pub fn first_set(&self) -> Option<u32> {
        self.next_set(0)
    }

    /// Return the index of the first set CPU with index `>= start`, or `None`.
    ///
    /// `word = start / 64`, `bit = start % 64`. The first word is masked to
    /// ignore bits below `start % 64`. Subsequent words are searched in full.
    /// Returns `None` if no set bit is found at or after `start`.
    pub fn next_set(&self, start: u32) -> Option<u32> {
        let start_word = (start / 64) as usize;
        let start_bit  = start % 64;
        let (words, word_count) = self.words_and_count();
        for w in start_word..word_count {
            let mut word = words[w];
            if w == start_word {
                // Mask off bits below `start_bit` so we do not return a CPU
                // that is before `start`.
                word &= u64::MAX.wrapping_shl(start_bit);
            }
            if word != 0 {
                let bit = word.trailing_zeros();
                return Some((w as u32) * 64 + bit);
            }
        }
        None
    }

    /// Return the number of CPUs set in the mask (popcount).
    ///
    /// Sums `u64::count_ones()` over all active words. On x86-64 with
    /// `popcnt` (universally available) this compiles to a single POPCNT per
    /// word. On platforms without hardware popcount the compiler generates an
    /// efficient software equivalent.
    pub fn count(&self) -> u32 {
        let (words, word_count) = self.words_and_count();
        words[..word_count].iter().map(|w| w.count_ones()).sum()
    }

    /// Return `true` if no CPUs are set in the mask.
    ///
    /// Cheaper than `count() == 0` because it short-circuits on the first
    /// non-zero word rather than accumulating the full popcount.
    pub fn is_empty(&self) -> bool {
        let (words, word_count) = self.words_and_count();
        words[..word_count].iter().all(|&w| w == 0)
    }

    /// Return the bitwise complement of `self`, restricted to CPUs 0 .. `nr_cpus`.
    ///
    /// Bits for non-existent CPUs beyond `nr_cpus` are forced clear in the
    /// result even if they were set in `self`. This prevents the complement
    /// from appearing to contain phantom CPUs that do not exist in the
    /// topology.
    pub fn complement(&self, nr_cpus: u32) -> Self {
        let mut result = Self::full(nr_cpus);
        let (src_words, src_wc) = self.words_and_count();
        let (dst_words, dst_wc) = result.words_and_count_mut();
        let wc = src_wc.min(dst_wc);
        for i in 0..wc {
            dst_words[i] &= !src_words[i];
        }
        result
    }

    /// Return the bitwise union (`self | other`) of two masks.
    ///
    /// If the two masks have different word counts (e.g., during topology
    /// bringup before all masks are resized), the shorter mask is zero-extended
    /// so that bits present only in the longer mask are preserved in the result.
    /// The result uses inline storage (capacity 128 CPUs); callers on systems
    /// with more than 128 CPUs must use pool-backed masks.
    pub fn union(&self, other: &Self) -> Self {
        let mut result = Self::empty();
        let (a, a_wc) = self.words_and_count();
        let (b, b_wc) = other.words_and_count();
        let (r, r_wc) = result.words_and_count_mut();
        let wc = a_wc.min(b_wc).min(r_wc);
        for i in 0..wc {
            r[i] = a[i] | b[i];
        }
        result
    }

    /// Return the bitwise intersection (`self & other`) of two masks.
    ///
    /// Only bits set in both operands appear in the result. If the masks have
    /// different word counts, the shorter extent is used (unrepresented bits
    /// are treated as zero in the shorter mask, so they do not appear in the
    /// intersection). The result uses inline storage (capacity 128 CPUs).
    pub fn intersection(&self, other: &Self) -> Self {
        let mut result = Self::empty();
        let (a, a_wc) = self.words_and_count();
        let (b, b_wc) = other.words_and_count();
        let (r, r_wc) = result.words_and_count_mut();
        let wc = a_wc.min(b_wc).min(r_wc);
        for i in 0..wc {
            r[i] = a[i] & b[i];
        }
        result
    }

    /// Iterate over the indices of all set CPUs in ascending order.
    ///
    /// Uses `next_set()` internally, advancing by one past each returned
    /// index to find the next. The iterator yields `u32` CPU indices.
    /// Yields no items for an empty mask.
    ///
    /// Example:
    /// ```
    /// let mut mask = CpuMask::empty();
    /// mask.set(0);
    /// mask.set(3);
    /// mask.set(65);
    /// assert_eq!(mask.iter().collect::<Vec<_>>(), vec![0, 3, 65]);
    /// ```
    pub fn iter(&self) -> impl Iterator<Item = u32> + '_ {
        let mut next = self.first_set();
        core::iter::from_fn(move || {
            let current = next?;
            next = self.next_set(current + 1);
            Some(current)
        })
    }

    // --- Private helpers -------------------------------------------------

    /// Return a shared slice of the backing words and the active word count.
    fn words_and_count(&self) -> (&[u64], usize) {
        match &self.storage {
            CpuMaskStorage::Inline { bits, active_words } => {
                (bits.as_slice(), *active_words as usize)
            }
            CpuMaskStorage::Pool { bits, word_count } => {
                // SAFETY: pool memory is valid for the kernel lifetime and
                // `word_count` accurately reflects the allocated length.
                let slice = unsafe {
                    core::slice::from_raw_parts(*bits, *word_count as usize)
                };
                (slice, *word_count as usize)
            }
        }
    }

    /// Return a mutable slice of the backing words and the active word count.
    fn words_and_count_mut(&mut self) -> (&mut [u64], usize) {
        match &mut self.storage {
            CpuMaskStorage::Inline { bits, active_words } => {
                let wc = *active_words as usize;
                (bits.as_mut_slice(), wc)
            }
            CpuMaskStorage::Pool { bits, word_count } => {
                let wc = *word_count as usize;
                // SAFETY: pool memory is valid for the kernel lifetime and
                // exclusive access is guaranteed by the `&mut self` borrow.
                let slice = unsafe {
                    core::slice::from_raw_parts_mut(*bits, wc)
                };
                (slice, wc)
            }
        }
    }
}

/// CpuMask storage: inline variant for const-initialization convenience, or
/// pool-backed for systems with >128 CPUs.
/// The variant is selected once at boot and fixed for the system lifetime.
///
/// Note: as a Rust enum, both variants occupy the same stack space (the union
/// reserves space for the largest variant). The inline variant does NOT save
/// memory relative to the pool variant — it exists for ergonomic
/// const-initialization without a pool allocation, not for memory savings.
/// For systems with >128 CPUs, the pool variant avoids placing the full bitmask
/// (potentially hundreds of bytes) on the stack; that is where pool-backing
/// provides a genuine space benefit.
enum CpuMaskStorage {
    /// Inline: up to 128 CPUs (2 × u64). Provided for const-initialization
    /// convenience; both enum variants occupy the same stack space.
    Inline { bits: [u64; 2], active_words: u32 },
    /// Pool: any CPU count. Points into a boot-allocated CpuMaskPool.
    /// The pool is contiguous memory, so cache behaviour is good.
    Pool { bits: *mut u64, word_count: u32 },
}

/// Boot-time global: number of u64 words needed for the discovered CPU count.
/// Set once during SMP bringup, read-only thereafter.
static CPU_MASK_WORDS: AtomicU32 = AtomicU32::new(0);

Permission Bits:

The PermissionBits type defines fine-grained access rights that can be granted on any capability. Different object types interpret these bits differently, but the base set is universal:

bitflags! {
    /// Fine-grained permission bits that can be set on any capability.
    /// These are orthogonal to SystemCaps (administrative capabilities) —
    /// PermissionBits control what operations a specific capability permits
    /// on its target object, while SystemCaps control what system-wide
    /// operations a process can perform.
    pub struct PermissionBits: u32 {
        /// READ: Read access to the target object's data or state.
        /// For memory objects: read data.
        /// For files: read file contents.
        /// For processes: read registers, memory, status.
        const READ = 1 << 0;

        /// WRITE: Write access to the target object's data or state.
        /// For memory objects: write data.
        /// For files: write file contents, append.
        /// For processes: modify registers, memory.
        const WRITE = 1 << 1;

        /// EXECUTE: Execute access or control flow modification.
        /// For memory objects: execute code from this region.
        /// For files: execute as a program.
        /// For processes: single-step, continue execution.
        const EXECUTE = 1 << 2;

        /// DEBUG: Debugging access to the target object.
        /// For processes: attach via ptrace, inspect/modify state.
        /// For /proc entries: access to private (non-public) fields.
        /// Implies READ unless explicitly excluded. See Section 19.3.1.
        const DEBUG = 1 << 3;

        /// SYSCALL_TRACE: Trace syscall entry/exit for a process.
        /// Used with PTRACE_SYSCALL. Requires DEBUG to also be set.
        /// See Section 19.3.1.
        const SYSCALL_TRACE = 1 << 4;

        /// DELEGATE: Delegate this capability to another process.
        /// Subject to CapConstraints::delegatable and max_delegation_depth.
        const DELEGATE = 1 << 5;

        /// ADMIN: Administrative access to the target object.
        /// Object-specific meaning: for devices, may allow reset;
        /// for filesystems, may allow unmount; etc.
        const ADMIN = 1 << 6;

        /// MAP_READ: Map the object read-only into address space.
        /// Used for memory objects, file-backed mappings, DMA buffers.
        const MAP_READ = 1 << 7;

        /// MAP_WRITE: Map the object read-write into address space.
        const MAP_WRITE = 1 << 8;

        /// MAP_EXECUTE: Map the object with execute permission.
        const MAP_EXECUTE = 1 << 9;

        /// KERNEL_READ: Read kernel-side data associated with a userspace object.
        /// For processes: read kernel stack trace (/proc/pid/stack).
        /// Requires CAP_DEBUG on the target in addition to this bit.
        /// See Section 19.3.5.
        const KERNEL_READ = 1 << 10;
    }
}

Permission composition: Multiple bits can be combined. For example, a debugging capability on a process might have DEBUG | READ | WRITE | SYSCALL_TRACE to allow full ptrace-style debugging including syscall interception.

Capabilities are stored in kernel-managed capability tables, indexed by per-process capability handles. User space never sees raw capability data -- only opaque handles.

Capability Revocation Semantics:

UmkaOS supports two revocation mechanisms, chosen per-object-type:

Generation-based (for bulk invalidation): Each object has its own monotonic generation counter (object.generation). Each capability records the generation of the object at the time the capability was created (cap.generation). Validation checks cap.generation == object.generation (exact match). Revoking all capabilities for an object increments object.generation; all existing capabilities now have a stale generation and fail validation. O(1), no table scanning.

Slot persistence: The generation counter lives in the kernel's object registry slot, not in the object itself. When an object is freed, its registry slot retains the last generation value. When a new object is allocated in the same slot, the slot's generation is incremented before the new object becomes visible. This prevents ABA vulnerabilities: an old capability with generation == N cannot validate against a new object in the same slot at generation == N+1. Object IDs are slot indices — they may be reused, but the generation makes each (ObjectId, generation) pair unique over time. See Section 8.1.1.3 for the full object registry data structure specification.

Trade-off: this is per-object all-or-nothing. Incrementing the generation for one PhysMemory object invalidates only capabilities pointing to that object — it does NOT affect capabilities for other PhysMemory objects (they have independent generation counters). If a single object needs fine-grained revocation (revoke one capability without affecting others for the same object), use indirection-based revocation instead.

Indirection-based (for fine-grained control): Capabilities point to an indirection entry in a per-object revocation table. Revoking sets the entry to "revoked." Allows individual revocation without affecting other capabilities for the same object. Cost: one extra pointer dereference per validation (~2-3ns).

Synchronization: Indirection entries are RCU-protected. The validate path runs inside an RCU read-side critical section: it dereferences the indirection pointer, checks the "revoked" flag, and proceeds — all under rcu_read_lock(). The revoke path sets the entry to "revoked" (atomic store), then defers entry reclamation to an RCU grace period via rcu_call(). This closes the TOCTOU window: a thread that has already dereferenced the indirection entry and sees "not revoked" is guaranteed to be within an RCU read-side critical section, so the entry cannot be freed until that thread exits the critical section. The indirection entry itself is never freed while any reader may hold a reference — only after the RCU grace period completes.

Object type	Revocation model	Rationale
`PhysMemory`	Generation (per-object)	Bulk revocation on free — all caps for this region die
`DeviceIo`	Generation (per-object)	Device reset invalidates all handles for this device
`FileDescriptor`	Indirection	Individual fd close must not affect other fds to the same inode
`IpcChannel`	Indirection	Can revoke one endpoint independently
`Process`	Generation (per-object)	Process exit invalidates all handles to this process

Validation rule: is_valid() in umka-core/src/cap/mod.rs must use exact-match: self.generation == object.generation. A capability is valid only when its generation matches the object's current generation exactly. Less-than-or-equal (<=) is incorrect because it would allow stale capabilities from earlier generations to pass validation — a capability from generation 3 must not be valid when the object is at generation 5.

Distributed Capability Revocation

Capabilities can be delegated across cluster nodes via the distributed capability protocol in Section 5.1. When a capability is revoked on the originating node, all copies on all remote nodes must be invalidated. This requires a distributed protocol with bounded latency to prevent stale capabilities from granting access after revocation.

Revocation mechanism — epoch-based generation increment:

Each capability carries a generation: u64 field (part of the CapHandle). The capability table on each node additionally stores the current valid generation per capability slot. Revocation works by incrementing the generation:

fn revoke_capability(cap: CapHandle) -> Result<(), RevokeError> {
    let entry = rcu_read_lock(|| cap_table.get(cap.index))?;
    // 1. Increment generation in the local capability table.
    //    All local lookups now fail (generation mismatch).
    cap_table.increment_generation(cap.index);

    // 2. Broadcast REVOKE message to all cluster nodes that may hold a copy.
    //    The message includes: cap.index, new_generation, cluster_epoch.
    let nodes = cap.delegation_set(); // nodes this cap was delegated to
    for node in nodes {
        cluster_send(node, ClusterMsg::RevokeCapability {
            index: cap.index,
            new_generation: cap_table.generation(cap.index),
        });
    }

    // 3. Wait for ACK from all nodes (bounded by REVOKE_TIMEOUT_MS).
    //    Nodes that do not ACK within the timeout are considered failed
    //    and are fenced (removed from the cluster).
    wait_for_acks(nodes, REVOKE_TIMEOUT_MS)
}

Latency bounds: Revocation completes within 2 * network_RTT + processing_time. For local cluster (PCIe P2P or 100GbE RDMA), RTT is approximately 1-10us, so revocation completes in < 100us. For wide-area clusters with RTT up to 10ms, revocation takes < 50ms. This is the hard latency bound — processes holding stale capabilities see rejection errors after this window.

Grace period for in-flight operations: Operations already in progress when revocation is issued complete normally (they entered the kernel before generation increment). New operations using the old generation fail immediately. This is equivalent to RCU's read-side grace period: existing readers complete, new readers see the new generation.

Revocation of delegated capabilities: When a parent capability is revoked, cap_revoke() first iterates cap.delegations and sends a CapRevoke IPC message to each target_domain listed. Domains receiving the revocation must call cap_drop(delegated_cap) to invalidate their copy. Only after all delegates acknowledge (or REVOKE_TIMEOUT_MS = 1000ms elapses) does the parent capability's generation increment take effect and the parent become invalid. Delegates that do not acknowledge within the timeout are forcibly invalidated and the domain is flagged for audit. This is explicitly better than Linux's POSIX capability model — Linux process capabilities have no delegation tracking at all.

The CapEntry.children list records every cap_delegate() call made FROM a given capability. Each entry holds the child CapId, the receiving DomainId, a TSC timestamp, and the granted rights subset. The list is stored in the CapTable's CapEntry (not in the Capability token itself) so that the hot validation path never touches it. Heap allocation during delegation is acceptable (syscall context). The maximum children per capability is CAP_MAX_DELEGATIONS (256); cap_delegate() returns CapError::DelegationLimitReached if exceeded. Attempting to delegate when delegation_depth == CAP_MAX_DELEGATION_DEPTH (16) returns CapError::DelegationDepthExceeded.

Capability tombstones: After revocation, the capability slot is kept as a tombstone (generation incremented, type = Revoked) for one epoch. This prevents ABA races where a new capability is allocated to the same slot index before all in-flight messages referencing the old generation have been processed.

8.1.1.4 Cluster Revocation Wire Protocol

Capability revocation in a cluster requires: 1. Notifying all nodes that hold a delegated copy of the revoked capability. 2. Ensuring all nodes have revoked before the revoking node considers the operation complete. 3. Fencing in-flight operations that may have passed the capability check but not yet completed.

Message format (CapRevocationMsg):

#[repr(C)]
pub struct CapRevocationMsg {
    /// Protocol version (currently 1).
    pub version:    u8,
    /// Message type (1 = REVOKE_REQUEST, 2 = REVOKE_ACK, 3 = REVOKE_COMPLETE).
    pub msg_type:   u8,
    pub _pad:       [u8; 2],
    /// The capability object ID being revoked.
    pub object_id:  ObjectId,
    /// Generation of the capability being revoked (capabilities with older
    /// generations are already invalid; this revokes the current generation).
    pub generation: u64,
    /// Epoch counter for ordering (sender increments on each revocation batch).
    pub epoch:      u64,
    /// Cryptographic nonce (128-bit random) for replay prevention.
    pub nonce:      [u8; 16],
    /// Sender node ID.
    pub sender:     NodeId,
}

Transport: Sent over the cluster's ClusterTransport (RDMA or TCP fallback, Section 5.1). Uses send_reliable() for REVOKE_REQUEST and REVOKE_COMPLETE messages.

Protocol: 1. Revoker increments capability.generation atomically (makes old-gen caps invalid locally). 2. Revoker sends REVOKE_REQUEST to all nodes in capability.delegation_list concurrently. 3. Each recipient node: a. Removes all capabilities with (object_id, generation) from its local capability table. b. Waits for any in-flight operations using the capability to complete (drains the per-capability operation counter to zero; bounded by the max operation timeout = 5s). c. Sends REVOKE_ACK to the revoker. 4. Revoker waits for ACKs from all notified nodes (timeout: 10s per node). 5. On all-ACKs: sends REVOKE_COMPLETE broadcast; revocation is final. 6. On timeout: The revoker marks the node as "pending-revocation" and isolates it from new capability grants. Cluster health monitoring (Section 5.1) forces the isolated node to rejoin, which triggers re-synchronization of its capability table from the root.

Fencing: In-flight operations are tracked by CapOperationGuard (RAII type; dropped when the operation that used the capability completes). The drain at step 3b waits for all CapOperationGuard instances on the revoked capability to drop. Maximum drain time: 5s (matches the max RPC timeout).

Replay prevention: The nonce + epoch pair prevents replayed revocations. Each node keeps a 60s window of seen nonces.

8.1.1.0 Capability Delegation API

/// Delegate a capability to another isolation domain.
///
/// Creates a child capability with a subset of the parent's rights,
/// registers the delegation in the parent capability's provenance chain
/// (for revocation propagation), and returns a token that can be
/// transferred to `target_domain`.
///
/// # Arguments
/// - `cap`: The parent capability to delegate.
/// - `permitted_rights`: The rights to grant to the delegate. Must be
///   a subset of `cap.rights` — you cannot grant rights you don't have.
/// - `target_domain`: The isolation domain receiving the delegation.
///
/// # Returns
/// A `DelegatedCap` containing the new child capability ID and metadata.
///
/// # Errors
/// - `CapError::InsufficientRights`: `permitted_rights` includes rights
///   not present in `cap.rights`.
/// - `CapError::DomainNotFound`: `target_domain` is not a valid domain.
/// - `CapError::DelegationLimitReached`: The parent capability already has
///   the maximum number of active delegations (256). Revoke some first.
/// - `CapError::DelegationDepthExceeded`: `cap.delegation_depth` equals
///   `CAP_MAX_DELEGATION_DEPTH` (16). The delegation chain has reached the
///   system-wide depth ceiling; no further sub-delegation is permitted.
///
/// # Revocation
/// Calling `cap_revoke(cap)` will:
/// 1. Look up the `CapEntry` in the CapTable, lock `entry.children`, and
///    send `CapRevoke` IPC to each `target_domain` in the children list.
/// 2. Each target domain must call `cap_drop(delegated_cap)` within
///    `REVOKE_TIMEOUT_MS` (1000ms default).
/// 3. After all delegates acknowledge (or timeout): the parent cap is
///    invalidated.
///
/// # Design note
/// Unlike POSIX capabilities (which are per-process and non-delegatable),
/// UmkaOS capabilities form an explicit delegation tree. This enables:
/// - Precise revocation (revoke from root, all delegates are notified)
/// - Audit trail (delegation records include timestamp + domain ID)
/// - Least-privilege enforcement (delegates cannot exceed parent rights)
pub fn cap_delegate(
    cap: CapRef,
    permitted_rights: Rights,
    target_domain: DomainId,
) -> Result<DelegatedCap, CapError>;

/// Token returned by `cap_delegate()`. Must be sent to `target_domain`
/// via IPC; the target uses it to call `cap_accept(delegated_cap)`.
pub struct DelegatedCap {
    /// ID of the newly created child capability.
    pub cap_id: CapId,
    /// ID of the parent capability (for audit and revocation).
    pub parent_cap_id: CapId,
    /// The domain authorized to accept this cap.
    pub target_domain: DomainId,
    /// Rights granted (subset of parent rights).
    pub rights: Rights,
    /// Expiry: TSC value after which this token is invalid (0 = no expiry).
    pub expiry_tsc: u64,
}

/// Maximum active delegations per capability.
pub const CAP_MAX_DELEGATIONS: u32 = 256;

/// Maximum delegation chain depth. Root capabilities issued directly by the
/// kernel have `delegation_depth = 0`. Each call to `cap_delegate()` produces
/// a child with `delegation_depth = parent.delegation_depth + 1`. When a
/// capability's `delegation_depth` equals this constant, `cap_delegate()`
/// returns `CapError::DelegationDepthExceeded` and no child is created.
///
/// Value 16 matches Linux's symbolic link depth limit (MAXSYMLINKS), which
/// is sufficient for all realistic delegation chains (e.g., kernel → service
/// → container runtime → container → sandbox → helper = 5 levels) and
/// prevents unbounded memory growth from pathological delegation chains.
pub const CAP_MAX_DELEGATION_DEPTH: u8 = 16;

/// Revocation timeout: how long UmkaOS Core waits for delegates to
/// acknowledge a revocation before forcibly invalidating the capability.
pub const REVOKE_TIMEOUT_MS: u64 = 1000;

8.1.1.0a Delegation Depth Limit

UmkaOS capabilities are attenuated: a delegated capability carries a strict subset of the parent's rights. Because rights can only be removed and never added, delegation is monotonically weakening — a context that already holds a parent capability cannot gain additional authority by receiving a child delegation of it. This property means cycle detection is unnecessary: delegating to a context that already holds the parent results in the child having the lesser of the two, which is harmless.

Despite the impossibility of privilege cycles, unbounded delegation chains would allow an adversary to consume unbounded kernel memory (one Capability kernel object per link in the chain). The depth limit closes this vector.

Depth limit: 16 levels. Root capabilities — issued directly by the kernel at object creation, open(), etc. — have delegation_depth = 0. Each call to cap_delegate() produces a child with delegation_depth = parent.delegation_depth + 1. When delegation_depth == CAP_MAX_DELEGATION_DEPTH (16), cap_delegate() returns CapError::DelegationDepthExceeded; no child capability is created and no kernel object is allocated.

Memory bound: with at most CAP_MAX_CAPABILITIES_PER_PROCESS (1024) active capabilities per process and a maximum delegation depth of 16, a single root capability can produce at most 16 kernel capability objects across the full delegation chain (one per level). The total kernel capability memory attributable to any one root capability is therefore bounded at 16 × sizeof(Capability).

CapConstraints::max_delegation_depth (the per-capability policy field, value 0–16) is an additional constraint: if set, cap_delegate() also rejects when parent.delegation_depth >= parent.constraints.max_delegation_depth. This allows issuers to impose a tighter limit than the system-wide maximum on capabilities they create (e.g., max_delegation_depth = 1 to allow exactly one level of sub-delegation). The system-wide CAP_MAX_DELEGATION_DEPTH = 16 is an absolute hard ceiling that cannot be overridden upward by any capability constraint.

8.1.1.1 Capability Validation Amortization: `ValidatedCap<'guard>`

Capability validation (generation check or indirection dereference) costs ~5-10 cycles per check. On a typical KABI call path, the driver invokes 3-5 kernel services per I/O operation (e.g., DMA map, ring buffer push, completion post), each requiring a capability check. This adds ~15-50 cycles per I/O — measurable at NVMe-class throughput (millions of IOPS).

UmkaOS amortizes this cost with a validate-once token pattern. This is a core design decision implemented from day one — the KABI dispatch path produces ValidatedCap tokens that downstream kernel services accept without re-validation.

/// A capability that has been validated within the current KABI dispatch.
/// The lifetime `'dispatch` is tied to the KABI dispatch guard, ensuring
/// the token cannot outlive the dispatch scope. Within this scope, the
/// capability is guaranteed valid:
/// - The generation matched at validation time.
/// - The indirection entry (if applicable) was not revoked.
/// - The permission bits include the required set.
///
/// # Safety invariant
///
/// `ValidatedCap` is only constructed by `cap_validate()` inside a
/// `KabiDispatchGuard` scope. The guard holds an RCU read-side lock,
/// preventing capability revocation from completing during the dispatch.
/// This means: if validation succeeds at the start of a KABI call,
/// the capability remains valid for the entire call — revocation cannot
/// race because the RCU grace period cannot complete until the guard drops.
pub struct ValidatedCap<'dispatch> {
    /// The validated capability handle (opaque, for passing to sub-services).
    handle: CapHandle,
    /// The permission bits that were validated. Sub-services can check
    /// specific permission bits against this cached copy without re-reading
    /// the capability table.
    permissions: PermissionBits,
    /// Object ID that this capability refers to.
    object_id: ObjectId,
    /// Tie lifetime to the KABI dispatch scope.
    _dispatch: PhantomData<&'dispatch KabiDispatchGuard>,
}

impl<'dispatch> ValidatedCap<'dispatch> {
    /// Check whether this already-validated capability has a specific
    /// permission bit. This is a local bitmask test (~1 cycle), NOT a
    /// capability table lookup.
    pub fn has_permission(&self, perm: PermissionBits) -> bool {
        self.permissions.contains(perm)
    }

    /// Return the object ID. Used by kernel services to locate the target
    /// object without a second capability table dereference.
    pub fn object_id(&self) -> ObjectId {
        self.object_id
    }
}

KABI dispatch integration: The KABI trampoline validates the capability once and passes the ValidatedCap token to the handler:

/// KABI dispatch trampoline (generated by kabi-compiler for each interface method).
fn kabi_dispatch_submit_io(
    ctx: *mut c_void,
    cap_handle: CapHandle,
    op: IoOp,
    /* ... */
) -> IoResult {
    // Acquire KABI dispatch guard (holds RCU read lock, pins capability validity).
    let dispatch = KabiDispatchGuard::enter();

    // Validate once — generation check + permission check (~5-10 cycles).
    let vcap = match cap_validate(cap_handle, PermissionBits::WRITE, &dispatch) {
        Ok(validated) => validated,
        Err(e) => return IoResult::from_error(e),
    };

    // All subsequent kernel service calls accept &ValidatedCap instead of raw
    // CapHandle. No re-validation needed — the dispatch guard guarantees validity.
    let dma_addr = dma_map_buffer(&vcap, buf)?;   // Accepts &ValidatedCap: ~1 cycle permission check
    ring_push_command(&vcap, op, dma_addr)?;       // Accepts &ValidatedCap: no cap lookup
    completion_register(&vcap, req_handle)?;        // Accepts &ValidatedCap: no cap lookup

    // dispatch guard drops here → RCU read unlock → revocations can now complete.
    IoResult::Success
}

Cost savings per KABI call:

Without amortization	With `ValidatedCap`
3-5 cap validations × ~5-10 cycles = ~15-50 cycles	1 validation (~5-10 cycles) + 2-4 bitmask checks (~1 cycle each) = ~7-14 cycles
~15-50 cycles total	~7-14 cycles total

On an NVMe path with one KABI dispatch per I/O, this saves ~8-36 cycles per operation (~0.03-0.15% of a 10μs NVMe read). The savings compound on paths with multiple KABI calls per I/O (e.g., TCP TX: socket cap + route cap + device cap).

Revocation safety: The KabiDispatchGuard holds an RCU read-side critical section for the duration of the KABI call. Since capability revocation uses RCU-deferred reclamation (see "Capability Revocation Semantics" above), a revocation that starts during a KABI dispatch cannot complete until the dispatch finishes. This means ValidatedCap cannot become stale within its lifetime scope.

8.1.1.2 Capability Table Lifecycle and Garbage Collection

Per-process capability tables (CapSpace) are small arrays indexed by local handle (typically <256 entries per process). Lifecycle:

Allocation: Capability table entries are allocated from the process's CapSpace when a capability is created (e.g., open() → new FileDescriptor capability) or received via IPC delegation. Each entry is reference-counted: the CapEntry holds a strong reference to the underlying kernel object.
Deallocation: When a capability handle is explicitly closed (e.g., close(fd)), the entry is removed from the CapSpace, and the kernel object's reference count is decremented. If the reference count reaches zero, the kernel object is freed.
Process exit: On process exit (do_exit), the kernel iterates the process's CapSpace and drops every entry. For generation-based objects, this decrements the reference count (the generation counter is untouched — other processes' capabilities remain valid if the object is still alive). For indirection-based objects, the indirection entry is marked "revoked" and scheduled for RCU-deferred reclamation.
Table exhaustion prevention: CapSpace has a configurable per-process maximum (default: 65536 entries, matching Linux's RLIMIT_NOFILE default hard limit). Attempts to allocate beyond this limit fail with -EMFILE. The system-wide total of live capability entries is bounded by the slab allocator's memory pressure feedback — under memory pressure, capability creation fails with -ENOMEM like any other kernel allocation. There is no unbounded capability table growth.

8.1.1.3 Object Registry

The object registry maps ObjectId values to kernel objects. It is the central data structure enabling capability validation and revocation.

/// Global object registry. Maps ObjectId → kernel object with generation-based
/// revocation. The registry is partitioned per-CPU for allocation (reducing
/// contention) but globally readable for validation (via RCU).
///
/// Slot layout: each slot holds an object pointer, a generation counter, and
/// a type tag. ObjectId encodes both the slot index and the expected generation.
/// Validation compares the ObjectId's generation against the slot's current
/// generation — a mismatch means the capability has been revoked.

/// Unique identifier for a kernel object. Encodes a slot index and generation
/// counter. Two ObjectIds with the same slot index but different generations
/// refer to different objects (the slot was recycled).
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
#[repr(C)]
pub struct ObjectId {
    /// Slot index in the object registry (lower 32 bits).
    /// Maximum 2^32 concurrent objects (~4 billion). In practice, typical
    /// systems use far fewer (tens of thousands of open files, sockets, etc.).
    slot: u32,
    /// Generation counter. Incremented each time a slot is recycled.
    /// Prevents stale capabilities from accessing new objects allocated in
    /// the same slot. u64 to match `Capability.generation` and the slot's
    /// `AtomicU64`; wrapping is not a practical concern (see security
    /// analysis below).
    generation: u64,
}

/// A single slot in the object registry.
struct ObjectSlot {
    /// Pointer to the kernel object. NULL when the slot is free.
    /// The pointer type is erased; the `type_tag` field identifies the
    /// actual type for safe downcasting.
    object: AtomicPtr<()>,
    /// Current generation counter. Incremented on each free/reuse cycle.
    /// Capabilities holding an older generation are automatically invalid.
    /// u64 to match `Capability.generation` and `ObjectId.generation`.
    generation: AtomicU64,
    /// Type discriminant for safe downcasting. Matches the `ObjectType` enum.
    type_tag: AtomicU8,
    /// Free list link. When the slot is free, this holds the index of the
    /// next free slot (forming a per-CPU free list). When allocated, unused.
    next_free: AtomicU32,
}

/// Object type tags for safe downcasting from erased pointers.
#[repr(u8)]
pub enum ObjectType {
    None = 0,       // Slot is free
    File = 1,
    Socket = 2,
    Process = 3,
    Thread = 4,
    MemoryRegion = 5,
    Device = 6,
    IpcChannel = 7,
    Timer = 8,
    Signal = 9,
    // Extensible: new types added as subsystems are implemented.
}

/// Registry of kernel objects accessible via capabilities.
/// Initialized from the boot allocator (Section 4.1.0) before the
/// slab allocator is ready. The `slots` array is allocated via
/// `BootAlloc::alloc_array::<ObjectSlot>(capacity)` at early boot
/// and has kernel lifetime — it is never freed.
///
/// Capacity is determined at boot from the platform's maximum
/// expected concurrent kernel objects (default: 65536 slots,
/// ~2MB at 32 bytes/slot). This is a boot-time parameter, not
/// a compile-time constant.
pub struct ObjectRegistry {
    /// Boot-allocator allocated array of slots. Length = `capacity`.
    /// Raw pointer because the allocation predates the type system's
    /// allocator infrastructure. Slots are RCU-readable (no lock for
    /// lookup) and locked for mutation.
    ///
    /// Capacity tiers (determined from discovered system memory at boot):
    /// - ≤1 GB: 65,536 slots
    /// - ≤16 GB: 262,144 slots
    /// - ≤256 GB: 1,048,576 slots
    /// - >256 GB: 4,194,304 slots
    slots: *mut ObjectSlot,
    /// Number of slots (set at init time, never changes).
    capacity: u32,
    /// Number of currently used slots.
    count: AtomicU32,
    /// Freelist head (index into slots, u32::MAX = empty).
    freelist_head: AtomicU32,
    /// Per-CPU free lists for lock-free allocation. Each CPU maintains a
    /// local free list head. When empty, the CPU steals a batch from the
    /// global free list (under a lock).
    per_cpu_free: PerCpu<AtomicU32>,
    /// Global free list for overflow. Protected by a spinlock, accessed
    /// only when a per-CPU list is exhausted.
    global_free_head: SpinLock<u32>,
}

// # Initialization
// ObjectRegistry::init(boot_alloc, capacity) is called from umka_core::early_init()
// before slab_init(). It uses BootAlloc to allocate the slots array.
// After slab_init(), no further BootAlloc allocations are needed by ObjectRegistry.

Operations:

impl ObjectRegistry {
    /// Allocate a slot and register an object. Returns an ObjectId that
    /// encodes the slot index and current generation.
    ///
    /// Called when creating files, sockets, processes, etc.
    /// Allocation is lock-free on the fast path (per-CPU free list).
    pub fn register<T: KernelObject>(&self, object: *mut T, type_tag: ObjectType) -> Result<ObjectId> {
        let guard = preempt_disable();
        let slot_idx = self.alloc_slot(&guard)?;
        let slot = &self.slots[slot_idx as usize];
        let gen = slot.generation.load(Ordering::Relaxed);
        slot.object.store(object as *mut (), Ordering::Release);
        slot.type_tag.store(type_tag as u8, Ordering::Release);
        Ok(ObjectId { slot: slot_idx, generation: gen })
    }

    /// Validate an ObjectId: check that the slot's generation matches and
    /// the object is still alive. Called from `cap_validate()` under
    /// `rcu_read_lock()`.
    ///
    /// Returns the object pointer if valid, or CapError::Revoked if the
    /// generation has advanced (capability was revoked).
    pub fn lookup(&self, id: ObjectId) -> Result<(*mut (), ObjectType), CapError> {
        // SAFETY: slot index bounds-checked against capacity.
        let slot = &self.slots[id.slot as usize];
        let gen = slot.generation.load(Ordering::Acquire);
        if gen != id.generation {
            return Err(CapError::Revoked);
        }
        let ptr = slot.object.load(Ordering::Acquire);
        if ptr.is_null() {
            return Err(CapError::Revoked);
        }
        let tag = slot.type_tag.load(Ordering::Acquire);
        Ok((ptr, ObjectType::from_u8(tag)))
    }

    /// Revoke all capabilities referring to an object. Increments the slot's
    /// generation counter, making all existing ObjectIds for this slot invalid.
    /// The actual object is freed via RCU deferred callback after a grace period
    /// (ensuring no concurrent `lookup()` readers see a dangling pointer).
    pub fn revoke(&self, id: ObjectId) -> Result<()> {
        let slot = &self.slots[id.slot as usize];
        // Increment generation — all existing capabilities become invalid.
        slot.generation.fetch_add(1, Ordering::Release);
        // Null the pointer (new lookups will fail immediately).
        let old_ptr = slot.object.swap(core::ptr::null_mut(), Ordering::AcqRel);
        slot.type_tag.store(ObjectType::None as u8, Ordering::Release);
        // Defer actual object destruction until RCU grace period completes.
        // This ensures no concurrent reader (in rcu_read_lock) sees a dangling pointer.
        rcu_call(move || {
            // SAFETY: old_ptr was valid when registered; RCU grace period
            // ensures no readers hold references.
            unsafe { drop(Box::from_raw(old_ptr)); }
        });
        // Return slot to per-CPU free list.
        self.free_slot(id.slot);
        Ok(())
    }
}

Security analysis: Generation counter wrapping: a u64 generation wraps after 2^64 reuses of the same slot (~1.8×10^19). At an extreme hypothetical rate of 10 million slot reuses per second, wrap occurs after ~58,000 years — not a practical concern. The u64 width (matching Capability.generation) eliminates generation wrap as a security consideration, at the cost of 4 additional bytes per ObjectId and 4 bytes per ObjectSlot compared to a u32 design (negligible given typical slot counts).

8.1.2 Linux Permission Emulation

Traditional Unix permissions (UIDs, GIDs, file modes, POSIX capabilities) are emulated on top of the UmkaOS capability model. The translation is transparent to applications:

UIDs/GIDs: Mapped to capability sets. uid == 0 grants a broad (but still bounded) capability set, not unlimited access.
File modes: Translated to per-file capability checks at open time.
POSIX capabilities (CAP_NET_RAW, CAP_SYS_ADMIN, etc.): Each maps to a specific set of UmkaOS capabilities.
Supplementary groups: Expand the effective capability set.

Applications see standard getuid(), stat(), access() behavior.

8.1.3 System Administration Capabilities

UmkaOS defines a set of core capabilities that map to Linux's POSIX capabilities. These are the native UmkaOS capabilities that form the basis for permission checks:

bitflags! {
    /// Core system administration and access control capabilities.
    /// These are the UmkaOS-native capabilities that correspond to Linux's
    /// POSIX capability set, but with a more granular and typed design.
    /// **Backing type**: `u128` provides 128 bits — enough for all Linux POSIX
    /// capabilities (currently 41, bits 0-40) and UmkaOS-native capabilities,
    /// with room for growth. Linux has added ~1 capability per release cycle;
    /// 128 bits gives decades of headroom. On 64-bit platforms, `u128` is
    /// two registers — capability checks remain fast (two AND+CMP operations).
    ///
    /// **Layout**: Bits 0-63 are reserved for POSIX-compatible capabilities
    /// (matching Linux numbering exactly — bit N = Linux `CAP_*` number N).
    /// Bits 64-127 are UmkaOS-native capabilities that have no Linux equivalent.
    /// Bits 41-63 are reserved for future Linux capabilities.
    pub struct SystemCaps: u128 {
        // ===== POSIX capabilities (bits 0-40, matching Linux exactly) =====
        // These bit positions are 1:1 with Linux's capability numbers.
        // The syscall compat layer (Section 18.1) passes Linux cap numbers
        // through directly: capable(N) → has_cap(1 << N). No translation needed.

        /// CAP_CHOWN (Linux 0): Allow arbitrary file ownership changes.
        const CAP_CHOWN = 1 << 0;
        /// CAP_DAC_OVERRIDE (Linux 1): Bypass all DAC checks.
        const CAP_DAC_OVERRIDE = 1 << 1;
        /// CAP_DAC_READ_SEARCH (Linux 2): Bypass file read and directory search checks.
        const CAP_DAC_READ_SEARCH = 1 << 2;
        /// CAP_FOWNER (Linux 3): Bypass file-owner permission checks.
        const CAP_FOWNER = 1 << 3;
        /// CAP_FSETID (Linux 4): Set set-user-ID and set-group-ID bits.
        const CAP_FSETID = 1 << 4;
        /// CAP_KILL (Linux 5): Send signals to arbitrary processes.
        const CAP_KILL = 1 << 5;
        /// CAP_SETGID (Linux 6): Set arbitrary group IDs.
        const CAP_SETGID = 1 << 6;
        /// CAP_SETUID (Linux 7): Set arbitrary user IDs.
        const CAP_SETUID = 1 << 7;
        /// CAP_SETPCAP (Linux 8): Modify capability sets.
        const CAP_SETPCAP = 1 << 8;
        /// CAP_LINUX_IMMUTABLE (Linux 9): Set immutable and append-only file attributes.
        const CAP_LINUX_IMMUTABLE = 1 << 9;
        /// CAP_NET_BIND_SERVICE (Linux 10): Bind to privileged ports (< 1024).
        const CAP_NET_BIND_SERVICE = 1 << 10;
        /// CAP_NET_BROADCAST (Linux 11): Socket broadcasting and multicast.
        const CAP_NET_BROADCAST = 1 << 11;
        /// CAP_NET_ADMIN (Linux 12): Network administration operations.
        const CAP_NET_ADMIN = 1 << 12;
        /// CAP_NET_RAW (Linux 13): Raw and packet sockets.
        const CAP_NET_RAW = 1 << 13;
        /// CAP_IPC_LOCK (Linux 14): Lock memory (mlock, mlockall, SHM_LOCK).
        const CAP_IPC_LOCK = 1 << 14;
        /// CAP_IPC_OWNER (Linux 15): Override IPC ownership checks.
        const CAP_IPC_OWNER = 1 << 15;
        /// CAP_SYS_MODULE (Linux 16): Load and unload kernel modules.
        const CAP_SYS_MODULE = 1 << 16;
        /// CAP_SYS_RAWIO (Linux 17): Raw I/O operations.
        const CAP_SYS_RAWIO = 1 << 17;
        /// CAP_SYS_CHROOT (Linux 18): Use chroot(2).
        const CAP_SYS_CHROOT = 1 << 18;
        /// CAP_SYS_PTRACE (Linux 19): Trace arbitrary processes.
        /// Note: UmkaOS also provides CAP_DEBUG (bit 68) as the native debugging
        /// capability. The compat layer maps CAP_SYS_PTRACE checks to CAP_DEBUG.
        const CAP_SYS_PTRACE = 1 << 19;
        /// CAP_SYS_PACCT (Linux 20): Configure process accounting.
        const CAP_SYS_PACCT = 1 << 20;
        /// CAP_SYS_ADMIN (Linux 21): Broad system administration.
        /// Note: UmkaOS also provides CAP_ADMIN (bit 64) as the native admin
        /// capability. The compat layer: `capable(CAP_SYS_ADMIN)` checks
        /// `has_cap(CAP_SYS_ADMIN) || has_cap(CAP_ADMIN)`.
        const CAP_SYS_ADMIN = 1 << 21;
        /// CAP_SYS_BOOT (Linux 22): Use reboot(2) and kexec_load(2).
        const CAP_SYS_BOOT = 1 << 22;
        /// CAP_SYS_NICE (Linux 23): Set scheduling policies, nice values.
        const CAP_SYS_NICE = 1 << 23;
        /// CAP_SYS_RESOURCE (Linux 24): Override resource limits (RLIMIT).
        const CAP_SYS_RESOURCE = 1 << 24;
        /// CAP_SYS_TIME (Linux 25): Set system clock and adjtime.
        const CAP_SYS_TIME = 1 << 25;
        /// CAP_SYS_TTY_CONFIG (Linux 26): Configure virtual terminal settings.
        const CAP_SYS_TTY_CONFIG = 1 << 26;
        /// CAP_MKNOD (Linux 27): Create special files using mknod(2).
        const CAP_MKNOD = 1 << 27;
        /// CAP_LEASE (Linux 28): Establish leases on arbitrary files.
        const CAP_LEASE = 1 << 28;
        /// CAP_AUDIT_WRITE (Linux 29): Write records to audit log.
        const CAP_AUDIT_WRITE = 1 << 29;
        /// CAP_AUDIT_CONTROL (Linux 30): Configure audit rules.
        const CAP_AUDIT_CONTROL = 1 << 30;
        /// CAP_SETFCAP (Linux 31): Set file capabilities.
        const CAP_SETFCAP = 1 << 31;
        /// CAP_MAC_OVERRIDE (Linux 32): Override MAC enforcement.
        const CAP_MAC_OVERRIDE = 1 << 32;
        /// CAP_MAC_ADMIN (Linux 33): MAC configuration changes.
        const CAP_MAC_ADMIN = 1 << 33;
        /// CAP_SYSLOG (Linux 34): Privileged syslog operations.
        const CAP_SYSLOG = 1 << 34;
        /// CAP_WAKE_ALARM (Linux 35): Set system wakeup alarms.
        const CAP_WAKE_ALARM = 1 << 35;
        /// CAP_BLOCK_SUSPEND (Linux 36): Prevent system suspending.
        const CAP_BLOCK_SUSPEND = 1 << 36;
        /// CAP_AUDIT_READ (Linux 37): Read audit log messages.
        const CAP_AUDIT_READ = 1 << 37;
        /// CAP_PERFMON (Linux 38): Performance monitoring and observability.
        const CAP_PERFMON = 1 << 38;
        /// CAP_BPF (Linux 39): Load eBPF programs and create maps.
        const CAP_BPF = 1 << 39;
        /// CAP_CHECKPOINT_RESTORE (Linux 40): Checkpoint/restore operations.
        const CAP_CHECKPOINT_RESTORE = 1 << 40;

        // Bits 41-63: Reserved for future Linux capabilities.

        // ===== UmkaOS-native capabilities (bits 64-127) =====
        // These have no Linux equivalent. They provide finer-grained
        // control for UmkaOS-specific subsystems.

        /// CAP_ADMIN: UmkaOS-native administrative capability. Grants broad
        /// administrative access including: mounting/unmounting filesystems,
        /// modifying firewall/routing, loading kernel modules, sysctl changes,
        /// privilege escalation, cgroup/namespace administration, device
        /// management. The compat layer maps `capable(CAP_SYS_ADMIN)` to
        /// check both CAP_SYS_ADMIN (bit 21) and CAP_ADMIN (bit 64) — a
        /// process holding either one passes the check. CAP_ADMIN is the
        /// preferred capability for new UmkaOS-native code paths; CAP_SYS_ADMIN
        /// exists purely for Linux application compatibility.
        const CAP_ADMIN = 1 << 64;

        /// CAP_P2P_DMA: Peer-to-peer DMA operations between devices.
        /// Required for drivers that initiate P2P transactions.
        const CAP_P2P_DMA = 1 << 65;

        /// CAP_NET_LOOKUP: BPF socket/connection table lookups via
        /// bpf_sk_lookup(). Required for BPF load balancers. See Section 15.2.2.
        const CAP_NET_LOOKUP = 1 << 66;

        /// CAP_NET_ROUTE_READ: BPF FIB (routing table) lookups via
        /// bpf_fib_lookup(). Required for XDP forwarding. See Section 15.2.2.
        const CAP_NET_ROUTE_READ = 1 << 67;

        /// CAP_DEBUG: Debug arbitrary processes via ptrace or /proc/pid.
        /// UmkaOS-native debugging capability. CAP_SYS_PTRACE (bit 19) is
        /// the Linux compat alias. See Section 19.3.1 for ptrace capability model.
        const CAP_DEBUG = 1 << 68;

        /// CAP_NS_TRAVERSE: Traverse namespace boundaries for cross-namespace
        /// operations. Each boundary crossing requires this capability.
        /// Combined with CAP_DEBUG, enables cross-namespace ptrace. See Section 19.3.1.
        const CAP_NS_TRAVERSE = 1 << 69;

        /// CAP_MOUNT: Mount/unmount filesystems. Required for mount(2),
        /// umount(2), pivot_root(2). Scoped to caller's mount namespace. See [Section 13.2](13-vfs.md#132-mount-tree-data-structures-and-operations).
        const CAP_MOUNT = 1 << 70;

        /// CAP_VMX: Execute VMX instructions for KVM host operation.
        /// Maps to /dev/kvm open and VMXON/VMXOFF. See Section 17.1.
        const CAP_VMX = 1 << 71;

        /// CAP_CGROUP_ADMIN: Manage cgroup hierarchies — create, modify,
        /// destroy within delegated subtree. See Section 18.1.5.
        const CAP_CGROUP_ADMIN = 1 << 72;

        /// CAP_TPM_SEAL: Seal/unseal data to TPM. See Section 8.3.2.
        const CAP_TPM_SEAL = 1 << 73;

        /// CAP_NET_CONNTRACK: BPF conntrack state query/modify via
        /// bpf_ct_lookup(), bpf_ct_insert(), bpf_ct_set_nat(). See Section 15.2.2.
        const CAP_NET_CONNTRACK = 1 << 74;

        /// CAP_NET_REDIRECT: XDP packet redirect to another interface's
        /// isolation domain. See Section 15.2.2.
        const CAP_NET_REDIRECT = 1 << 75;

        /// CAP_TTY_DIRECT: Zero-copy PTY mode (PtyRingPage mapped into both
        /// master/slave). For container logging optimization. See Section 20.1.2.
        const CAP_TTY_DIRECT = 1 << 76;

        /// CAP_ACCEL_ADMIN: Accelerator device admin (firmware update, reset,
        /// perf counter reset, privileged debug). See Section 21.1.2.2.
        const CAP_ACCEL_ADMIN = 1 << 77;

        // ZFS-specific capabilities (scoped to datasets via object_id, Section 14.2.2).
        /// CAP_ZFS_MOUNT: Mount a ZFS dataset as a filesystem.
        const CAP_ZFS_MOUNT = 1 << 78;
        /// CAP_ZFS_SNAPSHOT: Create and destroy snapshots.
        const CAP_ZFS_SNAPSHOT = 1 << 79;
        /// CAP_ZFS_SEND: Generate a send stream for replication.
        const CAP_ZFS_SEND = 1 << 80;
        /// CAP_ZFS_RECV: Receive a send stream into a dataset.
        const CAP_ZFS_RECV = 1 << 81;
        /// CAP_ZFS_CREATE: Create child datasets within a parent.
        const CAP_ZFS_CREATE = 1 << 82;
        /// CAP_ZFS_DESTROY: Destroy a dataset (highest ZFS privilege).
        const CAP_ZFS_DESTROY = 1 << 83;

        // DLM (Distributed Lock Manager) capabilities. See Section 14.6.14.
        /// CAP_DLM_LOCK: Acquire, convert, release locks in permitted lockspaces.
        const CAP_DLM_LOCK = 1 << 84;
        /// CAP_DLM_ADMIN: Create/destroy lockspaces, configure, view cluster-wide.
        const CAP_DLM_ADMIN = 1 << 85;
        /// CAP_DLM_CREATE: Create new lock resources (app-level via /dev/dlm).
        const CAP_DLM_CREATE = 1 << 86;

        /// CAP_SYS_ADMIN_GLOBAL (bit 87): Cluster-wide system administration.
        /// Extends CAP_SYS_ADMIN to authorize operations with cluster-wide scope:
        /// creating/destroying DLM lockspaces visible to all nodes, applying
        /// cluster-wide sysctl changes, modifying shared overlay network topology,
        /// and bootstrapping cluster membership. Required by cluster management
        /// daemons (e.g., pacemaker, corosync). The compat layer does not map
        /// any Linux capability to CAP_SYS_ADMIN_GLOBAL — it is UmkaOS-native only.
        /// Held in `TaskCredential.cap_permitted`; checked by DLM and cluster IPC.
        const CAP_SYS_ADMIN_GLOBAL = 1 << 87;

        // Bits 88-127: Reserved for future UmkaOS-native capabilities.
    }
}

Key design notes:

Bit layout: Bits 0-40 match Linux's capability numbering exactly (bit N = Linux CAP_* number N). This means the compat layer needs no translation for POSIX capability checks — capable(N) simply checks caps & (1 << N). Bits 41-63 are reserved for future Linux capabilities. Bits 64-127 are UmkaOS-native.
CAP_ADMIN vs CAP_SYS_ADMIN: CAP_SYS_ADMIN (bit 21) is the Linux-compatible capability at its exact Linux bit position. CAP_ADMIN (bit 64) is the UmkaOS-native administrative capability. The compat layer checks both: capable(CAP_SYS_ADMIN) succeeds if the process holds either CAP_SYS_ADMIN or CAP_ADMIN. New UmkaOS-native code should check CAP_ADMIN; the POSIX bits exist purely for unmodified Linux application compatibility.
CAP_SYS_PTRACE vs CAP_DEBUG: Similar pattern. CAP_SYS_PTRACE (bit 19) is the Linux compat bit. CAP_DEBUG (bit 68) is the UmkaOS-native debugging capability. The compat layer maps ptrace capability checks to CAP_SYS_PTRACE || CAP_DEBUG.
Granularity over blanket permissions: Individual capabilities (e.g., CAP_NET_ADMIN, CAP_DAC_OVERRIDE) allow least-privilege assignment. A web server needs only CAP_NET_BIND_SERVICE and CAP_SETUID, not CAP_ADMIN.
No root-equivalent unlimited access: Even CAP_ADMIN is bounded. It grants the operations listed above but does not bypass hardware isolation domains or allow arbitrary code execution in umka-core. There is no capability that grants "ignore all security checks" — administrative operations are still mediated through the capability system.

POSIX capability mapping: The syscall entry point (Section 18.1) converts Linux capability checks (capable(N)) to SystemCaps bit checks. For POSIX capabilities (0-40), this is a direct 1 << N check with no translation. For caps that have UmkaOS-native equivalents (e.g., CAP_SYS_ADMIN/CAP_ADMIN), the check is has_cap(1 << N) || has_cap(UMKA_NATIVE_EQUIVALENT).

8.1.4 Dual ACL Model: POSIX Draft ACLs + NFSv4 ACLs

UmkaOS supports two ACL models as first-class citizens, designed into the VFS layer from the start:

POSIX Draft ACLs (IEEE 1003.1e/1003.2c): - Standard Linux ACL model (setfacl/getfacl) - ACL_USER, ACL_GROUP, ACL_MASK, ACL_OTHER entry types - Default ACLs for directory inheritance - Required for ext4, XFS, btrfs compatibility

NFSv4 ACLs (RFC 7530 / RFC 8881): - Richer model used by NFS, ZFS, and most modern Unix systems (FreeBSD, Solaris/illumos) - Explicit ALLOW/DENY ACE ordering (access control entries processed in order) - Fine-grained permissions: READ_DATA, WRITE_DATA, APPEND_DATA, READ_NAMED_ATTRS, WRITE_NAMED_ATTRS, EXECUTE, DELETE_CHILD, READ_ATTRIBUTES, WRITE_ATTRIBUTES, DELETE, READ_ACL, WRITE_ACL, WRITE_OWNER, SYNCHRONIZE - Inheritance flags: FILE_INHERIT, DIRECTORY_INHERIT, NO_PROPAGATE_INHERIT, INHERIT_ONLY - Automatic inheritance tracking (for efficient subtree ACL changes)

VFS ACL Abstraction:

pub enum AclModel {
    PosixDraft,  // POSIX.1e draft ACLs
    Nfsv4,       // NFSv4/ZFS-style rich ACLs
}

pub trait VfsAcl {
    /// Which ACL model this filesystem uses
    fn acl_model(&self) -> AclModel;
    /// Get the effective ACL for an inode
    fn get_acl(&self, inode: InodeId, acl_type: AclType) -> Result<Acl, Error>;
    /// Set an ACL on an inode
    fn set_acl(&self, inode: InodeId, acl_type: AclType, acl: &Acl) -> Result<(), Error>;
    /// Check access (called by the permission check path)
    fn check_access(&self, inode: InodeId, who: &Principal, mask: AccessMask) -> Result<(), Error>;
}

Filesystem support: - ext4, XFS, btrfs: POSIX draft ACLs (native) + NFSv4 via translation layer - ZFS: NFSv4 ACLs (native) - NFS client: NFSv4 ACLs (native, passed through to server) - tmpfs, procfs, sysfs: POSIX draft ACLs (simple model sufficient)

Translation layer: When a filesystem natively uses one model but the user/application requests the other, a translation layer converts between them. The translation is lossy in some edge cases (NFSv4 DENY entries have no POSIX equivalent), but covers common use cases.

POSIX draft ACL → NFSv4 ACL (always lossless for ALLOW-only POSIX ACLs):

For each POSIX ACL entry in order:
  ACL_USER_OBJ  → NFSv4 ALLOW ACE for OWNER@  with rwxp permissions from entry
  ACL_USER(uid) → NFSv4 ALLOW ACE for user:uid with rwxp permissions from entry
  ACL_GROUP_OBJ → NFSv4 ALLOW ACE for GROUP@  with (permissions & MASK) from entry
  ACL_GROUP(gid)→ NFSv4 ALLOW ACE for group:gid with (permissions & MASK) from entry
  ACL_MASK      → Not translated as its own ACE; applied as mask to GROUP entries above
  ACL_OTHER     → NFSv4 ALLOW ACE for EVERYONE@ with permissions from entry

Default ACL (directory inheritance):
  Each entry above additionally receives FILE_INHERIT | DIRECTORY_INHERIT flags.
  ACL_DEFAULT entries with ACL_USER_OBJ / ACL_GROUP_OBJ / ACL_OTHER receive
  INHERIT_ONLY if the access ACL already covers owner/group/other.

NFSv4 ACL → POSIX draft ACL (lossy — DENY ACEs are discarded):

Pass 1 — extract ALLOW ACEs only (DENY ACEs have no POSIX equivalent):
  OWNER@  ALLOW → ACL_USER_OBJ  with the ACE permissions
  GROUP@  ALLOW → ACL_GROUP_OBJ with the ACE permissions
  EVERYONE@ ALLOW → ACL_OTHER   with the ACE permissions
  user:uid ALLOW → ACL_USER(uid) entry
  group:gid ALLOW → ACL_GROUP(gid) entry

Pass 2 — compute MASK:
  MASK = union of all GROUP@ and ACL_GROUP(gid) ALLOW permissions.
  This matches the POSIX MASK semantics (effective group permission limit).

Pass 3 — inheritance:
  ACEs with FILE_INHERIT | DIRECTORY_INHERIT → contribute to default ACL.
  ACEs with INHERIT_ONLY → contribute only to default ACL, not access ACL.

Lossy cases:
  DENY ACE                    → logged and discarded (no POSIX representation)
  Ordered DENY before ALLOW   → first-match ordering is lost; effective access
                                  may differ between models for complex rule sets
  NFSv4-only permission bits  → READ_NAMED_ATTRS, WRITE_NAMED_ATTRS, SYNCHRONIZE
    (not in POSIX model)         are silently dropped in translation

POSIX→NFSv4 ACL mapping follows RFC 7530 §6.2.1 (NFS Version 4 Protocol). The mapping algorithm is deterministic: - POSIX mode bits {r, w, x} for {owner, group, other} map to ACE4_ACCESS_ALLOWED_ACE_TYPE entries with the corresponding NFSv4 access mask bits. - POSIX default ACL entries: emit Mask-ACEs first, Access-ACEs second. - Reference implementation: nfs4_acl_posix_to_nfs4() in Linux fs/nfsd/nfs4acl.c (for cross-reference; UmkaOS implements independently per RFC 7530).

The translation implementation is cross-referenced with the NFSv4 ACL mapping used by ZFS (PSARC 2006/496) and FreeBSD's acl_nfs4_posix.c.

Linux compatibility: The getxattr/setxattr syscalls support both system.posix_acl_access/system.posix_acl_default (POSIX) and system.nfs4_acl (NFSv4) extended attribute names. Tools like nfs4_getfacl/nfs4_setfacl work unmodified.

8.1.5 Driver Sandboxing

Each driver tier has a different sandboxing model:

Tier 1 (domain-isolated): - Memory isolation via hardware domain keys (cannot access core or other drivers) - DMA fencing via IOMMU (cannot DMA to arbitrary physical addresses) - Capability restrictions: driver only receives capabilities for its own devices - No access to: page tables, capability tables, scheduler state, other driver state

Tier 2 (process-isolated): - Full address space isolation (separate page tables) - IOMMU DMA fencing - seccomp-like syscall filtering: driver can only invoke KABI syscalls, not arbitrary Linux syscalls - Resource limits: memory, CPU time, file descriptors - Mandatory capability-based access control

All drivers: - Least-privilege capability grants based on driver manifest and system policy - Cryptographic signature verification (optional, configurable) - Audit logging of all capability-mediated operations (when audit is enabled)

See also: Section 8.6 (Confidential Computing) extends capability-based isolation to TEE enclaves with hardware-encrypted memory (SEV-SNP, TDX, ARM CCA). Section 8.5 (Post-Quantum Cryptography) ensures capability tokens and distributed credentials remain secure against quantum attacks via algorithm-agile crypto abstractions.

8.1.6 Security by Default

Linux problem: MAC (SELinux/AppArmor) exists but is complex and usually permissive by default. POSIX capabilities were supposed to replace setuid but are clunky and poorly adopted.

UmkaOS design: - The capability model is the foundation, not an add-on. Every resource access goes through capability checks — there is no "bypass" path. - Default-deny: Processes start with an empty capability set. They receive capabilities explicitly from their parent or from the exec-time capability grants (replacing setuid). - No setuid binaries: The setuid bit on executables maps to "grant these capabilities on exec" — the process never actually runs as uid 0 with unlimited power. This is transparent to applications that check geteuid() == 0 (the compat layer handles this). - LSM hooks at all security-relevant points: Pre-integrated, always active, not optional kernel config. SELinux and AppArmor policy engines are loadable modules that attach to these hooks. - Application profiles: Ship default confinement profiles for common services (sshd, nginx, postgres) — applications are sandboxed out of the box.

8.2 Verified Boot Chain

Inspired by: ChromeOS Verified Boot, Android dm-verity, UEFI Secure Boot. IP status: Clean — industry standards (UEFI Secure Boot, TCG TPM), standard cryptographic constructions (Merkle trees, 1979). Public specifications.

8.2.1 Problem

Section 2.1 defines the boot sequence but does not address boot integrity. A compromised bootloader or kernel image is the most dangerous attack vector — it undermines all runtime security.

Additionally, in production deployments (cloud, embedded, enterprise), operators need assurance that the running kernel is exactly what they deployed, with no tampering.

8.2.2 Boot Chain Verification

Firmware (UEFI Secure Boot)
    |
    | Verifies: GRUB/bootloader signature
    v
Bootloader (GRUB2 / systemd-boot)
    |
    | Verifies: UmkaOS image signature
    v
UmkaOS Boot Stub (early Rust/asm)
    |
    | Verifies: initramfs signature
    v
UmkaOS Core Initialization
    |
    | Verifies: Tier 0 driver integrity (embedded in kernel image)
    | Verifies: Tier 1 driver signatures (loaded from initramfs/rootfs)
    v
Running System
    |
    | dm-verity: runtime block-level integrity verification
    v
Verified Root Filesystem

Each step verifies the next. A break at any point halts boot (or falls back to a known good configuration).

8.2.3 Kernel Image Signing

The UmkaOS image (vmlinuz-umka-VERSION) is signed during the build process:

// Build-time signature structure appended to kernel image.
#[repr(C)]
pub struct KernelSignature {
    /// Magic: "IKSIG\0\0\0"
    pub magic: [u8; 8],
    /// Signature algorithm ID (matches SignatureAlgorithm enum, Section 8.5.2):
    /// 0x0001 = Ed25519, 0x0002 = RSA-4096-PSS,
    /// 0x0100 = ML-DSA-44, 0x0101 = ML-DSA-65, 0x0110 = SLH-DSA-128f,
    /// 0x0200 = hybrid Ed25519 + ML-DSA-65.
    /// Algorithm ID 0 is reserved/invalid — a zero value indicates an
    /// unsigned or corrupt image.
    pub algorithm: u32,
    /// Length of actual signature data within the signature buffer.
    pub sig_len: u32,
    /// SHA-256 hash of the unsigned kernel image (informational).
    /// This field records the hash at signing time for offline auditing
    /// and debugging (e.g., comparing against a known-good manifest).
    /// It is NOT used during boot verification — the boot stub computes
    /// a fresh hash (step 3 below) and verifies the signature over that,
    /// eliminating TOCTOU attacks on a stored hash value.
    pub image_hash: [u8; 32],
    /// Signature buffer. Sized for the largest supported algorithm
    /// rounded up to 512-byte alignment:
    /// ML-DSA-65 = 3,309 bytes, SLH-DSA-128f = 17,088 bytes,
    /// hybrid = Ed25519 (64) + ML-DSA-65 (3,309) = 3,373 bytes.
    /// SLH-DSA-128f is the worst case at 17,088 bytes.
    /// Buffer = ceil(17,088 / 512) * 512 = 17,408 bytes.
    /// Only sig_len bytes are meaningful; the rest is zero-padded.
    pub signature: [u8; 17_408],
    /// Public key fingerprint (SHA-256 of the public key).
    pub key_fingerprint: [u8; 32],
}

Early boot verification memory: PQC signature verification requires scratch memory for intermediate computations during early boot (before the slab allocator is initialized). This scratch space is provided by a statically-allocated .bss section buffer: static VERIFY_SCRATCH: [u8; 20_480]. The buffer is sized for the worst-case algorithm used at boot: SLH-DSA-128f verification requires up to ~17,088 bytes of working space (matching the signature size), plus SHA-256 streaming state (~200 bytes) and alignment padding, totalling under 18KB. The 20,480-byte buffer (20KB, a multiple of 4096) provides sufficient headroom for all supported algorithms: ML-DSA-65 (~4KB), SLH-DSA-128f (~17,088 bytes), and hybrid Ed25519+ML-DSA-65 (~4KB). The boot stub uses this fixed buffer exclusively; it is released after the kernel signature is verified. The slab allocator path (SignatureData::Heap) is used only for runtime driver signature verification after boot.

Verification flow in the boot stub:

1. Boot stub finds KernelSignature at the end of the image.
2. Reads the public key from:
   a. UEFI Secure Boot db (if UEFI boot), OR
   b. Embedded in bootloader (if BIOS boot), OR
   c. Kernel command line: umka.verify_key=<fingerprint>
      (only honored in umka.verify=warn or umka.verify=off modes;
       ignored in umka.verify=enforce mode — see security restriction below)
3. Computes SHA-256 of the image (excluding signature) → fresh_hash.
4. Verifies signature over fresh_hash (not over stored image_hash).
   This eliminates the TOCTOU window: the hash used for signature
   verification is the one just computed, not a stored value an
   attacker could tamper with between steps.
5. If verification fails:
   a. If umka.verify=enforce (default in production): halt boot.
   b. If umka.verify=warn: log warning, continue (development mode).
   c. If umka.verify=off: skip (testing only).

Security restriction: In umka.verify=enforce mode, the umka.verify_key command-line parameter is ignored — the verification key MUST come from the firmware (UEFI db variable or DTB /chosen/umka,verify-key node), which is itself part of the measured/verified boot chain. The command-line parameter is only honored in umka.verify=warn (development) and umka.verify=off (debugging) modes. This prevents an attacker who controls the boot loader command line from substituting their own verification key.

8.2.4 Hibernate Secret

The hibernate subsystem (Section 17.2, 17-virtualization.md) uses a hibernate secret — a 256-bit random key generated at boot from the hardware RNG (RDRAND/RNDR) — to HMAC-authenticate hibernate images against tampering. The key is stored only in kernel memory and destroyed on shutdown.

TPM systems: The hibernate secret is sealed to the TPM via TPM2_Create() bound to the current PCR policy (PCRs 0-12). PCR 12 is included because it contains Tier 1 driver measurements (Section 8.3.1) — excluding it would allow a compromised or swapped storage driver to load a tampered hibernate image while still passing the TPM seal policy. Only a boot configuration matching the original (including the exact set of loaded Tier 1 drivers) can unseal the key.
Non-TPM systems: The hibernate secret does not survive a cold reboot. Hibernate images are only valid for warm suspend/resume cycles where the bootloader preserves the kernel's reserved memory region. Caveat: this mechanism requires a cooperative bootloader (e.g., GRUB with memmap= or a custom UEFI stub that avoids reclaiming the reserved region). On systems where the bootloader does not preserve kernel memory, non-TPM hibernate provides no cryptographic protection against image tampering — it only provides crash-recovery semantics (detecting corruption, not preventing forgery). Deployments requiring hibernate image authentication should use TPM-based sealing.

This key is NOT related to the kernel verification key (Section 8.2.3) or the IMA HMAC key (Section 8.4) — each subsystem derives its own independent secret.

8.2.5 Driver Signature Verification

Tier 1 drivers are signed (Section 10.4 mentions "cryptographically signed drivers can be granted Tier 1"). The device registry (Section 10.5) enforces this during the Loading → Initializing transition.

// In driver ELF binary, `.kabi_sig` section.
#[repr(C)]
pub struct DriverSignature {
    pub magic: [u8; 8],         // "NDSIG\0\0\0"
    /// Algorithm IDs: same as KernelSignature.algorithm (SignatureAlgorithm
    /// enum, Section 8.5.2). Driver signatures typically use ML-DSA-44
    /// (2,420 bytes).
    pub algorithm: u32,
    pub sig_len: u32,
    pub binary_hash: [u8; 32],  // SHA-256 of driver ELF (excluding .kabi_sig)
    /// Sized for worst-case algorithm (SLH-DSA-128f = 17,088 bytes,
    /// rounded up to 512-byte alignment = 17,408 bytes). This ensures
    /// the struct is algorithm-agile: the same binary layout works for
    /// any supported signature algorithm without recompilation.
    /// ML-DSA-44 (common case for drivers) uses 2,420 bytes; rest zero-padded.
    /// sig_len indicates how many bytes are meaningful.
    pub signature: [u8; 17_408],
    pub key_fingerprint: [u8; 32],
    pub signer_name: [u8; 64],  // Human-readable signer identity
}

Policy (integrates with existing tier trust model):

Verification	Tier 0	Tier 1	Tier 2
Signature required?	Always (part of kernel image)	Configurable (default: required)	Optional
Unsigned behavior	Cannot exist unsigned	Demoted to Tier 2	Allowed (user-space isolation)
Tampered binary	Kernel does not boot	Loading rejected, alert	Loading rejected, alert

8.2.6 Runtime Filesystem Integrity (dm-verity)

For production deployments, the root filesystem can be verified on every block read using Merkle tree verification. This detects tampering of any file at read time.

Root Filesystem Device           Hash Tree Device
+------------------+             +------------------+
| Block 0          | --hash-->   | Leaf hash 0      |
| Block 1          | --hash-->   | Leaf hash 1      |
| Block 2          | --hash-->   | Leaf hash 2      |
| Block 3          | --hash-->   | Leaf hash 3      |
| ...              |             | ...              |
+------------------+             +------------------+
                                         |
                                   +-----+-----+
                                   |           |
                                 Node 0-1    Node 2-3
                                   |           |
                                   +-----+-----+
                                         |
                                    Root Hash
                                   (signed, stored
                                    in kernel cmdline
                                    or bootloader)

Implementation in UmkaOS Block I/O layer:

// umka-block/src/verity.rs

/// Maximum supported hash tree depth. A depth of 20 with 4096-byte
/// hash blocks covers 4096^20 blocks, far exceeding any physical device.
const VERITY_MAX_TREE_DEPTH: usize = 20;

pub struct VerityTarget {
    /// Underlying data device.
    data_device: BlockDeviceHandle,

    /// Hash tree device (can be same device, appended).
    hash_device: BlockDeviceHandle,

    /// Hash algorithm: SHA-256 (default).
    hash_algorithm: HashAlgorithm,

    /// Block size for hashing (default: 4096).
    hash_block_size: u32,

    /// Root hash (trusted). Provenance (exactly one must succeed):
    ///  1. Embedded in the signed kernel image — verified transitively by the
    ///     kernel signature check in Section 8.2.3. No separate signature needed.
    ///  2. Passed as a kernel command-line parameter (`umka.verity.root_hash=`)
    ///     when the cmdline itself is measured into TPM PCR 11 (Section 8.3.1). The
    ///     TPM attestation chain verifies the cmdline has not been tampered with.
    ///  3. Loaded from initramfs (file `/etc/verity/root_hash`) when the
    ///     initramfs is IMA-appraised (Section 8.4) and measured into TPM PCR 10.
    /// An unauthenticated root hash from an unverified kernel cmdline is
    /// rejected unless `umka.verify=off`.
    root_hash: [u8; 32],

    /// Hash tree depth (computed from device size).
    tree_depth: u32,

    /// Pre-computed hash tree level offsets.
    /// A typical dm-verity device has at most ~20 hash tree levels
    /// (a 20-level tree covers 4096^20 blocks, far exceeding any
    /// physical device). A fixed-size array avoids heap allocation
    /// entirely, which is essential during early boot before the
    /// general-purpose allocator is initialized.
    level_offsets: [u64; VERITY_MAX_TREE_DEPTH],

    /// Number of valid entries in level_offsets (always <= MAX_TREE_DEPTH).
    level_count: u32,

    /// Error behavior on verification failure.
    error_behavior: VerityErrorBehavior,

    /// Cache of verified hashes (avoid re-hashing on repeated reads).
    /// Capacity capped at 64 Ki entries (~2 MB) regardless
    /// of device size, covering the hot working set of recently-accessed blocks.
    /// None until the slab allocator is online; verification operates
    /// without caching. Initialized to Some(...) during block subsystem
    /// init.
    hash_cache: Option<LruCache<u64, [u8; 32]>>,
}

#[repr(u32)]
pub enum VerityErrorBehavior {
    /// Return -EIO for the failed block (default).
    Eio     = 0,
    /// Kernel panic (for security-critical deployments).
    Panic   = 1,
    /// Log and continue (for debugging).
    Ignore  = 2,
}

Activation via kernel command line (standard dm-verity syntax):

root=/dev/dm-0
dm-mod.create="umka-root,,,ro,0 <size> verity 1 /dev/sda1 /dev/sda2 4096 4096 <blocks> <blocks> sha256 <root_hash> <salt>"

Or via the device-mapper ioctl interface (DM_TABLE_LOAD), which veritysetup from cryptsetup uses. Existing tools work unmodified.

8.2.7 Key Revocation

Driver signature verification checks against a Key Revocation List (KRL) to handle compromised signing keys:

The KRL is a signed, append-only list of revoked key fingerprints, embedded in the initramfs and updated with each kernel/initramfs build.
On boot, the KRL is loaded and checked before any driver loading. Any driver signed with a revoked key is rejected, regardless of tier.
At runtime, a new KRL can be loaded via /proc/umka/security/krl (requires admin capability). This allows revoking keys without rebooting.
UEFI Secure Boot dbx provides bootloader/kernel-level revocation (existing standard mechanism). The KRL extends this to cover driver-level revocation.

// Kernel-internal

/// Key revocation list.
///
/// **Lifetime management**: Both the `KeyRevocationList` struct and the
/// `revoked_keys` array it points to are allocated together as a single
/// slab allocation (struct header + trailing key array). Boot-time KRLs
/// are bump-allocated and have true `'static` lifetime. Runtime-loaded
/// KRLs are slab-allocated and managed via RCU: the old KRL (struct +
/// key array) remains valid until all readers complete their RCU
/// read-side critical section, then the entire slab slot is freed.
/// Readers must access the struct and its `revoked_keys` only within
/// `rcu_read_lock()` / `rcu_read_unlock()`.
#[repr(C)]
pub struct KeyRevocationList {
    /// Signature over the KRL itself (prevents tampering).
    /// Sized for worst-case PQC signature (SLH-DSA-128f = 17,088 bytes,
    /// rounded up to 512-byte alignment = 17,408 bytes).
    pub signature: [u8; 17_408],
    /// Actual length of the signature in `signature[]`.
    /// Algorithms produce different-length signatures (Ed25519: 64 bytes,
    /// ML-DSA-65: 3,309 bytes, SLH-DSA-128f: 17,088 bytes).
    pub sig_len: u32,
    /// Algorithm hint for this KRL's signature. Matches `SignatureAlgorithm`
    /// enum (Section 8.5.2). This field is a parser optimization hint only —
    /// the verifier MUST NOT trust it as authoritative. The actual verification
    /// algorithm is determined by the signing key: each public key in the
    /// kernel's trusted keyring has a fixed algorithm, and the verifier enforces
    /// that `algorithm` matches the key's algorithm. A mismatch causes
    /// verification failure (`-EKEYREJECTED`), preventing algorithm-confusion
    /// attacks where an attacker specifies a weaker algorithm in a forged KRL.
    pub algorithm: u32,
    // **Trust anchor**: KRL signatures are verified against the platform's
    // **Secure Boot key hierarchy**. Specifically, the KRL signing key must
    // chain to a certificate in the UEFI `db` (Signature Database) or, for
    // systems without UEFI, to the kernel's built-in trusted keyring
    // (`CONFIG_SYSTEM_TRUSTED_KEYRING` equivalent). This is the same trust
    // anchor used for kernel module signatures, ensuring a single chain of
    // trust from firmware to revocation policy.
    /// Version (monotonically increasing, prevents rollback).
    /// Anti-rollback enforcement: on every KRL load, the kernel
    /// compares this version against the last-seen KRL version stored
    /// in TPM NV index 0x01800200 (a monotonic counter). If the new
    /// version is less than the stored value, the KRL is rejected
    /// (prevents an attacker from replaying an older KRL with fewer
    /// revocations). On successful load, the TPM NV counter is updated
    /// to the new version. On systems without a TPM, the last-seen
    /// version is stored in a UEFI authenticated variable
    /// (umka-krl-version); this is weaker (vulnerable to firmware-level
    /// attacks) but still prevents userspace rollback.
    ///
    /// **First-boot initialization**: On a system where NV index `0x01800200` does not exist,
    /// the boot stub creates it during the first secure boot sequence:
    ///
    /// 1. **Detect first boot**: `TPM2_NV_ReadPublic(0x01800200)` returns
    ///    `TPM_RC_NV_UNDEFINED` — no prior KRL has been loaded.
    /// 2. **Define index**: `TPM2_NV_DefineSpace` with attributes
    ///    `TPMA_NV_COUNTER | TPMA_NV_AUTHREAD | TPMA_NV_AUTHWRITE |
    ///    TPMA_NV_NO_DA | TPMA_NV_PLATFORMCREATE`, size 8 bytes. The
    ///    `PLATFORMCREATE` attribute ensures only firmware (not OS) can delete
    ///    the index.
    /// 3. **Initialize**: `TPM2_NV_Increment` to set the counter to 1.
    /// 4. **Load KRL**: The first KRL must carry `krl_version ≥ 1` to pass the
    ///    anti-rollback check.
    ///
    /// Subsequent boots read the counter and compare against the KRL's embedded
    /// version. Systems that have been provisioned but later have their TPM
    /// cleared (e.g., ownership change) will fail KRL loading until re-provisioned
    /// via the secure enrollment process (Section 8.3 key ceremony).
    pub version: u64,
    /// Public key fingerprints (SHA-256 hash of the raw public key bytes),
    /// sorted for binary search. Algorithm-agnostic: the fingerprint is a
    /// SHA-256 hash regardless of whether the key is Ed25519, ML-DSA-44,
    /// ML-DSA-65, or any future algorithm — all produce a 32-byte fingerprint.
    /// Uses `RcuSlice<[u8; 32]>` — a zero-cost wrapper around the raw pointer
    /// that can only be dereferenced within an `RcuReadGuard` scope, preventing
    /// use-after-free if the backing slab allocation is freed after a grace period.
    /// Boot-time KRLs are bump-allocated (effectively 'static); runtime-loaded
    /// KRLs are slab-allocated and freed after an RCU grace period.
    /// Length is given by `revoked_count`.
    pub revoked_keys: RcuSlice<[u8; 32]>,
    /// Number of entries pointed to by `revoked_keys`.
    /// Required because `RcuSlice` does not carry length information.
    pub revoked_count: u32,
    /// SHA-256 hash of the serialized KRL (for TPM PCR extend).
    pub digest: [u8; 32],
}

/// **Lifetime safety**: The `revoked_keys` field uses `RcuSlice<[u8; 32]>`,
/// a zero-cost wrapper around `*const [u8; 32]` that implements `Deref` only
/// when an `RcuReadGuard` is in scope. `RcuSlice` does NOT implement `Send`
/// or `Sync` directly — the containing `KeyRevocationList` is shared via
/// `RcuCell<KeyRevocationList>`, which provides the necessary lifetime
/// guarantees (readers hold `RcuReadGuard`, preventing grace period completion
/// and slab freeing). The blanket `unsafe impl Send/Sync` on `KeyRevocationList`
/// is safe because the struct is only accessible through `RcuCell`, never
/// directly shared.
// SAFETY: `revoked_keys` is an `RcuSlice` whose pointee remains valid for the
// lifetime of any `RcuReadGuard` that can reach this struct. The struct is
// read-only after construction (immutable once published via RCU). Access
// outside an RCU read-side critical section is prevented by `RcuSlice`'s API,
// making cross-thread send and share safe.
unsafe impl Send for KeyRevocationList {}
unsafe impl Sync for KeyRevocationList {}

The verification flow during driver loading:

1. Extract key_fingerprint from DriverSignature.
2. Binary search KRL for the fingerprint.
3. If found: reject driver load with -EKEYREVOKED, emit alert.
4. If not found: proceed with normal signature verification.

8.2.7.1 First-Boot KRL Counter Pre-Increment Prevention

Attack: Between NV_DefineSpace (creates the NV counter at index 0x01800200) and the first NV_Increment (records the initial KRL epoch), an attacker with platform auth could call NV_Increment to advance the counter. The initial KRL entry would be bound to epoch 1, not epoch 0, making epoch 0 appear as a rollback.

Defense: UmkaOS uses a two-phase commit for first-boot NV counter provisioning:

NV_DefineSpace with NV_ORDERLY attribute: Creates the counter. The NV_ORDERLY flag marks the space as "provisioning in progress" until explicitly cleared.
Immediate NV_Increment in the same TPM command session: NV_DefineSpace and the first NV_Increment are submitted as a single session (TPM2_CC_StartAuthSession → NV_DefineSpace → NV_Increment → FlushContext). The TPM processes these atomically within the session. No external caller can interleave between NV_DefineSpace and NV_Increment within a session.
Platform auth lock: Before calling NV_DefineSpace, UmkaOS calls TPM2_CC_HierarchyControl(TPM_RH_PLATFORM, disable=true) to disable platform hierarchy (preventing platform-auth commands). This requires the platform auth secret, which is only available to UmkaOS's trusted boot firmware. After first-boot provisioning, platform hierarchy is re-disabled until the next boot.
Binding check: The first KRL entry is written with generation = NV_counter_value_after_first_increment (expected: 1). If the counter reads

1 after provisioning, provisioning aborts and triggers a measured boot failure event (extends PCR[10] with a failure marker).

Non-TPM path: On systems without TPM, the rollback counter is stored in NVRAM (UEFI SetVariable with EFI_VARIABLE_BOOTSERVICE_ACCESS | EFI_VARIABLE_NON_VOLATILE). The same two-phase approach applies: UEFI variable creation and initial-write happen in the same UEFI Boot Services call sequence before ExitBootServices().

8.2.8 RCU Read-Side Timeout (DoS Mitigation)

Attack vector: RCU-based KRL management has a theoretical denial-of-service vulnerability. An attacker could keep an RCU read-side critical section open indefinitely by repeatedly calling a syscall that reads from the KRL (e.g., driver loading queries) without allowing the critical section to exit. This would prevent the RCU grace period from ever completing, blocking KRL updates and preventing revocation from taking effect.

Mitigation: Preemptible RCU with priority boosting. UmkaOS uses preemptible RCU for KRL read-side critical sections — readers can be preempted by higher-priority tasks, preventing a single reader from indefinitely blocking grace period completion. When a grace period has been pending for longer than KRL_RCU_BOOST_MS (default: 1ms), the RCU subsystem boosts the priority of all in-progress KRL readers to SCHED_FIFO priority 99, expediting their completion. This is the same mechanism used by Linux's CONFIG_RCU_BOOST — it respects RCU semantics (readers complete naturally rather than being forcibly terminated) and avoids the use-after-free risk of forcibly calling rcu_read_unlock() from outside the critical section.

Implementation notes: - KRL readers are expected to be brief: a binary search over the revoked_keys array is O(log n) and does not perform blocking operations. Any syscall that blocks while holding KRL RCU read lock is a bug. - Priority boosting is a safety net, not a normal-path mechanism. Under normal operation, KRL lookups complete in <10μs — well before the 1ms boost threshold. - On boost trigger, the kernel logs a warning with the CPU ID, task PID, and execution context. Repeated boosts from the same task trigger rate limiting (log suppression) to avoid log flooding. - Syscall rate limiting: The driver-load syscall (umka_driver_load) is rate-limited to at most 10 invocations per second per UID (using a per-UID token bucket). This prevents a malicious userspace process from abusing repeated driver load requests to keep KRL RCU read-side critical sections active across many CPUs and delay grace period completion. Excess requests fail with -EAGAIN. This is the primary defense; priority boosting is the secondary safety net. - This mitigation is defensive: a compromised kernel (or a bug in a Tier 1 driver) could bypass it, but that is outside the threat model. The RCU boost + rate limiting protects against userspace-triggered DoS only. - Scope: Tier 1 crash containment (Section 10.2) is the primary defense against intentionally malicious or buggy drivers. The RCU timeout mechanism guards against accidental long RCU holds (e.g., a Tier 1 driver that forgot to call rcu_read_unlock() on an error path). It is not designed to stop a driver that deliberately holds RCU forever — Tier 1 crash recovery handles that by isolating and restarting the driver domain.

8.3 TPM Runtime Services

8.3.1 Measured Boot (TPM)

For environments with TPM (Trusted Platform Module):

PCR Assignment Table (consolidated for security review; authoritative detail at Section 2.1.8.2 in 02-boot-hardware.md):

PCR	Measured by	Content
0	UEFI firmware	Firmware code and EFI drivers
1	UEFI firmware	Platform configuration (NVRAM, PCI config)
2	UEFI firmware	Option ROM code
3	UEFI firmware	Option ROM data
4	UEFI/bootloader	Boot manager code (shim, GRUB executable)
5	UEFI/bootloader	Boot manager data + GPT partition table
6	UEFI firmware	Resume-from-S4 / hibernate image
7	UEFI firmware	Secure Boot policy (db, dbx, PK, KEK state)
8	Bootloader (GRUB)	GRUB command line
10	Bootloader (GRUB)	Kernel image (UmkaOS PE/COFF or bzImage)
11	Bootloader → IMA	initramfs (boot-time); IMA runtime measurements (both coexist, distinguished by event log type)
12	Bootloader (GRUB)	Kernel command line
13	UmkaOS Core	Tier 1 driver images + IMA policy signing key
13–15	(available)	Reserved for OS or application use

Measured boot sequence:
1. UEFI firmware self-measures into PCR 0–7 (standard, not UmkaOS-controlled).
2. Bootloader (GRUB/shim) measures: PCR 8 (cmdline), PCR 9 (kernel image),
   PCR 10 (initramfs), PCR 11 (kernel cmdline).
3. UmkaOS Core measures loaded Tier 1 drivers and IMA policy key → PCR 12.
4. IMA extends PCR 10 at runtime on each measured file open (coexists with
   boot-time initramfs measurement via distinct event log entry types).
5. Remote attestation server verifies the full boot chain via TPM quotes
   covering PCRs 0–12.

Key PCR assignments for IMA and driver measurement (for quick reference; the full table above is authoritative):

PCR	Purpose	Extends When
PCR 7	UEFI Secure Boot state (EFI variables, DBx revocation list)	Boot, before handoff
PCR 10	IMA runtime file measurements	Each file open/exec in IMA policy
PCR 12	UmkaOS Core static boot-time digest of loaded Tier 1 drivers	Driver load, before first domain grant

Note: PCR 10 and PCR 12 both extend on Tier 1 driver load: IMA measures the driver binary file (PCR 10); UmkaOS Core pre-measures the driver's immutable code+data region before granting it an isolation domain (PCR 12). These are complementary measurements — PCR 10 tracks what was loaded, PCR 12 locks the Tier 1 driver set into the TPM attestation chain before execution begins.

This is additive — TPM integration is optional and does not affect the boot path for systems without TPM.

8.3.1.1 PCR[7] Measurement Without Secure Boot

PCR[7] in the UEFI Secure Boot architecture records the Secure Boot policy and the authority (certificate chain) used to authorize boot components. Its content and interpretation depend on whether Secure Boot is enabled.

What UEFI always measures into PCR[7] (per TCG PC Client Platform Firmware Profile Specification, implemented in EDK II Tcg2Dxe.c):

UEFI firmware unconditionally measures the following EFI variables into PCR[7] using EV_EFI_VARIABLE_DRIVER_CONFIG events, regardless of whether Secure Boot is active: SecureBoot, PK, KEK, db, dbx. If a variable is absent (e.g., no keys are enrolled), the firmware measures a zero-size variable entry for that variable. The dbt and dbr variables are measured only if present. An EV_SEPARATOR event is then extended into PCR[7] to delimit configuration from authority measurements.

When Secure Boot is active and successfully verifies a boot binary, each EFI_SIGNATURE_DATA record used for verification is also extended into PCR[7] as an EV_EFI_VARIABLE_AUTHORITY event. These authority events do not occur when Secure Boot is disabled.

Condition	PCR[7] event sequence	Binding strength
Secure Boot enabled, keys enrolled	`EV_EFI_VARIABLE_DRIVER_CONFIG` for SecureBoot=1, PK, KEK, db, dbx; `EV_SEPARATOR`; `EV_EFI_VARIABLE_AUTHORITY` for each certificate used	Strong: value binds to the specific certificate set that authorized the boot
Secure Boot disabled, keys enrolled	`EV_EFI_VARIABLE_DRIVER_CONFIG` for SecureBoot=0, PK, KEK, db, dbx; `EV_SEPARATOR`; no authority events	Intermediate: binds to the key database state but not to any boot authorization act
Secure Boot disabled, no keys enrolled	`EV_EFI_VARIABLE_DRIVER_CONFIG` for SecureBoot=0 plus zero-size entries for absent PK/KEK/db/dbx; `EV_SEPARATOR`; no authority events	Weak: deterministic but signals absence of any policy — machine-bound only

The resulting PCR[7] value is always deterministic for a given Secure Boot configuration state. A system that had Secure Boot disabled and then re-enabled will produce the same PCR[7] value as before the change (assuming the key databases are unchanged), because PCRs are reset at each power cycle and rebuilt from scratch on every boot.

UmkaOS policy for non-SB systems:

When UmkaOS detects that Secure Boot is not active (via the EFI_GLOBAL_VARIABLE SecureBoot variable being absent or zero, or via UEFI event log inspection at boot time):

TPM sealing policy: Secrets sealed to PCR[7] on a non-SB system are treated as "machine-bound, not policy-bound" — they reflect the key database state at sealing time, but since no cryptographic authority event was recorded, they offer no assurance that the boot chain was verified. Operators are warned via the boot log: umka: TPM PCR[7] in no-SB mode; sealed secrets are machine-bound only.
IMA policy adjustment: Without Secure Boot, the IMA policy falls back from secure_boot_appraise (which requires SB-signed files) to ima_appraise=fix mode, allowing measurement without strict appraisal. Files are still measured into the IMA log for audit purposes.
PCR[7] state capture: UmkaOS records the UEFI event log entries for PCR[7] during early boot (before any further extensions) and exposes the Secure Boot status via Pcr7BootState. This allows later software to distinguish "SB active with these certificates" from "SB inactive" without re-reading the PCR or relying on the (mutable) SecureBoot EFI variable at runtime.

/// PCR[7] state recorded at UmkaOS boot time, captured from the UEFI event log
/// before the boot manager runs. Populated during early boot from the
/// EFI_TCG2_PROTOCOL event log.
pub struct Pcr7BootState {
    /// The SHA-256 hash in PCR[7] as read from the TPM after UEFI firmware
    /// completes its measurements. Value is deterministic per boot configuration.
    pub pcr7_value: [u8; 32],
    /// True if UEFI Secure Boot was active when UmkaOS booted.
    /// Derived from the `SecureBoot` EFI variable
    /// (GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c) as measured by UEFI
    /// into the TCG event log (EV_EFI_VARIABLE_DRIVER_CONFIG), not from
    /// runtime EFI variable access (which is mutable post-ExitBootServices).
    pub secure_boot_active: bool,
    /// True if at least one EV_EFI_VARIABLE_AUTHORITY event was recorded
    /// in the PCR[7] event log, indicating that Secure Boot actually
    /// verified at least one boot component. A system can have
    /// `secure_boot_active = true` but `authority_events_present = false`
    /// if no EFI binaries were verified (unusual but firmware-possible).
    pub authority_events_present: bool,
    /// Number of distinct EV_EFI_VARIABLE_AUTHORITY events recorded in
    /// the PCR[7] log. Zero if Secure Boot is disabled.
    pub authority_event_count: u16,
    pub _pad: [u8; 3],
}

Non-UEFI boot paths: On systems that boot via BIOS/MBR, DeviceTree (ARM), OpenSBI (RISC-V), or other non-UEFI mechanisms, PCR[7] is not defined by the UEFI Secure Boot specification. UmkaOS sets secure_boot_active = false and pcr7_value = [0u8; 32] on these platforms, reflecting that no UEFI-based PCR[7] measurement occurred. Platform- specific measured boot chains (e.g., ARM TrustZone OP-TEE measurements) are handled separately and do not map to PCR[7].

8.3.2 TPM Runtime Services

Section 8.3 covers TPM as a boot measurement device. But TPM 2.0 is also a runtime crypto engine — a hardware security module built into every modern server and laptop. UmkaOS integrates TPM as a first-class runtime resource.

Linux TPM interfaces — Linux exposes /dev/tpm0 (raw character device for direct TPM command submission) and /dev/tpmrm0 (resource-managed access that multiplexes TPM sessions). UmkaOS provides both via the syscall interface for existing userspace tools (tpm2-tools, clevis, systemd-cryptenroll).

Key sealing and unsealing — TPM can seal secrets (encryption keys, credentials) to a specific PCR state. The secret can only be recovered if the PCRs match — meaning the system booted the expected kernel, with the expected drivers, in the expected configuration. If any component in the boot chain is modified or compromised, unsealing fails and the secret is protected. UmkaOS integrates sealing with its capability system: the CAP_TPM_SEAL capability is required to create sealed objects, and the resulting sealed blob is itself a capability-protected kernel object.

NV Indices (NVRAM) — TPM provides persistent, tamper-resistant storage called NV indices. These store small secrets (disk encryption keys, network credentials, device identity certificates) that survive reboots and are protected by TPM authorization policies. UmkaOS abstracts NV indices as capability-protected objects — reading/writing an NV index requires the appropriate capability token, and the authorization policy (password, HMAC, or PCR-bound) is enforced by the TPM hardware itself.

Hardware random number generator — TPM includes a hardware RNG (certified to NIST SP 800-90A). UmkaOS's entropy pool (which also draws from CPU RDRAND/RDSEED) mixes in TPM random data when a TPM is present. This provides defense-in-depth: if the CPU RNG is compromised (as has happened with certain microcode bugs), the TPM RNG provides independent entropy.

Authorization policies — TPM 2.0 supports rich authorization: - PCR-bound: seal/unseal only if PCRs match expected values (boot integrity) - Password-based: simple passphrase authorization - HMAC-based: proof of possession of a shared secret - Policy-based: compound policies combining PCR state, time-of-day, locality, NV counter values, and external authorization (e.g., "unseal only if PCR 11 matches AND a remote server approves")

UmkaOS's policy engine maps these to capability gates — a sealed key's authorization policy determines which capability holders can trigger unsealing.

TPM-backed disk encryption — dm-crypt volume keys sealed to TPM PCR state. On boot, UmkaOS's initramfs unseals the volume key from the TPM (no passphrase required if the boot chain is trusted). This replaces the Linux approach of requiring systemd- cryptenroll or clevis daemons with kernel-native TPM key management. If the boot chain changes (different kernel version, modified initramfs), the TPM refuses to unseal and the system falls back to passphrase entry.

Resource manager — TPM 2.0 has limited internal session/object slots (typically 3 loaded objects + 3 loaded sessions simultaneously, per TCG TPM 2.0 Part 1 §32). Linux requires either the in-kernel tpm_tis/tpm_crb resource manager or the userspace tpm2-abrmd daemon to multiplex access. UmkaOS's TPM driver includes a kernel-native resource manager that transparently handles context swapping, so multiple concurrent TPM users (disk encryption, IMA measurements, remote attestation, application key storage) never contend for TPM slots. No userspace daemon required.

Resource manager data structures:

/// A virtualized TPM handle. Callers always use VirtTpmHandle; the resource
/// manager maps it to a real TPM handle when the object is loaded.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
pub struct VirtTpmHandle(pub u32);

/// State of one virtualized TPM object or session.
pub enum TpmSlotState {
    /// Currently loaded in the TPM — real_handle is valid.
    Loaded { real_handle: u32 },
    /// Context-saved to TPM NV storage by ContextSave.
    /// context_blob is retrieved by ContextLoad to reload.
    Saved { context_blob: TpmContextBlob },
    /// Never loaded (newly created, awaiting first use).
    Pending,
}

/// Opaque blob returned by TPM2_ContextSave. Variable length (≤ 2048 bytes).
pub struct TpmContextBlob {
    pub data: ArrayVec<u8, 2048>,
}

/// Per-handle entry in the resource manager table.
pub struct TpmHandleEntry {
    pub virt_handle: VirtTpmHandle,
    pub slot_state: TpmSlotState,
    /// LRU clock tick when last used. Used for eviction ordering.
    pub last_used: u64,
}

/// Kernel-native TPM resource manager.
///
/// Maintains an LRU cache of loaded TPM handles. When a caller needs to use
/// a handle and the TPM is full, the LRU loaded handle is context-saved
/// (TPM2_ContextSave) to free a slot, then the requested handle is
/// context-loaded (TPM2_ContextLoad) into the freed slot.
pub struct TpmResourceManager {
    /// All virtualized handles, keyed by VirtTpmHandle.
    pub handles: HashMap<VirtTpmHandle, TpmHandleEntry>,
    /// Currently loaded handles, in LRU order (front = most recent).
    pub loaded: ArrayDeque<VirtTpmHandle, 6>,  // 3 objects + 3 sessions max
    /// Maximum loaded slots (queried from TPM2_GetCapability at init).
    pub max_loaded: u8,
    /// Monotonic tick counter for LRU ordering.
    pub tick: u64,
    /// Serializes all TPM commands (TPM is single-threaded).
    pub tpm_lock: Mutex<()>,
}

Resource manager algorithm — ensure_loaded(virt):

Acquire tpm_lock.
If virt is already Loaded:
  Update virt.last_used = tick++.
  Return real_handle.
If loaded.len() == max_loaded:
  evict = loaded.pop_back()  // LRU handle
  TPM2_ContextSave(evict.real_handle) → context_blob
  evict.slot_state = Saved { context_blob }
  loaded.remove(evict)
TPM2_ContextLoad(virt.context_blob) → real_handle
virt.slot_state = Loaded { real_handle }
virt.last_used = tick++
loaded.push_front(virt)
Release tpm_lock.
Return real_handle.

Every TPM command path calls ensure_loaded(virt) to obtain the real handle before submitting the command. The caller is unaware of whether a context swap occurred. Context save/load adds ~1-5 ms per swap (one TPM2_ContextSave + one TPM2_ContextLoad round-trip); swaps are rare because the LRU policy keeps the most-used handles loaded.

Eviction Policy:

The TPM has limited internal context slots (TPM2_PT_HR_LOADED_SESSION: typically 3; TPM2_PT_ACTIVE_SESSIONS: typically 64 on current hardware). When all slots are occupied and a new context is needed:

LRU eviction: The resource manager maintains an LRU list of active TPM context handles. When a new slot is needed, the least-recently-used non-pinned context is evicted via TPM2_CC_ContextSave (serializes the context to a TPMS_CONTEXT blob, ~256-2048 bytes per context) and stored in kernel memory.
Pinned contexts: Contexts actively executing an operation (ContextState::InUse) are pinned and cannot be evicted. If all active contexts are pinned and a new slot is needed, the requesting operation blocks on tpm_context_wait_queue until a context completes and unpins.
Re-loading: When a previously evicted context is needed, the resource manager calls TPM2_CC_ContextLoad to reload it into the TPM, potentially evicting another LRU context.
Context blob storage: Saved context blobs are stored in a pre-allocated kernel pool (TpmContextPool), sized at boot to MAX_TPM_CONTEXTS × MAX_CONTEXT_SIZE = 64 × 2048 = 128 KiB. This pool is never grown after boot (no allocation in TPM I/O path).
Context size limits: Each saved context blob is bounded by TPM2_PT_CONTEXT_SYM_SIZE (hardware-reported). The TpmContextPool slot size is set to max(reported_size, 2048) for safety.
Eviction metrics: The resource manager tracks eviction_count, reload_count, and max_concurrent_active in the FMA health struct for the TPM device, allowing operators to tune MAX_TPM_CONTEXTS if thrashing is observed.

Performance considerations — TPM 2.0 operations are inherently slow, with latency varying dramatically by operation type:

Operation category	Latency	Examples
Simple reads/extends	~1-10 ms	`TPM2_PCR_Extend`, `TPM2_GetRandom`, `TPM2_PCR_Read`
Sealing/unsealing	~20-80 ms	`TPM2_Create` (sealed key), `TPM2_Unseal`
Asymmetric signing	~100-600 ms	ECDSA P256 (p50 ~200ms), RSA 2048 (slower)
Key generation	~200-1000 ms	`TPM2_CreatePrimary` (RSA 2048)

The TPM is single-threaded — all commands serialize internally regardless of concurrent callers. Under load (e.g., multiple containers requesting attestation quotes simultaneously), effective per-operation latency increases linearly with queue depth. UmkaOS's async I/O model ensures TPM operations never block the caller synchronously. TPM commands are submitted via the ring buffer interface and completed asynchronously. For latency-sensitive paths (e.g., IMA measurement during file open), UmkaOS caches measurement results and only re-measures when file content changes. The caching strategy is essential: without it, a file-intensive workload would saturate the TPM with PCR extend operations, each taking 1-10ms.

8.3.3 Anti-Rollback Counter (TOCTOU-Safe Initialization)

UmkaOS uses a dedicated TPM NV monotonic counter to prevent rollback to older kernel images. This counter is separate from the KRL version counter (NV index 0x01800200, Section 8.2.7) and tracks the kernel boot generation.

NV Index: UMKA_ROLLBACK_NV_INDEX = 0x01500001

First-boot TPM anti-rollback counter initialization (TOCTOU-safe):

1. TPM2_NV_DefineSpace(
     nvIndex      = UMKA_ROLLBACK_NV_INDEX,  // 0x01500001
     size         = 8,                        // 8-byte counter
     attributes   = TPMA_NV_COUNTER
                  | TPMA_NV_WRITE_STCLEAR    // write-locked after power cycle
                  | TPMA_NV_WRITEDEFINE      // cannot be redefined after first write
                  | TPMA_NV_PPWRITE          // requires physical presence to write
                  | TPMA_NV_AUTHREAD,        // readable with auth only
     authPolicy   = platform_auth            // requires platform authorization
   )
   If this returns TPM_RC_NV_DEFINED: the index already exists from a
   previous boot — skip to step 3 (do not re-initialize).

2. TPM2_NV_Increment(nvIndex, platformAuth)
   // First increment initializes the counter to 0 and locks the definition
   // (TPMA_NV_WRITEDEFINE prevents future TPM2_NV_DefineSpace for this index).

3. Read counter: TPM2_NV_Read(nvIndex) → current_rollback_generation
   Compare against kernel's compiled-in UMKA_MIN_ROLLBACK_GENERATION.
   If current < minimum: boot is rejected (anti-rollback enforcement).

Security properties:

TPMA_NV_WRITEDEFINE: once written, the NV space definition cannot be changed even with platform auth — prevents an attacker from deleting and re-creating the index with weaker attributes.
TPMA_NV_PPWRITE: physical presence required — cannot be manipulated by software alone.
The atomicity of step 1 is guaranteed by TPM hardware — concurrent TPM2_NV_DefineSpace calls are serialized by the TPM's internal command sequencer. If step 1 returns TPM_RC_NV_DEFINED, the index was created by a concurrent or prior boot; step 3 reads the existing counter safely. There is no window between "check if index exists" and "create index" because the DefineSpace command is itself the creation — either it succeeds (first boot) or returns TPM_RC_NV_DEFINED (already exists). This eliminates the TOCTOU race that would exist if existence were checked with a separate TPM2_NV_ReadPublic command before DefineSpace.

Rollback generation increment: When a new kernel version is released with security fixes that must not be rolled back, UMKA_MIN_ROLLBACK_GENERATION is incremented in the kernel build. On next boot, step 2 increments the TPM counter past the minimum; old kernel binaries (with lower minimums) can no longer satisfy the anti-rollback check on this system.

8.4 Runtime Integrity Measurement (IMA)

Measured boot (Section 8.3) ensures the system booted a trusted kernel and drivers. But what about runtime — every binary executed, every library loaded, every config file read after boot? Runtime integrity measurement extends the trust chain from boot into ongoing operation.

Linux IMA measures files into TPM PCR 10 (or a designated PCR) at access time. It was bolted onto the VFS layer after the fact, with a standalone policy language and limited integration with the rest of the security stack. UmkaOS provides equivalent functionality, deeply integrated with the capability system and driver loading.

Measurement policy — Rules specify what to measure: - All executable files (mmap PROT_EXEC) - All shared libraries (dlopen) - Specific configuration files (e.g., /etc/fstab, /etc/passwd) - All files opened by processes holding specific capabilities - All files in specific filesystem subtrees

Policy rules are expressed as capability predicates — "measure all files opened by any process holding CAP_NET_ADMIN" — making the policy composable with UmkaOS's existing security architecture.

8.4.1 Policy Rule Grammar

IMA policy rules follow a line-oriented grammar. Each rule is a single line specifying an action, a condition set, and optional flags. The grammar (EBNF):

rule         = action SP condition_list [SP flag_list] LF
action       = "measure" | "dont_measure" | "appraise" | "dont_appraise" | "audit"
condition_list = condition *( SP condition )
condition    = key "=" value
key          = "func" | "mask" | "fsmagic" | "fsuuid" | "fspath"
             | "uid" | "euid" | "cap" | "obj_type" | "obj_user"
             | "subtype" | "fsname"
value        = quoted_string | bare_word
flag_list    = flag *( SP flag )
flag         = "permit_directio" | "appraise_type=imasig"
             | "appraise_type=sigv2" | "appraise_algo=" hash_algo
             | "template=" template_name
hash_algo    = "sha256" | "sha384" | "sha512"
template_name = "ima-ng" | "ima-sig" | "ima-modsig"

Condition keys (evaluated in order, ALL conditions in a rule must match):

Key	Value type	Semantics
`func`	`FILE_CHECK`, `MMAP_CHECK`, `BPRM_CHECK`, `MODULE_CHECK`, `FIRMWARE_CHECK`, `KEXEC_CHECK`, `KABI_CHECK`	Hook point where measurement occurs. `KABI_CHECK` is UmkaOS-specific: fires on every KABI driver load (Tier 1 and Tier 2).
`mask`	`MAY_EXEC`, `MAY_WRITE`, `MAY_READ`, `MAY_APPEND`	File access mode filter. `MAY_EXEC` covers both `execve()` and `mmap(PROT_EXEC)`.
`fspath`	glob pattern	Filesystem path. Supports `` (single component) and `` (recursive). Example: `fspath=/usr/lib/*`
`fsmagic`	hex value	Filesystem magic number (e.g., `0xEF53` for ext4). Used to exempt pseudo-filesystems.
`fsuuid`	UUID string	Filesystem UUID. Allows per-partition policies.
`fsname`	string	Filesystem type name (e.g., `ext4`, `btrfs`, `tmpfs`).
`uid`	integer	Real UID of the process accessing the file.
`euid`	integer	Effective UID of the process.
`cap`	capability name	UmkaOS capability held by the process. Example: `cap=CAP_NET_ADMIN`. Matches if the process's effective capability set includes the named capability.
`obj_type`	string	SELinux/UmkaOS label type of the file (for label-based policies).

Example built-in policy (compiled into the kernel):

 # Measure all executables and shared libraries
measure  func=BPRM_CHECK  mask=MAY_EXEC
measure  func=MMAP_CHECK  mask=MAY_EXEC
measure  func=MODULE_CHECK
measure  func=FIRMWARE_CHECK
measure  func=KABI_CHECK

 # Appraise (enforce) all executables
appraise func=BPRM_CHECK  mask=MAY_EXEC  appraise_type=imasig
appraise func=MMAP_CHECK  mask=MAY_EXEC  appraise_type=imasig
appraise func=MODULE_CHECK                appraise_type=imasig
appraise func=KABI_CHECK                  appraise_type=imasig

 # Exempt pseudo-filesystems from measurement
dont_measure  fsmagic=0x9FA0    # procfs
dont_measure  fsmagic=0x62656572  # sysfs
dont_measure  fsmagic=0x64626720  # debugfs
dont_measure  fsmagic=0x01021994  # tmpfs
dont_measure  fsmagic=0x1CD1     # devpts

Rule evaluation: Rules are evaluated top-to-bottom; first matching rule wins. If no rule matches, the default action is dont_measure / dont_appraise (fail-open for measurement, fail-closed for appraisal in umka.ima=enforce mode — unmeasured files that require appraisal are denied).

8.4.2 Algorithm Agility

IMA supports multiple hash algorithms via the appraise_algo flag:

SHA-256 (default): 32-byte digests. Quantum-safe for preimage resistance (Grover's algorithm gives ~128-bit effective security), standard across all deployments.
SHA-384: 48-byte digests. Higher security margin for long-lived signatures.
SHA-512: 64-byte digests. Maximum security, slightly higher measurement cost.

The hash algorithm is selected per-rule, allowing different algorithms for different file categories (e.g., SHA-512 for kernel modules, SHA-256 for general executables).

Algorithm migration: When upgrading the hash algorithm (e.g., SHA-256 → SHA-512), the kernel accepts both old and new algorithm digests during a transition window (configurable via umka.ima.transition_algos=sha256,sha512). This allows gradual re-signing of file extended attributes without breaking running systems. After transition, the old algorithm can be removed from the accepted list.

The extended attribute storing the signed hash includes an algorithm identifier prefix: security.ima xattr format is { algorithm_id: u8, signature_data: [u8] }, where algorithm_id maps to: 0x04 = SHA-256, 0x05 = SHA-384, 0x06 = SHA-512 (matching the Linux IMA enum hash_algo values for compatibility).

8.4.3 Container and Namespace Interaction

IMA operates at the filesystem inode level, which is inherently global — the same inode has the same measurement regardless of which namespace accesses it. However, container isolation requires per-namespace policy control:

Policy namespacing — Each user namespace MAY have its own IMA policy overlay:

The global policy (loaded at boot, signed, append-only) applies to all namespaces. It cannot be weakened by a container.
A container runtime (e.g., umka-containerd) can request additional measurement rules for its namespace via securityfs in the container's IMA namespace (/proc/<pid>/ns/ima). These rules can only add measurements, never remove or weaken global rules.
The IMA namespace is created alongside the user namespace. It inherits the global policy and can be extended but not reduced.

Measurement log namespacing — Each IMA namespace maintains its own measurement log, allowing per-container attestation. A remote verifier can request a TPM quote scoped to a specific container's measurements by specifying the namespace ID.

/// Per-namespace IMA state. Created alongside the user namespace (Section 16.1.2).
/// The global (init) IMA namespace is statically allocated; container
/// namespaces are dynamically allocated via `unshare(CLONE_NEWUSER)`.
pub struct ImaNamespace {
    /// Owning user namespace. Determines which UIDs can extend the
    /// local policy. Only `ns_capable(user_ns, CAP_SYS_ADMIN)` can
    /// write to this namespace's `securityfs` IMA policy file.
    pub user_ns: Arc<UserNamespace>,

    /// Global policy reference (read-only, shared across all namespaces).
    /// These rules cannot be removed or weakened by any container.
    /// Points to the init namespace's policy (loaded at boot, signed).
    pub global_policy: &'static ImaPolicy,

    /// IMA policy rules for this mount namespace. RCU-protected to allow lockless reads.
    ///
    /// **Write path**: clone the `Vec<ImaRule>`, append the new rule, swap the `Arc` under RCU.
    /// This O(N) clone-on-update is an accepted tradeoff: IMA policy is written at most once
    /// per container lifecycle (at container startup), never at runtime. For deployments
    /// requiring more than ~1,000 dynamic policy updates, consider an RCU-protected skip list —
    /// that optimization is not needed for current UmkaOS use cases.
    ///
    /// Namespace-local additional rules (append-only). Container runtimes
    /// add rules via securityfs; rules can only strengthen measurement
    /// requirements, never weaken global policy.
    ///
    /// **Read path** (every `file_open` / `mmap` / `exec`): rule evaluation
    /// reads the list under `rcu_read_lock()` — lock-free, zero contention.
    /// **Write path** (`securityfs` policy write, typically once at container
    /// startup): clone the current Vec, append the new rule, swap the Arc
    /// under `rules_update_lock`. Old Vec is freed after the RCU grace period.
    pub local_rules: RcuPtr<Arc<Vec<ImaRule>>>,

    /// Serializes rule list updates. Only held during the cold-path
    /// `securityfs` policy write — never on the file-open hot path.
    pub rules_update_lock: Mutex<()>,

    /// Per-namespace measurement log. Append-only, hash-chained: each
    /// entry includes `H(previous_entry)` for tamper detection. Entries
    /// record (pcr_index, algorithm, hash, filename_hint, timestamp).
    /// Read via `/proc/<pid>/ns/ima/ascii_runtime_measurements`.
    pub measurement_log: Mutex<ImaMeasurementLog>,

    /// Software PCR bank for container attestation. Mirrors the hardware
    /// TPM PCR extend model but in software — each measurement extends
    /// `virtual_pcr[index] = H(virtual_pcr[index] || measurement)`.
    /// Remote verifiers requesting per-container attestation receive
    /// these virtual PCR values (signed by the kernel's attestation key)
    /// rather than the global TPM PCRs. The global (init) namespace
    /// extends both the hardware TPM PCRs and its virtual PCR bank.
    pub virtual_pcrs: [AtomicPcrValue; NUM_VIRTUAL_PCRS],
}

/// A single IMA measurement log entry.
pub struct ImaMeasurementEntry {
    /// PCR index this measurement was extended into (virtual PCR bank).
    pub pcr_index: u8,
    /// Hash algorithm used (SHA-256, SHA-384, SHA-512).
    pub algorithm: HashAlgorithm,
    /// File content hash.
    pub digest: [u8; 64],  // max SHA-512; actual length from algorithm
    /// File path hint (for human-readable log output). Truncated to 255 bytes.
    pub filename_hint: [u8; 256],
    /// Measurement timestamp (monotonic nanoseconds).
    pub timestamp_ns: u64,
    /// Hash of the previous log entry (chain integrity).
    pub previous_hash: [u8; 32],  // SHA-256 chain
}

Container image verification — For container images deployed as read-only filesystem layers (overlayfs lower layers), dm-verity (Section 8.2.6) is the primary integrity mechanism. IMA covers the writable upper layer and any files modified at runtime.

Cross-namespace file access — When a process in namespace A accesses a file whose inode was already measured in namespace B, the measurement is reused from the global cache (same inode → same hash). The measurement is recorded in namespace A's log but not re-computed. This prevents DoS via repeated cross-namespace measurement requests.

8.4.4 Signed Hash Lifecycle (Package Updates)

IMA appraisal requires every file to carry a signed hash. This section specifies how signed hashes are generated, distributed, and updated during the software lifecycle.

Signing infrastructure:

Signing Key Hierarchy:
  Root CA (offline, HSM-protected)
   └── IMA Signing Key (per-distribution or per-organization)
        └── signs file hashes stored in security.ima xattr

Key distribution:
  - IMA signing key's public half is compiled into the kernel image
    (verified by Secure Boot, measured into PCR 9)
  - Alternatively: public key stored in a keyring loaded from a signed
    policy file (measured into PCR 12)

Package manager integration:

Build time: The distribution's build system computes SHA-256 hashes of all files in a package and signs them with the IMA signing key. The signed hashes are stored as security.ima extended attributes in the package archive (RPM %__ima_sign, DEB dpkg-sig, or equivalent UmkaOS package format).
Install time: The package manager (apt, dnf, or UmkaOS's native package tool) extracts files to disk including their security.ima xattrs. The package manager itself must be IMA-appraised (bootstrapping: the package manager binary is signed during initial system image creation).
Update time: When updating a file (e.g., /usr/bin/ssh version upgrade): a. Package manager writes new file contents to a temporary path. b. Package manager sets the new security.ima xattr (from the package archive). c. Package manager atomically renames the temp file over the old file (rename(2)). d. IMA invalidates the measurement cache for the old inode (detected via inode version counter increment on the rename target). e. Next access to the file triggers re-measurement and appraisal against the new signed hash. The file is usable immediately — no reboot required.
Rollback: If a package update fails mid-install, the old file retains its old security.ima xattr (the rename never happened). The system remains in a consistent state. If an attacker modifies a file without updating its xattr, appraisal fails and the file is denied.

Files without signed hashes: In umka.ima=enforce mode, files that require appraisal (per policy rules) but lack a valid security.ima xattr are denied. In umka.ima=log mode, they are measured and logged but not denied. This allows gradual rollout: start in log mode to identify unsigned files, then switch to enforce.

IMA appraisal (enforcement) — Beyond measurement (recording what was accessed), IMA appraisal enforces integrity. Each file has a signed hash (stored as an extended attribute or in a separate manifest). On access, the kernel computes the file's hash, verifies it against the signed reference, and refuses to execute/load the file if they don't match. This blocks tampered binaries at runtime, not just at audit time.

IMA and UmkaOS driver loading — Every Tier 1 driver load is already measured into PCR 12 (Section 8.3). IMA extends this to: - Tier 2 driver binaries (userspace drivers loaded via KABI) - Userspace helper executables invoked by drivers - Firmware blobs loaded by drivers - Configuration files that affect driver behavior

IMA policy storage and loading — The IMA policy defines which files are measured and/or appraised. UmkaOS loads the IMA policy from one of these sources, in priority order:

Built-in policy: A compiled-in default policy that measures all executables, shared libraries, and kernel modules. This is always available as a baseline.
Signed policy file: /etc/umka/ima-policy.signed — a policy file signed with the IMA policy signing key (an RSA-4096 or Ed25519 key whose public half is compiled into the kernel or measured into TPM PCR 12 at boot). The kernel verifies the signature before loading. This is the recommended production mechanism.
securityfs write: A privileged process (CAP_MAC_ADMIN) can write policy rules to /sys/kernel/security/ima/policy at runtime. Rules are append-only — once written, they cannot be removed without reboot. This is used for development and debugging only. In production configurations, umka.ima=enforce disables securityfs policy writes entirely.

Policy changes take effect immediately for newly opened files. Files already open and measured retain their existing measurement until closed and reopened.

Audit log — All measurements are logged to UmkaOS's audit subsystem. The log is tamper-evident (hash-chained) and optionally signed. A remote attestation server can request a TPM quote over the measurement PCR plus the audit log, verifying not just "the system booted correctly" but "the system has only executed known-good software since boot."

LSM policy changes are always IMA-audited. Any modification to the security policy is recorded in the IMA measurement log as a tamper-evident audit entry, satisfying audit trail requirements for security-sensitive deployments. The following events generate an IMA audit record:

Loading a new AppArmor profile or policy namespace (e.g., via apparmor_parser -r)
Loading a new SELinux policy module (e.g., via semodule -i)
Removing an LSM policy module
Changing the enforcement mode (permissive to enforcing or enforcing to permissive)
Loading a new UmkaOS-native LSM module via kabi_load_lsm()

Each audit record contains the following fields:

Field	Description
`event_type`	`IMA_LSM_POLICY_LOAD`, `IMA_LSM_POLICY_REMOVE`, or `IMA_LSM_MODE_CHANGE`
`module_name`	Policy module name (e.g., `"docker-default"` for AppArmor, `"container_t"` for SELinux)
`module_version`	Policy module version string, if available
`policy_hash`	SHA-256 hash of the policy binary (for load events; absent for remove and mode-change events)
`caller_uid`	Real UID of the process that triggered the policy change
`caller_pid`	PID of the calling process
`caller_comm`	Process name (`comm`) of the caller
`timestamp_mono`	Monotonic clock timestamp (nanoseconds since boot)
`timestamp_wall`	Wall clock timestamp (nanoseconds since Unix epoch)

LSM policy audit records use the same ImaLsmPolicyEntry log structure as other IMA entries and are extended into TPM PCR 10 via the same async-extend path described below. This means a remote attestation quote covering PCR 10 captures not only which files were executed, but also every security policy change made since boot.

EVM (Extended Verification Module) — Protects file metadata (permissions, ownership, extended attributes, ACLs) against offline tampering. Without EVM, an attacker with physical access could mount the disk on another system and modify file permissions (e.g., make /usr/bin/su setuid-root) without changing file contents. EVM stores an HMAC over file metadata as an extended attribute, verified on every access. The HMAC key is sealed to the TPM, so offline modification is detectable.

Performance impact — IMA measurement cost is dominated by SHA-256 hashing of file contents and scales linearly with file size. For typical small files (config files, shared libraries <1 MB): ~10-50 us. For large binaries (100 MB): ~200 ms at ~500 MB/s single-threaded SHA-256 throughput. The measurement cache (below) amortizes this cost -- after the first measurement, re-measurement occurs only on content change, so the large-file cost is paid once. Mitigations: - Measurement cache: after first measurement, the result is cached. Re-measurement occurs only if the file's content changes (detected via VFS inode version counter, SB_I_VERSION — supported by ext4, Btrfs, XFS, and tmpfs since Linux 4.16+/Layton rework). For filesystems that do not implement SB_I_VERSION (e.g., some FUSE backends, legacy network filesystems), the cache falls back to mtime+ctime comparison, which is coarser-grained but avoids bypassing appraisal entirely - Policy exemptions: transient paths (/tmp, /dev, /proc, /sys) are configurable exemptions — no point measuring pseudo-filesystems - Async measurement: for large files, measurement can be performed asynchronously during non-security-critical reads (e.g., data files opened for read-only access). However, security-critical paths are always synchronous: any file opened for execution (mmap PROT_EXEC, dlopen, driver loading) is measured synchronously before the content is used. This prevents a TOCTOU vulnerability where file content could change between measurement and execution. The synchronous path holds an exclusive file lock (preventing concurrent writes) while computing the hash and comparing it against the signed reference. The async path is used only for audit/measurement-only policy rules where appraisal enforcement is not required

Integration with dm-verity — For read-only filesystems (container images, system partitions), dm-verity (Section 8.2.6) provides block-level integrity verification. IMA is complementary: it covers read-write filesystems where dm-verity cannot apply (dm-verity requires immutable block devices). Together, they provide complete coverage: dm-verity for immutable system images, IMA for mutable data and user-installed software.

Asynchronous PCR Extension Queue

IMA measures files on open, which requires extending a TPM PCR (PCR 10 for file measurements). TPM 2.0 PCR extend commands take 5-50ms depending on TPM firmware. Doing this synchronously in the VFS open path would add 5-50ms to every cold-open (first open of a file not in the IMA cache). For a container startup loading 1000 files, this adds 5-50 seconds — clearly unacceptable.

Solution: Measurement log + async PCR extend

IMA splits measurement into two phases: 1. Synchronous (on file open): Hash the file content and record the hash in the IMA measurement log (in-memory ring buffer). This takes 1-10us (SHA-256 of cached pages). File open proceeds immediately. 2. Asynchronous (background thread): A dedicated ima-tpm-extend kernel thread batches PCR extends, processing them from the measurement log at ~20-100ms intervals (or on explicit sync request).

pub struct ImaMeasurementLog {
    /// Ring buffer of pending measurements. SHA-256 digests (32 bytes each).
    pending: SpscRing<ImaMeasurement, 4096>,
    /// Measurements already extended into TPM (for audit export).
    committed: RcuVec<ImaMeasurement>,
    /// The ima-tpm-extend thread wakes on this condvar.
    extend_needed: Condvar,
    /// Total measurements since boot (for log sequence numbers).
    total: AtomicU64,
}

pub struct ImaMeasurement {
    /// SHA-256 hash of the file content.
    pub file_hash: [u8; 32],
    /// PCR to extend into (default: 10).
    pub pcr: u8,
    /// Hash of (hash_algo || file_hash || filename).
    pub template_hash: [u8; 32],
    /// Timestamp of the measurement (nanoseconds since boot).
    pub timestamp: u64,
    /// Monotonic sequence number within this log.
    pub sequence: u64,
}

ima-tpm-extend thread:

fn ima_tpm_extend_thread() {
    loop {
        // Batch up to 64 measurements per TPM session.
        let batch = measurement_log.pending.drain_batch(64);
        if batch.is_empty() {
            measurement_log.extend_needed.wait_timeout(Duration::from_millis(100));
            continue;
        }
        // Open a single TPM session and extend all pending measurements.
        let session = tpm_open_session();
        for m in &batch {
            tpm_pcr_extend(session, m.pcr, &m.template_hash); // 5-50ms each, but batched
        }
        tpm_close_session(session);
        // Move to committed log.
        for m in batch {
            measurement_log.committed.push(m);
        }
    }
}

Consistency guarantee: Applications that require all measurements to be committed before proceeding can call ima_sync() (via ioctl(IMA_SYNC) or syncfs()), which blocks until ima-tpm-extend has processed all pending measurements. systemd uses this before pivoting to the real root to ensure boot measurements are finalized.

Latency impact: File open latency = SHA-256 hash time (~1-10us for cached pages) + measurement log append (~100ns). TPM extend cost is moved off the open path entirely. Container startup loading 1000 files adds ~1-10ms total (hashing), not 5-50 seconds.

Queue Bounds and Overflow Policy

The pending ring in ImaMeasurementLog is bounded at 4096 entries. This provides ~50 seconds of buffering at 80 PCR extends/second (typical for a busy IMA workload during container startup or large package installation). The async extend thread drains the queue continuously, so under normal operation the queue stays well below capacity.

Overflow behavior — When the queue is full, the policy depends on the extend type:

IMA file measurement (runtime): The measurement is dropped — the file hash is not recorded in the measurement log and the corresponding PCR extend does not occur. The ima_measurements_dropped counter (AtomicU64) is incremented. A KERN_WARNING message is logged at most once per 60 seconds (rate-limited via a last_drop_warn: AtomicU64 timestamp to avoid log flooding). The file access proceeds normally — availability takes priority over integrity for already-running processes. The drop is visible to remote attestation verifiers via the counter and the gap in measurement log sequence numbers.
Boot-time PCR extends (security-critical): The caller blocks until queue space is available. Boot-time extends (kernel image, initramfs, Tier 1 drivers into PCR 9/10/12) are rare (typically < 20 total) and occur before the system accepts user workloads, so blocking is bounded by TPM throughput (~5-50ms per extend) and does not affect runtime latency.

TPM Timeout Handling

Each queued extend operation has a 5-second deadline. If the TPM does not complete the TPM2_PCR_Extend command within 5 seconds (indicating a hung or extremely slow TPM):

The tpm_extend_timeouts counter (AtomicU64) is incremented.
A KERN_ERR message is logged with the TPM device name, PCR index, and operation sequence number.
For IMA runtime extends: The measurement is dropped (same as queue overflow). The ima_measurements_dropped counter is incremented.
For boot-time PCR extends: The operation is retried once with a fresh 5-second deadline. If the retry also times out, the TPM is declared degraded: tpm_degraded is set to true, and all further PCR extend operations are skipped. Existing measurements already committed to the TPM remain valid — degraded mode only stops new extends. A KERN_CRIT message is logged indicating TPM degradation. The kernel continues in software-only measurement mode: the IMA measurement log is still maintained (for local audit), but TPM-backed remote attestation is no longer available until the next reboot.

`TpmExtendQueue`

The low-level queue that backs both ImaMeasurementLog.pending and the timeout tracking:

/// Maximum pending TPM PCR extend operations.
const MAX_TPM_QUEUE_DEPTH: usize = 4096;

/// A single queued PCR extend operation.
pub struct TpmExtendOp {
    /// PCR index to extend.
    pub pcr: u8,
    /// Template hash to extend into the PCR.
    pub template_hash: [u8; 32],
    /// Monotonic sequence number (for gap detection on drop).
    pub sequence: u64,
    /// Deadline: absolute time (nanoseconds since boot) by which the
    /// TPM must complete this extend. Default: submission time + 5s.
    pub deadline_ns: u64,
    /// true if this is a boot-time extend (blocks on full, retries on timeout).
    pub boot_critical: bool,
}

/// Bounded async queue for TPM PCR extend operations.
///
/// Provides backpressure (blocking) for boot-critical extends and
/// drop-on-overflow for runtime IMA extends. Tracks timeouts and
/// degraded state for TPM health monitoring.
pub struct TpmExtendQueue {
    /// Pending extend operations, bounded at MAX_TPM_QUEUE_DEPTH.
    queue: ArrayDeque<TpmExtendOp, MAX_TPM_QUEUE_DEPTH>,
    /// Counts dropped extends due to queue full or timeout.
    pub drops: AtomicU64,
    /// Counts extends that timed out waiting for TPM response.
    pub timeouts: AtomicU64,
    /// Set when TPM is non-responsive; disables further extends.
    pub degraded: AtomicBool,
    /// Last time a drop warning was logged (nanos since boot).
    /// Used to rate-limit KERN_WARNING to once per 60 seconds.
    last_drop_warn_ns: AtomicU64,
}

8.5 Post-Quantum Cryptography

8.5.1 Why This Cannot Wait

NIST finalized post-quantum cryptography standards in 2024: - ML-KEM (Kyber): key encapsulation (replaces ECDH/RSA key exchange) - ML-DSA (Dilithium): digital signatures (replaces RSA/Ed25519) - SLH-DSA (SPHINCS+): hash-based signatures (stateless, conservative)

"Harvest now, decrypt later" attacks mean data encrypted today with classical algorithms is vulnerable to future quantum computers. Migration timelines are 5-10 years — starting now is not optional.

UmkaOS uses cryptography in:

Component	Current Algorithm	PQC Replacement	Section
Verified boot (kernel signature)	RSA-4096 / Ed25519	ML-DSA-65 or SLH-DSA-128f	Section 8.2.3
Driver signatures	Ed25519	ML-DSA-44	Section 8.2.5
dm-verity Merkle tree	SHA-256	SHA-256 (quantum-safe as-is)	Section 8.2.6
Distributed capabilities	Ed25519	ML-DSA-44	Section 5.8
Cluster node authentication	X25519 key exchange + Ed25519 auth	ML-KEM-768 + ML-DSA-65	Section 5.2
TPM PCR measurements	SHA-256	SHA-256 (quantum-safe)	Section 8.3

8.5.2 Design: Algorithm-Agile Crypto Abstraction

PQC algorithm specifications follow NIST FIPS publications verbatim:

ML-KEM-768 (key encapsulation): FIPS 203, "Module-Lattice-Based Key-Encapsulation Mechanism Standard," NIST, August 2024. Parameter set: ML-KEM-768 (security level 3, 1088-byte public key).
ML-DSA-44/65/87 (digital signatures): FIPS 204, "Module-Lattice-Based Digital Signature Standard," NIST, August 2024.
ML-DSA-44: driver signing (security level 2, 1312-byte public key)
ML-DSA-65: kernel signing (security level 3, 1952-byte public key)
ML-DSA-87: long-term identity keys (security level 5, 2592-byte public key)
SLH-DSA-128f (stateless hash-based signatures): FIPS 205, "Stateless Hash-Based Digital Signature Standard," NIST, August 2024. Used for: root of trust, boot verification (stateless = no state synchronization needed across reboots).

No custom parameter variations. Implementation MUST pass the NIST Known-Answer Tests (KAT) vectors from the respective FIPS publications.

The critical design requirement: never hardcode key/signature sizes. PQC signatures are much larger than classical:

Algorithm	Signature Size	Public Key Size
Ed25519 (current)	64 bytes	32 bytes
ML-DSA-44 (PQC)	2,420 bytes	1,312 bytes
ML-DSA-65 (PQC)	3,309 bytes	1,952 bytes
SLH-DSA-128f (PQC)	17,088 bytes	32 bytes

Algorithm selection criteria:

ML-DSA (default): Lattice-based. Fast signing/verification, moderate signature size (~2.4-3.3KB). Use for: driver signatures, capabilities, cluster authentication. This is the standard choice unless there is a specific reason to avoid lattice assumptions.
SLH-DSA (paranoid mode): Hash-based (stateless). Conservative — survives even if lattice mathematical assumptions are broken by future cryptanalysis. Huge signatures (~17KB). Use for: kernel image signatures (verified once at boot, size doesn't matter). Configurable: umka.crypto.boot_algorithm=slh-dsa-128f overrides default ML-DSA-65.
Hybrid mode: Both Ed25519 + ML-DSA. Use during transition period (2025-2035). Both must verify. Provides defense if either algorithm family is broken.

If capability tokens (Section 5.8) have a fixed 64-byte signature field, PQC won't fit. The signature field must be variable-length or large enough for the biggest PQC algorithm.

// umka-core/src/crypto/mod.rs

/// Signature algorithm identifier.
/// Used for verified boot, driver signatures, capabilities, and
/// any context where digital signatures are created or verified.
/// Signatures and KEMs are separate enums because they serve
/// fundamentally different purposes: signatures prove authenticity
/// (sign/verify over a message), while KEMs establish shared secrets
/// (encapsulate/decapsulate). They have different field requirements
/// (signature data vs. ciphertext/shared secret) and must not be
/// conflated in data structures.
#[repr(u32)]
pub enum SignatureAlgorithm {
    // === Classical (pre-quantum) ===
    Ed25519             = 0x0001,
    Rsa4096Pss          = 0x0002,

    // === Post-quantum (NIST standards) ===
    MlDsa44             = 0x0100,   // FIPS 204, security level 2
    MlDsa65             = 0x0101,   // FIPS 204, security level 3
    MlDsa87             = 0x0102,   // FIPS 204, security level 5
    SlhDsa128f          = 0x0110,   // FIPS 205, fast variant
    SlhDsa128s          = 0x0111,   // FIPS 205, small variant

    // === Hybrid (classical + PQC, transition period) ===
    Ed25519PlusMlDsa44  = 0x0200,   // Both signatures, both must verify
    Rsa4096PlusMlDsa65  = 0x0201,
    Ed25519PlusMlDsa65  = 0x0202,   // Ed25519 + ML-DSA-65 (kernel image default)
}

/// Key Encapsulation Mechanism (KEM) algorithm identifier.
/// Used for key exchange in cluster node authentication (Section 5.2)
/// and any context where shared secrets are established.
/// Separate from SignatureAlgorithm because KEMs produce
/// (ciphertext, shared_secret) pairs, not signatures.
#[repr(u32)]
pub enum KemAlgorithm {
    // === Post-quantum (NIST standards) ===
    MlKem768            = 0x0120,   // FIPS 203, security level 3
    MlKem1024           = 0x0121,   // FIPS 203, security level 5
}
// Note: Hybrid signature algorithm IDs start at 0x0200, which exceeds u8 range.
// Network-portable structures (e.g., DistributedCapability in Section 5.8)
// that carry a sig_algorithm field MUST use at least u16 (or the full u32
// encoding) to represent the complete SignatureAlgorithm ID space.
// Cross-reference: DistributedCapability.sig_algorithm in Section 5.8 (05-distributed.md
// Section 5.8.2) IS already defined as u16 — this requirement is satisfied.

/// Variable-length signature.
/// Avoids hardcoding signature size in data structures.
pub struct Signature {
    /// Algorithm that produced this signature.
    pub algorithm: SignatureAlgorithm,
    /// Signature bytes (length depends on algorithm).
    pub data: SignatureData,
}

/// Signature data — inline for classical (hot path), boxed for PQC (cold path).
pub enum SignatureData {
    /// Ed25519: 64 bytes. Fits inline. No allocation.
    /// Used in capability verification cache (hot-path lookups).
    Inline64([u8; 64]),
    /// PQC signatures: heap-allocated via Box<[u8]>.
    /// PQC signatures are 2.4KB–17KB. Heap allocation is acceptable
    /// because PQC signing/verification occurs only on cold paths
    /// (boot, driver load, capability creation, cluster join), all of
    /// which run after the kernel slab allocator is initialized
    /// (post-Phase 2 boot, Section 4.1.2). During early boot (before
    /// the heap is available), signature verification uses the
    /// fixed-size buffers in KernelSignature and DriverSignature
    /// structs directly, bypassing this type entirely.
    Heap(Box<[u8]>),
}

// Note on cache pressure: cached PQC capabilities are ~2.5KB per entry
// vs 128 bytes for Ed25519. For 1,000 active capabilities: 2.5MB vs 128KB.
// Both fit in L3 cache on any modern system. If the working set grows
// beyond this, LRU eviction of cold capabilities mitigates pressure.

8.5.3 Impact on Distributed Capabilities

Distributed capabilities (Section 5.8) carry a signature for network verification. With PQC:

Current DistributedCapability:
  Base fields:     ~64 bytes
  Ed25519 signature: 64 bytes
  Total: ~128 bytes per capability (classical-only mode)

  With hybrid Ed25519 + ML-DSA-65 credential (production default):
  ~3.6 KB (including pre-allocated PQC signature buffer — see Section 5.8.2
  for the canonical struct definition with full field breakdown)

With ML-DSA-44:
  Base fields:     ~64 bytes
  ML-DSA-44 signature: 2,420 bytes
  Total: ~2,484 bytes per capability

Impact:
  - Capability token is ~20x larger.
  - RDMA bandwidth for capability exchange: negligible (capabilities
    are exchanged once and cached, not per-operation).
  - Verification time: Ed25519 verify is ~30-70 μs.
    ML-DSA-44 verify is ~50-120 μs (cached after first verification).
    ML-DSA-44 is typically 2-3x slower than Ed25519, but both are
    negligible on cold paths (driver load, cluster join).
  - Memory for cached capabilities: 2.5KB vs 128 bytes per entry.
    For a cluster with 1,000 active capabilities: 2.5 MB vs 128 KB.
    Negligible on any modern system.

8.5.4 Hybrid Mode (Transition Period)

During the transition period (2025-2035), sign with BOTH classical and PQC algorithms:

Kernel image signature:
  Ed25519 signature: 64 bytes    (verifiable by old bootloaders)
  ML-DSA-65 signature: 3,309 bytes (verifiable by PQC-aware bootloaders)

  Verification responsibility:
  - Pre-UmkaOS bootloaders (UEFI Secure Boot, GRUB) do NOT understand
    UmkaOS's KernelSignature format. Backward compatibility requires
    that the kernel image also carry a standard signature in the
    format the bootloader expects (e.g., Authenticode PE signature
    for UEFI Secure Boot, GPG detached signature for GRUB). The
    UmkaOS KernelSignature (containing hybrid Ed25519 + ML-DSA-65)
    is appended separately and is invisible to these bootloaders.
  - UmkaOS-aware bootloaders that predate PQC support verify only
    the classical component of the hybrid signature.

    **Hybrid signature wire format**: Hybrid signatures use a
    length-prefixed layout, not a fixed-offset convention. Format:
    `[classical_len: u16 | classical_sig: [u8; classical_len] |
    pqc_sig: [u8; remaining]]`. The verifier reads `classical_len`
    to split the signature buffer. For Ed25519 + ML-DSA-65 hybrids,
    `classical_len = 64`. For RSA-4096-PSS + ML-DSA-65 hybrids,
    `classical_len = 512`. This format accommodates any classical
    algorithm size without breaking the wire protocol.

    An older UmkaOS-aware verifier that does not recognize algorithm
    ID 0x0200 reads `classical_len` from the first two bytes of the
    signature buffer, extracts `classical_sig[0..classical_len]`,
    and verifies that component only, ignoring the trailing PQC
    signature data. To enable this, older UmkaOS-aware verifiers MUST
    treat any algorithm ID in the range 0x0200-0x02FF as "hybrid:
    extract classical_len from bytes [0..2), verify classical_sig
    from bytes [2..2+classical_len)". Legacy verifiers that only
    understand algorithm IDs in the range 0x0000-0x00FF ignore the
    hybrid field entirely. Algorithm IDs outside recognized ranges
    cause verification to fail (unknown algorithm = untrusted).
  - The UmkaOS boot stub, once loaded, verifies BOTH the Ed25519 and
    ML-DSA signatures before proceeding. If either fails, boot halts.

    **Downgrade protection**: The UmkaOS boot stub always verifies both
    signature components (classical + PQC) of a hybrid signature.
    To prevent downgrade attacks where an older boot stub is substituted
    (one that only verifies classical signatures), the boot chain itself
    must be protected: (1) the UEFI `dbx` (Forbidden Signature Database)
    should blacklist known pre-PQC boot stub hashes, and (2) the TPM
    anti-rollback counter (Section 8.2.7) covers the boot stub version, preventing
    rollback to pre-PQC firmware. Systems that rely on the classical-only
    fallback path (legacy BIOS) accept this reduced security posture as
    a documented trade-off.

  Algorithm unavailability:
  - If hardware acceleration for one algorithm is unavailable, software
    fallback is used. Current UmkaOS boot stubs always verify both signatures
    — there is no single-algorithm fallback mode within a current boot stub,
    as this would defeat the purpose of hybrid signatures. Legacy boot stubs
    that only verify the classical component operate under the documented
    reduced security posture described above (mitigated by dbx blacklisting
    and TPM anti-rollback). Systems running a current UmkaOS boot stub that
    cannot verify both algorithms refuse to boot.

Driver signature (.kabi_sig section):
  Ed25519: 64 bytes
  ML-DSA-44: 2,420 bytes
  Total overhead per driver binary: ~2.5 KB. Negligible.

Cluster capability signature:
  Ed25519PlusMlDsa44: verify both. Total verify time: ~50-170 μs.
  Cached after first verification. Zero hot-path impact.

8.5.5 Linux Compatibility

The Linux crypto API (/proc/crypto, AF_ALG sockets) is the standard interface for userspace access to kernel cryptography. PQC algorithms are registered as additional entries:

$ cat /proc/crypto | grep -A2 ml-dsa
name         : ml-dsa-44
driver       : ml-dsa-44-generic
module       : kernel
type         : akcipher

Existing Linux tools (cryptsetup, dm-crypt, etc.) use the crypto API.
Adding PQC algorithms is additive — existing algorithms remain available.

PQC Signature Size and IPC Ring Buffers:

PQC signatures (2.4KB–17KB) are much larger than Ed25519 (64 bytes). However, IPC ring buffers (Section 10.6) do not carry signatures — capabilities are verified at connect() time, not per-message. The SignatureData::Heap variant handles large PQC signatures without affecting ring buffer sizing. No ring buffer resize is needed.

Side-Channel Resistance:

All PQC implementations in the kernel MUST be constant-time. Variable-time operations leak secret key material through timing and cache side channels. CI enforcement: ctgrind-style testing (Valgrind memcheck with secret memory marked as undefined) runs on every commit touching crypto code. Non-constant-time code paths are rejected.

Benchmark Reference:

Approximate times on a modern x86 core (single-threaded, AVX2-enabled). References: ed25519-dalek 4.x, pqcrypto-dilithium (AVX2), liboqs 0.10+: - Ed25519: ~25-50 μs depending on hardware and key schedule caching (verify), ~15-20 μs (sign) - ML-DSA-44: ~50-120 μs (verify), ~80-200 μs (sign) - ML-DSA-65: ~100-200 μs (verify), ~250-460 μs (sign) - ML-KEM-768: ~90 μs (encapsulate), ~80 μs (decapsulate)

Ranges reflect implementation quality (optimized AVX2 vs. reference C) and CPU microarchitecture. ML-DSA-44 verification is typically 2-3x slower than Ed25519 in absolute terms, though both are fast enough to be negligible on cold-path operations (boot, driver load, cluster join). The performance difference is irrelevant for operations that occur at most a few times per second.

8.5.6 Performance Impact

Zero hot-path impact. Cryptographic operations occur only on cold paths: - Boot: once (kernel signature verification) - Driver load: once per driver (~50 drivers at boot) - Cluster join: once per node - Capability creation: infrequent (not per-operation)

Ed25519 verification (~25-50 μs) and ML-DSA-44 verification (~50-120 μs) are both fast on modern hardware; ML-DSA-65 (~100-200 μs) is somewhat slower. All are fast enough for cold-path operations. ML-KEM encapsulation is faster than ECDH. The performance difference is negligible for operations that occur once per boot, driver load, or cluster join.

8.5.7 PQC Key Management

PQC private keys must be stored securely. TPM 2.0 PQC support is emerging: TCG TPM 2.0 Library Specification 2.0 (V184, March 2025) is the latest published revision. V185 RC4 (December 2025) adds PQC algorithm profiles and completed public review on February 10, 2026. Until V185 is formally published, it should be treated as near-final draft, not as a normative reference. Initial PQC-capable TPM hardware samples were announced in late 2025 (e.g., SEALSQ). Production availability expected 2026-2027 (optimistic — historical TPM spec-to-silicon timelines are 18–36 months; plan for 2028 as fallback).

Key storage options (in priority order):

1. TPM 2.0 with PQC profile (near-term):
   TCG TPM 2.0 V184 (March 2025, latest published revision) is the
   normative reference. V185 RC4 (December 2025) adds PQC algorithm
   profiles and completed public review on February 10, 2026.
   TPM stores ML-DSA/ML-KEM private keys in hardware. Key never
   leaves TPM. This is the ideal solution.
   Status: V184 published; V185 (PQC profiles) near-final after RC4.
   PQC-capable TPM hardware samples announced (SEALSQ, late 2025).
   Production availability expected 2026-2027
   (optimistic — historical TPM spec-to-silicon timelines are 18–36
   months; plan for 2028 as fallback).

2. HSM / external key store:
   Enterprise deployments use network HSMs (Thales Luna, AWS CloudHSM)
   that already support ML-DSA/ML-KEM. Kernel communicates via PKCS#11
   or vendor-specific interface. Key never in kernel memory.

3. UEFI Secure Variable storage:
   Private key stored in a custom UEFI authenticated variable
   (NOT db/dbx — those store public keys and certificates for
   signature verification). Protected by Secure Boot chain: only
   authenticated updates can modify the variable. Key loaded into
   kernel memory at boot, used for signing, then zeroed. Vulnerable
   to cold-boot attacks. Acceptable for non-TEE systems.

4. Software key in kernel memory (fallback):
   IMPORTANT: Only the VERIFICATION (public) key is embedded in
   kernel .rodata. This is standard practice (Linux embeds its
   module signing public key the same way). The public key allows
   the kernel to verify signatures but cannot create them. Exposure
   of the public key via a memory read vulnerability does NOT
   compromise the signing process — an attacker who reads the
   public key can verify signatures (which is public information)
   but cannot forge them. The PRIVATE (signing) key must NEVER
   be present in the running kernel image. It exists only in the
   build environment (offline signing server, HSM, or developer
   workstation) and is used during the build process to produce
   signatures that are then appended to kernel/driver images.
   For cluster identity keys (which require a private key at
   runtime for mutual authentication), the private key is loaded
   from disk at boot and held in kernel memory. Protected by:
   - Kernel ASLR (address space randomization)
   - Confidential computing (if running in TEE, key is hardware-encrypted)
   - Memory zeroing on key rotation
   Weakest option for runtime private keys. Acceptable for
   development and non-critical deployments.

Root of trust for verified boot (Section 8.2):
  Phase 1: Classical Ed25519 public key in UEFI db (existing infrastructure).
  Phase 2: Hybrid Ed25519 + ML-DSA-65 public keys in UEFI db.
  Phase 3: ML-DSA-65 key in TPM 2.0 PQC profile (when available).
  Each phase is a strict superset — old keys continue to work.

8.6 Confidential Computing

8.6.1 Why This Cannot Wait

AMD SEV-SNP, Intel TDX, and ARM CCA are shipping today. Every major cloud provider offers confidential VMs. The fundamental shift: the kernel becomes untrusted by its own workloads.

Traditional trust model (current design):
  Hardware → Kernel → Processes
  The kernel can read all process memory.
  The kernel is the root of trust.

Confidential computing trust model:
  Hardware → TEE (Trusted Execution Environment) → Workload
  The kernel CANNOT read TEE memory. Hardware enforces encryption.
  The kernel is OUTSIDE the trust boundary.
  The workload trusts only hardware + its own code.

This directly conflicts with several kernel subsystems if not designed in:

Subsystem	Conflict	Resolution
Page migration (HMM, Section 21.2)	Kernel can't read encrypted pages to copy them	Hardware assists migration (SEV-SNP page copy with re-encryption)
Memory compression (Section 4.2)	Kernel can't compress what it can't read	Skip compression for confidential pages (metadata-only eviction)
DSM (Section 5.6)	Can't RDMA encrypted pages — remote node can't decrypt	TEE-to-TEE RDMA with shared key negotiation
In-kernel inference (Section 21.4)	Can't observe page content of encrypted memory; access patterns (page faults, accessed/dirty bits) remain fully observable (Section 8.6.7)	Train models on access patterns and scheduling metadata only; content-based features unavailable for TEE workloads (~10-20% quality reduction, Section 8.6.7)
Crash recovery (Section 10.8)	Can't inspect process state to reconstruct after driver crash	Preserve TEE context across driver reload (hardware feature)
FMA telemetry (Section 19.1)	Health events may leak information about confidential workloads	Aggregate-only telemetry for confidential contexts

8.6.2 Architectural Requirements

// umka-core/src/security/confidential.rs

/// Confidential computing context.
/// Attached to a process or VM that runs inside a TEE.
pub struct ConfidentialContext {
    /// Hardware TEE type.
    pub tee_type: TeeType,

    /// Hardware-managed encryption key ID.
    /// Each TEE context has a unique key. The kernel never sees the key.
    /// The hardware memory controller encrypts/decrypts transparently.
    pub key_id: u32,

    /// VMPL level (0-3). Only meaningful when tee_type == AmdSevSnpVmpl.
    /// 0 = paravisor/SVSM (most privileged), 1+ = guest OS.
    /// For non-VMPL TEE types, this field is 0 and ignored.
    pub vmpl_level: u8,

    /// Attestation state.
    pub attestation: AttestationState,

    /// Memory policy: how the kernel handles this context's pages.
    pub memory_policy: ConfidentialMemoryPolicy,
}

#[repr(u32)]
pub enum TeeType {
    /// No TEE. Standard process. Kernel can access all memory.
    None        = 0,
    /// AMD SEV-SNP (Secure Encrypted Virtualization - Secure Nested Paging).
    /// Implicitly runs at VMPL0 (most privileged level within the VM).
    AmdSevSnp   = 1,
    /// Intel TDX (Trust Domain Extensions).
    IntelTdx    = 2,
    /// ARM CCA (Confidential Compute Architecture).
    ArmCca      = 3,
    /// AMD SEV-SNP with VMPL (Virtual Machine Privilege Level) at a
    /// non-default level. VMPL allows multiple trust levels within a
    /// single SEV-SNP VM: VMPL0 = paravisor/SVSM (most privileged),
    /// VMPL1+ = guest OS. Required for AMD's SVSM architecture.
    /// The VMPL level is stored in ConfidentialContext.vmpl_level (u8).
    AmdSevSnpVmpl = 4,
}

// Note: The VMPL level for AmdSevSnpVmpl is stored in ConfidentialContext
// as a separate field, not embedded in the enum variant. This preserves
// #[repr(u32)] compatibility (Rust requires all variants of a repr(inttype)
// enum to be unit variants with explicit discriminants).

#[repr(u32)]
pub enum ConfidentialMemoryPolicy {
    /// All pages are encrypted. Kernel cannot read content.
    /// Kernel can still manage page tables, migration metadata.
    FullEncryption  = 0,
    /// Shared pages (explicitly marked by guest) are readable by kernel.
    /// Private pages are encrypted. Used for I/O buffers.
    HybridShared    = 1,
}

/// Attestation lifecycle state.
pub enum AttestationState {
    /// No attestation attempted yet.
    Unattested,
    /// Challenge issued, awaiting hardware report.
    PendingChallenge { nonce: [u8; 32] },
    /// Successfully attested. Report can be verified by remote party.
    Attested {
        /// Hardware-generated attestation report (opaque, TEE-specific).
        /// Attestation report sizes vary by platform:
        /// AMD SEV-SNP: ~1,184 bytes (ATTESTATION_REPORT structure),
        /// Intel TDX: 1,024 bytes (TDREPORT),
        /// ARM CCA: Realm token (Realm Measurement and Attestation Token, RMAT)
        /// as defined in the ARM CCA Security Model specification.
        /// The 4,096-byte buffer provides headroom for all platforms
        /// including any future format extensions.
        /// Only report_len bytes are meaningful.
        report: [u8; 4096],
        /// Number of valid bytes in the report buffer.
        report_len: u16,
        /// SHA-384 hash of the report (for quick identity checks).
        report_hash: [u8; 48],
        /// When attestation completed (monotonic clock).
        timestamp: u64,
    },
    /// Attestation failed (hardware error or measurement mismatch).
    Failed { reason: AttestationError },
}

#[repr(u32)]
pub enum AttestationError {
    /// Hardware does not support attestation.
    HardwareUnsupported = 0,
    /// Measurement does not match expected value.
    MeasurementMismatch = 1,
    /// Hardware reported an internal error.
    HardwareError       = 2,
    /// Certificate chain validation failed.
    CertificateInvalid  = 3,
}

Attestation Verification:

Hardware-rooted attestation allows remote parties to verify the integrity of a confidential workload. AMD SEV-SNP uses a Versioned Chip Endorsement Key (VCEK) signed by AMD's Key Distribution Service (KDS). Intel TDX uses a Quoting Enclave (QE) that produces quotes verifiable via Intel's Provisioning Certification Service (PCS). ARM CCA uses platform attestation tokens signed by the CCA platform key.

UmkaOS exposes Linux-compatible attestation device nodes: - /dev/sev-guest — AMD SEV-SNP guest attestation reports - /dev/tdx-guest — Intel TDX guest attestation reports (via TDX_CMD_GET_REPORT)

These are implemented in umka-compat with identical ioctl semantics to Linux.

8.6.3 Design Approach: Opaque Page Handles

The key design principle: kernel subsystems must be able to manage pages without reading their contents. Every subsystem that touches page data needs a "confidential-aware" path.

// umka-core/src/mem/page.rs

/// A physical page handle. The kernel always has this.
/// Whether the kernel can read the page DATA depends on the page's
/// confidentiality state.
pub struct PageHandle {
    /// Physical frame number.
    pub pfn: u64,

    /// Ownership and state metadata (kernel can always read this).
    pub state: PageState,

    /// Is this page's content readable by the kernel?
    /// false for pages owned by a ConfidentialContext.
    pub kernel_readable: bool,

    /// For confidential pages: hardware key ID for re-encryption
    /// during migration. The kernel uses this to instruct hardware
    /// to re-encrypt the page for the destination context.
    pub encryption_key_id: Option<u32>,
}

Subsystem-by-subsystem adaptation:

Memory compression (Section 4.2):
  if page.kernel_readable:
    compress normally (LZ4/zstd)
  else:
    skip compression — evict to swap encrypted
    (hardware encrypts at rest, no kernel involvement)

Page migration (Section 21.2, GPU VRAM):
  if page.kernel_readable:
    DMA copy (standard path)
  else:
    hardware-assisted encrypted copy
    (SEV-SNP: SNP_PAGE_MOVE firmware command via PSP, TDX: TDH.MEM.PAGE.RELOCATE SEAMCALL)

DSM (Section 5.6):
  if both endpoints are in the same TEE trust domain:
    Key negotiation protocol:
      1. Both nodes produce attestation reports (hardware-rooted).
      2. Reports are mutually verified (each node checks the other's
         measurement, firmware version, security policy).
      3. Key exchange via ML-KEM-768 (PQC, Section 8.5) authenticated by
         attestation-bound identity.
      4. Shared symmetric key established for RDMA encryption.
      5. RDMA transfers encrypted with shared key (AES-GCM).
    Latency: ~5ms for initial key negotiation (once per node pair).
    Steady-state RDMA: same performance as non-TEE (hardware AES).
    Key rotation: every 1 hour or 2^32 RDMA operations (whichever
    comes first). Nonce construction: 4-byte connection ID || 8-byte
    per-connection counter (deterministic, never reused within a key
    lifetime). The 2^32 limit is a conservative policy: counter-based
    nonces can safely reach 2^64, but frequent rotation limits the
    exposure window of any single key.
    Re-attestation: every 24 hours or on firmware update. If
    re-attestation fails (measurement changed), the shared key is
    revoked and all RDMA connections to that node are torn down.
    DSM pages cached from that node are invalidated (Section 5.6).
  else (different trust domains):
    pages must be re-encrypted for each endpoint
    → higher latency, but functionally correct

Hardware memory tagging (MTE, Section 2.3):
  MTE tags are stored in a separate physical memory region (tag RAM).
  For TEE-encrypted pages, tag RAM may also be encrypted.
  The kernel cannot set/clear tags on confidential pages.
  Policy: confidential pages are allocated UNTAGGED (tag = 0).
  MTE checking is disabled for pages owned by a ConfidentialContext.
  Rationale: MTE protects against kernel bugs accessing the page.
  For confidential pages, hardware encryption already prevents
  kernel access — MTE is redundant.

In-kernel inference (Section 21.4):
  Can observe: page fault frequency, access pattern (from page table
  accessed bits, not page content), memory pressure signals.
  Cannot observe: page content.
  → Page prefetching model trains on access patterns, not content.
  → Works fine. Content is irrelevant for prefetch decisions.

8.6.4 Guest Mode: UmkaOS as a Confidential Guest

When UmkaOS runs inside a confidential VM (as a guest on a hypervisor):

UmkaOS as TDX/SEV-SNP guest:
  1. All guest physical memory is encrypted by hardware.
  2. The hypervisor cannot read guest memory.
  3. UmkaOS must:
     a. Use SWIOTLB (bounce buffer) for device I/O with the hypervisor.
     b. Explicitly mark shared pages for virtio/MMIO communication.
     c. Validate all data received from the hypervisor (untrusted host).
     d. Support remote attestation so users can verify the guest.

  This requires:
  - Memory manager: distinguish "private" (encrypted) vs "shared" (plaintext) pages.
  - I/O path: bounce buffers for virtio when running as confidential guest.
  - io_uring: bounce buffer pool for DMA data payloads (SQE/CQE rings stay encrypted).
    See [Section 18.1.5.1](18-compat.md#18151-iouring-under-sev-snp-confidential-guest-mode)
    for the complete io_uring + SEV-SNP bounce buffer specification.
  - Attestation: kernel provides attestation report via sysfs/ioctl.

8.6.4.1 Confidential VM Live Migration

Live migration of a confidential VM requires that: 1. The destination platform is attested as trustworthy before any guest memory is sent. 2. Guest memory is transported under a migration transport key established via attestation-authenticated key exchange — the source host cannot read it in transit. 3. The migration request is authenticated to prevent rogue hosts from diverting a confidential VM to an untrusted platform.

AMD SEV-SNP migration protocol (KVM KVM_SEV_SEND_* / KVM_SEV_RECEIVE_* ioctls):

Phase 1 — Destination attestation:
  1. Destination PSP generates a migration ephemeral key pair:
       (migr_pub, migr_priv) = ML-KEM-768 keypair
  2. Destination PSP issues an attestation report that includes migr_pub
     in the REPORT_DATA field (binds the key to the platform measurement).
  3. Attestation report is signed by the VCEK (Versioned Chip Endorsement Key)
     which chains to AMD's KDS root.
  4. Destination sends: (attestation_report, migr_pub) to the migration controller.

Phase 2 — Source verifies destination:
  1. Migration controller verifies attestation_report against AMD KDS root.
  2. Verifies: PLATFORM_VERSION ≥ minimum required, POLICY matches expected.
  3. Extracts migr_pub from REPORT_DATA (proves key ownership by attested platform).
  4. Source PSP generates transport key:
       transport_key = ML-KEM-768 Encapsulate(migr_pub) → (ciphertext, shared_secret)
  5. AES-256-GCM key derived from shared_secret via HKDF-SHA-384:
       enc_key = HKDF(shared_secret, "UmkaOS-SEV-MIGRATION-ENC-v1", 32)

Phase 3 — Memory transfer:
  1. Source calls PSP: SNP_PAGE_COPY(page_list, dst_platform_guestkey_id)
     PSP re-encrypts each page from source key to transport key.
  2. Encrypted page blobs + ciphertext sent to destination over TLS 1.3.
     (TLS authenticates the management channel; SEV encryption protects page data.)
  3. Destination calls PSP: SNP_PAGE_RECEIVE(encrypted_page_blob)
     PSP decrypts with migr_priv → re-encrypts with destination platform key.

Phase 4 — Guest state transfer:
  1. vCPU registers, VMSA (VM Save Area) encrypted with transport key.
     VMSA includes all vCPU state: GPRs, MSRs, VMCB fields.
  2. Destination PSP verifies VMSA measurement matches expected boot digest.
  3. Guest resumes on destination only after PSP approves state transfer.

Request authenticity:
  The migration request (source → destination address, VM ID, timestamp) is signed
  with the source host's Ed25519 cluster identity key ([Section 5.9.2.1]).
  Destination verifies signature before accepting connection. This prevents MITM
  attacks where an untrusted third party diverts migration traffic.

Intel TDX migration protocol (TDX Module SEAMCALL interface):

Phase 1 — Destination attestation:
  1. Destination generates TDX migration TD (a special TrustDomain for migration).
  2. TDREPORT generated by TDH.REPORT SEAMCALL — includes migration TD measurement.
  3. Report verified via Intel PCS (Provisioning Certification Service).

Phase 2 — Key establishment:
  TDH.EXPORT.MPTABLE + TDH.EXPORT.TRACK on source.
  TDH.IMPORT.MPTABLE + TDH.IMPORT.TRACK on destination.
  Intel MKTME (Multi-Key Total Memory Encryption) handles re-keying.

Phase 3 — Page transfer:
  TDH.EXPORT.PAGEMD / TDH.IMPORT.PAGEMD for each page.
  Pages are re-encrypted from source KeyID to destination KeyID.

Latency: additional ~5-20ms for attestation (Phase 1) compared to
non-confidential migration. Memory transfer latency is identical
(hardware AES-256 encryption is in the memory controller, ~0.1% overhead).

8.6.5 Host Mode: UmkaOS Hosting Confidential VMs

When UmkaOS is the host running umka-kvm with confidential guests:

UmkaOS as TDX/SEV-SNP host:
  1. umka-kvm (Tier 1 driver, Section 10.4) creates confidential VM contexts.
  2. Kernel allocates encrypted physical pages for guest.
  3. Kernel CANNOT read guest memory (hardware enforces).
  4. Kernel CAN:
     - Schedule vCPUs on physical CPUs (scheduling, Section 6.1)
     - Manage guest memory mappings (page tables, metadata only)
     - Migrate guest pages between NUMA nodes (hardware re-encryption)
     - Apply cgroup limits (memory, CPU, accelerator)
  5. Kernel CANNOT:
     - Read guest page contents
     - Inject code into guest
     - Modify guest register state (except via controlled VM entry/exit)

8.6.6 Linux Compatibility

Existing Linux confidential computing interfaces:

/dev/sev (AMD SEV-SNP):
  KVM_SEV_INIT, KVM_SEV_LAUNCH_START, KVM_SEV_LAUNCH_UPDATE,
  KVM_SEV_LAUNCH_MEASURE, KVM_SEV_LAUNCH_FINISH, etc.
  → Implemented in umka-kvm, same ioctl numbers.

/dev/tdx_guest (Intel TDX):
  TDX_CMD_GET_REPORT, TDX_CMD_EXTEND_RTMR, etc.
  → Implemented in umka-compat, same ioctl numbers.

KVM ioctls (generic):
  KVM_CREATE_VM, KVM_SET_MEMORY_ATTRIBUTES (private vs shared), etc.
  → Implemented in umka-kvm.

QEMU, libvirt, cloud-hypervisor: all use these ioctls.
Binary compatibility preserved.

8.6.7 TEE Observability Degradation Model

When workloads run inside TEEs, some kernel optimization signals are unavailable. This is an inherent hardware constraint, not a software limitation. The kernel must degrade gracefully, not silently fail.

Signal                       Non-TEE    TEE (host observing guest)    Notes
────────────────────────────  ─────────  ───────────────────────────  ──────────────
Page fault patterns           ✓ Full     ✓ Full                      Kernel manages page tables
Page table accessed/dirty     ✓ Full     ✓ Full                      Hardware sets bits, kernel reads
I/O request patterns          ✓ Full     ✓ Full                      I/O goes through kernel
CPU scheduling (PELT)         ✓ Full     ✓ Full                      Kernel schedules vCPUs
Memory allocation patterns    ✓ Full     ✓ Full                      Kernel allocates pages
Memory pressure signals       ✓ Full     ✓ Full                      Kernel tracks pressure
Page content (for compress)   ✓ Full     ✗ Unavailable               Hardware encryption
Hardware perf counters        ✓ Full     ◐ Restricted                TDX: host reads limited set
                                                                     SEV-SNP: guest-only by default
                                                                     CCA: realm-restricted
In-kernel inference           ✓ Full     ◐ Reduced                   Trains on access patterns (ok)
                                                                     Cannot use content features
Memory compression            ✓ Full     ✗ Disabled                  Can't compress encrypted pages
MTE tagging                   ✓ Full     ✗ Disabled                  Tags on confidential pages = 0

Degraded features for TEE workloads:
  - Memory compression: disabled (pages evicted encrypted, no compression gain)
  - Learned prefetch: works on access patterns (page faults, accessed bits)
    but quality may be ~10-20% lower than non-TEE due to missing perf counters
  - Intent I/O scheduling: works (sees I/O requests), same quality as non-TEE
  - Power budgeting: works (RAPL is host-side, unaffected by TEE)
  - MTE safety: disabled for TEE pages (hardware encryption is the safety mechanism)

This degradation is identical to Linux. Linux has the same restricted visibility into TEE guests. No kernel can observe hardware-encrypted memory contents.

8.6.8 Performance Impact

When confidential computing is not used: zero overhead. No code runs, no checks happen. The page.kernel_readable flag is always true. All code paths take the standard branch.

When confidential computing is used: same overhead as Linux. The cost is hardware memory encryption (~1-5% memory bandwidth reduction), which is identical regardless of kernel implementation.

8.6.9 Confidential VM Live Migration

Confidential VMs require special migration handling because the hypervisor cannot read encrypted guest memory.

AMD SEV-SNP Migration:

The Platform Security Processor (PSP) manages migration. The source PSP exports encrypted guest pages with a transport key negotiated between source and destination PSPs. The destination allocates a new ASID and imports pages, re-encrypting with the destination's memory encryption key. Guest state (VMSA) is included in the encrypted export. The guest observes a brief pause but no data loss.

Intel TDX Migration:

TDX uses TD-Preserving migration via SEAMCALL instructions (TDH.EXPORT.MEM, TDH.IMPORT.MEM, TDH.EXPORT.STATE.IMMUTABLE, TDH.IMPORT.STATE.IMMUTABLE, TDH.EXPORT.STATE.VP, TDH.IMPORT.STATE.VP). The source TDX module exports TD pages (via TDH.EXPORT.MEM), immutable TD-scope metadata (via TDH.EXPORT.STATE.IMMUTABLE), and per-vCPU state (via TDH.EXPORT.STATE.VP) in an encrypted migration stream. The destination TDX module imports and re-keys via the corresponding IMPORT SEAMCALLs. Migration is transparent to the TD guest.

ARM CCA Migration:

CCA realm live migration is under active development by ARM but the migration protocol is not yet standardized as of early 2026. The RMM specification (DEN0137) has progressed to RMI version 2.0 (2.0-alp19, November 2025), which introduced breaking RMI changes but kept RSI at v1.1. The anticipated migration flow involves RMM-mediated secure export/sealing of realm memory and state on the source host, with attestation-verified re-initialization on the destination — the hypervisor never sees realm private data. UmkaOS reserves the CCA migration interface and will implement it when ARM publishes the finalized migration commands in a future RMM spec revision.

ASID/Key Exhaustion:

umka-kvm tracks hardware encryption key IDs (ASIDs for SEV-SNP, HKIDs for TDX). When all IDs are allocated, KVM_CREATE_VM with confidential flags returns -ENOSPC. The administrator must terminate existing confidential VMs to free IDs. Typical limits: SEV-SNP ASIDs are platform-dependent (e.g., 509 on Milan, 1006+ on Genoa), discovered at runtime via CPUID Fn8000001F[ECX]; TDX supports ~64 HKIDs (hardware-dependent).

Note for reviewers: The Genoa ASID count (1006) is higher than Milan (509) because Genoa's PSP (Platform Security Processor) has an expanded key cache. These values are per AMD's SEV-SNP API specification — CPUID Fn8000001F[ECX] returns the actual count. Do not flag "509 vs 1006" as an inconsistency; both are correct for their respective processor generations.

GHCB Protocol (SEV-ES/SNP):

SEV-ES and SEV-SNP guests cannot execute VMEXIT normally (host cannot read guest registers). The Guest-Hypervisor Communication Block (GHCB) is a shared page where the guest places exit information for the hypervisor. umka-kvm implements the GHCB protocol (v2) for handling #VC (VMM Communication) exceptions.

TDX TDCALL:

TDX guests communicate with the TDX module via TDCALL instruction (not VMEXIT). umka-kvm's TDX backend handles TDG.VP.VMCALL for guest-initiated exits.

8.6.10 TEE Key Negotiation Wire Format (UmkaOS-TEE-v1)

Used to establish a shared secret between UmkaOS Core and a TEE enclave (SEV-SNP, TDX, or ARM CCA) for encrypted inter-domain communication.

Handshake message format (sent by enclave to UmkaOS Core):

#[repr(C)]
pub struct TeeHelloMessage {
    /// Protocol version. Current: 1.
    pub version: u16,
    /// Key agreement algorithm:
    ///   1 = X25519 (32-byte public key)
    ///   2 = ML-KEM-768 encapsulation key (1184 bytes — post-quantum)
    pub algorithm: u16,
    /// Random nonce generated by the enclave's CSPRNG
    /// (RDRAND on x86, RNDR on AArch64, or TPM2_GetRandom).
    /// Must be unique per session; verified against a nonce replay log.
    pub nonce: [u8; 32],
    /// Enclave's public key or encapsulation key.
    /// For X25519: bytes 0..32. For ML-KEM-768: bytes 0..1184.
    /// Unused bytes are zero.
    pub public_key: [u8; 1184],
    /// TEE attestation report (SEV-SNP Report, TDX Quote, or CCA Token).
    /// UmkaOS Core verifies the report before accepting the key material.
    pub attestation_report: [u8; 4096],
    _pad: [u8; 4],  // align to 8 bytes. Total: 5320 bytes.
}

UmkaOS Core response:

#[repr(C)]
pub struct TeeCoreResponse {
    pub version: u16,
    pub algorithm: u16,
    /// UmkaOS Core's nonce. XOR with enclave nonce for KDF input.
    pub core_nonce: [u8; 32],
    /// For X25519: UmkaOS Core's public key (32 bytes).
    /// For ML-KEM-768: ciphertext (1088 bytes).
    pub response_key: [u8; 1088],
    _pad: [u8; 8],  // Total: 1136 bytes.
}

Key derivation:

shared_secret = X25519(core_private, enclave_public)
             OR ML-KEM-768.Decapsulate(core_private, enclave_ciphertext)

session_key = HKDF-SHA256(
    ikm    = shared_secret,
    salt   = enclave_nonce XOR core_nonce,   // 32 bytes
    info   = b"UmkaOS-TEE-v1",
    length = 32                              // 256-bit AES key
)

Anti-replay: UmkaOS Core maintains a nonce log (last 1024 nonces) per TEE instance type. Any incoming TeeHelloMessage.nonce matching a logged nonce is rejected with EEXIST.

Attestation verification: UmkaOS Core verifies the TEE attestation report using the platform's attestation root of trust: - SEV-SNP: VCEK certificate chain — AMD ARK/ASK - TDX: Intel's DCAP quote verification service (offline-capable with cached CRL) - ARM CCA: CCA realm token — platform RVIM certificate

8.7 Linux Security Module (LSM) Framework

UmkaOS provides a pluggable security module framework compatible with Linux's LSM infrastructure. The framework defines hook points, a stacking model, and a blob allocation scheme. Specific policy engines (SELinux, AppArmor) are implementation details outside this specification; this section specifies the framework they plug into.

Inspired by: Linux LSM (Wright et al., 2002), LSM stacking (Schaufler, 2018+). IP status: Clean -- LSM is a well-documented public kernel interface. The hook taxonomy is derived from the public Linux UAPI headers and kernel documentation.

8.7.1 Design Goals

Linux compatibility: Unmodified AppArmor, SELinux, and Landlock policy semantics must be reproducible. Container runtimes (Docker, containerd, CRI-O) that set LSM profiles via OCI annotations must work without modification.
Zero overhead when disabled: If no LSM module is loaded, every hook point compiles to a NOP via static keys (same mechanism as tracepoints, Section 19.2.4). There is no function pointer dereference, no branch, no cache line touch on the hot path.
Grouped hook interface: Linux defines approximately 220 individual hook functions. UmkaOS groups these into approximately 15-20 trait methods organized by object category. Each method receives a discriminant identifying the specific operation, preserving the per-operation granularity that policy engines need while keeping the trait implementable. This is a framework design choice, not a semantic change -- every Linux LSM hook has a corresponding UmkaOS entry point.
AND-logic stacking: Multiple LSM modules can be active simultaneously (e.g., IMA + AppArmor + Landlock). All active modules must permit an operation for it to proceed. Any module returning Err(SecurityDenial) blocks the operation. This matches Linux's LSM stacking model (mainlined incrementally since Linux 5.1).

8.7.2 The `SecurityModule` Trait

/// Pluggable Linux Security Module interface.
///
/// Each LSM (AppArmor, SELinux, Landlock, IMA, etc.) implements this trait.
/// The framework calls each registered module's hooks in order; all must
/// return Ok(()) for the operation to proceed. Any Err(SecurityDenial)
/// causes the operation to be denied.
///
/// Method grouping: hooks are grouped by kernel object category. Each method
/// receives an operation discriminant and the relevant kernel objects. This
/// provides the same granularity as Linux's ~220 individual hook functions
/// while keeping the trait manageable for implementation.
///
/// Default implementations return Ok(()) (permit). An LSM overrides only
/// the categories it cares about:
/// - AppArmor: file, task, socket, capable (path-based MAC)
/// - SELinux: all categories (label-based MAC)
/// - IMA: file only (integrity measurement)
/// - Landlock: file, socket (unprivileged sandboxing)
pub trait SecurityModule: Send + Sync {
    /// Human-readable module name (e.g., "apparmor", "selinux", "landlock").
    /// Used for /sys/kernel/security/lsm, audit logs, and error messages.
    fn name(&self) -> &'static str;

    /// Module initialization priority. Lower values initialize first.
    /// IMA = 10 (integrity must be first), MAC modules = 20, Landlock = 30.
    fn priority(&self) -> u32;

    /// Size in bytes of the security blob this module needs allocated in
    /// each kernel object (credential, inode, superblock, socket, etc.).
    /// Called once during module registration. Returns per-object-type sizes.
    fn blob_sizes(&self) -> LsmBlobSizes;

    // ===== Hook categories =====
    // Each method defaults to Ok(()) -- LSMs override only relevant hooks.

    /// File (open file descriptor) security checks.
    /// Called on operations that act on an open file.
    ///
    /// Operations:
    ///   Permission  -- check access mode (read/write/exec) on open file
    ///   Open        -- file being opened (after VFS lookup, before data access)
    ///   Receive     -- file descriptor received via SCM_RIGHTS (unix socket)
    ///   Lock        -- flock/fcntl locking operation
    ///   Mmap        -- file being memory-mapped (includes PROT_EXEC check)
    ///   Mprotect    -- mmap protection change on a file-backed mapping
    ///   Ioctl       -- ioctl on a file descriptor
    fn file_security(
        &self,
        op: FileSecurityOp,
        cred: &TaskCredential,
        file: &FileRef,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, file);
        Ok(())
    }

    /// Inode (on-disk object) security checks.
    /// Called on operations that create, modify, or query inodes.
    ///
    /// Operations:
    ///   Create      -- creating a new regular file
    ///   Link        -- creating a hard link
    ///   Unlink      -- removing a directory entry
    ///   Symlink     -- creating a symbolic link
    ///   Mkdir       -- creating a directory
    ///   Rmdir       -- removing a directory
    ///   Mknod       -- creating a special file (device, FIFO, socket)
    ///   Rename      -- renaming/moving a directory entry
    ///   Permission  -- inode access permission check (called by VFS layer)
    ///   Setattr     -- modifying inode attributes (chmod, chown, utimes)
    ///   Getattr     -- reading inode attributes (stat)
    ///   Setxattr    -- setting an extended attribute
    ///   Getxattr    -- reading an extended attribute
    ///   Removexattr -- removing an extended attribute
    ///   Listxattr   -- listing extended attributes
    fn inode_security(
        &self,
        op: InodeSecurityOp,
        cred: &TaskCredential,
        inode: &InodeRef,
        context: &InodeOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, inode, context);
        Ok(())
    }

    /// Superblock (filesystem instance) security checks.
    /// Called on filesystem-level operations.
    ///
    /// Operations:
    ///   Mount        -- mounting a filesystem
    ///   Umount       -- unmounting a filesystem
    ///   Remount      -- remounting with changed options
    ///   SetMntOpts   -- setting mount security options (e.g., SELinux context=)
    ///   Statfs       -- filesystem statistics query
    fn superblock_security(
        &self,
        op: SuperblockSecurityOp,
        cred: &TaskCredential,
        sb: &SuperBlockRef,
        context: &SbOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, sb, context);
        Ok(())
    }

    /// Task (process/thread) security checks.
    /// Called on operations that affect task state.
    ///
    /// Operations:
    ///   Alloc        -- new task being created (clone/fork)
    ///   Free         -- task being destroyed (cleanup hook, cannot deny)
    ///   Exec         -- execve() transition (profile change, label change)
    ///   SetPgid      -- changing process group
    ///   Setnice      -- changing scheduling priority
    ///   Setioprio    -- changing I/O priority
    ///   Setscheduler -- changing scheduling policy
    ///   Kill         -- sending a signal
    ///   Prctl        -- prctl() operation
    ///   Ptrace       -- ptrace attach/access (cross-reference: Section 19.3.1)
    fn task_security(
        &self,
        op: TaskSecurityOp,
        cred: &TaskCredential,
        target: Option<&TaskCredential>,
        context: &TaskOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, target, context);
        Ok(())
    }

    /// Credential security checks.
    /// Called on credential allocation, preparation, and commitment.
    ///
    /// Operations:
    ///   Alloc        -- new credential being allocated (prepare_creds)
    ///   Free         -- credential being freed (cleanup, cannot deny)
    ///   Prepare      -- credential clone for modification
    ///   Commit       -- credential being published (commit_creds)
    ///   Transfer     -- credential being transferred to another task
    fn cred_security(
        &self,
        op: CredSecurityOp,
        old: &TaskCredential,
        new: Option<&TaskCredential>,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, old, new);
        Ok(())
    }

    /// Capability override hook.
    /// Called by ns_capable() (Section 8.8.3) after the bitfield check
    /// passes. Allows LSMs to deny capabilities that the task technically
    /// holds but the policy forbids (e.g., SELinux denying CAP_NET_ADMIN
    /// to a confined domain).
    fn capable(
        &self,
        cred: &TaskCredential,
        target_ns: &UserNamespace,
        cap: SystemCaps,
        audit: CapableAudit,
    ) -> Result<(), SecurityDenial> {
        let _ = (cred, target_ns, cap, audit);
        Ok(())
    }

    /// Socket security checks.
    /// Called on socket lifecycle and data operations.
    ///
    /// Operations:
    ///   Create       -- socket() syscall
    ///   Bind         -- bind() to an address
    ///   Connect      -- connect() to a remote address
    ///   Listen       -- listen() for incoming connections
    ///   Accept       -- accept() an incoming connection
    ///   Sendmsg      -- sending a message
    ///   Recvmsg      -- receiving a message
    ///   Shutdown     -- shutdown() a connection
    ///   Setsockopt   -- setting a socket option
    ///   Getsockopt   -- reading a socket option
    fn socket_security(
        &self,
        op: SocketSecurityOp,
        cred: &TaskCredential,
        sock: Option<&SocketRef>,
        context: &SocketOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, sock, context);
        Ok(())
    }

    /// IPC security checks (SysV shared memory, semaphores, message queues).
    ///
    /// Operations:
    ///   Alloc        -- creating an IPC object
    ///   Free         -- destroying an IPC object
    ///   Permission   -- permission check on IPC access
    ///   Associate    -- connecting to an existing IPC object
    fn ipc_security(
        &self,
        op: IpcSecurityOp,
        cred: &TaskCredential,
        ipc: &IpcObjRef,
        context: &IpcOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, ipc, context);
        Ok(())
    }

    /// Key management security checks (kernel keyring).
    ///
    /// Operations:
    ///   Alloc        -- allocating a new key
    ///   Free         -- freeing a key
    ///   Permission   -- permission check on key access
    fn key_security(
        &self,
        op: KeySecurityOp,
        cred: &TaskCredential,
        context: &KeyOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, context);
        Ok(())
    }

    /// BPF security checks.
    ///
    /// Operations:
    ///   ProgLoad     -- loading a BPF program
    ///   ProgAttach   -- attaching a BPF program to a hook
    ///   MapCreate    -- creating a BPF map
    ///   MapAccess    -- accessing a BPF map (read/write/delete)
    fn bpf_security(
        &self,
        op: BpfSecurityOp,
        cred: &TaskCredential,
        context: &BpfOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, context);
        Ok(())
    }

    /// Namespace security checks.
    /// UmkaOS-specific: called on namespace creation, joining, and
    /// cross-namespace operations.
    ///
    /// Operations:
    ///   Create       -- creating a new namespace (clone/unshare)
    ///   Join         -- setns() into an existing namespace
    ///   Destroy      -- namespace being destroyed (cleanup, cannot deny)
    fn namespace_security(
        &self,
        op: NamespaceSecurityOp,
        cred: &TaskCredential,
        ns_type: NamespaceType,
        context: &NsOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, ns_type, context);
        Ok(())
    }

    // ===== Blob initialization hooks =====
    // Called when kernel objects are created to initialize the LSM's
    // portion of the security blob.

    /// Initialize the LSM's blob region in a new credential.
    fn cred_init_blob(&self, cred_blob: &mut [u8]) {
        let _ = cred_blob;
    }

    /// Initialize the LSM's blob region in a new inode.
    fn inode_init_blob(&self, inode_blob: &mut [u8]) {
        let _ = inode_blob;
    }

    /// Initialize the LSM's blob region in a new superblock.
    fn superblock_init_blob(&self, sb_blob: &mut [u8]) {
        let _ = sb_blob;
    }

    /// Initialize the LSM's blob region in a new socket.
    fn socket_init_blob(&self, sock_blob: &mut [u8]) {
        let _ = sock_blob;
    }
}

8.7.3 Operation Discriminants

Each hook category uses an enum to identify the specific operation. Policy engines match on these discriminants to apply per-operation rules:

/// File-level security operations.
#[repr(u32)]
pub enum FileSecurityOp {
    Permission  = 0,
    Open        = 1,
    Receive     = 2,
    Lock        = 3,
    Mmap        = 4,
    Mprotect    = 5,
    Ioctl       = 6,
}

/// Inode-level security operations.
#[repr(u32)]
pub enum InodeSecurityOp {
    Create      = 0,
    Link        = 1,
    Unlink      = 2,
    Symlink     = 3,
    Mkdir       = 4,
    Rmdir       = 5,
    Mknod       = 6,
    Rename      = 7,
    Permission  = 8,
    Setattr     = 9,
    Getattr     = 10,
    Setxattr    = 11,
    Getxattr    = 12,
    Removexattr = 13,
    Listxattr   = 14,
}

/// Superblock-level security operations.
#[repr(u32)]
pub enum SuperblockSecurityOp {
    Mount       = 0,
    Umount      = 1,
    Remount     = 2,
    SetMntOpts  = 3,
    Statfs      = 4,
}

/// Task-level security operations.
#[repr(u32)]
pub enum TaskSecurityOp {
    Alloc       = 0,
    Free        = 1,
    Exec        = 2,
    SetPgid     = 3,
    Setnice     = 4,
    Setioprio   = 5,
    Setscheduler = 6,
    Kill        = 7,
    Prctl       = 8,
    Ptrace      = 9,
}

/// Credential-level security operations.
#[repr(u32)]
pub enum CredSecurityOp {
    Alloc       = 0,
    Free        = 1,
    Prepare     = 2,
    Commit      = 3,
    Transfer    = 4,
}

/// Socket-level security operations.
#[repr(u32)]
pub enum SocketSecurityOp {
    Create      = 0,
    Bind        = 1,
    Connect     = 2,
    Listen      = 3,
    Accept      = 4,
    Sendmsg     = 5,
    Recvmsg     = 6,
    Shutdown    = 7,
    Setsockopt  = 8,
    Getsockopt  = 9,
}

/// IPC-level security operations.
#[repr(u32)]
pub enum IpcSecurityOp {
    Alloc       = 0,
    Free        = 1,
    Permission  = 2,
    Associate   = 3,
}

/// Key management security operations.
#[repr(u32)]
pub enum KeySecurityOp {
    Alloc       = 0,
    Free        = 1,
    Permission  = 2,
}

/// BPF security operations.
#[repr(u32)]
pub enum BpfSecurityOp {
    ProgLoad    = 0,
    ProgAttach  = 1,
    MapCreate   = 2,
    MapAccess   = 3,
}

/// Namespace security operations (UmkaOS-specific).
#[repr(u32)]
pub enum NamespaceSecurityOp {
    Create      = 0,
    Join        = 1,
    Destroy     = 2,
}

/// Audit control for capable() hook.
#[repr(u32)]
pub enum CapableAudit {
    /// Generate an audit record for this check.
    Audit       = 0,
    /// Suppress audit record (used for speculative checks like
    /// "would this operation succeed?" without generating noise).
    NoAudit     = 1,
}

/// Denial returned by LSM hooks. Includes the denying module name
/// and an operation-specific reason code for audit logging.
pub struct SecurityDenial {
    /// Which LSM denied the operation.
    pub module: &'static str,
    /// errno to return to userspace (typically EACCES or EPERM).
    pub errno: i32,
    /// Optional reason string for audit log (e.g., "apparmor: DENIED
    /// file_open /etc/shadow for profile /usr/bin/nginx").
    /// Allocated from a per-CPU scratch buffer, not heap, to avoid
    /// allocation on the denial path.
    pub reason: Option<&'static str>,
}

8.7.4 Security Blob Model

Each kernel object that LSMs can attach state to carries an opaque security blob. The blob is a single contiguous allocation divided into regions, one per registered LSM. This avoids per-LSM heap allocations and provides cache-friendly access.

/// Per-object-type blob size requirements, reported by each LSM during
/// registration via `SecurityModule::blob_sizes()`.
pub struct LsmBlobSizes {
    /// Bytes needed in each TaskCredential's blob.
    pub cred: usize,
    /// Bytes needed in each inode's blob.
    pub inode: usize,
    /// Bytes needed in each superblock's blob.
    pub superblock: usize,
    /// Bytes needed in each socket's blob.
    pub socket: usize,
    /// Bytes needed in each IPC object's blob.
    pub ipc: usize,
    /// Bytes needed in each key's blob.
    pub key: usize,
}

/// Opaque security blob attached to kernel objects.
/// Layout: [ LSM_0 data | LSM_1 data | LSM_2 data | ... ]
/// Each LSM's region starts at its registered offset (assigned during
/// LSM registration, fixed for the boot lifetime).
///
/// Blob memory is slab-allocated from a per-object-type slab cache
/// (e.g., `cred_lsm_blob_cache`, `inode_lsm_blob_cache`). The slab
/// object size is the sum of all registered LSMs' blob sizes for that
/// object type, computed once during LSM initialization.
/// LSM blob: variable-length opaque storage, one per kernel object (cred, inode, etc.).
/// Allocated from a per-object-type slab cache sized to `LsmBlobSizes::total()`.
/// Stored inline in the containing object (e.g., `TaskCredential.lsm_blob` is a
/// `NonNull<LsmBlob>` into the same slab allocation as the credential).
///
/// Access pattern: each LSM calls `lsm_blob.get_mut::<T>(offset)` where `offset`
/// is the byte offset for that LSM's private region (from `LsmBlobOffsets`).
pub struct LsmBlob {
    /// Total size of the blob data in bytes (sum of all active LSMs' requirements).
    /// Stored so that debug tooling can inspect blob contents without consulting
    /// the LSM registry.
    pub data_len: u32,
    pub _pad: u32,
    /// Blob data begins immediately after this header in the slab allocation.
    /// Access via `lsm_blob_data_ptr(blob)` which returns `*mut u8` to the
    /// byte immediately after this struct header. Not a Rust reference —
    /// lifetimes are managed by the slab allocator and the owning object.
    pub _data: [u8; 0],
}

/// Global LSM registry state. Initialized during boot, immutable after
/// all LSMs are registered (before the first user process runs).
pub struct LsmRegistry {
    /// Registered LSM modules, in evaluation order.
    /// Order: IMA first (integrity), then MAC (SELinux/AppArmor), then
    /// Landlock (unprivileged sandboxing). Within each priority tier,
    /// registration order (determined by kernel command line `lsm=`
    /// parameter or compiled-in default).
    pub modules: ArrayVec<&'static dyn SecurityModule, MAX_LSM_MODULES>,

    /// Per-LSM blob offsets for each object type. Indexed by LSM
    /// registration index (0..modules.len()). Each entry records the
    /// byte offset within the LsmBlob where this LSM's region begins.
    pub blob_offsets: ArrayVec<LsmBlobOffsets, MAX_LSM_MODULES>,

    /// Total blob size per object type (sum of all LSMs' requirements).
    /// Used to size the slab caches.
    pub total_blob_sizes: LsmBlobSizes,

    /// Static key: true if at least one LSM is registered.
    /// When false, all hook call sites are NOPs (zero overhead).
    /// Patched to true during LSM registration (one-time, at boot).
    pub any_lsm_active: StaticKey,

    /// Per-hook-category static keys. Each is true if at least one
    /// registered LSM overrides that category. Allows per-category
    /// zero-overhead bypass.
    pub file_hooks_active: StaticKey,
    pub inode_hooks_active: StaticKey,
    pub task_hooks_active: StaticKey,
    pub cred_hooks_active: StaticKey,
    pub capable_hooks_active: StaticKey,
    pub socket_hooks_active: StaticKey,
    pub ipc_hooks_active: StaticKey,
    pub key_hooks_active: StaticKey,
    pub bpf_hooks_active: StaticKey,
    pub namespace_hooks_active: StaticKey,
    pub superblock_hooks_active: StaticKey,
}

/// Per-LSM blob offset table.
pub struct LsmBlobOffsets {
    pub cred: usize,
    pub inode: usize,
    pub superblock: usize,
    pub socket: usize,
    pub ipc: usize,
    pub key: usize,
}

/// Maximum number of simultaneously registered LSM modules.
/// Linux supports up to ~10; UmkaOS allows 8 (IMA + SELinux/AppArmor +
/// Landlock + Yama + LoadPin + SafeSetID + Lockdown + BPF-LSM).
const MAX_LSM_MODULES: usize = 8;

Blob access pattern (used by LSM implementations to read/write their blob region):

impl LsmBlob {
    /// Get a typed reference to this LSM's region within the blob.
    /// `lsm_index` is the module's registration index.
    /// `T` is the LSM's private per-object state struct.
    ///
    /// # Safety
    /// The caller must ensure T matches the blob layout registered by
    /// the LSM at `lsm_index`, and that `size_of::<T>()` equals the
    /// blob size the LSM requested for this object type.
    pub unsafe fn get<T>(&self, lsm_index: usize) -> &T {
        let offset = LSM_REGISTRY.blob_offsets[lsm_index].offset_for_type();
        // SAFETY: offset and size were validated during LSM registration
        &*(self.data.as_ptr().add(offset) as *const T)
    }

    /// Mutable variant of get(). Used during blob initialization and
    /// by LSMs that mutate per-object state (e.g., SELinux label cache).
    pub unsafe fn get_mut<T>(&mut self, lsm_index: usize) -> &mut T {
        let offset = LSM_REGISTRY.blob_offsets[lsm_index].offset_for_type();
        // SAFETY: offset and size were validated during LSM registration
        &mut *(self.data.as_mut_ptr().add(offset) as *mut T)
    }
}

8.7.5 Hook Dispatch and Stacking

The framework dispatches hooks to all registered modules. The dispatch uses static keys for zero-overhead bypass when no module is registered for a given category.

// Macro-generated hook dispatch (conceptual expansion):

lsm_call_file_security(op, cred, file) -> Result<(), SecurityDenial>:
  // Static key check: compiled to NOP when no file hooks registered
  if !static_key_enabled(LSM_REGISTRY.file_hooks_active):
      return Ok(())

  // Iterate registered modules in priority order
  for module in LSM_REGISTRY.modules:
      result = module.file_security(op, cred, file)
      if result.is_err():
          // AND logic: first denial wins. Log the denial via audit
          // subsystem (Section 19.1) if auditing is enabled.
          audit_lsm_denial(module.name(), op, cred, file)
          return result
  Ok(())

Stacking order: Modules are called in priority order (lower priority value = earlier):

Priority	Module category	Examples	Rationale
11	Integrity	IMA	Must measure/appraise before MAC decisions
21	Mandatory Access Control	SELinux, AppArmor	Primary policy enforcement
30	Supplementary	Landlock, Yama, LoadPin	Additional restrictions
40	BPF-LSM	User-loaded BPF security programs	Most dynamic, least trusted

Within the same priority tier, modules are called in the order specified by the lsm= kernel command line parameter (e.g., lsm=integrity,selinux,landlock,bpf). If no lsm= parameter is provided, the compiled-in default order is used.

AND-logic semantics: ALL modules must return Ok(()) for the operation to proceed. If module A returns Ok(()) but module B returns Err(SecurityDenial), the operation is denied. The first denial is returned to the caller; subsequent modules are NOT called (short-circuit evaluation). This is safe because LSM hooks only restrict (deny), never grant -- skipping later modules after a denial cannot cause a false-allow.

8.7.6 Per-Namespace LSM Profiles

Container runtimes set LSM profiles on a per-task basis, typically at execve() time. The profile is stored in the task's credential LSM blob.

AppArmor: Profile name stored in the credential blob. Changed via: - apparmor_parser writing to /sys/kernel/security/apparmor/.replace - aa_change_onexec() in the task's own code - OCI runtime setting the profile before execve() of the container entrypoint

SELinux: Security context (label) stored in the credential blob. Changed via: - setcon() / setexeccon() before execve() - Policy-defined type transitions on exec

Profile application at container creation:

container_exec_with_lsm_profile(task, binary, profile_spec):
  1. new_cred = prepare_creds(task)
  2. For each active LSM:
       result = module.task_security(Exec, task.cred, None, &TaskOpContext {
           binary: binary,
           profile: profile_spec,  // OCI annotation: "apparmor=docker-default"
       })
       if result.is_err():
           abort_creds(new_cred)
           return result
  3. // LSM updates its blob region in new_cred with the new profile
  4. commit_creds(task, new_cred)
  5. execve(binary)

Landlock: unprivileged sandboxing:

Landlock is unique among LSMs: it does not require CAP_MAC_ADMIN or any capability. Any process can restrict itself via landlock_create_ruleset(), landlock_add_rule(), landlock_restrict_self(). This is self-restriction only -- a process can reduce its own access rights but cannot affect other processes.

/// Landlock ruleset, stored per-credential in the LSM blob.
/// Each landlock_restrict_self() call creates a new layer that
/// further restricts the previous layer (stacking via intersection).
pub struct LandlockCredBlob {
    /// Stack of restriction layers. Each layer is an immutable ruleset.
    /// Access is permitted only if ALL layers permit it (AND logic
    /// within the Landlock module, on top of the inter-module AND logic).
    pub domain: Option<Arc<LandlockDomain>>,
}

pub struct LandlockDomain {
    /// Parent domain (from the previous landlock_restrict_self() call).
    /// None for the outermost restriction layer.
    pub parent: Option<Arc<LandlockDomain>>,
    /// Ruleset defining what this layer permits.
    pub ruleset: LandlockRuleset,
}

8.7.7 Hook Integration Points

This section specifies WHERE in the kernel codebase hook calls are inserted. Each entry maps a kernel code path to the LSM hook it invokes. Implementing agents use this table to know exactly where to place lsm_call_*() invocations.

Kernel code path	Hook call	When
`vfs_open()` (umka-vfs)	`lsm_call_file_security(Open, ...)`	After successful lookup, before returning fd
`vfs_read()` / `vfs_write()`	`lsm_call_file_security(Permission, ...)`	Before data transfer
`do_mmap()` (umka-core, Section 4.1.5)	`lsm_call_file_security(Mmap, ...)`	Before creating VMA for file-backed mapping
`do_mprotect()` (umka-core)	`lsm_call_file_security(Mprotect, ...)`	Before changing VMA permissions
`vfs_create()` (umka-vfs, InodeOps)	`lsm_call_inode_security(Create, ...)`	Before calling filesystem's `create()`
`vfs_link()`	`lsm_call_inode_security(Link, ...)`	Before calling filesystem's `link()`
`vfs_unlink()`	`lsm_call_inode_security(Unlink, ...)`	Before calling filesystem's `unlink()`
`vfs_mkdir()`	`lsm_call_inode_security(Mkdir, ...)`	Before calling filesystem's `mkdir()`
`vfs_rename()`	`lsm_call_inode_security(Rename, ...)`	Before calling filesystem's `rename()`
`vfs_setattr()`	`lsm_call_inode_security(Setattr, ...)`	Before calling filesystem's `setattr()`
`vfs_setxattr()`	`lsm_call_inode_security(Setxattr, ...)`	Before calling filesystem's `setxattr()`
`vfs_getxattr()`	`lsm_call_inode_security(Getxattr, ...)`	Before returning xattr data
`do_mount()` (umka-vfs, Section 13.2)	`lsm_call_superblock_security(Mount, ...)`	Before calling filesystem's `mount()`
`copy_process()` (Section 7.1.2)	`lsm_call_task_security(Alloc, ...)`	During fork/clone, before task becomes runnable
`do_execve()` (Section 7.1.3)	`lsm_call_task_security(Exec, ...)`	After ELF loading, before credential commit
`do_kill()` / `do_tkill()`	`lsm_call_task_security(Kill, ...)`	Before signal delivery
`sys_ptrace()` (Section 19.3.1)	`lsm_call_task_security(Ptrace, ...)`	Before granting ptrace access
`commit_creds()` (Section 8.8.2)	`lsm_call_cred_security(Commit, ...)`	Before publishing new credentials
`ns_capable()` (Section 8.8.3)	`lsm_call_capable(...)`	After bitfield check passes, before returning true
`__sys_socket()` (Section 15.1.1)	`lsm_call_socket_security(Create, ...)`	Before allocating socket
`sys_bind()`	`lsm_call_socket_security(Bind, ...)`	Before binding address
`sys_connect()`	`lsm_call_socket_security(Connect, ...)`	Before initiating connection
`sys_listen()`	`lsm_call_socket_security(Listen, ...)`	Before marking socket as listener
`sys_sendmsg()` / `sys_sendto()`	`lsm_call_socket_security(Sendmsg, ...)`	Before sending data
`ipc_permission()`	`lsm_call_ipc_security(Permission, ...)`	Before granting IPC access
`bpf_prog_load()` (Section 18.1.4)	`lsm_call_bpf_security(ProgLoad, ...)`	Before loading BPF program
`create_namespace()` (Section 16.1)	`lsm_call_namespace_security(Create, ...)`	Before creating namespace
`do_setns()`	`lsm_call_namespace_security(Join, ...)`	Before joining namespace

8.7.8 `/sys/kernel/security` Interface

The LSM framework exposes state via the securityfs pseudo-filesystem:

/sys/kernel/security/
    lsm                         # Comma-separated list of active LSMs
# # (e.g., "integrity,apparmor,landlock")
    apparmor/                   # AppArmor-specific (if loaded)
        profiles                # Loaded profile list
        .replace                # Profile upload interface
        .remove                 # Profile removal interface
    selinux/                    # SELinux-specific (if loaded)
        enforce                 # Enforcement mode (0=permissive, 1=enforce)
        policy                  # Binary policy load interface
        booleans/               # Policy booleans
    ima/                        # Already specified in Section 8.4
        ascii_runtime_measurements
        policy
    landlock/                   # Landlock ABI version
        abi_version             # Integer, currently 4 (Linux 6.7+)

8.7.9 LSM Registration and Boot Sequence

lsm_init() — called during UmkaOS core initialization, after slab allocator
  is available but before any user process runs:

  1. Parse kernel command line for lsm= parameter
     (default: "integrity,apparmor,landlock" or "integrity,selinux,landlock"
      depending on compile-time config)

  2. For each requested LSM, in order:
     a. Call module.blob_sizes() to collect per-object blob size requirements
     b. Record blob offset = running sum of previous modules' sizes
     c. Add module to LSM_REGISTRY.modules

  3. Compute total_blob_sizes (sum per object type)

  4. Create slab caches for each object type's blob:
     - cred_lsm_blob_cache: total_blob_sizes.cred bytes per object
     - inode_lsm_blob_cache: total_blob_sizes.inode bytes per object
     - etc.
     If total is 0 for a type (no LSM needs that blob), skip slab creation

  5. Enable static keys:
     - LSM_REGISTRY.any_lsm_active = true (if modules.len() > 0)
     - Per-category keys based on which modules override which methods
       (determined by checking if the method implementation is the default)

  6. Initialize each module (call module-specific init, e.g.,
     apparmor loads compiled-in policy, selinux initializes AVC)

Performance impact (measured per operation):

Condition	Overhead per hook call
No LSM loaded	0 cycles (NOP via static key)
Single LSM (AppArmor)	~50-100 cycles (one vtable call + policy lookup)
Two LSMs (IMA + AppArmor)	~100-200 cycles (two vtable calls)
Three LSMs (IMA + AppArmor + Landlock)	~150-250 cycles (three vtable calls)

On an NVMe 4KB read (~10 microseconds), a 250-cycle LSM overhead adds ~2.5%. This is within the project's 5% overhead budget (Section 10.1) when combined with the isolation domain switch cost.

Note on scope: The cycle counts above cover hook dispatch overhead only (vtable call + static-key branch). Policy evaluation — AppArmor rule matching, SELinux AVC lookup, or Landlock restriction tree traversal — adds ~200-2000 cycles per hook depending on policy complexity and AVC cache state. A fully-loaded three-LSM configuration with cache-cold policy can reach 1000-3000 total cycles per hook invocation. The dispatch overhead is bounded; the policy evaluation overhead depends on policy complexity and is expected to dominate in production configurations.

Cross-references: - Section 8.1.1 (08-security.md): UmkaOS native capability model - Section 8.1.3 (08-security.md): SystemCaps definition (used by capable() hook) - Section 18.1: Syscall dispatch (LSM hook at capable()) - Section 18.1.4: eBPF including BPF-LSM program type - Section 8.4 (08-security.md): IMA (first LSM in evaluation order) - Section 13.1 (13-vfs.md): VFS traits (hook integration at VFS layer) - Section 15.1.1 (15-networking.md): SocketOps (hook integration at socket layer) - Section 19.2.4 (19-observability.md): Static keys (zero-overhead mechanism) - Section 19.3.1 (19-observability.md): ptrace (task_security Ptrace hook) - Section 16.1 (16-containers.md): Namespaces (namespace_security hooks) - Section 16.1.7 (16-containers.md): Security policy integration (references this section) - Section 8.8 (16-containers.md): Credential model (TaskCredential.lsm_blob)

8.7.10 Policy Format Compatibility

UmkaOS's LSM framework accepts policy in the same binary format as the upstream AppArmor and SELinux policy compilers, ensuring that existing policy toolchains work unmodified. Operators do not need to recompile, reformat, or convert policy artifacts when moving workloads to UmkaOS.

AppArmor: UmkaOS accepts AppArmor policy in the compiled binary format produced by apparmor_parser. The same .aa source files and compiled profiles used on Linux work on UmkaOS without recompilation. The policy ABI version supported is AppArmor 3.x (the current upstream format, as shipped in Ubuntu 22.04+ and Debian 12+). Profiles are loaded via the same securityfs interface: writing compiled profile data to /sys/kernel/security/apparmor/.replace (replace existing) or .load (add new). The kernel validates the ABI version field in the binary header and rejects profiles compiled for incompatible ABI versions with EINVAL.

SELinux: UmkaOS accepts SELinux policy modules in the binary .pp (policy package) format produced by semodule and checkmodule. CIL (Common Intermediate Language) source policies compiled with secilc are also accepted. The policy database format version supported is SELinux policy version 33, which is the current upstream version as of the Linux 6.x series. Policy is loaded via /sys/kernel/security/selinux/policy (full policy binary) or via semodule -i (individual .pp modules, which the kernel assembles into the running policy database). The enforce and booleans/ interfaces in securityfs behave identically to the Linux upstream SELinux implementation.

Custom LSM hooks for UmkaOS-specific events: UmkaOS extends the standard LSM hook set with hooks for events that do not exist in Linux (capability delegation, DebugCap issuance, accelerator context creation, KABI driver load). These hooks are invisible to AppArmor and SELinux policy written against upstream compilers: they pass through as SECURITY_ALLOW (permit) for any LSM module that does not explicitly implement the UmkaOS-specific hook variants. UmkaOS-native LSM modules (implemented against the full SecurityModule trait defined in this section) can hook these extended points to enforce site-specific policy over UmkaOS-specific resources. AppArmor and SELinux policy correctness is unaffected by the existence of these additional hooks.

8.8 Credential Model and Capabilities

The Linux credential model (UIDs, GIDs, supplementary groups) is a legacy access control mechanism. UmkaOS maps this model to its native Capability System (Section 8.1).

8.8.1 Credential Structure

Linux uses copy-on-write credentials (struct cred) shared between tasks via RCU. UmkaOS follows the same pattern: credentials are immutable once published. Modification follows a prepare-modify-commit sequence that atomically swaps the credential pointer visible to other tasks.

/// Per-task credential state. Analogous to Linux's `struct cred`.
///
/// Credentials are immutable after publication. To modify credentials,
/// a task calls `prepare_creds()` to obtain a mutable clone, modifies
/// the clone, then calls `commit_creds()` to atomically publish it.
/// The old credential is freed after an RCU grace period (readers that
/// already hold a reference via `rcu_read_lock()` see the old values
/// until they exit the read-side critical section).
///
/// **Memory layout**: The credential struct is slab-allocated from a
/// dedicated `cred_cache` slab (fixed size, no fragmentation). The
/// `Arc` provides reference counting for sharing between `prepare_creds`
/// clones and cross-task references (e.g., `/proc/PID/status` reads
/// another task's credentials). RCU protects the pointer swap; `Arc`
/// protects the struct lifetime.
///
/// **Hot-path optimization**: Syscall entry reads `current->cred` under
/// an implicit RCU read-side section (preempt-disabled on syscall entry).
/// The `SystemCaps` checks (`capable()`, `ns_capable()`) access
/// `cred.cap_effective` directly -- no indirection beyond the RCU
/// dereference of the cred pointer itself.
#[repr(C)]
pub struct TaskCredential {
    // ===== POSIX identity fields =====

    /// Real user ID. Set at login, inherited across fork/exec unless
    /// modified by setuid()/setreuid()/setresuid().
    pub uid: u32,
    /// Real group ID. Analogous to uid for group identity.
    pub gid: u32,
    /// Effective user ID. Used for permission checks (file access,
    /// signal delivery). May differ from uid after setuid binary exec
    /// or seteuid() call.
    pub euid: u32,
    /// Effective group ID. Used for permission checks.
    pub egid: u32,
    /// Saved set-user-ID. Preserved across exec for setuid binaries.
    /// Allows switching between real and saved UIDs via setreuid().
    pub suid: u32,
    /// Saved set-group-ID. Analogous to suid for groups.
    pub sgid: u32,
    /// Filesystem user ID. Used exclusively for filesystem permission
    /// checks (open, stat, chown). Normally tracks euid; can be set
    /// independently via setfsuid(). Exists for NFS server implementations
    /// that need to check permissions as a different user without changing
    /// signal delivery identity (euid).
    pub fsuid: u32,
    /// Filesystem group ID. Analogous to fsuid.
    pub fsgid: u32,

    // ===== Linux capability sets (5 sets, matching Linux kernel exactly) =====
    // All five sets use SystemCaps (u128) to hold both POSIX capabilities
    // (bits 0-63, matching Linux numbering) and UmkaOS-native capabilities
    // (bits 64-127). See Section 8.1.3 for the full SystemCaps definition.
    //
    // The five sets interact as follows:
    // - cap_permitted: upper bound on what the task CAN have
    // - cap_effective: what the task CURRENTLY has (checked by capable())
    // - cap_inheritable: what survives across execve()
    // - cap_bounding: limits what file capabilities can grant
    // - cap_ambient: auto-raised on execve() for capability-dumb binaries

    /// Effective capability set. This is the set checked by capable() and
    /// ns_capable() on every privileged operation. A capability must be in
    /// this set for the task to exercise it.
    pub cap_effective: SystemCaps,

    /// Permitted capability set. Upper bound on cap_effective. A task can
    /// raise a capability into cap_effective only if it is in cap_permitted.
    /// Dropping a capability from cap_permitted is permanent (cannot be
    /// re-added). cap_effective is always a subset of cap_permitted.
    pub cap_permitted: SystemCaps,

    /// Inheritable capability set. Capabilities that can be inherited across
    /// execve() if the executed file also has the capability in its file
    /// inheritable set. Used by capability-aware programs that explicitly
    /// manage which capabilities survive exec.
    pub cap_inheritable: SystemCaps,

    /// Bounding set. Limits which capabilities can be gained through file
    /// capabilities during execve(). A capability not in the bounding set
    /// cannot appear in cap_permitted after execve(), even if the file has
    /// it in its file permitted set. Can only be reduced (via
    /// prctl(PR_CAPBSET_DROP)), never expanded. Inherited across fork().
    pub cap_bounding: SystemCaps,

    /// Ambient capability set. Capabilities automatically added to
    /// cap_permitted and cap_effective on execve() of a non-setuid,
    /// non-file-capability binary (a "capability-dumb" binary). This
    /// allows unprivileged programs to inherit capabilities without
    /// requiring file capability xattrs. Added in Linux 4.3.
    ///
    /// Invariant: cap_ambient is always a subset of both cap_permitted
    /// and cap_inheritable. If a capability is dropped from either
    /// cap_permitted or cap_inheritable, it is automatically dropped
    /// from cap_ambient.
    pub cap_ambient: SystemCaps,

    // ===== Supplementary groups =====

    /// Supplementary group list. Set by setgroups(2) or initgroups(3).
    /// Used by permission checks alongside gid/egid. Maximum 65536 groups
    /// (matching Linux NGROUPS_MAX). Stored sorted for binary search during
    /// permission checks (in_group_p() on the hot path for file access).
    ///
    /// Shared via Arc for COW efficiency: fork() shares the list; only
    /// setgroups() triggers a clone. The common case (no setgroups after
    /// login) means the list is shared across the entire process tree
    /// spawned by a session.
    pub supplementary_groups: Arc<SortedGidList>,

    // ===== Security control flags =====

    /// PR_SET_NO_NEW_PRIVS flag. Once set to true, cannot be unset.
    /// Inherited across fork() and preserved across execve().
    /// When true, execve() will NOT:
    ///   - Honor setuid/setgid bits on the executed file
    ///   - Grant file capabilities from security.capability xattr
    ///   - Allow LSM transitions that increase privilege
    /// Required for unprivileged seccomp (SECCOMP_SET_MODE_FILTER
    /// without CAP_SYS_ADMIN requires no_new_privs == true).
    pub no_new_privs: bool,

    /// Securebits flags. Control the special handling of UID 0 (root).
    /// See SecureBits definition below. Set via prctl(PR_SET_SECUREBITS).
    /// Requires CAP_SETPCAP in the caller's effective set.
    pub securebits: SecureBits,

    // ===== Namespace and LSM =====

    /// User namespace this credential operates in. Determines the
    /// scope of UID/GID mappings and the ceiling for capability checks
    /// (see compute_effective_caps in Section 16.1.6).
    pub user_ns: Arc<UserNamespace>,

    /// Opaque LSM security blob. Allocated and managed by the active
    /// LSM modules (Section 8.7). Each LSM gets a contiguous region
    /// within this blob, indexed by the LSM's registered blob offset.
    /// None if no LSM is loaded (zero overhead -- no allocation).
    /// See Section 8.7.4 for the blob layout specification.
    pub lsm_blob: Option<Arc<LsmBlob>>,
}

/// Sorted supplementary group ID list. Stored sorted for O(log n) lookup
/// via binary search. NGROUPS_MAX = 65536 matches Linux.
pub struct SortedGidList {
    /// Sorted group IDs. Binary search for in_group_p() checks.
    groups: ArrayVec<u32, NGROUPS_MAX>,
}

/// Maximum number of supplementary groups per credential.
/// Matches Linux NGROUPS_MAX (65536 since Linux 2.6.4).
const NGROUPS_MAX: usize = 65536;

impl SortedGidList {
    /// Check if the given GID is in the supplementary group list.
    /// O(log n) binary search. Called on every file permission check
    /// when the file's group matches neither egid nor fsgid.
    pub fn contains(&self, gid: u32) -> bool {
        self.groups.binary_search(&gid).is_ok()
    }
}

bitflags! {
    /// Securebits flags. These control the special treatment of UID 0.
    /// Each flag has a corresponding LOCKED variant that prevents
    /// clearing the flag (one-way escalation of restriction).
    ///
    /// Bit layout matches Linux exactly (include/uapi/linux/securebits.h):
    ///   bit 0: SECBIT_NOROOT
    ///   bit 1: SECBIT_NOROOT_LOCKED
    ///   bit 2: SECBIT_NO_SETUID_FIXUP
    ///   bit 3: SECBIT_NO_SETUID_FIXUP_LOCKED
    ///   bit 4: SECBIT_KEEP_CAPS
    ///   bit 5: SECBIT_KEEP_CAPS_LOCKED
    ///   bit 6: SECBIT_NO_CAP_AMBIENT_RAISE
    ///   bit 7: SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED
    pub struct SecureBits: u32 {
        /// SECBIT_NOROOT: When set, UID 0 does NOT automatically gain
        /// capabilities. Without this flag (the default), a process
        /// that transitions to euid 0 via setuid() gains full
        /// cap_permitted (the "root is special" legacy behavior).
        /// Required for rootless containers where UID 0 inside the
        /// container maps to an unprivileged host UID.
        const NOROOT = 1 << 0;
        /// SECBIT_NOROOT_LOCKED: Prevents clearing NOROOT.
        const NOROOT_LOCKED = 1 << 1;

        /// SECBIT_NO_SETUID_FIXUP: When set, the kernel does NOT
        /// adjust cap_effective/cap_permitted/cap_ambient when
        /// euid changes (via setuid/seteuid/setreuid/setresuid).
        /// Without this flag, transitioning from euid 0 to non-zero
        /// clears cap_effective, and transitioning to euid 0 raises
        /// cap_effective to cap_permitted.
        const NO_SETUID_FIXUP = 1 << 2;
        /// SECBIT_NO_SETUID_FIXUP_LOCKED: Prevents clearing NO_SETUID_FIXUP.
        const NO_SETUID_FIXUP_LOCKED = 1 << 3;

        /// SECBIT_KEEP_CAPS: When set, a setuid() from euid 0 to
        /// non-zero does NOT clear cap_permitted. Normally, dropping
        /// root (setuid(non-zero)) clears cap_permitted entirely.
        /// This flag preserves capabilities across the UID change.
        /// Note: this flag is automatically cleared on execve().
        const KEEP_CAPS = 1 << 4;
        /// SECBIT_KEEP_CAPS_LOCKED: Prevents clearing KEEP_CAPS.
        const KEEP_CAPS_LOCKED = 1 << 5;

        /// SECBIT_NO_CAP_AMBIENT_RAISE: When set, prevents adding
        /// capabilities to the ambient set. Existing ambient caps
        /// are preserved but no new ones can be added.
        const NO_CAP_AMBIENT_RAISE = 1 << 6;
        /// SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED: Prevents clearing
        /// NO_CAP_AMBIENT_RAISE.
        const NO_CAP_AMBIENT_RAISE_LOCKED = 1 << 7;
    }
}

8.8.2 Credential Lifecycle (Copy-on-Write via RCU)

The credential is shared immutably between tasks. The pointer from the task struct to the credential is RCU-protected:

/// In the Task struct (Section 7.1.1):
pub struct Task {
    // ... existing fields ...

    /// RCU-protected pointer to the task's current credentials.
    /// Readers (other tasks inspecting this task's creds, e.g.,
    /// /proc/PID/status, kill() permission check, ptrace attach)
    /// dereference under rcu_read_lock().
    /// The owning task reads its own creds without RCU (it is the
    /// only writer, and commit_creds() is serialized per-task).
    pub cred: RcuPtr<Arc<TaskCredential>>,
}

Credential modification protocol:

prepare_creds(current_task) -> MutableCred:
  1. old = rcu_dereference(current_task.cred)  // or direct read for self
  2. new = Arc::new(old.deep_clone())           // clone all fields
  3. return MutableCred(new)                    // caller owns mutable access

// Caller modifies new.cap_effective, new.euid, etc.

commit_creds(current_task, new: MutableCred):
  1. Validate invariants (see below)
  2. old = current_task.cred
  3. rcu_assign_pointer(current_task.cred, new.into_arc())
  4. // old Arc ref is decremented; if refcount reaches 0,
  5. // the old TaskCredential is freed after RCU grace period
  6. // via call_rcu(drop_cred, old)

abort_creds(new: MutableCred):
  // Called if the modification is abandoned (e.g., error path)
  // Simply drops the Arc, freeing the cloned credential
  drop(new)

Invariants enforced by commit_creds():

cap_effective is a subset of cap_permitted
cap_ambient is a subset of both cap_permitted AND cap_inheritable
If no_new_privs was true in the old credential, it must be true in the new one
Locked securebits in the old credential cannot be cleared in the new one
cap_bounding can only shrink (new is a subset of old), never grow

Violation of any invariant causes commit_creds() to return Err(EPERM) without modifying the task's credential pointer.

8.8.3 The `capable()` and `ns_capable()` Check Functions

All privileged operations in the kernel go through one of these two functions:

capable(cap: SystemCaps) -> bool:
  // Check if the current task has the given capability against the
  // init user namespace (i.e., "global" capability).
  return ns_capable(&INIT_USER_NS, cap)

ns_capable(target_ns: &UserNamespace, cap: SystemCaps) -> bool:
  1. cred = current_task.cred   // no RCU needed for own creds
  2. if !cred.cap_effective.contains(cap):
       return false             // task doesn't have the cap at all
  3. // Check namespace hierarchy: task's user_ns must be same as or
  //  ancestor of target_ns (from compute_effective_caps, Section 16.1.6)
  4. if !is_same_or_ancestor(cred.user_ns, target_ns):
       return false             // cap not valid in target namespace
  5. // LSM hook: security_capable() -- allows LSMs to deny even if
  //  capability bits permit (e.g., SELinux policy denies CAP_NET_ADMIN
  //  to a confined domain even if the process holds it)
  6. if lsm_deny_capable(cred, target_ns, cap):  // Section 8.7
       return false
  7. return true

Cross-reference: ns_capable() at step 4 uses the is_same_or_ancestor() function defined in the compute_effective_caps() algorithm (Section 16.1.6). The namespace hierarchy walk is the same; ns_capable() is the per-operation entry point, while compute_effective_caps() is the bulk computation used when constructing the effective set.

8.8.4 The `execve()` Capability Transformation

When a task calls execve(), the capability sets are recomputed based on the executed file's properties. This is the most security-critical credential transformation in the kernel.

File capability structure (stored in security.capability xattr):

/// File capabilities stored as an extended attribute on the binary.
/// Format matches Linux VFS_CAP_REVISION_3 (Linux 4.14+).
/// Read from security.capability xattr during execve().
#[repr(C)]
pub struct FileCapabilities {
    /// File permitted set: capabilities the file can grant to the
    /// task's cap_permitted. Subject to cap_bounding restriction.
    pub permitted: SystemCaps,

    /// File inheritable set: capabilities the file allows to be
    /// inherited from the task's cap_inheritable into cap_permitted.
    pub inheritable: SystemCaps,

    /// Effective flag (single bit, not a full set). When set,
    /// cap_effective is raised to equal cap_permitted after the
    /// transformation. When clear, cap_effective is set to
    /// cap_ambient only (capability-aware binaries manage their
    /// own effective set via capset()).
    pub effective_flag: bool,

    /// Root UID for namespace-scoped file capabilities (VFS_CAP_REVISION_3).
    /// File capabilities are interpreted relative to this UID's user namespace.
    /// 0 = initial namespace (host file caps). Non-zero = namespaced file caps.
    pub rootid: u32,
}

Transformation algorithm (executed atomically during execve(), after ELF loading succeeds but before returning to user space):

execve_transform_caps(task, file):
  old = task.cred  // current credential
  new = prepare_creds(task)

  // Step 0: Check no_new_privs gate
  is_setuid = file.is_setuid() && !old.no_new_privs
  is_setgid = file.is_setgid() && !old.no_new_privs
  has_file_caps = file.has_security_capability_xattr()
                  && !old.no_new_privs

  // Step 1: UID/GID transitions (setuid/setgid bits)
  if is_setuid:
      new.euid = file.owner_uid
      new.suid = file.owner_uid
      // fsuid tracks euid by default
      new.fsuid = file.owner_uid
  if is_setgid:
      new.egid = file.owner_gid
      new.sgid = file.owner_gid
      new.fsgid = file.owner_gid

  // Step 2: Load file capabilities
  F = if has_file_caps:
          parse_file_capabilities(file)  // from security.capability xattr
      else:
          FileCapabilities::EMPTY

  // Step 2a: Namespace-scoped file capability validation
  // File caps are only honored if the file's rootid namespace is the
  // same as or an ancestor of the task's user namespace (Section 16.1.6).
  if F != EMPTY:
      file_ns = namespace_for_rootid(F.rootid)
      if !is_same_or_ancestor(file_ns, old.user_ns):
          F = FileCapabilities::EMPTY  // silently ignore

  // Step 3: Determine if this is a "privileged" exec
  // A binary is "privileged" if it has file capabilities or is setuid-root
  is_privileged = (F != EMPTY)
                  || (is_setuid && file.owner_uid == 0
                      && !old.securebits.contains(NOROOT))

  // Step 4: Ambient set transformation
  // Ambient caps are cleared if this is a privileged exec
  new.cap_ambient = if is_privileged {
      SystemCaps::empty()
  } else {
      old.cap_ambient
  }

  // Step 5: Permitted set transformation
  // P'(permitted) = (P(inheritable) & F(inheritable))
  //               | (F(permitted) & cap_bounding)
  //               | P'(ambient)
  new.cap_permitted =
      (old.cap_inheritable & F.inheritable)
      | (F.permitted & old.cap_bounding)
      | new.cap_ambient

  // Step 5a: SECBIT_NOROOT interaction
  // If NOROOT is NOT set and the binary is setuid-root (or euid becomes 0),
  // grant full capabilities within the bounding set (legacy root behavior)
  if !old.securebits.contains(NOROOT) && new.euid == 0:
      new.cap_permitted = new.cap_permitted | old.cap_bounding

  // Step 6: Effective set transformation
  // If F.effective_flag is set (capability-dumb binary or setuid-root),
  // raise effective to equal permitted.
  // Otherwise (capability-aware binary), effective = ambient only.
  new.cap_effective = if F.effective_flag || (new.euid == 0
                        && !old.securebits.contains(NOROOT)) {
      new.cap_permitted
  } else {
      new.cap_ambient
  }

  // Step 7: Inheritable set is unchanged across execve()
  new.cap_inheritable = old.cap_inheritable

  // Step 8: Bounding set is unchanged across execve()
  new.cap_bounding = old.cap_bounding

  // Step 9: KEEP_CAPS is always cleared on execve()
  new.securebits.remove(KEEP_CAPS)
  // All other securebits (including locked variants) are preserved

  // Step 10: no_new_privs is preserved (one-way flag)
  new.no_new_privs = old.no_new_privs

  // Step 11: LSM transition hook
  // LSM hooks (AppArmor/SELinux) may deny the exec or apply domain transitions.
  // Denials are always enforced regardless of no_new_privs.
  // When no_new_privs is set, LSM additionally rejects privilege-increasing
  // domain transitions (the LSM hook checks bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS).
  lsm_result = lsm_task_exec_transition(old, new, file)  // Section 8.7
  if lsm_result.is_err():
      abort_creds(new)
      return lsm_result

  // Step 12: Validate and commit
  commit_creds(task, new)

Interaction table (how securebits affect the transformation):

Securebit	Effect on `execve()`
`NOROOT` set	UID 0 does NOT auto-gain caps at step 5a. Setuid-root binaries still change euid but get only file-granted caps, not full `cap_bounding`.
`NOROOT` clear (default)	UID 0 gets `cap_bounding` added to `cap_permitted` (legacy root behavior).
`NO_SETUID_FIXUP` set	No effect on execve() (only affects `setuid()`/`seteuid()` UID transitions, see Section 8.8.5 above).
`KEEP_CAPS` set	Cleared unconditionally at step 9. Only affects `setuid()`, not `execve()`.
`NO_CAP_AMBIENT_RAISE` set	No effect on execve() transform itself, but prevents `prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE)` beforehand.

8.8.5 UID Transition Capability Adjustments

When a task changes its euid via setuid(), seteuid(), setreuid(), or setresuid(), the capability sets are adjusted unless SECBIT_NO_SETUID_FIXUP is set:

fixup_caps_after_setuid(old_euid, new_euid, cred):
  if cred.securebits.contains(NO_SETUID_FIXUP):
      return  // no adjustment

  // Case 1: Transition FROM euid 0 to non-zero
  if old_euid == 0 && new_euid != 0:
      if !cred.securebits.contains(KEEP_CAPS):
          // Drop all capabilities (leaving root)
          cred.cap_permitted = SystemCaps::empty()
          cred.cap_effective = SystemCaps::empty()
          cred.cap_ambient = SystemCaps::empty()
      else:
          // KEEP_CAPS: permitted preserved, effective cleared
          cred.cap_effective = SystemCaps::empty()

  // Case 2: Transition TO euid 0 from non-zero
  if old_euid != 0 && new_euid == 0:
      if !cred.securebits.contains(NOROOT):
          // Gaining root: raise effective to permitted
          cred.cap_effective = cred.cap_permitted

  // In all cases, maintain ambient invariant
  cred.cap_ambient = cred.cap_ambient
      & cred.cap_permitted
      & cred.cap_inheritable

8.8.6 `setgroups()` and Supplementary Group Management

setgroups(task, new_groups: &[u32]) -> Result<()>:
  1. Check: ns_capable(task.cred.user_ns, CAP_SETGID)
     // OR: task has written to /proc/PID/gid_map (user namespace case)
  2. if new_groups.len() > NGROUPS_MAX:
       return Err(EINVAL)
  3. sorted = new_groups.to_sorted_deduped()
  4. new_cred = prepare_creds(task)
  5. new_cred.supplementary_groups = Arc::new(SortedGidList::from(sorted))
  6. commit_creds(task, new_cred)

// Permission check helper used by VFS (file open, directory access):
in_group_p(cred: &TaskCredential, gid: u32) -> bool:
  cred.fsgid == gid || cred.supplementary_groups.contains(gid)

User namespace interaction: Inside a user namespace, setgroups() is denied by default until /proc/PID/gid_map has been written (matching Linux behavior since Linux 3.19). This prevents an unprivileged process in a new user namespace from using setgroups() to drop groups it does not want to be checked against, which would be a privilege escalation. The gid_map_written flag is tracked on the UserNamespace struct.

8.8.7 `prctl()` Credential Operations

prctl_set_no_new_privs(task):
  // One-way flag: once set, never cleared
  new_cred = prepare_creds(task)
  new_cred.no_new_privs = true
  commit_creds(task, new_cred)
  // Always succeeds (no capability required -- this is self-restriction)

prctl_set_securebits(task, new_bits: u32) -> Result<()>:
  1. Check: ns_capable(task.cred.user_ns, CAP_SETPCAP)
  2. old_bits = task.cred.securebits
  3. // Cannot clear a locked bit
     for each (flag, locked) pair in SecureBits:
       if old_bits.contains(locked) && !new_bits.contains(flag):
           return Err(EPERM)  // locked bit prevents clearing
  4. // Cannot clear a lock bit
     for each locked bit:
       if old_bits.contains(locked) && !new_bits.contains(locked):
           return Err(EPERM)
  5. // Validate: no unknown bits set
     if new_bits & !SecureBits::all().bits() != 0:
         return Err(EINVAL)
  6. new_cred = prepare_creds(task)
  7. new_cred.securebits = SecureBits::from_bits_truncate(new_bits)
  8. commit_creds(task, new_cred)

prctl_cap_ambient(task, op, cap) -> Result<()>:
  match op:
    PR_CAP_AMBIENT_RAISE:
      if task.cred.securebits.contains(NO_CAP_AMBIENT_RAISE):
          return Err(EPERM)
      if !task.cred.cap_permitted.contains(cap):
          return Err(EPERM)
      if !task.cred.cap_inheritable.contains(cap):
          return Err(EPERM)
      new_cred = prepare_creds(task)
      new_cred.cap_ambient |= cap
      commit_creds(task, new_cred)
    PR_CAP_AMBIENT_LOWER:
      new_cred = prepare_creds(task)
      new_cred.cap_ambient &= !cap
      commit_creds(task, new_cred)
    PR_CAP_AMBIENT_CLEAR_ALL:
      new_cred = prepare_creds(task)
      new_cred.cap_ambient = SystemCaps::empty()
      commit_creds(task, new_cred)
    PR_CAP_AMBIENT_IS_SET:
      return Ok(task.cred.cap_ambient.contains(cap) as u64)

prctl_capbset_drop(task, cap) -> Result<()>:
  1. Check: ns_capable(task.cred.user_ns, CAP_SETPCAP)
  2. new_cred = prepare_creds(task)
  3. new_cred.cap_bounding &= !cap
  4. // Maintain ambient invariant: ambient must be subset of
  //  both permitted and inheritable. Bounding doesn't directly
  //  constrain ambient, but dropping from bounding prevents
  //  future file-cap-based elevation.
  5. commit_creds(task, new_cred)

Cross-references: - Section 8.1.1 (08-security.md): Capability-based foundation (UmkaOS native model) - Section 8.1.3 (08-security.md): SystemCaps bitflags definition - Section 7.1.1 (07-process.md): Task and Process structs - Section 7.1.3 (07-process.md): execve() ELF loading sequence - Section 18.1: Syscall dispatch (where capable() is called) - Section 8.7 (08-security.md): LSM framework (security blobs, hook callouts) - Section 16.1.6 (16-containers.md): compute_effective_caps() and namespace hierarchy

8.8.8 Credential Translation

When a Linux binary executes, umka-compat constructs a Capability Set based on the file's ownership and the process's current credentials:

Authentication: The process presents its UID/GID.
Translation: umka-compat translates the namespace-local UID/GID to the parent namespace via UserNamespace::uid_map / gid_map. This is a lock-free read (the maps are frozen after the single write to /proc/PID/uid_map). For the common case of a single mapping entry (e.g., container UID 0-65535 → host UID 100000-165535), translation is a single range check + addition. If is_identity is set, the UID is returned unchanged (zero-cost fast path).
Capability Grant: If the UID is 0 (root) in the initial namespace, the process is granted a wide set of administrative capabilities (e.g., CAP_SYS_ADMIN equivalent).
File Access: When accessing a file, umka-compat checks the file's VFS mode bits against the process's UID/GID. If access is allowed, a native UmkaOS Capability<VfsNode> is generated and returned as a file descriptor.

8.8.9 Bounding Sets and Dropping Privileges

Linux capabilities (e.g., CAP_NET_BIND_SERVICE) map 1:1 to specific UmkaOS capability flags. When a process calls capset() to drop privileges, umka-compat permanently revokes the corresponding UmkaOS capabilities from the process's Capability Domain. Because UmkaOS capabilities are unforgeable, a dropped privilege can never be regained, ensuring strict containment.