Skip to content

Chapter 9: Security Architecture

Capabilities, credentials, LSM framework, verified boot, TPM, IMA, post-quantum cryptography, confidential computing


Security is capability-based at the foundation: every kernel object is accessed through unforgeable capability handles with 16-level delegation hierarchy and O(1) revocation. Capabilities are network-portable for cluster-wide access control. Verified boot (UEFI Secure Boot + TPM measured boot) establishes a hardware root of trust. Post-quantum cryptography (hybrid Ed25519 + ML-DSA-65 for boot, ML-KEM-768 for key exchange) protects the kernel from boot chain through cluster communication. The crypto API is INTERNAL_HASH-aware for correct PQC signature verification. The LSM framework supports SELinux and AppArmor as loadable policy modules. Confidential computing (Intel TDX, AMD SEV-SNP, ARM CCA) is first-class.

9.1 Capability-Based Foundation

UmkaOS uses three complementary permission systems:

System Scope Hot Path? Used By
Capability (this section) Fine-grained object permissions No Resource access control
SystemCaps (Section 9.2) Linux-compatible POSIX caps No Syscall permission checks
CapValidationToken (Section 12.3) Amortized KABI dispatch token Yes Driver cross-domain calls

KABI dispatch checks PermissionBits on ValidatedCap, not SystemCaps (Section 9.1).

Security check ordering (complete chain for any resource access):

Step Check Fails with Defined in
1 DAC: file permissions / owner check EACCES VFS layer (Section 14.1)
2 Capability bits: cap_effective contains required cap? EPERM Section 9.2
3 Namespace hierarchy: is_same_or_ancestor(cred.user_ns, target_ns)? EPERM Section 9.9
4 LSM hooks: security_capable() — SELinux/AppArmor/Landlock (priority-ordered, AND-logic: all must allow) EACCES Section 9.8
5 KABI: validate_cap() — generation, expiry, type, permission bits, then LSM lsm_check_cap_validate() EACCES Section 12.3
6 Dispatch: CapValidationToken generation re-checked every call (~12-25ns) EACCES (via KabiError::CapRevoked) Section 12.3

LSM hooks always run after capability checks (not before). First LSM denial short-circuits. Capability revocation takes effect at the next KABI dispatch (step 6 generation check).

Architecture-dependent enforcement note: On architectures without hardware Tier 1 isolation (RISC-V, s390x, LoongArch64), Tier 1 drivers are promoted to Tier 0. In this configuration, capability validation (steps 5-6 of the security check chain) is the SOLE enforcement boundary — there is no hardware memory domain to contain a driver that bypasses validation. Every KABI vtable dispatch still validates capabilities via the T0-validated transport (Section 11.2). This is the critical security difference between promoted-T0 and native-T0: native Tier 0 code (APIC, timer) is statically verified and skips capability checks; promoted-T0 drivers are dynamically loaded and MUST pass capability validation on every dispatch.

Applicability to Tier 1 drivers: All six steps of the security check chain apply identically to Tier 1 drivers. Tier 1 runs in Ring 0 with hardware memory domain isolation (MPK/POE/DACR), but the Ring 0 execution context does not bypass capability validation. The KABI dispatch trampoline (Section 2.21) sits in Nucleus and executes kabi_dispatch_with_vcap() (Section 12.3) for every cross-domain call — Tier 1 drivers invoke kernel services through the same KABI vtable interface as Tier 2 drivers, with the same CapValidationToken generation check (Step 6, ~12-25ns), the same PermissionBits validation (Step 5), and the same SystemCaps capability-bits check (Step 2). The only difference is the domain-crossing mechanism (hardware register write for Tier 1 vs. address-space switch for Tier 2); the capability validation path is shared. A Tier 1 driver that fails any of the six steps receives the same error (EACCES or EPERM) as a Tier 2 driver would. Capability revocation is reported as EACCES to userspace (mapped from KabiError::CapRevoked).

9.1.1 Capability Token Model

Every resource in UmkaOS is accessed through unforgeable capability tokens. This is the native security model -- not a bolt-on.

pub struct Capability {
    /// Unique identifier of the target object
    pub object_id: ObjectId,

    /// Bitfield of permitted operations
    pub permissions: PermissionBits,

    /// Monotonically increasing generation counter for revocation.
    /// When an object's generation advances, all capabilities with
    /// older generations become invalid. u64 overflow policy:
    /// [Section 12.3](12-kabi.md#kabi-bilateral-capability-exchange--generation-counter-wrap-policy) documents the
    /// canonical analysis — 584 years at 1 billion increments/sec.
    pub generation: u64,

    /// Additional constraints on this capability
    pub constraints: CapConstraints,

    /// Identifies the user namespace scope for this capability.
    /// Required for distributed delegation: when a capability is delegated
    /// across nodes, the receiving node must know which user namespace
    /// governs the capability's permission checks (e.g., `ns_capable()`).
    ///
    /// Set to `current_task().cred.user_ns.ns_id` at capability creation
    /// time (`cap_create()`). This is a `Relaxed` load from the immutable
    /// `user_ns` field of the task's RCU-protected credential — the
    /// namespace reference does not change after credential installation.
    /// Immutable thereafter — the capability records its originating
    /// namespace regardless of subsequent holder namespace changes.
    /// Value 0 identifies the init user namespace (root namespace).
    /// All pre-namespace capabilities and boot-time capabilities have
    /// `user_ns_id = 0`. Value 0 does NOT mean "unrestricted" — it means
    /// "scoped to init_user_ns."
    ///
    /// **Namespace interaction**: `capabilities_valid_for()` in
    /// [Section 17.1](17-containers.md#namespace-architecture) uses this field during `setns()` and
    /// `clone()` to strip namespace-inappropriate capabilities.
    ///
    /// **Fork/clone behavior**: Child processes share the same `CapEntry`
    /// objects via `Arc` (reference-counted). The `user_ns_id` in each
    /// `CapEntry` is immutable and reflects the namespace where the
    /// capability was originally created — not the child's current
    /// namespace.
    ///
    /// **`unshare(CLONE_NEWUSER)` behavior**: Existing capabilities retain
    /// their original `user_ns_id`. New capabilities created after
    /// `unshare` use the new user namespace's `ns_id`. This is correct —
    /// a capability's namespace scope is determined at creation time, not
    /// at use time.
    pub user_ns_id: u64,

    /// Delegation depth: 0 = root capability from kernel, max 16.
    /// Incremented each time this capability is delegated to another context.
    /// `cap_delegate()` rejects the call with `CapError::DelegationDepthExceeded`
    /// when this field equals `CAP_MAX_DELEGATION_DEPTH` (16).
    ///
    /// Invariant: `delegation_depth <= CAP_MAX_DELEGATION_DEPTH` (enforced
    /// by `cap_delegate()` at runtime). The u8 range [17, 255] is unreachable.
    pub(crate) delegation_depth: u8,
    // Delegation children list NOT stored here — see CapEntry below.
    // The Capability struct is the immutable validation token; it is
    // fixed-layout. The `CpuMask` field may contain a pointer to
    // pool-backed storage with kernel lifetime (see CpuMask). The mutable delegation
    // list (children derived via cap_delegate) lives in CapEntry, which is
    // stored in the CapTable and accessed only in syscall context.
}

/// Per-CapId storage in the CapTable. Every issued capability has one CapEntry.
/// The `cap` field is the immutable validation token; `children` is the mutable
/// delegation audit list. The two are separated so that hot-path validation
/// (which only reads `cap`) never touches the `children` Mutex.
pub struct CapEntry {
    /// The CapId under which this entry is stored in the CapTable.
    /// Immutable after creation. Used by `CapOperationGuard::try_new()`
    /// to reference the entry's identity.
    pub id: CapId,

    /// Immutable capability token. Fixed-size; no heap allocation. Copied into
    /// `ValidatedCap` tokens on first validation (see Section 9.1.2).
    pub cap: Capability,

    /// Active in-flight operation counter for revocation fencing.
    /// Bit 63 is the REVOKED_FLAG; bits 0-62 count outstanding
    /// `CapOperationGuard` instances. See `drain()` and `CapOperationGuard`.
    ///
    /// **Width rationale**: u64 is required because 256 cores × 1M ops/core/sec
    /// saturates a 31-bit counter in ~8.4 seconds. With 63 counter bits, overflow
    /// requires 1.1 billion years at 256M ops/sec — safely within the 50-year
    /// uptime target.
    pub active_ops: AtomicU64,

    /// Child capabilities derived from this one via `cap_delegate()`.
    /// Accessed only during `cap_delegate()` and `cap_revoke()` — never in
    /// the hot validation path. Bounded at compile time by `CAP_MAX_DELEGATIONS`
    /// (256); `cap_delegate()` returns `CapError::DelegationLimitReached` if
    /// the limit is reached. `ArrayVec` avoids heap allocation; `SpinLock`
    /// is used instead of `Mutex` because no sleeping is needed in this path.
    pub children: SpinLock<ArrayVec<DelegationRecord, CAP_MAX_DELEGATIONS>>,

    /// WaitQueue for `drain()` Phase 2. After setting the REVOKED_FLAG,
    /// `drain()` spin-polls `active_ops` briefly, then waits on this
    /// WaitQueue. The last `CapOperationGuard::Drop` that decrements
    /// `active_ops` to `REVOKED_FLAG | 0` calls
    /// `self.revocation_waitq.wake_one()`.
    pub revocation_waitq: WaitQueue,
}

/// Wake the revocation waiter for a capability entry.
/// Called from `CapOperationGuard::Drop` when `active_ops` reaches
/// `REVOKED_FLAG | 0` (all in-flight operations complete, revocation pending).
fn wake_revocation_waiter(cap_id: CapId) {
    if let Some(entry) = cap_table_lookup(cap_id) {
        entry.revocation_waitq.wake_one();
    }
}

/// Identifies an isolation domain that can hold capabilities.
///
/// This is an alias for `DriverDomainId` ([Section 12.3](12-kabi.md#kabi-bilateral-capability-exchange)),
/// which uses a `u64` opaque identifier assigned by the KABI domain registry.
/// Process-context domains use `pid` as the domain ID; Tier 1/2 driver domains
/// receive a unique ID from the domain allocator at driver load time.
pub type DomainId = DriverDomainId;

/// Records a single delegation of a capability to another domain.
/// Stored in the parent capability's delegation list so that revocation
/// can propagate to all delegate holders.
pub struct DelegationRecord {
    /// The delegated capability ID (child capability).
    pub delegated_cap: CapId,
    /// The domain that received the delegation.
    pub target_domain: DomainId,
    /// TSC timestamp of the delegation.
    pub timestamp_tsc: u64,
    /// Rights granted to the delegate (subset of parent rights).
    pub granted_rights: Rights,
}

pub struct CapConstraints {
    /// Capability expires after this timestamp (0 = no expiry)
    pub expires_at: u64,

    /// Can this capability be delegated to other processes?
    /// Set at delegation time and immutable thereafter. Policy evolution
    /// affects NEW delegations, not existing ones — changing delegatability
    /// retroactively would break existing delegation trees.
    pub delegatable: bool,

    /// Can this capability be revoked by the issuer or kernel?
    /// When true, the kernel may revoke this capability at any time
    /// (e.g., on credential change, resource teardown, or explicit
    /// `cap_revoke()` by a holder with ADMIN rights on the parent).
    /// Immutable after creation — changing revocability retroactively
    /// would break security invariants (a non-revocable grant becomes
    /// revocable, violating the original contract).
    pub revocable: bool,

    /// If true, revocation of this capability is time-critical
    /// (e.g., security credential revocation). The revocation protocol
    /// uses an expedited path with shorter grace periods.
    pub urgent_revoke: bool,

    /// Maximum delegation depth (0 = no further delegation, max 16).
    /// u8 matches `Capability::delegation_depth` and `CAP_MAX_DELEGATION_DEPTH`.
    /// Immutable after creation — same rationale as delegatable/revocable.
    /// `cap_create()` clamps this to `min(value, CAP_MAX_DELEGATION_DEPTH)`
    /// at creation time; values > 16 are silently reduced to the system limit.
    pub max_delegation_depth: u8,

    /// Restrict to specific CPU set. Uses the same `CpuMask` type as the
    /// scheduler (Section 7.1) and NUMA topology (Section 4.9).
    /// CpuMask is safe to duplicate (Copy) in this context because capability
    /// constraints are immutable after creation — all access is read-only.
    /// The pool-backed variant's memory has kernel lifetime (never freed).
    /// Wire serialization excludes CpuMask; see CapConstraintsWire which uses
    /// NodeAffinityHint instead. Both Inline and Pool variants are 24 bytes
    /// (enum sizing).
    ///
    /// # Safety (Send/Sync)
    ///
    /// `CpuMask` contains a raw pointer in its `Pool` variant. The following
    /// unsafe impls are sound because:
    /// ```rust
    /// // INVARIANT: Pool-backed CpuMask memory is allocated from a global arena
    /// // with kernel lifetime (never freed). This makes CpuMask safely Copy even
    /// // though it contains a raw pointer. All CpuMask instances created via
    /// // CpuMaskPool outlive all Capability instances that reference them.
    /// unsafe impl Send for CpuMask {}
    /// unsafe impl Sync for CpuMask {}
    /// ```
    pub cpu_affinity_mask: CpuMask,
}

/// CPU bitmask type. Sized at boot to fit the actual CPU count discovered from
/// firmware (MADT/DTB). Stored as a fixed-size array of u64 words allocated once
/// in a static per-subsystem arena. A default-constructed (all-zeros) mask means
/// "any CPU" (no affinity constraint).
///
/// The word count is `(nr_cpus + 63) / 64`, making `CpuMask` a compile-time-
/// unknown but boot-time-fixed size. All CpuMask instances in the same kernel
/// boot have the same word count. The backing storage is dynamically sized:
/// at boot, the kernel discovers the actual CPU count and allocates a
/// `CpuMaskPool` with the correct word count. Subsystems that store CpuMask
/// by value use a thin wrapper around a pool-allocated bitmask. For
/// frequently-embedded cases (e.g., `Capability.cpu_affinity_mask`), an
/// inline representation is available when the CPU count fits in 2 words
/// (≤128 CPUs) for const-initialization convenience; larger systems use the
/// pool-backed variant to avoid placing large bitmasks on the stack.
pub struct CpuMask {
    /// Bitmask storage. Either inline (for small CPU counts) or a pointer
    /// to pool-allocated storage.
    storage: CpuMaskStorage,
}

impl CpuMask {
    /// Return an empty mask (all bits clear).
    ///
    /// Uses the inline storage variant with `active_words = 2`, which covers
    /// up to 128 CPUs. For systems with more than 128 CPUs, callers must use
    /// `CpuMaskPool` to allocate a pool-backed instance of the correct size.
    /// This method is `const` so it can initialise `static` fields at
    /// compile time.
    ///
    /// **Truncation guard**: On systems with > 128 CPUs, using the inline
    /// variant would silently discard bits for CPUs 128+. To prevent this:
    /// - `CpuMask::empty()` is safe on all systems (all-zero is "any CPU").
    /// - `CpuMask::full(nr_cpus)` panics if `nr_cpus > 128` and the inline
    ///   variant is used (see below).
    /// - All `set_bit(cpu_id)` and `test_bit(cpu_id)` methods check
    ///   `cpu_id < active_words * 64` and return `Err(CpuMaskError::CpuOutOfRange)`
    ///   if the bit index exceeds the mask capacity — never silent truncation.
    /// - At boot, `cpu_features_freeze()` asserts that the system-wide
    ///   `CpuMask` word count (`(nr_cpus + 63) / 64`) is compatible with the
    ///   inline variant; if not, all subsystems must use pool-backed masks and
    ///   the kernel logs `KERN_INFO: CpuMask using pool-backed storage (nr_cpus=N)`.
    pub const fn empty() -> Self {
        CpuMask {
            storage: CpuMaskStorage::Inline { bits: [0u64; 2], active_words: 2 },
        }
    }

    /// Return a mask with all bits set for CPUs 0 .. `nr_cpus` (exclusive).
    ///
    /// Bits for non-existent CPUs beyond `nr_cpus` are left clear. The last
    /// partial word is filled using a shift mask so that only the exact
    /// `nr_cpus % 64` low-order bits are set.
    ///
    /// **Panics** (unconditionally, all builds) if `nr_cpus > 128` and the
    /// inline variant is used. Callers targeting large systems must use
    /// `CpuMaskPool::alloc_full(nr_cpus)` to get a pool-backed mask.
    /// The panic is unconditional (`assert!`, not `debug_assert!`) because
    /// silent truncation on >128 CPU production systems would discard CPUs
    /// 128+ from affinity masks — a correctness violation, not a debug aid.
    pub fn full(nr_cpus: u32) -> Self {
        assert!(
            nr_cpus <= 128,
            "CpuMask::full() inline variant cannot represent {} CPUs (max 128); \
             use CpuMaskPool::alloc_full() for large systems",
            nr_cpus
        );
        let mut mask = Self::empty();
        let full_words = (nr_cpus / 64) as usize;
        let remainder  = nr_cpus % 64;
        let bits = match &mut mask.storage {
            CpuMaskStorage::Inline { bits, .. } => bits,
            CpuMaskStorage::Pool   { .. }       => unreachable!(),
        };
        for i in 0..full_words {
            bits[i] = u64::MAX;
        }
        if remainder > 0 {
            bits[full_words] = (1u64 << remainder) - 1;
        }
        mask
    }

    /// Set the bit for CPU `cpu` (mark it as present in the mask).
    ///
    /// `word = cpu / 64`, `bit = cpu % 64`. Returns `false` if `cpu` is
    /// beyond the active word count (out-of-bounds); returns `true` on success.
    /// Callers that must detect out-of-bounds (e.g., `sched_setaffinity`
    /// validating userspace input) check the return value. Callers during
    /// bringup can ignore it — topology may grow before the mask is used.
    pub fn set(&mut self, cpu: u32) -> bool {
        let word = (cpu / 64) as usize;
        let bit  = cpu % 64;
        match &mut self.storage {
            CpuMaskStorage::Inline { bits, active_words } => {
                if word < *active_words as usize {
                    bits[word] |= 1u64 << bit;
                    return true;
                }
            }
            CpuMaskStorage::Pool { bits, word_count } => {
                if word < *word_count as usize {
                    // SAFETY: `word < word_count` and pool memory is valid for
                    // the kernel lifetime; pool is exclusively owned through
                    // the CpuMask wrapper.
                    unsafe { *bits.add(word) |= 1u64 << bit; }
                    return true;
                }
            }
        }
        false // out-of-bounds
    }

    /// Clear the bit for CPU `cpu` (remove it from the mask).
    ///
    /// `word = cpu / 64`, `bit = cpu % 64`. Returns `false` if `cpu` is
    /// beyond the active word count (out-of-bounds); returns `true` on success.
    /// Same policy as `set()` — both report out-of-bounds via return value.
    pub fn clear(&mut self, cpu: u32) -> bool {
        let word = (cpu / 64) as usize;
        let bit  = cpu % 64;
        match &mut self.storage {
            CpuMaskStorage::Inline { bits, active_words } => {
                if word < *active_words as usize {
                    bits[word] &= !(1u64 << bit);
                    return true;
                }
            }
            CpuMaskStorage::Pool { bits, word_count } => {
                if word < *word_count as usize {
                    // SAFETY: `word < word_count` and pool memory is valid for
                    // the kernel lifetime; pool is exclusively owned through
                    // the CpuMask wrapper.
                    unsafe { *bits.add(word) &= !(1u64 << bit); }
                    return true;
                }
            }
        }
        false // out-of-bounds
    }

    /// Return `true` if CPU `cpu` is set in the mask.
    ///
    /// `word = cpu / 64`, `bit = cpu % 64`. Returns `false` for any CPU
    /// whose index falls outside the active word range.
    pub fn test(&self, cpu: u32) -> bool {
        let word = (cpu / 64) as usize;
        let bit  = cpu % 64;
        match &self.storage {
            CpuMaskStorage::Inline { bits, active_words } => {
                word < *active_words as usize && (bits[word] >> bit) & 1 == 1
            }
            CpuMaskStorage::Pool { bits, word_count } => {
                if word >= *word_count as usize { return false; }
                // SAFETY: `word < word_count` and pool memory is valid for
                // the kernel lifetime; accessed read-only here.
                (unsafe { *bits.add(word) } >> bit) & 1 == 1
            }
        }
    }

    /// Return the index of the lowest-numbered set CPU, or `None` if empty.
    ///
    /// Scans words from lowest to highest. Within each word uses
    /// `u64::trailing_zeros()` (a single BSF/CTZ instruction on all
    /// supported architectures). Returns `None` when all words are zero.
    pub fn first_set(&self) -> Option<u32> {
        self.next_set(0)
    }

    /// Return the index of the first set CPU with index `>= start`, or `None`.
    ///
    /// `word = start / 64`, `bit = start % 64`. The first word is masked to
    /// ignore bits below `start % 64`. Subsequent words are searched in full.
    /// Returns `None` if no set bit is found at or after `start`.
    pub fn next_set(&self, start: u32) -> Option<u32> {
        let start_word = (start / 64) as usize;
        let start_bit  = start % 64;
        let (words, word_count) = self.words_and_count();
        for w in start_word..word_count {
            let mut word = words[w];
            if w == start_word {
                // Mask off bits below `start_bit` so we do not return a CPU
                // that is before `start`.
                word &= u64::MAX.wrapping_shl(start_bit);
            }
            if word != 0 {
                let bit = word.trailing_zeros();
                return Some((w as u32) * 64 + bit);
            }
        }
        None
    }

    /// Return the number of CPUs set in the mask (popcount).
    ///
    /// Sums `u64::count_ones()` over all active words. On x86-64 with
    /// `popcnt` (universally available) this compiles to a single POPCNT per
    /// word. On platforms without hardware popcount the compiler generates an
    /// efficient software equivalent.
    pub fn count(&self) -> u32 {
        let (words, word_count) = self.words_and_count();
        words[..word_count].iter().map(|w| w.count_ones()).sum()
    }

    /// Return `true` if no CPUs are set in the mask.
    ///
    /// Cheaper than `count() == 0` because it short-circuits on the first
    /// non-zero word rather than accumulating the full popcount.
    pub fn is_empty(&self) -> bool {
        let (words, word_count) = self.words_and_count();
        words[..word_count].iter().all(|&w| w == 0)
    }

    /// Return the bitwise complement of `self`, restricted to CPUs 0 .. `nr_cpus`.
    ///
    /// Bits for non-existent CPUs beyond `nr_cpus` are forced clear in the
    /// result even if they were set in `self`. This prevents the complement
    /// from appearing to contain phantom CPUs that do not exist in the
    /// topology.
    pub fn complement(&self, nr_cpus: u32) -> Self {
        let mut result = Self::full(nr_cpus);
        let (src_words, src_wc) = self.words_and_count();
        let (dst_words, dst_wc) = result.words_and_count_mut();
        let wc = src_wc.min(dst_wc);
        for i in 0..wc {
            dst_words[i] &= !src_words[i];
        }
        result
    }

    /// Return the bitwise union (`self | other`) of two masks.
    ///
    /// If the two masks have different word counts (e.g., during topology
    /// bringup before all masks are resized), the shorter mask is zero-extended
    /// so that bits present only in the longer mask are preserved in the result.
    ///
    /// On systems with >128 CPUs, the result is pool-backed to avoid silent
    /// truncation. The implementation uses the maximum word count of the two
    /// operands to size the result.
    pub fn union(&self, other: &Self) -> Self {
        let (a, a_wc) = self.words_and_count();
        let (b, b_wc) = other.words_and_count();
        let max_wc = a_wc.max(b_wc);
        // Allocate result large enough for the wider operand.
        // If both fit inline (<=128 CPUs), use inline. Otherwise pool-backed.
        let mut result = if max_wc <= 2 {
            Self::empty()
        } else {
            Self::pool_alloc(max_wc as u32)
        };
        let (r, r_wc) = result.words_and_count_mut();
        // r_wc >= max_wc by construction.
        let wc = max_wc.min(r_wc);
        for i in 0..wc {
            let va = if i < a_wc { a[i] } else { 0 };
            let vb = if i < b_wc { b[i] } else { 0 };
            r[i] = va | vb;
        }
        result
    }

    /// Return the bitwise intersection (`self & other`) of two masks.
    ///
    /// Only bits set in both operands appear in the result. If the masks have
    /// different word counts, the shorter extent is used (unrepresented bits
    /// are treated as zero in the shorter mask, so they do not appear in the
    /// intersection).
    ///
    /// On systems with >128 CPUs, the result is pool-backed when needed to
    /// avoid silent truncation. The result word count is `min(a_wc, b_wc)`
    /// (intersection cannot produce bits beyond the shorter operand).
    pub fn intersection(&self, other: &Self) -> Self {
        let (a, a_wc) = self.words_and_count();
        let (b, b_wc) = other.words_and_count();
        let min_wc = a_wc.min(b_wc);
        // Allocate result large enough for the shorter operand.
        // If both fit inline (<=128 CPUs), use inline. Otherwise pool-backed.
        let mut result = if min_wc <= 2 {
            Self::empty()
        } else {
            Self::pool_alloc(min_wc as u32)
        };
        let (r, r_wc) = result.words_and_count_mut();
        let wc = min_wc.min(r_wc);
        for i in 0..wc {
            r[i] = a[i] & b[i];
        }
        result
    }

    /// Iterate over the indices of all set CPUs in ascending order.
    ///
    /// Uses `next_set()` internally, advancing by one past each returned
    /// index to find the next. The iterator yields `u32` CPU indices.
    /// Yields no items for an empty mask.
    ///
    /// Example:
    /// ```
    /// let mut mask = CpuMask::empty();
    /// mask.set(0);
    /// mask.set(3);
    /// mask.set(65);
    /// assert_eq!(mask.iter().collect::<Vec<_>>(), vec![0, 3, 65]);
    /// ```
    pub fn iter(&self) -> impl Iterator<Item = u32> + '_ {
        let mut next = self.first_set();
        core::iter::from_fn(move || {
            let current = next?;
            next = self.next_set(current + 1);
            Some(current)
        })
    }

    // --- Private helpers -------------------------------------------------

    /// Return a shared slice of the backing words and the active word count.
    fn words_and_count(&self) -> (&[u64], usize) {
        match &self.storage {
            CpuMaskStorage::Inline { bits, active_words } => {
                (bits.as_slice(), *active_words as usize)
            }
            CpuMaskStorage::Pool { bits, word_count } => {
                // SAFETY: pool memory is valid for the kernel lifetime and
                // `word_count` accurately reflects the allocated length.
                let slice = unsafe {
                    core::slice::from_raw_parts(*bits, *word_count as usize)
                };
                (slice, *word_count as usize)
            }
        }
    }

    /// Return a mutable slice of the backing words and the active word count.
    fn words_and_count_mut(&mut self) -> (&mut [u64], usize) {
        match &mut self.storage {
            CpuMaskStorage::Inline { bits, active_words } => {
                let wc = *active_words as usize;
                (bits.as_mut_slice(), wc)
            }
            CpuMaskStorage::Pool { bits, word_count } => {
                let wc = *word_count as usize;
                // SAFETY: pool memory is valid for the kernel lifetime and
                // exclusive access is guaranteed by the `&mut self` borrow.
                let slice = unsafe {
                    core::slice::from_raw_parts_mut(*bits, wc)
                };
                (slice, wc)
            }
        }
    }
}

/// CpuMask storage: inline variant for const-initialization convenience, or
/// pool-backed for systems with >128 CPUs.
/// The variant is selected once at boot and fixed for the system lifetime.
///
/// Note: as a Rust enum, both variants occupy the same stack space (the union
/// reserves space for the largest variant). The inline variant does NOT save
/// memory relative to the pool variant — it exists for ergonomic
/// const-initialization without a pool allocation, not for memory savings.
/// For systems with >128 CPUs, the pool variant avoids placing the full bitmask
/// (potentially hundreds of bytes) on the stack; that is where pool-backing
/// provides a genuine space benefit.
enum CpuMaskStorage {
    /// Inline: up to 128 CPUs (2 × u64). Provided for const-initialization
    /// convenience; both enum variants occupy the same stack space.
    Inline { bits: [u64; 2], active_words: u32 },
    /// Pool: any CPU count. Points into a boot-allocated CpuMaskPool.
    /// The pool is contiguous memory, so cache behaviour is good.
    Pool { bits: *mut u64, word_count: u32 },
}

/// Boot-time global: number of u64 words needed for the discovered CPU count.
/// Set once during SMP bringup, read-only thereafter.
static CPU_MASK_WORDS: AtomicU32 = AtomicU32::new(0);

Permission Bits:

The PermissionBits type defines fine-grained access rights that can be granted on any capability. Different object types interpret these bits differently, but the base set is universal:

bitflags! {
    /// Fine-grained permission bits that can be set on any capability.
    /// These are orthogonal to SystemCaps (administrative capabilities) —
    /// PermissionBits control what operations a specific capability permits
    /// on its target object, while SystemCaps control what system-wide
    /// operations a process can perform.
    ///
    /// Width: u64 to match the wire encoding (`Le64` at offset 12 in the capability wire format)
    /// and to provide headroom for future permission types over a 50-year
    /// kernel lifetime. Currently bits 0-12 are defined; bits 13-63 are
    /// reserved for future use.
    pub struct PermissionBits: u64 {
        /// READ: Read access to the target object's data or state.
        /// For memory objects: read data.
        /// For files: read file contents.
        /// For processes: read registers, memory, status.
        const READ = 1 << 0;

        /// WRITE: Write access to the target object's data or state.
        /// For memory objects: write data.
        /// For files: write file contents, append.
        /// For processes: modify registers, memory.
        const WRITE = 1 << 1;

        /// EXECUTE: Execute access or control flow modification.
        /// For memory objects: execute code from this region.
        /// For files: execute as a program.
        /// For processes: single-step, continue execution.
        const EXECUTE = 1 << 2;

        /// DEBUG: Debugging access to the target object.
        /// For processes: attach via ptrace, inspect/modify state.
        /// For /proc entries: access to private (non-public) fields.
        /// Implies READ unless explicitly excluded. See "Capability-Gated ptrace" in debugging-and-process-inspection.
        const DEBUG = 1 << 3;

        /// SYSCALL_TRACE: Trace syscall entry/exit for a process.
        /// Used with PTRACE_SYSCALL. Requires DEBUG to also be set.
        /// See "Capability-Gated ptrace" in debugging-and-process-inspection.
        const SYSCALL_TRACE = 1 << 4;

        /// DELEGATE: Delegate this capability to another process.
        /// Subject to CapConstraints::delegatable and max_delegation_depth.
        const DELEGATE = 1 << 5;

        /// ADMIN: Administrative access to the target object.
        /// Object-specific meaning: for devices, may allow reset;
        /// for filesystems, may allow unmount; etc.
        const ADMIN = 1 << 6;

        /// MAP_READ: Map the object read-only into address space.
        /// Used for memory objects, file-backed mappings, DMA buffers.
        const MAP_READ = 1 << 7;

        /// MAP_WRITE: Map the object read-write into address space.
        const MAP_WRITE = 1 << 8;

        /// MAP_EXECUTE: Map the object with execute permission.
        const MAP_EXECUTE = 1 << 9;

        /// KERNEL_READ: Read kernel-side data associated with a userspace object.
        /// For processes: read kernel stack trace (/proc/pid/stack).
        /// Requires CAP_DEBUG on the target in addition to this bit.
        /// See "/proc/pid Interface" in debugging-and-process-inspection.
        const KERNEL_READ = 1 << 10;

        /// RDMA_REGISTER_MR: Register a memory region for RDMA access on the
        /// target device. Required for ibv_reg_mr() and kernel ULP MR
        /// registration. Without this bit, the device capability cannot be
        /// used to expose host memory to the RDMA NIC.
        const RDMA_REGISTER_MR = 1 << 11;

        /// RDMA_CREATE_QP: Create a Queue Pair on the target RDMA device.
        /// Required for ibv_create_qp() and kernel ULP QP creation.
        /// A QP is the fundamental communication endpoint for RDMA;
        /// controlling its creation gates all RDMA data transfer.
        const RDMA_CREATE_QP = 1 << 12;

        /// ALL: Union of all defined permission bits.
        /// Used by root_grant to synthesize capabilities with full access.
        const ALL = (1u64 << 13) - 1;  // bits 0..=12
    }
}

/// Type alias: `Rights` is the permission mask used in delegation and access
/// control. It is the same bitflags type as `PermissionBits` — the alias
/// exists for readability in delegation contexts (e.g., "granted rights"
/// vs "permission bits").
pub type Rights = PermissionBits;

Permission composition: Multiple bits can be combined. For example, a debugging capability on a process might have DEBUG | READ | WRITE | SYSCALL_TRACE to allow full ptrace-style debugging including syscall interception.

Capabilities are stored in kernel-managed capability tables, indexed by per-process capability handles. User space never sees raw capability data -- only opaque handles.

Capability Revocation Semantics:

UmkaOS supports two revocation mechanisms, chosen per-object-type:

  1. Generation-based (for bulk invalidation): Each object has its own monotonic generation counter (object.generation). Each capability records the generation of the object at the time the capability was created (cap.generation). Validation checks cap.generation == object.generation (exact match). Revoking all capabilities for an object increments object.generation; all existing capabilities now have a stale generation and fail validation. O(1), no table scanning.

Slot persistence: The generation counter lives in the kernel's object registry slot, not in the object itself. When an object is freed, its registry slot retains the last generation value. When a new object is allocated in the same slot, the slot's generation is incremented before the new object becomes visible. This prevents ABA vulnerabilities: an old capability with generation == N cannot validate against a new object in the same slot at generation == N+1. Object IDs are slot indices — they may be reused, but the generation makes each (ObjectId, generation) pair unique over time. See Section 9.1.1.3 for the full object registry data structure specification.

Trade-off: this is per-object all-or-nothing. Incrementing the generation for one PhysMemory object invalidates only capabilities pointing to that object — it does NOT affect capabilities for other PhysMemory objects (they have independent generation counters). If a single object needs fine-grained revocation (revoke one capability without affecting others for the same object), use indirection-based revocation instead.

  1. Indirection-based (for fine-grained control): Capabilities point to an indirection entry in a per-object revocation table. Revoking sets the entry to "revoked." Allows individual revocation without affecting other capabilities for the same object. Cost: one extra pointer dereference per validation (~2-3ns).

Synchronization: Indirection entries are RCU-protected. The validate path runs inside an RCU read-side critical section: it dereferences the indirection pointer, checks the "revoked" flag, and proceeds — all under rcu_read_lock(). The revoke path sets the entry to "revoked" (atomic store), then defers entry reclamation to an RCU grace period via rcu_call(). This closes the TOCTOU window: a thread that has already dereferenced the indirection entry and sees "not revoked" is guaranteed to be within an RCU read-side critical section, so the entry cannot be freed until that thread exits the critical section. The indirection entry itself is never freed while any reader may hold a reference — only after the RCU grace period completes.

Object type Revocation model Rationale
PhysMemory Generation (per-object) Bulk revocation on free — all caps for this region die
DeviceIo Generation (per-object) Device reset invalidates all handles for this device
FileDescriptor Indirection Individual fd close must not affect other fds to the same inode
IpcChannel Indirection Can revoke one endpoint independently
Process Generation (per-object) Process exit invalidates all handles to this process

Validation rule: is_valid() in umka-core/src/cap/mod.rs must use exact-match: self.generation == object.generation. A capability is valid only when its generation matches the object's current generation exactly. Less-than-or-equal (<=) is incorrect because it would allow stale capabilities from earlier generations to pass validation — a capability from generation 3 must not be valid when the object is at generation 5.

9.1.1.1.1 Distributed Capability Revocation

Capabilities can be delegated across cluster nodes via the distributed capability protocol in Section 5.1. When a capability is revoked on the originating node, all copies on all remote nodes must be invalidated. This requires a distributed protocol with bounded latency to prevent stale capabilities from granting access after revocation.

Revocation mechanism — epoch-based generation increment:

Each capability carries a generation: u64 field (part of the CapHandle). The capability table on each node additionally stores the current valid generation per capability slot. Revocation works by incrementing the generation:

/// Errors from capability revocation.
/// Revocation is a best-effort broadcast — `RevokeError` indicates the
/// specific failure mode so callers can decide whether to retry or escalate.
pub type RevokeError = CapError;

fn revoke_capability(cap: CapHandle) -> Result<(), RevokeError> {
    let entry = rcu_read_lock(|| cap_table.get(cap.index))?;
    // 1. Increment generation in the local capability table.
    //    All local lookups now fail (generation mismatch).
    cap_table.increment_generation(cap.index);

    // 2. Check if this capability was ever delegated to remote nodes.
    //    Local-only capabilities (never delegated) complete immediately —
    //    the generation increment in step 1 is sufficient.
    let nodes = cap.delegation_set(); // nodes this cap was delegated to
    if nodes.is_empty() {
        return Ok(());  // Fast path: local-only cap, no broadcast needed
    }

    // 3. Select revocation tier based on capability criticality.
    //    See [Section 5.7](05-distributed.md#network-portable-capabilities) for the three-tier model.
    let tier = select_revocation_tier(&entry.cap.header(), &current_cred());
    let new_gen = cap_table.generation(cap.index);

    match tier {
        RevocationTier::RemoteUrgent => {
            // Tier 2: dedicated REVOKE_URGENT per affected node.
            for node in &nodes {
                cluster_send(node, ClusterMsg::RevokeUrgent {
                    index: cap.index,
                    new_generation: new_gen,
                });
            }
            // 4. Wait for ACK from all nodes (bounded by REVOKE_TIMEOUT_MS).
            //    Nodes that do not ACK within the timeout are considered failed
            //    and are fenced (removed from the cluster).
            wait_for_acks(&nodes, REVOKE_TIMEOUT_MS)?;
        }
        RevocationTier::RemoteBatched => {
            // Tier 3: piggyback on next heartbeat — no immediate send.
            revocation_log_append(cap.index, new_gen, &nodes);
        }
        RevocationTier::LocalSync => {
            // Should not reach here (local-only caps returned in step 2).
            unreachable!("local-sync tier with non-empty delegation set");
        }
    }
    Ok(())
}

Latency bounds: Revocation completes within 2 * network_RTT + processing_time. For local cluster (PCIe P2P or 100GbE RDMA), RTT is approximately 1-10us, so revocation completes in < 100us. For wide-area clusters with RTT up to 10ms, revocation takes < 50ms. This is the hard latency bound — processes holding stale capabilities see rejection errors after this window.

Grace period for in-flight operations: Operations already in progress when revocation is issued complete normally (they entered the kernel before generation increment). New operations using the old generation fail immediately. This is equivalent to RCU's read-side grace period: existing readers complete, new readers see the new generation.

Revocation of delegated capabilities — two-phase breadth-first protocol:

Capability revocation traverses the delegation tree without recursive spinlock acquisition. The protocol uses two phases to decouple the latency-critical "stop new operations" step from the "propagate to children" step:

  • Phase 1 (lock-free, immediate): cap_revoke() sets REVOKED_FLAG on the target capability's active_ops: AtomicU64 via a single fetch_or(REVOKED_FLAG, Release). This is lock-free and completes in ~1-5 cycles. From this instant, all new CapOperationGuard::try_new() calls on the target capability fail with CapError::Revoked, and the KABI dispatch trampoline rejects new dispatches (see above). No spinlock is acquired.

  • Phase 2 (workqueue, breadth-first): cap_revoke() enqueues a CapRevocationWork item to the per-CPU workqueue (Section 3.11). The work item:

  • Acquires the target CapEntry's children spinlock.
  • Iterates the children list (at most CAP_MAX_DELEGATIONS = 256 entries).
  • For each child: sets REVOKED_FLAG on the child's active_ops (lock-free fetch_or), and if the child itself has children, enqueues a new CapRevocationWork for the child to the workqueue. The child's children spinlock is not acquired in this iteration — it is deferred to the child's own workqueue item.
  • Releases the target's children spinlock.
  • Sends CapRevoke IPC to each target_domain listed in the children entries. Domains receiving the revocation must call cap_drop(delegated_cap) within REVOKE_TIMEOUT_MS (1000ms default).

The traversal is breadth-first: each workqueue pass processes one level of the delegation tree, then enqueues the next level. Maximum delegation depth is CAP_MAX_DELEGATION_DEPTH (16), so at most 16 workqueue passes are needed to mark the entire subtree.

/// Work item for breadth-first capability revocation.
/// Enqueued to the per-CPU workqueue during Phase 2 of `cap_revoke()`.
/// Each item processes one node in the delegation tree: marks its
/// children as revoked and enqueues their sub-trees for processing.
pub struct CapRevocationWork {
    /// The capability entry whose children need to be revoked.
    target_cap_id: CapId,
    /// Generation at revocation time (for stale-work detection).
    revocation_gen: u64,
}
  • Drain: After all nodes in the subtree are marked with REVOKED_FLAG, the revoker waits for active_ops counters on all affected nodes to reach zero (the existing drain() mechanism). Delegates that do not acknowledge within the timeout are forcibly invalidated and the domain is flagged for audit.

Spinlock hold time invariant: The children spinlock on any single CapEntry is held for at most O(CAP_MAX_DELEGATIONS) = O(256) iterations per workqueue item. No recursive spinlock acquisition occurs — each level of the tree is processed by a separate workqueue item. The worst-case total revocation latency for a full tree (depth 16, 256 children per node) is bounded by 16 × workqueue_schedule_overhead + 16 × 256 × per_child_cost, but the spinlock hold time per node remains O(256) regardless of tree depth.

This is explicitly better than Linux's POSIX capability model — Linux process capabilities have no delegation tracking at all.

KABI dispatch trampoline check: In addition to the drain() mechanism described above (which prevents new CapOperationGuard instances), the REVOKED_FLAG is also checked at KABI dispatch entry — immediately after the CapValidationToken generation check and before ValidatedCap<'dispatch> creation. This closes the TOCTOU window between generation validation and RCU read lock acquisition. Cost: one AtomicU64::load(Relaxed) (~1-5 cycles). See Section 12.3 for the full dispatch-entry validation sequence.

The CapEntry.children list records every cap_delegate() call made FROM a given capability. Each entry holds the child CapId, the receiving DomainId, a TSC timestamp, and the granted rights subset. The list is stored in the CapTable's CapEntry (not in the Capability token itself) so that the hot validation path never touches it. Heap allocation during delegation is acceptable (syscall context). The maximum children per capability is CAP_MAX_DELEGATIONS (256); cap_delegate() returns CapError::DelegationLimitReached if exceeded. Attempting to delegate when delegation_depth == CAP_MAX_DELEGATION_DEPTH (16) returns CapError::DelegationDepthExceeded.

Capability tombstones: After revocation, the capability slot is kept as a tombstone (generation incremented, type = Revoked) for one epoch. This prevents ABA races where a new capability is allocated to the same slot index before all in-flight messages referencing the old generation have been processed.

9.1.1.2 Cluster Revocation Wire Protocol

Capability revocation in a cluster requires: 1. Notifying all nodes that hold a delegated copy of the revoked capability. 2. Ensuring all nodes have revoked before the revoking node considers the operation complete. 3. Fencing in-flight operations that may have passed the capability check but not yet completed.

Message format (CapRevocationMsg):

#[repr(C)]
pub struct CapRevocationMsg {
    /// Protocol version (currently 1).
    pub version:    u8,
    /// Message type (1 = REVOKE_REQUEST, 2 = REVOKE_ACK, 3 = REVOKE_COMPLETE).
    pub msg_type:   u8,
    pub _pad:       [u8; 2],
    /// Explicit padding to 8-byte alignment for ObjectId (which contains Le64).
    pub _pad1:      Le32,
    /// The capability object ID being revoked (wire-format: little-endian).
    pub object_id:  ObjectIdWire,
    /// Generation of the capability being revoked (capabilities with older
    /// generations are already invalid; this revokes the current generation).
    pub generation: Le64,
    /// Epoch counter for ordering (sender increments on each revocation batch).
    pub epoch:      Le64,
    /// Cryptographic nonce (128-bit random) for replay prevention.
    pub nonce:      [u8; 16],
    /// Sender node ID (wire-format: little-endian).
    pub sender:     Le64,
}

/// Wire-format variant of ObjectId with explicit little-endian fields.
/// Le32/Le64 are byte-array-backed (`[u8; 4]`/`[u8; 8]`) with alignment 1,
/// so no implicit padding exists between fields. Total size: 12 bytes.
#[repr(C)]
pub struct ObjectIdWire {
    pub slot:       Le32,
    pub generation: Le64,
}
const_assert!(core::mem::size_of::<ObjectIdWire>() == 12);

// CapRevocationMsg total size: version(1) + msg_type(1) + _pad(2) + _pad1(4) +
// object_id(12) + generation(8) + epoch(8) + nonce(16) + sender(8) = 60 bytes.
// All Le* types are [u8; N] with alignment 1 — no implicit padding.
const_assert!(core::mem::size_of::<CapRevocationMsg>() == 60);

Transport: Sent over the cluster's ClusterTransport (RDMA or TCP fallback, Section 5.1). Uses send_reliable() for REVOKE_REQUEST and REVOKE_COMPLETE messages.

Protocol: 1. Revoker increments capability.generation atomically (makes old-gen caps invalid locally). 2. Revoker sends REVOKE_REQUEST to all nodes in capability.delegation_list concurrently. 3. Each recipient node: a. Removes all capabilities with (object_id, generation) from its local capability table. b. Waits for any in-flight operations using the capability to complete (drains the per-capability operation counter to zero; bounded by the max operation timeout = 5s). c. Sends REVOKE_ACK to the revoker. 4. Revoker waits for ACKs from all notified nodes (timeout: 10s per node). 5. On all-ACKs: sends REVOKE_COMPLETE broadcast; revocation is final. 6. On timeout: The revoker marks the node as "pending-revocation" and isolates it from new capability grants. Cluster health monitoring (Section 5.1) forces the isolated node to rejoin, which triggers re-synchronization of its capability table from the root.

Fencing: In-flight operations are tracked by CapOperationGuard (RAII type; dropped when the operation that used the capability completes). The drain at step 3b waits for all CapOperationGuard instances on the revoked capability to drop. Maximum drain time: 5s (matches the max RPC timeout).

/// RAII guard that tracks in-flight operations on a capability.
/// Created when a validated capability is used for an operation;
/// dropped when the operation completes. Revocation drains all
/// outstanding guards before finalizing.
pub struct CapOperationGuard<'a> {
    /// Capability being used by this operation.
    cap_id: CapId,
    /// Reference to the per-capability active operation counter in the
    /// owning CapEntry. Shared across all guards for the same capability.
    /// Lifetime `'a` is tied to the CapEntry that issued this guard.
    active_ops: &'a AtomicU64,
}

impl<'a> CapOperationGuard<'a> {
    /// Create a new operation guard. Fails if the capability has been
    /// marked REVOKED (the REVOKED flag is bit 63 of active_ops).
    ///
    /// Uses fetch_add with post-check-and-undo instead of a CAS loop.
    /// This converges in O(1) regardless of contention: one atomic
    /// increment, one branch, and (on the revoked cold path only) one
    /// atomic decrement. The CAS loop approach requires O(N) retries
    /// under N-way contention, causing unbounded spinning on 256-core
    /// NUMA systems.
    ///
    /// # Correctness invariants
    ///
    /// - **No overflow**: `active_ops` lower 63 bits count concurrent guards.
    ///   Max concurrent guards bounded by max threads (~2^22 on typical systems).
    ///   63-bit counter cannot overflow within a single grace period.
    /// - **Revocation ordering**: fetch_add(AcqRel) ensures the REVOKED flag
    ///   check sees the flag if it was set by any core before the add. The
    ///   undo path uses Release to ensure the decrement is visible to the
    ///   revocation drain waiter.
    /// - **Lifetime**: `&'a CapEntry` ensures the guard cannot outlive the
    ///   capability entry. The guard borrows `active_ops` immutably — the
    ///   `AtomicU64` provides interior mutability.
    pub fn try_new(entry: &'a CapEntry) -> Result<Self, CapError> {
        // Single atomic: increment first, check after.
        let prev = entry.active_ops.fetch_add(1, Ordering::AcqRel);
        if prev & REVOKED_FLAG != 0 {
            // Capability was revoked — undo the increment.
            // This undo only fires on revoked capabilities, which are
            // cold by definition (revocation is a rare administrative event).
            entry.active_ops.fetch_sub(1, Ordering::Release);
            return Err(CapError::Revoked);
        }
        Ok(Self { cap_id: entry.id, active_ops: &entry.active_ops })
    }
}

impl Drop for CapOperationGuard<'_> {
    fn drop(&mut self) {
        // Release semantics: all operation writes are visible before
        // the counter decrement.
        let prev = self.active_ops.fetch_sub(1, Ordering::Release);
        // If we were the last guard and REVOKED flag is set, wake
        // the revocation waiter.
        if prev == (REVOKED_FLAG | 1) {
            wake_revocation_waiter(self.cap_id);
        }
    }
}

/// Bit 63 of `active_ops`: set by `drain()` to prevent new guards.
const REVOKED_FLAG: u64 = 1 << 63;

/// Drain all in-flight operations on a capability. Called during
/// revocation (step 3b of the cluster protocol).
///
/// 1. Sets REVOKED_FLAG (bit 63) — `try_new()` will fail for new guards.
/// 2. Spins until the counter (bits 0-62) reaches zero, meaning all
///    existing guards have been dropped.
/// 3. Timeout: 100ms for local operations (5s for cluster revocation).
///    On timeout: force-revoke — the capability is invalidated and any
///    in-flight operation holding a guard will receive an error on its
///    next kernel interaction (CapError::Revoked).
pub fn drain(entry: &CapEntry, timeout: Duration) -> Result<(), DrainError> {
    entry.active_ops.fetch_or(REVOKED_FLAG, Ordering::AcqRel);
    let deadline = monotonic_now() + timeout;
    // Phase 1: Brief spin (microsecond-scale). Most drains complete here
    // because the guard is dropped when the current operation finishes.
    for _ in 0..100 {
        let val = entry.active_ops.load(Ordering::Acquire);
        if val & !REVOKED_FLAG == 0 {
            return Ok(());
        }
        core::hint::spin_loop();
    }
    // Phase 2: WaitQueue-based wait. CapOperationGuard::Drop wakes
    // `entry.revocation_waitq` when the last guard is dropped, so we
    // do not need to busy-poll for the remaining timeout.
    loop {
        let val = entry.active_ops.load(Ordering::Acquire);
        if val & !REVOKED_FLAG == 0 {
            return Ok(());
        }
        let remaining = deadline.saturating_duration_since(monotonic_now());
        if remaining.is_zero() {
            return Err(DrainError::Timeout {
                remaining_ops: val & !REVOKED_FLAG,
            });
        }
        entry.revocation_waitq.wait_timeout(remaining);
    }
}

Replay prevention: The nonce + epoch pair prevents replayed revocations. Each node keeps a 60s window of seen nonces.

9.1.1.3 Capability Delegation API

/// A reference to a capability in the caller's CapSpace. This is the
/// handle value that the caller passes to kernel APIs; the kernel resolves
/// it to the underlying `CapEntry` via the caller's `CapSpace`.
pub type CapRef = CapHandle;

/// Error conditions for capability operations (delegation, access, revocation).
pub enum CapError {
    /// Requested rights exceed the parent capability's rights.
    InsufficientPermissions,
    /// Target isolation domain does not exist or has been torn down.
    DomainNotFound,
    /// Parent capability already has the maximum number of child delegations (256).
    DelegationLimitReached,
    /// Delegation chain depth exceeds `CAP_MAX_DELEGATION_DEPTH` (16).
    DelegationDepthExceeded,
    /// Capability has been revoked (generation mismatch or explicit revocation).
    Revoked,
    /// The driver backing this capability has crashed and not yet reloaded.
    DriverCrashed,
    /// Capability is non-transferable (e.g., debug capabilities).
    NonTransferable,
    /// CapSpace handle table is full (per-process limit exceeded).
    HandleExhausted,
    /// CapHandle does not map to a valid entry in the caller's CapSpace.
    InvalidHandle,
    /// Target isolation tier exceeds `CapConstraints::max_delegation_tier`.
    /// The capability's tier restriction prevents delegation to the requested tier.
    TierRestriction,
    /// Tier M delegation requested but the cluster messaging layer is not
    /// initialized (single-node system or cluster not yet joined).
    ClusterUnavailable,
    /// Signing the capability for Tier M delegation failed (cluster key pair
    /// not available or crypto operation error).
    SigningFailed,
    /// The target peer did not acknowledge capability delivery within the
    /// timeout (`CLUSTER_CAP_DELIVER_TIMEOUT_MS`). The peer may be
    /// unreachable, partitioned, or overloaded.
    PeerUnreachable,
    /// The remote node rejected the capability (invalid signature, expired,
    /// or policy rejection on the receiving side).
    RemoteRejection,
    /// The capability has a finite lifetime (`expires_ns` field) and the
    /// current monotonic time exceeds it. The caller must re-acquire a fresh
    /// capability from the granting authority. Used by KABI validation
    /// ([Section 12.3](12-kabi.md#kabi-bilateral-capability-exchange)).
    Expired,
    /// The capability's type does not match the target driver domain's
    /// expected service type. Used by KABI validation
    /// ([Section 12.3](12-kabi.md#kabi-bilateral-capability-exchange)).
    WrongType,
    /// The caller's user namespace does not match the capability's owning
    /// user namespace. Capabilities are scoped to a user namespace; cross-
    /// namespace delegation requires explicit `CAP_SYS_ADMIN` in the target
    /// namespace or Tier M cluster-level delegation.
    NamespaceMismatch,
    /// An LSM module (SELinux, AppArmor, BPF-LSM) denied the capability
    /// operation via the `security_cap_check()` hook. The LSM verdict takes
    /// precedence over capability permission checks — even if the capability
    /// grants the requested rights, the LSM can still deny the operation.
    LsmDenied,
    /// The receiving node's SystemCaps do not include the capability required
    /// by this distributed operation. Returned by `cluster_deliver_signed_cap()`
    /// when the target peer lacks the system-level capability (e.g.,
    /// `CAP_DSM_CREATE`, `CAP_DLM_ADMIN`) needed to accept the delegated
    /// capability. The sender should verify the receiver's advertised
    /// SystemCaps before attempting delegation.
    ReceiverLacksSystemCap,
}

/// Delegate a capability to another isolation domain.
///
/// Creates a child capability with a subset of the parent's rights,
/// registers the delegation in the parent capability's provenance chain
/// (for revocation propagation), and returns a token that can be
/// transferred to `target_domain`.
///
/// # Arguments
/// - `cap`: The parent capability to delegate.
/// - `permitted_rights`: The rights to grant to the delegate. Must be
///   a subset of `cap.rights` — you cannot grant rights you don't have.
/// - `target_domain`: The isolation domain receiving the delegation.
///
/// # Returns
/// A `DelegatedCap` containing the new child capability ID and metadata.
///
/// # Errors
/// - `CapError::InsufficientPermissions`: `permitted_rights` includes rights
///   not present in `cap.rights`.
/// - `CapError::DomainNotFound`: `target_domain` is not a valid domain.
/// - `CapError::DelegationLimitReached`: The parent capability already has
///   the maximum number of active delegations (256). Revoke some first.
/// - `CapError::DelegationDepthExceeded`: `cap.delegation_depth` equals
///   `CAP_MAX_DELEGATION_DEPTH` (16). The delegation chain has reached the
///   system-wide depth ceiling; no further sub-delegation is permitted.
///
/// # Revocation
/// Calling `cap_revoke(cap)` uses a two-phase breadth-first protocol:
/// 1. **Phase 1 (lock-free)**: Set `REVOKED_FLAG` on the target's
///    `active_ops` via `fetch_or(REVOKED_FLAG, Release)`. New operations
///    immediately fail.
/// 2. **Phase 2 (workqueue)**: Enqueue `CapRevocationWork` to the per-CPU
///    workqueue. The work item acquires `entry.children` spinlock, iterates
///    children (max 256), marks each child revoked, and enqueues sub-tree
///    revocations — no recursive spinlock acquisition.
/// 3. After all nodes are marked, `drain()` waits for `active_ops`
///    counters to reach zero. Delegates must call `cap_drop()` within
///    `REVOKE_TIMEOUT_MS` (1000ms default).
///
/// # Design note
/// Unlike POSIX capabilities (which are per-process and non-delegatable),
/// UmkaOS capabilities form an explicit delegation tree. This enables:
/// - Precise revocation (revoke from root, all delegates are notified)
/// - Audit trail (delegation records include timestamp + domain ID)
/// - Least-privilege enforcement (delegates cannot exceed parent rights)
pub fn cap_delegate(
    cap: CapRef,
    permitted_rights: Rights,
    target_domain: DomainId,
) -> Result<DelegatedCap, CapError>;

/// Token returned by `cap_delegate()`. Must be sent to `target_domain`
/// via IPC; the target uses it to call `cap_accept(delegated_cap)`.
pub struct DelegatedCap {
    /// ID of the newly created child capability.
    pub cap_id: CapId,
    /// ID of the parent capability (for audit and revocation).
    pub parent_cap_id: CapId,
    /// The domain authorized to accept this cap.
    pub target_domain: DomainId,
    /// Rights granted (subset of parent rights).
    pub rights: Rights,
    /// Expiry: TSC value after which this token is invalid (0 = no expiry).
    pub expiry_tsc: u64,
}

/// Maximum active delegations per capability.
pub const CAP_MAX_DELEGATIONS: usize = 256;

/// Maximum delegation chain depth. Root capabilities issued directly by the
/// kernel have `delegation_depth = 0`. Each call to `cap_delegate()` produces
/// a child with `delegation_depth = parent.delegation_depth + 1`. When a
/// capability's `delegation_depth` equals this constant, `cap_delegate()`
/// returns `CapError::DelegationDepthExceeded` and no child is created.
///
/// Value 16 is sufficient for all realistic delegation chains (e.g.,
/// kernel → service → container runtime → container → sandbox → helper
/// = 5 levels) and prevents unbounded memory growth from pathological
/// delegation chains.
pub const CAP_MAX_DELEGATION_DEPTH: u8 = 16;

/// Revocation timeout: how long UmkaOS Core waits for delegates to
/// acknowledge a revocation before forcibly invalidating the capability.
pub const REVOKE_TIMEOUT_MS: u64 = 1000;

/// Scope for bulk capability revocation.
pub enum RevocationScope {
    /// Revoke ALL outstanding capabilities on this node.
    /// Used by `setenforce(1)` to immediately invalidate all KABI tokens.
    AllLocal,
    /// Revoke ALL capabilities on a specific remote node (cluster security:
    /// compromised node isolation).
    ForNode(NodeId),
}

/// Revoke ALL outstanding capabilities in the specified scope.
///
/// Used by `setenforce(1)` to immediately invalidate all KABI tokens,
/// and by cluster security to revoke all caps on a compromised node.
///
/// **AllLocal**: Increments `GLOBAL_CAP_GENERATION` (AtomicU64, AcqRel).
/// All `CapValidationToken`s fail the generation check on their next
/// dispatch — no per-capability iteration required. This is an O(1)
/// operation regardless of the number of outstanding capabilities.
///
/// **ForNode**: Broadcasts a `RevokeAll` message to the specified node.
/// The remote node increments its local `GLOBAL_CAP_GENERATION` on receipt,
/// achieving the same O(1) invalidation remotely.
pub fn cap_revoke_all(scope: RevocationScope) {
    match scope {
        RevocationScope::AllLocal => {
            GLOBAL_CAP_GENERATION.fetch_add(1, AcqRel);
            // All CapValidationTokens fail generation check on next dispatch
        }
        RevocationScope::ForNode(node_id) => {
            broadcast_revoke_all(node_id);
            // Remote node increments its local generation on receipt
        }
    }
}

/// Global capability generation counter. Incremented by `cap_revoke_all()`
/// to invalidate all outstanding `CapValidationToken`s in O(1).
/// `CapValidationToken` captures this value at creation; KABI dispatch
/// compares the token's generation against the current value.
static GLOBAL_CAP_GENERATION: AtomicU64 = AtomicU64::new(0);

9.1.1.4 Process Capability Query

/// Query the effective capability set of a target process.
/// Requires CAP_SYS_PTRACE in the caller's user namespace.
/// Used by the D-Bus bridge for polkit-style authorization decisions.
pub fn cap_query_by_pid(
    target_pid: u32,
    ns: &PidNamespace,
) -> Result<SystemCaps, CapError>;

9.1.1.5 Delegation Depth Limit

UmkaOS capabilities are attenuated: a delegated capability carries a strict subset of the parent's rights. Because rights can only be removed and never added, delegation is monotonically weakening — a context that already holds a parent capability cannot gain additional authority by receiving a child delegation of it. This property means cycle detection is unnecessary: delegating to a context that already holds the parent results in the child having the lesser of the two, which is harmless.

Despite the impossibility of privilege cycles, unbounded delegation chains would allow an adversary to consume unbounded kernel memory (one Capability kernel object per link in the chain). The depth limit closes this vector.

Depth limit: 16 levels. Root capabilities — issued directly by the kernel at object creation, open(), etc. — have delegation_depth = 0. Each call to cap_delegate() produces a child with delegation_depth = parent.delegation_depth + 1. When delegation_depth == CAP_MAX_DELEGATION_DEPTH (16), cap_delegate() returns CapError::DelegationDepthExceeded; no child capability is created and no kernel object is allocated.

Memory bound: with at most CAP_MAX_CAPABILITIES_PER_PROCESS (1024) active capabilities per process and a maximum delegation depth of 16, a single root capability can produce at most 16 kernel capability objects across the full delegation chain (one per level). The total kernel capability memory attributable to any one root capability is therefore bounded at 16 × sizeof(Capability).

CapConstraints::max_delegation_depth (the per-capability policy field, value 0–16) is an additional constraint: if set, cap_delegate() also rejects when parent.delegation_depth >= parent.constraints.max_delegation_depth. This allows issuers to impose a tighter limit than the system-wide maximum on capabilities they create (e.g., max_delegation_depth = 1 to allow exactly one level of sub-delegation). The system-wide CAP_MAX_DELEGATION_DEPTH = 16 is an absolute hard ceiling that cannot be overridden upward by any capability constraint.

9.1.1.6 Cross-Tier Capability Delegation

Capabilities must be safely delegated across isolation tier boundaries (Tier 0 → Tier 1, Tier 0 → Tier 2, Tier 1 → Tier 2, any → Tier M) without allowing forgery, amplification, or stale references. The kernel mediates all cross-tier delegation — drivers cannot directly transfer capabilities between tiers.

/// Extended constraint for cross-tier delegation.
/// Added to `CapConstraints` to control how far down the isolation
/// hierarchy a capability may be delegated.
impl CapConstraints {
    /// Maximum isolation tier this capability can be delegated to.
    /// `None` = unrestricted (can reach any tier, including Tier M).
    /// `Some(IsolationTier::Tier2)` = local tiers only (no cluster delegation).
    /// `Some(IsolationTier::Tier1)` = Tier 0 and Tier 1 only.
    /// `Some(IsolationTier::Tier0)` = Tier 0 only (non-delegatable to isolated drivers).
    ///
    /// Enforcement: `cap_delegate()` checks `target_tier <= max_delegation_tier`
    /// before creating the child capability. Violation returns
    /// `CapError::TierRestriction`.
    pub max_delegation_tier: Option<IsolationTier>,
}

/// Isolation tier levels, ordered from most-privileged to least-privileged.
/// Tier 0 < Tier 1 < Tier 2 < Tier M (numerically: 0, 1, 2, 3).
///
/// Tier M (Machine-boundary) represents a capability holder on a remote
/// cluster node. It is the least-privileged tier because it crosses a
/// network boundary — all capability operations must be validated via
/// cryptographic signatures and are subject to expiry
/// ([Section 5.7](05-distributed.md#network-portable-capabilities)).
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
#[repr(u8)]
pub enum IsolationTier {
    /// Tier 0: in-kernel, statically linked or loadable but sharing Core address space.
    Tier0 = 0,
    /// Tier 1: Ring 0, hardware memory domain isolated (MPK/POE/DACR).
    Tier1 = 1,
    /// Tier 2: Ring 3, full process isolation + IOMMU.
    Tier2 = 2,
    /// Tier M: remote cluster node, network-isolated. Capabilities delegated
    /// to Tier M are cryptographically signed and time-bounded. The remote
    /// node validates the signature locally and checks expiry on every use.
    /// Revocation requires cluster-wide broadcast
    /// ([Section 9.1](#capability-based-foundation--distributed-capability-revocation)).
    ///
    /// Phase 3 implementation scope: the enum variant and delegation path
    /// are specified here. The full wire protocol for Tier M capability
    /// transfer uses the cluster messaging layer
    /// ([Section 5.1](05-distributed.md#distributed-kernel-architecture)) and the `CapabilityHeader` /
    /// `CapabilitySignature` types from [Section 5.7](05-distributed.md#network-portable-capabilities).
    TierM = 3,
}

Tier-specific delegation transport:

  • Tier 0 → Tier 1 (MPK-isolated): the capability handle is passed via a KABI descriptor in the shared memory region between Core and the Tier 1 domain. The kernel stores the full CapEntry in its CapTable; the Tier 1 driver receives an opaque CapHandle (u64 index into the per-domain handle table). All permission checks are kernel-mediated — the Tier 1 driver calls kabi_cap_check(handle, required_rights) which traps into Core, validates the handle, and returns Ok(ValidatedCap) or Err(CapError).

Per-domain handle table: Each isolation domain (Tier 1 driver instance) has a DomainCapTable: XArray<CapHandle> mapping local handle indices to global CapIds in the kernel-wide CapTable. The XArray is keyed by the domain-local handle value (u64). Created at domain init; destroyed on domain teardown (all entries revoked). kabi_cap_check() resolves the domain-local handle to a global CapId, validates generation and rights, and returns a ValidatedCap token:

/// Validate a domain-local capability handle. Called from the KABI
/// dispatch trampoline when a Tier 1 driver presents a CapHandle.
/// Resolves through the per-domain handle table, then validates
/// against the global CapTable.
pub fn kabi_cap_check(
    domain: DomainId,
    handle: CapHandle,
    required_rights: PermissionBits,
) -> Result<ValidatedCap<'static>, CapError>;

ValidatedCap vs CapValidationToken naming: ValidatedCap<'dispatch> is a short-lived, lifetime-scoped token valid for a single KABI dispatch call (scoped to the KabiDispatchGuard RCU guard). CapValidationToken is a long-lived, cross-call token stored by drivers between dispatches. Both serve capability amortization but at different timescales. See Section 12.3 for the two-level validation design.

  • Tier 0 → Tier 2 (Ring 3 process): the capability is passed via a sealed IPC handle. The handle is a userspace-opaque token (64-bit value) that the kernel validates on every use via the standard cap_validate() path. The Tier 2 process cannot modify, forge, or inspect the token's internal structure — it can only pass it back to the kernel in syscall arguments.

  • Tier 1 → Tier 2: not permitted directly. A Tier 1 driver that needs to grant a capability to a Tier 2 process must request the delegation through Core (cap_delegate() with target_domain pointing to the Tier 2 domain). Core creates the child capability and delivers the sealed IPC handle to the Tier 2 process.

  • Any local tier → Tier M (remote cluster node): the capability is converted to a CapabilityHeader + CapabilitySignature (Section 5.7) and transmitted via the cluster messaging layer (Section 5.1). The issuing node signs the capability with its cluster key pair. The remote node validates the signature locally and caches the capability in its local cap table with a bounded TTL (default: 5 minutes, renewable while the issuer is alive). Remote operations using the delegated capability are validated locally on the remote node (signature check + expiry check + PermissionBits check). Revocation is propagated via the epoch-based distributed revocation protocol (Section 9.1).

  • Tier M → local tier: not permitted directly. A remote node that holds a Tier M capability cannot further delegate it to a local driver on the receiving node. The remote node may request the issuing node to create a new delegation (a "re-delegation request" via ClusterMsg::RedelegateCapability), which the issuer processes through its local cap_delegate() path with full policy checks.

Delegation validation:

/// Delegate a capability to a driver in a specific isolation tier.
///
/// The kernel mediates all cross-tier delegation — drivers cannot
/// directly transfer capabilities between tiers.
///
/// # Tier-specific handle delivery
///
/// - **Tier 1** (MPK-isolated): capability handle passed via KABI descriptor.
///   The kernel stores the full `CapEntry`; the driver receives an opaque
///   `CapHandle` (u64 index). All permission checks are kernel-mediated.
///
/// - **Tier 2** (Ring 3 process): capability passed via sealed IPC handle.
///   The handle is a userspace-opaque token that the kernel validates
///   on every use. The Tier 2 process cannot modify or forge the token.
///
/// - **Tier M** (remote cluster node): capability is serialized into a
///   `CapabilityHeader` + `CapabilitySignature`, signed with the local
///   node's cluster key pair, and transmitted via the cluster messaging
///   layer. The remote node validates the signature and caches the
///   capability locally with a bounded TTL.
///
/// # Revocation across tiers
///
/// When a capability is revoked (`CapEntry::revoke()`), the kernel
/// invalidates all delegated handles across all tiers:
/// - **Tier 1**: the KABI descriptor is marked invalid. The next
///   `kabi_cap_check()` call returns `CapError::Revoked` (the error code
///   is `ECAPREVOKED` in the KABI error namespace).
/// - **Tier 2**: the IPC handle is invalidated. The next syscall using
///   the handle returns `EBADF`.
/// - **Tier M**: a `ClusterMsg::RevokeCapability` message is broadcast to
///   all nodes in the capability's `delegation_set()`. Remote nodes
///   increment the generation in their local cache, invalidating the
///   capability. Bounded by `REVOKE_TIMEOUT_MS`.
/// - **Crash recovery**: all capabilities delegated to a crashed driver's
///   domain are automatically revoked as part of domain cleanup
///   ([Section 11.4](11-drivers.md#device-registry-and-bus-management--state-machine)). The domain's
///   handle table is walked and every `CapEntry` referencing the crashed
///   domain has its `DelegationRecord` removed from the parent's children
///   list.
///
/// # Errors
/// - `CapError::TierRestriction`: `target_tier` exceeds
///   `cap.constraints.max_delegation_tier`.
/// - `CapError::ClusterUnavailable`: Tier M delegation requested but the
///   cluster messaging layer is not initialized (single-node system).
/// - All errors from `cap_delegate()` also apply (rights, depth, limit).
pub fn delegate_to_tier(
    cap: &CapEntry,
    target_tier: IsolationTier,
    target_domain: DomainId,
) -> Result<CapHandle, CapError> {
    // 1. Check tier restriction.
    if let Some(max_tier) = cap.cap.constraints.max_delegation_tier {
        if target_tier > max_tier {
            return Err(CapError::TierRestriction);
        }
    }
    // 2. Delegate via the standard cap_delegate() path (rights attenuation,
    //    depth check, children list update).
    let delegated = cap_delegate(
        CapRef(cap),
        cap.cap.permissions, // full parent rights; caller narrows separately
        target_domain,
    )?;
    // 3. Deliver the handle via the tier-appropriate transport.
    match target_tier {
        IsolationTier::Tier0 => {
            // Tier 0: direct CapHandle in shared Core memory.
            Ok(CapHandle::from_cap_id(delegated.cap_id))
        }
        IsolationTier::Tier1 => {
            // Tier 1: KABI descriptor write into the domain's handle table.
            kabi_deliver_cap_handle(target_domain, delegated.cap_id)
        }
        IsolationTier::Tier2 => {
            // Tier 2: sealed IPC handle delivered via the domain's IPC channel.
            ipc_deliver_sealed_handle(target_domain, delegated.cap_id)
        }
        IsolationTier::TierM => {
            // Tier M: cryptographically-signed capability sent via cluster
            // messaging. The target_domain encodes the remote PeerId.
            let peer_id = PeerId::from_domain(target_domain);
            cluster_deliver_signed_cap(peer_id, &delegated)
        }
    }
}

CapError::TierRestriction is added to the CapError enum (above) for this purpose. The tier restriction is enforced independently of the delegation depth limit — both checks must pass for the delegation to succeed.

Tier M delegation — cluster_deliver_signed_cap:

/// Deliver a capability to a remote cluster node via signed capability token.
///
/// Converts the delegated `CapEntry` into a network-portable `CapabilityHeader` +
/// `CapabilitySignature` ([Section 5.7](05-distributed.md#network-portable-capabilities)), signs it with the
/// local node's cluster key pair, and transmits it via `ClusterTransport::send_reliable()`.
///
/// The remote node receives the signed capability, validates the signature against
/// the issuer's public key (obtained during cluster join via
/// [Section 5.2](05-distributed.md#cluster-topology-model--membership-and-topology)), and inserts it into its
/// local capability cache with a TTL of `CAP_REMOTE_TTL_NS` (default: 5 minutes).
///
/// # Errors
/// - `CapError::ClusterUnavailable`: cluster transport is not initialized.
/// - `CapError::SigningFailed`: cluster key pair is not available or signing failed.
/// - `CapError::PeerUnreachable`: the target peer did not ACK within
///   `CLUSTER_CAP_DELIVER_TIMEOUT_MS` (default: 1000ms).
fn cluster_deliver_signed_cap(
    peer_id: PeerId,
    delegated: &CapEntry,
) -> Result<CapHandle, CapError> {
    // 1. Verify cluster transport is available.
    let transport = cluster_transport()
        .ok_or(CapError::ClusterUnavailable)?;

    // 2. Construct CapabilityHeader from the delegated CapEntry.
    let mut header = CapabilityHeader {
        object_id: delegated.cap.object_id,
        permissions: delegated.cap.permissions,
        creation_epoch: delegated.cap.generation,
        constraints: delegated.cap.constraints,
        issuer_peer: local_peer_id(),
        issued_at_ns: cluster_clock_ns(),
        expires_at_ns: cluster_clock_ns() + CAP_REMOTE_TTL_NS,
        sig_algorithm: current_cluster_sig_algorithm(),
        // The user namespace ID scopes this capability on the receiving node.
        // Without it, the remote node cannot enforce namespace-scoped access
        // control and would either reject (NamespaceMismatch) or over-grant.
        user_ns_id: delegated.cap.user_ns_id,
        signature_slot: CapSigSlotId(0), // placeholder; assigned after signing
    };

    // 3. Sign the header with the local node's cluster key pair.
    //    Signature is allocated from cap_sig_slab (dedicated slab pool).
    //    Returns a CapSigSlotId indexing into the slab pool.
    let slot = sign_capability_header(&header)?;
    header.signature_slot = slot;

    // 4. Serialize header + signature (looked up via signature_slot)
    //    into a ClusterMsg::GrantCapability and send via reliable transport
    //    (waits for ACK).
    transport.send_reliable(
        peer_id,
        ClusterMsg::GrantCapability { header, sig_slot: slot },
        CLUSTER_CAP_DELIVER_TIMEOUT_MS,
    )?;

    // 5. Return a local CapHandle that tracks the remote delegation.
    //    The local cap table records that this cap was delegated to
    //    peer_id, enabling revocation broadcast on cap_revoke().
    Ok(CapHandle::from_cap_id(delegated.cap_id))
}

/// Default TTL for capabilities delegated to remote cluster nodes.
/// 5 minutes in nanoseconds. Renewable: the remote node sends a
/// `ClusterMsg::RenewCapability` request before expiry, and the
/// issuing node extends the TTL if the capability is still valid.
const CAP_REMOTE_TTL_NS: u64 = 5 * 60 * 1_000_000_000;

/// Timeout for capability delivery to a remote peer.
/// If the peer does not ACK within this timeout, the delivery fails
/// and the caller receives `CapError::PeerUnreachable`.
const CLUSTER_CAP_DELIVER_TIMEOUT_MS: u32 = 1000;

Remote capability validation on the receiving node:

When the remote node receives a ClusterMsg::GrantCapability, it:

  1. Validates the CapabilitySignature against the issuer's public key (cached from the cluster join handshake). Rejects if signature is invalid.
  2. Checks expires_at_ns > cluster_clock_ns() (with 1ms grace for clock skew). Rejects if expired.
  3. Inserts the CapabilityHeader into its local capability cache (an XArray keyed by (issuer_peer, object_id) pair, compressed to a u64 hash). The cache entry includes the expiry time for periodic cleanup.
  4. Subsequent operations on the remote node that require this capability check the local cache: signature is already validated at insertion time, so the per-operation check is only generation == expected + expires_at_ns > now (~3-5ns, no crypto on the hot path).
  5. When expires_at_ns approaches, the remote node sends ClusterMsg::RenewCapability to the issuer. The issuer re-validates the capability (generation check, constraint check) and responds with a new expires_at_ns or CapError::Revoked.

Remote capability revocation:

Handled by the existing epoch-based distributed revocation protocol (Section 9.1). The revoke_capability() function broadcasts ClusterMsg::RevokeCapability to all nodes in the delegation_set(), which includes Tier M peers. Remote nodes remove the capability from their local cache and ACK the revocation.

9.1.1.7 Capability Validation Amortization: ValidatedCap<'guard>

Capability validation (generation check or indirection dereference) costs ~5-10 cycles per check. On a typical KABI call path, the driver invokes 3-5 kernel services per I/O operation (e.g., DMA map, ring buffer push, completion post), each requiring a capability check. This adds ~15-50 cycles per I/O — measurable at NVMe-class throughput (millions of IOPS).

UmkaOS amortizes this cost with a validate-once token pattern. This is a core design decision implemented from day one — the KABI dispatch path produces ValidatedCap tokens that downstream kernel services accept without re-validation.

/// A capability that has been validated within the current KABI dispatch.
/// The lifetime `'dispatch` is tied to the KABI dispatch guard, ensuring
/// the token cannot outlive the dispatch scope. Within this scope, the
/// capability is guaranteed valid:
/// - The generation matched at validation time.
/// - The indirection entry (if applicable) was not revoked.
/// - The permission bits include the required set.
///
/// # Safety invariant
///
/// `ValidatedCap` is only constructed by `cap_validate()` inside a
/// `KabiDispatchGuard` scope. The guard holds an RCU read-side lock,
/// preventing capability revocation from completing during the dispatch.
/// This means: if validation succeeds at the start of a KABI call,
/// the capability remains valid for the entire call — revocation cannot
/// race because the RCU grace period cannot complete until the guard drops.
pub struct ValidatedCap<'dispatch> {
    /// The validated capability handle (opaque, for passing to sub-services).
    handle: CapHandle,
    /// The permission bits that were validated. Sub-services can check
    /// specific permission bits against this cached copy without re-reading
    /// the capability table.
    permissions: PermissionBits,
    /// Object ID that this capability refers to.
    object_id: ObjectId,
    /// Tie lifetime to the KABI dispatch scope.
    _dispatch: PhantomData<&'dispatch KabiDispatchGuard>,
}

impl<'dispatch> ValidatedCap<'dispatch> {
    /// Check whether this already-validated capability has a specific
    /// permission bit. This is a local bitmask test (~1 cycle), NOT a
    /// capability table lookup.
    pub fn has_permission(&self, perm: PermissionBits) -> bool {
        self.permissions.contains(perm)
    }

    /// Return the object ID. Used by kernel services to locate the target
    /// object without a second capability table dereference.
    pub fn object_id(&self) -> ObjectId {
        self.object_id
    }
}

KABI dispatch integration: The KABI trampoline validates the capability once and passes the ValidatedCap token to the handler:

/// KABI dispatch trampoline (generated by kabi-compiler for each interface method).
fn kabi_dispatch_submit_io(
    ctx: *mut c_void,
    cap_handle: CapHandle,
    op: BioOp,
    /* ... */
) -> IoResult {
    // Acquire KABI dispatch guard (holds RCU read lock, pins capability validity).
    let dispatch = KabiDispatchGuard::enter();

    // Validate once — generation check + permission check (~5-10 cycles).
    let vcap = match cap_validate(cap_handle, PermissionBits::WRITE, &dispatch) {
        Ok(validated) => validated,
        Err(e) => return IoResult::from_error(e),
    };

    // All subsequent kernel service calls accept &ValidatedCap instead of raw
    // CapHandle. No re-validation needed — the dispatch guard guarantees validity.
    let dma_addr = dma_map_buffer(&vcap, buf)?;   // Accepts &ValidatedCap: ~1 cycle permission check
    ring_push_command(&vcap, op, dma_addr)?;       // Accepts &ValidatedCap: no cap lookup
    completion_register(&vcap, req_handle)?;        // Accepts &ValidatedCap: no cap lookup

    // dispatch guard drops here → RCU read unlock → revocations can now complete.
    IoResult::Success
}

Cost savings per KABI call:

Without amortization With ValidatedCap
3-5 cap validations × ~5-10 cycles = ~15-50 cycles 1 validation (~5-10 cycles) + 2-4 bitmask checks (~1 cycle each) = ~7-14 cycles
~15-50 cycles total ~7-14 cycles total

On an NVMe path with one KABI dispatch per I/O, this saves ~8-36 cycles per operation (~0.03-0.15% of a 10μs NVMe read). The savings compound on paths with multiple KABI calls per I/O (e.g., TCP TX: socket cap + route cap + device cap).

Revocation safety: The KabiDispatchGuard holds an RCU read-side critical section for the duration of the KABI call. Since capability revocation uses RCU-deferred reclamation (see "Capability Revocation Semantics" above), a revocation that starts during a KABI dispatch cannot complete until the dispatch finishes. This means ValidatedCap cannot become stale within its lifetime scope.

9.1.1.8 Capability System Data/Policy Split

The capability system follows the same data/policy split pattern as the physical memory allocator (PhysAllocPolicy, Section 4.2) and page table manager (VmmPolicy, Section 4.8). Non-replaceable data operations stay in Nucleus; replaceable policy logic lives in Evolvable behind an AtomicPtr<dyn CapPolicy>.

Rationale: Over a 50-year uptime window, the capability system will inevitably need changes — new permission types, new delegation models for future hardware, post-quantum delegation signatures, new namespace types, distributed capability tokens, new LSM integration, new security model requirements. Verification proves the data layer correct; live evolution keeps the policy layer current.

9.1.1.8.1 CapTable — Global Capability Registry

The CapTable is the kernel-global registry of all live capabilities. It is a Nucleus data structure — non-replaceable, formally verified, and accessed on the KABI hot path. Each issued capability (whether from open(), cap_delegate(), or driver device_init()) has exactly one CapEntry in the global CapTable.

/// Global capability table. One per kernel instance. Stores all live
/// `CapEntry` records keyed by `CapId`.
///
/// **Hot-path access**: `cap_lookup()` performs a single XArray index read
/// (~5 instructions). The XArray's 64-way radix fanout ensures O(1) lookup
/// for typical capability counts (<1M entries) with at most 3-4 cache-line
/// accesses.
///
/// **Capacity**: The XArray grows dynamically using slab-allocated nodes.
/// Maximum capacity is bounded by physical memory. The `next_id` counter
/// is `u64`, so ID exhaustion is not a concern (584-year longevity at
/// 1 billion allocations/sec).
///
/// **Concurrency**: Readers use RCU (`rcu_read_lock()`) for lock-free
/// lookup. Writers (create, revoke) acquire `write_lock` to serialize
/// structural modifications. The `next_id` counter uses `AtomicU64` for
/// lock-free ID allocation.
pub struct CapTable {
    /// Integer-keyed XArray mapping CapId -> CapEntry. XArray is mandated
    /// for all integer-keyed mappings ([Section 3.13](03-concurrency.md#collection-usage-policy)).
    /// RCU-protected reads; writer lock for insert/remove.
    pub entries: XArray<CapId, CapEntry>,

    /// Monotonic capability ID counter. Incremented atomically on each
    /// `cap_create()`. Never wraps (u64). IDs are never reused — a
    /// revoked capability's ID is permanently retired. This simplifies
    /// revocation: any stale `CapHandle` holding an old ID finds either
    /// a `None` slot or a generation mismatch.
    pub next_id: AtomicU64,

    /// Writer serialization lock. Held during `cap_create()`, `cap_revoke()`,
    /// and `cap_delegate()` to serialize structural modifications to the
    /// XArray. Not held during reads (RCU provides lock-free read access).
    ///
    /// **Allocation policy**: XArray node allocation under this SpinLock
    /// uses `GFP_ATOMIC` (non-sleeping). The XArray pre-allocates slab
    /// nodes to minimize allocation failures. If allocation fails under
    /// the lock, the operation returns `CapError::NoMemory`.
    pub write_lock: SpinLock<()>,
}
9.1.1.8.2 Nucleus — Capability Data (non-replaceable, ~2-3 KB)

The following operations are pure data manipulation with no policy decisions. They are formally verified (Section 24.4) and frozen after Phase 2 exit:

/// Nucleus capability data operations — non-replaceable, formally verified.
/// These are the only capability operations that execute on the KABI hot path.

/// Lookup a capability by handle. Pure XArray index lookup.
/// ~5 instructions: bounds check + XArray slot read + null check.
#[inline(always)]
pub fn cap_lookup(table: &CapTable, handle: CapHandle) -> Option<&CapEntry> {
    table.entries.get(handle.index())
}

/// Generation check: has this capability been revoked?
/// Single integer compare. Nucleus data — no policy involved.
#[inline(always)]
pub fn cap_generation_valid(entry: &CapEntry, object_gen: u64) -> bool {
    entry.cap.generation == object_gen
}

/// Permission check: does this capability grant the requested rights?
/// Single AND + compare. Nucleus data — no policy involved.
#[inline(always)]
pub fn cap_has_rights(entry: &CapEntry, required: Rights) -> bool {
    entry.cap.permissions.contains(required)
}

/// Revocation flag check on active_ops. Bit 63 = REVOKED_FLAG.
#[inline(always)]
pub fn cap_is_revoked(entry: &CapEntry) -> bool {
    entry.active_ops.load(Ordering::Acquire) & REVOKED_FLAG != 0
}

/// CapOperationGuard: RAII increment/decrement of active_ops counter.
/// Prevents revocation from completing while operations are in flight.
/// Pure atomic arithmetic — no policy.

Hot-path impact: ZERO. The KABI dispatch path (Section 12.3) only calls cap_lookup() + cap_generation_valid() + cap_has_rights() — all Nucleus data operations. The CapPolicy trait is never invoked on the hot path.

KABI dispatch hot path (Nucleus only, ~10 instructions):
  cap_lookup(table, handle)           → XArray slot read
  cap_generation_valid(entry, gen)    → 1 compare
  cap_has_rights(entry, required)     → 1 AND + 1 compare
  active_ops.fetch_add(1, Acquire)    → 1 atomic increment
  → ValidatedCap issued              (no policy call)
9.1.1.8.3 Evolvable — Capability Policy (replaceable, ~5-6 KB)

All decision-making logic is in the CapPolicy trait, loaded in Evolvable and replaceable via AtomicPtr swap (same mechanism as PhysAllocPolicy and VmmPolicy):

/// Global capability policy — replaceable via atomic pointer swap.
/// Initialized during Boot Phase 5 (Evolvable init). Swapped via the
/// live evolution framework ([Section 13.18](13-device-classes.md#live-kernel-evolution)) with MonotonicVerifier gate.
static CAP_POLICY: AtomicPtr<dyn CapPolicy> = AtomicPtr::new(ptr::null_mut());

/// Replaceable capability policy trait.
///
/// Controls all decision-making logic. Nucleus data structures (CapTable,
/// CapEntry, generation/permission checks) are non-replaceable.
///
/// **Cold-path only.** None of these methods are called during KABI dispatch.
/// They are invoked only on syscall-frequency operations:
///   - capable() / ns_capable(): ~1000/s typical (syscall permission checks)
///   - cap_delegate(): ~10/s (driver loading, service binding)
///   - cap_revoke_subtree(): ~1/s (teardown, credential change)
///   - cap_inherit_exec(): ~10/s (process exec)
///   - evaluate_constraints(): only when CapConstraints are non-default
pub trait CapPolicy: Send + Sync {
    /// Check whether a process holds a system capability.
    /// Called by the SysAPI layer on `capable(N)` / `ns_capable(ns, N)`.
    /// Default implementation: `creds.effective_caps.contains(cap)`.
    fn capable(
        &self,
        creds: &Credentials,
        cap: SystemCaps,
        ns: Option<&UserNamespace>,
    ) -> bool;

    /// Decide whether a delegation is permitted. Called by `cap_delegate()`.
    /// Enforces: rights attenuation, delegation depth limits, tier restrictions,
    /// constraint propagation. Returns child constraints on success.
    fn delegate_check(
        &self,
        parent: &CapEntry,
        requested_rights: Rights,
        target_domain: DomainId,
        target_tier: IsolationTier,
    ) -> Result<CapConstraints, CapError>;

    /// Compute revocation propagation order for a capability subtree.
    /// Called by `cap_revoke()` Phase 2 workqueue items. Returns child
    /// CapIds in traversal order (default: breadth-first via workqueue
    /// scheduling, or custom strategy). The iterator yields one level
    /// at a time — each level's children are processed under a single
    /// spinlock hold, then sub-trees are enqueued to the workqueue.
    fn revocation_order<'a>(
        &self,
        root: &'a CapEntry,
    ) -> RevocationIterator<'a>;

    /// Determine capability inheritance across exec().
    /// Called during execve() for each capability in the process's table.
    /// Returns None to drop the capability, Some(new_constraints) to keep.
    fn inherit_on_exec(
        &self,
        cap: &CapEntry,
        new_creds: &Credentials,
        exec_flags: ExecFlags,
    ) -> Option<CapConstraints>;

    /// Evaluate capability constraints (time bounds, CPU mask, custom).
    /// Called during `cap_validate()` only when constraints are non-default
    /// (the fast path skips this entirely for unconstrained capabilities).
    ///
    /// **Dual expiry enforcement**: When both `CapConstraints.expires_at` and
    /// `CapabilityHeader.expires_at_ns` ([Section 5.7](05-distributed.md#network-portable-capabilities)) are
    /// present, enforcement uses the minimum (earlier) of the two timestamps.
    /// `CapConstraints.expires_at` is the policy-level expiry (set by the granting
    /// authority). `CapabilityHeader.expires_at_ns` is the cryptographic validity
    /// period (set at signing time). Both must be checked.
    fn evaluate_constraints(
        &self,
        constraints: &CapConstraints,
        context: &CapCheckContext,
    ) -> Result<(), CapError>;

    /// Map SystemCaps to PermissionBits for a specific object class.
    /// Called during ValidatedCap creation when the caller provides
    /// SystemCaps instead of a capability handle.
    fn syscaps_to_permissions(
        &self,
        caps: SystemCaps,
        target_class: ObjectClass,
    ) -> PermissionBits;

    /// LSM hook: called after Nucleus permission check succeeds.
    /// The LSM may deny the operation even if capability permissions allow it.
    /// Called only when an LSM is active (SELinux, AppArmor, etc.).
    fn lsm_cap_check(
        &self,
        creds: &Credentials,
        cap: &CapEntry,
        operation: CapOperation,
    ) -> Result<(), CapError>;

    /// Invalidate the delegation check cache. Called as a post-swap callback
    /// during live evolution ([Section 13.18](13-device-classes.md#live-kernel-evolution)). The new CapPolicy
    /// may have different constraint propagation rules, so stale cache entries
    /// could grant or deny capabilities incorrectly. Cold path — once per swap.
    fn rebuild_delegation_cache(&self, cap_table: &CapTable);

    /// Distributed capability operations — cluster revocation protocol.
    /// Called when a revocation must propagate to Tier M peers or remote
    /// nodes in a distributed cluster. Returns when all remote peers
    /// have acknowledged the revocation (or timed out).
    fn cluster_revoke(
        &self,
        cap_id: CapId,
        peer_set: &PeerSet,
    ) -> Result<(), CapError>;
}

Why each method is in policy (not data):

Method Why replaceable 50-year scenario
capable() Permission decision logic New permission models, conditional capabilities
delegate_check() Delegation rules New hardware tiers, PQC delegation signatures
revocation_order() Traversal strategy (default: breadth-first via workqueue) Optimized revocation for new topology patterns
inherit_on_exec() Inheritance rules New exec models (WASM, unikernel), new security boundaries
evaluate_constraints() Constraint evaluation New constraint types (GPU affinity, memory tier, energy budget)
syscaps_to_permissions() Mapping logic New SystemCaps bits, new object classes
lsm_cap_check() LSM integration New security modules, new MAC models
cluster_revoke() Distributed protocol New fabric types, new consensus protocols
rebuild_delegation_cache() Cache invalidation on policy swap New constraint propagation rules

Delegation cache: delegate_check() results are cached in a per-CapTable DelegationCache to avoid repeated constraint evaluation on hot delegation paths (e.g., service binding during driver probe):

/// Cache of recent delegate_check() results. Keyed by
/// (parent_cap_id, requested_rights, target_tier).
/// Invalidated on policy swap via `rebuild_delegation_cache()`.
/// Memory footprint: ~80 bytes per entry * 64 entries = ~5 KB per domain.
/// Acceptable for warm-path caching (one DelegationCache per isolation domain).
pub struct DelegationCache {
    /// LRU cache, 64 entries. Hot path does NOT use this — only warm
    /// delegation paths (cap_delegate syscall, ~10/s).
    entries: ArrayVec<DelegationCacheEntry, 64>,
}

pub struct DelegationCacheEntry {
    pub parent_id: CapId,
    pub rights: Rights,
    pub tier: IsolationTier,
    pub result: Result<CapConstraints, CapError>,
    /// Policy generation at time of caching. Stale if != current.
    pub policy_gen: u64,
}

Policy replacement protocol (same as PhysAllocPolicy):

/// Replace the active capability policy via the live evolution framework.
/// Called during Phase B (stop-the-world) of an Evolvable evolution.
///
/// The MonotonicVerifier ensures the new policy never ALLOWS something
/// the old policy DENIED — live swaps can only tighten security.
pub fn cap_policy_evolve(
    new_policy: &'static dyn CapPolicy,
    cap_table: &CapTable,
) -> Result<(), CapEvolutionError> {
    // 1. Build monotonic test suite from current state.
    let verifier = MonotonicVerifier::build_from_current(cap_table);
    // 2. Verify new policy is monotonically tighter.
    verifier.verify_policy(new_policy)?;
    // 3. Atomic swap (under stop-the-world — no concurrent readers).
    CAP_POLICY.store(new_policy as *const _ as *mut _, Ordering::Release);
    Ok(())
}

MonotonicVerifier — ensures that capability policy evolution only tightens security (new policy never allows what old policy denied):

/// Verification gate for capability policy live replacement.
///
/// Before swapping the policy, the MonotonicVerifier runs a test suite
/// of (CapEntry, Rights, Context) → expected_result pairs. The new
/// policy must:
///   1. DENY everything the old policy denied (monotonic tightening).
///   2. May additionally DENY things the old policy allowed.
///   3. Must NOT ALLOW anything the old policy denied.
///
/// This guarantees that a live swap never opens a security hole.
/// The test suite is generated from the kernel's current capability
/// table — every active capability is tested with its granted rights
/// and with rights it does NOT have.
pub struct MonotonicVerifier {
    /// Test cases: (capability snapshot, required rights, old result).
    test_cases: ArrayVec<MonotonicTestCase, 256>,
}

pub struct MonotonicTestCase {
    pub entry_snapshot: CapEntry,
    pub required_rights: Rights,
    pub context: CapCheckContext,
    /// What the CURRENT (old) policy returns for `evaluate_constraints()`.
    pub old_result: Result<(), CapError>,
    /// Credential snapshot for `capable()` testing.
    pub cred_snapshot: Credentials,
    /// SystemCaps bit to test with `capable()`.
    pub syscap: SystemCaps,
    /// Namespace context for `capable()` testing.
    pub cap_ns: Option<UserNamespaceRef>,
    /// Old `capable()` result (true = allowed, false = denied).
    pub old_capable_result: bool,
    /// Delegation target domain for `delegate_check()`.
    pub target_domain: DomainId,
    /// Delegation target tier for `delegate_check()`.
    pub target_tier: IsolationTier,
    /// Old `delegate_check()` result.
    pub old_delegate_result: Result<CapConstraints, CapError>,
    /// Credential snapshot for `inherit_on_exec()` testing.
    pub exec_creds: Credentials,
    /// ExecFlags snapshot for `inherit_on_exec()` testing.
    pub exec_flags: ExecFlags,
    /// Old `inherit_on_exec()` result (true = inherited).
    pub old_inherit_result: bool,
}

pub struct MonotonicViolation {
    pub entry: ObjectId,
    pub rights: Rights,
    /// Which policy method produced the violation.
    pub method: &'static str,
}

impl MonotonicVerifier {
    /// Build test cases from the current capability table.
    /// Called during Phase A of live evolution.
    pub fn build_from_current(cap_table: &CapTable) -> Self { /* ... */ }

    /// Verify that `new_policy` is monotonically tighter than the current one.
    /// Tests ALL policy methods that make security decisions:
    ///   1. `evaluate_constraints()` — capability constraint evaluation
    ///   2. `capable()` — SystemCaps permission check
    ///   3. `delegate_check()` — delegation permission
    ///   4. `inherit_on_exec()` — execve inheritance filter
    ///   5. LSM hooks (via `lsm_check_cap_validate()`) — if LSM is active
    ///
    /// Each test case exercises ALL five methods with the same (entry, rights,
    /// context) tuple. A monotonic violation in ANY method is a swap-blocking error.
    pub fn verify_policy(
        &self,
        new_policy: &dyn CapPolicy,
    ) -> Result<(), MonotonicViolation> {
        for tc in &self.test_cases {
            // 1. Constraint evaluation
            let new_result = new_policy.evaluate_constraints(
                &tc.entry_snapshot.cap.constraints,
                &tc.context,
            );
            if tc.old_result.is_err() && new_result.is_ok() {
                return Err(MonotonicViolation {
                    entry: tc.entry_snapshot.cap.object_id,
                    rights: tc.required_rights,
                    method: "evaluate_constraints",
                });
            }
            // 2. capable() — test with the entry's SystemCaps
            let old_cap = tc.old_capable_result;
            let new_cap = new_policy.capable(
                &tc.cred_snapshot, tc.syscap, tc.cap_ns.as_deref(),
            );
            if !old_cap && new_cap {
                return Err(MonotonicViolation {
                    entry: tc.entry_snapshot.cap.object_id,
                    rights: tc.required_rights,
                    method: "capable",
                });
            }
            // 3. delegate_check() — delegation permission
            let old_del = &tc.old_delegate_result;
            let new_del = new_policy.delegate_check(
                &tc.entry_snapshot, tc.required_rights,
                tc.target_domain, tc.target_tier,
            );
            if old_del.is_err() && new_del.is_ok() {
                return Err(MonotonicViolation {
                    entry: tc.entry_snapshot.cap.object_id,
                    rights: tc.required_rights,
                    method: "delegate_check",
                });
            }
            // 4. inherit_on_exec() — execve filter
            let old_inh = tc.old_inherit_result;
            let new_inh = new_policy.inherit_on_exec(
                &tc.entry_snapshot, &tc.exec_creds, tc.exec_flags,
            ).is_some();
            if old_inh && !new_inh {
                // Old policy allowed inheritance; new policy blocks it — this
                // is a valid tightening. But if old blocked and new allows, violation.
            }
            if !old_inh && new_inh {
                return Err(MonotonicViolation {
                    entry: tc.entry_snapshot.cap.object_id,
                    rights: tc.required_rights,
                    method: "inherit_on_exec",
                });
            }
        }
        Ok(())
    }
}

Performance analysis:

Operation Path Policy calls Cost
KABI dispatch (cap_validate) Hot 0 Zero — Nucleus data only
cap_validate with constraints Warm 1 (evaluate_constraints) ~3-5 ns (indirect call on ~1-2 μs fault path = <0.3%)
capable() / ns_capable() Cold 1 (capable) ~3-5 ns per syscall permission check
cap_delegate() Cold 1 (delegate_check) ~10-20 ns on a ~1-10 μs delegation path
cap_revoke() Cold 1 (revocation_order) + N (cluster_revoke) Phase 1: ~1-5 cycles (lock-free). Phase 2: ~1-100 μs total (workqueue, breadth-first). Spinlock hold per node: O(256) iterations max.
execve() inheritance Cold N (inherit_on_exec) ~50-500 ns per exec (N = caps in table)

Cross-reference: The data/policy split follows the same pattern documented in Section 13.18 for the memory allocator and page table manager. Agentic development workflow benefits from runtime policy replacement in Section 25.17.

9.1.1.9 Capability Table Lifecycle and Garbage Collection

/// Default per-process capability limit (matches Linux `RLIMIT_NOFILE` hard limit).
const CAP_SPACE_DEFAULT_MAX: usize = 65536;

/// Per-process capability table. Maps local `CapHandle` values to `CapEntry`
/// objects. Indexed by the handle's slot number (0-based).
///
/// **Two-level design** (same approach as Linux `fdtable`):
/// - Level 0: inline array of `CAP_SPACE_INLINE` (64) slots — covers >99% of
///   processes without heap allocation. Size: 64 * 8 = 512 bytes on 64-bit
///   (safe for `cap_space_fork()` stack allocation on 16 KiB kernel stacks).
/// - Level 1: if more slots are needed (e.g., high-fd-count servers), a slab-
///   allocated `CapSpaceExtension` is linked. The extension array grows in
///   powers of 2 up to `max_entries`.
///
/// The per-process maximum is adjustable via `setrlimit(RLIMIT_NOFILE)`.
pub struct CapSpace {
    /// Inline slot array for the first 64 handles (avoids heap alloc for
    /// typical processes). `None` = free slot. Uses `Arc<CapEntry>` so
    /// parent and child processes share the same CapEntry object after fork,
    /// enabling shared revocation via `AtomicU64 REVOKED_FLAG` (DD-03).
    /// Arc refcount traffic occurs only on fork/exit (cold paths), not on
    /// capability validation (hot path — `CapOperationGuard::try_new()`
    /// reads the entry through the already-held SpinLock, no Arc clone).
    ///
    /// **Expected hold time during fork**: ~50ns per entry * number of live
    /// entries. For a typical process with 20 open file capabilities, hold
    /// time is ~1us. Worst case (fully populated 256 entries): ~12.8us.
    /// Fork is a cold path; this hold time is acceptable.
    pub inline_slots: SpinLock<[Option<Arc<CapEntry>>; CAP_SPACE_INLINE]>,
    /// Heap-allocated extension for handles >= CAP_SPACE_INLINE.
    /// Allocated on first use; `None` if all handles fit in inline_slots.
    pub extension: SpinLock<Option<CapSpaceExtension>>,
    /// Number of live (non-None) entries across both levels.
    pub live_count: AtomicU32,
    /// Per-process maximum (from RLIMIT_NOFILE). Defaults to CAP_SPACE_DEFAULT_MAX.
    pub max_entries: u32,
    /// Bitmap for O(1) free-slot search. One bit per slot.
    pub free_bitmap: SpinLock<Bitmap>,
}

/// Reduced from 256 to 64 to prevent stack overflow during `cap_space_fork()`:
/// 256 * 8 = 2048 bytes consumed 12.5-25% of the 8-16 KiB kernel stack.
/// 64 * 8 = 512 bytes is safe. The vast majority of processes have <64
/// capabilities; the extension mechanism handles overflow.
const CAP_SPACE_INLINE: usize = 64;

/// Heap-allocated capability table extension for processes exceeding
/// `CAP_SPACE_INLINE` handles. Allocated from the slab allocator in
/// power-of-2 chunks (512, 1024, ... up to `max_entries`).
pub struct CapSpaceExtension {
    /// Slab-allocated slot array. Uses `Arc<CapEntry>` matching `inline_slots`.
    pub slots: *mut Option<Arc<CapEntry>>,
    /// Current capacity of the extension array.
    pub capacity: u32,
}

Per-process capability tables (CapSpace) are small arrays indexed by local handle (typically <256 entries per process). Lifecycle:

  • Allocation: Capability table entries are allocated from the process's CapSpace when a capability is created (e.g., open() → new FileDescriptor capability) or received via IPC delegation. Each entry is reference-counted: the CapEntry holds a strong reference to the underlying kernel object.

  • Deallocation: When a capability handle is explicitly closed (e.g., close(fd)), the entry is removed from the CapSpace, and the kernel object's reference count is decremented. If the reference count reaches zero, the kernel object is freed.

  • Process exit: On process exit (do_exit), the kernel iterates the process's CapSpace and drops every entry. For generation-based objects, this decrements the reference count (the generation counter is untouched — other processes' capabilities remain valid if the object is still alive). For indirection-based objects, the indirection entry is marked "revoked" and scheduled for RCU-deferred reclamation.

  • Table exhaustion prevention: CapSpace has a configurable per-process maximum (default: 65536 entries, matching Linux's RLIMIT_NOFILE default hard limit). Attempts to allocate beyond this limit fail with -EMFILE. The system-wide total of live capability entries is bounded by the slab allocator's memory pressure feedback — under memory pressure, capability creation fails with -ENOMEM like any other kernel allocation. There is no unbounded capability table growth.

9.1.1.10 fork() Inheritance Semantics

When a process calls fork() (or clone() without CLONE_THREAD), the child receives a deep copy of the handle table but shares the underlying capability graph nodes with the parent. This gives the child independent handle-to-capability bindings while maintaining shared revocation semantics across the fork boundary.

Concrete mechanism:

  1. Handle table (deep copy): The child's CapSpace is a fresh allocation with all entries copied from the parent's CapSpace. Each slot in the parent's inline_slots (and extension, if allocated) is duplicated into the child's corresponding slot. The child's live_count, max_entries, and free_bitmap are copied from the parent. After the copy, parent and child have independent handle tables — closing a handle in one does not affect the other.

  2. CapEntry graph (shared via Arc): Each CapEntry referenced by a handle slot is wrapped in Arc<CapEntry>. The fork copy increments the Arc refcount for every live slot. Parent and child both hold Arc references to the same CapEntry objects. The CapEntry's active_ops, children list, and the underlying Capability (including generation, permissions, constraints) are shared state.

  3. Revocation propagation: Because parent and child share the same CapEntry objects, revoking a capability in the parent (via cap_revoke(), which sets the REVOKED_FLAG in active_ops and increments the object's generation) immediately invalidates the capability in the child as well — both see the same AtomicU64 flag. This is the intended behavior: revocation is a security operation that must propagate to all holders of the same capability, including fork children.

  4. Independent handle manipulation: The child can independently:

  5. Close handles (drops the Arc<CapEntry>, decrements refcount; the CapEntry is freed only when the last Arc holder drops it).
  6. Delegate capabilities (adds a child to the shared CapEntry.children list; the delegation is visible to all Arc holders).
  7. Receive new capabilities (allocated in the child's own CapSpace with no effect on the parent's handle table).

  8. No Copy-on-Write: CoW was considered and rejected. CoW handle tables would require deferred splitting on write, complicating the revocation path: a CoW page fault during cap_revoke() (which may run with interrupts disabled or under spinlock) is not acceptable. The deep copy of the handle table is O(N) where N = live_count (typically <256), taking <10 us for the common case.

Thread creation (clone(CLONE_THREAD)): Threads within the same process share the parent's CapSpace directly (no copy). This matches the thread model where all threads operate on the same Process and its cap_table field.

/// Fork the capability space for a new child process.
///
/// Deep-copies the handle table; shares CapEntry objects via Arc refcount.
/// Called from do_fork() step 9a (after fd table copy, before namespace setup).
///
/// # Errors
///
/// Returns ENOMEM if the child's CapSpace allocation fails (inline_slots
/// is stack-allocated so only the extension, if needed, may fail).
pub fn cap_space_fork(parent: &CapSpace) -> Result<CapSpace, Errno> {
    // Acquire both locks before copying. This ensures live_count is computed
    // from a consistent snapshot — a concurrent cap_close() cannot decrement
    // the parent's count between lock release and the live_count read.
    let parent_inline = parent.inline_slots.lock();
    let parent_ext_guard = parent.extension.lock();

    let child_inline: [Option<Arc<CapEntry>>; CAP_SPACE_INLINE] = {
        let mut slots = [const { None }; CAP_SPACE_INLINE];
        for (i, slot) in parent_inline.iter().enumerate() {
            if let Some(ref entry) = slot {
                // Arc::clone — shared CapEntry, independent handle slot.
                slots[i] = Some(Arc::clone(entry));
            }
        }
        slots
    };

    let child_extension = if let Some(ref ext) = *parent_ext_guard {
        Some(cap_space_extension_fork(ext)?)
    } else {
        None
    };

    // Compute live_count from the copied slots while both locks are held.
    // This is the definitive count: inline occupied slots + extension occupied slots.
    let inline_live = child_inline.iter().filter(|s| s.is_some()).count() as u32;
    let ext_live = child_extension.as_ref().map_or(0u32, |e| e.live_count());
    let total_live = inline_live + ext_live;

    let child_bitmap = parent.free_bitmap.lock().clone();

    drop(parent_ext_guard);
    drop(parent_inline);

    Ok(CapSpace {
        inline_slots: SpinLock::new(child_inline),
        extension: SpinLock::new(child_extension),
        live_count: AtomicU32::new(total_live),
        max_entries: parent.max_entries,
        free_bitmap: SpinLock::new(child_bitmap),
    })
}

exec() behavior: On exec(), the process's CapSpace is cleared — all handles are dropped (decrementing Arc refcounts on each CapEntry). The new program starts with an empty capability table. Capabilities marked with CAP_INHERITABLE in the exec-time capability grant table (see Exec Capability Grants below) are re-granted as fresh CapEntry objects (not inherited from the pre-exec table).

9.1.1.11 Exec Capability Grants

On execve(), the task's capability sets are transformed following Linux's cap_bprm_creds_from_file() semantics. This transformation determines which capabilities the new program receives, based on the executable file's capability metadata and the task's pre-exec capability state.

Transformation steps (executed atomically under the task's credential lock):

  1. File capability extraction: Read the file's permitted and effective capability sets from the security.capability extended attribute (xattr). If the file has no security.capability xattr, file_permitted and file_inheritable are both empty, and file_effective is false.

  2. New permitted set computation:

    new_permitted = (file_permitted & cap_bset) | (file_inheritable & old_inheritable)
    
    where cap_bset is the task's capability bounding set (inherited from parent, can only be reduced — never expanded — via prctl(PR_CAPBSET_DROP)). The bounding set acts as an upper bound: even if the file grants CAP_SYS_ADMIN, a task whose bounding set lacks it will not receive it.

  3. New effective set:

    new_effective = file_effective_flag ? new_permitted : empty
    
    If the file's effective flag is set (the VFS_CAP_FLAGS_EFFECTIVE bit in the xattr), all permitted capabilities become effective immediately. Otherwise, the new program starts with an empty effective set and must explicitly raise capabilities via prctl(PR_CAP_AMBIENT_RAISE) or capability-aware code.

  4. Inheritable set: The inheritable set is unchanged across execve (matching Linux semantics and Section 9.9 §9.9.4 step 7). The inheritable set's effect on exec is indirect — it contributes to the new permitted set via (old_inheritable & file_inheritable), but the inheritable set itself is preserved:

    new_inheritable = old_inheritable    // unchanged across exec
    new_permitted  |= old_inheritable & file_inheritable  // inheritable's contribution
    

  5. Ambient capabilities: If the task has ambient capabilities set (via prctl(PR_CAP_AMBIENT_RAISE)), they are added to both permitted and effective sets:

    new_permitted |= ambient
    new_effective |= ambient
    
    After raising into permitted and effective, the ambient set itself is cleared for this exec (matching Linux behavior: ambient caps are consumed during each exec, not retained across it). To re-establish ambient caps in the child, the new program must call prctl(PR_CAP_AMBIENT_RAISE). Ambient caps provide a mechanism for unprivileged programs to inherit specific capabilities without requiring file capabilities on the executable.

  6. Setuid-root special case: If !old.securebits.contains(NOROOT) && new.euid == 0, the bounding set is ORed into the already-computed permitted set (preserving any file-capability grants from step 5):

    new_permitted = new_permitted | cap_bset
    new_effective = new_permitted   // full effective if euid == 0
    
    This matches Linux cap_bprm_creds_for_exec(): new->cap_permitted = cap_combine(new->cap_permitted, new->cap_bset). Note: this is an OR operation, not assignment — file-capability grants from step 5 are preserved. The canonical step-by-step algorithm is in Section 9.9 (execve_transform_caps() step 5a). The bounding set still applies — prctl(PR_CAPBSET_DROP) before exec can restrict even setuid-root binaries.

  7. CapSpace clearing: All capability handles (file descriptors, device handles, etc.) in CapSpace.inline_slots and CapSpace.extension are dropped. Capability handles do not survive exec — only the POSIX capability sets (effective, permitted, inheritable) are transformed. The new program starts with an empty handle table and must open/acquire resources explicitly.

/// Clear a process's CapSpace on exec and re-grant inheritable capabilities.
///
/// Lock ordering: inline_slots lock → extension lock → free_bitmap lock.
/// Concurrent cap_validate() from another thread (via a signal handler)
/// will see REVOKED entries and fail validation — safe.
fn cap_space_exec_clear(
    cap_table: &CapSpace,
    grants: &[ExecCapGrant],
) {
    // 1. Clear inline slots (under inline_slots lock).
    {
        let mut slots = cap_table.inline_slots.lock();
        for slot in slots.iter_mut() {
            if let Some(entry) = slot.take() {
                drop(entry); // Drop CapEntry, decrement Arc refcount
            }
        }
    }
    // 2. Clear extension slots (under extension lock).
    {
        let mut ext = cap_table.extension.lock();
        for slot in ext.iter_mut() {
            if let Some(entry) = slot.take() {
                drop(entry);
            }
        }
        ext.clear(); // Free the Vec backing storage
    }
    // 3. Reset counters and bitmap.
    cap_table.live_count.store(0, Relaxed);
    {
        let mut bitmap = cap_table.free_bitmap.lock();
        bitmap.set_all_free();
    }
    // 4. Re-grant inheritable capabilities from the exec grant table.
    for grant in grants {
        let entry = CapEntry::new(grant.cap_type, grant.user_ns_id);
        let slot_idx = cap_table.alloc_slot(); // finds first free via bitmap
        cap_table.inline_slots.lock()[slot_idx] = Some(entry);
        cap_table.live_count.fetch_add(1, Relaxed);
    }
}

user_ns_id assignment for exec-granted capabilities: Capabilities granted by exec (file capabilities from security.capability xattr) use the executed file's owning user namespace: typically init_user_ns (ns_id = 0) for suid binaries on the root filesystem. Capabilities from the ambient set inherit the creating task's user_ns_id (as set at prctl(PR_CAP_AMBIENT_RAISE) time).

no_new_privs enforcement: If the task has PR_SET_NO_NEW_PRIVS active (set by prctl(PR_SET_NO_NEW_PRIVS, 1)), steps 1-6 are constrained: the new permitted set cannot exceed the old permitted set. This prevents privilege escalation via setuid or file capabilities when no_new_privs is in effect (required for seccomp-BPF without CAP_SYS_ADMIN).

9.1.1.12 Object Registry

The object registry maps ObjectId values to kernel objects. It is the central data structure enabling capability validation and revocation.

/// Global object registry. Maps ObjectId → kernel object with generation-based
/// revocation. The registry is partitioned per-CPU for allocation (reducing
/// contention) but globally readable for validation (via RCU).
///
/// Slot layout: each slot holds an object pointer, a generation counter, and
/// a type tag. ObjectId encodes both the slot index and the expected generation.
/// Validation compares the ObjectId's generation against the slot's current
/// generation — a mismatch means the capability has been revoked.

/// Unique identifier for a kernel object. Encodes a slot index and generation
/// counter. Two ObjectIds with the same slot index but different generations
/// refer to different objects (the slot was recycled).
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
// kernel-internal, not KABI — ObjectIdWire is the cross-boundary wire format.
#[repr(C)]
pub struct ObjectId {
    /// Slot index in the object registry (lower 32 bits).
    /// Maximum 2^32 concurrent objects (~4 billion). In practice, typical
    /// systems use far fewer (tens of thousands of open files, sockets, etc.).
    ///
    /// **Longevity analysis (u32):** This is a concurrent-slot index, not a
    /// monotonic counter. Slots are reused via the freelist; the generation
    /// counter (u64) prevents stale-handle collisions. The maximum number of
    /// *simultaneously live* objects is bounded by physical memory (~100K-10M
    /// objects on typical systems). u32 provides 4 billion slots, which exceeds
    /// any plausible concurrent object count. Widening to u64 would double
    /// ObjectId size (8 → 12 bytes) for no practical benefit.
    slot: u32,
    /// Generation counter. Incremented each time a slot is recycled.
    /// Prevents stale capabilities from accessing new objects allocated in
    /// the same slot. u64 to match `Capability.generation` and the slot's
    /// `AtomicU64`; wrapping is not a practical concern (see security
    /// analysis below).
    generation: u64,
}

/// A single slot in the object registry.
struct ObjectSlot {
    /// Pointer to the kernel object. NULL when the slot is free.
    /// The pointer type is erased; the `type_tag` field identifies the
    /// actual type for safe downcasting.
    object: AtomicPtr<()>,
    /// Current generation counter. Incremented on each free/reuse cycle.
    /// Capabilities holding an older generation are automatically invalid.
    /// u64 to match `Capability.generation` and `ObjectId.generation`.
    generation: AtomicU64,
    /// Type discriminant for safe downcasting. Matches the `ObjectType` enum.
    type_tag: AtomicU8,
    /// Free list link. When the slot is free, this holds the index of the
    /// next free slot (forming a per-CPU free list). When allocated, unused.
    next_free: AtomicU32,
    /// Destruction callback for the type-erased object pointer. Stored at
    /// registration time to avoid assuming Box provenance during revocation.
    /// The callback casts the `*mut ()` back to the original concrete type
    /// and drops it (e.g., `|p| drop(Box::from_raw(p as *mut ConcreteType))`).
    /// Null when the slot is free.
    drop_fn: AtomicPtr<()>,  // Actually unsafe fn(*mut ()), stored as raw ptr
}

/// Object type tags for safe downcasting from erased pointers.
#[repr(u8)]
pub enum ObjectType {
    None = 0,       // Slot is free
    File = 1,
    Socket = 2,
    Process = 3,
    Thread = 4,
    MemoryRegion = 5,
    Device = 6,
    IpcChannel = 7,
    Timer = 8,
    Signal = 9,
    // Extensible: new types added as subsystems are implemented.
}

/// Per-object security metadata. Attached to every kernel object that is
/// addressable through the capability system. Provides the ownership and
/// access-control information needed to validate capability operations
/// without consulting a central authority.
///
/// Used by the Unified Object Namespace ([Section 20.5](20-observability.md#unified-object-namespace))
/// as the `security` field of each `Object` entry, and by the capability
/// validation path ([Section 9.2](#permission-and-acl-model--linux-permission-emulation)) when
/// checking POSIX permission compatibility.
// kernel-internal, not KABI — accessed only within Tier 0 capability subsystem.
#[repr(C)]
pub struct SecurityDescriptor {
    /// Credential ID of the object's owner. Indexes into the credential
    /// table maintained by the process subsystem ([Section 8.1](08-process.md#process-and-task-management)).
    /// All permission checks against this object use `owner` as the
    /// reference identity for "owner" permission bits.
    pub owner: CredId,

    /// Credential ID of the owning group (POSIX group semantics).
    /// Used when mapping POSIX group permission bits to capability checks.
    pub group: CredId,

    /// Bitmask of permitted operations on this object. The encoding
    /// matches `PermissionBits` — the same bitfield used in `Capability`.
    /// This is the *object-side* maximum: a capability can grant at most
    /// the permissions recorded here. Operations not set in `access_mask`
    /// are unconditionally denied regardless of the requesting capability's
    /// permissions.
    pub access_mask: u64,

    /// Security label index for mandatory access control (MAC). When an
    /// LSM module (e.g., SELinux, AppArmor emulation) is active, this
    /// indexes into the LSM label table. Zero when no MAC policy is loaded.
    pub label_id: u32,
}

/// Credential identifier. Opaque 32-bit index into the per-namespace
/// credential table. Corresponds to a (uid, gid, supplementary groups)
/// tuple in the POSIX compatibility layer. The kernel never exposes raw
/// uid/gid to subsystems — all identity checks go through `CredId`.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct CredId(pub u32);

/// Registry of kernel objects accessible via capabilities.
/// Initialized from the boot allocator (Section 4.1) before the
/// slab allocator is ready. The `slots` array is allocated via
/// `BootAlloc::alloc_array::<ObjectSlot>(capacity)` at early boot
/// and has kernel lifetime — it is never freed.
///
/// Capacity is determined at boot from the platform's maximum
/// expected concurrent kernel objects (default: 65536 slots,
/// ~2MB at 32 bytes/slot). This is a boot-time parameter, not
/// a compile-time constant.
pub struct ObjectRegistry {
    /// Boot-allocator allocated array of slots. Length = `capacity`.
    /// Raw pointer because the allocation predates the type system's
    /// allocator infrastructure. Slots are RCU-readable (no lock for
    /// lookup) and locked for mutation.
    ///
    /// Capacity tiers (determined from discovered system memory at boot):
    /// - ≤1 GB: 65,536 slots
    /// - ≤16 GB: 262,144 slots
    /// - ≤256 GB: 1,048,576 slots
    /// - >256 GB: 4,194,304 slots
    slots: *mut ObjectSlot,
    /// Number of slots (set at init time, never changes).
    capacity: u32,
    /// Number of currently used slots.
    count: AtomicU32,
    /// Freelist head (index into slots, u32::MAX = empty).
    freelist_head: AtomicU32,
    /// Per-CPU free lists for lock-free allocation. Each CPU maintains a
    /// local free list head. When empty, the CPU steals a batch from the
    /// global free list (under a lock).
    per_cpu_free: PerCpu<AtomicU32>,
    /// Global free list for overflow. Protected by a spinlock, accessed
    /// only when a per-CPU list is exhausted.
    global_free_head: SpinLock<u32>,
}

// # Initialization
// ObjectRegistry::init(boot_alloc, capacity) is called from umka_core::early_init()
// before slab_init(). It uses BootAlloc to allocate the slots array.
// After slab_init(), no further BootAlloc allocations are needed by ObjectRegistry.

Operations:

impl ObjectRegistry {
    /// Allocate a slot and register an object. Returns an ObjectId that
    /// encodes the slot index and current generation.
    ///
    /// Called when creating files, sockets, processes, etc.
    /// Allocation is lock-free on the fast path (per-CPU free list).
    ///
    /// The `drop_fn` callback is stored alongside the object pointer and
    /// called during revocation to destroy the object. This avoids assuming
    /// Box provenance on a type-erased pointer — the caller provides the
    /// correct destructor for the concrete type at registration time.
    pub fn register<T: KernelObject>(&self, object: *mut T, type_tag: ObjectType) -> Result<ObjectId> {
        /// Type-safe destructor trampoline. Called during revocation to drop
        /// the concrete object behind the type-erased pointer.
        ///
        /// # Safety
        /// `ptr` must point to a valid `T` allocated via `Box::new`.
        unsafe fn drop_trampoline<T>(ptr: *mut ()) {
            drop(Box::from_raw(ptr as *mut T));
        }

        let guard = preempt_disable();
        let slot_idx = self.alloc_slot(&guard)?;
        let slot = &self.slots[slot_idx as usize];
        // Acquire pairs with the Release in revoke(), ensuring a reused slot's
        // ObjectId carries the post-revocation generation. This prevents ABA
        // capability confusion regardless of free-list synchronization details.
        let gen = slot.generation.load(Ordering::Acquire);
        slot.object.store(object as *mut (), Ordering::Release);
        slot.type_tag.store(type_tag as u8, Ordering::Release);
        slot.drop_fn.store(drop_trampoline::<T> as *mut (), Ordering::Release);
        Ok(ObjectId { slot: slot_idx, generation: gen })
    }

    /// Validate an ObjectId: check that the slot's generation matches and
    /// the object is still alive. Called from `cap_validate()` under
    /// `rcu_read_lock()`.
    ///
    /// Returns the object pointer if valid, or CapError::Revoked if the
    /// generation has advanced (capability was revoked).
    pub fn lookup(&self, id: ObjectId) -> Result<(*mut (), ObjectType), CapError> {
        // SAFETY: slot index bounds-checked against capacity.
        let slot = &self.slots[id.slot as usize];
        let gen = slot.generation.load(Ordering::Acquire);
        if gen != id.generation {
            return Err(CapError::Revoked);
        }
        let ptr = slot.object.load(Ordering::Acquire);
        if ptr.is_null() {
            return Err(CapError::Revoked);
        }
        let tag = slot.type_tag.load(Ordering::Acquire);
        Ok((ptr, ObjectType::from_u8(tag)))
    }

    /// Revoke all capabilities referring to an object. Increments the slot's
    /// generation counter, making all existing ObjectIds for this slot invalid.
    /// The actual object is freed via RCU deferred callback after a grace period
    /// (ensuring no concurrent `lookup()` readers see a dangling pointer).
    ///
    /// Object destruction uses the `drop_fn` callback stored at registration
    /// time, not `Box::from_raw` — this ensures correct provenance for the
    /// type-erased pointer regardless of how the object was originally allocated.
    pub fn revoke(&self, id: ObjectId) -> Result<()> {
        let slot = &self.slots[id.slot as usize];
        // Increment generation — all existing capabilities become invalid.
        slot.generation.fetch_add(1, Ordering::Release);
        // Null the pointer (new lookups will fail immediately).
        let old_ptr = slot.object.swap(core::ptr::null_mut(), Ordering::AcqRel);
        // Load the destruction callback that was stored at registration time.
        let destructor = slot.drop_fn.swap(core::ptr::null_mut(), Ordering::AcqRel);
        slot.type_tag.store(ObjectType::None as u8, Ordering::Release);
        // Defer actual object destruction until RCU grace period completes.
        // This ensures no concurrent reader (in rcu_read_lock) sees a dangling pointer.
        let registry = self as *const Self;
        rcu_call(move || {
            // SAFETY: old_ptr was valid when registered; RCU grace period
            // ensures no readers hold references. destructor was set by
            // register() and matches the concrete type behind old_ptr.
            if !old_ptr.is_null() && !destructor.is_null() {
                let drop_fn: unsafe fn(*mut ()) =
                    unsafe { core::mem::transmute(destructor) };
                unsafe { drop_fn(old_ptr); }
            }
            // Return slot to per-CPU free list AFTER RCU grace period.
            // This prevents slot reuse while readers from the revocation
            // epoch may still be active (ABA prevention).
            // SAFETY: registry is &'static and outlives all RCU callbacks.
            unsafe { (*registry).free_slot(id.slot); }
        });
        Ok(())
    }
}

Security analysis: Generation counter wrapping: a u64 generation wraps after 2^64 reuses of the same slot (~1.8×10^19). At an extreme hypothetical rate of 10 million slot reuses per second, wrap occurs after ~58,000 years — not a practical concern. The u64 width (matching Capability.generation) eliminates generation wrap as a security consideration, at the cost of 4 additional bytes per ObjectId and 4 bytes per ObjectSlot compared to a u32 design (negligible given typical slot counts).

Continued in Section 9.2: Linux permission emulation, system administration capabilities (SystemCaps bitflags), dual ACL model (POSIX draft + NFSv4), driver sandboxing, and security-by-default design.

9.2 Permission and ACL Model

9.2.1 Linux Permission Emulation

Traditional Unix permissions (UIDs, GIDs, file modes, POSIX capabilities) are emulated on top of the UmkaOS capability model. The translation is transparent to applications:

  • UIDs/GIDs: Mapped to capability sets. uid == 0 grants a broad (but still bounded) capability set, not unlimited access.
  • File modes: Translated to per-file capability checks at open time.
  • POSIX capabilities (CAP_NET_RAW, CAP_SYS_ADMIN, etc.): Each maps to a specific set of UmkaOS capabilities.
  • Supplementary groups: Expand the effective capability set.

Applications see standard getuid(), stat(), access() behavior.

9.2.2 System Administration Capabilities

UmkaOS defines a set of core capabilities that map to Linux's POSIX capabilities. These are the native UmkaOS capabilities that form the basis for permission checks:

bitflags! {
    /// Core system administration and access control capabilities.
    /// These are the UmkaOS-native capabilities that correspond to Linux's
    /// POSIX capability set, but with a more granular and typed design.
    /// **Backing type**: `u128` provides 128 bits — enough for all Linux POSIX
    /// capabilities (currently 41, bits 0-40) and UmkaOS-native capabilities,
    /// with room for growth. Linux has added ~1 capability per release cycle;
    /// 128 bits gives decades of headroom. On 64-bit platforms, `u128` is
    /// two registers — capability checks remain fast (two AND+CMP operations).
    ///
    /// **Layout**: Bits 0-63 are reserved for POSIX-compatible capabilities
    /// (matching Linux numbering exactly — bit N = Linux `CAP_*` number N).
    /// Bits 64-127 are UmkaOS-native capabilities that have no Linux equivalent.
    /// Bits 41-63 are reserved for future Linux capabilities.
    pub struct SystemCaps: u128 {
        // ===== POSIX capabilities (bits 0-40, matching Linux exactly) =====
        // These bit positions are 1:1 with Linux's capability numbers.
        // The syscall SysAPI layer (Section 19.1) passes Linux cap numbers
        // through directly: capable(N) → has_cap(1 << N). No translation needed.

        /// CAP_CHOWN (Linux 0): Allow arbitrary file ownership changes.
        const CAP_CHOWN = 1 << 0;
        /// CAP_DAC_OVERRIDE (Linux 1): Bypass all DAC checks.
        const CAP_DAC_OVERRIDE = 1 << 1;
        /// CAP_DAC_READ_SEARCH (Linux 2): Bypass file read and directory search checks.
        const CAP_DAC_READ_SEARCH = 1 << 2;
        /// CAP_FOWNER (Linux 3): Bypass file-owner permission checks.
        const CAP_FOWNER = 1 << 3;
        /// CAP_FSETID (Linux 4): Set set-user-ID and set-group-ID bits.
        const CAP_FSETID = 1 << 4;
        /// CAP_KILL (Linux 5): Send signals to arbitrary processes.
        const CAP_KILL = 1 << 5;
        /// CAP_SETGID (Linux 6): Set arbitrary group IDs.
        const CAP_SETGID = 1 << 6;
        /// CAP_SETUID (Linux 7): Set arbitrary user IDs.
        const CAP_SETUID = 1 << 7;
        /// CAP_SETPCAP (Linux 8): Modify capability sets.
        const CAP_SETPCAP = 1 << 8;
        /// CAP_LINUX_IMMUTABLE (Linux 9): Set immutable and append-only file attributes.
        const CAP_LINUX_IMMUTABLE = 1 << 9;
        /// CAP_NET_BIND_SERVICE (Linux 10): Bind to privileged ports (< 1024).
        const CAP_NET_BIND_SERVICE = 1 << 10;
        /// CAP_NET_BROADCAST (Linux 11): Socket broadcasting and multicast.
        const CAP_NET_BROADCAST = 1 << 11;
        /// CAP_NET_ADMIN (Linux 12): Network administration operations.
        const CAP_NET_ADMIN = 1 << 12;
        /// CAP_NET_RAW (Linux 13): Raw and packet sockets.
        const CAP_NET_RAW = 1 << 13;
        /// CAP_IPC_LOCK (Linux 14): Lock memory (mlock, mlockall, SHM_LOCK).
        const CAP_IPC_LOCK = 1 << 14;
        /// CAP_IPC_OWNER (Linux 15): Override IPC ownership checks.
        const CAP_IPC_OWNER = 1 << 15;
        /// CAP_SYS_MODULE (Linux 16): Load and unload kernel modules.
        const CAP_SYS_MODULE = 1 << 16;
        /// CAP_SYS_RAWIO (Linux 17): Raw I/O operations.
        const CAP_SYS_RAWIO = 1 << 17;
        /// CAP_SYS_CHROOT (Linux 18): Use chroot(2).
        const CAP_SYS_CHROOT = 1 << 18;
        /// CAP_SYS_PTRACE (Linux 19): Trace arbitrary processes.
        /// Note: UmkaOS also provides CAP_DEBUG (bit 68) as the native debugging
        /// capability. The SysAPI layer maps CAP_SYS_PTRACE checks to CAP_DEBUG.
        const CAP_SYS_PTRACE = 1 << 19;
        /// CAP_SYS_PACCT (Linux 20): Configure process accounting.
        const CAP_SYS_PACCT = 1 << 20;
        /// CAP_SYS_ADMIN (Linux 21): Broad system administration.
        /// Note: UmkaOS also provides CAP_ADMIN (bit 64) as the native admin
        /// capability. The SysAPI layer: `capable(CAP_SYS_ADMIN)` checks
        /// `has_cap(CAP_SYS_ADMIN) || has_cap(CAP_ADMIN)`.
        const CAP_SYS_ADMIN = 1 << 21;
        /// CAP_SYS_BOOT (Linux 22): Use reboot(2) and kexec_load(2).
        const CAP_SYS_BOOT = 1 << 22;
        /// CAP_SYS_NICE (Linux 23): Set scheduling policies, nice values.
        const CAP_SYS_NICE = 1 << 23;
        /// CAP_SYS_RESOURCE (Linux 24): Override resource limits (RLIMIT).
        const CAP_SYS_RESOURCE = 1 << 24;
        /// CAP_SYS_TIME (Linux 25): Set system clock and adjtime.
        const CAP_SYS_TIME = 1 << 25;
        /// CAP_SYS_TTY_CONFIG (Linux 26): Configure virtual terminal settings.
        const CAP_SYS_TTY_CONFIG = 1 << 26;
        /// CAP_MKNOD (Linux 27): Create special files using mknod(2).
        const CAP_MKNOD = 1 << 27;
        /// CAP_LEASE (Linux 28): Establish leases on arbitrary files.
        const CAP_LEASE = 1 << 28;
        /// CAP_AUDIT_WRITE (Linux 29): Write records to audit log.
        const CAP_AUDIT_WRITE = 1 << 29;
        /// CAP_AUDIT_CONTROL (Linux 30): Configure audit rules.
        const CAP_AUDIT_CONTROL = 1 << 30;
        /// CAP_SETFCAP (Linux 31): Set file capabilities.
        const CAP_SETFCAP = 1 << 31;
        /// CAP_MAC_OVERRIDE (Linux 32): Override MAC enforcement.
        const CAP_MAC_OVERRIDE = 1 << 32;
        /// CAP_MAC_ADMIN (Linux 33): MAC configuration changes.
        const CAP_MAC_ADMIN = 1 << 33;
        /// CAP_SYSLOG (Linux 34): Privileged syslog operations.
        const CAP_SYSLOG = 1 << 34;
        /// CAP_WAKE_ALARM (Linux 35): Set system wakeup alarms.
        const CAP_WAKE_ALARM = 1 << 35;
        /// CAP_BLOCK_SUSPEND (Linux 36): Prevent system suspending.
        const CAP_BLOCK_SUSPEND = 1 << 36;
        /// CAP_AUDIT_READ (Linux 37): Read audit log messages.
        const CAP_AUDIT_READ = 1 << 37;
        /// CAP_PERFMON (Linux 38): Performance monitoring and observability.
        const CAP_PERFMON = 1 << 38;
        /// CAP_BPF (Linux 39): Load eBPF programs and create maps.
        const CAP_BPF = 1 << 39;
        /// CAP_CHECKPOINT_RESTORE (Linux 40): Checkpoint/restore operations.
        const CAP_CHECKPOINT_RESTORE = 1 << 40;

        // Bits 41-63: Reserved for future Linux capabilities.

        // ===== UmkaOS-native capabilities (bits 64-127) =====
        // These have no Linux equivalent. They provide finer-grained
        // control for UmkaOS-specific subsystems.

        /// CAP_ADMIN: UmkaOS-native administrative capability. Grants broad
        /// administrative access including: mounting/unmounting filesystems,
        /// modifying firewall/routing, loading kernel modules, sysctl changes,
        /// privilege management (setuid/setgid administration), cgroup/namespace administration, device
        /// management. The SysAPI layer maps `capable(CAP_SYS_ADMIN)` to
        /// check both CAP_SYS_ADMIN (bit 21) and CAP_ADMIN (bit 64) — a
        /// process holding either one passes the check. CAP_ADMIN is the
        /// preferred capability for new UmkaOS-native code paths; CAP_SYS_ADMIN
        /// exists purely for Linux application compatibility.
        const CAP_ADMIN = 1 << 64;

        /// CAP_P2P_DMA: Peer-to-peer DMA operations between devices.
        /// Required for drivers that initiate P2P transactions.
        const CAP_P2P_DMA = 1 << 65;

        /// CAP_NET_LOOKUP: BPF socket/connection table lookups via
        /// bpf_sk_lookup(). Required for BPF load balancers. See Section 16.15.
        const CAP_NET_LOOKUP = 1 << 66;

        /// CAP_NET_ROUTE_READ: BPF FIB (routing table) lookups via
        /// bpf_fib_lookup(). Required for XDP forwarding. See Section 16.15.
        const CAP_NET_ROUTE_READ = 1 << 67;

        /// CAP_DEBUG: Debug arbitrary processes via ptrace or /proc/pid.
        /// UmkaOS-native debugging capability. CAP_SYS_PTRACE (bit 19) is
        /// the Linux compat alias. See "Capability-Gated ptrace" in debugging-and-process-inspection.
        const CAP_DEBUG = 1 << 68;

        /// CAP_NS_TRAVERSE: Traverse namespace boundaries for cross-namespace
        /// operations. Each boundary crossing requires this capability.
        /// Combined with CAP_DEBUG, enables cross-namespace ptrace. See "Capability-Gated ptrace" in debugging-and-process-inspection.
        const CAP_NS_TRAVERSE = 1 << 69;

        /// CAP_MOUNT: Mount/unmount filesystems. Required for mount(2),
        /// umount(2), pivot_root(2). Scoped to caller's mount namespace. See [Section 14.6](14-vfs.md#mount-tree-data-structures-and-operations).
        const CAP_MOUNT = 1 << 70;

        /// CAP_VMX: Execute VMX instructions for KVM host operation.
        /// Maps to /dev/kvm open and VMXON/VMXOFF. See Section 18.1.
        const CAP_VMX = 1 << 71;

        /// CAP_CGROUP_ADMIN: Manage cgroup hierarchies — create, modify,
        /// destroy within delegated subtree. See Section 19.1.5.
        const CAP_CGROUP_ADMIN = 1 << 72;

        /// CAP_TPM_SEAL: Seal/unseal data to TPM. See Section 9.3.2.
        const CAP_TPM_SEAL = 1 << 73;

        /// CAP_NET_CONNTRACK: BPF conntrack state query/modify via
        /// bpf_ct_lookup(), bpf_ct_insert(), bpf_ct_set_nat(). See Section 16.15.
        const CAP_NET_CONNTRACK = 1 << 74;

        /// CAP_NET_REDIRECT: XDP packet redirect to another interface's
        /// isolation domain. See Section 16.15.
        const CAP_NET_REDIRECT = 1 << 75;

        /// CAP_TTY_DIRECT: Zero-copy PTY mode (PtyRingPage mapped into both
        /// master/slave). For container logging optimization. See Section 21.1.2.
        const CAP_TTY_DIRECT = 1 << 76;

        /// CAP_ACCEL_ADMIN: Accelerator device admin (firmware update, reset,
        /// perf counter reset, privileged debug). See Section 22.1.2.2.
        const CAP_ACCEL_ADMIN = 1 << 77;

        // ZFS-specific capabilities (scoped to datasets via object_id, Section 15.2.2).
        /// CAP_ZFS_MOUNT: Mount a ZFS dataset as a filesystem.
        const CAP_ZFS_MOUNT = 1 << 78;
        /// CAP_ZFS_SNAPSHOT: Create and destroy snapshots.
        const CAP_ZFS_SNAPSHOT = 1 << 79;
        /// CAP_ZFS_SEND: Generate a send stream for replication.
        const CAP_ZFS_SEND = 1 << 80;
        /// CAP_ZFS_RECV: Receive a send stream into a dataset.
        const CAP_ZFS_RECV = 1 << 81;
        /// CAP_ZFS_CREATE: Create child datasets within a parent.
        const CAP_ZFS_CREATE = 1 << 82;
        /// CAP_ZFS_DESTROY: Destroy a dataset (highest ZFS privilege).
        const CAP_ZFS_DESTROY = 1 << 83;

        // DLM (Distributed Lock Manager) capabilities. See Section 15.12.14.
        /// CAP_DLM_LOCK: Acquire, convert, release locks in permitted lockspaces.
        const CAP_DLM_LOCK = 1 << 84;
        /// CAP_DLM_ADMIN: Create/destroy lockspaces, configure, view cluster-wide.
        const CAP_DLM_ADMIN = 1 << 85;
        /// CAP_DLM_CREATE: Create new lock resources (app-level via /dev/dlm).
        const CAP_DLM_CREATE = 1 << 86;

        /// CAP_SYS_ADMIN_GLOBAL (bit 87): Cluster-wide system administration.
        /// Extends CAP_SYS_ADMIN to authorize operations with cluster-wide scope:
        /// creating/destroying DLM lockspaces visible to all nodes, applying
        /// cluster-wide sysctl changes, modifying shared overlay network topology,
        /// and bootstrapping cluster membership. Required by cluster management
        /// daemons (e.g., pacemaker, corosync). The SysAPI layer does not map
        /// any Linux capability to CAP_SYS_ADMIN_GLOBAL — it is UmkaOS-native only.
        /// Held in `TaskCredential.cap_permitted`; checked by DLM and cluster IPC.
        const CAP_SYS_ADMIN_GLOBAL = 1 << 87;

        /// CAP_ML_TUNE (bit 88): ML policy parameter tuning.
        /// Required to write ML-driven tuning parameters via the policy service
        /// interface ([Section 23.1](23-ml-policy.md#aiml-policy-framework-closed-loop-kernel-intelligence--policy-consumer-kabi-interface)).
        /// Global parameter updates require both CAP_ML_TUNE and CAP_SYS_ADMIN.
        /// Per-cgroup overrides require CAP_ML_TUNE scoped to the target cgroup's
        /// user namespace.
        const CAP_ML_TUNE = 1 << 88;

        /// CAP_CAMERA (bit 89): Open and stream from camera/video capture devices.
        /// Required to open `/dev/videoN` devices and start streaming. Prevents
        /// unprivileged access to cameras. The SysAPI layer maps V4L2 device
        /// `open()` to a CAP_CAMERA check.
        /// See [Section 13.16](13-device-classes.md#camera-and-video-capture--privacy-and-security).
        const CAP_CAMERA = 1 << 89;

        /// CAP_TPM_REMOTE (bit 90): Bind to a remote TPM service over the
        /// cluster fabric. See Section 9.3.4. Separate from CAP_TPM_SEAL (local).
        const CAP_TPM_REMOTE = 1 << 90;

        /// CAP_USB_REMOTE (bit 91): Bind to a remote USB forwarding service
        /// over the cluster fabric. Required by the USB/IP-style capability
        /// service ([Section 13.29](13-device-classes.md#usb-device-forwarding-service-provider)). Separate from CAP_SYS_RAWIO (local USB).
        const CAP_USB_REMOTE = 1 << 91;

        /// CAP_DMA (bit 92): Perform DMA operations on behalf of a device.
        /// Required by Tier 1 and Tier 2 drivers that allocate DMA buffers
        /// or map device memory. Tier 0 drivers (in-kernel) implicitly hold
        /// CAP_DMA. Granted via the capability grant bundle during
        /// `device_init()` ([Section 11.3](11-drivers.md#driver-isolation-tiers)). Checked by
        /// `DmaDevice::dma_alloc_coherent()` and `dma_map_sgl()` in
        /// [Section 4.14](04-memory.md#dma-subsystem). Separate from CAP_SYS_RAWIO (which grants
        /// raw I/O port access, not DMA).
        const CAP_DMA = 1 << 92;

        /// CAP_DMA_IDENTITY (bit 93): Use identity-mapped DMA (bypass IOMMU
        /// translation). Only granted to Tier 1 drivers that require direct
        /// physical address access (e.g., APIC, early boot devices). Tier 2
        /// drivers NEVER receive this capability. Checked in addition to
        /// CAP_DMA when identity mapping is requested.
        const CAP_DMA_IDENTITY = 1 << 93;

        /// CAP_IRQ (bit 94): Register and deregister interrupt vectors.
        /// Required by Tier 1 and Tier 2 drivers that call
        /// `KernelServicesVTable::register_interrupt()` or
        /// `deregister_interrupt()`. Tier 0 drivers (in-kernel) implicitly
        /// hold CAP_IRQ. Granted via the DeviceCapGrant bundle during
        /// `device_init()` ([Section 11.4](11-drivers.md#device-registry-and-bus-management--device-capability-grant-bundle)).
        /// Enforced by the KABI dispatch trampoline's dual-check protocol
        /// ([Section 12.3](12-kabi.md#kabi-bilateral-capability-exchange--kabi-operation-permission-requirements)).
        /// Separate from CAP_SYS_RAWIO (which grants raw I/O port access).
        const CAP_IRQ = 1 << 94;

        // --- Distributed subsystem capabilities (bits 95-99) ---

        /// CAP_BLOCK_REMOTE (bit 95): Access block devices on remote cluster
        /// nodes via the block service provider
        /// ([Section 15.13](15-storage.md#block-storage-networking--block-service-provider)).
        /// Required to bind to a remote block device advertised via
        /// `CapAdvertise`. Checked by the block service client on
        /// `ServiceBind` for block-class services. Without this capability,
        /// a process can only access locally-attached block devices.
        const CAP_BLOCK_REMOTE = 1 << 95;

        /// CAP_FS_REMOTE (bit 96): Access filesystems on remote cluster
        /// nodes via the VFS service provider
        /// ([Section 14.11](14-vfs.md#fuse-filesystem-in-userspace--vfs-service-provider)).
        /// Required to bind to a remote filesystem mount advertised via
        /// `CapAdvertise`. Checked by the VFS service client on
        /// `ServiceBind` for filesystem-class services. Without this
        /// capability, a process can only access locally-mounted filesystems.
        const CAP_FS_REMOTE = 1 << 96;

        /// CAP_ACCEL_REMOTE (bit 97): Access accelerator devices (GPU, FPGA,
        /// inference engines) on remote cluster nodes via the accelerator
        /// service provider
        /// ([Section 22.7](22-accelerators.md#accelerator-networking-rdma-and-linux-gpu-compatibility--accelerator-service-provider)).
        /// Required to bind to a remote accelerator context advertised via
        /// `CapAdvertise`. Checked by the accelerator service client on
        /// `ServiceBind` for accel-class services. Separate from
        /// CAP_ACCEL_ADMIN (which controls local admin operations).
        const CAP_ACCEL_REMOTE = 1 << 97;

        /// CAP_NET_REMOTE (bit 98): Access network interfaces on remote
        /// cluster nodes via the network service provider
        /// ([Section 16.31](16-networking.md#network-service-provider)).
        /// Required to bind to a remote network gateway or RDMA proxy
        /// advertised via `CapAdvertise`. Checked by the network service
        /// client on `ServiceBind` for network-class services. Separate
        /// from CAP_NET_ADMIN (which controls local network configuration).
        const CAP_NET_REMOTE = 1 << 98;

        /// CAP_CLUSTER_ADMIN (bit 99): Cluster-wide administrative operations
        /// that affect the entire distributed fabric. Required for:
        /// - Issuing `KeyRevokeMsg` on behalf of another peer (compromise
        ///   response when the peer cannot revoke its own key).
        /// - Forcing a node eviction (`DeadNotify` without heartbeat timeout).
        /// - Modifying cluster-wide fencing tokens (`FenceTokenUpdate`).
        /// - Creating cluster-wide DSM regions (DsmRegionCreate with
        ///   visibility=ClusterWide).
        /// - Administrative `REVOKE_URGENT` for capabilities issued by
        ///   other peers (emergency revocation).
        /// Stronger than CAP_SYS_ADMIN_GLOBAL (bit 87): CAP_SYS_ADMIN_GLOBAL
        /// covers cluster-visible sysctl and DLM lockspace management;
        /// CAP_CLUSTER_ADMIN covers operations that affect the cluster
        /// membership, security, and distributed resource fabric itself.
        /// Held only by cluster management daemons (e.g., pacemaker,
        /// corosync) and administrative tools.
        const CAP_CLUSTER_ADMIN = 1 << 99;

        /// CAP_DSM_CREATE (bit 100): Create DSM (Distributed Shared Memory) regions.
        /// Required to call `dsm_region_create()` ([Section 6.8](06-dsm.md#dsm-region-management)).
        /// Without this capability, a process can only join existing DSM regions
        /// (which requires CAP_SYS_ADMIN_GLOBAL or an explicit capability grant).
        /// Checked by the DSM subsystem before broadcasting `DsmRegionCreateBcast`.
        const CAP_DSM_CREATE = 1 << 100;

        /// CAP_PEER_MANAGE (bit 101): Manage peer protocol connections.
        /// Required for administrative peer lifecycle operations: initiating
        /// peer join/leave sequences, forcing peer eviction, modifying peer
        /// transport parameters, and querying peer health status via the
        /// management interface. Checked by the peer protocol layer on
        /// `PeerMessageType::JoinRequest`, `LeaveNotify`, and `DeadNotify`
        /// origination. Separate from CAP_CLUSTER_ADMIN (which covers
        /// cluster-wide fencing and security); CAP_PEER_MANAGE covers
        /// individual peer connection lifecycle.
        const CAP_PEER_MANAGE = 1 << 101;

        // Bits 102-127: Reserved for future UmkaOS-native capabilities.
    }
}

Key design notes:

  • Bit layout: Bits 0-40 match Linux's capability numbering exactly (bit N = Linux CAP_* number N). This means the SysAPI layer needs no translation for POSIX capability checks — capable(N) simply checks caps & (1 << N). Bits 41-63 are reserved for future Linux capabilities. Bits 64-127 are UmkaOS-native.

  • CAP_ADMIN vs CAP_SYS_ADMIN: CAP_SYS_ADMIN (bit 21) is the Linux-compatible capability at its exact Linux bit position. CAP_ADMIN (bit 64) is the UmkaOS-native administrative capability. The SysAPI layer checks both: capable(CAP_SYS_ADMIN) succeeds if the process holds either CAP_SYS_ADMIN or CAP_ADMIN. New UmkaOS-native code should check CAP_ADMIN; the POSIX bits exist purely for unmodified Linux application compatibility.

  • CAP_SYS_PTRACE vs CAP_DEBUG: Similar pattern. CAP_SYS_PTRACE (bit 19) is the Linux compat bit. CAP_DEBUG (bit 68) is the UmkaOS-native debugging capability. The SysAPI layer maps ptrace capability checks to CAP_SYS_PTRACE || CAP_DEBUG.

  • Granularity over blanket permissions: Individual capabilities (e.g., CAP_NET_ADMIN, CAP_DAC_OVERRIDE) allow least-privilege assignment. A web server needs only CAP_NET_BIND_SERVICE and CAP_SETUID, not CAP_ADMIN.

  • No root-equivalent unlimited access: Even CAP_ADMIN is bounded. It grants the operations listed above but does not bypass hardware isolation domains or allow arbitrary code execution in umka-core. There is no capability that grants "ignore all security checks" — administrative operations are still mediated through the capability system.

POSIX capability mapping: The syscall entry point (Section 19.1) converts Linux capability checks (capable(N)) to SystemCaps bit checks. For POSIX capabilities (0-40), this is a direct 1 << N check with no translation. For caps that have UmkaOS-native equivalents (e.g., CAP_SYS_ADMIN/CAP_ADMIN), the check is has_cap(1 << N) || has_cap(UMKA_NATIVE_EQUIVALENT).

9.2.3 Dual ACL Model: POSIX Draft ACLs + NFSv4 ACLs

UmkaOS supports two ACL models as first-class citizens, designed into the VFS layer from the start:

POSIX Draft ACLs (IEEE 1003.1e/1003.2c): - Standard Linux ACL model (setfacl/getfacl) - ACL_USER, ACL_GROUP, ACL_MASK, ACL_OTHER entry types - Default ACLs for directory inheritance - Required for ext4, XFS, btrfs compatibility

NFSv4 ACLs (RFC 7530 / RFC 8881): - Richer model used by NFS, ZFS, and most modern Unix systems (FreeBSD, Solaris/illumos) - Explicit ALLOW/DENY ACE ordering (access control entries processed in order) - Fine-grained permissions: READ_DATA, WRITE_DATA, APPEND_DATA, READ_NAMED_ATTRS, WRITE_NAMED_ATTRS, EXECUTE, DELETE_CHILD, READ_ATTRIBUTES, WRITE_ATTRIBUTES, DELETE, READ_ACL, WRITE_ACL, WRITE_OWNER, SYNCHRONIZE - Inheritance flags: FILE_INHERIT, DIRECTORY_INHERIT, NO_PROPAGATE_INHERIT, INHERIT_ONLY - Automatic inheritance tracking (for efficient subtree ACL changes)

VFS ACL Abstraction:

pub enum AclModel {
    PosixDraft,  // POSIX.1e draft ACLs
    Nfsv4,       // NFSv4/ZFS-style rich ACLs
}

pub trait VfsAcl {
    /// Which ACL model this filesystem uses
    fn acl_model(&self) -> AclModel;
    /// Get the effective ACL for an inode
    fn get_acl(&self, inode: InodeId, acl_type: AclType) -> Result<Acl, Error>;
    /// Set an ACL on an inode
    fn set_acl(&self, inode: InodeId, acl_type: AclType, acl: &Acl) -> Result<(), Error>;
    /// Check access (called by the permission check path)
    fn check_access(&self, inode: InodeId, who: &Principal, mask: AccessMask) -> Result<(), Error>;
}

Filesystem support: - ext4, XFS, btrfs: POSIX draft ACLs (native) + NFSv4 via translation layer - ZFS: NFSv4 ACLs (native) - NFS client: NFSv4 ACLs (native, passed through to server) - tmpfs, procfs, sysfs: POSIX draft ACLs (simple model sufficient)

Translation layer: When a filesystem natively uses one model but the user/application requests the other, a translation layer converts between them. The translation is lossy in some edge cases (NFSv4 DENY entries have no POSIX equivalent), but covers common use cases.

POSIX draft ACL → NFSv4 ACL (always lossless for ALLOW-only POSIX ACLs):

For each POSIX ACL entry in order:
  ACL_USER_OBJ  → NFSv4 ALLOW ACE for OWNER@  with rwxp permissions from entry
  ACL_USER(uid) → NFSv4 ALLOW ACE for user:uid with rwxp permissions from entry
  ACL_GROUP_OBJ → NFSv4 ALLOW ACE for GROUP@  with (permissions & MASK) from entry
  ACL_GROUP(gid)→ NFSv4 ALLOW ACE for group:gid with (permissions & MASK) from entry
  ACL_MASK      → Not translated as its own ACE; applied as mask to GROUP entries above
  ACL_OTHER     → NFSv4 ALLOW ACE for EVERYONE@ with permissions from entry

Default ACL (directory inheritance):
  Each entry above additionally receives FILE_INHERIT | DIRECTORY_INHERIT flags.
  ACL_DEFAULT entries with ACL_USER_OBJ / ACL_GROUP_OBJ / ACL_OTHER receive
  INHERIT_ONLY if the access ACL already covers owner/group/other.

NFSv4 ACL → POSIX draft ACL (lossy — DENY ACEs are discarded):

Pass 1 — extract ALLOW ACEs only (DENY ACEs have no POSIX equivalent):
  OWNER@  ALLOW → ACL_USER_OBJ  with the ACE permissions
  GROUP@  ALLOW → ACL_GROUP_OBJ with the ACE permissions
  EVERYONE@ ALLOW → ACL_OTHER   with the ACE permissions
  user:uid ALLOW → ACL_USER(uid) entry
  group:gid ALLOW → ACL_GROUP(gid) entry

Pass 2 — compute MASK:
  MASK = union of all GROUP@ and ACL_GROUP(gid) ALLOW permissions.
  This matches the POSIX MASK semantics (effective group permission limit).

Pass 3 — inheritance:
  ACEs with FILE_INHERIT | DIRECTORY_INHERIT → contribute to default ACL.
  ACEs with INHERIT_ONLY → contribute only to default ACL, not access ACL.

Lossy cases:
  DENY ACE                    → logged and discarded (no POSIX representation)
  Ordered DENY before ALLOW   → first-match ordering is lost; effective access
                                  may differ between models for complex rule sets
  NFSv4-only permission bits  → READ_NAMED_ATTRS, WRITE_NAMED_ATTRS, SYNCHRONIZE
    (not in POSIX model)         are silently dropped in translation

POSIX→NFSv4 ACL mapping follows RFC 7530 §6.2.1 (NFS Version 4 Protocol). The mapping algorithm is deterministic: - POSIX mode bits {r, w, x} for {owner, group, other} map to ACE4_ACCESS_ALLOWED_ACE_TYPE entries with the corresponding NFSv4 access mask bits. - POSIX default ACL entries: emit Mask-ACEs first, Access-ACEs second. - Reference implementation: nfs4_acl_posix_to_nfs4() in Linux fs/nfsd/nfs4acl.c (for cross-reference; UmkaOS implements independently per RFC 7530).

The translation implementation is cross-referenced with the NFSv4 ACL mapping used by ZFS (PSARC 2006/496) and FreeBSD's acl_nfs4_posix.c.

Linux compatibility: The getxattr/setxattr syscalls support both system.posix_acl_access/system.posix_acl_default (POSIX) and system.nfs4_acl (NFSv4) extended attribute names. Tools like nfs4_getfacl/nfs4_setfacl work unmodified.

9.2.3.1 POSIX ACL Extended Attribute Wire Format

The system.posix_acl_access and system.posix_acl_default extended attributes (Section 14.16) encode POSIX draft ACLs in a binary wire format defined by POSIX.1e draft 17. This format is implemented identically by Linux, FreeBSD, and Solaris. UmkaOS uses the same byte-exact encoding for binary compatibility — tools like getfacl/setfacl, rsync, tar, and container runtimes (which read/write these xattrs directly) must work without modification.

/// POSIX ACL wire format — stored in `system.posix_acl_access` and
/// `system.posix_acl_default` extended attributes ([Section 14.16](14-vfs.md#extended-attributes)).
///
/// Binary layout: header + array of entries. Little-endian on all architectures.
/// This format is defined by POSIX.1e draft 17 and implemented identically by
/// Linux, FreeBSD, and Solaris. UmkaOS uses the same binary encoding for
/// Linux compatibility.

/// ACL version — must be 0x0002 (POSIX_ACL_XATTR_VERSION).
pub const POSIX_ACL_XATTR_VERSION: u32 = 0x0002;

/// Undefined qualifier sentinel — used for tags without a uid/gid qualifier.
pub const ACL_UNDEFINED_ID: u32 = 0xFFFF_FFFF;

/// ACL entry tag values. Values match Linux `include/uapi/linux/posix_acl.h`.
/// NOTE: Although values are powers of two, these are enum discriminants,
/// NOT combinable flags. Each `e_tag` must be exactly one of these values.
pub const ACL_USER_OBJ: u16  = 0x01;  // File owner
pub const ACL_USER: u16      = 0x02;  // Named user (uid in qualifier)
pub const ACL_GROUP_OBJ: u16 = 0x04;  // File owning group
pub const ACL_GROUP: u16     = 0x08;  // Named group (gid in qualifier)
pub const ACL_MASK: u16      = 0x10;  // Maximum effective permissions for USER/GROUP entries
pub const ACL_OTHER: u16     = 0x20;  // Everyone else

/// Permission bits (same as file mode bits in the low 3 bits of st_mode).
pub const ACL_READ: u16    = 0x04;
pub const ACL_WRITE: u16   = 0x02;
pub const ACL_EXECUTE: u16 = 0x01;

/// Wire format header — 4 bytes, always first in the xattr value.
/// All fields are little-endian on disk (Le32/Le16 enforce explicit conversion).
#[repr(C)]
pub struct PosixAclXattrHeader {
    /// Must be POSIX_ACL_XATTR_VERSION (0x0002).
    pub a_version: Le32,
}
// Packed layout: 4 bytes.
const_assert!(size_of::<PosixAclXattrHeader>() == 4);

/// Wire format entry — 8 bytes, repeated N times after the header.
/// All fields are little-endian on disk.
#[repr(C)]
pub struct PosixAclXattrEntry {
    /// Tag: one of ACL_USER_OBJ, ACL_USER, ACL_GROUP_OBJ, ACL_GROUP,
    /// ACL_MASK, ACL_OTHER.
    pub e_tag: Le16,
    /// Permission bits: bitwise OR of ACL_READ, ACL_WRITE, ACL_EXECUTE.
    pub e_perm: Le16,
    /// Qualifier: uid for ACL_USER, gid for ACL_GROUP.
    /// ACL_UNDEFINED_ID (0xFFFFFFFF) for ACL_USER_OBJ, ACL_GROUP_OBJ,
    /// ACL_MASK, and ACL_OTHER.
    pub e_id: Le32,
}
// Packed layout: 2 + 2 + 4 = 8 bytes.
const_assert!(size_of::<PosixAclXattrEntry>() == 8);

Total xattr value size: 4 + 8 * N bytes — a 4-byte PosixAclXattrHeader followed by N PosixAclXattrEntry values. Filesystems reject xattr values whose length is not exactly 4 + 8 * N for some positive N, or whose a_version field is not 0x0002.

Entry ordering (mandatory — consumers depend on sorted order):

  1. ACL_USER_OBJ — exactly one, always first
  2. ACL_USER entries — zero or more, sorted by ascending uid in e_id
  3. ACL_GROUP_OBJ — exactly one
  4. ACL_GROUP entries — zero or more, sorted by ascending gid in e_id
  5. ACL_MASK — exactly one if any ACL_USER or ACL_GROUP entries exist; absent otherwise
  6. ACL_OTHER — exactly one, always last

Minimal vs extended ACLs:

  • Minimal ACL: Only ACL_USER_OBJ + ACL_GROUP_OBJ + ACL_OTHER (3 entries, 28 bytes). Equivalent to standard Unix permission bits — no named user/group entries, no mask. Filesystems may choose not to store a minimal ACL as an xattr at all (the mode bits in the inode are sufficient).
  • Extended ACL: Contains at least one ACL_USER or ACL_GROUP entry, which requires a corresponding ACL_MASK entry. The mask limits the effective permissions granted to all ACL_USER, ACL_GROUP, and ACL_GROUP_OBJ entries.

chmod interaction — the critical st_mode ↔ ACL synchronization rules:

When chmod(path, mode) is called on an inode that has an extended POSIX ACL:

  • Owner permission bits (mode & 0o700) update the ACL_USER_OBJ entry's e_perm.
  • Group permission bits (mode & 0o070) update the ACL_MASK entry's e_perm, NOT ACL_GROUP_OBJ. This is the key subtlety: when an extended ACL exists, the "group" bits in stat.st_mode reflect the mask, not the owning group's actual permissions. The owning group's effective access is ACL_GROUP_OBJ.e_perm & ACL_MASK.e_perm.
  • Other permission bits (mode & 0o007) update the ACL_OTHER entry's e_perm.

Conversely, when stat() returns st_mode for an inode with an extended ACL, the group bits are set to ACL_MASK.e_perm. This matches Linux behavior and is required for correct operation of tools that rely on stat output (e.g., ls -l shows the mask as the group column, which getfacl annotates with #effective:).

For a minimal ACL (no mask), chmod operates on ACL_GROUP_OBJ directly as expected.

Default ACL inheritance (system.posix_acl_default):

When a directory has a default ACL stored in system.posix_acl_default:

  • New files inherit the default ACL as their access ACL (system.posix_acl_access). The inherited permissions are intersected with the complement of the process's umask when the creating syscall specifies a mode (e.g., open() with O_CREAT). If the file is created by creat() or open() with an explicit mode argument, the umask applies as: effective_perm = inherited_perm & ~umask for each entry.
  • New directories inherit the default ACL as both their access ACL and their own default ACL (so the inheritance propagates recursively). The mkdir() syscall applies the default ACL without umask masking — this matches Linux behavior and ensures directory trees maintain consistent ACL inheritance chains.
  • If no default ACL exists on the parent directory, the traditional umask-based permission model applies.

Validation rules (checked on setxattr and filesystem mount):

  • a_version must be exactly 0x0002; any other value is rejected with EINVAL.
  • Each e_tag must be one of the six defined tag values; unknown tags are rejected.
  • e_perm must only use bits 0-2 (ACL_READ | ACL_WRITE | ACL_EXECUTE); higher bits are rejected.
  • e_id must be ACL_UNDEFINED_ID for ACL_USER_OBJ, ACL_GROUP_OBJ, ACL_MASK, and ACL_OTHER; must be a valid uid/gid (not ACL_UNDEFINED_ID) for ACL_USER and ACL_GROUP.
  • Exactly one ACL_USER_OBJ, one ACL_GROUP_OBJ, and one ACL_OTHER must be present.
  • ACL_MASK must be present if and only if ACL_USER or ACL_GROUP entries exist.
  • Entries must be in the canonical order defined above; out-of-order entries are rejected.
  • Duplicate ACL_USER entries for the same uid, or duplicate ACL_GROUP entries for the same gid, are rejected.

Cross-references: The xattr storage layer that persists these binary values is specified in Section 14.16. The VFS inode permission check path that consults ACLs during access(), open(), and path resolution is specified in Section 14.1.

9.2.4 Driver Sandboxing

Each driver tier has a different sandboxing model:

Tier 1 (domain-isolated): - Memory isolation via hardware domain keys (cannot access core or other drivers) - DMA fencing via IOMMU (cannot DMA to arbitrary physical addresses) - Capability restrictions: driver only receives capabilities for its own devices - No access to: page tables, capability tables, scheduler state, other driver state

Tier 2 (process-isolated): - Full address space isolation (separate page tables) - IOMMU DMA fencing - seccomp-like syscall filtering: driver can only invoke KABI syscalls, not arbitrary Linux syscalls - Resource limits: memory, CPU time, file descriptors - Mandatory capability-based access control

All drivers: - Least-privilege capability grants based on driver manifest and system policy - Cryptographic signature verification (optional, configurable) - Audit logging of all capability-mediated operations (when audit is enabled)

See also: Section 9.7 (Confidential Computing) extends capability-based isolation to TEE enclaves with hardware-encrypted memory (SEV-SNP, TDX, ARM CCA). Section 9.6 (Post-Quantum Cryptography) ensures capability tokens and distributed credentials remain secure against quantum attacks via algorithm-agile crypto abstractions.

9.2.5 Security by Default

Linux problem: MAC (SELinux/AppArmor) exists but is complex and usually permissive by default. POSIX capabilities were supposed to replace setuid but are clunky and poorly adopted.

UmkaOS design: - The capability model is the foundation, not an add-on. Every resource access goes through capability checks — there is no "bypass" path. - Default-deny: Processes start with an empty capability set. They receive capabilities explicitly from their parent or from the exec-time capability grants (replacing setuid). - No setuid binaries: The setuid bit on executables maps to "grant these capabilities on exec" — the process never actually runs as uid 0 with unlimited power. This is transparent to applications that check geteuid() == 0 (the SysAPI layer handles this). - LSM hooks at all security-relevant points: Pre-integrated, always active, not optional kernel config. SELinux and AppArmor policy engines are loadable modules that attach to these hooks. - Application profiles: Ship default confinement profiles for common services (sshd, nginx, postgres) — applications are sandboxed out of the box.


9.3 Verified Boot Chain

Inspired by: ChromeOS Verified Boot, Android dm-verity, UEFI Secure Boot. IP status: Clean — industry standards (UEFI Secure Boot, TCG TPM), standard cryptographic constructions (Merkle trees, 1979). Public specifications.

9.3.1 Problem

Section 2.1 defines the boot sequence but does not address boot integrity. A compromised bootloader or kernel image is the most dangerous attack vector — it undermines all runtime security.

Additionally, in production deployments (cloud, embedded, enterprise), operators need assurance that the running kernel is exactly what they deployed, with no tampering.

9.3.2 Boot Chain Verification

Firmware (UEFI Secure Boot)
    |
    | Verifies: GRUB/bootloader signature
    v
Bootloader (GRUB2 / systemd-boot)
    |
    | Verifies: UmkaOS kernel ELF signature (ML-DSA-65 or hybrid)
    v
Nucleus (verified nucleus, Phase 0.8)
    |
    | Verifies: embedded Evolvable image via LMS-SHAKE256 (NIST SP 800-208)
    |   LMS public key baked in Nucleus .rodata — version-independent
    |   (any Evolvable signed by the corresponding key passes)
    v
Evolvable (evolution orchestration)
    |
    | Verifies: initramfs signature (ML-DSA-65)
    | Verifies: Tier 0 driver integrity (embedded in kernel image)
    | Verifies: Tier 1 driver signatures (loaded from initramfs/rootfs)
    | Verifies: all live evolution payloads (ML-DSA-65 trust anchor chain)
    v
Running System
    |
    | dm-verity: runtime block-level integrity verification
    v
Verified Root Filesystem

Each step verifies the next. A break at any point halts boot (or falls back to a known good configuration).

Dual signature scheme: Nucleus uses LMS (hash-based, ~1-3 KB verifier, reuses Keccak-f[1600]) because it must be minimal and formally verifiable. Evolvable uses ML-DSA-65 (lattice-based, ~5 KB verifier, richer features) for runtime evolution. The LMS public key in Nucleus is NOT tied to a specific Evolvable version — it decouples Nucleus releases from Evolvable releases entirely. See Section 2.21 for the full analysis.

Nucleus standalone LMS verifier: The Nucleus LMS verifier is standalone code (~2 KB) in the Nucleus image that verifies the Evolvable image signature BEFORE the crypto API is initialized. It uses a hardcoded LMS public key embedded in the Nucleus binary's .rodata section. After the crypto API initializes (Phase 1.3.1 builtin registration / Phase 3.6 full init), lms-shake256 is registered as a normal algorithm and the Nucleus standalone verifier is no longer used. All subsequent signature verification (driver loading, module signing) goes through the standard crypto API path.

9.3.3 Kernel Image Signing

The UmkaOS image (vmlinuz-umka-VERSION) is signed during the build process:

// Build-time signature structure appended to kernel image.
#[repr(C)]
pub struct KernelSignature {
    /// Magic: "IKSIG\0\0\0"
    pub magic: [u8; 8],
    /// Signature algorithm ID. See `BOOT_ALGO_MAP` below for the
    /// canonical ID-to-name mapping. Common values:
    /// 0x0101 = ML-DSA-65, 0x0103 = Ed25519, 0x0105 = SLH-DSA-SHAKE-128f,
    /// 0x0200 = hybrid Ed25519 + ML-DSA-65.
    /// Algorithm ID 0 is reserved/invalid — a zero value indicates an
    /// unsigned or corrupt image.
    pub algorithm: u32,
    /// Length of actual signature data within the signature buffer.
    pub sig_len: u32,
    /// SHA-256 hash of the unsigned kernel image (informational).
    /// This field records the hash at signing time for offline auditing
    /// and debugging (e.g., comparing against a known-good manifest).
    /// It is NOT used during boot verification — the boot stub computes
    /// a fresh hash (step 3 below) and verifies the signature over that,
    /// eliminating TOCTOU attacks on a stored hash value.
    pub image_hash: [u8; 32],
    /// Signature buffer. Sized for the largest supported algorithm
    /// rounded up to 512-byte disk sector alignment (simplifies bootloader
    /// read logic — KernelSignature is appended to the kernel ELF image on
    /// disk, and sector-aligned structures avoid partial-sector reads):
    /// ML-DSA-65 = 3,309 bytes (FIPS 204, Table 2), SLH-DSA-SHAKE-128f = 17,088 bytes,
    /// hybrid = Ed25519 (64) + ML-DSA-65 (3,309) = 3,373 bytes.
    /// SLH-DSA-SHAKE-128f is the worst case at 17,088 bytes.
    /// Buffer = ceil(17,088 / 512) * 512 = 17,408 bytes (320 bytes overhead).
    /// Only sig_len bytes are meaningful; the rest is zero-padded.
    pub signature: [u8; 17_408],
    /// Public key fingerprint (SHA-256 of the public key).
    pub key_fingerprint: [u8; 32],
}
// Layout: 8 + 4 + 4 + 32 + 17408 + 32 = 17488 bytes.
const_assert!(size_of::<KernelSignature>() == 17488);

/// Canonical algorithm ID mapping table. Maps the `u32` algorithm IDs used in
/// `KernelSignature.algorithm`, `KabiSigSection.algorithm`, `TrustAnchor.algorithm`,
/// and `KrlEntry.algo_hint` to Kernel Crypto API algorithm name strings.
///
/// The ID space is partitioned by category:
/// - 0x0001-0x00FF: Hash algorithms (used standalone or as components)
/// - 0x0100-0x01FF: Asymmetric signature algorithms
/// - 0x0200-0x02FF: Hybrid / composite schemes
/// - 0x0300-0x03FF: Key encapsulation mechanisms (KEM)
///
/// Algorithm ID 0x0000 is reserved/invalid (indicates unsigned or corrupt data).
/// Cross-reference: the `SignatureAlgorithm` enum in [Section 9.6](#post-quantum-cryptography)
/// uses the same numeric IDs as this table. The Kernel Crypto API
/// ([Section 10.1](10-security-extensions.md#kernel-crypto-api)) registers implementations under these same string names.
///
/// Renamed from `SIGNATURE_ALGO_MAP` to `BOOT_ALGO_MAP` because this table
/// contains both signature algorithms (0x0100-0x02FF) AND key encapsulation
/// mechanisms (0x0300-0x03FF), not just signatures.
pub const BOOT_ALGO_MAP: &[(u32, &str)] = &[
    // Hash algorithms
    (0x0001, "sha256"),
    (0x0002, "sha384"),
    (0x0003, "sha512"),
    (0x0004, "sha3-256"),
    (0x0005, "sha3-384"),
    (0x0006, "sha3-512"),
    // NOTE: SHAKE256 (0x0007) intentionally removed. It is an XOF, not a
    // fixed-digest hash — it has no HashAlgorithm enum variant. SHAKE256
    // is accessed only via crypto API string "shake256" or via
    // SignatureAlgorithm::LmsShake256 (0x0107) below.
    // Asymmetric signature algorithms
    (0x0100, "ml-dsa-44"),         // NIST FIPS 204, security level 2
    (0x0101, "ml-dsa-65"),         // NIST FIPS 204, security level 3 (primary)
    (0x0102, "ml-dsa-87"),         // NIST FIPS 204, security level 5
    (0x0103, "ed25519"),           // RFC 8032
    (0x0104, "rsa-4096-pss"),      // PKCS#1 v2.1, SHA-256
    (0x0105, "slh-dsa-shake-128f"),      // NIST FIPS 205, fast variant
    (0x0106, "slh-dsa-sha2-128s"),  // NIST FIPS 205, SHA-2 small variant (fallback)
    (0x0107, "lms-shake256"),      // NIST SP 800-208 (Nucleus verification only)
    // Hybrid / composite schemes
    (0x0200, "hybrid-ed25519-ml-dsa-65"),  // Ed25519 + ML-DSA-65 dual signature
    // Key encapsulation mechanisms
    (0x0300, "ml-kem-768"),        // NIST FIPS 203, security level 3
    (0x0301, "ml-kem-1024"),       // NIST FIPS 203, security level 5
];

/// Look up the Kernel Crypto API algorithm name for a given algorithm ID.
/// Returns `None` for unrecognized IDs. Used by the boot stub, driver
/// signature verifier, and KRL processor to select the correct verification
/// algorithm.
///
/// O(n) linear scan — the table has <20 entries and is accessed only on
/// cold paths (boot verification, driver load, KRL update).
pub fn signature_algo_name(id: u32) -> Option<&'static str> {
    BOOT_ALGO_MAP.iter().find(|(k, _)| *k == id).map(|(_, v)| *v)
}

/// Convert a signature algorithm ID (hash subset, 0x0001-0x0006) to the
/// corresponding `HashAlgorithm` enum variant. These are the hash entries
/// in `BOOT_ALGO_MAP` (values 0x0001-0x0006), distinct from signature
/// algorithms (0x0100+). Returns `None` for IDs outside the hash range.
/// Note: 0x0007 (formerly SHAKE256) returns None — SHAKE256 is an XOF,
/// not a fixed-digest hash, and has no HashAlgorithm variant.
pub fn signature_algo_to_hash(algo_id: u32) -> Option<HashAlgorithm> {
    match algo_id {
        0x0001 => Some(HashAlgorithm::Sha256),
        0x0002 => Some(HashAlgorithm::Sha384),
        0x0003 => Some(HashAlgorithm::Sha512),
        0x0004 => Some(HashAlgorithm::Sha3_256),
        0x0005 => Some(HashAlgorithm::Sha3_384),
        0x0006 => Some(HashAlgorithm::Sha3_512),
        _ => None,
    }
}

Early boot verification memory: PQC signature verification requires scratch memory for intermediate computations during early boot (before the slab allocator is initialized). This scratch space is provided by a statically-allocated .bss section buffer: static VERIFY_SCRATCH: [u8; 20_480]. The buffer is sized for the worst-case algorithm used at boot: SLH-DSA-SHAKE-128f verification requires up to ~17,088 bytes of working space (matching the signature size), plus SHA-256 streaming state (~200 bytes) and alignment padding, totalling under 18KB. The 20,480-byte buffer (20KB, a multiple of 4096) provides sufficient headroom for all supported algorithms: ML-DSA-65 (~4KB), SLH-DSA-SHAKE-128f (~17,088 bytes), and hybrid Ed25519+ML-DSA-65 (~4KB). The boot stub uses this fixed buffer exclusively; it is released after the kernel signature is verified. The slab allocator path (SignatureData::Heap) is used only for runtime driver signature verification after boot.

Verification flow in the boot stub:

1. Boot stub finds KernelSignature at the end of the image.
2. Reads the public key from:
   a. UEFI Secure Boot db (if UEFI boot), OR
   b. Embedded in bootloader (if BIOS boot), OR
   c. Kernel command line: umka.verify_key=<fingerprint>
      (only honored in umka.verify=warn or umka.verify=off modes;
       ignored in umka.verify=enforce mode — see security restriction below)
3. Determines the algorithm from `KernelSignature.algorithm` via
   `signature_algo_name()`. At this point the slab allocator is NOT
   initialized — the boot stub uses the statically-allocated
   `VERIFY_SCRATCH` buffer (20 KB, `.bss` section) for all verification
   working memory. It does NOT call `crypto_alloc_*()` or allocate
   an `AkCipherTfm` — those APIs require a running slab allocator.
   Instead, the boot stub calls the algorithm's verification function
   directly via a static dispatch table (`BOOT_VERIFY_TABLE`):

   ```rust
   /// Static dispatch table for boot-time signature verification.
   /// Used before slab/crypto API initialization. Each entry provides
   /// a standalone verify function that operates on the VERIFY_SCRATCH
   /// arena without any heap allocation.
   struct BootVerifyEntry {
       algo_id: u32,
       name: &'static str,
       internal_hash: bool,
       verify_fn: fn(key: &[u8], data: &[u8], sig: &[u8], scratch: &mut [u8]) -> bool,
   }
   static BOOT_VERIFY_TABLE: &[BootVerifyEntry] = &[
       BootVerifyEntry { algo_id: 0x0101, name: "ml-dsa-65", internal_hash: true,
                         verify_fn: ml_dsa_65_verify_standalone },
       BootVerifyEntry { algo_id: 0x0103, name: "ed25519", internal_hash: true,
                         verify_fn: ed25519_verify_standalone },
       BootVerifyEntry { algo_id: 0x0200, name: "hybrid-ed25519-ml-dsa-65",
                         internal_hash: true, verify_fn: hybrid_verify_standalone },
       // SLH-DSA-SHAKE-128f (FIPS 205, fast variant): stateless hash-based signature.
       // internal_hash = true per FIPS 205 §9.2 (SLH-DSA signs the message
       // directly; pre-hashing would break verification). Requires ~17,088 bytes
       // of scratch space (fits within the 20 KB VERIFY_SCRATCH buffer).
       BootVerifyEntry { algo_id: 0x0105, name: "slh-dsa-shake-128f", internal_hash: true,
                         verify_fn: slh_dsa_shake_128f_verify_standalone },
       // RSA-4096-PSS requires pre-hashing (internal_hash = false)
       BootVerifyEntry { algo_id: 0x0104, name: "rsa-4096-pss", internal_hash: false,
                         verify_fn: rsa_pss_verify_standalone },
   ];
   ```

   The boot stub looks up the algorithm in this table, passing
   `&VERIFY_SCRATCH` as the working-memory arena. The full crypto API
   path (`crypto_alloc_shash()`, `crypto_alloc_akcipher()`, etc.) is
   used only for runtime driver signature verification after the slab
   allocator is initialized.
4. Checks the algorithm's internal-hash property via the dispatch table:
   a. **If INTERNAL_HASH is set** (Ed25519, ML-DSA, SLH-DSA, LMS, hybrid):
      passes the raw image bytes (excluding signature) directly to
      `AkCipherOps::verify()`. These algorithms perform their own
      internal hashing per their standards (FIPS 204/205, RFC 8032).
      Pre-hashing would corrupt verification.
   b. **If INTERNAL_HASH is not set** (RSA-4096-PSS): computes SHA-256
      of the image (excluding signature) → `fresh_hash`, then passes
      `fresh_hash` to `AkCipherOps::verify()`.
   In both cases, the input is derived from the image just loaded — not
   from a stored value — eliminating the TOCTOU window.
5. If verification fails:
   a. If umka.verify=enforce (default in production): halt boot.
   b. If umka.verify=warn: log warning, continue (development mode).
   c. If umka.verify=off: skip (testing only).

Security restriction: In umka.verify=enforce mode, the umka.verify_key command-line parameter is ignored — the verification key MUST come from the firmware (UEFI db variable or DTB /chosen/umka,verify-key node), which is itself part of the measured/verified boot chain. The command-line parameter is only honored in umka.verify=warn (development) and umka.verify=off (debugging) modes. This prevents an attacker who controls the boot loader command line from substituting their own verification key.

Hibernate Secret

The hibernate subsystem (Section 18.4, 17-virtualization.md) uses a hibernate secret — a 256-bit random key generated at boot from the hardware RNG (RDRAND/RNDR) — to HMAC-authenticate hibernate images against tampering. The key is stored only in kernel memory and destroyed on shutdown.

  • TPM systems: The hibernate secret is sealed to the TPM via TPM2_Create() bound to the current PCR policy (PCRs 0-12). PCR 12 is included because it contains Tier 1 driver measurements (Section 9.4) — excluding it would allow a compromised or swapped storage driver to load a tampered hibernate image while still passing the TPM seal policy. Only a boot configuration matching the original (including the exact set of loaded Tier 1 drivers) can unseal the key.
  • Non-TPM systems: The hibernate secret does not survive a cold reboot. Hibernate images are only valid for warm suspend/resume cycles where the bootloader preserves the kernel's reserved memory region. Caveat: this mechanism requires a cooperative bootloader (e.g., GRUB with memmap= or a custom UEFI stub that avoids reclaiming the reserved region). On systems where the bootloader does not preserve kernel memory, non-TPM hibernate provides no cryptographic protection against image tampering — it only provides crash-recovery semantics (detecting corruption, not preventing forgery). Deployments requiring hibernate image authentication should use TPM-based sealing.

This key is NOT related to the kernel verification key (Section 9.3) or the IMA HMAC key (Section 9.5) — each subsystem derives its own independent secret.

Driver Signature Verification

Tier 1 drivers are signed (Section 11.3 mentions "cryptographically signed drivers can be granted Tier 1"). The device registry (Section 11.4) enforces this during the Loading → Initializing transition.

The .kabi_sig ELF section format is defined canonically in Section 12.7 (KabiSigSection struct). The primary signing algorithm is ML-DSA-65 (NIST FIPS 204, security level 3, 1,952-byte public key, 3,309-byte signature; FIPS 204, Table 2: ML-DSA-65 signature size = 3,309 bytes, verified against liboqs reference implementation). ML-DSA-65 is used for both kernel image and driver signatures — a single algorithm simplifies key management and verification code.

Policy (integrates with existing tier trust model):

Verification Tier 0 Tier 1 Tier 2
Signature required? Always (part of kernel image) Configurable (default: required) Optional
Unsigned behavior Cannot exist unsigned Demoted to Tier 2 Allowed (user-space isolation)
Tampered binary Kernel does not boot Loading rejected, alert Loading rejected, alert

Verification ordering (driver load path):

  1. .kabi_sig verification (authenticity): The device registry verifies the ML-DSA-65 (or fallback SLH-DSA-SHA2-128s) signature in the .kabi_sig ELF section against the trusted driver signing key. Rejection is immediate and final. See KabiSigSection in Section 12.7.
  2. IMA appraisal (runtime integrity): IMA measures the driver binary hash and verifies against the IMA appraise policy. IMA records the measurement in the IMA log for remote attestation. Rejection blocks the load.
  3. Both must pass for the driver to enter the Loading → Initializing transition. .kabi_sig provides build-time authenticity (was this binary signed by a trusted party?). IMA provides runtime integrity (has the binary been tampered with on disk since signing?).

Driver signing key hierarchy:

  • Root of trust: UmkaOS kernel signing key (ML-DSA-65), embedded in the kernel image at build time. Verified during secure boot.
  • Driver signing key: Separate ML-DSA-65 key, signed by the kernel signing key (certificate chain). The driver signing certificate is embedded in the kernel image's .driver_certs section.
  • Vendor driver keys: Third-party vendors (e.g., Nvidia) have their own signing keys. Their root CA certificate is pinned in the kernel image at distribution build time. Certificate chain: Vendor Root CA → Vendor Driver Signing Cert → module signature.
  • Key rotation: New driver signing keys are distributed via kernel updates. The old key remains trusted for a deprecation period (configurable, default: 2 kernel releases) to avoid breaking existing signed drivers.
  • key_id matching: KabiSigSection.key_id (SHA3-256 hash of the signing key) is matched against the trusted key list in .driver_certs. If no match, the driver is rejected.

Implementation path: Driver signature verification calls crypto_verify_signature(algo, message, signature, pubkey) from the Kernel Crypto API (Section 10.1). The algo parameter is CryptoAlgType::Akcipher with the algorithm selected from the SignatureAlgorithm enum (Section 9.6). Verification runs synchronously in the device registry worker thread context.

Signing and Trust Enforcement Model

UmkaOS enforces driver trust through three independent layers. A driver must pass ALL three to reach Tier 0 or Tier 1. Any single layer can cap it at Tier 2:

Layer What it enforces Mechanism Fakeable?
OKLF license (legal) Proprietary code cannot legally be in Ring 0 (OKLF §4.2(c)) Copyright law No (legal liability)
Signing cert max_tier (trust policy) Distro controls which signing keys vouch for which tiers DriverCertEntry.max_tier in .driver_certs No (embedded at kernel build time)
license_id in manifest (audit/automation) Automates license-to-tier policy for distro-built drivers license_id in KabiDriverManifest Yes (self-declaration)

The signing cert max_tier is the primary technical enforcement. It is set at kernel build time by the distro and cannot be overridden at runtime or faked by the vendor. The license_id field is a self-declaration (like Linux's MODULE_LICENSE()) — trustworthy for distro-built drivers, advisory for vendor-signed binaries. MOK-enrolled keys are always capped at Tier 2 regardless of their max_tier value. See Section 12.7 for the DriverCertEntry structure and the module load sequence that enforces these layers.

Practical Signing Workflows

The signing architecture supports four deployment scenarios. In all cases, the on-disk driver format is a single ELF binary with .uko extension containing the manifest (.kabi_manifest section), signature (.kabi_sig section), and code. No separate manifest files exist on the running system.

Distribution Vendor Workflow (Fedora/Ubuntu/SUSE equivalent)

Each distribution independently:

  1. Generates its own ML-DSA-65 root keypair in an HSM (private key never exported).
  2. Builds the kernel image, embedding its public key as the trust anchor.
  3. Signs its kernel with its key, producing a PE/COFF binary for UEFI Secure Boot.
  4. Enrolls its key in UEFI db — either via a shim signed by the Microsoft UEFI CA (for broad hardware compatibility) or via direct db enrollment.
  5. Builds open-source drivers from source, signs them with the distro's driver signing key (chained to the root key). These drivers get max_tier=0 (eligible for any tier including Tier 0).
  6. Pins vendor CA certificates it chooses to trust into the kernel's .driver_certs section, with max_tier=2 for closed-source vendors. The distro decides which vendors to trust and at what tier — this is a policy decision, not a technical one.
  7. Ships modules in /lib/modules/umka/<version>/ with .uko extension.

Hardware Vendor Workflow (closed-source, e.g. Nvidia GPU)

  1. Vendor generates own ML-DSA-65 CA keypair and publishes the CA certificate (public portion) to distributions via a business relationship or public registry.
  2. Vendor derives a driver signing key, signs it with the CA key (certificate chain).
  3. Vendor compiles the driver binary (closed source) and signs it — the .kabi_sig section contains the signature and a key ID that chains to the vendor's CA.
  4. Vendor ships a single pre-signed .uko binary per KABI major version. Because KABI is a stable ABI (unlike Linux's deliberately-unstable internal API), one binary works across all distributions that pin the vendor's CA certificate. This is a significant practical advantage over Linux, where Nvidia must rebuild for every kernel release.
  5. At load time, the kernel verifies the .kabi_sig signature, finds the vendor's CA in .driver_certs with max_tier=2, and loads the driver at Tier 2 (Ring 3, process-isolated). The driver works, the hardware functions — but with full process isolation overhead (~200-500 cycles per crossing instead of ~23 cycles at Tier 1).

Vendor drivers that the distro independently builds from source (e.g., Intel open-sources their NIC driver, distro builds it) are signed with the distro's own key, not the vendor's key, and get the distro's max_tier (typically 0). The vendor CA is irrelevant for open-source drivers that the distro builds itself.

Self-Builder Workflow (Gentoo/LFS-style)

  1. Generate own ML-DSA-65 keypair: umka-keygen --output my-kernel-key.
  2. Build kernel with SIGNING_KEY=my-kernel-key.priv — public key becomes the trust anchor.
  3. Enroll key in UEFI db via MOK manager or direct enrollment.
  4. Sign own drivers with umka-sign-driver --key my-driver-key.priv driver.uko. Self-built drivers get full Tier 0/1/2 access since the signing key is embedded as a built-in cert (source=CERT_SOURCE_BUILTIN).
  5. Vendor drivers: manually add vendor CA certs to .driver_certs at build time, with chosen max_tier (self-builders control their own trust policy).

Development / Debug Workflow

  • umka.module_sig=off boot parameter: all signature checks disabled. Drivers load at their requested tier. Kernel is tainted (TAINT_UNSIGNED_MODULE). IMA log records all loads as unsigned.
  • umka.driver_tier_policy=permissive: license and cert tier ceilings are ignored. All tier assignments follow only the manifest's preferred_tier and admin overrides. Kernel is tainted (TAINT_TIER_POLICY_OVERRIDE). This mode is intended for development and CI testing, not production.
  • Both taint flags are visible via /proc/sys/kernel/tainted and appear in crash dumps. Remote attestation via IMA/TPM will show the tainted state.

Runtime Filesystem Integrity (dm-verity)

For production deployments, the root filesystem can be verified on every block read using Merkle tree verification. This detects tampering of any file at read time.

Root Filesystem Device           Hash Tree Device
+------------------+             +------------------+
| Block 0          | --hash-->   | Leaf hash 0      |
| Block 1          | --hash-->   | Leaf hash 1      |
| Block 2          | --hash-->   | Leaf hash 2      |
| Block 3          | --hash-->   | Leaf hash 3      |
| ...              |             | ...              |
+------------------+             +------------------+
                                         |
                                   +-----+-----+
                                   |           |
                                 Node 0-1    Node 2-3
                                   |           |
                                   +-----+-----+
                                         |
                                    Root Hash
                                   (signed, stored
                                    in kernel cmdline
                                    or bootloader)

Implementation in UmkaOS Block I/O layer:

// umka-block/src/verity.rs

/// Maximum supported hash tree depth. A depth of 20 with 4096-byte
/// hash blocks covers 4096^20 blocks, far exceeding any physical device.
const VERITY_MAX_TREE_DEPTH: usize = 20;

pub struct VerityTarget {
    /// Underlying data device.
    data_device: BlockDeviceHandle,

    /// Hash tree device (can be same device, appended).
    hash_device: BlockDeviceHandle,

    /// Hash algorithm: SHA-256 (default).
    hash_algorithm: HashAlgorithm,

    /// Block size for hashing (default: 4096).
    hash_block_size: u32,

    /// Root hash (trusted). Provenance (exactly one must succeed):
    ///  1. Embedded in the signed kernel image — verified transitively by the
    ///     kernel signature check in Section 9.2.3. No separate signature needed.
    ///  2. Passed as a kernel command-line parameter (`umka.verity.root_hash=`)
    ///     when the cmdline itself is measured into TPM PCR 11 (Section 9.3.1). The
    ///     TPM attestation chain verifies the cmdline has not been tampered with.
    ///  3. Loaded from initramfs (file `/etc/verity/root_hash`) when the
    ///     initramfs is IMA-appraised (Section 9.4) and measured into TPM PCR 10.
    /// An unauthenticated root hash from an unverified kernel cmdline is
    /// rejected unless `umka.verify=off`.
    root_hash: [u8; 32],

    /// Hash tree depth (computed from device size).
    tree_depth: u32,

    /// Pre-computed hash tree level offsets.
    /// A typical dm-verity device has at most ~20 hash tree levels
    /// (a 20-level tree covers 4096^20 blocks, far exceeding any
    /// physical device). A fixed-size array avoids heap allocation
    /// entirely, which is essential during early boot before the
    /// general-purpose allocator is initialized.
    level_offsets: [u64; VERITY_MAX_TREE_DEPTH],

    /// Number of valid entries in level_offsets (always <= MAX_TREE_DEPTH).
    level_count: u32,

    /// Error behavior on verification failure.
    error_behavior: VerityErrorBehavior,

    /// Cache of verified hashes (avoid re-hashing on repeated reads).
    /// Capacity capped at 64 Ki entries (~2 MB) regardless
    /// of device size, covering the hot working set of recently-accessed blocks.
    /// None until the slab allocator is online; verification operates
    /// without caching. Initialized to Some(...) during block subsystem
    /// init.
    hash_cache: Option<LruCache<u64, [u8; 32]>>,
}

#[repr(u32)]
pub enum VerityErrorBehavior {
    /// Return -EIO for the failed block (default).
    Eio     = 0,
    /// Kernel panic (for security-critical deployments).
    Panic   = 1,
    /// Log and continue (for debugging).
    Ignore  = 2,
}

VerityTarget::map() implementation: For each bio, compute the block index from the bio's sector offset. map() queues the bio for async hash verification via a verification workqueue. The workqueue handler reads hash blocks from the hash device, walks the Merkle tree leaf-to-root, and on successful verification, resubmits the original data bio to the data device. On hash mismatch, the bio is completed with -EIO. This matches Linux's dm-verity which returns DM_MAPIO_SUBMITTED. Return value: DmMapResult::Submitted (bio ownership transferred to the verification workqueue). If any hash mismatches: set bio error, increment corruption counter, emit FMA event (Section 20.1), apply error_behavior policy. The page cache caches verified data — re-reads of the same block skip verification (cache hit). Read-ahead submits verification asynchronously for prefetched blocks.

Merkle tree walk detail: For data block N at level 0, compute its hash. Look up the expected hash at level 1: hash_block = N / hashes_per_block, hash_offset = N % hashes_per_block. Compare. Repeat ascending through each level until reaching the root hash (stored in the dm-verity superblock). If all levels match: the block is authentic. If any level mismatches: return -EIO to the bio caller.

Activation via kernel command line (standard dm-verity syntax):

root=/dev/dm-0
dm-mod.create="umka-root,,,ro,0 <size> verity 1 /dev/sda1 /dev/sda2 4096 4096 <blocks> <blocks> sha256 <root_hash> <salt>"

Or via the device-mapper ioctl interface (DM_TABLE_LOAD), which veritysetup from cryptsetup uses. Existing tools work unmodified.

9.3.4 Key Revocation

Driver signature verification checks against a Key Revocation List (KRL) to handle compromised signing keys:

  • The KRL is a signed, append-only list of revoked key fingerprints, embedded in the initramfs and updated with each kernel/initramfs build.
  • On boot, the KRL is loaded and checked before any driver loading. Any driver signed with a revoked key is rejected, regardless of tier.
  • At runtime, a new KRL can be loaded via /proc/umka/security/krl (requires admin capability). This allows revoking keys without rebooting.
  • UEFI Secure Boot dbx provides bootloader/kernel-level revocation (existing standard mechanism). The KRL extends this to cover driver-level revocation.
// Kernel-internal

/// Key revocation list.
///
/// **Lifetime management**: Both the `KeyRevocationList` struct and the
/// `revoked_keys` array it points to are allocated together as a single
/// slab allocation (struct header + trailing key array). Boot-time KRLs
/// are bump-allocated and have true `'static` lifetime. Runtime-loaded
/// KRLs are slab-allocated and managed via RCU: the old KRL (struct +
/// key array) remains valid until all readers complete their RCU
/// read-side critical section, then the entire slab slot is freed.
/// Readers must access the struct and its `revoked_keys` only within
/// `rcu_read_lock()` / `rcu_read_unlock()`.
// kernel-internal, not KABI
#[repr(C)]
pub struct KeyRevocationList {
    /// Signature over the KRL itself (prevents tampering).
    /// Sized for worst-case PQC signature (SLH-DSA-SHAKE-128f = 17,088 bytes,
    /// rounded up to 512-byte alignment = 17,408 bytes).
    pub signature: [u8; 17_408],
    /// Actual length of the signature in `signature[]`.
    /// Algorithms produce different-length signatures (Ed25519: 64 bytes,
    /// ML-DSA-65: 3,309 bytes (FIPS 204, Table 2), SLH-DSA-SHAKE-128f: 17,088 bytes).
    pub sig_len: u32,
    /// Algorithm hint for this KRL's signature. Matches `SignatureAlgorithm`
    /// enum (Section 9.5.2). This field is a parser optimization hint only —
    /// the verifier MUST NOT trust it as authoritative. The actual verification
    /// algorithm is determined by the signing key: each public key in the
    /// kernel's trusted keyring has a fixed algorithm, and the verifier enforces
    /// that `algorithm` matches the key's algorithm. A mismatch causes
    /// verification failure (`-EKEYREJECTED`), preventing algorithm-confusion
    /// attacks where an attacker specifies a weaker algorithm in a forged KRL.
    pub algorithm: u32,
    // **Trust anchor**: KRL signatures are verified against the platform's
    // **Secure Boot key hierarchy**. Specifically, the KRL signing key must
    // chain to a certificate in the UEFI `db` (Signature Database) or, for
    // systems without UEFI, to the kernel's built-in trusted keyring
    // (`CONFIG_SYSTEM_TRUSTED_KEYRING` equivalent). This is the same trust
    // anchor used for kernel module signatures, ensuring a single chain of
    // trust from firmware to revocation policy.
    /// Version (monotonically increasing, prevents rollback).
    /// Anti-rollback enforcement: on every KRL load, the kernel
    /// compares this version against the last-seen KRL version stored
    /// in TPM NV index 0x01800200 (a monotonic counter). If the new
    /// version is less than the stored value, the KRL is rejected
    /// (prevents an attacker from replaying an older KRL with fewer
    /// revocations). On successful load, the TPM NV counter is updated
    /// to the new version. On systems without a TPM, the last-seen
    /// version is stored in a UEFI authenticated variable
    /// (umka-krl-version); this is weaker (vulnerable to firmware-level
    /// attacks) but still prevents userspace rollback.
    ///
    /// **First-boot initialization**: On a system where NV index `0x01800200` does not exist,
    /// the boot stub creates it during the first secure boot sequence:
    ///
    /// 1. **Detect first boot**: `TPM2_NV_ReadPublic(0x01800200)` returns
    ///    `TPM_RC_NV_UNDEFINED` — no prior KRL has been loaded.
    /// 2. **Define index**: `TPM2_NV_DefineSpace` with attributes
    ///    `TPMA_NV_COUNTER | TPMA_NV_AUTHREAD | TPMA_NV_AUTHWRITE |
    ///    TPMA_NV_NO_DA | TPMA_NV_PLATFORMCREATE`, size 8 bytes. The
    ///    `PLATFORMCREATE` attribute ensures only firmware (not OS) can delete
    ///    the index.
    /// 3. **Initialize**: `TPM2_NV_Increment` to set the counter to 1.
    /// 4. **Load KRL**: The first KRL must carry `krl_version ≥ 1` to pass the
    ///    anti-rollback check.
    ///
    /// Subsequent boots read the counter and compare against the KRL's embedded
    /// version. Systems that have been provisioned but later have their TPM
    /// cleared (e.g., ownership change) will fail KRL loading until re-provisioned
    /// via the secure enrollment process (Section 9.3 key ceremony).
    pub version: u64,
    /// Public key fingerprints (SHA-256 hash of the raw public key bytes),
    /// sorted for binary search. Algorithm-agnostic: the fingerprint is a
    /// SHA-256 hash regardless of whether the key is Ed25519, ML-DSA-44,
    /// ML-DSA-65, or any future algorithm — all produce a 32-byte fingerprint.
    /// Uses `RcuSlice<[u8; 32]>` — a zero-cost wrapper around the raw pointer
    /// that can only be dereferenced within an `RcuReadGuard` scope, preventing
    /// use-after-free if the backing slab allocation is freed after a grace period.
    /// Boot-time KRLs are bump-allocated (effectively 'static); runtime-loaded
    /// KRLs are slab-allocated and freed after an RCU grace period.
    /// Length is given by `revoked_count`.
    pub revoked_keys: RcuSlice<[u8; 32]>,
    /// Number of entries pointed to by `revoked_keys`.
    /// Required because `RcuSlice` does not carry length information.
    pub revoked_count: u32,
    /// Revoked CA fingerprints (SHA-256 hash of the CA's raw public key bytes),
    /// sorted for binary search. When a CA is revoked, ALL driver signing keys
    /// chained to that CA are rejected — even if the individual driver signing
    /// keys do not appear in `revoked_keys`. This handles the case where a
    /// vendor's root CA key is compromised: a single CA revocation entry
    /// invalidates every driver signed by any key in that CA's hierarchy,
    /// without needing to enumerate all individual signing keys.
    ///
    /// The verification flow checks `revoked_cas` after checking `revoked_keys`:
    /// ```
    /// 1. Extract key_fingerprint from DriverSignature.
    /// 2. Binary search revoked_keys for the fingerprint → reject if found.
    /// 3. Walk the signing key's certificate chain to the root CA.
    /// 4. For each CA in the chain: binary search revoked_cas → reject if found.
    /// 5. If no match: proceed with normal signature verification.
    /// ```
    ///
    /// CA revocation entries are distributed via the same KRL update mechanism
    /// as individual key revocations (signed, versioned, anti-rollback via TPM
    /// NV counter). Each distro must independently issue a kernel update or KRL
    /// update when a vendor CA is compromised.
    pub revoked_cas: RcuSlice<[u8; 32]>,
    /// Number of entries pointed to by `revoked_cas`.
    pub revoked_ca_count: u32,
    /// SHA-256 hash of the serialized KRL (for TPM PCR extend).
    pub digest: [u8; 32],
}

/// **Lifetime safety**: The `revoked_keys` field uses `RcuSlice<[u8; 32]>`,
/// a zero-cost wrapper around `*const [u8; 32]` that implements `Deref` only
/// when an `RcuReadGuard` is in scope. `RcuSlice` does NOT implement `Send`
/// or `Sync` directly — the containing `KeyRevocationList` is shared via
/// `RcuCell<KeyRevocationList>`, which provides the necessary lifetime
/// guarantees (readers hold `RcuReadGuard`, preventing grace period completion
/// and slab freeing). The blanket `unsafe impl Send/Sync` on `KeyRevocationList`
/// is safe because the struct is only accessible through `RcuCell`, never
/// directly shared.
// SAFETY: `revoked_keys` is an `RcuSlice` whose pointee remains valid for the
// lifetime of any `RcuReadGuard` that can reach this struct. The struct is
// read-only after construction (immutable once published via RCU). Access
// outside an RCU read-side critical section is prevented by `RcuSlice`'s API,
// making cross-thread send and share safe.
unsafe impl Send for KeyRevocationList {}
unsafe impl Sync for KeyRevocationList {}

The verification flow during driver loading:

1. Extract key_fingerprint from DriverSignature.
2. Binary search KRL revoked_keys for the fingerprint.
3. If found: reject driver load with -EKEYREVOKED, emit alert.
4. Walk the signing key's certificate chain to the root CA.
5. For each CA in the chain: binary search KRL revoked_cas for the CA fingerprint.
6. If found: reject driver load with -EKEYREVOKED, emit alert
   ("CA revoked: vendor {ca_name}, all drivers from this CA are rejected").
7. If no match in either list: proceed with normal signature verification.

9.3.4.1 First-Boot KRL Counter Pre-Increment Prevention

Attack: Between NV_DefineSpace (creates the NV counter at index 0x01800200) and the first NV_Increment (records the initial KRL epoch), an attacker with platform auth could call NV_Increment to advance the counter. The initial KRL entry would be bound to epoch 1, not epoch 0, making epoch 0 appear as a rollback.

Defense: UmkaOS uses a two-phase commit for first-boot NV counter provisioning:

  1. NV_DefineSpace with NV_ORDERLY attribute: Creates the counter. The NV_ORDERLY flag marks the space as "provisioning in progress" until explicitly cleared.

  2. Immediate NV_Increment in the same TPM command session: NV_DefineSpace and the first NV_Increment are submitted as a single session (TPM2_CC_StartAuthSessionNV_DefineSpaceNV_IncrementFlushContext). The TPM processes these atomically within the session. No external caller can interleave between NV_DefineSpace and NV_Increment within a session.

  3. Platform auth lock: Before calling NV_DefineSpace, UmkaOS calls TPM2_CC_HierarchyControl(TPM_RH_PLATFORM, disable=true) to disable platform hierarchy (preventing platform-auth commands). This requires the platform auth secret, which is only available to UmkaOS's trusted boot firmware. After first-boot provisioning, platform hierarchy is re-disabled until the next boot.

  4. Binding check: The first KRL entry is written with generation = NV_counter_value_after_first_increment (expected: 1). If the counter reads

    1 after provisioning, provisioning aborts and triggers a measured boot failure event (extends PCR[10] with a failure marker).

Non-TPM path: On systems without TPM, the rollback counter is stored in NVRAM (UEFI SetVariable with EFI_VARIABLE_BOOTSERVICE_ACCESS | EFI_VARIABLE_NON_VOLATILE). The same two-phase approach applies: UEFI variable creation and initial-write happen in the same UEFI Boot Services call sequence before ExitBootServices().

9.3.5 RCU Read-Side Timeout (DoS Mitigation)

Attack vector: RCU-based KRL management has a theoretical denial-of-service vulnerability. An attacker could keep an RCU read-side critical section open indefinitely by repeatedly calling a syscall that reads from the KRL (e.g., driver loading queries) without allowing the critical section to exit. This would prevent the RCU grace period from ever completing, blocking KRL updates and preventing revocation from taking effect.

Mitigation: Preemptible RCU with priority boosting. UmkaOS uses preemptible RCU for KRL read-side critical sections — readers can be preempted by higher-priority tasks, preventing a single reader from indefinitely blocking grace period completion. When a grace period has been pending for longer than KRL_RCU_BOOST_MS (default: 1ms), the RCU subsystem boosts the priority of all in-progress KRL readers to SCHED_FIFO priority 99, expediting their completion. This is the same mechanism used by Linux's CONFIG_RCU_BOOST — it respects RCU semantics (readers complete naturally rather than being forcibly terminated) and avoids the use-after-free risk of forcibly calling rcu_read_unlock() from outside the critical section.

Implementation notes: - KRL readers are expected to be brief: a binary search over the revoked_keys array is O(log N) × 32-byte key comparisons. With 10,000 revoked keys (extreme worst case), this is ~14 comparisons × 32 bytes ≈ 430 μs — well within RCU read-side critical section guidelines. Any syscall that blocks while holding KRL RCU read lock is a bug. - Priority boosting is a safety net, not a normal-path mechanism. Under normal operation, KRL lookups complete in <10μs — well before the 1ms boost threshold. - On boost trigger, the kernel logs a warning with the CPU ID, task PID, and execution context. Repeated boosts from the same task trigger rate limiting (log suppression) to avoid log flooding. - Syscall rate limiting: The driver-load syscall (umka_driver_load) is rate-limited to at most 10 invocations per second per UID (using a per-UID token bucket). This prevents a malicious userspace process from abusing repeated driver load requests to keep KRL RCU read-side critical sections active across many CPUs and delay grace period completion. Excess requests fail with -EAGAIN. This is the primary defense; priority boosting is the secondary safety net. - This mitigation is defensive: a compromised kernel (or a bug in a Tier 1 driver) could bypass it, but that is outside the threat model. The RCU boost + rate limiting protects against userspace-triggered DoS only. - Scope: Tier 1 crash containment (Section 11.2) is the primary defense against intentionally malicious or buggy drivers. The RCU timeout mechanism guards against accidental long RCU holds (e.g., a Tier 1 driver that forgot to call rcu_read_unlock() on an error path). It is not designed to stop a driver that deliberately holds RCU forever — Tier 1 crash recovery handles that by isolating and restarting the driver domain.

9.3.5.1 Trust Anchor Live Rotation

9.3.5.1.1 Problem

The root of trust — the ML-DSA-65 kernel signing key whose public key is embedded in the kernel image — has a finite cryptographic lifetime. NIST SP 800-57 Part 1 (Rev. 5, Table 1) recommends key rotation every 5-10 years for long-lived signing keys. Over a 50-year uptime with live kernel evolution, this key must be rotated 5-10 times without rebooting.

Driver signing keys and vendor keys can be rotated via the Key Revocation List mechanism. But the root key cannot: it is the key that verifies KRLs, kernel updates, and the live evolution payload itself. Rotating the key that verifies rotations requires a bootstrapping protocol.

9.3.5.1.2 Design: Dual-Anchor Chain

The kernel maintains a trust anchor chain of two keys at all times: the active anchor (currently used for all signature verification) and the next anchor (pre-distributed, not yet active).

/// Trust anchor state. Protected by a dedicated Mutex (cold-path only:
/// rotation happens at most once per several years). Stored in kernel BSS
/// (not heap) to survive live evolution: the evolution framework preserves
/// BSS data across component swaps.
pub struct TrustAnchorChain {
    /// Currently active trust anchor. All signature verification uses
    /// this key. Initialized at boot from the kernel image's embedded
    /// public key.
    pub active: TrustAnchor,
    /// Pre-staged next anchor. `None` until a rotation is initiated.
    /// Once activated, the previous `active` becomes `retired` and
    /// this becomes the new `active`.
    pub next: Option<TrustAnchor>,
    /// Retired anchor. Still accepted for signature verification during
    /// the deprecation window (default: 90 days) to allow existing signed
    /// artifacts (drivers, KRLs) to be re-signed with the new key.
    /// `None` if no rotation has occurred yet.
    pub retired: Option<RetiredAnchor>,
    /// Monotonic rotation epoch. Incremented on each successful rotation.
    /// Persisted in TPM NV counter for anti-rollback protection.
    pub epoch: u64,
}

pub struct TrustAnchor {
    /// Algorithm identifier (matches `SignatureAlgorithm`).
    pub algorithm: u32,
    /// Raw public key bytes. ML-DSA-65 public key = 1952 bytes.
    pub public_key: ArrayVec<u8, 2048>,
    /// SHA3-256 fingerprint of the public key (for display and comparison).
    pub fingerprint: [u8; 32],
    /// Wall-clock timestamp (monotonic ns) of activation.
    pub activated_at_ns: u64,
}

pub struct RetiredAnchor {
    /// The retired key.
    pub anchor: TrustAnchor,
    /// Wall-clock timestamp after which this anchor is no longer accepted
    /// for signature verification. Set to
    /// `activated_at_ns + TRUST_ANCHOR_DEPRECATION_WINDOW_NS`.
    pub expires_at_ns: u64,
}

/// Deprecation window: 90 days. During this window, signatures from the
/// retired anchor are still accepted, giving operators time to re-sign
/// all artifacts with the new active anchor.
pub const TRUST_ANCHOR_DEPRECATION_WINDOW_NS: u64 =
    90 * 24 * 3600 * 1_000_000_000;
9.3.5.1.3 Rotation Protocol

Rotation is a three-phase protocol. The entire protocol runs as a cold-path operation; no hot-path code is affected.

Phase 1: Pre-distribution (months/years before activation)

  1. The distribution builder generates a new ML-DSA-65 key pair.
  2. The builder signs a TrustAnchorUpdate message with the CURRENT active anchor:
/// Signed request to stage a new trust anchor for future activation.
pub struct TrustAnchorUpdate {
    /// New public key bytes.
    pub new_public_key: ArrayVec<u8, 2048>,
    /// Algorithm for the new key.
    pub new_algorithm: u32,
    /// SHA3-256 fingerprint of the new public key.
    pub new_fingerprint: [u8; 32],
    /// Target rotation epoch (must equal current epoch + 1).
    pub target_epoch: u64,
    /// Earliest activation time (wall clock, nanoseconds).
    /// The kernel will not activate before this time.
    pub earliest_activation_ns: u64,
    /// Signature over all preceding fields, by the active anchor.
    pub signature: ArrayVec<u8, 4672>,  // Sized for ML-DSA-87 worst case (4627 bytes, FIPS 204 Table 2),
                                         // rounded to 64-byte cache-line alignment: ceil(4627/64)*64 = 4672.
                                         // Trust anchors use ML-DSA-87 (security level 5) per §9.6.
}
  1. The update is delivered to the running kernel via: a. Live evolution payload (new kernel component includes the update), OR b. Write to /proc/umka/security/trust_anchor_update (requires CAP_SYS_ADMIN + CAP_SYS_SECURITY, rate-limited to 1/hour).

  2. The kernel verifies the signature against the active anchor. On success: stores the new key in TrustAnchorChain::next. On failure: returns EKEYREJECTED and logs an FMA alert.

Phase 2: Activation (at the scheduled time or via explicit command)

  1. When now_ns() >= next.earliest_activation_ns: a. Verify the next anchor is still present (not revoked). b. retired = Some(RetiredAnchor { anchor: active, expires_at_ns: now + DEPRECATION_WINDOW }) c. active = next.take().unwrap() d. epoch += 1 e. Update TPM NV counter for anti-rollback. f. Log FMA event: TrustAnchorRotated { old_fingerprint, new_fingerprint, epoch }.

Phase 3: Deprecation expiry

  1. After TRUST_ANCHOR_DEPRECATION_WINDOW_NS elapses: a. retired = None b. Signature verification with the retired key now returns EKEYEXPIRED. c. All drivers and KRLs signed with the retired key must have been re-signed during the deprecation window.
9.3.5.1.4 Signature Verification with Dual Anchors
verify_signature(payload, signature) -> Result<(), VerifyError>:
  // Try active anchor first (common case)
  if active.verify(payload, signature).is_ok():
    return Ok(())

  // Try retired anchor during deprecation window
  if let Some(ref retired) = retired:
    if now_ns() < retired.expires_at_ns:
      if retired.anchor.verify(payload, signature).is_ok():
        return Ok(())

  Err(VerifyError::SignatureInvalid)

Cost: one additional comparison (retired.is_some() && now < expires) on the verification fallback path. The primary verification (ML-DSA-65) takes >100 μs; the comparison adds ~2 ns.

9.3.5.1.5 Security Analysis

Threat: Compromised active key. An attacker with the active key could forge a TrustAnchorUpdate to install their own key. Mitigations: - earliest_activation_ns must be ≥24 hours in the future. This gives the operator time to detect and revoke a fraudulent update. - Emergency revocation: /proc/umka/security/trust_anchor_revoke (requires physical console access OR TPM owner auth) immediately clears next and logs an FMA critical alert. - The TPM NV counter prevents rollback: an attacker cannot replay an older TrustAnchorUpdate with a lower epoch.

Threat: Compromised next key (before activation). The next key is pre-distributed but not yet active. If compromised before activation: - The operator writes a new TrustAnchorUpdate (signed with the still-active current key) that replaces the compromised next key. The compromised key was never active, so no artifacts are signed with it.

Threat: Both keys compromised simultaneously. This is equivalent to total root key compromise. Mitigation: reboot with a new kernel image containing a fresh trust anchor embedded at build time. This is the only scenario that requires a reboot in the 50-year lifecycle. TPM-based remote attestation detects this state.

9.3.5.1.6 Performance Impact

Zero. Trust anchor verification is cold-path only (driver load, KRL update, live evolution payload verification). The rotation protocol itself runs at most once per several years. The dual-anchor lookup adds one pointer comparison + one timestamp comparison to the signature verification fallback path — ~2 ns overhead on a >100 μs ML-DSA-65 verification.


9.4 TPM Runtime Services

9.4.1 Measured Boot (TPM)

For environments with TPM (Trusted Platform Module):

PCR Assignment Table (consolidated for security review; authoritative detail at Section 2.19):

PCR Measured by Content
0 UEFI firmware Firmware code and EFI drivers
1 UEFI firmware Platform configuration (NVRAM, PCI config)
2 UEFI firmware Option ROM code
3 UEFI firmware Option ROM data
4 UEFI/bootloader Boot manager code (shim, GRUB executable)
5 UEFI/bootloader Boot manager data + GPT partition table
6 UEFI firmware Resume-from-S4 / hibernate image
7 UEFI firmware Secure Boot policy (db, dbx, PK, KEK state)
8 Bootloader (GRUB) GRUB command line
9 Bootloader (GRUB) Kernel image (UmkaOS PE/COFF or bzImage)
10 Bootloader → IMA initramfs (boot-time); IMA runtime measurements (both coexist, distinguished by event log type)
11 systemd-stub / pcrphase UKI components + boot phase strings (UAPI Group registry)
12 UmkaOS Core Tier 1 driver images + IMA policy signing key
13 systemd-stub System extension images for the initrd (sysexts)
14 shim MOK certificates and hashes
15 UmkaOS Core Policy module measurements (Section 19.9)
Measured boot sequence:
1. UEFI firmware self-measures into PCR 0–7 (standard, not UmkaOS-controlled).
2. Bootloader (GRUB/shim) measures: PCR 8 (GRUB cmdline), PCR 9 (kernel image),
   PCR 10 (initramfs). PCR 11 is used by systemd-stub for UKI/boot phases.
3. UmkaOS Core measures loaded Tier 1 drivers and IMA policy key → PCR 12.
4. IMA extends PCR 10 at runtime on each measured file open **in the init IMA
   namespace only** (coexists with boot-time initramfs measurement via distinct
   event log entry types). Container-scoped measurements extend software virtual
   PCRs in the container's `ImaNamespace`, not hardware PCR 10. See
   [Section 9.5](#runtime-integrity-measurement).
5. Remote attestation server verifies the full boot chain via TPM quotes
   covering PCRs 0–12.

Key PCR assignments for IMA and driver measurement (for quick reference; the full table above is authoritative):

PCR Purpose Extends When
PCR 7 UEFI Secure Boot state (EFI variables, DBx revocation list) Boot, before handoff
PCR 10 Initramfs (boot) + IMA runtime file measurements (init namespace only) Boot: bootloader measures initramfs digest; Runtime: each file open/exec per IMA policy in the init IMA namespace (coexists via distinct event log entry types). Container namespace measurements extend virtual PCRs only (Section 9.5).
PCR 12 UmkaOS Core static boot-time digest of loaded Tier 1 drivers Driver load, before first domain grant

Note: PCR 10 and PCR 12 both extend on Tier 1 driver load: IMA measures the driver binary file (PCR 10); UmkaOS Core pre-measures the driver's immutable code+data region before granting it an isolation domain (PCR 12). These are complementary measurements — PCR 10 tracks what was loaded, PCR 12 locks the Tier 1 driver set into the TPM attestation chain before execution begins.

This is additive — TPM integration is optional and does not affect the boot path for systems without TPM.

9.4.1.1 PCR[7] Measurement Without Secure Boot

PCR[7] in the UEFI Secure Boot architecture records the Secure Boot policy and the authority (certificate chain) used to authorize boot components. Its content and interpretation depend on whether Secure Boot is enabled.

What UEFI always measures into PCR[7] (per TCG PC Client Platform Firmware Profile Specification, implemented in EDK II Tcg2Dxe.c):

UEFI firmware unconditionally measures the following EFI variables into PCR[7] using EV_EFI_VARIABLE_DRIVER_CONFIG events, regardless of whether Secure Boot is active: SecureBoot, PK, KEK, db, dbx. If a variable is absent (e.g., no keys are enrolled), the firmware measures a zero-size variable entry for that variable. The dbt and dbr variables are measured only if present. An EV_SEPARATOR event is then extended into PCR[7] to delimit configuration from authority measurements.

When Secure Boot is active and successfully verifies a boot binary, each EFI_SIGNATURE_DATA record used for verification is also extended into PCR[7] as an EV_EFI_VARIABLE_AUTHORITY event. These authority events do not occur when Secure Boot is disabled.

Condition PCR[7] event sequence Binding strength
Secure Boot enabled, keys enrolled EV_EFI_VARIABLE_DRIVER_CONFIG for SecureBoot=1, PK, KEK, db, dbx; EV_SEPARATOR; EV_EFI_VARIABLE_AUTHORITY for each certificate used Strong: value binds to the specific certificate set that authorized the boot
Secure Boot disabled, keys enrolled EV_EFI_VARIABLE_DRIVER_CONFIG for SecureBoot=0, PK, KEK, db, dbx; EV_SEPARATOR; no authority events Intermediate: binds to the key database state but not to any boot authorization act
Secure Boot disabled, no keys enrolled EV_EFI_VARIABLE_DRIVER_CONFIG for SecureBoot=0 plus zero-size entries for absent PK/KEK/db/dbx; EV_SEPARATOR; no authority events Weak: deterministic but signals absence of any policy — machine-bound only

The resulting PCR[7] value is always deterministic for a given Secure Boot configuration state. Importantly, a system with Secure Boot enabled (SecureBoot=1) produces a different PCR[7] than one with Secure Boot disabled (SecureBoot=0), because the SecureBoot EFI variable value is measured into PCR[7] as part of the EV_EFI_VARIABLE_DRIVER_CONFIG event. However, a system that had Secure Boot disabled and then re-enabled will produce the same PCR[7] value as when it was previously enabled (assuming the key databases are unchanged), because PCRs are reset at each power cycle and the entire measurement sequence is replayed from scratch on every boot.

UmkaOS policy for non-SB systems:

When UmkaOS detects that Secure Boot is not active (via the EFI_GLOBAL_VARIABLE SecureBoot variable being absent or zero, or via UEFI event log inspection at boot time):

  1. TPM sealing policy: Secrets sealed to PCR[7] on a non-SB system are treated as "machine-bound, not policy-bound" — they reflect the key database state at sealing time, but since no cryptographic authority event was recorded, they offer no assurance that the boot chain was verified. Operators are warned via the boot log: umka: TPM PCR[7] in no-SB mode; sealed secrets are machine-bound only.

  2. IMA policy adjustment: Without Secure Boot, the IMA policy falls back from secure_boot_appraise (which requires SB-signed files) to ima_appraise=fix mode, allowing measurement without strict appraisal. Files are still measured into the IMA log for audit purposes.

  3. PCR[7] state capture: UmkaOS records the UEFI event log entries for PCR[7] during early boot (before any further extensions) and exposes the Secure Boot status via Pcr7BootState. This allows later software to distinguish "SB active with these certificates" from "SB inactive" without re-reading the PCR or relying on the (mutable) SecureBoot EFI variable at runtime.

/// PCR[7] state recorded at UmkaOS boot time, captured from the UEFI event log
/// before the boot manager runs. Populated during early boot from the
/// EFI_TCG2_PROTOCOL event log.
#[repr(C)]
pub struct Pcr7BootState {
    /// The SHA-256 hash in PCR[7] as read from the TPM after UEFI firmware
    /// completes its measurements. Value is deterministic per boot configuration.
    pub pcr7_value: [u8; 32],
    /// True if UEFI Secure Boot was active when UmkaOS booted.
    /// Derived from the `SecureBoot` EFI variable
    /// (GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c) as measured by UEFI
    /// into the TCG event log (EV_EFI_VARIABLE_DRIVER_CONFIG), not from
    /// runtime EFI variable access (which is mutable post-ExitBootServices).
    pub secure_boot_active: u8, // 0 = inactive, 1 = active
    /// True if at least one EV_EFI_VARIABLE_AUTHORITY event was recorded
    /// in the PCR[7] event log, indicating that Secure Boot actually
    /// verified at least one boot component. A system can have
    /// `secure_boot_active = true` but `authority_events_present = false`
    /// if no EFI binaries were verified (unusual but firmware-possible).
    pub authority_events_present: u8, // 0 = absent, 1 = present
    /// Number of distinct EV_EFI_VARIABLE_AUTHORITY events recorded in
    /// the PCR[7] log. Zero if Secure Boot is disabled.
    pub authority_event_count: u16,
    pub _pad: [u8; 4],
    // pcr7_value(32) + secure_boot_active(1) + authority_events_present(1)
    // + authority_event_count(2) + _pad(4) = 40 bytes.
}
const_assert!(size_of::<Pcr7BootState>() == 40);

Non-UEFI boot paths: On systems that boot via BIOS/MBR, DeviceTree (ARM), OpenSBI (RISC-V), or other non-UEFI mechanisms, PCR[7] is not defined by the UEFI Secure Boot specification. UmkaOS sets secure_boot_active = false and pcr7_value = [0u8; 32] on these platforms, reflecting that no UEFI-based PCR[7] measurement occurred. Platform- specific measured boot chains (e.g., ARM TrustZone OP-TEE measurements) are handled separately and do not map to PCR[7].

9.4.2 TPM Runtime Services

Section 9.4 covers TPM as a boot measurement device. But TPM 2.0 is also a runtime crypto engine — a hardware security module built into every modern server and laptop. UmkaOS integrates TPM as a first-class runtime resource.

Linux TPM interfaces — Linux exposes /dev/tpm0 (raw character device for direct TPM command submission) and /dev/tpmrm0 (resource-managed access that multiplexes TPM sessions). UmkaOS provides both via the syscall interface for existing userspace tools (tpm2-tools, clevis, systemd-cryptenroll).

Key sealing and unsealing — TPM can seal secrets (encryption keys, credentials) to a specific PCR state. The secret can only be recovered if the PCRs match — meaning the system booted the expected kernel, with the expected drivers, in the expected configuration. If any component in the boot chain is modified or compromised, unsealing fails and the secret is protected. UmkaOS integrates sealing with its capability system: the CAP_TPM_SEAL capability is required to create sealed objects, and the resulting sealed blob is itself a capability-protected kernel object.

NV Indices (NVRAM) — TPM provides persistent, tamper-resistant storage called NV indices. These store small secrets (disk encryption keys, network credentials, device identity certificates) that survive reboots and are protected by TPM authorization policies. UmkaOS abstracts NV indices as capability-protected objects — reading/writing an NV index requires the appropriate capability token, and the authorization policy (password, HMAC, or PCR-bound) is enforced by the TPM hardware itself.

Hardware random number generator — TPM includes a hardware RNG (certified to NIST SP 800-90A). UmkaOS's entropy pool (which also draws from CPU RDRAND/RDSEED) mixes in TPM random data when a TPM is present. This provides defense-in-depth: if the CPU RNG is compromised (as has happened with certain microcode bugs), the TPM RNG provides independent entropy.

Authorization policies — TPM 2.0 supports rich authorization: - PCR-bound: seal/unseal only if PCRs match expected values (boot integrity) - Password-based: simple passphrase authorization - HMAC-based: proof of possession of a shared secret - Policy-based: compound policies combining PCR state, time-of-day, locality, NV counter values, and external authorization (e.g., "unseal only if PCR 11 matches AND a remote server approves")

UmkaOS's policy engine maps these to capability gates — a sealed key's authorization policy determines which capability holders can trigger unsealing.

TPM-backed disk encryptiondm-crypt volume keys sealed to TPM PCR state. On boot, UmkaOS's initramfs unseals the volume key from the TPM (no passphrase required if the boot chain is trusted). This replaces the Linux approach of requiring systemd- cryptenroll or clevis daemons with kernel-native TPM key management. If the boot chain changes (different kernel version, modified initramfs), the TPM refuses to unseal and the system falls back to passphrase entry.

Resource manager — TPM 2.0 has limited internal session/object slots (typically 3 loaded objects + 3 loaded sessions simultaneously, per TCG TPM 2.0 Part 1 §32). Linux requires either the in-kernel tpm_tis/tpm_crb resource manager or the userspace tpm2-abrmd daemon to multiplex access. UmkaOS's TPM driver includes a kernel-native resource manager that transparently handles context swapping, so multiple concurrent TPM users (disk encryption, IMA measurements, remote attestation, application key storage) never contend for TPM slots. No userspace daemon required.

Resource manager data structures:

/// A virtualized TPM handle. Callers always use VirtTpmHandle; the resource
/// manager maps it to a real TPM handle when the object is loaded.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
pub struct VirtTpmHandle(pub u32);

/// State of one virtualized TPM object or session.
pub enum TpmSlotState {
    /// Currently loaded in the TPM — real_handle is valid.
    Loaded { real_handle: u32 },
    /// Context-saved to TPM NV storage by ContextSave.
    /// context_blob is retrieved by ContextLoad to reload.
    Saved { context_blob: TpmContextBlob },
    /// Never loaded (newly created, awaiting first use).
    Pending,
}

/// Opaque blob returned by TPM2_ContextSave. Variable length (≤ 2048 bytes).
///
/// **Stack safety**: This struct is 2048+ bytes. It is NEVER allocated on the
/// stack. The eviction path allocates a context blob from `TPM_CONTEXT_POOL`
/// (slab-backed, max 2048 bytes per slot, pre-allocated at boot — see
/// `TpmContextPool` below). The blob is moved into the `TpmHandleEntry`
/// (stored in an XArray, heap-allocated) after `TPM2_ContextSave` completes.
/// The VFS → IMA → TPM → resource manager → eviction call chain never places
/// a 2 KiB buffer on the kernel stack (which is typically 8-16 KiB).
pub struct TpmContextBlob {
    pub data: ArrayVec<u8, 2048>,
}

/// Per-handle entry in the resource manager table.
pub struct TpmHandleEntry {
    pub virt_handle: VirtTpmHandle,
    pub slot_state: TpmSlotState,
    /// LRU clock tick when last used. Used for eviction ordering.
    pub last_used: u64,
}

/// Maximum loaded transient objects. Discovered at init via
/// `TPM2_GetCapability(TPM_PT_MAX_LOADED_OBJECTS)`. Typical value: 3.
/// Used as ArrayDeque capacity — the actual runtime limit is `max_loaded_objects`.
const TPM_MAX_LOADED_OBJECTS: usize = 7;

/// Maximum loaded sessions. Discovered at init via
/// `TPM2_GetCapability(TPM_PT_MAX_LOADED_SESSIONS)`. Typical value: 3.
/// Used as ArrayDeque capacity — the actual runtime limit is `max_loaded_sessions`.
const TPM_MAX_LOADED_SESSIONS: usize = 7;

/// Kernel-native TPM resource manager.
///
/// Maintains an LRU cache of loaded TPM handles. When a caller needs to use
/// a handle and the TPM is full, the LRU loaded handle is context-saved
/// (TPM2_ContextSave) to free a slot, then the requested handle is
/// context-loaded (TPM2_ContextLoad) into the freed slot.
pub struct TpmResourceManager {
    /// All virtualized handles, keyed by VirtTpmHandle (u32).
    /// XArray provides O(1) integer-keyed lookup per collection policy.
    pub handles: XArray<TpmHandleEntry>,
    /// Currently loaded transient objects (keys, NV indices), LRU order
    /// (front = most recent). Capacity from `TPM2_GetCapability(TPM_PT_MAX_LOADED_OBJECTS)`.
    /// Objects and sessions have independent hardware limits and must be tracked
    /// separately — a single mixed LRU can evict all sessions while objects remain,
    /// or vice versa, exceeding one limit while the other has capacity.
    pub loaded_objects: ArrayDeque<VirtTpmHandle, TPM_MAX_LOADED_OBJECTS>,
    /// Currently loaded sessions (HMAC, policy, trial), LRU order (front = most recent).
    /// Capacity from `TPM2_GetCapability(TPM_PT_MAX_LOADED_SESSIONS)`.
    pub loaded_sessions: ArrayDeque<VirtTpmHandle, TPM_MAX_LOADED_SESSIONS>,
    /// Maximum loaded object slots (queried from TPM at init). Typically 3.
    pub max_loaded_objects: u8,
    /// Maximum loaded session slots (queried from TPM at init). Typically 3.
    pub max_loaded_sessions: u8,
    /// Monotonic tick counter for LRU ordering.
    pub tick: u64,
    /// Serializes all TPM commands (TPM is single-threaded).
    pub tpm_lock: Mutex<()>,
}

Resource manager algorithmensure_loaded(virt):

Acquire tpm_lock.
If virt is already Loaded:
  Update virt.last_used = tick++.
  Return real_handle.
// Select the correct LRU queue based on handle type.
lru = if virt.is_session() { &loaded_sessions } else { &loaded_objects }
max = if virt.is_session() { max_loaded_sessions } else { max_loaded_objects }
If lru.len() == max:
  evict = lru.pop_back()  // LRU handle in this category
  TPM2_ContextSave(evict.real_handle) → context_blob
  evict.slot_state = Saved { context_blob }
TPM2_ContextLoad(virt.context_blob) → real_handle
virt.slot_state = Loaded { real_handle }
virt.last_used = tick++
lru.push_front(virt)
Release tpm_lock.
Return real_handle.

Every TPM command path calls ensure_loaded(virt) to obtain the real handle before submitting the command. The caller is unaware of whether a context swap occurred. Context save/load adds ~1-5 ms per swap (one TPM2_ContextSave + one TPM2_ContextLoad round-trip); swaps are rare because the LRU policy keeps the most-used handles loaded.

Eviction Policy:

The TPM has limited internal context slots (TPM2_PT_HR_LOADED_SESSION: typically 3; TPM2_PT_ACTIVE_SESSIONS: typically 64 on current hardware). When all slots are occupied and a new context is needed:

  1. LRU eviction: The resource manager maintains an LRU list of active TPM context handles. When a new slot is needed, the least-recently-used non-pinned context is evicted via TPM2_CC_ContextSave (serializes the context to a TPMS_CONTEXT blob, ~256-2048 bytes per context) and stored in kernel memory.

  2. Pinned contexts: Contexts actively executing an operation (ContextState::InUse) are pinned and cannot be evicted. If all active contexts are pinned and a new slot is needed, the requesting operation blocks on tpm_context_wait_queue until a context completes and unpins.

  3. Re-loading: When a previously evicted context is needed, the resource manager calls TPM2_CC_ContextLoad to reload it into the TPM, potentially evicting another LRU context.

  4. Context blob storage: Saved context blobs are stored in a pre-allocated kernel pool (TpmContextPool), sized at boot to MAX_TPM_CONTEXTS × MAX_CONTEXT_SIZE = 64 × 2048 = 128 KiB. This pool is never grown after boot (no allocation in TPM I/O path).

  5. Context size limits: Each saved context blob is bounded by TPM2_PT_CONTEXT_SYM_SIZE (hardware-reported). The TpmContextPool slot size is set to max(reported_size, 2048) for safety.

  6. Eviction metrics: The resource manager tracks eviction_count, reload_count, and max_concurrent_active in the FMA health struct for the TPM device, allowing operators to tune MAX_TPM_CONTEXTS if thrashing is observed.

Performance considerations — TPM 2.0 operations are inherently slow, with latency varying dramatically by operation type:

Operation category Latency Examples
Simple reads/extends ~1-10 ms TPM2_PCR_Extend, TPM2_GetRandom, TPM2_PCR_Read
Sealing/unsealing ~20-80 ms TPM2_Create (sealed key), TPM2_Unseal
Asymmetric signing ~100-600 ms ECDSA P256 (p50 ~200ms), RSA 2048 (slower)
Key generation ~200-1000 ms TPM2_CreatePrimary (RSA 2048)

The TPM is single-threaded — all commands serialize internally regardless of concurrent callers. Under load (e.g., multiple containers requesting attestation quotes simultaneously), effective per-operation latency increases linearly with queue depth. UmkaOS's async I/O model ensures TPM operations never block the caller synchronously. TPM commands are submitted via the ring buffer interface and completed asynchronously. For latency-sensitive paths (e.g., IMA measurement during file open), UmkaOS caches measurement results and only re-measures when file content changes. The caching strategy is essential: without it, a file-intensive workload would saturate the TPM with PCR extend operations, each taking 1-10ms.

PCR extend operations use SHA-256 as the hash algorithm (matching the TPM 2.0 SHA-256 PCR bank, corresponding to HashAlgorithm::Sha256 from Section 10.1). The Kernel Crypto API is NOT used for PCR extend — the TPM hardware performs the hash-extend operation internally. The kernel provides the raw measurement data; the TPM computes SHA-256(old_PCR_value || measurement). The TPM command specifies the hash algorithm via TPM2 algorithm ID TPM_ALG_SHA256 (0x000B), not via the kernel's HashAlgorithm enum — the two are separate namespaces mapped only at the IMA/measured boot boundary where kernel-side digests are submitted to the TPM.

9.4.2.1 TPM Command Timeout and Failure Policy

TPM hardware may hang or become unresponsive (firmware bug, bus error, power management race). Every TPM command submission has a bounded timeout:

Command class Timeout Example commands
Short 750 ms TPM2_PCR_Extend, TPM2_GetRandom, TPM2_NV_Read
Medium 2000 ms TPM2_ContextSave, TPM2_ContextLoad, TPM2_NV_Write
Long 30000 ms TPM2_CreatePrimary, TPM2_Create (RSA/ECC key generation)

Timeout values match the TPM_PT_MAX_COMMAND_SIZE / duration properties reported by the TPM itself via TPM2_GetCapability(TPM_CAP_TPM_PROPERTIES). If the TPM does not report durations, the above defaults apply.

On timeout: Abort the in-flight command (TPM2_Cancel if the TPM supports it, otherwise hardware reset of the TPM interface). Return -ETIME to the caller.

Consecutive failure policy: After 3 consecutive timeouts (any command class):

  1. Mark the TPM device as TpmState::Degraded.
  2. Log a HealthEventClass::HardwareError event via FMA (Section 20.1).
  3. Non-critical TPM operations (attestation quotes, random number generation) fail immediately with -ENODEV. Callers fall back to software alternatives:
  4. getrandom() falls back to ChaCha20-based CSPRNG (already seeded at boot).
  5. Attestation quotes are unavailable (no software fallback — integrity requires hardware root of trust).
  6. Critical TPM operations (measured boot PCR extends during early boot) cause a kernel panic — the integrity chain is broken and cannot be recovered.
  7. A background recovery task attempts TPM2_Startup(CLEAR) every 30 seconds. If the TPM responds, it is re-probed and promoted back to TpmState::Ready.
/// TPM device health state.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum TpmState {
    /// Normal operation.
    Ready,
    /// 3+ consecutive timeouts. Non-critical ops fail; recovery in progress.
    Degraded,
    /// Hardware permanently failed (e.g., TPM removed, bus error). No recovery.
    Failed,
}

9.4.3 Anti-Rollback Counter (TOCTOU-Safe Initialization)

UmkaOS uses a dedicated TPM NV monotonic counter to prevent rollback to older kernel images. This counter is separate from the KRL version counter (NV index 0x01800200, Section 9.2.7) and tracks the kernel boot generation.

NV Index: UMKA_ROLLBACK_NV_INDEX = 0x01500001

First-boot TPM anti-rollback counter initialization (TOCTOU-safe):

1. TPM2_NV_DefineSpace(
     nvIndex      = UMKA_ROLLBACK_NV_INDEX,  // 0x01500001
     size         = 8,                        // 8-byte counter
     attributes   = TPMA_NV_COUNTER
                  | TPMA_NV_WRITE_STCLEAR    // write-locked after power cycle
                  | TPMA_NV_WRITEDEFINE      // cannot be redefined after first write
                  | TPMA_NV_PPWRITE          // requires physical presence to write
                  | TPMA_NV_AUTHREAD,        // readable with auth only
     authPolicy   = platform_auth            // requires platform authorization
   )
   If this returns TPM_RC_NV_DEFINED: the index already exists from a
   previous boot — skip to step 3 (do not re-initialize).

2. TPM2_NV_Increment(nvIndex, platformAuth)
   // First increment initializes the counter to 0 and locks the definition
   // (TPMA_NV_WRITEDEFINE prevents future TPM2_NV_DefineSpace for this index).

3. Read counter: TPM2_NV_Read(nvIndex) → current_rollback_generation
   Compare against kernel's compiled-in UMKA_MIN_ROLLBACK_GENERATION.
   If current < minimum: boot is rejected (anti-rollback enforcement).

Security properties:

  • TPMA_NV_WRITEDEFINE: once written, the NV space definition cannot be changed even with platform auth — prevents an attacker from deleting and re-creating the index with weaker attributes.
  • TPMA_NV_PPWRITE: physical presence required — cannot be manipulated by software alone.
  • The atomicity of step 1 is guaranteed by TPM hardware — concurrent TPM2_NV_DefineSpace calls are serialized by the TPM's internal command sequencer. If step 1 returns TPM_RC_NV_DEFINED, the index was created by a concurrent or prior boot; step 3 reads the existing counter safely. There is no window between "check if index exists" and "create index" because the DefineSpace command is itself the creation — either it succeeds (first boot) or returns TPM_RC_NV_DEFINED (already exists). This eliminates the TOCTOU race that would exist if existence were checked with a separate TPM2_NV_ReadPublic command before DefineSpace.
  • TPM2_NV_Increment is atomic by TPM specification (TCG TPM Library Specification, Part 3, Section 31.12): the counter value is incremented and persisted to NV storage as a single TPM command. Power loss during the NV write is handled by the TPM's internal journaling — the counter either reflects the old or new value on next read, never a corrupted intermediate. No software-side locking is required around increment calls.

Rollback generation increment: When a new kernel version is released with security fixes that must not be rolled back, UMKA_MIN_ROLLBACK_GENERATION is incremented in the kernel build. On next boot, step 2 increments the TPM counter past the minimum; old kernel binaries (with lower minimums) can no longer satisfy the anti-rollback check on this system.

9.4.4 TPM Service Provider (Cluster-Wide TPM Access)

Provider model: TPM service is host-proxy — the host kernel mediates access to the physical TPM chip. A discrete TPM with Tier M firmware is theoretically possible but impractical (TPMs have minimal processing power, ~33 MHz I2C/SPI interface). Sharing model: shared with session multiplexing — multiple local and remote peers bind simultaneously, each getting independent TPM sessions with handle isolation.

Not every node in a cluster has a TPM. CXL memory expanders, DPU-only nodes, embedded accelerator sleds, and virtual machines may lack a hardware TPM. The TPM service provider allows any peer to use another peer's TPM for attestation, key sealing, and random number generation.

This is the security-subsystem instantiation of the capability service provider model (Section 5.7).

Use cases: - Remote attestation: Node B (no TPM) requests a TPM quote from Node A. A verifier can confirm Node A's boot integrity and, by extension, the cluster's trust chain. - Key sealing across the cluster: A shared secret is sealed to Node A's TPM PCR state. Any authorized peer can request unsealing — if the boot chain on Node A is intact, the secret is released. - Entropy: Nodes without hardware RNG draw from a TPM-equipped peer's hardware random source (defense-in-depth for entropy pool). - Confidential computing bootstrap: A TEE node (Section 9.7) needs TPM-based attestation to establish trust before loading confidential workloads.

PeerCapFlags: TPM_SERVICE (bit 11) — advertised by peers with a hardware TPM 2.0.

ServiceId: ServiceId("tpm", 1).

PeerServiceDescriptor.properties (32 bytes):

#[repr(C)]
pub struct TpmServiceProperties {
    /// TPM manufacturer ID (TCG Vendor ID Registry, TCG specification
    /// "TPM Rev 2.0 Part 2: Structures", Table 16). Common values:
    /// 0x49465800 = IFX (Infineon), 0x53544D20 = STM (STMicro),
    /// 0x4E544300 = NTC (Nuvoton), 0x4E545A00 = NTZ (Nationz),
    /// 0x524F4343 = ROCC (Rockchip), 0x414D4400 = AMD (fTPM).
    /// SF-110 fix: endian wrapper — this struct is embedded in
    /// PeerServiceDescriptor.properties which is transmitted via RDMA
    /// between cluster nodes (wire struct). On mixed-endian clusters
    /// (x86-64 + s390x), native integers would be byte-swapped.
    pub manufacturer: Le32,
    /// TPM firmware version (major.minor as two Le16).
    pub firmware_major: Le16,
    pub firmware_minor: Le16,
    /// Supported hash algorithms bitmask.
    /// bit 0: SHA-1, bit 1: SHA-256, bit 2: SHA-384, bit 3: SHA-512,
    /// bit 4: SM3-256.
    pub hash_algorithms: Le32,
    /// Number of PCR banks available. Single byte — no endian wrapper needed.
    pub pcr_count: u8,
    /// Whether the TPM supports policy-based authorization. Single byte.
    pub policy_support: u8,
    /// Padding to fill 32-byte PeerServiceDescriptor.properties slot.
    /// Fields sum to 14 bytes; 32 - 14 = 18 bytes of trailing pad.
    pub _pad: [u8; 18],
}
const_assert!(core::mem::size_of::<TpmServiceProperties>() == 32);

Wire protocol — TPM commands are opaque byte buffers (defined by the TCG TPM 2.0 specification). The service provider forwards them to the local TPM and returns responses:

#[repr(u16)]
pub enum TpmServiceOpcode {
    /// Client → provider: submit a TPM command.
    /// Payload: raw TPM2 command bytes (TCG Part 3 format).
    /// Max size: 4096 bytes (TPM command buffer limit).
    TpmCommand   = 0x0001,
    /// Provider → client: TPM command response.
    /// Payload: raw TPM2 response bytes.
    TpmResponse  = 0x0002,
    /// Client → provider: request hardware random bytes.
    /// Payload: u32 (number of bytes requested, max 4096).
    GetRandom    = 0x0010,
    /// Provider → client: random bytes response.
    /// Payload: raw random bytes.
    RandomData   = 0x0011,
    /// Provider → client: TPM locality/session info (sent on bind).
    TpmInfo      = 0x0020,
}

Security model — critical design decisions:

  1. No raw command forwarding for sensitive operations. The service provider does NOT blindly forward all TPM commands. The following are filtered:
  2. TPM2_Clear (factory reset) — blocked. Only local admin.
  3. TPM2_DictionaryAttackLockReset — blocked. Only local admin.
  4. TPM2_HierarchyChangeAuth — blocked. Only local admin.
  5. TPM2_HierarchyControl (enable/disable hierarchies) — blocked. A remote client could disable the platform or endorsement hierarchy, bricking the TPM until physical presence clears it.
  6. TPM2_ChangePPS (rotate platform primary seed) — blocked. Invalidates all keys derived from the platform hierarchy; recovery requires physical access.
  7. TPM2_ChangeEPS (rotate endorsement primary seed) — blocked. Invalidates the endorsement key (EK) and all keys derived from it; destroys remote attestation identity permanently.
  8. TPM2_NV_UndefineSpace (delete NV indices) — blocked for indices outside the remote-allocated range. A remote client may only delete NV indices it created within 0x01C00000-0x01CFFFFF.
  9. TPM2_NV_DefineSpace — restricted to specific NV index ranges allocated for remote use (NV indices 0x01C00000-0x01CFFFFF).

Command filtering implementation: The TPM service provider parses the first 10 bytes of every incoming TpmCommand payload (TPM2 command header: tag[2] + size[4] + commandCode[4]). The commandCode is checked against a static blocklist:

static BLOCKED_COMMANDS: &[u32] = &[
    0x0000_0126, // TPM2_Clear
    0x0000_0139, // TPM2_DictionaryAttackLockReset
    0x0000_0129, // TPM2_HierarchyChangeAuth
    0x0000_0121, // TPM2_HierarchyControl
    0x0000_0125, // TPM2_ChangePPS
    0x0000_0124, // TPM2_ChangeEPS
];

For TPM2_NV_DefineSpace (0x0000_012A) and TPM2_NV_UndefineSpace (0x0000_0122): the NV index is extracted from the command body (bytes 10-13, big-endian u32) and checked against the allowed range 0x01C00000-0x01CFFFFF.

Blocked commands return TpmResponse with TPM_RC_COMMAND_CODE (0x0143) -- the standard TPM error for unsupported commands. The client cannot distinguish "blocked by policy" from "not implemented" -- this is intentional (no information leakage about filtering policy).

Filtering cost: ~50ns per command (one array scan + one conditional index parse). Negligible vs TPM command latency (1-50ms).

NV index allocation tracking: The server maintains a global NV index allocation tracker for the remote range:

/// NV index range size: 0x01C00000-0x01CFFFFF = 1M indices.
const NV_REMOTE_RANGE_SIZE: usize = 1 << 20; // 1M indices
/// Bitmap words: 1M / 64 bits per word = 16384 words.
const NV_BITMAP_WORDS: usize = NV_REMOTE_RANGE_SIZE / 64;

struct NvIndexTracker {
    /// Bitmap of allocated NV indices in the remote range
    /// 0x01C00000-0x01CFFFFF (1M indices, 128 KiB bitmap).
    /// Fixed-size: the NV index range is defined by the TPM2 spec.
    /// Lock-protected: TPM NV operations are 1-50ms each (cold path),
    /// so SpinLock contention is negligible. Non-atomic `u64` words
    /// under the lock — no redundant atomics.
    bitmap: SpinLock<[u64; NV_BITMAP_WORDS]>,
    /// Per-client allocation count (for cleanup on disconnect).
    /// XArray keyed by PeerId (u64, integer-keyed per collection policy).
    per_client_count: XArray<AtomicU32>,
    /// Maximum NV indices per client. Default: 64.
    max_per_client: u32,
}

On TPM2_NV_DefineSpace: check bitmap for the requested index. If already allocated: return TPM_RC_NV_DEFINED (standard TPM error). If client exceeds max_per_client: return TPM_RC_NV_SPACE. Otherwise: set bit, increment per_client_count.

On TPM2_NV_UndefineSpace: verify the index was allocated by this client (check per-client tracking). If not: block (return TPM_RC_COMMAND_CODE). Otherwise: clear bit, decrement per_client_count.

On client disconnect: clear all NV indices allocated by that client (walk bitmap using per-client tracking). This prevents leaked NV resources from accumulating over the 50-year operational lifetime.

  1. Session isolation. Each remote client gets an independent TPM session (via TPM2_StartAuthSession). Sessions from different clients cannot interfere with each other. The resource manager (/dev/tpmrm0 equivalent) handles context swapping.

TPM resource manager for remote clients: The TPM service provider includes a lightweight resource manager that multiplexes client sessions onto the physical TPM:

  • Each remote client gets a virtual TPM session namespace. Session handles returned to clients are REMAPPED: the provider maintains a per-client handle translation table (XArray<u32> mapping virtual handle to physical handle, integer-keyed per collection policy).
  • On each TpmCommand: translate virtual handles in the command body to physical handles before forwarding to the TPM. On TpmResponse: translate physical handles back to virtual handles.
  • Context swapping: when the TPM's internal session slots are full (TPM2_RC_SESSION_MEMORY), the provider saves the least-recently-used session via TPM2_ContextSave and loads the needed session via TPM2_ContextLoad. Saved contexts are stored in kernel memory (not persistent -- lost on provider reboot, which invalidates all remote sessions anyway).
  • Handle types requiring translation: session handles (0x03xxxxxx), transient object handles (0x80xxxxxx). Persistent handles (0x81xxxxxx) and NV indices (0x01xxxxxx) are NOT translated (they are global TPM resources).

  • Rate limiting. TPM operations are slow (1-50ms per command). A remote client is rate-limited to 100 commands/second to prevent DoS of the TPM for local operations. The rate limit is per-client, not global -- local operations always have priority.

Rate limiter: token bucket per client. - Bucket capacity: 100 tokens (= 100 commands). - Refill rate: 100 tokens/second. - Each TpmCommand consumes 1 token. - If bucket empty: command is queued (bounded: 32 pending commands per client). If queue full: return TPM_RC_RETRY (0x0100 -- standard TPM "try again later" response code). - Local operations bypass the rate limiter entirely (they use /dev/tpmrm0 directly, not the service provider path).

  1. Audit trail. Every remote TPM command is logged as an IMA event (Section 9.5) with the requesting peer's identity. This provides a tamper-evident record of all remote TPM usage.

IMA audit event format: - Event name: tpm_remote_cmd - Template: d-ng|n-ng|sig (data hash, filename, signature) - Data hash: SHA-256 of the TPM command bytes. - Filename: peer:{peer_id}:cmd:{command_code_hex} - Signature: not signed (informational event, not a code measurement). - The event is appended to the IMA measurement list (Section 9.5) and can be retrieved via /sys/kernel/security/ima/ascii_runtime_measurements.

  1. Capability gating. CAP_TPM_REMOTE (new capability) is required to bind to a remote TPM service. This is separate from CAP_TPM_SEAL (which gates local sealing operations).

Client-side integration: On the consuming peer, the TPM service appears as /dev/tpmrm1 (or the next available TPM device number). The standard TPM2 software stack (tpm2-tss, tpm2-tools, tpm2-abrmd) works unchanged — it sees a standard TPM resource manager character device. The kernel's TPM driver translates /dev/tpmrm1 operations into TpmCommand/TpmResponse messages over the peer protocol.

Latency: TPM commands take 1-50ms on the hardware. The RDMA round-trip (~3-5μs) adds <0.5% to total latency. Remote TPM is effectively the same speed as local TPM.

Multi-client access: Unlike serial (exclusive), TPM supports multiple concurrent clients via the resource manager's session multiplexing. Multiple peers can bind to the same TPM service simultaneously.

Attestation flow example:

Node B (no TPM) wants to attest its workload to Verifier V:

1. B discovers TPM service on Node A via PeerRegistry (TPM_SERVICE flag).
2. B binds via ServiceBind → gets /dev/tpmrm1.
3. B generates an attestation key (AK) on A's TPM:
   TPM2_CreatePrimary → TPM2_Create → TPM2_Load.
4. B requests a quote from A's TPM:
   TPM2_Quote(AK, PCRs=[0,1,2,7,11]) → signed quote.
5. B sends the quote + A's EK certificate to Verifier V.
6. V verifies: (a) quote signature valid, (b) PCR values match
   expected boot state, (c) EK cert chains to a trusted CA.
7. V trusts that A's boot chain is intact → by extension, B's
   cluster membership is trustworthy (A vouches for B via the
   cluster's Raft-consensus membership protocol).

Drain protocol: On graceful shutdown:

  1. Provider sends ServiceDrainNotify to all connected clients.
  2. Wait up to 5 seconds for clients to send TPM2_FlushContext for their sessions (clients should do this on receiving drain notify).
  3. After timeout: provider forcibly flushes all sessions associated with remote clients (using the handle translation table to identify remote handles). TPM2_FlushContext for each remote session handle.
  4. Clear all NV indices allocated by remote clients (NvIndexTracker cleanup -- walk per-client allocations and issue TPM2_NV_UndefineSpace for each).
  5. Destroy handle translation tables.

No data writeback needed beyond the above -- TPM persistent state (sealed objects, persistent keys) survives in the TPM's NVRAM.


9.5 Runtime Integrity Measurement (IMA)

Measured boot (Section 9.4) ensures the system booted a trusted kernel and drivers. But what about runtime — every binary executed, every library loaded, every config file read after boot? Runtime integrity measurement extends the trust chain from boot into ongoing operation.

Linux IMA measures files into TPM PCR 10 (or a designated PCR) at access time. It was bolted onto the VFS layer after the fact, with a standalone policy language and limited integration with the rest of the security stack. UmkaOS provides equivalent functionality, deeply integrated with the capability system and driver loading.

Measurement policy — Rules specify what to measure: - All executable files (mmap PROT_EXEC) - All shared libraries (dlopen) - Specific configuration files (e.g., /etc/fstab, /etc/passwd) - All files opened by processes holding specific capabilities - All files in specific filesystem subtrees

Policy rules are expressed as capability predicates — "measure all files opened by any process holding CAP_NET_ADMIN" — making the policy composable with UmkaOS's existing security architecture.

9.5.1 Policy Rule Grammar

IMA policy rules follow a line-oriented grammar. Each rule is a single line specifying an action, a condition set, and optional flags. The grammar (EBNF):

rule         = action SP condition_list [SP flag_list] LF
action       = "measure" | "dont_measure" | "appraise" | "dont_appraise" | "audit"
condition_list = condition *( SP condition )
condition    = key "=" value
key          = "func" | "mask" | "fsmagic" | "fsuuid" | "fspath"
             | "uid" | "euid" | "cap" | "obj_type" | "obj_user"
             | "subtype" | "fsname"
value        = quoted_string | bare_word
flag_list    = flag *( SP flag )
flag         = "permit_directio" | "appraise_type=imasig"
             | "appraise_type=sigv2" | "appraise_algo=" hash_algo
             | "template=" template_name
hash_algo    = "sha256" | "sha384" | "sha512" | "sha3_256"
template_name = "ima-ng" | "ima-sig" | "ima-modsig"

Template formats: ima-ng template: d-ng|n-ng = (hash_algo:hash_hex)|file_path. ima-sig template: d-ng|n-ng|sig adds a base64-encoded IMA signature. The security.ima xattr stores either an HMAC or an RSA/EC digital signature over the file hash, verified against the IMA keyring (.ima keyring, populated from initramfs or built-in keys at boot).

Condition keys (evaluated in order, ALL conditions in a rule must match):

Key Value type Semantics
func FILE_CHECK, MMAP_CHECK, BPRM_CHECK, MODULE_CHECK, FIRMWARE_CHECK, KEXEC_CHECK, KABI_CHECK Hook point where measurement occurs. KABI_CHECK is UmkaOS-extension syntax (no Linux equivalent): fires on every KABI driver load (Tier 1 and Tier 2). Policy files using KABI_CHECK are not parseable by stock ima-evm-utils; UmkaOS ships its own ima-evm-utils build with KABI_CHECK support. Future Linux IMA func values will not collide because UmkaOS-specific values use a separate namespace.
mask MAY_EXEC, MAY_WRITE, MAY_READ, MAY_APPEND File access mode filter. MAY_EXEC covers both execve() and mmap(PROT_EXEC).
fspath glob pattern Filesystem path. Supports * (single component) and ** (recursive). Example: fspath=/usr/lib/**
fsmagic hex value Filesystem magic number (e.g., 0xEF53 for ext4). Used to exempt pseudo-filesystems.
fsuuid UUID string Filesystem UUID. Allows per-partition policies.
fsname string Filesystem type name (e.g., ext4, btrfs, tmpfs).
uid integer Real UID of the process accessing the file.
euid integer Effective UID of the process.
cap capability name UmkaOS capability held by the process. Example: cap=CAP_NET_ADMIN. Matches if the process's effective capability set includes the named capability.
obj_type string SELinux/UmkaOS label type of the file (for label-based policies).

Example built-in policy (compiled into the kernel):

 # Measure all executables and shared libraries
measure  func=BPRM_CHECK  mask=MAY_EXEC
measure  func=MMAP_CHECK  mask=MAY_EXEC
measure  func=MODULE_CHECK
measure  func=FIRMWARE_CHECK
measure  func=KABI_CHECK

 # Appraise (enforce) all executables
appraise func=BPRM_CHECK  mask=MAY_EXEC  appraise_type=imasig
appraise func=MMAP_CHECK  mask=MAY_EXEC  appraise_type=imasig
appraise func=MODULE_CHECK                appraise_type=imasig
appraise func=KABI_CHECK                  appraise_type=imasig

 # Exempt pseudo-filesystems from measurement
dont_measure  fsmagic=0x9FA0    # procfs
dont_measure  fsmagic=0x62656572  # sysfs
dont_measure  fsmagic=0x64626720  # debugfs
dont_measure  fsmagic=0x01021994  # tmpfs
dont_measure  fsmagic=0x1CD1     # devpts

Rule evaluation: Rules are evaluated top-to-bottom; first matching rule wins. If no rule matches, the default action is dont_measure / dont_appraise (fail-open for measurement, fail-closed for appraisal in umka.ima=enforce mode — unmeasured files that require appraisal are denied).

9.5.2 Algorithm Agility

IMA supports multiple hash algorithms via the appraise_algo flag:

  • SHA-256 (default): 32-byte digests. Quantum-safe for preimage resistance (Grover's algorithm gives ~128-bit effective security), standard across all deployments.
  • SHA-384: 48-byte digests. Higher security margin for long-lived signatures.
  • SHA-512: 64-byte digests. Maximum security, slightly higher measurement cost.
  • SHA3-256: 32-byte digests. Required for KABI driver measurement log interoperability: KABI key fingerprints (Section 12.7) use SHA3-256 (the DriverCertEntry.key_id is a SHA3-256 hash of the public key), so IMA must support SHA3-256 to record KABI driver measurements with matching algorithm identifiers. SHA3-256 is also the NIST-recommended sponge-based hash (FIPS 202) and shares the implementation with EVM's HMAC-SHA3-256 key derivation (Section 9.5).

The hash algorithm is selected per-rule, allowing different algorithms for different file categories (e.g., SHA-512 for kernel modules, SHA-256 for general executables, SHA3-256 for KABI drivers).

Crypto API integration: All IMA hash operations use the kernel Crypto API (Section 10.1) via crypto_shash_digest(), never standalone hash implementations. At IMA init time, a ShashTfm is allocated for each hash algorithm referenced by the active IMA policy rules. This ensures: - Hardware SHA acceleration (SHA-NI on x86, SHA2/SHA3 instructions on ARMv8.2+) is automatically utilized when available. - FIPS 140-3 compliance: all cryptographic operations go through the validated Crypto API code path, with self-test and integrity verification. - Algorithm agility: switching algorithms only requires updating the policy rule's appraise_algo flag; the Crypto API resolves the appropriate implementation.

PCR digest prediction (for TPM sealing policy construction) also uses the Crypto API — crypto_shash_init() / crypto_shash_update() / crypto_shash_final() — rather than standalone hash functions, ensuring consistent algorithm selection and FIPS compliance for all TPM-related cryptographic operations.

Algorithm migration: When upgrading the hash algorithm (e.g., SHA-256 → SHA-512), the kernel accepts both old and new algorithm digests during a transition window (configurable via umka.ima.transition_algos=sha256,sha512). This allows gradual re-signing of file extended attributes without breaking running systems. After transition, the old algorithm can be removed from the accepted list.

The extended attribute storing the signed hash includes an algorithm identifier prefix: security.ima xattr format is { algorithm_id: u8, signature_data: [u8] }, where algorithm_id maps to: 4 = SHA-256, 5 = SHA-384, 6 = SHA-512, 20 = SHA3-256 (matching the Linux IMA enum hash_algo values for compatibility). SHA3-256 (20) corresponds to HASH_ALGO_SHA3_256 in the Linux enum hash_algo (include/uapi/linux/hash_info.h).

9.5.3 Crypto API Integration

IMA must use the Kernel Crypto API (Section 10.1) for all hash operations. This is a hard requirement for three reasons:

  1. Hardware acceleration: SHA-NI (x86-64), ARM Crypto Extensions (FEAT_SHA2), and RISC-V Zknh provide 4-10x throughput improvement over software SHA-256. Direct computation bypassing the Crypto API forfeits this acceleration.
  2. FIPS compliance: FIPS 140-3 requires that all cryptographic operations use the validated module. A standalone SHA-256 implementation outside the Crypto API violates the FIPS module boundary.
  3. Algorithm agility: The Crypto API's crypto_alloc_shash() respects the per-rule appraise_algo setting (SHA-256, SHA-384, SHA-512, SHA3-256) and automatically selects the highest-priority registered implementation.

9.5.3.1 IMA Hash Transform Lifecycle

IMA allocates ShashTfm objects during subsystem initialization (Phase 2 boot, after the slab allocator is online) and caches them for the lifetime of the IMA subsystem. Hash operations on the VFS hot path use these pre-allocated transforms — no crypto_alloc_shash() occurs per file open.

// umka-core/src/security/ima/crypto.rs

/// Per-algorithm cached hash transform for IMA measurement.
/// Allocated once at IMA init; reused across all measurement operations
/// for the same algorithm. Each CPU gets its own `ShashDesc` (scratch
/// state) but shares the `ShashTfm` (algorithm binding + key schedule).
///
/// **Lifetime**: Stored in `static IMA_CRYPTO: OnceLock<ImaCryptoCtx>`, so the
/// `Box<ShashTfm>` values have kernel lifetime and are never dropped.
/// This is a deliberate permanent allocation (cold-path, boot-time only).
pub struct ImaCryptoCtx {
    /// SHA-256 transform (always allocated — default measurement algorithm).
    pub sha256_tfm: Box<ShashTfm>,
    /// SHA-384 transform (allocated if any policy rule uses sha384).
    pub sha384_tfm: Option<Box<ShashTfm>>,
    /// SHA-512 transform (allocated if any policy rule uses sha512).
    pub sha512_tfm: Option<Box<ShashTfm>>,
    /// SHA3-256 transform (allocated if any KABI measurement rule exists,
    /// or if any policy rule specifies sha3-256).
    pub sha3_256_tfm: Option<Box<ShashTfm>>,
}

/// Global IMA crypto context. Initialized once by `ima_crypto_init()`.
/// Read-only after init — no locking needed on the hot path.
static IMA_CRYPTO: OnceLock<ImaCryptoCtx> = OnceLock::new();

/// Initialize IMA hash transforms via the Kernel Crypto API.
///
/// Called during IMA subsystem init (after `crypto_register_builtin_algs()`
/// has populated the algorithm table). Panics on failure — IMA without
/// crypto is not a valid configuration.
///
/// # Crypto API interaction
/// Each `crypto_alloc_shash()` call:
/// 1. Looks up the algorithm name in the global `ALGORITHM_TABLE`.
/// 2. Selects the highest-priority non-DYING implementation (hardware
///    accelerated if available, software fallback otherwise).
/// 3. Increments the algorithm's `refcount` (prevents deregistration
///    while IMA holds the transform).
/// 4. Allocates a `ShashTfm` from the slab and binds it to the algorithm.
///
/// In FIPS mode (`crypto_fips_enabled()`), only FIPS-approved hash
/// implementations are returned. All four IMA hash algorithms (SHA-256,
/// SHA-384, SHA-512, SHA3-256) are FIPS-approved, so FIPS mode does not
/// restrict IMA's algorithm selection.
pub fn ima_crypto_init(policy: &ImaPolicy) -> ImaCryptoCtx {
    let sha256_tfm = crypto_alloc_shash("sha256", CryptoAllocFlags::empty())
        .expect("IMA: crypto_alloc_shash(sha256) failed — cannot proceed");

    // Allocate optional transforms only if referenced by policy rules.
    let sha384_tfm = if policy.uses_algorithm(HashAlgorithm::Sha384) {
        Some(crypto_alloc_shash("sha384", CryptoAllocFlags::empty())
            .expect("IMA: crypto_alloc_shash(sha384) failed"))
    } else {
        None
    };

    let sha512_tfm = if policy.uses_algorithm(HashAlgorithm::Sha512) {
        Some(crypto_alloc_shash("sha512", CryptoAllocFlags::empty())
            .expect("IMA: crypto_alloc_shash(sha512) failed"))
    } else {
        None
    };

    let sha3_256_tfm = if policy.uses_algorithm(HashAlgorithm::Sha3_256)
                        || policy.has_kabi_rules() {
        Some(crypto_alloc_shash("sha3-256", CryptoAllocFlags::empty())
            .expect("IMA: crypto_alloc_shash(sha3-256) failed"))
    } else {
        None
    };

    ImaCryptoCtx { sha256_tfm, sha384_tfm, sha512_tfm, sha3_256_tfm }
}

9.5.3.2 IMA as an LSM Module

IMA registers as an LSM module at priority 11 (integrity tier) (Section 9.8). IMA is NOT a separate standalone step in the exec/open path — it participates via the LSM hook chain. When do_execve() calls lsm_call_task_security(Exec, ...), the LSM framework iterates all registered modules; IMA's security_bprm_check hook runs as part of that iteration. Similarly, file opens invoke lsm_call_file_security(Open, ...), which calls IMA's security_file_open hook. There is exactly one invocation per hook point, not two separate calls (one "standalone" and one "as LSM").

9.5.3.3 Per-Inode IMA State

IMA tracks per-inode measurement state in the inode's LSM blob. This avoids redundant re-measurement of files that have not changed since the last measurement.

/// Cached hash digest with algorithm tag, supporting up to SHA-512 (64 bytes).
pub struct ImaInodeHash {
    /// The digest bytes (up to 64; only the first `len` bytes are valid).
    pub digest: [u8; 64],
    /// The hash algorithm used to produce this digest.
    pub algo: HashAlgorithm,
    /// Actual digest length in bytes (32 for SHA-256, 48 for SHA-384, 64 for SHA-512).
    pub len: u8,
}

/// Per-inode IMA state, stored in the inode's LSM security blob
/// (allocated by `inode_alloc_security()` during `new_inode()`).
/// Initialized to default values (all None/false, generation 0).
pub struct ImaInodeState {
    /// Cached file content hash from the most recent IMA measurement.
    /// `None` if the file has not been measured since last modification.
    /// Set by `ima_calc_file_hash()` after successful measurement.
    /// Invalidated (set to `None`) when the inode's `i_version` changes
    /// (file content modified) or when the inode is evicted from cache.
    pub hash: Option<ImaInodeHash>,

    /// Whether this inode has been measured (recorded in the IMA log)
    /// during the current boot cycle. Once measured, re-measurement is
    /// skipped unless the file changes (detected via `i_version`).
    pub measured: bool,

    /// Whether this inode has been appraised (signature verified against
    /// `security.ima` xattr) during the current boot cycle.
    pub appraised: bool,

    /// The IMA generation counter value at the time of the last measurement.
    /// If the global IMA generation counter has advanced past this value
    /// (e.g., due to policy reload or algorithm change), the cached hash
    /// is stale and the file must be re-measured. Compared against the
    /// global `IMA_GENERATION: AtomicU64` on each access.
    pub last_check_generation: u64,
}

9.5.3.4 Measurement Hot Path

On every file open/exec/mmap that matches an IMA policy rule, the measurement path hashes the file contents via crypto_shash_digest() using the pre-allocated transform. This is the IMA hot path — it runs synchronously in the VFS file_open context (invoked via the LSM hook chain, not as a separate step). The measurement is routed to the current task's IMA namespace:

/// Check whether any IMA policy rule (global then namespace-local) matches
/// this file for the given action. Evaluated under rcu_read_lock().
fn matches_policy(ima_ns: &ImaNamespace, inode: &Inode, action: ImaAction) -> bool;

/// Push a measurement onto the global extend queue. Called from the VFS
/// hot path; the ima-tpm-extend thread drains and routes to namespaces.
fn enqueue_extend(measurement: ImaMeasurement) {
    IMA_EXTEND_QUEUE.push(measurement);
    IMA_EXTEND_NEEDED.notify_one();
}
let ima_ns = current_task().nsproxy.ima_ns;
// Policy: namespace-local rules first, then global fallthrough.
if !matches_policy(&ima_ns, file.inode(), action) { return Ok(()); }
// Hash: shared inode cache avoids re-hashing across namespaces.
let hash = ima_inode_state.get_or_compute_hash(file, algo)?;
// Enqueue to global extend queue (async thread routes to namespace log + PCR).
enqueue_extend(ImaMeasurement { ima_ns: ima_ns.clone(), .. });
// umka-core/src/security/ima/measure.rs

/// Compute the IMA file measurement hash using the Kernel Crypto API.
///
/// This function is called from the `security_file_open()` LSM hook
/// for every file that matches an IMA measurement or appraisal rule.
///
/// # Crypto API usage
/// - `crypto_shash_init()`: initialize the per-CPU `ShashDesc` scratch state.
/// - `crypto_shash_update()`: feed file pages incrementally (4 KiB per call,
///   streaming from the page cache without reading the entire file into a
///   contiguous buffer).
/// - `crypto_shash_final()`: finalize and extract the digest.
///
/// The `ShashTfm` is shared (read-only after init); the `ShashDesc` is
/// per-CPU (allocated on the stack or from a per-CPU slab pool). This
/// avoids any locking on the measurement hot path.
///
/// # Hardware acceleration
/// When SHA-NI (x86-64), FEAT_SHA2 (AArch64), or Zknh (RISC-V) is
/// available, the `crypto_shash_update()` calls dispatch to the hardware-
/// accelerated implementation transparently via the Crypto API's priority
/// selection. No IMA-specific code path change is needed — the `ShashTfm`
/// was bound to the best implementation at `crypto_alloc_shash()` time.
///
/// # Performance
/// SHA-256 throughput with hardware acceleration:
/// - SHA-NI (x86-64): ~4-6 GB/s single-core (~1 cycle/byte)
/// - FEAT_SHA2 (AArch64 Neoverse N2): ~3-5 GB/s
/// - Software fallback: ~500 MB/s
/// For a 1 MiB file: ~0.2 ms (SHA-NI) vs ~2 ms (software).
pub fn ima_calc_file_hash(
    file: &OpenFile,
    algo: HashAlgorithm,
    digest_out: &mut [u8],
) -> Result<(), ImaError> {
    let ctx = IMA_CRYPTO.get().ok_or(ImaError::NotInitialized)?;

    // Select the pre-allocated ShashTfm for the requested algorithm.
    let tfm: &ShashTfm = match algo {
        HashAlgorithm::Sha256  => &ctx.sha256_tfm,
        HashAlgorithm::Sha384  => ctx.sha384_tfm.as_ref()
            .ok_or(ImaError::AlgorithmNotConfigured)?,
        HashAlgorithm::Sha512  => ctx.sha512_tfm.as_ref()
            .ok_or(ImaError::AlgorithmNotConfigured)?,
        HashAlgorithm::Sha3_256 => ctx.sha3_256_tfm.as_ref()
            .ok_or(ImaError::AlgorithmNotConfigured)?,
    };

    // Allocate ShashDesc on the stack (per-CPU, no heap allocation).
    // desc_size is algorithm-dependent: SHA-256 = 108 bytes,
    // SHA-512 = 208 bytes, SHA3-256 = 360 bytes.
    let mut desc_buf = [0u8; 512];  // worst-case ShashDesc size
    let desc = shash_desc_on_stack(tfm, &mut desc_buf);

    // SAFETY: desc is a valid ShashDesc backed by desc_buf with
    // sufficient size for the algorithm's descsize. tfm is valid
    // for the duration of IMA's lifetime.
    unsafe {
        let rc = (tfm.alg.ops_shash().init)(desc);
        if rc != 0 { return Err(ImaError::HashInitFailed(rc)); }
    }

    // Stream file pages through the hash. Each page is read from
    // the page cache via vfs_read_page() — no separate I/O.
    let page_count = file.size().div_ceil(PAGE_SIZE as u64);
    for page_idx in 0..page_count {
        let page = vfs_read_page(file, page_idx)?;
        let len = core::cmp::min(
            PAGE_SIZE,
            (file.size() - page_idx * PAGE_SIZE as u64) as usize,
        );
        // SAFETY: page.as_ptr() is a valid kernel page mapping.
        // len is bounded by PAGE_SIZE and remaining file size.
        unsafe {
            let rc = (tfm.alg.ops_shash().update)(
                desc, page.as_ptr(), len as u32,
            );
            if rc != 0 { return Err(ImaError::HashUpdateFailed(rc)); }
        }
    }

    // Finalize and extract the digest.
    // SAFETY: digest_out must be at least algo.digest_size() bytes.
    unsafe {
        let rc = (tfm.alg.ops_shash().finalize)(desc, digest_out.as_mut_ptr());
        if rc != 0 { return Err(ImaError::HashFinalizeFailed(rc)); }
    }

    Ok(())
}

9.5.3.5 Re-Attestation on Algorithm Change

If the underlying hash implementation changes at runtime (e.g., a hardware crypto driver probes after IMA init and registers a higher-priority SHA-256), IMA's cached ShashTfm objects participate in the Crypto API's generation-based re-attestation protocol (Section 10.1). On the next crypto_shash_init() call, the TfmBase inside ShashTfm detects the generation mismatch and transparently re-binds to the new (hardware-accelerated) implementation. No IMA code change or restart is needed. The cost is a single atomic load per hash operation (~1 cycle) plus a one-time ~1 us re-initialization on mismatch.

9.5.4 Container and Namespace Interaction

IMA operates at the filesystem inode level, which is inherently global — the same inode has the same measurement regardless of which namespace accesses it. However, container isolation requires per-namespace policy control:

Policy namespacing — Each user namespace MAY have its own IMA policy overlay:

  1. The global policy (loaded at boot, signed, append-only) applies to all namespaces. It cannot be weakened by a container.
  2. A container runtime (e.g., umka-containerd) can request additional measurement rules for its namespace via securityfs in the container's IMA namespace (/proc/<pid>/ns/ima). These rules can only add measurements, never remove or weaken global rules.
  3. The IMA namespace is created alongside the user namespace. It inherits the global policy and can be extended but not reduced.
  4. Exec-path scoping: IMA appraisal during exec evaluates the policy in the init_user_ns context (global policy). Per-namespace IMA policies are not supported for appraisal enforcement in Phase 3 — all measurement and appraisal uses the global policy regardless of the task's user namespace.

Per-namespace IMA measurement logs — Each IMA namespace maintains its own independent, append-only measurement log. Files measured in init_ns have entries only in init_ns's log. Files measured inside a container namespace have entries only in that container's log. The inode hash cache (ImaInodeState.hash) is SHARED across all namespaces — a file already hashed in one namespace is not re-hashed when accessed from another, but a new LOG ENTRY is created in the accessing namespace's log. This provides container isolation (namespace A cannot see namespace B's file access patterns) while avoiding redundant cryptographic computation.

PCR extend semantics — Hardware TPM PCR 10 is extended ONLY by init_ns measurements. Container namespace measurements extend virtual PCRs (ImaNamespace.virtual_pcrs) only. Virtual PCRs are software-maintained hash chains — not hardware-backed. Remote attestation of a container uses virtual PCR values; remote attestation of the host uses hardware PCR 10.

Policy evaluation — Policy evaluation walks namespace-specific rules first (ima_ns.local_rules), then falls through to the global policy (ima_ns.global_policy). A container can have stricter policy than the host (e.g., require signatures for all binaries) but cannot weaken the global policy.

A remote verifier can request a TPM quote scoped to a specific container's measurements by specifying the namespace ID.

/// Number of virtual PCR slots per IMA namespace. Matches the standard
/// TPM 2.0 PCR bank size: PCR[0] through PCR[23]. Each slot independently
/// tracks a measurement chain for a different measurement category
/// (boot, kernel, IMA policy, application-specific, etc.).
const NUM_VIRTUAL_PCRS: usize = 24;

/// Per-namespace IMA state. Created alongside the user namespace (Section 17.1.2).
/// The global (init) IMA namespace is statically allocated; container
/// namespaces are dynamically allocated via `unshare(CLONE_NEWUSER)`.
///
/// A freshly-created IMA namespace starts with an empty measurement log. Files
/// accessed from the new namespace are measured and added to the namespace's own
/// log. The global inode hash cache (`ImaInodeState.hash`) is shared — a file
/// already hashed in the init namespace is not re-hashed, but IS added to the new
/// namespace's measurement log as a new entry.
pub struct ImaNamespace {
    /// Owning user namespace. Determines which UIDs can extend the
    /// local policy. Only `ns_capable(user_ns, CAP_SYS_ADMIN)` can
    /// write to this namespace's `securityfs` IMA policy file.
    pub user_ns: Arc<UserNamespace>,

    /// Global policy reference (read-only, shared across all namespaces).
    /// These rules cannot be removed or weakened by any container.
    /// Points to the init namespace's policy (loaded at boot, signed).
    pub global_policy: &'static ImaPolicy,

    /// IMA policy rules for this mount namespace. RCU-protected to allow lockless reads.
    ///
    /// **Write path**: clone the `Vec<ImaRule>`, append the new rule, swap the `Arc` under RCU.
    /// This O(N) clone-on-update is an accepted tradeoff: IMA policy is written at most once
    /// per container lifecycle (at container startup), never at runtime. For deployments
    /// requiring more than ~1,000 dynamic policy updates, consider an RCU-protected skip list —
    /// that optimization is not needed for current UmkaOS use cases.
    ///
    /// Namespace-local additional rules (append-only). Container runtimes
    /// add rules via securityfs; rules can only strengthen measurement
    /// requirements, never weaken global policy.
    ///
    /// **Read path** (every `file_open` / `mmap` / `exec`): rule evaluation
    /// reads the list under `rcu_read_lock()` — lock-free, zero contention.
    /// Two pointer dereferences: RcuPtr -> Arc -> inline (data_ptr, len) -> rules.
    /// **Write path** (`securityfs` policy write, typically once at container
    /// startup): clone the current boxed slice, append the new rule to a new
    /// allocation, swap the Arc under `rules_update_lock`. Old slice is freed
    /// after the RCU grace period.
    /// **Maximum rules per namespace**: 256 (IMA_MAX_NS_RULES). Policy writes
    /// that exceed this limit return ENOSPC.
    pub local_rules: RcuPtr<Arc<[ImaRule]>>,

    /// Serializes rule list updates. Only held during the cold-path
    /// `securityfs` policy write — never on the file-open hot path.
    pub rules_update_lock: Mutex<()>,

    /// Per-namespace measurement log. Append-only, hash-chained: each
    /// entry includes `H(previous_entry)` for tamper detection. Entries
    /// record (pcr_index, algorithm, hash, filename_hint, timestamp).
    /// Read via `/proc/<pid>/ns/ima/ascii_runtime_measurements`.
    /// The Mutex serializes writes from the extend thread. Readers
    /// (attestation queries via `ima_get_measurement_log`) access the
    /// `committed` RcuVec via RCU read-side — they do NOT acquire the
    /// Mutex. This allows lock-free reads concurrent with writes.
    pub measurement_log: Mutex<ImaMeasurementLog>,

    /// Software PCR bank for container attestation. Mirrors the hardware
    /// TPM PCR extend model but in software — each measurement extends
    /// `virtual_pcr[index] = H(virtual_pcr[index] || measurement)`.
    /// Remote verifiers requesting per-container attestation receive
    /// these virtual PCR values (signed by the kernel's attestation key)
    /// rather than the global TPM PCRs. The global (init) namespace
    /// extends both the hardware TPM PCRs and its virtual PCR bank.
    pub virtual_pcrs: [PcrSlot; NUM_VIRTUAL_PCRS],
}

impl ImaNamespace {
    /// Returns `true` if this is the init (global) IMA namespace.
    /// Used by the extend thread to decide whether to extend hardware TPM PCRs.
    pub fn is_init_ns(&self) -> bool {
        Arc::ptr_eq(&self.user_ns, &INIT_USER_NS)
    }
}

/// A single software PCR slot with seqlock-based concurrency.
///
/// PCR values are 32 bytes (SHA-256) or 48 bytes (SHA-384) — too wide for
/// any hardware atomic instruction on any supported architecture (max atomic
/// width is 16 bytes on x86-64 with CMPXCHG16B, 8 bytes on ARMv7/PPC32).
/// A seqlock provides lock-free readers with consistent snapshots:
///
/// **Writer** (IMA extend path, single-writer per PCR guaranteed by
/// `ima_measurement_log` mutex):
///   1. `sequence.fetch_add(1, Release)` — odd sequence signals write-in-progress
///   2. Write `digest` bytes and `algorithm`
///   3. `sequence.fetch_add(1, Release)` — even sequence signals write-complete
///
/// **Reader** (attestation path, lock-free):
///   1. `seq1 = sequence.load(Acquire)`
///   2. If seq1 is odd, retry (writer in progress)
///   3. Read `digest` and `algorithm`
///   4. `seq2 = sequence.load(Acquire)`
///   5. If seq2 != seq1, retry (writer intervened)
///   6. Return consistent snapshot
///
/// This is arch-neutral — no hardware atomics wider than 8 bytes needed.
pub struct PcrSlot {
    /// Seqlock sequence counter. Even = stable, odd = write in progress.
    pub sequence: AtomicU64,
    /// PCR digest value. Max 48 bytes for SHA-384; SHA-256 uses first 32.
    pub digest: [u8; 48],
    /// Hash algorithm for this PCR slot (determines valid digest length).
    /// Uses the canonical `HashAlgorithm` enum from [Section 10.1](10-security-extensions.md#kernel-crypto-api).
    pub algorithm: HashAlgorithm,
    /// Padding to cache-line alignment (avoid false sharing between PCR slots).
    pub _pad: [u8; 7],
}
// PcrSlot: sequence(8) + digest(48) + algorithm(1) + _pad(7) = 64 bytes = 1 cache line.
const_assert!(size_of::<PcrSlot>() == 64);

/// A single IMA measurement log entry.
///
/// Fields are ordered to eliminate alignment padding under `#[repr(C)]`:
/// largest-alignment fields first (u64), then arrays, then u8 fields.
/// Slab-allocated from a dedicated cache.
#[repr(C)]
pub struct ImaMeasurementEntry {
    /// Measurement timestamp (monotonic nanoseconds).
    pub timestamp_ns: u64,
    /// File content hash.
    pub digest: [u8; 64],  // max SHA-512; actual length from algorithm
    /// Hash of the previous log entry (chain integrity).
    pub previous_hash: [u8; 32],  // SHA-256 chain
    /// File path hint (for human-readable log output). Only the first
    /// `filename_len` bytes are valid. The fixed 256-byte (1 + 255) layout
    /// avoids variable-length allocation — `ImaMeasurementEntry` is
    /// slab-allocated from a dedicated cache, and the fixed size avoids
    /// per-entry heap allocation and fragmentation.
    /// Paths longer than 255 bytes (e.g., deep container paths) are
    /// truncated. When `filename_len == 255`, the path MAY be truncated;
    /// consumers should treat it as a best-effort hint. The digest field
    /// is the actual measurement — filename truncation does not affect
    /// integrity chain correctness.
    pub filename_data: [u8; 255],
    /// PCR index this measurement was extended into (virtual PCR bank).
    pub pcr_index: u8,
    /// Hash algorithm used (SHA-256, SHA-384, SHA-512).
    pub algorithm: HashAlgorithm,
    /// Actual length of the filename hint in `filename_data`. 0..=255.
    pub filename_len: u8,
    // Struct alignment = 8 (from timestamp_ns). Fields sum = 8+64+32+255+1+1+1 = 362.
    // #[repr(C)] rounds struct size to next multiple of alignment: ceil(362/8)*8 = 368.
    // Trailing padding: 6 bytes (offsets 362-367).
}
// ImaMeasurementEntry: timestamp_ns(u64=8,off=0) + digest([u8;64]=64,off=8) +
//   previous_hash([u8;32]=32,off=72) + filename_data([u8;255]=255,off=104) +
//   pcr_index(u8=1,off=359) + algorithm(u8=1,off=360) + filename_len(u8=1,off=361)
//   + trailing_pad(6) = 368 bytes.
const_assert!(size_of::<ImaMeasurementEntry>() == 368);

/// Read the measurement log for a specific IMA namespace.
/// `ns=None` reads the init namespace's log (hardware PCR-backed).
/// `ns=Some(ima_ns)` reads that namespace's log (virtual PCR-backed).
pub fn ima_get_measurement_log(
    ns: Option<&ImaNamespace>,
) -> &ImaMeasurementLog;

Virtual PCR trust level — Virtual PCRs are not hardware-backed. Attestation of a container namespace provides software-level integrity (the kernel guarantees the log is append-only and hash-chained) but not hardware-level trust. For hardware-backed container attestation, the host's TPM signs the container's virtual PCR values + log via a software attestation bridge: the host kernel produces a TPM quote over PCR 10 (proving kernel integrity) and then signs the container's virtual PCR values with the kernel's attestation key (proving the virtual PCRs were maintained by a trusted kernel).

Container image verification — For container images deployed as read-only filesystem layers (overlayfs lower layers), dm-verity (Section 9.3) is the primary integrity mechanism. IMA covers the writable upper layer and any files modified at runtime.

Cross-namespace file access — When a process in namespace A accesses a file whose inode was already measured in namespace B, the measurement is reused from the global cache (same inode → same hash). The measurement is recorded in namespace A's log but not re-computed. This prevents DoS via repeated cross-namespace measurement requests.

9.5.5 Signed Hash Lifecycle (Package Updates)

IMA appraisal requires every file to carry a signed hash. This section specifies how signed hashes are generated, distributed, and updated during the software lifecycle.

Signing infrastructure:

Signing Key Hierarchy:
  Root CA (offline, HSM-protected)
   └── IMA Signing Key (per-distribution or per-organization)
        └── signs file hashes stored in security.ima xattr

Key distribution:
  - IMA signing key's public half is compiled into the kernel image
    (verified by Secure Boot, measured into PCR 9)
  - Alternatively: public key stored in a keyring loaded from a signed
    policy file (measured into PCR 12)

Package manager integration:

  1. Build time: The distribution's build system computes SHA-256 hashes of all files in a package and signs them with the IMA signing key. The signed hashes are stored as security.ima extended attributes in the package archive (RPM %__ima_sign, DEB dpkg-sig, or equivalent UmkaOS package format).

  2. Install time: The package manager (apt, dnf, or UmkaOS's native package tool) extracts files to disk including their security.ima xattrs. The package manager itself must be IMA-appraised (bootstrapping: the package manager binary is signed during initial system image creation).

  3. Update time: When updating a file (e.g., /usr/bin/ssh version upgrade): a. Package manager writes new file contents to a temporary path. b. Package manager sets the new security.ima xattr (from the package archive). c. Package manager atomically renames the temp file over the old file (rename(2)). d. IMA invalidates the measurement cache for the old inode (detected via inode version counter increment on the rename target). e. Next access to the file triggers re-measurement and appraisal against the new signed hash. The file is usable immediately — no reboot required.

  4. Rollback: If a package update fails mid-install, the old file retains its old security.ima xattr (the rename never happened). The system remains in a consistent state. If an attacker modifies a file without updating its xattr, appraisal fails and the file is denied.

Files without signed hashes: In umka.ima=enforce mode, files that require appraisal (per policy rules) but lack a valid security.ima xattr are denied. In umka.ima=log mode, they are measured and logged but not denied. This allows gradual rollout: start in log mode to identify unsigned files, then switch to enforce.

IMA appraisal (enforcement) — Beyond measurement (recording what was accessed), IMA appraisal enforces integrity. Each file has a signed hash (stored as an extended attribute or in a separate manifest). On access, the kernel computes the file's hash, verifies it against the signed reference, and refuses to execute/load the file if they don't match. This blocks tampered binaries at runtime, not just at audit time.

Appraisal verification path: After measurement, if the policy requires appraisal (ima_appraise=enforce), the kernel verifies the file's security.ima xattr signature against the IMA keyring (.ima). The xattr contains a PKCS#7 or raw signature over the file's hash. Verification uses crypto_akcipher_verify(). If verification fails or the xattr is absent: return -EACCES (block exec/load).

IMA and UmkaOS driver loading — Every Tier 1 driver load is already measured into PCR 12 (Section 9.4). IMA extends this to: - Tier 2 driver binaries (userspace drivers loaded via KABI) - Userspace helper executables invoked by drivers - Firmware blobs loaded by drivers - Configuration files that affect driver behavior

IMA policy storage and loading — The IMA policy defines which files are measured and/or appraised. UmkaOS loads the IMA policy from one of these sources, in priority order:

  1. Built-in policy: A compiled-in default policy that measures all executables, shared libraries, and kernel modules. This is always available as a baseline.
  2. Signed policy file: /etc/umka/ima-policy.signed — a policy file signed with the IMA policy signing key (an RSA-4096 or Ed25519 key whose public half is compiled into the kernel or measured into TPM PCR 12 at boot). The kernel verifies the signature before loading. This is the recommended production mechanism.
  3. securityfs write: A privileged process (CAP_MAC_ADMIN) can write policy rules to /sys/kernel/security/ima/policy at runtime. Rules are append-only — once written, they cannot be removed without reboot. This is used for development and debugging only. In production configurations, umka.ima=enforce disables securityfs policy writes entirely.

Policy changes take effect immediately for newly opened files. Files already open and measured retain their existing measurement until closed and reopened.

Audit log — All measurements are logged to UmkaOS's audit subsystem. The log is tamper-evident (hash-chained) and optionally signed. A remote attestation server can request a TPM quote over the measurement PCR plus the audit log, verifying not just "the system booted correctly" but "the system has only executed known-good software since boot."

LSM policy changes are always IMA-audited. Any modification to the security policy is recorded in the IMA measurement log as a tamper-evident audit entry, satisfying audit trail requirements for security-sensitive deployments. The following events generate an IMA audit record:

  • Loading a new AppArmor profile or policy namespace (e.g., via apparmor_parser -r)
  • Loading a new SELinux policy module (e.g., via semodule -i)
  • Removing an LSM policy module
  • Changing the enforcement mode (permissive to enforcing or enforcing to permissive)
  • Loading a new UmkaOS-native LSM module via kabi_load_lsm()

Each audit record contains the following fields:

Field Description
event_type IMA_LSM_POLICY_LOAD, IMA_LSM_POLICY_REMOVE, or IMA_LSM_MODE_CHANGE
module_name Policy module name (e.g., "docker-default" for AppArmor, "container_t" for SELinux)
module_version Policy module version string, if available
policy_hash SHA-256 hash of the policy binary (for load events; absent for remove and mode-change events)
caller_uid Real UID of the process that triggered the policy change
caller_pid PID of the calling process
caller_comm Process name (comm) of the caller
timestamp_mono Monotonic clock timestamp (nanoseconds since boot)
timestamp_wall Wall clock timestamp (nanoseconds since Unix epoch)

LSM policy audit records use the same ImaLsmPolicyEntry log structure as other IMA entries and are extended into TPM PCR 10 (init namespace only — LSM policy is global, so policy audit records always extend hardware PCR 10, not container virtual PCRs) via the same async-extend path described below. This means a remote attestation quote covering PCR 10 captures not only which files were executed, but also every security policy change made since boot.

9.5.6 EVM — Extended Verification Module

EVM protects security-critical file extended attributes (including the IMA measurement hash in security.ima) against offline tampering. Without EVM, an attacker with physical access could mount the disk on another system and modify security.selinux labels or security.capability values without changing file contents — bypassing IMA appraisal entirely. EVM closes this gap by storing a kernel-computed HMAC over the protected xattr set as security.evm.

Protected xattr set (evaluated in alphabetical name order for HMAC input stability):

xattr Content protected
security.capability POSIX file capabilities
security.ima IMA hash or digital signature
security.SMACK64 Smack integrity label (if Smack LSM active)
security.selinux SELinux security context (if SELinux LSM active)

HMAC input: hmac_misc_struct_bytes || (for each xattr in alphabetical order: name || NUL || value). The hmac_misc struct is hashed in host byte order (matching Linux's hmac_add_misc() in security/integrity/evm/evm_crypto.c): { ino: unsigned_long, generation: u32, uid: u32, gid: u32, mode: u16 } — where unsigned_long is 4 bytes on 32-bit and 8 bytes on 64-bit. The inode number binds the HMAC to the specific file, preventing cross-file replay attacks.

Portability note: EVM HMACs are NOT portable between architectures with different endianness or word size. This is a Linux limitation that UmkaOS inherits for binary compatibility. A filesystem with EVM-signed xattrs from Linux/s390x produces different HMACs than Linux/x86-64 for the same file content.

Key derivation — called once during early security init (after TPM, before first filesystem mount):

pub struct EvmKeyDerivation;

impl EvmKeyDerivation {
    /// Derive the 256-bit HMAC key. Must be called after `tpm_init()` and
    /// before any filesystem is mounted in appraisal mode.
    pub fn derive(tpm: &TpmDevice) -> Result<EvmHmacKey, EvmError> {
        // 1. Read EVM_SEED from TPM NV index 0x01500071.
        //    This index is provisioned at factory/install time with 32 random bytes.
        let seed = tpm.nv_read(0x01500071)?;

        // 2. Construct machine identity: DMI chassis UUID || TPM PCR[7] value.
        //    PCR[7] captures the Secure Boot state and is stable across boots on the same
        //    machine, but changes if Secure Boot keys are modified — a desirable property.
        //    Without Secure Boot, PCR[7] = 0x00...00; the derived EVM key is still
        //    deterministic (reproducible across reboots) but not policy-bound.
        let dmi_uuid  = acpi_dmi_uuid();
        let pcr7      = tpm.pcr_read(7)?;
        let machine_id = [dmi_uuid.as_bytes(), &pcr7].concat();

        // 3. HKDF-SHA3-256(IKM=seed, salt=machine_id, info=b"umkaos-evm-v1") → 32 bytes.
        let key_bytes = hkdf_sha3_256(&seed, &machine_id, b"umkaos-evm-v1");
        Ok(EvmHmacKey(key_bytes))
    }
}

/// The derived HMAC key. Zeroized on drop.
pub struct EvmHmacKey([u8; 32]);

impl Drop for EvmHmacKey {
    fn drop(&mut self) { self.0.zeroize(); }
}

/// Global EVM key, initialized once at boot.
static EVM_KEY: OnceLock<EvmHmacKey> = OnceLock::new();

The key is never written to disk and never exposed to userspace. The OnceLock<> wrapper guarantees it is initialized exactly once; subsequent calls to EVM_KEY.get() are lock-free reads.

Per-inode EVM lock — each inode with EVM-protected xattrs carries a spinlock that serializes HMAC recomputation with xattr writes:

/// Per-inode EVM state, stored in the inode's LSM blob ([Section 9.8](#linux-security-module-framework--lsm-blob-layout)).
pub struct EvmInodeState {
    /// Serializes HMAC computation + xattr update to prevent a concurrent
    /// xattr write from producing a stale HMAC. Per-inode (not global) to
    /// avoid contention across unrelated files.
    ///
    /// **Writer critical section** (xattr write path):
    ///   1. Acquire `evm_lock`
    ///   2. Read all protected xattrs (security.ima, security.selinux, etc.)
    ///   3. Compute HMAC-SHA3-256 over inode_number || xattr values
    ///   4. Write computed HMAC to `security.evm` xattr
    ///   5. Set `status.store(1, Release)` (valid)
    ///   6. Release `evm_lock`
    ///
    /// **Reader fast path** (file_open appraisal): When `status.load(Relaxed) == 1`,
    /// verification is complete — **no lock acquired**. The lock protects only
    /// initial verification (`status == 0`, first access after boot or after
    /// xattr write sets status back to 0) and re-verification. On steady-state
    /// file opens, the hot path is a single atomic load + branch (~1 cycle).
    /// The 256-core contention scenario (hundreds of threads opening the same
    /// popular file like `/usr/lib/libc.so.6`) only occurs on cold first access,
    /// not steady state.
    ///
    /// **Relaxed ordering is intentional**: The TOCTOU window between a concurrent
    /// xattr write (status→0 via Release) and a reader's Relaxed load is bounded
    /// by hardware cache coherence latency (~10-100 ns on all supported
    /// architectures). This matches Linux's IMA/EVM cached-status semantics —
    /// the check-and-open are not atomic regardless of ordering. The next file
    /// open after the writer completes will see status=0 and re-verify.
    /// Upgrading to Acquire adds no measurable security benefit.
    ///
    /// **Cold path** (first access, `status == 0`):
    ///   1. Acquire `evm_lock` (serializes concurrent first-accessors)
    ///   2. Re-check `status` (another thread may have computed while we waited)
    ///   3. If still 0: read xattrs, compute HMAC, compare, set status
    ///   4. Release `evm_lock`
    pub evm_lock: SpinLock<()>,
    /// Cached EVM verification status. Cleared (set to 0) on any protected
    /// xattr change by the writer under `evm_lock`.
    pub status: AtomicU8,  // 0 = unchecked, 1 = valid, 2 = invalid
}

IMA interaction:

EVM wraps two IMA LSM hooks:

  1. inode_post_setxattr (IMA has just updated security.ima): EVM recomputes the HMAC over all protected xattrs (including the new security.ima) and writes the new security.evm. Both xattr writes are atomic with respect to each other under the inode's evm_lock — a concurrent reader cannot observe a state where security.ima is updated but security.evm is stale.

  2. inode_need_killpriv / file_open (EVM appraisal path): Before allowing a file to be opened or executed, EVM reads all protected xattrs, recomputes the HMAC, and compares it against security.evm. If the comparison fails (xattr tampered offline), the operation is denied with EACCES and an FMA security event is raised (FmaCategory::SecurityIntegrityViolation).

EVM operational mode — controlled by the evm_mode sysctl (/proc/sys/kernel/evm_mode):

Mode Behavior
0 Disabled — no HMAC enforcement (provisioning/recovery mode)
1 Audit — log mismatches as FMA events, do not deny access
2 Enforce — deny access (EACCES) on HMAC mismatch ← production default

The initramfs sets evm_mode=2 after deriving the EVM key and before pivoting to the root filesystem. This ensures all files accessed after pivot are HMAC-verified.

Sealed key mode (optional, high-security deployments): The TPM NV seed can be replaced with a TPM2_Seal'd blob requiring PCR[0,7] to match the verified boot chain. If the boot chain is tampered (different firmware, different Secure Boot key), PCR[7] changes, the TPM refuses to unseal the seed, EVM key derivation fails, and all EVM-protected files are inaccessible. The system boots (kernel image is verified by Secure Boot independently) but no protected file opens succeed — a hard security boundary against offline tampering even from the bootloader level.

HMAC algorithm: HMAC-SHA3-256. Note: HMAC construction itself is not vulnerable to SHA2's length-extension property (HMAC's double-keyed structure prevents it). SHA3 is chosen because its sponge construction also enables KMAC (NIST SP 800-185), but UmkaOS uses the standard HKDF construction (RFC 5869) with HMAC-SHA3-256 for maximum interoperability with existing KDF implementations. SHA3 is preferred because it is SHA3 is the NIST-recommended successor hash family (FIPS 202). Using the same algorithm family as IMA (§9.4.1) also simplifies the cryptographic dependency surface.

Performance impact — IMA measurement cost is dominated by SHA-256 hashing of file contents and scales linearly with file size. For typical small files (config files, shared libraries <1 MB): ~10-50 us. For large binaries (100 MB): ~200 ms at ~500 MB/s single-threaded SHA-256 throughput. The measurement cache (below) amortizes this cost -- after the first measurement, re-measurement occurs only on content change, so the large-file cost is paid once. Mitigations: - Measurement cache: after first measurement, the result is cached. Re-measurement occurs only if the file's content changes (detected via VFS inode version counter, SB_I_VERSION — supported by ext4, Btrfs, XFS, and tmpfs since Linux 4.16+/Layton rework). For filesystems that do not implement SB_I_VERSION (e.g., some FUSE backends, legacy network filesystems), the cache falls back to mtime+ctime comparison, which is coarser-grained but avoids bypassing appraisal entirely - Policy exemptions: transient paths (/tmp, /dev, /proc, /sys) are configurable exemptions — no point measuring pseudo-filesystems - Async measurement: for large files, measurement can be performed asynchronously during non-security-critical reads (e.g., data files opened for read-only access). However, security-critical paths are always synchronous: any file opened for execution (mmap PROT_EXEC, dlopen, driver loading) is measured synchronously before the content is used. This prevents a TOCTOU vulnerability where file content could change between measurement and execution. The synchronous path holds an exclusive file lock (preventing concurrent writes) while computing the hash and comparing it against the signed reference. The async path is used only for audit/measurement-only policy rules where appraisal enforcement is not required

Integration with dm-verity — For read-only filesystems (container images, system partitions), dm-verity (Section 9.3) provides block-level integrity verification. IMA is complementary: it covers read-write filesystems where dm-verity cannot apply (dm-verity requires immutable block devices). Together, they provide complete coverage: dm-verity for immutable system images, IMA for mutable data and user-installed software.

9.5.6.1.1 Asynchronous PCR Extension Queue

IMA measures files on open, which requires extending a TPM PCR (PCR 10 for file measurements). TPM 2.0 PCR extend commands take 5-50ms depending on TPM firmware. Doing this synchronously in the VFS open path would add 5-50ms to every cold-open (first open of a file not in the IMA cache). For a container startup loading 1000 files, this adds 5-50 seconds — clearly unacceptable.

Solution: Measurement log + async PCR extend

IMA splits measurement into two phases: 1. Synchronous (on file open): Hash the file content and record the hash in the IMA measurement log (in-memory). This takes 1-10us (SHA-256 of cached pages). The measurement is pushed onto a global extend queue. File open proceeds immediately. 2. Asynchronous (background thread): A dedicated ima-tpm-extend kernel thread drains the global queue, routes each measurement to the correct namespace's committed log and PCR bank, and batches hardware TPM extends at ~20-100ms intervals (or on explicit sync request).

Global extend queue — Multiple tasks across different namespaces concurrently measure files (multiple producers), and the single ima-tpm-extend thread is the sole consumer. A per-namespace ring would require SpscRing (single-producer), which is incorrect when multiple tasks share a namespace. Instead, all namespaces enqueue into one global BoundedMpmcRing:

/// Global queue: all namespaces enqueue measurements here.
/// The ima-tpm-extend thread is the sole consumer.
static IMA_EXTEND_QUEUE: BoundedMpmcRing<ImaMeasurement, 4096> =
    BoundedMpmcRing::new();

/// Condvar signalled on every push to IMA_EXTEND_QUEUE.
static IMA_EXTEND_NEEDED: Condvar = Condvar::new();

/// Per-namespace measurement log stores only committed (already-extended) entries.
pub struct ImaMeasurementLog {
    /// Measurements already extended into TPM/virtual PCRs (for audit export).
    committed: RcuVec<ImaMeasurement>,
    /// Total measurements since boot (for log sequence numbers).
    total: AtomicU64,
}

pub struct ImaMeasurement {
    /// Hash algorithm used for this measurement.
    pub hash_alg: HashAlgorithm,
    /// File content hash (up to 64 bytes for SHA-512; actual length = hash_len).
    pub file_hash: [u8; 64],
    /// Actual digest length in bytes (32 for SHA-256, 48 for SHA-384, 64 for SHA-512).
    pub hash_len: u8,
    /// PCR to extend into (default: 10).
    pub pcr: u8,
    /// Hash of (hash_algo || file_hash || filename).
    pub template_hash: [u8; 64],
    /// Timestamp of the measurement (nanoseconds since boot).
    pub timestamp: u64,
    /// Monotonic sequence number within this log.
    pub sequence: u64,
    /// IMA namespace that originated this measurement. Used by the
    /// extend thread to route hardware PCR extends (init_ns only)
    /// vs virtual PCR extends (all namespaces).
    ///
    /// `Arc<ImaNamespace>` guarantees the namespace remains alive while
    /// the measurement is in the extend queue. A namespace-ID alternative
    /// would require error handling for "namespace destroyed while queued."
    /// The 2 atomic ops per measurement (clone at enqueue, drop at consume)
    /// are negligible at typical IMA measurement rates (~100/sec).
    pub ima_ns: Arc<ImaNamespace>,
}

ima-tpm-extend thread:

fn ima_tpm_extend_thread() {
    loop {
        // Drain up to 64 measurements from the global queue per TPM session.
        let batch = IMA_EXTEND_QUEUE.drain_batch(64);
        if batch.is_empty() {
            IMA_EXTEND_NEEDED.wait_timeout(Duration::from_millis(100));
            continue;
        }
        // Open a single TPM session and extend all pending measurements.
        let session = tpm_open_session();
        for m in &batch {
            // IMA converts its internal HashAlgorithm enum to the TCG TPM_ALG_ID
            // via to_tpm_alg_id() ([Section 10.1](10-security-extensions.md#kernel-crypto-api)):
            //   HashAlgorithm::Sha256 (0x04) → TPM_ALG_SHA256 (0x000B)
            //   HashAlgorithm::Sha384 (0x05) → TPM_ALG_SHA384 (0x000C)
            //   HashAlgorithm::Sha512 (0x06) → TPM_ALG_SHA512 (0x000D)
            let tpm_alg = to_tpm_alg_id(m.hash_alg);

            // Namespace routing: hardware TPM PCR 10 extended only by init_ns.
            // Container namespaces extend virtual PCRs only.
            if m.ima_ns.is_init_ns() {
                tpm_pcr_extend(session, m.pcr, tpm_alg, &m.template_hash);
            }
            // All namespaces (including init_ns) extend their virtual PCR bank.
            m.ima_ns.virtual_pcrs[m.pcr as usize].extend(&m.template_hash);
        }
        tpm_close_session(session);
        // Route each measurement to the originating namespace's committed log.
        for m in batch {
            m.ima_ns.measurement_log.lock().committed.push(m);
        }
    }
}

Consistency guarantee: Applications that require all measurements to be committed before proceeding can call ima_sync() (via ioctl(IMA_SYNC) or syncfs()), which blocks until ima-tpm-extend has processed all pending measurements. systemd uses this before pivoting to the real root to ensure boot measurements are finalized.

Latency impact: File open latency = SHA-256 hash time (~1-10us for cached pages) + measurement log append (~100ns). TPM extend cost is moved off the open path entirely. Container startup loading 1000 files adds ~1-10ms total (hashing), not 5-50 seconds.

9.5.6.1.2 Queue Bounds and Overflow Policy

The pending ring in ImaMeasurementLog is bounded at 4096 entries. This provides ~50 seconds of buffering at 80 PCR extends/second (typical for a busy IMA workload during container startup or large package installation). The async extend thread drains the queue continuously, so under normal operation the queue stays well below capacity.

Overflow behavior — When the queue is full, the policy depends on the extend type:

  • IMA file measurement (runtime): The measurement is dropped — the file hash is not recorded in the measurement log and the corresponding PCR extend does not occur. The ima_measurements_dropped counter (AtomicU64) is incremented. A KERN_WARNING message is logged at most once per 60 seconds (rate-limited via a last_drop_warn: AtomicU64 timestamp to avoid log flooding). The file access proceeds normally — availability takes priority over integrity for already-running processes. The drop is visible to remote attestation verifiers via the counter and the gap in measurement log sequence numbers.

  • Boot-time PCR extends (security-critical): The caller blocks until queue space is available. Boot-time extends (kernel image, initramfs, Tier 1 drivers into PCR 9/10/12) are rare (typically < 20 total) and occur before the system accepts user workloads, so blocking is bounded by TPM throughput (~5-50ms per extend) and does not affect runtime latency.

9.5.6.1.3 TPM Timeout Handling

Each queued extend operation has a 5-second deadline. If the TPM does not complete the TPM2_PCR_Extend command within 5 seconds (indicating a hung or extremely slow TPM):

  1. The tpm_extend_timeouts counter (AtomicU64) is incremented.
  2. A KERN_ERR message is logged with the TPM device name, PCR index, and operation sequence number.
  3. For IMA runtime extends: The measurement is dropped (same as queue overflow). The ima_measurements_dropped counter is incremented.
  4. For boot-time PCR extends: The operation is retried once with a fresh 5-second deadline. If the retry also times out, the TPM is declared degraded: tpm_degraded is set to true, and all further PCR extend operations are skipped. Existing measurements already committed to the TPM remain valid — degraded mode only stops new extends. A KERN_CRIT message is logged indicating TPM degradation. The kernel continues in software-only measurement mode: the IMA measurement log is still maintained (for local audit), but TPM-backed remote attestation is no longer available until the next reboot.
9.5.6.1.4 TpmExtendQueue

The low-level queue that backs both ImaMeasurementLog.pending and the timeout tracking:

/// Maximum pending TPM PCR extend operations.
const MAX_TPM_QUEUE_DEPTH: usize = 4096;

/// A single queued PCR extend operation.
///
/// Supports all hash algorithms (SHA-256/384/512). The `hash_len` field
/// indicates the actual number of valid bytes in `template_hash`.
pub struct TpmExtendOp {
    /// PCR index to extend.
    pub pcr: u8,
    /// Actual hash length in bytes (32 for SHA-256, 48 for SHA-384, 64 for SHA-512).
    pub hash_len: u8,
    /// Template hash to extend into the PCR.
    /// Sized for SHA-512 (64 bytes); actual length in `hash_len`.
    pub template_hash: [u8; 64],
    /// Monotonic sequence number (for gap detection on drop).
    pub sequence: u64,
    /// Deadline: absolute time (nanoseconds since boot) by which the
    /// TPM must complete this extend. Default: submission time + 5s.
    pub deadline_ns: u64,
    /// Boot-critical flag: 0 = false, 1 = true.
    /// (u8, not bool, because TpmExtendOp is enqueued in ArrayDeque
    /// and may cross compilation boundaries in debug tooling.)
    pub boot_critical: u8,
}

/// Bounded async queue for TPM PCR extend operations.
///
/// Provides backpressure (blocking) for boot-critical extends and
/// drop-on-overflow for runtime IMA extends. Tracks timeouts and
/// degraded state for TPM health monitoring.
pub struct TpmExtendQueue {
    /// Pending extend operations, bounded at MAX_TPM_QUEUE_DEPTH.
    queue: ArrayDeque<TpmExtendOp, MAX_TPM_QUEUE_DEPTH>,
    /// Counts dropped extends due to queue full or timeout.
    pub drops: AtomicU64,
    /// Counts extends that timed out waiting for TPM response.
    pub timeouts: AtomicU64,
    /// Set when TPM is non-responsive; disables further extends.
    pub degraded: AtomicBool,
    /// Last time a drop warning was logged (nanos since boot).
    /// Used to rate-limit KERN_WARNING to once per 60 seconds.
    last_drop_warn_ns: AtomicU64,
}

9.6 Post-Quantum Cryptography

9.6.1 Why This Cannot Wait

NIST finalized post-quantum cryptography standards in 2024: - ML-KEM (Kyber): key encapsulation (replaces ECDH/RSA key exchange) - ML-DSA (Dilithium): digital signatures (replaces RSA/Ed25519) - SLH-DSA (SPHINCS+): hash-based signatures (stateless, conservative)

"Harvest now, decrypt later" attacks mean data encrypted today with classical algorithms is vulnerable to future quantum computers. Migration timelines are 5-10 years — starting now is not optional.

UmkaOS uses cryptography in:

Component Current Algorithm PQC Replacement Section
Verified boot (kernel signature) RSA-4096 / Ed25519 ML-DSA-65 or SLH-DSA-SHAKE-128f Section 9.3
Driver signatures Ed25519 ML-DSA-65 Section 9.3
dm-verity Merkle tree SHA-256 SHA-256 (quantum-safe as-is) Section 9.3
Distributed capabilities Ed25519 ML-DSA-44 Section 5.7
Cluster node authentication X25519 key exchange + Ed25519 auth ML-KEM-768 + ML-DSA-65 Section 5.2
TPM PCR measurements SHA-256 SHA-256 (quantum-safe) Section 9.4

9.6.2 Boot Stub Cryptographic Algorithm Subset

Nucleus executes before the kernel crypto API (Section 10.1) is initialized. During Phase 0.8 (Section 2.21), Nucleus must verify the Evolvable image signature using only code baked into Nucleus itself. The boot stub supports exactly three algorithms — the minimum required to perform LMS signature verification:

Algorithm Purpose Specification Code Size
SHAKE256 Hash the Evolvable image content to produce a message digest; also used as the internal hash function for LMS Winternitz chains and Merkle path computation NIST FIPS 202 (SHA-3), Sec. 6.2; uses Keccak-f[1600] permutation with padding byte 0x1F Shares Keccak-f[1600] with SHA3-256 (~1 KB); SHAKE256-specific XOF output mode adds ~200 bytes
LMS (verification only) Verify the Evolvable image signature against the baked-in LMS public key (NIST SP 800-208, Leighton-Micali Signature) NIST SP 800-208, Sec. 4-5; W=4 Winternitz parameter (~530 hash calls per verification), H=15 Merkle tree height (~15 hash calls for authentication path) ~1-2 KB (Winternitz chain completion + Merkle path walk)
SHA3-256 Hash the Evolvable image header during magic/geometry validation (pre-signature step) NIST FIPS 202, Sec. 6.1; Keccak-f[1600] with padding byte 0x06 and 256-bit output Shares Keccak-f[1600] permutation with SHAKE256; ~0.5 KB additional

Total boot stub crypto code: ~1-3 KB (dominated by the shared Keccak-f[1600] permutation, which is ~800 bytes of straight-line code with no branches, no allocation, and no external dependencies).

What is NOT in the boot stub: ML-DSA, ML-KEM, SLH-DSA, Ed25519, RSA, AES, and all other algorithms. These are loaded as part of Evolvable's crypto core and the kernel crypto API. The boot stub is deliberately minimal — every byte of Nucleus code is a formal verification target (Section 24.4).

Why LMS + SHAKE256: LMS is a stateful hash-based signature scheme (NIST SP 800-208) — it relies only on the security of the hash function (SHAKE256/Keccak), not on number-theoretic assumptions (lattice, factoring, discrete log). Statefulness is a signing concern (each OTS leaf key must be used at most once); for Nucleus, only verification is performed, so no OTS state tracking is needed. This makes it the most conservative choice for the boot trust root: even if lattice-based schemes (ML-DSA) are broken by future cryptanalysis, the boot verification chain remains intact. LMS was selected over SLH-DSA for Nucleus because LMS verification is simpler (~530 hash calls vs. SLH-DSA's multi-tree traversal) and the code is smaller and more amenable to formal verification. The LmsShake256 algorithm ID (0x0107 in SignatureAlgorithm) is reserved exclusively for Nucleus verification — no other kernel subsystem uses LMS.

9.6.3 Design: Algorithm-Agile Crypto Abstraction

PQC algorithm specifications follow NIST FIPS publications verbatim:

  • ML-KEM-768 (key encapsulation): FIPS 203, "Module-Lattice-Based Key-Encapsulation Mechanism Standard," NIST, August 2024. Parameter set: ML-KEM-768 (security level 3, 1184-byte encapsulation key).

  • ML-DSA-44/65/87 (digital signatures): FIPS 204, "Module-Lattice-Based Digital Signature Standard," NIST, August 2024.

  • ML-DSA-44: distributed capabilities (Section 5.7), short-lived tokens (security level 2, 1312-byte public key, 2420-byte signature)
  • ML-DSA-65: kernel and driver signing (security level 3, 1952-byte public key, 3309-byte signature). A single algorithm for both simplifies key management — see Section 9.3
  • ML-DSA-87: long-term identity keys (security level 5, 2592-byte public key)

  • SLH-DSA-SHAKE-128f (stateless hash-based signatures): FIPS 205, "Stateless Hash-Based Digital Signature Standard," NIST, August 2024. Used for: root of trust, boot verification (stateless = no state synchronization needed across reboots).

No custom parameter variations. Implementation MUST pass the NIST Known-Answer Tests (KAT) vectors from the respective FIPS publications.

The critical design requirement: never hardcode key/signature sizes. PQC signatures are much larger than classical:

Algorithm Signature Size Public Key Size
Ed25519 (current) 64 bytes 32 bytes
ML-DSA-44 (PQC) 2,420 bytes 1,312 bytes
ML-DSA-65 (PQC) 3,309 bytes 1,952 bytes
SLH-DSA-SHAKE-128f (PQC) 17,088 bytes 32 bytes

Algorithm selection criteria:

  • ML-DSA (default): Lattice-based. Fast signing/verification, moderate signature size (~2.4-3.3KB). Use for: driver signatures, capabilities, cluster authentication. This is the standard choice unless there is a specific reason to avoid lattice assumptions.
  • SLH-DSA (paranoid mode): Hash-based (stateless). Conservative — survives even if lattice mathematical assumptions are broken by future cryptanalysis. Huge signatures (~17KB). Use for: kernel image signatures (verified once at boot, size doesn't matter). Configurable: umka.crypto.boot_algorithm=slh-dsa-shake-128f overrides default ML-DSA-65.
  • Hybrid mode: Both Ed25519 + ML-DSA. Use during transition period (2025-2035). Both must verify. Provides defense if either algorithm family is broken.

If capability tokens (Section 5.7) have a fixed 64-byte signature field, PQC won't fit. The signature field must be variable-length or large enough for the biggest PQC algorithm.

// umka-core/src/crypto/mod.rs

/// Signature algorithm identifier.
/// Used for verified boot, driver signatures, capabilities, and
/// any context where digital signatures are created or verified.
/// Signatures and KEMs are separate enums because they serve
/// fundamentally different purposes: signatures prove authenticity
/// (sign/verify over a message), while KEMs establish shared secrets
/// (encapsulate/decapsulate). They have different field requirements
/// (signature data vs. ciphertext/shared secret) and must not be
/// conflated in data structures.
#[repr(u32)]
pub enum SignatureAlgorithm {
    // === Classical (pre-quantum) — 0x0103–0x0104 ===
    // Matches BOOT_ALGO_MAP in [Section 9.3](#verified-boot-chain).
    Ed25519             = 0x0103,   // RFC 8032
    Rsa4096Pss          = 0x0104,   // PKCS#1 v2.1, SHA-256

    // === Post-quantum (NIST standards) — 0x0100–0x0107 ===
    MlDsa44             = 0x0100,   // FIPS 204, security level 2
    MlDsa65             = 0x0101,   // FIPS 204, security level 3
    MlDsa87             = 0x0102,   // FIPS 204, security level 5
    SlhDsa128f          = 0x0105,   // FIPS 205, fast variant
    SlhDsa128s          = 0x0106,   // FIPS 205, small variant
    LmsShake256         = 0x0107,   // NIST SP 800-208 (Nucleus verification only)

    // === Hybrid / composite — 0x0200 range ===
    Ed25519PlusMlDsa65  = 0x0200,   // Ed25519 + ML-DSA-65 (kernel image default)
}

impl SignatureAlgorithm {
    /// Returns the kernel crypto API algorithm name for this signature algorithm.
    /// Used by the akcipher subsystem to look up the algorithm implementation.
    pub fn akcipher_name(&self) -> &'static str {
        match self {
            Self::Ed25519            => "ed25519",
            Self::Rsa4096Pss         => "rsa-4096-pss",
            Self::MlDsa44            => "ml-dsa-44",
            Self::MlDsa65            => "ml-dsa-65",
            Self::MlDsa87            => "ml-dsa-87",
            Self::SlhDsa128f         => "slh-dsa-shake-128f",
            Self::SlhDsa128s         => "slh-dsa-sha2-128s",
            Self::LmsShake256        => "lms-shake256",
            Self::Ed25519PlusMlDsa65 => "hybrid-ed25519-ml-dsa-65",
        }
    }
}

/// Key Encapsulation Mechanism (KEM) algorithm identifier.
/// Used for key exchange in cluster node authentication (Section 5.2)
/// and any context where shared secrets are established.
/// Separate from SignatureAlgorithm because KEMs produce
/// (ciphertext, shared_secret) pairs, not signatures.
#[repr(u32)]
pub enum KemAlgorithm {
    // === Post-quantum (NIST standards) — 0x0300 range ===
    // Matches BOOT_ALGO_MAP KEM entries in [Section 9.3](#verified-boot-chain).
    MlKem768            = 0x0300,   // FIPS 203, security level 3
    MlKem1024           = 0x0301,   // FIPS 203, security level 5
}
// Note: Hybrid signature algorithm IDs start at 0x0200, which exceeds u8 range.
// Network-portable structures (e.g., DistributedCapability in Section 5.8)
// that carry a sig_algorithm field MUST use at least u16 (or the full u32
// encoding) to represent the complete SignatureAlgorithm ID space.
// Cross-reference: DistributedCapability.sig_algorithm in Section 5.8 (05-distributed.md
// Section 5.8.2) IS already defined as u16 — this requirement is satisfied.

/// Variable-length signature.
/// Avoids hardcoding signature size in data structures.
pub struct Signature {
    /// Algorithm that produced this signature.
    pub algorithm: SignatureAlgorithm,
    /// Signature bytes (length depends on algorithm).
    pub data: SignatureData,
}

/// Signature data — inline for classical (hot path), boxed for PQC (cold path).
pub enum SignatureData {
    /// Ed25519: 64 bytes. Fits inline. No allocation.
    /// Used in capability verification cache (hot-path lookups).
    Inline64([u8; 64]),
    /// PQC signatures: heap-allocated via Box<[u8]>.
    /// PQC signatures are 2.4KB–17KB. Heap allocation is acceptable
    /// because PQC signing/verification occurs only on cold paths
    /// (boot, driver load, capability creation, cluster join), all of
    /// which run after the kernel slab allocator is initialized
    /// (post-Phase 2 boot, Section 4.3). During early boot (before
    /// the heap is available), signature verification uses the
    /// fixed-size buffers in KernelSignature and DriverSignature
    /// structs directly, bypassing this type entirely.
    Heap(Box<[u8]>),
}

// Note on cache pressure: cached PQC capabilities are ~2.5KB per entry
// vs 128 bytes for Ed25519. For 1,000 active capabilities: 2.5MB vs 128KB.
// Both fit in L3 cache on any modern system. If the working set grows
// beyond this, LRU eviction of cold capabilities mitigates pressure.

9.6.4 Impact on Distributed Capabilities

Distributed capabilities (Section 5.7) carry a signature for network verification. With PQC:

Current DistributedCapability:
  Base fields:     ~64 bytes
  Ed25519 signature: 64 bytes
  Total: ~128 bytes per capability (classical-only mode)

  With hybrid Ed25519 + ML-DSA-65 credential (production default):
  ~3.6 KB (including pre-allocated PQC signature buffer — see Section 5.8.2
  for the canonical struct definition with full field breakdown)

With ML-DSA-44:
  Base fields:     ~64 bytes
  ML-DSA-44 signature: 2,420 bytes
  Total: ~2,484 bytes per capability

Impact:
  - Capability token is ~20x larger.
  - RDMA bandwidth for capability exchange: negligible (capabilities
    are exchanged once and cached, not per-operation).
  - Verification time: Ed25519 verify is ~30-70 μs.
    ML-DSA-44 verify is ~50-120 μs (cached after first verification).
    ML-DSA-44 is typically 2-3x slower than Ed25519, but both are
    negligible on cold paths (driver load, cluster join).
  - Memory for cached capabilities: 2.5KB vs 128 bytes per entry.
    For a cluster with 1,000 active capabilities: 2.5 MB vs 128 KB.
    Negligible on any modern system.

9.6.5 Hybrid Mode (Transition Period)

During the transition period (2025-2035), sign with BOTH classical and PQC algorithms:

Kernel image signature:
  Ed25519 signature: 64 bytes    (verifiable by old bootloaders)
  ML-DSA-65 signature: 3,309 bytes (verifiable by PQC-aware bootloaders)

  Verification responsibility:
  - Pre-UmkaOS bootloaders (UEFI Secure Boot, GRUB) do NOT understand
    UmkaOS's KernelSignature format. Backward compatibility requires
    that the kernel image also carry a standard signature in the
    format the bootloader expects (e.g., Authenticode PE signature
    for UEFI Secure Boot, GPG detached signature for GRUB). The
    UmkaOS KernelSignature (containing hybrid Ed25519 + ML-DSA-65)
    is appended separately and is invisible to these bootloaders.
  - UmkaOS-aware bootloaders that predate PQC support verify only
    the classical component of the hybrid signature.

    **Hybrid signature wire format**: Hybrid signatures use a
    length-prefixed layout, not a fixed-offset convention. Format:
    `[classical_len: u16 | classical_sig: [u8; classical_len] |
    pqc_sig: [u8; remaining]]`. The verifier reads `classical_len`
    to split the signature buffer. For Ed25519 + ML-DSA-65 hybrids,
    `classical_len = 64`. For RSA-4096-PSS + ML-DSA-65 hybrids,
    `classical_len = 512`. This format accommodates any classical
    algorithm size without breaking the wire protocol.

    An older UmkaOS-aware verifier that does not recognize algorithm
    ID 0x0200 reads `classical_len` from the first two bytes of the
    signature buffer, extracts `classical_sig[0..classical_len]`,
    and verifies that component only, ignoring the trailing PQC
    signature data. To enable this, older UmkaOS-aware verifiers MUST
    treat any algorithm ID in the range 0x0200-0x02FF as "hybrid:
    extract classical_len from bytes 0..2, verify classical_sig only,
    skip remainder." This forward-compatibility rule ensures smooth
    rollout even when some nodes in a cluster upgrade before others.

Ring Buffer Impact: KABI ring messages (Section 11.7) do not carry signatures — capabilities are verified at connect() time, not per-message. The SignatureData::Heap variant handles large PQC signatures without affecting ring buffer sizing. No ring buffer resize is needed.

Side-Channel Resistance:

All PQC implementations in the kernel MUST be constant-time. Variable-time operations leak secret key material through timing and cache side channels. CI enforcement: ctgrind-style testing (Valgrind memcheck with secret memory marked as undefined) runs on every commit touching crypto code. Non-constant-time code paths are rejected.

Benchmark Reference:

Approximate times on a modern x86 core (single-threaded, AVX2-enabled). References: ed25519-dalek 4.x, pqcrypto-dilithium (AVX2), liboqs 0.10+: - Ed25519: ~25-50 μs depending on hardware and key schedule caching (verify), ~15-20 μs (sign) - ML-DSA-44: ~50-120 μs (verify), ~80-200 μs (sign) - ML-DSA-65: ~100-200 μs (verify), ~250-460 μs (sign) - ML-KEM-768: ~90 μs (encapsulate), ~80 μs (decapsulate)

Ranges reflect implementation quality (optimized AVX2 vs. reference C) and CPU microarchitecture. ML-DSA-44 verification is typically 2-3x slower than Ed25519 in absolute terms, though both are fast enough to be negligible on cold-path operations (boot, driver load, cluster join). The performance difference is irrelevant for operations that occur at most a few times per second.

9.6.6 Performance Impact

Zero hot-path impact. Cryptographic operations occur only on cold paths: - Boot: once (kernel signature verification) - Driver load: once per driver (~50 drivers at boot) - Cluster join: once per node - Capability creation: infrequent (not per-operation)

Ed25519 verification (~25-50 μs) and ML-DSA-44 verification (~50-120 μs) are both fast on modern hardware; ML-DSA-65 (~100-200 μs) is somewhat slower. All are fast enough for cold-path operations. ML-KEM encapsulation is faster than ECDH. The performance difference is negligible for operations that occur once per boot, driver load, or cluster join.

9.6.7 PQC Key Management

PQC private keys must be stored securely. TPM 2.0 PQC support is emerging: TCG TPM 2.0 Library Specification 2.0 (V184, March 2025) is the latest published revision. V185 RC4 (December 2025) adds PQC algorithm profiles and completed public review on February 10, 2026. Until V185 is formally published, it should be treated as near-final draft, not as a normative reference. Initial PQC-capable TPM hardware samples were announced in late 2025 (e.g., SEALSQ). Production availability expected 2026-2027 (optimistic — historical TPM spec-to-silicon timelines are 18–36 months; plan for 2028 as fallback).

Key storage options (in priority order):

  1. TPM 2.0 with PQC profile (near-term): TCG TPM 2.0 V184 (March 2025, latest published revision) is the normative reference. V185 RC4 (December 2025) adds PQC algorithm profiles and completed public review on February 10, 2026. TPM stores ML-DSA/ML-KEM private keys in hardware. Key never leaves TPM. This is the ideal solution. Status: V184 published; V185 (PQC profiles) near-final after RC4. PQC-capable TPM hardware samples announced (SEALSQ, late 2025). Production availability expected 2026-2027 (optimistic — historical TPM spec-to-silicon timelines are 18–36 months; plan for 2028 as fallback).

  2. HSM / external key store: Enterprise deployments use network HSMs (Thales Luna, AWS CloudHSM) that already support ML-DSA/ML-KEM. Kernel communicates via PKCS#11 or vendor-specific interface. Key never in kernel memory.

  3. UEFI Secure Variable storage: Private key stored in a custom UEFI authenticated variable (NOT db/dbx — those store public keys and certificates for signature verification). Protected by Secure Boot chain: only authenticated updates can modify the variable. Key loaded into kernel memory at boot, used for signing, then zeroed. Vulnerable to cold-boot attacks. Acceptable for non-TEE systems.

  4. Software key in kernel memory (fallback): IMPORTANT: Only the VERIFICATION (public) key is embedded in kernel .rodata. This is standard practice (Linux embeds its module signing public key the same way). The public key allows the kernel to verify signatures but cannot create them. Exposure of the public key via a memory read vulnerability does NOT compromise the signing process — an attacker who reads the public key can verify signatures (which is public information) but cannot forge them. The PRIVATE (signing) key must NEVER be present in the running kernel image. It exists only in the build environment (offline signing server, HSM, or developer workstation) and is used during the build process to produce signatures that are then appended to kernel/driver images. For cluster identity keys (which require a private key at runtime for mutual authentication), the private key is loaded from disk at boot and held in kernel memory. Protected by:

  5. Kernel ASLR (address space randomization)
  6. Confidential computing (if running in TEE, key is hardware-encrypted)
  7. Memory zeroing on key rotation Weakest option for runtime private keys. Acceptable for development and non-critical deployments.

Root of trust for verified boot (Section 9.2): Phase 1: Classical Ed25519 public key in UEFI db (existing infrastructure). Phase 2: Hybrid Ed25519 + ML-DSA-65 public keys in UEFI db. Phase 3: ML-DSA-65 key in TPM 2.0 PQC profile (when available). Each phase is a strict superset — old keys continue to work.


9.7 Confidential Computing

9.7.1 Why This Cannot Wait

AMD SEV-SNP, Intel TDX, and ARM CCA are shipping today. Every major cloud provider offers confidential VMs. The fundamental shift: the kernel becomes untrusted by its own workloads.

Traditional trust model (current design):
  Hardware → Kernel → Processes
  The kernel can read all process memory.
  The kernel is the root of trust.

Confidential computing trust model:
  Hardware → TEE (Trusted Execution Environment) → Workload
  The kernel CANNOT read TEE memory. Hardware enforces encryption.
  The kernel is OUTSIDE the trust boundary.
  The workload trusts only hardware + its own code.

This directly conflicts with several kernel subsystems if not designed in:

Subsystem Conflict Resolution
Page migration (HMM, Section 22.4) Kernel can't read encrypted pages to copy them Hardware assists migration (SEV-SNP page copy with re-encryption)
Memory compression (Section 4.12) Kernel can't compress what it can't read Skip compression for confidential pages (metadata-only eviction)
DSM (Section 6.2) Can't RDMA encrypted pages — remote node can't decrypt TEE-to-TEE RDMA with shared key negotiation (Phase 5+ research scope: shared-key negotiation for DSM pages within TEE-protected VMs requires attestation-authenticated key exchange between TEE enclaves on different physical hosts; deferred to Phase 5+ pending standardization of inter-enclave communication protocols such as ARM CCA Realm-to-Realm and Intel TDX Migration TD)
In-kernel inference (Section 22.6) Can't observe page content of encrypted memory; access patterns (page faults, accessed/dirty bits) remain fully observable (Section 9.7) Train models on access patterns and scheduling metadata only; content-based features unavailable for TEE workloads (~10-20% quality reduction, Section 9.7)
Crash recovery (Section 11.9) Can't inspect process state to reconstruct after driver crash Preserve TEE context across driver reload (hardware feature)
FMA telemetry (Section 20.1) Health events may leak information about confidential workloads Aggregate-only telemetry for confidential contexts

9.7.2 Architectural Requirements

// umka-core/src/security/confidential.rs

/// Confidential computing context.
/// Attached to a process or VM that runs inside a TEE.
pub struct ConfidentialContext {
    /// Hardware TEE type.
    pub tee_type: TeeType,

    /// Hardware-managed encryption key ID.
    /// Each TEE context has a unique key. The kernel never sees the key.
    /// The hardware memory controller encrypts/decrypts transparently.
    ///
    /// **Longevity analysis (u32)**: Constrained by hardware TEE implementations.
    /// AMD SEV-SNP: max ~1006 ASIDs (Genoa). Intel TDX: max ~32K keys
    /// (15-bit IA32_TME_CAPABILITY). ARM CCA: Realm IDs implementation-defined
    /// (current implementations <1024). All fit trivially in u32 (4 billion).
    /// Hardware key ID spaces are small by design (each key requires dedicated
    /// encryption hardware). u32 is sufficient for the foreseeable future.
    pub key_id: u32,

    /// VMPL level (0-3). Only meaningful when tee_type == AmdSevSnpVmpl.
    /// 0 = paravisor/SVSM (most privileged), 1+ = guest OS.
    /// For non-VMPL TEE types, this field is 0 and ignored.
    pub vmpl_level: u8,

    /// Attestation state.
    pub attestation: AttestationState,

    /// Memory policy: how the kernel handles this context's pages.
    pub memory_policy: ConfidentialMemoryPolicy,
}

#[repr(u32)]
pub enum TeeType {
    /// No TEE. Standard process. Kernel can access all memory.
    None        = 0,
    /// AMD SEV-SNP (Secure Encrypted Virtualization - Secure Nested Paging).
    /// Implicitly runs at VMPL0 (most privileged level within the VM).
    AmdSevSnp   = 1,
    /// Intel TDX (Trust Domain Extensions).
    IntelTdx    = 2,
    /// ARM CCA (Confidential Compute Architecture).
    ArmCca      = 3,
    /// AMD SEV-SNP with VMPL (Virtual Machine Privilege Level) at a
    /// non-default level. VMPL allows multiple trust levels within a
    /// single SEV-SNP VM: VMPL0 = paravisor/SVSM (most privileged),
    /// VMPL1+ = guest OS. Required for AMD's SVSM architecture.
    /// The VMPL level is stored in ConfidentialContext.vmpl_level (u8).
    AmdSevSnpVmpl = 4,
}

// Note: The VMPL level for AmdSevSnpVmpl is stored in ConfidentialContext
// as a separate field, not embedded in the enum variant. This preserves
// #[repr(u32)] compatibility (Rust requires all variants of a repr(inttype)
// enum to be unit variants with explicit discriminants).

#[repr(u32)]
pub enum ConfidentialMemoryPolicy {
    /// All pages are encrypted. Kernel cannot read content.
    /// Kernel can still manage page tables, migration metadata.
    FullEncryption  = 0,
    /// Shared pages (explicitly marked by guest) are readable by kernel.
    /// Private pages are encrypted. Used for I/O buffers.
    HybridShared    = 1,
}

/// Attestation lifecycle state.
9.7.2.1.1 TEE-Capable NUMA Nodes

Not all NUMA nodes support hardware memory encryption. CXL-attached memory expanders typically lack TEE capability — they provide additional capacity but cannot enforce hardware encryption. The tiering engine must never migrate pages belonging to a ConfidentialContext to a non-TEE-capable node.

/// Per-NUMA-node TEE capability flag, populated at boot from hardware discovery.
/// Used by the tiering engine to gate confidential page migration.
/// Stored in `BuddyAllocator::tee_info` ([Section 4.2](04-memory.md#physical-memory-allocator--numa-node-tee-info)).
pub struct NumaNodeTeeInfo {
    /// Whether this NUMA node's memory controller supports hardware encryption.
    /// True for: AMD SEV-SNP (ASID-keyed), Intel TDX (MKTME-keyed), ARM CCA (GPC-protected).
    /// False for: CXL Type 3 expanders without TEE firmware, standard DRAM on non-TEE platforms.
    pub tee_capable: bool,
    /// Maximum number of concurrent encryption key IDs supported by this node's controller.
    /// 0 if !tee_capable. AMD: up to 509 ASIDs (Milan), 1006+ (Genoa). Intel: MKTME key range.
    /// ARM: realm granule count (GPC-managed).
    pub max_key_ids: u32,
}

Migration gate (enforced in the tiering engine's page migration path, see Section 4.2):

if page.confidential_context.is_some() && !target_node.tee_info.tee_capable {
    // REJECT migration. Confidential page must stay on TEE-capable node.
    return Err(MigrationError::TargetNotTeeCapable);
}

TEE capability is discovered per-node during boot from hardware-specific mechanisms: AMD SEV-SNP capability via CPUID Fn8000_001F[EAX] bit 1, Intel MKTME via IA32_TME_CAPABILITY MSR, ARM CCA via GPC capability per memory controller. See Section 2.15 for the full per-architecture discovery procedure.

pub enum AttestationState {
    /// No attestation attempted yet.
    Unattested,
    /// Challenge issued, awaiting hardware report.
    PendingChallenge { nonce: [u8; 32] },
    /// Successfully attested. Report can be verified by remote party.
    Attested {
        /// Hardware-generated attestation report (opaque, TEE-specific).
        /// Attestation report sizes vary by platform:
        /// AMD SEV-SNP: ~1,184 bytes (ATTESTATION_REPORT structure),
        /// Intel TDX: 1,024 bytes (TDREPORT),
        /// ARM CCA: Realm token (Realm Measurement and Attestation Token, RMAT)
        /// as defined in the ARM CCA Security Model specification.
        /// The 4,096-byte buffer provides headroom for all platforms
        /// including any future format extensions.
        /// Only report_len bytes are meaningful.
        report: [u8; 4096],
        /// Number of valid bytes in the report buffer.
        report_len: u16,
        /// SHA-384 hash of the report (for quick identity checks).
        report_hash: [u8; 48],
        /// When attestation completed (monotonic clock).
        timestamp: u64,
    },
    /// Attestation failed (hardware error or measurement mismatch).
    Failed { reason: AttestationError },
}

#[repr(u32)]
pub enum AttestationError {
    /// Hardware does not support attestation.
    HardwareUnsupported = 0,
    /// Measurement does not match expected value.
    MeasurementMismatch = 1,
    /// Hardware reported an internal error.
    HardwareError       = 2,
    /// Certificate chain validation failed.
    CertificateInvalid  = 3,
}

Attestation Verification:

Hardware-rooted attestation allows remote parties to verify the integrity of a confidential workload. AMD SEV-SNP uses a Versioned Chip Endorsement Key (VCEK) signed by AMD's Key Distribution Service (KDS). Intel TDX uses a Quoting Enclave (QE) that produces quotes verifiable via Intel's Provisioning Certification Service (PCS). ARM CCA uses platform attestation tokens signed by the CCA platform key.

UmkaOS exposes Linux-compatible attestation device nodes: - /dev/sev-guest — AMD SEV-SNP guest attestation reports - /dev/tdx-guest — Intel TDX guest attestation reports (via TDX_CMD_GET_REPORT)

These are implemented in umka-sysapi with identical ioctl semantics to Linux.

9.7.3 Design Approach: Opaque Page Handles

The key design principle: kernel subsystems must be able to manage pages without reading their contents. Every subsystem that touches page data needs a "confidential-aware" path.

// umka-core/src/mem/page.rs

/// A physical page handle. The kernel always has this.
/// Whether the kernel can read the page DATA depends on the page's
/// confidentiality state.
pub struct PageHandle {
    /// Physical frame number.
    pub pfn: u64,

    /// Ownership and state metadata (kernel can always read this).
    pub state: PageState,

    /// Is this page's content readable by the kernel?
    /// false for pages owned by a ConfidentialContext.
    pub kernel_readable: bool,

    /// For confidential pages: hardware key ID for re-encryption
    /// during migration. The kernel uses this to instruct hardware
    /// to re-encrypt the page for the destination context.
    pub encryption_key_id: Option<u32>,
}

Subsystem-by-subsystem adaptation:

Memory compression (Section 4.2):
  if page.kernel_readable:
    compress normally (LZ4/zstd)
  else:
    skip compression — evict to swap encrypted
    (hardware encrypts at rest, no kernel involvement)

Page migration (Section 22.2, GPU VRAM):
  if page.kernel_readable:
    DMA copy (standard path)
  else:
    hardware-assisted encrypted copy
    (SEV-SNP: SNP_PAGE_MOVE firmware command via PSP, TDX: TDH.MEM.PAGE.RELOCATE SEAMCALL)

DSM (Section 5.6):
  if both endpoints are in the same TEE trust domain:
    Key negotiation protocol:
      1. Both nodes produce attestation reports (hardware-rooted).
      2. Reports are mutually verified (each node checks the other's
         measurement, firmware version, security policy).
      3. Key exchange via ML-KEM-768 (PQC, Section 9.5) authenticated by
         attestation-bound identity.
      4. Shared symmetric key established for RDMA encryption.
      5. RDMA transfers encrypted with shared key (AES-GCM).
    Latency: ~5ms for initial key negotiation (once per node pair).
    Steady-state RDMA: same performance as non-TEE (hardware AES).
    Key rotation: every 1 hour or 2^32 RDMA operations (whichever
    comes first). Nonce construction: 4-byte connection ID || 8-byte
    per-connection counter (deterministic, never reused within a key
    lifetime). The 2^32 limit is a conservative policy: counter-based
    nonces can safely reach 2^64, but frequent rotation limits the
    exposure window of any single key.
    Re-attestation: every 24 hours or on firmware update. If
    re-attestation fails (measurement changed), the shared key is
    revoked and all RDMA connections to that node are torn down.
    DSM pages cached from that node are invalidated (Section 5.6).
  else (different trust domains):
    pages must be re-encrypted for each endpoint
    → higher latency, but functionally correct

  Phase scope: The above TEE-to-TEE DSM key negotiation protocol is
  Phase 5+ research scope. It is not implemented in Phase 1-4 and is
  deferred pending standardization of inter-enclave communication
  protocols (ARM CCA Realm-to-Realm, Intel TDX Migration TD).

Hardware memory tagging (MTE, Section 2.3):
  MTE tags are stored in a separate physical memory region (tag RAM).
  For TEE-encrypted pages, tag RAM may also be encrypted.
  The kernel cannot set/clear tags on confidential pages.
  Policy: confidential pages are allocated UNTAGGED (tag = 0).
  MTE checking is disabled for pages owned by a ConfidentialContext.
  Rationale: MTE protects against kernel bugs accessing the page.
  For confidential pages, hardware encryption already prevents
  kernel access — MTE is redundant.

In-kernel inference (Section 22.4):
  Can observe: page fault frequency, access pattern (from page table
  accessed bits, not page content), memory pressure signals.
  Cannot observe: page content.
  → Page prefetching model trains on access patterns, not content.
  → Works fine. Content is irrelevant for prefetch decisions.

9.7.4 Guest Mode: UmkaOS as a Confidential Guest

When UmkaOS runs inside a confidential VM (as a guest on a hypervisor):

UmkaOS as TDX/SEV-SNP guest:
  1. All guest physical memory is encrypted by hardware.
  2. The hypervisor cannot read guest memory.
  3. UmkaOS must:
     a. Use SWIOTLB (bounce buffer) for device I/O with the hypervisor.
     b. Explicitly mark shared pages for virtio/MMIO communication.
     c. Validate all data received from the hypervisor (untrusted host).
        This includes the **Heckler attack** defense: a malicious hypervisor
        can inject interrupts into TDX guests to influence kernel execution
        paths and cause information disclosure. All host-provided MMIO, port
        I/O, MSR, and CPUID values must be validated against expected ranges;
        interrupt injection must be filtered against expected interrupt vectors
        (reject unexpected #VE-injected events). See `X86Errata::TDX_HECKLER`.
     d. Support remote attestation so users can verify the guest.

  This requires:
  - Memory manager: distinguish "private" (encrypted) vs "shared" (plaintext) pages.
  - I/O path: bounce buffers for virtio when running as confidential guest.
  - io_uring: bounce buffer pool for DMA data payloads (SQE/CQE rings stay encrypted).
    See [Section 19.3](19-sysapi.md#io-uring-subsystem--iouring-under-sev-snp-confidential-guest-mode)
    for the complete io_uring + SEV-SNP bounce buffer specification.
  - Attestation: kernel provides attestation report via sysfs/ioctl.

9.7.4.1 Confidential VM Live Migration

Live migration of a confidential VM requires that: 1. The destination platform is attested as trustworthy before any guest memory is sent. 2. Guest memory is transported under a migration transport key established via attestation-authenticated key exchange — the source host cannot read it in transit. 3. The migration request is authenticated to prevent rogue hosts from diverting a confidential VM to an untrusted platform.

AMD SEV-SNP migration protocol (KVM KVM_SEV_SEND_* / KVM_SEV_RECEIVE_* ioctls):

Phase 1 — Destination attestation:
  1. Destination PSP generates a migration ephemeral ECDH P-384 key pair:
       (migr_pub, migr_priv) = ECDH-P384 keypair
       (AMD PSP supports only ECDSA/ECDH P-384 — see AMD spec 56860 rev 1.58, Tables 139-142)
  2. Destination PSP issues an attestation report that includes a hash of migr_pub
     in the REPORT_DATA field (binds the key to the platform measurement).
  3. Attestation report is signed by the VCEK (Versioned Chip Endorsement Key,
     ECDSA P-384 with SHA-384) which chains to AMD's KDS root.
  4. Destination sends: (attestation_report, migr_pub) to the migration controller.

Phase 2 — Source verifies destination:
  1. Migration controller verifies attestation_report against AMD KDS root.
  2. Verifies: PLATFORM_VERSION ≥ minimum required, POLICY matches expected.
  3. Extracts migr_pub hash from REPORT_DATA (proves key ownership by attested platform).
  4. Source PSP performs ECDH key agreement:
       shared_secret = ECDH(source_migr_priv, dest_migr_pub)
  5. AES-256-GCM key derived from shared_secret via HKDF-SHA-384:
       enc_key = HKDF(shared_secret, "UmkaOS-SEV-MIGRATION-ENC-v1", 32)

Phase 3 — Memory transfer:
  1. Source calls PSP: SNP_PAGE_COPY(page_list, dst_platform_guestkey_id)
     PSP re-encrypts each page from source key to transport key.
  2. Encrypted page blobs + ciphertext sent to destination over TLS 1.3.
     (TLS authenticates the management channel; SEV encryption protects page data.)
  3. Destination calls PSP: SNP_PAGE_RECEIVE(encrypted_page_blob)
     PSP decrypts with migr_priv → re-encrypts with destination platform key.

Phase 4 — Guest state transfer:
  1. vCPU registers, VMSA (VM Save Area) encrypted with transport key.
     VMSA includes all vCPU state: GPRs, MSRs, VMCB fields.
  2. Destination PSP verifies VMSA measurement matches expected boot digest.
  3. Guest resumes on destination only after PSP approves state transfer.

Request authenticity:
  The migration request (source → destination address, VM ID, timestamp) is signed
  with the source host's Ed25519 cluster identity key ([Section 5.1.2.2]).
  Destination verifies signature before accepting connection. This prevents MITM
  attacks where an untrusted third party diverts migration traffic.

Intel TDX migration protocol (per Intel TDX Module Architecture Specification: TD Migration, pub #348550002; SEAMCALL interface):

Phase 1 — Destination attestation:
  1. Destination generates TDX migration TD (a special TrustDomain for migration).
  2. TDREPORT generated by TDH.REPORT SEAMCALL — includes migration TD measurement.
  3. Report verified via Intel PCS (Provisioning Certification Service).

Phase 2 — Key establishment and state setup:
  TDH.EXPORT.STATE.IMMUTABLE / TDH.IMPORT.STATE.IMMUTABLE for TD-scope metadata.
  TDH.EXPORT.TRACK / TDH.IMPORT.TRACK for migration ordering.
  Intel MKTME (Multi-Key Total Memory Encryption) handles re-keying.

Phase 3 — Page and vCPU state transfer:
  TDH.EXPORT.MEM / TDH.IMPORT.MEM for each page.
  TDH.EXPORT.STATE.VP / TDH.IMPORT.STATE.VP for per-vCPU state.
  Pages are re-encrypted from source KeyID to destination KeyID.

Latency: additional ~5-20ms for attestation (Phase 1) compared to
non-confidential migration. Memory transfer latency is identical
(hardware AES-256 encryption is in the memory controller, ~0.1% overhead).

9.7.5 Host Mode: UmkaOS Hosting Confidential VMs

When UmkaOS is the host running umka-kvm with confidential guests:

UmkaOS as TDX/SEV-SNP host:
  0. Boot-time prerequisite: microcode version validation against minimum-safe
     table before enabling SGX/TDX (INTEL-SA-00837: unauthorized error injection
     allows privilege escalation — see X86Errata::SGX_TDX_ERROR_INJ). AMD:
     CVE-2024-56161 microcode signature weakness — disable SEV-SNP below safe
     version (see X86Errata::AMD_UCODE_SIG).
  1. umka-kvm (Tier 1 driver, Section 11.4) creates confidential VM contexts.
  2. Kernel allocates encrypted physical pages for guest.
  3. Kernel CANNOT read guest memory (hardware enforces).
  4. Kernel CAN:
     - Schedule vCPUs on physical CPUs (scheduling, Section 7.1)
     - Manage guest memory mappings (page tables, metadata only)
     - Migrate guest pages between NUMA nodes (hardware re-encryption)
     - Apply cgroup limits (memory, CPU, accelerator)
  5. Kernel CANNOT:
     - Read guest page contents
     - Inject code into guest
     - Modify guest register state (except via controlled VM entry/exit)

9.7.6 Linux Compatibility

Existing Linux confidential computing interfaces:

/dev/sev (AMD SEV-SNP):
  KVM_SEV_INIT, KVM_SEV_LAUNCH_START, KVM_SEV_LAUNCH_UPDATE,
  KVM_SEV_LAUNCH_MEASURE, KVM_SEV_LAUNCH_FINISH, etc.
  → Implemented in umka-kvm, same ioctl numbers.

/dev/tdx_guest (Intel TDX):
  TDX_CMD_GET_REPORT, TDX_CMD_EXTEND_RTMR, etc.
  → Implemented in umka-sysapi, same ioctl numbers.

KVM ioctls (generic):
  KVM_CREATE_VM, KVM_SET_MEMORY_ATTRIBUTES (private vs shared), etc.
  → Implemented in umka-kvm.

QEMU, libvirt, cloud-hypervisor: all use these ioctls.
Binary compatibility preserved.

9.7.7 TEE Observability Degradation Model

When workloads run inside TEEs, some kernel optimization signals are unavailable. This is an inherent hardware constraint, not a software limitation. The kernel must degrade gracefully, not silently fail.

Signal                       Non-TEE    TEE (host observing guest)    Notes
────────────────────────────  ─────────  ───────────────────────────  ──────────────
Page fault patterns           ✓ Full     ✓ Full                      Kernel manages page tables
Page table accessed/dirty     ✓ Full     ✓ Full                      Hardware sets bits, kernel reads
I/O request patterns          ✓ Full     ✓ Full                      I/O goes through kernel
CPU scheduling (PELT)         ✓ Full     ✓ Full                      Kernel schedules vCPUs
Memory allocation patterns    ✓ Full     ✓ Full                      Kernel allocates pages
Memory pressure signals       ✓ Full     ✓ Full                      Kernel tracks pressure
Page content (for compress)   ✓ Full     ✗ Unavailable               Hardware encryption
Hardware perf counters        ✓ Full     ◐ Restricted                TDX: host reads limited set
                                                                     SEV-SNP: guest-only by default
                                                                     CCA: realm-restricted
In-kernel inference           ✓ Full     ◐ Reduced                   Trains on access patterns (ok)
                                                                     Cannot use content features
Memory compression            ✓ Full     ✗ Disabled                  Can't compress encrypted pages
MTE tagging                   ✓ Full     ✗ Disabled                  Tags on confidential pages = 0

Degraded features for TEE workloads:
  - Memory compression: disabled (pages evicted encrypted, no compression gain)
  - Learned prefetch: works on access patterns (page faults, accessed bits)
    but quality may be ~10-20% lower than non-TEE due to missing perf counters
  - Intent I/O scheduling: works (sees I/O requests), same quality as non-TEE
  - Power budgeting: works (RAPL is host-side, unaffected by TEE)
  - MTE safety: disabled for TEE pages (hardware encryption is the safety mechanism)

This degradation is identical to Linux. Linux has the same restricted visibility into TEE guests. No kernel can observe hardware-encrypted memory contents.

9.7.8 Performance Impact

When confidential computing is not used: zero overhead. No code runs, no checks happen. The page.kernel_readable flag is always true. All code paths take the standard branch.

When confidential computing is used: same overhead as Linux. The cost is hardware memory encryption (~1-5% memory bandwidth reduction), which is identical regardless of kernel implementation.

9.7.9 Device Passthrough for Confidential VMs

PCIe devices (GPUs, NICs, accelerators) passed through to a confidential VM face a fundamental problem: device DMA targets host physical memory, but TEE memory is encrypted. The device cannot read/write encrypted pages directly unless the device itself supports encryption.

UmkaOS supports both paths and selects automatically at VM creation time based on device capability. An admin override is available for testing and policy enforcement.

Boot/runtime parameter: umka.cc_device_dma=auto|bounce|hwcrypto

Mode Behavior Performance Use Case
auto (default) Detect device CC capability. Use hardware crypto if available, bounce buffer otherwise. Optimal Production — always correct
bounce Force bounce buffers for all CC VM device DMA, even if device supports hardware crypto. 10-100x slower for large transfers Security-conservative (eliminates trust in device firmware)
hwcrypto Require hardware-encrypted DMA. Device passthrough fails with -ENOTSUP if device lacks CC capability. Native (zero overhead) Environments that mandate hardware encryption

Detection mechanism:

/// Confidential computing device capability, discovered at VFIO device attach.
pub enum CcDeviceCapability {
    /// Device supports hardware-encrypted DMA to TEE memory.
    /// Examples: NVIDIA H100 CC mode, AMD MI300X SEV-SNP VRAM (when verified).
    /// Detected via vendor-specific PCI config space or firmware query.
    HardwareCrypto,
    /// Device does not support TEE-aware DMA. Bounce buffers required.
    /// This is the common case for most PCIe devices today.
    BounceRequired,
}

/// Called during VFIO_DEVICE_ATTACH for confidential VMs.
pub fn cc_device_dma_policy(
    device: &VfioDevice,
    vm: &ConfidentialContext,
    mode: CcDeviceDmaMode,
) -> Result<CcDmaPath, VfioError> {
    let hw_cap = device.detect_cc_capability();

    match (mode, hw_cap) {
        // Auto: prefer hardware crypto, fall back to bounce.
        (CcDeviceDmaMode::Auto, CcDeviceCapability::HardwareCrypto) => {
            Ok(CcDmaPath::HardwareCrypto)
        }
        (CcDeviceDmaMode::Auto, CcDeviceCapability::BounceRequired) => {
            Ok(CcDmaPath::BounceBuffer)
        }
        // Force bounce: always use bounce buffers.
        (CcDeviceDmaMode::Bounce, _) => Ok(CcDmaPath::BounceBuffer),
        // Require hardware crypto: fail if device can't do it.
        (CcDeviceDmaMode::HwCrypto, CcDeviceCapability::HardwareCrypto) => {
            Ok(CcDmaPath::HardwareCrypto)
        }
        (CcDeviceDmaMode::HwCrypto, CcDeviceCapability::BounceRequired) => {
            Err(VfioError::CcCapabilityRequired)
        }
    }
}

Bounce buffer path (reuses existing SWIOTLB infrastructure):

The IOMMU maps a shared unencrypted bounce buffer region visible to both the device and the kernel. DMA operations copy between the bounce buffer and the VM's encrypted memory via hardware re-encryption (SEV-SNP SNP_PAGE_MOVE, TDX TDH.MEM.PAGE.RELOCATE). Buffer size is configurable per device: /sys/kernel/iommu_groups/<N>/cc_bounce_size_mb (default: 64 MiB for GPUs, 16 MiB for NICs, same as io_uring SEV-SNP bounce buffers).

Hardware crypto path:

The device firmware manages VRAM encryption. The kernel performs standard DMA mapping — the device encrypts/decrypts transparently. From UmkaOS's perspective, this is identical to normal VFIO passthrough. Zero kernel changes, zero overhead.

Vendor capability detection:

Vendor Device CC Mode Detection
NVIDIA H100/H200 CC-on mode (firmware toggle) PCI config space vendor capability + nvidia-smi -cc 1
AMD MI300X SEV-SNP VRAM integration PCI config space + PSP query (silicon-revision dependent)
Intel Data Center GPU Max TDX Connect (emerging) PCI config space TDISP capability

When hardware CC capability is unverifiable for a specific silicon revision (e.g., early MI300X steppings), detect_cc_capability() returns BounceRequired — safe default. The admin can override with umka.cc_device_dma=hwcrypto after independent verification.

9.7.9.1 Nested GPU Passthrough

A confidential VM running a nested hypervisor that passes through a GPU adds three layers of IOMMU translation (host IOMMU → L1 guest IOMMU → L2 guest virtual IOMMU). UmkaOS supports this if the hardware supports it — no architectural limitation prevents it.

Support policy: allowed when all three conditions are met:

  1. Hardware: The physical IOMMU supports nested/multi-level translation (Intel VT-d with scalable-mode nested translation, AMD-Vi with GPA-to-SPA + nested guest translation).
  2. Firmware: The TEE firmware allows device assignment to nested guests (SEV-SNP: nested VMPL, TDX: nested TD — both are hardware-generation dependent).
  3. Performance: The cumulative translation overhead is ≤ 3x single-level. If overhead exceeds this, the VFIO attach returns -ENOTSUP with a diagnostic (dmesg: "nested GPU passthrough: IOMMU overhead exceeds 3x threshold, consider direct passthrough to L1 instead").

The 3x threshold is a configurable policy, not a hard limit: /sys/kernel/umka/vfio/nested_overhead_limit (default: 3). Setting to 0 disables the check (allow regardless of overhead).

When hardware or firmware does not support nested device assignment, VFIO returns -ENOTSUP at L2 device attach time — no silent degradation.

9.7.10 Confidential VM Live Migration

Confidential VMs require special migration handling because the hypervisor cannot read encrypted guest memory.

AMD SEV-SNP Migration:

The Platform Security Processor (PSP) manages migration. The source PSP exports encrypted guest pages with a transport key negotiated between source and destination PSPs. The destination allocates a new ASID and imports pages, re-encrypting with the destination's memory encryption key. Guest state (VMSA) is included in the encrypted export. The guest observes a brief pause but no data loss.

Intel TDX Migration:

TDX uses TD-Preserving migration via SEAMCALL instructions (TDH.EXPORT.MEM, TDH.IMPORT.MEM, TDH.EXPORT.STATE.IMMUTABLE, TDH.IMPORT.STATE.IMMUTABLE, TDH.EXPORT.STATE.VP, TDH.IMPORT.STATE.VP). The source TDX module exports TD pages (via TDH.EXPORT.MEM), immutable TD-scope metadata (via TDH.EXPORT.STATE.IMMUTABLE), and per-vCPU state (via TDH.EXPORT.STATE.VP) in an encrypted migration stream. The destination TDX module imports and re-keys via the corresponding IMPORT SEAMCALLs. Migration is transparent to the TD guest.

ARM CCA Migration:

CCA realm live migration is under active development by ARM but the migration protocol is not yet standardized as of early 2026. The RMM specification (DEN0137) has progressed to RMI version 2.0 (2.0-alp19, November 2025), which introduced breaking RMI changes but kept RSI at v1.1. The anticipated migration flow involves RMM-mediated secure export/sealing of realm memory and state on the source host, with attestation-verified re-initialization on the destination — the hypervisor never sees realm private data. UmkaOS reserves the CCA migration interface and will implement it when ARM publishes the finalized migration commands in a future RMM spec revision.

ARM CCA realm live migration is deferred to Phase 5. The CCA RME Migration extension is under active development by ARM (no public spec as of 2026).

Anticipated CCA migration mechanism (based on RMM architecture and ARM CCA security model): The Realm Management Extension (RME) partitions physical memory into four worlds (Normal, Secure, Realm, Root) via the Granule Protection Table (GPT). Migration requires three phases:

  1. Export (source host): The source RMM seals realm granules — each 4KB granule's content is encrypted with a migration-specific transport key derived from mutual attestation between source and destination RMMs. The GPT entries for exported granules transition from Realm to Normal world ownership, allowing the hypervisor to read the (encrypted) data for transfer. Per-vCPU register state and realm metadata (REC, RTT entries) are sealed similarly.

  2. Attestation exchange: Before importing, the destination RMM verifies the source realm's attestation token (signed by the source platform's CCA attestation key). This establishes that the realm was running on genuine CCA hardware and provides the transport key for decryption. The destination realm's initial measurement must match — realm-to-realm attestation ensures continuity.

  3. Import (destination host): The destination RMM allocates fresh realm granules, updates its GPT to assign them to Realm world, decrypts the sealed content, and reconstructs RTT (Realm Translation Table) entries. REC (Realm Execution Context) state is restored per vCPU. The realm resumes execution with no visible state change.

Required data structures (RealmMigrationContext, RealmPageExportList, sealed granule format) will be specified when the ARM spec stabilizes. Forward reference: revisit after ARM RME Migration spec reaches v1.0.

ASID/Key Exhaustion:

umka-kvm tracks hardware encryption key IDs (ASIDs for SEV-SNP, HKIDs for TDX). When all IDs are allocated, KVM_CREATE_VM with confidential flags returns -ENOSPC. The administrator must terminate existing confidential VMs to free IDs. Typical limits: SEV-SNP ASIDs are platform-dependent (e.g., 509 on Milan, 1006+ on Genoa), discovered at runtime via CPUID Fn8000001F[ECX]; TDX supports ~64 HKIDs (hardware-dependent).

Note for reviewers: The Genoa ASID count (1006) is higher than Milan (509) because Genoa's PSP (Platform Security Processor) has an expanded key cache. These values are per AMD's SEV-SNP API specification — CPUID Fn8000001F[ECX] returns the actual count. Do not flag "509 vs 1006" as an inconsistency; both are correct for their respective processor generations.

GHCB Protocol (SEV-ES/SNP):

SEV-ES and SEV-SNP guests cannot execute VMEXIT normally (host cannot read guest registers). The Guest-Hypervisor Communication Block (GHCB) is a shared page where the guest places exit information for the hypervisor. umka-kvm implements the GHCB protocol (v2) for handling #VC (VMM Communication) exceptions.

TDX TDCALL:

TDX guests communicate with the TDX module via TDCALL instruction (not VMEXIT). umka-kvm's TDX backend handles TDG.VP.VMCALL for guest-initiated exits.

9.7.11 TEE Key Negotiation Wire Format (UmkaOS-TEE-v1)

Used to establish a shared secret between UmkaOS Core and a TEE enclave (SEV-SNP, TDX, or ARM CCA) for encrypted inter-domain communication.

Handshake message format (sent by enclave to UmkaOS Core):

#[repr(C)]
pub struct TeeHelloMessage {
    /// Protocol version. Current: 1.
    pub version: u16,
    /// Key agreement algorithm:
    ///   1 = X25519 (32-byte public key)
    ///   2 = ML-KEM-768 encapsulation key (1184 bytes — post-quantum)
    pub algorithm: u16,
    /// Random nonce generated by the enclave's CSPRNG
    /// (RDRAND on x86, RNDR on AArch64, or TPM2_GetRandom).
    /// Must be unique per session; verified against a nonce replay log.
    pub nonce: [u8; 32],
    /// Enclave's public key or encapsulation key.
    /// For X25519: bytes 0..32. For ML-KEM-768: bytes 0..1184.
    /// Unused bytes are zero.
    pub public_key: [u8; 1184],
    /// TEE attestation report (SEV-SNP Report, TDX Quote, or CCA Token).
    /// UmkaOS Core verifies the report before accepting the key material.
    pub attestation_report: [u8; 4096],
    _pad: [u8; 4],  // align to 8 bytes. Total: 5320 bytes.
}
// Layout: 2 + 2 + 32 + 1184 + 4096 + 4 = 5320 bytes.
const_assert!(size_of::<TeeHelloMessage>() == 5320);

UmkaOS Core response:

#[repr(C)]
pub struct TeeCoreResponse {
    pub version: u16,
    pub algorithm: u16,
    /// UmkaOS Core's nonce. XOR with enclave nonce for KDF input.
    pub core_nonce: [u8; 32],
    /// For X25519: UmkaOS Core's public key (32 bytes).
    /// For ML-KEM-768: ciphertext (1088 bytes).
    pub response_key: [u8; 1088],
    _pad: [u8; 8],  // Total: 2 + 2 + 32 + 1088 + 8 = 1132 bytes.
}
const_assert!(size_of::<TeeCoreResponse>() == 1132);

Key derivation:

shared_secret = X25519(core_private, enclave_public)
             OR ML-KEM-768.Decapsulate(core_private, enclave_ciphertext)

session_key = HKDF-SHA256(
    ikm    = shared_secret,
    salt   = enclave_nonce XOR core_nonce,   // 32 bytes
    info   = b"UmkaOS-TEE-v1",
    length = 32                              // 256-bit AES key
)

Anti-replay: UmkaOS Core maintains a nonce log (last 1024 nonces) per TEE instance type. Any incoming TeeHelloMessage.nonce matching a logged nonce is rejected with EEXIST.

Attestation verification: UmkaOS Core verifies the TEE attestation report using the platform's attestation root of trust: - SEV-SNP: VCEK certificate chain — AMD ARK/ASK - TDX: Intel's DCAP quote verification service (offline-capable with cached CRL) - ARM CCA: CCA realm token — platform RVIM certificate


9.8 Linux Security Module (LSM) Framework

UmkaOS provides a pluggable security module framework compatible with Linux's LSM infrastructure. The framework defines hook points, a stacking model, and a blob allocation scheme. Specific policy engines (SELinux, AppArmor) are implementation details outside this specification; this section specifies the framework they plug into.

Inspired by: Linux LSM (Wright et al., 2002), LSM stacking (Schaufler, 2018+). IP status: Clean -- LSM is a well-documented public kernel interface. The hook taxonomy is derived from the public Linux UAPI headers and kernel documentation.

9.8.1 Design Goals

  1. Linux compatibility: Unmodified AppArmor, SELinux, and Landlock policy semantics must be reproducible. Container runtimes (Docker, containerd, CRI-O) that set LSM profiles via OCI annotations must work without modification.

  2. Zero overhead when disabled: If no LSM module is loaded, every hook point compiles to a NOP via static keys (same mechanism as tracepoints, Section 20.2). There is no function pointer dereference, no branch, no cache line touch on the hot path.

  3. Grouped hook interface: Linux defines approximately 220 individual hook functions. UmkaOS groups these into approximately 15-20 trait methods organized by object category. Each method receives a discriminant identifying the specific operation, preserving the per-operation granularity that policy engines need while keeping the trait implementable. This is a framework design choice, not a semantic change -- every Linux LSM hook has a corresponding UmkaOS entry point.

  4. AND-logic stacking: Multiple LSM modules can be active simultaneously (e.g., IMA + AppArmor + Landlock). All active modules must permit an operation for it to proceed. Any module returning Err(SecurityDenial) blocks the operation. This matches Linux's LSM stacking model (mainlined incrementally since Linux 5.1).

9.8.2 The SecurityModule Trait

/// Pluggable Linux Security Module interface.
///
/// Each LSM (AppArmor, SELinux, Landlock, IMA, etc.) implements this trait.
/// The framework calls each registered module's hooks in order; all must
/// return Ok(()) for the operation to proceed. Any Err(SecurityDenial)
/// causes the operation to be denied.
///
/// Method grouping: hooks are grouped by kernel object category. Each method
/// receives an operation discriminant and the relevant kernel objects. This
/// provides the same granularity as Linux's ~220 individual hook functions
/// while keeping the trait manageable for implementation.
///
/// Default implementations return Ok(()) (permit). An LSM overrides only
/// the categories it cares about:
/// - AppArmor: file, task, socket, capable (path-based MAC)
/// - SELinux: all categories (label-based MAC)
/// - IMA: file only (integrity measurement)
/// - Landlock: file, socket (unprivileged sandboxing)
pub trait SecurityModule: Send + Sync {
    /// Human-readable module name (e.g., "apparmor", "selinux", "landlock").
    /// Used for /sys/kernel/security/lsm, audit logs, and error messages.
    fn name(&self) -> &'static str;

    /// Module initialization priority. Lower values initialize first.
    /// IMA = 10 (integrity must be first), MAC modules = 20, Landlock = 30.
    fn priority(&self) -> u32;

    /// Size in bytes of the security blob this module needs allocated in
    /// each kernel object (credential, inode, superblock, socket, etc.).
    /// Called once during module registration. Returns per-object-type sizes.
    fn blob_sizes(&self) -> LsmBlobSizes;

    /// Declare which hook categories this module actually implements.
    /// Returns a bitmask of `LsmHookMask` flags. The framework only
    /// dispatches hooks where the corresponding bit is set — avoiding
    /// virtual dispatch overhead for default (permit-all) implementations.
    ///
    /// This is better than Linux's `static_call` approach: Linux patches
    /// call sites per-hook at boot using architecture-specific binary
    /// rewriting (x86 NOPs, ARM branch patching). UmkaOS uses a portable
    /// bitmask check (single AND + branch) that works identically on all
    /// eight architectures with no binary patching infrastructure needed.
    ///
    /// Example: IMA returns `LsmHookMask::FILE_OPS | LsmHookMask::INODE_OPS`.
    /// Landlock returns `LsmHookMask::FILE_OPS | LsmHookMask::SOCKET_OPS`.
    /// SELinux returns `LsmHookMask::all()`.
    fn hooks_implemented(&self) -> LsmHookMask;

    // ===== Hook categories =====
    // Each method defaults to Ok(()) -- LSMs override only relevant hooks.

    /// File (open file descriptor) security checks.
    /// Called on operations that act on an open file.
    ///
    /// Operations:
    ///   Permission  -- check access mode (read/write/exec) on open file
    ///   Open        -- file being opened (after VFS lookup, before data access)
    ///   Receive     -- file descriptor received via SCM_RIGHTS (unix socket)
    ///   Lock        -- flock/fcntl locking operation
    ///   Mmap        -- file being memory-mapped (includes PROT_EXEC check)
    ///   Mprotect    -- mmap protection change on a file-backed mapping
    ///   Ioctl       -- ioctl on a file descriptor
    fn file_security(
        &self,
        op: FileSecurityOp,
        cred: &TaskCredential,
        file: &FileRef,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, file);
        Ok(())
    }

    /// Inode (on-disk object) security checks.
    /// Called on operations that create, modify, or query inodes.
    ///
    /// Operations:
    ///   Create      -- creating a new regular file
    ///   Link        -- creating a hard link
    ///   Unlink      -- removing a directory entry
    ///   Symlink     -- creating a symbolic link
    ///   Mkdir       -- creating a directory
    ///   Rmdir       -- removing a directory
    ///   Mknod       -- creating a special file (device, FIFO, socket)
    ///   Rename      -- renaming/moving a directory entry
    ///   Permission  -- inode access permission check (called by VFS layer)
    ///   Setattr     -- modifying inode attributes (chmod, chown, utimes)
    ///   Getattr     -- reading inode attributes (stat)
    ///   Setxattr    -- setting an extended attribute
    ///   Getxattr    -- reading an extended attribute
    ///   Removexattr -- removing an extended attribute
    ///   Listxattr   -- listing extended attributes
    fn inode_security(
        &self,
        op: InodeSecurityOp,
        cred: &TaskCredential,
        inode: &InodeRef,
        context: &InodeOpContext<'_>,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, inode, context);
        Ok(())
    }

    /// Superblock (filesystem instance) security checks.
    /// Called on filesystem-level operations.
    ///
    /// Operations:
    ///   Mount        -- mounting a filesystem
    ///   Umount       -- unmounting a filesystem
    ///   Remount      -- remounting with changed options
    ///   SetMntOpts   -- setting mount security options (e.g., SELinux context=)
    ///   Statfs       -- filesystem statistics query
    fn superblock_security(
        &self,
        op: SuperblockSecurityOp,
        cred: &TaskCredential,
        sb: &SuperBlockRef,
        context: &SbOpContext<'_>,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, sb, context);
        Ok(())
    }

    /// Task (process/thread) security checks.
    /// Called on operations that affect task state.
    ///
    /// Operations:
    ///   Alloc        -- new task being created (clone/fork)
    ///   Free         -- task being destroyed (cleanup hook, cannot deny)
    ///   Exec         -- execve() transition (profile change, label change)
    ///   SetPgid      -- changing process group
    ///   Setnice      -- changing scheduling priority
    ///   Setioprio    -- changing I/O priority
    ///   Setscheduler -- changing scheduling policy
    ///   Kill         -- sending a signal
    ///   Prctl        -- prctl() operation
    ///   Ptrace       -- ptrace attach/access (cross-reference: "Capability-Gated ptrace" in debugging-and-process-inspection)
    fn task_security(
        &self,
        op: TaskSecurityOp,
        cred: &TaskCredential,
        target: Option<&TaskCredential>,
        context: &TaskOpContext<'_>,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, target, context);
        Ok(())
    }

    /// Credential security checks.
    /// Called on credential allocation, preparation, and commitment.
    ///
    /// Operations:
    ///   Alloc        -- new credential being allocated (prepare_creds)
    ///   Free         -- credential being freed (cleanup, cannot deny)
    ///   Prepare      -- credential clone for modification
    ///   Commit       -- credential being published (commit_creds)
    ///   Transfer     -- credential being transferred to another task
    fn cred_security(
        &self,
        op: CredSecurityOp,
        old: &TaskCredential,
        new: Option<&TaskCredential>,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, old, new);
        Ok(())
    }

    /// Capability override hook.
    /// Called by ns_capable() (Section 9.8.3) after the bitfield check
    /// passes. Allows LSMs to deny capabilities that the task technically
    /// holds but the policy forbids (e.g., SELinux denying CAP_NET_ADMIN
    /// to a confined domain).
    fn capable(
        &self,
        cred: &TaskCredential,
        target_ns: &UserNamespace,
        cap: SystemCaps,
        audit: CapableAudit,
    ) -> Result<(), SecurityDenial> {
        let _ = (cred, target_ns, cap, audit);
        Ok(())
    }

    /// Socket security checks.
    /// Called on socket lifecycle and data operations.
    ///
    /// Operations:
    ///   Create       -- socket() syscall
    ///   Bind         -- bind() to an address
    ///   Connect      -- connect() to a remote address
    ///   Listen       -- listen() for incoming connections
    ///   Accept       -- accept() an incoming connection
    ///   Sendmsg      -- sending a message
    ///   Recvmsg      -- receiving a message
    ///   Shutdown     -- shutdown() a connection
    ///   Setsockopt   -- setting a socket option
    ///   Getsockopt   -- reading a socket option
    fn socket_security(
        &self,
        op: SocketSecurityOp,
        cred: &TaskCredential,
        sock: Option<&SocketRef>,
        context: &SocketOpContext<'_>,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, sock, context);
        Ok(())
    }

    /// IPC security checks (SysV shared memory, semaphores, message queues).
    ///
    /// Operations:
    ///   Alloc        -- creating an IPC object
    ///   Free         -- destroying an IPC object
    ///   Permission   -- permission check on IPC access
    ///   Associate    -- connecting to an existing IPC object
    fn ipc_security(
        &self,
        op: IpcSecurityOp,
        cred: &TaskCredential,
        ipc: &IpcObjRef,
        context: &IpcOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, ipc, context);
        Ok(())
    }

    /// Key management security checks (kernel keyring).
    ///
    /// Operations:
    ///   Alloc        -- allocating a new key
    ///   Free         -- freeing a key
    ///   Permission   -- permission check on key access
    fn key_security(
        &self,
        op: KeySecurityOp,
        cred: &TaskCredential,
        context: &KeyOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, context);
        Ok(())
    }

    /// BPF security checks.
    ///
    /// Operations:
    ///   ProgLoad     -- loading a BPF program
    ///   ProgAttach   -- attaching a BPF program to a hook
    ///   MapCreate    -- creating a BPF map
    ///   MapAccess    -- accessing a BPF map (read/write/delete)
    fn bpf_security(
        &self,
        op: BpfSecurityOp,
        cred: &TaskCredential,
        context: &BpfOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, context);
        Ok(())
    }

    /// Namespace security checks.
    /// UmkaOS-specific: called on namespace creation, joining, and
    /// cross-namespace operations.
    ///
    /// Operations:
    ///   Create       -- creating a new namespace (clone/unshare)
    ///   Join         -- setns() into an existing namespace
    ///   Destroy      -- namespace being destroyed (cleanup, cannot deny)
    fn namespace_security(
        &self,
        op: NamespaceSecurityOp,
        cred: &TaskCredential,
        ns_type: NamespaceType,
        context: &NsOpContext,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, cred, ns_type, context);
        Ok(())
    }

    /// KABI / capability validation security checks.
    /// Called during `validate_cap()` ([Section 12.3](12-kabi.md#kabi-bilateral-capability-exchange))
    /// when a driver validates a capability token for KABI dispatch.
    ///
    /// Operations:
    ///   CapValidate  -- capability being validated for dispatch
    ///   DriverLoad   -- KABI driver module being loaded
    ///   DriverUnload -- KABI driver module being unloaded
    ///
    /// The LSM can enforce: AppArmor profile restrictions on which drivers
    /// may use which capabilities, SELinux type transitions for capability
    /// usage, Landlock domain restrictions on driver operations.
    ///
    /// `lsm_check_cap_validate()` is the entry point called from
    /// `validate_cap()`. It dispatches to this hook with `op = CapValidate`.
    fn kabi_security(
        &self,
        op: KabiSecurityOp,
        subject: &TaskCredential,
        cap: &CapEntry,
        domain: &DriverDomain,
        perms: PermissionBits,
    ) -> Result<(), SecurityDenial> {
        let _ = (op, subject, cap, domain, perms);
        Ok(()) // Default: ALLOW
    }

    // ===== Blob initialization hooks =====
    // Called when kernel objects are created to initialize the LSM's
    // portion of the security blob.

    /// Initialize the LSM's blob region in a new credential.
    fn cred_init_blob(&self, cred_blob: &mut [u8]) {
        let _ = cred_blob;
    }

    /// Initialize the LSM's blob region in a new inode.
    fn inode_init_blob(&self, inode_blob: &mut [u8]) {
        let _ = inode_blob;
    }

    /// Initialize the LSM's blob region in a new superblock.
    fn superblock_init_blob(&self, sb_blob: &mut [u8]) {
        let _ = sb_blob;
    }

    /// Initialize the LSM's blob region in a new socket.
    fn socket_init_blob(&self, sock_blob: &mut [u8]) {
        let _ = sock_blob;
    }
}

9.8.3 Operation Discriminants

Each hook category uses an enum to identify the specific operation. Policy engines match on these discriminants to apply per-operation rules:

/// File-level security operations.
#[repr(u32)]
pub enum FileSecurityOp {
    Permission  = 0,
    Open        = 1,
    Receive     = 2,
    Lock        = 3,
    Mmap        = 4,
    Mprotect    = 5,
    Ioctl       = 6,
}

/// Inode-level security operations.
#[repr(u32)]
pub enum InodeSecurityOp {
    Create      = 0,
    Link        = 1,
    Unlink      = 2,
    Symlink     = 3,
    Mkdir       = 4,
    Rmdir       = 5,
    Mknod       = 6,
    Rename      = 7,
    Permission  = 8,
    Setattr     = 9,
    Getattr     = 10,
    Setxattr    = 11,
    Getxattr    = 12,
    Removexattr = 13,
    Listxattr   = 14,
}

/// Superblock-level security operations.
#[repr(u32)]
pub enum SuperblockSecurityOp {
    Mount       = 0,
    Umount      = 1,
    Remount     = 2,
    SetMntOpts  = 3,
    Statfs      = 4,
}

/// Task-level security operations.
#[repr(u32)]
pub enum TaskSecurityOp {
    Alloc       = 0,
    Free        = 1,
    Exec        = 2,
    SetPgid     = 3,
    Setnice     = 4,
    Setioprio   = 5,
    Setscheduler = 6,
    Kill        = 7,
    Prctl       = 8,
    Ptrace      = 9,
}

/// Credential-level security operations.
#[repr(u32)]
pub enum CredSecurityOp {
    Alloc       = 0,
    Free        = 1,
    Prepare     = 2,
    Commit      = 3,
    Transfer    = 4,
}

/// Socket-level security operations.
#[repr(u32)]
pub enum SocketSecurityOp {
    Create      = 0,
    Bind        = 1,
    Connect     = 2,
    Listen      = 3,
    Accept      = 4,
    Sendmsg     = 5,
    Recvmsg     = 6,
    Shutdown    = 7,
    Setsockopt  = 8,
    Getsockopt  = 9,
}

/// IPC-level security operations.
#[repr(u32)]
pub enum IpcSecurityOp {
    Alloc       = 0,
    Free        = 1,
    Permission  = 2,
    Associate   = 3,
}

/// Key management security operations.
#[repr(u32)]
pub enum KeySecurityOp {
    Alloc       = 0,
    Free        = 1,
    Permission  = 2,
}

/// BPF security operations.
#[repr(u32)]
pub enum BpfSecurityOp {
    ProgLoad    = 0,
    ProgAttach  = 1,
    MapCreate   = 2,
    MapAccess   = 3,
}

/// Namespace security operations (UmkaOS-specific).
#[repr(u32)]
pub enum NamespaceSecurityOp {
    Create      = 0,
    Join        = 1,
    Destroy     = 2,
}

/// Operations for the `kabi_security()` LSM hook.
#[repr(u32)]
pub enum KabiSecurityOp {
    /// Capability is being validated for KABI dispatch.
    CapValidate  = 0,
    /// KABI driver module is being loaded.
    DriverLoad   = 1,
    /// KABI driver module is being unloaded.
    DriverUnload = 2,
}

/// Audit control for capable() hook.
#[repr(u32)]
pub enum CapableAudit {
    /// Generate an audit record for this check.
    Audit       = 0,
    /// Suppress audit record (used for speculative checks like
    /// "would this operation succeed?" without generating noise).
    NoAudit     = 1,
}

// ===== Hook context structs =====
// Each hook category receives a context struct carrying operation-specific
// metadata. These are stack-allocated by the VFS/net/IPC layer at the call
// site and passed by reference — no heap allocation on the hook path.

/// Context passed to LSM superblock hooks during mount operations.
/// Populated by `do_mount()` ([Section 14.1](14-vfs.md#virtual-filesystem-layer)) before calling
/// `lsm_call_superblock_security()`.
pub struct SbOpContext<'a> {
    /// Mount flags (MS_RDONLY, MS_NOSUID, MS_NOEXEC, etc.).
    pub mnt_flags: u32,
    /// Filesystem type name (e.g., "ext4", "tmpfs", "nfs").
    /// `&'static str` because registered filesystem type names are static.
    pub fs_type: &'static str,
    /// Device path (e.g., "/dev/sda1"). `None` for pseudo-filesystems
    /// (tmpfs, procfs, sysfs) that have no backing block device.
    /// Lifetime `'a` — borrowed from the caller's stack during `do_mount()`.
    pub dev_path: Option<&'a CStr>,
    /// Target mount path — the directory where the filesystem will be
    /// mounted (e.g., "/mnt/data"). Required by SELinux for mount
    /// transition checks (source type × target path → allow/deny).
    /// References the resolved `Path` from `do_mount()`'s target lookup.
    /// Lifetime `'a` — borrowed from the path resolution result.
    pub target_path: &'a Path,
    /// Security context label from mount options, if present
    /// (e.g., `context=system_u:object_r:tmp_t:s0` for SELinux).
    /// `None` when no security label was specified in mount options.
    /// Lifetime `'a` — borrowed from the parsed mount options.
    pub security_label: Option<&'a CStr>,
    /// Superblock flags from the filesystem (SB_RDONLY, SB_NOSUID, etc.).
    pub sb_flags: u32,
}

/// Context passed to LSM inode hooks for operations that need additional
/// metadata beyond the target inode and credential.
///
/// The lifetime `'a` ensures that context references cannot outlive the hook
/// call. LSM modules must NOT store references from context structs beyond
/// the hook invocation — the borrow checker enforces this at the trait level.
pub struct InodeOpContext<'a> {
    /// For Create/Link/Symlink/Mkdir/Mknod/Rename: the parent directory.
    pub parent: Option<&'a InodeRef>,
    /// For Create/Mkdir/Mknod: the requested permission mode (e.g., 0o644).
    pub mode: u32,
    /// For Rename: the target (destination) inode, if it exists.
    pub target: Option<&'a InodeRef>,
    /// For Setxattr/Getxattr/Removexattr: the attribute name.
    pub xattr_name: Option<&'a CStr>,
}

/// Context passed to LSM task hooks.
///
/// Lifetime `'a` — all references are valid only for the duration of the
/// hook call. See `InodeOpContext<'a>` for rationale.
pub struct TaskOpContext<'a> {
    /// For Exec: the binary being executed (path + inode).
    pub binary: Option<&'a InodeRef>,
    /// For Exec with container profile: the OCI profile annotation
    /// (e.g., "apparmor=docker-default", "selinux=container_t").
    pub profile: Option<&'a CStr>,
    /// For Kill: the signal number.
    pub signo: u32,
    /// For Ptrace: the ptrace request (PTRACE_ATTACH, PTRACE_PEEK, etc.).
    pub ptrace_request: u32,
}

/// Context passed to LSM socket hooks.
///
/// Lifetime `'a` — all references are valid only for the duration of the
/// hook call. See `InodeOpContext<'a>` for rationale.
pub struct SocketOpContext<'a> {
    /// For Create: address family (AF_INET, AF_INET6, AF_UNIX, etc.).
    pub family: u16,
    /// For Create: socket type (SOCK_STREAM, SOCK_DGRAM, etc.).
    pub sock_type: u16,
    /// For Create: protocol number (IPPROTO_TCP, IPPROTO_UDP, etc.).
    pub protocol: u16,
    /// For Bind/Connect: the target address.
    pub addr: Option<&'a [u8]>,
}

/// Context passed to LSM IPC hooks.
pub struct IpcOpContext {
    /// IPC key (from `shmget()`, `semget()`, `msgget()`).
    pub key: u32,
    /// Requested permission flags.
    pub flags: u32,
}

/// Context passed to LSM key hooks.
///
/// Lifetime `'a` — all references are valid only for the duration of the
/// hook call. See `InodeOpContext<'a>` for rationale.
pub struct KeyOpContext<'a> {
    /// Key type (e.g., "user", "logon", "keyring").
    pub key_type: &'a str,
    /// Key description.
    pub description: Option<&'a CStr>,
    /// Requested permission mask.
    pub perm: u32,
}

/// Context passed to LSM BPF hooks.
pub struct BpfOpContext {
    /// For ProgLoad: BPF program type (kprobe, tracepoint, XDP, etc.).
    pub prog_type: u32,
    /// For MapCreate: BPF map type (hash, array, ringbuf, etc.).
    pub map_type: u32,
    /// For ProgAttach: the attach target (fd or path).
    pub attach_type: u32,
}

/// Context passed to LSM namespace hooks.
pub struct NsOpContext {
    /// For Create/Join: the parent namespace identifier.
    /// Uses `NamespaceId` (u64) instead of a raw byte slice or reference to
    /// avoid lifetime issues — namespaces have dynamic lifetimes, not `'static`.
    /// `None` when there is no parent (e.g., creating the init namespace).
    pub parent_ns: Option<NamespaceId>,
    /// Clone flags (for Create via clone/unshare).
    pub clone_flags: u64,
}

/// Bitmask of hook categories that an LSM module implements.
/// Returned by `SecurityModule::hooks_implemented()`. The dispatch
/// macro checks this mask before calling any hook method — if the
/// bit is not set, the virtual call is skipped entirely.
///
/// This replaces Linux's per-hook `static_call` patching with a
/// portable bitmask check that works on all architectures without
/// binary rewriting. Cost: one AND + branch (~1 cycle) vs Linux's
/// NOP-to-call patching (arch-specific, ~0 cycles when disabled but
/// requires complex infrastructure). The bitmask approach is simpler,
/// debuggable, and has negligible overhead since hook sites are already
/// branching on `any_lsm_active`.
bitflags! {
    pub struct LsmHookMask: u32 {
        /// File descriptor operations (open, mmap, ioctl, lock).
        const FILE_OPS       = 1 << 0;
        /// Inode operations (create, unlink, setxattr, permission).
        const INODE_OPS      = 1 << 1;
        /// Superblock operations (mount, umount, remount, statfs).
        const SUPERBLOCK_OPS = 1 << 2;
        /// Task operations (fork, exec, signal, ptrace).
        const TASK_OPS       = 1 << 3;
        /// Credential operations (prepare, commit, transfer).
        const CRED_OPS       = 1 << 4;
        /// Capability override (ns_capable() post-check).
        const CAPABLE_OPS    = 1 << 5;
        /// Socket operations (create, bind, connect, sendmsg).
        const SOCKET_OPS     = 1 << 6;
        /// IPC operations (SysV shm, sem, msg).
        const IPC_OPS        = 1 << 7;
        /// Key management operations (kernel keyring).
        const KEY_OPS        = 1 << 8;
        /// BPF operations (prog load, map create, map access).
        const BPF_OPS        = 1 << 9;
        /// Namespace operations (create, join, destroy).
        const NAMESPACE_OPS  = 1 << 10;
        /// Blob initialization hooks (cred, inode, superblock, socket, file, task, etc.).
        /// BLOB_INIT has no corresponding StaticKey in LsmRegistry — blob init hooks
        /// are always called (they initialize required memory for every LSM-tracked
        /// object). The mask bit exists for registration accounting only; it cannot be
        /// gated because skipping blob init would leave uninitialized memory in the
        /// LSM blob regions. The `lsm_call!` dispatch macro MUST skip the StaticKey
        /// check for BLOB_INIT hooks — unconditional dispatch.
        const BLOB_INIT      = 1 << 11;
        /// KABI operations (driver load verification, capability exchange).
        const KABI_OPS       = 1 << 12;
    }
}

/// Denial returned by LSM hooks. Includes the denying module name
/// and an operation-specific reason code for audit logging.
pub struct SecurityDenial {
    /// Which LSM denied the operation.
    pub module: &'static str,
    /// errno to return to userspace (typically EACCES or EPERM).
    pub errno: i32,
    /// Optional reason string for audit log (e.g., "apparmor: DENIED
    /// file_open /etc/shadow for profile /usr/bin/nginx").
    /// Compile-time string literal (e.g., "DENIED file_open"). The
    /// `'static` lifetime reflects that all denial reason strings are
    /// constants embedded in the LSM module's `.rodata` section. Dynamic
    /// data (file paths, UIDs) is formatted by the audit subsystem, not
    /// carried in this field.
    pub reason: Option<&'static str>,
}

9.8.4 Security Blob Model

Each kernel object that LSMs can attach state to carries an opaque security blob. The blob is a single contiguous allocation divided into regions, one per registered LSM. This avoids per-LSM heap allocations and provides cache-friendly access.

/// Per-object-type blob size requirements, reported by each LSM during
/// registration via `SecurityModule::blob_sizes()`.
///
/// Fields match Linux `struct lsm_blob_sizes` (17 fields as of Linux 6.12+).
/// Previously had 6 fields; expanded to 17 per SF-251 fix.
pub struct LsmBlobSizes {
    // --- Phase 1 blob types ---
    /// Bytes needed in each TaskCredential's blob.
    pub cred: usize,
    /// Bytes needed in each inode's blob.
    pub inode: usize,
    /// Bytes needed in each superblock's blob.
    pub superblock: usize,
    /// Bytes needed in each socket's blob.
    pub socket: usize,
    /// Bytes needed in each IPC object's blob.
    pub ipc: usize,
    /// Bytes needed in each key's blob.
    pub key: usize,
    /// Bytes needed in each open file's blob.
    pub file: usize,
    /// Bytes needed in each task's blob (separate from cred).
    pub task: usize,
    /// Bytes needed in each network namespace's blob.
    pub net_ns: usize,
    /// Bytes needed for per-inode xattr count tracking.
    pub xattr_count: usize,

    // --- Phase 2 blob types ---
    /// Bytes needed in each msg_msg's blob.
    pub msg_msg: usize,
    /// Bytes needed in each block device's blob.
    pub bdev: usize,
    /// Bytes needed in each perf_event's blob.
    pub perf_event: usize,
    /// Bytes needed in each BPF map's blob.
    pub bpf_map: usize,
    /// Bytes needed in each BPF program's blob.
    pub bpf_prog: usize,
    /// Bytes needed in each BPF token's blob.
    pub bpf_token: usize,

    // --- Phase 3 blob types ---
    /// Bytes needed in each InfiniBand partition's blob.
    pub ib: usize,
}

impl LsmBlobSizes {
    /// Return the blob size for the given object kind.
    pub const fn for_kind(&self, kind: LsmBlobKind) -> usize {
        match kind {
            LsmBlobKind::Cred       => self.cred,
            LsmBlobKind::Inode      => self.inode,
            LsmBlobKind::Superblock => self.superblock,
            LsmBlobKind::Socket     => self.socket,
            LsmBlobKind::Ipc        => self.ipc,
            LsmBlobKind::Key        => self.key,
            LsmBlobKind::File       => self.file,
            LsmBlobKind::Task       => self.task,
            LsmBlobKind::NetNs      => self.net_ns,
            LsmBlobKind::XattrCount => self.xattr_count,
            LsmBlobKind::MsgMsg     => self.msg_msg,
            LsmBlobKind::Bdev       => self.bdev,
            LsmBlobKind::PerfEvent  => self.perf_event,
            LsmBlobKind::BpfMap     => self.bpf_map,
            LsmBlobKind::BpfProg    => self.bpf_prog,
            LsmBlobKind::BpfToken   => self.bpf_token,
            LsmBlobKind::Ib         => self.ib,
        }
    }
}

/// Opaque security blob attached to kernel objects.
/// Layout: [ LSM_0 data | LSM_1 data | LSM_2 data | ... ]
/// Each LSM's region starts at its registered offset (assigned during
/// LSM registration, fixed for the boot lifetime).
///
/// Blob memory is slab-allocated from a per-object-type slab cache
/// (e.g., `cred_lsm_blob_cache`, `inode_lsm_blob_cache`). The slab
/// object size is the sum of all registered LSMs' blob sizes for that
/// object type, computed once during LSM initialization.
/// LSM blob: variable-length opaque storage, one per kernel object (cred, inode, etc.).
/// Allocated from a per-object-type slab cache sized to `LsmBlobSizes::total()`.
/// Stored inline in the containing object (e.g., `TaskCredential.lsm_blob` is a
/// `NonNull<LsmBlob>` into the same slab allocation as the credential).
///
/// Access pattern: each LSM calls `lsm_blob.get_mut::<T>(offset)` where `offset`
/// is the byte offset for that LSM's private region (from `LsmBlobOffsets`).
pub struct LsmBlob {
    /// Total size of the blob data in bytes (sum of all active LSMs' requirements).
    /// Stored so that debug tooling can inspect blob contents without consulting
    /// the LSM registry.
    pub data_len: u32,
    pub _pad: u32,
    /// Blob data begins immediately after this header in the slab allocation.
    /// Access via `lsm_blob_data_ptr(blob)` which returns `*mut u8` to the
    /// byte immediately after this struct header. Not a Rust reference —
    /// lifetimes are managed by the slab allocator and the owning object.
    pub _data: [u8; 0],
}

/// Global LSM registry state. Initialized during boot, immutable after
/// all LSMs are registered (before the first user process runs).
pub struct LsmRegistry {
    /// Registered LSM modules, in evaluation order.
    /// Order: IMA first (integrity), then MAC (SELinux/AppArmor), then
    /// Landlock (unprivileged sandboxing). Within each priority tier,
    /// registration order (determined by kernel command line `lsm=`
    /// parameter or compiled-in default).
    pub modules: ArrayVec<&'static dyn SecurityModule, MAX_LSM_MODULES>,

    /// Per-LSM blob offsets for each object type. Indexed by LSM
    /// registration index (0..modules.len()). Each entry records the
    /// byte offset within the LsmBlob where this LSM's region begins.
    pub blob_offsets: ArrayVec<LsmBlobOffsets, MAX_LSM_MODULES>,

    /// Total blob size per object type (sum of all LSMs' requirements).
    /// Used to size the slab caches.
    pub total_blob_sizes: LsmBlobSizes,

    /// Static key: true if at least one LSM is registered.
    /// When false, all hook call sites are NOPs (zero overhead).
    /// Patched to true during LSM registration (one-time, at boot).
    pub any_lsm_active: StaticKey,

    /// Global LSM policy generation counter. Incremented atomically
    /// whenever any LSM policy is loaded or replaced:
    /// - SELinux: incremented in `selinux_policy_load()` step 7.
    /// - AppArmor: incremented on `.replace` and `.load` profile writes.
    /// - BPF-LSM: incremented on `bpf_prog_attach(BPF_LSM_MAC)`.
    ///
    /// `CapValidationToken.policy_gen` snapshots this value at token
    /// creation time. A mismatch at KABI dispatch forces re-validation
    /// of the LSM decision cached in the token. This ensures that a
    /// policy reload invalidates all cached capability validation tokens
    /// without requiring an explicit sweep.
    ///
    /// Ordering: writers use `Release` (after policy swap completes);
    /// readers use `Acquire` (before checking cached decisions).
    pub policy_generation: AtomicU64,

    /// Per-hook-category static keys. Each is true if at least one
    /// registered LSM overrides that category. Allows per-category
    /// zero-overhead bypass.
    pub file_hooks_active: StaticKey,
    pub inode_hooks_active: StaticKey,
    pub task_hooks_active: StaticKey,
    pub cred_hooks_active: StaticKey,
    pub capable_hooks_active: StaticKey,
    pub socket_hooks_active: StaticKey,
    pub ipc_hooks_active: StaticKey,
    pub key_hooks_active: StaticKey,
    pub bpf_hooks_active: StaticKey,
    pub namespace_hooks_active: StaticKey,
    pub superblock_hooks_active: StaticKey,
    pub kabi_hooks_active: StaticKey,
}

/// Per-LSM blob offset table. One entry per object type that carries an
/// LSM blob. Indexed by `LsmBlobKind` via `offset_for_kind()`.
/// Fields match `LsmBlobSizes` 1:1. Both structs MUST have the same number
/// of fields; the const_assert below enforces this at compile time (SF-105 fix).
pub struct LsmBlobOffsets {
    // Phase 1
    pub cred: usize,
    pub inode: usize,
    pub superblock: usize,
    pub socket: usize,
    pub ipc: usize,
    pub key: usize,
    pub file: usize,
    pub task: usize,
    pub net_ns: usize,
    pub xattr_count: usize,
    // Phase 2
    pub msg_msg: usize,
    pub bdev: usize,
    pub perf_event: usize,
    pub bpf_map: usize,
    pub bpf_prog: usize,
    pub bpf_token: usize,
    // Phase 3
    pub ib: usize,
}
// SF-105 fix: ensure LsmBlobSizes and LsmBlobOffsets stay in sync.
// Both structs are composed entirely of `usize` fields, so identical
// size implies identical field count. If a field is added to one but
// not the other, this assertion fails at compile time.
const_assert!(size_of::<LsmBlobSizes>() == size_of::<LsmBlobOffsets>());

/// Maximum number of simultaneously registered LSM modules.
/// Linux supports up to ~10; UmkaOS allows 8 (IMA + SELinux/AppArmor +
/// Landlock + Yama + LoadPin + SafeSetID + Lockdown + BPF-LSM).
const MAX_LSM_MODULES: usize = 8;

Blob kind discriminant (selects the correct offset field from LsmBlobOffsets):

/// Discriminant for selecting the correct blob offset field.
/// Each variant corresponds to a kernel object type that carries an LSM blob.
///
/// Matches Linux `struct lsm_blob_sizes` fields from `include/linux/lsm_hooks.h`
/// (17 blob types as of Linux 6.12+). Previously split across `LsmBlobKind` and
/// `LsmBlobType` — unified into this single enum per SF-241 fix.
///
/// **Phase assignment** (per decision B-11):
/// - Phase 1 (immediate): Cred, Inode, Superblock, Socket, Ipc, Key, File, Task, NetNs, XattrCount
/// - Phase 2: MsgMsg, Bdev, PerfEvent, BpfMap, BpfProg, BpfToken
/// - Phase 3 (defer): Ib, TunDev
#[derive(Copy, Clone, PartialEq, Eq)]
#[repr(u8)]
pub enum LsmBlobKind {
    // --- Phase 1 blob types (required for basic SELinux/AppArmor operation) ---
    Cred       = 0,
    Inode      = 1,
    Superblock = 2,
    Socket     = 3,
    Ipc        = 4,
    Key        = 5,
    /// Per-open-file security context. Used by SELinux for per-file security
    /// labels and by AppArmor for file delegation tracking. Allocated in
    /// `file_alloc_security()` during `do_filp_open()`, freed in `fput()`.
    File       = 6,
    /// Per-task security context (separate from per-credential). Used by
    /// SELinux for task-level security context and by Yama for ptrace scope.
    /// Allocated in `task_alloc()` during `copy_process()`, freed in
    /// `task_free()` during `__put_task_struct()`.
    Task       = 7,
    /// Per-network-namespace security context. Used by SELinux for network
    /// namespace labeling. Allocated in `netns_alloc_security()` during
    /// `copy_net_ns()`, freed in `netns_free_security()` in `net_drop_ns()`.
    NetNs      = 8,
    /// Per-inode xattr count for EVM/IMA. Framework use (not per-LSM).
    /// Tracks the number of security xattrs on an inode for efficient
    /// EVM HMAC computation. Allocated alongside inode blob.
    XattrCount = 9,

    // --- Phase 2 blob types (SysV IPC, eBPF, storage, observability) ---
    /// Per-IPC-message security context. Used by SELinux for SysV IPC
    /// message labeling. Allocated in `msg_msg_alloc_security()`, freed
    /// in `msg_msg_free_security()`.
    MsgMsg     = 10,
    /// Per-block-device security context. Added in Linux 6.x. Used by
    /// SELinux for block device labeling. Allocated in `bdev_alloc_security()`,
    /// freed in `bdev_free_security()`.
    Bdev       = 11,
    /// Per-perf-event security context. Used by SELinux for perf event
    /// access control. Allocated in `perf_event_alloc()`, freed in
    /// `perf_event_free()`.
    PerfEvent  = 12,
    /// Per-BPF-map security context. Used by SELinux for BPF map access
    /// control. Allocated in `bpf_map_alloc_security()`, freed in
    /// `bpf_map_free_security()`.
    BpfMap     = 13,
    /// Per-BPF-program security context. Used by SELinux for BPF program
    /// access control. Allocated in `bpf_prog_alloc_security()`, freed in
    /// `bpf_prog_free_security()`.
    BpfProg    = 14,
    /// Per-BPF-token security context. Used for BPF token delegation
    /// access control. Allocated in `bpf_token_alloc_security()`, freed in
    /// `bpf_token_free_security()`.
    BpfToken   = 15,

    // --- Phase 3 blob types (specialized subsystems) ---
    /// Per-InfiniBand partition security context. Used by SELinux for IB
    /// partition labeling. Allocated in `ib_alloc_security()`, freed in
    /// `ib_free_security()`.
    Ib         = 16,
    // NOTE: Linux mainline now includes `lbs_tun_dev` in `struct lsm_blob_sizes`
    // (include/linux/lsm_hooks.h), though tun.c still also uses a direct
    // `void *security` pointer. UmkaOS defers TunDev blob support to Phase 3
    // alongside Ib. When added: TunDev = 17 here, tun_dev: usize in both
    // LsmBlobSizes and LsmBlobOffsets, and LSM_BLOB_KIND_COUNT → 18.
}

impl LsmBlobOffsets {
    /// Return the blob offset for the given object kind.
    pub const fn offset_for_kind(&self, kind: LsmBlobKind) -> usize {
        match kind {
            LsmBlobKind::Cred       => self.cred,
            LsmBlobKind::Inode      => self.inode,
            LsmBlobKind::Superblock => self.superblock,
            LsmBlobKind::Socket     => self.socket,
            LsmBlobKind::Ipc        => self.ipc,
            LsmBlobKind::Key        => self.key,
            LsmBlobKind::File       => self.file,
            LsmBlobKind::Task       => self.task,
            LsmBlobKind::NetNs      => self.net_ns,
            LsmBlobKind::XattrCount => self.xattr_count,
            LsmBlobKind::MsgMsg     => self.msg_msg,
            LsmBlobKind::Bdev       => self.bdev,
            LsmBlobKind::PerfEvent  => self.perf_event,
            LsmBlobKind::BpfMap     => self.bpf_map,
            LsmBlobKind::BpfProg    => self.bpf_prog,
            LsmBlobKind::BpfToken   => self.bpf_token,
            LsmBlobKind::Ib         => self.ib,
        }
    }
}

Blob access pattern (used by LSM implementations to read/write their blob region):

impl LsmBlob {
    /// Get a typed reference to this LSM's region within the blob.
    /// `lsm_index` is the module's registration index.
    /// `KIND` selects the blob offset field for the kernel object type.
    /// `T` is the LSM's private per-object state struct.
    ///
    /// Example: `blob.get::<SelinuxInode, {LsmBlobKind::Inode}>(selinux_idx)`
    ///
    /// # Safety
    /// The caller must ensure T matches the blob layout registered by
    /// the LSM at `lsm_index`, and that `size_of::<T>()` equals the
    /// blob size the LSM requested for this object type.
    pub unsafe fn get<T, const KIND: LsmBlobKind>(&self, lsm_index: usize) -> &T {
        let offset = LSM_REGISTRY.blob_offsets[lsm_index].offset_for_kind(KIND);
        // SAFETY: offset and size were validated during LSM registration
        &*(self._data.as_ptr().add(offset) as *const T)
    }

    /// Mutable variant of get(). Used during blob initialization and
    /// by LSMs that mutate per-object state (e.g., SELinux label cache).
    pub unsafe fn get_mut<T, const KIND: LsmBlobKind>(&mut self, lsm_index: usize) -> &mut T {
        let offset = LSM_REGISTRY.blob_offsets[lsm_index].offset_for_kind(KIND);
        // SAFETY: offset and size were validated during LSM registration
        &mut *(self._data.as_mut_ptr().add(offset) as *mut T)
    }
}

9.8.4.1 Blob Allocation and Deallocation Lifecycle

Blob memory is managed by the framework, not by individual LSMs. Each kernel object type has a specific allocation and deallocation integration point:

/// Allocate and initialize an LSM blob for a kernel object.
///
/// Called by the framework at object creation time. Each object type has
/// a specific integration point:
///
/// | Object type | Allocation site | Deallocation site | Phase |
/// |-------------|----------------|-------------------|-------|
/// | Credential  | `prepare_creds()` (fork, exec, setuid) | `put_cred()` when refcount→0 | 1 |
/// | Inode       | `inode_alloc_security()` inside `new_inode()` | `inode_free_security()` in `evict_inode()` | 1 |
/// | Superblock  | `sb_alloc_security()` in `mount()` | `sb_free_security()` in `kill_sb()` | 1 |
/// | Socket      | `socket_alloc_security()` in `sock_alloc()` | `socket_free_security()` in `sock_release()` | 1 |
/// | IPC (msg/sem/shm) | `ipc_alloc_security()` in `ipc_get()` | `ipc_free_security()` in `ipc_rcu_free()` | 1 |
/// | Key (keyring) | `key_alloc_security()` in `key_alloc()` | `key_free_security()` in `key_gc()` | 1 |
/// | File        | `file_alloc_security()` in `do_filp_open()` | `file_free_security()` in `fput()` | 1 |
/// | Task        | `task_alloc()` in `copy_process()` | `task_free()` in `__put_task_struct()` | 1 |
/// | NetNs       | `netns_alloc_security()` in `copy_net_ns()` | `netns_free_security()` in `net_drop_ns()` | 1 |
/// | XattrCount  | Alongside inode blob (shared slab) | Alongside inode blob (shared free) | 1 |
/// | MsgMsg      | `msg_msg_alloc_security()` in `load_msg()` | `msg_msg_free_security()` in `free_msg()` | 2 |
/// | Bdev        | `bdev_alloc_security()` in `bdev_alloc()` | `bdev_free_security()` in `bdev_free()` | 2 |
/// | PerfEvent   | `perf_event_alloc()` in `perf_event_open()` | `perf_event_free()` in `__perf_event_release_kernel()` | 2 |
/// | BpfMap      | `bpf_map_alloc_security()` in `map_create()` | `bpf_map_free_security()` in `bpf_map_free_deferred()` | 2 |
/// | BpfProg     | `bpf_prog_alloc_security()` in `bpf_prog_load()` | `bpf_prog_free_security()` in `bpf_prog_free_deferred()` | 2 |
/// | BpfToken    | `bpf_token_alloc_security()` in `bpf_token_create()` | `bpf_token_free_security()` in `bpf_token_free()` | 2 |
/// | Ib          | `ib_alloc_security()` in ib_mad registration | `ib_free_security()` in ib_mad deregistration | 3 |
///
/// All allocations come from per-type slab caches created during `lsm_init()`.
/// The slab object size equals `LSM_REGISTRY.total_blob_sizes.<type>` (sum of
/// all registered LSMs' requirements for that type). If no LSM needs a blob
/// for a given type (total == 0), the allocation is skipped and the object's
/// blob pointer is set to a sentinel `EMPTY_LSM_BLOB` (static, zero-length).
///
/// After allocation, the framework calls each LSM's type-specific init hook
/// (e.g., `cred_init_security`, `inode_init_security`) in registration order.
/// Each hook initializes its own region within the blob.
pub fn lsm_blob_alloc(obj_type: LsmBlobKind) -> NonNull<LsmBlob> {
    let size = LSM_REGISTRY.total_blob_sizes.for_kind(obj_type);
    if size == 0 {
        return NonNull::from(&EMPTY_LSM_BLOB);
    }
    let cache = LSM_SLAB_CACHES[obj_type as usize].get()
        .expect("LSM slab cache not initialized for this type");
    let blob = cache.alloc_zeroed(); // Zero-initialized (safe default for all LSMs)
    blob.data_len = size as u32;
    NonNull::new(blob).expect("LSM blob slab exhausted")
}

/// Free an LSM blob. Called by the framework at object deallocation time.
/// Before freeing, calls each LSM's type-specific free hook in reverse
/// registration order (so that the last-registered module cleans up first,
/// avoiding use-after-free between modules with cross-dependencies).
pub fn lsm_blob_free(blob: NonNull<LsmBlob>, obj_type: LsmBlobKind) {
    if blob == NonNull::from(&EMPTY_LSM_BLOB) { return; }
    let cache = LSM_SLAB_CACHES[obj_type as usize].get()
        .expect("LSM slab cache not initialized for this type");
    cache.free(blob.as_ptr());
}

/// Number of LsmBlobKind variants. Derived from the maximum discriminant value.
/// When adding a new variant, update this constant to `(NewVariant as usize) + 1`.
/// The const_assert below ensures consistency.
const LSM_BLOB_KIND_COUNT: usize = (LsmBlobKind::Ib as usize) + 1;

/// Per-type slab caches for LSM blobs, created during lsm_init().
/// Indexed by LsmBlobKind discriminant. Initialized once at boot via `set()`.
/// Array size derived from `LSM_BLOB_KIND_COUNT` — self-sizing on variant addition.
static LSM_SLAB_CACHES: [OnceCell<SlabCache>; LSM_BLOB_KIND_COUNT] = [OnceCell::new(); LSM_BLOB_KIND_COUNT];

// Verify array covers all LsmBlobKind variants (catches forgotten updates).
const_assert!(LSM_BLOB_KIND_COUNT == 17);

/// Sentinel blob for object types where no LSM needs storage.
static EMPTY_LSM_BLOB: LsmBlob = LsmBlob { data_len: 0, _pad: 0, _data: [] };

9.8.4.2 LsmHookMask

Canonical definition: See LsmHookMask bitflags in the Hook Infrastructure section above (12 flags, bits 0–11). That is the authoritative definition with correct bit assignments and the full set of hook categories.

9.8.5 Hook Dispatch and Stacking

The framework dispatches hooks to all registered modules. The dispatch uses static keys for zero-overhead bypass when no module is registered for a given category.

// Macro-generated hook dispatch (conceptual expansion):

lsm_call_file_security(op, cred, file) -> Result<(), SecurityDenial>:
  // Static key check: compiled to NOP when no file hooks registered
  if !static_key_enabled(LSM_REGISTRY.file_hooks_active):
      return Ok(())

  // Iterate registered modules in priority order.
  // Skip modules that do not implement this hook category (LsmHookMask check).
  for module in LSM_REGISTRY.modules:
      if !module.hooks_implemented().contains(LsmHookMask::FILE_OPS):
          continue
      result = module.file_security(op, cred, file)
      if result.is_err():
          // AND logic: first denial wins. Log the denial via audit
          // subsystem (Section 20.1) if auditing is enabled.
          audit_lsm_denial(module.name(), op, cred, file)
          return result
  Ok(())

Stacking order: Modules are called in priority order (lower priority value = earlier):

Priority Module category Examples Rationale
11 Integrity IMA Must measure/appraise before MAC decisions
21 Mandatory Access Control SELinux, AppArmor Primary policy enforcement
30 Supplementary Landlock, Yama, LoadPin Additional restrictions
40 BPF-LSM User-loaded BPF security programs Most dynamic, least trusted

Within the same priority tier, modules are called in the order specified by the lsm= kernel command line parameter (e.g., lsm=integrity,selinux,landlock,bpf). If no lsm= parameter is provided, the compiled-in default order is used.

AND-logic semantics: ALL modules must return Ok(()) for the operation to proceed. If module A returns Ok(()) but module B returns Err(SecurityDenial), the operation is denied. The first denial is returned to the caller; subsequent modules are NOT called (short-circuit evaluation). This is safe because LSM hooks only restrict (deny), never grant -- skipping later modules after a denial cannot cause a false-allow.

9.8.6 Per-Namespace LSM Profiles

Container runtimes set LSM profiles on a per-task basis, typically at execve() time. The profile is stored in the task's credential LSM blob.

AppArmor: Profile name stored in the credential blob. Changed via: - apparmor_parser writing to /sys/kernel/security/apparmor/.replace - aa_change_onexec() in the task's own code - OCI runtime setting the profile before execve() of the container entrypoint

SELinux: Security context (label) stored in the credential blob. Changed via: - setcon() / setexeccon() before execve() - Policy-defined type transitions on exec

Profile application at container creation:

container_exec_with_lsm_profile(task, binary, profile_spec):
  1. new_cred = prepare_creds(task)
  2. For each active LSM:
       result = module.task_security(Exec, task.cred, None, &TaskOpContext::<'_> {
           binary: binary,
           profile: profile_spec,  // OCI annotation: "apparmor=docker-default"
       })
       if result.is_err():
           abort_creds(new_cred)
           return result
  3. // LSM updates its blob region in new_cred with the new profile
  4. commit_creds(task, new_cred)
  5. execve(binary)

Landlock: unprivileged sandboxing:

Landlock is unique among LSMs: it does not require CAP_MAC_ADMIN or any capability. Any process can restrict itself via landlock_create_ruleset(), landlock_add_rule(), landlock_restrict_self(). This is self-restriction only -- a process can reduce its own access rights but cannot affect other processes.

/// Landlock ruleset, stored per-credential in the LSM blob.
/// Each landlock_restrict_self() call creates a new layer that
/// further restricts the previous layer (stacking via intersection).
pub struct LandlockCredBlob {
    /// Stack of restriction layers. Each layer is an immutable ruleset.
    /// Access is permitted only if ALL layers permit it (AND logic
    /// within the Landlock module, on top of the inter-module AND logic).
    pub domain: Option<Arc<LandlockDomain>>,
}

pub struct LandlockDomain {
    /// Parent domain (from the previous landlock_restrict_self() call).
    /// None for the outermost restriction layer.
    pub parent: Option<Arc<LandlockDomain>>,
    /// Ruleset defining what this layer permits.
    pub ruleset: LandlockRuleset,
}

9.8.6.1 Landlock Syscall Interface

Three syscalls (added in Linux 5.13, numbers identical across all UmkaOS architectures):

/// landlock_create_ruleset(2) — create a new Landlock ruleset.
/// Syscall number: 444 (x86-64), 444 (AArch64), 444 (ARMv7), 444 (RISC-V),
/// 444 (PPC32), 444 (PPC64LE).
///
/// Returns a file descriptor referring to the new ruleset.
/// The returned fd supports `close()` and `landlock_add_rule()`.
/// When `flags` contains `LANDLOCK_CREATE_RULESET_VERSION`, `attr` must be
/// NULL and `size` must be 0; the return value is the highest supported
/// ABI version (not a file descriptor).
pub fn sys_landlock_create_ruleset(
    attr: UserPtr<LandlockRulesetAttr>,
    size: usize,
    flags: u32,  // LANDLOCK_CREATE_RULESET_VERSION for ABI query
) -> Result<Fd, Errno>;

/// landlock_add_rule(2) — add a rule to a ruleset.
/// Syscall number: 445 (all architectures).
///
/// The ruleset fd must have been created by `landlock_create_ruleset()`
/// and must not yet have been passed to `landlock_restrict_self()`.
/// After enforcement the ruleset is immutable; adding rules returns EINVAL.
pub fn sys_landlock_add_rule(
    ruleset_fd: Fd,
    rule_type: LandlockRuleType,
    rule_attr: UserPtr<u8>,  // type depends on rule_type
    flags: u32,              // reserved, must be 0
) -> Result<(), Errno>;

/// landlock_restrict_self(2) — enforce a ruleset on the calling thread.
/// Syscall number: 446 (all architectures).
///
/// Requires `no_new_privs` (via `prctl(PR_SET_NO_NEW_PRIVS, 1)`).
/// After this call, the thread (and its future children) is restricted.
/// This is irreversible — the restriction cannot be lifted.
/// Multiple calls stack: each call adds a new layer that can only further
/// restrict access (AND-logic with all previous layers).
pub fn sys_landlock_restrict_self(
    ruleset_fd: Fd,
    flags: u32,  // reserved, must be 0
) -> Result<(), Errno>;

Error conditions:

Syscall Errno Condition
create_ruleset EINVAL Unknown flags, zero handled_access_*, size mismatch
create_ruleset ENOMEM Kernel memory allocation failure
add_rule EBADF ruleset_fd is not a valid Landlock ruleset fd
add_rule EINVAL Unknown rule_type, unknown bits in allowed_access, or ruleset already enforced
add_rule EBADF parent_fd (for PathBeneath) is not a valid directory fd
restrict_self EPERM no_new_privs not set on the calling thread
restrict_self EBADF ruleset_fd is not a valid Landlock ruleset fd

9.8.6.2 LandlockRulesetAttr

/// Attribute struct for landlock_create_ruleset().
/// Layout matches the Linux UAPI struct `landlock_ruleset_attr`.
#[repr(C)]
pub struct LandlockRulesetAttr {
    /// Bitmask of filesystem access rights handled by this ruleset.
    /// Only access rights present in this bitmask are restricted; all
    /// others are implicitly allowed (forward-compatible: unknown bits
    /// in future ABI versions are simply not handled by older rulesets).
    pub handled_access_fs: u64,
    /// Bitmask of network access rights handled by this ruleset (ABI v4+).
    /// Zero means no network restrictions.
    pub handled_access_net: u64,
}
// Layout: 2 × u64 = 16 bytes.
const_assert!(size_of::<LandlockRulesetAttr>() == 16);

9.8.6.3 LandlockRuleset (Internal)

/// Internal ruleset structure — mutable during construction (while adding
/// rules via landlock_add_rule), immutable after landlock_restrict_self().
///
/// The ruleset is reference-counted (Arc) and shared by all LandlockDomain
/// layers that reference it. Immutability after enforcement means no
/// synchronization is needed for read access during hook checks.
pub struct LandlockRuleset {
    /// Handled filesystem access rights.
    pub handled_access_fs: LandlockAccessFs,
    /// Handled network access rights.
    pub handled_access_net: LandlockAccessNet,
    /// Filesystem rules: keyed by inode number (u64).
    /// Each entry maps an inode to the set of access rights permitted
    /// under that inode's subtree (for PathBeneath rules).
    /// XArray provides O(1) lookup and is RCU-compatible for reads.
    pub fs_rules: XArray<u64, LandlockFsRule>,
    /// Network rules: keyed by port number (u64).
    /// Each entry maps a TCP port to the set of network access rights
    /// permitted on that port.
    pub net_rules: XArray<u64, LandlockNetRule>,
    /// Whether this ruleset has been enforced (passed to restrict_self).
    /// Once true, no further rules can be added.
    pub enforced: AtomicBool,
}

/// A filesystem rule entry: access rights permitted under a directory inode.
pub struct LandlockFsRule {
    /// Bitmask of access rights that this rule permits.
    pub allowed_access: LandlockAccessFs,
}

/// A network rule entry: access rights permitted on a TCP port.
pub struct LandlockNetRule {
    /// Bitmask of network access rights that this rule permits.
    pub allowed_access: LandlockAccessNet,
}

9.8.6.4 Filesystem Access Rights

bitflags! {
    /// Filesystem access rights for Landlock (LANDLOCK_ACCESS_FS_*).
    /// Values match the Linux UAPI constants exactly.
    pub struct LandlockAccessFs: u64 {
        const EXECUTE         = 1 << 0;   // Execute a file
        const WRITE_FILE      = 1 << 1;   // Open a file with write access
        const READ_FILE       = 1 << 2;   // Open a file with read access
        const READ_DIR        = 1 << 3;   // List directory contents (getdents)
        const REMOVE_DIR      = 1 << 4;   // Remove an empty directory (rmdir)
        const REMOVE_FILE     = 1 << 5;   // Unlink a file
        const MAKE_CHAR       = 1 << 6;   // Create a character device (mknod)
        const MAKE_DIR        = 1 << 7;   // Create a directory (mkdir)
        const MAKE_REG        = 1 << 8;   // Create a regular file (open O_CREAT)
        const MAKE_SOCK       = 1 << 9;   // Create a UNIX domain socket (bind)
        const MAKE_FIFO       = 1 << 10;  // Create a named pipe (mkfifo/mknod)
        const MAKE_BLOCK      = 1 << 11;  // Create a block device (mknod)
        const MAKE_SYM        = 1 << 12;  // Create a symbolic link (symlink)
        const REFER           = 1 << 13;  // Link or rename across directories (ABI v2+)
        const TRUNCATE        = 1 << 14;  // Truncate a file (ABI v3+)
        const IOCTL_DEV       = 1 << 15;  // ioctl on character/block device files (ABI v5+)
    }
}

9.8.6.5 Network Access Rights

bitflags! {
    /// Network access rights for Landlock (ABI v4+, Linux 6.7).
    /// Only TCP is currently supported; UDP and other protocols are
    /// not yet restricted by Landlock (future ABI versions may add them).
    pub struct LandlockAccessNet: u64 {
        const BIND_TCP    = 1 << 0;  // bind() to a TCP port
        const CONNECT_TCP = 1 << 1;  // connect() to a TCP port
    }
}

9.8.6.6 Rule Types and Attributes

/// Discriminant for landlock_add_rule().
pub enum LandlockRuleType {
    /// LANDLOCK_RULE_PATH_BENEATH (1): restrict filesystem access under a
    /// directory hierarchy. The rule permits `allowed_access` on all files
    /// and directories that are descendants of the directory referenced by
    /// `parent_fd`.
    PathBeneath = 1,
    /// LANDLOCK_RULE_NET_PORT (2): restrict network access on a specific
    /// TCP port (ABI v4+).
    NetPort = 2,
}

/// Filesystem rule attribute (for LANDLOCK_RULE_PATH_BENEATH).
/// Layout matches the Linux UAPI struct `landlock_path_beneath_attr`.
#[repr(C)]
pub struct LandlockPathBeneathAttr {
    /// Bitmask of access rights allowed under the directory hierarchy.
    /// Must be a subset of the ruleset's `handled_access_fs`.
    pub allowed_access: u64,
    /// File descriptor of the directory forming the root of the
    /// hierarchy. Must refer to a directory (not a regular file).
    /// The fd can be closed after the rule is added; the kernel
    /// holds a reference to the underlying inode.
    pub parent_fd: i32,
}
// Layout: u64(8) + i32(4) + pad(4) = 16 bytes.
const_assert!(size_of::<LandlockPathBeneathAttr>() == 16);

/// Network rule attribute (for LANDLOCK_RULE_NET_PORT).
/// Layout matches the Linux UAPI struct `landlock_net_port_attr`.
#[repr(C)]
pub struct LandlockNetPortAttr {
    /// Bitmask of network access rights allowed on this port.
    /// Must be a subset of the ruleset's `handled_access_net`.
    pub allowed_access: u64,
    /// TCP port number (0-65535). Values above 65535 return EINVAL.
    pub port: u64,
}
// Layout: 2 × u64 = 16 bytes.
const_assert!(size_of::<LandlockNetPortAttr>() == 16);

9.8.6.7 ABI Versioning

Applications query the supported ABI version before constructing rulesets, using only flags and access rights from the negotiated version. This ensures forward compatibility: a binary compiled against ABI v5 gracefully degrades on a kernel that only supports ABI v3.

/// Query the highest supported Landlock ABI version.
/// Returns the version number (e.g., 5) or Err if Landlock is disabled.
pub fn landlock_query_abi_version() -> Result<u32, Errno> {
    sys_landlock_create_ruleset(
        UserPtr::null(),
        0,
        LANDLOCK_CREATE_RULESET_VERSION, // = 1 << 0
    )
}
ABI Version Linux Version New Features
1 5.13 Initial: 13 filesystem access rights
2 5.19 REFER — cross-directory rename and link
3 6.2 TRUNCATE — file truncation
4 6.7 Network access rights (BIND_TCP, CONNECT_TCP)
5 6.10 IOCTL_DEV — device ioctl restriction

UmkaOS implements ABI version 5. Unknown bits in handled_access_fs or handled_access_net from future ABI versions cause EINVAL, matching Linux behavior. Applications should mask their requested rights to the negotiated ABI version.

9.8.6.8 Enforcement Semantics

When landlock_restrict_self() is called, the kernel:

  1. Clones the current credentials (copy-on-write, same as prepare_creds()).
  2. Creates a new LandlockDomain whose parent points to the existing domain (if any) from the current credential blob.
  3. Attaches the provided ruleset (now immutable) to the new domain.
  4. Stores the new domain in the cloned credential's LandlockCredBlob.
  5. Commits the new credentials atomically.

Access check algorithm: For each filesystem or network operation that reaches a Landlock hook, the kernel walks the domain stack from the innermost (most recent restrict_self) to the outermost (first restrict_self):

  • At each layer: if the layer's ruleset handles this access type (bit set in handled_access_fs or handled_access_net) AND no rule in that layer permits the specific operation → DENY (EACCES).
  • If a layer does NOT handle the access type (bit not set) → the layer is transparent to that operation; continue to the next layer.
  • If all layers either permit or are transparent → ALLOW.

This is AND-stacking: each restrict_self() call can only further restrict access, never relax restrictions imposed by a previous layer. A child process inherits the parent's full domain stack and can add additional layers but cannot remove any.

Landlock enforcement runs AFTER standard DAC and capability checks (Section 9.1) and IN ADDITION TO other LSMs (AppArmor, SELinux). It is never a substitute for DAC — it is a supplementary restriction layer at LSM priority 30 (see the stacking order table above).

9.8.6.9 Internal Rule Lookup

/// Check if a Landlock domain permits the given filesystem access on
/// the specified inode. Walks the domain stack (AND-stacking).
///
/// Called from the `file_security()` LSM hook implementation for Landlock.
pub fn landlock_check_access_fs(
    domain: &LandlockDomain,
    inode: &Inode,
    access: LandlockAccessFs,
) -> Result<(), Errno> {
    let mut current = Some(domain);
    while let Some(layer) = current {
        let handled = layer.ruleset.handled_access_fs & access;
        if !handled.is_empty() {
            // This layer handles some of the requested access rights.
            // Walk from the target inode up through parent directories,
            // collecting permitted access from matching rules.
            if !layer_permits_access(&layer.ruleset.fs_rules, inode, handled) {
                return Err(Errno::EACCES);
            }
        }
        current = layer.parent.as_deref();
    }
    Ok(())
}

/// Check if a Landlock domain permits the given network access on
/// the specified TCP port.
pub fn landlock_check_access_net(
    domain: &LandlockDomain,
    port: u16,
    access: LandlockAccessNet,
) -> Result<(), Errno> {
    let mut current = Some(domain);
    while let Some(layer) = current {
        let handled = layer.ruleset.handled_access_net & access;
        if !handled.is_empty() {
            match layer.ruleset.net_rules.load(port as u64) {
                Some(rule) if rule.allowed_access.contains(handled) => {}
                _ => return Err(Errno::EACCES),
            }
        }
        current = layer.parent.as_deref();
    }
    Ok(())
}

Filesystem rule matching (layer_permits_access): The kernel resolves the target inode to its position in the directory tree, then walks upward through parent directories. At each ancestor, it checks the fs_rules XArray for an entry keyed by that ancestor's inode number. If a LandlockFsRule exists and its allowed_access covers the requested rights, the access is permitted for this layer. This traversal implements LANDLOCK_RULE_PATH_BENEATH semantics: a rule on /home/user permits access to /home/user/documents/file.txt because /home/user is an ancestor of the target.

Network rule matching: Direct XArray lookup by port number. No hierarchy — each port is an independent key. If no rule exists for the port and the layer handles the requested network access type, the access is denied.

9.8.6.10 Landlock and Cross-References

Landlock interacts with several other kernel subsystems:

  • DAC / capabilities (Section 9.1): Standard permission checks run first. Landlock can only further deny access that DAC would otherwise permit. A file with mode 0000 is inaccessible regardless of Landlock rules.
  • LSM stacking (Section 9.8): Landlock is registered at priority 30 (supplementary tier). It runs after integrity (IMA) and MAC (SELinux/AppArmor) modules. All must agree for access to proceed.
  • Seccomp-BPF (Section 10.3): Complementary mechanisms — seccomp restricts which syscalls a process can invoke; Landlock restricts which resources those syscalls can access. A well-sandboxed application uses both.
  • Namespaces (Section 17.1): Landlock is namespace-independent. It is stored per-credential (not per-namespace), so a process retains its Landlock restrictions across namespace transitions (setns, unshare). Mount namespace changes do not escape Landlock filesystem rules because rules are keyed by inode identity, not path string.
  • VFS inode lookup (Section 14.1): The layer_permits_access function relies on VFS inode-to-parent traversal. Bind mounts and overlayfs layers are handled correctly because the kernel resolves through the actual inode graph, not through path strings.

9.8.7 Hook Integration Points

This section specifies WHERE in the kernel codebase hook calls are inserted. Each entry maps a kernel code path to the LSM hook it invokes. Implementing agents use this table to know exactly where to place lsm_call_*() invocations.

Kernel code path Hook call When
vfs_open() (umka-vfs) lsm_call_file_security(Open, ...) After successful lookup, before returning fd
vfs_read() / vfs_write() lsm_call_file_security(Permission, ...) Before data transfer
do_mmap() (umka-core, Section 4.8) lsm_call!(mmap_file, file, prot, flags) Before creating VMA for file-backed mapping. Separate from file_security because mmap needs prot/flags that generic file ops do not. Matches Linux security_mmap_file() hook name (SELinux policy rules reference mmap_file).
do_mprotect() (umka-core) lsm_call_file_security(Mprotect, ...) Before changing VMA permissions
vfs_create() (umka-vfs, InodeOps) lsm_call_inode_security(Create, ...) Before calling filesystem's create()
vfs_link() lsm_call_inode_security(Link, ...) Before calling filesystem's link()
vfs_unlink() lsm_call_inode_security(Unlink, ...) Before calling filesystem's unlink()
vfs_mkdir() lsm_call_inode_security(Mkdir, ...) Before calling filesystem's mkdir()
vfs_rename() lsm_call_inode_security(Rename, ...) Before calling filesystem's rename()
vfs_setattr() lsm_call_inode_security(Setattr, ...) Before calling filesystem's setattr()
vfs_setxattr() lsm_call_inode_security(Setxattr, ...) Before calling filesystem's setxattr()
vfs_getxattr() lsm_call_inode_security(Getxattr, ...) Before returning xattr data
do_mount() (umka-vfs, Section 14.6) lsm_call_superblock_security(Mount, ...) Before calling filesystem's mount()
copy_process() (Section 8.1) lsm_call_task_security(Alloc, ...) During fork/clone, before task becomes runnable
do_execve() step 2 (Section 8.1) lsm_call_task_security(Exec, ...) Before ELF loading — permission check (IMA hash, SELinux/AppArmor domain check, Landlock sandbox)
execve_transform_caps() step 11 (Section 9.9) lsm_task_exec_transition(old_cred, new_cred) During credential commit — SELinux/AppArmor domain transition, label assignment
do_kill() / do_tkill() lsm_call_task_security(Kill, ...) Before signal delivery
sys_ptrace() (Section 20.4) lsm_call_task_security(Ptrace, ...) Before granting ptrace access
commit_creds() (Section 9.9) lsm_call_cred_security(Commit, ...) Before publishing new credentials
ns_capable() (Section 9.9) lsm_call_capable(...) After bitfield check passes, before returning true
__sys_socket() (Section 16.2) lsm_call_socket_security(Create, ...) Before allocating socket
sys_bind() lsm_call_socket_security(Bind, ...) Before binding address
sys_connect() lsm_call_socket_security(Connect, ...) Before initiating connection
sys_listen() lsm_call_socket_security(Listen, ...) Before marking socket as listener
sys_sendmsg() / sys_sendto() lsm_call_socket_security(Sendmsg, ...) Before sending data
ipc_permission() lsm_call_ipc_security(Permission, ...) Before granting IPC access
bpf_prog_load() (Section 19.2) lsm_call_bpf_security(ProgLoad, ...) Before loading BPF program
create_namespace() (Section 17.1) lsm_call_namespace_security(Create, ...) Before creating namespace
do_setns() lsm_call_namespace_security(Join, ...) Before joining namespace

9.8.8 /sys/kernel/security Interface

The LSM framework exposes state via the securityfs pseudo-filesystem:

/sys/kernel/security/
    lsm                         # Comma-separated list of active LSMs
# # (e.g., "integrity,apparmor,landlock")
    apparmor/                   # AppArmor-specific (if loaded)
        profiles                # Loaded profile list
        .replace                # Profile upload interface
        .remove                 # Profile removal interface
    selinux/                    # SELinux-specific (if loaded)
        enforce                 # Enforcement mode (0=permissive, 1=enforce)
        policy                  # Binary policy load interface
        booleans/               # Policy booleans
    ima/                        # Already specified in Section 9.4
        ascii_runtime_measurements
        policy
    landlock/                   # Landlock ABI version
        abi_version             # Integer, currently 5 (UmkaOS implements ABI v5; Linux 6.10+)

9.8.9 LSM Registration and Boot Sequence

lsm_init() — called during UmkaOS core initialization, after slab allocator
  is available but before any user process runs:

  1. Parse kernel command line for lsm= parameter
     (default: "integrity,apparmor,landlock" or "integrity,selinux,landlock"
      depending on compile-time config)

  2. For each requested LSM, in order:
     a. Call module.blob_sizes() to collect per-object blob size requirements
     b. Record blob offset = running sum of previous modules' sizes
     c. Add module to LSM_REGISTRY.modules

  3. Compute total_blob_sizes (sum per object type)

  4. Create slab caches for each object type's blob:
     - cred_lsm_blob_cache: total_blob_sizes.cred bytes per object
     - inode_lsm_blob_cache: total_blob_sizes.inode bytes per object
     - etc.
     If total is 0 for a type (no LSM needs that blob), skip slab creation

  5. Enable static keys:
     - LSM_REGISTRY.any_lsm_active = true (if modules.len() > 0)
     - Per-category keys based on which modules override which methods
       (determined by checking if the method implementation is the default)

  6. Initialize each module (call module-specific init, e.g.,
     apparmor loads compiled-in policy, selinux initializes AVC)

Performance impact (measured per operation):

Condition Overhead per hook call
No LSM loaded 0 cycles (NOP via static key)
Single LSM (AppArmor) ~50-100 cycles (one vtable call + policy lookup)
Two LSMs (IMA + AppArmor) ~100-200 cycles (two vtable calls)
Three LSMs (IMA + AppArmor + Landlock) ~150-250 cycles (three vtable calls)

On an NVMe 4KB read (~10 microseconds), a 250-cycle LSM overhead adds ~2.5%. This is within the project's 5% overhead budget (Section 11.1) when combined with the isolation domain switch cost.

Note on scope: The cycle counts above cover hook dispatch overhead only (vtable call + static-key branch). Policy evaluation — AppArmor rule matching, SELinux AVC lookup, or Landlock restriction tree traversal — adds ~200-2000 cycles per hook depending on policy complexity and AVC cache state. A fully-loaded three-LSM configuration with cache-cold policy can reach 1000-3000 total cycles per hook invocation. The dispatch overhead is bounded; the policy evaluation overhead depends on policy complexity and is expected to dominate in production configurations.

Cross-references: - Section 9.1 (08-security.md): UmkaOS native capability model - Section 9.2 (08-security.md): SystemCaps definition (used by capable() hook) - Section 19.1: Syscall dispatch (LSM hook at capable()) - Section 19.2: eBPF including BPF-LSM program type - Section 9.5 (08-security.md): IMA (first LSM in evaluation order) - Section 14.1 (13-vfs.md): VFS traits (hook integration at VFS layer) - Section 16.2 (15-networking.md): SocketOps (hook integration at socket layer) - Section 20.2 (19-observability.md): Static keys (zero-overhead mechanism) - Section 20.4 (19-observability.md): ptrace (task_security Ptrace hook) - Section 17.1 (16-containers.md): Namespaces (namespace_security hooks) - Section 17.1 (16-containers.md): Security policy integration (references this section) - Section 9.9: Credential model (TaskCredential.lsm_blob)

9.8.10 Policy Format Compatibility

UmkaOS's LSM framework accepts policy in the same binary format as the upstream AppArmor and SELinux policy compilers, ensuring that existing policy toolchains work unmodified. Operators do not need to recompile, reformat, or convert policy artifacts when moving workloads to UmkaOS.

AppArmor: UmkaOS accepts AppArmor policy in the compiled binary format produced by apparmor_parser. The same .aa source files and compiled profiles used on Linux work on UmkaOS without recompilation. The policy ABI version supported is AppArmor 3.x (the current upstream format, as shipped in Ubuntu 22.04+ and Debian 12+). Profiles are loaded via the same securityfs interface: writing compiled profile data to /sys/kernel/security/apparmor/.replace (replace existing) or .load (add new). The kernel validates the ABI version field in the binary header and rejects profiles compiled for incompatible ABI versions with EINVAL. After successful profile load or replacement, LSM_REGISTRY.policy_generation is incremented (fetch_add(1, Release)) to invalidate cached CapValidationToken LSM decisions.

SELinux: UmkaOS accepts SELinux policy modules in the binary .pp (policy package) format produced by semodule and checkmodule. CIL (Common Intermediate Language) source policies compiled with secilc are also accepted. The policy database format version supported is SELinux policy version 33, which is the current upstream version as of the Linux 6.x series. Policy is loaded via /sys/kernel/security/selinux/policy (full policy binary) or via semodule -i (individual .pp modules, which the kernel assembles into the running policy database). The enforce and booleans/ interfaces in securityfs behave identically to the Linux upstream SELinux implementation.

Custom LSM hooks for UmkaOS-specific events: UmkaOS extends the standard LSM hook set with hooks for events that do not exist in Linux (capability delegation, DebugCap issuance, accelerator context creation, KABI driver load). These hooks are invisible to AppArmor and SELinux policy written against upstream compilers: they pass through as SECURITY_ALLOW (permit) for any LSM module that does not explicitly implement the UmkaOS-specific hook variants. UmkaOS-native LSM modules (implemented against the full SecurityModule trait defined in this section) can hook these extended points to enforce site-specific policy over UmkaOS-specific resources. AppArmor and SELinux policy correctness is unaffected by the existence of these additional hooks.

9.8.11 SELinux KABI Interface, AVC, and Policy Load Protocol

This section specifies the kernel-side SELinux infrastructure: the policy load path, the Access Vector Cache (AVC), the SID table (sidtab), and the hook dispatch model. It does NOT specify the SELinux policy binary parser internals (which are an implementation detail of the policy compiler output format). The specification covers the interfaces that the rest of the kernel interacts with.

9.8.11.1 Policy Load Path

SELinux policy is loaded via a securityfs write interface and takes effect atomically:

/sys/fs/selinux/load    — write() accepts a complete binary policy blob
/sys/fs/selinux/policy  — read() returns the currently active policy blob
/sys/fs/selinux/enforce — read/write: "1" = enforcing, "0" = permissive
/sys/fs/selinux/disable — write "1" to permanently disable SELinux (one-way)

Policy load algorithm:

selinux_policy_load(blob: &[u8]) -> Result<()>:
  1. Validate policydb header:
     - magic: 0xF97CFF8C
     - policy_version: must be <= POLICYDB_VERSION_MAX (33)
     - config flags: MLS enabled/disabled
     If invalid: return Err(EINVAL)

  2. Parse the binary policy blob into in-kernel decision tables:
     - Type enforcement (TE) rules: source_type × target_type × class → allowed
     - Role-based access control (RBAC) rules
     - MLS (Multi-Level Security) constraints
     - Type transition rules
     - Boolean conditional rules
     This produces a new SelinuxPolicy struct.

  3. Build the new sidtab (SID table) from the policy's initial SID
     definitions and the context strings in the policy.

  4. Atomic policy swap (RCU pointer replacement):
     old_policy = rcu_dereference(SELINUX_STATE.policy)
     rcu_assign_pointer(SELINUX_STATE.policy, new_policy)

  5. Flush the entire AVC (all cached access decisions are invalidated
     because the decision tables have changed).

  6. After an RCU grace period, free the old policy struct:
     call_rcu(drop_selinux_policy, old_policy)

  7. Increment policy_seqno (monotonic policy generation counter).
     Used by audit logs to correlate decisions with the policy version
     that produced them.

  8. Increment LSM_REGISTRY.policy_generation (fetch_add(1, Release)).
     This invalidates all CapValidationTokens that cached LSM decisions
     under the old policy. The next KABI dispatch for each token will
     detect the mismatch and force full re-validation.

SelinuxPolicy struct:

/// In-kernel representation of a loaded SELinux policy.
/// Immutable after construction. Swapped atomically via RCU.
pub struct SelinuxPolicy {
    /// Policy version number (from the binary header).
    pub policy_version: u32,

    /// Type enforcement access vector rules.
    /// Indexed by packed `TeRuleKey` (u64) → AV decision.
    /// XArray gives O(1) lookup with RCU-protected reads for the AVC-miss
    /// path. The key is packed via `pack_te_key()`: `(source << 32) | (target << 16) | class`.
    /// Cold path — hot path uses the AVC cache, not this table directly.
    pub te_rules: XArray<AvDecision>,

    /// Type transition rules.
    /// Packed `TeRuleKey` (u64) → new_type for object labeling.
    /// Same packing as `te_rules`; XArray for integer-key compliance.
    pub type_transitions: XArray<SecurityType>,

    /// Role allow rules: which roles can transition to which other roles.
    /// **Collection policy exemption**: HashMap with composite non-integer key
    /// `(RoleId, RoleId)`. Fewer than 100 entries in typical policies; cold
    /// path only (role transitions are admin operations, not per-syscall).
    pub role_allow: HashMap<(RoleId, RoleId), bool>,

    /// Conditional booleans: named boolean → current value.
    /// Toggled via /sys/fs/selinux/booleans/<name>.
    /// Changing a boolean triggers selective AVC flush for affected rules.
    /// **Collection policy exemption**: HashMap with string key (not integer).
    pub booleans: HashMap<KernelString<64>, bool>,

    /// MLS constraints (if MLS is enabled in the policy).
    pub mls_constraints: Vec<MlsConstraint>,

    /// Policy generation sequence number (monotonically increasing).
    pub seqno: u64,
}

/// Pack a TE rule key into a u64 for XArray indexing.
/// Layout: bits [47:32] = source_type, bits [31:16] = target_type, bits [15:0] = tclass.
/// No information loss: each component is u16 (max 65535), fitting in 48 bits.
#[inline]
pub fn pack_te_key(source: u16, target: u16, class: u16) -> u64 {
    ((source as u64) << 32) | ((target as u64) << 16) | (class as u64)
}

/// Key for type enforcement rule lookup. Packed to u64 via `pack_te_key()`
/// for XArray indexing.
pub struct TeRuleKey {
    pub source_type: u16,
    pub target_type: u16,
    pub tclass: u16,
}

impl TeRuleKey {
    /// Pack this key into a u64 suitable for XArray indexing.
    pub fn pack(&self) -> u64 {
        pack_te_key(self.source_type, self.target_type, self.tclass)
    }
}

/// Access vector decision (result of a TE rule lookup).
pub struct AvDecision {
    /// Bitmask of allowed permissions for this (source, target, class) triple.
    pub allowed: u32,
    /// Which allowed permissions to audit (generate audit record when used).
    pub audit_allow: u32,
    /// Which denied permissions to audit (generate audit record when denied).
    pub audit_deny: u32,
    /// Bitmask of permissions to use when creating new objects of this class.
    pub default_type: u32,
}

9.8.11.2 Access Vector Cache (AVC)

The AVC is a fast-path cache that avoids consulting the full policy decision tables on every LSM hook invocation. Most security decisions hit the AVC and never touch the policy tables.

/// AVC entry — cached access decision for a (source SID, target SID, class) triple.
pub struct AvcEntry {
    /// Source security identifier (the acting subject).
    pub ssid: u32,
    /// Target security identifier (the object being accessed).
    pub tsid: u32,
    /// Object class (file, socket, process, etc.).
    pub tclass: u16,
    /// Cached allowed permission bitmask.
    pub allowed: u32,
    /// Which allowed permissions generate audit records.
    pub audit_allow: u32,
    /// Which denied permissions generate audit records.
    pub audit_deny: u32,
}

/// Maximum AVC entries. Bounded to limit memory consumption.
/// Default: 1024 entries (configurable via kernel command line
/// `avc_cache_threshold=N`). LRU eviction when full.
pub const AVC_CACHE_DEFAULT_SIZE: usize = 1024;

AVC storage: XArray<u64, AvcEntry> keyed by avc_hash(ssid, tsid, tclass). This is a single-entry-per-bucket design: each XArray slot holds at most one AvcEntry. Hash collisions cause the new entry to evict (overwrite) the existing entry — there is no chaining. This matches Linux's AVC behavior (a direct-mapped cache). The AVC is a pure performance optimization; on any miss or eviction, the full policy lookup in SelinuxPolicy.te_rules produces the correct decision and repopulates the cache. Eviction rate depends on access pattern diversity; the default 1024 entries handle typical server workloads with >90% hit rate.

The hash function combines the three fields into a single u64 key:

avc_hash(ssid: u32, tsid: u32, tclass: u16) -> u64:
  // FNV-1a style mixing to distribute entries across the XArray
  h = (ssid as u64) ^ ((tsid as u64) << 17) ^ ((tclass as u64) << 34)
  h = h.wrapping_mul(0x100000001B3)
  h % (AVC_CACHE_DEFAULT_SIZE as u64)

AVC lookup algorithm (avc_has_perm — called from every SELinux LSM hook):

avc_has_perm(ssid: u32, tsid: u32, tclass: u16, requested: u32) -> Result<()>:
  1. key = avc_hash(ssid, tsid, tclass)

  2. // RCU-protected read (no lock)
     entry = rcu_dereference(AVC_CACHE.get(key))

  3. if entry is Some && entry.ssid == ssid && entry.tsid == tsid
        && entry.tclass == tclass:
       // Cache hit
       if entry.allowed & requested == requested:
           // Permitted — check if we need to audit the allowed access
           if entry.audit_allow & requested != 0:
               avc_audit(ssid, tsid, tclass, requested, AUDIT_ALLOW)
           return Ok(())
       else:
           // Denied — audit and return error
           denied = requested & !entry.allowed
           if entry.audit_deny & denied != 0:
               avc_audit(ssid, tsid, tclass, requested, AUDIT_DENY)
           return Err(EACCES)

  4. // Cache miss — consult the full policy decision tables
     decision = security_compute_av(ssid, tsid, tclass)
     // decision: AvDecision { allowed, audit_allow, audit_deny }

  5. // Insert into AVC (may evict LRU entry if cache is full)
     new_entry = AvcEntry {
         ssid, tsid, tclass,
         allowed: decision.allowed,
         audit_allow: decision.audit_allow,
         audit_deny: decision.audit_deny,
     }
     avc_cache_insert(key, new_entry)  // RCU-safe insertion

  6. // Now check the freshly computed decision
     if decision.allowed & requested == requested:
         if decision.audit_allow & requested != 0:
             avc_audit(ssid, tsid, tclass, requested, AUDIT_ALLOW)
         return Ok(())
     else:
         denied = requested & !decision.allowed
         if decision.audit_deny & denied != 0:
             avc_audit(ssid, tsid, tclass, requested, AUDIT_DENY)
         return Err(EACCES)

AVC invalidation events:

Event Action
Full policy reload (selinux_policy_load) Flush entire AVC
Boolean toggle (/sys/fs/selinux/booleans/) Flush AVC entries affected by the toggled boolean's conditional rules
avc_cache_threshold change Resize cache, flush all

9.8.11.3 SID Table (sidtab)

The sidtab maps between opaque numeric Security Identifiers (SIDs) and their full security context strings (e.g., system_u:object_r:tmp_t:s0).

/// Security context — the human-readable SELinux label.
pub struct SecurityContext {
    /// User component (e.g., "system_u", "unconfined_u").
    pub user: u16,
    /// Role component (e.g., "object_r", "staff_r").
    pub role: u16,
    /// Type component (e.g., "tmp_t", "httpd_t").
    pub type_: u16,
    /// MLS level/range (e.g., "s0", "s0-s0:c0.c1023").
    /// Only meaningful when MLS is enabled in the policy.
    pub mls_range: MlsRange,
}

/// SID table — maps numeric SIDs ↔ security contexts.
/// XArray<u32, SecurityContext> for O(1) lookup by SID number.
/// SIDs are allocated monotonically and never reused within a
/// policy generation. On policy reload, the sidtab is rebuilt
/// from the new policy's initial SID definitions.
///
/// The sidtab grows monotonically during a policy generation:
/// new SIDs are allocated when previously unseen security contexts
/// are encountered (e.g., a new file is created with a label that
/// was not in the initial policy). SIDs are never freed or reused
/// until a full policy reload replaces the entire sidtab.
pub struct Sidtab {
    /// Forward map: SID number → SecurityContext.
    pub sid_to_context: XArray<u32, SecurityContext>,
    /// Reverse map: context hash → SID number.
    /// Used by `security_context_to_sid()` to find an existing SID
    /// for a given context string, or allocate a new one.
    /// **Collection policy**: XArray is mandatory for integer-keyed (u64
    /// context hash) lookups. Cold path (SID allocation is rare after
    /// initial policy load).
    pub context_to_sid: XArray<u32>,
    /// Next SID to allocate (monotonically increasing counter).
    /// Starts at SECINITSID_NUM + 1 after loading initial SIDs.
    pub next_sid: AtomicU32,
}

/// Number of initial SIDs defined by the policy (kernel, security,
/// unlabeled, file, etc.). These are pre-allocated in the sidtab
/// during policy load with fixed SID numbers (1..SECINITSID_NUM).
pub const SECINITSID_NUM: u32 = 27;

SID allocation:

security_context_to_sid(context: &str) -> Result<u32>:
  1. hash = hash_context(context)
  2. // Check reverse map (read path, no lock)
     if let Some(sid) = sidtab.context_to_sid.get(hash):
         return Ok(sid)
  3. // New context: parse and validate against current policy
     parsed = parse_security_context(context)?  // EINVAL if malformed
     validate_context(parsed, &current_policy)?  // EINVAL if type/role/user unknown
  4. // Allocate new SID (atomic increment)
     new_sid = sidtab.next_sid.fetch_add(1, Ordering::Relaxed)
  5. sidtab.sid_to_context.insert(new_sid, parsed)
     sidtab.context_to_sid.insert(hash, new_sid)
  6. return Ok(new_sid)

9.8.11.4 Hook Dispatch Model

Each SELinux LSM hook implementation follows the same pattern: resolve the subject and object SIDs, determine the object class, and call avc_has_perm():

selinux_file_open(cred: &TaskCredential, file: &FileRef) -> Result<()>:
  // 1. Get the subject SID from the task's credentials
  ssid = cred.lsm_blob.selinux_sid   // stored in the credential's LSM blob

  // 2. Get the target SID from the file's inode
  tsid = file.inode.lsm_blob.selinux_sid  // stored in the inode's LSM blob

  // 3. Determine the object class
  tclass = inode_to_security_class(file.inode)  // e.g., SECCLASS_FILE,
                                                 // SECCLASS_DIR, SECCLASS_SOCK_FILE

  // 4. Determine the requested permission(s)
  requested = file_mask_to_av(file.f_mode)  // e.g., FILE__READ | FILE__WRITE

  // 5. AVC check (hot path: cache hit; cold path: policy table lookup)
  avc_has_perm(ssid, tsid, tclass, requested)?

  Ok(())

This pattern is the same for all ~220 hook points. The differences are: - Where the SIDs come from: credentials (subject), inodes (files), sockets (network objects), superblocks (filesystems), or IPC objects. - The object class: SECCLASS_FILE, SECCLASS_PROCESS, SECCLASS_TCP_SOCKET, SECCLASS_CAPABILITY, etc. (~80 classes in a standard policy). - The permission bits: Each class has its own permission bitmask (e.g., FILE__READ, FILE__WRITE, FILE__EXECUTE, PROCESS__SIGNAL, PROCESS__PTRACE).

SELinux state:

/// Global SELinux state. Initialized during LSM registration.
pub struct SelinuxState {
    /// Currently active policy (RCU-protected pointer).
    pub policy: RcuPtr<SelinuxPolicy>,
    /// Enforcement mode: true = enforcing (denials are enforced),
    /// false = permissive (denials are logged but not enforced).
    pub enforcing: AtomicBool,
    /// Whether SELinux has been permanently disabled.
    pub disabled: AtomicBool,
    /// The AVC cache.
    pub avc: AvcCache,
    /// The SID table.
    pub sidtab: Sidtab,
}

/// Singleton SELinux state.
static SELINUX_STATE: OnceCell<SelinuxState> = OnceCell::new();

Cross-references: - Section 9.9: TaskCredential.lsm_blob (where the task SID is stored) - Section 14.1: Inode LSM blob (where the file SID is stored) - Section 9.1: Capability token model (SELinux checks are orthogonal)

9.9 Credential Model and Capabilities

The Linux credential model (UIDs, GIDs, supplementary groups) is a legacy access control mechanism. UmkaOS maps this model to its native Capability System (Section 9.1).

9.9.1 Credential Structure

Linux uses copy-on-write credentials (struct cred) shared between tasks via RCU. UmkaOS follows the same pattern: credentials are immutable once published. Modification follows a prepare-modify-commit sequence that atomically swaps the credential pointer visible to other tasks.

/// Per-task credential state. Analogous to Linux's `struct cred`.
///
/// Credentials are immutable after publication. To modify credentials,
/// a task calls `prepare_creds()` to obtain a mutable clone, modifies
/// the clone, then calls `commit_creds()` to atomically publish it.
/// The old credential is freed after an RCU grace period (readers that
/// already hold a reference via `rcu_read_lock()` see the old values
/// until they exit the read-side critical section).
///
/// **Memory layout**: The credential struct is slab-allocated from a
/// dedicated `cred_cache` slab (fixed size, no fragmentation). The
/// `Arc` provides reference counting for sharing between `prepare_creds`
/// clones and cross-task references (e.g., `/proc/PID/status` reads
/// another task's credentials). RCU protects the pointer swap; `Arc`
/// protects the struct lifetime.
///
/// **Hot-path optimization**: Syscall entry reads `current->cred` under
/// an implicit RCU read-side section (preempt-disabled on syscall entry).
/// The `SystemCaps` checks (`capable()`, `ns_capable()`) access
/// `cred.cap_effective` directly -- no indirection beyond the RCU
/// dereference of the cred pointer itself.
/// Kernel-internal, slab-allocated, accessed via RCU. Never crosses KABI or
/// wire boundary. `bool` and `Arc` fields are acceptable per kernel-internal
/// exception (CLAUDE.md rule 8). Implicit 3-byte padding between
/// `no_new_privs` (bool, 1 byte) and `securebits` (u32, 4-byte aligned)
/// is benign — kernel-internal, no information-disclosure boundary.
#[repr(C)]
pub struct TaskCredential {
    // ===== POSIX identity fields =====

    /// Real user ID. Set at login, inherited across fork/exec unless
    /// modified by setuid()/setreuid()/setresuid().
    pub uid: u32,
    /// Real group ID. Analogous to uid for group identity.
    pub gid: u32,
    /// Effective user ID. Used for permission checks (file access,
    /// signal delivery). May differ from uid after setuid binary exec
    /// or seteuid() call.
    pub euid: u32,
    /// Effective group ID. Used for permission checks.
    pub egid: u32,
    /// Saved set-user-ID. Preserved across exec for setuid binaries.
    /// Allows switching between real and saved UIDs via setreuid().
    pub suid: u32,
    /// Saved set-group-ID. Analogous to suid for groups.
    pub sgid: u32,
    /// Filesystem user ID. Used exclusively for filesystem permission
    /// checks (open, stat, chown). Normally tracks euid; can be set
    /// independently via setfsuid(). Exists for NFS server implementations
    /// that need to check permissions as a different user without changing
    /// signal delivery identity (euid).
    pub fsuid: u32,
    /// Filesystem group ID. Analogous to fsuid.
    pub fsgid: u32,

    // ===== Linux capability sets (5 sets, matching Linux kernel exactly) =====
    // All five sets use SystemCaps (u128) to hold both POSIX capabilities
    // (bits 0-63, matching Linux numbering) and UmkaOS-native capabilities
    // (bits 64-127). See Section 9.1.3 for the full SystemCaps definition.
    //
    // The five sets interact as follows:
    // - cap_permitted: upper bound on what the task CAN have
    // - cap_effective: what the task CURRENTLY has (checked by capable())
    // - cap_inheritable: what survives across execve()
    // - cap_bounding: limits what file capabilities can grant
    // - cap_ambient: auto-raised on execve() for capability-dumb binaries

    /// Effective capability set. This is the set checked by capable() and
    /// ns_capable() on every privileged operation. A capability must be in
    /// this set for the task to exercise it.
    pub cap_effective: SystemCaps,

    /// Permitted capability set. Upper bound on cap_effective. A task can
    /// raise a capability into cap_effective only if it is in cap_permitted.
    /// Dropping a capability from cap_permitted is permanent (cannot be
    /// re-added). cap_effective is always a subset of cap_permitted.
    pub cap_permitted: SystemCaps,

    /// Inheritable capability set. Capabilities that can be inherited across
    /// execve() if the executed file also has the capability in its file
    /// inheritable set. Used by capability-aware programs that explicitly
    /// manage which capabilities survive exec.
    pub cap_inheritable: SystemCaps,

    /// Bounding set. Limits which capabilities can be gained through file
    /// capabilities during execve(). A capability not in the bounding set
    /// cannot appear in cap_permitted after execve(), even if the file has
    /// it in its file permitted set. Can only be reduced (via
    /// prctl(PR_CAPBSET_DROP)), never expanded. Inherited across fork().
    pub cap_bounding: SystemCaps,

    /// Ambient capability set. Capabilities automatically added to
    /// cap_permitted and cap_effective on execve() of a non-setuid,
    /// non-file-capability binary (a "capability-dumb" binary). This
    /// allows unprivileged programs to inherit capabilities without
    /// requiring file capability xattrs. Added in Linux 4.3.
    ///
    /// Invariant: cap_ambient is always a subset of both cap_permitted
    /// and cap_inheritable. If a capability is dropped from either
    /// cap_permitted or cap_inheritable, it is automatically dropped
    /// from cap_ambient.
    pub cap_ambient: SystemCaps,

    // ===== Supplementary groups =====

    /// Supplementary group list. Set by setgroups(2) or initgroups(3).
    /// Used by permission checks alongside gid/egid. Maximum 65536 groups
    /// (matching Linux NGROUPS_MAX). Stored sorted for binary search during
    /// permission checks (in_group_p() on the hot path for file access).
    ///
    /// Shared via Arc for COW efficiency: fork() shares the list; only
    /// setgroups() triggers a clone. The common case (no setgroups after
    /// login) means the list is shared across the entire process tree
    /// spawned by a session.
    pub supplementary_groups: Arc<SortedGidList>,

    // ===== Security control flags =====

    /// PR_SET_NO_NEW_PRIVS flag. Once set to true, cannot be unset.
    /// Inherited across fork() and preserved across execve().
    /// When true, execve() will NOT:
    ///   - Honor setuid/setgid bits on the executed file
    ///   - Grant file capabilities from security.capability xattr
    ///   - Allow LSM transitions that increase privilege
    /// Required for unprivileged seccomp (SECCOMP_SET_MODE_FILTER
    /// without CAP_SYS_ADMIN requires no_new_privs == true).
    pub no_new_privs: bool,

    /// Securebits flags. Control the special handling of UID 0 (root).
    /// See SecureBits definition below. Set via prctl(PR_SET_SECUREBITS).
    /// Requires CAP_SETPCAP in the caller's effective set.
    pub securebits: SecureBits,

    // ===== Namespace and LSM =====

    /// User namespace this credential operates in. Determines the
    /// scope of UID/GID mappings and the ceiling for capability checks
    /// (see compute_effective_caps in Section 17.1.6).
    pub user_ns: Arc<UserNamespace>,

    /// Opaque LSM security blob. Allocated and managed by the active
    /// LSM modules ([Section 9.8](#linux-security-module-framework)). Each LSM gets a contiguous region
    /// within this blob, indexed by the LSM's registered blob offset.
    /// None if no LSM is loaded (zero overhead -- no allocation).
    /// See [Section 9.8](#linux-security-module-framework--lsm-blob-layout) for the blob layout specification.
    /// Slab-allocated, owned by this credential, freed via `lsm_blob_free()` in
    /// `TaskCredential`'s destructor. Not reference-counted -- the blob dies
    /// with its owning credential.
    /// SAFETY: the `NonNull<LsmBlob>` points to slab-allocated memory managed
    /// by `lsm_blob_alloc()`/`lsm_blob_free()` (not the global allocator).
    pub lsm_blob: Option<NonNull<LsmBlob>>,
}

/// Alias for use in cross-subsystem interfaces.
pub type Credentials = TaskCredential;

/// Sorted supplementary group ID list. Stored sorted for O(log n) lookup
/// via binary search. The inline ArrayVec holds up to 32 groups (covers
/// >99% of real-world users). Processes with more than 32 supplementary
/// groups use the heap-allocated `overflow` extension (allocated once at
/// setgroups() time, shared via Arc across fork()). NGROUPS_MAX = 65536
/// matches Linux.
pub struct SortedGidList {
    /// Inline sorted group IDs (up to 32). Binary search for in_group_p() checks.
    inline_groups: ArrayVec<u32, 32>,
    /// Overflow heap allocation for supplementary groups beyond 32.
    /// None when group count <= 32 (common case: zero heap allocation).
    /// When present, contains ALL groups (not just the overflow); the
    /// inline_groups field is unused and empty. This avoids a two-step
    /// lookup on the hot path.
    /// `Arc<[u32]>` stores the slice data inline after the Arc header
    /// (single allocation, no double indirection). Construct via
    /// `Arc::from(vec.into_boxed_slice())`.
    overflow: Option<Arc<[u32]>>,
}

/// Maximum number of supplementary groups per credential.
/// Matches Linux NGROUPS_MAX (65536 since Linux 2.6.4).
const NGROUPS_MAX: usize = 65536;

impl SortedGidList {
    /// Check if the given GID is in the supplementary group list.
    /// O(log n) binary search. Called on every file permission check
    /// when the file's group matches neither egid nor fsgid.
    pub fn contains(&self, gid: u32) -> bool {
        if let Some(ref all) = self.overflow {
            all.binary_search(&gid).is_ok()
        } else {
            self.inline_groups.binary_search(&gid).is_ok()
        }
    }
}

bitflags! {
    /// Securebits flags. These control the special treatment of UID 0.
    /// Each flag has a corresponding LOCKED variant that prevents
    /// clearing the flag (one-way escalation of restriction).
    ///
    /// Bit layout matches Linux exactly (include/uapi/linux/securebits.h):
    ///   bit 0: SECBIT_NOROOT
    ///   bit 1: SECBIT_NOROOT_LOCKED
    ///   bit 2: SECBIT_NO_SETUID_FIXUP
    ///   bit 3: SECBIT_NO_SETUID_FIXUP_LOCKED
    ///   bit 4: SECBIT_KEEP_CAPS
    ///   bit 5: SECBIT_KEEP_CAPS_LOCKED
    ///   bit 6: SECBIT_NO_CAP_AMBIENT_RAISE
    ///   bit 7: SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED
    ///   bit 8: SECBIT_EXEC_RESTRICT_FILE          (Linux UAPI reserved; UmkaOS enforces)
    ///   bit 9: SECBIT_EXEC_RESTRICT_FILE_LOCKED    (Linux UAPI reserved; UmkaOS enforces)
    ///   bit 10: SECBIT_EXEC_DENY_INTERACTIVE       (Linux UAPI reserved; UmkaOS enforces)
    ///   bit 11: SECBIT_EXEC_DENY_INTERACTIVE_LOCKED (Linux UAPI reserved; UmkaOS enforces)
    pub struct SecureBits: u32 {
        /// SECBIT_NOROOT: When set, UID 0 does NOT automatically gain
        /// capabilities. Without this flag (the default), a process
        /// that transitions to euid 0 via setuid() gains full
        /// cap_permitted (the "root is special" legacy behavior).
        /// Required for rootless containers where UID 0 inside the
        /// container maps to an unprivileged host UID.
        const NOROOT = 1 << 0;
        /// SECBIT_NOROOT_LOCKED: Prevents clearing NOROOT.
        const NOROOT_LOCKED = 1 << 1;

        /// SECBIT_NO_SETUID_FIXUP: When set, the kernel does NOT
        /// adjust cap_effective/cap_permitted/cap_ambient when
        /// euid changes (via setuid/seteuid/setreuid/setresuid).
        /// Without this flag, transitioning from euid 0 to non-zero
        /// clears cap_effective, and transitioning to euid 0 raises
        /// cap_effective to cap_permitted.
        const NO_SETUID_FIXUP = 1 << 2;
        /// SECBIT_NO_SETUID_FIXUP_LOCKED: Prevents clearing NO_SETUID_FIXUP.
        const NO_SETUID_FIXUP_LOCKED = 1 << 3;

        /// SECBIT_KEEP_CAPS: When set, a setuid() from euid 0 to
        /// non-zero does NOT clear cap_permitted. Normally, dropping
        /// root (setuid(non-zero)) clears cap_permitted entirely.
        /// This flag preserves capabilities across the UID change.
        /// Note: this flag is automatically cleared on execve().
        const KEEP_CAPS = 1 << 4;
        /// SECBIT_KEEP_CAPS_LOCKED: Prevents clearing KEEP_CAPS.
        const KEEP_CAPS_LOCKED = 1 << 5;

        /// SECBIT_NO_CAP_AMBIENT_RAISE: When set, prevents adding
        /// capabilities to the ambient set. Existing ambient caps
        /// are preserved but no new ones can be added.
        const NO_CAP_AMBIENT_RAISE = 1 << 6;
        /// SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED: Prevents clearing
        /// NO_CAP_AMBIENT_RAISE.
        const NO_CAP_AMBIENT_RAISE_LOCKED = 1 << 7;

        /// SECBIT_EXEC_RESTRICT_FILE (UAPI reserved in Linux; UmkaOS enforces): When set, `execve()`
        /// is restricted to files that have the execute permission bit set
        /// for the calling user. Without this flag, `execve()` of a non-+x
        /// file opened via `openat2(O_EXEC)` or `fexecve(fd)` is allowed.
        /// With this flag, the kernel rejects `execve()` of any file that
        /// does not have execute permission, even if opened via fd.
        /// Used by hardened container runtimes to prevent execution of
        /// arbitrary files that were opened for reading.
        const EXEC_RESTRICT_FILE = 1 << 8;
        /// SECBIT_EXEC_RESTRICT_FILE_LOCKED: Prevents clearing
        /// EXEC_RESTRICT_FILE.
        const EXEC_RESTRICT_FILE_LOCKED = 1 << 9;

        /// SECBIT_EXEC_DENY_INTERACTIVE (UAPI reserved in Linux; UmkaOS enforces): When set, denies
        /// execution of scripts or binaries passed via file descriptor to
        /// an interpreter (e.g., `bash < script.sh`, `python3 /dev/fd/3`).
        /// This prevents bypassing EXEC_RESTRICT_FILE by feeding
        /// non-executable content to an interpreter. The kernel checks
        /// whether the executed file was opened for reading (not exec) and
        /// denies execution if this bit is set.
        const EXEC_DENY_INTERACTIVE = 1 << 10;
        /// SECBIT_EXEC_DENY_INTERACTIVE_LOCKED: Prevents clearing
        /// EXEC_DENY_INTERACTIVE.
        const EXEC_DENY_INTERACTIVE_LOCKED = 1 << 11;
    }
}

9.9.2 Credential Lifecycle (Copy-on-Write via RCU)

The credential is shared immutably between tasks. The pointer from the task struct to the credential is RCU-protected:

/// In the Task struct (Section 8.1.1):
pub struct Task {
    // ... existing fields ...

    /// RCU-protected pointer to the task's current credentials.
    /// Readers (other tasks inspecting this task's creds, e.g.,
    /// /proc/PID/status, kill() permission check, ptrace attach)
    /// dereference under rcu_read_lock().
    /// The owning task reads its own creds without RCU (it is the
    /// only writer, and commit_creds() is serialized per-task).
    pub cred: RcuPtr<Arc<TaskCredential>>,
}

Credential inheritance at fork: fork() inherits credentials via Arc::clone(&parent.cred) — a simple reference count bump, not a prepare_creds()/commit_creds() cycle. Parent and child share the same immutable TaskCredential struct. This is safe because credentials are never mutated in-place; modifications always go through prepare_creds() → clone → commit_creds() → RCU publish, creating a new Arc<TaskCredential>. The child independently calls prepare_creds()/commit_creds() if it later needs different credentials (e.g., via execve(), setuid(), or capset()).

Credential modification protocol:

prepare_creds(current_task) -> MutableCred:
  1. old = rcu_dereference(current_task.cred)  // or direct read for self
  2. new = Arc::new(old.deep_clone())           // clone all fields
  3. return MutableCred(new)                    // caller owns mutable access

// Caller modifies new.cap_effective, new.euid, etc.

commit_creds(current_task, new: MutableCred):
  1. Validate invariants (see below)
  2. old = current_task.cred
  2b. current_task.cred_generation.fetch_add(1, Ordering::Release);
      // Increment credential generation BEFORE publishing. This ensures
      // that any CapValidationToken captured before this point will see a
      // stale generation and force re-validation. The Release ordering
      // pairs with Acquire loads in KABI dispatch cap_validate_token().
  3. rcu_assign_pointer(current_task.cred, new.into_arc())
  4. // old Arc ref is decremented; if refcount reaches 0,
  5. // the old TaskCredential is freed after RCU grace period
  6. // via call_rcu(drop_cred, old)

abort_creds(new: MutableCred):
  // Called if the modification is abandoned (e.g., error path)
  // Simply drops the Arc, freeing the cloned credential
  drop(new)

Invariants enforced by commit_creds():

  1. cap_effective is a subset of cap_permitted
  2. cap_ambient is a subset of both cap_permitted AND cap_inheritable
  3. If no_new_privs was true in the old credential, it must be true in the new one
  4. Locked securebits in the old credential cannot be cleared in the new one
  5. cap_bounding can only shrink (new is a subset of old), never grow

Violation of any invariant causes commit_creds() to return Err(EPERM) without modifying the task's credential pointer.

CapValidationToken invalidation on credential change: When commit_creds() publishes new credentials, it increments task.cred_generation (AtomicU64). CapValidationToken captures cred_generation at creation time. KABI dispatch compares the token's cred_gen against current_task().cred_generation; a mismatch forces re-validation of the capability against the new credential set. This ensures that setuid(), setgid(), setgroups(), and any other credential-modifying operation immediately invalidates cached KABI tokens — preventing a process from retaining capabilities that the new credential set would not grant.

/// Credential generation counter. Incremented by `commit_creds()` each
/// time a task's credentials are modified. `CapValidationToken` captures
/// this value at creation; KABI dispatch compares the token's snapshot
/// against the task's current generation to detect stale tokens.
pub cred_generation: AtomicU64,

9.9.3 The capable() and ns_capable() Check Functions

All privileged operations in the kernel go through one of these two functions:

capable(task: &Task, cap: SystemCaps) -> bool:
  // Check if the given task has the given capability against the
  // init user namespace (i.e., "global" capability).
  // When called as `capable(current_task(), cap)`, this matches Linux's
  // `capable()` which uses `current` internally.
  return ns_capable(task, &INIT_USER_NS, cap)

ns_capable(task: &Task, target_ns: &UserNamespace, cap: SystemCaps) -> bool:
  1. cred = task.cred   // no RCU needed for own creds when task == current
  2. if !cred.cap_effective.contains(cap):
       return false             // task doesn't have the cap at all
  3. // Check namespace hierarchy: task's user_ns must be same as or
  //  ancestor of target_ns (from compute_effective_caps, Section 17.1.6)
  4. if !is_same_or_ancestor(cred.user_ns, target_ns):
       return false             // cap not valid in target namespace
  5. // LSM hook: security_capable() -- allows LSMs to deny even if
  //  capability bits permit (e.g., SELinux policy denies CAP_NET_ADMIN
  //  to a confined domain even if the process holds it)
  6. if lsm_deny_capable(cred, target_ns, cap):  // [Section 9.8](#linux-security-module-framework)
       return false
  7. return true

Cross-reference: ns_capable() at step 4 uses the is_same_or_ancestor() function defined in the compute_effective_caps() algorithm (Section 17.1). The namespace hierarchy walk is the same; ns_capable() is the per-operation entry point, while compute_effective_caps() is the bulk computation used when constructing the effective set.

Credential vs nsproxy distinction: ns_capable() MUST use task.cred.user_ns (the task's effective credential namespace), NOT nsproxy.user_ns (the task's namespace set). cred.user_ns determines the task's capability scope; nsproxy.user_ns is the namespace the task belongs to. These differ after setns(CLONE_NEWUSER) — the task's nsproxy points to the joined namespace, but its credentials (and thus capability scope) reflect where the task was created.

9.9.4 The execve() Capability Transformation

When a task calls execve(), the capability sets are recomputed based on the executed file's properties. This is the most security-critical credential transformation in the kernel.

File capability structure (stored in security.capability xattr):

/// File capabilities stored as an extended attribute on the binary.
/// Format matches Linux VFS_CAP_REVISION_3 (Linux 4.14+).
/// Read from security.capability xattr during execve().
// kernel-internal, not KABI
#[repr(C)]
pub struct FileCapabilities {
    /// File permitted set: capabilities the file can grant to the
    /// task's cap_permitted. Subject to cap_bounding restriction.
    pub permitted: SystemCaps,

    /// File inheritable set: capabilities the file allows to be
    /// inherited from the task's cap_inheritable into cap_permitted.
    pub inheritable: SystemCaps,

    /// Effective flag (single bit, not a full set). When set (non-zero),
    /// cap_effective is raised to equal cap_permitted after the
    /// transformation. When clear (0), cap_effective is set to
    /// cap_ambient only (capability-aware binaries manage their
    /// own effective set via capset()).
    ///
    /// u8 instead of bool because FileCapabilities is `#[repr(C)]` and
    /// parsed from the `security.capability` xattr (disk/wire format).
    /// A non-0/1 byte value in a `bool` field is UB in Rust. The xattr
    /// content comes from disk (potentially corrupted) or from a
    /// different kernel implementation (which may write 0xFF for "true").
    /// Validation: the xattr parser treats any non-zero value as set,
    /// matching Linux behavior.
    pub effective_flag: u8,  // 0 = clear, non-zero = set

    /// Root UID for namespace-scoped file capabilities (VFS_CAP_REVISION_3).
    /// File capabilities are interpreted relative to this UID's user namespace.
    /// 0 = initial namespace (host file caps). Non-zero = namespaced file caps.
    pub rootid: u32,
}

Canonical capability transformation formulae (summary — the algorithm below implements these step by step):

Variable Meaning
P(x) Task's current (pre-exec) set x
P'(x) Task's new (post-exec) set x
F(x) File capability set x (from security.capability xattr)
F(E) File effective flag (single bit, not a set)
B Task's bounding set (cap_bounding)
A Task's ambient set (cap_ambient); cleared if exec is privileged
P'(ambient)    = is_privileged ? {} : P(ambient)
P'(permitted)  = (P(inheritable) & F(inheritable))
               | (F(permitted) & B)
               | P'(ambient)
               // + B if euid==0 && !SECBIT_NOROOT  (legacy root grant)
P'(effective)  = (F(E) || (euid==0 && !SECBIT_NOROOT)) ? P'(permitted) : P'(ambient)
P'(inheritable)= P(inheritable)           // unchanged
P'(bounding)   = B                         // unchanged

Where is_privileged = (F != EMPTY) || (is_setuid && owner_uid==0 && !SECBIT_NOROOT). These formulae match Linux cap_bprm_set_creds() in security/commoncap.c exactly.

Transformation algorithm (executed atomically during execve(), after ELF loading succeeds but before returning to user space):

execve_transform_caps(task, file):
  old = task.cred  // current credential

  // Step -2: SECBIT_EXEC_RESTRICT_FILE enforcement (UmkaOS-original).
  // NOTE: These securebits are defined in the Linux UAPI header
  // (include/uapi/linux/securebits.h, bits 8-11) but enforcement code
  // has NOT been merged into Linux mainline as of April 2026. The bits
  // were reserved in the UAPI but the actual enforcement logic was
  // proposed on LKML without being accepted (similar pattern to
  // latency_nice). UmkaOS implements enforcement as security hardening.
  // When set, execve() of any file without the execute permission bit
  // for the calling user is denied. This closes the fexecve(fd) loophole
  // where a file opened for reading can be executed via fd.
  //
  // `file.has_execute_permission()` checks the file's mode bits against
  // the caller's euid/egid/supplementary_groups (standard POSIX access
  // check for execute permission). Returns true if any of owner/group/other
  // execute bits grant access to the caller.
  if old.securebits.contains(EXEC_RESTRICT_FILE):
      if !file.has_execute_permission(old.euid, old.egid, old.supplementary_groups):
          return Err(EACCES)  // file not marked executable for this user

  // Step -1: SECBIT_EXEC_DENY_INTERACTIVE enforcement (UmkaOS-original).
  // When set, denies execution of interpreter-fed scripts. The kernel
  // checks whether the file was opened via O_EXEC (explicit exec intent)
  // vs O_RDONLY (read, then fed to interpreter). If the file's open mode
  // does not include O_EXEC and this bit is set, reject.
  // Implementation: the `file` argument carries the open mode from the
  // original openat2() call. If the caller used fexecve() or
  // execveat(AT_EMPTY_PATH), the kernel internally sets O_EXEC.
  // Scripts passed via stdin redirection (bash < script.sh) are caught
  // because the interpreter opens them with O_RDONLY, not O_EXEC.
  //
  // `file.is_interpreted_exec()` returns true if the current execve is
  // going through binfmt_script or binfmt_misc (the kernel detects this
  // via the #! shebang parsing in search_binary_handler). Set by the
  // binfmt layer before calling execve_transform_caps().
  //
  // `file.opened_with_exec_intent()` returns true if the file's open
  // flags include O_EXEC (or equivalent: AT_EMPTY_PATH with an exec fd).
  // The VFS layer records this flag during openat2()/fexecve().
  if old.securebits.contains(EXEC_DENY_INTERACTIVE):
      if file.is_interpreted_exec():  // detected via binfmt_script/binfmt_misc
          if !file.opened_with_exec_intent():
              return Err(EACCES)  // interpreter-based execution denied

  new = prepare_creds(task)

  // Step 0: Check no_new_privs gate
  is_setuid = file.is_setuid() && !old.no_new_privs
  is_setgid = file.is_setgid() && !old.no_new_privs
  has_file_caps = file.has_security_capability_xattr()
                  && !old.no_new_privs

  // Step 1: UID/GID transitions (setuid/setgid bits)
  if is_setuid:
      new.euid = file.owner_uid
      new.suid = file.owner_uid
      // fsuid tracks euid by default
      new.fsuid = file.owner_uid
  if is_setgid:
      new.egid = file.owner_gid
      new.sgid = file.owner_gid
      new.fsgid = file.owner_gid

  // Step 2: Load file capabilities
  F = if has_file_caps:
          parse_file_capabilities(file)  // from security.capability xattr
      else:
          FileCapabilities::EMPTY

  // Step 2a: Namespace-scoped file capability validation
  // File caps are only honored if the file's rootid namespace is the
  // same as or an ancestor of the task's user namespace (Section 17.1.6).
  if F != EMPTY:
      file_ns = namespace_for_rootid(F.rootid)
      if !is_same_or_ancestor(file_ns, old.user_ns):
          F = FileCapabilities::EMPTY  // silently ignore

  // Step 3: Determine if this is a "privileged" exec
  // A binary is "privileged" if it has file capabilities or is setuid-root
  is_privileged = (F != EMPTY)
                  || (is_setuid && file.owner_uid == 0
                      && !old.securebits.contains(NOROOT))

  // Step 4: Ambient set transformation
  // Ambient caps are cleared if this is a privileged exec
  new.cap_ambient = if is_privileged {
      SystemCaps::empty()
  } else {
      old.cap_ambient
  }

  // Step 5: Permitted set transformation
  // P'(permitted) = (P(inheritable) & F(inheritable))
  //               | (F(permitted) & cap_bounding)
  //               | P'(ambient)
  new.cap_permitted =
      (old.cap_inheritable & F.inheritable)
      | (F.permitted & old.cap_bounding)
      | new.cap_ambient

  // Step 5a: SECBIT_NOROOT interaction
  // If NOROOT is NOT set and the binary is setuid-root (or euid becomes 0),
  // grant full capabilities within the bounding set (legacy root behavior)
  if !old.securebits.contains(NOROOT) && new.euid == 0:
      new.cap_permitted = new.cap_permitted | old.cap_bounding

  // Step 6: Effective set transformation
  // If F.effective_flag is set (capability-dumb binary or setuid-root),
  // raise effective to equal permitted.
  // Otherwise (capability-aware binary), effective = ambient only.
  new.cap_effective = if F.effective_flag != 0 || (new.euid == 0
                        && !old.securebits.contains(NOROOT)) {
      new.cap_permitted
  } else {
      new.cap_ambient
  }

  // Step 7: Inheritable set is unchanged across execve()
  new.cap_inheritable = old.cap_inheritable

  // Step 8: Bounding set is unchanged across execve()
  new.cap_bounding = old.cap_bounding

  // Step 9: KEEP_CAPS is always cleared on execve()
  new.securebits.remove(KEEP_CAPS)
  // All other securebits (including locked variants) are preserved

  // Step 10: no_new_privs is preserved (one-way flag)
  new.no_new_privs = old.no_new_privs

  // Step 11: LSM transition hook
  // LSM hooks (AppArmor/SELinux) may deny the exec or apply domain transitions.
  // Denials are always enforced regardless of no_new_privs.
  // When no_new_privs is set, LSM additionally rejects privilege-increasing
  // domain transitions (the LSM hook checks bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS).
  lsm_result = lsm_task_exec_transition(old, new, file)  // [Section 9.8](#linux-security-module-framework)
  if lsm_result.is_err():
      abort_creds(new)
      return lsm_result

  // Step 12: Validate and commit
  commit_creds(task, new)

LSM blob lifecycle across execve(): - Each LSM module has a per-credential security blob (e.g., AppArmor's cred_label, SELinux's cred_security). The blob is allocated in new at step 1 via lsm_cred_alloc_blank(new), which calls each module's cred_alloc_blank hook. - At step 11, lsm_task_exec_transition updates the blob in new with the post-exec security context (e.g., AppArmor domain transition, SELinux type transition). On denial, abort_creds(new) calls lsm_cred_free(new) to free the blob — no leak. - At step 12, commit_creds atomically replaces task->cred with new via rcu_assign_pointer. The old cred is freed via call_rcu after all readers complete; its LSM blobs are freed in the RCU callback via lsm_cred_free(old). - The RCU grace period guarantees that any concurrent current_cred() readers see either the old or the new credential — never a partially-updated blob.

Interaction table (how securebits affect the transformation):

Securebit Effect on execve()
NOROOT set UID 0 does NOT auto-gain caps at step 5a. Setuid-root binaries still change euid but get only file-granted caps, not full cap_bounding.
NOROOT clear (default) UID 0 gets cap_bounding added to cap_permitted (legacy root behavior).
NO_SETUID_FIXUP set No effect on execve() (only affects setuid()/seteuid() UID transitions, see below).
KEEP_CAPS set Cleared unconditionally at step 9. Only affects setuid(), not execve().
NO_CAP_AMBIENT_RAISE set No effect on execve() transform itself, but prevents prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE) beforehand.
EXEC_RESTRICT_FILE set Step -2: execve() of files without execute permission for the calling user is denied with EACCES. Closes the fexecve(fd) loophole. (UmkaOS-original; bits reserved in Linux UAPI but enforcement not merged into mainline)
EXEC_DENY_INTERACTIVE set Step -1: Interpreter-based execution of non-exec-intent files is denied with EACCES. Prevents bash < script.sh bypass. (UmkaOS-original; bits reserved in Linux UAPI but enforcement not merged into mainline)

9.9.5 UID Transition Capability Adjustments

When a task changes its euid via setuid(), seteuid(), setreuid(), or setresuid(), the capability sets are adjusted unless SECBIT_NO_SETUID_FIXUP is set:

fixup_caps_after_setuid(old_euid, new_euid, cred):
  if cred.securebits.contains(NO_SETUID_FIXUP):
      return  // no adjustment

  // Case 1: Transition FROM euid 0 to non-zero
  if old_euid == 0 && new_euid != 0:
      if !cred.securebits.contains(KEEP_CAPS):
          // Drop all capabilities (leaving root)
          cred.cap_permitted = SystemCaps::empty()
          cred.cap_effective = SystemCaps::empty()
          cred.cap_ambient = SystemCaps::empty()
      else:
          // KEEP_CAPS: permitted preserved, effective cleared
          cred.cap_effective = SystemCaps::empty()

  // Case 2: Transition TO euid 0 from non-zero
  if old_euid != 0 && new_euid == 0:
      if !cred.securebits.contains(NOROOT):
          // Gaining root: raise effective to permitted
          cred.cap_effective = cred.cap_permitted

  // In all cases, maintain ambient invariant
  cred.cap_ambient = cred.cap_ambient
      & cred.cap_permitted
      & cred.cap_inheritable

9.9.6 capget() and capset() Syscalls

These syscalls read and modify the Linux capability sets on a task's credentials. They operate on the five capability sets stored in TaskCredential (cap_effective, cap_permitted, cap_inheritable, cap_bounding, cap_ambient). The user-facing ABI uses the v3 protocol with two __user_cap_data_struct entries to encode 64-bit capability sets (the kernel widens these to SystemCaps u128 internally, zero-extending the high bits).

/// Userspace capability header (matches Linux struct __user_cap_header_struct).
/// Version must be _LINUX_CAPABILITY_VERSION_3 (0x20080522).
/// UmkaOS rejects v1 (0x19980330) and v2 (0x20071026) with EINVAL.
#[repr(C)]
pub struct UserCapHeader {
    /// Protocol version. Must be 0x20080522 (_LINUX_CAPABILITY_VERSION_3).
    pub version: u32,
    /// Target process PID. 0 = caller (self). Non-zero requires
    /// CAP_SETPCAP in the caller's effective set (for capset) or
    /// must be self (for capget, except with CAP_SYS_PTRACE).
    pub pid: i32,
}
// Layout: 4 + 4 = 8 bytes.
const_assert!(size_of::<UserCapHeader>() == 8);

/// Userspace capability data (matches Linux struct __user_cap_data_struct).
/// v3 uses TWO of these structs: [0] = bits 0-31, [1] = bits 32-63.
/// The kernel zero-extends bits 64-127 (UmkaOS-native caps are not
/// accessible via capget/capset — they use prctl() extensions instead).
#[repr(C)]
pub struct UserCapData {
    /// Effective capability bits for this 32-bit word.
    pub effective: u32,
    /// Permitted capability bits for this 32-bit word.
    pub permitted: u32,
    /// Inheritable capability bits for this 32-bit word.
    pub inheritable: u32,
}
// Layout: 3 × u32 = 12 bytes.
const_assert!(size_of::<UserCapData>() == 12);

pub const _LINUX_CAPABILITY_VERSION_3: u32 = 0x20080522;

capget(header, data_out) algorithm:

capget(header: &UserCapHeader, data_out: &mut [UserCapData; 2]) -> Result<()>:
  1. if header.version != _LINUX_CAPABILITY_VERSION_3:
       // Write correct version into header for userspace to retry
       header.version = _LINUX_CAPABILITY_VERSION_3
       return Err(EINVAL)

  2. target = if header.pid == 0:
       current_task
     else:
       find_task_by_pid(header.pid)?  // ESRCH if not found
       // Reading another task's caps requires CAP_SYS_PTRACE or
       // same thread group (tgid match) — same as /proc/PID/status
       if target.tgid != current_task.tgid
           && !capable(current_task, CAP_SYS_PTRACE):
         return Err(EPERM)

  3. cred = rcu_dereference(target.cred)

  4. // Pack SystemCaps (u128) into two UserCapData structs (low 64 bits only)
     data_out[0].effective   = (cred.cap_effective.bits() >>  0) as u32
     data_out[0].permitted   = (cred.cap_permitted.bits() >>  0) as u32
     data_out[0].inheritable = (cred.cap_inheritable.bits() >> 0) as u32
     data_out[1].effective   = (cred.cap_effective.bits() >> 32) as u32
     data_out[1].permitted   = (cred.cap_permitted.bits() >> 32) as u32
     data_out[1].inheritable = (cred.cap_inheritable.bits() >> 32) as u32

  5. return Ok(())

capset(header, data_in) algorithm:

capset(header: &UserCapHeader, data_in: &[UserCapData; 2]) -> Result<()>:
  1. if header.version != _LINUX_CAPABILITY_VERSION_3:
       header.version = _LINUX_CAPABILITY_VERSION_3
       return Err(EINVAL)

  2. if header.pid != 0 && header.pid != current_task.pid:
       return Err(EPERM)  // capset only modifies own caps (Linux restriction)

  3. // Unpack two UserCapData structs into SystemCaps (u128)
     new_effective   = SystemCaps::from_bits_truncate(
         (data_in[0].effective   as u128) | ((data_in[1].effective   as u128) << 32))
     new_permitted   = SystemCaps::from_bits_truncate(
         (data_in[0].permitted   as u128) | ((data_in[1].permitted   as u128) << 32))
     new_inheritable = SystemCaps::from_bits_truncate(
         (data_in[0].inheritable as u128) | ((data_in[1].inheritable as u128) << 32))

  4. // Clone the current credential BEFORE validation. This prevents
     // a TOCTOU race where another thread calls commit_creds() between
     // validation and cloning, causing validation against stale data.
     // Matches the pattern used by setuid/setreuid/setresuid in this file.
     old_cred = current_task.cred
     new_cred = prepare_creds(current_task)

  5. // Validation rules (must ALL pass before any modification):
     // Rule 1: new_permitted must be a subset of old cap_permitted.
     //   A task cannot grant itself caps it doesn't already have in permitted.
     if new_permitted & !old_cred.cap_permitted != SystemCaps::empty():
         return Err(EPERM)

     // Rule 2: new_effective must be a subset of new_permitted.
     //   Cannot raise effective bits beyond what is permitted.
     if new_effective & !new_permitted != SystemCaps::empty():
         return Err(EPERM)

     // Rule 3: new_inheritable — the (permitted | inheritable) check
     //   applies only when cap_inh_is_capped (i.e., the task lacks
     //   CAP_SETPCAP). A task WITH CAP_SETPCAP can raise inheritable
     //   bits beyond (permitted | inheritable), limited only by the
     //   bounding set. Matches Linux: cap_inh_is_capped() returns true
     //   when !ns_capable(CAP_SETPCAP).
     let new_inheritable_raised = new_inheritable & !old_cred.cap_inheritable
     if !capable(old_cred, CAP_SETPCAP) {
         // Capped: raised bits must be in permitted | inheritable
         if new_inheritable_raised & !(old_cred.cap_permitted | old_cred.cap_inheritable)
             != SystemCaps::empty():
             return Err(EPERM)
     }
     // Bounding set check applies unconditionally (even for CAP_SETPCAP holders)
     if new_inheritable_raised & !old_cred.cap_bounding != SystemCaps::empty():
         return Err(EPERM)

  6. // LSM hook: security_capset()
     lsm_result = lsm_security_capset(
         old_cred, new_effective, new_inheritable, new_permitted)
     if lsm_result.is_err():
         return lsm_result

  7. // new_cred was already cloned in step 4 (before validation).
     new_cred.cap_effective = new_effective
     new_cred.cap_permitted = new_permitted
     new_cred.cap_inheritable = new_inheritable

     // Maintain ambient invariant: ambient must be subset of both
     // permitted AND inheritable
     new_cred.cap_ambient = new_cred.cap_ambient
         & new_cred.cap_permitted
         & new_cred.cap_inheritable

  8. commit_creds(current_task, new_cred)

9.9.7 setuid(), setreuid(), setresuid(), and setfsuid() Syscalls

These syscalls modify the POSIX identity fields (uid, euid, suid, fsuid) in the task's credentials. Each has different rules about which fields change and what permission checks are applied. All follow the prepare-modify-commit credential protocol.

State transition table — which of {ruid, euid, suid, fsuid} change for each syscall:

Syscall ruid euid suid fsuid Permission
setuid(uid) (euid==0) uid uid uid uid euid==0 or CAP_SETUID
setuid(uid) (euid!=0) unchanged uid unchanged uid uid must equal ruid or suid
seteuid(euid) unchanged euid unchanged euid euid must equal ruid or suid, or CAP_SETUID
setreuid(ruid, euid) ruid (if !=-1) euid (if !=-1) new_euid new_euid see rules below
setresuid(r, e, s) r (if !=-1) e (if !=-1) s (if !=-1) new_euid see rules below
setfsuid(fsuid) unchanged unchanged unchanged fsuid fsuid must equal ruid, euid, suid, or old fsuid; or CAP_SETUID

In all cases, fsuid tracks euid unless independently set via setfsuid().

setuid(uid) algorithm:

sys_setuid(task, uid: u32) -> Result<()>:
  old = task.cred

  new = prepare_creds(task)

  if ns_capable(old.user_ns, CAP_SETUID):
      // Privileged: set all four IDs
      new.uid  = uid
      new.euid = uid
      new.suid = uid
  else:
      // Unprivileged: can only set euid to ruid or suid
      if uid != old.uid && uid != old.suid:
          abort_creds(new)
          return Err(EPERM)
      new.euid = uid

  new.fsuid = new.euid  // fsuid always tracks euid

  // LSM hook
  if lsm_task_fix_setuid(old, new, LSM_SETID_ID).is_err():
      abort_creds(new)
      return Err(EPERM)

  // Capability side effects (see fixup_caps_after_setuid above)
  fixup_caps_after_setuid(old.euid, new.euid, &mut new)

  commit_creds(task, new)

setreuid(ruid, euid) algorithm:

sys_setreuid(task, ruid: i32, euid: i32) -> Result<()>:
  // Reject negative values other than -1 (the "don't change" sentinel).
  // Without this, -2 (0xFFFFFFFE) would be cast to uid 4294967294.
  if (ruid < -1) || (euid < -1):
      return Err(EINVAL)

  old = task.cred
  new = prepare_creds(task)

  if ruid != -1:
      // Setting ruid: must be current ruid or euid, or have CAP_SETUID
      if ruid != old.uid && ruid != old.euid
          && !ns_capable(old.user_ns, CAP_SETUID):
          abort_creds(new); return Err(EPERM)
      new.uid = ruid as u32

  if euid != -1:
      // Setting euid: must be current ruid, euid, or suid, or have CAP_SETUID
      if euid != old.uid && euid != old.euid && euid != old.suid
          && !ns_capable(old.user_ns, CAP_SETUID):
          abort_creds(new); return Err(EPERM)
      new.euid = euid as u32

  // If ruid was changed OR euid was changed, suid becomes new euid.
  // Per Linux __sys_setreuid(): the condition is new.uid != old.uid || new.euid != old.euid.
  if new.uid != old.uid || new.euid != old.euid:
      new.suid = new.euid

  new.fsuid = new.euid

  if lsm_task_fix_setuid(old, new, LSM_SETID_RE).is_err():
      abort_creds(new); return Err(EPERM)

  fixup_caps_after_setuid(old.euid, new.euid, &mut new)
  commit_creds(task, new)

setresuid(ruid, euid, suid) algorithm:

sys_setresuid(task, ruid: i32, euid: i32, suid: i32) -> Result<()>:
  old = task.cred
  has_cap = ns_capable(old.user_ns, CAP_SETUID)

  // Validate all three parameters before modifying anything
  if ruid != -1 && ruid as u32 != old.uid && ruid as u32 != old.euid
      && ruid as u32 != old.suid && !has_cap:
      return Err(EPERM)
  if euid != -1 && euid as u32 != old.uid && euid as u32 != old.euid
      && euid as u32 != old.suid && !has_cap:
      return Err(EPERM)
  if suid != -1 && suid as u32 != old.uid && suid as u32 != old.euid
      && suid as u32 != old.suid && !has_cap:
      return Err(EPERM)

  new = prepare_creds(task)

  if ruid != -1: new.uid  = ruid as u32
  if euid != -1: new.euid = euid as u32
  if suid != -1: new.suid = suid as u32

  new.fsuid = new.euid

  // setresuid(-1, -1, -1) = no-op (keep all IDs unchanged)

  if lsm_task_fix_setuid(old, new, LSM_SETID_RES).is_err():
      abort_creds(new); return Err(EPERM)

  fixup_caps_after_setuid(old.euid, new.euid, &mut new)
  commit_creds(task, new)

setfsuid(fsuid) algorithm:

sys_setfsuid(task, fsuid: u32) -> Result<u32>:
  old = task.cred
  old_fsuid = old.fsuid  // return value is always the old fsuid

  // Check permission: fsuid must match one of the four IDs, or CAP_SETUID
  if fsuid != old.uid && fsuid != old.euid && fsuid != old.suid
      && fsuid != old.fsuid && !ns_capable(old.user_ns, CAP_SETUID):
      return Ok(old_fsuid)  // silently fail (Linux behavior: returns old fsuid)

  new = prepare_creds(task)
  new.fsuid = fsuid

  // Capability side effects: if fsuid transitions to/from 0, adjust
  // FS-related capabilities ONLY (not all caps).
  //
  // Matches Linux `security/commoncap.c cap_task_fix_setuid()` case
  // LSM_SETID_FS exactly. Key differences from setuid/setreuid/setresuid:
  //   1. Only checks NO_SETUID_FIXUP (NOT KEEP_CAPS — KEEP_CAPS is
  //      irrelevant for the fsuid path in Linux).
  //   2. Only drops/raises FS-related capabilities via cap_drop_fs_set()
  //      and cap_raise_fs_set(), NOT all capabilities.
  //
  // CAP_FS_SET = CAP_CHOWN | CAP_MKNOD | CAP_DAC_OVERRIDE
  //            | CAP_DAC_READ_SEARCH | CAP_FOWNER | CAP_FSETID
  //            | CAP_LINUX_IMMUTABLE | CAP_MAC_OVERRIDE
  if !new.securebits.contains(NO_SETUID_FIXUP):
      if old.fsuid == 0 && fsuid != 0:
          // Leaving root fsuid → drop only FS-related effective caps
          new.cap_effective = cap_drop_fs_set(new.cap_effective)
      if old.fsuid != 0 && fsuid == 0:
          // Gaining root fsuid → raise only FS-related effective caps
          // from permitted set
          new.cap_effective = cap_raise_fs_set(new.cap_effective, new.cap_permitted)

  if lsm_task_fix_setuid(old, new, LSM_SETID_FS).is_err():
      abort_creds(new); return Ok(old_fsuid)

  commit_creds(task, new)
  Ok(old_fsuid)

FS capability set helpers (used by setfsuid() and Linux cap_task_fix_setuid(LSM_SETID_FS)):

/// Filesystem-related capability mask. These are the capabilities that are
/// dropped/raised when fsuid transitions to/from root. Matches Linux
/// `CAP_FS_SET` from `include/linux/capability.h`:
/// `CAP_FS_MASK | BIT_ULL(CAP_LINUX_IMMUTABLE)`.
const CAP_FS_SET: SystemCaps = SystemCaps::from_bits_truncate(
    (1u128 << CAP_CHOWN)
    | (1u128 << CAP_MKNOD)
    | (1u128 << CAP_DAC_OVERRIDE)
    | (1u128 << CAP_DAC_READ_SEARCH)
    | (1u128 << CAP_FOWNER)
    | (1u128 << CAP_FSETID)
    | (1u128 << CAP_LINUX_IMMUTABLE) // bit 9 — included in Linux CAP_FS_SET, not in CAP_FS_MASK
    | (1u128 << CAP_MAC_OVERRIDE)
);

/// Drop FS-related capabilities from the effective set.
/// Returns effective with CAP_FS_SET bits cleared.
fn cap_drop_fs_set(effective: SystemCaps) -> SystemCaps {
    effective & !CAP_FS_SET
}

/// Raise FS-related capabilities in the effective set from the permitted set.
/// Returns effective with CAP_FS_SET bits raised (ORed) from permitted.
fn cap_raise_fs_set(effective: SystemCaps, permitted: SystemCaps) -> SystemCaps {
    effective | (permitted & CAP_FS_SET)
}

LSM setid discriminants (passed to security_task_fix_setuid()):

/// Discriminant for security_task_fix_setuid() LSM hook.
/// Tells the LSM which syscall triggered the UID change so it can
/// apply different policy (e.g., SELinux may have different rules
/// for setuid vs setresuid).
#[repr(u32)]
pub enum LsmSetidOp {
    /// setuid() or seteuid()
    LSM_SETID_ID  = 0,
    /// setreuid()
    LSM_SETID_RE  = 1,
    /// setresuid()
    LSM_SETID_RES = 2,
    /// setfsuid()
    LSM_SETID_FS  = 3,
}

Capability side effects summary — when euid transitions:

Transition NO_SETUID_FIXUP clear (default) NO_SETUID_FIXUP set
euid: 0 → non-zero, KEEP_CAPS clear cap_permitted = empty, cap_effective = empty, cap_ambient = empty No change
euid: 0 → non-zero, KEEP_CAPS set cap_effective = empty (permitted preserved) No change
euid: non-zero → 0, NOROOT clear cap_effective = cap_permitted No change
euid: non-zero → 0, NOROOT set No change No change
euid unchanged No change No change

The fixup_caps_after_setuid() function (defined above in "UID Transition Capability Adjustments") implements these rules.

9.9.8 setgroups() and Supplementary Group Management

setgroups(task, new_groups: &[u32]) -> Result<()>:
  1. Check: ns_capable(task.cred.user_ns, CAP_SETGID)
     // OR: task has written to /proc/PID/gid_map (user namespace case)
  2. if new_groups.len() > NGROUPS_MAX:
       return Err(EINVAL)
  3. sorted = new_groups.to_sorted_deduped()
  4. new_cred = prepare_creds(task)
  5. new_cred.supplementary_groups = Arc::new(SortedGidList::from(sorted))
  6. commit_creds(task, new_cred)

// Permission check helper used by VFS (file open, directory access):
in_group_p(cred: &TaskCredential, gid: u32) -> bool:
  cred.fsgid == gid || cred.supplementary_groups.contains(gid)

User namespace interaction: Inside a user namespace, setgroups() is denied by default until /proc/PID/gid_map has been written (matching Linux behavior since Linux 3.19). This prevents an unprivileged process in a new user namespace from using setgroups() to drop groups it does not want to be checked against, which would be a privilege escalation. The gid_map_written flag is tracked on the UserNamespace struct.

9.9.9 prctl() Credential Operations

prctl_set_no_new_privs(task):
  // One-way flag: once set, never cleared
  new_cred = prepare_creds(task)
  new_cred.no_new_privs = true
  commit_creds(task, new_cred)
  // Always succeeds (no capability required -- this is self-restriction)

prctl_set_securebits(task, new_bits: u32) -> Result<()>:
  1. Check: ns_capable(task.cred.user_ns, CAP_SETPCAP)
  2. old_bits = task.cred.securebits
  3. // Cannot clear a locked bit.
     // Six (flag, locked) pairs: (NOROOT, NOROOT_LOCKED),
     // (NO_SETUID_FIXUP, NO_SETUID_FIXUP_LOCKED),
     // (KEEP_CAPS, KEEP_CAPS_LOCKED),
     // (NO_CAP_AMBIENT_RAISE, NO_CAP_AMBIENT_RAISE_LOCKED),
     // (EXEC_RESTRICT_FILE, EXEC_RESTRICT_FILE_LOCKED),
     // (EXEC_DENY_INTERACTIVE, EXEC_DENY_INTERACTIVE_LOCKED)
     for each (flag, locked) pair in SecureBits:
       if old_bits.contains(locked) && !new_bits.contains(flag):
           return Err(EPERM)  // locked bit prevents clearing
  4. // Cannot clear a lock bit
     for each locked bit:
       if old_bits.contains(locked) && !new_bits.contains(locked):
           return Err(EPERM)
  5. // Validate: no unknown bits set
     if new_bits & !SecureBits::all().bits() != 0:
         return Err(EINVAL)
  6. new_cred = prepare_creds(task)
  7. new_cred.securebits = SecureBits::from_bits_truncate(new_bits)
  8. commit_creds(task, new_cred)

prctl_cap_ambient(task, op, cap) -> Result<()>:
  match op:
    PR_CAP_AMBIENT_RAISE:
      if task.cred.securebits.contains(NO_CAP_AMBIENT_RAISE):
          return Err(EPERM)
      if !task.cred.cap_permitted.contains(cap):
          return Err(EPERM)
      if !task.cred.cap_inheritable.contains(cap):
          return Err(EPERM)
      new_cred = prepare_creds(task)
      new_cred.cap_ambient |= cap
      commit_creds(task, new_cred)
    PR_CAP_AMBIENT_LOWER:
      new_cred = prepare_creds(task)
      new_cred.cap_ambient &= !cap
      commit_creds(task, new_cred)
    PR_CAP_AMBIENT_CLEAR_ALL:
      new_cred = prepare_creds(task)
      new_cred.cap_ambient = SystemCaps::empty()
      commit_creds(task, new_cred)
    PR_CAP_AMBIENT_IS_SET:
      return Ok(task.cred.cap_ambient.contains(cap) as u64)

prctl_capbset_drop(task, cap) -> Result<()>:
  1. Check: ns_capable(task.cred.user_ns, CAP_SETPCAP)
  2. new_cred = prepare_creds(task)
  3. new_cred.cap_bounding &= !cap
  4. // Maintain ambient invariant: ambient must be subset of
  //  both permitted and inheritable. Bounding doesn't directly
  //  constrain ambient, but dropping from bounding prevents
  //  future file-cap-based elevation.
  5. commit_creds(task, new_cred)

Cross-references: - Section 9.1 (08-security.md): Capability-based foundation (UmkaOS native model) - Section 9.2 (08-security.md): SystemCaps bitflags definition - Section 8.1 (07-process.md): Task and Process structs - Section 8.1 (07-process.md): execve() ELF loading sequence - Section 19.1: Syscall dispatch (where capable() is called) - Section 9.8 (08-security.md): LSM framework (security blobs, hook callouts) - Section 17.1 (16-containers.md): compute_effective_caps() and namespace hierarchy

9.9.10 Credential Translation

When a Linux binary executes, umka-sysapi constructs a Capability Set based on the file's ownership and the process's current credentials:

  1. Authentication: The process presents its UID/GID.
  2. Translation: umka-sysapi translates the namespace-local UID/GID to the parent namespace via UserNamespace::uid_map / gid_map. This is a lock-free read (the maps are frozen after the single write to /proc/PID/uid_map). For the common case of a single mapping entry (e.g., container UID 0-65535 → host UID 100000-165535), translation is a single range check + addition. If is_identity is set, the UID is returned unchanged (zero-cost fast path).
  3. Capability Grant: If the UID is 0 (root) in the initial namespace, the process is granted a wide set of administrative capabilities (e.g., CAP_SYS_ADMIN equivalent).
  4. File Access: When accessing a file, umka-sysapi checks the file's VFS mode bits against the process's UID/GID. If access is allowed, a native UmkaOS Capability<VfsNode> is generated and returned as a file descriptor.

9.9.11 Bounding Sets and Dropping Privileges

Linux capabilities (e.g., CAP_NET_BIND_SERVICE) map 1:1 to specific UmkaOS capability flags. When a process calls capset() to drop privileges, umka-sysapi permanently revokes the corresponding UmkaOS capabilities from the process's Capability Domain. Because UmkaOS capabilities are unforgeable, a dropped privilege can never be regained, ensuring strict containment.

9.9.12 Linux Capability to UmkaOS Capability Translation

UmkaOS maintains two orthogonal access control systems: Linux-compatible SystemCaps (coarse-grained, process-level, checked by capable()) and UmkaOS-native PermissionBits (fine-grained, per-capability-object, checked in the KABI dispatch path). This section specifies how the two systems interact when Linux binaries exercise privileged operations that touch UmkaOS-native objects.

9.9.12.1 Translation Table

When a Linux syscall path requires access to a UmkaOS-native object (a file capability, a device capability, a network socket), the SysAPI layer translates the Linux SystemCaps authorization into UmkaOS PermissionBits on the resulting Capability object. The table below defines the mapping for each Linux capability that produces or widens a UmkaOS capability:

Linux SystemCaps UmkaOS PermissionBits Scope Notes
CAP_DAC_READ_SEARCH READ File/directory capabilities Bypass file read and directory search permission checks. Produces file capabilities with READ.
CAP_DAC_OVERRIDE READ \| WRITE File capabilities Bypass all DAC read/write checks. Does not grant ADMIN (no ownership bypass).
CAP_FOWNER READ \| WRITE \| ADMIN File capabilities Bypass owner-identity checks (chmod, utime, etc.). ADMIN enables owner-only operations.
CAP_SYS_ADMIN ADMIN System-wide device capabilities Maps to ADMIN on device and system objects. Also checked via CAP_ADMIN (bit 64); see Section 9.2.
CAP_NET_ADMIN ADMIN Network device capabilities Network interface configuration, routing table modification, firewall rules.
CAP_NET_RAW READ \| WRITE Raw socket capabilities Create raw (AF_PACKET, SOCK_RAW) sockets. READ for packet capture, WRITE for packet injection.
CAP_SYS_RAWIO READ \| WRITE Block device capabilities Direct I/O to block devices (iopl, ioperm on x86).
CAP_MKNOD WRITE Device node creation capabilities Create device special files. WRITE only (the created node inherits its own permission bits from the filesystem).
CAP_SYS_PTRACE DEBUG \| READ Process capabilities Maps to DEBUG + READ on the target process capability. See also CAP_DEBUG (bit 68).
CAP_IPC_OWNER READ \| WRITE \| ADMIN IPC object capabilities Bypass IPC ownership checks (SysV shm, sem, msg).
CAP_CHOWN ADMIN File capabilities Change file ownership. Requires ADMIN on the target file capability.
CAP_SETFCAP ADMIN File capabilities Set file capabilities (security.capability xattr). Requires ADMIN on the target.

Key principles:

  1. Translation is additive, not substitutive. A SystemCaps check passing does not bypass the PermissionBits check on the capability object. It determines what PermissionBits a newly created or widened capability receives.
  2. Narrowing is always possible. A process with CAP_DAC_OVERRIDE that opens a file read-only receives a capability with only READ, not READ | WRITE. The translation table defines the maximum bits; the actual operation determines the granted subset.
  3. Unmapped capabilities. Linux capabilities that do not produce UmkaOS capability objects (e.g., CAP_SYS_BOOT, CAP_SYS_TIME, CAP_SYS_NICE, CAP_KILL, CAP_SETUID, CAP_SETGID) are checked purely via capable() against SystemCaps and have no PermissionBits translation. They gate operations that do not involve capability-mediated object access.

9.9.12.2 UID 0 Capability Grant Policy

When a process with euid == 0 calls execve(), the standard capability transformation (see execve_transform_caps above) handles the Linux-compatible capability sets (cap_effective, cap_permitted, etc.) via the SECBIT_NOROOT mechanism. This section specifies the additional UmkaOS-native capability grants for UID 0.

Root capability set construction (executed during execve() after commit_creds, before returning to userspace):

construct_root_caps(task):
  if task.cred.euid != 0:
      return  // non-root: no additional grants

  if task.cred.securebits.contains(NOROOT):
      return  // NOROOT suppresses root auto-grants

  // Grant UmkaOS-native capability objects for each device category.
  // These are Capability objects (Section 9.1) with PermissionBits::ALL
  // (READ | WRITE | ADMIN) injected into the task's CapSpace.
  //
  // The objects cover the following categories:
  //   - Block devices: all currently registered block devices
  //   - Network devices: all currently registered NICs
  //   - Character devices: all currently registered char devices
  //   - IPC objects: full access to all SysV/POSIX IPC namespaces
  //   - Process objects: DEBUG | READ | WRITE on all PID capabilities
  //
  // Each capability is created with:
  //   permissions: PermissionBits::ALL  (READ | WRITE | ADMIN)
  //   constraints: CapConstraints {
  //       delegatable: true,
  //       max_delegation_depth: u8::MAX, // effectively unlimited (system limit of 16 is binding)
  //       revocable: true,
  //   }
  //
  // Implementation: the capability objects are NOT pre-created for every
  // device. Instead, the task's CapSpace is marked with a root_grant flag.
  // When the task first accesses a device (open(), ioctl()), the VFS/device
  // layer checks root_grant and synthesizes the capability on demand:
  task.capspace.set_root_grant(true)

The root_grant flag is a single boolean on the task's CapSpace. It is: - Set during execve() for euid == 0 (unless SECBIT_NOROOT). - Cleared when all of uid, euid, suid, and fsuid become non-zero (dropping root entirely via setuid()). - Inherited across fork() (child gets the same CapSpace snapshot). - NOT inherited across execve() of a non-root binary (cleared before construct_root_caps runs; only re-set if the new binary's effective UID is 0).

On-demand synthesis: When a root-grant task opens a device or file, the VFS path calls:

synthesize_root_cap(task, object_id, requested_perms) -> Result<Capability>:
  if !task.capspace.root_grant():
      return Err(EPERM)  // non-root: normal capability delegation path

  // LSM hooks are checked even for UID 0 root grant to support
  // SELinux/AppArmor policies that restrict root. A confined root
  // process (e.g., in a container with an SELinux label) must still
  // pass the LSM check before receiving a synthesized capability.
  if !lsm_check_root_grant(task, object_id, requested_perms):
      return Err(EACCES)  // LSM denied the root capability grant

  // Create a capability with the intersection of requested permissions
  // and PermissionBits::ALL. Root can request any permission.
  let cap = Capability {
      object_id,
      permissions: requested_perms,
      generation: object_generation(object_id),
  }
  task.capspace.insert(cap)
  Ok(cap)

This lazy approach avoids enumerating all devices at execve() time (which would be both expensive and incorrect for hot-plugged devices).

9.9.12.3 capable() and UmkaOS PermissionBits: Orthogonal Check Paths

The two access control systems are checked on different code paths and are not redundant:

/// Linux-compatible capability check. Called by syscall implementations
/// for coarse-grained privilege gates (e.g., "can this process bind to
/// port 80?", "can this process load a kernel module?").
///
/// This function checks only SystemCaps bits. It does NOT check
/// PermissionBits on any capability object.
pub fn capable(task: &Task, cap: SystemCaps) -> bool {
    // Step 1: Check Linux-compatible cap_effective bits.
    if !task.cred.cap_effective.contains(cap) {
        return false;
    }
    // Step 2: Namespace hierarchy check (Section 17.1.6).
    // For capable() (not ns_capable), the target is always init_user_ns.
    if !is_same_or_ancestor(task.cred.user_ns, &INIT_USER_NS) {
        return false;
    }
    // Step 3: LSM hook (SELinux/AppArmor equivalent, [Section 9.8](#linux-security-module-framework)).
    if !lsm_capable(task, &INIT_USER_NS, cap) {
        return false;
    }
    // Step 4: Audit log (if auditing enabled, Section 20.2).
    audit_log_cap_check(task, cap, true);
    true
}

Interaction with KABI dispatch: The KABI dispatch path (Section 12.1) checks PermissionBits on the ValidatedCap token, not SystemCaps. A driver call like block_write(vcap, buf, offset, len) verifies that vcap.has_permission(PermissionBits::WRITE) is true. The capable() check, if any, happens earlier in the syscall path (e.g., the open() syscall checks capable(CAP_SYS_RAWIO) before creating the block device capability).

Summary of check points:

Check Point System What Is Checked Example
Syscall entry (SysAPI layer) SystemCaps via capable() Process-level Linux capability bits bind() to port < 1024 checks CAP_NET_BIND_SERVICE
VFS open / device open Both capable() for privilege gate, then PermissionBits on the resulting capability open("/dev/sda") checks CAP_SYS_RAWIO, then creates capability with READ \| WRITE
KABI dispatch (driver call) PermissionBits via ValidatedCap Per-object permission bits on the capability token block_write(vcap, ...) checks vcap.has_permission(WRITE)
LSM hooks SystemCaps + LSM policy SELinux/AppArmor may deny even if SystemCaps allows lsm_capable() at step 3 of capable()

Cross-references: - Section 9.1: Capability token model, PermissionBits definition, CapSpace structure - Section 9.2: SystemCaps bitflags definition (bits 0-40 = Linux, 64-127 = UmkaOS-native) - Section 12.1: KABI dispatch, ValidatedCap amortized validation - Section 19.1: Syscall SysAPI layer where capable() is called - Section 9.8: LSM framework (lsm_capable() hook)

9.9.13 Security Subsystem Lock Ordering

All code paths that acquire multiple security-subsystem locks must follow this strict partial order. Violating this order risks deadlock.

Lock Ordering (outermost → innermost):

  CapTable.write_lock (global capability table writer lock)
      < capspace_spinlock (per-task CapSpace spinlock)
          < tpm_lock (TPM device command serialization mutex)
            < ima_measurement_log (per-namespace IMA log mutex)
              < evm_lock (per-inode EVM HMAC spinlock, Section 9.4.5)
                < rules_update_lock (IMA policy rules update mutex)
                  < audit_buffer_lock (audit ring buffer spinlock)
Lock Type Scope Protects
CapTable.write_lock SpinLock Global (one CapTable per kernel) CapEntry create, revoke, delegate (XArray structural modifications)
capspace_spinlock SpinLock Per-CapSpace CapEntry insertion, delegation, revocation
tpm_lock Mutex Per-TPM device Serializes TPM command submission
ima_measurement_log Mutex Per-IMA namespace Append-only measurement log
evm_lock SpinLock Per-inode Atomic HMAC compute + xattr write
rules_update_lock Mutex Per-IMA namespace IMA policy rule list updates
audit_buffer_lock SpinLock Global Audit ring buffer enqueue

Rationale for ordering:

  • No cred_lock: Credential transitions use the RCU copy-on-write model (prepare_creds → modify clone → commit_creds with atomic pointer swap). Readers use rcu_dereference(task.cred) under an RCU read-side critical section; writers atomically publish via commit_creds(). No lock is needed.
  • CapTable.write_lock outermost: global capability table modifications (cap_create, cap_revoke, cap_delegate) may need to acquire per-task CapSpace locks to update delegation trees. Must be outermost to prevent inversion with capspace_spinlock. Readers use RCU (no lock).
  • capspace_spinlock next: capability operations may trigger audit events (which acquire audit_buffer_lock). Must be outermost to avoid inversion.
  • capspace_spinlock before tpm_lock: capability delegation may require TPM-sealed attestation of the new capability (cluster attestation path).
  • tpm_lock before ima_measurement_log: TPM PCR extend is called from IMA measurement path; the TPM command must complete before the measurement log entry is committed.
  • ima_measurement_log before evm_lock: IMA measurement triggers EVM HMAC recomputation (Section 9.4.5); the EVM lock is acquired inside the IMA measurement code path.
  • evm_lock before rules_update_lock: EVM hook execution reads the IMA rule list under RCU; rules_update_lock is only held on the cold policy update path and never nests inside evm_lock in practice. Ordering here prevents future inversions if the paths evolve.
  • audit_buffer_lock innermost: audit logging is a terminal action called from many security paths. It must never acquire any other security lock.

Enforcement: Debug builds (cfg(debug_assertions)) use a per-CPU lock depth counter. Each lock acquisition asserts that the new lock's rank is strictly greater than the current maximum held rank. Violation triggers a diagnostic panic with both lock names and the call stack.