Chapter 9: Security Extensions

Companion to Chapter 8: Security Architecture. This chapter contains §9.1–§9.5: Kernel Crypto API, Key Retention, Seccomp-BPF, ARM MTE, and DebugCap. See 08-security.md for §8.1–§8.8.

9.1 Kernel Crypto API

The Kernel Crypto API is the algorithm registry and dispatch framework. It is not a security policy subsystem in the way that LSM (Section 8.7) or capabilities (Section 8.1) are, but it is placed here because it is the shared foundation that every security-relevant subsystem depends on: verified boot (Section 8.2) needs Ed25519 and ML-DSA-65 signature verification; PQC key exchange (Section 8.5) needs ML-KEM-768; confidential computing (Section 8.6) needs AES-256-GCM for sealed blobs; IMA (Section 8.4) needs SHA-256 and SHA-384; NVMe TLS authentication (Section 14.4) needs AES-GCM and ML-KEM; NFS/Kerberos (Section 14.X) needs AES-128-CTS-HMAC-SHA1 and AES-256-CTS-HMAC-SHA384; and kTLS (Section 15.X) needs ChaCha20-Poly1305.

A single, unified algorithm registry ensures: hardware-accelerated implementations are discovered at runtime and preferred automatically; PQC algorithms are first-class citizens with the same lookup paths as classical algorithms; and callers are insulated from implementation churn as acceleration support is added.

9.1.1 Algorithm Type Taxonomy

// umka-core/src/crypto/api.rs

/// Taxonomy of cryptographic algorithm families.
///
/// Each variant corresponds to a distinct API surface (different transform
/// objects, different operation descriptors, different vtables). Callers
/// use the type to filter the algorithm registry during lookup.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u32)]
pub enum CryptoAlgType {
    /// Synchronous hash (e.g., SHA-256, SHA-384, BLAKE2b).
    /// Output is produced in a single call from a complete message or
    /// incrementally via `update()` + `final()`. No async path.
    Shash = 0x0001,

    /// Asynchronous hash (e.g., offloaded SHA via AMD CCP).
    /// Operation descriptor submitted to a hardware ring buffer and
    /// completed via completion callback or poll.
    Ahash = 0x0002,

    /// Synchronous symmetric key cipher (block cipher or stream cipher).
    /// Operates in-place or out-of-place on a contiguous buffer. Suitable
    /// for AES-ECB, AES-CBC, AES-CTR, ChaCha20.
    Skcipher = 0x0003,

    /// Asynchronous symmetric key cipher (hardware offload path).
    Ablkcipher = 0x0004,

    /// Authenticated encryption with associated data (AEAD).
    /// Combines confidentiality and integrity: AES-GCM, AES-CCM,
    /// ChaCha20-Poly1305. The `encrypt` path appends the authentication
    /// tag; the `decrypt` path verifies it and returns `Err(EBADMSG)`
    /// on mismatch without exposing the decrypted plaintext.
    Aead = 0x0005,

    /// Asymmetric key cipher (sign/verify, encrypt/decrypt).
    /// Covers RSA, ECDSA (P-256, P-384), Ed25519, ML-DSA-44/65/87.
    /// Key import uses PKCS#8 DER or raw key bytes depending on algorithm.
    Akcipher = 0x0006,

    /// Key agreement / key encapsulation mechanism (KEM).
    /// Covers ECDH (P-256, P-384, X25519), ML-KEM-512/768/1024, and
    /// hybrid templates (e.g., `hybrid-kem(x25519,ml-kem-768)`).
    Kpp = 0x0007,

    /// Cryptographic random number generator.
    /// Backed by the hardware DRBG (RDRAND, ARM TRNG) seeded into a
    /// NIST SP 800-90A CTR-DRBG instance.
    Rng = 0x0008,
}

9.1.2 Algorithm Descriptor and Registration

Every algorithm implementation — whether software or hardware-accelerated — is described by a CryptoAlg descriptor and registered with the global algorithm table at module load or driver probe time.

// umka-core/src/crypto/api.rs

bitflags::bitflags! {
    /// Flags attached to an algorithm descriptor.
    #[derive(Clone, Copy, Debug)]
    pub struct CryptoAlgFlags: u32 {
        /// Algorithm has passed NIST FIPS 140-3 / SP 800-131A validation.
        /// In FIPS mode, only algorithms with this flag may be allocated.
        const FIPS_APPROVED    = 0x0001;

        /// Algorithm is implemented in software (no hardware dependency).
        const SW_IMPL          = 0x0002;

        /// Algorithm requires hardware support (will fail if HW absent).
        const HW_ACCEL         = 0x0004;

        /// Algorithm is a template that composes two or more base algorithms.
        /// Instantiated from component names in the template string, e.g.,
        /// `gcm(aes)` composes the `gcm` template with the `aes` base cipher.
        const TEMPLATE         = 0x0008;

        /// Internal use: algorithm is in the process of being unregistered.
        /// Allocation requests for this algorithm will return `Err(ENOENT)`.
        const DYING            = 0x0010;

        /// Algorithm supports in-place operation (src == dst buffer).
        const INPLACE          = 0x0020;

        /// Algorithm is part of the PQC suite (ML-KEM, ML-DSA, SLH-DSA).
        const PQC              = 0x0040;
    }
}

/// Algorithm implementation descriptor.
///
/// Registered once per implementation; shared across all transform objects
/// that use the same implementation. Immutable after registration.
///
/// `CryptoAlg` is placed in a static or module-scoped location: its
/// lifetime must exceed any `*Tfm` objects that reference it.
pub struct CryptoAlg {
    /// Canonical algorithm name used for lookup.
    /// Examples: `"sha256"`, `"aes-gcm"`, `"ml-kem-768"`,
    /// `"hybrid-kem(x25519,ml-kem-768)"`.
    /// Maximum 64 bytes, null-padded.
    pub name: [u8; 64],

    /// Implementation name, used for diagnostics and sysfs.
    /// Examples: `"aesni-sha256"`, `"soft-ml-kem-768"`, `"ccp-aes-gcm"`.
    pub driver_name: [u8; 64],

    /// Algorithm family. Determines which vtable pointer is valid.
    pub alg_type: CryptoAlgType,

    /// Selection priority. Higher priority implementations are preferred
    /// when multiple implementations of the same algorithm are registered.
    /// Range 0–999. Software fallback: 100. Hardware-accelerated: 300–900.
    /// Test vectors only (for self-test use): 0.
    pub priority: u32,

    /// Capability and mode flags.
    pub flags: CryptoAlgFlags,

    /// Reference count: incremented when a transform object is allocated
    /// from this descriptor, decremented when the transform is freed.
    /// The descriptor cannot be unregistered while refcount > 0.
    pub refcount: AtomicU32,

    /// Algorithm-family-specific operations vtable.
    pub ops: CryptoAlgOps,
}

/// Union of per-family vtables. Exactly one variant is valid, selected by
/// `alg_type`. Using an enum keeps the dispatch explicit and exhaustive.
pub enum CryptoAlgOps {
    Shash(ShashOps),
    Ahash(AhashOps),
    Skcipher(SkcipherOps),
    Ablkcipher(AblkcipherOps),
    Aead(AeadOps),
    Akcipher(AkCipherOps),
    Kpp(KppOps),
    Rng(RngOps),
}

Algorithm registration and deregistration:

// umka-core/src/crypto/registry.rs

/// Global algorithm table. Protected by a single RwSpinLock for the rare
/// registration/deregistration paths; reads (lookup during alloc) are common
/// but brief enough that a spinlock is acceptable. An RCU-protected list
/// would reduce reader overhead but is not necessary given registration
/// happens only at boot and at module load.
static ALGORITHM_TABLE: RwSpinLock<AlgorithmTable> = RwSpinLock::new(AlgorithmTable::new());

/// Register an algorithm implementation.
///
/// # Errors
/// - `EEXIST`: an implementation with the same `name` + `driver_name` is
///   already registered.
/// - `EINVAL`: the descriptor is malformed (zero-length name, unknown type,
///   priority out of range, null function pointer in vtable).
pub fn crypto_register_alg(alg: &'static CryptoAlg) -> Result<(), KernelError> {
    let name = alg_name_str(alg)?;
    let mut table = ALGORITHM_TABLE.write();
    if table.find_by_driver(name, &alg.driver_name).is_some() {
        return Err(KernelError::EEXIST);
    }
    validate_alg_descriptor(alg)?;
    table.insert(alg);
    Ok(())
}

/// Deregister an algorithm implementation.
///
/// Marks the descriptor as `DYING` first so that concurrent allocations
/// fail gracefully, then waits for `refcount` to reach zero before removing
/// from the table.
///
/// # Errors
/// - `ENOENT`: algorithm not found.
/// - `EBUSY`: would block indefinitely; caller must retry (module unload
///   should be deferred until users have released their transforms).
pub fn crypto_unregister_alg(alg: &'static CryptoAlg) -> Result<(), KernelError>;

9.1.3 Transform Objects

A transform object (Tfm) is the per-user working state for an algorithm. It holds the key schedule, any per-instance configuration, and a reference to the algorithm descriptor. Tfm objects are not shared: each caller allocates its own.

// umka-core/src/crypto/tfm.rs

/// Synchronous hash transform.
pub struct ShashTfm {
    /// Descriptor of the underlying algorithm.
    pub alg: &'static CryptoAlg,
    /// Per-instance state (key for HMAC; empty for plain hash).
    state: ShashTfmState,
}

/// In-progress synchronous hash computation.
///
/// Allocated on the caller's stack (or in a kernel object) via
/// `tfm.desc_size()`. Lives for the duration of one hash computation.
pub struct ShashDesc {
    pub tfm: *const ShashTfm,
    /// Implementation-defined context (SHA state, BLAKE2 state, etc.).
    /// Size is `ShashTfm.alg.descsize` bytes; allocated adjacent to this
    /// struct in the same allocation.
    _ctx: [u8; 0], // variable-length tail, accessed via raw pointer arithmetic
}

/// AEAD transform.
pub struct AeadTfm {
    pub alg: &'static CryptoAlg,
    /// Encryption key schedule.
    key_enc: SecretBox<[u8]>,
    /// Authentication key material (for GCM: H = AES_K(0^128)).
    key_auth: SecretBox<[u8]>,
    /// IV/nonce length for this transform (set by `set_authsize`).
    authsize: u32,
    /// IV length in bytes (12 for AES-GCM, 16 for AES-CCM).
    ivsize: u32,
}

/// Single AEAD operation request.
///
/// Submitted inline (synchronous) or queued to a hardware ring (async).
pub struct AeadReq {
    /// Transform to use.
    pub tfm: *const AeadTfm,
    /// Associated data (authenticated but not encrypted).
    pub assoc: &'static [u8],
    /// Input buffer (plaintext for encrypt, ciphertext+tag for decrypt).
    pub src: *const u8,
    /// Output buffer. May equal `src` for in-place operation if
    /// `CryptoAlgFlags::INPLACE` is set on the algorithm.
    pub dst: *mut u8,
    /// Length of the payload (excluding the authentication tag).
    pub cryptlen: u32,
    /// IV/nonce. Must be exactly `AeadTfm.ivsize` bytes.
    pub iv: [u8; 16],
    /// Completion callback for async path. `None` for synchronous callers.
    pub complete: Option<fn(*mut AeadReq, i32)>,
    /// Caller-supplied context pointer, passed to `complete`.
    pub data: *mut core::ffi::c_void,
}

/// Asymmetric key cipher operations vtable.
///
/// Registered by each Akcipher implementation. All functions are mandatory
/// unless documented as optional (marked with `Option<…>`).
pub struct AkCipherOps {
    /// Import a private key from DER-encoded PKCS#8 or raw bytes.
    /// Stores parsed key material inside `tfm`. The src buffer is zeroed
    /// by the caller after this call returns.
    pub set_priv_key: unsafe extern "C" fn(
        tfm: *mut AkCipherTfm,
        src: *const u8,
        src_len: u32,
    ) -> i32,

    /// Import a public key. Format is algorithm-specific:
    /// RSA: SubjectPublicKeyInfo DER; Ed25519/ML-DSA: raw 32/2420 bytes.
    pub set_pub_key: unsafe extern "C" fn(
        tfm: *mut AkCipherTfm,
        src: *const u8,
        src_len: u32,
    ) -> i32,

    /// Produce a signature over `src` (typically a digest). Output written
    /// to `dst`. Returns the number of bytes written on success.
    pub sign: unsafe extern "C" fn(
        tfm: *const AkCipherTfm,
        src: *const u8,
        src_len: u32,
        dst: *mut u8,
        dst_len: u32,
    ) -> i32,

    /// Verify `sig` over `src`. Returns 0 on success, `-EBADMSG` if the
    /// signature is invalid, other negative errno on error.
    pub verify: unsafe extern "C" fn(
        tfm: *const AkCipherTfm,
        src: *const u8,
        src_len: u32,
        sig: *const u8,
        sig_len: u32,
    ) -> i32,

    /// Maximum signature size in bytes. Used to allocate output buffers.
    pub max_size: unsafe extern "C" fn(tfm: *const AkCipherTfm) -> u32,
}

/// Synchronous hash operations vtable.
pub struct ShashOps {
    /// Digest size in bytes (e.g., 32 for SHA-256, 48 for SHA-384).
    pub digestsize: u32,

    /// `ShashDesc` context size in bytes.
    pub descsize: u32,

    /// Optional: set a key (for HMAC). Returns `-EINVAL` for plain hashes.
    pub setkey: Option<unsafe extern "C" fn(
        tfm: *mut ShashTfm,
        key: *const u8,
        keylen: u32,
    ) -> i32>,

    /// Initialise `desc` for a new hash computation.
    pub init: unsafe extern "C" fn(desc: *mut ShashDesc) -> i32,

    /// Process `len` bytes of data.
    pub update: unsafe extern "C" fn(
        desc: *mut ShashDesc,
        data: *const u8,
        len: u32,
    ) -> i32,

    /// Finalise and write `digestsize` bytes to `out`.
    pub finalize: unsafe extern "C" fn(desc: *mut ShashDesc, out: *mut u8) -> i32,

    /// One-shot: init + update(data, len) + finalize. Faster for small messages.
    pub digest: unsafe extern "C" fn(
        desc: *mut ShashDesc,
        data: *const u8,
        len: u32,
        out: *mut u8,
    ) -> i32,
}

/// AEAD operations vtable.
pub struct AeadOps {
    /// Set the encryption key. Key length must be in the algorithm's
    /// supported set (e.g., 16 or 32 bytes for AES-GCM).
    pub setkey: unsafe extern "C" fn(
        tfm: *mut AeadTfm,
        key: *const u8,
        keylen: u32,
    ) -> i32,

    /// Set the authentication tag size. For AES-GCM this is 16 bytes;
    /// shorter tags are allowed (8 bytes minimum) but not FIPS-approved.
    pub setauthsize: unsafe extern "C" fn(tfm: *mut AeadTfm, authsize: u32) -> i32,

    /// Encrypt and authenticate. On success, `dst` contains the ciphertext
    /// followed by the authentication tag (`authsize` bytes).
    pub encrypt: unsafe extern "C" fn(req: *mut AeadReq) -> i32,

    /// Decrypt and verify. Returns `-EBADMSG` if the authentication tag
    /// does not match; the output buffer is not modified on failure.
    /// On success, `dst` contains the plaintext (without the tag).
    pub decrypt: unsafe extern "C" fn(req: *mut AeadReq) -> i32,

    /// IV size in bytes (12 for GCM, 16 for CCM).
    pub ivsize: u32,

    /// Maximum authentication tag size in bytes.
    pub maxauthsize: u32,
}

/// Key agreement / KEM operations vtable.
pub struct KppOps {
    /// Generate a fresh key pair, storing the private key inside `tfm`.
    /// The public key is written to `pub_key` (exactly `pub_key_size()` bytes).
    pub generate_key: unsafe extern "C" fn(
        tfm: *mut KppTfm,
        pub_key: *mut u8,
    ) -> i32,

    /// For ECDH/X25519: set the private key from `src`.
    /// For ML-KEM: import a serialised decapsulation key.
    pub set_priv_key: unsafe extern "C" fn(
        tfm: *mut KppTfm,
        src: *const u8,
        src_len: u32,
    ) -> i32,

    /// ECDH/X25519 compute_shared_secret / ML-KEM encapsulate.
    /// For ECDH: `peer_pub` is the peer's public key; `shared` receives
    /// the Diffie-Hellman shared secret.
    /// For ML-KEM: `peer_pub` is the peer's encapsulation key;
    /// `shared` receives the KEM shared secret (32 bytes); the
    /// ciphertext is written to `ct_out` (1088 bytes for ML-KEM-768).
    pub compute: unsafe extern "C" fn(
        tfm: *const KppTfm,
        peer_pub: *const u8,
        peer_pub_len: u32,
        shared: *mut u8,
        ct_out: *mut u8,
    ) -> i32,

    /// ML-KEM decapsulate. `ct` is the ciphertext produced by the peer's
    /// `compute` call. `shared` receives the same 32-byte shared secret.
    /// Returns 0 on success. Decapsulation always succeeds (implicit
    /// rejection per FIPS 203 Section 6.4): on ciphertext mismatch the
    /// output is a deterministic but unpredictable value.
    pub decapsulate: Option<unsafe extern "C" fn(
        tfm: *const KppTfm,
        ct: *const u8,
        ct_len: u32,
        shared: *mut u8,
    ) -> i32>,

    /// Public key size in bytes.
    pub pub_key_size: unsafe extern "C" fn(tfm: *const KppTfm) -> u32,

    /// Shared secret / KEM output size in bytes.
    pub shared_secret_size: unsafe extern "C" fn(tfm: *const KppTfm) -> u32,
}

9.1.4 Algorithm Lookup and Transform Allocation

The crypto_alloc_* family of functions drives the registry lookup, priority selection, and transform instantiation.

// umka-core/src/crypto/alloc.rs

/// Allocate a synchronous hash transform for the named algorithm.
///
/// # Algorithm
/// 1. Lock the algorithm table for reading.
/// 2. Collect all registered `CryptoAlg` entries where `name` matches
///    and `alg_type == CryptoAlgType::Shash` and `!flags.DYING`.
/// 3. In FIPS mode, discard any entries without `flags.FIPS_APPROVED`.
/// 4. Select the entry with the highest `priority`. If multiple entries
///    share the maximum priority, the most recently registered wins
///    (last-writer-wins within the same priority tier, consistent with
///    hardware driver load order at boot).
/// 5. Atomically increment `alg.refcount`.
/// 6. Release the read lock.
/// 7. Allocate a `ShashTfm` from the kernel slab, initialise fields,
///    call `alg.ops.shash.init_tfm(tfm)` if the vtable provides it.
/// 8. Return the tfm. Caller must call `crypto_free_shash(tfm)` when done.
///
/// # Errors
/// - `ENOENT`: no implementation found for the name (or all are filtered
///   out by FIPS mode).
/// - `ENOMEM`: slab allocation failed.
/// - `EINVAL`: algorithm name is empty or longer than 64 bytes.
pub fn crypto_alloc_shash(
    name: &str,
    flags: CryptoAllocFlags,
) -> Result<Box<ShashTfm>, KernelError>;

/// Allocate an AEAD transform.
/// Follows the same lookup algorithm as `crypto_alloc_shash`.
/// Template algorithms (e.g., `"gcm(aes)"`) are instantiated by:
/// 1. Parsing the template name to extract the template (`"gcm"`) and
///    the base algorithm (`"aes"`).
/// 2. Looking up and allocating a `SkcipherTfm` for the base algorithm.
/// 3. Looking up the template and calling `template.alloc_aead(base_tfm)`.
pub fn crypto_alloc_aead(
    name: &str,
    flags: CryptoAllocFlags,
) -> Result<Box<AeadTfm>, KernelError>;

/// Allocate an asymmetric key cipher transform.
pub fn crypto_alloc_akcipher(
    name: &str,
    flags: CryptoAllocFlags,
) -> Result<Box<AkCipherTfm>, KernelError>;

/// Allocate a KEM transform.
pub fn crypto_alloc_kpp(
    name: &str,
    flags: CryptoAllocFlags,
) -> Result<Box<KppTfm>, KernelError>;

bitflags::bitflags! {
    /// Flags for transform allocation.
    #[derive(Clone, Copy)]
    pub struct CryptoAllocFlags: u32 {
        /// Accept only hardware-accelerated implementations.
        const HW_ONLY      = 0x0001;
        /// Accept only software implementations (useful for self-tests).
        const SW_ONLY      = 0x0002;
        /// Caller is in atomic context; allocation must not sleep.
        const NOIO         = 0x0004;
    }
}

/// Free a synchronous hash transform, decrement algorithm refcount.
pub fn crypto_free_shash(tfm: Box<ShashTfm>);

/// Free an AEAD transform. Zeroises the key schedule before freeing.
pub fn crypto_free_aead(tfm: Box<AeadTfm>);

Template Instantiation

Template algorithms compose two or more base algorithms. gcm(aes) combines the GCM mode template with the AES block cipher. hybrid-kem(x25519,ml-kem-768) combines X25519 ECDH with ML-KEM-768 using the concatenated shared-secret construction from NIST SP 800-227 (IPD, 2024):

shared_secret = HKDF-SHA256(
    ikm  = x25519_shared || ml_kem_shared,
    info = "hybrid-kem v1" || x25519_pub || ml_kem_pub,
    len  = 32
)

The hybrid template is registered as a Kpp algorithm with CryptoAlgFlags::TEMPLATE | CryptoAlgFlags::PQC. Its generate_key generates both inner key pairs; compute runs both KEMs and applies HKDF; decapsulate runs ML-KEM decapsulation and the same HKDF.

9.1.5 PQC Algorithms as First-Class Citizens

ML-KEM-768 and ML-DSA-65 are the preferred algorithms for new key exchange and signature code respectively, matching the selection in Section 8.5.

// umka-core/src/crypto/pqc.rs

/// ML-KEM-768 KEM algorithm descriptor (software implementation).
/// Implements FIPS 203 (2024). Registered at boot with priority 200.
static ML_KEM_768_ALG: CryptoAlg = CryptoAlg {
    name:        *b"ml-kem-768\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
    driver_name: *b"soft-ml-kem-768\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
    alg_type:    CryptoAlgType::Kpp,
    priority:    200,
    flags:       CryptoAlgFlags::FIPS_APPROVED
                 .union(CryptoAlgFlags::SW_IMPL)
                 .union(CryptoAlgFlags::PQC),
    refcount:    AtomicU32::new(0),
    ops:         CryptoAlgOps::Kpp(ML_KEM_768_OPS),
};

/// ML-DSA-65 signature algorithm descriptor (software implementation).
/// Implements FIPS 204 (2024). Registered at boot with priority 200.
static ML_DSA_65_ALG: CryptoAlg = CryptoAlg {
    name:        *b"ml-dsa-65\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
    driver_name: *b"soft-ml-dsa-65\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
    alg_type:    CryptoAlgType::Akcipher,
    priority:    200,
    flags:       CryptoAlgFlags::FIPS_APPROVED
                 .union(CryptoAlgFlags::SW_IMPL)
                 .union(CryptoAlgFlags::PQC),
    refcount:    AtomicU32::new(0),
    ops:         CryptoAlgOps::Akcipher(ML_DSA_65_OPS),
};

When a hardware accelerator supporting PQC operations is present (future Intel IAA, AMD Phoenix ML-KEM offload), it registers a higher-priority implementation of the same algorithm name. The Crypto API selects it automatically; callers need no changes.

9.1.6 Hardware Acceleration Integration

Tier 1 crypto drivers register implementations of standard algorithm names at higher priority than the software fallback. The KABI registration path for an async hardware driver:

// Executed during Tier 1 driver probe (e.g., aesni driver on x86-64).

fn aesni_probe(dev: &mut DriverContext) -> KabiResult {
    // Register AES-NI accelerated AES-GCM. Priority 400 > SW priority 100,
    // so this implementation is preferred on x86-64 systems with AES-NI.
    let alg = Box::leak(Box::new(CryptoAlg {
        name:        alg_name("aes-gcm"),
        driver_name: alg_name("aesni-gcm"),
        alg_type:    CryptoAlgType::Aead,
        priority:    400,
        flags:       CryptoAlgFlags::FIPS_APPROVED
                     .union(CryptoAlgFlags::HW_ACCEL)
                     .union(CryptoAlgFlags::INPLACE),
        refcount:    AtomicU32::new(0),
        ops:         CryptoAlgOps::Aead(AESNI_GCM_OPS),
    }));
    crypto_register_alg(alg)?;
    dev.set_private(alg as *mut _);
    KabiResult::Ok
}

Async hardware (e.g., AMD CCP, Intel QAT) uses the Ahash/Ablkcipher interfaces. Their encrypt/decrypt vtable functions submit an operation descriptor to the hardware's command ring buffer (see Section 11.1 for the ring buffer infrastructure) and return -EINPROGRESS. The kernel DMA completion IRQ fires the AeadReq.complete callback. Callers that cannot tolerate async completion use crypto_alloc_aead with CryptoAllocFlags::SW_ONLY to force the synchronous software implementation.

9.1.7 Hardware Crypto Acceleration by Architecture

UmkaOS's crypto API dispatches to hardware acceleration when available, with a portable software fallback for every algorithm. Acceleration availability is detected at boot via architecture-specific feature registers and registered per-algorithm with a higher priority than the software implementation. The crypto API always selects the highest-priority driver that supports the requested algorithm and mode.

x86-64 (Intel/AMD):

AES: AES-NI instructions (VAESENC, VAESDEC, VAESKEYGENASSIST) — available on all Intel Sandy Bridge+ and AMD Bulldozer+ processors. ~1-2 cycles/block for AES-128 GCM.
SHA: SHA-NI (SHA256RNDS2, SHA256MSG1, SHA256MSG2) — available on Intel Goldmont+ and AMD Zen+. ~4 cycles/block for SHA-256.
CLMUL: PCLMULQDQ — for GCM authentication tag and CRC-32. Available alongside AES-NI on the same CPU generations.
RDRAND/RDSEED: hardware RNG, available on Intel Ivy Bridge+ and AMD Zen+.
Detection: CPUID leaf 1 (ECX.AES bit 25, ECX.PCLMULQDQ bit 1), leaf 7 sub-leaf 0 (EBX.SHA bit 29). UmkaOS reads these at early boot in umka-kernel/src/arch/x86_64/cpu.rs and passes them to the crypto subsystem via CryptoHwCaps.

AArch64 (ARM Cryptography Extensions — FEAT_AES / FEAT_SHA2 / FEAT_SHA512 / FEAT_SHA3):

AES (FEAT_AES): AESE, AESD, AESMC, AESIMC NEON instructions — available on Cortex-A53+ (most ARMv8.0+ cores). Same throughput class as AES-NI.
SHA-256 (FEAT_SHA2): SHA256H, SHA256H2, SHA256SU0, SHA256SU1 — available on Cortex-A53+.
SHA-512 (FEAT_SHA512): SHA512H, SHA512H2, SHA512SU0, SHA512SU1 — available on Cortex-A55+, Neoverse N1+.
SHA-3 (FEAT_SHA3): EOR3, RAX1, XAR, BCAX — available on Neoverse V1+.
SM3/SM4 (FEAT_SM3 / FEAT_SM4): Chinese national cipher standards, available on some ARM licensees targeting CN markets.
PMULL (FEAT_PMULL): PMULL / PMULL2 for GCM polynomial multiplication — available on Cortex-A53+. Required alongside FEAT_AES for AES-GCM hardware offload.
RNG (FEAT_RNG): RNDR / RNDRRS system registers — available on Neoverse N2/V2 and Cortex-A710+. Provides a TRNG directly readable from EL0 without a syscall.
Detection: ID_AA64ISAR0_EL1 register — AES field [7:4], SHA2 [15:12], SHA3 [35:32], SM4 [43:40]; ID_AA64ISAR1_EL1 — RNDR field [63:60]. UmkaOS reads these in umka-kernel/src/arch/aarch64/cpu.rs at boot before any crypto allocations.
Performance: comparable to AES-NI — ~1-3 cycles/block for AES-128 GCM on Neoverse V1.

RISC-V (Scalar Cryptography ISA extensions — Zkn group, ratified in RISC-V ISA v20191213):

Zkne (AES encryption): scalar AES round instructions aes64es, aes64esm, aes64ks1i, aes64ks2 — one round per instruction on 64-bit cores.
Zknd (AES decryption): aes64ds, aes64dsm — symmetric to Zkne.
Zknh (SHA-2): sha256sig0, sha256sig1, sha256sum0, sha256sum1, sha512sig0, sha512sig1, sha512sum0r, sha512sum1r.
Zksh (SHA-1, legacy): sha512sig0l, sha512sig0h, etc. — provided for compatibility; SHA-1 is not used for new constructions.
Zksed (SM4): sm4ed, sm4ks — Chinese national block cipher.
Zbkx (bit manipulation for crypto): xperm4, xperm8 — accelerates S-box lookups and byte shuffles in cipher implementations.
Zvkned (vector AES, requires V extension): AES using the V (vector) extension — higher throughput than scalar Zkne/Zknd when V is available.
Detection: the RISC-V ISA string in the Device Tree riscv,isa property (e.g., rv64imafdc_zkn_zks_zbkx) or the misa CSR (V bit for vector). UmkaOS parses the ISA string at boot in umka-kernel/src/arch/riscv64/cpu.rs. Scalar crypto extensions are not yet universal — many embedded RISC-V cores lack them. UmkaOS falls back to the portable software implementation on cores without them.

PPC32 / PPC64LE:

AES: POWER7+ provides vcipher, vcipherlast, vncipher, vncipherlast VMX (AltiVec) instructions for AES encryption and decryption.
SHA: POWER8+ adds vshasigmaw and vshasigmad for SHA-256 and SHA-512 sigma functions, accelerating the compression rounds.
GCM: vpmsumw / vpmsumd VMX polynomial multiply — particularly efficient for GCM GHASH due to POWER8's wide polynomial multiply unit.
RNG: POWER9+ darn instruction (Deliver A Random Number) — a hardware TRNG accessible from privileged mode without a firmware call.
Detection: PPC feature bits in the Device Tree ibm,pa-features property for bare-metal boot, or AT_HWCAP / AT_HWCAP2 auxiliary vector entries (following the PPC Linux ABI feature bit definitions). UmkaOS reads ibm,pa-features from the DTB during early boot in umka-kernel/src/arch/ppc64le/cpu.rs.

UmkaOS crypto dispatch table (registered at boot):

/// A registered crypto algorithm implementation — hardware or software.
/// Registered via `crypto_register_alg`; the API selects the highest-priority
/// driver that supports the requested algorithm and operation mode.
struct CryptoDriver {
    /// Human-readable implementation name (e.g., "aesni-gcm", "arm-ce-aes-gcm").
    name:    &'static str,
    /// Selection priority. Hardware implementations register at priority 300-400;
    /// the portable software fallback registers at priority 100. Higher wins.
    priority: u32,
    /// AES-CBC encrypt/decrypt, or None if not supported by this driver.
    aes_cbc: Option<fn(key: &AesKey, iv: &[u8; 16], buf: &mut [u8], enc: bool)>,
    /// AES-GCM AEAD, or None if not supported.
    aes_gcm: Option<fn(key: &AesKey, nonce: &[u8; 12], aad: &[u8],
                       pt: &[u8], ct: &mut [u8], tag: &mut [u8; 16])>,
    /// SHA-256, or None if not supported.
    sha256:  Option<fn(data: &[u8], out: &mut [u8; 32])>,
    /// SHA-512, or None if not supported.
    sha512:  Option<fn(data: &[u8], out: &mut [u8; 64])>,
    // Additional algorithm slots follow the same pattern.
}

At boot, each architecture's cpu.rs initialisation code calls crypto_register_hw_drivers() which inspects the detected feature flags and registers whichever hardware drivers are available at priority 300-400. The portable software implementation registers unconditionally at priority 100. The crypto API's alloc_tfm path always selects the highest-priority registered driver, so hardware acceleration is transparent to callers.

9.1.8 FIPS Mode

FIPS mode is a runtime configuration flag set during early boot from the kernel command line (umka.fips=1) or a UEFI variable. Once enabled, it cannot be disabled without a reboot.

// umka-core/src/crypto/fips.rs

/// Set once at boot. Subsequent reads are a single atomic load (relaxed).
static FIPS_MODE: AtomicBool = AtomicBool::new(false);

/// Returns true if FIPS mode is active. Hot path: one acquire-load.
#[inline]
pub fn crypto_fips_enabled() -> bool {
    FIPS_MODE.load(Ordering::Acquire)
}

/// Called once during early boot, before any crypto allocations.
pub fn crypto_fips_enable() {
    FIPS_MODE.store(true, Ordering::Release);
}

In FIPS mode:

Algorithm lookup filters out any descriptor lacking CryptoAlgFlags::FIPS_APPROVED. The following algorithms are approved under NIST SP 800-131A Rev. 2 (2019) and SP 800-131A Rev. 3 (draft, 2024): AES-128/192/256 (all approved modes), SHA-256/384/512, SHA-3-256/384/512, HMAC-SHA-256/384/512, AES-GCM, AES-CCM, RSA (≥2048 bits), ECDSA/ECDH (P-256, P-384), Ed25519 (pending FIPS 186-5 approval), ML-KEM-512/768/1024 (FIPS 203), ML-DSA-44/65/87 (FIPS 204), SLH-DSA (FIPS 205).
Algorithms explicitly disallowed: MD5, SHA-1 (signature generation), RC4, DES, 3DES (new applications), RSA < 2048 bits, ECDH/ECDSA on curves below P-256.

Note: FIPS approval status changes over time as NIST publishes new standards. The approved-algorithm list in UmkaOS is maintained as a compile-time table in umka-core/src/crypto/fips_approved.rs and updated with each relevant NIST publication. Do not hard-code FIPS approval decisions in algorithm registration sites; derive them from that central table.

9.1.9 sysfs Interface

Registered algorithms are exposed read-only under /sys/kernel/umka/crypto/algorithms/. Each algorithm has a directory named <driver_name> containing:

/sys/kernel/umka/crypto/algorithms/
  aesni-gcm/
    name        "aes-gcm"
    driver      "aesni-gcm"
    type        "aead"
    priority    400
    flags       "hw_accel,fips_approved,inplace"
    refcount    3
  soft-ml-kem-768/
    name        "ml-kem-768"
    driver      "soft-ml-kem-768"
    type        "kpp"
    priority    200
    flags       "sw_impl,fips_approved,pqc"
    refcount    0
  ...

The refcount attribute reflects the number of live transform objects backed by that implementation. This is diagnostic only; it is subject to TOCTOU races and must not be used for resource accounting.

Cross-references: - Section 8.2 (08-security.md): Verified boot uses ML-DSA-65 (Akcipher) and SHA-384 (Shash) - Section 8.4 (08-security.md): IMA uses SHA-256 and SHA-384 via Shash - Section 8.5 (08-security.md): PQC algorithm definitions (ML-KEM, ML-DSA) - Section 8.6 (08-security.md): SEV-SNP/TDX use AES-256-GCM (Aead) - Section 9.2: Key retention service AsymmetricKey type uses Akcipher transforms - Section 11.1 (10-drivers.md): Ring buffer infrastructure used by async hardware accelerators - Section 14.4 (14-storage.md): NVMe TLS/auth uses AES-GCM and ML-KEM via this API - Section 15.X (15-networking.md): kTLS uses ChaCha20-Poly1305 and AES-GCM via Aead

9.1.10 AF_ALG — Userspace Crypto via Sockets

AF_ALG (Linux 2.6.38+) exposes the kernel crypto API to userspace via a socket interface. Userspace programs access kernel-implemented cryptographic algorithms — including hardware-accelerated implementations — without writing their own crypto code. dm-crypt, cryptsetup, OpenSSL ENGINE_linux_af_alg, and the Go crypto/tls fallback all use AF_ALG.

Socket Setup

/// AF_ALG socket address (matches Linux struct sockaddr_alg).
#[repr(C)]
pub struct SockaddrAlg {
    /// Address family: AF_ALG = 38.
    pub salg_family: u16,
    /// Algorithm type: "hash", "skcipher", "aead", "rng".
    pub salg_type:   [u8; 14],
    /// Feature bits (currently unused, must be 0).
    pub salg_feat:   u32,
    /// Algorithm mask (currently unused, must be 0).
    pub salg_mask:   u32,
    /// Algorithm name (e.g., "sha256", "aes-cbc", "chacha20-poly1305", "stdrng").
    pub salg_name:   [u8; 64],
}

Usage pattern: 1. socket(AF_ALG, SOCK_SEQPACKET, 0) → returns a bind socket fd (no data transferred here) 2. bind(bind_fd, &SockaddrAlg { salg_type: "hash", salg_name: "sha256", .. }, sizeof) → selects the algorithm 3. For ciphers: setsockopt(bind_fd, SOL_ALG, ALG_SET_KEY, key, key_len) → set the key 4. For AEAD: setsockopt(bind_fd, SOL_ALG, ALG_SET_AEAD_AUTHSIZE, NULL, authsize) → set tag length 5. accept(bind_fd, NULL, NULL) → returns an op socket fd (one per concurrent operation) 6. sendmsg(op_fd, msg, 0) → provide input data and control messages 7. recvmsg(op_fd, msg, 0) → receive output data

The bind socket is shared across threads. Each accept() creates an independent operation socket that maintains its own IV and transform state. Multiple concurrent operations require multiple accept() calls.

Socket-Level Options (SOL_ALG)

pub const SOL_ALG:              i32 = 279;
pub const ALG_SET_KEY:          i32 = 1;  // set cipher/MAC key (getsockopt: read current key length)
pub const ALG_SET_IV:           i32 = 2;  // set IV via cmsg ALG_SET_IV control message
pub const ALG_SET_OP:           i32 = 3;  // set direction via cmsg: ALG_OP_ENCRYPT / ALG_OP_DECRYPT
pub const ALG_SET_AEAD_AUTHSIZE: i32 = 4; // set AEAD authentication tag size in bytes
pub const ALG_SET_DRBG_ENTROPY: i32 = 5; // seed RNG with entropy (privileged, for testing)

pub const ALG_OP_DECRYPT:       u32 = 0;
pub const ALG_OP_ENCRYPT:       u32 = 1;

Control Messages (sendmsg cmsg)

/// IV control message (type = ALG_SET_IV).
/// Sent as ancillary data in sendmsg to set the IV for this operation.
#[repr(C)]
pub struct AlgIv {
    /// IV length in bytes (must match algorithm's expected IV size).
    pub ivlen: u32,
    /// IV bytes (variable length: ivlen bytes follow this field).
    pub iv:    [u8; 0],
}

/// Operation direction message (type = ALG_SET_OP).
/// Contains a single u32: ALG_OP_ENCRYPT or ALG_OP_DECRYPT.

MSG_MORE: Setting MSG_MORE in sendmsg flags indicates more data follows for this operation (multi-call streaming for hash update or large cipher blocks). Only the final sendmsg (without MSG_MORE) triggers computation.

Algorithm Types

`salg_type`	Key?	IV?	Use case
`"hash"`	Optional (for HMAC)	No	SHA-256, SHA-384, BLAKE2b, HMAC-SHA256
`"skcipher"`	Yes	Yes	AES-CBC, AES-CTR, ChaCha20, AES-XTS
`"aead"`	Yes	Yes	AES-GCM, ChaCha20-Poly1305, AES-CCM
`"rng"`	No (optional seed)	No	stdrng, drbg_pr_ctr_aes256

All algorithms registered in the kernel crypto API (§9.1) are accessible via AF_ALG, including hardware-accelerated implementations (AES-NI, Intel QAT, AMD CCP). The kernel automatically selects the fastest available implementation.

Zero-Copy Path

For large data (e.g., full-disk encryption buffers), AF_ALG supports zero-copy via vmsplice() + splice(): 1. vmsplice(pipe_write_fd, iov, iov_count, SPLICE_F_GIFT) → transfer user pages to a pipe without copying 2. splice(pipe_read_fd, NULL, op_fd, NULL, len, 0) → feed pipe data to AF_ALG input without copying 3. splice(op_fd, NULL, pipe_write_fd, NULL, len, 0) → read output from AF_ALG without copying

This avoids any kernel↔userspace data copy for bulk operations, achieving near-hardware throughput.

Security Model

No privilege required for standard algorithms. Any process may use AF_ALG with any registered algorithm.
Privileged algorithms: algorithms requiring CAP_SYS_ADMIN to use (currently none in the standard registry — this mechanism is reserved for test-only algorithms that bypass FIPS constraints).
Key secrecy: the key set via ALG_SET_KEY is not accessible from userspace after being set (the setsockopt write-only path). The kernel holds the key in struct af_alg_ctx allocated in kernel memory; it is not pinned into a keyring and is freed when the bind socket is closed.
Algorithm access control: LSM hooks (§8.7) can gate AF_ALG socket creation by algorithm name, allowing policy-based restrictions on which algorithms are available to which processes (e.g., FIPS mode that only allows FIPS-approved algorithms).
No TOCTOU: the bind socket locks in the algorithm at bind() time; subsequent key or IV changes on the bind socket do not affect already-accept()ed op sockets.

Linux Compatibility

Same AF_ALG = 38 socket family constant
Same SockaddrAlg struct layout (salg_family, salg_type[14], salg_feat, salg_mask, salg_name[64])
Same SOL_ALG socket-level options (279)
Same ALG_SET_KEY, ALG_SET_IV, ALG_SET_OP, ALG_SET_AEAD_AUTHSIZE values
Same MSG_MORE streaming semantics
Same zero-copy splice() path
cryptsetup LUKS2 uses AF_ALG for AES-XTS bulk cipher offload to AES-NI
OpenSSL 1.1.0+ has an AF_ALG ENGINE for hardware-accelerated hashing on embedded systems
Go crypto/tls falls back to AF_ALG when software implementations are insufficient

9.2 Kernel Key Retention Service

The Key Retention Service stores cryptographic keys and opaque credentials in kernel memory, where they are inaccessible to userspace except through the tightly-controlled keyctl() syscall interface. The service provides durable, referenceable key handles that persist across file descriptors, survive fork/exec under controlled conditions, and integrate with the LSM framework (Section 8.7) for fine-grained access control.

Callers in the kernel that use the key retention service: NVMe TLS authentication (Section 14.4) stores TLS client certificates as AsymmetricKey entries; RPCSEC_GSS (NFS Kerberos) caches service tickets as LogonKey entries; dm-crypt stores volume master keys as LogonKey entries; IMA (Section 8.4) stores measurement policy signing keys; driver signing (Section 8.2) stores the .builtin_trusted_keys and .secondary_trusted_keys keyrings; and the TPM subsystem (Section 8.3) stores sealed blobs as EncryptedKey entries whose payload is protected by the TPM's storage root key.

9.2.1 Key Object

// umka-core/src/keys/key.rs

/// Globally unique key serial number.
/// Assigned monotonically from an atomic counter at key creation time.
/// Userspace references keys by serial number in `keyctl()` calls.
/// Zero is not a valid serial.
#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct KeySerial(pub u32);

/// Permissions bitfield for a key. Modelled after Linux's key permission
/// word (see `man 7 keyrings`). Four subjects (possessor, user, group, other)
/// each with six permission bits.
#[derive(Clone, Copy, Debug)]
#[repr(transparent)]
pub struct KeyPerm(pub u32);

impl KeyPerm {
    // Possessor bits (bits 24-29)
    pub const POSS_VIEW:    u32 = 0x0100_0000; // see key attributes
    pub const POSS_READ:    u32 = 0x0200_0000; // read key payload
    pub const POSS_WRITE:   u32 = 0x0400_0000; // update/instantiate key
    pub const POSS_SEARCH:  u32 = 0x0800_0000; // find via keyring search
    pub const POSS_LINK:    u32 = 0x1000_0000; // link into a keyring
    pub const POSS_SETATTR: u32 = 0x2000_0000; // set timeout, perms, uid/gid

    // User bits (bits 16-21)
    pub const USER_VIEW:    u32 = 0x0001_0000;
    pub const USER_READ:    u32 = 0x0002_0000;
    pub const USER_WRITE:   u32 = 0x0004_0000;
    pub const USER_SEARCH:  u32 = 0x0008_0000;
    pub const USER_LINK:    u32 = 0x0010_0000;
    pub const USER_SETATTR: u32 = 0x0020_0000;

    // Group bits (bits 8-13)
    pub const GROUP_VIEW:    u32 = 0x0000_0100;
    pub const GROUP_READ:    u32 = 0x0000_0200;
    pub const GROUP_WRITE:   u32 = 0x0000_0400;
    pub const GROUP_SEARCH:  u32 = 0x0000_0800;
    pub const GROUP_LINK:    u32 = 0x0000_1000;
    pub const GROUP_SETATTR: u32 = 0x0000_2000;

    // Other bits (bits 0-5)
    pub const OTHER_VIEW:    u32 = 0x0000_0001;
    pub const OTHER_READ:    u32 = 0x0000_0002;
    pub const OTHER_WRITE:   u32 = 0x0000_0004;
    pub const OTHER_SEARCH:  u32 = 0x0000_0008;
    pub const OTHER_LINK:    u32 = 0x0000_0010;
    pub const OTHER_SETATTR: u32 = 0x0000_0020;
}

/// Key lifecycle state.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u8)]
pub enum KeyState {
    /// Created but not yet instantiated. `request_key()` upcalls put a key
    /// here while waiting for the userspace helper to provide the payload.
    Uninstantiated = 0,
    /// Fully instantiated and available for use.
    Instantiated   = 1,
    /// Negative instantiation: the key does not exist (caches a lookup failure
    /// to prevent repeated upcalls). Requests for this key return `-ENOKEY`.
    Negative       = 2,
    /// Payload revoked by `KEYCTL_REVOKE`. Metadata readable; payload gone.
    Revoked        = 3,
    /// Refcount reached zero; being freed. Not reachable from the key table.
    Dead           = 4,
}

/// Core key object. Lives in the global key table (`KEY_TABLE`) as long as
/// `refcount > 0`. Keyrings hold strong references to keys they link to.
pub struct Key {
    /// Globally unique serial. Assigned at creation, immutable thereafter.
    pub serial: KeySerial,

    /// Key type implementation. Points to a static `KeyType` vtable.
    /// Immutable after creation.
    pub key_type: &'static dyn KeyType,

    /// Human-readable description. Set at creation, readable by `KEYCTL_DESCRIBE`.
    /// May contain structured data (e.g., `"nfs@server.example.com:krb5"`).
    /// Maximum 4096 bytes.
    pub description: Box<str>,

    /// Type-specific payload. Protected by `payload_lock`. On revocation the
    /// payload is replaced with `None` and the type's `destroy()` method is
    /// called to zeroize key material.
    pub payload: SpinLock<Option<Box<dyn Any + Send>>>,

    /// Owner user ID (initial namespace).
    pub uid: UserId,

    /// Owner group ID (initial namespace).
    pub gid: GroupId,

    /// Access control: who may perform which operations.
    pub perm: KeyPerm,

    /// Absolute expiry time. `None` = does not expire. When expired, the
    /// key transitions to `Negative` on next access and the payload is
    /// destroyed. The garbage collector also reaps expired keys.
    pub expiry: Option<MonotonicInstant>,

    /// Strong reference count. Includes: key table entry (1), each keyring
    /// link (1 per link), each in-progress `keyctl()` call (1).
    pub refcount: AtomicU32,

    /// Lifecycle state.
    pub state: AtomicU8, // KeyState

    /// Quota: which UID's quota this key counts against.
    /// Usually equals `uid`; may differ for keys created on behalf of another
    /// UID (e.g., by the request-key helper running as root).
    pub quota_uid: UserId,

    /// Size of the payload in bytes, for quota accounting.
    /// Updated atomically when the payload is set.
    pub payload_bytes: AtomicU32,
}

9.2.2 Key Types

// umka-core/src/keys/types.rs

/// Type implementation for a class of keys.
///
/// Each concrete key type implements this trait. Vtable pointers are stable
/// for the kernel lifetime (static implementations only).
pub trait KeyType: Send + Sync {
    /// Short name used in `keyctl(KEYCTL_DESCRIBE)` output and in the
    /// `add_key()` syscall `type` argument (e.g., `"user"`, `"logon"`,
    /// `"asymmetric"`). Maximum 32 bytes.
    fn name(&self) -> &'static str;

    /// Instantiate the key from raw payload bytes supplied by userspace
    /// (via `add_key()`) or by the request-key helper (via
    /// `KEYCTL_INSTANTIATE`). The type parses and validates `data`,
    /// returning a heap-allocated payload on success.
    ///
    /// # Errors
    /// - `EINVAL`: data is malformed for this key type.
    /// - `ENOMEM`: allocation failed.
    fn instantiate(
        &self,
        key: &Key,
        data: &[u8],
    ) -> Result<Box<dyn Any + Send>, KernelError>;

    /// Update the payload of an already-instantiated key.
    /// Not all types support update; return `Err(EOPNOTSUPP)` if not.
    fn update(
        &self,
        key: &Key,
        data: &[u8],
    ) -> Result<Box<dyn Any + Send>, KernelError>;

    /// Revoke the key: destroy key material. Called with `payload_lock` held.
    /// Must zeroize sensitive data before returning.
    fn revoke(&self, key: &Key);

    /// Final destruction: called when `refcount` reaches zero after revocation.
    /// At this point the payload is already `None`; the type may release any
    /// external resources (e.g., TPM NV slot).
    fn destroy(&self, key: &Key);

    /// Produce a human-readable description for `KEYCTL_DESCRIBE`.
    /// Format: `"<type>;<uid>;<gid>;<perm>;<description>"`.
    fn describe(&self, key: &Key, buf: &mut dyn core::fmt::Write) -> core::fmt::Result;

    /// Read the key payload back to userspace (for `KEYCTL_READ`).
    /// Not all types permit this. `LogonKey` returns `Err(EACCES)`.
    /// The output is type-specific: `UserKey` returns raw bytes; `AsymmetricKey`
    /// returns the public key in SubjectPublicKeyInfo DER.
    fn read(
        &self,
        key: &Key,
        buf: &mut [u8],
    ) -> Result<usize, KernelError>;
}

/// Opaque binary blob. Userspace writes the payload; userspace may also read
/// it back (subject to `KeyPerm::READ`). Used for passwords, tokens, and
/// arbitrary secrets where the kernel does not interpret the content.
pub struct UserKey;

/// Write-only credential. Userspace can search and link `LogonKey` entries
/// but cannot read the payload (`read()` returns `EACCES`). Used for
/// Kerberos service tickets, NVMe TLS PSKs, and dm-crypt volume keys
/// where the kernel uses the payload but userspace must not extract it.
pub struct LogonKey;

/// Key that points to another keyring, forming the keyring tree.
/// The payload is a `Vec<KeyRef>` — an ordered list of links to other keys.
/// Supports: `KEYCTL_LINK`, `KEYCTL_UNLINK`, `KEYCTL_CLEAR`, `KEYCTL_SEARCH`.
pub struct KeyringKey;

/// Asymmetric public key. Payload is an `AkCipherTfm` (from Section 9.1)
/// backed by a parsed key (RSA, ECDSA, Ed25519, or ML-DSA public/private key
/// pair). Supports `KEYCTL_PKEY_SIGN`, `KEYCTL_PKEY_VERIFY`,
/// `KEYCTL_PKEY_ENCRYPT`, `KEYCTL_PKEY_DECRYPT` via the Crypto API.
/// `read()` returns the public key in SubjectPublicKeyInfo DER.
pub struct AsymmetricKey;

/// Encrypted key. The raw key material is encrypted under a master key
/// (another key in the kernel key service, typically a TPM-bound key).
/// The encrypted blob is the on-disk/in-swap representation; the plaintext
/// is only present in kernel memory as long as the key is instantiated.
/// Payload format: `struct EncryptedKeyPayload { ct: Vec<u8>, iv: [u8; 12] }`.
/// Master key is referenced by serial; if the master key is revoked, this
/// key cannot be decrypted and transitions to `Negative`.
pub struct EncryptedKey;

/// DNS resolver key. Description is a DNS name + query type (e.g.,
/// `"server.example.com"` or `"_ldap._tcp.example.com srva"`).
/// Payload is a serialised DNS response (A/AAAA/SRV records).
/// Auto-expires at the TTL of the DNS response. On expiry, a `request_key()`
/// upcall refreshes the entry.
pub struct DnsResolverKey;

9.2.3 Keyring Hierarchy

Keyrings are keys of type KeyringKey. They form a directed acyclic graph (search cycles are forbidden and detected at KEYCTL_LINK time). The standard per-thread/process/session hierarchy is established at task creation:

  .builtin_trusted_keys   .secondary_trusted_keys   .ima_mok
         │                        │                     │
         └────────────────────────┼─────────────────────┘
                                  │ (kernel-owned, read-only from userspace)
  User persistent keyring ←───────┘
         │
  User keyring (per UID)
         │
  User session keyring ←── default session for login shells
         │
  Session keyring (per login session, replaced by pam_keyinit)
         │
  Process keyring (per process group, optional)
         │
  Thread keyring (per thread, optional)

Special kernel-owned keyrings:

.builtin_trusted_keys: populated at build time from X.509 certificates embedded in the kernel image. Contains the distribution CA and UmkaOS signing key. Read-only; no userspace links permitted.
.secondary_trusted_keys: populated at runtime from KEYCTL_RESTRICT_KEYRING-restricted keyring; allows MOK (Machine Owner Key) certificates to be added without rebuilding the kernel. Restricted: additions require a valid signature from a key already in .builtin_trusted_keys.
.ima_mok: IMA's machine owner key ring. Keys added here affect IMA policy (Section 8.4). Requires CAP_SYS_ADMIN to modify.
.nvme: NVMe authentication keyring. Populated by the nvme_keyring module from /etc/nvme/hostkey.pem and /etc/nvme/hostsym.conf via request_key() upcall on first NVMe TLS connection attempt.

9.2.4 Key Quotas

To prevent denial-of-service via key exhaustion, each UID has a quota:

// umka-core/src/keys/quota.rs

/// Per-UID key quota.
#[derive(Debug)]
pub struct KeyQuota {
    /// UID this quota applies to.
    pub uid: UserId,

    /// Number of keys currently charged to this UID.
    pub key_count: AtomicU32,

    /// Total payload bytes currently charged to this UID.
    pub payload_bytes: AtomicU64,

    /// Maximum number of keys allowed. Default: 200.
    /// Configurable via `/proc/sys/kernel/keys/maxkeys`.
    pub max_keys: u32,

    /// Maximum payload bytes allowed. Default: 20 * 1024 (20 KiB).
    /// Configurable via `/proc/sys/kernel/keys/maxbytes`.
    pub max_bytes: u64,
}

/// Global quota table. One entry per active UID; entries created on first
/// key allocation for a UID and freed when all keys for that UID are gone.
static KEY_QUOTA_TABLE: RwSpinLock<BTreeMap<UserId, KeyQuota>> = ...;

/// Charge quota before creating a key. Returns `Err(EDQUOT)` if the UID
/// would exceed either limit.
pub fn key_quota_charge(uid: UserId, payload_bytes: u32) -> Result<(), KernelError>;

/// Release quota when a key is destroyed.
pub fn key_quota_release(uid: UserId, payload_bytes: u32);

Root (UID 0) is exempt from key quotas. Kernel-internal keys (with quota_uid set to a sentinel value KERNEL_KEY_UID) do not count against any user's quota.

9.2.5 The `keyctl()` Syscall

keyctl() is the primary userspace interface. The first argument selects the operation; subsequent arguments are operation-specific.

// umka-compat/src/syscall/keyctl.rs

/// Dispatch table for keyctl operations.
/// Each entry maps an operation constant to a handler function.
fn keyctl_dispatch(
    op: u32,
    arg2: usize,
    arg3: usize,
    arg4: usize,
    arg5: usize,
    task: &Task,
) -> Result<isize, KernelError> {
    match op {
        // Return the key ID of one of the special keyrings.
        // arg2: KEY_SPEC_* constant (negative values).
        // Returns the key serial as a positive isize.
        KEYCTL_GET_KEYRING_ID => keyctl_get_keyring_id(arg2 as i32, task),

        // Join or create the named session keyring.
        // arg2: pointer to name string (NULL = anonymous).
        KEYCTL_JOIN_SESSION_KEYRING => keyctl_join_session_keyring(arg2 as *const u8, task),

        // Update a key's payload. arg2: key serial, arg3: payload ptr, arg4: payload len.
        KEYCTL_UPDATE => keyctl_update(KeySerial(arg2 as u32), arg3 as *const u8, arg4 as u32, task),

        // Revoke a key. arg2: key serial.
        KEYCTL_REVOKE => keyctl_revoke(KeySerial(arg2 as u32), task),

        // Return a description string. arg2: serial, arg3: buf ptr, arg4: buf size.
        KEYCTL_DESCRIBE => keyctl_describe(KeySerial(arg2 as u32), arg3 as *mut u8, arg4 as u32, task),

        // Clear all links in a keyring. arg2: keyring serial.
        KEYCTL_CLEAR => keyctl_clear(KeySerial(arg2 as u32), task),

        // Link a key into a keyring. arg2: key serial, arg3: keyring serial.
        KEYCTL_LINK => keyctl_link(KeySerial(arg2 as u32), KeySerial(arg3 as u32), task),

        // Unlink a key from a keyring. arg2: key serial, arg3: keyring serial.
        KEYCTL_UNLINK => keyctl_unlink(KeySerial(arg2 as u32), KeySerial(arg3 as u32), task),

        // Search keyrings for a key. arg2: keyring serial, arg3: type ptr,
        // arg4: description ptr, arg5: destination keyring serial.
        KEYCTL_SEARCH => keyctl_search(
            KeySerial(arg2 as u32),
            arg3 as *const u8,
            arg4 as *const u8,
            KeySerial(arg5 as u32),
            task,
        ),

        // Read a key's payload. arg2: serial, arg3: buf ptr, arg4: buf len.
        KEYCTL_READ => keyctl_read(KeySerial(arg2 as u32), arg3 as *mut u8, arg4 as u32, task),

        // Instantiate a key from the request-key helper.
        // arg2: serial, arg3: payload ptr, arg4: payload len, arg5: keyring serial.
        KEYCTL_INSTANTIATE => keyctl_instantiate(
            KeySerial(arg2 as u32),
            arg3 as *const u8,
            arg4 as u32,
            KeySerial(arg5 as u32),
            task,
        ),

        // Negatively instantiate (mark as not-found). arg2: serial,
        // arg3: timeout_secs, arg4: keyring serial.
        KEYCTL_NEGATE => keyctl_negate(
            KeySerial(arg2 as u32),
            arg3 as u32,
            KeySerial(arg4 as u32),
            task,
        ),

        // Set the default keyring for implicit key requests.
        // arg2: KEY_REQKEY_DEFL_* constant.
        KEYCTL_SET_REQKEY_KEYRING => keyctl_set_reqkey_keyring(arg2 as i32, task),

        // Set key expiry. arg2: serial, arg3: timeout_secs (0 = no expiry).
        KEYCTL_SET_TIMEOUT => keyctl_set_timeout(KeySerial(arg2 as u32), arg3 as u32, task),

        // Assume authority over an uninstantiated key (used by request-key helper).
        KEYCTL_ASSUME_AUTHORITY => keyctl_assume_authority(KeySerial(arg2 as u32), task),

        // Get the LSM security label for a key. arg2: serial,
        // arg3: buf ptr, arg4: buf len.
        KEYCTL_GET_SECURITY => keyctl_get_security(KeySerial(arg2 as u32), arg3 as *mut u8, arg4 as u32, task),

        // DH key derivation. arg2: pointer to keyctl_dh_params struct.
        KEYCTL_DH_COMPUTE => keyctl_dh_compute(arg2 as *const KeyctlDhParams, task),

        // Public key operations. arg2: serial, arg3: pointer to keyctl_pkey_params,
        // arg4: info ptr, arg5: in/out ptrs encoded in the params struct.
        KEYCTL_PKEY_QUERY    => keyctl_pkey_query(KeySerial(arg2 as u32), arg3 as *const KeyctlPkeyParams, task),
        KEYCTL_PKEY_ENCRYPT  => keyctl_pkey_encrypt(KeySerial(arg2 as u32), arg3 as *const KeyctlPkeyParams, task),
        KEYCTL_PKEY_DECRYPT  => keyctl_pkey_decrypt(KeySerial(arg2 as u32), arg3 as *const KeyctlPkeyParams, task),
        KEYCTL_PKEY_SIGN     => keyctl_pkey_sign(KeySerial(arg2 as u32), arg3 as *const KeyctlPkeyParams, task),
        KEYCTL_PKEY_VERIFY   => keyctl_pkey_verify(KeySerial(arg2 as u32), arg3 as *const KeyctlPkeyParams, task),

        _ => Err(KernelError::EOPNOTSUPP),
    }
}

Special key ID constants, as defined by Linux <linux/keyctl.h>:

/// Refers to the calling thread's own thread keyring.
pub const KEY_SPEC_THREAD_KEYRING:       i32 = -1;
/// Refers to the calling process's own process keyring.
pub const KEY_SPEC_PROCESS_KEYRING:      i32 = -2;
/// Refers to the calling process's session keyring.
pub const KEY_SPEC_SESSION_KEYRING:      i32 = -3;
/// Refers to the calling process's user keyring.
pub const KEY_SPEC_USER_KEYRING:         i32 = -4;
/// Refers to the calling process's user session keyring.
pub const KEY_SPEC_USER_SESSION_KEYRING: i32 = -5;
/// Refers to the calling process's group keyring.
pub const KEY_SPEC_GROUP_KEYRING:        i32 = -6;
/// Refers to the assumed request_key() authorisation key.
pub const KEY_SPEC_REQKEY_AUTH_KEY:      i32 = -7;

9.2.6 The `request_key()` Upcall

request_key() is the mechanism by which the kernel asks userspace to supply a key that is not yet in any keyring reachable from the calling process. The upcall is entirely asynchronous from the requesting process's perspective: the kernel creates an uninstantiated key, suspends the request, and resumes when the key is instantiated (or fails).

request_key(type, description, callout_info, dest_keyring) algorithm:

1. SEARCH: Walk the keyring tree reachable from the calling thread
   (thread keyring → process keyring → session keyring → user keyring
   → user session keyring → .builtin_trusted_keys).
   For each keyring:
     a. Lock the keyring for reading (RCU).
     b. Iterate links; for each linked key matching (type, description):
        - If state == Instantiated and not expired → return that key.
        - If state == Negative → return Err(ENOKEY) immediately.
   If found: optionally link into dest_keyring and return.

2. CREATE uninstantiated key:
   - Allocate Key{serial=next_serial(), state=Uninstantiated, ...}.
   - Insert into the global key table.
   - Link into dest_keyring (if specified).

3. CONSTRUCT: Check if a call_sysrequest handler is registered for the type.
   If none: mark the key Negative with a short timeout and return Err(ENOKEY).

4. CREATE auth key:
   - Allocate a special `request_key_auth` key with:
     - payload: (target_key_serial, type, description, callout_info)
     - uid/gid: calling process's credentials
     - perm: possessor=VIEW|READ|SEARCH, others=none
   - Link the auth key into the kernel's `request_key_auth_keyring`.

5. FORK request-key helper:
   - Fork a process running `/sbin/request-key`.
   - Set its session keyring to a new keyring containing only the auth key.
   - Pass environment: KEYCTL_REQUESTKEY_AUTH_KEY=<auth_key_serial>.
   - The helper process looks up its auth key, reads the target description,
     contacts the appropriate credential daemon (gssd, nvme-cli, etc.),
     and calls:
       keyctl(KEYCTL_INSTANTIATE, target_serial, payload, len, dest_keyring)
     or on failure:
       keyctl(KEYCTL_NEGATE, target_serial, timeout_secs, dest_keyring)

6. WAIT: The requesting thread waits (interruptible) on a wait queue
   associated with the uninstantiated key. Timeout: 60 seconds by default.
   On wake:
     - If state == Instantiated: return key serial.
     - If state == Negative: return Err(ENOKEY).
     - If timeout: return Err(ETIMEDOUT); key stays in Negative state.
     - If signal received: return Err(ERESTARTSYS).

7. CLEANUP: When the request-key helper process exits, the kernel:
   - Revokes and destroys the auth key.
   - If the target key is still Uninstantiated, marks it Negative.

Note: The /sbin/request-key binary is part of the keyutils package. Its configuration file /etc/request-key.conf maps (operation, type, description) triples to handler programs. For example, create krb5 nfs@* * /usr/sbin/rpc.gssd %k %d %c tells request-key to invoke rpc.gssd for Kerberos NFS tickets.

9.2.7 LSM Hooks

Key operations are mediated by LSM hook callouts (Section 8.7). The hooks allow MAC policies (SELinux, AppArmor) to enforce additional constraints beyond the DAC KeyPerm checks:

// umka-core/src/keys/security.rs

/// Called when a new key is allocated, before quota charge.
/// LSM may set a security label on the key.
/// Returns `Ok(())` to allow, `Err(EACCES)` to deny.
pub fn security_key_alloc(
    key: &Key,
    cred: &TaskCredential,
    flags: KeyAllocFlags,
) -> Result<(), KernelError>;

/// Called when a key's refcount drops to zero, just before deallocation.
pub fn security_key_free(key: &Key);

/// Called before each `keyctl()` operation.
/// `perm` is one of the `KeyPerm::*` bit constants for the requested operation.
/// Returns `Ok(())` to allow, `Err(EACCES)` to deny.
pub fn security_key_permission(
    key_ref: &KeyRef,
    cred: &TaskCredential,
    perm: u32,
) -> Result<(), KernelError>;

/// Called during keyring search, once per candidate key.
/// LSM may suppress a key from search results (return `Err(EACCES)`)
/// without denying access to the key by direct serial reference.
pub fn security_keyring_search(
    keyring: &Key,
    key_type: &'static dyn KeyType,
    description: &str,
) -> Result<(), KernelError>;

The DAC permission check (KeyPerm) is performed before the LSM hook. If DAC denies the operation, the LSM hook is not called. This matches the Linux model and ensures that LSM cannot be used to grant permissions beyond what DAC allows (MAC augments DAC, does not override it).

9.2.8 Integration: NVMe TLS Authentication

Section 14.4 describes NVMe over Fabrics with TLS. The key retention service provides the certificate and PSK storage:

Boot sequence for NVMe TLS:
  1. nvme_keyring module loads → creates ".nvme" keyring (KeyringKey, uid=0,
     perm=possessor:VIEW|READ|SEARCH, others:none).
  2. NVMe initiator attempts connection to a target requiring TLS.
  3. Before the TLS handshake, the initiator calls:
       request_key("asymmetric", "nvme-tls:<hostnqn>", NULL, nvme_keyring_serial)
  4. Key not found → request_key() upcall to /sbin/request-key.
  5. request-key invokes /usr/lib/nvme/nvme-key-helper, which:
       a. Reads /etc/nvme/hostkey.pem (PKCS#8 private key, ML-DSA-65 or RSA).
       b. Parses the certificate and calls:
            add_key("asymmetric", "nvme-tls:<hostnqn>", cert_der, len, nvme_keyring_serial)
  6. The kernel's AsymmetricKey.instantiate() parses the DER via crypto_alloc_akcipher(),
     stores the public key (for the peer to verify) and private key material.
  7. The target's certificate is verified against ".builtin_trusted_keys" or
     ".secondary_trusted_keys".
  8. The NVMe TLS layer calls keyctl_pkey_sign() to sign the TLS handshake
     messages using the key obtained in step 6.

Key lifetime: Keys in ".nvme" have no expiry by default (the NVMe host certificate
is valid until the certificate's notAfter date, checked by the TLS layer independently).
On NVMe controller removal, the corresponding key is unlinked from ".nvme" but persists
until its refcount drops to zero (any in-flight TLS sessions holding a reference).

9.2.9 Integration: RPCSEC_GSS (NFS Kerberos)

NFS mounts with sec=krb5 use RPCSEC_GSS to authenticate each RPC call with a Kerberos service ticket. Tickets are cached as LogonKey entries:

Per-NFS-operation key lookup:
  1. rpcsec_gss_krb5 calls:
       request_key("krb5", "nfs@<server>:<realm>", callout_info, session_keyring)
     where callout_info encodes the desired enctypes and flags.
  2. If a valid (not-expired) LogonKey exists: use its payload directly.
     The payload is a serialised krb5_creds structure (TGS reply + session key).
     Key expiry = Kerberos ticket expiry (from the ticket's endtime field).
  3. On cache miss or expiry: request_key() upcall to /sbin/request-key.
  4. request-key invokes rpc.gssd, which:
       a. Locates the user's TGT in the ccache (FILE:/tmp/krb5cc_<uid> or
          KEYRING:session:).
       b. Requests a TGS ticket for the NFS service principal.
       c. Serialises the ticket and calls:
            add_key("krb5", "nfs@<server>:<realm>", ticket_data, len, session_keyring)
  5. The LogonKey payload is set; rpcsec_gss_krb5 wakes and proceeds.
  6. rpcsec_gss_krb5 calls keyctl(KEYCTL_READ, ...) → Err(EACCES) because
     LogonKey.read() denies userspace reads. Only in-kernel callers with a
     direct Key reference can access the payload (via key.payload.lock()).

Security: The LogonKey type's read() method always returns Err(EACCES) regardless
of KeyPerm bits. This ensures Kerberos session keys cannot be extracted by userspace
even if the calling process has the key's serial number.

Cross-references: - Section 8.2 (08-security.md): .builtin_trusted_keys and .secondary_trusted_keys keyrings used in verified boot - Section 8.3 (08-security.md): TPM-bound EncryptedKey entries; TPM storage root key seals payloads - Section 8.4 (08-security.md): IMA uses .ima_mok keyring for measurement policy key verification - Section 8.7 (08-security.md): LSM hooks gate all key operations - Section 9.1: AsymmetricKey type uses AkCipherOps from the Crypto API - Section 14.4 (14-storage.md): NVMe TLS uses .nvme keyring for host certificate storage

9.3 Seccomp-BPF Syscall Filter

Seccomp is the kernel's last line of defense for syscall sandboxing. It operates after all other security checks (capabilities, LSM, DAC/MAC) and provides a programmable per-task filter that can restrict which syscalls a task is allowed to make. UmkaOS implements full Linux seccomp compatibility — libseccomp, systemd seccomp profiles, Docker/containerd seccomp JSON, and Chrome's sandbox all work without modification. The UmkaOS-native improvement is JIT compilation of filters at install time, reducing per-syscall overhead from ~50-200 ns (Linux interpreted BPF) to 2-5 ns (native code).

9.3.1 Entry Points

Seccomp state is set through two interfaces: the legacy prctl(2) interface and the preferred seccomp(2) syscall (x86-64 syscall number 317). Both are fully supported.

prctl(2) interface (legacy, Linux compatible):

prctl(PR_SET_SECCOMP, SECCOMP_MODE_DISABLED, 0, 0, 0)
    → EINVAL (cannot revert to disabled once seccomp is active)

prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT, 0, 0, 0)
    → 0 on success; restricts task to read/write/exit/sigreturn only

prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, uaddr)
    → 0 on success; installs BPF filter from struct sock_fprog at uaddr

The prctl interface is equivalent to seccomp(SECCOMP_SET_MODE_STRICT, 0, NULL) and seccomp(SECCOMP_SET_MODE_FILTER, 0, uaddr) respectively. It is provided for backward compatibility. New code should use the seccomp(2) syscall.

seccomp(2) syscall (preferred):

long seccomp(unsigned int operation, unsigned int flags, void *args);

Syscall numbers: - x86-64: 317 - AArch64: 277 - ARMv7: 383 - RISC-V 64: 277 - All other arches: follow Linux's syscall table for the respective architecture

The seccomp(2) syscall is the preferred interface because it exposes flags that prctl cannot pass and supports operations beyond mode setting.

9.3.2 `seccomp()` Operations

`SECCOMP_SET_MODE_STRICT` (operation = 0)

Equivalent to prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT). The flags argument must be 0 and args must be NULL; any other value returns EINVAL.

After this call, the task is restricted to the four allowed syscalls: read, write, exit, exit_group, and rt_sigreturn. Any other syscall results in SIGKILL of the task. This mode cannot be undone.

`SECCOMP_SET_MODE_FILTER` (operation = 1)

Install a cBPF filter program. The args pointer must point to a struct sock_fprog:

struct sock_fprog {
    unsigned short len;       /* number of filter instructions */
    struct sock_filter *filter; /* pointer to filter instructions */
};

The following flags are supported:

Flag	Value	Meaning
`SECCOMP_FILTER_FLAG_TSYNC`	0x1	Synchronise filter to all threads of the process
`SECCOMP_FILTER_FLAG_LOG`	0x2	Log all allowed syscalls (even those not matching a log action)
`SECCOMP_FILTER_FLAG_SPEC_ALLOW`	0x4	Disable Spectre-class mitigations in the syscall path for this task
`SECCOMP_FILTER_FLAG_NEW_LISTENER`	0x8	Return a file descriptor for userspace notifications
`SECCOMP_FILTER_FLAG_NOTIFY_ADDFD`	0x20	Allow `SECCOMP_IOCTL_NOTIF_ADDFD` (companion to `NEW_LISTENER`)

When SECCOMP_FILTER_FLAG_NEW_LISTENER is set, the return value on success is a file descriptor (not 0). This fd is epoll-able and readable; it delivers seccomp_notif structures when the filter returns SECCOMP_RET_USER_NOTIF.

When SECCOMP_FILTER_FLAG_TSYNC is set, the new filter is installed atomically on all threads of the calling process. If any thread has a more restrictive mode than the caller, the operation fails with ESRCH. The installation is all-or-nothing: if it fails for any thread, no thread is updated.

`SECCOMP_GET_ACTION_AVAIL` (operation = 2)

Checks whether a specific action code (passed via args as unsigned int *) is supported by the kernel. Returns 0 if supported, EOPNOTSUPP if not. UmkaOS supports all Linux-defined actions: Kill, KillProcess, Trap, Errno, Trace, Log, Allow, and Notify.

`SECCOMP_GET_NOTIF_SIZES` (operation = 3)

Fills in a struct seccomp_notif_sizes at the args pointer with the sizes of the notification structures:

struct seccomp_notif_sizes {
    __u16 seccomp_notif;       /* sizeof(struct seccomp_notif) */
    __u16 seccomp_notif_resp;  /* sizeof(struct seccomp_notif_resp) */
    __u16 seccomp_data;        /* sizeof(struct seccomp_data) */
};

These sizes are stable across UmkaOS versions (same as Linux kernel 5.0+). Userspace libraries (libseccomp, gVisor) use this to verify ABI compatibility before using the notification interface.

9.3.3 `seccomp_data` Struct (BPF Program Input)

Every time a BPF filter is evaluated, it receives a pointer to a seccomp_data struct describing the syscall. This struct's layout is an ABI — it must match Linux exactly because BPF programs compiled by libseccomp or by userspace policy engines reference specific offsets into this struct.

/// Input passed to the seccomp BPF filter. Layout is ABI-stable and matches
/// Linux's `struct seccomp_data` exactly (do not reorder or add fields).
#[repr(C)]
pub struct SeccompData {
    /// Syscall number (architecture-specific; matches the `arch` field).
    pub nr: i32,
    /// AUDIT_ARCH_* value identifying the calling ABI.
    /// On x86-64: AUDIT_ARCH_X86_64 (0xC000003E).
    /// On AArch64: AUDIT_ARCH_AARCH64 (0xC00000B7).
    /// On ARMv7 compat: AUDIT_ARCH_ARM (0x40000028).
    /// On RISC-V 64: AUDIT_ARCH_RISCV64 (0xC00000F3).
    pub arch: u32,
    /// Instruction pointer of the syscall instruction (not the return address).
    pub instruction_pointer: u64,
    /// Syscall arguments (up to 6). Arguments beyond the syscall's defined
    /// argument count are zero-filled.
    pub args: [u64; 6],
}

The arch field is filled with the task's current ABI identifier — not the kernel's native architecture. On x86-64 UmkaOS, a task running in 32-bit compatibility mode (ia32) receives AUDIT_ARCH_I386. On AArch64 UmkaOS, a 32-bit ARMv7 task running in AArch32 EL0 compat mode receives AUDIT_ARCH_ARM. This allows seccomp filters to correctly restrict 32-bit syscall numbers on 64-bit kernels.

UmkaOS note: SeccompData is allocated on the kernel stack of the syscall entry path. It is populated before the filter chain is called and discarded afterward. It is never heap-allocated and never escapes the syscall entry frame.

9.3.4 BPF Wire Format

Seccomp filters are specified using classic BPF (cBPF), not eBPF. This is the same BPF dialect used by SO_ATTACH_FILTER (socket filters). The BPF program is specified as an array of sock_filter instructions:

struct sock_filter {
    __u16 code;   /* instruction opcode */
    __u8  jt;     /* jump-if-true offset */
    __u8  jf;     /* jump-if-false offset */
    __u32 k;      /* generic multiuse field */
};

The filter is delivered from userspace as a sock_fprog:

struct sock_fprog {
    unsigned short      len;    /* number of sock_filter instructions */
    struct sock_filter *filter; /* pointer to instructions */
};

Constraints (identical to Linux): - Maximum filter length: 32767 instructions (BPF_MAXINSNS) - Maximum filter chain depth: 512 filters (MAX_SECCOMP_FILTER_DEPTH) - A filter must end with a BPF_RET instruction - Load instructions may only access seccomp_data fields (by validated offset) - Absolute memory loads (BPF_LD | BPF_ABS) are the primary access mechanism: BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)) loads the syscall number into accumulator A - Scratch memory (16 × u32 M[0..15]) is available - No packet memory, no pointer arithmetic beyond scratch memory

Validation (UmkaOS implementation): UmkaOS validates the cBPF program using the same rules as Linux's bpf_check_classic(): - All branch targets must be forward-only (no backward jumps) and within bounds - The program must terminate (no cycles possible with forward-only branches) - All loads are range-checked against sizeof(SeccompData) - Division-by-zero constants in BPF_ALU | BPF_DIV | BPF_K are rejected - All instruction opcodes must be valid cBPF opcodes

If validation fails, SECCOMP_SET_MODE_FILTER returns EINVAL and the task's seccomp state is unchanged.

After validation: UmkaOS JIT-compiles the validated cBPF to native machine code. The original sock_filter array is retained in CompiledFilter.bpf for audit and debugging purposes but is not used during syscall dispatch.

9.3.5 Return Values (Actions)

Each BPF filter program returns a 32-bit value. The top 16 bits encode the action; the bottom 16 bits are action-specific data. The actions are evaluated in priority order: when a filter chain has multiple filters, the highest-priority action among all filters' return values wins.

/// Seccomp filter return actions, in descending priority order.
/// The numeric values match Linux's SECCOMP_RET_* constants exactly.
#[repr(u32)]
pub enum SeccompAction {
    /// Kill the entire process (all threads). No signal is delivered;
    /// the process exits with a core dump if core dumps are enabled.
    KillProcess = 0x80000000,

    /// Kill the calling thread only. SIGKILL is sent to the thread.
    /// No signal handler runs; no userspace notification.
    Kill = 0x00000000,

    /// Deliver SIGSYS to the thread. The siginfo_t is filled with:
    ///   si_signo = SIGSYS
    ///   si_code  = SYS_SECCOMP (1)
    ///   si_call_addr = seccomp_data.instruction_pointer
    ///   si_syscall   = seccomp_data.nr
    ///   si_arch      = seccomp_data.arch
    Trap = 0x00030000,

    /// Notify a registered userspace supervisor via a listener fd.
    /// The task is suspended until the supervisor responds.
    /// The listener fd is obtained via SECCOMP_FILTER_FLAG_NEW_LISTENER.
    Notify = 0x7fc00000,

    /// Notify a ptrace(2) tracer. The low 16 bits are the tracer message
    /// (accessible via PTRACE_GETEVENTMSG). If no tracer is attached,
    /// ENOSYS is returned to the task.
    Trace(u16) /* = 0x7ff00000 | (id & 0xffff) */,

    /// Return -errno to the calling task. The syscall is not executed.
    /// The low 16 bits are the errno value (must be in 1..=65535).
    Errno(u16) /* = 0x00050000 | (errno & 0xffff) */,

    /// Log the syscall to the audit log, then allow it to proceed.
    /// Respects the rate limit (max 10 log entries per second per task).
    Log = 0x7ffc0000,

    /// Allow the syscall to proceed. No logging, no overhead.
    Allow = 0x7fff0000,
}

Priority order (highest to lowest):

KillProcess (0x80000000)
Kill (0x00000000)
Trap (0x00030000)
Notify (0x7fc00000)
Trace (0x7ff00000 | id)
Errno (0x00050000 | errno)
Log (0x7ffc0000)
Allow (0x7fff0000)

When a filter chain contains N filters, each filter is evaluated independently. The returned values are collected, and the highest-priority action is applied. A lower-numbered (more recently installed) filter's Allow cannot override an outer filter's Errno — the outer filter's stricter action prevails.

This priority ordering is identical to Linux's and allows composing filters safely: a library can install an inner filter that allows its needed syscalls without being able to override the outer filter's restrictions.

9.3.6 Filter Chain Data Structures

/// A single installed seccomp filter (immutable after creation).
///
/// Filters form a singly-linked chain: each filter holds an optional Arc
/// to its parent (the filter that was active at the time of installation).
/// The chain grows toward older filters; the innermost (most recently
/// installed) filter is the head.
pub struct SeccompFilter {
    /// JIT-compiled native code, or interpreted cBPF on arches without JIT.
    code: Arc<CompiledFilter>,
    /// The parent filter in the chain. None if this is the first filter.
    parent: Option<Arc<SeccompFilter>>,
    /// If true, log allowed syscalls regardless of the filter program's action.
    /// Set by SECCOMP_FILTER_FLAG_LOG.
    log: bool,
    /// If true, disable Spectre-class mitigations for this task's syscall path.
    /// Set by SECCOMP_FILTER_FLAG_SPEC_ALLOW.
    allow_spec: bool,
    /// Listener fd for SECCOMP_RET_USER_NOTIF delivery. Present only when
    /// the filter was installed with SECCOMP_FILTER_FLAG_NEW_LISTENER.
    notif_fd: Option<Arc<SeccompNotifFd>>,
}

/// JIT-compiled (or bytecode-retained) seccomp filter program.
pub struct CompiledFilter {
    /// Executable region containing native machine code. The region is
    /// write-protected (W^X) after compilation: it is writable during JIT
    /// output, then the write permission is revoked before first use.
    code: ExecutableRegion,
    /// Original cBPF instructions retained for audit and debugging.
    /// Not used during syscall dispatch in release builds.
    bpf: Vec<SockFilter>,
    /// Number of cBPF instructions in `bpf`.
    bpf_len: u16,
}

/// Per-task seccomp state, embedded in `Task`.
pub struct SeccompState {
    /// Innermost (most recently installed) filter in the chain.
    /// None if the task has not installed any filter.
    filter: Option<Arc<SeccompFilter>>,
    /// Current seccomp mode.
    mode: SeccompMode,
    /// Whether PR_SET_NO_NEW_PRIVS has been set for this task.
    /// Required before installing a filter without CAP_SYS_ADMIN.
    no_new_privs: bool,
}

/// Seccomp operating mode.
#[repr(u8)]
pub enum SeccompMode {
    /// No seccomp active. Syscall path skips all filter evaluation.
    Disabled = 0,
    /// Strict mode: only read/write/exit/rt_sigreturn are allowed.
    Strict   = 1,
    /// Filter mode: BPF filter chain is evaluated on every syscall.
    Filter   = 2,
}

SeccompState is embedded directly in the Task struct. Accessing the mode field on the syscall fast path requires no pointer indirection beyond the Task pointer that is already in a CpuLocal register. The mode check is a single byte comparison.

9.3.7 Filter Installation Algorithm (`SECCOMP_SET_MODE_FILTER`)

The installation path is per-task and requires no global locks. The BPF verifier and JIT compiler are purely local operations on data provided by the caller.

fn seccomp_set_mode_filter(task: &mut Task, flags: u32, fprog: UserPtr<SockFprog>) -> Result<i64> {

Step 1 — Privilege check:

The caller must satisfy at least one of: - task.seccomp.no_new_privs == true (set by prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)), OR - task.creds.has_cap(CAP_SYS_ADMIN) in the task's own user namespace

If neither condition is satisfied, return EPERM.

Step 2 — Mode transition check:

Seccomp mode is write-once-increasing: - If task.seccomp.mode == SeccompMode::Strict and SECCOMP_SET_MODE_FILTER is requested, return EINVAL. A task in strict mode cannot install a BPF filter. - SECCOMP_SET_MODE_STRICT is accepted regardless of current mode (idempotent in strict mode).

Step 3 — Parse and copy sock_fprog from userspace:

Copy sock_fprog from userspace (validated pointer, no kernel alias). Check: - len is in 1..=32767 (inclusive); reject 0 (empty program) and >32767 (BPF_MAXINSNS) - Copy len × sizeof(sock_filter) bytes from filter pointer

Return EFAULT if any copy-from-user fails. Return EINVAL if len is out of range.

Step 4 — Validate cBPF program:

Run the cBPF validator (same rules as Linux's bpf_check_classic): - All branch targets are forward-only and within bounds - All load offsets are within sizeof(SeccompData) - Division and modulo by zero constant are rejected - The last instruction is BPF_RET - All opcodes are valid cBPF opcodes

Return EINVAL if validation fails.

Step 5 — Check filter chain depth:

Count the current depth of task.seccomp.filter by walking the parent links. If depth ≥ 512 (MAX_SECCOMP_FILTER_DEPTH), return E2BIG.

Step 6 — JIT compile:

let compiled = jit_compile_cbpf(&bpf_insns, target_arch)?;
// compiled.code is now write-protected and executable

On architectures with JIT support (x86-64, AArch64), this produces native code. On architectures without JIT (currently: ARMv7, RISC-V, PPC32, PPC64LE), the validator output is stored verbatim and the interpreter is used at runtime. JIT support for additional architectures is added as the architecture ports mature; the installation path is identical regardless.

If JIT compilation fails (out of memory for the executable region), return ENOMEM.

Step 7 — Construct SeccompFilter:

let new_filter = Arc::new(SeccompFilter {
    code: Arc::new(compiled),
    parent: task.seccomp.filter.clone(),  // Arc::clone, no deep copy
    log: flags & SECCOMP_FILTER_FLAG_LOG != 0,
    allow_spec: flags & SECCOMP_FILTER_FLAG_SPEC_ALLOW != 0,
    notif_fd: if flags & SECCOMP_FILTER_FLAG_NEW_LISTENER != 0 {
        Some(Arc::new(SeccompNotifFd::new()))
    } else {
        None
    },
});

Step 8 — Atomically install:

task.seccomp.filter = Some(new_filter.clone());
task.seccomp.mode = SeccompMode::Filter;

This is not a compare-and-swap because filter installation is a single-threaded operation on the calling task's own SeccompState. Thread synchronisation is handled in step 9 (TSYNC) if requested.

Step 9 — TSYNC (if SECCOMP_FILTER_FLAG_TSYNC):

Iterate all threads of the process (via the thread group list): - For each thread, check that the thread's current filter chain is a prefix of the new filter's chain (i.e., the thread has not installed filters the caller does not have). If any thread's chain is incompatible, return ESRCH and roll back all changes. - If all threads are compatible, install new_filter on each thread via the same Arc::clone assignment. This is done under each thread's task lock; if any thread exits during the operation, it is skipped (already exiting).

Step 10 — Return value:

If SECCOMP_FILTER_FLAG_NEW_LISTENER was set: return the listener fd number (≥0)
Otherwise: return 0

9.3.8 Syscall Interception Path

The seccomp check is inserted at the syscall entry point, after register saving and argument marshalling, before dispatch to the syscall handler. It is architecture-specific in its placement (each arch's entry.S/entry.rs calls seccomp_check_syscall) but the check logic is shared.

/// Called from syscall entry with interrupts enabled, preemption disabled.
/// Returns Ok(()) to proceed with the syscall, or Err(action) to handle.
#[inline(always)]
pub fn seccomp_check_syscall(task: &Task, data: &SeccompData) -> SeccompVerdict {
    match task.seccomp.mode {
        SeccompMode::Disabled => SeccompVerdict::Allow,  // predicted branch, ~0 cycles
        SeccompMode::Strict   => seccomp_strict_check(data.nr),
        SeccompMode::Filter   => seccomp_filter_check(task, data),
    }
}

Mode::Disabled fast path:

The Disabled branch is predicted not-taken on the fast path (syscall entry predicts seccomp is disabled for most processes). The mode byte is the first field of SeccompState, co-located with the Task struct fields accessed on the syscall entry path. On x86-64 the check is a single cmp byte [task + offset], 0 / jne instruction.

Mode::Strict check:

fn seccomp_strict_check(nr: i32) -> SeccompVerdict {
    match nr as u32 {
        SYS_READ | SYS_WRITE | SYS_EXIT | SYS_EXIT_GROUP | SYS_RT_SIGRETURN => {
            SeccompVerdict::Allow
        }
        _ => SeccompVerdict::Kill,
    }
}

Mode::Filter — filter chain evaluation:

fn seccomp_filter_check(task: &Task, data: &SeccompData) -> SeccompVerdict {
    let mut result: u32 = SECCOMP_RET_ALLOW;  // lowest priority: allow

    // Walk the filter chain from innermost to outermost, collecting return values.
    let mut filter_opt = task.seccomp.filter.as_ref();
    while let Some(filter) = filter_opt {
        // Call the JIT-compiled (or interpreted) filter function.
        // JIT signature: extern "C" fn(*const SeccompData) -> u32
        let ret = filter.code.call(data);
        // Take the highest-priority action seen so far.
        if seccomp_action_priority(ret) > seccomp_action_priority(result) {
            result = ret;
        }
        filter_opt = filter.parent.as_ref().map(|p| p.as_ref());
    }

    seccomp_verdict_from_action(result, task)
}

The chain is walked innermost-first (most recently installed filter runs first). This matches Linux behaviour: inner filters can only further restrict, not expand.

Verdict dispatch:

After seccomp_filter_check returns a SeccompVerdict, the syscall entry path handles the verdict:

Verdict	Action
`Allow`	Proceed to syscall handler
`Log`	Write audit log entry, then proceed
`Errno(e)`	Return `-e` to userspace, skip syscall
`Trap`	Fill `siginfo_t`, deliver `SIGSYS` to task
`Trace(id)`	Notify ptrace tracer via `ptrace_event`, suspend task
`Notify`	Enqueue to `SeccompNotifFd`, suspend task, await response
`Kill`	Send SIGKILL to thread, do not return to userspace
`KillProcess`	Send SIGKILL to all threads, do not return to userspace

9.3.9 JIT Compilation

UmkaOS JIT-compiles cBPF seccomp filters to native machine code at install time. The JIT is invoked from seccomp_set_mode_filter (Section 9.3.7, Step 6) and produces W^X-protected executable code.

Why JIT matters:

A cBPF filter of 30-100 instructions is common for realistic seccomp policies (e.g., systemd service filters, Docker default profiles). With interpretation, each of those instructions requires a dispatch loop iteration costing ~50-200 ns total per syscall. After JIT, the same filter runs in 2-5 ns. For workloads that make frequent syscalls (high-throughput servers, container runtimes), this saves tens of microseconds per second per thread.

JIT properties:

Property	Value
Input	Validated cBPF (after `bpf_check_classic`)
Output	Native machine code for the running architecture
Average expansion	x86-64: ~3 native instructions per cBPF instruction
Average expansion	AArch64: ~4 native instructions per cBPF instruction
Memory protection	W^X: write permission removed after compilation
Code cache	Per `CompiledFilter` instance (not shared across tasks)
JIT availability	x86-64, AArch64 (at launch); ARMv7, RISC-V, PPC added as arch ports mature
Fallback	cBPF interpreted via `cbpf_interpret()` on arches without JIT

JIT output calling convention:

/// JIT-compiled filter function signature. Called from seccomp_filter_check.
/// The function pointer is stored in CompiledFilter.code.
/// Calling convention: System V AMD64 ABI (x86-64), AAPCS64 (AArch64).
type FilterFn = extern "C" fn(data: *const SeccompData) -> u32;

The JIT allocates an ExecutableRegion:

pub struct ExecutableRegion {
    /// Base of the mmap'd region (initially RW, then remapped RX after JIT).
    base: NonNull<u8>,
    /// Length of the region in bytes.
    len: usize,
    /// The callable function pointer, typed for safety.
    fn_ptr: FilterFn,
}

impl ExecutableRegion {
    /// Remap the region from RW to RX (W^X enforcement).
    /// Called once after JIT output is complete.
    pub fn seal(&mut self) -> Result<(), KernelError>;

    /// Call the compiled filter with the given seccomp_data.
    #[inline(always)]
    pub fn call(&self, data: &SeccompData) -> u32 {
        // SAFETY: fn_ptr is valid native code produced by the verified JIT,
        // protected RX. data is a valid reference for the call duration.
        unsafe { (self.fn_ptr)(data as *const SeccompData) }
    }
}

/proc/sys/net/core/bpf_jit_enable:

UmkaOS always JIT-compiles seccomp filters on supported architectures regardless of this sysctl value. The sysctl is accepted for compatibility with tools that check or set it, and value 2 enables diagnostic JIT output (dumps generated native code to the kernel log at KERN_DEBUG level). The sysctl does not affect seccomp JIT behaviour: seccomp JIT is not user-configurable and cannot be disabled.

9.3.10 Userspace Notification (SECCOMP_USER_NOTIF)

The userspace notification interface (introduced in Linux 5.0) allows a privileged supervisor process to intercept syscalls made by a sandboxed process, inspect them, and inject an arbitrary return value. This is used by container runtimes (e.g., gVisor's runsc, sysbox) to handle syscalls that are safe in a supervised context but would otherwise be blocked by a seccomp filter.

Data Structures

/// Notification fd: the supervisor's handle for receiving and responding to
/// seccomp notifications from a sandboxed task.
pub struct SeccompNotifFd {
    /// Lock-free ring buffer for pending notifications (fast path).
    /// Capacity 256: fits typical burst of notifications before supervisor wakes.
    queue: SpscRing<SeccompNotif, 256>,
    /// Map from notification id to the suspended task.
    /// Consulted when the supervisor sends a response.
    pending: Mutex<HashMap<u64, Arc<SuspendedTask>>>,
    /// Wait queue: the supervisor blocks here when the queue is empty.
    waiters: WaitQueue,
    /// Monotonically increasing id generator. Each notification gets a unique id.
    next_id: AtomicU64,
}

/// A pending notification sent to the supervisor.
/// Layout matches Linux's `struct seccomp_notif` exactly (ABI stable).
#[repr(C)]
pub struct SeccompNotif {
    /// Unique notification ID. Used to correlate with the response.
    pub id: u64,
    /// PID of the sandboxed task making the syscall.
    pub pid: u32,
    /// Reserved flags (always 0 in current version).
    pub flags: u32,
    /// Syscall information (identical to what the BPF program receives).
    pub data: SeccompData,
}

/// Response from the supervisor to the suspended task.
/// Layout matches Linux's `struct seccomp_notif_resp` exactly.
#[repr(C)]
pub struct SeccompNotifResp {
    /// Must match the `id` field of the corresponding `SeccompNotif`.
    pub id: u64,
    /// Return value to inject (used when `error == 0`).
    pub val: i64,
    /// Errno to return (if non-zero, `val` is ignored and `-error` is returned).
    pub error: i32,
    /// Flags: SECCOMP_USER_NOTIF_FLAG_CONTINUE (0x1) = execute the real syscall.
    pub flags: u32,
}

/// A task suspended waiting for a supervisor response.
pub struct SuspendedTask {
    /// Waker to resume the task once a response arrives.
    waker: TaskWaker,
    /// Response slot: filled by the supervisor, read by the task on wakeup.
    response: Option<SeccompNotifResp>,
}

UmkaOS improvement: SeccompNotifFd.queue is a lock-free SpscRing<SeccompNotif, 256> (single-producer, single-consumer). The sandboxed task pushes notifications; the supervisor pops them. This is lock-free on the fast path; the pending map (for response correlation) is only touched under a Mutex on the slower supervisor-response path.

ioctls on the Listener fd

The listener fd returned by SECCOMP_FILTER_FLAG_NEW_LISTENER supports the following ioctls (all defined in <linux/seccomp.h>):

SECCOMP_IOCTL_NOTIF_RECV (_IOWR(SECCOMP_IOC_MAGIC, 0, struct seccomp_notif)):

Dequeues one pending notification. Blocks if the queue is empty (respecting O_NONBLOCK which returns EAGAIN if empty). On return, fills the userspace seccomp_notif struct with the notification. The sandboxed task remains suspended until a response is sent.

SECCOMP_IOCTL_NOTIF_SEND (_IOWR(SECCOMP_IOC_MAGIC, 1, struct seccomp_notif_resp)):

Sends a response to a suspended notification. The id field must match a currently suspended task. If flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE, the original syscall is executed (the filter's NOTIFY action is overridden). Otherwise, val or -error is returned to the task.

Returns ENOENT if the id is not found (task already timed out or was killed). Returns EINPROGRESS if a response for this id has already been sent.

SECCOMP_IOCTL_NOTIF_ID_VALID (_IOW(SECCOMP_IOC_MAGIC, 2, __u64)):

Checks whether the notification with the given id is still valid (the suspended task is still alive and waiting). Returns 0 if valid, ENOENT if not. The supervisor uses this to detect tasks that exited while the supervisor was processing the notification.

SECCOMP_IOCTL_NOTIF_ADDFD (_IOW(SECCOMP_IOC_MAGIC, 3, struct seccomp_notif_addfd)):

Installs a file descriptor into the suspended task's file descriptor table. The seccomp_notif_addfd struct specifies the fd to install and an optional target fd number (or -1 for the lowest available). This enables the supervisor to inject fds (e.g., a socket connected to a local service) into the sandboxed process, allowing "fake" syscall implementations that return a real, usable fd.

Requires SECCOMP_FILTER_FLAG_NOTIFY_ADDFD to have been set at filter installation time; returns EINVAL otherwise.

Notification Lifecycle

Sandboxed task:
  1. Makes syscall; filter returns SECCOMP_RET_USER_NOTIF.
  2. seccomp_filter_check builds SeccompNotif, enqueues to notif_fd.queue.
  3. Wakes any waiting supervisor (notif_fd.waiters.wake_one()).
  4. Registers self in notif_fd.pending[id] = SuspendedTask { waker, response: None }.
  5. Suspends (yields CPU; preemptible).

Supervisor process:
  6. Wakes on notif_fd (epoll, read, select, or blocking ioctl).
  7. SECCOMP_IOCTL_NOTIF_RECV: dequeues SeccompNotif, inspects data.
  8. (Optionally) SECCOMP_IOCTL_NOTIF_ID_VALID: verifies task still alive.
  9. (Optionally) SECCOMP_IOCTL_NOTIF_ADDFD: installs fd into task.
 10. SECCOMP_IOCTL_NOTIF_SEND: fills SuspendedTask.response, calls waker.

Sandboxed task (resumed):
 11. Reads SuspendedTask.response from own task struct (no lock needed: only own task).
 12. If SECCOMP_USER_NOTIF_FLAG_CONTINUE: execute original syscall normally.
     Else: return val or -error to userspace.
 13. Removes self from notif_fd.pending.

9.3.11 `SECCOMP_MODE_STRICT`

Strict mode is the simplest seccomp mode and predates the BPF filter interface. It allows a task to restrict itself to an absolute minimum set of syscalls with a single prctl call. No filter program is needed; the allowlist is hard-coded in the kernel.

Allowed syscalls in strict mode:

Syscall	Rationale
`read` (0)	Read from existing fds
`write` (4/1)	Write to existing fds
`exit` (1/60)	Exit current thread
`exit_group` (231)	Exit all threads of process
`rt_sigreturn` (15/173)	Return from signal handler

All other syscalls result in SIGKILL of the calling thread (not the process — the thread is killed, but sibling threads continue running unless they also make a disallowed syscall). This matches Linux's strict mode behaviour.

Note: syscall numbers differ between architectures (x86 32-bit vs x86-64 vs AArch64). UmkaOS uses the correct architecture-specific numbers for each ABI.

Use case: OpenSSH privilege-separated workers, minimal computation daemons, and processes that have already opened all needed fds and only need to read/write/exit. Strict mode is simpler to reason about than a BPF filter and cannot be misconfigured.

9.3.12 Inheritance and exec Semantics

Fork (clone(2) without CLONE_SECCOMP_NOFILTER):

The child inherits the parent's SeccompState with the filter chain shared via Arc::clone(). No bytecode is copied; the child shares the same compiled filter code:

fn fork_seccomp(parent: &Task, child: &mut Task) {
    child.seccomp = SeccompState {
        filter: parent.seccomp.filter.clone(),  // Arc refcount increment only
        mode: parent.seccomp.mode,
        no_new_privs: parent.seccomp.no_new_privs,
    };
}

Fork is O(1) for seccomp state regardless of filter chain depth. The Arc refcount ensures the compiled filter code is kept alive as long as any task references it.

Thread creation (clone(CLONE_THREAD)):

Thread creation uses the same fork_seccomp path. All threads of a process share the same filter chain by default. SECCOMP_FILTER_FLAG_TSYNC can be used after the fact to synchronise a new filter to all threads.

exec (execve(2)):

The filter chain survives exec. This is intentional and matches Linux: a process that installs a seccomp filter before exec cannot have the filter stripped by the exec'd program. The exec'd program runs under the same (or more restrictive, if the exec'd program also calls seccomp) filter chain.

fn exec_seccomp(task: &mut Task) {
    // The filter chain is preserved across exec unchanged.
    // no_new_privs is also preserved (set before exec).
    // Mode is preserved.
    // Nothing to do here; SeccompState is unchanged by exec.
}

Mode monotonicity:

The seccomp mode can only increase:

From \ To	Disabled	Strict	Filter
Disabled	EINVAL	Allowed	Allowed
Strict	EINVAL	Allowed (idempotent)	EINVAL
Filter	EINVAL	EINVAL	Allowed (adds to chain)

A task cannot go from Filter back to Strict or from either back to Disabled. This ensures that filters installed by sandboxing infrastructure cannot be removed by sandboxed code. The mode field is written only in seccomp_set_mode_*, which is always called on the task's own state under the task lock.

no_new_privs:

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) sets SeccompState.no_new_privs = true. This is required to install a seccomp filter without CAP_SYS_ADMIN. It also has independent effects: exec'd programs do not gain privileges from set-uid/set-gid bits. The flag is inherited by children (fork and thread) and is preserved across exec. It cannot be unset.

9.3.13 Audit Logging

Seccomp events are written to the kernel audit log under two circumstances:

SECCOMP_RET_LOG: The BPF filter program explicitly returns this action, which logs the syscall and then allows it to proceed.
SECCOMP_FILTER_FLAG_LOG: Set at filter installation time. Every syscall that is allowed by the filter chain is also logged (regardless of which action matched).

Log entry format:

audit: type=1326 msg=audit(1708956031.442:1234): auid=1000 uid=1000 gid=1000 \
  ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 \
  pid=12345 comm="my_process" exe="/usr/bin/my_process" sig=0 \
  arch=c000003e syscall=59 compat=0 ip=0x7f1234567890 code=0x7ffc0000

Field meanings: - type=1326: AUDIT_SECCOMP event type (matches Linux) - auid: Audit UID (login UID, set by PAM) - uid/gid: Real UID/GID at time of syscall - ses: Audit session ID - subj: SELinux/AppArmor subject label (from LSM, empty if no LSM active) - pid: Task PID - comm: Task name (up to 15 bytes, truncated from Task.name) - exe: Executable path (from Task.exe_path) - sig: Signal delivered (0 if none, e.g., SIGSYS=31 for Trap, 9 for Kill) - arch: AUDIT_ARCH_* value in hex (matching SeccompData.arch) - syscall: Syscall number - compat: 1 if the task was in 32-bit compatibility mode, 0 otherwise - ip: Instruction pointer from SeccompData.instruction_pointer - code: The SECCOMP_RET_* value returned by the filter in hex

Rate limiting:

To prevent a compromised or misbehaving task from flooding the audit log:

/// Per-task seccomp log rate limiter.
pub struct SeccompLogRateLimit {
    /// Number of log entries written in the current second.
    count: u32,
    /// Start of the current one-second window (coarse monotonic clock, seconds).
    window_start: u64,
}

const SECCOMP_LOG_RATE_LIMIT: u32 = 10;

If more than 10 events per second are generated by a single task, the excess events are dropped and a single "N events suppressed" message is written at the end of the window. This matches Linux's seccomp_log_rate_limit behaviour.

9.3.14 `/proc` Integration

/proc/PID/status:

The Seccomp: field in /proc/PID/status reports the current seccomp mode:

Seccomp: 0    # Disabled
Seccomp: 1    # Strict
Seccomp: 2    # Filter

This is an ABI-stable field read by many tools (systemd, ps, audit tools). The value is a decimal integer matching SeccompMode as u8.

/proc/PID/seccomp_filter (UmkaOS extension):

UmkaOS provides a read-only file /proc/PID/seccomp_filter that reports the installed filter chain in human-readable form. This is an UmkaOS extension (not present in Linux); it is intended for debugging and introspection by administrators.

Format:

seccomp_filter: mode=filter depth=3 jit=yes
filter[0]: len=47 instructions, jit=yes, log=no, spec_allow=no, has_notif=no
filter[1]: len=12 instructions, jit=yes, log=yes, spec_allow=no, has_notif=no
filter[2]: len=8 instructions, jit=yes, log=no, spec_allow=no, has_notif=yes

mode: current mode (disabled/strict/filter)
depth: number of filters in the chain (0 for disabled/strict)
jit: whether all filters in the chain are JIT-compiled
filter[N]: innermost is filter[0], outermost is filter[depth-1]
len: number of cBPF instructions in this filter
jit: whether this specific filter is JIT-compiled
log: whether SECCOMP_FILTER_FLAG_LOG was set
spec_allow: whether SECCOMP_FILTER_FLAG_SPEC_ALLOW was set
has_notif: whether a userspace notification fd is associated

Access control: the file is readable by the task owner and by processes with CAP_SYS_PTRACE in the target task's user namespace. Other processes see EPERM.

The BPF bytecode itself is not exposed via /proc (it would allow filter fingerprinting by sandboxed code). Only metadata is exposed.

9.3.15 Linux Compatibility

UmkaOS's seccomp implementation is a drop-in replacement for Linux's. All of the following work without modification:

libseccomp (libseccomp2): libseccomp generates cBPF programs from a high-level policy API and installs them via seccomp(SECCOMP_SET_MODE_FILTER). The generated BPF programs pass UmkaOS's validator and run correctly. The SCMP_ACT_* action codes map directly to UmkaOS's SeccompAction values.

systemd service sandboxing: systemd's SystemCallFilter=, SystemCallArchitectures=, and related security directives use libseccomp to generate and install seccomp filters. These filters work on UmkaOS because: (a) the seccomp(2) syscall number and ABI match Linux; (b) the sock_fprog wire format is identical; (c) AUDIT_ARCH_* values match.

Docker / containerd default seccomp profile: Docker's default seccomp profile is a JSON file that containerd compiles to cBPF via libseccomp. The resulting filter is installed via seccomp(2) in the container's init process. UmkaOS handles this identically to Linux.

Chrome sandbox: Chrome's renderer processes use SECCOMP_MODE_FILTER with a multi-layered filter chain. Chrome uses SECCOMP_FILTER_FLAG_TSYNC to synchronise the filter to all threads before starting sensitive work. UmkaOS's TSYNC implementation matches Linux's semantics exactly, including the ESRCH error when a thread conflict is detected.

gVisor (runsc): gVisor's runsc runtime uses SECCOMP_RET_USER_NOTIF to implement its syscall interception model on UmkaOS. The SeccompNotifFd interface — ioctls, epoll-ability, the seccomp_notif / seccomp_notif_resp struct layout — matches Linux 5.0+ exactly.

Seccomp syscall numbers:

Architecture	Syscall number
x86-64	317
AArch64	277
ARMv7	383
RISC-V 64	277

SECCOMP_RET_* action values:

All SECCOMP_RET_* constants used in BPF programs compiled by libseccomp or other policy compilers match UmkaOS's SeccompAction values exactly. Programs that embed action constants numerically (as most BPF program generators do) work without modification.

Cross-references: - Section 8.1 (08-security.md): Capabilities required for seccomp filter installation (CAP_SYS_ADMIN path) - Section 8.7 (08-security.md): LSM hooks for seccomp — security_seccomp_filter_install and security_seccomp_check_syscall - Section 8.8 (08-security.md): Credential model; no_new_privs interacts with set-uid exec and capability bounding set - Section 6.1 (06-scheduling.md): Task struct embedding SeccompState; task suspension and wakeup for SECCOMP_RET_USER_NOTIF - Section 18.1 (18-compat.md): Syscall dispatch table; seccomp check is inserted before dispatch in the compatibility layer as well as the native path

9.4 ARM Memory Tagging Extension (MTE)

ARM's Memory Tagging Extension is a hardware security capability available from ARMv8.5-A (AArch64 only). It provides automatic, hardware-enforced detection of heap use-after-free and heap/stack buffer overflows with near-zero runtime overhead in production. x86-64 has no hardware equivalent — the closest software analogues (ASAN, Valgrind) impose 2-10x slowdowns and are not suitable for production deployment. On ARM platforms, MTE is UmkaOS's preferred first-line mitigation against memory corruption attacks on Tier 1 drivers and userspace processes alike.

Linux added MTE support in kernel 5.10. UmkaOS's MTE implementation is ABI-compatible with Linux 5.10+ and is verified against the ARM Architecture Reference Manual DDI0487 (ARMv8.5-A section D8 "The Memory Tagging Extension").

9.4.1 MTE Overview and Architecture Coverage

Hardware mechanism:

Every 16-byte aligned granule of physical memory backed by Normal-Tagged memory pages has a 4-bit allocation tag stored in separate tag memory. This tag memory is transparent to normal loads and stores — programs that do not use MTE see no change in behaviour. When MTE is active, the processor compares the logical tag embedded in the top byte of the virtual address (bits 59:56 under TBI — Top Byte Ignore mode) against the stored allocation tag on every memory access. A mismatch either faults immediately (sync mode) or sets a sticky fault flag (async mode), depending on the configured Tag Check Fault mode.

The tag memory overhead is exactly 1 bit per 2 bytes of addressable memory (4 bits per 16-byte granule), equating to approximately 3.125% of physical memory for Normal-Tagged regions. Only memory mapped with PROT_MTE incurs this overhead; untagged regions consume no tag storage. The kernel manages tag memory as part of each physical page's metadata via a separate tag page table level; from software's perspective tag storage is addressed by logical address, not separately mapped.

Feature levels:

Feature	Minimum architecture	What it adds
`FEAT_MTE`	ARMv8.5-A (optional from ARMv8.4)	Basic allocation tag storage, `IRG`/`STG`/`LDG`/`ADDG`/`SUBG` instructions, sync TCF mode
`FEAT_MTE2`	ARMv8.5-A (optional from ARMv8.4)	Async TCF mode (`TFSR_EL1`/`TFSRE0_EL1` registers), `MTE_TCF_ASYNC` usable
`FEAT_MTE3`	ARMv8.7-A / ARMv9.0+	Asymmetric TCF mode (sync load, async store), `MTE_TCF_ASYMM`

Feature presence is detected at boot via ID_AA64PFR1_EL1.MTE:

Value 0: MTE not implemented
Value 1: FEAT_MTE only (sync mode, no async)
Value 2: FEAT_MTE2 (sync + async modes)
Value 3: FEAT_MTE3 (all three modes including asymmetric)

Chip availability (verified as of 2025):

MTE is common in ARMv9 consumer SoCs but optional in server microarchitectures — Neoverse N2 and V2 notably omit it (as do AWS Graviton 4, Azure Cobalt 100, Google Axion). Representative implementations WITH MTE: ARM Cortex-X1, Cortex-A78, Cortex-A710, Cortex-A715, Neoverse V1; Apple M2, M3, M4 (FEAT_MTE2+); Qualcomm Snapdragon 8 Gen 1 and later; MediaTek Dimensity 9000 and later; Samsung Exynos 2200 and later.

UmkaOS detects MTE availability at boot from ID_AA64PFR1_EL1.MTE and conditionally enables MTE support. No assumptions are hardcoded about which CPU models have MTE.

Kernel-side tag check override (PSTATE.TCO):

The PSTATE.TCO (Tag Check Override) bit suppresses tag check faults for the current execution context when set to 1. The kernel always enters EL1 with TCO=1, meaning the kernel itself never faults on tag mismatches during EL1 code paths. On every return to EL0, TCO is cleared to 0, restoring tag checking for userspace. This is the correct and only safe design: the kernel must be able to access user memory (e.g. during read(2), copy_to_user()) even when the user has MTE enabled with conflicting tags. Kernel EL1 code paths use the TCR_EL1.TCMA1=1 configuration (match-all for tag 0b1111) as a controlled exception for specific kernel pointer uses — this is a known limitation noted in Section 9.4.7.

9.4.2 MTE Modes (SYNC / ASYNC / ASYMM)

MTE supports three tag check fault (TCF) modes, selectable per-thread via prctl(2):

SYNC mode (MTE_TCF_SYNC, PR_MTE_TCF_SYNC):

A tag mismatch raises a synchronous fault on the faulting instruction before any results are architecturally committed. The kernel delivers SIGSEGV with si_code = SEGV_MTESERR and si_addr set to the exact faulting address. The memory access is not performed.

SYNC mode is the strongest mitigation: the faulting address is precise, no speculative results are committed, and there is no window for an attacker to observe the results of an invalid access. SYNC mode imposes a measurable hardware overhead (instruction pipeline effects) of approximately 1-3% on memory-intensive workloads.

ASYNC mode (MTE_TCF_ASYNC, PR_MTE_TCF_ASYNC):

A tag mismatch sets the TFSRE0_EL1 (Tag Fault Status Register, EL0) sticky bit without immediately faulting. Execution continues. On the next kernel entry (syscall, interrupt, exception), the kernel checks TFSRE0_EL1; if set, it clears the register and delivers SIGSEGV with si_code = SEGV_MTEAERR and si_addr = 0 (faulting address is not available in async mode). Requires FEAT_MTE2.

ASYNC mode overhead is approximately 0-1% on production workloads — essentially immeasurable in most benchmarks. The trade-off is imprecise fault delivery: the actual faulting instruction may have already completed and the CPU may have speculated beyond it before the fault is reported. This makes async mode unsuitable for precise debugging but excellent for production crash containment where performance is critical.

ASYMM mode (MTE_TCF_ASYMM, PR_MTE_TCF_ASYMM, requires FEAT_MTE3):

Asymmetric mode uses SYNC semantics for loads (read tag mismatches fault immediately) and ASYNC semantics for stores (write tag mismatches set TFSRE0_EL1). Overhead is intermediate — approximately 0.5-2% depending on workload read/write ratio. Requires FEAT_MTE3 (ARMv8.7+/ARMv9+).

UmkaOS default mode selection:

UmkaOS selects the default per-thread mode as follows, based on hardware capabilities and the type of code being run:

Tier 1 driver processes (UmkaOS-native, ring 0 with hardware domain isolation): ASYNC mode by default (FEAT_MTE2 required). Balances crash detection against the hard real-time latency requirements of storage/network drivers. A tag fault in a driver terminates the driver domain and triggers reload (see Section 11.1).
Tier 2 processes (ring 3): ASYNC mode by default; userspace may upgrade to SYNC via prctl(). The allocator enables MTE on all anonymous heap mappings.
Debugging and testing: SYNC mode recommended — precise fault address enables exact identification of the corrupting site.
Legacy software without MTE awareness: MTE_TCF_NONE (default on execve). MTE is not forced on processes that have not opted in via prctl().

Per-CPU preferred mode:

The kernel exposes a per-CPU preferred TCF mode via /sys/devices/system/cpu/cpu<N>/mte_tcf_preferred (matching Linux). If userspace requests both PR_MTE_TCF_SYNC and PR_MTE_TCF_ASYNC simultaneously, the kernel selects the CPU's preferred mode. Preference order when multiple modes are requested: ASYNC > ASYMM > SYNC.

9.4.3 Kernel Data Structures

VMA flags (stored in VmaFlags alongside existing VM_READ/VM_WRITE):

/// MTE-related flags for a Virtual Memory Area.
/// Bits 40-42 of the VMA flags field. Disjoint from Linux VM_* bits in the
/// lower 32 bits; UmkaOS-native extensions occupy bits 32-63.
pub struct MteVmaFlags: u64 {
    /// PROT_MTE was specified at mmap()/mprotect() time.
    /// Pages backing this VMA are Normal-Tagged; tag memory is allocated.
    const MTE_ENABLED = 1 << 40;
    /// Tag Check Fault mode for this VMA's pages is SYNC.
    /// Recorded for coredump reconstruction; actual enforcement is per-thread.
    const MTE_SYNC    = 1 << 41;
    /// Tag Check Fault mode for this VMA's pages is ASYNC (FEAT_MTE2 required).
    const MTE_ASYNC   = 1 << 42;
}

Note that PROT_MTE on a VMA is irrevocable: once set via mmap() or mprotect(), it cannot be removed by a subsequent mprotect() call (consistent with Linux semantics and required because tag memory has already been provisioned for the physical pages). The MTE_SYNC/MTE_ASYNC flags on a VMA record the mode in effect when MTE was first enabled, for coredump annotation; the tag check fault mode enforced at runtime is always the per-thread MteTaskConfig.tcf_mode.

Per-thread MTE configuration (embedded in Task):

/// Per-thread MTE configuration. Stored in Task.mte_config.
/// Zero-initialised on task creation: MTE disabled, no faults checked.
pub struct MteTaskConfig {
    /// Tag Check Fault mode for this thread. Controls SCTLR_EL1.TCF0
    /// (for EL0 accesses) written on context switch and on prctl().
    pub tcf_mode: MteTcfMode,

    /// Tag inclusion mask for IRG/ADDG/SUBG instructions.
    /// 16-bit bitmask: bit i = 1 means tag i is included in the random set.
    /// Translated to GCR_EL1.Exclude = ~tag_mask & 0xFFFF before writing.
    /// Linux default: 0 (all tags excluded, IRG always returns tag 0).
    /// Allocator recommendation: 0xFFFE (exclude tag 0, include tags 1-15).
    pub tag_mask: u16,

    /// Tagged Address ABI enabled for this thread.
    /// Set by PR_TAGGED_ADDR_ENABLE bit in prctl(PR_SET_TAGGED_ADDR_CTRL).
    /// When false, the kernel strips top-byte tags from user-provided addresses
    /// (syscall arguments, signal si_addr) for backwards compatibility.
    pub tagged_addr_enabled: bool,

    /// Cached TFSRE0_EL1 value saved at last kernel entry in ASYNC mode.
    /// Non-zero means an async tag fault is pending delivery as SIGSEGV.
    /// The kernel checks and clears this field on every exception return.
    pub async_fault_pending: bool,
}

/// Tag Check Fault mode. Corresponds to SCTLR_EL1.TCF0 field encoding.
#[repr(u8)]
pub enum MteTcfMode {
    /// SCTLR_EL1.TCF0 = 0b00: tag faults ignored (default on execve).
    None  = 0,
    /// SCTLR_EL1.TCF0 = 0b01: synchronous fault on tag mismatch.
    Sync  = 1,
    /// SCTLR_EL1.TCF0 = 0b10: asynchronous fault (FEAT_MTE2 required).
    /// Writing this mode without FEAT_MTE2 returns EINVAL from prctl().
    Async = 2,
    /// SCTLR_EL1.TCF0 = 0b11: asymmetric (sync load, async store, FEAT_MTE3).
    /// Writing this mode without FEAT_MTE3 returns EINVAL from prctl().
    Asymm = 3,
}

UmkaOS-native system-level MTE state (boot-time singleton):

/// Boot-time-discovered MTE capabilities for this system.
/// Initialised once during AArch64 arch init; read-only thereafter.
pub struct MteSystemCapabilities {
    /// Highest MTE feature level present: 0=none, 1=FEAT_MTE,
    /// 2=FEAT_MTE2 (async), 3=FEAT_MTE3 (asymm).
    /// Read from ID_AA64PFR1_EL1.MTE at boot.
    pub feature_level: u8,

    /// Physical memory pages suitable for MTE (Normal-Tagged attribute).
    /// Not all memory may support tagging (e.g., device-mapped regions).
    pub tagged_memory_pages: u64,

    /// Tag storage memory in bytes (= tagged_memory_pages * 4096 / 32).
    /// Each page requires 128 bytes of tag storage (4 bits per 16-byte granule).
    pub tag_storage_bytes: u64,

    /// Whether the system-wide tagged address mode has been disabled by
    /// /proc/sys/abi/tagged_addr_disabled (requires CAP_SYS_ADMIN to set).
    pub tagged_addr_disabled: bool,
}

/// Global MTE system state. Initialised in arch_init_mte().
static MTE_SYSTEM: MteSystemCapabilities = MteSystemCapabilities::zeroed();

9.4.4 MTE-Aware Allocator Design

UmkaOS's slab allocator and buddy allocator are MTE-aware when running on AArch64 with MTE_SYSTEM.feature_level >= 1. MTE tagging is applied at the slab layer for heap allocations; the buddy allocator operates below slab granularity and manages untagged physical pages.

Allocation path:

The slab allocator selects a free object from a magazine or slab.
The allocator issues IRG Xd, Xbase, #0 to generate a random tag from the thread's effective tag mask (GCR_EL1.Exclude configured to exclude tag 0, ensuring objects are never tagged zero — zero is the untagged sentinel).
The allocator stores the tag to memory: ST2G Xtagged, [Xaddr] stores the same tag to two consecutive 16-byte granules (covering 32 bytes), repeated for the full object size. For objects up to 32 bytes, a single ST2G suffices. For larger objects, a loop of ST2G or STZG (store tag and zero memory) is used. STZG both stores the tag and zeroes the granule's memory, combining zeroing and tagging in one instruction.
The allocator returns the tagged pointer: Xresult = Xtagged | (tag << 56).

Deallocation path:

The allocator receives the potentially-tagged pointer from the caller.
The raw address is extracted by masking off the top byte: Xaddr = Xptr & ~(0xFFUL << 56).
The tag is cleared in memory: STG XZR, [Xaddr] stores tag 0 (the untagged sentinel) to the first granule, repeated for the full object. Alternatively, if the slab will immediately re-use the object, the allocator stores a new random tag for the next allocation. The key invariant is: the freed object's memory tag never matches any tag value 1-15 that a living pointer could carry, so any access through the freed pointer (use-after-free) triggers a tag mismatch fault.
The object is returned to the magazine or slab free list.

Adjacent object tag separation:

The slab allocator guarantees that no two adjacent objects within the same slab share the same tag. This is implemented by a simple retry: after generating a candidate tag via IRG, the allocator checks the tag of the preceding and succeeding objects (via LDG) and regenerates if a collision is detected. With 15 non-zero tags and typically 2 neighbour checks, the probability of requiring more than one retry is 2/15 ≈ 13%. The expected number of IRG calls per allocation is therefore at most 1.15 — negligible overhead.

Interaction with KASAN and compiler instrumentation:

When UmkaOS is built with KASAN enabled (kernel sanitizer, debug builds only), KASAN and MTE serve complementary roles. KASAN operates on shadow memory (software) and catches accesses to red zones and freed regions before they reach the hardware. MTE provides hardware enforcement in production builds where KASAN is disabled. The allocator does not enable KASAN and MTE simultaneously on the same object — when CONFIG_KASAN_HW_TAGS is set (the hardware-tag-based KASAN variant), KASAN hijacks the MTE tag mechanism directly and the allocator delegates all tag management to the KASAN layer.

Rust GlobalAlloc integration:

/// UmkaOS AArch64 slab allocator implementing GlobalAlloc with MTE tagging.
/// Called for all heap allocations in Tier 1 driver code compiled for AArch64.
unsafe impl GlobalAlloc for SlabAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        // SAFETY: slab_alloc_mte returns a tagged, non-null pointer on success,
        // or null on OOM. Pointer is aligned to layout.align() and valid for
        // layout.size() bytes. Caller must not use pointer after free.
        let ptr = self.slab_alloc_mte(layout.size(), layout.align());
        if ptr.is_null() { handle_alloc_error(layout); }
        ptr
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        // SAFETY: ptr was returned by alloc() with the same layout.
        // dealloc_mte clears the MTE tag before returning the slab object.
        self.dealloc_mte(ptr, layout.size());
    }
}

9.4.5 Context Switch Handling

When switching between tasks on AArch64 with MTE enabled, the following system registers must be saved and restored to maintain correct per-thread MTE semantics:

Registers saved/restored per-thread:

Register	Purpose	Save/restore condition
`SCTLR_EL1.TCF0` (bits 39:38)	Tag Check Fault mode for EL0	Always when MTE enabled
`GCR_EL1`	Tag exclusion mask for `IRG`/`ADDG`/`SUBG`	When `tag_mask` differs
`TFSRE0_EL1`	Tag Fault Status for EL0 (async fault accumulator)	When `FEAT_MTE2`, task has `MTE_TCF_ASYNC` or `MTE_TCF_ASYMM`
`PSTATE.TCO`	Tag Check Override (suppress all tag faults)	Managed by kernel entry/exit; not per-thread

Context switch sequence (AArch64 __switch_to):

// Saving outgoing task (prev):
// 1. Read TFSRE0_EL1 before dsb() — captures any async faults since last save.
mrs  x9, tfsre0_el1
str  x9, [x0, #TASK_MTE_TFSRE0]   // store to prev->mte_config.async_fault_pending

// 2. Clear TFSRE0_EL1 before switching to prevent spurious delivery to next task.
msr  tfsre0_el1, xzr

// 3. dsb() barrier — required before SCTLR_EL1 write to ensure TFSRE0 read completes.
dsb  nsh

// Restoring incoming task (next):
// 4. Write SCTLR_EL1.TCF0 for next task's TCF mode.
//    Read-modify-write: preserve all other SCTLR_EL1 bits.
mrs  x9, sctlr_el1
ldr  x10, [x1, #TASK_MTE_SCTLR_TCF0_BITS]  // pre-computed bits for next->tcf_mode
bic  x9, x9, #SCTLR_EL1_TCF0_MASK
orr  x9, x9, x10
msr  sctlr_el1, x9

// 5. Write GCR_EL1 for next task's tag exclusion mask.
ldr  x9, [x1, #TASK_MTE_GCR_EL1]   // ~(next->tag_mask) & 0xFFFF
msr  gcr_el1, x9

// 6. Restore TFSRE0_EL1 for next task (may have pending async fault from before
//    it was preempted).
ldr  x9, [x1, #TASK_MTE_TFSRE0]
msr  tfsre0_el1, x9

// 7. isb() to synchronise SCTLR_EL1 and GCR_EL1 changes before returning to EL0.
isb

Kernel entry async fault check:

On every transition from EL0 to EL1 (syscall, IRQ, data abort), the kernel entry stub must check for pending async MTE faults in the outgoing thread. This check is inserted in the AArch64 exception entry code at el0_sync_handler and el0_irq:

/// Called on kernel entry from EL0 when task has MTE_TCF_ASYNC or MTE_TCF_ASYMM.
/// Checks TFSRE0_EL1 for a pending async tag fault; if set, schedules SIGSEGV delivery.
///
/// # Safety
/// Must be called with IRQs disabled on the kernel entry path, before any code
/// that could preempt or sleep. Reads and clears TFSRE0_EL1 atomically with
/// respect to exception return (the current exception level is EL1).
pub unsafe fn mte_check_async_fault(task: &mut Task) {
    if task.mte_config.tcf_mode != MteTcfMode::Async
        && task.mte_config.tcf_mode != MteTcfMode::Asymm
    {
        return;
    }

    // SAFETY: reading TFSRE0_EL1 in EL1 is always permitted.
    let tfsre0: u64;
    core::arch::asm!("mrs {}, tfsre0_el1", out(reg) tfsre0);

    if tfsre0 != 0 {
        // Clear the fault status register before processing.
        // SAFETY: writing TFSRE0_EL1 in EL1 is always permitted.
        core::arch::asm!("msr tfsre0_el1, xzr");
        task.mte_config.async_fault_pending = true;

        // Queue SIGSEGV delivery on return to userspace.
        // si_code = SEGV_MTEAERR (6), si_addr = 0 (address unavailable in async mode).
        task.signal_queue.push(Signal::new(
            SIGSEGV,
            SigInfo {
                si_code: SEGV_MTEAERR,
                si_addr: 0,
                ..SigInfo::default()
            },
        ));
    }
}

Signal handler invariant:

Signal handlers are always invoked with PSTATE.TCO = 0, regardless of whether the interrupted EL0 code had TCO set. Tag checking is always active inside signal handlers. PSTATE.TCO is restored to its pre-signal value on sigreturn(). This matches Linux semantics and is required because signal handlers may run MTE-unaware code that must not be disrupted by inherited TCO=1.

fork() and clone() inheritance:

On fork(), the child inherits the parent's MteTaskConfig (TCF mode, tag mask, tagged-addr-enable flag). TFSRE0_EL1 is reset to zero in the child — async faults pending in the parent are not inherited. On execve(), MteTaskConfig is reset to the default (all fields zero: MTE_TCF_NONE, tag_mask = 0, tagged addr disabled). This matches Linux semantics exactly.

9.4.6 Userspace Interface (prctl, mmap PROT_MTE)

Compile-time feature constants (UAPI, AArch64 only):

/* arch/arm64/include/uapi/asm/hwcap.h */
#define HWCAP2_MTE              (1UL << 18)   /* MTE present; check via getauxval(AT_HWCAP2) */

/* arch/arm64/include/uapi/asm/mman.h */
#define PROT_MTE                0x20          /* Enable MTE allocation tags for mmap/mprotect */

/* include/uapi/linux/prctl.h */
#define PR_SET_TAGGED_ADDR_CTRL 55
#define PR_GET_TAGGED_ADDR_CTRL 56
#  define PR_TAGGED_ADDR_ENABLE  (1UL << 0)    /* Enable tagged address ABI */
#  define PR_MTE_TCF_SHIFT       1
#  define PR_MTE_TCF_NONE        (0UL << 1)    /* Ignore tag faults (default) */
#  define PR_MTE_TCF_SYNC        (1UL << 1)    /* Synchronous tag fault mode */
#  define PR_MTE_TCF_ASYNC       (2UL << 1)    /* Asynchronous tag fault mode */
#  define PR_MTE_TCF_MASK        (3UL << 1)    /* Mask for TCF bits */
#  define PR_MTE_TAG_SHIFT       3
#  define PR_MTE_TAG_MASK        (0xffffUL << 3) /* Tag inclusion mask (16 bits) */

/* Signal codes for MTE faults */
#define SEGV_MTESERR  8    /* Synchronous MTE tag fault */
#define SEGV_MTEAERR  9    /* Asynchronous MTE tag fault (si_addr = 0) */

mmap() / mprotect() with PROT_MTE:

/* Map anonymous memory with MTE tagging enabled */
void *heap = mmap(NULL, size,
                  PROT_READ | PROT_WRITE | PROT_MTE,
                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

/* Or enable MTE on an existing anonymous mapping */
mprotect(heap, size, PROT_READ | PROT_WRITE | PROT_MTE);

UmkaOS validates that PROT_MTE is only applied to MAP_ANONYMOUS or RAM-backed file mappings (tmpfs, memfd). Applying PROT_MTE to file-backed mappings of on-disk files, device-backed mappings, or MAP_FIXED mappings over non-Normal-Tagged physical memory returns EINVAL. PROT_MTE cannot be removed by a subsequent mprotect() call — this is enforced by the VMA merge logic, which treats VM_MTE (UmkaOS's internal name for PROT_MTE) as a non-mergeable flag.

prctl(PR_SET_TAGGED_ADDR_CTRL):

/* Enable tagged address ABI and set SYNC mode with all non-zero tags allowed */
prctl(PR_SET_TAGGED_ADDR_CTRL,
      PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xFFFEUL << PR_MTE_TAG_SHIFT),
      0, 0, 0);

/* Read current configuration */
unsigned long ctrl = prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0);
bool mte_sync  = (ctrl & PR_MTE_TCF_MASK) == PR_MTE_TCF_SYNC;
bool mte_async = (ctrl & PR_MTE_TCF_MASK) == PR_MTE_TCF_ASYNC;

The PR_MTE_TAG_MASK field provides an include mask: bit i set means tag i is allowed to be selected by IRG. The kernel inverts this to compute GCR_EL1.Exclude (which is an exclude mask). An include mask of 0x0000 causes IRG to always return tag 0 (the untagged value), effectively disabling random tag generation. UmkaOS's allocator sets tag_mask = 0xFFFE (exclude only tag 0, include tags 1-15) for all MTE-capable threads in Tier 1 and Tier 2 contexts.

prctl(PR_SET_TAGGED_ADDR_CTRL) validation by UmkaOS:

UmkaOS validates the flags argument to PR_SET_TAGGED_ADDR_CTRL as follows:

Bits outside PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_MASK | PR_MTE_TAG_MASK must be zero → EINVAL.
If PR_MTE_TCF_ASYNC is set and MTE_SYSTEM.feature_level < 2 (no FEAT_MTE2) → EINVAL.
If PR_MTE_TCF_ASYMM is set and MTE_SYSTEM.feature_level < 3 (no FEAT_MTE3) → EINVAL.
If both PR_MTE_TCF_SYNC and PR_MTE_TCF_ASYNC are set, UmkaOS resolves to the CPU's preferred mode (see Section 9.4.2).
If MTE_SYSTEM.tagged_addr_disabled = true and PR_TAGGED_ADDR_ENABLE is set → EINVAL.

ptrace() interface for tag access:

UmkaOS implements the Linux ptrace MTE interface for debugging tools (GDB, LLDB, sanitizer runtimes):

PTRACE_PEEKMTETAGS: reads allocation tags from a tracee's MTE-tagged address range. Data is packed as two 4-bit tags per byte. iov_len is updated to the number of tag bytes actually read. Returns EOPNOTSUPP if the address is not in a PROT_MTE VMA.
PTRACE_POKEMTETAGS: writes allocation tags to a tracee's MTE-tagged address range. Used by sanitizer tools to set up expected tag patterns.
PTRACE_GETREGSET/PTRACE_SETREGSET with NT_ARM_TAGGED_ADDR_CTRL: reads/writes the thread's tagged address control word (equivalent to PR_GET/SET_TAGGED_ADDR_CTRL).

Core dump support:

When a process with PROT_MTE mappings generates a core dump, UmkaOS includes the allocation tags as additional PT_AARCH64_MEMTAG_MTE program header segments. Each such segment covers the same virtual address range as a PT_LOAD segment for the corresponding MTE-tagged mapping. Tags are stored packed at 2 tags per byte (4 bits each), so a 4096-byte page produces 128 bytes of tag data in the core. This allows post-mortem debuggers to reconstruct the full tagged memory state.

/proc and sysfs interfaces:

/proc/sys/abi/tagged_addr_disabled
    0: tagged addresses allowed (default)
    1: tagged addresses disabled system-wide (requires CAP_SYS_ADMIN to set)
    Writing 1 prevents any process from enabling PR_TAGGED_ADDR_ENABLE;
    existing processes with tagged addresses already enabled are unaffected.

/proc/<pid>/status
    Tagged_addr: 1    (present when PR_TAGGED_ADDR_ENABLE is set)
    Mte_tcf_mode: sync|async|asymm|none

/sys/devices/system/cpu/cpu<N>/mte_tcf_preferred
    sync|async|asymm  (per-CPU hardware preference, read-only)

/proc/cpuinfo MTE feature flag:

Features: ... mte ...

The mte feature flag is present in /proc/cpuinfo when HWCAP2_MTE is set (ID_AA64PFR1_EL1.MTE >= 1). This matches Linux 5.10+ behaviour exactly.

9.4.7 Integration with UmkaOS Security Model

Tier 1 drivers (ring 0, hardware domain isolation):

Tier 1 drivers compiled for AArch64 with MTE receive automatic heap memory safety with no additional coding. The MTE-aware slab allocator tags all heap objects. Any use-after-free or overflow within a Tier 1 driver produces a tag fault, which:

In ASYNC mode: sets TFSRE0_EL1. On the next kernel entry (e.g., ring crossing, IPI), the tag fault handler runs, determines the fault occurred in a Tier 1 driver domain, and initiates driver domain teardown and reload (Section 11.1). The kernel itself is unaffected.
In SYNC mode: produces an immediate instruction abort in the driver's EL1 execution context (driver code runs in EL1 but in a restricted MPK/POE domain). The exception handler identifies the Tier 1 domain, tears it down, and schedules reload.

This is a significant improvement over x86-64, where KASAN (software shadow memory) is the only equivalent tool and is not suitable for production use due to its 2-3x overhead. On AArch64 with MTE, Tier 1 drivers get production-grade heap safety with ~0-1% overhead.

SCTLR_EL1.TCF vs. SCTLR_EL1.TCF0:

SCTLR_EL1 contains two TCF fields:

TCF0 (bits 39:38): Tag Check Fault mode for EL0 accesses. UmkaOS uses this for userspace thread configuration via prctl().
TCF (bits 41:40): Tag Check Fault mode for EL1 accesses (kernel/Tier-1 driver code). UmkaOS sets TCF = 0b10 (ASYNC) system-wide for Tier 1 driver contexts where MTE is enabled. This means tag faults in kernel/Tier-1 code accumulate in TFSR_EL1 (not TFSRE0_EL1). On EL1-to-EL1 transitions (e.g., driver domain → kernel domain), the kernel checks TFSR_EL1 and handles faults as Tier 1 crashes.

Known limitation — TCMA1 and tag 0xF:

TCR_EL1.TCMA1 = 1 is required in the UmkaOS kernel (matching Linux). This means that any pointer with logical tag 0b1111 (0xF) bypasses tag checking and can be dereferenced without triggering a fault, regardless of the allocation tag in memory. This is necessary because many kernel subsystems generate kernel virtual addresses from physical addresses (e.g., phys_to_virt()) which carry no meaningful tag. An attacker aware of this can forge a pointer to any address by setting the top nibble of bits 59:56 to 0xF. This is a structural limitation of the ARMv8.5 MTE architecture as deployed in a full-kernel context; it is documented in the ARM architecture specification and in Linux's implementation notes. UmkaOS's response: PKEY 15 (the guard/unmapped key in the MPK domain table) is reserved; tag 0xF is excluded from the allocator's tag generation mask (PR_MTE_TAG_MASK excludes bit 15 in addition to bit 0). This ensures allocator-managed heap objects are never tagged 0xF, so TCMA1 bypass does not help an attacker against heap objects.

LSM integration:

The UmkaOS LSM framework (Section 8.7) provides the following MTE-related hooks:

/// Called on mmap() and mprotect() when PROT_MTE is in the requested protection flags.
/// LSM may deny the request (e.g., for confined processes in high-security contexts
/// that should not manipulate their own tag state).
///
/// Returns Ok(()) to permit, Err(Errno::EPERM) to deny.
fn security_mmap_mte(task: &Task, vma: &Vma) -> Result<(), Errno>;

/// Called on prctl(PR_SET_TAGGED_ADDR_CTRL) to allow LSM policy to restrict
/// MTE mode changes. For example, an LSM can prevent untrusted processes from
/// downgrading from PR_MTE_TCF_SYNC to PR_MTE_TCF_NONE.
fn security_mte_ctrl(task: &Task, new_flags: u64) -> Result<(), Errno>;

Disabling MTE for legacy software:

CAP_SYS_ADMIN is required to write 1 to /proc/sys/abi/tagged_addr_disabled, which disables MTE system-wide. Individual processes that have not opted in via prctl() are unaffected by MTE regardless of this setting (MTE is opt-in per-thread). A process with sufficient privilege can also call prctl(PR_SET_TAGGED_ADDR_CTRL, PR_MTE_TCF_NONE, ...) to disable tag checking for itself, reverting to MTE_TCF_NONE behaviour. There is no mechanism to disable MTE for another process without ptrace(2) access.

Interaction with UmkaOS capability tokens:

UmkaOS capability tokens (Section 8.1) use 64-bit object IDs. On AArch64 with TBI (Top Byte Ignore) enabled, the top byte of a pointer is not used for translation. UmkaOS's capability system does not store capabilities in pointers — capabilities are opaque handles in the CapabilitySpace table, never raw pointers. This design ensures that MTE's use of bits 59:56 for logical tags does not conflict with capability token encoding.

9.4.8 Comparison with x86-64 Mitigations

x86-64 has no hardware memory tagging mechanism. The comparison below covers the nearest available alternatives:

Threat	x86-64 mitigation	Overhead	ARM MTE	Overhead
Heap use-after-free	ASAN (compiler instrumentation)	~2-10x slowdown	Hardware tag check	~0-1% (async)
Heap overflow	ASAN / AddressSanitizer	~2-10x slowdown	Hardware tag check	~0-1% (async)
Stack buffer overflow	Intel CET Shadow Stack (SHSTK)	~0-2%	MTE stack tagging (`FEAT_MTE3`)	~0-2%
Production deployable?	No (ASAN overhead is prohibitive)	—	Yes (ASYNC mode)	—
Fault precision	Exact (ASAN synchronous)	—	Exact (SYNC) / imprecise (ASYNC)	—
Detection guarantee	Probabilistic (ASAN has false negatives on aliasing)	—	Hardware deterministic (SYNC mode)	—
Kernel-side use	KASAN (debug builds only; ~2-3x overhead)	—	Tier 1 driver tagging (production)	~0-1%

Important nuance on MTE's probabilistic nature:

MTE with 4-bit tags provides 1/16 probability that a random tag guess is correct. This is not a cryptographic guarantee. Against an attacker who can make repeated attempts (a serial exploitation loop), MTE in ASYNC mode provides approximately 15/16 chance of detection per attempt. In SYNC mode, each attempt terminates the process, making iterative exploitation observable and limited by SIGSEGV delivery. MTE is classified as a crash-containment mitigation (probabilistic, stops typical bugs, slows targeted attacks) rather than a cryptographic security boundary (the role played by Tier 2 ring-3 isolation and IOMMU, which provide deterministic isolation). This is consistent with the analysis in the ARM Architecture Security Model and with Project Zero's MTE research (2023).

x86-64 future outlook:

Intel's Linear Address Masking (LAM) and AMD's Upper Address Ignore (UAI) provide similar TBI-style top-byte-ignore semantics but do not provide hardware memory tags. They allow software tools to embed metadata in pointer top bytes but provide no hardware enforcement. No announced x86-64 extension as of 2025 provides hardware tagging equivalent to ARM MTE.

9.4.9 Linux Compatibility

UmkaOS's MTE implementation is a drop-in replacement for Linux 5.10+ on AArch64. All of the following work without modification on UmkaOS:

Compiler-based MTE support:

GCC 10+ and Clang 11+ support -march=armv8.5-a+memtag and provide the __arm_mte_* intrinsic family. Programs compiled with these flags run on UmkaOS without modification. The HWCAP2_MTE bit in AT_HWCAP2 (accessible via getauxval(3)) is set when MTE is present; applications performing runtime feature detection work correctly.

LLVM AddressSanitizer (HWASan) with MTE hardware backend:

LLVM's HWASan (Hardware-assisted AddressSanitizer) can use ARM MTE as its hardware backend when built with -fsanitize=hwaddress and run on MTE-capable hardware. HWASan manages allocation tagging itself (via the __hwasan_* runtime) using IRG/STG and relies on MTE faults for detection. UmkaOS's MTE implementation is compatible with HWASan because: (a) PROT_MTE on mmap() works as specified; (b) prctl(PR_SET_TAGGED_ADDR_CTRL) sets SYNC mode as HWASan requires for precise fault delivery; (c) SIGSEGV with si_code = SEGV_MTESERR and a valid si_addr is delivered on tag fault.

Android Scudo allocator with MTE:

Android's Scudo hardened allocator uses ARM MTE for heap tagging in production Android builds (Android 12+). Scudo uses PROT_MTE on its heap regions and manages tags directly via inline assembly IRG/STG/LDG sequences. Scudo-built applications run on UmkaOS without modification.

glibc MTE support:

glibc 2.35+ contains AArch64 MTE awareness in its memory allocator (malloc/free). When HWCAP2_MTE is detected, glibc's allocator enables PROT_MTE on heap mmap() regions and tags allocations. This is transparent to applications. All glibc-linked programs gain MTE heap protection automatically on UmkaOS/AArch64 with no recompilation.

Syscall compatibility table:

Operation	UmkaOS AArch64 syscall	Linux AArch64 syscall	Notes
`prctl(PR_SET_TAGGED_ADDR_CTRL)`	167 (prctl)	167 (prctl)	ABI-identical
`prctl(PR_GET_TAGGED_ADDR_CTRL)`	167 (prctl)	167 (prctl)	ABI-identical
`ptrace(PTRACE_PEEKMTETAGS)`	117 (ptrace)	117 (ptrace)	ABI-identical
`ptrace(PTRACE_POKEMTETAGS)`	117 (ptrace)	117 (ptrace)	ABI-identical

PTRACE_PEEKMTETAGS / PTRACE_POKEMTETAGS constants:

#define PTRACE_PEEKMTETAGS   33
#define PTRACE_POKEMTETAGS   34

These values match Linux and must not change (they are embedded in debugging tools and test suites).

Cross-references:

Section 8.1: Capability tokens; MTE logical tags in pointer top byte do not overlap with UmkaOS capability encoding (capabilities are table-indexed handles, not raw pointers)
Section 8.7: LSM hooks security_mmap_mte() and security_mte_ctrl() for policy enforcement
Section 8.8: CAP_SYS_ADMIN required to write /proc/sys/abi/tagged_addr_disabled
Section 4.2: VMA flags (MteVmaFlags) and PROT_MTE VMA handling; tag storage provisioning in the page fault handler
Section 4.1: Physical page metadata for tag storage; tag pages are not counted in usable memory reported to userspace
Section 11.1: Tier 1 driver crash recovery triggered by MTE tag faults in driver EL1 domains
Section 3.1: dsb nsh barrier requirement before SCTLR_EL1 write in context switch
Section 18.1: prctl() and ptrace() dispatch; MTE-specific argument validation in the compat layer

9.5 DebugCap — Capability-Based Process Debugging

Section 19.3.1 establishes that every ptrace operation requires a CAP_DEBUG capability token scoped to the target process. That model eliminates the ambient CAP_SYS_PTRACE authority problem — a debugger must hold an explicit, scoped token for the precise process it intends to inspect. This section extends that foundation with DebugCap: a first-class, transferable, revocable capability object that carries its own permission mask and an optional expiry time.

The distinction matters for three deployment scenarios that CAP_DEBUG tokens alone do not serve cleanly:

Container debugging from outside the container. A monitoring service running on the host needs to debug a specific container workload process. It cannot enter the container (no shell, no shared UID, no root inside). CAP_DEBUG in the host's capability space covers host processes; the container's user namespace has its own capability space. A DebugCap is an object — it crosses namespace boundaries when explicitly handed across them, without requiring CAP_NS_TRAVERSE at every namespace boundary for routine debugging of a pre-authorised target.
Privilege drop after attachment. A debugger daemon starts with CAP_SYS_PTRACE (or CAP_DEBUG) in order to attach to a target. Once attached, it should drop that broad authority and operate only on the specific process it was authorised for. A DebugCap is the handle it keeps after dropping the broad capability; the kernel enforces that the session is scoped to that handle.
Granular, time-limited access. A CI system grants a test runner read-only memory inspection on a specific worker for 15 minutes. CAP_DEBUG is all-or-nothing; a DebugCap with read_memory: true and expires: Some(15min) is precise and self-cleaning.

9.5.1 DebugCap Data Structures

/// A capability token granting ptrace-level access to a specific process.
///
/// Issued by the kernel when a process explicitly grants debug access, or by a
/// process holding `CAP_DEBUG` (the UmkaOS-native form) or `CAP_SYS_PTRACE` (the
/// Linux-compat alias) for any process it can see in its capability namespace.
///
/// Properties:
/// - **Scoped**: only valid for `target`. Any operation attempted against a
///   different process returns `ESRCH`.
/// - **Revocable**: `cap_revoke(debug_cap)` revokes via the seqlock protocol on
///   `revocation_seq`: the revoker writes an odd value (in-progress), completes
///   bookkeeping, then writes an even value ≥ 2 (permanently revoked). Each
///   `DebugSession` method call checks `revocation_seq.load(Acquire) >= 2` before
///   dispatching. In-flight operations that have already passed the revocation check
///   may complete normally — revocation takes effect at the next dispatch boundary.
///   After revocation, any `DebugSession` holding the cap receives `EACCES` on its
///   next operation and is expected to release the cap.
/// - **Auditable**: issuance and every `ptrace_attach_cap()` call are logged
///   via the kernel audit subsystem.
/// - **Process-death-safe**: automatically invalidated (equivalent to revocation)
///   when `target` exits. The `Arc<Process>` inside keeps the process descriptor
///   alive for the duration of any active `DebugSession`, but `revocation_seq` is
///   advanced to an even value >= 2 during process teardown before releasing the
///   `Arc<Process>`, ensuring any blocked `DebugSession` wakes and sees revocation.
pub struct DebugCap {
    /// The process this capability grants access to.
    /// Kept alive (descriptor, not address space) for the lifetime of the cap.
    target: Arc<Process>,
    /// Permitted operations. Each field maps to one or more `PTRACE_*` requests.
    permissions: DebugPermissions,
    /// Kernel-assigned serial number for audit log correlation and revocation.
    /// Globally unique per boot; monotonically increasing.
    serial: u64,
    /// Monotonic expiry instant. `None` means the cap does not expire on its own.
    /// Checked on `ptrace_attach_cap()` and on each `DebugSession` operation.
    expires: Option<MonotonicInstant>,
    /// Seqlock-based revocation counter. Initially 0 (valid).
    /// Odd value = revocation in progress. Even value >= 2 = permanently revoked.
    /// Checked on every `DebugSession` method call before dispatching to ptrace.
    /// To check: `revocation_seq.load(SeqCst) >= 2`.
    /// To revoke: store odd (in-progress), complete bookkeeping, store even (done).
    revocation_seq: AtomicU32,
}

/// Fine-grained permissions carried by a `DebugCap`.
/// Setting `full_ptrace` is equivalent to setting all individual fields.
pub struct DebugPermissions {
    /// Read target memory (`PTRACE_PEEKDATA`, `process_vm_readv`).
    pub read_memory: bool,
    /// Write target memory (`PTRACE_POKEDATA`, `process_vm_writev`).
    pub write_memory: bool,
    /// Read general-purpose and floating-point registers (`PTRACE_GETREGS`,
    /// `PTRACE_GETFPREGS`, `PTRACE_GETREGSET`).
    pub read_regs: bool,
    /// Write general-purpose and floating-point registers (`PTRACE_SETREGS`,
    /// `PTRACE_SETFPREGS`, `PTRACE_SETREGSET`).
    pub write_regs: bool,
    /// Set hardware breakpoints and watchpoints (Section 19.3.2).
    pub set_breakpoints: bool,
    /// Single-step execution (`PTRACE_SINGLESTEP`).
    pub single_step: bool,
    /// Receive and inject signals (`PTRACE_GETSIGINFO`, `PTRACE_SETSIGINFO`).
    pub intercept_signals: bool,
    /// Intercept syscall entry/exit (`PTRACE_SYSCALL`).
    pub trace_syscalls: bool,
    /// Full ptrace control: implies all fields above. When this field is `true`,
    /// the kernel ignores the individual fields and permits every ptrace request.
    pub full_ptrace: bool,
}

9.5.2 Obtaining a DebugCap

Three kernel interfaces issue DebugCap tokens. All three log an audit record.

/// A process grants another process (identified by `grantee_pid`) the right to
/// debug the calling process.
///
/// Preconditions (any one must hold):
/// - Caller and caller's process share the same UID, OR
/// - The calling process has set `PR_SET_DUMPABLE` and `PR_SET_DEBUG_ACCEPT`
///   (see Section 9.5.5), OR
/// - The calling process invokes this on itself (`grantee_pid` is the
///   caller's own PID — equivalent to `self_debug_cap()`).
///
/// The kernel delivers the resulting `DebugCap` to `grantee_pid` via a
/// kernel-managed pending-cap queue; the grantee retrieves it with
/// `cap_recv(CAP_TYPE_DEBUG)`.
///
/// Returns: the serial number of the issued cap (for audit correlation).
pub fn grant_debug_cap(
    grantee_pid: Pid,
    permissions: DebugPermissions,
    expires: Option<Duration>,
) -> Result<u64, CapError>;

/// A process holding `CAP_DEBUG` (or the Linux-compat `CAP_SYS_PTRACE`) issues
/// a `DebugCap` for any process visible in the caller's namespace.
///
/// This is the primary entry point for debugger daemons. The recommended usage
/// pattern is:
///   1. Start with `CAP_SYS_PTRACE` in the bounding set.
///   2. Call `ptrace_cap_issue()` for the target.
///   3. Drop `CAP_SYS_PTRACE` from the ambient and effective sets.
///   4. Proceed using only the returned `DebugCap`.
///
/// Cross-namespace targets are reachable only if the caller also holds
/// `CAP_NS_TRAVERSE` for every intermediate namespace boundary (consistent
/// with [Section 19.3.1](19-observability.md#1931-capability-gated-ptrace)).
///
/// Returns: the `DebugCap` kernel handle (an opaque file descriptor in the
/// calling process's file-descriptor table, transferable via `SCM_RIGHTS`).
pub fn ptrace_cap_issue(
    target_pid: Pid,
    permissions: DebugPermissions,
    expires: Option<Duration>,
) -> Result<DebugCapFd, CapError>;  // Requires CAP_DEBUG or CAP_SYS_PTRACE

/// A process grants debug access to itself.
///
/// No capability checks — a process can always inspect itself. Useful for
/// test harnesses, in-process debuggers, and coverage tools that need the
/// structured `DebugSession` API rather than raw ptrace calls.
///
/// Self-caps are non-transferable (the `send_cap()` path returns `EPERM` for
/// self-issued caps) to prevent self-escalation.
pub fn self_debug_cap(permissions: DebugPermissions) -> DebugCapFd;

Non-transferability is enforced via the CAP_FLAG_SELF_ISSUED bit in cap_flags, set unconditionally in create_self_cap() and never cleared:

send_cap(cap, dest): checks cap.cap_flags & CAP_FLAG_SELF_ISSUED; if set, returns CapError::NonTransferable immediately, no capability transfer occurs
fd_dup2(old_fd, new_fd): creates a new fd pointing to the same capability entry (shared reference count). CAP_FLAG_SELF_ISSUED is in the entry, not the fd, so the duplicate fd inherits the non-transferable restriction automatically
fork(): the child inherits the fd table (new fds pointing to the same entries) with CAP_FLAG_SELF_ISSUED preserved. The child cannot transfer the cap either
execve(): by default, capabilities with CAP_FLAG_CLOEXEC are closed; self-issued caps are additionally closed regardless of CLOEXEC flag, since they represent the issuing task's identity context

This design requires no runtime "is this self-issued" check beyond reading a flag bit from the already-cached capability entry.

DebugCapFd is a kernel file-descriptor type (analogous to a pidfd) that wraps the DebugCap. It is reference-counted: duplicating the fd (dup2, SCM_RIGHTS transfer) increments the reference count; closing the last reference destroys the cap if no DebugSession is currently holding it open.

9.5.3 Using a DebugCap

/// Attach to the target process using a capability token.
///
/// On success, the target is stopped (SIGSTOP delivered) and the returned
/// `DebugSession` provides the full debug interface. Dropping the session
/// detaches the debugger and delivers SIGCONT to the target.
///
/// Errors:
/// - `DebugError::Expired`  — cap has passed its `expires` time.
/// - `DebugError::Revoked`  — cap has been revoked by the issuer.
/// - `DebugError::TargetGone` — target process has already exited.
/// - `DebugError::PermDenied` — `permissions.full_ptrace` is false and the
///   operation mode requires full control (reserved for future use).
pub fn ptrace_attach_cap(cap_fd: DebugCapFd) -> Result<DebugSession, DebugError>;

/// An active debug session. Dropping this value detaches from the target and
/// delivers SIGCONT if the target was stopped by this session.
pub struct DebugSession {
    /// The capability authorising this session. Kept alive for session duration;
    /// revocation of the underlying cap is immediately visible here.
    cap: Arc<DebugCap>,
    /// Convenience pointer; equivalent to `cap.target`.
    target: Arc<Process>,
}

impl DebugSession {
    /// Read `buf.len()` bytes from the target's address space at `addr`.
    /// Requires `cap.permissions.read_memory`.
    pub fn read_memory(&self, addr: u64, buf: &mut [u8]) -> Result<usize, DebugError>;

    /// Write `data` to the target's address space at `addr`.
    /// Requires `cap.permissions.write_memory`.
    pub fn write_memory(&self, addr: u64, data: &[u8]) -> Result<(), DebugError>;

    /// Read the target's general-purpose registers.
    /// Requires `cap.permissions.read_regs`.
    pub fn get_regs(&self) -> Result<UserRegs, DebugError>;

    /// Write the target's general-purpose registers.
    /// Requires `cap.permissions.write_regs`.
    pub fn set_regs(&self, regs: &UserRegs) -> Result<(), DebugError>;

    /// Set a hardware breakpoint at `addr`. Returns a handle; drop the handle
    /// to clear the breakpoint. Requires `cap.permissions.set_breakpoints`.
    pub fn set_breakpoint(&self, addr: u64) -> Result<BreakpointHandle, DebugError>;

    /// Single-step the target: target executes one instruction then re-stops.
    /// Requires `cap.permissions.single_step`.
    pub fn single_step(&self) -> Result<(), DebugError>;

    /// Resume the target, optionally delivering `signal`.
    /// Requires `cap.permissions.full_ptrace` or appropriate per-op permissions.
    pub fn cont(&self, signal: Option<Signal>) -> Result<(), DebugError>;

    /// Wait for the target to stop. Returns the stop reason.
    ///
    /// Blocks until the target stops, the DebugCap expires, or the cap is
    /// revoked. The wait is implemented as a bounded loop so that cap expiry
    /// is detected promptly even if the target does not stop:
    ///
    /// ```text
    /// fn wait_stop(session: &DebugSession, cap: &DebugCap)
    ///     -> Result<StopReason, DebugError>:
    ///   loop:
    ///     now = monotonic_clock_ns()
    ///     if now >= cap.expiry_ns:
    ///       session.detach()  // revoke debug session on expiry
    ///       return Err(DebugError::Expired)
    ///     remaining_ns = cap.expiry_ns - now
    ///     // Slice wait into ≤100ms chunks so cap expiry is rechecked frequently.
    ///     event = session.wait_for_stop_event(
    ///                 timeout_ns = min(remaining_ns, 100_000_000))
    ///     match event:
    ///       Timeout      → continue  // deadline not reached; re-check expiry
    ///       StopEvent(r) → return Ok(r)
    ///       Revoked      → return Err(DebugError::Revoked)
    /// ```
    ///
    /// The 100ms timeout slice ensures that a DebugCap with a short remaining
    /// lifetime is honoured promptly without busy-waiting.
    ///
    /// Returns `Err(DebugError::Expired)` if the cap expires during the wait.
    /// Returns `Err(DebugError::Revoked)` if the cap is revoked while waiting.
    pub fn wait_stop(&self) -> Result<WaitStatus, DebugError>;
}

Every DebugSession method checks cap.revoked atomically before dispatching to the underlying ptrace path. This is a single relaxed load on the fast path (the common case where the cap has not been revoked) and an acquire load when the flag is set, ensuring visibility of any state written by the revoking thread before the revocation flag was set.

9.5.4 Capability Transfer

DebugCapFd can be sent to another process over a Unix domain socket using the standard SCM_RIGHTS control message interface. The kernel's SCM_RIGHTS path is extended to handle DebugCapFd file descriptors:

The sender calls sendmsg(2) with a SCM_RIGHTS cmsg containing the DebugCapFd.
The kernel validates that the sender holds the fd and that the cap is not revoked.
The kernel creates a new DebugCapFd in the receiver's file-descriptor table, backed by the same Arc<DebugCap>. The reference count is incremented.
The sender's fd is closed (or not, depending on whether it duplicated first).

No additional privilege is required to transfer a DebugCap — if you hold it, you can delegate it. The receiver inherits the same permissions and expires values; there is no mechanism to escalate permissions on transfer (the cap is immutable after issuance).

Container debugging workflow — canonical example:

1. Container runtime (host, holds CAP_SYS_PTRACE):
       fd = ptrace_cap_issue(container_pid, DebugPermissions::full(), expires=Some(30min))
   → DebugCap issued; container_pid need not be in the host's user namespace.
   → Audit: type=DEBUG_CAP_ISSUED target_pid=1234 serial=77 perms=0xFF issuer=runtime

2. Runtime passes fd to an external debugger via Unix socket (SCM_RIGHTS):
       sendmsg(debugger_sock, SCM_RIGHTS=[fd])
   → Debugger receives the fd. Runtime may now close its copy and drop CAP_SYS_PTRACE.

3. Debugger (no CAP_SYS_PTRACE, no container root):
       session = ptrace_attach_cap(received_fd)
   → Kernel validates cap: not revoked, not expired, target still alive.
   → Audit: type=DEBUG_CAP_USED serial=77 attacher=debugger_pid

4. Debugger operates: session.get_regs(), session.read_memory(), etc.

5. 30 minutes later: kernel marks cap expired on next DebugSession operation.
       session.wait_stop() → Err(DebugError::Expired)
   → Session auto-detaches; target receives SIGCONT.

This workflow requires no root inside the container, no shared UID between the debugger and the container workload, and no long-lived broad privilege in the debugger process.

9.5.5 PR_SET_DEBUG_ACCEPT — Cross-UID Debug Grant

A process can advertise willingness to be debugged by processes that would not normally pass the UID check in grant_debug_cap(). This uses a new prctl option:

/* Allow processes in the same user namespace (even different UIDs) to call
 * grant_debug_cap() targeting this process.
 * arg2: DEBUG_ACCEPT_NONE (0) = default, only same-UID or parent
 *       DEBUG_ACCEPT_SAME_NS (1) = any process in same user namespace
 * arg3, arg4, arg5: must be zero.
 */
prctl(PR_SET_DEBUG_ACCEPT, DEBUG_ACCEPT_SAME_NS, 0, 0, 0);

The flag is stored in the Task struct alongside PR_SET_DUMPABLE. It is cleared on execve() (reset to DEBUG_ACCEPT_NONE). It is inherited across fork() and clone() (a process that accepts debug access continues to do so after forking worker children, which is the typical use case for worker-pool servers).

PR_SET_DEBUG_ACCEPT does not bypass LSM checks — the security_debug_cap_grant() hook (Section 9.5.6) fires regardless of the accept flag. It only widens the UID check in the kernel's own permission gate.

Use case: a multi-user web server spawns worker processes under different UIDs per virtual host. A monitoring tool running as the server's primary UID needs to inspect worker memory. Workers call prctl(PR_SET_DEBUG_ACCEPT, DEBUG_ACCEPT_SAME_NS) at startup; the monitor calls grant_debug_cap(worker_pid, read_memory_only, 5min) without needing root.

9.5.6 Revocation

DebugCap revocation uses a seqlock protocol to provide atomic revocation without a global lock:

DebugCap carries revocation_seq: AtomicU32, initially 0 (even = valid)
To revoke: write seq | 1 (odd = in-progress revocation), perform all revocation bookkeeping (remove from ptrace tables, close debug channels), then write seq + 2 (even = complete, permanently revoked)
ptrace operations read seq before the operation and after; if the value changed or is odd, the operation returns ESRCH (target gone)
No stale-cap window: the odd transition atomically signals all concurrent operations to abort before revocation completes

This is the standard seqlock pattern applied to capability lifecycle management.

/// Revoke a DebugCap identified by its kernel file descriptor.
///
/// Effects (all atomic with respect to ongoing DebugSession operations):
/// - Sets `cap.revocation_seq` to an odd value (in-progress), completes all
///   revocation bookkeeping, then advances to even (permanently revoked).
/// - Any `DebugSession` currently holding this cap has its next operation
///   return `Err(DebugError::Revoked)`.
/// - Any task blocked in `wait_stop()` on a revoked session is immediately
///   woken and receives `Err(DebugError::Revoked)`.
/// - If the target was stopped by this session, SIGCONT is delivered.
/// - All copies of `DebugCapFd` (across all processes, via SCM_RIGHTS
///   duplicates) are simultaneously invalidated — revocation is on the
///   underlying `DebugCap` object, not on any individual fd copy.
///
/// Audit: type=DEBUG_CAP_REVOKED serial=N revoker_pid=M
pub fn cap_revoke(cap_fd: DebugCapFd) -> Result<(), CapError>;

Only the process that issued the DebugCap can revoke it. Issuance is recorded in the cap's issuer_pid field, which the kernel checks in cap_revoke(). Processes that received the cap via SCM_RIGHTS transfer can close their fd copy (reducing the reference count) but cannot revoke the cap itself. This asymmetry is intentional: a delegated cap should not be renounceable by the delegate — only the authority that granted it can terminate it.

Target process exit implicitly revokes all DebugCap tokens targeting that process. The kernel advances revocation_seq to an even value >= 2 on every outstanding DebugCap for the exiting process during the process teardown path, before releasing the Arc<Process>. This ensures that any DebugSession blocked in wait_stop() wakes and returns Err(DebugError::TargetGone) rather than blocking indefinitely.

9.5.7 Audit Logging

Every lifecycle event for a DebugCap is logged via the kernel audit subsystem (Section 19.2.9):

Event	Audit record format
Cap issued via `ptrace_cap_issue()`	`type=DEBUG_CAP_ISSUED serial=N target_pid=T issuer_pid=I perms=0xHH expires=S`
Cap issued via `grant_debug_cap()`	`type=DEBUG_CAP_GRANTED serial=N target_pid=T grantee_pid=G perms=0xHH expires=S`
Cap issued via `self_debug_cap()`	`type=DEBUG_CAP_SELF serial=N pid=P perms=0xHH`
Cap used in `ptrace_attach_cap()`	`type=DEBUG_CAP_USED serial=N attacher_pid=A target_pid=T`
Cap revoked via `cap_revoke()`	`type=DEBUG_CAP_REVOKED serial=N revoker_pid=R`
Cap expired (on next use)	`type=DEBUG_CAP_EXPIRED serial=N target_pid=T`
Target exited with outstanding caps	`type=DEBUG_CAP_TARGET_EXIT serial=N target_pid=T` (one per outstanding cap)

The perms field in the audit record is a bitmask of the DebugPermissions fields, in the order they appear in the struct: read_memory = bit 0, write_memory = bit 1, read_regs = bit 2, write_regs = bit 3, set_breakpoints = bit 4, single_step = bit 5, intercept_signals = bit 6, trace_syscalls = bit 7, full_ptrace = bit 8.

The serial number is globally unique within a boot session and monotonically increasing. Audit records from the ptrace_attach_cap() call can be correlated with the issuance record using the serial field.

9.5.8 Linux Compatibility

ptrace(2) with the standard PTRACE_* constants continues to work unchanged. UmkaOS internally converts every ptrace(PTRACE_ATTACH, pid) call into a DebugCap with full_ptrace: true using the caller's CAP_DEBUG capability token (or the traditional UID check for processes that do not use the UmkaOS capability model). The resulting session is tracked identically to one opened via ptrace_attach_cap() — revocation, audit logging, and DebugSession semantics apply.

This means the audit trail covers traditional ptrace sessions as well as DebugCap sessions, with no gaps. GDB, LLDB, strace, perf, and any other tool using the ptrace(2) syscall operate without modification.

New UmkaOS-specific syscalls for the DebugCap API:

Syscall	x86-64 number	Description
`ptrace_cap_issue`	1032	Issue a `DebugCap` for a target process (requires `CAP_DEBUG`/`CAP_SYS_PTRACE`)
`ptrace_attach_cap`	1033	Attach a debug session to a target using a `DebugCapFd`
`grant_debug_cap`	1034	Grant debug access to another process (issuer = calling process)
`self_debug_cap`	1035	Issue a non-transferable `DebugCap` for the calling process
`cap_revoke`	1036	Revoke a `DebugCap` by its fd (issuer only)

These syscall numbers are UmkaOS-specific. UmkaOS custom syscalls start at 1024 to provide generous long-term headroom beyond Linux's current maximum (Linux 7.0), with ample buffer for indefinite future Linux growth. The PR_SET_DEBUG_ACCEPT and PR_GET_DEBUG_ACCEPT prctl options use the next available UmkaOS-reserved prctl numbers after the existing set defined in include/uapi/linux/prctl.h.

9.5.9 LSM Hooks

The UmkaOS LSM framework (Section 8.7) provides hooks at every DebugCap lifecycle point:

/// Called on ptrace_cap_issue() and grant_debug_cap() before issuing the cap.
/// LSM may deny issuance (e.g., Mandatory Access Control policy prevents
/// cross-label debugging).
///
/// `issuer`: the calling process.
/// `target`: the process to be debugged.
/// `perms`: the requested permissions.
///
/// Returns Ok(()) to permit, Err(Errno::EPERM) to deny.
fn security_debug_cap_grant(
    issuer: &Process,
    target: &Process,
    perms: &DebugPermissions,
) -> Result<(), Errno>;

/// Called on ptrace_attach_cap() before attaching the session.
/// LSM may deny attachment even if the cap is validly issued
/// (e.g., policy changed since issuance).
///
/// Returns Ok(()) to permit, Err(Errno::EPERM) to deny.
fn security_debug_cap_attach(
    attacher: &Process,
    cap: &DebugCap,
) -> Result<(), Errno>;

/// Called on cap_revoke() before revoking.
/// LSM may deny revocation (unusual; reserved for audit-locking scenarios
/// where the audit system must preserve an active session until log flush).
///
/// Returns Ok(()) to permit, Err(Errno::EPERM) to deny.
fn security_debug_cap_revoke(
    revoker: &Process,
    cap: &DebugCap,
) -> Result<(), Errno>;

The existing security_ptrace() hook (called for traditional ptrace(PTRACE_ATTACH)) continues to fire for the synthetic DebugCap created by the compat path, so LSM policy is uniformly applied regardless of whether the caller uses the new API or the legacy ptrace(2) syscall.

9.5.10 DebugCap Request Rate Limiting

Processes holding CAP_DEBUG or CAP_SYS_PTRACE may call ptrace_cap_issue() or grant_debug_cap() in a tight loop, producing a rapid sequence of kernel capability allocations and audit records. Without a rate limit, this is a low-cost DoS vector against the audit subsystem and capability allocator. The rate limit closes the window.

Rate: 10 DebugCap issue requests per second per real UID (RUID), enforced via a token bucket algorithm. Processes holding CAP_SYS_ADMIN in their effective set are exempt — they are already unconditionally trusted.

self_debug_cap() is also exempt: it produces a non-transferable, same-process-only cap with no cross-process security boundary crossing, and carries no meaningful DoS potential.

/// Per-UID token bucket for DebugCap request rate limiting.
///
/// Each entry represents the rate-limit state for one real UID.
/// Entries are created on first request and evicted after 60 seconds
/// of inactivity (no requests from this UID).
pub struct DebugCapRateLimit {
    /// Tokens available. Each DebugCap issue request consumes 1 token.
    /// Refilled at REFILL_RATE_NS intervals up to MAX_TOKENS.
    tokens: AtomicU32,
    /// Timestamp of last token refill (nanoseconds since boot).
    /// Used to calculate how many tokens to add on the next request.
    last_refill_ns: AtomicU64,
}

impl DebugCapRateLimit {
    /// Burst capacity: maximum tokens in the bucket. A fully-charged
    /// UID may issue up to 10 DebugCap requests before being throttled.
    const MAX_TOKENS: u32 = 10;

    /// Token refill interval: one new token every 100ms = 10 tokens/sec
    /// sustained throughput. Calculated as: 1_000_000_000ns / 10 = 100ms.
    const REFILL_RATE_NS: u64 = 100_000_000;
}

Storage: Per-UID DebugCapRateLimit entries live in a hash table keyed by RUID, protected by an RCU-read lock for lookup and a per-bucket spinlock for mutation. New entries are allocated from the slab allocator on first request. Entries are evicted (and their memory returned) after 60 seconds of inactivity, detected during the refill step: if now - last_refill_ns > 60_000_000_000ns, the entry is removed and the bucket fully refills for the next request from that UID (fresh start, no penalty for inactivity).

Token consumption algorithm (executed under the per-bucket spinlock):

Compute elapsed = now_ns - entry.last_refill_ns.
Add elapsed / REFILL_RATE_NS tokens to entry.tokens, clamped to MAX_TOKENS.
Set entry.last_refill_ns += (elapsed / REFILL_RATE_NS) * REFILL_RATE_NS (preserve fractional interval for the next call; do not set to now_ns directly).
If entry.tokens >= 1: decrement tokens by 1, return Ok(()).
Otherwise: return Err(Errno::EBUSY).

Error semantics: when the rate limit is exceeded, the syscall returns EBUSY (not EPERM). EPERM signals permanent denial; EBUSY signals transient backpressure — the caller is authorized but must wait. Callers should apply exponential back-off starting at 100ms. A well-behaved debugger daemon will never encounter this limit in normal operation.

Audit: every rate-limit rejection is logged to the IMA audit ring:

type=DEBUG_CAP_RATELIMIT uid=U request=ptrace_cap_issue|grant_debug_cap timestamp_ns=T

The audit record includes the requesting UID (uid), the syscall name (request), and the kernel monotonic timestamp (timestamp_ns). Rate-limit audit records are written unconditionally (they are not themselves rate-limited) to preserve the full evidence trail for intrusion detection.

Interaction with LSM hooks: rate limiting occurs before the security_debug_cap_grant() LSM hook. A request rejected by the rate limiter never reaches LSM. This ordering is correct: there is no point invoking potentially expensive LSM policy evaluation for a request that will be refused regardless.