Chapter 9: Security Extensions
Companion to Chapter 8: Security Architecture. This chapter contains §9.1–§9.5: Kernel Crypto API, Key Retention, Seccomp-BPF, ARM MTE, and DebugCap. See 08-security.md for §8.1–§8.8.
9.1 Kernel Crypto API
The Kernel Crypto API is the algorithm registry and dispatch framework. It is not a security policy subsystem in the way that LSM (Section 8.7) or capabilities (Section 8.1) are, but it is placed here because it is the shared foundation that every security-relevant subsystem depends on: verified boot (Section 8.2) needs Ed25519 and ML-DSA-65 signature verification; PQC key exchange (Section 8.5) needs ML-KEM-768; confidential computing (Section 8.6) needs AES-256-GCM for sealed blobs; IMA (Section 8.4) needs SHA-256 and SHA-384; NVMe TLS authentication (Section 14.4) needs AES-GCM and ML-KEM; NFS/Kerberos (Section 14.X) needs AES-128-CTS-HMAC-SHA1 and AES-256-CTS-HMAC-SHA384; and kTLS (Section 15.X) needs ChaCha20-Poly1305.
A single, unified algorithm registry ensures: hardware-accelerated implementations are discovered at runtime and preferred automatically; PQC algorithms are first-class citizens with the same lookup paths as classical algorithms; and callers are insulated from implementation churn as acceleration support is added.
9.1.1 Algorithm Type Taxonomy
// umka-core/src/crypto/api.rs
/// Taxonomy of cryptographic algorithm families.
///
/// Each variant corresponds to a distinct API surface (different transform
/// objects, different operation descriptors, different vtables). Callers
/// use the type to filter the algorithm registry during lookup.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u32)]
pub enum CryptoAlgType {
/// Synchronous hash (e.g., SHA-256, SHA-384, BLAKE2b).
/// Output is produced in a single call from a complete message or
/// incrementally via `update()` + `final()`. No async path.
Shash = 0x0001,
/// Asynchronous hash (e.g., offloaded SHA via AMD CCP).
/// Operation descriptor submitted to a hardware ring buffer and
/// completed via completion callback or poll.
Ahash = 0x0002,
/// Synchronous symmetric key cipher (block cipher or stream cipher).
/// Operates in-place or out-of-place on a contiguous buffer. Suitable
/// for AES-ECB, AES-CBC, AES-CTR, ChaCha20.
Skcipher = 0x0003,
/// Asynchronous symmetric key cipher (hardware offload path).
Ablkcipher = 0x0004,
/// Authenticated encryption with associated data (AEAD).
/// Combines confidentiality and integrity: AES-GCM, AES-CCM,
/// ChaCha20-Poly1305. The `encrypt` path appends the authentication
/// tag; the `decrypt` path verifies it and returns `Err(EBADMSG)`
/// on mismatch without exposing the decrypted plaintext.
Aead = 0x0005,
/// Asymmetric key cipher (sign/verify, encrypt/decrypt).
/// Covers RSA, ECDSA (P-256, P-384), Ed25519, ML-DSA-44/65/87.
/// Key import uses PKCS#8 DER or raw key bytes depending on algorithm.
Akcipher = 0x0006,
/// Key agreement / key encapsulation mechanism (KEM).
/// Covers ECDH (P-256, P-384, X25519), ML-KEM-512/768/1024, and
/// hybrid templates (e.g., `hybrid-kem(x25519,ml-kem-768)`).
Kpp = 0x0007,
/// Cryptographic random number generator.
/// Backed by the hardware DRBG (RDRAND, ARM TRNG) seeded into a
/// NIST SP 800-90A CTR-DRBG instance.
Rng = 0x0008,
}
9.1.2 Algorithm Descriptor and Registration
Every algorithm implementation — whether software or hardware-accelerated — is described by a
CryptoAlg descriptor and registered with the global algorithm table at module load or driver
probe time.
// umka-core/src/crypto/api.rs
bitflags::bitflags! {
/// Flags attached to an algorithm descriptor.
#[derive(Clone, Copy, Debug)]
pub struct CryptoAlgFlags: u32 {
/// Algorithm has passed NIST FIPS 140-3 / SP 800-131A validation.
/// In FIPS mode, only algorithms with this flag may be allocated.
const FIPS_APPROVED = 0x0001;
/// Algorithm is implemented in software (no hardware dependency).
const SW_IMPL = 0x0002;
/// Algorithm requires hardware support (will fail if HW absent).
const HW_ACCEL = 0x0004;
/// Algorithm is a template that composes two or more base algorithms.
/// Instantiated from component names in the template string, e.g.,
/// `gcm(aes)` composes the `gcm` template with the `aes` base cipher.
const TEMPLATE = 0x0008;
/// Internal use: algorithm is in the process of being unregistered.
/// Allocation requests for this algorithm will return `Err(ENOENT)`.
const DYING = 0x0010;
/// Algorithm supports in-place operation (src == dst buffer).
const INPLACE = 0x0020;
/// Algorithm is part of the PQC suite (ML-KEM, ML-DSA, SLH-DSA).
const PQC = 0x0040;
}
}
/// Algorithm implementation descriptor.
///
/// Registered once per implementation; shared across all transform objects
/// that use the same implementation. Immutable after registration.
///
/// `CryptoAlg` is placed in a static or module-scoped location: its
/// lifetime must exceed any `*Tfm` objects that reference it.
pub struct CryptoAlg {
/// Canonical algorithm name used for lookup.
/// Examples: `"sha256"`, `"aes-gcm"`, `"ml-kem-768"`,
/// `"hybrid-kem(x25519,ml-kem-768)"`.
/// Maximum 64 bytes, null-padded.
pub name: [u8; 64],
/// Implementation name, used for diagnostics and sysfs.
/// Examples: `"aesni-sha256"`, `"soft-ml-kem-768"`, `"ccp-aes-gcm"`.
pub driver_name: [u8; 64],
/// Algorithm family. Determines which vtable pointer is valid.
pub alg_type: CryptoAlgType,
/// Selection priority. Higher priority implementations are preferred
/// when multiple implementations of the same algorithm are registered.
/// Range 0–999. Software fallback: 100. Hardware-accelerated: 300–900.
/// Test vectors only (for self-test use): 0.
pub priority: u32,
/// Capability and mode flags.
pub flags: CryptoAlgFlags,
/// Reference count: incremented when a transform object is allocated
/// from this descriptor, decremented when the transform is freed.
/// The descriptor cannot be unregistered while refcount > 0.
pub refcount: AtomicU32,
/// Algorithm-family-specific operations vtable.
pub ops: CryptoAlgOps,
}
/// Union of per-family vtables. Exactly one variant is valid, selected by
/// `alg_type`. Using an enum keeps the dispatch explicit and exhaustive.
pub enum CryptoAlgOps {
Shash(ShashOps),
Ahash(AhashOps),
Skcipher(SkcipherOps),
Ablkcipher(AblkcipherOps),
Aead(AeadOps),
Akcipher(AkCipherOps),
Kpp(KppOps),
Rng(RngOps),
}
Algorithm registration and deregistration:
// umka-core/src/crypto/registry.rs
/// Global algorithm table. Protected by a single RwSpinLock for the rare
/// registration/deregistration paths; reads (lookup during alloc) are common
/// but brief enough that a spinlock is acceptable. An RCU-protected list
/// would reduce reader overhead but is not necessary given registration
/// happens only at boot and at module load.
static ALGORITHM_TABLE: RwSpinLock<AlgorithmTable> = RwSpinLock::new(AlgorithmTable::new());
/// Register an algorithm implementation.
///
/// # Errors
/// - `EEXIST`: an implementation with the same `name` + `driver_name` is
/// already registered.
/// - `EINVAL`: the descriptor is malformed (zero-length name, unknown type,
/// priority out of range, null function pointer in vtable).
pub fn crypto_register_alg(alg: &'static CryptoAlg) -> Result<(), KernelError> {
let name = alg_name_str(alg)?;
let mut table = ALGORITHM_TABLE.write();
if table.find_by_driver(name, &alg.driver_name).is_some() {
return Err(KernelError::EEXIST);
}
validate_alg_descriptor(alg)?;
table.insert(alg);
Ok(())
}
/// Deregister an algorithm implementation.
///
/// Marks the descriptor as `DYING` first so that concurrent allocations
/// fail gracefully, then waits for `refcount` to reach zero before removing
/// from the table.
///
/// # Errors
/// - `ENOENT`: algorithm not found.
/// - `EBUSY`: would block indefinitely; caller must retry (module unload
/// should be deferred until users have released their transforms).
pub fn crypto_unregister_alg(alg: &'static CryptoAlg) -> Result<(), KernelError>;
9.1.3 Transform Objects
A transform object (Tfm) is the per-user working state for an algorithm. It holds the key schedule, any per-instance configuration, and a reference to the algorithm descriptor. Tfm objects are not shared: each caller allocates its own.
// umka-core/src/crypto/tfm.rs
/// Synchronous hash transform.
pub struct ShashTfm {
/// Descriptor of the underlying algorithm.
pub alg: &'static CryptoAlg,
/// Per-instance state (key for HMAC; empty for plain hash).
state: ShashTfmState,
}
/// In-progress synchronous hash computation.
///
/// Allocated on the caller's stack (or in a kernel object) via
/// `tfm.desc_size()`. Lives for the duration of one hash computation.
pub struct ShashDesc {
pub tfm: *const ShashTfm,
/// Implementation-defined context (SHA state, BLAKE2 state, etc.).
/// Size is `ShashTfm.alg.descsize` bytes; allocated adjacent to this
/// struct in the same allocation.
_ctx: [u8; 0], // variable-length tail, accessed via raw pointer arithmetic
}
/// AEAD transform.
pub struct AeadTfm {
pub alg: &'static CryptoAlg,
/// Encryption key schedule.
key_enc: SecretBox<[u8]>,
/// Authentication key material (for GCM: H = AES_K(0^128)).
key_auth: SecretBox<[u8]>,
/// IV/nonce length for this transform (set by `set_authsize`).
authsize: u32,
/// IV length in bytes (12 for AES-GCM, 16 for AES-CCM).
ivsize: u32,
}
/// Single AEAD operation request.
///
/// Submitted inline (synchronous) or queued to a hardware ring (async).
pub struct AeadReq {
/// Transform to use.
pub tfm: *const AeadTfm,
/// Associated data (authenticated but not encrypted).
pub assoc: &'static [u8],
/// Input buffer (plaintext for encrypt, ciphertext+tag for decrypt).
pub src: *const u8,
/// Output buffer. May equal `src` for in-place operation if
/// `CryptoAlgFlags::INPLACE` is set on the algorithm.
pub dst: *mut u8,
/// Length of the payload (excluding the authentication tag).
pub cryptlen: u32,
/// IV/nonce. Must be exactly `AeadTfm.ivsize` bytes.
pub iv: [u8; 16],
/// Completion callback for async path. `None` for synchronous callers.
pub complete: Option<fn(*mut AeadReq, i32)>,
/// Caller-supplied context pointer, passed to `complete`.
pub data: *mut core::ffi::c_void,
}
/// Asymmetric key cipher operations vtable.
///
/// Registered by each Akcipher implementation. All functions are mandatory
/// unless documented as optional (marked with `Option<…>`).
pub struct AkCipherOps {
/// Import a private key from DER-encoded PKCS#8 or raw bytes.
/// Stores parsed key material inside `tfm`. The src buffer is zeroed
/// by the caller after this call returns.
pub set_priv_key: unsafe extern "C" fn(
tfm: *mut AkCipherTfm,
src: *const u8,
src_len: u32,
) -> i32,
/// Import a public key. Format is algorithm-specific:
/// RSA: SubjectPublicKeyInfo DER; Ed25519/ML-DSA: raw 32/2420 bytes.
pub set_pub_key: unsafe extern "C" fn(
tfm: *mut AkCipherTfm,
src: *const u8,
src_len: u32,
) -> i32,
/// Produce a signature over `src` (typically a digest). Output written
/// to `dst`. Returns the number of bytes written on success.
pub sign: unsafe extern "C" fn(
tfm: *const AkCipherTfm,
src: *const u8,
src_len: u32,
dst: *mut u8,
dst_len: u32,
) -> i32,
/// Verify `sig` over `src`. Returns 0 on success, `-EBADMSG` if the
/// signature is invalid, other negative errno on error.
pub verify: unsafe extern "C" fn(
tfm: *const AkCipherTfm,
src: *const u8,
src_len: u32,
sig: *const u8,
sig_len: u32,
) -> i32,
/// Maximum signature size in bytes. Used to allocate output buffers.
pub max_size: unsafe extern "C" fn(tfm: *const AkCipherTfm) -> u32,
}
/// Synchronous hash operations vtable.
pub struct ShashOps {
/// Digest size in bytes (e.g., 32 for SHA-256, 48 for SHA-384).
pub digestsize: u32,
/// `ShashDesc` context size in bytes.
pub descsize: u32,
/// Optional: set a key (for HMAC). Returns `-EINVAL` for plain hashes.
pub setkey: Option<unsafe extern "C" fn(
tfm: *mut ShashTfm,
key: *const u8,
keylen: u32,
) -> i32>,
/// Initialise `desc` for a new hash computation.
pub init: unsafe extern "C" fn(desc: *mut ShashDesc) -> i32,
/// Process `len` bytes of data.
pub update: unsafe extern "C" fn(
desc: *mut ShashDesc,
data: *const u8,
len: u32,
) -> i32,
/// Finalise and write `digestsize` bytes to `out`.
pub finalize: unsafe extern "C" fn(desc: *mut ShashDesc, out: *mut u8) -> i32,
/// One-shot: init + update(data, len) + finalize. Faster for small messages.
pub digest: unsafe extern "C" fn(
desc: *mut ShashDesc,
data: *const u8,
len: u32,
out: *mut u8,
) -> i32,
}
/// AEAD operations vtable.
pub struct AeadOps {
/// Set the encryption key. Key length must be in the algorithm's
/// supported set (e.g., 16 or 32 bytes for AES-GCM).
pub setkey: unsafe extern "C" fn(
tfm: *mut AeadTfm,
key: *const u8,
keylen: u32,
) -> i32,
/// Set the authentication tag size. For AES-GCM this is 16 bytes;
/// shorter tags are allowed (8 bytes minimum) but not FIPS-approved.
pub setauthsize: unsafe extern "C" fn(tfm: *mut AeadTfm, authsize: u32) -> i32,
/// Encrypt and authenticate. On success, `dst` contains the ciphertext
/// followed by the authentication tag (`authsize` bytes).
pub encrypt: unsafe extern "C" fn(req: *mut AeadReq) -> i32,
/// Decrypt and verify. Returns `-EBADMSG` if the authentication tag
/// does not match; the output buffer is not modified on failure.
/// On success, `dst` contains the plaintext (without the tag).
pub decrypt: unsafe extern "C" fn(req: *mut AeadReq) -> i32,
/// IV size in bytes (12 for GCM, 16 for CCM).
pub ivsize: u32,
/// Maximum authentication tag size in bytes.
pub maxauthsize: u32,
}
/// Key agreement / KEM operations vtable.
pub struct KppOps {
/// Generate a fresh key pair, storing the private key inside `tfm`.
/// The public key is written to `pub_key` (exactly `pub_key_size()` bytes).
pub generate_key: unsafe extern "C" fn(
tfm: *mut KppTfm,
pub_key: *mut u8,
) -> i32,
/// For ECDH/X25519: set the private key from `src`.
/// For ML-KEM: import a serialised decapsulation key.
pub set_priv_key: unsafe extern "C" fn(
tfm: *mut KppTfm,
src: *const u8,
src_len: u32,
) -> i32,
/// ECDH/X25519 compute_shared_secret / ML-KEM encapsulate.
/// For ECDH: `peer_pub` is the peer's public key; `shared` receives
/// the Diffie-Hellman shared secret.
/// For ML-KEM: `peer_pub` is the peer's encapsulation key;
/// `shared` receives the KEM shared secret (32 bytes); the
/// ciphertext is written to `ct_out` (1088 bytes for ML-KEM-768).
pub compute: unsafe extern "C" fn(
tfm: *const KppTfm,
peer_pub: *const u8,
peer_pub_len: u32,
shared: *mut u8,
ct_out: *mut u8,
) -> i32,
/// ML-KEM decapsulate. `ct` is the ciphertext produced by the peer's
/// `compute` call. `shared` receives the same 32-byte shared secret.
/// Returns 0 on success. Decapsulation always succeeds (implicit
/// rejection per FIPS 203 Section 6.4): on ciphertext mismatch the
/// output is a deterministic but unpredictable value.
pub decapsulate: Option<unsafe extern "C" fn(
tfm: *const KppTfm,
ct: *const u8,
ct_len: u32,
shared: *mut u8,
) -> i32>,
/// Public key size in bytes.
pub pub_key_size: unsafe extern "C" fn(tfm: *const KppTfm) -> u32,
/// Shared secret / KEM output size in bytes.
pub shared_secret_size: unsafe extern "C" fn(tfm: *const KppTfm) -> u32,
}
9.1.4 Algorithm Lookup and Transform Allocation
The crypto_alloc_* family of functions drives the registry lookup, priority selection, and
transform instantiation.
// umka-core/src/crypto/alloc.rs
/// Allocate a synchronous hash transform for the named algorithm.
///
/// # Algorithm
/// 1. Lock the algorithm table for reading.
/// 2. Collect all registered `CryptoAlg` entries where `name` matches
/// and `alg_type == CryptoAlgType::Shash` and `!flags.DYING`.
/// 3. In FIPS mode, discard any entries without `flags.FIPS_APPROVED`.
/// 4. Select the entry with the highest `priority`. If multiple entries
/// share the maximum priority, the most recently registered wins
/// (last-writer-wins within the same priority tier, consistent with
/// hardware driver load order at boot).
/// 5. Atomically increment `alg.refcount`.
/// 6. Release the read lock.
/// 7. Allocate a `ShashTfm` from the kernel slab, initialise fields,
/// call `alg.ops.shash.init_tfm(tfm)` if the vtable provides it.
/// 8. Return the tfm. Caller must call `crypto_free_shash(tfm)` when done.
///
/// # Errors
/// - `ENOENT`: no implementation found for the name (or all are filtered
/// out by FIPS mode).
/// - `ENOMEM`: slab allocation failed.
/// - `EINVAL`: algorithm name is empty or longer than 64 bytes.
pub fn crypto_alloc_shash(
name: &str,
flags: CryptoAllocFlags,
) -> Result<Box<ShashTfm>, KernelError>;
/// Allocate an AEAD transform.
/// Follows the same lookup algorithm as `crypto_alloc_shash`.
/// Template algorithms (e.g., `"gcm(aes)"`) are instantiated by:
/// 1. Parsing the template name to extract the template (`"gcm"`) and
/// the base algorithm (`"aes"`).
/// 2. Looking up and allocating a `SkcipherTfm` for the base algorithm.
/// 3. Looking up the template and calling `template.alloc_aead(base_tfm)`.
pub fn crypto_alloc_aead(
name: &str,
flags: CryptoAllocFlags,
) -> Result<Box<AeadTfm>, KernelError>;
/// Allocate an asymmetric key cipher transform.
pub fn crypto_alloc_akcipher(
name: &str,
flags: CryptoAllocFlags,
) -> Result<Box<AkCipherTfm>, KernelError>;
/// Allocate a KEM transform.
pub fn crypto_alloc_kpp(
name: &str,
flags: CryptoAllocFlags,
) -> Result<Box<KppTfm>, KernelError>;
bitflags::bitflags! {
/// Flags for transform allocation.
#[derive(Clone, Copy)]
pub struct CryptoAllocFlags: u32 {
/// Accept only hardware-accelerated implementations.
const HW_ONLY = 0x0001;
/// Accept only software implementations (useful for self-tests).
const SW_ONLY = 0x0002;
/// Caller is in atomic context; allocation must not sleep.
const NOIO = 0x0004;
}
}
/// Free a synchronous hash transform, decrement algorithm refcount.
pub fn crypto_free_shash(tfm: Box<ShashTfm>);
/// Free an AEAD transform. Zeroises the key schedule before freeing.
pub fn crypto_free_aead(tfm: Box<AeadTfm>);
Template Instantiation
Template algorithms compose two or more base algorithms. gcm(aes) combines the GCM mode
template with the AES block cipher. hybrid-kem(x25519,ml-kem-768) combines X25519 ECDH
with ML-KEM-768 using the concatenated shared-secret construction from
NIST SP 800-227 (IPD, 2024):
shared_secret = HKDF-SHA256(
ikm = x25519_shared || ml_kem_shared,
info = "hybrid-kem v1" || x25519_pub || ml_kem_pub,
len = 32
)
The hybrid template is registered as a Kpp algorithm with CryptoAlgFlags::TEMPLATE |
CryptoAlgFlags::PQC. Its generate_key generates both inner key pairs; compute runs both
KEMs and applies HKDF; decapsulate runs ML-KEM decapsulation and the same HKDF.
9.1.5 PQC Algorithms as First-Class Citizens
ML-KEM-768 and ML-DSA-65 are the preferred algorithms for new key exchange and signature code respectively, matching the selection in Section 8.5.
// umka-core/src/crypto/pqc.rs
/// ML-KEM-768 KEM algorithm descriptor (software implementation).
/// Implements FIPS 203 (2024). Registered at boot with priority 200.
static ML_KEM_768_ALG: CryptoAlg = CryptoAlg {
name: *b"ml-kem-768\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
driver_name: *b"soft-ml-kem-768\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
alg_type: CryptoAlgType::Kpp,
priority: 200,
flags: CryptoAlgFlags::FIPS_APPROVED
.union(CryptoAlgFlags::SW_IMPL)
.union(CryptoAlgFlags::PQC),
refcount: AtomicU32::new(0),
ops: CryptoAlgOps::Kpp(ML_KEM_768_OPS),
};
/// ML-DSA-65 signature algorithm descriptor (software implementation).
/// Implements FIPS 204 (2024). Registered at boot with priority 200.
static ML_DSA_65_ALG: CryptoAlg = CryptoAlg {
name: *b"ml-dsa-65\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
driver_name: *b"soft-ml-dsa-65\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
alg_type: CryptoAlgType::Akcipher,
priority: 200,
flags: CryptoAlgFlags::FIPS_APPROVED
.union(CryptoAlgFlags::SW_IMPL)
.union(CryptoAlgFlags::PQC),
refcount: AtomicU32::new(0),
ops: CryptoAlgOps::Akcipher(ML_DSA_65_OPS),
};
When a hardware accelerator supporting PQC operations is present (future Intel IAA, AMD Phoenix ML-KEM offload), it registers a higher-priority implementation of the same algorithm name. The Crypto API selects it automatically; callers need no changes.
9.1.6 Hardware Acceleration Integration
Tier 1 crypto drivers register implementations of standard algorithm names at higher priority than the software fallback. The KABI registration path for an async hardware driver:
// Executed during Tier 1 driver probe (e.g., aesni driver on x86-64).
fn aesni_probe(dev: &mut DriverContext) -> KabiResult {
// Register AES-NI accelerated AES-GCM. Priority 400 > SW priority 100,
// so this implementation is preferred on x86-64 systems with AES-NI.
let alg = Box::leak(Box::new(CryptoAlg {
name: alg_name("aes-gcm"),
driver_name: alg_name("aesni-gcm"),
alg_type: CryptoAlgType::Aead,
priority: 400,
flags: CryptoAlgFlags::FIPS_APPROVED
.union(CryptoAlgFlags::HW_ACCEL)
.union(CryptoAlgFlags::INPLACE),
refcount: AtomicU32::new(0),
ops: CryptoAlgOps::Aead(AESNI_GCM_OPS),
}));
crypto_register_alg(alg)?;
dev.set_private(alg as *mut _);
KabiResult::Ok
}
Async hardware (e.g., AMD CCP, Intel QAT) uses the Ahash/Ablkcipher interfaces. Their
encrypt/decrypt vtable functions submit an operation descriptor to the hardware's command
ring buffer (see Section 11.1 for the ring buffer infrastructure) and return -EINPROGRESS.
The kernel DMA completion IRQ fires the AeadReq.complete callback. Callers that cannot
tolerate async completion use crypto_alloc_aead with CryptoAllocFlags::SW_ONLY to force
the synchronous software implementation.
9.1.7 Hardware Crypto Acceleration by Architecture
UmkaOS's crypto API dispatches to hardware acceleration when available, with a portable software fallback for every algorithm. Acceleration availability is detected at boot via architecture-specific feature registers and registered per-algorithm with a higher priority than the software implementation. The crypto API always selects the highest-priority driver that supports the requested algorithm and mode.
x86-64 (Intel/AMD):
- AES:
AES-NIinstructions (VAESENC,VAESDEC,VAESKEYGENASSIST) — available on all Intel Sandy Bridge+ and AMD Bulldozer+ processors. ~1-2 cycles/block for AES-128 GCM. - SHA: SHA-NI (
SHA256RNDS2,SHA256MSG1,SHA256MSG2) — available on Intel Goldmont+ and AMD Zen+. ~4 cycles/block for SHA-256. - CLMUL:
PCLMULQDQ— for GCM authentication tag and CRC-32. Available alongside AES-NI on the same CPU generations. - RDRAND/RDSEED: hardware RNG, available on Intel Ivy Bridge+ and AMD Zen+.
- Detection: CPUID leaf 1 (ECX.AES bit 25, ECX.PCLMULQDQ bit 1), leaf 7 sub-leaf 0
(EBX.SHA bit 29). UmkaOS reads these at early boot in
umka-kernel/src/arch/x86_64/cpu.rsand passes them to the crypto subsystem viaCryptoHwCaps.
AArch64 (ARM Cryptography Extensions — FEAT_AES / FEAT_SHA2 / FEAT_SHA512 / FEAT_SHA3):
- AES (FEAT_AES):
AESE,AESD,AESMC,AESIMCNEON instructions — available on Cortex-A53+ (most ARMv8.0+ cores). Same throughput class as AES-NI. - SHA-256 (FEAT_SHA2):
SHA256H,SHA256H2,SHA256SU0,SHA256SU1— available on Cortex-A53+. - SHA-512 (FEAT_SHA512):
SHA512H,SHA512H2,SHA512SU0,SHA512SU1— available on Cortex-A55+, Neoverse N1+. - SHA-3 (FEAT_SHA3):
EOR3,RAX1,XAR,BCAX— available on Neoverse V1+. - SM3/SM4 (FEAT_SM3 / FEAT_SM4): Chinese national cipher standards, available on some ARM licensees targeting CN markets.
- PMULL (FEAT_PMULL):
PMULL/PMULL2for GCM polynomial multiplication — available on Cortex-A53+. Required alongside FEAT_AES for AES-GCM hardware offload. - RNG (FEAT_RNG):
RNDR/RNDRRSsystem registers — available on Neoverse N2/V2 and Cortex-A710+. Provides a TRNG directly readable from EL0 without a syscall. - Detection:
ID_AA64ISAR0_EL1register — AES field [7:4], SHA2 [15:12], SHA3 [35:32], SM4 [43:40];ID_AA64ISAR1_EL1— RNDR field [63:60]. UmkaOS reads these inumka-kernel/src/arch/aarch64/cpu.rsat boot before any crypto allocations. - Performance: comparable to AES-NI — ~1-3 cycles/block for AES-128 GCM on Neoverse V1.
RISC-V (Scalar Cryptography ISA extensions — Zkn group, ratified in RISC-V ISA v20191213):
- Zkne (AES encryption): scalar AES round instructions
aes64es,aes64esm,aes64ks1i,aes64ks2— one round per instruction on 64-bit cores. - Zknd (AES decryption):
aes64ds,aes64dsm— symmetric to Zkne. - Zknh (SHA-2):
sha256sig0,sha256sig1,sha256sum0,sha256sum1,sha512sig0,sha512sig1,sha512sum0r,sha512sum1r. - Zksh (SHA-1, legacy):
sha512sig0l,sha512sig0h, etc. — provided for compatibility; SHA-1 is not used for new constructions. - Zksed (SM4):
sm4ed,sm4ks— Chinese national block cipher. - Zbkx (bit manipulation for crypto):
xperm4,xperm8— accelerates S-box lookups and byte shuffles in cipher implementations. - Zvkned (vector AES, requires V extension): AES using the V (vector) extension — higher throughput than scalar Zkne/Zknd when V is available.
- Detection: the RISC-V ISA string in the Device Tree
riscv,isaproperty (e.g.,rv64imafdc_zkn_zks_zbkx) or themisaCSR (V bit for vector). UmkaOS parses the ISA string at boot inumka-kernel/src/arch/riscv64/cpu.rs. Scalar crypto extensions are not yet universal — many embedded RISC-V cores lack them. UmkaOS falls back to the portable software implementation on cores without them.
PPC32 / PPC64LE:
- AES: POWER7+ provides
vcipher,vcipherlast,vncipher,vncipherlastVMX (AltiVec) instructions for AES encryption and decryption. - SHA: POWER8+ adds
vshasigmawandvshasigmadfor SHA-256 and SHA-512 sigma functions, accelerating the compression rounds. - GCM:
vpmsumw/vpmsumdVMX polynomial multiply — particularly efficient for GCM GHASH due to POWER8's wide polynomial multiply unit. - RNG: POWER9+
darninstruction (Deliver A Random Number) — a hardware TRNG accessible from privileged mode without a firmware call. - Detection: PPC feature bits in the Device Tree
ibm,pa-featuresproperty for bare-metal boot, orAT_HWCAP/AT_HWCAP2auxiliary vector entries (following the PPC Linux ABI feature bit definitions). UmkaOS readsibm,pa-featuresfrom the DTB during early boot inumka-kernel/src/arch/ppc64le/cpu.rs.
UmkaOS crypto dispatch table (registered at boot):
/// A registered crypto algorithm implementation — hardware or software.
/// Registered via `crypto_register_alg`; the API selects the highest-priority
/// driver that supports the requested algorithm and operation mode.
struct CryptoDriver {
/// Human-readable implementation name (e.g., "aesni-gcm", "arm-ce-aes-gcm").
name: &'static str,
/// Selection priority. Hardware implementations register at priority 300-400;
/// the portable software fallback registers at priority 100. Higher wins.
priority: u32,
/// AES-CBC encrypt/decrypt, or None if not supported by this driver.
aes_cbc: Option<fn(key: &AesKey, iv: &[u8; 16], buf: &mut [u8], enc: bool)>,
/// AES-GCM AEAD, or None if not supported.
aes_gcm: Option<fn(key: &AesKey, nonce: &[u8; 12], aad: &[u8],
pt: &[u8], ct: &mut [u8], tag: &mut [u8; 16])>,
/// SHA-256, or None if not supported.
sha256: Option<fn(data: &[u8], out: &mut [u8; 32])>,
/// SHA-512, or None if not supported.
sha512: Option<fn(data: &[u8], out: &mut [u8; 64])>,
// Additional algorithm slots follow the same pattern.
}
At boot, each architecture's cpu.rs initialisation code calls crypto_register_hw_drivers()
which inspects the detected feature flags and registers whichever hardware drivers are
available at priority 300-400. The portable software implementation registers unconditionally
at priority 100. The crypto API's alloc_tfm path always selects the highest-priority
registered driver, so hardware acceleration is transparent to callers.
9.1.8 FIPS Mode
FIPS mode is a runtime configuration flag set during early boot from the kernel command line
(umka.fips=1) or a UEFI variable. Once enabled, it cannot be disabled without a reboot.
// umka-core/src/crypto/fips.rs
/// Set once at boot. Subsequent reads are a single atomic load (relaxed).
static FIPS_MODE: AtomicBool = AtomicBool::new(false);
/// Returns true if FIPS mode is active. Hot path: one acquire-load.
#[inline]
pub fn crypto_fips_enabled() -> bool {
FIPS_MODE.load(Ordering::Acquire)
}
/// Called once during early boot, before any crypto allocations.
pub fn crypto_fips_enable() {
FIPS_MODE.store(true, Ordering::Release);
}
In FIPS mode:
- Algorithm lookup filters out any descriptor lacking
CryptoAlgFlags::FIPS_APPROVED. The following algorithms are approved under NIST SP 800-131A Rev. 2 (2019) and SP 800-131A Rev. 3 (draft, 2024): AES-128/192/256 (all approved modes), SHA-256/384/512, SHA-3-256/384/512, HMAC-SHA-256/384/512, AES-GCM, AES-CCM, RSA (≥2048 bits), ECDSA/ECDH (P-256, P-384), Ed25519 (pending FIPS 186-5 approval), ML-KEM-512/768/1024 (FIPS 203), ML-DSA-44/65/87 (FIPS 204), SLH-DSA (FIPS 205). - Algorithms explicitly disallowed: MD5, SHA-1 (signature generation), RC4, DES, 3DES (new applications), RSA < 2048 bits, ECDH/ECDSA on curves below P-256.
Note: FIPS approval status changes over time as NIST publishes new standards. The approved-algorithm list in UmkaOS is maintained as a compile-time table in
umka-core/src/crypto/fips_approved.rsand updated with each relevant NIST publication. Do not hard-code FIPS approval decisions in algorithm registration sites; derive them from that central table.
9.1.9 sysfs Interface
Registered algorithms are exposed read-only under /sys/kernel/umka/crypto/algorithms/.
Each algorithm has a directory named <driver_name> containing:
/sys/kernel/umka/crypto/algorithms/
aesni-gcm/
name "aes-gcm"
driver "aesni-gcm"
type "aead"
priority 400
flags "hw_accel,fips_approved,inplace"
refcount 3
soft-ml-kem-768/
name "ml-kem-768"
driver "soft-ml-kem-768"
type "kpp"
priority 200
flags "sw_impl,fips_approved,pqc"
refcount 0
...
The refcount attribute reflects the number of live transform objects backed by that
implementation. This is diagnostic only; it is subject to TOCTOU races and must not be used
for resource accounting.
Cross-references:
- Section 8.2 (08-security.md): Verified boot uses ML-DSA-65 (Akcipher) and SHA-384 (Shash)
- Section 8.4 (08-security.md): IMA uses SHA-256 and SHA-384 via Shash
- Section 8.5 (08-security.md): PQC algorithm definitions (ML-KEM, ML-DSA)
- Section 8.6 (08-security.md): SEV-SNP/TDX use AES-256-GCM (Aead)
- Section 9.2: Key retention service AsymmetricKey type uses Akcipher transforms
- Section 11.1 (10-drivers.md): Ring buffer infrastructure used by async hardware accelerators
- Section 14.4 (14-storage.md): NVMe TLS/auth uses AES-GCM and ML-KEM via this API
- Section 15.X (15-networking.md): kTLS uses ChaCha20-Poly1305 and AES-GCM via Aead
9.1.10 AF_ALG — Userspace Crypto via Sockets
AF_ALG (Linux 2.6.38+) exposes the kernel crypto API to userspace via a socket interface. Userspace programs access kernel-implemented cryptographic algorithms — including hardware-accelerated implementations — without writing their own crypto code. dm-crypt, cryptsetup, OpenSSL ENGINE_linux_af_alg, and the Go crypto/tls fallback all use AF_ALG.
Socket Setup
/// AF_ALG socket address (matches Linux struct sockaddr_alg).
#[repr(C)]
pub struct SockaddrAlg {
/// Address family: AF_ALG = 38.
pub salg_family: u16,
/// Algorithm type: "hash", "skcipher", "aead", "rng".
pub salg_type: [u8; 14],
/// Feature bits (currently unused, must be 0).
pub salg_feat: u32,
/// Algorithm mask (currently unused, must be 0).
pub salg_mask: u32,
/// Algorithm name (e.g., "sha256", "aes-cbc", "chacha20-poly1305", "stdrng").
pub salg_name: [u8; 64],
}
Usage pattern:
1. socket(AF_ALG, SOCK_SEQPACKET, 0) → returns a bind socket fd (no data transferred here)
2. bind(bind_fd, &SockaddrAlg { salg_type: "hash", salg_name: "sha256", .. }, sizeof) → selects the algorithm
3. For ciphers: setsockopt(bind_fd, SOL_ALG, ALG_SET_KEY, key, key_len) → set the key
4. For AEAD: setsockopt(bind_fd, SOL_ALG, ALG_SET_AEAD_AUTHSIZE, NULL, authsize) → set tag length
5. accept(bind_fd, NULL, NULL) → returns an op socket fd (one per concurrent operation)
6. sendmsg(op_fd, msg, 0) → provide input data and control messages
7. recvmsg(op_fd, msg, 0) → receive output data
The bind socket is shared across threads. Each accept() creates an independent operation socket that maintains its own IV and transform state. Multiple concurrent operations require multiple accept() calls.
Socket-Level Options (SOL_ALG)
pub const SOL_ALG: i32 = 279;
pub const ALG_SET_KEY: i32 = 1; // set cipher/MAC key (getsockopt: read current key length)
pub const ALG_SET_IV: i32 = 2; // set IV via cmsg ALG_SET_IV control message
pub const ALG_SET_OP: i32 = 3; // set direction via cmsg: ALG_OP_ENCRYPT / ALG_OP_DECRYPT
pub const ALG_SET_AEAD_AUTHSIZE: i32 = 4; // set AEAD authentication tag size in bytes
pub const ALG_SET_DRBG_ENTROPY: i32 = 5; // seed RNG with entropy (privileged, for testing)
pub const ALG_OP_DECRYPT: u32 = 0;
pub const ALG_OP_ENCRYPT: u32 = 1;
Control Messages (sendmsg cmsg)
/// IV control message (type = ALG_SET_IV).
/// Sent as ancillary data in sendmsg to set the IV for this operation.
#[repr(C)]
pub struct AlgIv {
/// IV length in bytes (must match algorithm's expected IV size).
pub ivlen: u32,
/// IV bytes (variable length: ivlen bytes follow this field).
pub iv: [u8; 0],
}
/// Operation direction message (type = ALG_SET_OP).
/// Contains a single u32: ALG_OP_ENCRYPT or ALG_OP_DECRYPT.
MSG_MORE: Setting MSG_MORE in sendmsg flags indicates more data follows for this operation (multi-call streaming for hash update or large cipher blocks). Only the final sendmsg (without MSG_MORE) triggers computation.
Algorithm Types
salg_type |
Key? | IV? | Use case |
|---|---|---|---|
"hash" |
Optional (for HMAC) | No | SHA-256, SHA-384, BLAKE2b, HMAC-SHA256 |
"skcipher" |
Yes | Yes | AES-CBC, AES-CTR, ChaCha20, AES-XTS |
"aead" |
Yes | Yes | AES-GCM, ChaCha20-Poly1305, AES-CCM |
"rng" |
No (optional seed) | No | stdrng, drbg_pr_ctr_aes256 |
All algorithms registered in the kernel crypto API (§9.1) are accessible via AF_ALG, including hardware-accelerated implementations (AES-NI, Intel QAT, AMD CCP). The kernel automatically selects the fastest available implementation.
Zero-Copy Path
For large data (e.g., full-disk encryption buffers), AF_ALG supports zero-copy via vmsplice() + splice():
1. vmsplice(pipe_write_fd, iov, iov_count, SPLICE_F_GIFT) → transfer user pages to a pipe without copying
2. splice(pipe_read_fd, NULL, op_fd, NULL, len, 0) → feed pipe data to AF_ALG input without copying
3. splice(op_fd, NULL, pipe_write_fd, NULL, len, 0) → read output from AF_ALG without copying
This avoids any kernel↔userspace data copy for bulk operations, achieving near-hardware throughput.
Security Model
- No privilege required for standard algorithms. Any process may use
AF_ALGwith any registered algorithm. - Privileged algorithms: algorithms requiring
CAP_SYS_ADMINto use (currently none in the standard registry — this mechanism is reserved for test-only algorithms that bypass FIPS constraints). - Key secrecy: the key set via
ALG_SET_KEYis not accessible from userspace after being set (the setsockopt write-only path). The kernel holds the key instruct af_alg_ctxallocated in kernel memory; it is not pinned into a keyring and is freed when the bind socket is closed. - Algorithm access control: LSM hooks (§8.7) can gate
AF_ALGsocket creation by algorithm name, allowing policy-based restrictions on which algorithms are available to which processes (e.g., FIPS mode that only allows FIPS-approved algorithms). - No TOCTOU: the bind socket locks in the algorithm at
bind()time; subsequent key or IV changes on the bind socket do not affect already-accept()ed op sockets.
Linux Compatibility
- Same
AF_ALG = 38socket family constant - Same
SockaddrAlgstruct layout (salg_family,salg_type[14],salg_feat,salg_mask,salg_name[64]) - Same
SOL_ALGsocket-level options (279) - Same
ALG_SET_KEY,ALG_SET_IV,ALG_SET_OP,ALG_SET_AEAD_AUTHSIZEvalues - Same
MSG_MOREstreaming semantics - Same zero-copy
splice()path - cryptsetup LUKS2 uses AF_ALG for AES-XTS bulk cipher offload to AES-NI
- OpenSSL 1.1.0+ has an AF_ALG ENGINE for hardware-accelerated hashing on embedded systems
- Go
crypto/tlsfalls back to AF_ALG when software implementations are insufficient
9.2 Kernel Key Retention Service
The Key Retention Service stores cryptographic keys and opaque credentials in kernel memory,
where they are inaccessible to userspace except through the tightly-controlled keyctl()
syscall interface. The service provides durable, referenceable key handles that persist across
file descriptors, survive fork/exec under controlled conditions, and integrate with the LSM
framework (Section 8.7) for fine-grained access control.
Callers in the kernel that use the key retention service: NVMe TLS authentication
(Section 14.4) stores TLS client certificates as AsymmetricKey entries; RPCSEC_GSS
(NFS Kerberos) caches service tickets as LogonKey entries; dm-crypt stores volume master
keys as LogonKey entries; IMA (Section 8.4) stores measurement policy signing keys; driver
signing (Section 8.2) stores the .builtin_trusted_keys and .secondary_trusted_keys
keyrings; and the TPM subsystem (Section 8.3) stores sealed blobs as EncryptedKey entries
whose payload is protected by the TPM's storage root key.
9.2.1 Key Object
// umka-core/src/keys/key.rs
/// Globally unique key serial number.
/// Assigned monotonically from an atomic counter at key creation time.
/// Userspace references keys by serial number in `keyctl()` calls.
/// Zero is not a valid serial.
#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct KeySerial(pub u32);
/// Permissions bitfield for a key. Modelled after Linux's key permission
/// word (see `man 7 keyrings`). Four subjects (possessor, user, group, other)
/// each with six permission bits.
#[derive(Clone, Copy, Debug)]
#[repr(transparent)]
pub struct KeyPerm(pub u32);
impl KeyPerm {
// Possessor bits (bits 24-29)
pub const POSS_VIEW: u32 = 0x0100_0000; // see key attributes
pub const POSS_READ: u32 = 0x0200_0000; // read key payload
pub const POSS_WRITE: u32 = 0x0400_0000; // update/instantiate key
pub const POSS_SEARCH: u32 = 0x0800_0000; // find via keyring search
pub const POSS_LINK: u32 = 0x1000_0000; // link into a keyring
pub const POSS_SETATTR: u32 = 0x2000_0000; // set timeout, perms, uid/gid
// User bits (bits 16-21)
pub const USER_VIEW: u32 = 0x0001_0000;
pub const USER_READ: u32 = 0x0002_0000;
pub const USER_WRITE: u32 = 0x0004_0000;
pub const USER_SEARCH: u32 = 0x0008_0000;
pub const USER_LINK: u32 = 0x0010_0000;
pub const USER_SETATTR: u32 = 0x0020_0000;
// Group bits (bits 8-13)
pub const GROUP_VIEW: u32 = 0x0000_0100;
pub const GROUP_READ: u32 = 0x0000_0200;
pub const GROUP_WRITE: u32 = 0x0000_0400;
pub const GROUP_SEARCH: u32 = 0x0000_0800;
pub const GROUP_LINK: u32 = 0x0000_1000;
pub const GROUP_SETATTR: u32 = 0x0000_2000;
// Other bits (bits 0-5)
pub const OTHER_VIEW: u32 = 0x0000_0001;
pub const OTHER_READ: u32 = 0x0000_0002;
pub const OTHER_WRITE: u32 = 0x0000_0004;
pub const OTHER_SEARCH: u32 = 0x0000_0008;
pub const OTHER_LINK: u32 = 0x0000_0010;
pub const OTHER_SETATTR: u32 = 0x0000_0020;
}
/// Key lifecycle state.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
#[repr(u8)]
pub enum KeyState {
/// Created but not yet instantiated. `request_key()` upcalls put a key
/// here while waiting for the userspace helper to provide the payload.
Uninstantiated = 0,
/// Fully instantiated and available for use.
Instantiated = 1,
/// Negative instantiation: the key does not exist (caches a lookup failure
/// to prevent repeated upcalls). Requests for this key return `-ENOKEY`.
Negative = 2,
/// Payload revoked by `KEYCTL_REVOKE`. Metadata readable; payload gone.
Revoked = 3,
/// Refcount reached zero; being freed. Not reachable from the key table.
Dead = 4,
}
/// Core key object. Lives in the global key table (`KEY_TABLE`) as long as
/// `refcount > 0`. Keyrings hold strong references to keys they link to.
pub struct Key {
/// Globally unique serial. Assigned at creation, immutable thereafter.
pub serial: KeySerial,
/// Key type implementation. Points to a static `KeyType` vtable.
/// Immutable after creation.
pub key_type: &'static dyn KeyType,
/// Human-readable description. Set at creation, readable by `KEYCTL_DESCRIBE`.
/// May contain structured data (e.g., `"nfs@server.example.com:krb5"`).
/// Maximum 4096 bytes.
pub description: Box<str>,
/// Type-specific payload. Protected by `payload_lock`. On revocation the
/// payload is replaced with `None` and the type's `destroy()` method is
/// called to zeroize key material.
pub payload: SpinLock<Option<Box<dyn Any + Send>>>,
/// Owner user ID (initial namespace).
pub uid: UserId,
/// Owner group ID (initial namespace).
pub gid: GroupId,
/// Access control: who may perform which operations.
pub perm: KeyPerm,
/// Absolute expiry time. `None` = does not expire. When expired, the
/// key transitions to `Negative` on next access and the payload is
/// destroyed. The garbage collector also reaps expired keys.
pub expiry: Option<MonotonicInstant>,
/// Strong reference count. Includes: key table entry (1), each keyring
/// link (1 per link), each in-progress `keyctl()` call (1).
pub refcount: AtomicU32,
/// Lifecycle state.
pub state: AtomicU8, // KeyState
/// Quota: which UID's quota this key counts against.
/// Usually equals `uid`; may differ for keys created on behalf of another
/// UID (e.g., by the request-key helper running as root).
pub quota_uid: UserId,
/// Size of the payload in bytes, for quota accounting.
/// Updated atomically when the payload is set.
pub payload_bytes: AtomicU32,
}
9.2.2 Key Types
// umka-core/src/keys/types.rs
/// Type implementation for a class of keys.
///
/// Each concrete key type implements this trait. Vtable pointers are stable
/// for the kernel lifetime (static implementations only).
pub trait KeyType: Send + Sync {
/// Short name used in `keyctl(KEYCTL_DESCRIBE)` output and in the
/// `add_key()` syscall `type` argument (e.g., `"user"`, `"logon"`,
/// `"asymmetric"`). Maximum 32 bytes.
fn name(&self) -> &'static str;
/// Instantiate the key from raw payload bytes supplied by userspace
/// (via `add_key()`) or by the request-key helper (via
/// `KEYCTL_INSTANTIATE`). The type parses and validates `data`,
/// returning a heap-allocated payload on success.
///
/// # Errors
/// - `EINVAL`: data is malformed for this key type.
/// - `ENOMEM`: allocation failed.
fn instantiate(
&self,
key: &Key,
data: &[u8],
) -> Result<Box<dyn Any + Send>, KernelError>;
/// Update the payload of an already-instantiated key.
/// Not all types support update; return `Err(EOPNOTSUPP)` if not.
fn update(
&self,
key: &Key,
data: &[u8],
) -> Result<Box<dyn Any + Send>, KernelError>;
/// Revoke the key: destroy key material. Called with `payload_lock` held.
/// Must zeroize sensitive data before returning.
fn revoke(&self, key: &Key);
/// Final destruction: called when `refcount` reaches zero after revocation.
/// At this point the payload is already `None`; the type may release any
/// external resources (e.g., TPM NV slot).
fn destroy(&self, key: &Key);
/// Produce a human-readable description for `KEYCTL_DESCRIBE`.
/// Format: `"<type>;<uid>;<gid>;<perm>;<description>"`.
fn describe(&self, key: &Key, buf: &mut dyn core::fmt::Write) -> core::fmt::Result;
/// Read the key payload back to userspace (for `KEYCTL_READ`).
/// Not all types permit this. `LogonKey` returns `Err(EACCES)`.
/// The output is type-specific: `UserKey` returns raw bytes; `AsymmetricKey`
/// returns the public key in SubjectPublicKeyInfo DER.
fn read(
&self,
key: &Key,
buf: &mut [u8],
) -> Result<usize, KernelError>;
}
/// Opaque binary blob. Userspace writes the payload; userspace may also read
/// it back (subject to `KeyPerm::READ`). Used for passwords, tokens, and
/// arbitrary secrets where the kernel does not interpret the content.
pub struct UserKey;
/// Write-only credential. Userspace can search and link `LogonKey` entries
/// but cannot read the payload (`read()` returns `EACCES`). Used for
/// Kerberos service tickets, NVMe TLS PSKs, and dm-crypt volume keys
/// where the kernel uses the payload but userspace must not extract it.
pub struct LogonKey;
/// Key that points to another keyring, forming the keyring tree.
/// The payload is a `Vec<KeyRef>` — an ordered list of links to other keys.
/// Supports: `KEYCTL_LINK`, `KEYCTL_UNLINK`, `KEYCTL_CLEAR`, `KEYCTL_SEARCH`.
pub struct KeyringKey;
/// Asymmetric public key. Payload is an `AkCipherTfm` (from Section 9.1)
/// backed by a parsed key (RSA, ECDSA, Ed25519, or ML-DSA public/private key
/// pair). Supports `KEYCTL_PKEY_SIGN`, `KEYCTL_PKEY_VERIFY`,
/// `KEYCTL_PKEY_ENCRYPT`, `KEYCTL_PKEY_DECRYPT` via the Crypto API.
/// `read()` returns the public key in SubjectPublicKeyInfo DER.
pub struct AsymmetricKey;
/// Encrypted key. The raw key material is encrypted under a master key
/// (another key in the kernel key service, typically a TPM-bound key).
/// The encrypted blob is the on-disk/in-swap representation; the plaintext
/// is only present in kernel memory as long as the key is instantiated.
/// Payload format: `struct EncryptedKeyPayload { ct: Vec<u8>, iv: [u8; 12] }`.
/// Master key is referenced by serial; if the master key is revoked, this
/// key cannot be decrypted and transitions to `Negative`.
pub struct EncryptedKey;
/// DNS resolver key. Description is a DNS name + query type (e.g.,
/// `"server.example.com"` or `"_ldap._tcp.example.com srva"`).
/// Payload is a serialised DNS response (A/AAAA/SRV records).
/// Auto-expires at the TTL of the DNS response. On expiry, a `request_key()`
/// upcall refreshes the entry.
pub struct DnsResolverKey;
9.2.3 Keyring Hierarchy
Keyrings are keys of type KeyringKey. They form a directed acyclic graph (search cycles are
forbidden and detected at KEYCTL_LINK time). The standard per-thread/process/session
hierarchy is established at task creation:
.builtin_trusted_keys .secondary_trusted_keys .ima_mok
│ │ │
└────────────────────────┼─────────────────────┘
│ (kernel-owned, read-only from userspace)
User persistent keyring ←───────┘
│
User keyring (per UID)
│
User session keyring ←── default session for login shells
│
Session keyring (per login session, replaced by pam_keyinit)
│
Process keyring (per process group, optional)
│
Thread keyring (per thread, optional)
Special kernel-owned keyrings:
.builtin_trusted_keys: populated at build time from X.509 certificates embedded in the kernel image. Contains the distribution CA and UmkaOS signing key. Read-only; no userspace links permitted..secondary_trusted_keys: populated at runtime fromKEYCTL_RESTRICT_KEYRING-restricted keyring; allows MOK (Machine Owner Key) certificates to be added without rebuilding the kernel. Restricted: additions require a valid signature from a key already in.builtin_trusted_keys..ima_mok: IMA's machine owner key ring. Keys added here affect IMA policy (Section 8.4). RequiresCAP_SYS_ADMINto modify..nvme: NVMe authentication keyring. Populated by thenvme_keyringmodule from/etc/nvme/hostkey.pemand/etc/nvme/hostsym.confviarequest_key()upcall on first NVMe TLS connection attempt.
9.2.4 Key Quotas
To prevent denial-of-service via key exhaustion, each UID has a quota:
// umka-core/src/keys/quota.rs
/// Per-UID key quota.
#[derive(Debug)]
pub struct KeyQuota {
/// UID this quota applies to.
pub uid: UserId,
/// Number of keys currently charged to this UID.
pub key_count: AtomicU32,
/// Total payload bytes currently charged to this UID.
pub payload_bytes: AtomicU64,
/// Maximum number of keys allowed. Default: 200.
/// Configurable via `/proc/sys/kernel/keys/maxkeys`.
pub max_keys: u32,
/// Maximum payload bytes allowed. Default: 20 * 1024 (20 KiB).
/// Configurable via `/proc/sys/kernel/keys/maxbytes`.
pub max_bytes: u64,
}
/// Global quota table. One entry per active UID; entries created on first
/// key allocation for a UID and freed when all keys for that UID are gone.
static KEY_QUOTA_TABLE: RwSpinLock<BTreeMap<UserId, KeyQuota>> = ...;
/// Charge quota before creating a key. Returns `Err(EDQUOT)` if the UID
/// would exceed either limit.
pub fn key_quota_charge(uid: UserId, payload_bytes: u32) -> Result<(), KernelError>;
/// Release quota when a key is destroyed.
pub fn key_quota_release(uid: UserId, payload_bytes: u32);
Root (UID 0) is exempt from key quotas. Kernel-internal keys (with quota_uid set to a
sentinel value KERNEL_KEY_UID) do not count against any user's quota.
9.2.5 The keyctl() Syscall
keyctl() is the primary userspace interface. The first argument selects the operation;
subsequent arguments are operation-specific.
// umka-compat/src/syscall/keyctl.rs
/// Dispatch table for keyctl operations.
/// Each entry maps an operation constant to a handler function.
fn keyctl_dispatch(
op: u32,
arg2: usize,
arg3: usize,
arg4: usize,
arg5: usize,
task: &Task,
) -> Result<isize, KernelError> {
match op {
// Return the key ID of one of the special keyrings.
// arg2: KEY_SPEC_* constant (negative values).
// Returns the key serial as a positive isize.
KEYCTL_GET_KEYRING_ID => keyctl_get_keyring_id(arg2 as i32, task),
// Join or create the named session keyring.
// arg2: pointer to name string (NULL = anonymous).
KEYCTL_JOIN_SESSION_KEYRING => keyctl_join_session_keyring(arg2 as *const u8, task),
// Update a key's payload. arg2: key serial, arg3: payload ptr, arg4: payload len.
KEYCTL_UPDATE => keyctl_update(KeySerial(arg2 as u32), arg3 as *const u8, arg4 as u32, task),
// Revoke a key. arg2: key serial.
KEYCTL_REVOKE => keyctl_revoke(KeySerial(arg2 as u32), task),
// Return a description string. arg2: serial, arg3: buf ptr, arg4: buf size.
KEYCTL_DESCRIBE => keyctl_describe(KeySerial(arg2 as u32), arg3 as *mut u8, arg4 as u32, task),
// Clear all links in a keyring. arg2: keyring serial.
KEYCTL_CLEAR => keyctl_clear(KeySerial(arg2 as u32), task),
// Link a key into a keyring. arg2: key serial, arg3: keyring serial.
KEYCTL_LINK => keyctl_link(KeySerial(arg2 as u32), KeySerial(arg3 as u32), task),
// Unlink a key from a keyring. arg2: key serial, arg3: keyring serial.
KEYCTL_UNLINK => keyctl_unlink(KeySerial(arg2 as u32), KeySerial(arg3 as u32), task),
// Search keyrings for a key. arg2: keyring serial, arg3: type ptr,
// arg4: description ptr, arg5: destination keyring serial.
KEYCTL_SEARCH => keyctl_search(
KeySerial(arg2 as u32),
arg3 as *const u8,
arg4 as *const u8,
KeySerial(arg5 as u32),
task,
),
// Read a key's payload. arg2: serial, arg3: buf ptr, arg4: buf len.
KEYCTL_READ => keyctl_read(KeySerial(arg2 as u32), arg3 as *mut u8, arg4 as u32, task),
// Instantiate a key from the request-key helper.
// arg2: serial, arg3: payload ptr, arg4: payload len, arg5: keyring serial.
KEYCTL_INSTANTIATE => keyctl_instantiate(
KeySerial(arg2 as u32),
arg3 as *const u8,
arg4 as u32,
KeySerial(arg5 as u32),
task,
),
// Negatively instantiate (mark as not-found). arg2: serial,
// arg3: timeout_secs, arg4: keyring serial.
KEYCTL_NEGATE => keyctl_negate(
KeySerial(arg2 as u32),
arg3 as u32,
KeySerial(arg4 as u32),
task,
),
// Set the default keyring for implicit key requests.
// arg2: KEY_REQKEY_DEFL_* constant.
KEYCTL_SET_REQKEY_KEYRING => keyctl_set_reqkey_keyring(arg2 as i32, task),
// Set key expiry. arg2: serial, arg3: timeout_secs (0 = no expiry).
KEYCTL_SET_TIMEOUT => keyctl_set_timeout(KeySerial(arg2 as u32), arg3 as u32, task),
// Assume authority over an uninstantiated key (used by request-key helper).
KEYCTL_ASSUME_AUTHORITY => keyctl_assume_authority(KeySerial(arg2 as u32), task),
// Get the LSM security label for a key. arg2: serial,
// arg3: buf ptr, arg4: buf len.
KEYCTL_GET_SECURITY => keyctl_get_security(KeySerial(arg2 as u32), arg3 as *mut u8, arg4 as u32, task),
// DH key derivation. arg2: pointer to keyctl_dh_params struct.
KEYCTL_DH_COMPUTE => keyctl_dh_compute(arg2 as *const KeyctlDhParams, task),
// Public key operations. arg2: serial, arg3: pointer to keyctl_pkey_params,
// arg4: info ptr, arg5: in/out ptrs encoded in the params struct.
KEYCTL_PKEY_QUERY => keyctl_pkey_query(KeySerial(arg2 as u32), arg3 as *const KeyctlPkeyParams, task),
KEYCTL_PKEY_ENCRYPT => keyctl_pkey_encrypt(KeySerial(arg2 as u32), arg3 as *const KeyctlPkeyParams, task),
KEYCTL_PKEY_DECRYPT => keyctl_pkey_decrypt(KeySerial(arg2 as u32), arg3 as *const KeyctlPkeyParams, task),
KEYCTL_PKEY_SIGN => keyctl_pkey_sign(KeySerial(arg2 as u32), arg3 as *const KeyctlPkeyParams, task),
KEYCTL_PKEY_VERIFY => keyctl_pkey_verify(KeySerial(arg2 as u32), arg3 as *const KeyctlPkeyParams, task),
_ => Err(KernelError::EOPNOTSUPP),
}
}
Special key ID constants, as defined by Linux <linux/keyctl.h>:
/// Refers to the calling thread's own thread keyring.
pub const KEY_SPEC_THREAD_KEYRING: i32 = -1;
/// Refers to the calling process's own process keyring.
pub const KEY_SPEC_PROCESS_KEYRING: i32 = -2;
/// Refers to the calling process's session keyring.
pub const KEY_SPEC_SESSION_KEYRING: i32 = -3;
/// Refers to the calling process's user keyring.
pub const KEY_SPEC_USER_KEYRING: i32 = -4;
/// Refers to the calling process's user session keyring.
pub const KEY_SPEC_USER_SESSION_KEYRING: i32 = -5;
/// Refers to the calling process's group keyring.
pub const KEY_SPEC_GROUP_KEYRING: i32 = -6;
/// Refers to the assumed request_key() authorisation key.
pub const KEY_SPEC_REQKEY_AUTH_KEY: i32 = -7;
9.2.6 The request_key() Upcall
request_key() is the mechanism by which the kernel asks userspace to supply a key that is
not yet in any keyring reachable from the calling process. The upcall is entirely asynchronous
from the requesting process's perspective: the kernel creates an uninstantiated key, suspends
the request, and resumes when the key is instantiated (or fails).
request_key(type, description, callout_info, dest_keyring) algorithm:
1. SEARCH: Walk the keyring tree reachable from the calling thread
(thread keyring → process keyring → session keyring → user keyring
→ user session keyring → .builtin_trusted_keys).
For each keyring:
a. Lock the keyring for reading (RCU).
b. Iterate links; for each linked key matching (type, description):
- If state == Instantiated and not expired → return that key.
- If state == Negative → return Err(ENOKEY) immediately.
If found: optionally link into dest_keyring and return.
2. CREATE uninstantiated key:
- Allocate Key{serial=next_serial(), state=Uninstantiated, ...}.
- Insert into the global key table.
- Link into dest_keyring (if specified).
3. CONSTRUCT: Check if a call_sysrequest handler is registered for the type.
If none: mark the key Negative with a short timeout and return Err(ENOKEY).
4. CREATE auth key:
- Allocate a special `request_key_auth` key with:
- payload: (target_key_serial, type, description, callout_info)
- uid/gid: calling process's credentials
- perm: possessor=VIEW|READ|SEARCH, others=none
- Link the auth key into the kernel's `request_key_auth_keyring`.
5. FORK request-key helper:
- Fork a process running `/sbin/request-key`.
- Set its session keyring to a new keyring containing only the auth key.
- Pass environment: KEYCTL_REQUESTKEY_AUTH_KEY=<auth_key_serial>.
- The helper process looks up its auth key, reads the target description,
contacts the appropriate credential daemon (gssd, nvme-cli, etc.),
and calls:
keyctl(KEYCTL_INSTANTIATE, target_serial, payload, len, dest_keyring)
or on failure:
keyctl(KEYCTL_NEGATE, target_serial, timeout_secs, dest_keyring)
6. WAIT: The requesting thread waits (interruptible) on a wait queue
associated with the uninstantiated key. Timeout: 60 seconds by default.
On wake:
- If state == Instantiated: return key serial.
- If state == Negative: return Err(ENOKEY).
- If timeout: return Err(ETIMEDOUT); key stays in Negative state.
- If signal received: return Err(ERESTARTSYS).
7. CLEANUP: When the request-key helper process exits, the kernel:
- Revokes and destroys the auth key.
- If the target key is still Uninstantiated, marks it Negative.
Note: The
/sbin/request-keybinary is part of thekeyutilspackage. Its configuration file/etc/request-key.confmaps (operation, type, description) triples to handler programs. For example,create krb5 nfs@* * /usr/sbin/rpc.gssd %k %d %ctellsrequest-keyto invokerpc.gssdfor Kerberos NFS tickets.
9.2.7 LSM Hooks
Key operations are mediated by LSM hook callouts (Section 8.7). The hooks allow MAC policies
(SELinux, AppArmor) to enforce additional constraints beyond the DAC KeyPerm checks:
// umka-core/src/keys/security.rs
/// Called when a new key is allocated, before quota charge.
/// LSM may set a security label on the key.
/// Returns `Ok(())` to allow, `Err(EACCES)` to deny.
pub fn security_key_alloc(
key: &Key,
cred: &TaskCredential,
flags: KeyAllocFlags,
) -> Result<(), KernelError>;
/// Called when a key's refcount drops to zero, just before deallocation.
pub fn security_key_free(key: &Key);
/// Called before each `keyctl()` operation.
/// `perm` is one of the `KeyPerm::*` bit constants for the requested operation.
/// Returns `Ok(())` to allow, `Err(EACCES)` to deny.
pub fn security_key_permission(
key_ref: &KeyRef,
cred: &TaskCredential,
perm: u32,
) -> Result<(), KernelError>;
/// Called during keyring search, once per candidate key.
/// LSM may suppress a key from search results (return `Err(EACCES)`)
/// without denying access to the key by direct serial reference.
pub fn security_keyring_search(
keyring: &Key,
key_type: &'static dyn KeyType,
description: &str,
) -> Result<(), KernelError>;
The DAC permission check (KeyPerm) is performed before the LSM hook. If DAC denies the
operation, the LSM hook is not called. This matches the Linux model and ensures that LSM
cannot be used to grant permissions beyond what DAC allows (MAC augments DAC, does not
override it).
9.2.8 Integration: NVMe TLS Authentication
Section 14.4 describes NVMe over Fabrics with TLS. The key retention service provides the certificate and PSK storage:
Boot sequence for NVMe TLS:
1. nvme_keyring module loads → creates ".nvme" keyring (KeyringKey, uid=0,
perm=possessor:VIEW|READ|SEARCH, others:none).
2. NVMe initiator attempts connection to a target requiring TLS.
3. Before the TLS handshake, the initiator calls:
request_key("asymmetric", "nvme-tls:<hostnqn>", NULL, nvme_keyring_serial)
4. Key not found → request_key() upcall to /sbin/request-key.
5. request-key invokes /usr/lib/nvme/nvme-key-helper, which:
a. Reads /etc/nvme/hostkey.pem (PKCS#8 private key, ML-DSA-65 or RSA).
b. Parses the certificate and calls:
add_key("asymmetric", "nvme-tls:<hostnqn>", cert_der, len, nvme_keyring_serial)
6. The kernel's AsymmetricKey.instantiate() parses the DER via crypto_alloc_akcipher(),
stores the public key (for the peer to verify) and private key material.
7. The target's certificate is verified against ".builtin_trusted_keys" or
".secondary_trusted_keys".
8. The NVMe TLS layer calls keyctl_pkey_sign() to sign the TLS handshake
messages using the key obtained in step 6.
Key lifetime: Keys in ".nvme" have no expiry by default (the NVMe host certificate
is valid until the certificate's notAfter date, checked by the TLS layer independently).
On NVMe controller removal, the corresponding key is unlinked from ".nvme" but persists
until its refcount drops to zero (any in-flight TLS sessions holding a reference).
9.2.9 Integration: RPCSEC_GSS (NFS Kerberos)
NFS mounts with sec=krb5 use RPCSEC_GSS to authenticate each RPC call with a Kerberos
service ticket. Tickets are cached as LogonKey entries:
Per-NFS-operation key lookup:
1. rpcsec_gss_krb5 calls:
request_key("krb5", "nfs@<server>:<realm>", callout_info, session_keyring)
where callout_info encodes the desired enctypes and flags.
2. If a valid (not-expired) LogonKey exists: use its payload directly.
The payload is a serialised krb5_creds structure (TGS reply + session key).
Key expiry = Kerberos ticket expiry (from the ticket's endtime field).
3. On cache miss or expiry: request_key() upcall to /sbin/request-key.
4. request-key invokes rpc.gssd, which:
a. Locates the user's TGT in the ccache (FILE:/tmp/krb5cc_<uid> or
KEYRING:session:).
b. Requests a TGS ticket for the NFS service principal.
c. Serialises the ticket and calls:
add_key("krb5", "nfs@<server>:<realm>", ticket_data, len, session_keyring)
5. The LogonKey payload is set; rpcsec_gss_krb5 wakes and proceeds.
6. rpcsec_gss_krb5 calls keyctl(KEYCTL_READ, ...) → Err(EACCES) because
LogonKey.read() denies userspace reads. Only in-kernel callers with a
direct Key reference can access the payload (via key.payload.lock()).
Security: The LogonKey type's read() method always returns Err(EACCES) regardless
of KeyPerm bits. This ensures Kerberos session keys cannot be extracted by userspace
even if the calling process has the key's serial number.
Cross-references:
- Section 8.2 (08-security.md): .builtin_trusted_keys and .secondary_trusted_keys keyrings used in verified boot
- Section 8.3 (08-security.md): TPM-bound EncryptedKey entries; TPM storage root key seals payloads
- Section 8.4 (08-security.md): IMA uses .ima_mok keyring for measurement policy key verification
- Section 8.7 (08-security.md): LSM hooks gate all key operations
- Section 9.1: AsymmetricKey type uses AkCipherOps from the Crypto API
- Section 14.4 (14-storage.md): NVMe TLS uses .nvme keyring for host certificate storage
9.3 Seccomp-BPF Syscall Filter
Seccomp is the kernel's last line of defense for syscall sandboxing. It operates after all
other security checks (capabilities, LSM, DAC/MAC) and provides a programmable per-task
filter that can restrict which syscalls a task is allowed to make. UmkaOS implements full
Linux seccomp compatibility — libseccomp, systemd seccomp profiles, Docker/containerd
seccomp JSON, and Chrome's sandbox all work without modification. The UmkaOS-native
improvement is JIT compilation of filters at install time, reducing per-syscall overhead
from ~50-200 ns (Linux interpreted BPF) to 2-5 ns (native code).
9.3.1 Entry Points
Seccomp state is set through two interfaces: the legacy prctl(2) interface and the
preferred seccomp(2) syscall (x86-64 syscall number 317). Both are fully supported.
prctl(2) interface (legacy, Linux compatible):
prctl(PR_SET_SECCOMP, SECCOMP_MODE_DISABLED, 0, 0, 0)
→ EINVAL (cannot revert to disabled once seccomp is active)
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT, 0, 0, 0)
→ 0 on success; restricts task to read/write/exit/sigreturn only
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, uaddr)
→ 0 on success; installs BPF filter from struct sock_fprog at uaddr
The prctl interface is equivalent to seccomp(SECCOMP_SET_MODE_STRICT, 0, NULL) and
seccomp(SECCOMP_SET_MODE_FILTER, 0, uaddr) respectively. It is provided for backward
compatibility. New code should use the seccomp(2) syscall.
seccomp(2) syscall (preferred):
long seccomp(unsigned int operation, unsigned int flags, void *args);
Syscall numbers: - x86-64: 317 - AArch64: 277 - ARMv7: 383 - RISC-V 64: 277 - All other arches: follow Linux's syscall table for the respective architecture
The seccomp(2) syscall is the preferred interface because it exposes flags that
prctl cannot pass and supports operations beyond mode setting.
9.3.2 seccomp() Operations
SECCOMP_SET_MODE_STRICT (operation = 0)
Equivalent to prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT). The flags argument must
be 0 and args must be NULL; any other value returns EINVAL.
After this call, the task is restricted to the four allowed syscalls: read, write,
exit, exit_group, and rt_sigreturn. Any other syscall results in SIGKILL of the
task. This mode cannot be undone.
SECCOMP_SET_MODE_FILTER (operation = 1)
Install a cBPF filter program. The args pointer must point to a struct sock_fprog:
struct sock_fprog {
unsigned short len; /* number of filter instructions */
struct sock_filter *filter; /* pointer to filter instructions */
};
The following flags are supported:
| Flag | Value | Meaning |
|---|---|---|
SECCOMP_FILTER_FLAG_TSYNC |
0x1 | Synchronise filter to all threads of the process |
SECCOMP_FILTER_FLAG_LOG |
0x2 | Log all allowed syscalls (even those not matching a log action) |
SECCOMP_FILTER_FLAG_SPEC_ALLOW |
0x4 | Disable Spectre-class mitigations in the syscall path for this task |
SECCOMP_FILTER_FLAG_NEW_LISTENER |
0x8 | Return a file descriptor for userspace notifications |
SECCOMP_FILTER_FLAG_NOTIFY_ADDFD |
0x20 | Allow SECCOMP_IOCTL_NOTIF_ADDFD (companion to NEW_LISTENER) |
When SECCOMP_FILTER_FLAG_NEW_LISTENER is set, the return value on success is a file
descriptor (not 0). This fd is epoll-able and readable; it delivers seccomp_notif
structures when the filter returns SECCOMP_RET_USER_NOTIF.
When SECCOMP_FILTER_FLAG_TSYNC is set, the new filter is installed atomically on all
threads of the calling process. If any thread has a more restrictive mode than the
caller, the operation fails with ESRCH. The installation is all-or-nothing: if it fails
for any thread, no thread is updated.
SECCOMP_GET_ACTION_AVAIL (operation = 2)
Checks whether a specific action code (passed via args as unsigned int *) is
supported by the kernel. Returns 0 if supported, EOPNOTSUPP if not. UmkaOS supports all
Linux-defined actions: Kill, KillProcess, Trap, Errno, Trace, Log, Allow, and Notify.
SECCOMP_GET_NOTIF_SIZES (operation = 3)
Fills in a struct seccomp_notif_sizes at the args pointer with the sizes of the
notification structures:
struct seccomp_notif_sizes {
__u16 seccomp_notif; /* sizeof(struct seccomp_notif) */
__u16 seccomp_notif_resp; /* sizeof(struct seccomp_notif_resp) */
__u16 seccomp_data; /* sizeof(struct seccomp_data) */
};
These sizes are stable across UmkaOS versions (same as Linux kernel 5.0+). Userspace libraries (libseccomp, gVisor) use this to verify ABI compatibility before using the notification interface.
9.3.3 seccomp_data Struct (BPF Program Input)
Every time a BPF filter is evaluated, it receives a pointer to a seccomp_data struct
describing the syscall. This struct's layout is an ABI — it must match Linux exactly
because BPF programs compiled by libseccomp or by userspace policy engines reference
specific offsets into this struct.
/// Input passed to the seccomp BPF filter. Layout is ABI-stable and matches
/// Linux's `struct seccomp_data` exactly (do not reorder or add fields).
#[repr(C)]
pub struct SeccompData {
/// Syscall number (architecture-specific; matches the `arch` field).
pub nr: i32,
/// AUDIT_ARCH_* value identifying the calling ABI.
/// On x86-64: AUDIT_ARCH_X86_64 (0xC000003E).
/// On AArch64: AUDIT_ARCH_AARCH64 (0xC00000B7).
/// On ARMv7 compat: AUDIT_ARCH_ARM (0x40000028).
/// On RISC-V 64: AUDIT_ARCH_RISCV64 (0xC00000F3).
pub arch: u32,
/// Instruction pointer of the syscall instruction (not the return address).
pub instruction_pointer: u64,
/// Syscall arguments (up to 6). Arguments beyond the syscall's defined
/// argument count are zero-filled.
pub args: [u64; 6],
}
The arch field is filled with the task's current ABI identifier — not the kernel's
native architecture. On x86-64 UmkaOS, a task running in 32-bit compatibility mode
(ia32) receives AUDIT_ARCH_I386. On AArch64 UmkaOS, a 32-bit ARMv7 task running in
AArch32 EL0 compat mode receives AUDIT_ARCH_ARM. This allows seccomp filters to
correctly restrict 32-bit syscall numbers on 64-bit kernels.
UmkaOS note: SeccompData is allocated on the kernel stack of the syscall entry path.
It is populated before the filter chain is called and discarded afterward. It is never
heap-allocated and never escapes the syscall entry frame.
9.3.4 BPF Wire Format
Seccomp filters are specified using classic BPF (cBPF), not eBPF. This is the same
BPF dialect used by SO_ATTACH_FILTER (socket filters). The BPF program is specified
as an array of sock_filter instructions:
struct sock_filter {
__u16 code; /* instruction opcode */
__u8 jt; /* jump-if-true offset */
__u8 jf; /* jump-if-false offset */
__u32 k; /* generic multiuse field */
};
The filter is delivered from userspace as a sock_fprog:
struct sock_fprog {
unsigned short len; /* number of sock_filter instructions */
struct sock_filter *filter; /* pointer to instructions */
};
Constraints (identical to Linux):
- Maximum filter length: 32767 instructions (BPF_MAXINSNS)
- Maximum filter chain depth: 512 filters (MAX_SECCOMP_FILTER_DEPTH)
- A filter must end with a BPF_RET instruction
- Load instructions may only access seccomp_data fields (by validated offset)
- Absolute memory loads (BPF_LD | BPF_ABS) are the primary access mechanism:
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)) loads the
syscall number into accumulator A
- Scratch memory (16 × u32 M[0..15]) is available
- No packet memory, no pointer arithmetic beyond scratch memory
Validation (UmkaOS implementation):
UmkaOS validates the cBPF program using the same rules as Linux's bpf_check_classic():
- All branch targets must be forward-only (no backward jumps) and within bounds
- The program must terminate (no cycles possible with forward-only branches)
- All loads are range-checked against sizeof(SeccompData)
- Division-by-zero constants in BPF_ALU | BPF_DIV | BPF_K are rejected
- All instruction opcodes must be valid cBPF opcodes
If validation fails, SECCOMP_SET_MODE_FILTER returns EINVAL and the task's seccomp
state is unchanged.
After validation: UmkaOS JIT-compiles the validated cBPF to native machine code.
The original sock_filter array is retained in CompiledFilter.bpf for audit and
debugging purposes but is not used during syscall dispatch.
9.3.5 Return Values (Actions)
Each BPF filter program returns a 32-bit value. The top 16 bits encode the action; the bottom 16 bits are action-specific data. The actions are evaluated in priority order: when a filter chain has multiple filters, the highest-priority action among all filters' return values wins.
/// Seccomp filter return actions, in descending priority order.
/// The numeric values match Linux's SECCOMP_RET_* constants exactly.
#[repr(u32)]
pub enum SeccompAction {
/// Kill the entire process (all threads). No signal is delivered;
/// the process exits with a core dump if core dumps are enabled.
KillProcess = 0x80000000,
/// Kill the calling thread only. SIGKILL is sent to the thread.
/// No signal handler runs; no userspace notification.
Kill = 0x00000000,
/// Deliver SIGSYS to the thread. The siginfo_t is filled with:
/// si_signo = SIGSYS
/// si_code = SYS_SECCOMP (1)
/// si_call_addr = seccomp_data.instruction_pointer
/// si_syscall = seccomp_data.nr
/// si_arch = seccomp_data.arch
Trap = 0x00030000,
/// Notify a registered userspace supervisor via a listener fd.
/// The task is suspended until the supervisor responds.
/// The listener fd is obtained via SECCOMP_FILTER_FLAG_NEW_LISTENER.
Notify = 0x7fc00000,
/// Notify a ptrace(2) tracer. The low 16 bits are the tracer message
/// (accessible via PTRACE_GETEVENTMSG). If no tracer is attached,
/// ENOSYS is returned to the task.
Trace(u16) /* = 0x7ff00000 | (id & 0xffff) */,
/// Return -errno to the calling task. The syscall is not executed.
/// The low 16 bits are the errno value (must be in 1..=65535).
Errno(u16) /* = 0x00050000 | (errno & 0xffff) */,
/// Log the syscall to the audit log, then allow it to proceed.
/// Respects the rate limit (max 10 log entries per second per task).
Log = 0x7ffc0000,
/// Allow the syscall to proceed. No logging, no overhead.
Allow = 0x7fff0000,
}
Priority order (highest to lowest):
KillProcess(0x80000000)Kill(0x00000000)Trap(0x00030000)Notify(0x7fc00000)Trace(0x7ff00000 | id)Errno(0x00050000 | errno)Log(0x7ffc0000)Allow(0x7fff0000)
When a filter chain contains N filters, each filter is evaluated independently.
The returned values are collected, and the highest-priority action is applied.
A lower-numbered (more recently installed) filter's Allow cannot override an
outer filter's Errno — the outer filter's stricter action prevails.
This priority ordering is identical to Linux's and allows composing filters safely: a library can install an inner filter that allows its needed syscalls without being able to override the outer filter's restrictions.
9.3.6 Filter Chain Data Structures
/// A single installed seccomp filter (immutable after creation).
///
/// Filters form a singly-linked chain: each filter holds an optional Arc
/// to its parent (the filter that was active at the time of installation).
/// The chain grows toward older filters; the innermost (most recently
/// installed) filter is the head.
pub struct SeccompFilter {
/// JIT-compiled native code, or interpreted cBPF on arches without JIT.
code: Arc<CompiledFilter>,
/// The parent filter in the chain. None if this is the first filter.
parent: Option<Arc<SeccompFilter>>,
/// If true, log allowed syscalls regardless of the filter program's action.
/// Set by SECCOMP_FILTER_FLAG_LOG.
log: bool,
/// If true, disable Spectre-class mitigations for this task's syscall path.
/// Set by SECCOMP_FILTER_FLAG_SPEC_ALLOW.
allow_spec: bool,
/// Listener fd for SECCOMP_RET_USER_NOTIF delivery. Present only when
/// the filter was installed with SECCOMP_FILTER_FLAG_NEW_LISTENER.
notif_fd: Option<Arc<SeccompNotifFd>>,
}
/// JIT-compiled (or bytecode-retained) seccomp filter program.
pub struct CompiledFilter {
/// Executable region containing native machine code. The region is
/// write-protected (W^X) after compilation: it is writable during JIT
/// output, then the write permission is revoked before first use.
code: ExecutableRegion,
/// Original cBPF instructions retained for audit and debugging.
/// Not used during syscall dispatch in release builds.
bpf: Vec<SockFilter>,
/// Number of cBPF instructions in `bpf`.
bpf_len: u16,
}
/// Per-task seccomp state, embedded in `Task`.
pub struct SeccompState {
/// Innermost (most recently installed) filter in the chain.
/// None if the task has not installed any filter.
filter: Option<Arc<SeccompFilter>>,
/// Current seccomp mode.
mode: SeccompMode,
/// Whether PR_SET_NO_NEW_PRIVS has been set for this task.
/// Required before installing a filter without CAP_SYS_ADMIN.
no_new_privs: bool,
}
/// Seccomp operating mode.
#[repr(u8)]
pub enum SeccompMode {
/// No seccomp active. Syscall path skips all filter evaluation.
Disabled = 0,
/// Strict mode: only read/write/exit/rt_sigreturn are allowed.
Strict = 1,
/// Filter mode: BPF filter chain is evaluated on every syscall.
Filter = 2,
}
SeccompState is embedded directly in the Task struct. Accessing the mode field on
the syscall fast path requires no pointer indirection beyond the Task pointer that is
already in a CpuLocal register. The mode check is a single byte comparison.
9.3.7 Filter Installation Algorithm (SECCOMP_SET_MODE_FILTER)
The installation path is per-task and requires no global locks. The BPF verifier and JIT compiler are purely local operations on data provided by the caller.
fn seccomp_set_mode_filter(task: &mut Task, flags: u32, fprog: UserPtr<SockFprog>) -> Result<i64> {
Step 1 — Privilege check:
The caller must satisfy at least one of:
- task.seccomp.no_new_privs == true (set by prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)), OR
- task.creds.has_cap(CAP_SYS_ADMIN) in the task's own user namespace
If neither condition is satisfied, return EPERM.
Step 2 — Mode transition check:
Seccomp mode is write-once-increasing:
- If task.seccomp.mode == SeccompMode::Strict and SECCOMP_SET_MODE_FILTER is requested,
return EINVAL. A task in strict mode cannot install a BPF filter.
- SECCOMP_SET_MODE_STRICT is accepted regardless of current mode (idempotent in strict mode).
Step 3 — Parse and copy sock_fprog from userspace:
Copy sock_fprog from userspace (validated pointer, no kernel alias). Check:
- len is in 1..=32767 (inclusive); reject 0 (empty program) and >32767 (BPF_MAXINSNS)
- Copy len × sizeof(sock_filter) bytes from filter pointer
Return EFAULT if any copy-from-user fails. Return EINVAL if len is out of range.
Step 4 — Validate cBPF program:
Run the cBPF validator (same rules as Linux's bpf_check_classic):
- All branch targets are forward-only and within bounds
- All load offsets are within sizeof(SeccompData)
- Division and modulo by zero constant are rejected
- The last instruction is BPF_RET
- All opcodes are valid cBPF opcodes
Return EINVAL if validation fails.
Step 5 — Check filter chain depth:
Count the current depth of task.seccomp.filter by walking the parent links.
If depth ≥ 512 (MAX_SECCOMP_FILTER_DEPTH), return E2BIG.
Step 6 — JIT compile:
let compiled = jit_compile_cbpf(&bpf_insns, target_arch)?;
// compiled.code is now write-protected and executable
On architectures with JIT support (x86-64, AArch64), this produces native code. On architectures without JIT (currently: ARMv7, RISC-V, PPC32, PPC64LE), the validator output is stored verbatim and the interpreter is used at runtime. JIT support for additional architectures is added as the architecture ports mature; the installation path is identical regardless.
If JIT compilation fails (out of memory for the executable region), return ENOMEM.
Step 7 — Construct SeccompFilter:
let new_filter = Arc::new(SeccompFilter {
code: Arc::new(compiled),
parent: task.seccomp.filter.clone(), // Arc::clone, no deep copy
log: flags & SECCOMP_FILTER_FLAG_LOG != 0,
allow_spec: flags & SECCOMP_FILTER_FLAG_SPEC_ALLOW != 0,
notif_fd: if flags & SECCOMP_FILTER_FLAG_NEW_LISTENER != 0 {
Some(Arc::new(SeccompNotifFd::new()))
} else {
None
},
});
Step 8 — Atomically install:
task.seccomp.filter = Some(new_filter.clone());
task.seccomp.mode = SeccompMode::Filter;
This is not a compare-and-swap because filter installation is a single-threaded
operation on the calling task's own SeccompState. Thread synchronisation is handled
in step 9 (TSYNC) if requested.
Step 9 — TSYNC (if SECCOMP_FILTER_FLAG_TSYNC):
Iterate all threads of the process (via the thread group list):
- For each thread, check that the thread's current filter chain is a prefix of the
new filter's chain (i.e., the thread has not installed filters the caller does not
have). If any thread's chain is incompatible, return ESRCH and roll back all changes.
- If all threads are compatible, install new_filter on each thread via the same
Arc::clone assignment. This is done under each thread's task lock; if any thread
exits during the operation, it is skipped (already exiting).
Step 10 — Return value:
- If
SECCOMP_FILTER_FLAG_NEW_LISTENERwas set: return the listener fd number (≥0) - Otherwise: return 0
9.3.8 Syscall Interception Path
The seccomp check is inserted at the syscall entry point, after register saving and
argument marshalling, before dispatch to the syscall handler. It is architecture-specific
in its placement (each arch's entry.S/entry.rs calls seccomp_check_syscall) but the
check logic is shared.
/// Called from syscall entry with interrupts enabled, preemption disabled.
/// Returns Ok(()) to proceed with the syscall, or Err(action) to handle.
#[inline(always)]
pub fn seccomp_check_syscall(task: &Task, data: &SeccompData) -> SeccompVerdict {
match task.seccomp.mode {
SeccompMode::Disabled => SeccompVerdict::Allow, // predicted branch, ~0 cycles
SeccompMode::Strict => seccomp_strict_check(data.nr),
SeccompMode::Filter => seccomp_filter_check(task, data),
}
}
Mode::Disabled fast path:
The Disabled branch is predicted not-taken on the fast path (syscall entry predicts
seccomp is disabled for most processes). The mode byte is the first field of
SeccompState, co-located with the Task struct fields accessed on the syscall entry
path. On x86-64 the check is a single cmp byte [task + offset], 0 / jne instruction.
Mode::Strict check:
fn seccomp_strict_check(nr: i32) -> SeccompVerdict {
match nr as u32 {
SYS_READ | SYS_WRITE | SYS_EXIT | SYS_EXIT_GROUP | SYS_RT_SIGRETURN => {
SeccompVerdict::Allow
}
_ => SeccompVerdict::Kill,
}
}
Mode::Filter — filter chain evaluation:
fn seccomp_filter_check(task: &Task, data: &SeccompData) -> SeccompVerdict {
let mut result: u32 = SECCOMP_RET_ALLOW; // lowest priority: allow
// Walk the filter chain from innermost to outermost, collecting return values.
let mut filter_opt = task.seccomp.filter.as_ref();
while let Some(filter) = filter_opt {
// Call the JIT-compiled (or interpreted) filter function.
// JIT signature: extern "C" fn(*const SeccompData) -> u32
let ret = filter.code.call(data);
// Take the highest-priority action seen so far.
if seccomp_action_priority(ret) > seccomp_action_priority(result) {
result = ret;
}
filter_opt = filter.parent.as_ref().map(|p| p.as_ref());
}
seccomp_verdict_from_action(result, task)
}
The chain is walked innermost-first (most recently installed filter runs first). This matches Linux behaviour: inner filters can only further restrict, not expand.
Verdict dispatch:
After seccomp_filter_check returns a SeccompVerdict, the syscall entry path handles
the verdict:
| Verdict | Action |
|---|---|
Allow |
Proceed to syscall handler |
Log |
Write audit log entry, then proceed |
Errno(e) |
Return -e to userspace, skip syscall |
Trap |
Fill siginfo_t, deliver SIGSYS to task |
Trace(id) |
Notify ptrace tracer via ptrace_event, suspend task |
Notify |
Enqueue to SeccompNotifFd, suspend task, await response |
Kill |
Send SIGKILL to thread, do not return to userspace |
KillProcess |
Send SIGKILL to all threads, do not return to userspace |
9.3.9 JIT Compilation
UmkaOS JIT-compiles cBPF seccomp filters to native machine code at install time. The JIT
is invoked from seccomp_set_mode_filter (Section 9.3.7, Step 6) and produces
W^X-protected executable code.
Why JIT matters:
A cBPF filter of 30-100 instructions is common for realistic seccomp policies (e.g., systemd service filters, Docker default profiles). With interpretation, each of those instructions requires a dispatch loop iteration costing ~50-200 ns total per syscall. After JIT, the same filter runs in 2-5 ns. For workloads that make frequent syscalls (high-throughput servers, container runtimes), this saves tens of microseconds per second per thread.
JIT properties:
| Property | Value |
|---|---|
| Input | Validated cBPF (after bpf_check_classic) |
| Output | Native machine code for the running architecture |
| Average expansion | x86-64: ~3 native instructions per cBPF instruction |
| Average expansion | AArch64: ~4 native instructions per cBPF instruction |
| Memory protection | W^X: write permission removed after compilation |
| Code cache | Per CompiledFilter instance (not shared across tasks) |
| JIT availability | x86-64, AArch64 (at launch); ARMv7, RISC-V, PPC added as arch ports mature |
| Fallback | cBPF interpreted via cbpf_interpret() on arches without JIT |
JIT output calling convention:
/// JIT-compiled filter function signature. Called from seccomp_filter_check.
/// The function pointer is stored in CompiledFilter.code.
/// Calling convention: System V AMD64 ABI (x86-64), AAPCS64 (AArch64).
type FilterFn = extern "C" fn(data: *const SeccompData) -> u32;
The JIT allocates an ExecutableRegion:
pub struct ExecutableRegion {
/// Base of the mmap'd region (initially RW, then remapped RX after JIT).
base: NonNull<u8>,
/// Length of the region in bytes.
len: usize,
/// The callable function pointer, typed for safety.
fn_ptr: FilterFn,
}
impl ExecutableRegion {
/// Remap the region from RW to RX (W^X enforcement).
/// Called once after JIT output is complete.
pub fn seal(&mut self) -> Result<(), KernelError>;
/// Call the compiled filter with the given seccomp_data.
#[inline(always)]
pub fn call(&self, data: &SeccompData) -> u32 {
// SAFETY: fn_ptr is valid native code produced by the verified JIT,
// protected RX. data is a valid reference for the call duration.
unsafe { (self.fn_ptr)(data as *const SeccompData) }
}
}
/proc/sys/net/core/bpf_jit_enable:
UmkaOS always JIT-compiles seccomp filters on supported architectures regardless of this
sysctl value. The sysctl is accepted for compatibility with tools that check or set it,
and value 2 enables diagnostic JIT output (dumps generated native code to the kernel
log at KERN_DEBUG level). The sysctl does not affect seccomp JIT behaviour: seccomp
JIT is not user-configurable and cannot be disabled.
9.3.10 Userspace Notification (SECCOMP_USER_NOTIF)
The userspace notification interface (introduced in Linux 5.0) allows a privileged
supervisor process to intercept syscalls made by a sandboxed process, inspect them,
and inject an arbitrary return value. This is used by container runtimes (e.g.,
gVisor's runsc, sysbox) to handle syscalls that are safe in a supervised context
but would otherwise be blocked by a seccomp filter.
Data Structures
/// Notification fd: the supervisor's handle for receiving and responding to
/// seccomp notifications from a sandboxed task.
pub struct SeccompNotifFd {
/// Lock-free ring buffer for pending notifications (fast path).
/// Capacity 256: fits typical burst of notifications before supervisor wakes.
queue: SpscRing<SeccompNotif, 256>,
/// Map from notification id to the suspended task.
/// Consulted when the supervisor sends a response.
pending: Mutex<HashMap<u64, Arc<SuspendedTask>>>,
/// Wait queue: the supervisor blocks here when the queue is empty.
waiters: WaitQueue,
/// Monotonically increasing id generator. Each notification gets a unique id.
next_id: AtomicU64,
}
/// A pending notification sent to the supervisor.
/// Layout matches Linux's `struct seccomp_notif` exactly (ABI stable).
#[repr(C)]
pub struct SeccompNotif {
/// Unique notification ID. Used to correlate with the response.
pub id: u64,
/// PID of the sandboxed task making the syscall.
pub pid: u32,
/// Reserved flags (always 0 in current version).
pub flags: u32,
/// Syscall information (identical to what the BPF program receives).
pub data: SeccompData,
}
/// Response from the supervisor to the suspended task.
/// Layout matches Linux's `struct seccomp_notif_resp` exactly.
#[repr(C)]
pub struct SeccompNotifResp {
/// Must match the `id` field of the corresponding `SeccompNotif`.
pub id: u64,
/// Return value to inject (used when `error == 0`).
pub val: i64,
/// Errno to return (if non-zero, `val` is ignored and `-error` is returned).
pub error: i32,
/// Flags: SECCOMP_USER_NOTIF_FLAG_CONTINUE (0x1) = execute the real syscall.
pub flags: u32,
}
/// A task suspended waiting for a supervisor response.
pub struct SuspendedTask {
/// Waker to resume the task once a response arrives.
waker: TaskWaker,
/// Response slot: filled by the supervisor, read by the task on wakeup.
response: Option<SeccompNotifResp>,
}
UmkaOS improvement: SeccompNotifFd.queue is a lock-free SpscRing<SeccompNotif, 256>
(single-producer, single-consumer). The sandboxed task pushes notifications; the supervisor
pops them. This is lock-free on the fast path; the pending map (for response correlation)
is only touched under a Mutex on the slower supervisor-response path.
ioctls on the Listener fd
The listener fd returned by SECCOMP_FILTER_FLAG_NEW_LISTENER supports the following
ioctls (all defined in <linux/seccomp.h>):
SECCOMP_IOCTL_NOTIF_RECV (_IOWR(SECCOMP_IOC_MAGIC, 0, struct seccomp_notif)):
Dequeues one pending notification. Blocks if the queue is empty (respecting O_NONBLOCK
which returns EAGAIN if empty). On return, fills the userspace seccomp_notif struct
with the notification. The sandboxed task remains suspended until a response is sent.
SECCOMP_IOCTL_NOTIF_SEND (_IOWR(SECCOMP_IOC_MAGIC, 1, struct seccomp_notif_resp)):
Sends a response to a suspended notification. The id field must match a currently
suspended task. If flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE, the original syscall
is executed (the filter's NOTIFY action is overridden). Otherwise, val or -error
is returned to the task.
Returns ENOENT if the id is not found (task already timed out or was killed).
Returns EINPROGRESS if a response for this id has already been sent.
SECCOMP_IOCTL_NOTIF_ID_VALID (_IOW(SECCOMP_IOC_MAGIC, 2, __u64)):
Checks whether the notification with the given id is still valid (the suspended task
is still alive and waiting). Returns 0 if valid, ENOENT if not. The supervisor uses
this to detect tasks that exited while the supervisor was processing the notification.
SECCOMP_IOCTL_NOTIF_ADDFD (_IOW(SECCOMP_IOC_MAGIC, 3, struct seccomp_notif_addfd)):
Installs a file descriptor into the suspended task's file descriptor table. The
seccomp_notif_addfd struct specifies the fd to install and an optional target fd
number (or -1 for the lowest available). This enables the supervisor to inject fds
(e.g., a socket connected to a local service) into the sandboxed process, allowing
"fake" syscall implementations that return a real, usable fd.
Requires SECCOMP_FILTER_FLAG_NOTIFY_ADDFD to have been set at filter installation
time; returns EINVAL otherwise.
Notification Lifecycle
Sandboxed task:
1. Makes syscall; filter returns SECCOMP_RET_USER_NOTIF.
2. seccomp_filter_check builds SeccompNotif, enqueues to notif_fd.queue.
3. Wakes any waiting supervisor (notif_fd.waiters.wake_one()).
4. Registers self in notif_fd.pending[id] = SuspendedTask { waker, response: None }.
5. Suspends (yields CPU; preemptible).
Supervisor process:
6. Wakes on notif_fd (epoll, read, select, or blocking ioctl).
7. SECCOMP_IOCTL_NOTIF_RECV: dequeues SeccompNotif, inspects data.
8. (Optionally) SECCOMP_IOCTL_NOTIF_ID_VALID: verifies task still alive.
9. (Optionally) SECCOMP_IOCTL_NOTIF_ADDFD: installs fd into task.
10. SECCOMP_IOCTL_NOTIF_SEND: fills SuspendedTask.response, calls waker.
Sandboxed task (resumed):
11. Reads SuspendedTask.response from own task struct (no lock needed: only own task).
12. If SECCOMP_USER_NOTIF_FLAG_CONTINUE: execute original syscall normally.
Else: return val or -error to userspace.
13. Removes self from notif_fd.pending.
9.3.11 SECCOMP_MODE_STRICT
Strict mode is the simplest seccomp mode and predates the BPF filter interface. It
allows a task to restrict itself to an absolute minimum set of syscalls with a single
prctl call. No filter program is needed; the allowlist is hard-coded in the kernel.
Allowed syscalls in strict mode:
| Syscall | Rationale |
|---|---|
read (0) |
Read from existing fds |
write (4/1) |
Write to existing fds |
exit (1/60) |
Exit current thread |
exit_group (231) |
Exit all threads of process |
rt_sigreturn (15/173) |
Return from signal handler |
All other syscalls result in SIGKILL of the calling thread (not the process — the thread is killed, but sibling threads continue running unless they also make a disallowed syscall). This matches Linux's strict mode behaviour.
Note: syscall numbers differ between architectures (x86 32-bit vs x86-64 vs AArch64). UmkaOS uses the correct architecture-specific numbers for each ABI.
Use case: OpenSSH privilege-separated workers, minimal computation daemons, and processes that have already opened all needed fds and only need to read/write/exit. Strict mode is simpler to reason about than a BPF filter and cannot be misconfigured.
9.3.12 Inheritance and exec Semantics
Fork (clone(2) without CLONE_SECCOMP_NOFILTER):
The child inherits the parent's SeccompState with the filter chain shared via
Arc::clone(). No bytecode is copied; the child shares the same compiled filter code:
fn fork_seccomp(parent: &Task, child: &mut Task) {
child.seccomp = SeccompState {
filter: parent.seccomp.filter.clone(), // Arc refcount increment only
mode: parent.seccomp.mode,
no_new_privs: parent.seccomp.no_new_privs,
};
}
Fork is O(1) for seccomp state regardless of filter chain depth. The Arc refcount ensures the compiled filter code is kept alive as long as any task references it.
Thread creation (clone(CLONE_THREAD)):
Thread creation uses the same fork_seccomp path. All threads of a process share the
same filter chain by default. SECCOMP_FILTER_FLAG_TSYNC can be used after the fact
to synchronise a new filter to all threads.
exec (execve(2)):
The filter chain survives exec. This is intentional and matches Linux: a process that installs a seccomp filter before exec cannot have the filter stripped by the exec'd program. The exec'd program runs under the same (or more restrictive, if the exec'd program also calls seccomp) filter chain.
fn exec_seccomp(task: &mut Task) {
// The filter chain is preserved across exec unchanged.
// no_new_privs is also preserved (set before exec).
// Mode is preserved.
// Nothing to do here; SeccompState is unchanged by exec.
}
Mode monotonicity:
The seccomp mode can only increase:
| From \ To | Disabled | Strict | Filter |
|---|---|---|---|
| Disabled | EINVAL | Allowed | Allowed |
| Strict | EINVAL | Allowed (idempotent) | EINVAL |
| Filter | EINVAL | EINVAL | Allowed (adds to chain) |
A task cannot go from Filter back to Strict or from either back to Disabled. This
ensures that filters installed by sandboxing infrastructure cannot be removed by
sandboxed code. The mode field is written only in seccomp_set_mode_*, which is
always called on the task's own state under the task lock.
no_new_privs:
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) sets SeccompState.no_new_privs = true.
This is required to install a seccomp filter without CAP_SYS_ADMIN. It also has
independent effects: exec'd programs do not gain privileges from set-uid/set-gid bits.
The flag is inherited by children (fork and thread) and is preserved across exec. It
cannot be unset.
9.3.13 Audit Logging
Seccomp events are written to the kernel audit log under two circumstances:
-
SECCOMP_RET_LOG: The BPF filter program explicitly returns this action, which logs the syscall and then allows it to proceed. -
SECCOMP_FILTER_FLAG_LOG: Set at filter installation time. Every syscall that is allowed by the filter chain is also logged (regardless of which action matched).
Log entry format:
audit: type=1326 msg=audit(1708956031.442:1234): auid=1000 uid=1000 gid=1000 \
ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 \
pid=12345 comm="my_process" exe="/usr/bin/my_process" sig=0 \
arch=c000003e syscall=59 compat=0 ip=0x7f1234567890 code=0x7ffc0000
Field meanings:
- type=1326: AUDIT_SECCOMP event type (matches Linux)
- auid: Audit UID (login UID, set by PAM)
- uid/gid: Real UID/GID at time of syscall
- ses: Audit session ID
- subj: SELinux/AppArmor subject label (from LSM, empty if no LSM active)
- pid: Task PID
- comm: Task name (up to 15 bytes, truncated from Task.name)
- exe: Executable path (from Task.exe_path)
- sig: Signal delivered (0 if none, e.g., SIGSYS=31 for Trap, 9 for Kill)
- arch: AUDIT_ARCH_* value in hex (matching SeccompData.arch)
- syscall: Syscall number
- compat: 1 if the task was in 32-bit compatibility mode, 0 otherwise
- ip: Instruction pointer from SeccompData.instruction_pointer
- code: The SECCOMP_RET_* value returned by the filter in hex
Rate limiting:
To prevent a compromised or misbehaving task from flooding the audit log:
/// Per-task seccomp log rate limiter.
pub struct SeccompLogRateLimit {
/// Number of log entries written in the current second.
count: u32,
/// Start of the current one-second window (coarse monotonic clock, seconds).
window_start: u64,
}
const SECCOMP_LOG_RATE_LIMIT: u32 = 10;
If more than 10 events per second are generated by a single task, the excess events
are dropped and a single "N events suppressed" message is written at the end of the
window. This matches Linux's seccomp_log_rate_limit behaviour.
9.3.14 /proc Integration
/proc/PID/status:
The Seccomp: field in /proc/PID/status reports the current seccomp mode:
Seccomp: 0 # Disabled
Seccomp: 1 # Strict
Seccomp: 2 # Filter
This is an ABI-stable field read by many tools (systemd, ps, audit tools). The value
is a decimal integer matching SeccompMode as u8.
/proc/PID/seccomp_filter (UmkaOS extension):
UmkaOS provides a read-only file /proc/PID/seccomp_filter that reports the installed
filter chain in human-readable form. This is an UmkaOS extension (not present in Linux);
it is intended for debugging and introspection by administrators.
Format:
seccomp_filter: mode=filter depth=3 jit=yes
filter[0]: len=47 instructions, jit=yes, log=no, spec_allow=no, has_notif=no
filter[1]: len=12 instructions, jit=yes, log=yes, spec_allow=no, has_notif=no
filter[2]: len=8 instructions, jit=yes, log=no, spec_allow=no, has_notif=yes
mode: current mode (disabled/strict/filter)depth: number of filters in the chain (0 for disabled/strict)jit: whether all filters in the chain are JIT-compiledfilter[N]: innermost isfilter[0], outermost isfilter[depth-1]len: number of cBPF instructions in this filterjit: whether this specific filter is JIT-compiledlog: whetherSECCOMP_FILTER_FLAG_LOGwas setspec_allow: whetherSECCOMP_FILTER_FLAG_SPEC_ALLOWwas sethas_notif: whether a userspace notification fd is associated
Access control: the file is readable by the task owner and by processes with
CAP_SYS_PTRACE in the target task's user namespace. Other processes see EPERM.
The BPF bytecode itself is not exposed via /proc (it would allow filter fingerprinting
by sandboxed code). Only metadata is exposed.
9.3.15 Linux Compatibility
UmkaOS's seccomp implementation is a drop-in replacement for Linux's. All of the following work without modification:
libseccomp (libseccomp2):
libseccomp generates cBPF programs from a high-level policy API and installs them
via seccomp(SECCOMP_SET_MODE_FILTER). The generated BPF programs pass UmkaOS's
validator and run correctly. The SCMP_ACT_* action codes map directly to UmkaOS's
SeccompAction values.
systemd service sandboxing:
systemd's SystemCallFilter=, SystemCallArchitectures=, and related security
directives use libseccomp to generate and install seccomp filters. These filters
work on UmkaOS because: (a) the seccomp(2) syscall number and ABI match Linux; (b) the
sock_fprog wire format is identical; (c) AUDIT_ARCH_* values match.
Docker / containerd default seccomp profile:
Docker's default seccomp profile is a JSON file that containerd compiles to cBPF via
libseccomp. The resulting filter is installed via seccomp(2) in the container's
init process. UmkaOS handles this identically to Linux.
Chrome sandbox:
Chrome's renderer processes use SECCOMP_MODE_FILTER with a multi-layered filter
chain. Chrome uses SECCOMP_FILTER_FLAG_TSYNC to synchronise the filter to all
threads before starting sensitive work. UmkaOS's TSYNC implementation matches Linux's
semantics exactly, including the ESRCH error when a thread conflict is detected.
gVisor (runsc):
gVisor's runsc runtime uses SECCOMP_RET_USER_NOTIF to implement its syscall
interception model on UmkaOS. The SeccompNotifFd interface — ioctls, epoll-ability,
the seccomp_notif / seccomp_notif_resp struct layout — matches Linux 5.0+ exactly.
Seccomp syscall numbers:
| Architecture | Syscall number |
|---|---|
| x86-64 | 317 |
| AArch64 | 277 |
| ARMv7 | 383 |
| RISC-V 64 | 277 |
SECCOMP_RET_* action values:
All SECCOMP_RET_* constants used in BPF programs compiled by libseccomp or other
policy compilers match UmkaOS's SeccompAction values exactly. Programs that embed
action constants numerically (as most BPF program generators do) work without
modification.
Cross-references:
- Section 8.1 (08-security.md): Capabilities required for seccomp filter installation (CAP_SYS_ADMIN path)
- Section 8.7 (08-security.md): LSM hooks for seccomp — security_seccomp_filter_install and security_seccomp_check_syscall
- Section 8.8 (08-security.md): Credential model; no_new_privs interacts with set-uid exec and capability bounding set
- Section 6.1 (06-scheduling.md): Task struct embedding SeccompState; task suspension and wakeup for SECCOMP_RET_USER_NOTIF
- Section 18.1 (18-compat.md): Syscall dispatch table; seccomp check is inserted before dispatch in the compatibility layer as well as the native path
9.4 ARM Memory Tagging Extension (MTE)
ARM's Memory Tagging Extension is a hardware security capability available from ARMv8.5-A (AArch64 only). It provides automatic, hardware-enforced detection of heap use-after-free and heap/stack buffer overflows with near-zero runtime overhead in production. x86-64 has no hardware equivalent — the closest software analogues (ASAN, Valgrind) impose 2-10x slowdowns and are not suitable for production deployment. On ARM platforms, MTE is UmkaOS's preferred first-line mitigation against memory corruption attacks on Tier 1 drivers and userspace processes alike.
Linux added MTE support in kernel 5.10. UmkaOS's MTE implementation is ABI-compatible with Linux 5.10+ and is verified against the ARM Architecture Reference Manual DDI0487 (ARMv8.5-A section D8 "The Memory Tagging Extension").
9.4.1 MTE Overview and Architecture Coverage
Hardware mechanism:
Every 16-byte aligned granule of physical memory backed by Normal-Tagged memory pages has a 4-bit allocation tag stored in separate tag memory. This tag memory is transparent to normal loads and stores — programs that do not use MTE see no change in behaviour. When MTE is active, the processor compares the logical tag embedded in the top byte of the virtual address (bits 59:56 under TBI — Top Byte Ignore mode) against the stored allocation tag on every memory access. A mismatch either faults immediately (sync mode) or sets a sticky fault flag (async mode), depending on the configured Tag Check Fault mode.
The tag memory overhead is exactly 1 bit per 2 bytes of addressable memory (4 bits per
16-byte granule), equating to approximately 3.125% of physical memory for Normal-Tagged
regions. Only memory mapped with PROT_MTE incurs this overhead; untagged regions consume
no tag storage. The kernel manages tag memory as part of each physical page's metadata via
a separate tag page table level; from software's perspective tag storage is addressed by
logical address, not separately mapped.
Feature levels:
| Feature | Minimum architecture | What it adds |
|---|---|---|
FEAT_MTE |
ARMv8.5-A (optional from ARMv8.4) | Basic allocation tag storage, IRG/STG/LDG/ADDG/SUBG instructions, sync TCF mode |
FEAT_MTE2 |
ARMv8.5-A (optional from ARMv8.4) | Async TCF mode (TFSR_EL1/TFSRE0_EL1 registers), MTE_TCF_ASYNC usable |
FEAT_MTE3 |
ARMv8.7-A / ARMv9.0+ | Asymmetric TCF mode (sync load, async store), MTE_TCF_ASYMM |
Feature presence is detected at boot via ID_AA64PFR1_EL1.MTE:
- Value 0: MTE not implemented
- Value 1:
FEAT_MTEonly (sync mode, no async) - Value 2:
FEAT_MTE2(sync + async modes) - Value 3:
FEAT_MTE3(all three modes including asymmetric)
Chip availability (verified as of 2025):
MTE is common in ARMv9 consumer SoCs but optional in server microarchitectures — Neoverse N2 and V2 notably omit it (as do AWS Graviton 4, Azure Cobalt 100, Google Axion). Representative implementations WITH MTE: ARM Cortex-X1, Cortex-A78, Cortex-A710, Cortex-A715, Neoverse V1; Apple M2, M3, M4 (FEAT_MTE2+); Qualcomm Snapdragon 8 Gen 1 and later; MediaTek Dimensity 9000 and later; Samsung Exynos 2200 and later.
UmkaOS detects MTE availability at boot from ID_AA64PFR1_EL1.MTE and conditionally
enables MTE support. No assumptions are hardcoded about which CPU models have MTE.
Kernel-side tag check override (PSTATE.TCO):
The PSTATE.TCO (Tag Check Override) bit suppresses tag check faults for the current
execution context when set to 1. The kernel always enters EL1 with TCO=1, meaning the
kernel itself never faults on tag mismatches during EL1 code paths. On every return to
EL0, TCO is cleared to 0, restoring tag checking for userspace. This is the correct
and only safe design: the kernel must be able to access user memory (e.g. during read(2),
copy_to_user()) even when the user has MTE enabled with conflicting tags. Kernel EL1
code paths use the TCR_EL1.TCMA1=1 configuration (match-all for tag 0b1111) as a
controlled exception for specific kernel pointer uses — this is a known limitation noted
in Section 9.4.7.
9.4.2 MTE Modes (SYNC / ASYNC / ASYMM)
MTE supports three tag check fault (TCF) modes, selectable per-thread via prctl(2):
SYNC mode (MTE_TCF_SYNC, PR_MTE_TCF_SYNC):
A tag mismatch raises a synchronous fault on the faulting instruction before any results
are architecturally committed. The kernel delivers SIGSEGV with si_code = SEGV_MTESERR
and si_addr set to the exact faulting address. The memory access is not performed.
SYNC mode is the strongest mitigation: the faulting address is precise, no speculative results are committed, and there is no window for an attacker to observe the results of an invalid access. SYNC mode imposes a measurable hardware overhead (instruction pipeline effects) of approximately 1-3% on memory-intensive workloads.
ASYNC mode (MTE_TCF_ASYNC, PR_MTE_TCF_ASYNC):
A tag mismatch sets the TFSRE0_EL1 (Tag Fault Status Register, EL0) sticky bit without
immediately faulting. Execution continues. On the next kernel entry (syscall, interrupt,
exception), the kernel checks TFSRE0_EL1; if set, it clears the register and delivers
SIGSEGV with si_code = SEGV_MTEAERR and si_addr = 0 (faulting address is not
available in async mode). Requires FEAT_MTE2.
ASYNC mode overhead is approximately 0-1% on production workloads — essentially immeasurable in most benchmarks. The trade-off is imprecise fault delivery: the actual faulting instruction may have already completed and the CPU may have speculated beyond it before the fault is reported. This makes async mode unsuitable for precise debugging but excellent for production crash containment where performance is critical.
ASYMM mode (MTE_TCF_ASYMM, PR_MTE_TCF_ASYMM, requires FEAT_MTE3):
Asymmetric mode uses SYNC semantics for loads (read tag mismatches fault immediately)
and ASYNC semantics for stores (write tag mismatches set TFSRE0_EL1). Overhead is
intermediate — approximately 0.5-2% depending on workload read/write ratio. Requires
FEAT_MTE3 (ARMv8.7+/ARMv9+).
UmkaOS default mode selection:
UmkaOS selects the default per-thread mode as follows, based on hardware capabilities and the type of code being run:
- Tier 1 driver processes (UmkaOS-native, ring 0 with hardware domain isolation):
ASYNC mode by default (
FEAT_MTE2required). Balances crash detection against the hard real-time latency requirements of storage/network drivers. A tag fault in a driver terminates the driver domain and triggers reload (see Section 11.1). - Tier 2 processes (ring 3): ASYNC mode by default; userspace may upgrade to SYNC
via
prctl(). The allocator enables MTE on all anonymous heap mappings. - Debugging and testing: SYNC mode recommended — precise fault address enables exact identification of the corrupting site.
- Legacy software without MTE awareness:
MTE_TCF_NONE(default onexecve). MTE is not forced on processes that have not opted in viaprctl().
Per-CPU preferred mode:
The kernel exposes a per-CPU preferred TCF mode via
/sys/devices/system/cpu/cpu<N>/mte_tcf_preferred (matching Linux). If userspace
requests both PR_MTE_TCF_SYNC and PR_MTE_TCF_ASYNC simultaneously, the kernel
selects the CPU's preferred mode. Preference order when multiple modes are requested:
ASYNC > ASYMM > SYNC.
9.4.3 Kernel Data Structures
VMA flags (stored in VmaFlags alongside existing VM_READ/VM_WRITE):
/// MTE-related flags for a Virtual Memory Area.
/// Bits 40-42 of the VMA flags field. Disjoint from Linux VM_* bits in the
/// lower 32 bits; UmkaOS-native extensions occupy bits 32-63.
pub struct MteVmaFlags: u64 {
/// PROT_MTE was specified at mmap()/mprotect() time.
/// Pages backing this VMA are Normal-Tagged; tag memory is allocated.
const MTE_ENABLED = 1 << 40;
/// Tag Check Fault mode for this VMA's pages is SYNC.
/// Recorded for coredump reconstruction; actual enforcement is per-thread.
const MTE_SYNC = 1 << 41;
/// Tag Check Fault mode for this VMA's pages is ASYNC (FEAT_MTE2 required).
const MTE_ASYNC = 1 << 42;
}
Note that PROT_MTE on a VMA is irrevocable: once set via mmap() or mprotect(), it
cannot be removed by a subsequent mprotect() call (consistent with Linux semantics and
required because tag memory has already been provisioned for the physical pages). The
MTE_SYNC/MTE_ASYNC flags on a VMA record the mode in effect when MTE was first
enabled, for coredump annotation; the tag check fault mode enforced at runtime is always
the per-thread MteTaskConfig.tcf_mode.
Per-thread MTE configuration (embedded in Task):
/// Per-thread MTE configuration. Stored in Task.mte_config.
/// Zero-initialised on task creation: MTE disabled, no faults checked.
pub struct MteTaskConfig {
/// Tag Check Fault mode for this thread. Controls SCTLR_EL1.TCF0
/// (for EL0 accesses) written on context switch and on prctl().
pub tcf_mode: MteTcfMode,
/// Tag inclusion mask for IRG/ADDG/SUBG instructions.
/// 16-bit bitmask: bit i = 1 means tag i is included in the random set.
/// Translated to GCR_EL1.Exclude = ~tag_mask & 0xFFFF before writing.
/// Linux default: 0 (all tags excluded, IRG always returns tag 0).
/// Allocator recommendation: 0xFFFE (exclude tag 0, include tags 1-15).
pub tag_mask: u16,
/// Tagged Address ABI enabled for this thread.
/// Set by PR_TAGGED_ADDR_ENABLE bit in prctl(PR_SET_TAGGED_ADDR_CTRL).
/// When false, the kernel strips top-byte tags from user-provided addresses
/// (syscall arguments, signal si_addr) for backwards compatibility.
pub tagged_addr_enabled: bool,
/// Cached TFSRE0_EL1 value saved at last kernel entry in ASYNC mode.
/// Non-zero means an async tag fault is pending delivery as SIGSEGV.
/// The kernel checks and clears this field on every exception return.
pub async_fault_pending: bool,
}
/// Tag Check Fault mode. Corresponds to SCTLR_EL1.TCF0 field encoding.
#[repr(u8)]
pub enum MteTcfMode {
/// SCTLR_EL1.TCF0 = 0b00: tag faults ignored (default on execve).
None = 0,
/// SCTLR_EL1.TCF0 = 0b01: synchronous fault on tag mismatch.
Sync = 1,
/// SCTLR_EL1.TCF0 = 0b10: asynchronous fault (FEAT_MTE2 required).
/// Writing this mode without FEAT_MTE2 returns EINVAL from prctl().
Async = 2,
/// SCTLR_EL1.TCF0 = 0b11: asymmetric (sync load, async store, FEAT_MTE3).
/// Writing this mode without FEAT_MTE3 returns EINVAL from prctl().
Asymm = 3,
}
UmkaOS-native system-level MTE state (boot-time singleton):
/// Boot-time-discovered MTE capabilities for this system.
/// Initialised once during AArch64 arch init; read-only thereafter.
pub struct MteSystemCapabilities {
/// Highest MTE feature level present: 0=none, 1=FEAT_MTE,
/// 2=FEAT_MTE2 (async), 3=FEAT_MTE3 (asymm).
/// Read from ID_AA64PFR1_EL1.MTE at boot.
pub feature_level: u8,
/// Physical memory pages suitable for MTE (Normal-Tagged attribute).
/// Not all memory may support tagging (e.g., device-mapped regions).
pub tagged_memory_pages: u64,
/// Tag storage memory in bytes (= tagged_memory_pages * 4096 / 32).
/// Each page requires 128 bytes of tag storage (4 bits per 16-byte granule).
pub tag_storage_bytes: u64,
/// Whether the system-wide tagged address mode has been disabled by
/// /proc/sys/abi/tagged_addr_disabled (requires CAP_SYS_ADMIN to set).
pub tagged_addr_disabled: bool,
}
/// Global MTE system state. Initialised in arch_init_mte().
static MTE_SYSTEM: MteSystemCapabilities = MteSystemCapabilities::zeroed();
9.4.4 MTE-Aware Allocator Design
UmkaOS's slab allocator and buddy allocator are MTE-aware when running on AArch64 with
MTE_SYSTEM.feature_level >= 1. MTE tagging is applied at the slab layer for heap
allocations; the buddy allocator operates below slab granularity and manages untagged
physical pages.
Allocation path:
- The slab allocator selects a free object from a magazine or slab.
- The allocator issues
IRG Xd, Xbase, #0to generate a random tag from the thread's effective tag mask (GCR_EL1.Excludeconfigured to exclude tag 0, ensuring objects are never tagged zero — zero is the untagged sentinel). - The allocator stores the tag to memory:
ST2G Xtagged, [Xaddr]stores the same tag to two consecutive 16-byte granules (covering 32 bytes), repeated for the full object size. For objects up to 32 bytes, a singleST2Gsuffices. For larger objects, a loop ofST2GorSTZG(store tag and zero memory) is used.STZGboth stores the tag and zeroes the granule's memory, combining zeroing and tagging in one instruction. - The allocator returns the tagged pointer:
Xresult = Xtagged | (tag << 56).
Deallocation path:
- The allocator receives the potentially-tagged pointer from the caller.
- The raw address is extracted by masking off the top byte:
Xaddr = Xptr & ~(0xFFUL << 56). - The tag is cleared in memory:
STG XZR, [Xaddr]stores tag 0 (the untagged sentinel) to the first granule, repeated for the full object. Alternatively, if the slab will immediately re-use the object, the allocator stores a new random tag for the next allocation. The key invariant is: the freed object's memory tag never matches any tag value 1-15 that a living pointer could carry, so any access through the freed pointer (use-after-free) triggers a tag mismatch fault. - The object is returned to the magazine or slab free list.
Adjacent object tag separation:
The slab allocator guarantees that no two adjacent objects within the same slab share the
same tag. This is implemented by a simple retry: after generating a candidate tag via IRG,
the allocator checks the tag of the preceding and succeeding objects (via LDG) and
regenerates if a collision is detected. With 15 non-zero tags and typically 2 neighbour
checks, the probability of requiring more than one retry is 2/15 ≈ 13%. The expected number
of IRG calls per allocation is therefore at most 1.15 — negligible overhead.
Interaction with KASAN and compiler instrumentation:
When UmkaOS is built with KASAN enabled (kernel sanitizer, debug builds only), KASAN and MTE
serve complementary roles. KASAN operates on shadow memory (software) and catches accesses
to red zones and freed regions before they reach the hardware. MTE provides hardware
enforcement in production builds where KASAN is disabled. The allocator does not enable
KASAN and MTE simultaneously on the same object — when CONFIG_KASAN_HW_TAGS is set (the
hardware-tag-based KASAN variant), KASAN hijacks the MTE tag mechanism directly and the
allocator delegates all tag management to the KASAN layer.
Rust GlobalAlloc integration:
/// UmkaOS AArch64 slab allocator implementing GlobalAlloc with MTE tagging.
/// Called for all heap allocations in Tier 1 driver code compiled for AArch64.
unsafe impl GlobalAlloc for SlabAllocator {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
// SAFETY: slab_alloc_mte returns a tagged, non-null pointer on success,
// or null on OOM. Pointer is aligned to layout.align() and valid for
// layout.size() bytes. Caller must not use pointer after free.
let ptr = self.slab_alloc_mte(layout.size(), layout.align());
if ptr.is_null() { handle_alloc_error(layout); }
ptr
}
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
// SAFETY: ptr was returned by alloc() with the same layout.
// dealloc_mte clears the MTE tag before returning the slab object.
self.dealloc_mte(ptr, layout.size());
}
}
9.4.5 Context Switch Handling
When switching between tasks on AArch64 with MTE enabled, the following system registers must be saved and restored to maintain correct per-thread MTE semantics:
Registers saved/restored per-thread:
| Register | Purpose | Save/restore condition |
|---|---|---|
SCTLR_EL1.TCF0 (bits 39:38) |
Tag Check Fault mode for EL0 | Always when MTE enabled |
GCR_EL1 |
Tag exclusion mask for IRG/ADDG/SUBG |
When tag_mask differs |
TFSRE0_EL1 |
Tag Fault Status for EL0 (async fault accumulator) | When FEAT_MTE2, task has MTE_TCF_ASYNC or MTE_TCF_ASYMM |
PSTATE.TCO |
Tag Check Override (suppress all tag faults) | Managed by kernel entry/exit; not per-thread |
Context switch sequence (AArch64 __switch_to):
// Saving outgoing task (prev):
// 1. Read TFSRE0_EL1 before dsb() — captures any async faults since last save.
mrs x9, tfsre0_el1
str x9, [x0, #TASK_MTE_TFSRE0] // store to prev->mte_config.async_fault_pending
// 2. Clear TFSRE0_EL1 before switching to prevent spurious delivery to next task.
msr tfsre0_el1, xzr
// 3. dsb() barrier — required before SCTLR_EL1 write to ensure TFSRE0 read completes.
dsb nsh
// Restoring incoming task (next):
// 4. Write SCTLR_EL1.TCF0 for next task's TCF mode.
// Read-modify-write: preserve all other SCTLR_EL1 bits.
mrs x9, sctlr_el1
ldr x10, [x1, #TASK_MTE_SCTLR_TCF0_BITS] // pre-computed bits for next->tcf_mode
bic x9, x9, #SCTLR_EL1_TCF0_MASK
orr x9, x9, x10
msr sctlr_el1, x9
// 5. Write GCR_EL1 for next task's tag exclusion mask.
ldr x9, [x1, #TASK_MTE_GCR_EL1] // ~(next->tag_mask) & 0xFFFF
msr gcr_el1, x9
// 6. Restore TFSRE0_EL1 for next task (may have pending async fault from before
// it was preempted).
ldr x9, [x1, #TASK_MTE_TFSRE0]
msr tfsre0_el1, x9
// 7. isb() to synchronise SCTLR_EL1 and GCR_EL1 changes before returning to EL0.
isb
Kernel entry async fault check:
On every transition from EL0 to EL1 (syscall, IRQ, data abort), the kernel entry stub
must check for pending async MTE faults in the outgoing thread. This check is inserted
in the AArch64 exception entry code at el0_sync_handler and el0_irq:
/// Called on kernel entry from EL0 when task has MTE_TCF_ASYNC or MTE_TCF_ASYMM.
/// Checks TFSRE0_EL1 for a pending async tag fault; if set, schedules SIGSEGV delivery.
///
/// # Safety
/// Must be called with IRQs disabled on the kernel entry path, before any code
/// that could preempt or sleep. Reads and clears TFSRE0_EL1 atomically with
/// respect to exception return (the current exception level is EL1).
pub unsafe fn mte_check_async_fault(task: &mut Task) {
if task.mte_config.tcf_mode != MteTcfMode::Async
&& task.mte_config.tcf_mode != MteTcfMode::Asymm
{
return;
}
// SAFETY: reading TFSRE0_EL1 in EL1 is always permitted.
let tfsre0: u64;
core::arch::asm!("mrs {}, tfsre0_el1", out(reg) tfsre0);
if tfsre0 != 0 {
// Clear the fault status register before processing.
// SAFETY: writing TFSRE0_EL1 in EL1 is always permitted.
core::arch::asm!("msr tfsre0_el1, xzr");
task.mte_config.async_fault_pending = true;
// Queue SIGSEGV delivery on return to userspace.
// si_code = SEGV_MTEAERR (6), si_addr = 0 (address unavailable in async mode).
task.signal_queue.push(Signal::new(
SIGSEGV,
SigInfo {
si_code: SEGV_MTEAERR,
si_addr: 0,
..SigInfo::default()
},
));
}
}
Signal handler invariant:
Signal handlers are always invoked with PSTATE.TCO = 0, regardless of whether the
interrupted EL0 code had TCO set. Tag checking is always active inside signal handlers.
PSTATE.TCO is restored to its pre-signal value on sigreturn(). This matches Linux
semantics and is required because signal handlers may run MTE-unaware code that must
not be disrupted by inherited TCO=1.
fork() and clone() inheritance:
On fork(), the child inherits the parent's MteTaskConfig (TCF mode, tag mask,
tagged-addr-enable flag). TFSRE0_EL1 is reset to zero in the child — async faults
pending in the parent are not inherited. On execve(), MteTaskConfig is reset to
the default (all fields zero: MTE_TCF_NONE, tag_mask = 0, tagged addr disabled).
This matches Linux semantics exactly.
9.4.6 Userspace Interface (prctl, mmap PROT_MTE)
Compile-time feature constants (UAPI, AArch64 only):
/* arch/arm64/include/uapi/asm/hwcap.h */
#define HWCAP2_MTE (1UL << 18) /* MTE present; check via getauxval(AT_HWCAP2) */
/* arch/arm64/include/uapi/asm/mman.h */
#define PROT_MTE 0x20 /* Enable MTE allocation tags for mmap/mprotect */
/* include/uapi/linux/prctl.h */
#define PR_SET_TAGGED_ADDR_CTRL 55
#define PR_GET_TAGGED_ADDR_CTRL 56
# define PR_TAGGED_ADDR_ENABLE (1UL << 0) /* Enable tagged address ABI */
# define PR_MTE_TCF_SHIFT 1
# define PR_MTE_TCF_NONE (0UL << 1) /* Ignore tag faults (default) */
# define PR_MTE_TCF_SYNC (1UL << 1) /* Synchronous tag fault mode */
# define PR_MTE_TCF_ASYNC (2UL << 1) /* Asynchronous tag fault mode */
# define PR_MTE_TCF_MASK (3UL << 1) /* Mask for TCF bits */
# define PR_MTE_TAG_SHIFT 3
# define PR_MTE_TAG_MASK (0xffffUL << 3) /* Tag inclusion mask (16 bits) */
/* Signal codes for MTE faults */
#define SEGV_MTESERR 8 /* Synchronous MTE tag fault */
#define SEGV_MTEAERR 9 /* Asynchronous MTE tag fault (si_addr = 0) */
mmap() / mprotect() with PROT_MTE:
/* Map anonymous memory with MTE tagging enabled */
void *heap = mmap(NULL, size,
PROT_READ | PROT_WRITE | PROT_MTE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
/* Or enable MTE on an existing anonymous mapping */
mprotect(heap, size, PROT_READ | PROT_WRITE | PROT_MTE);
UmkaOS validates that PROT_MTE is only applied to MAP_ANONYMOUS or RAM-backed file
mappings (tmpfs, memfd). Applying PROT_MTE to file-backed mappings of on-disk
files, device-backed mappings, or MAP_FIXED mappings over non-Normal-Tagged physical
memory returns EINVAL. PROT_MTE cannot be removed by a subsequent mprotect() call
— this is enforced by the VMA merge logic, which treats VM_MTE (UmkaOS's internal name
for PROT_MTE) as a non-mergeable flag.
prctl(PR_SET_TAGGED_ADDR_CTRL):
/* Enable tagged address ABI and set SYNC mode with all non-zero tags allowed */
prctl(PR_SET_TAGGED_ADDR_CTRL,
PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xFFFEUL << PR_MTE_TAG_SHIFT),
0, 0, 0);
/* Read current configuration */
unsigned long ctrl = prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0);
bool mte_sync = (ctrl & PR_MTE_TCF_MASK) == PR_MTE_TCF_SYNC;
bool mte_async = (ctrl & PR_MTE_TCF_MASK) == PR_MTE_TCF_ASYNC;
The PR_MTE_TAG_MASK field provides an include mask: bit i set means tag i is allowed
to be selected by IRG. The kernel inverts this to compute GCR_EL1.Exclude (which is
an exclude mask). An include mask of 0x0000 causes IRG to always return tag 0 (the
untagged value), effectively disabling random tag generation. UmkaOS's allocator sets
tag_mask = 0xFFFE (exclude only tag 0, include tags 1-15) for all MTE-capable threads
in Tier 1 and Tier 2 contexts.
prctl(PR_SET_TAGGED_ADDR_CTRL) validation by UmkaOS:
UmkaOS validates the flags argument to PR_SET_TAGGED_ADDR_CTRL as follows:
- Bits outside
PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_MASK | PR_MTE_TAG_MASKmust be zero →EINVAL. - If
PR_MTE_TCF_ASYNCis set andMTE_SYSTEM.feature_level < 2(noFEAT_MTE2) →EINVAL. - If
PR_MTE_TCF_ASYMMis set andMTE_SYSTEM.feature_level < 3(noFEAT_MTE3) →EINVAL. - If both
PR_MTE_TCF_SYNCandPR_MTE_TCF_ASYNCare set, UmkaOS resolves to the CPU's preferred mode (see Section 9.4.2). - If
MTE_SYSTEM.tagged_addr_disabled = trueandPR_TAGGED_ADDR_ENABLEis set →EINVAL.
ptrace() interface for tag access:
UmkaOS implements the Linux ptrace MTE interface for debugging tools (GDB, LLDB, sanitizer runtimes):
PTRACE_PEEKMTETAGS: reads allocation tags from a tracee's MTE-tagged address range. Data is packed as two 4-bit tags per byte.iov_lenis updated to the number of tag bytes actually read. ReturnsEOPNOTSUPPif the address is not in aPROT_MTEVMA.PTRACE_POKEMTETAGS: writes allocation tags to a tracee's MTE-tagged address range. Used by sanitizer tools to set up expected tag patterns.PTRACE_GETREGSET/PTRACE_SETREGSETwithNT_ARM_TAGGED_ADDR_CTRL: reads/writes the thread's tagged address control word (equivalent toPR_GET/SET_TAGGED_ADDR_CTRL).
Core dump support:
When a process with PROT_MTE mappings generates a core dump, UmkaOS includes the
allocation tags as additional PT_AARCH64_MEMTAG_MTE program header segments. Each
such segment covers the same virtual address range as a PT_LOAD segment for the
corresponding MTE-tagged mapping. Tags are stored packed at 2 tags per byte (4 bits each),
so a 4096-byte page produces 128 bytes of tag data in the core. This allows post-mortem
debuggers to reconstruct the full tagged memory state.
/proc and sysfs interfaces:
/proc/sys/abi/tagged_addr_disabled
0: tagged addresses allowed (default)
1: tagged addresses disabled system-wide (requires CAP_SYS_ADMIN to set)
Writing 1 prevents any process from enabling PR_TAGGED_ADDR_ENABLE;
existing processes with tagged addresses already enabled are unaffected.
/proc/<pid>/status
Tagged_addr: 1 (present when PR_TAGGED_ADDR_ENABLE is set)
Mte_tcf_mode: sync|async|asymm|none
/sys/devices/system/cpu/cpu<N>/mte_tcf_preferred
sync|async|asymm (per-CPU hardware preference, read-only)
/proc/cpuinfo MTE feature flag:
Features: ... mte ...
The mte feature flag is present in /proc/cpuinfo when HWCAP2_MTE is set
(ID_AA64PFR1_EL1.MTE >= 1). This matches Linux 5.10+ behaviour exactly.
9.4.7 Integration with UmkaOS Security Model
Tier 1 drivers (ring 0, hardware domain isolation):
Tier 1 drivers compiled for AArch64 with MTE receive automatic heap memory safety with no additional coding. The MTE-aware slab allocator tags all heap objects. Any use-after-free or overflow within a Tier 1 driver produces a tag fault, which:
- In ASYNC mode: sets
TFSRE0_EL1. On the next kernel entry (e.g., ring crossing, IPI), the tag fault handler runs, determines the fault occurred in a Tier 1 driver domain, and initiates driver domain teardown and reload (Section 11.1). The kernel itself is unaffected. - In SYNC mode: produces an immediate instruction abort in the driver's EL1 execution context (driver code runs in EL1 but in a restricted MPK/POE domain). The exception handler identifies the Tier 1 domain, tears it down, and schedules reload.
This is a significant improvement over x86-64, where KASAN (software shadow memory) is the only equivalent tool and is not suitable for production use due to its 2-3x overhead. On AArch64 with MTE, Tier 1 drivers get production-grade heap safety with ~0-1% overhead.
SCTLR_EL1.TCF vs. SCTLR_EL1.TCF0:
SCTLR_EL1 contains two TCF fields:
TCF0(bits 39:38): Tag Check Fault mode for EL0 accesses. UmkaOS uses this for userspace thread configuration viaprctl().TCF(bits 41:40): Tag Check Fault mode for EL1 accesses (kernel/Tier-1 driver code). UmkaOS setsTCF = 0b10(ASYNC) system-wide for Tier 1 driver contexts where MTE is enabled. This means tag faults in kernel/Tier-1 code accumulate inTFSR_EL1(notTFSRE0_EL1). On EL1-to-EL1 transitions (e.g., driver domain → kernel domain), the kernel checksTFSR_EL1and handles faults as Tier 1 crashes.
Known limitation — TCMA1 and tag 0xF:
TCR_EL1.TCMA1 = 1 is required in the UmkaOS kernel (matching Linux). This means that any
pointer with logical tag 0b1111 (0xF) bypasses tag checking and can be dereferenced
without triggering a fault, regardless of the allocation tag in memory. This is necessary
because many kernel subsystems generate kernel virtual addresses from physical addresses
(e.g., phys_to_virt()) which carry no meaningful tag. An attacker aware of this can
forge a pointer to any address by setting the top nibble of bits 59:56 to 0xF. This is a
structural limitation of the ARMv8.5 MTE architecture as deployed in a full-kernel context;
it is documented in the ARM architecture specification and in Linux's implementation notes.
UmkaOS's response: PKEY 15 (the guard/unmapped key in the MPK domain table) is reserved;
tag 0xF is excluded from the allocator's tag generation mask (PR_MTE_TAG_MASK excludes
bit 15 in addition to bit 0). This ensures allocator-managed heap objects are never tagged
0xF, so TCMA1 bypass does not help an attacker against heap objects.
LSM integration:
The UmkaOS LSM framework (Section 8.7) provides the following MTE-related hooks:
/// Called on mmap() and mprotect() when PROT_MTE is in the requested protection flags.
/// LSM may deny the request (e.g., for confined processes in high-security contexts
/// that should not manipulate their own tag state).
///
/// Returns Ok(()) to permit, Err(Errno::EPERM) to deny.
fn security_mmap_mte(task: &Task, vma: &Vma) -> Result<(), Errno>;
/// Called on prctl(PR_SET_TAGGED_ADDR_CTRL) to allow LSM policy to restrict
/// MTE mode changes. For example, an LSM can prevent untrusted processes from
/// downgrading from PR_MTE_TCF_SYNC to PR_MTE_TCF_NONE.
fn security_mte_ctrl(task: &Task, new_flags: u64) -> Result<(), Errno>;
Disabling MTE for legacy software:
CAP_SYS_ADMIN is required to write 1 to /proc/sys/abi/tagged_addr_disabled, which
disables MTE system-wide. Individual processes that have not opted in via prctl() are
unaffected by MTE regardless of this setting (MTE is opt-in per-thread). A process with
sufficient privilege can also call prctl(PR_SET_TAGGED_ADDR_CTRL, PR_MTE_TCF_NONE, ...)
to disable tag checking for itself, reverting to MTE_TCF_NONE behaviour. There is no
mechanism to disable MTE for another process without ptrace(2) access.
Interaction with UmkaOS capability tokens:
UmkaOS capability tokens (Section 8.1) use 64-bit object IDs. On AArch64 with TBI (Top Byte
Ignore) enabled, the top byte of a pointer is not used for translation. UmkaOS's capability
system does not store capabilities in pointers — capabilities are opaque handles in the
CapabilitySpace table, never raw pointers. This design ensures that MTE's use of bits
59:56 for logical tags does not conflict with capability token encoding.
9.4.8 Comparison with x86-64 Mitigations
x86-64 has no hardware memory tagging mechanism. The comparison below covers the nearest available alternatives:
| Threat | x86-64 mitigation | Overhead | ARM MTE | Overhead |
|---|---|---|---|---|
| Heap use-after-free | ASAN (compiler instrumentation) | ~2-10x slowdown | Hardware tag check | ~0-1% (async) |
| Heap overflow | ASAN / AddressSanitizer | ~2-10x slowdown | Hardware tag check | ~0-1% (async) |
| Stack buffer overflow | Intel CET Shadow Stack (SHSTK) | ~0-2% | MTE stack tagging (FEAT_MTE3) |
~0-2% |
| Production deployable? | No (ASAN overhead is prohibitive) | — | Yes (ASYNC mode) | — |
| Fault precision | Exact (ASAN synchronous) | — | Exact (SYNC) / imprecise (ASYNC) | — |
| Detection guarantee | Probabilistic (ASAN has false negatives on aliasing) | — | Hardware deterministic (SYNC mode) | — |
| Kernel-side use | KASAN (debug builds only; ~2-3x overhead) | — | Tier 1 driver tagging (production) | ~0-1% |
Important nuance on MTE's probabilistic nature:
MTE with 4-bit tags provides 1/16 probability that a random tag guess is correct. This is not a cryptographic guarantee. Against an attacker who can make repeated attempts (a serial exploitation loop), MTE in ASYNC mode provides approximately 15/16 chance of detection per attempt. In SYNC mode, each attempt terminates the process, making iterative exploitation observable and limited by SIGSEGV delivery. MTE is classified as a crash-containment mitigation (probabilistic, stops typical bugs, slows targeted attacks) rather than a cryptographic security boundary (the role played by Tier 2 ring-3 isolation and IOMMU, which provide deterministic isolation). This is consistent with the analysis in the ARM Architecture Security Model and with Project Zero's MTE research (2023).
x86-64 future outlook:
Intel's Linear Address Masking (LAM) and AMD's Upper Address Ignore (UAI) provide similar TBI-style top-byte-ignore semantics but do not provide hardware memory tags. They allow software tools to embed metadata in pointer top bytes but provide no hardware enforcement. No announced x86-64 extension as of 2025 provides hardware tagging equivalent to ARM MTE.
9.4.9 Linux Compatibility
UmkaOS's MTE implementation is a drop-in replacement for Linux 5.10+ on AArch64. All of the following work without modification on UmkaOS:
Compiler-based MTE support:
GCC 10+ and Clang 11+ support -march=armv8.5-a+memtag and provide the __arm_mte_*
intrinsic family. Programs compiled with these flags run on UmkaOS without modification.
The HWCAP2_MTE bit in AT_HWCAP2 (accessible via getauxval(3)) is set when MTE is
present; applications performing runtime feature detection work correctly.
LLVM AddressSanitizer (HWASan) with MTE hardware backend:
LLVM's HWASan (Hardware-assisted AddressSanitizer) can use ARM MTE as its hardware
backend when built with -fsanitize=hwaddress and run on MTE-capable hardware. HWASan
manages allocation tagging itself (via the __hwasan_* runtime) using IRG/STG and
relies on MTE faults for detection. UmkaOS's MTE implementation is compatible with HWASan
because: (a) PROT_MTE on mmap() works as specified; (b) prctl(PR_SET_TAGGED_ADDR_CTRL)
sets SYNC mode as HWASan requires for precise fault delivery; (c) SIGSEGV with
si_code = SEGV_MTESERR and a valid si_addr is delivered on tag fault.
Android Scudo allocator with MTE:
Android's Scudo hardened allocator uses ARM MTE for heap tagging in production Android
builds (Android 12+). Scudo uses PROT_MTE on its heap regions and manages tags directly
via inline assembly IRG/STG/LDG sequences. Scudo-built applications run on UmkaOS
without modification.
glibc MTE support:
glibc 2.35+ contains AArch64 MTE awareness in its memory allocator (malloc/free).
When HWCAP2_MTE is detected, glibc's allocator enables PROT_MTE on heap mmap()
regions and tags allocations. This is transparent to applications. All glibc-linked
programs gain MTE heap protection automatically on UmkaOS/AArch64 with no recompilation.
Syscall compatibility table:
| Operation | UmkaOS AArch64 syscall | Linux AArch64 syscall | Notes |
|---|---|---|---|
prctl(PR_SET_TAGGED_ADDR_CTRL) |
167 (prctl) | 167 (prctl) | ABI-identical |
prctl(PR_GET_TAGGED_ADDR_CTRL) |
167 (prctl) | 167 (prctl) | ABI-identical |
ptrace(PTRACE_PEEKMTETAGS) |
117 (ptrace) | 117 (ptrace) | ABI-identical |
ptrace(PTRACE_POKEMTETAGS) |
117 (ptrace) | 117 (ptrace) | ABI-identical |
PTRACE_PEEKMTETAGS / PTRACE_POKEMTETAGS constants:
#define PTRACE_PEEKMTETAGS 33
#define PTRACE_POKEMTETAGS 34
These values match Linux and must not change (they are embedded in debugging tools and test suites).
Cross-references:
- Section 8.1: Capability tokens; MTE logical tags in pointer top byte do not overlap with UmkaOS capability encoding (capabilities are table-indexed handles, not raw pointers)
- Section 8.7: LSM hooks
security_mmap_mte()andsecurity_mte_ctrl()for policy enforcement - Section 8.8:
CAP_SYS_ADMINrequired to write/proc/sys/abi/tagged_addr_disabled - Section 4.2: VMA flags (
MteVmaFlags) andPROT_MTEVMA handling; tag storage provisioning in the page fault handler - Section 4.1: Physical page metadata for tag storage; tag pages are not counted in usable memory reported to userspace
- Section 11.1: Tier 1 driver crash recovery triggered by MTE tag faults in driver EL1 domains
- Section 3.1:
dsb nshbarrier requirement beforeSCTLR_EL1write in context switch - Section 18.1:
prctl()andptrace()dispatch; MTE-specific argument validation in the compat layer
9.5 DebugCap — Capability-Based Process Debugging
Section 19.3.1 establishes that every ptrace operation requires a CAP_DEBUG capability
token scoped to the target process. That model eliminates the ambient CAP_SYS_PTRACE
authority problem — a debugger must hold an explicit, scoped token for the precise process
it intends to inspect. This section extends that foundation with DebugCap: a first-class,
transferable, revocable capability object that carries its own permission mask and an
optional expiry time.
The distinction matters for three deployment scenarios that CAP_DEBUG tokens alone do
not serve cleanly:
-
Container debugging from outside the container. A monitoring service running on the host needs to debug a specific container workload process. It cannot enter the container (no shell, no shared UID, no root inside).
CAP_DEBUGin the host's capability space covers host processes; the container's user namespace has its own capability space. ADebugCapis an object — it crosses namespace boundaries when explicitly handed across them, without requiringCAP_NS_TRAVERSEat every namespace boundary for routine debugging of a pre-authorised target. -
Privilege drop after attachment. A debugger daemon starts with
CAP_SYS_PTRACE(orCAP_DEBUG) in order to attach to a target. Once attached, it should drop that broad authority and operate only on the specific process it was authorised for. ADebugCapis the handle it keeps after dropping the broad capability; the kernel enforces that the session is scoped to that handle. -
Granular, time-limited access. A CI system grants a test runner read-only memory inspection on a specific worker for 15 minutes.
CAP_DEBUGis all-or-nothing; aDebugCapwithread_memory: trueandexpires: Some(15min)is precise and self-cleaning.
9.5.1 DebugCap Data Structures
/// A capability token granting ptrace-level access to a specific process.
///
/// Issued by the kernel when a process explicitly grants debug access, or by a
/// process holding `CAP_DEBUG` (the UmkaOS-native form) or `CAP_SYS_PTRACE` (the
/// Linux-compat alias) for any process it can see in its capability namespace.
///
/// Properties:
/// - **Scoped**: only valid for `target`. Any operation attempted against a
/// different process returns `ESRCH`.
/// - **Revocable**: `cap_revoke(debug_cap)` revokes via the seqlock protocol on
/// `revocation_seq`: the revoker writes an odd value (in-progress), completes
/// bookkeeping, then writes an even value ≥ 2 (permanently revoked). Each
/// `DebugSession` method call checks `revocation_seq.load(Acquire) >= 2` before
/// dispatching. In-flight operations that have already passed the revocation check
/// may complete normally — revocation takes effect at the next dispatch boundary.
/// After revocation, any `DebugSession` holding the cap receives `EACCES` on its
/// next operation and is expected to release the cap.
/// - **Auditable**: issuance and every `ptrace_attach_cap()` call are logged
/// via the kernel audit subsystem.
/// - **Process-death-safe**: automatically invalidated (equivalent to revocation)
/// when `target` exits. The `Arc<Process>` inside keeps the process descriptor
/// alive for the duration of any active `DebugSession`, but `revocation_seq` is
/// advanced to an even value >= 2 during process teardown before releasing the
/// `Arc<Process>`, ensuring any blocked `DebugSession` wakes and sees revocation.
pub struct DebugCap {
/// The process this capability grants access to.
/// Kept alive (descriptor, not address space) for the lifetime of the cap.
target: Arc<Process>,
/// Permitted operations. Each field maps to one or more `PTRACE_*` requests.
permissions: DebugPermissions,
/// Kernel-assigned serial number for audit log correlation and revocation.
/// Globally unique per boot; monotonically increasing.
serial: u64,
/// Monotonic expiry instant. `None` means the cap does not expire on its own.
/// Checked on `ptrace_attach_cap()` and on each `DebugSession` operation.
expires: Option<MonotonicInstant>,
/// Seqlock-based revocation counter. Initially 0 (valid).
/// Odd value = revocation in progress. Even value >= 2 = permanently revoked.
/// Checked on every `DebugSession` method call before dispatching to ptrace.
/// To check: `revocation_seq.load(SeqCst) >= 2`.
/// To revoke: store odd (in-progress), complete bookkeeping, store even (done).
revocation_seq: AtomicU32,
}
/// Fine-grained permissions carried by a `DebugCap`.
/// Setting `full_ptrace` is equivalent to setting all individual fields.
pub struct DebugPermissions {
/// Read target memory (`PTRACE_PEEKDATA`, `process_vm_readv`).
pub read_memory: bool,
/// Write target memory (`PTRACE_POKEDATA`, `process_vm_writev`).
pub write_memory: bool,
/// Read general-purpose and floating-point registers (`PTRACE_GETREGS`,
/// `PTRACE_GETFPREGS`, `PTRACE_GETREGSET`).
pub read_regs: bool,
/// Write general-purpose and floating-point registers (`PTRACE_SETREGS`,
/// `PTRACE_SETFPREGS`, `PTRACE_SETREGSET`).
pub write_regs: bool,
/// Set hardware breakpoints and watchpoints (Section 19.3.2).
pub set_breakpoints: bool,
/// Single-step execution (`PTRACE_SINGLESTEP`).
pub single_step: bool,
/// Receive and inject signals (`PTRACE_GETSIGINFO`, `PTRACE_SETSIGINFO`).
pub intercept_signals: bool,
/// Intercept syscall entry/exit (`PTRACE_SYSCALL`).
pub trace_syscalls: bool,
/// Full ptrace control: implies all fields above. When this field is `true`,
/// the kernel ignores the individual fields and permits every ptrace request.
pub full_ptrace: bool,
}
9.5.2 Obtaining a DebugCap
Three kernel interfaces issue DebugCap tokens. All three log an audit record.
/// A process grants another process (identified by `grantee_pid`) the right to
/// debug the calling process.
///
/// Preconditions (any one must hold):
/// - Caller and caller's process share the same UID, OR
/// - The calling process has set `PR_SET_DUMPABLE` and `PR_SET_DEBUG_ACCEPT`
/// (see Section 9.5.5), OR
/// - The calling process invokes this on itself (`grantee_pid` is the
/// caller's own PID — equivalent to `self_debug_cap()`).
///
/// The kernel delivers the resulting `DebugCap` to `grantee_pid` via a
/// kernel-managed pending-cap queue; the grantee retrieves it with
/// `cap_recv(CAP_TYPE_DEBUG)`.
///
/// Returns: the serial number of the issued cap (for audit correlation).
pub fn grant_debug_cap(
grantee_pid: Pid,
permissions: DebugPermissions,
expires: Option<Duration>,
) -> Result<u64, CapError>;
/// A process holding `CAP_DEBUG` (or the Linux-compat `CAP_SYS_PTRACE`) issues
/// a `DebugCap` for any process visible in the caller's namespace.
///
/// This is the primary entry point for debugger daemons. The recommended usage
/// pattern is:
/// 1. Start with `CAP_SYS_PTRACE` in the bounding set.
/// 2. Call `ptrace_cap_issue()` for the target.
/// 3. Drop `CAP_SYS_PTRACE` from the ambient and effective sets.
/// 4. Proceed using only the returned `DebugCap`.
///
/// Cross-namespace targets are reachable only if the caller also holds
/// `CAP_NS_TRAVERSE` for every intermediate namespace boundary (consistent
/// with [Section 19.3.1](19-observability.md#1931-capability-gated-ptrace)).
///
/// Returns: the `DebugCap` kernel handle (an opaque file descriptor in the
/// calling process's file-descriptor table, transferable via `SCM_RIGHTS`).
pub fn ptrace_cap_issue(
target_pid: Pid,
permissions: DebugPermissions,
expires: Option<Duration>,
) -> Result<DebugCapFd, CapError>; // Requires CAP_DEBUG or CAP_SYS_PTRACE
/// A process grants debug access to itself.
///
/// No capability checks — a process can always inspect itself. Useful for
/// test harnesses, in-process debuggers, and coverage tools that need the
/// structured `DebugSession` API rather than raw ptrace calls.
///
/// Self-caps are non-transferable (the `send_cap()` path returns `EPERM` for
/// self-issued caps) to prevent self-escalation.
pub fn self_debug_cap(permissions: DebugPermissions) -> DebugCapFd;
Non-transferability is enforced via the CAP_FLAG_SELF_ISSUED bit in
cap_flags, set unconditionally in create_self_cap() and never cleared:
send_cap(cap, dest): checkscap.cap_flags & CAP_FLAG_SELF_ISSUED; if set, returnsCapError::NonTransferableimmediately, no capability transfer occursfd_dup2(old_fd, new_fd): creates a new fd pointing to the same capability entry (shared reference count).CAP_FLAG_SELF_ISSUEDis in the entry, not the fd, so the duplicate fd inherits the non-transferable restriction automaticallyfork(): the child inherits the fd table (new fds pointing to the same entries) withCAP_FLAG_SELF_ISSUEDpreserved. The child cannot transfer the cap eitherexecve(): by default, capabilities withCAP_FLAG_CLOEXECare closed; self-issued caps are additionally closed regardless ofCLOEXECflag, since they represent the issuing task's identity context
This design requires no runtime "is this self-issued" check beyond reading a flag bit from the already-cached capability entry.
DebugCapFd is a kernel file-descriptor type (analogous to a pidfd) that wraps the
DebugCap. It is reference-counted: duplicating the fd (dup2, SCM_RIGHTS transfer)
increments the reference count; closing the last reference destroys the cap if no
DebugSession is currently holding it open.
9.5.3 Using a DebugCap
/// Attach to the target process using a capability token.
///
/// On success, the target is stopped (SIGSTOP delivered) and the returned
/// `DebugSession` provides the full debug interface. Dropping the session
/// detaches the debugger and delivers SIGCONT to the target.
///
/// Errors:
/// - `DebugError::Expired` — cap has passed its `expires` time.
/// - `DebugError::Revoked` — cap has been revoked by the issuer.
/// - `DebugError::TargetGone` — target process has already exited.
/// - `DebugError::PermDenied` — `permissions.full_ptrace` is false and the
/// operation mode requires full control (reserved for future use).
pub fn ptrace_attach_cap(cap_fd: DebugCapFd) -> Result<DebugSession, DebugError>;
/// An active debug session. Dropping this value detaches from the target and
/// delivers SIGCONT if the target was stopped by this session.
pub struct DebugSession {
/// The capability authorising this session. Kept alive for session duration;
/// revocation of the underlying cap is immediately visible here.
cap: Arc<DebugCap>,
/// Convenience pointer; equivalent to `cap.target`.
target: Arc<Process>,
}
impl DebugSession {
/// Read `buf.len()` bytes from the target's address space at `addr`.
/// Requires `cap.permissions.read_memory`.
pub fn read_memory(&self, addr: u64, buf: &mut [u8]) -> Result<usize, DebugError>;
/// Write `data` to the target's address space at `addr`.
/// Requires `cap.permissions.write_memory`.
pub fn write_memory(&self, addr: u64, data: &[u8]) -> Result<(), DebugError>;
/// Read the target's general-purpose registers.
/// Requires `cap.permissions.read_regs`.
pub fn get_regs(&self) -> Result<UserRegs, DebugError>;
/// Write the target's general-purpose registers.
/// Requires `cap.permissions.write_regs`.
pub fn set_regs(&self, regs: &UserRegs) -> Result<(), DebugError>;
/// Set a hardware breakpoint at `addr`. Returns a handle; drop the handle
/// to clear the breakpoint. Requires `cap.permissions.set_breakpoints`.
pub fn set_breakpoint(&self, addr: u64) -> Result<BreakpointHandle, DebugError>;
/// Single-step the target: target executes one instruction then re-stops.
/// Requires `cap.permissions.single_step`.
pub fn single_step(&self) -> Result<(), DebugError>;
/// Resume the target, optionally delivering `signal`.
/// Requires `cap.permissions.full_ptrace` or appropriate per-op permissions.
pub fn cont(&self, signal: Option<Signal>) -> Result<(), DebugError>;
/// Wait for the target to stop. Returns the stop reason.
///
/// Blocks until the target stops, the DebugCap expires, or the cap is
/// revoked. The wait is implemented as a bounded loop so that cap expiry
/// is detected promptly even if the target does not stop:
///
/// ```text
/// fn wait_stop(session: &DebugSession, cap: &DebugCap)
/// -> Result<StopReason, DebugError>:
/// loop:
/// now = monotonic_clock_ns()
/// if now >= cap.expiry_ns:
/// session.detach() // revoke debug session on expiry
/// return Err(DebugError::Expired)
/// remaining_ns = cap.expiry_ns - now
/// // Slice wait into ≤100ms chunks so cap expiry is rechecked frequently.
/// event = session.wait_for_stop_event(
/// timeout_ns = min(remaining_ns, 100_000_000))
/// match event:
/// Timeout → continue // deadline not reached; re-check expiry
/// StopEvent(r) → return Ok(r)
/// Revoked → return Err(DebugError::Revoked)
/// ```
///
/// The 100ms timeout slice ensures that a DebugCap with a short remaining
/// lifetime is honoured promptly without busy-waiting.
///
/// Returns `Err(DebugError::Expired)` if the cap expires during the wait.
/// Returns `Err(DebugError::Revoked)` if the cap is revoked while waiting.
pub fn wait_stop(&self) -> Result<WaitStatus, DebugError>;
}
Every DebugSession method checks cap.revoked atomically before dispatching to the
underlying ptrace path. This is a single relaxed load on the fast path (the common case
where the cap has not been revoked) and an acquire load when the flag is set, ensuring
visibility of any state written by the revoking thread before the revocation flag was
set.
9.5.4 Capability Transfer
DebugCapFd can be sent to another process over a Unix domain socket using the standard
SCM_RIGHTS control message interface. The kernel's SCM_RIGHTS path is extended to
handle DebugCapFd file descriptors:
- The sender calls
sendmsg(2)with aSCM_RIGHTScmsg containing theDebugCapFd. - The kernel validates that the sender holds the fd and that the cap is not revoked.
- The kernel creates a new
DebugCapFdin the receiver's file-descriptor table, backed by the sameArc<DebugCap>. The reference count is incremented. - The sender's fd is closed (or not, depending on whether it duplicated first).
No additional privilege is required to transfer a DebugCap — if you hold it, you can
delegate it. The receiver inherits the same permissions and expires values; there is
no mechanism to escalate permissions on transfer (the cap is immutable after issuance).
Container debugging workflow — canonical example:
1. Container runtime (host, holds CAP_SYS_PTRACE):
fd = ptrace_cap_issue(container_pid, DebugPermissions::full(), expires=Some(30min))
→ DebugCap issued; container_pid need not be in the host's user namespace.
→ Audit: type=DEBUG_CAP_ISSUED target_pid=1234 serial=77 perms=0xFF issuer=runtime
2. Runtime passes fd to an external debugger via Unix socket (SCM_RIGHTS):
sendmsg(debugger_sock, SCM_RIGHTS=[fd])
→ Debugger receives the fd. Runtime may now close its copy and drop CAP_SYS_PTRACE.
3. Debugger (no CAP_SYS_PTRACE, no container root):
session = ptrace_attach_cap(received_fd)
→ Kernel validates cap: not revoked, not expired, target still alive.
→ Audit: type=DEBUG_CAP_USED serial=77 attacher=debugger_pid
4. Debugger operates: session.get_regs(), session.read_memory(), etc.
5. 30 minutes later: kernel marks cap expired on next DebugSession operation.
session.wait_stop() → Err(DebugError::Expired)
→ Session auto-detaches; target receives SIGCONT.
This workflow requires no root inside the container, no shared UID between the debugger and the container workload, and no long-lived broad privilege in the debugger process.
9.5.5 PR_SET_DEBUG_ACCEPT — Cross-UID Debug Grant
A process can advertise willingness to be debugged by processes that would not normally
pass the UID check in grant_debug_cap(). This uses a new prctl option:
/* Allow processes in the same user namespace (even different UIDs) to call
* grant_debug_cap() targeting this process.
* arg2: DEBUG_ACCEPT_NONE (0) = default, only same-UID or parent
* DEBUG_ACCEPT_SAME_NS (1) = any process in same user namespace
* arg3, arg4, arg5: must be zero.
*/
prctl(PR_SET_DEBUG_ACCEPT, DEBUG_ACCEPT_SAME_NS, 0, 0, 0);
The flag is stored in the Task struct alongside PR_SET_DUMPABLE. It is cleared on
execve() (reset to DEBUG_ACCEPT_NONE). It is inherited across fork() and clone()
(a process that accepts debug access continues to do so after forking worker children,
which is the typical use case for worker-pool servers).
PR_SET_DEBUG_ACCEPT does not bypass LSM checks — the security_debug_cap_grant() hook
(Section 9.5.6) fires regardless of the accept flag. It only widens the UID check in
the kernel's own permission gate.
Use case: a multi-user web server spawns worker processes under different UIDs per
virtual host. A monitoring tool running as the server's primary UID needs to inspect
worker memory. Workers call prctl(PR_SET_DEBUG_ACCEPT, DEBUG_ACCEPT_SAME_NS) at
startup; the monitor calls grant_debug_cap(worker_pid, read_memory_only, 5min) without
needing root.
9.5.6 Revocation
DebugCap revocation uses a seqlock protocol to provide atomic revocation without a global lock:
DebugCapcarriesrevocation_seq: AtomicU32, initially 0 (even = valid)- To revoke: write
seq | 1(odd = in-progress revocation), perform all revocation bookkeeping (remove from ptrace tables, close debug channels), then writeseq + 2(even = complete, permanently revoked) - ptrace operations read
seqbefore the operation and after; if the value changed or is odd, the operation returnsESRCH(target gone) - No stale-cap window: the odd transition atomically signals all concurrent operations to abort before revocation completes
This is the standard seqlock pattern applied to capability lifecycle management.
/// Revoke a DebugCap identified by its kernel file descriptor.
///
/// Effects (all atomic with respect to ongoing DebugSession operations):
/// - Sets `cap.revocation_seq` to an odd value (in-progress), completes all
/// revocation bookkeeping, then advances to even (permanently revoked).
/// - Any `DebugSession` currently holding this cap has its next operation
/// return `Err(DebugError::Revoked)`.
/// - Any task blocked in `wait_stop()` on a revoked session is immediately
/// woken and receives `Err(DebugError::Revoked)`.
/// - If the target was stopped by this session, SIGCONT is delivered.
/// - All copies of `DebugCapFd` (across all processes, via SCM_RIGHTS
/// duplicates) are simultaneously invalidated — revocation is on the
/// underlying `DebugCap` object, not on any individual fd copy.
///
/// Audit: type=DEBUG_CAP_REVOKED serial=N revoker_pid=M
pub fn cap_revoke(cap_fd: DebugCapFd) -> Result<(), CapError>;
Only the process that issued the DebugCap can revoke it. Issuance is recorded in the
cap's issuer_pid field, which the kernel checks in cap_revoke(). Processes that
received the cap via SCM_RIGHTS transfer can close their fd copy (reducing the reference
count) but cannot revoke the cap itself. This asymmetry is intentional: a delegated cap
should not be renounceable by the delegate — only the authority that granted it can
terminate it.
Target process exit implicitly revokes all DebugCap tokens targeting that process.
The kernel advances revocation_seq to an even value >= 2 on every outstanding
DebugCap for the exiting process during the process teardown path, before releasing
the Arc<Process>. This
ensures that any DebugSession blocked in wait_stop() wakes and returns
Err(DebugError::TargetGone) rather than blocking indefinitely.
9.5.7 Audit Logging
Every lifecycle event for a DebugCap is logged via the kernel audit subsystem
(Section 19.2.9):
| Event | Audit record format |
|---|---|
Cap issued via ptrace_cap_issue() |
type=DEBUG_CAP_ISSUED serial=N target_pid=T issuer_pid=I perms=0xHH expires=S |
Cap issued via grant_debug_cap() |
type=DEBUG_CAP_GRANTED serial=N target_pid=T grantee_pid=G perms=0xHH expires=S |
Cap issued via self_debug_cap() |
type=DEBUG_CAP_SELF serial=N pid=P perms=0xHH |
Cap used in ptrace_attach_cap() |
type=DEBUG_CAP_USED serial=N attacher_pid=A target_pid=T |
Cap revoked via cap_revoke() |
type=DEBUG_CAP_REVOKED serial=N revoker_pid=R |
| Cap expired (on next use) | type=DEBUG_CAP_EXPIRED serial=N target_pid=T |
| Target exited with outstanding caps | type=DEBUG_CAP_TARGET_EXIT serial=N target_pid=T (one per outstanding cap) |
The perms field in the audit record is a bitmask of the DebugPermissions fields,
in the order they appear in the struct: read_memory = bit 0, write_memory = bit 1,
read_regs = bit 2, write_regs = bit 3, set_breakpoints = bit 4,
single_step = bit 5, intercept_signals = bit 6, trace_syscalls = bit 7,
full_ptrace = bit 8.
The serial number is globally unique within a boot session and monotonically increasing.
Audit records from the ptrace_attach_cap() call can be correlated with the issuance
record using the serial field.
9.5.8 Linux Compatibility
ptrace(2) with the standard PTRACE_* constants continues to work unchanged. UmkaOS
internally converts every ptrace(PTRACE_ATTACH, pid) call into a DebugCap with
full_ptrace: true using the caller's CAP_DEBUG capability token (or the traditional
UID check for processes that do not use the UmkaOS capability model). The resulting session
is tracked identically to one opened via ptrace_attach_cap() — revocation, audit
logging, and DebugSession semantics apply.
This means the audit trail covers traditional ptrace sessions as well as DebugCap sessions,
with no gaps. GDB, LLDB, strace, perf, and any other tool using the ptrace(2) syscall
operate without modification.
New UmkaOS-specific syscalls for the DebugCap API:
| Syscall | x86-64 number | Description |
|---|---|---|
ptrace_cap_issue |
1032 | Issue a DebugCap for a target process (requires CAP_DEBUG/CAP_SYS_PTRACE) |
ptrace_attach_cap |
1033 | Attach a debug session to a target using a DebugCapFd |
grant_debug_cap |
1034 | Grant debug access to another process (issuer = calling process) |
self_debug_cap |
1035 | Issue a non-transferable DebugCap for the calling process |
cap_revoke |
1036 | Revoke a DebugCap by its fd (issuer only) |
These syscall numbers are UmkaOS-specific. UmkaOS custom syscalls start at 1024 to provide
generous long-term headroom beyond Linux's current maximum (Linux 7.0), with ample
buffer for indefinite future Linux growth. The PR_SET_DEBUG_ACCEPT and PR_GET_DEBUG_ACCEPT prctl
options use the next available UmkaOS-reserved prctl numbers after the existing set defined
in include/uapi/linux/prctl.h.
9.5.9 LSM Hooks
The UmkaOS LSM framework (Section 8.7) provides hooks at every DebugCap lifecycle point:
/// Called on ptrace_cap_issue() and grant_debug_cap() before issuing the cap.
/// LSM may deny issuance (e.g., Mandatory Access Control policy prevents
/// cross-label debugging).
///
/// `issuer`: the calling process.
/// `target`: the process to be debugged.
/// `perms`: the requested permissions.
///
/// Returns Ok(()) to permit, Err(Errno::EPERM) to deny.
fn security_debug_cap_grant(
issuer: &Process,
target: &Process,
perms: &DebugPermissions,
) -> Result<(), Errno>;
/// Called on ptrace_attach_cap() before attaching the session.
/// LSM may deny attachment even if the cap is validly issued
/// (e.g., policy changed since issuance).
///
/// Returns Ok(()) to permit, Err(Errno::EPERM) to deny.
fn security_debug_cap_attach(
attacher: &Process,
cap: &DebugCap,
) -> Result<(), Errno>;
/// Called on cap_revoke() before revoking.
/// LSM may deny revocation (unusual; reserved for audit-locking scenarios
/// where the audit system must preserve an active session until log flush).
///
/// Returns Ok(()) to permit, Err(Errno::EPERM) to deny.
fn security_debug_cap_revoke(
revoker: &Process,
cap: &DebugCap,
) -> Result<(), Errno>;
The existing security_ptrace() hook (called for traditional ptrace(PTRACE_ATTACH))
continues to fire for the synthetic DebugCap created by the compat path, so LSM policy
is uniformly applied regardless of whether the caller uses the new API or the legacy
ptrace(2) syscall.
9.5.10 DebugCap Request Rate Limiting
Processes holding CAP_DEBUG or CAP_SYS_PTRACE may call ptrace_cap_issue() or
grant_debug_cap() in a tight loop, producing a rapid sequence of kernel capability
allocations and audit records. Without a rate limit, this is a low-cost DoS vector against
the audit subsystem and capability allocator. The rate limit closes the window.
Rate: 10 DebugCap issue requests per second per real UID (RUID), enforced via a
token bucket algorithm. Processes holding CAP_SYS_ADMIN in their effective set are
exempt — they are already unconditionally trusted.
self_debug_cap() is also exempt: it produces a non-transferable, same-process-only cap
with no cross-process security boundary crossing, and carries no meaningful DoS potential.
/// Per-UID token bucket for DebugCap request rate limiting.
///
/// Each entry represents the rate-limit state for one real UID.
/// Entries are created on first request and evicted after 60 seconds
/// of inactivity (no requests from this UID).
pub struct DebugCapRateLimit {
/// Tokens available. Each DebugCap issue request consumes 1 token.
/// Refilled at REFILL_RATE_NS intervals up to MAX_TOKENS.
tokens: AtomicU32,
/// Timestamp of last token refill (nanoseconds since boot).
/// Used to calculate how many tokens to add on the next request.
last_refill_ns: AtomicU64,
}
impl DebugCapRateLimit {
/// Burst capacity: maximum tokens in the bucket. A fully-charged
/// UID may issue up to 10 DebugCap requests before being throttled.
const MAX_TOKENS: u32 = 10;
/// Token refill interval: one new token every 100ms = 10 tokens/sec
/// sustained throughput. Calculated as: 1_000_000_000ns / 10 = 100ms.
const REFILL_RATE_NS: u64 = 100_000_000;
}
Storage: Per-UID DebugCapRateLimit entries live in a hash table keyed by RUID,
protected by an RCU-read lock for lookup and a per-bucket spinlock for mutation.
New entries are allocated from the slab allocator on first request. Entries are evicted
(and their memory returned) after 60 seconds of inactivity, detected during the refill
step: if now - last_refill_ns > 60_000_000_000ns, the entry is removed and the bucket
fully refills for the next request from that UID (fresh start, no penalty for inactivity).
Token consumption algorithm (executed under the per-bucket spinlock):
- Compute elapsed =
now_ns - entry.last_refill_ns. - Add
elapsed / REFILL_RATE_NStokens toentry.tokens, clamped toMAX_TOKENS. - Set
entry.last_refill_ns += (elapsed / REFILL_RATE_NS) * REFILL_RATE_NS(preserve fractional interval for the next call; do not set tonow_nsdirectly). - If
entry.tokens >= 1: decrementtokensby 1, returnOk(()). - Otherwise: return
Err(Errno::EBUSY).
Error semantics: when the rate limit is exceeded, the syscall returns EBUSY (not
EPERM). EPERM signals permanent denial; EBUSY signals transient backpressure — the
caller is authorized but must wait. Callers should apply exponential back-off starting at
100ms. A well-behaved debugger daemon will never encounter this limit in normal operation.
Audit: every rate-limit rejection is logged to the IMA audit ring:
type=DEBUG_CAP_RATELIMIT uid=U request=ptrace_cap_issue|grant_debug_cap timestamp_ns=T
The audit record includes the requesting UID (uid), the syscall name (request), and
the kernel monotonic timestamp (timestamp_ns). Rate-limit audit records are written
unconditionally (they are not themselves rate-limited) to preserve the full evidence
trail for intrusion detection.
Interaction with LSM hooks: rate limiting occurs before the security_debug_cap_grant()
LSM hook. A request rejected by the rate limiter never reaches LSM. This ordering is correct:
there is no point invoking potentially expensive LSM policy evaluation for a request that
will be refused regardless.
Cross-references:
- Section 8.1: Core capability system;
CAP_DEBUGandCAP_SYS_PTRACEcapability bits; capability delegation model - Section 8.7: LSM hooks
security_debug_cap_grant(),security_debug_cap_attach(),security_debug_cap_revoke() - Section 8.8: Credential model;
CAP_DEBUGin the effective set vs. ambient set; bounding set enforcement - Section 13: Container namespace isolation; user namespace boundaries crossed by transferred
DebugCap - Section 19.3.1: Capability-gated ptrace; the per-operation permission checks that
DebugSessionenforces - Section 19.3.2: Hardware breakpoint/watchpoint registers managed via
DebugSession.set_breakpoint()