Skip to content

Chapter 17: Virtualization

KVM host/guest integration, VMX/VHE/H-ext, live migration, PV features, suspend/resume

KVM host support scope: umka-kvm targets x86-64 (VMX/SVM), AArch64 (VHE/nVHE), and RISC-V (H-extension) as hypervisor hosts. ARMv7, PPC32, and PPC64LE are intentionally out of scope for KVM host mode: 32-bit ARM hypervisor workloads are niche and UmkaOS's target use cases (cloud, server, edge) use 64-bit hosts exclusively; PPC32/PPC64LE hypervisor support requires PAPR (IBM's PowerVM ABI), which is outside UmkaOS's compatibility scope. All three platforms continue to support UmkaOS as a KVM guest inside a compatible hypervisor.


17.1 Host and Guest Integration

How UmkaOS behaves as a VM host (via umka-kvm) and as a guest kernel running inside a hypervisor. This section covers virtio device negotiation, paravirtual optimizations, vhost data plane, and live migration.

Guest Mode — Virtio Device Negotiation

When UmkaOS runs as a guest kernel, it discovers virtio devices via PCI or MMIO transport and negotiates feature bits with the hypervisor. The virtio drivers (virtio-blk, virtio-net, virtio-gpu, virtio-console — already listed as Priority 1 in Section 10.5) implement the standard virtio 1.2 specification (approved as an OASIS Committee Specification in July 2022), with forward-compatible support for virtio 1.3 features as that draft is finalized.

Guest Mode — Paravirtual Clock

Hardware RDTSC inside a VM can be inaccurate (the TSC may not be invariant, or vmexit overhead distorts time). Paravirtual clock avoids this:

  • KVM pvclock / kvmclock: the hypervisor maps a shared memory page containing clock parameters (scale, offset, version). The guest reads time from this page — no vmexit required. UmkaOS's clocksource subsystem auto-detects and prefers pvclock when running as a KVM guest.
  • Hyper-V TSC page: equivalent mechanism for Hyper-V hosts. Same principle — shared memory page, no hypercall for time reads.
  • Fallback: if neither paravirt clock is available, UmkaOS uses the ACPI PM timer (slow but always accurate) or PIT (ancient but universal).

Guest Mode — Balloon Driver

virtio-balloon enables dynamic memory adjustment — the hypervisor can reclaim guest memory by inflating the balloon (guest returns pages) or release memory by deflating it. UmkaOS integrates balloon inflation with its memory pressure framework: - Balloon inflation is treated as memory pressure, triggering the same reclaim path as physical memory exhaustion (page cache eviction, slab shrinking, swap-out) - Balloon deflation immediately makes pages available to the buddy allocator - This unified pressure model means UmkaOS's OOM decisions correctly account for ballooned-away memory

Guest Mode — PV Spinlocks

Under overcommitted VMs, spinning on a lock held by a descheduled vCPU wastes host CPU cycles (the spinning vCPU can never acquire the lock until the holder is scheduled). UmkaOS detects the hypervisor type at boot: - KVM: the spinning vCPU halts (HLT-based yield) when it detects the lock holder is descheduled; the lock releaser calls KVM_HC_KICK_CPU to wake the halted waiter - Hyper-V: uses HvCallNotifyLongSpinWait hypercall — notifies the hypervisor of a long spin wait, allowing it to schedule the lock holder - Bare metal: standard spin loops (no overhead when not virtualized)

Post-Yield Backoff

When the hypervisor returns from a VMEXIT yield (indicating another vCPU has released or is about to release the lock), the acquiring vCPU uses the following adaptive backoff before re-yielding:

attempt = 0
loop:
    try acquire lock (test-and-set)
    if acquired: return

    if attempt < 6:
        # Spin for 2^attempt iterations (1, 2, 4, 8, 16, 32 cycles)
        spin_hint(1 << attempt)   # x86: PAUSE; ARM: YIELD; RISC-V: nop
        attempt += 1
    else:
        # Back to hypervisor yield
        pv_kick_yield()
        attempt = 0               # reset after yield

Total spin before re-yielding: 1+2+4+8+16+32 = 63 loop iterations (~100-250ns). This avoids hammering the hypervisor with immediate re-yields while still responding quickly when the lock becomes available.

Maximum yield count: after 32 consecutive yields without acquiring the lock, the vCPU switches to schedule() (voluntary preemption) to allow other vCPUs to make progress. This prevents a vCPU from monopolizing its pCPU waiting for a lock held by a vCPU that is not scheduled.

Guest Mode — Hypervisor-Specific Backends

Hypervisor Paravirt Features
KVM (primary) pvclock, PV spinlocks, PV TLB flush, steal time accounting, async PF
Hyper-V Synthetic interrupts, synthetic timer, APIC assist, TSC page, PV spinlocks
Xen PV Future — xenbus, grant tables, PV disk/net (lower priority)

Guest Mode — Cloud Metadata

Cloud-init and instance metadata (AWS IMDSv2, Azure IMDS, GCP metadata server) are consumed by userspace agents. The kernel's role is providing transport: - vsock (virtio-socket) for hypervisor↔guest communication without networking - virtio-serial for structured host↔guest channels - Standard networking for HTTP-based metadata endpoints (169.254.169.254)

vhost Kernel Data Plane

vhost moves the virtio data plane into the host kernel, bypassing the VMM (QEMU) for hot-path I/O:

  • vhost-net: kernel-side virtio-net processing. Packets move directly between the guest's virtio ring and the host's tap/macvtap device via kernel. The VMM handles only control plane (device configuration, feature negotiation). Implemented as a Tier 1 (with extended hardware privileges) umka-kvm module. KVM requires CAP_VMX (hardware virtualization support), which grants it the KvmHardwareCapability on top of standard Tier 1 memory-domain isolation. See Section 18.1.4.5 for full classification. Why this exception is unique and non-proliferating: VMX/SVM instructions must execute directly on the host CPU — they cannot be mediated via MMIO, DMA, or ring buffer IPC like all other device operations. No other driver class has this constraint; all other hardware interactions go through memory-mapped registers or DMA descriptors that the standard Tier 1 isolation boundary can intercept.)
  • vhost-scsi: kernel-side virtio-scsi processing for direct block device access from guests, bypassing QEMU's I/O path. Guests see near-native block device performance.
  • vhost-user: protocol for offloading vhost processing to userspace daemons (DPDK for networking, SPDK for storage). This is handled entirely in userspace by the VMM (e.g., QEMU) which shares guest memory via memfd with the backend daemon. The UmkaOS kernel does not implement vhost-user directly; it simply provides the standard shared memory and unix domain socket primitives required for QEMU to function.
  • vhost-vDPA: Hardware-accelerated virtio for SmartNICs and DPUs. vDPA (virtio Data Path Acceleration) allows the virtio data plane to be offloaded to hardware while the control plane remains in software. Integration with UmkaOS's SmartNIC architecture (Section 5.2) is planned for Phase 4-5.
  • vhost-vsock: host↔guest communication channel using the vsock address family. No networking stack required — communication uses a simple stream/datagram protocol over shared memory.

VM Live Migration (KVM)

Live migration moves a running VM from one physical host to another with minimal downtime. UmkaOS's umka-kvm implements the full migration pipeline:

  1. Pre-copy phase: Track dirty pages via Intel PML (Page Modification Logging) or manual dirty bitmap scanning. Umka-kvm reads the PML buffer on a timer interrupt and transmits dirty pages to the destination host.
  2. Iterative convergence: Multiple pre-copy rounds, each sending pages dirtied since the last round. Configurable maximum downtime target (e.g., 50ms).
  3. Auto-converge: If the guest's dirty rate exceeds the network transfer rate (migration won't converge), umka-kvm throttles vCPU execution to reduce the dirty rate. This is automatic and transparent to the guest.
  4. Stop-and-copy: When the remaining dirty set is small enough to transfer within the downtime target, the VM is paused, final dirty pages are sent, and the destination resumes execution.
  5. Post-copy (optional): The destination VM starts running immediately. Pages not yet transferred are faulted in on demand via a kernel-internal demand-fault mechanism (not Linux's userfaultfd, which is a userspace API). Since umka-kvm runs as a Tier 1 kernel module with extended hardware privileges, it registers a post-copy fault handler directly with the page fault subsystem (Section 4.1). When a guest accesses a not-yet-migrated page, the fault handler requests the page from the source host over the migration channel (TCP or RDMA) and maps it before returning. This is functionally equivalent to QEMU's userfaultfd-based post-copy but operates entirely in kernel space.

Convergence Policy and Auto-Convergence

UmkaOS's migration controller owns the convergence decision. The wire protocol is QEMU-compatible for interoperability, but the policy for when and how to converge is UmkaOS's internal design ("UmkaOS inside").

Convergence threshold: Pre-copy is considered converged when:

remaining_dirty_pages < convergence_threshold

where convergence_threshold = initial_dirty_pages * 0.02 (2% of the initial working set). When this threshold is met, the controller proceeds directly to stop-and-copy regardless of which round it is.

Dirty-rate tracking: At the end of each pre-copy round the controller computes:

dirty_rate_pages_per_sec =
    pages_dirtied_this_round / round_duration_secs;

transfer_rate_pages_per_sec =
    bytes_transferred_this_round / PAGE_SIZE / round_duration_secs;

// Transfer must exceed dirty rate by at least 10% margin.
is_converging = dirty_rate < transfer_rate * 0.9;

Auto-converge trigger: If pre-copy has NOT converged after max_precopy_rounds = 30 rounds, OR if is_converging is false for 3 consecutive rounds, the controller begins auto-convergence using the following action type:

pub enum ConvergenceAction {
    /// Throttle vCPU execution to reduce the dirty rate.
    /// `throttle_pct` is the percentage reduction applied to the vCPU time
    /// slice. Increased by 10% each non-converging round; maximum 80%.
    ThrottleVcpu { throttle_pct: u8 },

    /// Switch to post-copy mode: resume the VM at the destination and fetch
    /// remaining pages on demand (guest page fault → source request →
    /// transfer → resume).
    SwitchToPostCopy,
}

Auto-converge sequence:

  1. Rounds 1–10: pure pre-copy, no throttling.
  2. Round 11+: if not converging, apply ThrottleVcpu starting at 10%, increasing by 10% per non-converging round up to a maximum of 80%.
  3. If throttle reaches 80% and migration still has not converged after 5 further rounds: issue SwitchToPostCopy. Post-copy always terminates because pages are fetched on demand and the VM is already live at the destination.

Post-copy failure mitigations: Post-copy fails catastrophically if the source host dies before all pages have been delivered. UmkaOS applies three mitigations:

  • The source host is kept alive (vCPUs suspended, not destroyed) until the destination's post-copy fault handler confirms every referenced page has been received.
  • If the source fails mid-post-copy: the destination VM is sent SIGKILL and the migration is declared failed. The VM cannot continue safely with unreachable pages.
  • Optionally: a pre-copy checkpoint snapshot is taken before switching to post-copy. If post-copy then fails, the operator can restart from the checkpoint rather than from scratch.

VFIO/passthrough constraint: Post-copy live migration is disabled when the VM has VFIO passthrough devices attached. The reason: post-copy allows the guest to run on the destination before all pages are transferred; if a passthrough device DMAs to a page that hasn't been migrated yet (still on the source), the IOMMU on the destination raises an unrecoverable fault (the physical address is not mapped in the destination's IOMMU domain). UmkaOS detects passthrough devices at migration-start time and automatically switches to pre-copy with auto-converge when any VfioDevice is attached to the VM. Pre-copy ensures all dirty pages are transferred before the final stop-and-copy phase, preventing any DMA to unmigrated pages.

Guest-side migration support — When UmkaOS runs as a guest: - PV migration notifier: guest receives a pre-migration hint via virtio, allowing it to flush caches, pause background I/O, and prepare for the brief freeze - Post-migration re-enumeration: guest re-enumerates PCI topology (in case of heterogeneous migration to different hardware), re-calibrates pvclock, resumes I/O - Confidential VM migration: handled by the TEE framework (Section 8.6.4)

Host Mode — Cloud Orchestration

UmkaOS provides /dev/kvm and the associated ioctl interface, making it compatible with the standard KVM ecosystem: - libvirt: standard virtualization management library. Works unmodified — it talks to /dev/kvm via standard ioctls. - OpenStack Nova: compute driver talks to libvirt, libvirt talks to /dev/kvm. UmkaOS is transparent to the orchestration layer. - QEMU and Firecracker: both use /dev/kvm directly. Both work unmodified on UmkaOS.

17.1.1 KVM Host-Side Implementation

This section specifies the hypervisor role: what umka-kvm does as a host to create, configure, and run virtual machines. umka-kvm runs as a Tier 1 driver with extended hardware privileges (see Section 18.1.4.5 for the isolation model, CAP_VMX rationale, and VMX trampoline design). The SLAT hooks (SlatHooks trait), EPT violation handling path, dirty page tracking, and memory overcommit behavior are specified in Section 18.1.4.5; this section covers the remaining host-side subsystems.

17.1.1.1 /dev/kvm Ioctl Interface

umka-kvm exposes the standard Linux KVM ioctl interface so that unmodified QEMU, Firecracker, Cloud Hypervisor, and crosvm work without changes. The interface is organized into three ioctl scopes:

System ioctls (on /dev/kvm file descriptor). Ioctl numbers use Linux's standard encoding: _IO(KVMIO, nr) where KVMIO = 0xAE. The nr column shows the number field; the actual ioctl constant includes direction and size bits per the _IO/_IOR/_IOW/_IOWR macros.

Ioctl nr Description
KVM_GET_API_VERSION 0x00 Returns KVM_API_VERSION (12). Userspace checks this first.
KVM_CREATE_VM 0x01 Allocate a new Vm struct, return VM file descriptor.
KVM_GET_MSR_INDEX_LIST 0x02 Returns list of MSRs that KVM_GET_MSRS/KVM_SET_MSRS can access.
KVM_CHECK_EXTENSION 0x03 Query capability support (EPT, PML, posted interrupts, etc.).
KVM_GET_VCPU_MMAP_SIZE 0x04 Returns size of the kvm_run shared page (one page per vCPU).
KVM_GET_SUPPORTED_CPUID 0x05 Returns filtered CPUID values reflecting host capabilities.

VM ioctls (on VM file descriptor):

Ioctl nr Description
KVM_CREATE_VCPU 0x41 Allocate a Vcpu struct, return vCPU file descriptor.
KVM_GET_DIRTY_LOG 0x42 Read and reset per-slot dirty bitmap (for live migration).
KVM_SET_USER_MEMORY_REGION 0x46 Add/modify/delete a memory slot mapping guest physical → host virtual.
KVM_SET_TSS_ADDR 0x47 Set guest TSS address (x86 specific, required by QEMU).
KVM_SET_IDENTITY_MAP_ADDR 0x48 Set identity-mapped page table region for real-mode emulation.
KVM_CREATE_IRQCHIP 0x60 Create in-kernel interrupt controller (LAPIC + IOAPIC on x86).
KVM_CREATE_PIT2 0x77 Create in-kernel PIT (8254 timer emulation).
KVM_IRQFD 0x76 Associate an eventfd with a guest IRQ for direct injection.
KVM_IOEVENTFD 0x79 Trigger an eventfd on guest I/O to a specified port/MMIO address.
KVM_SET_CLOCK 0x7B Set/get VM-wide kvmclock parameters.
KVM_CLEAR_DIRTY_LOG 0xC0 Granular dirty bitmap clear (avoids resetting entire slot).
KVM_MEMORY_ENCRYPT_OP 0xBA Confidential VM operations (SEV-SNP/TDX, see Section 8.6).

vCPU ioctls (on vCPU file descriptor):

Ioctl nr Description
KVM_RUN 0x80 Enter VMX non-root / VHE EL1 / VS-mode. Blocks until VM exit needs userspace.
KVM_GET_REGS / KVM_SET_REGS 0x81/0x82 Read/write guest general-purpose registers.
KVM_GET_SREGS / KVM_SET_SREGS 0x83/0x84 Read/write guest segment registers, CR0/CR3/CR4, EFER, IDT, GDT.
KVM_TRANSLATE 0x85 Walk guest page tables to translate guest virtual → guest physical.
KVM_INTERRUPT 0x86 Inject an external interrupt into the guest.
KVM_GET_MSRS / KVM_SET_MSRS 0x88/0x89 Read/write guest MSRs.
KVM_SET_SIGNAL_MASK 0x8B Set signal mask for the vCPU thread during KVM_RUN.
KVM_GET_FPU / KVM_SET_FPU 0x8C/0x8D Read/write guest FPU/SSE/AVX state.
KVM_GET_LAPIC / KVM_SET_LAPIC 0x8E/0x8F Read/write guest Local APIC state.
KVM_SET_CPUID2 0x90 Configure CPUID values exposed to the guest.
KVM_NMI 0x9A Inject an NMI into the guest.
KVM_GET_VCPU_EVENTS / KVM_SET_VCPU_EVENTS 0x9F/0xA0 Exception/interrupt/NMI injection state.
KVM_GET_XSAVE / KVM_SET_XSAVE 0xA4/0xA5 Read/write guest XSAVE state (AVX-512, AMX, etc.).

Ioctl dispatch: Each ioctl handler runs in umka-kvm's isolation domain. The /dev/kvm character device is registered via umka-core's device subsystem. When userspace calls ioctl(fd, KVM_RUN, ...), the syscall layer resolves the file descriptor to the Vcpu struct, switches into umka-kvm's domain, and invokes the KVM_RUN handler — which transitions to the VMX trampoline in umka-core's domain for the actual VM entry (see Section 18.1.4.5).

kvm_run shared page: Each vCPU has a single page mapped into userspace (returned by mmap on the vCPU file descriptor). This page contains the kvm_run struct that communicates VM-exit reasons and I/O data between kernel and userspace:

/// KVM error code for unrecognized hypercalls. Positive 1000, not POSIX ENOSYS (-38).
/// Defined in Linux's include/uapi/linux/kvm_para.h. The guest places this value in
/// the return register (RAX on x86) when a hypercall number is not recognized
/// in-kernel. QEMU and other VMMs check for this value, not -ENOSYS.
pub const KVM_ENOSYS: u64 = 1000;

/// KVM VM exit reason constants placed in KvmRun::exit_reason by the kernel.
/// Values match Linux's linux/kvm.h exactly for binary compatibility with QEMU,
/// Firecracker, and libvirt.
pub const KVM_EXIT_UNKNOWN: u32 = 0;         // Hardware exit reason KVM does not recognize.
pub const KVM_EXIT_IO: u32 = 2;              // Guest IN/OUT to intercepted port.
pub const KVM_EXIT_HYPERCALL: u32 = 3;       // Unrecognized VMCALL/HVC/ECALL.
pub const KVM_EXIT_DEBUG: u32 = 4;           // Hardware single-step or breakpoint.
pub const KVM_EXIT_MMIO: u32 = 6;            // Guest MMIO access with no in-kernel handler.
pub const KVM_EXIT_SHUTDOWN: u32 = 8;        // Guest triple-faulted or ACPI/PSCI shutdown.
pub const KVM_EXIT_FAIL_ENTRY: u32 = 9;      // VM entry failed before guest executed.
pub const KVM_EXIT_INTERNAL_ERROR: u32 = 17; // KVM internal consistency error.

/// Shared between umka-kvm (kernel) and VMM (userspace).
/// Layout matches Linux's struct kvm_run exactly for binary compatibility.
#[repr(C)]
pub struct KvmRun {
    /// Set by userspace before KVM_RUN: whether to inject an interrupt.
    pub request_interrupt_window: u8,
    /// Set by userspace: whether to re-enter immediately on HLT.
    pub immediate_exit: u8,
    _padding1: [u8; 6],
    /// Set by kernel on VM exit: why the vCPU exited.
    pub exit_reason: u32,
    /// Set by kernel: whether an interrupt window is open.
    pub ready_for_interrupt_injection: u8,
    /// Set by kernel: whether the vCPU's IF flag is set.
    pub if_flag: u8,
    pub flags: u16,
    /// Guest CR8 (TPR) value. Avoids a KVM_SET_REGS round-trip.
    pub cr8: u64,
    /// Set by kernel: APIC base MSR value.
    pub apic_base: u64,
    /// Exit-reason-specific data. Union discriminated by exit_reason.
    pub exit_data: KvmRunExitData,
}

/// Union of exit-specific structs, discriminated by KvmRun::exit_reason.
#[repr(C)]
pub union KvmRunExitData {
    pub io: KvmRunIo,               // KVM_EXIT_IO
    pub mmio: KvmRunMmio,           // KVM_EXIT_MMIO
    pub hypercall: KvmRunHypercall, // KVM_EXIT_HYPERCALL
    pub internal: KvmRunInternal,   // KVM_EXIT_INTERNAL_ERROR
    /// KVM_EXIT_UNKNOWN — hardware reports unknown VM exit reason.
    pub hw: KvmExitHw,
    /// KVM_EXIT_FAIL_ENTRY — VM entry failed; hardware_entry_failure_reason
    /// holds the VMX/SVM-specific exit reason from the hardware.
    pub fail_entry: KvmExitFailEntry,
    /// KVM_EXIT_DEBUG — hardware single-step or breakpoint triggered.
    pub debug: KvmExitDebug,
    /// Padding to 256 bytes — matches Linux's `kvm_run` exit union size exactly
    /// (see `linux/kvm.h`: `__u8 padding[256]`). This ensures binary compatibility
    /// with VMMs (QEMU, Firecracker) that mmap the kvm_run page and access it as
    /// `struct kvm_run` using the Linux kernel headers.
    _padding: [u8; 256],
}

/// KVM_EXIT_UNKNOWN: hardware exit reason that KVM does not recognize.
pub struct KvmExitHw {
    pub hardware_exit_reason: u64,
}

/// KVM_EXIT_FAIL_ENTRY: VM entry failed before the guest executed any instructions.
pub struct KvmExitFailEntry {
    /// Architecture-specific entry failure reason (e.g., VMX basic exit reason).
    pub hardware_entry_failure_reason: u64,
    /// vCPU index on which the entry failure occurred.
    pub cpu: u32,
    pub _pad: u32,
}

/// KVM_EXIT_DEBUG: hardware single-step or breakpoint.
pub struct KvmExitDebug {
    /// Architecture-specific debug exit information (e.g., DR6 on x86,
    /// ESR_EL2 on AArch64).
    pub arch: KvmDebugExitArch,
}

Exit reasons that require userspace handling (the KVM_RUN ioctl returns to userspace): - KVM_EXIT_IO: Guest executed IN/OUT to an intercepted port. Userspace emulates the device and re-enters. - KVM_EXIT_MMIO: Guest accessed an MMIO region with no in-kernel handler. Userspace emulates and re-enters. - KVM_EXIT_HYPERCALL: Guest executed VMCALL with an unrecognized hypercall number. Forwarded to userspace. - KVM_EXIT_SHUTDOWN: Guest triple-faulted. VMM should terminate or reset the VM. - KVM_EXIT_SYSTEM_EVENT: Guest requested reset or shutdown via ACPI or PSCI.

Exit reasons handled entirely in-kernel (the vCPU re-enters the guest without returning to userspace): external interrupts, EPT violations that resolve to mapped memory slots, CPUID emulation, MSR access to non-intercepted MSRs, preemption timer expiry, APIC access, HLT (if other vCPUs can wake it). These are detailed in Section 17.1.1.3.

17.1.1.2 Core Data Structures

Vm struct — one per virtual machine:

/// Represents a single virtual machine. Created by KVM_CREATE_VM.
/// Shared (via Arc) among all vCPUs belonging to this VM.
pub struct Vm {
    /// Unique VM identifier (monotonically increasing, never reused).
    pub id: u64,

    /// Guest physical address → host physical address mapping.
    /// Modified by KVM_SET_USER_MEMORY_REGION. RCU-protected for
    /// concurrent read access from vCPU fault handlers.
    pub memslots: RcuVec<MemSlot>,

    /// Architecture-specific second-level page table root.
    /// x86-64: EPT PML4 pointer. AArch64: Stage-2 VTTBR. RISC-V: hgatp.
    pub slat: SlatRoot,

    /// In-kernel interrupt controller state.
    /// x86: IOAPIC redirection table + PIC state.
    /// AArch64: vGIC distributor state.
    /// RISC-V: virtual PLIC/APLIC state.
    pub irqchip: Option<IrqChip>,

    /// Per-VM dirty page bitmap for live migration.
    /// One bit per 4 KiB guest physical page. Atomically set by EPT
    /// violation handler or PML drain; atomically read-and-reset by
    /// KVM_GET_DIRTY_LOG.
    pub dirty_bitmap: Option<AtomicBitmap>,

    /// vCPU list. Protected by vm_lock for structural changes
    /// (create/destroy). Individual vCPU operations do not take this lock.
    pub vcpus: RwLock<Vec<Arc<Vcpu>>>,

    /// Maximum number of vCPUs (set at VM creation, capped by hardware
    /// and policy). x86: min(KVM_MAX_VCPUS, host logical CPU count * 2).
    pub max_vcpus: u32,

    /// TSC frequency (Hz) for this VM. All vCPUs share the same virtual
    /// TSC frequency. Set by KVM_SET_TSC_KHZ or defaults to host TSC.
    pub tsc_khz: u64,

    /// Checkpointed state for crash recovery (Section 10.8). Updated
    /// on every KVM_RUN return-to-userspace and on periodic checkpoints.
    pub checkpoint: SpinLock<VmCheckpoint>,

    /// Power budget integration (Section 6.2.6.2).
    pub power_budget: Option<VmPowerBudget>,
}

MemSlot struct — guest physical region backed by host memory:

/// A contiguous region of guest physical address space backed by host memory.
/// Created/modified by KVM_SET_USER_MEMORY_REGION.
pub struct MemSlot {
    /// Slot identifier (0-based, userspace-assigned).
    pub slot: u32,
    /// Guest physical base address (page-aligned).
    pub guest_phys_base: u64,
    /// Size in bytes (page-aligned).
    pub size: u64,
    /// Host virtual address of the backing memory (userspace mapping).
    /// umka-kvm resolves this to host physical pages via the host page tables.
    pub userspace_addr: u64,
    /// Flags: KVM_MEM_LOG_DIRTY_PAGES, KVM_MEM_READONLY.
    pub flags: MemSlotFlags,
    /// Cached host physical addresses for fast EPT population.
    /// Lazily populated on first EPT fault per page. RCU-protected.
    pub hva_to_hpa_cache: RcuVec<Option<u64>>,
}

Vcpu struct — one per virtual CPU:

/// Represents a single virtual CPU. Created by KVM_CREATE_VCPU.
pub struct Vcpu {
    /// vCPU index within the VM (0-based).
    pub id: u32,
    /// Back-reference to the parent VM.
    pub vm: Arc<Vm>,

    // --- Architecture-specific hardware virtualization state ---

    /// x86-64: VMCS region (4 KiB aligned, one per vCPU).
    /// AArch64: saved EL1 register context for the guest.
    /// RISC-V: saved VS-mode CSR context.
    pub hw_state: ArchVcpuState,

    /// Guest general-purpose registers. Saved by the trampoline on VM exit,
    /// restored on VM entry. The trampoline handles all GPRs that are not
    /// automatically saved/restored by hardware (VMCS handles RSP/RIP/RFLAGS
    /// on x86; hardware handles PC/PSTATE on ARM64).
    pub guest_regs: GuestRegisters,

    /// Guest FPU/SIMD state. Saved/restored lazily — only when the host
    /// thread is about to be scheduled out, or when userspace reads via
    /// KVM_GET_FPU/KVM_GET_XSAVE.
    pub guest_fpu: FpuState,

    /// Highest-priority pending virtual interrupt to inject on next VM entry.
    /// Set by in-kernel LAPIC/vGIC or by KVM_INTERRUPT ioctl.
    /// AtomicU32 because interrupt injection can race with the vCPU thread
    /// (e.g., IOAPIC routing from a different vCPU's ioctl).
    pub pending_irq: AtomicU32,

    /// Virtual APIC page (x86 only). 4 KiB page used for x2APIC
    /// virtualization — hardware reads/writes APIC registers directly
    /// from this page without VM exits (when APIC virtualization is enabled
    /// via VMCS `APIC-access address` + `virtual-APIC address` fields).
    ///
    /// # Synchronization Protocol
    ///
    /// The vAPIC page is written by hardware (the CPU's VMX subsystem) on every
    /// APIC register access from within the guest and by the kernel on interrupt
    /// injection. It is read by the kernel on VM exit to sample guest APIC state.
    ///
    /// Access rules:
    /// 1. **Set once at vCPU creation** (when no guest is running): the physical
    ///    address is written into the VMCS. The pointer is stable for the vCPU
    ///    lifetime; it is `None` only if the host CPU does not support APIC
    ///    virtualization (checked via `CPUID.01H:ECX.X2APIC[bit 21]` and
    ///    `IA32_VMX_PROCBASED_CTLS2[bit 9]`).
    /// 2. **Kernel reads/writes**: only while the vCPU is not running (i.e., outside
    ///    `VMENTRY..VMEXIT`). The vCPU thread acquires `Vcpu::run_lock` before
    ///    accessing `vapic_page` for interrupt injection or state save/restore.
    /// 3. **Hardware writes**: occur inside the guest execution window. The kernel
    ///    never reads stale data because it only accesses the page after `VMEXIT`
    ///    has serialized all hardware writes.
    /// 4. **SMP**: each `Vcpu` has its own `vapic_page`; there is no cross-vCPU
    ///    sharing. A vCPU's page is touched only by its own vCPU thread and by
    ///    the in-kernel IOAPIC/PIC emulation (which acquires `run_lock` first).
    ///
    /// `*mut VApicPage` is used (not `Box<VApicPage>` or `Arc`) because the
    /// physical address must be pinned in the VMCS. The page is allocated via
    /// `PhysAlloc::alloc_pages(1)` at vCPU creation and freed at vCPU destruction.
    /// `# Safety`: callers must hold `run_lock` and the vCPU must be not-running.
    pub vapic_page: Option<*mut VApicPage>,

    /// TSC offset for this vCPU. Guest TSC = host TSC + tsc_offset.
    /// Set by KVM_SET_TSC_KHZ or by migration (to preserve guest-visible
    /// TSC continuity across hosts with different TSC frequencies).
    pub tsc_offset: i64,

    /// Shared page mapped into userspace for KVM_RUN communication.
    pub kvm_run: *mut KvmRun,

    /// Whether this vCPU has been launched (VMLAUNCH vs VMRESUME on x86).
    pub launched: AtomicBool,

    /// Pending MMIO emulation request (set on EPT fault for unmapped region,
    /// causes KVM_RUN to return with KVM_EXIT_MMIO).
    pub pending_mmio: Option<MmioRequest>,

    /// Preemption timer value. When the VMX preemption timer fires, the
    /// vCPU exits to allow the host scheduler to run. Set from the
    /// scheduler's time slice quantum (Section 6.1).
    pub preempt_timer_value: u32,

    /// Run state: Running, Halted (HLT), or Paused (by userspace/migration).
    pub run_state: AtomicU8,
}

Architecture-specific state (ArchVcpuState enum):

/// Per-architecture hardware virtualization state.
pub enum ArchVcpuState {
    /// Intel VMX: VMCS region.
    Vmx(VmcsState),
    /// AMD SVM: VMCB region.
    Svm(VmcbState),
    /// AArch64 VHE: saved guest EL1 context.
    ArmVhe(ArmVheState),
    /// AArch64 nVHE: saved guest EL1 context + EL2 trampoline state.
    ArmNvhe(ArmNvheState),
    /// RISC-V H-extension: saved VS-mode CSRs.
    RiscvH(RiscvHState),
}

SlatRoot — root of the Second-Level Address Translation page table, referenced by Vm::slat:

/// Root of the Second-Level Address Translation (SLAT) page table hierarchy.
/// On x86-64: EPT (Extended Page Tables) root; on AArch64: IPA stage-2 table root.
pub struct SlatRoot {
    /// Host physical address of the top-level page table (EPT PML4 / VTTBR_EL2 target).
    pub hpa: PhysAddr,
    /// ASID/VMID for TLB tagging (prevents cross-VM TLB pollution).
    pub vmid: u16,
}

VmCheckpoint — snapshot of a VM's complete state, referenced by Vm::checkpoint:

/// Snapshot of a VM's complete state for live migration or checkpoint.
///
/// # Memory Allocation Discipline
///
/// `VmCheckpoint` is held under `Vm::checkpoint: SpinLock<VmCheckpoint>`.
/// SpinLock disables preemption; heap allocation (`Vec::push`, `Vec::extend`,
/// `Box::new`) inside a spinlock is prohibited (the allocator may acquire
/// currently-held locks, causing deadlock).
///
/// Therefore all buffers in `VmCheckpoint` are pre-allocated at VM creation
/// time (when no spinlock is held) and never resized:
/// - `vcpu_states` is a `Box<[GuestRegisters]>` pre-sized to `vcpu_count`.
/// - `device_states` is a `Box<[DeviceStateBlob]>` pre-sized to `device_count`.
///
/// Snapshot capture only writes into the pre-allocated buffers; it never
/// allocates. The `epoch` field is updated atomically. This makes snapshot
/// capture lock-friendly and O(1) in latency.
pub struct VmCheckpoint {
    /// Saved register state for each vCPU. Pre-allocated at VM creation to
    /// hold exactly `Vm::vcpu_count` entries; never resized after creation.
    /// `Box<[GuestRegisters]>` prevents accidental `push()`/`extend()` that
    /// would heap-allocate under the spinlock.
    pub vcpu_states: Box<[GuestRegisters]>,
    /// Dirty page bitmap: bit N set = GPA page N modified since last checkpoint.
    /// Pre-allocated at VM creation to cover the full GPA range.
    pub dirty_bitmap: BitVec,
    /// Device state blobs (serialized from each Tier 1 virtual device).
    /// Pre-allocated at VM creation; one slot per registered virtual device.
    /// Each `DeviceStateBlob` has a fixed `data: Box<[u8]>` buffer sized to
    /// the maximum serialized state reported by the device at registration.
    pub device_states: Box<[DeviceStateBlob]>,
    /// Monotonic sequence number of this snapshot (used to order incremental
    /// checkpoints during live migration). Incremented on each capture.
    pub epoch: u64,
}

/// Fixed-capacity serialized state blob for one virtual device.
/// Pre-allocated at VM creation. The device driver writes its state into
/// `data[..actual_len]`; actual_len ≤ data.len() is always true.
pub struct DeviceStateBlob {
    /// Device identifier.
    pub device_id: DeviceId,
    /// Pre-allocated buffer (sized to device's `max_state_bytes()` at registration).
    pub data: Box<[u8]>,
    /// Valid bytes written in the most recent snapshot capture. Always ≤ data.len().
    pub actual_len: u32,
}

GuestRegisters — complete architectural register state, referenced by Vcpu::guest_regs:

/// Complete architectural register state of a guest vCPU.
/// Architecture-specific; union across supported ISAs.
pub struct GuestRegisters {
    /// General-purpose registers (16 on x86-64, 31 on AArch64).
    pub gpr: [u64; 31],
    /// Program counter / instruction pointer.
    pub pc: u64,
    /// Stack pointer.
    pub sp: u64,
    /// Architecture flags / PSTATE / CPSR.
    pub flags: u64,
    /// Floating-point and SIMD state (architecture-specific).
    pub fpu: FpuState,
    /// Architecture-specific system registers (CR0/CR3/CR4/EFER on x86-64, etc.).
    pub sys_regs: [u64; 64],
}

FpuState — floating-point and SIMD register state, referenced by Vcpu::guest_fpu and GuestRegisters::fpu:

/// Floating-point and SIMD register state.
/// Layout matches the host architecture's XSAVE area (x86-64) or FPSIMD context (AArch64).
#[repr(C, align(64))]
pub struct FpuState {
    /// Raw XSAVE/FPSIMD data; size is architecture-dependent.
    /// x86-64: up to 2688 bytes (AVX-512 XSAVE area).
    /// AArch64: 512 bytes (V0-V31 + FPCR/FPSR).
    pub data: [u8; 2688],
    /// Number of valid bytes in data[] (architecture-dependent).
    pub size: u32,
}

MmioRequest — MMIO access request from a vCPU, referenced by Vcpu::pending_mmio:

/// MMIO access request from a vCPU that trapped on an unhandled MMIO address.
pub struct MmioRequest {
    /// Guest physical address of the MMIO access.
    pub gpa: u64,
    /// Data to write (for writes); ignored for reads.
    pub data: u64,
    /// Access size in bytes: 1, 2, 4, or 8.
    pub size: u8,
    /// True = write, False = read.
    pub is_write: bool,
}

17.1.1.3 VM-Exit Handling and Dispatch

When a VM exit occurs, the architecture-specific trampoline (running in umka-core's domain, PKEY 0 on x86) saves guest GPRs to vcpu.guest_regs and dispatches to umka-kvm's exit handler. The trampoline design is specified in Section 18.1.4.5; this section covers the exit handler logic.

Exit reason dispatch (architecture-independent entry point):

/// Called by the trampoline after guest GPRs are saved.
/// Returns an action that tells the trampoline what to do next.
fn handle_vmexit(vcpu: &mut Vcpu, exit: VmExit) -> VmExitAction {
    match exit.reason {
        // --- Handled entirely in-kernel (fast path) ---
        ExitReason::ExternalInterrupt => {
            // Host interrupt arrived while guest was running. The trampoline
            // already acknowledged the interrupt via the host IDT. Just
            // re-enter the guest after the host ISR completes.
            VmExitAction::ReenterGuest
        }
        ExitReason::PreemptionTimer => {
            // Host scheduler quantum expired. Yield to the scheduler.
            // The scheduler will re-enter the guest when re-scheduled.
            VmExitAction::Reschedule
        }
        ExitReason::EptViolation | ExitReason::Stage2Fault | ExitReason::GuestPageFault => {
            // Second-level page fault. Handled via the EPT violation path
            // specified in Section 18.1.4.5.
            handle_slat_fault(vcpu, exit.guest_phys_addr, exit.fault_flags)
        }
        ExitReason::Cpuid => {
            handle_cpuid_exit(vcpu)
        }
        ExitReason::Msr(direction) => {
            handle_msr_exit(vcpu, direction)
        }
        ExitReason::Hlt => {
            handle_hlt_exit(vcpu)
        }
        ExitReason::ApicAccess => {
            handle_apic_access(vcpu)
        }
        ExitReason::Hypercall => {
            handle_hypercall(vcpu)
        }
        ExitReason::CrAccess(cr, direction) => {
            handle_cr_access(vcpu, cr, direction)
        }
        ExitReason::TaskSwitch => {
            handle_task_switch(vcpu)
        }
        ExitReason::Xsetbv => {
            handle_xsetbv(vcpu)
        }

        // --- Forwarded to userspace (slow path) ---
        ExitReason::IoInstruction => {
            handle_io_exit(vcpu)
        }
        ExitReason::MmioAccess => {
            // Unmapped MMIO — no in-kernel device model matched.
            populate_kvm_run_mmio(vcpu);
            VmExitAction::ReturnToUserspace
        }
        ExitReason::Shutdown => {
            vcpu.kvm_run().exit_reason = KVM_EXIT_SHUTDOWN;
            VmExitAction::ReturnToUserspace
        }
        ExitReason::Unknown(code) => {
            log_unknown_exit(vcpu, code);
            vcpu.kvm_run().exit_reason = KVM_EXIT_INTERNAL_ERROR;
            VmExitAction::ReturnToUserspace
        }
    }
}

VmExitAction enum:

pub enum VmExitAction {
    /// Re-enter the guest immediately (VMRESUME/ERET/sret).
    ReenterGuest,
    /// Yield to the host scheduler. The scheduler will call back into
    /// umka-kvm when the vCPU thread is next scheduled.
    Reschedule,
    /// Return from KVM_RUN to userspace. The kvm_run shared page
    /// contains exit reason and data for userspace to handle.
    ReturnToUserspace,
}

Key exit handlers:

  • CPUID: umka-kvm maintains a per-VM CPUID table (set by KVM_SET_CPUID2). On CPUID exit, the handler looks up guest EAX/ECX (leaf/subleaf) in the table and writes the result to guest EAX/EBX/ECX/EDX. Guest RIP is advanced past the 2-byte CPUID instruction. Handled entirely in-kernel; no userspace round-trip.

  • MSR access: umka-kvm maintains one 4 KiB MSR bitmap region divided into four 1 KiB sections per Intel SDM Vol. 3C Section 24.6.9: read-low (offset 0x000, MSRs 0x00000000-0x00001FFF), read-high (offset 0x400, MSRs 0xC0000000-0xC0001FFF), write-low (offset 0x800, MSRs 0x00000000-0x00001FFF), write-high (offset 0xC00, MSRs 0xC0000000-0xC0001FFF). Commonly passthrough MSRs (IA32_TSC, IA32_TSC_AUX, IA32_SYSENTER_*) have their bitmap bits cleared — the guest accesses them directly without exit. Intercepted MSRs (IA32_EFER, IA32_APIC_BASE, IA32_TSC_DEADLINE, IA32_SPEC_CTRL, IA32_STAR, IA32_LSTAR) are emulated in the exit handler: the handler validates the value, applies it to the VMCS guest-state area or internal shadow, and advances RIP.

  • HLT: If the in-kernel LAPIC has a pending interrupt or the vCPU's pending_irq is set, inject the interrupt and re-enter immediately (no halt). Otherwise, mark the vCPU as halted (run_state = Halted) and yield to the scheduler. The vCPU is woken when an interrupt is routed to it (via IOAPIC redirection, MSI injection, or KVM_INTERRUPT ioctl).

  • I/O port: The exit handler reads the port number, direction (in/out), size (1/2/4 bytes), and data from the VMCS exit qualification. If an in-kernel device model handles this port (PIT, PIC, IOAPIC, serial for debug), it is handled without returning to userspace. Otherwise, the handler populates the kvm_run.io struct and returns KVM_EXIT_IO to userspace.

  • Hypercall (VMCALL/HVC): Recognized hypercalls:

  • KVM_HC_VAPIC_POLL_IRQ (1): Check for pending virtual interrupts.
  • KVM_HC_MMU_OP (2): Deprecated; returns KVM_ENOSYS (1000) in RAX — not POSIX -ENOSYS (-38). Guests check for the positive value 1000 per the KVM paravirt ABI (include/uapi/linux/kvm_para.h).
  • KVM_HC_KICK_CPU (5): Wake a halted vCPU (PV spinlock support).
  • KVM_HC_SEND_IPI (10): Batch IPI injection (up to 128 targets).
  • KVM_HC_SCHED_YIELD (11): Hint that the vCPU is spinning; yield to scheduler.
  • KVM_HC_MAP_GPA_RANGE (12): Shared/private page conversion (confidential VMs, see Section 8.6).
  • Unrecognized hypercall numbers: forwarded to userspace as KVM_EXIT_HYPERCALL. If userspace also does not handle the hypercall, it must write KVM_ENOSYS (1000) into RAX before re-entering — not -ENOSYS.

17.1.1.4 x86-64 VMX Implementation

VMCS (Virtual Machine Control Structure) — Intel SDM Vol. 3C, Chapter 24:

The VMCS is a 4 KiB hardware-defined region that controls VMX operation. One VMCS exists per vCPU. VMCS fields are accessed via VMREAD/VMWRITE instructions (not direct memory access — the format is opaque and CPU-model-specific).

/// x86-64 VMX state for one vCPU.
pub struct VmcsState {
    /// Physical address of the 4 KiB VMCS region (allocated from umka-core's
    /// page allocator, 4 KiB aligned). Written to the per-CPU VMCS pointer
    /// via VMPTRLD before any VMREAD/VMWRITE.
    pub vmcs_phys: u64,

    /// Virtual address of the VMCS region (for the revision ID write at
    /// VMCS initialization — the only direct memory access to the VMCS).
    pub vmcs_virt: *mut u8,

    /// MSR bitmap (one 4 KiB page, four 1 KiB sections): controls which MSR
    /// accesses cause VM exits. Bit set = intercept; bit clear = passthrough.
    /// Sections: read-low (0x000), read-high (0x400), write-low (0x800),
    /// write-high (0xC00). Physical address stored in VMCS MSR bitmap address field.
    pub msr_bitmap_phys: u64,

    /// I/O bitmap pages (two 4 KiB pages, covering ports 0x0000-0xFFFF).
    /// Bit set = intercept IN/OUT; bit clear = passthrough.
    pub io_bitmap_a_phys: u64, // ports 0x0000-0x7FFF
    pub io_bitmap_b_phys: u64, // ports 0x8000-0xFFFF

    /// Posted interrupt descriptor (if posted interrupts are enabled).
    /// 64-byte aligned structure used by hardware to inject interrupts
    /// without VM exit.
    pub posted_interrupt_desc: *mut PostedInterruptDesc,

    /// Whether this VMCS has been launched (VMLAUNCH sets this; subsequent
    /// entries use VMRESUME).
    pub launched: bool,
}

VMCS initialization (KVM_CREATE_VCPU path):

  1. Allocate a 4 KiB page from umka-core (SlatHooks::alloc_slat_page). Write the VMCS revision identifier (from IA32_VMX_BASIC MSR) to the first 4 bytes.
  2. Execute VMPTRLD to make this VMCS current on the physical CPU.
  3. Write host-state fields — these define the CPU state restored on VM exit:
  4. HOST_CR0, HOST_CR3 (umka-core's page table root), HOST_CR4
  5. HOST_RSP (trampoline stack pointer), HOST_RIP (trampoline entry point)
  6. HOST_CS_SELECTOR, HOST_SS_SELECTOR, HOST_DS_SELECTOR, HOST_ES_SELECTOR, HOST_FS_SELECTOR, HOST_GS_SELECTOR, HOST_TR_SELECTOR
  7. HOST_FS_BASE, HOST_GS_BASE (PerCpu/CpuLocal base), HOST_TR_BASE, HOST_GDTR_BASE, HOST_IDTR_BASE
  8. HOST_IA32_SYSENTER_CS, HOST_IA32_SYSENTER_ESP, HOST_IA32_SYSENTER_EIP
  9. HOST_IA32_EFER, HOST_IA32_PAT
  10. Write guest-state fields — initial guest CPU state:
  11. GUEST_CR0, GUEST_CR3, GUEST_CR4 (set by KVM_SET_SREGS)
  12. GUEST_RIP, GUEST_RSP, GUEST_RFLAGS (set by KVM_SET_REGS)
  13. Segment selectors, bases, limits, access rights for CS/DS/ES/SS/FS/GS/TR/LDTR
  14. GUEST_GDTR_BASE, GUEST_GDTR_LIMIT, GUEST_IDTR_BASE, GUEST_IDTR_LIMIT
  15. GUEST_IA32_EFER, GUEST_IA32_PAT, GUEST_IA32_DEBUGCTL
  16. GUEST_ACTIVITY_STATE (0 = active, 1 = HLT, 2 = shutdown, 3 = wait-for-SIPI)
  17. GUEST_INTERRUPTIBILITY_STATE, GUEST_PENDING_DBG_EXCEPTIONS
  18. VMCS_LINK_POINTER = 0xFFFF_FFFF_FFFF_FFFF (no VMCS shadowing initially)
  19. Write VM-execution control fields:
  20. Pin-based controls: external interrupt exiting, NMI exiting, virtual NMIs, VMX preemption timer (enabled — used for host scheduler integration)
  21. Primary processor-based controls: HLT exiting, INVLPG exiting (disabled — guest manages its own TLB), CR3 load/store exiting (disabled when EPT is active), MOV DR exiting (disabled unless debug), I/O bitmap use, MSR bitmap use, MONITOR/MWAIT exiting, activate secondary controls
  22. Secondary processor-based controls: enable EPT, enable VPID (Virtual Processor ID — tags TLB entries per-VM to avoid full TLB flush on VM switch), unrestricted guest (allows real-mode execution in VMX non-root), APIC register virtualization, virtual interrupt delivery, PML enable, XSAVES/XRSTORS enable, enable INVPCID passthrough, TSC scaling
  23. Exception bitmap: intercept #UD (for instruction emulation), #PF if EPT is not available (should not happen on any modern CPU), #DB/#BP for guest debugging when GDB is attached. All other exceptions delivered to guest.
  24. EPT pointer: PML4 physical address | memory type (WB=6) | page-walk length (4-1=3) | accessed/dirty flags enable
  25. VPID: unique 16-bit value per vCPU (allocated from a per-host bitmap). VPID 0 is reserved for the host. Range 1-65535.
  26. Write VM-exit control fields: save IA32_EFER, load IA32_EFER, host address-space size (64-bit), acknowledge interrupt on exit (allows the host IDT to handle the interrupt without a separate VMREAD of the exit interrupt info field)
  27. Write VM-entry control fields: load IA32_EFER, IA32_PAT; IA-32e mode guest (64-bit guest); entry to SMM (no).

AMD SVM (VMCB) — AMD APM Vol. 2, Chapter 16:

AMD's equivalent uses a VMCB (Virtual Machine Control Block), also 4 KiB, but directly memory-mapped (no VMREAD/VMWRITE — the hypervisor reads/writes VMCB fields as a C struct). Key differences from VMX: - Nested Page Tables (NPT) instead of EPT — functionally equivalent 4-level page table for guest-physical → host-physical translation. - VMRUN instruction (replaces VMLAUNCH/VMRESUME — single instruction for all entries). #VMEXIT stores exit info directly in the VMCB. - VMCB clean bits: the hypervisor marks which VMCB sections were modified since the last VMRUN, allowing the CPU to skip reloading unchanged state. - ASID (Address Space ID) for TLB tagging — analogous to Intel's VPID. - VMCB.control.intercept_* fields replace VMX's execution control bitmaps. - Secure Encrypted Virtualization (SEV/SEV-ES/SEV-SNP) extensions use the VMCB's SEV_FEATURES and VMSA fields (see Section 8.6).

/// AMD SVM state for one vCPU.
pub struct VmcbState {
    /// Physical address of the 4 KiB VMCB (4 KiB aligned).
    pub vmcb_phys: u64,
    /// Virtual address for direct field access.
    pub vmcb: *mut Vmcb,
    /// Host save area physical address (set in VM_HSAVE_PA MSR).
    /// CPU saves host state here on VMRUN and restores on #VMEXIT.
    pub host_save_area_phys: u64,
    /// AMD MSRPM: 8 KiB (two 4 KiB pages). Uses 2 bits per MSR: bit 2n = read
    /// intercept, bit 2n+1 = write intercept. Three MSR ranges:
    /// 0x00000000-0x00001FFF, 0xC0000000-0xC0001FFF, 0xC0010000-0xC0011FFF.
    /// Different from Intel VMX MSR bitmap (which is 4 KiB, 1 bit/MSR, four
    /// separate 1 KiB sections). See AMD APM Vol. 2, Section 18.11.
    pub msrpm_phys: u64,
    /// I/O permission bitmap (three 4 KiB pages covering 0-65535).
    pub iopm_phys: u64,
}

umka-kvm abstracts VMX and SVM behind a common VmxOps trait so that all host-side logic above the trampoline level is architecture-neutral:

/// Architecture-specific VMX/SVM operations. Implemented by VmcsState (Intel)
/// and VmcbState (AMD). Called from the architecture-neutral exit handler.
pub trait HvOps {
    /// Read a guest register from the hardware state area.
    fn read_guest_reg(&self, reg: GuestReg) -> u64;
    /// Write a guest register to the hardware state area.
    fn write_guest_reg(&mut self, reg: GuestReg, val: u64);
    /// Advance guest instruction pointer by `len` bytes.
    fn advance_rip(&mut self, len: u8);
    /// Inject a virtual interrupt (IRQ number, priority).
    fn inject_irq(&mut self, vector: u8);
    /// Inject an exception (vector, error code, CR2 for #PF).
    fn inject_exception(&mut self, vector: u8, error_code: Option<u32>);
    /// Set the EPT/NPT pointer (for EPT invalidation after page table changes).
    fn set_slat_root(&mut self, root_phys: u64);
    /// Invalidate TLB entries for this vCPU's VPID/ASID.
    fn flush_guest_tlb(&self);
    /// Read exit qualification / exit info (architecture-specific format).
    fn exit_info(&self) -> ExitInfo;
}

17.1.1.5 AArch64 Host-Side Implementation

VHE mode (ARMv8.1+, preferred): UmkaOS runs at EL2. The guest runs at EL1/EL0. VM entry is a controlled ERET to guest EL1; VM exit is a trap from guest EL1 to host EL2. No EL2↔EL1 world switch is needed on the host side because VHE transparently redirects EL1 system register accesses to their EL2 counterparts.

/// AArch64 VHE vCPU state.
pub struct ArmVheState {
    /// Saved guest EL1 system registers (restored before ERET to guest,
    /// saved after trap back to host EL2).
    pub sctlr_el1: u64,
    pub tcr_el1: u64,
    pub ttbr0_el1: u64,
    pub ttbr1_el1: u64,
    pub mair_el1: u64,
    pub vbar_el1: u64,
    pub contextidr_el1: u64,
    pub cpacr_el1: u64,
    pub esr_el1: u64,
    pub far_el1: u64,
    pub sp_el1: u64,
    pub elr_el1: u64,
    pub spsr_el1: u64,
    pub tpidr_el1: u64,
    pub tpidr_el0: u64,
    pub tpidrro_el0: u64,
    pub cntvoff_el2: u64,   // virtual timer offset
    pub cntv_cval_el0: u64, // virtual timer compare value
    pub cntv_ctl_el0: u64,  // virtual timer control

    /// Stage-2 translation table base register.
    /// Points to the root of the IPA → PA page tables for this VM.
    /// Written to VTTBR_EL2 before guest entry.
    pub vttbr_el2: u64,

    /// Hypervisor Configuration Register — controls trap behavior.
    /// umka-kvm sets: VM bit (enable Stage-2), IMO/FMO/AMO (trap
    /// interrupts to EL2), TWI/TWE (trap WFI/WFE), TSC (trap SMC).
    pub hcr_el2: u64,

    /// VMID (Virtual Machine Identifier) — 8-bit or 16-bit tag for
    /// Stage-2 TLB entries (analogous to Intel VPID).
    pub vmid: u16,
}

Stage-2 page tables: 4-level (or concatenated 3-level for 40-bit IPA) page tables mapping IPA (Intermediate Physical Address) to PA (host physical). Constructed by umka-kvm using the same SlatHooks allocator as x86 EPT. TLBI VMALLE1IS (TLB invalidate all EL1, inner-shareable) flushes guest TLB entries; TLBI IPAS2E1IS flushes a single IPA mapping.

nVHE mode (pre-ARMv8.1 or when VHE is disabled): The host kernel runs at EL1. A small EL2 stub (~500 lines of assembly) handles world switches. On VM entry: save host EL1 context → load guest EL1 context → ERET to guest EL1. On VM exit: save guest EL1 context → restore host EL1 context → return to host. Cost: ~500-1000 cycles per entry/exit (vs ~200 for VHE) due to the full EL1 context save/restore.

Virtual interrupt injection: GICv4 direct injection (where hardware supports it) allows the physical GIC to deliver virtual interrupts directly to the guest vCPU without a VM exit. umka-kvm programs the GICv4 virtual LPI (Locality-specific Peripheral Interrupt) tables so that device MSIs targeted at a guest are directly delivered. Fallback: software injection via ICH_LR<n>_EL2 list registers (GICv3 virtualization interface).

AArch64 Cache Geometry Discovery and DC CISW Flush

KVM requires flushing guest memory pages from the host cache when switching between host and guest mappings and on guest Stage-2 page table updates. On AArch64 there is no single instruction that flushes the entire data cache hierarchy; the kernel must iterate over every cache set and way up to the Level of Coherence (LoC). The geometry parameters — line size, associativity, and number of sets — are discovered at runtime by reading three system registers after selecting each cache level.

Register layout:

CLIDR_EL1 (Cache Level ID Register): - bits[26:24] = LoC — the number of cache levels that must be flushed to achieve data coherency with all data-sharing agents - bits[23:21] = LoUIS — Level of Unification Inner Shareable (used for context-switch cache maintenance; not needed for the full-flush path) - Ctype_n = (CLIDR_EL1 >> (3 * (n-1))) & 0x7 for n = 1..7: 0b000 = no cache at this level; 0b001 = instruction cache only; 0b010 = data cache; 0b011 = separate I+D; 0b100 = unified. Levels where Ctype_n >= 2 have a data or unified cache and must be flushed.

CSSELR_EL1 (Cache Size Selection Register): - Write ((level - 1) << 1) | 0 to select the data/unified cache at the given one-indexed level (level 1 = write 0, level 2 = write 2, …). - An ISB must follow immediately before reading CCSIDR_EL1; without it the processor may return stale geometry data for the previously selected level. - The write is not interrupt-safe: an interrupt handler that inspects cache geometry (e.g., via a PMU callback) could change CSSELR_EL1 between the host write and the CCSIDR_EL1 read. IRQs must therefore be disabled for the entire select → ISB → read sequence.

CCSIDR_EL1 (Cache Size ID Register) — standard format (no FEAT_CCIDX): - bits[2:0] = LineSize_encL = LineSize_enc + 4; line size in bytes = 1 << L - bits[12:3] = Assoc_encNumWays = Assoc_enc + 1 - bits[27:13] = NumSets_encNumSets = NumSets_enc + 1

DC CISW (Clean and Invalidate by Set/Way) operand encoding (64-bit register):

bits[31 .. 32 - A] = way_index     (A = ceil(log2(NumWays)))
bits[B - 1 .. L]   = set_index     (B = L + ceil(log2(NumSets)))
bits[3:1]          = level - 1

where way_shift = u32::leading_zeros(Assoc_enc) (= CLZ of the encoded maximum way index, placing the way field at the most-significant bits of the 32-bit way field so that incrementing way_index by one always advances to the next way regardless of the actual associativity).

Flush algorithm (all data cache levels up to LoC):

/// Flush and invalidate all data caches from EL1 up to the Level of Coherence.
///
/// Required before:
/// - Switching Stage-2 page table mappings (host↔guest address attribute changes)
/// - Pinning guest physical pages for device DMA (ensures host caches are clean)
/// - Serializing guest memory for live migration transmission
/// - Changing memory type attributes in guest Stage-2 mappings (Normal→Device, etc.)
///
/// # Safety
/// - Caller must hold an `IrqDisabledGuard` (CSSELR_EL1 write is not interrupt-safe)
/// - Must execute at EL1 or EL2 (DC CISW is a privileged instruction)
pub unsafe fn flush_dcache_all(_irq: &IrqDisabledGuard) {
    // Ordering barrier: ensure all prior memory accesses are globally visible
    // before the first cache maintenance operation.
    core::arch::asm!("dsb sy", options(nostack, preserves_flags));

    let clidr: u64;
    core::arch::asm!("mrs {}, clidr_el1", out(reg) clidr, options(nostack, preserves_flags));

    // LoC: number of cache levels to flush for full coherency.
    let loc = ((clidr >> 24) & 0x7) as usize;

    // Iterate from L1 (level index 0) to LoC - 1 (inclusive).
    for level in 0..loc {
        // Ctype_n for this level (Ctype1 = bits[2:0], Ctype2 = bits[5:3], ...).
        let ctype = (clidr >> (3 * level)) & 0x7;

        // Skip levels with no data or unified cache.
        if ctype < 2 {
            continue;
        }

        // Select this cache level (data/unified; InD bit = 0).
        // ISB is mandatory before reading CCSIDR_EL1.
        let csselr: u64 = (level as u64) << 1;
        core::arch::asm!(
            "msr csselr_el1, {sel}",
            "isb",
            sel = in(reg) csselr,
            options(nostack, preserves_flags),
        );

        let ccsidr: u64;
        core::arch::asm!(
            "mrs {ccsidr}, ccsidr_el1",
            ccsidr = out(reg) ccsidr,
            options(nostack, preserves_flags),
        );

        // Decode geometry.
        let l         = ((ccsidr & 0x7) + 4) as u32;          // log2(line_size_bytes)
        let assoc_enc = ((ccsidr >> 3) & 0x3FF) as u32;       // NumWays - 1
        let sets_enc  = ((ccsidr >> 13) & 0x7FFF) as u32;     // NumSets - 1

        // CLZ(Assoc_enc) places the way index at the MSBs of bits[31:0] of
        // the DC CISW operand, matching the ARM architecture specification.
        let way_shift = assoc_enc.leading_zeros();

        // Iterate all sets and all ways.  Loop bounds are inclusive (0..=sets_enc,
        // 0..=assoc_enc) so that every set/way combination is covered.
        let mut set = sets_enc;
        loop {
            let mut way = assoc_enc;
            loop {
                let operand: u64 =
                    ((way << way_shift) as u64)
                    | ((set as u64) << l)
                    | ((level as u64) << 1);
                core::arch::asm!(
                    "dc cisw, {op}",
                    op = in(reg) operand,
                    options(nostack, preserves_flags),
                );
                if way == 0 {
                    break;
                }
                way -= 1;
            }
            if set == 0 {
                break;
            }
            set -= 1;
        }
    }

    // Restore CSSELR_EL1 to L1 data cache (level index 0, InD = 0).
    // This is good hygiene: other EL1 code that reads CCSIDR_EL1 without
    // writing CSSELR_EL1 first will see the L1 geometry, which is the
    // least surprising default.
    core::arch::asm!("msr csselr_el1, xzr", options(nostack, preserves_flags));

    // Completion barriers: DSB ensures all DC CISW operations are finished
    // before any subsequent memory access; ISB ensures the instruction stream
    // observes the completed maintenance.
    core::arch::asm!(
        "dsb sy",
        "isb",
        options(nostack, preserves_flags),
    );
}

FEAT_CCIDX support (ARMv8.3+ large-cache systems):

When ID_AA64MMFR2_EL1 bits[59:56] are non-zero, FEAT_CCIDX is implemented and CCSIDR_EL1 uses wider fields to support caches with more than 1024 ways or more than 32768 sets:

Field Standard (no FEAT_CCIDX) FEAT_CCIDX
LineSize_enc bits[2:0] bits[2:0] (unchanged)
Assoc_enc bits[12:3] (10-bit, mask 0x3FF) bits[23:3] (21-bit, mask 0x1F_FFFF)
NumSets_enc bits[27:13] (15-bit, mask 0x7FFF) bits[55:32] (24-bit, mask 0xFF_FFFF)

The DC CISW operand format and the way_shift = u32::leading_zeros(Assoc_enc) formula are identical in both cases. Because way_shift is computed from the u32 representation of Assoc_enc and u32::leading_zeros operates on a 32-bit value, the 21-bit FEAT_CCIDX Assoc_enc is handled correctly: it is zero-extended into a u32 before leading_zeros is applied, producing the right shift position.

Detection: read ID_AA64MMFR2_EL1 at EL1. If bits[59:56] are non-zero, apply the wider masks when extracting Assoc_enc and NumSets_enc from CCSIDR_EL1.

KVM call sites:

flush_dcache_all() is called from umka-kvm's AArch64 path at four points:

  1. VM entry preparation — after modifying Stage-2 page tables and before the ERET into the guest. This ensures the host's cache does not hold stale data for pages whose Stage-2 attributes changed (e.g., a new guest mapping that maps a page as Normal Cacheable when the host previously treated it as Device).

  2. Guest memory pinning — when pinning guest physical pages for DMA passthrough (VFIO/iommufd, Section 17.3). The host caches are cleaned before the IOMMU mapping is established so the device reads coherent data.

  3. Live migration send — before reading guest memory pages to serialize them for network transmission (Section 17.1, VM Live Migration). The flush ensures that all dirty cache lines for the guest's physical pages are written back to DRAM before the migration sender reads them.

  4. Memory type attribute change — when reconfiguring Stage-2 page table entries to change the memory type of a guest region (e.g., from Normal Cacheable to Device nGnRnE for MMIO remapping). The cache must be flushed and invalidated before the attribute change takes effect to avoid cache aliasing.

17.1.1.6 RISC-V Host-Side Implementation

RISC-V H-extension (ratified as part of Privileged ISA v1.12, December 2021) provides VS-mode (virtualized supervisor) and VU-mode (virtualized user).

/// RISC-V H-extension vCPU state.
pub struct RiscvHState {
    /// Saved guest VS-mode CSRs (restored before sret to guest,
    /// saved after trap to HS-mode).
    pub vsstatus: u64,
    pub vsie: u64,
    pub vstvec: u64,
    pub vsscratch: u64,
    pub vsepc: u64,
    pub vscause: u64,
    pub vstval: u64,
    pub vsip: u64,
    pub vsatp: u64,     // guest's own page table root (SV48/SV39)

    /// Guest physical address translation register.
    /// Written to hgatp CSR before guest entry.
    /// Encodes: mode (Sv48x4/Sv39x4) | VMID | PPN of Stage-2 root.
    pub hgatp: u64,

    /// Hypervisor status register. SPV bit indicates guest context.
    pub hstatus: u64,

    /// Virtual interrupt pending (injection mechanism).
    /// Setting bits in hvip causes virtual interrupts in the guest.
    pub hvip: u64,

    /// VMID (Virtual Machine Identifier) — TLB tag analogous to
    /// Intel VPID / ARM VMID. Width determined by hardware (typically
    /// 7-14 bits, read from henvcfg).
    pub vmid: u16,
}

VM entry: Set hstatus.SPV = 1, load guest VS-mode CSRs, sret transitions the CPU to VS-mode. VM exit: Any trap configured in hedeleg/hideleg to not be delegated to VS-mode traps into HS-mode. The hardware saves the faulting guest physical address in htval (for Stage-2 faults) and the trap cause in scause.

Stage-2 page tables: Controlled by hgatp CSR. Sv48x4 mode provides 4-level page tables with a 50-bit guest physical address space (the "x4" means the root page table is 4 pages / 16 KiB instead of 1 page / 4 KiB, giving 2 extra bits). HFENCE.GVMA flushes guest TLB entries (analogous to INVEPT on x86 and TLBI IPAS2E1IS on ARM).

Interrupt injection: hvip CSR provides virtual interrupt pending bits (VSSIP, VSTIP, VSEIP). For external interrupts, umka-kvm sets hvip.VSEIP to inject a virtual external interrupt. The AIA (Advanced Interrupt Architecture) extension, when available, provides IMSIC (Incoming MSI Controller) for direct MSI injection to guest interrupt files — analogous to ARM GICv4 direct injection.

17.1.1.7 vCPU Scheduling Integration

vCPU threads are scheduled by umka-core's EEVDF scheduler (Section 6.1) as normal kernel threads with specific properties:

  • vCPU affinity: By default, a vCPU thread can migrate between any host physical CPU. Userspace can pin vCPUs to specific pCPUs via sched_setaffinity for latency-sensitive workloads (DPDK, real-time). When pinned, the vCPU thread runs on exactly one pCPU and the VMX preemption timer is disabled (the vCPU owns the pCPU exclusively until it voluntarily exits or is preempted by a higher-priority host thread).

  • VMX preemption timer: On x86, the VMCS preemption timer field is programmed from the scheduler's remaining time slice for the vCPU thread. When the timer fires inside VMX non-root mode, a VM exit occurs with ExitReason::PreemptionTimer, and the vCPU thread yields to the scheduler. This ensures vCPU threads do not monopolize physical CPUs. The timer value is calculated as:

preempt_ticks = remaining_slice_ns * tsc_freq_khz / (1000 * 2^preempt_timer_shift)

where preempt_timer_shift is read from IA32_VMX_MISC[4:0].

  • AArch64 equivalent: The generic timer's CNTHP_TVAL_EL2 (EL2 physical timer) is programmed as the host scheduler's preemption tick. When it fires, it traps the guest to EL2, where the exit handler yields.

  • RISC-V equivalent: The stimecmp CSR (Sstc extension) or SBI timer is programmed for the scheduler quantum. Timer interrupt traps to HS-mode.

  • NUMA placement: When a VM's backing memory is allocated from a specific NUMA node, umka-kvm hints the scheduler to prefer running vCPU threads on CPUs in the same NUMA node (via set_cpus_allowed_ptr with the node's CPU mask). This is a soft hint, not a hard pin — the scheduler can migrate vCPUs for load balancing but prefers local placement.

  • Halt polling: When a vCPU executes HLT and there is no pending interrupt, instead of immediately yielding to the scheduler (which incurs a context switch), the vCPU thread spins for a configurable duration (default: 200 microseconds, tunable via /sys/module/umka_kvm/parameters/halt_poll_ns). If an interrupt arrives during the spin window, the vCPU re-enters the guest without a context switch. If the spin window expires, the thread yields. This optimization reduces wake latency for interrupt-driven workloads (networking, storage) at the cost of slightly higher host CPU usage.

  • Overcommit behavior: When more vCPUs than physical CPUs are active, the scheduler distributes time fairly via EEVDF virtual deadline ordering. The PV spinlock mechanism (Section 17.1, "Guest Mode — PV Spinlocks") prevents lock-holder preemption waste. KVM_HC_SCHED_YIELD from a spinning guest vCPU triggers an immediate scheduler yield, allowing the lock-holding vCPU to run.

  • Power budget integration: Each VM can have a power budget (Section 6.2.6.2). The scheduler accounts vCPU thread CPU time against the VM's power budget. When a VM exceeds its budget, its vCPU threads' scheduling weights are reduced proportionally, throttling the VM without killing it.

17.1.1.8 In-Kernel Device Models

umka-kvm includes minimal in-kernel device emulation for devices where the userspace round-trip (VM exit → KVM_RUN return → userspace emulation → KVM_RUN re-entry) would be a performance bottleneck:

Device Emulation location Rationale
Local APIC (x2APIC) In-kernel + hardware-assisted Interrupt delivery is the hottest path. Hardware APIC virtualization avoids most exits.
IOAPIC In-kernel Interrupt routing must be low-latency. Each I/O completion triggers IOAPIC.
PIT (i8254) In-kernel Timer tick generation. Legacy but required for BIOS boot.
PIC (i8259) In-kernel Legacy interrupt controller. Required for BIOS boot until IOAPIC takes over.
kvmclock In-kernel Shared memory page, no exits needed. Host updates parameters on TSC recalibration.
vhost-net In-kernel (Tier 1, extended) See Section 17.1, "vhost Kernel Data Plane".
vhost-scsi In-kernel (Tier 1, extended) See Section 17.1, "vhost Kernel Data Plane".

All other devices (virtio-blk, virtio-gpu, IDE, e1000, etc.) are emulated in userspace by the VMM (QEMU, Firecracker, etc.). This split matches Linux KVM's architecture: the kernel handles the time-critical interrupt and timer paths; the VMM handles the device model complexity.

irqfd / ioeventfd: These mechanisms avoid the userspace round-trip for specific interrupt injection and I/O intercept patterns: - irqfd (KVM_IRQFD ioctl): Associates an eventfd with a guest IRQ line. When a userspace or kernel component writes to the eventfd, umka-kvm injects the corresponding interrupt into the guest — without a KVM_RUN exit/re-entry cycle. Used by QEMU for virtio interrupt injection. - ioeventfd (KVM_IOEVENTFD ioctl): Associates an eventfd with a guest I/O port or MMIO address. When the guest writes to that address, the VM exit handler triggers the eventfd and immediately re-enters the guest — the userspace device model processes the write asynchronously. Used by QEMU for virtio doorbell writes.

17.1.1.9 Nested Virtualization

Nested virtualization (running a hypervisor inside a VM) is a Phase 5 deliverable. The specification covers the basic architectural requirements:

  • x86 (VMCS shadowing): A guest hypervisor's VMCS operations (VMREAD, VMWRITE, VMLAUNCH, VMRESUME) are intercepted by umka-kvm. umka-kvm maintains a shadow VMCS (VMCS02) that merges the L1 hypervisor's intended guest state with umka-kvm's own host state. The VMCS_LINK_POINTER field points to the shadow VMCS. L2 VM exits are dispatched to L1 or L0 based on exit reason: exits caused by L1's execution controls go to L1; exits caused by L0's controls (e.g., EPT violation in L0's page tables) go to L0. Shadow EPT (EPT02) merges L1's EPT (guest-physical → L1-physical) with L0's EPT (L1-physical → host-physical) into a combined guest-physical → host-physical mapping.
  • ARM64: Nested virtualization on ARM64 requires trapping all EL2 instructions executed by the L1 hypervisor (HCR_EL2.NV = 1, ARMv8.3+). Stage-2 nesting (combining L1's Stage-2 with L0's Stage-2) follows the same shadow page table approach as x86 EPT02.
  • RISC-V: The H-extension does not yet define a standard nested virtualization mechanism. Software trap-and-emulate of all H-extension CSR accesses from L1 is functionally correct but slow (~10x overhead). Hardware support is expected in a future extension.

Performance target for nested virtualization: less than 20% overhead for L2 guest workloads compared to L1 (non-nested) execution, on hardware with VMCS shadowing (Intel), ARM VHE (Virtualization Host Extensions), or RISC-V H-extension. This is consistent with Linux KVM's measured nested overhead (10-30% depending on workload and exit frequency). Software-only nested virtualization — where the L0 hypervisor must emulate VMX/SVM instructions for L1 because the CPU lacks hardware shadowing support — has substantially higher overhead (typically 2-5x) and is not a supported configuration for production use.

Recovery advantage — UmkaOS's driver recovery provides unique benefits for virtualization: - Host-side: if a vhost-net or vhost-scsi module crashes, UmkaOS recovers it in-place (Tier 1 reload). The hypervisor and guest never notice. In Linux, a vhost crash would require tearing down and re-establishing the vhost connection. - Guest-side: if a guest running UmkaOS crashes a virtio driver, the driver recovers without VM reboot. The hypervisor sees a brief pause in I/O but no reset. In Linux, a guest virtio driver crash typically requires VM reboot.

Host Mode — Kernel Same-page Merging (KSM)

When UmkaOS runs as a hypervisor host managing many VMs, identical memory pages accumulate across guests — shared libraries (libc, libssl), zero-filled BSS pages, and common read-only data. KSM reclaims this waste by deduplicating identical pages:

  1. Scanning: A background kernel thread (ksmd) periodically walks pages marked as mergeable (via madvise(MADV_MERGEABLE) or per-VM opt-in at VM creation). Each page is hashed (xxHash, ~200-400ns per 4KB page) and inserted into a two-tree structure: a stable tree (already-merged pages, searched first) and an unstable tree (candidate pages, searched second).

  2. Merging: When a hash match is found, the kernel performs a byte-for-byte comparison to confirm identity (hashes can collide). If identical, the duplicate page's PTE is updated to point to the existing shared copy (COW-protected). The duplicate physical frame is freed back to the buddy allocator.

  3. Break-on-write: If a process writes to a KSM-merged page, the COW fault handler allocates a new page, copies the content, and remaps the writer's PTE to the private copy. This is the standard COW mechanism — no KSM-specific path.

Configuration:

/sys/kernel/mm/ksm/run              — 0=off, 1=on (default: off)
/sys/kernel/mm/ksm/sleep_millisecs  — scan interval (default: 20ms)
/sys/kernel/mm/ksm/pages_to_scan    — pages per scan cycle (default: 100)
/sys/kernel/mm/ksm/pages_shared     — currently merged pages (read-only)
/sys/kernel/mm/ksm/pages_sharing    — additional page references saved by merging (read-only)

Performance trade-off: KSM's scanning consumes CPU (~1-5% of one core depending on scan rate and working set size). For VM-dense servers running 50-100 identical guests, the memory savings (30-50% for homogeneous Linux guests) far outweigh the CPU cost. For non-VM workloads or heterogeneous guests, the savings are minimal and KSM should remain disabled (the default).

NUMA awareness: KSM only merges pages within the same NUMA node by default (/sys/kernel/mm/ksm/merge_across_nodes=0; note: Linux defaults to 1, so UmkaOS diverges from the Linux default here for NUMA-aware performance). Cross-node merging saves more memory but forces remote NUMA accesses on the merged page — typically a net loss for latency-sensitive workloads. Administrators can enable cross-node merging explicitly for memory-constrained environments where density outweighs NUMA locality.


17.2 Suspend and Resume

Linux problem: Suspend/resume on laptops was notoriously unreliable for years. Driver suspend/resume callbacks are fragile — one broken driver blocks the entire system.

UmkaOS design:

17.2.1 Suspend Modes

  • s2idle (suspend-to-idle): Primary suspend mode. Freezes all userspace processes, puts devices into low-power states, and halts CPUs in their deepest idle state. Does not require firmware cooperation (no ACPI S3 handoff), making it more reliable than traditional suspend-to-RAM. Wake sources: any enabled interrupt (keyboard, network, RTC alarm, lid switch).
  • S3 (suspend-to-RAM): Fallback for platforms where s2idle power consumption is unacceptable. CPU and device state are saved to RAM, then firmware is called to power down the platform. Requires ACPI S3 support and correct firmware implementation.
  • S4 (hibernate / suspend-to-disk): Full system image is written to a swap partition or dedicated hibernate file, then the system powers off completely. On resume, the bootloader loads the kernel, which restores the saved image into memory and resumes execution. Hibernate support depends on the block I/O layer (Section 14.3) being available and a configured swap/hibernate target.

17.2.2 Device State Save/Restore Ordering

Device suspend follows the device dependency tree in leaf-to-root order (children before parents). Resume follows root-to-leaf order (parents before children, the reverse of suspend). This ensures that a child device never attempts I/O on a parent that is already suspended, and that parent buses are active before children attempt to re-initialize.

The ordering algorithm:

  1. Build topological order from the device dependency tree (Section 10.5, device registry).
  2. Suspend phase — traverse the tree bottom-up:
  3. For each device, invoke its KABI suspend() callback with a per-device timeout (default per Section 10.5.5: 2 seconds for Tier 1, 5 seconds for Tier 2; configurable via sysfs).
  4. If the callback does not complete within the timeout, the driver is forcibly stopped (Tier 1/Tier 2 driver recovery — the driver's isolation domain is torn down). The device is marked as "failed to suspend" and will be re-initialized from scratch on resume.
  5. DMA engines are quiesced before their parent bus controller suspends.
  6. Interrupt controllers are suspended last (after all device interrupts are masked).
  7. Resume phase — traverse the tree top-down:
  8. Bus controllers and interrupt controllers are restored first.
  9. For each device, invoke its KABI resume() callback.
  10. Devices that failed to suspend are re-initialized via the standard driver probe path rather than the resume path.
  11. Drivers that fail to resume within the timeout are forcibly restarted, same as suspend failures.

17.2.3 CPU State Save/Restore

On suspend, the kernel saves per-CPU state that is not preserved by hardware across the suspend/resume cycle:

  • General-purpose registers: Saved to a per-CPU save area in the kernel BSS. On resume, the boot CPU restores its own state and brings up secondary CPUs via the normal SMP bringup path (Section 2.1), which re-initializes their register state.
  • System registers / MSRs: Architecture-specific system register state that must be explicitly restored:
  • x86_64: IA32_EFER, IA32_STAR, IA32_LSTAR, IA32_FMASK (syscall registers), IA32_PAT, IA32_KERNEL_GS_BASE, GDT/IDT/TR descriptors, CR0/CR3/CR4, debug registers (DR0-DR7), XCR0 (XSAVE state), IA32_SPEC_CTRL (Spectre mitigations)
  • AArch64: SCTLR_EL1, TCR_EL1, TTBR0/TTBR1_EL1, MAIR_EL1, VBAR_EL1, TPIDR_EL1, CNTKCTL_EL1, CPACR_EL1
  • RISC-V: satp, stvec, sscratch, sie, sstatus
  • ARMv7: SCTLR, TTBR0/TTBR1, TTBCR, DACR, VBAR, TPIDRPRW, CNTKCTL, CPACR, DFAR/IFAR, CONTEXTIDR (deferred to Phase 3: full list pending ARMv7 suspend implementation)
  • PPC32: MSR, SRR0/SRR1, SPRG0-3, DEC, PVR, HID0/HID1, DBAT/IBAT registers, L1CSR0/L1CSR1 (deferred to Phase 3: full list pending PPC32 suspend implementation)
  • PPC64LE: MSR, SRR0/SRR1, SPRG0-3, LPCR, HSPRG0/1, DEC, AMOR, PIDR, PTCR (Radix), SDR1 (HPT). On POWER9+: PSSCR for stop states. (Deferred to Phase 3: full list pending PPC64LE suspend implementation; OPAL firmware may handle some state save/restore)
  • FPU/SIMD state: Saved via XSAVE (x86_64), STP of Q0-Q31 + FPCR/FPSR (AArch64), or architecture-specific equivalent. Eager FPU restore is used on resume — FPU state is restored immediately on all CPUs before executing any userspace or untrusted code, preventing the CVE-2018-3665 (LazyFP) speculative execution side-channel vulnerability.

17.2.4 Memory Handling

  • s2idle and S3: RAM remains powered. No memory save/restore is needed. The kernel only needs to ensure that all dirty cache lines are flushed to RAM before the CPU enters the suspended state (WBINVD on x86_64; DC CISW (Clean and Invalidate by Set/Way) iterated over all sets and ways + DSB on AArch64 — DC CIVAC is per-VA and cannot flush the entire cache without iterating all dirty addresses, which is impractical; DC CISW is the standard ARM approach for full-cache flush before S3; see Section 17.1.1.5, "AArch64 Cache Geometry Discovery and DC CISW Flush", for the register layout and geometry discovery algorithm used by flush_dcache_all()).
  • S4 (hibernate): The kernel creates a consistent snapshot of all in-use memory pages using a two-phase freeze-and-snapshot approach:
  • Freeze phase: All userspace processes are frozen (SIGSTOP equivalent). All Tier 1 and Tier 2 drivers (except the storage stack required for the hibernate target) are suspended via the suspend path (Section 17.2.2), which quiesces their DMA activity. After this point, no new memory modifications occur except from the snapshot code and the active storage drivers.
  • Snapshot phase: With all sources of concurrent modification stopped, the kernel walks the page frame allocator's used-page bitmap. Free pages (tracked by the buddy allocator) are excluded. Each in-use page is compressed (LZ4, matching Linux's default hibernate compressor) and written to the configured hibernate target (swap partition or file). The snapshot code runs on the boot CPU only, with interrupts disabled except for the disk I/O completion interrupt.
  • Integrity and authentication: The hibernate image is cryptographically authenticated to prevent tampering:
    • A SHA-256 hash of the compressed image is computed during the write.
    • The hash is signed (not merely stored) using a TPM-backed key if a TPM is available (Section 8.3), or encrypted with a symmetric key derived from the kernel's hibernate secret — a 256-bit random key generated at boot from the hardware RNG (RDRAND/RNDR), stored only in kernel memory, and destroyed on shutdown. On systems with TPM, the hibernate secret is additionally sealed to the TPM (PCR-bound) so that only a boot configuration matching the original can unseal and verify the image. On non-TPM systems, the hibernate secret does not survive a true cold reboot (full power-off). True ACPI S4 (power completely removed, cold resume from disk) is only reliably supported in TPM mode. Non-TPM systems support a memory-preservation mode: the platform firmware must preserve a specific kernel memory region across the power state transition, making this more analogous to a deep sleep with disk checkpointing than a true S4. Platforms that cannot guarantee firmware memory preservation must use the TPM path for hibernate integrity.
    • On resume, the signature is verified (TPM path) or the hash is decrypted and validated (boot-secret path) before any pages are restored.
    • This prevents an attacker with disk write access from substituting a malicious hibernate image, as they cannot forge valid authentication without access to the TPM or the boot secret.
  • On resume, the bootloader loads a fresh kernel, which reads the hibernate image from disk, verifies the hash, and allocates intermediate safe memory (bounce frames) to hold the image. It then carefully copies the saved pages to their original physical addresses, taking care to avoid overwriting the fresh kernel's own executing code or page tables, and jumps to the restore entry point. The resumed kernel then re-initializes devices via the resume path described in Section 17.2.2.

17.2.5 Timer Re-synchronization

System clocks drift or lose state during suspend. On resume, the kernel must re-synchronize all time sources:

  1. Read the hardware RTC (CMOS on x86, PL031 on ARM, or platform-specific RTC) to determine wall-clock time elapsed during suspend.
  2. Adjust CLOCK_BOOTTIME offset by the elapsed suspend duration so that it reflects total wall-clock time since boot, including suspend. CLOCK_MONOTONIC is not adjusted — it does not count time spent in suspend, matching Linux semantics. CLOCK_BOOTTIME includes suspend time by definition.
  3. Re-calibrate TSC / arch timer: On x86, re-read IA32_TSC_ADJUST if available, or re-synchronize TSC across CPUs via the TSC synchronization protocol (Section 6.5). On AArch64, the generic timer (CNTPCT_EL0) typically survives S3 suspend but must be verified on resume. On platforms with paravirtual clocks (KVM pvclock, Hyper-V TSC page), the shared clock page is re-read to pick up updated scale/offset values.
  4. Fire expired timers: All pending hrtimer and timer_list entries are checked against their respective updated time bases. Timers armed against CLOCK_BOOTTIME or CLOCK_REALTIME that expired during suspend are fired immediately in a batch. Timers armed against CLOCK_MONOTONIC are evaluated against the unadjusted monotonic clock and will not fire prematurely.
  5. Notify userspace: A CLOCK_REALTIME discontinuity notification is sent to processes using timerfd or clock_nanosleep so they can adjust.

17.2.6 Interrupt Controller State

Interrupt controller state is saved on suspend and restored before any device resume callbacks are invoked:

  • x86_64 (APIC): Save and restore the Local APIC registers (LVT entries, TPR, timer configuration, spurious interrupt vector). The I/O APIC redirection table entries are saved per-pin. MSI/MSI-X vectors are re-programmed by the PCI subsystem during device resume — the interrupt controller layer saves the IRQ-to-vector mapping so that devices resume with the same interrupt vectors they had before suspend.
  • AArch64 (GICv3): Save and restore the GIC Distributor (GICD), Redistributor (GICR), and CPU Interface (ICC) state. GICv3 defines standard save/restore registers for this purpose.
  • RISC-V (PLIC/APLIC): Save per-source priority, per-context enable bits, and threshold registers.

The kernel disables all interrupts (except the wake source IRQ) before entering the final suspend state, and re-enables them after interrupt controller state is restored on resume.


17.3 VFIO and iommufd — Device Passthrough Framework

VFIO (Virtual Function I/O) is the kernel framework that exposes physical PCIe devices directly to user-space processes — primarily KVM guests managed by a VMM such as QEMU. The device is detached from its host driver and assigned exclusively to the guest, which then drives it with its own unmodified guest driver. From the guest's perspective, the device is indistinguishable from a real bare-metal device. VFIO relies on the IOMMU subsystem (Section 10.4) to confine device DMA to the guest's physical address space.

iommufd is the modern replacement for the legacy vfio_iommu_type1 API. It provides a richer, more composable object model and is the preferred interface for all new VMM implementations.

17.3.1 VFIO Object Model

VFIO exposes three primary objects to userspace:

VfioGroup — A set of devices that the IOMMU requires to be isolated together (an IOMMU group). An IOMMU group is the minimal unit of isolation: all devices in the same group share DMA visibility and must either all be passed through to the same guest, or all remain bound to their host drivers. This constraint follows from the PCIe ACS (Access Control Services) topology: if a PCIe switch lacks ACS, peer devices behind that switch can DMA to each other's address space, so they form a single IOMMU group.

// umka-kvm/src/vfio/group.rs

/// An IOMMU group — set of devices that must be isolated together.
/// Corresponds to a /dev/vfio/N file descriptor in userspace.
pub struct VfioGroup {
    /// Kernel IOMMU group identity.
    pub iommu_group_id: u32,
    /// All devices in this group. Must all be unbound from host drivers
    /// before any can be assigned to a guest (or all must be bound).
    pub devices: Vec<Arc<VfioDevice>>,
    /// Reference to the iommufd context this group is attached to.
    /// None if the group is not yet associated with an IOAS.
    pub iommufd_ctx: Option<Arc<IommufdCtx>>,
    /// Exclusive open lock: only one VMM process may open a given group.
    pub open_mutex: Mutex<()>,
}

VfioDevice — A single PCIe function or platform device. Provides three capabilities to the VMM: - Region access: read/write of MMIO regions (BARs, ROM, config space) via pread/pwrite on the device fd, or mmap for regions that allow direct mapping. - Interrupt injection: delivery of device interrupts (INTx, MSI, MSI-X) to the guest via the irqbypass mechanism (Section 17.3.4). - Reset: VFIO_DEVICE_RESET triggers a Function-Level Reset (FLR) or bus reset as appropriate for the device type.

// umka-kvm/src/vfio/device.rs

/// A single passthrough device (one PCIe function or platform device).
pub struct VfioDevice {
    pub name: ArrayString<64>,          // e.g. "0000:03:00.0"
    pub group: Weak<VfioGroup>,
    pub pci_dev: Option<Arc<PciDevice>>, // None for platform devices
    /// BAR regions. Most PCIe devices have <=8 BARs (6 standard + 2 optional).
    /// Fixed array avoids heap allocation on the hot KVM MMIO path.
    pub bars: [Option<VfioRegion>; 8],
    /// Count of active (Some) bars for iteration.
    pub bar_count: u8,
    /// Overflow for devices with unusual BAR counts (some network controllers
    /// with 9+ BARs). None for the vast majority of hardware.
    pub bar_overflow: Option<Vec<VfioRegion>>,
    pub irqs: Vec<VfioIrqConfig>,
    /// Flags: VFIO_DEVICE_FLAGS_PCI | VFIO_DEVICE_FLAGS_RESET | ...
    pub flags: VfioDeviceFlags,
    /// IOMMU domain this device is attached to.
    pub iommu_domain: Option<Arc<IommuDomain>>,
    /// irqbypass producer, set when VFIO_DEVICE_SET_IRQS is called
    /// with a KVM IRQFD eventfd.
    pub irqbypass_producers: Vec<IrqBypassProducer>,
}

/// A VFIO memory region (BAR, ROM, Config, or platform-specific).
pub struct VfioRegion {
    pub index: u32,
    pub flags: VfioRegionFlags,   // READ | WRITE | MMAP | CAPS
    pub size: u64,
    /// Byte offset on the device fd for pread/pwrite access.
    pub fd_offset: u64,
    /// Physical address of the underlying MMIO resource (for mmap).
    pub phys_addr: Option<u64>,
}

bitflags! {
    pub struct VfioRegionFlags: u32 {
        const READ  = 0x1;
        const WRITE = 0x2;
        const MMAP  = 0x4;
        const CAPS  = 0x8;
    }
    pub struct VfioDeviceFlags: u32 {
        const PCI      = 0x1;
        const PLATFORM = 0x2;
        const RESET    = 0x10;  // device supports FLR/bus reset
        const NOIOMMU  = 0x40;  // no-IOMMU mode (dangerous, dev-only)
    }
}

BAR array sizing: The PCIe spec defines at most 6 BARs for standard endpoints and 2 expansion ROM BARs; 8 slots covers all known hardware. Devices with unusual BAR counts (some network controllers with 9+ BARs) use the secondary overflow Vec<VfioRegion> attached via the bar_overflow extension field on VfioDevice.

VfioContainer (legacy) — Aggregates multiple VfioGroups under a single IOMMU domain. Kept for compatibility with older VMMs (QEMU < 8.2). New VMMs use iommufd exclusively. The container model is superseded because it couples IOMMU domain lifecycle to group membership, preventing the more flexible IOAS-based mapping composition that iommufd provides.

17.3.2 iommufd Object Model

iommufd introduces a composable object graph, accessed via a single /dev/iommu fd. All objects are reference-counted and can be shared across multiple VFIO devices or multiple VMM processes (within capability and policy constraints).

IommufdCtx — The per-fd root context. Owns all objects created on this fd.

// umka-kvm/src/iommufd/ctx.rs

/// Per-fd root context for iommufd. All objects are owned here.
pub struct IommufdCtx {
    /// Next object ID (monotonically increasing, wraps at u32::MAX).
    pub next_id: AtomicU32,
    /// All IO address spaces created on this fd.
    pub ioas: Mutex<HashMap<u32, Arc<IoAddrSpace>>>,
    /// Hardware page tables created from IOASes.
    pub hw_pagetables: Mutex<HashMap<u32, Arc<HwPagetable>>>,
    /// Physical devices bound to this context.
    pub devices: Mutex<HashMap<u32, Arc<BoundDevice>>>,
    /// owning process credential — checked on IOMMU_DEVICE_ATTACH.
    pub cred: Credential,
}

IoAddrSpace (IOAS) — A virtual DMA address space. Multiple physical devices can be attached to the same IOAS, causing them all to share the same IOVA→physical mapping. A KVM VM's guest physical address (GPA) space is implemented as an IOAS: the VMM maps all guest RAM into it, and all passthrough devices are attached to it, so device DMA using guest physical addresses is automatically translated to host physical addresses by the IOMMU.

// umka-kvm/src/iommufd/ioas.rs

/// An IO Address Space: a virtual DMA address space for one or more devices.
pub struct IoAddrSpace {
    pub id: u32,
    /// The underlying IOMMU page directory (arch-specific).
    /// Shared with all HwPagetables derived from this IOAS.
    pub pgd: Arc<IommuPgd>,
    /// All current IOVA mappings, sorted by IOVA start.
    pub mappings: BTreeMap<u64, IommuMapping>,
    /// Number of devices currently attached. Mappings cannot be freed
    /// while devices are attached (in-flight DMA hazard).
    pub attached_device_count: u32,
    /// Valid IOVA ranges (from IOMMU hardware capabilities).
    pub valid_iova_ranges: Vec<IovaRange>,
}

/// A single IOVA→physical mapping within an IOAS.
pub struct IommuMapping {
    /// IO Virtual Address — the address the device will use.
    pub iova: u64,
    /// Host physical address this IOVA maps to.
    pub paddr: u64,
    /// Length in bytes. Must be a multiple of the IOMMU page size.
    pub len: usize,
    /// Access permissions.
    pub prot: IommuProt,
}

bitflags! {
    pub struct IommuProt: u32 {
        const READ    = 0x1;
        const WRITE   = 0x2;
        const NOEXEC  = 0x4;  // where supported by IOMMU hardware
    }
}

/// A contiguous range of valid IOVA space reported by the IOMMU.
pub struct IovaRange {
    pub start: u64,
    pub last: u64,  // inclusive
}

HwPagetable — A hardware IOMMU page table derived from one IOAS. On x86-64 this is the VT-d SLPT (Second-Level Page Table); on ARM64 it is the stage-2 page table; on AMD systems it is the AMD-Vi page table. HwPagetable holds a reference to the IOAS's IommuPgd and a device-side IOMMU context entry pointing to it.

BoundDevice — A physical device that has been attached to this iommufd context, detached from its host driver, and linked to an IoAddrSpace or HwPagetable.

// umka-kvm/src/iommufd/bound.rs

pub struct BoundDevice {
    pub id: u32,
    pub dev: Arc<dyn DeviceNode>,       // from §10.4 device registry
    pub attached_ioas: Option<Arc<IoAddrSpace>>,
    pub attached_hwpt: Option<Arc<HwPagetable>>,
}

17.3.3 ioctl Interface

VFIO and iommufd expose their APIs via ioctl on character device file descriptors. All ioctl structs are #[repr(C)] and must match the Linux kernel ABI exactly for QEMU and other VMMs to work without modification.

VFIO device ioctls (on /dev/vfio/devices/vfioX):

// umka-kvm/src/vfio/ioctl.rs

/// VFIO_DEVICE_GET_INFO — query device capabilities.
#[repr(C)]
pub struct VfioDeviceInfo {
    pub argsz: u32,
    pub flags: VfioDeviceFlags,
    pub num_regions: u32,
    pub num_irqs: u32,
    pub cap_offset: u32,    // offset into info struct of first capability header
}

/// VFIO_DEVICE_GET_REGION_INFO — query one region (BAR, ROM, Config).
#[repr(C)]
pub struct VfioRegionInfo {
    pub argsz: u32,
    pub flags: VfioRegionFlags,
    pub index: u32,     // region index: VFIO_PCI_BAR0_REGION_INDEX .. CONFIG
    pub cap_offset: u32,
    pub size: u64,
    pub offset: u64,    // byte offset on the device fd for pread/pwrite/mmap
}

/// VFIO_DEVICE_GET_IRQ_INFO — query one IRQ index.
#[repr(C)]
pub struct VfioIrqInfo {
    pub argsz: u32,
    /// VFIO_IRQ_INFO_EVENTFD | VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED | NORESIZE
    pub flags: u32,
    pub index: u32,   // VFIO_PCI_INTX_IRQ_INDEX, MSI, MSIX, ERR, REQ
    pub count: u32,   // number of vectors in this IRQ index
}

/// VFIO_DEVICE_SET_IRQS — configure interrupt delivery.
#[repr(C)]
pub struct VfioIrqSet {
    pub argsz: u32,
    /// ACTION (SET/UNMASK/MASK) | DATA (NONE/BOOL/EVENTFD) | INDEX flags
    pub flags: u32,
    pub index: u32,   // which IRQ type (INTx, MSI, MSI-X, ...)
    pub start: u32,   // first vector within the index
    pub count: u32,   // number of vectors this call configures
    // Followed by `count` eventfd integers in the data[] array.
    // data[]: i32 eventfds. For irqbypass, these are KVM IRQFD eventfds.
}

The ioctl dispatch for VFIO_DEVICE_SET_IRQS wires the provided eventfds into the irqbypass subsystem (Section 17.3.4):

// umka-kvm/src/vfio/ioctl.rs

fn handle_set_irqs(
    dev: &mut VfioDevice,
    req: &VfioIrqSet,
    eventfds: &[RawFd],
) -> Result<(), KernelError> {
    // Validate IRQ index and vector range against dev.irqs[].
    let irq_cfg = dev.irqs.get(req.index as usize)
        .ok_or(KernelError::EINVAL)?;
    if req.start + req.count > irq_cfg.count {
        return Err(KernelError::EINVAL);
    }

    if req.flags & VFIO_IRQ_SET_DATA_EVENTFD != 0
        && req.flags & VFIO_IRQ_SET_ACTION_TRIGGER != 0
    {
        // Wire each eventfd as an irqbypass producer.
        for (i, &fd) in eventfds.iter().enumerate() {
            let vector = req.start as usize + i;
            let producer = IrqBypassProducer::from_eventfd(fd)?;
            irqbypass_register_producer(&producer)?;
            dev.irqbypass_producers.push(producer);
            // The KVM IRQFD (consumer) must have been registered already
            // via KVM_IRQFD ioctl on the KVM VM fd.
        }
    }
    Ok(())
}

iommufd ioctls (on /dev/iommu):

ioctl Description
IOMMU_IOAS_ALLOC Allocate a new IO address space. Returns { id: u32 }.
IOMMU_IOAS_IOVA_RANGES Query valid IOVA ranges for the IOAS. Returns Vec<IovaRange>.
IOMMU_IOAS_MAP Map a physical range into the IOAS at a given IOVA.
IOMMU_IOAS_UNMAP Unmap a range. Blocked while devices are in-flight (returns EBUSY if active DMA is detected).
IOMMU_IOAS_COPY Copy all mappings from one IOAS to another. Used during VM live migration to clone the guest's IOVA mapping into the destination kernel without quiescing the device.
IOMMU_HWPT_ALLOC Allocate a hardware page table from an IOAS. Returns { hwpt_id: u32 }. The HWPT shares the IOAS's IommuPgd.
IOMMU_DEVICE_ATTACH Attach a bound device to an IOAS or HWPT. Installs the IOMMU context entry pointing to the page table.
IOMMU_DEVICE_DETACH Detach a device. Invalidates the IOMMU context entry. Must drain in-flight DMA before returning (IOTLB invalidation with drain).

The IOMMU_IOAS_MAP ioctl struct:

// umka-kvm/src/iommufd/ioctl.rs

#[repr(C)]
pub struct IommuIoasMap {
    pub argsz: u32,
    pub flags: u32,        // IOMMU_IOAS_MAP_READABLE | WRITABLE | FIXED_IOVA
    pub ioas_id: u32,
    _padding: u32,
    /// Userspace virtual address of the memory to map. The kernel pins
    /// the pages and obtains their physical addresses via get_user_pages().
    pub user_va: u64,
    /// IO virtual address to map at. If FIXED_IOVA is not set, the kernel
    /// allocates the IOVA (and returns it in this field on success).
    pub iova: u64,
    pub length: u64,
    pub iommu_prot: u32,   // IommuProt bits
    _padding2: u32,
}

17.3.4 irqbypass — Zero-Latency Interrupt Delivery

When a passthrough device raises an interrupt, the normal path would be: hardware IRQ → host kernel interrupt handler → write to eventfd → KVM thread wakeup → inject guest interrupt. This path involves a context switch and is latency-sensitive for NVMe and network devices.

irqbypass eliminates the kernel interrupt handler and thread wakeup:

Device raises IRQ
       │
       ▼
 IOMMU/APIC delivers to host CPU
       │
       ▼
 irqbypass producer fires  ──────►  irqbypass consumer (KVM IRQFD)
                                           │
                                           ▼
                               KVM injects virtual interrupt
                               directly into guest LAPIC/GIC
                               (no kernel thread wakeup)

The two sides of irqbypass:

// umka-kvm/src/irqbypass.rs

/// Produced by the VFIO side: fires when the physical device raises an IRQ.
/// token is a unique identity used to match producers to consumers.
pub struct IrqBypassProducer {
    /// Unique token — pointer-identity. Matches the corresponding consumer.
    pub token: NonNull<IrqBypassToken>,
    /// Called when a consumer is linked to this producer.
    /// Implementor: disables the standard interrupt handler and installs
    /// the consumer's direct delivery path.
    pub add_consumer: unsafe fn(prod: &IrqBypassProducer, cons: &IrqBypassConsumer) -> KabiResult,
    /// Called when the consumer is unlinked. Re-enables the standard handler.
    pub del_consumer: unsafe fn(prod: &IrqBypassProducer, cons: &IrqBypassConsumer),
}

/// Consumed by the KVM IRQFD side: receives IRQs and injects them into the guest.
pub struct IrqBypassConsumer {
    pub token: NonNull<IrqBypassToken>,
    /// Called when a producer is linked. Installs the guest IRQ injection path.
    /// On x86: programs the Posted Interrupt Descriptor so the device's MSI
    /// vector is delivered directly to the vCPU's virtual LAPIC.
    pub add_producer: unsafe fn(cons: &IrqBypassConsumer, prod: &IrqBypassProducer) -> KabiResult,
    /// Called when the producer is unlinked. Tears down the direct path.
    pub del_producer: unsafe fn(cons: &IrqBypassConsumer, prod: &IrqBypassProducer),
}

/// Opaque token used for producer↔consumer matching. Pointer identity only.
pub struct IrqBypassToken(());

The irqbypass registry maintains per-VM lists of registered producers and consumers, sharded by VmId to avoid serializing multi-VM setups. When a new consumer is registered (via KVM_IRQFD with the irqbypass flag), the registry acquires only that VM's SpinLock and scans for a matching producer by token. When a new producer is registered (via VFIO_DEVICE_SET_IRQS), the registry similarly acquires only the target VM's lock. Matching is by token pointer identity.

// umka-kvm/src/irqbypass.rs

/// Entry in the per-VM irqbypass registry.
pub struct IrqBypassEntry {
    pub producer: Option<IrqBypassProducer>,
    pub consumer: Option<IrqBypassConsumer>,
}

/// Per-VM sharded IRQ bypass registry. Avoids a global mutex that would
/// serialize producer/consumer registration across independent VMs.
pub struct IrqBypassRegistry {
    /// Per-VM IRQ bypass mappings. RwLock allows concurrent reads across VMs.
    /// Write lock only needed on VM create/destroy, not per-device registration.
    per_vm: RwLock<HashMap<VmId, Arc<SpinLock<Vec<IrqBypassEntry>>>>>,
}

Registration for a specific VM acquires only that VM's SpinLock. Cross-VM iteration (for IOMMU group validation) acquires the global RwLock in read mode. The global write lock is taken only when a new VM is created or destroyed — not on the per-device registration hot path.

fn irqbypass_register_producer(
    registry: &IrqBypassRegistry,
    vm_id: VmId,
    prod: &IrqBypassProducer,
) -> Result<(), KernelError> {
    // Read lock: does not block other VMs' registrations.
    let per_vm = registry.per_vm.read();
    let vm_entries = per_vm.get(&vm_id)
        .ok_or(KernelError::InvalidVm)?;
    let mut entries = vm_entries.lock();
    entries.push(IrqBypassEntry {
        producer: Some(prod.clone()),
        consumer: None,
    });

    // Check if a matching consumer is already registered in this VM.
    if let Some(entry) = entries.iter()
        .find(|e| e.consumer.as_ref().map(|c| c.token) == Some(prod.token))
    {
        let cons = entry.consumer.as_ref().unwrap();
        // SAFETY: both producer and consumer are valid while registered.
        // add_consumer disables the standard IRQ handler and installs
        // the direct delivery path atomically.
        unsafe { (prod.add_consumer)(prod, cons) }
            .into_result()?;
    }
    Ok(())
}

fn irqbypass_register_consumer(
    registry: &IrqBypassRegistry,
    vm_id: VmId,
    cons: &IrqBypassConsumer,
) -> Result<(), KernelError> {
    let per_vm = registry.per_vm.read();
    let vm_entries = per_vm.get(&vm_id)
        .ok_or(KernelError::InvalidVm)?;
    let mut entries = vm_entries.lock();
    entries.push(IrqBypassEntry {
        producer: None,
        consumer: Some(cons.clone()),
    });

    if let Some(entry) = entries.iter()
        .find(|e| e.producer.as_ref().map(|p| p.token) == Some(cons.token))
    {
        let prod = entry.producer.as_ref().unwrap();
        // SAFETY: same as above.
        unsafe { (cons.add_producer)(cons, prod) }
            .into_result()?;
    }
    Ok(())
}

On x86-64 with APICv/Posted Interrupts (Intel VT-d PI), the add_producer implementation programs the device's MSI destination address to point to the vCPU's Posted Interrupt Descriptor (PID). The hardware then delivers the interrupt directly into the guest's virtual LAPIC without any host CPU intervention — the VMEXIT for interrupt injection is eliminated entirely.

17.3.5 KVM Integration

When QEMU assigns a VFIO device to a KVM VM, the following sequence establishes the full passthrough configuration:

  1. Detach from host driver: QEMU opens /dev/vfio/devices/vfioX. The kernel calls driver_unbind(dev) on the PCI device's current host driver (e.g., NVMe, igb). The device is removed from the host driver's device list and its interrupts are disabled at the host LAPIC/GIC level.

  2. Create IOAS and map guest RAM: QEMU issues IOMMU_IOAS_ALLOC to create a fresh IOAS. For each memory region in the VM (KVM_SET_USER_MEMORY_REGION), QEMU also calls IOMMU_IOAS_MAP to create a matching IOVA mapping in the IOAS, using the guest physical address as the IOVA and the host userspace virtual address as the source. The kernel pins the backing pages and maps them into the IOMMU page table. The device can now DMA to guest physical addresses — the IOMMU translates GPA→HPA transparently.

  3. Attach device to IOAS: IOMMU_DEVICE_ATTACH programs the IOMMU context entry for the device, pointing it at the IOAS's IOMMU page table. From this point, the device's DMA is translated by the IOMMU and confined to the guest's mapped regions. Any out-of-range DMA access triggers an IOMMU fault and is logged to the FMA subsystem (Section 19.1).

  4. Wire interrupts via irqbypass: For each MSI-X vector, QEMU calls KVM_IRQFD on the KVM VM fd to register a KVM IRQFD consumer (linking a guest interrupt vector to an eventfd). Then QEMU calls VFIO_DEVICE_SET_IRQS with the same eventfds to register the VFIO producer side. The irqbypass registry links them, and on APICv-capable hardware, programs Posted Interrupt Descriptors.

  5. Map MMIO regions into guest address space: QEMU calls VFIO_DEVICE_GET_REGION_INFO to find which BARs support mmap. Mappable BARs are mmap-ed from the VFIO device fd (obtaining a userspace VA mapping of the device's MMIO). QEMU then calls KVM_SET_USER_MEMORY_REGION with the KVM_MEM_READONLY flag cleared to place this MMIO mapping at the guest's BAR address. The guest now accesses the BAR via EPT/NPT without VMEXIT.

  6. Non-mappable MMIO (Config space, registers that require emulation): These generate a KVM_EXIT_MMIO VMEXIT. QEMU handles it by calling pread/pwrite on the VFIO device fd at the region's fd offset. This path is inherently slower but only applies to infrequent configuration accesses.

IOMMU_IOAS_COPY for live migration: During VM live migration (Section 17.1), the destination kernel issues IOMMU_IOAS_COPY to clone the source VM's IOAS mappings. The device is detached from the source IOAS and re-attached to the destination IOAS atomically. In-flight DMA at the moment of device detach is drained (IOTLB invalidation with completion wait) before the device is released. The guest's DMA window is thus preserved across migration without the destination kernel needing to replay all IOMMU_IOAS_MAP calls from scratch.

17.3.6 SR-IOV and VF Passthrough

SR-IOV (Single Root I/O Virtualization) allows a single PCIe Physical Function (PF) to present multiple Virtual Functions (VFs). VFs appear as independent PCIe devices with their own config space, BARs, and MSI-X vectors, but share the underlying hardware resources managed by the PF.

Each VF has its own IOMMU group (ACS ensures the VF is isolated from the PF and from other VFs at the PCIe bus level), so VFs can be passed through individually without binding the PF to VFIO. This is the standard mechanism for NIC and NVMe passthrough in cloud environments: the host retains control of the PF (and the physical link), while individual VFs are assigned to guest VMs.

UmkaOS's VFIO implementation supports VF passthrough with the same ioctl interface as full-device passthrough. The num_regions and num_irqs reported by VFIO_DEVICE_GET_INFO reflect the VF's resources, not the PF's.

17.3.7 Security Model

Device passthrough grants the guest direct hardware access. The security model must prevent privilege escalation back to the host:

  • Capability requirement: Opening a VFIO device fd requires Capability::SysAdmin (Section 8.1). This requirement applies to the process that opens /dev/vfio/devices/vfioX. In practice, only the VMM process (QEMU) holds this capability; it is typically run with a minimal privilege set (cap_sys_admin only, no network or filesystem capabilities beyond what the VM needs).

  • IOMMU mandatory: When a device is attached to VFIO, the kernel verifies that a functioning IOMMU domain can be created for it. If the system has no IOMMU, or if the device is not covered by the IOMMU (e.g., behind a legacy ISA bridge), the attach call fails with ENODEV. The sole exception is iommu_off=dangerous boot parameter, which enables a no-IOMMU passthrough mode for development use only; a prominent boot warning and a WDIOF_OVERHEAT-equivalent flag in the device info struct mark the system as operating outside the security envelope.

  • DMA containment: The IOMMU page table is populated only with the guest's memory regions. Any device DMA that targets an address outside the IOAS mapping is blocked by the IOMMU and generates an IOMMU fault. The fault is logged via the FMA subsystem and, by default, triggers device isolation (the device is isolated into a fault domain and the VMM is notified via an error eventfd).

  • Config space access control: The raw PCIe config space is not fully exposed to userspace. The VFIO PCI driver intercepts pread/pwrite on the config space region. Writes to the Bus Master Enable bit, Interrupt Disable bit, and PCIe capability registers are validated or silently dropped where they could affect the host PCI topology. The Memory Space Enable and I/O Space Enable bits are allowed through (they gate BAR access and are necessary for device operation).

  • Reset on release: When the VFIO device fd is closed, the kernel performs an FLR (Function-Level Reset) if the device supports it, or a bus reset if not. This clears any DMA-capable state in the device (pending DMA descriptors, MSI-X configuration) before the device is re-bound to the host driver or left quiesced.

17.3.8 Integration with UmkaOS IOMMU (Section 10.4)

The iommufd layer is built on top of the IOMMU primitives defined in Section 10.4. The correspondence is:

iommufd concept Section 10.4 primitive Notes
IoAddrSpace IommuDomain IOAS wraps an IommuDomain with userspace-facing state (mapping BTree, attached device count, valid IOVA ranges).
HwPagetable IommuPgd + context entry HWPT holds a reference to the IommuPgd and the device-side context entry that points to it.
IOMMU_IOAS_MAP iommu_map(domain, iova, paddr, len, prot) iommufd calls into the §10.4 iommu_map primitive after pinning userspace pages.
IOMMU_IOAS_UNMAP iommu_unmap(domain, iova, len) + IOTLB invalidate Unmap also calls iommu_iotlb_sync(domain) to flush TLB entries before releasing page pins.
IOMMU_DEVICE_ATTACH iommu_attach_device(domain, dev) Programs the IOMMU context/stream table entry that points the device at the domain's page table.
IOMMU_DEVICE_DETACH iommu_detach_device(domain, dev) Removes the context entry and drains in-flight DMA (issues IOTLB invalidation with drain completion wait).

The §10.4 layer handles all architecture-specific IOMMU programming (VT-d context tables and SLPT on x86-64; SMMU stream table and stage-2 tables on ARM64; PCIe PASID tables for Shared Virtual Addressing). iommufd is arch-neutral: it builds the IOVA→HPA mapping in the arch-agnostic IommuDomain and relies on §10.4 to push it to hardware.