Chapter 18: Virtualization¶
KVM host/guest integration, VMX/VHE/H-ext, live migration, PV features, suspend/resume KVM host support scope: umka-kvm targets x86-64 (VMX/SVM), AArch64 (VHE/nVHE), and RISC-V (H-extension) as hypervisor hosts. ARMv7, PPC32, and PPC64LE are intentionally out of scope for KVM host mode: 32-bit ARM hypervisor workloads are niche and UmkaOS's target use cases (cloud, server, edge) use 64-bit hosts exclusively; PPC32/PPC64LE hypervisor support requires PAPR (IBM's PowerVM ABI), which is outside UmkaOS's compatibility scope. All three platforms continue to support UmkaOS as a KVM guest inside a compatible hypervisor.
UmkaOS provides KVM host support on x86-64 (VMX/SVM), AArch64 (VHE/nVHE), RISC-V (H-extension), PPC64LE (KVM-HV), and LoongArch64 (LVZ). VFIO and IOMMUFD provide device passthrough with IOMMU isolation. Live migration, suspend/resume, and paravirtual features are specified. ARMv7 and PPC32 are supported as KVM guests only — not as hypervisor hosts.
18.1 Host and Guest Integration¶
How UmkaOS behaves as a VM host (via umka-kvm) and as a guest kernel running inside a hypervisor. This section covers virtio device negotiation, paravirtual optimizations, vhost data plane, and live migration.
Guest Mode — Virtio Device Negotiation
When UmkaOS runs as a guest kernel, it discovers virtio devices via PCI or MMIO transport and negotiates feature bits with the hypervisor. The virtio drivers (virtio-blk, virtio-net, virtio-gpu, virtio-console — already listed as Priority 1 in Section 11.4) implement the standard virtio 1.2 specification (approved as an OASIS Committee Specification in July 2022), with forward-compatible support for virtio 1.3 features as that draft is finalized.
Guest Mode — Paravirtual Clock
Hardware RDTSC inside a VM can be inaccurate (the TSC may not be invariant, or vmexit
overhead distorts time). Paravirtual clock avoids this:
- KVM pvclock / kvmclock: the hypervisor maps a shared memory page containing clock parameters (scale, offset, version). The guest reads time from this page — no vmexit required. UmkaOS's clocksource subsystem auto-detects and prefers pvclock when running as a KVM guest.
- Hyper-V TSC page: equivalent mechanism for Hyper-V hosts. Same principle — shared memory page, no hypercall for time reads.
- Fallback: if neither paravirt clock is available, UmkaOS uses the ACPI PM timer (slow but always accurate) or PIT (ancient but universal).
Guest Mode — Balloon Driver
virtio-balloon enables dynamic memory adjustment — the hypervisor can reclaim guest
memory by inflating the balloon (guest returns pages) or release memory by deflating it.
UmkaOS integrates balloon inflation with its memory pressure framework:
- Balloon inflation is treated as memory pressure, triggering the same reclaim path
as physical memory exhaustion (page cache eviction, slab shrinking, swap-out)
- Balloon deflation immediately makes pages available to the buddy allocator
- This unified pressure model means UmkaOS's OOM decisions correctly account for
ballooned-away memory
Guest Mode — PV Spinlocks
Under overcommitted VMs, spinning on a lock held by a descheduled vCPU wastes host
CPU cycles (the spinning vCPU can never acquire the lock until the holder is scheduled).
UmkaOS detects the hypervisor type at boot:
- KVM: the spinning vCPU halts (HLT-based yield) when it detects the lock holder is
descheduled; the lock releaser calls KVM_HC_KICK_CPU to wake the halted waiter
- Hyper-V: uses HvCallNotifyLongSpinWait hypercall — notifies the hypervisor
of a long spin wait, allowing it to schedule the lock holder
- Bare metal: standard spin loops (no overhead when not virtualized)
Post-Yield Backoff
When the hypervisor returns from a VMEXIT yield (indicating another vCPU has
released or is about to release the lock), the acquiring vCPU uses the following
adaptive backoff before re-yielding:
attempt = 0
loop:
try acquire lock (test-and-set)
if acquired: return
if attempt < 6:
# Spin for 2^attempt iterations (1, 2, 4, 8, 16, 32 cycles)
spin_hint(1 << attempt) # x86: PAUSE; ARM: YIELD; RISC-V: nop
attempt += 1
else:
# Back to hypervisor yield
pv_kick_yield()
attempt = 0 # reset after yield
Total spin before re-yielding: 1+2+4+8+16+32 = 63 loop iterations (~100-250ns). This avoids hammering the hypervisor with immediate re-yields while still responding quickly when the lock becomes available.
Maximum yield count: after 32 consecutive yields without acquiring the lock,
the vCPU switches to schedule() (voluntary preemption) to allow other vCPUs to
make progress. This prevents a vCPU from monopolizing its pCPU waiting for a lock
held by a vCPU that is not scheduled.
Guest Mode — Hypervisor-Specific Backends
| Hypervisor | Paravirt Features |
|---|---|
| KVM (primary) | pvclock, PV spinlocks, PV TLB flush, steal time accounting, async PF |
| Hyper-V | Synthetic interrupts, synthetic timer, APIC assist, TSC page, PV spinlocks |
| Xen PV | Future — xenbus, grant tables, PV disk/net (lower priority) |
Guest Mode — Cloud Metadata
Cloud-init and instance metadata (AWS IMDSv2, Azure IMDS, GCP metadata server) are
consumed by userspace agents. The kernel's role is providing transport:
- vsock (virtio-socket) for hypervisor↔guest communication without networking
- virtio-serial for structured host↔guest channels
- Standard networking for HTTP-based metadata endpoints (169.254.169.254)
vhost Kernel Data Plane
vhost moves the virtio data plane into the host kernel, bypassing the VMM (QEMU) for hot-path I/O:
- vhost-net: kernel-side virtio-net processing. Packets move directly between the
guest's virtio ring and the host's tap/macvtap device via kernel. The VMM handles
only control plane (device configuration, feature negotiation). Implemented as a
Tier 1 (with extended hardware privileges) umka-kvm module. KVM requires
CAP_VMX(hardware virtualization support), which grants it theKvmHardwareCapabilityon top of standard Tier 1 memory-domain isolation. See Section 19.1 for full classification. Why this exception is unique and non-proliferating: VMX/SVM instructions must execute directly on the host CPU — they cannot be mediated via MMIO, DMA, or ring buffer IPC like all other device operations. No other driver class has this constraint; all other hardware interactions go through memory-mapped registers or DMA descriptors that the standard Tier 1 isolation boundary can intercept.) - vhost-scsi: kernel-side virtio-scsi processing for direct block device access from guests, bypassing QEMU's I/O path. Guests see near-native block device performance.
- vhost-user: protocol for offloading vhost processing to userspace daemons
(DPDK for networking, SPDK for storage). This is handled entirely in userspace by the
VMM (e.g., QEMU) which shares guest memory via
memfdwith the backend daemon. The UmkaOS kernel does not implement vhost-user directly; it simply provides the standard shared memory and unix domain socket primitives required for QEMU to function. - vhost-vDPA: Hardware-accelerated virtio for SmartNICs and DPUs. vDPA (virtio Data Path Acceleration) allows the virtio data plane to be offloaded to hardware while the control plane remains in software. Integration with UmkaOS's SmartNIC architecture (Section 5.11) is planned for Phase 4-5.
- vhost-vsock: host↔guest communication channel using the
vsockaddress family. No networking stack required — communication uses a simple stream/datagram protocol over shared memory.
VM Live Migration (KVM)
Live migration moves a running VM from one physical host to another with minimal downtime. UmkaOS's umka-kvm implements the full migration pipeline:
- Pre-copy phase: Track dirty pages via Intel PML (Page Modification Logging) or manual dirty bitmap scanning. Umka-kvm reads the PML buffer on a timer interrupt and transmits dirty pages to the destination host.
- Iterative convergence: Multiple pre-copy rounds, each sending pages dirtied since the last round. Configurable maximum downtime target (e.g., 50ms).
- Auto-converge: If the guest's dirty rate exceeds the network transfer rate (migration won't converge), umka-kvm throttles vCPU execution to reduce the dirty rate. This is automatic and transparent to the guest.
- Stop-and-copy: When the remaining dirty set is small enough to transfer within the downtime target, the VM is paused, final dirty pages are sent, and the destination resumes execution.
- Post-copy (optional): The destination VM starts running immediately. Pages not
yet transferred are faulted in on demand via a kernel-internal demand-fault
mechanism (not Linux's
userfaultfd, which is a userspace API). Since umka-kvm runs as a Tier 1 kernel module with extended hardware privileges, it registers a post-copy fault handler directly with the page fault subsystem (Section 4.8). When a guest accesses a not-yet-migrated page, the fault handler requests the page from the source host over the migration channel (TCP or RDMA) and maps it before returning. This is functionally equivalent to QEMU's userfaultfd-based post-copy but operates entirely in kernel space.
Registration API: mm::register_fault_handler(addr_range, handler_fn) —
the page fault subsystem invokes handler_fn for faults in addr_range.
Returns Ok(page) with the resolved page or Err(MigrationFailed) →
SIGBUS. Registered at SwitchToPostCopy; unregistered when all pages
are transferred.
Convergence Policy and Auto-Convergence
UmkaOS's migration controller owns the convergence decision. The wire protocol is QEMU-compatible for interoperability, but the policy for when and how to converge is UmkaOS's internal design ("UmkaOS inside").
Convergence threshold: Pre-copy is considered converged when:
where convergence_threshold = initial_dirty_pages * 0.02 (2% of the initial
working set). When this threshold is met, the controller proceeds directly to
stop-and-copy regardless of which round it is.
Dirty-rate tracking: At the end of each pre-copy round the controller computes:
dirty_rate_pages_per_sec =
pages_dirtied_this_round / round_duration_secs;
transfer_rate_pages_per_sec =
bytes_transferred_this_round / PAGE_SIZE / round_duration_secs;
// Transfer must exceed dirty rate by at least 10% margin.
is_converging = dirty_rate < transfer_rate * 0.9;
Auto-converge trigger: If pre-copy has NOT converged after max_precopy_rounds
= 30 rounds, OR if is_converging is false for 3 consecutive rounds, the
controller begins auto-convergence using the following action type:
pub enum ConvergenceAction {
/// Throttle vCPU execution to reduce the dirty rate.
/// `throttle_pct` is the percentage reduction applied to the vCPU time
/// slice. Increased by 10% each non-converging round; maximum 80%.
///
/// **Mechanism**: The KVM vCPU thread's scheduler parameters are adjusted
/// to reduce its CPU allocation. Specifically, the vCPU thread is placed
/// in a temporary cgroup with `cpu.max` set to `(100 - throttle_pct)%`
/// of its original quota. This causes the scheduler to preempt the vCPU
/// thread more frequently, reducing the dirty page generation rate.
/// On migration completion (success or cancellation), the cgroup quota
/// is restored to its original value.
ThrottleVcpu { throttle_pct: u8 },
/// Switch to post-copy mode: resume the VM at the destination and fetch
/// remaining pages on demand (guest page fault → source request →
/// transfer → resume).
SwitchToPostCopy,
}
Auto-converge sequence:
- Rounds 1–10: pure pre-copy, no throttling.
- Round 11+: if not converging, apply
ThrottleVcpustarting at 10%, increasing by 10% per non-converging round up to a maximum of 80%. - If throttle reaches 80% and migration still has not converged after 5
further rounds: issue
SwitchToPostCopy. Post-copy always terminates because pages are fetched on demand and the VM is already live at the destination.
Post-copy failure mitigations: Post-copy fails catastrophically if the source host dies before all pages have been delivered. UmkaOS applies three mitigations:
- The source host is kept alive (vCPUs suspended, not destroyed) until the destination's post-copy fault handler confirms every referenced page has been received.
- If the source fails mid-post-copy: the destination VM is sent SIGKILL and the migration is declared failed. The VM cannot continue safely with unreachable pages.
- Optionally: a pre-copy checkpoint snapshot is taken before switching to post-copy. If post-copy then fails, the operator can restart from the checkpoint rather than from scratch.
VFIO/passthrough constraint: Post-copy live migration is disabled when the VM has VFIO passthrough devices attached. The reason: post-copy allows the guest to run on the destination before all pages are transferred; if a passthrough device DMAs to a page that hasn't been migrated yet (still on the source), the IOMMU on the destination raises an unrecoverable fault (the physical address is not mapped in the destination's IOMMU domain). UmkaOS detects passthrough devices at migration-start time and automatically switches to pre-copy with auto-converge when any
VfioDeviceis attached to the VM. Pre-copy ensures all dirty pages are transferred before the final stop-and-copy phase, preventing any DMA to unmigrated pages.VFIO hot-attach during post-copy: During active post-copy migration, VFIO device attachment is blocked:
vfio_attach_device()returnsEBUSYwhen the VM is inPostCopyState::Active. This prevents a device from DMAing to pages that have not been migrated yet. Device attachment is permitted after post-copy completes (PostCopyState::Complete).
Guest-side migration support — When UmkaOS runs as a guest: - PV migration notifier: guest receives a pre-migration hint via virtio, allowing it to flush caches, pause background I/O, and prepare for the brief freeze - Post-migration re-enumeration: guest re-enumerates PCI topology (in case of heterogeneous migration to different hardware), re-calibrates pvclock, resumes I/O - Confidential VM migration: handled by the TEE framework (Section 9.7)
Host Mode — Cloud Orchestration
UmkaOS provides /dev/kvm and the associated ioctl interface, making it compatible with
the standard KVM ecosystem:
- libvirt: standard virtualization management library. Works unmodified — it talks
to /dev/kvm via standard ioctls.
- OpenStack Nova: compute driver talks to libvirt, libvirt talks to /dev/kvm.
UmkaOS is transparent to the orchestration layer.
- QEMU and Firecracker: both use /dev/kvm directly. Both work unmodified on UmkaOS.
18.1.1 KVM Host-Side Implementation¶
This section specifies the hypervisor role: what umka-kvm does as a host to
create, configure, and run virtual machines. umka-kvm runs as a Tier 1 driver with extended hardware privileges
(see Section 19.1 for the isolation model, CAP_VMX rationale, and VMX
trampoline design). The SlatHooks trait (SLAT page table management callbacks),
EPT violation handling path, dirty page tracking, and memory overcommit behavior
are specified in Section 19.1 (search for SlatHooks);
this section covers the remaining host-side subsystems.
18.1.1.1 /dev/kvm Ioctl Interface¶
umka-kvm exposes the standard Linux KVM ioctl interface so that unmodified QEMU, Firecracker, Cloud Hypervisor, and crosvm work without changes. The interface is organized into three ioctl scopes:
System ioctls (on /dev/kvm file descriptor). Ioctl numbers use Linux's
standard encoding: _IO(KVMIO, nr) where KVMIO = 0xAE. The nr column
shows the number field; the actual ioctl constant includes direction and size
bits per the _IO/_IOR/_IOW/_IOWR macros.
| Ioctl | nr | Description |
|---|---|---|
KVM_GET_API_VERSION |
0x00 | Returns KVM_API_VERSION (12). Userspace checks this first. |
KVM_CREATE_VM |
0x01 | Allocate a new Vm struct, return VM file descriptor. |
KVM_GET_MSR_INDEX_LIST |
0x02 | Returns list of MSRs that KVM_GET_MSRS/KVM_SET_MSRS can access. |
KVM_CHECK_EXTENSION |
0x03 | Query capability support (EPT, PML, posted interrupts, etc.). |
KVM_GET_VCPU_MMAP_SIZE |
0x04 | Returns size of the kvm_run shared page (one page per vCPU). |
KVM_GET_SUPPORTED_CPUID |
0x05 | Returns filtered CPUID values reflecting host capabilities. |
The VM file descriptor returned by KVM_CREATE_VM has O_CLOEXEC set by default.
This matches the usage pattern of QEMU and Firecracker, which never intend KVM fds to
leak across execve. KVM VM fds are reference-counted; closing a duplicate fd (e.g.,
in a forked child) does not affect the VM's lifecycle as long as another fd reference
exists. Userspace that genuinely needs to pass a VM fd across execve must explicitly
clear the flag with fcntl(fd, F_SETFD, 0).
VM ioctls (on VM file descriptor):
| Ioctl | nr | Description |
|---|---|---|
KVM_CREATE_VCPU |
0x41 | Allocate a Vcpu struct, return vCPU file descriptor. |
KVM_GET_DIRTY_LOG |
0x42 | Read and reset per-slot dirty bitmap (for live migration). |
KVM_SET_USER_MEMORY_REGION |
0x46 | Add/modify/delete a memory slot mapping guest physical → host virtual. |
KVM_SET_TSS_ADDR |
0x47 | Set guest TSS address (x86 specific, required by QEMU). |
KVM_SET_IDENTITY_MAP_ADDR |
0x48 | Set identity-mapped page table region for real-mode emulation. |
KVM_CREATE_IRQCHIP |
0x60 | Create in-kernel interrupt controller (LAPIC + IOAPIC on x86). |
KVM_CREATE_PIT2 |
0x77 | Create in-kernel PIT (8254 timer emulation). |
KVM_IRQFD |
0x76 | Associate an eventfd with a guest IRQ for direct injection. |
KVM_IOEVENTFD |
0x79 | Trigger an eventfd on guest I/O to a specified port/MMIO address. |
KVM_SET_CLOCK |
0x7B | Set/get VM-wide kvmclock parameters. |
KVM_CLEAR_DIRTY_LOG |
0xC0 | Granular dirty bitmap clear (avoids resetting entire slot). |
KVM_MEMORY_ENCRYPT_OP |
0xBA | Confidential VM operations (SEV-SNP/TDX, see Section 9.7). |
vCPU ioctls (on vCPU file descriptor):
| Ioctl | nr | Description |
|---|---|---|
KVM_RUN |
0x80 | Enter VMX non-root / VHE EL1 / VS-mode. Blocks until VM exit needs userspace. |
KVM_GET_REGS / KVM_SET_REGS |
0x81/0x82 | Read/write guest general-purpose registers. |
KVM_GET_SREGS / KVM_SET_SREGS |
0x83/0x84 | Read/write guest segment registers, CR0/CR3/CR4, EFER, IDT, GDT. |
KVM_TRANSLATE |
0x85 | Walk guest page tables to translate guest virtual → guest physical. |
KVM_INTERRUPT |
0x86 | Inject an external interrupt into the guest. |
KVM_GET_MSRS / KVM_SET_MSRS |
0x88/0x89 | Read/write guest MSRs. |
KVM_SET_SIGNAL_MASK |
0x8B | Set signal mask for the vCPU thread during KVM_RUN. |
KVM_GET_FPU / KVM_SET_FPU |
0x8C/0x8D | Read/write guest FPU/SSE/AVX state. |
KVM_GET_LAPIC / KVM_SET_LAPIC |
0x8E/0x8F | Read/write guest Local APIC state. |
KVM_SET_CPUID2 |
0x90 | Configure CPUID values exposed to the guest. |
KVM_NMI |
0x9A | Inject an NMI into the guest. |
KVM_GET_VCPU_EVENTS / KVM_SET_VCPU_EVENTS |
0x9F/0xA0 | Exception/interrupt/NMI injection state. |
KVM_GET_XSAVE / KVM_SET_XSAVE |
0xA4/0xA5 | Read/write guest XSAVE state (AVX-512, AMX, etc.). |
Ioctl dispatch: Each ioctl handler runs in umka-kvm's isolation domain. The
/dev/kvm character device is registered via umka-core's device subsystem. When
userspace calls ioctl(fd, KVM_RUN, ...), the syscall layer resolves the file
descriptor to the Vcpu struct, switches into umka-kvm's domain, and invokes the
KVM_RUN handler — which transitions to the VMX trampoline in umka-core's domain
for the actual VM entry (see Section 19.1).
18.1.1.2 CPU Errata Integration in KVM¶
L1TF (L1 Terminal Fault) mitigation (x86-64): On CPUs affected by L1TF
(X86Errata::L1TF), the KVM_RUN handler flushes the L1 data cache before every
VM entry. This prevents a guest from speculatively reading host L1D contents via
non-present page table entries. The flush is conditional:
1. Skip if the CPU has IA32_ARCH_CAPABILITIES[SKIP_L1DFL_VMENTRY] set (hardware fix).
2. Otherwise: WRMSR IA32_FLUSH_CMD, 1 (microcode-provided L1D flush).
3. PTE inversion: non-present PTEs are bitwise inverted so that speculative
translation produces an address in the unmapped I/O hole rather than a valid
physical address. PTE inversion is applied in the VMM's PTE set/clear operations
and is transparent to the rest of the memory subsystem.
XFD (Extended Feature Disable) separation for AMX (x86-64, SPR+): Host and guest
maintain separate XFD values because XFD controls which XSAVE components trigger #NM
faults. On VM entry, the host's XFD value is saved and the guest's XFD value is loaded
into IA32_XFD. On VM exit, the reverse occurs. The XFD MSR is NOT part of the
automatic VMCS save/restore area — UmkaOS manages it manually. Failure to separate
XFD values causes AMX TILEDATA corruption when the host uses AMX while a guest also
has AMX enabled (errata X86Errata::AMX_XFD_KVM). Additionally, the faulted XRSTOR
must be re-executed before any XSAVE of tile state to avoid partial state capture
(X86Errata::SPR_TILEDATA).
VMX preemption timer clamp (SPR): On Sapphire Rapids CPUs
(X86Errata::VMX_PREEMPT_SPR), the VMX preemption timer may misfire when programmed
with a value of 1. UmkaOS clamps the preemption timer to max(2, requested_value).
SEV-SNP page state change cache flush (AMD): Before a page transitions between
shared and private state (KVM_MEMORY_ENCRYPT_OP), the kernel must flush the cache
for those pages using WBINVD or CLFLUSH per cacheline. Without this flush, stale
shared-state cache lines may be returned to the guest after the page becomes private,
leaking host data (X86Errata::SEV_SNP_CACHE). AMD microcode minimum version
enforcement (Section 2.18) gates the SNP activation path — if
microcode is below the minimum safe version, SNP VM creation is refused.
AVIC IPI workaround (AMD Zen 1/2): X86Errata::AVIC_IPI_ZEN12 — on these CPUs,
AVIC (AMD Virtual Interrupt Controller) may miss IPI wakeup events. UmkaOS disables
AVIC-accelerated IPI delivery on affected steppings and falls back to MMIO-trapped
IPI injection. Nested AVIC field sanitization is applied on all AMD CPUs
(X86Errata::SVM_AVIC_BYPASS, CVE-2021-3653).
PKRU handling in KVM: PKRU is swapped on every VM entry/exit as part of the
vm_enter_and_exit() sequence (called inside the vcpu_run() loop).
On VM entry: save host PKRU to vcpu.host_pkru, load guest PKRU from vcpu.guest_pkru.
On VM exit: save guest PKRU to vcpu.guest_pkru, restore host PKRU from vcpu.host_pkru,
then update CpuLocalBlock.pkru_shadow to reflect the restored host PKRU value.
This shadow update is critical: without it, the next switch_domain() call
(Section 11.2) would
compare against a stale shadow (still holding the pre-VM-entry value), potentially
eliding a required WRPKRU and leaving the CPU in the wrong isolation domain.
The swap is unconditional — it occurs regardless of whether the guest has CR4.PKE set,
because a guest could enable CR4.PKE at any time without a VM exit.
// VM exit PKRU restore (x86-64 KVM, outside the fast-path loop)
vcpu.guest_pkru = rdpkru();
wrpkru(vcpu.host_pkru);
// CRITICAL: sync the per-CPU shadow so switch_domain() elision is correct.
CpuLocal::get_mut().pkru_shadow = vcpu.host_pkru;
AArch64 VHE mandate for speculative AT: On Cortex-A55 (erratum 1530923), Cortex-A76
(erratum 1165522), and Cortex-A510 (erratum 2077057), speculative address translation
operations can corrupt TLB entries across EL1/EL2 boundaries. On these cores, KVM
must run in VHE (Virtualization Host Extensions) mode — non-VHE operation is not
safe. On Cortex-A57/A72 (erratum 1319367/1319537), where VHE is not available, the
mitigation is to insert an ISB after every AT instruction and to use TLBI VMALLE1IS
on guest exit. UmkaOS refuses to start KVM in non-VHE mode on cores where
Aarch64Errata::A55_1530923 or A76_1165522 is set.
AArch64 SPSR_EL2 sanitization (A510): Aarch64Errata::A510_2077057 — on early
Cortex-A510, PAC trap exceptions corrupt SPSR_EL2. KVM must sanitize SPSR_EL2
after guest exit to prevent trap-based guest→host escalation.
AArch64 ThunderX2 TTBR trap (Marvell): Aarch64Errata::THUNDERX2_TTBR_TRAP —
On ThunderX2 SMT cores, guest writes to TTBR0_EL1/TTBR1_EL1 can corrupt the
sibling thread's TLB state due to incomplete hardware isolation of the page table
walker between SMT threads. Workaround: trap guest TTBR writes (HCR_EL2.TVM = 1)
and emulate them in the KVM handler with an explicit TLBI VMALLE1IS on both SMT
threads before returning. This adds ~100-200 cycles per guest TTBR write (infrequent —
only on mmap/munmap/exec).
AArch64 ThunderX2 GIC Group 1 re-enable: On KVM exit from ThunderX2 guests, GIC
Group 1 interrupts may be left disabled due to a race in the GICv3 save/restore path.
The host's KVM exit handler must unconditionally re-enable Group 1 via
MSR ICC_IGRPEN1_EL1, #1 after restoring the host GIC state.
AArch64 AmpereOne HCR_EL2 ordering: Aarch64Errata::AMPEREONE_HCR_ORDERING —
On AmpereOne cores, writes to HCR_EL2 (Hypervisor Configuration Register) are not
immediately visible to subsequent EL2 instructions. A DSB ISH + ISB sequence is
required after every HCR_EL2 write before executing ERET to the guest or performing
any trapped register access. Without this barrier, the guest may execute under the old
HCR_EL2 configuration, bypassing trap settings and potentially accessing host resources.
The barrier is inserted in the KVM world-switch path for all AmpereOne cores.
AArch64 POE KVM trap routing (D22677): KVM EL2 support for S1PIE/S1POE depends
on ARM architecture correction D22677. The CPTR_EL2.E0POE bit controls whether
guest POR_EL0 (Permission Overlay Register) accesses trap to EL2. Incorrect
configuration causes either spurious traps (breaking guest permission overlay
operations) or missed traps (allowing guest to modify permission overlays without
hypervisor awareness). umka-kvm on ARM64 must implement the corrected CPTR_EL2
configuration per D22677 when S1POE-capable hardware is available. Additionally,
when both S1PIE (Permission Indirection Extension) and S1POE are active in a guest,
hierarchical permissions must be disabled and the interaction between PIRE0_EL1
and POR_EL0 carefully managed — KVM must ensure the guest cannot exploit
PIE/POE coexistence to bypass Stage-2 isolation.
kvm_run shared page: Each vCPU has a single page mapped into userspace (returned
by mmap on the vCPU file descriptor). This page contains the kvm_run struct
that communicates VM-exit reasons and I/O data between kernel and userspace:
/// KVM error code for unrecognized hypercalls. Positive 1000, not POSIX ENOSYS (-38).
/// Defined in Linux's include/uapi/linux/kvm_para.h. The guest places this value in
/// the return register (RAX on x86) when a hypercall number is not recognized
/// in-kernel. QEMU and other VMMs check for this value, not -ENOSYS.
pub const KVM_ENOSYS: u64 = 1000;
/// KVM VM exit reason constants placed in KvmRun::exit_reason by the kernel.
/// Values match Linux's linux/kvm.h exactly for binary compatibility with QEMU,
/// Firecracker, and libvirt.
pub const KVM_EXIT_UNKNOWN: u32 = 0; // Hardware exit reason KVM does not recognize.
pub const KVM_EXIT_EXCEPTION: u32 = 1; // Guest exception forwarded to userspace.
pub const KVM_EXIT_IO: u32 = 2; // Guest IN/OUT to intercepted port.
pub const KVM_EXIT_HYPERCALL: u32 = 3; // Unrecognized VMCALL/HVC/ECALL.
pub const KVM_EXIT_DEBUG: u32 = 4; // Hardware single-step or breakpoint.
pub const KVM_EXIT_HLT: u32 = 5; // Guest executed HLT (Firecracker shutdown detection).
pub const KVM_EXIT_MMIO: u32 = 6; // Guest MMIO access with no in-kernel handler.
pub const KVM_EXIT_IRQ_WINDOW_OPEN: u32 = 7; // Interrupt injection window available.
pub const KVM_EXIT_SHUTDOWN: u32 = 8; // Guest triple-faulted or ACPI/PSCI shutdown.
pub const KVM_EXIT_FAIL_ENTRY: u32 = 9; // VM entry failed before guest executed.
pub const KVM_EXIT_INTR: u32 = 10; // Host signal interrupted KVM_RUN.
pub const KVM_EXIT_SET_TPR: u32 = 11; // x86 TPR access.
pub const KVM_EXIT_TPR_ACCESS: u32 = 12; // x86 TPR access trap.
pub const KVM_EXIT_NMI: u32 = 16; // NMI forwarding to userspace.
pub const KVM_EXIT_INTERNAL_ERROR: u32 = 17; // KVM internal consistency error.
pub const KVM_EXIT_PAPR_HCALL: u32 = 19; // PPC64LE KVM-HV paravirt hypercall.
pub const KVM_EXIT_SYSTEM_EVENT: u32 = 24; // Guest reset/shutdown via ACPI/PSCI.
pub const KVM_EXIT_ARM_NISV: u32 = 28; // AArch64 not-in-syndrome MMIO.
pub const KVM_EXIT_DIRTY_RING_FULL: u32 = 31; // Dirty ring live migration.
pub const KVM_EXIT_RISCV_SBI: u32 = 35; // RISC-V SBI call.
pub const KVM_EXIT_MEMORY_FAULT: u32 = 39; // Guest memory fault.
// --- Architecture-specific exit types (Phase 4+) ---
// s390x KVM defines additional exit types for channel I/O (KVM_EXIT_S390_SIEIC = 13),
// reset (KVM_EXIT_S390_RESET = 14), store status (KVM_EXIT_S390_STSI = 25),
// and TSCH intercept (KVM_EXIT_S390_TSCH = 22). These are deferred to Phase 4
// s390x KVM implementation. The common exit types above are sufficient for
// x86-64 and AArch64 Phase 2-3 KVM.
/// Shared between umka-kvm (kernel) and VMM (userspace).
/// Layout matches Linux's struct kvm_run exactly for binary compatibility.
/// Userspace ABI struct — mmap'd via KVM vcpu fd; VMMs access fields directly.
/// Note: Linux's full struct kvm_run includes kvm_sync_regs and additional
/// padding to fill the mmap page beyond this core struct. The exit_data
/// union is padded to 256 bytes to match Linux's `char padding1[256]`.
// kernel-internal, not KABI
#[repr(C)]
pub struct KvmRun {
/// Set by userspace before KVM_RUN: whether to inject an interrupt.
pub request_interrupt_window: u8,
/// Set by userspace: if non-zero, KVM_RUN returns immediately with
/// `KVM_EXIT_INTR` without entering the guest. Used by VMMs to implement
/// signal-safe KVM_RUN: the signal handler sets `immediate_exit = 1`,
/// ensuring the next KVM_RUN call does not block in guest execution.
pub immediate_exit: u8,
_padding1: [u8; 6],
/// Set by kernel on VM exit: why the vCPU exited.
pub exit_reason: u32,
/// Set by kernel: whether an interrupt window is open.
pub ready_for_interrupt_injection: u8,
/// Set by kernel: whether the vCPU's IF flag is set.
pub if_flag: u8,
pub flags: u16,
/// Guest CR8 (TPR) value. Avoids a KVM_SET_REGS round-trip.
pub cr8: u64,
/// Set by kernel: APIC base MSR value.
pub apic_base: u64,
/// Exit-reason-specific data. Union discriminated by exit_reason.
pub exit_data: KvmRunExitData,
// --- Fields below the exit union (Linux `struct kvm_run` continued) ---
/// KVM_SET_SIGNAL_MASK: bitmap of signals blocked during KVM_RUN.
/// The vCPU thread's signal mask is temporarily replaced with this during
/// guest execution. Signals not in this mask cause `KVM_EXIT_INTR`.
pub kvm_valid_regs: u64,
pub kvm_dirty_regs: u64,
/// `kvm_sync_regs` union — register state that userspace can read/write
/// without KVM_GET_REGS / KVM_SET_REGS round-trips. Size is arch-dependent;
/// padded to fill the mmap page. For x86: 2048 bytes. For ARM64: 384 bytes.
/// This struct is placed at the end of `kvm_run` and the total size is
/// exactly one page (4096 bytes on all supported architectures).
pub s: KvmSyncRegs,
/// Explicit padding to fill the KvmRun struct to exactly 4096 bytes.
/// The size is computed at compile time using a const expression:
/// `4096 - size_of_all_preceding_fields`. This ensures that regardless
/// of the arch-specific `KvmSyncRegs` size, the total struct is always
/// one page. The const_assert below validates this at compile time.
///
/// On x86-64: preceding fields = 32 (header) + 256 (exit_data) + 16
/// (valid/dirty regs) + 2048 (KvmSyncRegs) = 2352; padding = 1744.
/// On AArch64: KvmSyncRegs = 384; padding = 3408.
_page_padding: [u8; 4096
- KVMRUN_HEADER_SIZE // request_interrupt_window..kvm_dirty_regs
- size_of::<KvmRunExitData>()
- size_of::<KvmSyncRegs>()
],
}
// KVMRUN_HEADER_SIZE: sum of all scalar fields before exit_data + the two
// u64 fields (kvm_valid_regs, kvm_dirty_regs) after exit_data.
// request_interrupt_window: u8 (1) + immediate_exit: u8 (1) + _pad1: [u8;6] (6)
// + exit_reason: u32 (4) + ready_for_interrupt_injection: u8 (1) + if_flag: u8 (1)
// + flags: u16 (2) + cr8: u64 (8) + apic_base: u64 (8)
// = 32 bytes before exit_data.
// kvm_valid_regs: u64 (8) + kvm_dirty_regs: u64 (8) = 16 bytes after exit_data.
// Total header = 32 + 16 = 48.
const KVMRUN_HEADER_SIZE: usize = 48;
// KvmRun must be EXACTLY one page (4096 bytes) for binary compatibility with
// VMMs (QEMU, Firecracker, crosvm). This is the mmap(vcpu_fd, KVM_RUN) contract.
const_assert!(size_of::<KvmRun>() == 4096);
/// Union of exit-specific structs, discriminated by KvmRun::exit_reason.
#[repr(C)]
pub union KvmRunExitData {
pub io: KvmRunIo, // KVM_EXIT_IO
pub mmio: KvmRunMmio, // KVM_EXIT_MMIO
pub hypercall: KvmRunHypercall, // KVM_EXIT_HYPERCALL
pub internal: KvmRunInternal, // KVM_EXIT_INTERNAL_ERROR
/// KVM_EXIT_UNKNOWN — hardware reports unknown VM exit reason.
pub hw: KvmExitHw,
/// KVM_EXIT_FAIL_ENTRY — VM entry failed; hardware_entry_failure_reason
/// holds the VMX/SVM-specific exit reason from the hardware.
pub fail_entry: KvmExitFailEntry,
/// KVM_EXIT_DEBUG — hardware single-step or breakpoint triggered.
pub debug: KvmExitDebug,
/// Padding to 256 bytes — matches Linux's `kvm_run` exit union size exactly
/// (see `linux/kvm.h`: `__u8 padding[256]`). This ensures binary compatibility
/// with VMMs (QEMU, Firecracker) that mmap the kvm_run page and access it as
/// `struct kvm_run` using the Linux kernel headers.
_padding: [u8; 256],
}
/// KVM_EXIT_UNKNOWN: hardware exit reason that KVM does not recognize.
/// Userspace ABI struct — part of `KvmRunExitData` union. Must be `#[repr(C)]`.
#[repr(C)]
pub struct KvmExitHw {
pub hardware_exit_reason: u64,
}
const_assert!(size_of::<KvmExitHw>() == 8);
/// KVM_EXIT_FAIL_ENTRY: VM entry failed before the guest executed any instructions.
/// Userspace ABI struct — part of `KvmRunExitData` union. Must be `#[repr(C)]`.
#[repr(C)]
pub struct KvmExitFailEntry {
/// Architecture-specific entry failure reason (e.g., VMX basic exit reason).
pub hardware_entry_failure_reason: u64,
/// vCPU index on which the entry failure occurred.
pub cpu: u32,
pub _pad: u32,
}
const_assert!(size_of::<KvmExitFailEntry>() == 16);
/// KVM_EXIT_DEBUG: hardware single-step or breakpoint.
/// Userspace ABI struct — part of `KvmRunExitData` union. Must be `#[repr(C)]`.
#[repr(C)]
pub struct KvmExitDebug {
/// Architecture-specific debug exit information (e.g., DR6 on x86,
/// ESR_EL2 on AArch64).
pub arch: KvmDebugExitArch,
}
// KvmExitDebug: size varies by architecture (x86-64: 32 bytes, AArch64: 16 bytes).
// Userspace ABI sub-struct within KvmRun exit_data union.
#[cfg(target_arch = "x86_64")]
const_assert!(core::mem::size_of::<KvmExitDebug>() == 32);
#[cfg(target_arch = "aarch64")]
const_assert!(core::mem::size_of::<KvmExitDebug>() == 16);
/// KVM_EXIT_INTERNAL_ERROR: KVM detected an internal inconsistency.
/// Suberror codes identify the class of internal failure; `data` carries
/// arch-specific diagnostic context (e.g., VMX/SVM instruction error bits).
/// Layout matches Linux's `struct kvm_run` internal subfield exactly.
#[repr(C)]
pub struct KvmRunInternal {
/// Suberror code: 1=KVM_INTERNAL_ERROR_EMULATION, 2=KVM_INTERNAL_ERROR_SIMUL_EX,
/// 3=KVM_INTERNAL_ERROR_DELIVERY_EV, 4=KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON.
pub suberror: u32,
/// Number of valid entries in `data` (0..16).
pub ndata: u32,
/// Architecture-specific diagnostic data (e.g., failed instruction bytes,
/// exit qualification, VM-entry failure info). Only `data[0..ndata]` is valid.
pub data: [u64; 16],
}
// KvmRunInternal: u32(4) + u32(4) + [u64;16](128) = 136 bytes.
// Userspace ABI sub-struct within KvmRun exit_data union.
const_assert!(core::mem::size_of::<KvmRunInternal>() == 136);
/// KVM_EXIT_IO: guest executed IN/OUT to an intercepted I/O port.
/// Layout matches Linux's `struct kvm_run` io subfield exactly.
#[repr(C)]
pub struct KvmRunIo {
/// Direction: 0 = OUT (guest writes to port), 1 = IN (guest reads from port).
pub direction: u8,
/// Access size in bytes: 1, 2, or 4.
pub size: u8,
/// I/O port number (0-65535).
pub port: u16,
/// Number of repetitions (for REP IN/OUT; 1 for single access).
pub count: u32,
/// Offset within the kvm_run mmap region where the data resides.
/// For OUT: data to be written (count * size bytes starting at this offset).
/// For IN: userspace writes the result here before re-entering KVM_RUN.
pub data_offset: u64,
}
// KvmRunIo: u8(1) + u8(1) + u16(2) + u32(4) + u64(8) = 16 bytes.
// Userspace ABI sub-struct within KvmRun exit_data union.
const_assert!(core::mem::size_of::<KvmRunIo>() == 16);
/// KVM_EXIT_MMIO: guest accessed an MMIO address with no in-kernel handler.
/// Layout matches Linux's `struct kvm_run` mmio subfield exactly.
#[repr(C)]
pub struct KvmRunMmio {
/// Guest physical address of the MMIO access.
pub phys_addr: u64,
/// Data buffer: for writes, contains the value the guest wrote;
/// for reads, userspace fills this before re-entering KVM_RUN.
pub data: [u8; 8],
/// Access size in bytes: 1, 2, 4, or 8.
pub len: u32,
/// 1 = write, 0 = read.
pub is_write: u8,
pub _pad: [u8; 3],
}
// KvmRunMmio: u64(8) + [u8;8](8) + u32(4) + u8(1) + [u8;3](3) = 24 bytes.
// Userspace ABI sub-struct within KvmRun exit_data union.
const_assert!(core::mem::size_of::<KvmRunMmio>() == 24);
/// KVM_EXIT_HYPERCALL: guest executed VMCALL/HVC with an unrecognized number.
/// **Deprecation note**: `KVM_EXIT_HYPERCALL` is deprecated in Linux (since v5.0)
/// in favor of `KVM_EXIT_XEN` / `KVM_EXIT_HYPERV` for Xen/Hyper-V paravirt, and
/// `KVM_HC_MAP_GPA_RANGE` via `KVM_EXIT_HYPERCALL` is the only remaining user.
/// UmkaOS retains this exit type for backward-compatible VMM (QEMU/crosvm)
/// support but new hypercall interfaces should use `KVM_EXIT_IO` or vendor-specific exits.
/// Layout matches Linux's `struct kvm_run` hypercall subfield exactly.
#[repr(C)]
pub struct KvmRunHypercall {
/// Hypercall number (from guest RAX on x86, X0 on AArch64).
pub nr: u64,
/// Arguments (up to 6; from guest RBX/RCX/RDX/RSI on x86 — standard KVM
/// hypercall ABI uses 4 arguments; slots [4..5] reserved for future use.
/// On AArch64: X1-X5 via SMCCC convention).
pub args: [u64; 6],
/// Return value: userspace writes the result here before re-entering.
pub ret: u64,
/// Flags: bit 0 = `KVM_EXIT_HYPERCALL_LONG_MODE` (1 if guest was in 64-bit
/// mode when it issued the hypercall). Other bits reserved, must be 0.
/// Linux uses `__u64` (the `__u32 longmode` alias is deprecated, userspace-only).
pub flags: u64,
}
// KvmRunHypercall: u64(8) + [u64;6](48) + u64(8) + u64(8) = 72 bytes.
// Userspace ABI sub-struct within KvmRun exit_data union.
const_assert!(core::mem::size_of::<KvmRunHypercall>() == 72);
/// Architecture-specific debug exit information.
/// Selected at compile time via `#[cfg(target_arch)]` — matches Linux's
/// per-architecture `struct kvm_debug_exit_arch` in `arch/*/include/uapi/asm/kvm.h`.
/// There is NO runtime discriminant field — the architecture is a compile-time constant.
/// x86-64 debug exit (Linux `arch/x86/include/uapi/asm/kvm.h`).
#[cfg(target_arch = "x86_64")]
#[repr(C)]
pub struct KvmDebugExitArch {
/// Exception vector number (1 = #DB, 3 = #BP).
pub exception: u32,
/// Padding for natural alignment of `pc`.
pub _pad: u32,
/// Instruction address that triggered the debug event.
pub pc: u64,
/// DR6 (debug status register) value at the time of the debug exit.
/// Bit 0-3: breakpoint condition detected (DR0-DR3).
/// Bit 14: single-step (BS).
pub dr6: u64,
/// DR7 (debug control register) value.
pub dr7: u64,
}
#[cfg(target_arch = "x86_64")]
// KvmDebugExitArch (x86-64): u32(4) + u32(4) + u64(8)*3 = 32 bytes.
// Matches Linux `arch/x86/include/uapi/asm/kvm.h` exactly.
const_assert!(core::mem::size_of::<KvmDebugExitArch>() == 32);
/// AArch64 debug exit (Linux `arch/arm64/include/uapi/asm/kvm.h`).
#[cfg(target_arch = "aarch64")]
#[repr(C)]
pub struct KvmDebugExitArch {
/// ESR_EL2 (Exception Syndrome Register) lower 32 bits.
pub hsr: u32,
/// ESR_EL2[61:32] (upper bits, added in Linux 6.x).
pub hsr_high: u32,
/// FAR_EL2 (Fault Address Register) — watchpoint address.
pub far: u64,
}
#[cfg(target_arch = "aarch64")]
// KvmDebugExitArch (AArch64): u32(4)*2 + u64(8) = 16 bytes.
// Matches Linux `arch/arm64/include/uapi/asm/kvm.h` exactly.
const_assert!(core::mem::size_of::<KvmDebugExitArch>() == 16);
/// Guest register identifier for `HvOps::read_guest_reg()` / `write_guest_reg()`.
///
/// Architecture-neutral enum that covers registers accessed during VM-exit handling.
/// Each variant maps to a specific VMCS field (Intel), VMCB offset (AMD), or
/// saved register in the guest context (AArch64/RISC-V).
#[repr(u16)]
pub enum GuestReg {
// --- x86-64 GPRs (indices 0-15) ---
Rax = 0, Rcx = 1, Rdx = 2, Rbx = 3,
Rsp = 4, Rbp = 5, Rsi = 6, Rdi = 7,
R8 = 8, R9 = 9, R10 = 10, R11 = 11,
R12 = 12, R13 = 13, R14 = 14, R15 = 15,
// --- x86-64 system registers ---
Rip = 0x100, Rflags = 0x101, Cr0 = 0x102, Cr3 = 0x103, Cr4 = 0x104,
Efer = 0x105,
// --- AArch64 (X0-X30 mapped to 0-30, PC = 0x100, PSTATE = 0x101) ---
// --- RISC-V (x0-x31 mapped to 0-31, PC = 0x100) ---
// Shared namespace: the arch implementation maps these to concrete offsets.
}
/// Exit qualification / exit info returned by `HvOps::exit_info()`.
///
/// Wraps the architecture-specific exit reason details into a uniform struct
/// so that the architecture-neutral exit dispatcher can route without matching
/// on hardware-specific bitfields.
pub struct ExitInfo {
/// Primary exit reason (VMX: basic exit reason from VMCS field 0x4402;
/// SVM: EXITCODE from VMCB offset 0x070).
pub reason: u32,
/// Exit qualification (VMX: VMCS field 0x6400; SVM: EXITINFO1).
/// Content depends on `reason` — e.g., for EPT violation this contains
/// the faulting GPA access flags (read/write/execute).
pub qualification: u64,
/// Secondary info (VMX: exit instruction length from VMCS field 0x440C;
/// SVM: EXITINFO2). Used by `advance_rip()` to skip the trapping instruction.
pub instruction_len: u32,
/// Guest-linear address involved in the exit (VMX: VMCS field 0x640A;
/// SVM: not always available, 0 if absent).
pub guest_linear_addr: u64,
/// Guest-physical address (for EPT/NPT violations).
pub guest_phys_addr: u64,
}
/// Posted Interrupt Descriptor (x86-64 VMX).
///
/// 64-byte aligned structure used by the CPU to inject virtual interrupts
/// directly into the guest without causing a VM exit. The CPU atomically
/// sets bits in the `pir` (Posted Interrupt Request) bitmap and checks
/// `outstanding_notification` to decide whether to send a notification event.
///
/// Reference: Intel SDM Vol. 3C, Section 29.6 "Posted-Interrupt Processing".
#[repr(C, align(64))]
pub struct PostedInterruptDesc {
/// Posted Interrupt Request bitmap. 256 bits (one per x86 interrupt vector).
/// The CPU sets bit N when vector N is to be injected. The posted-interrupt
/// processing logic transfers set bits into the virtual APIC page's IRR
/// and clears them here, all without a VM exit.
pub pir: [AtomicU64; 4], // bytes 0-31
/// Control field: ON (bit 0) = outstanding notification, SN (bit 1) = suppress.
/// Hardware atomically tests/sets these bits. Both are in the same u16.
/// ON: set by the sender after writing to `pir`. The notification IPI
/// handler clears this bit and processes the PIR. If already set, no IPI
/// is sent (coalescing).
/// SN: when set, posted-interrupt processing is suppressed even if ON is
/// set. Used during vCPU halt so a posted interrupt causes a VM exit
/// instead of silent delivery.
pub notifications: AtomicU16, // bytes 32-33
/// Notification vector: the interrupt vector used for the posted-interrupt
/// notification IPI. Set during VMCS initialization.
pub nv: u8, // byte 34
pub _rsvd1: u8, // byte 35
/// Notification destination: physical APIC ID of the target CPU.
/// Written only during vCPU migration, after the vCPU has VM-exited.
/// `VMLAUNCH`/`VMRESUME` serializes all prior PID writes. Concurrent
/// writes prevented by the vCPU lock (`vcpu.mutex`) held during migration.
pub ndst: u32, // bytes 36-39
/// Reserved / padding to 64 bytes.
pub _reserved: [u32; 6], // bytes 40-63
}
// PostedInterruptDesc must be exactly 64 bytes (one cache line, hardware requirement).
const_assert!(size_of::<PostedInterruptDesc>() == 64);
impl PostedInterruptDesc {
pub const ON: u16 = 1 << 0; // Outstanding Notification
pub const SN: u16 = 1 << 1; // Suppress Notification
}
Exit reasons that require userspace handling (the KVM_RUN ioctl returns to
userspace):
- KVM_EXIT_IO: Guest executed IN/OUT to an intercepted port. Userspace emulates
the device and re-enters.
- KVM_EXIT_MMIO: Guest accessed an MMIO region with no in-kernel handler.
Userspace emulates and re-enters.
- KVM_EXIT_HYPERCALL: Guest executed VMCALL with an unrecognized hypercall
number. Forwarded to userspace.
- KVM_EXIT_SHUTDOWN: Guest triple-faulted. VMM should terminate or reset the VM.
- KVM_EXIT_SYSTEM_EVENT: Guest requested reset or shutdown via ACPI or PSCI.
Exit reasons handled entirely in-kernel (the vCPU re-enters the guest without returning to userspace): external interrupts, EPT violations that resolve to mapped memory slots, CPUID emulation, MSR access to non-intercepted MSRs, preemption timer expiry, APIC access, HLT (if other vCPUs can wake it). These are detailed in Section 18.1.
18.1.1.3 Core Data Structures¶
Vm struct — one per virtual machine:
/// Represents a single virtual machine. Created by KVM_CREATE_VM.
/// Shared (via Arc) among all vCPUs belonging to this VM.
pub struct Vm {
/// Unique VM identifier (monotonically increasing, never reused).
pub id: u64,
/// Guest physical address → host physical address mapping.
/// Modified by KVM_SET_USER_MEMORY_REGION. RCU-protected for
/// concurrent read access from vCPU fault handlers.
pub memslots: RcuVec<MemSlot>,
/// Architecture-specific second-level page table root.
/// x86-64: EPT PML4 pointer. AArch64: Stage-2 VTTBR. RISC-V: hgatp.
pub slat: SlatRoot,
/// In-kernel interrupt controller state.
/// x86: IOAPIC redirection table + PIC state.
/// AArch64: vGIC distributor state.
/// RISC-V: virtual PLIC/APLIC state.
pub irqchip: Option<IrqChip>,
/// Per-VM dirty page bitmap for live migration.
/// One bit per 4 KiB guest physical page. See `AtomicBitmap` below.
pub dirty_bitmap: Option<AtomicBitmap>,
/// vCPU table. XArray keyed by vCPU ID (integer-keyed). RCU-protected
/// reads for lock-free interrupt injection. Structural modifications
/// (KVM_CREATE_VCPU) protected by `vm_lock`.
pub vcpus: XArray<Arc<Vcpu>>,
/// Maximum number of vCPUs (set at VM creation, capped by hardware
/// and policy). x86: min(KVM_MAX_VCPUS, host logical CPU count * 2).
pub max_vcpus: u32,
/// TSC frequency (Hz) for this VM. All vCPUs share the same virtual
/// TSC frequency. Set by KVM_SET_TSC_KHZ or defaults to host TSC.
pub tsc_khz: u64,
/// Checkpointed state for crash recovery (Section 11.7). Updated
/// on every KVM_RUN return-to-userspace and on periodic checkpoints.
pub checkpoint: SpinLock<VmCheckpoint>,
/// Power budget integration (Section 7.2.6.2).
pub power_budget: Option<VmPowerBudget>,
/// In-kernel I/O device dispatch bus. Routes MMIO and PIO accesses to
/// in-kernel emulated devices (LAPIC, IOAPIC, PIT, PIC, kvmclock,
/// ioeventfd). Page-level XArray lookup: `io_bus.mmio.xa_load(gpa >> PAGE_SHIFT)`.
/// See [Section 18.3](#kvm-operational--kvmiobus-in-kernel-io-device-dispatch).
pub io_bus: KvmIoBus,
/// SLAT page pool: pre-allocated pages for EPT/Stage-2 fault handling.
/// Drawn from per-VM pool first (GFP_ATOMIC safe — no sleeping). If pool
/// is exhausted, falls back to buddy allocator (GFP_KERNEL — may sleep,
/// only on the faulting vCPU's thread which is not in atomic context).
/// Default: 256 pages per VM, refilled in the background when < 64 remain.
pub slat_page_pool: SpinLock<SlabPagePool>,
}
IrqChip — in-kernel virtual interrupt controller:
/// In-kernel emulated interrupt controller state for a VM.
/// Architecture-specific: x86 uses split PIC + IOAPIC + LAPIC,
/// AArch64 uses vGIC, RISC-V uses virtual PLIC/APLIC.
pub enum IrqChip {
/// x86-64: dual 8259A PIC + IOAPIC + per-vCPU LAPIC.
X86 {
/// 24-entry I/O APIC redirection table.
ioapic_rtes: [IoApicRte; 24],
/// PIC state (master + slave, 2 × 8259A).
pic: [PicState; 2],
},
/// AArch64: GICv3 distributor + per-vCPU redistributor.
Aarch64 {
/// GIC distributor state (SPI routing, enable bits).
dist: GicDistState,
},
/// RISC-V: virtual PLIC context state.
Riscv {
/// Per-source priority and pending bits.
plic: PlicState,
},
}
AtomicBitmap — dirty page tracking with per-word atomics:
/// Bitmap with per-u64 word atomic operations. Used for dirty page tracking.
/// One bit per 4 KiB guest physical page.
///
/// **Atomicity**: per-`AtomicU64` word (64 bits = 64 pages per atomic unit).
/// - `set_bit(n)`: `words[n/64].fetch_or(1 << (n%64), Relaxed)` — called from
/// EPT violation handler or PML drain (interrupt context, any CPU).
/// - `read_and_reset(word_idx)`: `words[word_idx].swap(0, AcqRel)` — called by
/// `KVM_GET_DIRTY_LOG` to atomically read and clear a word. AcqRel ensures
/// the read sees all prior stores from EPT handlers.
/// - Full bitmap scan: iterate all words, `swap(0, AcqRel)` each.
pub struct AtomicBitmap {
words: Box<[AtomicU64]>,
/// Number of valid bits (may not fill the last word).
num_bits: usize,
}
MemSlot struct — guest physical region backed by host memory:
/// A contiguous region of guest physical address space backed by host memory.
/// Created/modified by KVM_SET_USER_MEMORY_REGION.
pub struct MemSlot {
/// Slot identifier (0-based, userspace-assigned).
pub slot: u32,
/// Guest physical base address (page-aligned).
pub guest_phys_base: u64,
/// Size in bytes (page-aligned).
pub size: u64,
/// Host virtual address of the backing memory (userspace mapping).
/// umka-kvm resolves this to host physical pages via the host page tables.
pub userspace_addr: u64,
/// Flags: KVM_MEM_LOG_DIRTY_PAGES, KVM_MEM_READONLY.
pub flags: MemSlotFlags,
/// Cached host physical addresses for fast EPT population.
/// Lazily populated on first EPT fault per page. RCU-protected.
///
/// **Nested RCU lifetime**: `Vm.memslots` is `RcuVec<MemSlot>`, and each
/// `MemSlot` contains this `RcuVec`. The outer RCU read-side section
/// (protecting the `memslots` access) also protects the inner `RcuVec`:
/// since the `MemSlot` is valid for the duration of the outer RCU read lock,
/// its fields (including this cache) are also valid. When a memslot is
/// updated via `KVM_SET_USER_MEMORY_REGION`, the old `MemSlot` (and its
/// cache) are freed after a full RCU grace period — any vCPU thread
/// mid-fault holds an RCU read lock and sees the old, valid cache until
/// it exits the RCU section.
pub hva_to_hpa_cache: RcuVec<Option<u64>>,
}
Vcpu struct — one per virtual CPU:
/// Represents a single virtual CPU. Created by KVM_CREATE_VCPU.
pub struct Vcpu {
/// vCPU index within the VM (0-based).
pub id: u32,
/// Back-reference to the parent VM.
pub vm: Arc<Vm>,
// --- Architecture-specific hardware virtualization state ---
/// x86-64: VMCS region (4 KiB aligned, one per vCPU).
/// AArch64: saved EL1 register context for the guest.
/// RISC-V: saved VS-mode CSR context.
pub hw_state: ArchVcpuState,
/// Guest general-purpose registers. Saved by the trampoline on VM exit,
/// restored on VM entry. The trampoline handles all GPRs that are not
/// automatically saved/restored by hardware (VMCS handles RSP/RIP/RFLAGS
/// on x86; hardware handles PC/PSTATE on ARM64).
pub guest_regs: GuestRegisters,
/// Guest FPU/SIMD state. Saved/restored lazily — only when the host
/// thread is about to be scheduled out, or when userspace reads via
/// KVM_GET_FPU/KVM_GET_XSAVE.
pub guest_fpu: FpuState,
/// Highest-priority pending virtual interrupt to inject on next VM entry.
/// Set by in-kernel LAPIC/vGIC or by KVM_INTERRUPT ioctl.
/// AtomicU32 because interrupt injection can race with the vCPU thread
/// (e.g., IOAPIC routing from a different vCPU's ioctl).
pub pending_irq: AtomicU32,
/// Virtual APIC page (x86 only). 4 KiB page used for x2APIC
/// virtualization — hardware reads/writes APIC registers directly
/// from this page without VM exits (when APIC virtualization is enabled
/// via VMCS `APIC-access address` + `virtual-APIC address` fields).
///
/// # Synchronization Protocol
///
/// The vAPIC page is written by hardware (the CPU's VMX subsystem) on every
/// APIC register access from within the guest and by the kernel on interrupt
/// injection. It is read by the kernel on VM exit to sample guest APIC state.
///
/// Access rules:
/// 1. **Set once at vCPU creation** (when no guest is running): the physical
/// address is written into the VMCS. The pointer is stable for the vCPU
/// lifetime; it is `None` only if the host CPU does not support APIC
/// virtualization (checked via `CPUID.01H:ECX.X2APIC[bit 21]` and
/// `IA32_VMX_PROCBASED_CTLS2[bit 9]`).
/// 2. **Kernel reads/writes**: only while the vCPU is not running (i.e., outside
/// `VMENTRY..VMEXIT`). The vCPU thread acquires `Vcpu::run_lock` before
/// accessing `vapic_page` for interrupt injection or state save/restore.
/// 3. **Hardware writes**: occur inside the guest execution window. The kernel
/// never reads stale data because it only accesses the page after `VMEXIT`
/// has serialized all hardware writes.
/// 4. **SMP**: each `Vcpu` has its own `vapic_page`; there is no cross-vCPU
/// sharing. A vCPU's page is touched only by its own vCPU thread and by
/// the in-kernel IOAPIC/PIC emulation (which acquires `run_lock` first).
///
/// `*mut VApicPage` is used (not `Box<VApicPage>` or `Arc`) because the
/// physical address must be pinned in the VMCS. The page is allocated via
/// `PhysAlloc::alloc_pages(1)` at vCPU creation and freed at vCPU destruction.
/// `# Safety`: callers must hold `run_lock` and the vCPU must be not-running.
pub vapic_page: Option<*mut VApicPage>,
/// TSC offset for this vCPU. Guest TSC = host TSC + tsc_offset.
/// Set by KVM_SET_TSC_KHZ or by migration (to preserve guest-visible
/// TSC continuity across hosts with different TSC frequencies).
pub tsc_offset: i64,
/// Shared page mapped into userspace for KVM_RUN communication.
pub kvm_run: *mut KvmRun,
// NOTE: The `launched` state is stored in `arch_state` (e.g.,
// `VmcsState.launched` on VMX). Use `vcpu.arch_state.is_launched()`
// as the architecture-neutral accessor. No duplicate field here.
/// Physical CPU on which this vCPU's VMCS/VMCB was last loaded.
/// Checked at the top of `vm_enter_and_exit()` — if the vCPU thread
/// migrated to a different pCPU, we must VMCLEAR the old VMCS and
/// VMPTRLD on the new pCPU (x86 VMX), or update HOST_SAVE_AREA
/// (AMD SVM), before VM entry. Initialized to `u32::MAX` (invalid)
/// at vCPU creation to force the first `vcpu_load()` unconditionally.
pub last_loaded_cpu: u32,
/// Host PKRU value saved on VM entry, restored on VM exit (x86 MPK only).
/// Used by the PKRU swap path (vm_enter_and_exit step 5).
/// On architectures without MPK, this field is unused (always 0).
pub host_pkru: u32,
/// Guest PKRU value saved on VM exit, restored on VM entry (x86 MPK only).
/// Persists across VM entry/exit cycles — the guest's PKRU state is preserved.
pub guest_pkru: u32,
/// Pending MMIO emulation request (set on EPT fault for unmapped region,
/// causes KVM_RUN to return with KVM_EXIT_MMIO).
pub pending_mmio: Option<MmioRequest>,
/// Preemption timer value. When the VMX preemption timer fires, the
/// vCPU exits to allow the host scheduler to run. Set from the
/// scheduler's time slice quantum (Section 7.1).
///
/// Must be u64: with a 3 GHz TSC and `preempt_timer_shift` = 0, a 1ms
/// slice yields `remaining_slice_ns * tsc_freq_khz / 1000` ≈ 3 × 10⁶,
/// which fits in u32. However, with larger slices (e.g., RT tasks with
/// 100ms budget) or higher TSC frequencies (5+ GHz), the product
/// `remaining_slice_ns * tsc_freq_khz` can exceed 2³² before the
/// division by `2^preempt_timer_shift`. Using u64 for the intermediate
/// and stored value avoids silent truncation. The VMCS field itself is
/// 32-bit; the value is clamped to u32::MAX before VMWRITE (which
/// simply means "don't preempt for a very long time" — safe behavior).
pub preempt_timer_value: u64,
/// Remaining halt-poll budget for the current scheduling quantum (nanoseconds).
/// **Initialization**: Set to `DEFAULT_QUANTUM_NS * halt_poll_budget_pct / 100`
/// at vCPU creation time (where `DEFAULT_QUANTUM_NS` = sysctl_sched_base_slice
/// = 750_000 ns, and `halt_poll_budget_pct` defaults to 10).
/// **Reset mechanism**: Reset on each schedule-in via the KVM preempt_notifier
/// callback (equivalent to Linux's `kvm_sched_in()`). When the scheduler
/// dispatches the vCPU thread, the callback recalculates the budget from
/// the new quantum: `quantum_ns * halt_poll_budget_pct / 100`.
/// Decremented by actual poll duration on each halt-poll iteration.
/// When zero, subsequent HLTs bypass halt-poll and yield immediately.
///
/// Not atomic: only accessed by the vCPU's own thread (inside `run_lock`),
/// never cross-vCPU. See [Section 18.3](#kvm-operational) "Per-quantum halt-poll budget"
/// for the fairness rationale.
pub halt_poll_budget_remaining_ns: u64,
/// Run state: Running, Halted (HLT), Paused (by userspace/migration),
/// or Parked (crash recovery — vCPU is suspended during domain teardown
/// and driver reload). See VcpuRunState enum.
pub run_state: AtomicU8,
/// Serializes concurrent `vcpu_run()` calls on the same vCPU.
/// KVM requires that only one thread runs a vCPU at a time; this lock
/// enforces that invariant. Held for the duration of `KVM_RUN` ioctl
/// processing — from guest register restore through VM entry, guest
/// execution, VM exit, and exit handling, until control returns to
/// userspace. A second thread calling `KVM_RUN` on the same vCPU fd
/// blocks until the first exits.
///
/// Also protects fields that must not be accessed while the vCPU is
/// running: `vapic_page` (interrupt injection / state save), `guest_regs`
/// (register read/write ioctls), and `pending_irq` (in-kernel IOAPIC/PIC
/// emulation). See the `vapic_page` synchronization protocol above.
///
/// `run_lock` (Mutex) is held during `KVM_RUN` processing. The scheduler
/// may migrate the vCPU thread to another CPU during VM exit handling
/// (interrupts enabled, `run_lock` held). This is safe: `run_lock` is a
/// sleeping lock (not held with IRQs disabled), and scheduler migration
/// uses `RQ_LOCK` (SpinLock at level 5) which is at a lower lock hierarchy
/// level. No lock ordering conflict exists.
pub run_lock: Mutex<()>,
}
impl Vcpu {
/// Safe accessor for the kvm_run shared page. The kvm_run pointer is
/// valid for the lifetime of the Vcpu (allocated at creation, freed on
/// Vcpu drop). Only called while run_lock is held.
///
/// # Safety
/// The kvm_run pointer was allocated via mmap and is valid for the
/// lifetime of the Vcpu struct. The caller must hold run_lock.
unsafe fn kvm_run(&mut self) -> &mut KvmRun {
&mut *self.kvm_run
}
}
Lock ordering: run_lock vs mmap_lock
run_lock is held for the duration of KVM_RUN (VM entry → exit handling →
return to userspace). EPT/Stage-2 violations during guest execution need
mmap_lock.read() to resolve host virtual → physical mappings for the
faulting guest physical address. This establishes the ordering:
run_lock → mmap_lock.read().
The reverse path — KVM_SET_USER_MEMORY_REGION — modifies memslots and
may need mmap_lock.write(). To avoid deadlock, this ioctl does not
acquire run_lock. Instead:
- Memslot changes are published via RCU (
Vm::memslots: RcuVec<MemSlot>). The writing thread acquiresVm::memslots_update_lock(a Mutex, never held concurrently withrun_lock) to serialize memslot modifications. - Running vCPUs see the new memslot array on the next EPT violation (RCU
read-side during fault handling). No coordination with
run_lockneeded. - If a memslot is removed or shrunk, the writer invalidates the
corresponding EPT/Stage-2 entries (IPI + TLB flush) after the RCU grace
period. This ensures no vCPU can fault into a freed memslot. Additionally,
for every VFIO-attached device's IOMMU domain, the KVM layer calls
iommu_unmap_range(iommu_domain, gpa_start, size)(Section 4.14) to ensure no device can DMA to freed guest pages. The IOMMU unmap is synchronous — KVM does not return to userspace until all IOMMU TLBs are invalidated. - If a memslot is created or expanded, IOMMU mapping is the
responsibility of userspace (the VMM). The VMM calls
VFIO_IOMMU_MAP_DMA/IOMMU_IOAS_MAPvia the iommufd interface (Section 18.5) to establish IOMMU mappings for new guest physical address ranges. The kernel does NOT auto-map memslots into IOMMU domains — this would bypass the VMM's security policy (which controls which guest regions are device-accessible). The VMM maps only the specific ranges needed by each passthrough device.
Lock ordering summary for KVM:
| Level | Lock | Context |
|---|---|---|
| — | memslots_update_lock (Mutex) |
KVM_SET_USER_MEMORY_REGION only |
| 7 | VM_LOCK / mmap_lock |
Host VMM mmap/munmap |
| — | run_lock (Mutex, per-vCPU) |
KVM_RUN, register access ioctls |
run_lock is not assigned a global level because it is never held
concurrently with mmap_lock.write() or any other global-leveled lock.
The only lock acquired under run_lock is mmap_lock.read() (for EPT
fault resolution), which is safe because read-mode does not conflict with
other readers.
Architecture-specific state (ArchVcpuState enum):
/// Per-architecture hardware virtualization state.
pub enum ArchVcpuState {
/// Intel VMX: VMCS region.
Vmx(VmcsState),
/// AMD SVM: VMCB region.
Svm(VmcbState),
/// AArch64 VHE: saved guest EL1 context.
ArmVhe(ArmVheState),
/// AArch64 nVHE: saved guest EL1 context + EL2 trampoline state.
ArmNvhe(ArmNvheState),
/// RISC-V H-extension: saved VS-mode CSRs.
RiscvH(RiscvHState),
/// PPC64LE KVM-HV: LPCR, HDEC, partition-scoped registers.
/// See [Section 18.2](#kvm-architecture-backends) for `PpcHvState` definition.
PpcHv(PpcHvState),
/// LoongArch64 LVZ: Guest CSR save area, GCFG, GPID.
/// See [Section 18.2](#kvm-architecture-backends) for `LvzState` definition.
Lvz(LvzState),
/// s390x SIE (Start Interpretive Execution): State Description block.
/// s390x virtualization uses the SIE instruction to enter the guest.
/// The State Description (SD) block is the equivalent of VMCS/VMCB.
/// Phase 3+ (s390x KVM backend not yet fully specified).
S390Sie(S390SieState),
}
/// s390x SIE (Start Interpretive Execution) state. The State Description (SD)
/// block is the s390x equivalent of x86's VMCS / AMD's VMCB. Contains guest
/// PSW, guest GPRs save area, SIE control block address, and intercept
/// configuration. Phase 3+ (s390x KVM backend not yet fully specified).
pub struct S390SieState {
/// State Description block physical address (4K-aligned, allocated from
/// the SIEIC region during vCPU creation).
pub sd_addr: u64,
/// Guest PSW (Program Status Word) — instruction address + condition code.
pub guest_psw: u128,
/// SIE intercept control flags.
pub intercept_control: u64,
}
SlatRoot — root of the Second-Level Address Translation page table, referenced
by Vm::slat. For IOMMU integration (shared page tables with passthrough devices),
see Section 11.5:
/// Root of the Second-Level Address Translation (SLAT) page table hierarchy.
/// On x86-64: EPT (Extended Page Tables) root; on AArch64: IPA stage-2 table root.
pub struct SlatRoot {
/// Host physical address of the top-level page table (EPT PML4 / VTTBR_EL2 target).
///
/// **Conversion to `IommuPageTable`**: For `IommuDomainType::VmPassthrough`
/// domains ([Section 11.5](11-drivers.md#iommu-and-dma-mapping--iommu-groups)), the IOMMU page table
/// and SLAT page table are the same physical structure (EPT serves as both
/// guest physical → host physical translation and IOMMU DMA translation).
/// Conversion: `IommuPageTable::from_phys(slat.hpa)`. The reverse path
/// (`IommuPageTable → SlatRoot`) is used during VM creation when the IOMMU
/// domain is allocated first and the SLAT root is derived from it.
pub hpa: PhysAddr,
/// VMID state for TLB tagging (prevents cross-VM TLB pollution).
/// Contains the assigned VMID and the generation at which it was allocated.
/// On VM entry, the generation is compared with `VmidAllocator::generation`:
/// if stale, a fresh VMID is allocated (see VMID Recycling Protocol below).
/// The VMID value is u16 to match hardware width (8-16 bits). Wrap-safe by
/// design — not a 50-year counter (the generation counter is u64).
pub vmid_state: VmidState,
}
VMID Recycling Protocol
Hardware VMIDs are a scarce resource: x86-64 VPIDs are 16 bits (max 65535),
AArch64 VMIDs are 8 or 16 bits depending on ID_AA64MMFR1_EL1.VMIDBits,
and RISC-V VMIDs are 7-14 bits (discovered from hgatp by writing all-ones to the VMID field and reading back the supported width). When all VMIDs are
allocated, a running VM must be evicted to free a VMID for a new VM.
/// Global VMID allocator. One instance per physical CPU package (on systems
/// with per-package TLB domains) or one global instance (most hardware).
///
/// The allocator uses a generation counter to detect stale VMID assignments
/// without scanning all VMs. When the VMID space is exhausted, the generation
/// is bumped and all VMIDs become invalid — the next VM entry for each VM
/// lazily re-acquires a fresh VMID.
///
/// **Atomicity design**: `generation` (high 32 bits) and `next_vmid` (low 32 bits)
/// are packed into a single `AtomicU64` so that exhaustion-triggered rollover
/// (bump generation + reset next_vmid) is a single CAS — no TOCTOU race between
/// separate atomics. The generation is 32 bits rather than 64; at 1 billion
/// rollovers per second, wrap occurs after ~4.3 seconds of continuous rollover,
/// but rollovers are rare (once per VMID space exhaustion, typically thousands
/// of VM entries apart). The VmidState stores a matching u32 generation for
/// the per-VM comparison to detect staleness across wraps: the allocator
/// increments monotonically and the 32-bit generation space (4 billion values)
/// is sufficient because a single VMID generation is consumed per exhaustion
/// event, not per allocation.
pub struct VmidAllocator {
/// Packed state: high 32 bits = generation, low 32 bits = next_vmid.
/// CAS-updated atomically on every allocation. On exhaustion, both
/// fields change in a single CAS (generation incremented, next_vmid
/// reset to 2, VMID 1 returned to the caller that performed the rollover).
state: AtomicU64,
/// Maximum VMID value supported by hardware. Discovered at boot:
/// x86-64: 65535 (VPID is 16-bit, 0 reserved).
/// AArch64: 255 (8-bit) or 65535 (16-bit), from ID_AA64MMFR1_EL1.
/// RISC-V: (1 << vmid_bits) - 1, from hgatp (write-all-ones-read-back).
/// Stored as u32 for packing alignment; actual hardware max fits in u16.
max_vmid: u32,
}
impl VmidAllocator {
/// Allocate a VMID. Returns (generation, vmid). If the generation changed
/// (VMID space was exhausted), the caller must issue a full TLB flush
/// before using the returned VMID.
fn allocate(&self) -> (u32, u32) {
loop {
let old = self.state.load(Acquire);
let gen = (old >> 32) as u32;
let next = old as u32;
let (new_gen, new_next, allocated) = if next < self.max_vmid {
// Normal path: allocate `next`, advance counter.
(gen, next + 1, next)
} else {
// Exhausted: bump generation, reset to 2, return VMID 1.
// VMID 0 is reserved (host context).
(gen.wrapping_add(1), 2, 1)
};
let new_state = ((new_gen as u64) << 32) | (new_next as u64);
if self.state.compare_exchange_weak(
old, new_state, AcqRel, Relaxed
).is_ok() {
if new_gen != gen {
// Generation rolled over: flush all VMID-tagged TLBs.
// The CAS ensures exactly one thread performs this flush
// per generation transition.
// Note: correctness does NOT depend on this IPI-based
// flush completing before other vCPUs re-enter. Each
// vCPU that detects a stale VMID in vm_enter_and_exit()
// step 1 re-acquires a VMID and sets per-vCPU VPID/VMID
// flush bits in its VMCS/VTTBR. The per-vCPU flush at
// VM entry provides LOCAL correctness; the global IPI
// flush is a best-effort optimization to reduce the
// number of per-vCPU flush events.
flush_all_vmid_tlbs();
}
return (new_gen, allocated);
}
// CAS failed: another thread allocated concurrently. Retry.
}
}
}
/// Per-VM VMID state, stored alongside `SlatRoot`.
pub struct VmidState {
/// The VMID assigned to this VM in the current generation.
pub vmid: u16,
/// Generation at which `vmid` was assigned. If this does not match
/// the current generation from `VmidAllocator::state`, the VMID is
/// stale and must be re-acquired before VM entry.
/// Matches the 32-bit generation produced by VmidAllocator. Wrap-safety
/// relies on practical impossibility (~4 billion VMID exhaustion events;
/// at 1/sec, wraps after ~136 years).
pub generation: u32,
}
Allocation algorithm (called on VM entry, before loading VMCS/VTTBR/hgatp):
- Read the VM's
VmidState.generationand compare with the current generation fromVmidAllocator.state.load(Acquire) >> 32. - If generations match: the VM's VMID is still valid. Proceed to VM entry.
- If generations differ (stale VMID): call
allocator.allocate(): a. The CAS loop atomically reads(gen, next_vmid)from the packedstate. b. Ifnext_vmid < max_vmid: CAS advancesnext_vmidby 1, returns the oldnext_vmidas the allocated VMID. UpdateVmidState { vmid, generation }. c. Ifnext_vmid >= max_vmid(VMID space exhausted): CAS atomically bumpsgenerationand resetsnext_vmidto 2, returning VMID 1 to the caller. Because both fields change in a single CAS, exactly one thread wins the rollover — all other concurrent allocators retry and see the new generation. The winning thread issues a full TLB flush for the VMID-tagged TLB domain: - x86-64:INVVPIDtype 2 (all-context invalidation) — flushes all VPID-tagged entries across all VPIDs. - AArch64:TLBI VMALLE1IS— invalidates all Stage-1 entries for all VMIDs on the inner-shareable domain, followed byTLBI ALLE2ISfor Stage-2. - RISC-V:HFENCE.GVMAwithrs1=x0, rs2=x0— flushes all guest physical address translations for all VMIDs. Threads that lost the CAS retry and allocate from the fresh generation (step 3b path) without redundant TLB flushes.
LRU is not used: The generation-based scheme avoids per-VM LRU tracking entirely. There is no "victim selection" — on exhaustion, all VMIDs are invalidated atomically by the CAS that bumps the generation counter. This is equivalent to a global flush but is O(1) at the allocator (no scanning). Each VM lazily re-acquires a VMID on its next entry. The cost is one TLB miss storm after a generation rollover, amortized across all VMs.
Atomicity: The packed AtomicU64 with CAS ensures that the generation bump
and next_vmid reset are a single atomic operation. This eliminates the TOCTOU
race that would exist if generation and next_vmid were separate atomics (where
two threads could both observe exhaustion and both attempt rollover, potentially
double-incrementing the generation or allocating from a partially-reset state).
The CAS uses AcqRel on success (publish the new state) and Relaxed on
failure (retry loop). compare_exchange_weak is used because the retry loop
tolerates spurious failures.
VmCheckpoint — snapshot of a VM's complete state, referenced by Vm::checkpoint:
/// Snapshot of a VM's complete state for live migration or checkpoint.
///
/// # Memory Allocation Discipline
///
/// `VmCheckpoint` is held under `Vm::checkpoint: SpinLock<VmCheckpoint>`.
/// SpinLock disables preemption; heap allocation (`Vec::push`, `Vec::extend`,
/// `Box::new`) inside a spinlock is prohibited (the allocator may acquire
/// currently-held locks, causing deadlock).
///
/// Therefore all buffers in `VmCheckpoint` are pre-allocated at VM creation
/// time (when no spinlock is held) and never resized:
/// - `vcpu_states` is a `Box<[GuestRegisters]>` pre-sized to `vcpu_count`.
/// - `device_states` is a `Box<[DeviceStateBlob]>` pre-sized to `device_count`.
///
/// Fixed-capacity bit vector for tracking dirty pages, interrupt masks,
/// and other per-bit state. Pre-allocated at creation time; never resized.
///
/// Internal storage uses `u64` words for efficient bulk operations
/// (popcount, find-first-set, bitwise OR for dirty bitmap merging).
/// `Box<[u64]>` ensures no accidental reallocation under spinlock.
pub struct BitVec {
/// Backing storage. Word count = ceil(len / 64).
data: Box<[u64]>,
/// Number of valid bits. Bits beyond `len` in the last word are always zero.
len: usize,
}
impl BitVec {
/// Create a zero-initialized bit vector with `len` bits.
pub fn zeroed(len: usize) -> Self {
let words = (len + 63) / 64;
Self { data: vec![0u64; words].into_boxed_slice(), len }
}
/// Set bit at `index`.
pub fn set(&mut self, index: usize) {
self.data[index / 64] |= 1u64 << (index % 64);
}
/// Clear bit at `index`.
pub fn clear(&mut self, index: usize) {
self.data[index / 64] &= !(1u64 << (index % 64));
}
/// Test bit at `index`.
pub fn test(&self, index: usize) -> bool {
(self.data[index / 64] >> (index % 64)) & 1 != 0
}
/// Clear all bits to zero.
pub fn clear_all(&mut self) {
self.data.fill(0);
}
/// Count of set bits (population count).
pub fn count_ones(&self) -> usize {
self.data.iter().map(|w| w.count_ones() as usize).sum()
}
/// Number of valid bits.
pub fn len(&self) -> usize {
self.len
}
/// Raw word slice for bulk DMA transfer (e.g., KVM dirty log ioctl).
pub fn as_raw_words(&self) -> &[u64] {
&self.data
}
/// Mutable raw word slice for direct dirty-bitmap writes from hardware.
pub fn as_raw_words_mut(&mut self) -> &mut [u64] {
&mut self.data
}
}
/// Snapshot capture only writes into the pre-allocated buffers; it never
/// allocates. The `epoch` field is updated atomically. This makes snapshot
/// capture lock-friendly and O(1) in latency.
pub struct VmCheckpoint {
/// Saved register state for each vCPU. Pre-allocated at VM creation to
/// hold exactly `Vm::vcpu_count` entries; never resized after creation.
/// `Box<[GuestRegisters]>` prevents accidental `push()`/`extend()` that
/// would heap-allocate under the spinlock.
pub vcpu_states: Box<[GuestRegisters]>,
/// Dirty page bitmap: bit N set = GPA page N modified since last checkpoint.
/// Pre-allocated at VM creation to cover the full GPA range.
///
/// **AtomicBitmap → BitVec transfer protocol**: During checkpoint capture,
/// the live `MemSlot.dirty_bitmap: AtomicBitmap` (per-word atomic, updated
/// concurrently by EPT violation handlers) is drained into this non-atomic
/// `BitVec` via a word-by-word `swap(0, AcqRel)` loop:
/// ```
/// for i in 0..dirty_bitmap.word_count() {
/// let word = slot.dirty_bitmap.words[i].swap(0, AcqRel);
/// checkpoint.dirty_bitmap.as_raw_words_mut()[offset + i] |= word;
/// }
/// ```
/// The `swap(0, AcqRel)` atomically reads and clears each word, ensuring
/// no dirty bit is lost or double-counted. The `|=` merge accumulates
/// bits across multiple memslots into the single checkpoint bitmap.
/// vCPUs continue running during the transfer — new dirty bits set
/// after the swap are captured by the next checkpoint epoch.
/// Pre-allocated fixed-size dirty bitmap. `Box<[u64]>` instead of `BitVec`
/// to prevent accidental `push()`/`extend()` heap allocation under spinlock.
/// Sized at VM creation: `(max_guest_pages + 63) / 64` u64 words.
pub dirty_bitmap: Box<[u64]>,
/// Device state blobs (serialized from each Tier 1 virtual device).
/// Pre-allocated at VM creation; one slot per registered virtual device.
/// Each `DeviceStateBlob` has a fixed `data: Box<[u8]>` buffer sized to
/// the maximum serialized state reported by the device at registration.
pub device_states: Box<[DeviceStateBlob]>,
/// Monotonic sequence number of this snapshot (used to order incremental
/// checkpoints during live migration). Incremented on each capture.
pub epoch: u64,
}
/// Fixed-capacity serialized state blob for one virtual device.
/// Pre-allocated at VM creation. The device driver writes its state into
/// `data[..actual_len]`; actual_len ≤ data.len() is always true.
pub struct DeviceStateBlob {
/// Device identifier.
pub device_id: DeviceId,
/// Pre-allocated buffer (sized to device's `max_state_bytes()` at registration).
pub data: Box<[u8]>,
/// Valid bytes written in the most recent snapshot capture. Always ≤ data.len().
pub actual_len: u32,
}
GuestRegisters — complete architectural register state, referenced by
Vcpu::guest_regs:
/// Complete architectural register state of a guest vCPU.
pub struct GuestRegisters {
/// General-purpose registers. Sized for the largest ISA (RISC-V: 32 GPRs,
/// AArch64: 31 GPRs, x86-64: 16 GPRs). x86-64 uses indices [0..16];
/// AArch64 uses [0..31]; RISC-V uses [0..32]. Unused indices are zero-filled.
pub gpr: [u64; 32],
/// Program counter / instruction pointer.
pub pc: u64,
/// Stack pointer.
pub sp: u64,
/// Architecture flags / PSTATE / CPSR.
pub flags: u64,
// Note: FPU/SIMD state is NOT stored here. The authoritative guest FPU
// state lives in `Vcpu::guest_fpu` to avoid duplication and inconsistency.
// `GuestRegisters` captures only integer/system register state that the
// trampoline saves/restores on every VM exit/entry. FPU state is managed
// lazily by the host scheduler (saved only when the host thread is
// preempted or when userspace reads via KVM_GET_FPU/KVM_GET_XSAVE).
/// Architecture-specific system registers (CR0/CR3/CR4/EFER on x86-64, etc.).
pub sys_regs: [u64; 64],
}
FpuState — floating-point and SIMD register state. The authoritative
guest FPU state is stored in Vcpu::guest_fpu (not in GuestRegisters,
which holds only integer/system register state):
/// Floating-point and SIMD register state.
/// Layout matches the host architecture's XSAVE area (x86-64) or FPSIMD context (AArch64).
// kernel-internal, not KABI — vCPU register save area, never exposed to userspace.
#[repr(C, align(64))]
pub struct FpuState {
/// Raw XSAVE/FPSIMD data; size is architecture-dependent.
/// x86-64: up to ~11 KiB (AVX-512 = 2688; with AMX TILEDATA = 8192 +
/// TILECFG = 64 + legacy/header = ~11 KiB total). Buffer sized at 12 KiB
/// to accommodate all current XSAVE components with alignment headroom.
/// AArch64: 512 bytes (V0-V31 + FPCR/FPSR).
/// The `size` field records the actual valid length for this vCPU
/// (queried via CPUID EAX=0DH at boot); unused tail bytes are not accessed.
///
/// The 12 KiB buffer is a deliberate superset accommodating the largest ISA
/// (x86-64 with AVX-512/AMX). AArch64 uses only 512 bytes; RISC-V (V ext)
/// uses up to 4 KiB depending on VLEN. The per-vCPU `size` field tracks
/// actual valid bytes — save/restore only touches `data[0..size]`.
pub data: [u8; 12288],
/// Number of valid bytes in data[] (set by XSAVE/FPSIMD save).
pub size: u32,
}
MmioRequest — MMIO access request from a vCPU, referenced by
Vcpu::pending_mmio:
/// MMIO access request from a vCPU that trapped on an unhandled MMIO address.
/// Kernel-internal struct (not exposed to userspace, does not cross KABI boundary).
#[repr(C)]
pub struct MmioRequest {
/// Guest physical address of the MMIO access.
pub gpa: u64,
/// Data to write (for writes); ignored for reads.
pub data: u64,
/// Access size in bytes: 1, 2, 4, or 8.
pub size: u8,
/// 1 = write, 0 = read. Uses u8 (not bool) because `MmioRequest` is
/// `#[repr(C)]` and `bool` has a validity invariant (must be 0 or 1);
/// a future KABI extension or serialization could introduce values
/// outside {0, 1}. u8 is safe for any byte value.
pub is_write: u8,
}
18.1.1.4 VM-Exit Handling and Dispatch¶
When a VM exit occurs, the architecture-specific trampoline (running in umka-core's
domain, PKEY 0 on x86) saves guest GPRs to vcpu.guest_regs and dispatches to
umka-kvm's exit handler. The trampoline design is specified in
Section 19.1; this section covers the exit handler logic.
MsrDirection — direction of a guest MSR access:
/// Direction of a guest MSR access (RDMSR vs. WRMSR).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum MsrDirection {
/// Guest executed RDMSR. Handler writes result to guest
/// EAX:EDX and advances RIP.
Read,
/// Guest executed WRMSR. Handler validates the value in guest
/// ECX:EAX and applies it to the VMCS or internal shadow.
Write,
}
CrDirection — direction of a guest control-register access:
/// Direction of a guest control-register access (MOV to/from CRn).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum CrDirection {
/// Guest wrote to CRn (MOV to CRn).
Write,
/// Guest read from CRn (MOV from CRn).
Read,
}
ExitReason — architecture-independent VM-exit reason. The architecture-specific
trampoline translates the raw exit code (VMCS field 0x4402 on x86, ESR_EL2 on ARM,
scause on RISC-V) into this enum before calling the common dispatch:
/// Architecture-independent VM-exit reason.
///
/// Each architecture trampoline maps hardware-specific exit codes into this
/// enum. Variants that carry data (Msr, CrAccess) include the decoded
/// sub-fields so the common handler does not need to re-parse raw bits.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ExitReason {
/// Host external interrupt arrived while the guest was running.
/// The trampoline already acknowledged the interrupt via the host IDT.
ExternalInterrupt,
/// VMX preemption timer fired (x86) or EL2 physical timer expired (ARM).
/// Indicates the host scheduler quantum has elapsed.
PreemptionTimer,
/// EPT violation (x86). Guest accessed a GPA with no EPT mapping or
/// with insufficient permissions. `VmExit::guest_phys_addr` holds the
/// faulting GPA; `VmExit::fault_flags` encodes read/write/execute.
EptViolation,
/// Stage-2 translation fault (AArch64). Analogous to `EptViolation`.
Stage2Fault,
/// Guest page fault (RISC-V). Second-level translation miss. Analogous
/// to `EptViolation`.
GuestPageFault,
/// Guest executed CPUID. Handled entirely in-kernel using the per-VM
/// CPUID table set by `KVM_SET_CPUID2`.
Cpuid,
/// Guest executed RDMSR or WRMSR on an intercepted MSR index.
Msr(MsrDirection),
/// Guest executed HLT. If an interrupt is already pending the guest is
/// re-entered immediately; otherwise the vCPU is marked halted.
Hlt,
/// Guest accessed the virtual APIC page at an offset that requires
/// software emulation (e.g., ICR write for IPI delivery).
ApicAccess,
/// Guest executed VMCALL (x86), HVC (ARM), or ecall with
/// `KVM_HC_*` function code (RISC-V).
Hypercall,
/// Guest accessed a control register (MOV to/from CRn). The first
/// field is the CR number (0, 3, 4, or 8); the second is the direction.
CrAccess(u8, CrDirection),
/// Guest executed a hardware task switch (x86 only). Emulated in
/// software — modern guests should not trigger this.
TaskSwitch,
/// Guest executed XSETBV to modify XCR0 (extended control register).
/// Handler validates the new value against the VM's allowed feature mask.
Xsetbv,
/// Guest executed an IN/OUT or INS/OUTS instruction on an I/O port
/// not covered by the I/O bitmap passthrough.
IoInstruction,
/// Guest performed an MMIO access to a GPA that has no in-kernel device
/// model mapping. Forwarded to userspace via `kvm_run`.
MmioAccess,
/// Guest triple-faulted or entered shutdown state.
Shutdown,
/// VM entry itself failed (invalid VMCS/HCR state, x86 VMX "VM-entry
/// failure" or ARM EL2 "invalid entry state"). This is NOT a VM exit —
/// the guest never ran. The `VmExit::exit_qual` field contains the
/// architecture-specific failure reason (e.g., VMX error code from
/// VMCS field 0x4400). Requires re-validation of the VMCS/vCPU state
/// before the next entry attempt.
EntryFailed,
/// Exit reason not recognized by the architecture trampoline. The raw
/// hardware exit code is preserved for diagnostic logging.
Unknown(u32),
}
VmExit — decoded VM-exit information passed from the architecture trampoline
to the common dispatch handler:
/// Decoded VM-exit information. Built by the architecture trampoline from
/// hardware-specific fields (VMCS on x86, ESR_EL2/FAR_EL2 on ARM,
/// scause/stval on RISC-V) and passed to `handle_vmexit`.
#[derive(Debug, Clone, Copy)]
pub struct VmExit {
/// Architecture-independent exit reason.
pub reason: ExitReason,
/// Faulting guest physical address. Meaningful for EPT violations,
/// Stage-2 faults, and guest page faults. Zero for non-fault exits.
pub guest_phys_addr: u64,
/// Fault flags (read/write/execute/instruction-fetch). Encoding is
/// architecture-independent: bit 0 = read, bit 1 = write, bit 2 = exec.
/// Zero for non-fault exits.
pub fault_flags: u64,
/// Exit qualification (x86) or syndrome register (ARM). Contains
/// architecture-specific sub-fields for the exit reason. The common
/// handler uses this only for cases not fully decoded into `ExitReason`
/// variants (e.g., I/O port number for `IoInstruction`).
pub exit_qual: u64,
}
VApicPage — 4 KiB page mapped at the VMCS virtual-APIC address. Hardware
reads and writes APIC registers directly from this page when APIC virtualization
is enabled (see Vcpu::vapic_page). Offsets match the xAPIC MMIO register layout
(Intel SDM Vol. 3A Table 10-1):
/// Virtual APIC page used for hardware-assisted x2APIC virtualization.
///
/// The CPU's VMX subsystem reads and writes fields in this page on every
/// APIC register access from within the guest. The kernel reads it on
/// VM exit to sample guest APIC state and writes it for interrupt injection.
///
/// # Safety
///
/// This page is shared between the CPU hardware and the kernel. Accesses
/// must be volatile or use atomic operations. The page must remain pinned
/// in physical memory for the lifetime of the vCPU.
// Hardware-facing struct — CPU VMX subsystem reads/writes this page directly.
#[repr(C, align(4096))]
pub struct VApicPage {
/// Raw 4 KiB backing store. Register offsets are defined by the xAPIC
/// MMIO layout. Key offsets (in bytes):
///
/// | Offset | Register | Description |
/// |--------|----------|-------------|
/// | 0x020 | LAPIC ID | Local APIC ID (read-only in x2APIC mode) |
/// | 0x030 | LAPIC Version | APIC version and max LVT entry |
/// | 0x080 | TPR | Task Priority Register |
/// | 0x0B0 | EOI | End of Interrupt (write-only) |
/// | 0x0D0 | LDR | Logical Destination Register |
/// | 0x0E0 | DFR | Destination Format Register (flat/cluster) |
/// | 0x0F0 | SVR | Spurious Interrupt Vector Register |
/// | 0x100..0x170 | ISR | In-Service Register (256 bits) |
/// | 0x180..0x1F0 | TMR | Trigger Mode Register (256 bits) |
/// | 0x200..0x270 | IRR | Interrupt Request Register (256 bits) |
/// | 0x280 | ESR | Error Status Register |
/// | 0x300 | ICR_LOW | Interrupt Command Register (low 32 bits) |
/// | 0x310 | ICR_HIGH | Interrupt Command Register (high 32 bits) |
/// | 0x320 | LVT Timer | LVT Timer entry |
/// | 0x350 | LVT LINT0 | LVT LINT0 entry |
/// | 0x360 | LVT LINT1 | LVT LINT1 entry |
/// | 0x380 | Timer ICR | Timer Initial Count Register |
/// | 0x390 | Timer CCR | Timer Current Count Register (read-only) |
/// | 0x3E0 | Timer DCR | Timer Divide Configuration Register |
///
/// Access helpers: use `read_reg(offset)` and `write_reg(offset, val)`
/// which perform volatile 32-bit reads/writes at the given byte offset.
data: [u8; 4096],
}
// VApicPage: [u8;4096] with align(4096) = 4096 bytes.
// Hardware-facing struct — CPU reads/writes directly.
const_assert!(core::mem::size_of::<VApicPage>() == 4096);
impl VApicPage {
/// Byte offsets for commonly accessed APIC registers.
pub const LAPIC_ID: usize = 0x020;
pub const LAPIC_VERSION: usize = 0x030;
pub const TPR: usize = 0x080;
pub const EOI: usize = 0x0B0;
pub const LDR: usize = 0x0D0;
pub const DFR: usize = 0x0E0;
pub const SVR: usize = 0x0F0;
pub const ISR_BASE: usize = 0x100;
pub const TMR_BASE: usize = 0x180;
pub const IRR_BASE: usize = 0x200;
pub const ESR: usize = 0x280;
pub const ICR_LOW: usize = 0x300;
pub const ICR_HIGH: usize = 0x310;
pub const LVT_TIMER: usize = 0x320;
pub const LVT_LINT0: usize = 0x350;
pub const LVT_LINT1: usize = 0x360;
pub const TIMER_ICR: usize = 0x380;
pub const TIMER_CCR: usize = 0x390;
pub const TIMER_DCR: usize = 0x3E0;
/// Read a 32-bit APIC register at the given byte offset.
///
/// # Safety
///
/// `offset` must be a valid register offset (16-byte aligned, within 4 KiB).
/// Caller must hold `vcpu.run_lock` or be in VM-exit context for this vCPU.
pub unsafe fn read_reg(&self, offset: usize) -> u32 {
let ptr = (self.data.as_ptr() as *const u32).add(offset / 4);
core::ptr::read_volatile(ptr)
}
/// Write a 32-bit APIC register at the given byte offset.
///
/// # Safety
///
/// `offset` must be a valid register offset (16-byte aligned, within 4 KiB).
/// Caller must hold `vcpu.run_lock` or be in VM-exit context for this vCPU.
pub unsafe fn write_reg(&mut self, offset: usize, val: u32) {
let ptr = (self.data.as_mut_ptr() as *mut u32).add(offset / 4);
core::ptr::write_volatile(ptr, val);
}
}
Exit reason dispatch (architecture-independent entry point):
/// Called by the trampoline after guest GPRs are saved.
/// Returns an action that tells the trampoline what to do next.
fn handle_vmexit(vcpu: &mut Vcpu, exit: VmExit) -> VmExitAction {
match exit.reason {
// --- Handled entirely in-kernel (fast path) ---
ExitReason::ExternalInterrupt => {
// Host interrupt arrived while guest was running. The trampoline
// already acknowledged the interrupt via the host IDT. Just
// re-enter the guest after the host ISR completes.
VmExitAction::ReenterGuest
}
ExitReason::PreemptionTimer => {
// Host scheduler quantum expired. Yield to the scheduler.
// The scheduler will re-enter the guest when re-scheduled.
VmExitAction::Reschedule
}
ExitReason::EptViolation | ExitReason::Stage2Fault | ExitReason::GuestPageFault => {
// Second-level page fault. Handled via the EPT violation path
// specified in Section 19.1.4.6.
// Called from VM exit context — interrupts are re-enabled after VM
// exit. alloc_slat_page uses GFP_ATOMIC — must not
// sleep. A pre-allocated SLAT page pool (default 256 pages per VM)
// absorbs allocation bursts. If the pool is exhausted, GFP_ATOMIC
// allocation from the buddy allocator is attempted; failure returns
// VM_FAULT_OOM to the guest.
handle_slat_fault(vcpu, exit.guest_phys_addr, exit.fault_flags)
}
ExitReason::Cpuid => {
handle_cpuid_exit(vcpu)
}
ExitReason::Msr(direction) => {
handle_msr_exit(vcpu, direction)
}
ExitReason::Hlt => {
handle_hlt_exit(vcpu)
}
ExitReason::ApicAccess => {
handle_apic_access(vcpu)
}
ExitReason::Hypercall => {
handle_hypercall(vcpu)
}
ExitReason::CrAccess(cr, direction) => {
handle_cr_access(vcpu, cr, direction)
}
ExitReason::TaskSwitch => {
handle_task_switch(vcpu)
}
ExitReason::Xsetbv => {
handle_xsetbv(vcpu)
}
// --- Forwarded to userspace (slow path) ---
ExitReason::IoInstruction => {
handle_io_exit(vcpu)
}
ExitReason::MmioAccess => {
// Unmapped MMIO — no in-kernel device model matched.
populate_kvm_run_mmio(vcpu);
VmExitAction::ReturnToUserspace
}
ExitReason::EntryFailed => {
// Unreachable: vm_enter_and_exit() returns Err(KvmError::EntryFailed)
// for entry failures, which is propagated by `?` before reaching
// handle_vmexit. This arm exists as defensive programming: if the
// error path is ever changed to return Ok(exit), this arm catches
// it explicitly instead of falling through to Unknown.
unreachable!("EntryFailed handled in vm_enter_and_exit via Err path")
}
ExitReason::Shutdown => {
// SAFETY: kvm_run pointer valid -- vcpu.run_lock held.
unsafe { vcpu.kvm_run().exit_reason = KVM_EXIT_SHUTDOWN; }
VmExitAction::ReturnToUserspace
}
ExitReason::Unknown(code) => {
log_unknown_exit(vcpu, code);
// SAFETY: kvm_run pointer valid -- vcpu.run_lock held.
unsafe { vcpu.kvm_run().exit_reason = KVM_EXIT_INTERNAL_ERROR; }
VmExitAction::ReturnToUserspace
}
}
}
VmExitAction enum:
pub enum VmExitAction {
/// Re-enter the guest immediately (VMRESUME/ERET/sret).
ReenterGuest,
/// Yield to the host scheduler. The scheduler will call back into
/// umka-kvm when the vCPU thread is next scheduled.
Reschedule,
/// Return from KVM_RUN to userspace. The kvm_run shared page
/// contains exit reason and data for userspace to handle.
ReturnToUserspace,
}
KVM_RUN main loop (inside vcpu_run()):
/// The KVM_RUN loop. Called from the KVM_RUN ioctl handler after
/// acquiring vcpu.run_lock.
fn vcpu_run(vcpu: &mut Vcpu) -> Result<(), KvmError> {
loop {
// Check for pending work: signals, reschedule requests, ptrace,
// task_work, and other TIF flags. This is the equivalent of Linux's
// xfer_to_guest_mode_work_pending() which checks:
// - TIF_SIGPENDING: signal delivery needed
// - TIF_NEED_RESCHED: scheduler wants to preempt the vCPU thread
// - TIF_NOTIFY_RESUME: seccomp, ptrace, task work completion
// - TIF_NOTIFY_SIGNAL: signalfd and task_work
// Without this broad check, need_resched set between VM exit and
// the top-of-loop would be ignored until the next VM exit.
if xfer_to_guest_mode_work_pending(current_task()) {
// SAFETY: kvm_run pointer valid for KVM_RUN ioctl duration.
unsafe { vcpu.kvm_run().exit_reason = KVM_EXIT_INTR; }
return Ok(());
}
// immediate_exit: userspace set this to 1 (e.g., in a signal handler)
// to force KVM_RUN to return without entering the guest.
// SAFETY: kvm_run pointer valid for KVM_RUN ioctl duration.
if unsafe { vcpu.kvm_run().immediate_exit } != 0 {
unsafe { vcpu.kvm_run().exit_reason = KVM_EXIT_INTR; }
return Ok(());
}
// Pre-entry setup and guest execution via the architecture-specific
// trampoline. See vm_enter_and_exit() definition below.
let exit = vm_enter_and_exit(vcpu)?;
let action = handle_vmexit(vcpu, exit);
match action {
VmExitAction::ReenterGuest => continue,
VmExitAction::Reschedule => {
schedule(); // Yield to host scheduler, re-enter on wake.
continue;
}
VmExitAction::ReturnToUserspace => return Ok(()),
}
}
}
18.1.1.5 vm_enter_and_exit() — Guest Entry/Exit Trampoline¶
/// Perform pre-entry setup, enter the guest, handle the exit, and return
/// the exit reason. This is the single most critical function in the KVM
/// subsystem — it bridges the umka-kvm domain (PKEY 7) and umka-core
/// domain (PKEY 0) via the VMX/SVM/VHE trampoline.
///
/// # Architecture
/// The function runs in two phases:
/// 1. **Pre-entry** (in umka-kvm domain, PKEY 7): prepare guest state.
/// 2. **Trampoline** (domain switch to PKEY 0): execute VMLAUNCH/VMRESUME.
/// The trampoline is ~200 lines of verified assembly in umka-core that
/// performs the actual VMX operation. umka-kvm cannot execute VMX
/// instructions directly from PKEY 7.
///
/// # Returns
/// `Ok(VmExit)` with the exit reason and exit info, or
/// `Err(KvmError::EntryFailed)` if VM entry itself fails (invalid VMCS state).
///
/// # Security boundary
/// The domain switch (PKEY 7 → PKEY 0 → guest → PKEY 0 → PKEY 7) is the
/// security boundary for the virtualization stack. The trampoline validates
/// all VMCS fields before executing VMLAUNCH/VMRESUME.
fn vm_enter_and_exit(vcpu: &mut Vcpu) -> Result<VmExit, KvmError> {
// Disable preemption for the entire VM entry/exit sequence. The pCPU
// check (step 0), VMCS load, and VMLAUNCH/VMRESUME must all execute on
// the same physical CPU. Without preemption disable, the scheduler could
// migrate this thread between smp_processor_id() and the VMCS load,
// causing the VMCS to be loaded on the wrong pCPU.
let _preempt = preempt_disable();
// Step 0: pCPU migration check. If the vCPU thread was rescheduled
// to a different physical CPU since the last VM entry, the VMCS/VMCB
// must be migrated. On x86 VMX: VMCLEAR on the old pCPU (flushes
// VMCS to memory), then VMPTRLD on the current pCPU (loads the VMCS
// from memory into the new pCPU's working set). Host-state fields
// (HOST_RSP, HOST_CR3, HOST_GS_BASE, HOST_TR_BASE, HOST_GDTR_BASE,
// HOST_IDTR_BASE, HOST_IA32_SYSENTER_ESP) must be updated for the
// new pCPU. On AMD SVM: the VMCB is in normal memory (no per-pCPU
// affinity), but HOST_SAVE_AREA must be updated.
// On AArch64 VHE: no action needed (no per-pCPU binding).
// On RISC-V H-ext: no per-pCPU VMCS equivalent.
let current_cpu = smp_processor_id();
if vcpu.last_loaded_cpu != current_cpu {
arch::current::kvm::vcpu_load(vcpu, current_cpu)?;
vcpu.last_loaded_cpu = current_cpu;
}
// Step 1: VMID/ASID staleness check. If the vCPU's VMID is stale
// (guest TLB may contain entries from a different VM's old VMID),
// request a new VMID from the allocator and mark TLB flush needed.
// Load the global generation ONCE to avoid TOCTOU between the
// comparison and the update (a rollover between two loads would
// leave vcpu.vmid_gen stale despite the re-allocation).
let current_gen = VMID_GENERATION.load(Acquire);
if vcpu.vmid_gen != current_gen {
vcpu.vmid = alloc_vmid()?;
vcpu.vmid_gen = current_gen;
// On x86: set VPID flush bits in VMCS. On ARM: update VMID in VTTBR.
}
// Step 2: Load guest register state into VMCS/VMCB.
// Only "dirty" fields need writing (tracked by vcpu.dirty_vmcs_fields bitmap).
arch::current::kvm::load_guest_state(vcpu)?;
// Step 3: Interrupt injection window check.
// If the guest has interrupts enabled (RFLAGS.IF=1 on x86, or PSTATE.I=0 on ARM)
// and there is a pending virtual interrupt, inject it now:
if let Some(vec) = vcpu.pending_interrupt() {
arch::current::kvm::inject_interrupt(vcpu, vec)?;
}
// Step 4: Preemption timer programming (x86 VMX only).
// Calculate TSC ticks until the next scheduler preemption point and write
// to VMCS VMX_PREEMPTION_TIMER_VALUE. This ensures the vCPU exits before
// its scheduler time slice expires.
//
// `preemption_timer_ticks()` definition:
// ```rust
// fn preemption_timer_ticks(vcpu: &Vcpu) -> u32 {
// // The remaining scheduler slice is communicated via the per-task
// // `slice_remaining_ns` field, which is updated by the scheduler on
// // every tick and context switch. umka-kvm reads this field from
// // the current task struct (same address space, Tier 1 read-only
// // access to Tier 0 task data via the CpuLocal current_task pointer).
// let remaining_ns = current_task().sched_entity.slice_remaining_ns;
// let tsc_khz = boot::tsc_freq_khz();
// let shift = vcpu.vm.vmx_misc_preempt_shift; // IA32_VMX_MISC[4:0]
// let ticks: u64 = remaining_ns * tsc_khz / (1000 * (1u64 << shift));
// // Clamp to u32::MAX (VMCS preemption timer field is 32-bit).
// core::cmp::min(ticks, u32::MAX as u64) as u32
// }
// ```
#[cfg(target_arch = "x86_64")]
{
let ticks = preemption_timer_ticks(vcpu);
arch::current::kvm::set_preemption_timer(vcpu, ticks);
}
// Step 5: Host domain key save (arch-dependent).
// Save the current host isolation domain key permissions.
// On x86 with MPK: saves PKRU (umka-kvm domain key permissions).
// On AArch64 with POE: saves POR_EL0. On ARMv7 with DACR: saves DACR.
// On architectures without hardware isolation (RISC-V, s390x,
// LoongArch64): `read_domain_key()` returns 0 (no-op sentinel).
// This is guaranteed by the arch::current::isolation contract:
// `read_domain_key()` is always callable on all 8 architectures and
// returns a well-defined value (0 when no fast isolation is available).
// The guest PKRU/domain-key is loaded by the trampoline after VMRESUME,
// or by hardware if the VMCS PKRU load controls are set.
vcpu.host_pkru = arch::current::isolation::read_domain_key();
// Guest PKRU will be loaded by the trampoline.
//
// SAFETY: The guest PKRU value is active in Ring 0 for ~3-5 instructions
// between the trampoline's WRPKRU(guest_pkru) and VMLAUNCH. This window
// is safe because: (1) the trampoline is verified assembly code that does
// not perform any memory accesses between WRPKRU and VMLAUNCH, (2) EPT
// is active so guest-physical address space is not accessible via Ring 0
// linear addresses, (3) this matches Linux's pattern for PKRU handling
// in KVM. UmkaOS uses PKRU for Tier 1 domain isolation (unlike Linux),
// but the security impact is identical — the trampoline code is in the
// Nucleus (non-replaceable, verified).
// Step 6: Domain switch to PKEY 0 and invoke trampoline.
// The trampoline performs:
// a. WRPKRU to PKEY 0 (if not already there)
// b. L1TF cache flush (if applicable, VERW or L1D_FLUSH)
// c. Extended state save/load (XFD swap if guest uses different XSAVE features)
// d. VMLAUNCH (if !vcpu.arch_state.launched) or VMRESUME (if launched)
// e. On VM exit: save guest registers, load host registers
// f. WRPKRU back to PKEY 7 (umka-kvm domain)
// g. Return VmExit struct
let exit = arch::current::kvm::vmx_trampoline_enter(vcpu)?;
// Step 7: Post-exit PKRU restore.
// The trampoline already restored PKRU to the umka-kvm domain key.
// Sync pkru_shadow with the restored value.
CpuLocal::get_mut().pkru_shadow = vcpu.host_pkru;
// Step 8: Guest PKRU was already saved by the trampoline.
// The architecture-specific trampoline executes `rdpkru()` immediately
// after VM exit (before any domain switch) and writes the result to
// `vcpu.guest_pkru`. It then restores host PKRU via `wrpkru()`.
// No action needed here — the trampoline handles both save and restore.
// Step 9: Mark as launched (for VMRESUME on next entry).
vcpu.arch_state.launched = true;
// Step 10: Handle VM entry failure.
if exit.reason == ExitReason::EntryFailed {
// SAFETY: kvm_run pointer valid for KVM_RUN ioctl duration.
unsafe {
vcpu.kvm_run().exit_reason = KVM_EXIT_FAIL_ENTRY;
vcpu.kvm_run().exit_data.fail_entry.hardware_entry_failure_reason =
exit.exit_qual;
}
return Err(KvmError::EntryFailed);
}
Ok(exit)
}
VcpuRunState — vCPU execution state enum:
/// Formal definition of vCPU run states. Stored as `AtomicU8` in `Vcpu.run_state`.
#[repr(u8)]
pub enum VcpuRunState {
/// vCPU is actively executing guest code or handling an exit.
Running = 0,
/// vCPU executed HLT instruction; waiting for interrupt injection.
Halted = 1,
/// vCPU is paused by userspace (migration, snapshot, debug).
Paused = 2,
/// vCPU is parked during crash recovery (domain reload).
Parked = 3,
}
Exit handler function signatures (all return VmExitAction):
/// All exit handlers return VmExitAction to tell the vcpu_run loop
/// whether to re-enter the guest, reschedule, or return to userspace.
fn handle_cpuid_exit(vcpu: &mut Vcpu) -> VmExitAction;
fn handle_msr_exit(vcpu: &mut Vcpu, direction: MsrDirection) -> VmExitAction;
fn handle_hlt_exit(vcpu: &mut Vcpu) -> VmExitAction {
// Check for pending interrupt -- if the in-kernel LAPIC has a pending
// interrupt or vcpu.pending_irq is set, inject and re-enter immediately.
if vcpu.has_pending_interrupt() {
return VmExitAction::ReenterGuest;
}
// Halt-poll: spin for up to halt_poll_ns checking for interrupts.
// Uses sched_idle_enter/exit to avoid charging poll time as vruntime.
// See [Section 18.3](#kvm-operational--vcpu-scheduling-integration) for the full
// halt-poll algorithm including per-quantum budget accounting.
let poll_ns = vcpu.vm.halt_poll_ns;
let start = sched_clock_nanos();
let guard = rq_lock_irqsave();
sched_idle_enter(&mut guard);
rq_unlock_irqrestore(guard);
// Halt-poll spin loop.
loop {
if vcpu.has_pending_interrupt() {
// Interrupt arrived during poll -- inject and re-enter.
let guard = rq_lock_irqsave();
sched_idle_exit(&mut guard);
rq_unlock_irqrestore(guard);
vcpu.halt_poll_budget_remaining_ns =
vcpu.halt_poll_budget_remaining_ns
.saturating_sub(sched_clock_nanos() - start);
return VmExitAction::ReenterGuest;
}
let elapsed = sched_clock_nanos() - start;
if elapsed >= poll_ns || elapsed >= vcpu.halt_poll_budget_remaining_ns {
break; // Poll window or per-quantum budget exhausted.
}
core::hint::spin_loop();
}
// Poll expired without interrupt. Mark vCPU as halted and yield.
let guard = rq_lock_irqsave();
sched_idle_exit(&mut guard);
rq_unlock_irqrestore(guard);
vcpu.halt_poll_budget_remaining_ns =
vcpu.halt_poll_budget_remaining_ns
.saturating_sub(sched_clock_nanos() - start);
vcpu.run_state.store(VcpuRunState::Halted as u8, Release);
VmExitAction::Reschedule
}
fn handle_apic_access(vcpu: &mut Vcpu) -> VmExitAction;
fn handle_hypercall(vcpu: &mut Vcpu) -> VmExitAction;
fn handle_cr_access(vcpu: &mut Vcpu, cr: u8, direction: CrDirection) -> VmExitAction;
fn handle_task_switch(vcpu: &mut Vcpu) -> VmExitAction;
fn handle_xsetbv(vcpu: &mut Vcpu) -> VmExitAction;
fn handle_io_exit(vcpu: &mut Vcpu) -> VmExitAction {
// Decode PIO parameters from VMCS exit qualification (x86) or
// architecture-equivalent register (AArch64: ESR_EL2, RISC-V: htval).
let port = arch::current::kvm::io_exit_port(vcpu);
let direction = arch::current::kvm::io_exit_direction(vcpu); // In or Out
let size = arch::current::kvm::io_exit_size(vcpu); // 1, 2, or 4 bytes
let data = arch::current::kvm::io_exit_data(vcpu);
// Look up in-kernel device model via the PIO bus (XArray keyed by port).
// The MMIO bus (vm.io_bus.mmio) uses the same pattern -- see the MmioAccess
// arm above for the MMIO dispatch flow.
if let Some(dev) = vcpu.vm.io_bus.pio.xa_load(port as u64) {
// In-kernel device handles this port (PIT, PIC, IOAPIC, serial).
match direction {
IoDirection::In => {
let result = dev.ops.read(port, size);
arch::current::kvm::set_io_exit_data(vcpu, result);
}
IoDirection::Out => {
dev.ops.write(port, size, data);
}
}
// Advance guest RIP past the I/O instruction.
arch::current::kvm::advance_rip(vcpu);
VmExitAction::ReenterGuest
} else {
// No in-kernel device -- forward to userspace via KVM_EXIT_IO.
// SAFETY: kvm_run pointer valid -- vcpu.run_lock held.
unsafe {
vcpu.kvm_run().exit_reason = KVM_EXIT_IO;
vcpu.kvm_run().exit_data.io.direction = direction as u8;
vcpu.kvm_run().exit_data.io.size = size;
vcpu.kvm_run().exit_data.io.port = port;
vcpu.kvm_run().exit_data.io.count = 1;
vcpu.kvm_run().exit_data.io.data_offset =
core::mem::offset_of!(KvmRun, io_data) as u64;
// For OUT: copy data to the kvm_run shared page.
if direction == IoDirection::Out {
core::ptr::write(vcpu.kvm_run().io_data.as_mut_ptr(), data);
}
}
VmExitAction::ReturnToUserspace
}
}
fn handle_slat_fault(vcpu: &mut Vcpu, guest_phys_addr: u64, fault_flags: SlatFaultFlags) -> VmExitAction;
Key exit handlers:
-
CPUID: umka-kvm maintains a per-VM CPUID table (set by
KVM_SET_CPUID2). On CPUID exit, the handler looks up guest EAX/ECX (leaf/subleaf) in the table and writes the result to guest EAX/EBX/ECX/EDX. Guest RIP is advanced past the 2-byte CPUID instruction. Handled entirely in-kernel; no userspace round-trip. -
MSR access: umka-kvm maintains one 4 KiB MSR bitmap region divided into four 1 KiB sections per Intel SDM Vol. 3C Section 25.6.9: read-low (offset 0x000, MSRs 0x00000000-0x00001FFF), read-high (offset 0x400, MSRs 0xC0000000-0xC0001FFF), write-low (offset 0x800, MSRs 0x00000000-0x00001FFF), write-high (offset 0xC00, MSRs 0xC0000000-0xC0001FFF). Commonly passthrough MSRs (IA32_TSC, IA32_TSC_AUX, IA32_SYSENTER_*) have their bitmap bits cleared — the guest accesses them directly without exit. Intercepted MSRs (IA32_EFER, IA32_APIC_BASE, IA32_TSC_DEADLINE, IA32_SPEC_CTRL, IA32_STAR, IA32_LSTAR) are emulated in the exit handler: the handler validates the value, applies it to the VMCS guest-state area or internal shadow, and advances RIP.
-
HLT: If the in-kernel LAPIC has a pending interrupt or the vCPU's
pending_irqis set, inject the interrupt and re-enter immediately (no halt). Otherwise, attempt halt-polling before yielding (Section 18.3): callsched_idle_enter(), spin for up tohalt_poll_nschecking for interrupts, thensched_idle_exit(). If an interrupt arrives during the poll window, inject and re-enter without a context switch. If the poll window expires (or the per-quantum halt-poll budget is exhausted), mark the vCPU as halted (run_state = Halted) and yield to the scheduler. The vCPU is woken when an interrupt is routed to it (via IOAPIC redirection, MSI injection, orKVM_INTERRUPTioctl). -
I/O port: The exit handler reads the port number, direction (in/out), size (1/2/4 bytes), and data from the VMCS exit qualification. If an in-kernel device model handles this port (PIT, PIC, IOAPIC, serial for debug), it is handled without returning to userspace. Otherwise, the handler populates the
kvm_run.iostruct and returnsKVM_EXIT_IOto userspace. -
Hypercall (VMCALL/HVC): Recognized hypercalls:
KVM_HC_VAPIC_POLL_IRQ(1): Check for pending virtual interrupts.KVM_HC_MMU_OP(2): Deprecated; returnsKVM_ENOSYS(1000) in RAX — not POSIX-ENOSYS(-38). Guests check for the positive value 1000 per the KVM paravirt ABI (include/uapi/linux/kvm_para.h).KVM_HC_KICK_CPU(5): Wake a halted vCPU (PV spinlock support).KVM_HC_SEND_IPI(10): Batch IPI injection (up to 128 targets).KVM_HC_SCHED_YIELD(11): Hint that the vCPU is spinning; yield to scheduler.KVM_HC_MAP_GPA_RANGE(12): Shared/private page conversion (confidential VMs, see Section 9.7).- Unrecognized hypercall numbers: forwarded to userspace as
KVM_EXIT_HYPERCALL. If userspace also does not handle the hypercall, it must writeKVM_ENOSYS(1000) into RAX before re-entering — not-ENOSYS.
handle_slat_fault — Second-Level Address Translation fault handler:
/// Handle an EPT violation (x86-64), Stage-2 fault (AArch64), or guest page
/// fault (RISC-V H-extension). Called from the VM exit dispatch loop in
/// interrupt-enabled context. Host page resolution (`get_user_pages_fast`) may
/// sleep when the host page needs to be faulted in. SLAT page table allocation
/// uses GFP_ATOMIC (must not sleep) from the pre-allocated SLAT page pool.
///
/// # Algorithm
/// 1. Translate the faulting GPA through the VM's memslot table
/// (`KvmVm::memslots`) to obtain the corresponding HVA.
/// - If no memslot covers this GPA: set `kvm_run.exit_reason = KVM_EXIT_MMIO`,
/// populate `kvm_run.mmio` with GPA/len/is_write, return
/// `VmExitAction::ReturnToUserspace` for userspace MMIO emulation.
/// 2. Resolve the HVA to a host physical address (HPA) via the host page tables
/// (`follow_pte(hva)` under RCU read lock).
/// - If the HVA is not mapped in the host: trigger a host-side page fault
/// via `get_user_pages_fast(hva, 1, write, &page)`. On failure (ENOMEM),
/// retry once after direct reclaim. If still failing, return
/// `VmExitAction::ReturnToUserspace` with `KVM_EXIT_INTERNAL_ERROR`
/// (suberror = `KVM_INTERNAL_ERROR_EMULATION`). Do NOT inject `#PF` into
/// the guest — a host OOM condition should not be visible as a guest page
/// fault (the guest would interpret it as a page table miss and loop
/// infinitely trying to page in). Linux KVM also does not inject #PF for
/// host-side allocation failures.
/// 3. Determine the SLAT mapping size:
/// - If both the GPA and HPA are 2 MiB-aligned and the memslot covers a
/// contiguous 2 MiB region: install a 2 MiB huge page SLAT PTE.
/// - If 1 GiB-aligned and contiguous: install a 1 GiB SLAT PTE.
/// - Otherwise: install a 4 KiB SLAT PTE.
/// 4. Walk the SLAT page table (EPT/Stage-2/G-stage), allocating intermediate
/// page table pages as needed from the VM's SLAT page pool
/// (`KvmVm::slat_page_pool`, pre-allocated, refilled from GFP_ATOMIC).
/// If SLAT page allocation fails (pool exhausted AND `GFP_ATOMIC` buddy
/// allocation fails): return `VmExitAction::ReturnToUserspace` with
/// `KVM_EXIT_INTERNAL_ERROR` (suberror = `KVM_INTERNAL_ERROR_EMULATION`).
/// Do NOT inject `#PF`/data abort into the guest — host OOM is not a
/// guest page table miss. The VMM (QEMU/Firecracker) handles the error
/// by retrying or reporting to the user. The host OOM killer may
/// independently select a different process for termination.
/// 5. Install the SLAT PTE with the resolved HPA and access permissions
/// derived from `fault_flags` (read/write/execute) and memslot flags.
/// 6. If this is a write fault and the page is CoW in the host: the host-side
/// `get_user_pages_fast` with `write=true` triggers CoW break before step 3.
/// 7. Return `VmExitAction::ReenterGuest`.
fn handle_slat_fault(
vcpu: &mut Vcpu,
guest_phys_addr: u64,
fault_flags: SlatFaultFlags,
) -> VmExitAction {
// Implementation per architecture:
// - x86-64 (EPT): Intel SDM Vol. 3C §29.3 — EPT violations
// - AArch64 (Stage-2): ARMv8 ARM §D5.10 — Stage 2 translation faults
// - RISC-V (G-stage): Privileged ISA §8.5 — Guest page faults
// Steps 1-7 as documented above.
}
bitflags! {
/// Flags extracted from the VM exit qualification describing the fault.
pub struct SlatFaultFlags: u32 {
/// Fault was caused by a read access.
const READ = 1 << 0;
/// Fault was caused by a write access.
const WRITE = 1 << 1;
/// Fault was caused by an instruction fetch.
const EXECUTE = 1 << 2;
/// The GPA was not present in the SLAT (vs. permission violation).
const NOT_PRESENT = 1 << 3;
}
}
Architecture-specific VMX/SVM/VHE/H-extension implementations continue in Section 18.2. vCPU scheduling, in-kernel device models, nested virtualization, crash recovery, and guest memory integration continue in Section 18.3.
18.2 KVM Architecture Backends¶
18.2.1 x86-64 VMX Implementation¶
VMCS (Virtual Machine Control Structure) — Intel SDM Vol. 3C, Chapter 25:
The VMCS is a 4 KiB hardware-defined region that controls VMX operation. One VMCS
exists per vCPU. VMCS fields are accessed via VMREAD/VMWRITE instructions
(not direct memory access — the format is opaque and CPU-model-specific).
/// x86-64 VMX state for one vCPU.
/// Kernel-internal, not a KABI or wire struct — no `#[repr(C)]` required.
pub struct VmcsState {
/// Physical address of the 4 KiB VMCS region (allocated from umka-core's
/// page allocator, 4 KiB aligned). Written to the per-CPU VMCS pointer
/// via VMPTRLD before any VMREAD/VMWRITE.
pub vmcs_phys: u64,
/// Virtual address of the VMCS region (for the revision ID write at
/// VMCS initialization — the only direct memory access to the VMCS).
pub vmcs_virt: *mut u8,
/// MSR bitmap (one 4 KiB page, four 1 KiB sections): controls which MSR
/// accesses cause VM exits. Bit set = intercept; bit clear = passthrough.
/// Sections: read-low (0x000), read-high (0x400), write-low (0x800),
/// write-high (0xC00). Physical address stored in VMCS MSR bitmap address field.
pub msr_bitmap_phys: u64,
/// I/O bitmap pages (two 4 KiB pages, covering ports 0x0000-0xFFFF).
/// Bit set = intercept IN/OUT; bit clear = passthrough.
pub io_bitmap_a_phys: u64, // ports 0x0000-0x7FFF
pub io_bitmap_b_phys: u64, // ports 0x8000-0xFFFF
/// Posted interrupt descriptor (if posted interrupts are enabled).
/// 64-byte aligned structure used by hardware to inject interrupts
/// without VM exit.
pub posted_interrupt_desc: *mut PostedInterruptDesc,
/// Whether this VMCS has been launched (VMLAUNCH sets this; subsequent
/// entries use VMRESUME).
pub launched: bool,
}
VMCS initialization (KVM_CREATE_VCPU path):
- Allocate a 4 KiB page from umka-core (
SlatHooks::alloc_slat_page). Write the VMCS revision identifier (fromIA32_VMX_BASICMSR) to the first 4 bytes. - Execute
VMPTRLDto make this VMCS current on the physical CPU. - Write host-state fields — these define the CPU state restored on VM exit:
HOST_CR0,HOST_CR3(umka-core's page table root),HOST_CR4HOST_RSP(trampoline stack pointer),HOST_RIP(trampoline entry point)HOST_CS_SELECTOR,HOST_SS_SELECTOR,HOST_DS_SELECTOR,HOST_ES_SELECTOR,HOST_FS_SELECTOR,HOST_GS_SELECTOR,HOST_TR_SELECTORHOST_FS_BASE,HOST_GS_BASE(PerCpu/CpuLocal base),HOST_TR_BASE,HOST_GDTR_BASE,HOST_IDTR_BASEHOST_IA32_SYSENTER_CS,HOST_IA32_SYSENTER_ESP,HOST_IA32_SYSENTER_EIPHOST_IA32_EFER,HOST_IA32_PAT- Write guest-state fields — initial guest CPU state:
GUEST_CR0,GUEST_CR3,GUEST_CR4(set byKVM_SET_SREGS)GUEST_RIP,GUEST_RSP,GUEST_RFLAGS(set byKVM_SET_REGS)- Segment selectors, bases, limits, access rights for CS/DS/ES/SS/FS/GS/TR/LDTR
GUEST_GDTR_BASE,GUEST_GDTR_LIMIT,GUEST_IDTR_BASE,GUEST_IDTR_LIMITGUEST_IA32_EFER,GUEST_IA32_PAT,GUEST_IA32_DEBUGCTLGUEST_ACTIVITY_STATE(0 = active, 1 = HLT, 2 = shutdown, 3 = wait-for-SIPI)GUEST_INTERRUPTIBILITY_STATE,GUEST_PENDING_DBG_EXCEPTIONSVMCS_LINK_POINTER= 0xFFFF_FFFF_FFFF_FFFF (no VMCS shadowing initially)- Write VM-execution control fields:
- Pin-based controls: external interrupt exiting, NMI exiting, virtual NMIs, VMX preemption timer (enabled — used for host scheduler integration)
- Primary processor-based controls: HLT exiting, INVLPG exiting (disabled — guest manages its own TLB), CR3 load/store exiting (disabled when EPT is active), MOV DR exiting (disabled unless debug), I/O bitmap use, MSR bitmap use, MONITOR/MWAIT exiting, activate secondary controls
- Secondary processor-based controls: enable EPT, enable VPID (Virtual Processor ID — tags TLB entries per-VM to avoid full TLB flush on VM switch), unrestricted guest (allows real-mode execution in VMX non-root), APIC register virtualization, virtual interrupt delivery, PML enable, XSAVES/XRSTORS enable, enable INVPCID passthrough, TSC scaling
- Exception bitmap: intercept #UD (for instruction emulation), #PF if EPT is not available (should not happen on any modern CPU), #DB/#BP for guest debugging when GDB is attached. All other exceptions delivered to guest.
- EPT pointer: PML4 physical address | memory type (WB=6) | page-walk length (4-1=3) | accessed/dirty flags enable
- VPID: unique 16-bit value per vCPU (allocated from a per-host bitmap). VPID 0 is reserved for the host. Range 1-65535.
- Write VM-exit control fields: save IA32_EFER, load IA32_EFER, host address-space size (64-bit), acknowledge interrupt on exit (allows the host IDT to handle the interrupt without a separate VMREAD of the exit interrupt info field)
- Write VM-entry control fields: load IA32_EFER, IA32_PAT; IA-32e mode guest (64-bit guest); entry to SMM (no).
AMD SVM (VMCB) — AMD APM Vol. 2, Chapter 17:
AMD's equivalent uses a VMCB (Virtual Machine Control Block), also 4 KiB, but
directly memory-mapped (no VMREAD/VMWRITE — the hypervisor reads/writes VMCB
fields as a C struct). Key differences from VMX:
- Nested Page Tables (NPT) instead of EPT — functionally equivalent 4-level
page table for guest-physical → host-physical translation.
- VMRUN instruction (replaces VMLAUNCH/VMRESUME — single instruction for all
entries). #VMEXIT stores exit info directly in the VMCB.
- VMCB clean bits: the hypervisor marks which VMCB sections were modified since
the last VMRUN, allowing the CPU to skip reloading unchanged state.
- ASID (Address Space ID) for TLB tagging — analogous to Intel's VPID.
- VMCB.control.intercept_* fields replace VMX's execution control bitmaps.
- Secure Encrypted Virtualization (SEV/SEV-ES/SEV-SNP) extensions use the VMCB's
SEV_FEATURES and VMSA fields (see Section 9.7).
/// AMD VMCB (Virtual Machine Control Block). 4 KiB, 4 KiB-aligned.
/// The CPU reads/writes this struct directly on VMRUN / #VMEXIT.
/// Layout per AMD APM Vol. 2, Appendix B (rev 3.42+).
// kernel-internal, not KABI
#[repr(C, align(4096))]
pub struct Vmcb {
// --- Control Area (offset 0x000-0x3FF) ---
/// Intercept reads/writes to CRn (bits 0-15 = read, 16-31 = write).
pub intercept_cr: u32, // 0x000
/// Intercept reads/writes to DRn.
pub intercept_dr: u32, // 0x004
/// Exception intercept bitmap (bit N = intercept exception #N).
pub intercept_exc: u32, // 0x008
/// Miscellaneous intercepts (INTR, NMI, SMI, INIT, VINTR, etc.).
pub intercept_misc1: u32, // 0x00C
/// Extended intercepts (VMRUN, VMMCALL, VMLOAD, VMSAVE, STGI, CLGI, SKINIT, RDTSCP, etc.).
pub intercept_misc2: u32, // 0x010
/// Additional intercepts (MCOMMIT, TLBSYNC, etc.).
pub intercept_misc3: u32, // 0x014
pub _reserved0: [u8; 0x024], // 0x018-0x03B (36 bytes)
/// Pause filter threshold / count.
pub pause_filter_thresh: u16, // 0x03C
pub pause_filter_count: u16, // 0x03E
/// Physical address of IOPM (I/O Permission Map), 12 KiB.
pub iopm_base_pa: u64, // 0x040
/// Physical address of MSRPM (MSR Permission Map), 8 KiB.
pub msrpm_base_pa: u64, // 0x048
/// TSC offset added to guest RDTSC/RDTSCP.
pub tsc_offset: u64, // 0x050
/// Guest ASID (must be non-zero).
pub guest_asid: u32, // 0x058
/// TLB control: 0=do nothing, 1=flush all, 3=flush this ASID, 7=flush non-global.
pub tlb_control: u8, // 0x05C
pub _reserved1: [u8; 3], // 0x05D-0x05F
/// Virtual interrupt control (V_TPR, V_IRQ, V_INTR_PRIO, V_IGN_TPR, V_INTR_VECTOR, etc.).
pub v_intr: u64, // 0x060
/// Interrupt shadow / guest interrupt state.
pub interrupt_shadow: u64, // 0x068
/// Exit code (written by CPU on #VMEXIT). Maps to SVM_EXIT_* constants.
pub exit_code: u64, // 0x070
/// Exit info 1 (exit-code-specific qualification).
pub exit_info1: u64, // 0x078
/// Exit info 2 (additional qualification).
pub exit_info2: u64, // 0x080
/// Exit interrupt info (if exit during interrupt delivery).
pub exit_int_info: u64, // 0x088
/// Nested paging enable + SEV control bits.
pub np_enable: u64, // 0x090
/// AVIC/SEV-related fields.
pub avic_apic_bar: u64, // 0x098
/// GHCB physical address (for SEV-ES).
pub ghcb_pa: u64, // 0x0A0
/// Event injection (for injecting #VMEXIT, exceptions, interrupts into guest).
pub event_inject: u64, // 0x0A8
/// Nested page table CR3 (physical address of nPT root).
pub n_cr3: u64, // 0x0B0
/// LBR virtualization enable.
pub lbr_virt_enable: u64, // 0x0B8
/// VMCB clean bits (hypervisor marks which sections were modified since last VMRUN).
pub vmcb_clean: u32, // 0x0C0
pub _reserved2: [u8; 4], // 0x0C4
/// Next sequential RIP (saved on intercepts of INS/OUTS/etc.).
pub next_rip: u64, // 0x0C8
/// Number of bytes fetched (for instruction decode assist).
pub bytes_fetched: u8, // 0x0D0
pub guest_inst_bytes: [u8; 15], // 0x0D1-0x0DF
/// AVIC backing page / logical/physical table pointers.
pub avic_backing_page: u64, // 0x0E0
pub _reserved3: [u8; 8], // 0x0E8
pub avic_logical_table: u64, // 0x0F0
pub avic_physical_table:u64, // 0x0F8
pub _reserved4: [u8; 8], // 0x100
/// VMSA pointer (for SEV-ES encrypted state).
pub vmsa_pa: u64, // 0x108
/// SEV features (SNP, VMSA_REG_PROT, etc.).
pub sev_features: u64, // 0x110
pub _reserved5: [u8; 0x2E8], // 0x118-0x3FF
// --- State Save Area (offset 0x400-0xFFF) ---
/// Guest segment registers (ES, CS, SS, DS, FS, GS, GDTR, LDTR, IDTR, TR).
pub es: VmcbSegment, // 0x400
pub cs: VmcbSegment, // 0x410
pub ss: VmcbSegment, // 0x420
pub ds: VmcbSegment, // 0x430
pub fs: VmcbSegment, // 0x440
pub gs: VmcbSegment, // 0x450
pub gdtr: VmcbSegment, // 0x460
pub ldtr: VmcbSegment, // 0x470
pub idtr: VmcbSegment, // 0x480
pub tr: VmcbSegment, // 0x490
pub _reserved6: [u8; 0x2B], // 0x4A0-0x4CA
pub cpl: u8, // 0x4CB
pub _reserved7: [u8; 4], // 0x4CC
pub efer: u64, // 0x4D0
pub _reserved8: [u8; 0x70], // 0x4D8-0x547
pub cr4: u64, // 0x548
pub cr3: u64, // 0x550
pub cr0: u64, // 0x558
pub dr7: u64, // 0x560
pub dr6: u64, // 0x568
pub rflags: u64, // 0x570
pub rip: u64, // 0x578
pub _reserved9: [u8; 0x58], // 0x580-0x5D7
pub rsp: u64, // 0x5D8
pub s_cet: u64, // 0x5E0
pub ssp: u64, // 0x5E8
pub isst_addr: u64, // 0x5F0
pub rax: u64, // 0x5F8
pub star: u64, // 0x600
pub lstar: u64, // 0x608
pub cstar: u64, // 0x610
pub sfmask: u64, // 0x618
pub kernel_gs_base: u64, // 0x620
pub sysenter_cs: u64, // 0x628
pub sysenter_esp: u64, // 0x630
pub sysenter_eip: u64, // 0x638
pub cr2: u64, // 0x640
pub _reserved10: [u8; 0x20], // 0x648-0x667
pub g_pat: u64, // 0x668
pub dbgctl: u64, // 0x670
pub br_from: u64, // 0x678
pub br_to: u64, // 0x680
pub last_excp_from: u64, // 0x688
pub last_excp_to: u64, // 0x690
pub _reserved11: [u8; 0x968], // 0x698-0xFFF
}
// VMCB is exactly one 4 KiB page (AMD APM Vol. 2, §15.5.1).
const_assert!(size_of::<Vmcb>() == 4096);
/// VMCB segment descriptor (16 bytes each).
#[repr(C)]
pub struct VmcbSegment {
pub selector: u16,
pub attrib: u16,
pub limit: u32,
pub base: u64,
}
// VmcbSegment: 2+2+4+8 = 16 bytes.
const_assert!(size_of::<VmcbSegment>() == 16);
/// AMD SVM state for one vCPU.
pub struct VmcbState {
/// Physical address of the 4 KiB VMCB (4 KiB aligned).
pub vmcb_phys: u64,
/// Virtual address for direct field access.
pub vmcb: *mut Vmcb,
/// Host save area physical address (set in VM_HSAVE_PA MSR).
/// CPU saves host state here on VMRUN and restores on #VMEXIT.
pub host_save_area_phys: u64,
/// AMD MSRPM: 8 KiB (two 4 KiB pages). Uses 2 bits per MSR: bit 2n = read
/// intercept, bit 2n+1 = write intercept. Three MSR ranges:
/// 0x00000000-0x00001FFF, 0xC0000000-0xC0001FFF, 0xC0010000-0xC0011FFF.
/// Different from Intel VMX MSR bitmap (which is 4 KiB, 1 bit/MSR, four
/// separate 1 KiB sections). See AMD APM Vol. 2, Section 19.11.
pub msrpm_phys: u64,
/// I/O permission bitmap (three 4 KiB pages covering 0-65535).
pub iopm_phys: u64,
}
umka-kvm abstracts VMX and SVM behind a common HvOps trait so that all
host-side logic above the trampoline level is architecture-neutral:
/// Architecture-specific VMX/SVM operations. Implemented by VmcsState (Intel)
/// and VmcbState (AMD). Called from the architecture-neutral exit handler.
pub trait HvOps {
/// Read a guest register from the hardware state area.
fn read_guest_reg(&self, reg: GuestReg) -> u64;
/// Write a guest register to the hardware state area.
fn write_guest_reg(&mut self, reg: GuestReg, val: u64);
/// Advance guest instruction pointer by `len` bytes.
fn advance_rip(&mut self, len: u8);
/// Inject a virtual interrupt (IRQ number, priority).
fn inject_irq(&mut self, vector: u8);
/// Inject an exception (vector, error code, CR2 for #PF).
fn inject_exception(&mut self, vector: u8, error_code: Option<u32>);
/// Set the EPT/NPT pointer (for EPT invalidation after page table changes).
fn set_slat_root(&mut self, root_phys: u64);
/// Invalidate TLB entries for this vCPU's VPID/ASID.
fn flush_guest_tlb(&self);
/// Read exit qualification / exit info (architecture-specific format).
fn exit_info(&self) -> ExitInfo;
}
18.2.2 AArch64 Host-Side Implementation¶
VHE mode (ARMv8.1+, preferred): UmkaOS runs at EL2. The guest runs at EL1/EL0.
VM entry is a controlled ERET to guest EL1; VM exit is a trap from guest EL1 to
host EL2. No EL2↔EL1 world switch is needed on the host side because VHE
transparently redirects EL1 system register accesses to their EL2 counterparts.
/// AArch64 VHE vCPU state.
pub struct ArmVheState {
/// Saved guest EL1 system registers (restored before ERET to guest,
/// saved after trap back to host EL2).
pub sctlr_el1: u64,
pub tcr_el1: u64,
pub ttbr0_el1: u64,
pub ttbr1_el1: u64,
pub mair_el1: u64,
pub vbar_el1: u64,
pub contextidr_el1: u64,
pub cpacr_el1: u64,
pub esr_el1: u64,
pub far_el1: u64,
pub sp_el1: u64,
pub elr_el1: u64,
pub spsr_el1: u64,
pub tpidr_el1: u64,
pub tpidr_el0: u64,
pub tpidrro_el0: u64,
pub mdscr_el1: u64, // Monitor Debug System Control (gdb inside VM)
pub amair_el1: u64, // Auxiliary Memory Attribute Indirection
pub par_el1: u64, // Physical Address Register (AT instruction result)
// Note: this lists the most critical saved registers. The implementation
// saves all EL1 system registers from the kvm_cpu_context, including
// afsr0_el1, afsr1_el1, and others.
pub cntvoff_el2: u64, // virtual timer offset
pub cntv_cval_el0: u64, // virtual timer compare value
pub cntv_ctl_el0: u64, // virtual timer control
/// Stage-2 translation table base register.
/// Points to the root of the IPA → PA page tables for this VM.
/// Written to VTTBR_EL2 before guest entry.
pub vttbr_el2: u64,
/// Hypervisor Configuration Register — controls trap behavior.
/// umka-kvm sets: VM bit (enable Stage-2), IMO/FMO/AMO (trap
/// interrupts to EL2), TWI/TWE (trap WFI/WFE), TSC (trap SMC).
pub hcr_el2: u64,
/// VMID (Virtual Machine Identifier) — 8-bit or 16-bit tag for
/// Stage-2 TLB entries (analogous to Intel VPID).
pub vmid: u16,
}
Stage-2 page tables: 4-level (or concatenated 3-level for 40-bit IPA) page
tables mapping IPA (Intermediate Physical Address) to PA (host physical).
Constructed by umka-kvm using the same SlatHooks allocator as x86 EPT.
TLBI VMALLE1IS (TLB invalidate all EL1, inner-shareable) flushes guest TLB
entries; TLBI IPAS2E1IS flushes a single IPA mapping.
/// AArch64 nVHE vCPU state. Used when VHE is unavailable (pre-ARMv8.1).
/// Both guest and host EL1 contexts must be saved/restored on every
/// VM entry/exit because the host runs at EL1 (not EL2).
pub struct ArmNvheState {
/// Saved guest EL1 context (restored before ERET to guest,
/// saved after trap back to EL2 stub).
pub guest_ctxt: ArmCpuContext,
/// Saved host EL1 context (saved on VM entry, restored on VM exit).
/// Includes all system registers that the guest may modify.
pub host_ctxt: ArmCpuContext,
/// EL2 vector table base for the hyp stub.
pub hyp_vbar: u64,
/// HCR_EL2 value programmed for this guest (trap configuration bits).
pub hcr_el2: u64,
}
nVHE mode (pre-ARMv8.1 or when VHE is disabled): The host kernel runs at EL1.
A small EL2 stub (~500 lines of assembly) handles world switches. On VM entry:
save host EL1 context → load guest EL1 context → ERET to guest EL1. On VM exit:
save guest EL1 context → restore host EL1 context → return to host. Cost:
~500-1000 cycles per entry/exit (vs ~200 for VHE) due to the full EL1 context
save/restore.
Phase 3 deliverable. The nVHE EL2 stub is specified alongside the full KVM implementation (Phase 3). The entry points and save/restore protocol below are the architectural contract; the assembly implementation will be generated during Phase 3 implementation.
nVHE EL2 stub entry points:
| Entry point | Trigger | Action |
|---|---|---|
__kvm_hyp_init |
Boot (called once per CPU) | Install EL2 vector table, configure HCR_EL2 (VM trapping bits), save host EL2 state baseline |
__kvm_vcpu_run(vcpu) |
KVM_RUN ioctl |
Save host EL1 context (31 GPRs, SP_EL1, ELR_EL1, SPSR_EL1, SCTLR_EL1, TTBR0/1_EL1, TCR_EL1, VBAR_EL1, MAIR_EL1, CONTEXTIDR_EL1), load guest EL1 context from vcpu->arch.ctxt, program Stage-2 VTTBR_EL2, ERET to guest |
__kvm_hyp_host_vector |
Guest trap/IRQ | Save guest EL1 context to vcpu->arch.ctxt, restore host EL1 context, return to host EL1 with exit reason in x0 |
Save/restore context size: 39 system registers + 31 GPRs + SP + PC + PSTATE = ~600 bytes
per direction. Stored in struct kvm_cpu_context within the VcpuArch structure.
Virtual interrupt injection: GICv4 direct injection (where hardware supports
it) allows the physical GIC to deliver virtual interrupts directly to the guest
vCPU without a VM exit. umka-kvm programs the GICv4 virtual LPI (Locality-specific
Peripheral Interrupt) tables so that device MSIs targeted at a guest are directly
delivered. Fallback: software injection via ICH_LR<n>_EL2 list registers
(GICv3 virtualization interface).
AArch64 Cache Geometry Discovery and DC CISW Flush
KVM requires flushing guest memory pages from the host cache when switching between host and guest mappings and on guest Stage-2 page table updates. On AArch64 there is no single instruction that flushes the entire data cache hierarchy; the kernel must iterate over every cache set and way up to the Level of Coherence (LoC). The geometry parameters — line size, associativity, and number of sets — are discovered at runtime by reading three system registers after selecting each cache level.
Register layout:
CLIDR_EL1 (Cache Level ID Register):
- bits[26:24] = LoC — the number of cache levels that must be flushed to achieve
data coherency with all data-sharing agents
- bits[23:21] = LoUIS — Level of Unification Inner Shareable (used for context-switch
cache maintenance; not needed for the full-flush path)
- Ctype_n = (CLIDR_EL1 >> (3 * (n-1))) & 0x7 for n = 1..7:
0b000 = no cache at this level; 0b001 = instruction cache only;
0b010 = data cache; 0b011 = separate I+D; 0b100 = unified.
Levels where Ctype_n >= 2 have a data or unified cache and must be flushed.
CSSELR_EL1 (Cache Size Selection Register):
- Write ((level - 1) << 1) | 0 to select the data/unified cache at the given
one-indexed level (level 1 = write 0, level 2 = write 2, …).
- An ISB must follow immediately before reading CCSIDR_EL1; without it the
processor may return stale geometry data for the previously selected level.
- The write is not interrupt-safe: an interrupt handler that inspects cache geometry
(e.g., via a PMU callback) could change CSSELR_EL1 between the host write and the
CCSIDR_EL1 read. IRQs must therefore be disabled for the entire
select → ISB → read sequence.
CCSIDR_EL1 (Cache Size ID Register) — standard format (no FEAT_CCIDX):
- bits[2:0] = LineSize_enc → L = LineSize_enc + 4; line size in bytes = 1 << L
- bits[12:3] = Assoc_enc → NumWays = Assoc_enc + 1
- bits[27:13] = NumSets_enc → NumSets = NumSets_enc + 1
DC CISW (Clean and Invalidate by Set/Way) operand encoding (64-bit register):
bits[31 .. 32 - A] = way_index (A = ceil(log2(NumWays)))
bits[B - 1 .. L] = set_index (B = L + ceil(log2(NumSets)))
bits[3:1] = level - 1
way_shift = u32::leading_zeros(Assoc_enc) (= CLZ of the encoded maximum
way index, placing the way field at the most-significant bits of the 32-bit way
field so that incrementing way_index by one always advances to the next way
regardless of the actual associativity).
Flush algorithm (all data cache levels up to LoC):
/// Flush and invalidate all data caches from EL1 up to the Level of Coherence.
///
/// Required before:
/// - Switching Stage-2 page table mappings (host↔guest address attribute changes)
/// - Pinning guest physical pages for device DMA (ensures host caches are clean)
/// - Serializing guest memory for live migration transmission
/// - Changing memory type attributes in guest Stage-2 mappings (Normal→Device, etc.)
///
/// # Safety
/// - Caller must hold an `IrqDisabledGuard` (CSSELR_EL1 write is not interrupt-safe)
/// - Must execute at EL1 or EL2 (DC CISW is a privileged instruction)
pub unsafe fn flush_dcache_all(_irq: &IrqDisabledGuard) {
// Ordering barrier: ensure all prior memory accesses are globally visible
// before the first cache maintenance operation.
core::arch::asm!("dsb sy", options(nostack, preserves_flags));
let clidr: u64;
core::arch::asm!("mrs {}, clidr_el1", out(reg) clidr, options(nostack, preserves_flags));
// LoC: number of cache levels to flush for full coherency.
let loc = ((clidr >> 24) & 0x7) as usize;
// Iterate from L1 (level index 0) to LoC - 1 (inclusive).
for level in 0..loc {
// Ctype_n for this level (Ctype1 = bits[2:0], Ctype2 = bits[5:3], ...).
let ctype = (clidr >> (3 * level)) & 0x7;
// Skip levels with no data or unified cache.
if ctype < 2 {
continue;
}
// Select this cache level (data/unified; InD bit = 0).
// ISB is mandatory before reading CCSIDR_EL1.
let csselr: u64 = (level as u64) << 1;
core::arch::asm!(
"msr csselr_el1, {sel}",
"isb",
sel = in(reg) csselr,
options(nostack, preserves_flags),
);
let ccsidr: u64;
core::arch::asm!(
"mrs {ccsidr}, ccsidr_el1",
ccsidr = out(reg) ccsidr,
options(nostack, preserves_flags),
);
// Decode geometry.
let l = ((ccsidr & 0x7) + 4) as u32; // log2(line_size_bytes)
let assoc_enc = ((ccsidr >> 3) & 0x3FF) as u32; // NumWays - 1
let sets_enc = ((ccsidr >> 13) & 0x7FFF) as u32; // NumSets - 1
// CLZ(Assoc_enc) places the way index at the MSBs of bits[31:0] of
// the DC CISW operand, matching the ARM architecture specification.
let way_shift = assoc_enc.leading_zeros();
// Iterate all sets and all ways. Loop bounds are inclusive (0..=sets_enc,
// 0..=assoc_enc) so that every set/way combination is covered.
let mut set = sets_enc;
loop {
let mut way = assoc_enc;
loop {
let operand: u64 =
((way << way_shift) as u64)
| ((set as u64) << l)
| ((level as u64) << 1);
core::arch::asm!(
"dc cisw, {op}",
op = in(reg) operand,
options(nostack, preserves_flags),
);
if way == 0 {
break;
}
way -= 1;
}
if set == 0 {
break;
}
set -= 1;
}
}
// Restore CSSELR_EL1 to L1 data cache (level index 0, InD = 0).
// This is good hygiene: other EL1 code that reads CCSIDR_EL1 without
// writing CSSELR_EL1 first will see the L1 geometry, which is the
// least surprising default.
core::arch::asm!("msr csselr_el1, xzr", options(nostack, preserves_flags));
// Completion barriers: DSB ensures all DC CISW operations are finished
// before any subsequent memory access; ISB ensures the instruction stream
// observes the completed maintenance.
core::arch::asm!(
"dsb sy",
"isb",
options(nostack, preserves_flags),
);
}
FEAT_CCIDX support (ARMv8.3+ large-cache systems):
When ID_AA64MMFR2_EL1 bits[23:20] are non-zero, FEAT_CCIDX is implemented
and CCSIDR_EL1 uses wider fields to support caches with more than 1024 ways or
more than 32768 sets:
| Field | Standard (no FEAT_CCIDX) | FEAT_CCIDX |
|---|---|---|
LineSize_enc |
bits[2:0] | bits[2:0] (unchanged) |
Assoc_enc |
bits[12:3] (10-bit, mask 0x3FF) |
bits[23:3] (21-bit, mask 0x1F_FFFF) |
NumSets_enc |
bits[27:13] (15-bit, mask 0x7FFF) |
bits[55:32] (24-bit, mask 0xFF_FFFF) |
The DC CISW operand format and the way_shift = u32::leading_zeros(Assoc_enc)
formula are identical in both cases. Because way_shift is computed from the
u32 representation of Assoc_enc and u32::leading_zeros operates on a 32-bit
value, the 21-bit FEAT_CCIDX Assoc_enc is handled correctly: it is zero-extended
into a u32 before leading_zeros is applied, producing the right shift position.
Detection: read ID_AA64MMFR2_EL1 at EL1. If bits[23:20] are non-zero, apply the
wider masks when extracting Assoc_enc and NumSets_enc from CCSIDR_EL1.
KVM call sites:
flush_dcache_all() is called from umka-kvm's AArch64 path at four points:
-
VM entry preparation — after modifying Stage-2 page tables and before the
ERETinto the guest. This ensures the host's cache does not hold stale data for pages whose Stage-2 attributes changed (e.g., a new guest mapping that maps a page as Normal Cacheable when the host previously treated it as Device). -
Guest memory pinning — when pinning guest physical pages for DMA passthrough (VFIO/iommufd, Section 18.5). The host caches are cleaned before the IOMMU mapping is established so the device reads coherent data.
-
Live migration send — before reading guest memory pages to serialize them for network transmission (Section 18.1, VM Live Migration). The flush ensures that all dirty cache lines for the guest's physical pages are written back to DRAM before the migration sender reads them.
-
Memory type attribute change — when reconfiguring Stage-2 page table entries to change the memory type of a guest region (e.g., from Normal Cacheable to Device nGnRnE for MMIO remapping). The cache must be flushed and invalidated before the attribute change takes effect to avoid cache aliasing.
18.2.3 RISC-V Host-Side Implementation¶
RISC-V H-extension (ratified as part of Privileged ISA v1.12, December 2021) provides VS-mode (virtualized supervisor) and VU-mode (virtualized user).
/// RISC-V H-extension vCPU state.
pub struct RiscvHState {
/// Saved guest VS-mode CSRs (restored before sret to guest,
/// saved after trap to HS-mode).
pub vsstatus: u64,
pub vsie: u64,
pub vstvec: u64,
pub vsscratch: u64,
pub vsepc: u64,
pub vscause: u64,
pub vstval: u64,
pub vsip: u64,
pub vsatp: u64, // guest's own page table root (SV48/SV39)
/// Guest physical address translation register.
/// Written to hgatp CSR before guest entry.
/// Encodes: mode (Sv48x4/Sv39x4) | VMID | PPN of Stage-2 root.
pub hgatp: u64,
/// Hypervisor status register. SPV bit indicates guest context.
pub hstatus: u64,
/// Virtual interrupt pending (injection mechanism).
/// Setting bits in hvip causes virtual interrupts in the guest.
pub hvip: u64,
/// VMID (Virtual Machine Identifier) — TLB tag analogous to
/// Intel VPID / ARM VMID. Width determined by hardware (typically
/// 7-14 bits, discovered from hgatp write-all-ones-read-back).
pub vmid: u16,
}
VM entry: Set hstatus.SPV = 1, load guest VS-mode CSRs, sret transitions
the CPU to VS-mode. VM exit: Any trap configured in hedeleg/hideleg to
not be delegated to VS-mode traps into HS-mode. The hardware saves the faulting
guest physical address in htval (for Stage-2 faults) and the trap cause in
scause.
Stage-2 page tables: Controlled by hgatp CSR. Sv48x4 mode provides 4-level
page tables with a 50-bit guest physical address space (the "x4" means the root
page table is 4 pages / 16 KiB instead of 1 page / 4 KiB, giving 2 extra bits).
HFENCE.GVMA flushes guest TLB entries (analogous to INVEPT on x86 and
TLBI IPAS2E1IS on ARM).
Interrupt injection: hvip CSR provides virtual interrupt pending bits (VSSIP,
VSTIP, VSEIP). For external interrupts, umka-kvm sets hvip.VSEIP to inject a
virtual external interrupt. The AIA (Advanced Interrupt Architecture) extension,
when available, provides IMSIC (Incoming MSI Controller) for direct MSI injection
to guest interrupt files — analogous to ARM GICv4 direct injection.
18.2.4 LoongArch64 LVZ Implementation¶
LoongArch provides the LVZ (LoongArch Virtualization eXtension) for trap-and-emulate virtualization (Linux KVM support since kernel 6.7). LVZ is conceptually similar to ARM's EL2 / VHE model but uses LoongArch-specific CSR registers.
/// LoongArch64 LVZ state for one vCPU.
pub struct LvzState {
/// Guest CSR save area. LVZ defines a parallel set of Guest CSRs
/// (GCSR prefix) that mirror host CSRs for the guest. On VM exit,
/// hardware saves guest CSR state; on VM entry, hardware restores it.
/// Key GCSRs: GCSR.CRMD (mode), GCSR.PRMD (pre-exception mode),
/// GCSR.ESTAT (exception status), GCSR.ERA (exception return address),
/// GCSR.BADV (bad virtual address), GCSR.TLBRENTRY (TLB refill entry),
/// GCSR.PGDL/PGDH (page directory), GCSR.STLBPS (STLB page size).
pub guest_csrs: GuestCsrBlock,
/// Hardware-assisted guest timer state. The host writes
/// CSR.GCNTC (Guest Counter Compensation) to offset the guest's
/// view of the Stable Counter, providing transparent guest
/// timekeeping without trapping timer reads.
pub guest_timer_offset: u64,
/// Guest TLB configuration. LVZ provides separate guest TLB
/// entries (GTLB) that the guest manages via guest-mode TLBWR/
/// TLBFILL/INVTLB. Guest TLB invalidation uses INVTLB with
/// guest ASID — does not affect host TLB entries.
pub guest_asid: u32,
}
VM entry/exit: CSR.GSTAT (Guest Status) controls the guest/host mode. Setting
GSTAT.PGM = 1 (Guest Mode) and executing ERTN (Exception Return) enters the
guest. Exceptions not delegated to the guest (controlled by GCFG.MATC — guest
exception delegation matrix) trap to the host. The hardware saves guest CSR state
automatically on exit.
Nested page tables: LoongArch uses a standard 4-level page table for guest
physical → host physical translation, controlled by CSR.PGDL/CSR.PGDH in the
host's Stage-2 context. Page sizes match host configuration (4KB, 16KB, 64KB).
Guest TLB invalidation via INVTLB in guest mode affects only GTLB entries.
Interrupt injection: CSR.GINTC (Guest Interrupt Control) provides virtual interrupt pending bits for the guest. The host writes GINTC to signal pending interrupts. EIOINTC routing for guest interrupt delivery uses the same CSR.ESTAT-based dispatch mechanism as the host.
Phase: LoongArch KVM is Phase 3+. Initial implementation targets QEMU virt machine type with LVZ-capable 3A5000+ processors.
18.2.5 PPC64LE KVM-HV Implementation¶
PPC64LE supports two KVM modes. UmkaOS targets KVM-HV (Hypervisor mode) for production on POWER9+ with radix page tables, and KVM-PR (Problem state) as a fallback for older hardware or nested guests.
| Feature | KVM-HV | KVM-PR |
|---|---|---|
| Privilege level | Hypervisor mode (MSR[HV]=1) | Problem state (user mode emulation) |
| Available since | POWER8 (officially), POWER7 (limited) | All PowerPC |
| Guest CPU | Same as host | Can emulate different CPU models |
| Performance | Near-native | Slow (all privileged instructions trapped) |
| Page table | Radix (POWER9+) or Hash (POWER8) | Any |
| Interrupt controller | XIVE passthrough (POWER9+) | Software-emulated XICS |
| Nested virtualization | Yes (POWER9+ with L0 support) | Yes (from within any guest) |
UmkaOS decision: KVM-HV with radix page tables is the primary target. KVM-PR is a Phase 4+ item for backward compatibility with POWER7/8 and for nested guest scenarios.
/// PPC64LE KVM-HV vCPU state.
pub struct PpcHvState {
/// LPCR (Logical Partitioning Control Register) per vCPU.
/// Controls: large decrementer (LD), interrupt delivery (LPES),
/// alternate interrupt location (AIL), host radix (HR), etc.
pub lpcr: u64,
/// Partition table entry for this VM. On POWER9+, the PTCR register
/// points to a partition table indexed by LPID. Each entry contains
/// the guest's page table root pointer and configuration.
pub partition_table_entry: u64,
/// Logical Partition ID — hardware LPID assigned to this VM.
/// Maximum LPID width: 12 bits (4096 partitions) on POWER9/10.
pub lpid: u16,
/// Guest state saved/restored on VM entry/exit.
/// KVM-HV with radix: only dirty registers are transferred.
pub guest_gprs: [u64; 32],
pub guest_msr: u64,
pub guest_pc: u64, // SRR0
pub guest_srr1: u64,
pub guest_dar: u64,
pub guest_dsisr: u64,
pub guest_dec: u64, // decrementer
pub guest_sprg: [u64; 4],
pub guest_pid: u64, // PIDR (PID register for radix translation)
/// XIVE interrupt state for direct injection.
/// When XIVE is available (POWER9+), the hardware delivers interrupts
/// directly to the guest vCPU without hypervisor intervention.
pub xive_vp: u64, // Virtual Processor number
pub xive_cam_word: u64,
}
VM entry/exit: The HRFID (Hypervisor Return From Interrupt Doubleword) instruction
transitions from hypervisor mode (HV) to partition mode (guest). On guest exception or
hypervisor decrementer (HDEC) expiry, control returns to the hypervisor via the
exception vector at the hypervisor's HSPRG0-based entry point.
Radix page tables (POWER9+): Guest uses standard 4-level radix page tables. The
hypervisor programs the partition table entry with the guest's page table root. Stage-2
translation (guest-physical → host-physical) is performed by the partition table mechanism —
analogous to Intel EPT / ARM Stage-2. TLB invalidation uses tlbie with the guest's
LPID to scope the invalidation.
POWER9 DD2.1 constraint: Host and guest page table format must match (both radix or both hash). POWER10 removes this constraint and supports only radix for KVM-HV guests.
Hypervisor Decrementer (HDEC): Separate from the guest's DEC, the HDEC fires at the
hypervisor level to preempt guest execution for host scheduling. Always operates in
"large" mode on POWER9 (56-bit). The guest DEC can be 32-bit or 56-bit depending on
LPCR[LD].
Transactional Memory (TM): Disabled by default for host userspace on POWER9+.
POWER8 hardware TM has multiple known bugs (state corruption during signal delivery,
race conditions between treclaim and context switch). If TM is exposed to guests,
the KVM code must handle facility unavailable interrupts and TM state save/restore.
UmkaOS follows Linux's approach: TM is opt-in only.
18.2.5.1 Nested Virtualization (L0/L1/L2)¶
POWER9 and POWER10 support nested virtualization with two API versions:
v1 API (POWER9): Uses h_enter_nested() hcall. The L1 hypervisor transfers the
full L2 vCPU state on every nested entry/exit. The L0 (PowerVM or KVM-HV host) keeps
no persistent L2 state except partition table entries. High overhead due to full state
transfer on every transition.
v2 API (POWER10): Uses H_GUEST_* hcalls (H_GUEST_CREATE, H_GUEST_RUN_VCPU,
H_GUEST_SET_STATE, H_GUEST_GET_STATE, H_GUEST_DELETE). The L0 retains L2 state
between runs; only dirty state needs to be communicated. Significantly better performance
due to incremental state transfer.
When UmkaOS runs as an L1 hypervisor: - On POWER10: use v2 API (mandatory for new development) - On POWER9: use v1 API with full state transfer on every nested entry/exit
Phase: PPC64LE KVM-HV is Phase 3. Nested virtualization is Phase 4+.
18.3 KVM Operational Integration¶
18.3.1 vCPU Scheduling Integration¶
vCPU threads are scheduled by umka-core's EEVDF scheduler (Section 7.1) as normal kernel threads with specific properties:
-
vCPU affinity: By default, a vCPU thread can migrate between any host physical CPU. Userspace can pin vCPUs to specific pCPUs via
sched_setaffinityfor latency-sensitive workloads (DPDK, real-time). When pinned, the vCPU thread runs on exactly one pCPU and the VMX preemption timer is disabled (the vCPU owns the pCPU exclusively until it voluntarily exits or is preempted by a higher-priority host thread). -
VMX preemption timer: On x86, the VMCS preemption timer field is programmed from the scheduler's remaining time slice for the vCPU thread. When the timer fires inside VMX non-root mode, a VM exit occurs with
ExitReason::PreemptionTimer, and the vCPU thread yields to the scheduler. This ensures vCPU threads do not monopolize physical CPUs. The timer value is calculated as:
where preempt_timer_shift is read from IA32_VMX_MISC[4:0]. The
calculation uses u64 arithmetic to avoid overflow (e.g., 100ms slice at
5 GHz TSC with shift=0 yields ~5 × 10⁸, which exceeds u32::MAX). The
result is stored in Vcpu::preempt_timer_value: u64 and clamped to
u32::MAX before writing to the 32-bit VMCS preemption timer field.
-
AArch64 equivalent: The generic timer's
CNTHP_TVAL_EL2(EL2 physical timer) is programmed as the host scheduler's preemption tick. When it fires, it traps the guest to EL2, where the exit handler yields. -
RISC-V equivalent: The
stimecmpCSR (Sstc extension) or SBI timer is programmed for the scheduler quantum. Timer interrupt traps to HS-mode. -
NUMA placement: When a VM's backing memory is allocated from a specific NUMA node, umka-kvm hints the scheduler to prefer running vCPU threads on CPUs in the same NUMA node (via
set_cpus_allowed_ptrwith the node's CPU mask). This is a soft hint, not a hard pin — the scheduler can migrate vCPUs for load balancing but prefers local placement. -
Halt polling: When a vCPU executes HLT and there is no pending interrupt, instead of immediately yielding to the scheduler (which incurs a context switch), the vCPU thread spins for a configurable duration (default: 200 microseconds, tunable via
/sys/module/umka_kvm/parameters/halt_poll_ns). If an interrupt arrives during the spin window, the vCPU re-enters the guest without a context switch. If the spin window expires, the thread yields. This optimization reduces wake latency for interrupt-driven workloads (networking, storage) at the cost of slightly higher host CPU usage.
EEVDF accounting for halt-poll time: Halt-poll spinning is CPU time consumed
by the vCPU thread, but it is idle waiting — not productive guest computation.
To prevent unfair scheduling penalties, halt-poll time is accounted as idle
time, not as vruntime. The vCPU thread calls sched_idle_enter() before
entering the halt-poll loop and sched_idle_exit() when an interrupt arrives
or the poll window expires. Lock requirement: sched_idle_enter() and
sched_idle_exit() require the current CPU's runqueue lock
(Section 7.1):
let guard = rq_lock_irqsave();
sched_idle_enter(&mut guard);
rq_unlock_irqrestore(guard);
// ... halt-poll loop ...
let guard = rq_lock_irqsave();
sched_idle_exit(&mut guard);
rq_unlock_irqrestore(guard);
sched_idle_enter, once for sched_idle_exit). Linux uses
current_set_polling() / current_clr_polling() which are simple TIF flag
manipulations (no lock). The two lock round-trips are the cost of UmkaOS's
innovation of NOT charging halt-poll time as vruntime (documented above). A
future optimization could use a per-task atomic flag checked by update_curr()
to avoid the lock, but the current design is correct and the overhead is
bounded (~100 cycles per HLT, <0.01% of a 200us poll window).
During this window the scheduler treats the thread
as idle: its vruntime does not advance, and it is not counted toward the CFS
runqueue load. This prevents a vCPU that polls frequently (e.g., a latency-
sensitive networking guest) from being penalised with excess vruntime relative
to compute-bound vCPUs. The halt_poll_ns cap bounds the maximum idle-accounted
time per HLT — at 200µs default, this is <0.02% of a 1ms scheduler quantum.
Per-quantum halt-poll budget: In addition to the per-HLT cap
(halt_poll_ns, default 200µs), a per-quantum aggregate cap limits total
halt-poll time within a single scheduling quantum. The aggregate budget
defaults to 10% of the quantum duration (sched_latency_ns / nr_running).
Each halt-poll iteration decrements the per-quantum budget by the actual
poll duration. When the budget is exhausted, subsequent HLTs bypass
halt-poll and yield immediately to the scheduler. The budget resets when
the vCPU is next scheduled (i.e., on sched_idle_exit() following a
schedule-in event, the budget is recalculated from the new quantum).
| Parameter | Default | Tunable |
|---|---|---|
halt_poll_ns (per-HLT cap) |
200,000 ns | /sys/module/umka_kvm/parameters/halt_poll_ns |
halt_poll_budget_pct (per-quantum %) |
10 | /sys/module/umka_kvm/parameters/halt_poll_budget_pct |
This prevents the exploit where a guest alternating short compute bursts
with frequent HLTs consumes disproportionate host CPU without vruntime
charging. The per-quantum budget ensures halt-poll CPU consumption is
bounded to a fixed fraction of the quantum, matching the fix applied in
Linux for the same class of fairness exploit. The per-vCPU budget state
is stored in Vcpu::halt_poll_budget_remaining_ns
(Section 18.1).
-
Overcommit behavior: When more vCPUs than physical CPUs are active, the scheduler distributes time fairly via EEVDF virtual deadline ordering. The PV spinlock mechanism (Section 18.1, "Guest Mode — PV Spinlocks") prevents lock-holder preemption waste.
KVM_HC_SCHED_YIELDfrom a spinning guest vCPU triggers an immediate scheduler yield, allowing the lock-holding vCPU to run. -
Power budget integration: Each VM can have a power budget (Section 7.4). The scheduler accounts vCPU thread CPU time against the VM's power budget. When a VM exceeds its budget, its vCPU threads' scheduling weights are reduced proportionally, throttling the VM without killing it.
18.3.2 In-Kernel Device Models¶
umka-kvm includes minimal in-kernel device emulation for devices where the
userspace round-trip (VM exit → KVM_RUN return → userspace emulation →
KVM_RUN re-entry) would be a performance bottleneck:
| Device | Emulation location | Rationale |
|---|---|---|
| Local APIC (x2APIC) | In-kernel + hardware-assisted | Interrupt delivery is the hottest path. Hardware APIC virtualization avoids most exits. |
| IOAPIC | In-kernel | Interrupt routing must be low-latency. Each I/O completion triggers IOAPIC. |
| PIT (i8254) | In-kernel | Timer tick generation. Legacy but required for BIOS boot. |
| PIC (i8259) | In-kernel | Legacy interrupt controller. Required for BIOS boot until IOAPIC takes over. |
| kvmclock | In-kernel | Shared memory page, no exits needed. Host updates parameters on TSC recalibration. |
| vhost-net | In-kernel (Tier 1, extended) | See Section 18.1, "vhost Kernel Data Plane". |
| vhost-scsi | In-kernel (Tier 1, extended) | See Section 18.1, "vhost Kernel Data Plane". |
All other devices (virtio-blk, virtio-gpu, IDE, e1000, etc.) are emulated in userspace by the VMM (QEMU, Firecracker, etc.). This split matches Linux KVM's architecture: the kernel handles the time-critical interrupt and timer paths; the VMM handles the device model complexity.
irqfd / ioeventfd: These mechanisms avoid the userspace round-trip for
specific interrupt injection and I/O intercept patterns:
- irqfd (KVM_IRQFD ioctl): Associates an eventfd with a guest IRQ line.
When a userspace or kernel component writes to the eventfd, umka-kvm injects
the corresponding interrupt into the guest — without a KVM_RUN exit/re-entry
cycle. Used by QEMU for virtio interrupt injection.
- ioeventfd (KVM_IOEVENTFD ioctl): Associates an eventfd with a guest
I/O port or MMIO address. When the guest writes to that address, the VM exit
handler triggers the eventfd and immediately re-enters the guest — the
userspace device model processes the write asynchronously. Used by QEMU for
virtio doorbell writes.
In-kernel MMIO dispatch (KvmIoBus)
In-kernel device models (LAPIC, IOAPIC, PIT, PIC, kvmclock) and ioeventfd
registrations need a dispatch table so that EPT/Stage-2 violations for MMIO
regions can be resolved in-kernel without returning to userspace. KvmIoBus
provides this:
/// An I/O device registration in the KVM I/O bus.
pub struct IoDeviceEntry {
/// MMIO base address (guest physical).
pub base: u64,
/// Region size in bytes.
pub len: u32,
/// Handler for reads/writes to this region.
pub ops: &'static dyn KvmIoDeviceOps,
/// Opaque device context passed to `ops.read()`/`ops.write()`.
/// # Safety
/// - The pointed-to data must remain valid for the lifetime of this
/// `IoDeviceEntry` in the XArray. Since entries are RCU-protected,
/// `dev` must remain valid for at least one RCU grace period after
/// the entry is removed from the XArray.
/// - The owner (the in-kernel device model component that registered this
/// entry) is responsible for ensuring the lifetime via `Arc` or similar.
/// Typical pattern: the device model holds `Arc<LapicState>` and stores
/// `Arc::into_raw(lapic)` as `dev`. On unregister, `Arc::from_raw(dev)`
/// reconstructs the Arc for proper deallocation after RCU grace period.
pub dev: *const (),
}
/// Operations for an in-kernel I/O device.
pub trait KvmIoDeviceOps: Send + Sync {
/// Handle a guest read from the MMIO region.
/// `offset` is relative to the device's base address.
/// Writes the result into `data` (1/2/4/8 bytes).
fn read(&self, dev: *const (), offset: u64, data: &mut [u8]);
/// Handle a guest write to the MMIO region.
/// `offset` is relative to the device's base address.
fn write(&self, dev: *const (), offset: u64, data: &[u8]);
}
/// Per-VM I/O bus. Two buses: one for MMIO, one for PIO (x86 only).
/// Entries are stored in an XArray keyed by base address for O(1) lookup.
/// RCU-protected: readers (EPT fault handlers) use RCU read-side;
/// writers (device registration/removal) use `io_bus_lock`.
pub struct KvmIoBus {
/// MMIO device registrations, keyed by page-frame number (base GPA >> PAGE_SHIFT).
/// If a second device registers at an existing page-frame key, it is stored
/// in a collision chain (ArrayVec) within the XArray entry. Sub-page MMIO
/// collision chain limited to 8 entries per page. Typical devices use 1-2;
/// nested virtualization (L1 synthesized IOAPIC+PIT+HPET in one page) may
/// use 5-6. Registration uses `try_push()` and returns `-ENOSPC` if the
/// chain is full. On `-ENOSPC`, a `klog(Warning, "KVM: MMIO sub-page chain
/// full for GPA {:#x}, {} existing entries", gpa, chain.len())` is emitted
/// to aid diagnosis.
pub mmio: XArray<ArrayVec<IoDeviceEntry, 8>>,
/// PIO device registrations, keyed by port number (x86 only).
pub pio: XArray<IoDeviceEntry>,
/// Serializes structural changes (add/remove).
pub io_bus_lock: SpinLock<()>,
}
Dispatch flow: On an EPT/Stage-2 MMIO fault, the exit handler:
1. Extracts the guest physical address and access size from VMCS/exit info.
2. Page-level lookup: xa_load(gpa >> PAGE_SHIFT) in vm.io_bus.mmio.
Returns a page-level IoDeviceEntry (or a small list for the rare case
of multiple sub-page MMIO regions sharing one page).
3. Offset match: If the page-level entry covers the full page (the common
case -- LAPIC, IOAPIC, kvmclock are all page-aligned), dispatch directly.
For sub-page regions, linear scan within the page's entries:
base <= gpa < base + len. With K entries per page (typically 1-2),
this is O(K).
4. If found: call ops.read() or ops.write() and re-enter guest (no
userspace exit).
5. If not found: set KVM_EXIT_MMIO in kvm_run and return to userspace.
Performance: With N = 5-15 in-kernel devices (all standard devices page-aligned), step 2 is O(1) via XArray, and step 3 is trivially O(1). The sub-page linear scan handles edge cases (multiple PCI BARs in one page) without requiring a separate interval tree.
For PIO (x86 IN/OUT instructions), the same flow uses vm.io_bus.pio
keyed by port number.
18.3.3 Nested Virtualization¶
(Phase 5 — data structures and architectural requirements defined here for design completeness; implementation deferred.)
Nested virtualization (running a hypervisor inside a VM) requires intercepting and emulating the L1 hypervisor's virtualization instructions. The specification covers the basic architectural requirements:
- x86 (VMCS shadowing): A guest hypervisor's VMCS operations (
VMREAD,VMWRITE,VMLAUNCH,VMRESUME) are intercepted by umka-kvm. umka-kvm maintains a shadow VMCS (VMCS02) that merges the L1 hypervisor's intended guest state with umka-kvm's own host state. TheVMCS_LINK_POINTERfield points to the shadow VMCS. L2 VM exits are dispatched to L1 or L0 based on exit reason: exits caused by L1's execution controls go to L1; exits caused by L0's controls (e.g., EPT violation in L0's page tables) go to L0. Shadow EPT (EPT02) merges L1's EPT (guest-physical → L1-physical) with L0's EPT (L1-physical → host-physical) into a combined guest-physical → host-physical mapping. - ARM64: Nested virtualization on ARM64 requires trapping all EL2 instructions executed by the L1 hypervisor (HCR_EL2.NV = 1, ARMv8.3+). Stage-2 nesting (combining L1's Stage-2 with L0's Stage-2) follows the same shadow page table approach as x86 EPT02.
- RISC-V: The H-extension does not yet define a standard nested virtualization mechanism. Software trap-and-emulate of all H-extension CSR accesses from L1 is functionally correct but slow (~10x overhead). Hardware support is expected in a future extension.
Data structures:
/// Shadow VMCS state for nested virtualization. Merges L1 hypervisor's
/// intended guest state (VMCS12) with L0's host state into VMCS02.
pub struct NestedVmcsState {
/// VMCS02: the shadow VMCS that the CPU actually runs.
/// Combines L1's guest fields with L0's host/control fields.
pub vmcs02_phys: PhysAddr,
/// VMCS12: L1 hypervisor's virtual VMCS (in L1's guest-physical memory).
/// Read by L0 on L1's VMLAUNCH/VMRESUME, updated on L2→L1 exits.
pub vmcs12_gpa: PhysAddr,
/// Shadow EPT root (EPT02): merges L1's EPT with L0's EPT.
pub ept02_root: PhysAddr,
/// Current nesting state: NotNested, L1Running, L2Running.
pub state: NestedState,
/// Cached L1 execution controls for exit dispatch.
pub l1_pin_controls: u32,
pub l1_proc_controls: u32,
pub l1_proc_controls2: u32,
pub l1_exit_controls: u32,
}
/// Nested virtualization state machine.
#[repr(u8)]
pub enum NestedState {
/// No nested hypervisor active. L1 runs directly.
NotNested = 0,
/// L1 is running with nested intercepts armed. L1 has executed VMXON
/// but not yet VMLAUNCH.
L1VmxActive = 1,
/// L2 is running. L0 monitors VMCS02. Exits dispatch per rules below.
L2Running = 2,
}
Exit dispatch rule: On a VM exit from L2, L0 inspects the exit reason.
If the reason matches an L1-controlled intercept (in l1_proc_controls or
l1_exit_controls), reflect the exit to L1 (inject VMEXIT into VMCS12, switch
to L1 context). Otherwise, L0 handles the exit directly (e.g., EPT violation
in L0's page tables, external interrupt).
Performance target for nested virtualization: less than 20% overhead for L2 guest workloads compared to L1 (non-nested) execution, on hardware with VMCS shadowing (Intel) or ARM VHE (Virtualization Host Extensions). RISC-V H-extension does not define nested virtualization primitives; nested RISC-V performance is TBD pending future ISA extensions (software trap-and-emulate incurs ~10x overhead). The 20% target is consistent with Linux KVM's measured nested overhead (10-30% depending on workload and exit frequency). Software-only nested virtualization — where the L0 hypervisor must emulate VMX/SVM instructions for L1 because the CPU lacks hardware shadowing support — has substantially higher overhead (typically 2-5x) and is not a supported configuration for production use.
Recovery advantage — UmkaOS's driver recovery provides unique benefits for virtualization: - Host-side: if a vhost-net or vhost-scsi module crashes, UmkaOS recovers it in-place (Tier 1 reload). The hypervisor and guest never notice. In Linux, a vhost crash would require tearing down and re-establishing the vhost connection. - Guest-side: if a guest running UmkaOS crashes a virtio driver, the driver recovers without VM reboot. The hypervisor sees a brief pause in I/O but no reset. In Linux, a guest virtio driver crash typically requires VM reboot.
Host Mode — Kernel Same-page Merging (KSM)
When UmkaOS runs as a hypervisor host managing many VMs, identical memory pages accumulate across guests — shared libraries (libc, libssl), zero-filled BSS pages, and common read-only data. KSM reclaims this waste by deduplicating identical pages.
Page registration (merge lifecycle entry): Pages enter the KSM system through
two paths: (1) userspace calls madvise(MADV_MERGEABLE) on a VMA range, which
marks those pages as KSM-eligible, or (2) per-VM opt-in at VM creation marks all
guest memory slots as mergeable. Eligible pages are added to ksmd's scan list
— a per-NUMA-node linked list of KsmRmapItem entries, each tracking one
anonymous page. Pages are removed from the scan list when the owning VMA is
unmapped, when userspace calls madvise(MADV_UNMERGEABLE), or when the process
exits.
Two-tree lookup structure:
- Stable tree:
RBTree<PageHash, KsmStableEntry>— pages already merged (shared COW). Searched first. Entries are never removed by scanning; only removed when the last reference breaks COW (i.e.,mapcountdrops to zero). - Unstable tree:
RBTree<PageHash, KsmUnstableEntry>— candidate pages seen in the current scan cycle but not yet matched. The entire unstable tree is rebuilt each scan cycle (dropped and re-populated) because page contents can change between cycles. Entries hold a weak page reference (no COW — the page may be modified before the next lookup).
Collection policy note: RBTree (BTreeMap equivalent) is used here instead of
XArray despite integer keys (PageHash = u32/u64) because KSM requires ordered
traversal (memcmp-based secondary comparison for hash collisions) and the key is a
content hash, not a monotonic ID. Hash collision resolution requires walking adjacent
entries in sorted order — BTreeMap::range() provides this naturally. XArray's radix
structure does not support efficient range queries on hash-distributed keys.
ksmd thread lifecycle: The ksmd kernel thread is created at boot but
remains in an interruptible sleep state until KSM is activated via
/sys/kernel/mm/ksm/run = 1. Once activated, ksmd enters a scan-sleep loop:
- Wake:
ksmdwakes (fromschedule_timeout_interruptible) and begins a scan cycle. - Scan: Process up to
pages_to_scanpages from the scan list. For each page: - (a) Compute the page hash (xxHash, ~200-400ns per 4 KiB page).
- (b) Search the stable tree — if a hash match is found and a full
byte-for-byte
memcmpconfirms identity, merge: update the scanned page's PTE to point to the existing shared COW page, and free the duplicate physical frame back to the buddy allocator. - (c) If no stable match, search the unstable tree — if a hash match is
found and
memcmpconfirms, promote: allocate a new KSM page, copy the content, remap both pages' PTEs to the new shared COW page, insert into the stable tree, and free both original frames. - (d) If no match in either tree, insert into the unstable tree as a candidate for future cycles.
- Sleep: After processing
pages_to_scanpages (or exhausting the scan list),ksmdsleeps forsleep_millisecsbefore repeating. At the start of each new full scan pass (after visiting every page on the scan list), the unstable tree is dropped and rebuilt from scratch. - Deactivation: Writing
0to/sys/kernel/mm/ksm/runsets a flag that causesksmdto stop scanning and return to interruptible sleep. Already- merged pages remain shared until broken by COW or explicit unmerge.
Unmerge (merge lifecycle exit): Pages leave the merged state in two ways:
- Break-on-write (implicit): A write to a KSM-merged page triggers a COW
fault. The fault handler allocates a new page, copies the content, and remaps
the writer's PTE to the private copy. This is the standard COW mechanism — no
KSM-specific path. The stable tree entry's reference count decrements; when it
reaches zero, the shared KSM page is freed and the stable tree entry is removed.
- Explicit unmerge: When userspace calls madvise(MADV_UNMERGEABLE) or KSM is
deactivated with /sys/kernel/mm/ksm/run = 2 (unmerge-and-stop), the kernel
walks all KSM-merged pages in the affected range, breaks sharing by allocating
private copies (equivalent to forcing a COW break on each), and removes the
pages from the scan list.
Configuration:
/sys/kernel/mm/ksm/run — 0=off, 1=on, 2=unmerge-and-stop (default: 0)
/sys/kernel/mm/ksm/sleep_millisecs — scan interval (default: 20ms)
/sys/kernel/mm/ksm/pages_to_scan — pages per scan cycle (default: 100)
/sys/kernel/mm/ksm/pages_shared — currently merged pages (read-only)
/sys/kernel/mm/ksm/pages_sharing — additional page references saved by merging (read-only)
/sys/kernel/mm/ksm/full_scans — number of complete scan passes (read-only)
Performance trade-off: KSM's scanning consumes CPU (~1-5% of one core depending on scan rate and working set size). For VM-dense servers running 50-100 identical guests, the memory savings (30-50% for homogeneous Linux guests) far outweigh the CPU cost. For non-VM workloads or heterogeneous guests, the savings are minimal and KSM should remain disabled (the default).
NUMA awareness: KSM only merges pages within the same NUMA node by default
(/sys/kernel/mm/ksm/merge_across_nodes=0). UmkaOS divergence: Linux defaults
to merge_across_nodes=1. UmkaOS defaults to 0 because cross-node merging
degrades NUMA-sensitive VM workloads. Cross-node merging
saves more memory but forces remote NUMA accesses on the merged page — typically a
net loss for latency-sensitive workloads. Administrators can enable cross-node merging
explicitly for memory-constrained environments where density outweighs NUMA locality.
18.3.4 KVM Crash Recovery and vCPU Thread Management¶
umka-kvm runs as a Tier 1 driver with extended hardware privileges. When umka-kvm crashes and is reloaded (~50-150ms recovery window per Section 11.9), vCPU threads must be managed to preserve guest state:
vCPU thread lifecycle during crash recovery:
-
Crash detected: The driver isolation subsystem detects the fault (e.g., page fault in umka-kvm's memory domain). All vCPU threads are currently inside
KVM_RUNor blocked in the exit handler. -
Thread parking: The recovery framework sends an IPI to all CPUs running vCPU threads belonging to the faulted umka-kvm instance. The IPI handler:
- Forces a VM exit (if the vCPU is in guest mode) via
vmx_preemption_timer = 0(x86),HCR_EL2.VI = 1(ARM64), orstimecmp = 0(RISC-V). - Sets the vCPU's
run_statetoPARKED(new value: 3). - Saves the guest register state from the VMCS/exit context into
Vcpu::guest_regsandVcpu::hw_state. -
Puts the vCPU thread into
TASK_UNINTERRUPTIBLEsleep on a per-VMpark_waitqueue. -
Module reload: The driver recovery framework reloads umka-kvm. The new instance receives the
VmCheckpointfrom the old instance (persisted in Nucleus memory, not in the driver's domain). -
VMCS reconstruction: For each vCPU:
- Allocate a new VMCS region. Execute
VMCLEARon the old region andVMPTRLDon the new region via the VMX trampoline (same trampoline used for normal VM entry —VMCLEARandVMPTRLDare VMX instructions that require Ring 0 / PKEY 0 execution, which the reloaded umka-kvm domain (PKEY 7) cannot issue directly). - Populate from
VmCheckpoint::vcpu_states[i]+hw_state. - Reload EPT root from
Vm::slat(EPT page tables are in Nucleus memory — they survive the crash). -
Re-register in-kernel device state (LAPIC, IOAPIC) from checkpoint.
-
Thread unparking: Wake all threads on
park_waitqueue. Each vCPU thread re-entersKVM_RUNwith the reconstructed VMCS. The guest sees a brief pause (50-150ms) but no crash — equivalent to a long host scheduling delay.
Guest visibility: The guest perceives the recovery as a scheduling stall. PV-aware guests see increased steal time for the recovery window. Non-PV guests see a TSC gap (which pvclock handles transparently). No guest-visible state is lost.
Failure mode: If VMCS reconstruction fails (e.g., hardware state
cannot be restored), the affected VM is terminated with KVM_EXIT_INTERNAL_ERROR.
The VMM (QEMU/Firecracker) handles this as a VM crash.
18.3.5 Guest Memory Integration with VMM Reclaim and NUMA¶
KVM guest memory must interact correctly with the host VMM's memory reclaim, NUMA placement, and THP subsystems.
Reclaim integration:
Guest memory is backed by host userspace pages (mapped via
KVM_SET_USER_MEMORY_REGION). From the host VMM's perspective, these are
anonymous pages in the VMM process's address space. The standard reclaim
path applies:
- Page cache pages (file-backed guest memory, e.g., virtiofs): reclaimable via normal page cache eviction.
- Anonymous pages (guest RAM): reclaimable via swap-out if swap is
configured, or via balloon deflation if
virtio-balloonis active. - Pinned pages: Pages pinned for VFIO DMA (
vfio_pin_pages()) are excluded from reclaim viaPG_mlocked+ elevated refcount. The OOM killer accounts pinned pages against the VMM process's RSS.
Balloon-aware OOM: When the OOM killer evaluates the VMM process, it
considers oom_score_adj (typically set by the orchestrator to reflect
VM priority). Balloon-inflated pages are already returned to the host
and do not count against the VMM's RSS.
NUMA placement:
- vCPU threads are placed on CPUs in the same NUMA node as the VM's
backing memory (via
set_cpus_allowed_ptr, soft affinity — see Section 18.3 above). KVM_SET_USER_MEMORY_REGIONpages are faulted in on the node where the vCPU first accesses them (first-touch policy). For optimal placement, the VMM shouldmmap+mbind(MPOL_BIND, node)guest memory to the desired NUMA node before creating vCPUs.- NUMA balancing (Section 4.11) migrates guest pages to follow vCPU access patterns. This is transparent to KVM — the VMM's page tables are updated by the NUMA balancing scanner, and EPT entries are invalidated via the standard mmu_notifier callback.
THP/EPT interaction:
When guest memory is backed by a THP (2MB transparent huge page), the EPT fault handler can install a 2MB EPT superpage entry, reducing TLB pressure:
- On EPT fault, if the host PTE is a PMD-level THP entry and the guest physical range is 2MB-aligned, install a 2MB EPT entry.
- If the THP is later split (e.g., by partial COW or KSM), the EPT entry is shattered into 512 × 4KB entries via mmu_notifier callback.
- This is automatic — no VMM configuration required.
VFIO-Pinned Pages and IOMMU Domain Coherence (EPT/IOMMU Sync)
When a VM has VFIO passthrough devices, guest memory pages are mapped in two secondary address spaces simultaneously:
- EPT/NPT/Stage-2 — the VM's second-level page tables, managed by KVM.
Updated via
MmuNotifiercallbacks (Section 4.8) when the host VMM's primary page tables change. - IOMMU page tables — the device's DMA translation tables, managed by
iommufd or the KVM
VmPassthroughpath.
The MmuNotifier mechanism covers EPT invalidation when the host moves or
reclaims pages (NUMA balancing, compaction, KSM, swap). However, IOMMU page
tables are not subscribed to MmuNotifier — IOMMU mappings are managed
explicitly via IOMMU_IOAS_MAP/IOMMU_IOAS_UNMAP ioctls, not derived from
the host process's primary page tables.
This creates a coherence hazard: if the host kernel moves a page that is
simultaneously mapped in EPT (via MmuNotifier → EPT invalidation) and
in the IOMMU domain (via IOMMU_IOAS_MAP pinning), the device would DMA to
the old physical address while the CPU accesses the new physical address.
Resolution — VFIO page pinning prevents the hazard:
Pages mapped into an IOMMU domain via IOMMU_IOAS_MAP are pinned with
FOLL_LONGTERM (Section 4.8). Pinned pages are
excluded from:
- NUMA balancing migration: The NUMA scanner skips pages with elevated refcount (pinned). The page stays on its original node.
- Memory compaction: The compaction scanner's
isolate_lru_page()fails for pinned pages (refcount check). The page is not moved. - KSM merging: KSM skips pages with
page_count > 1(pinned pages always have elevated count). No COW merge occurs. - Swap-out: Pinned pages have
PG_mlockedset and are excluded from the reclaim LRU scan.
Because the host kernel never moves a VFIO-pinned page, the IOMMU mapping
remains valid for the page's entire pinned lifetime. The MmuNotifier
invalidate_range_start/end callbacks are never invoked for these pages
(there is nothing to invalidate — the primary PTE is stable).
Unpin lifecycle: When the VMM calls IOMMU_IOAS_UNMAP, iommufd unpins
the pages (unpin_user_pages()), removes the IOMMU page table entries, and
issues an IOTLB flush. Only after the IOTLB flush completes are the pages
returned to the free pool and eligible for migration/reclaim.
Invariant: A page mapped in an IOMMU domain is always pinned. A pinned
page is never subject to MmuNotifier callbacks (because the host never
moves it). Therefore EPT and IOMMU are always coherent: EPT maps the page
at its current host physical address, and IOMMU maps the same host physical
address — both derived from the same stable primary PTE.
18.3.5.1.1 TLB Invalidation Batching¶
KVM registers an MmuNotifier subscriber that invalidates Stage-2/EPT mappings
when host page tables change. To avoid IPI storms during multi-VMA operations,
KVM uses the MmuNotifierGuard RAII API
(see Section 4.8):
- On VM-exit that triggers host page table changes, KVM accumulates all
invalidate_rangecallbacks into anMmuNotifierRangeBuilder. - On the next VM-entry, a single architecture-specific TLB invalidation covers
the union range:
INVEPTon x86-64,TLBI IPAS2E1IS+DSB ISHon AArch64,HFENCE.GVMAon RISC-V. - For AArch64, this reduces
DSB ISHbarriers from N (one per VMA) to 1, eliminating the inner-shareable broadcast storm. For a 10-VMAmunmapon an 8-vCPU guest, this saves ~9 cross-core barrier sequences (~200-500 cycles each).
18.3.6 Post-Copy SwitchToPostCopy and mmap_lock Latency¶
The SwitchToPostCopy convergence action (Section 18.1)
registers a per-range fault handler via mm::register_fault_handler(). This
registration needs mmap_lock.write() on the VMM process's MmStruct to
install the fault handler in the VMA tree.
Latency risk: During VM execution, vCPU threads hold mmap_lock.read()
while resolving EPT/Stage-2 faults (host VA → PA lookup). On a VM with many
vCPUs, mmap_lock.read() may be held near-continuously — each EPT fault
takes 1-5 us, and on a 128-vCPU VM with active workload, faults can overlap.
The mmap_lock.write() request from SwitchToPostCopy must wait for all
readers to drain, which can take tens of milliseconds on large VMs.
During this wait, new EPT faults also stall (writer-pending blocks new
readers on a fair RwLock), compounding guest latency.
Mitigation — per-range fault handler registration without mmap_lock.write():
UmkaOS uses a dedicated post-copy fault handler slot on the Vm struct
rather than modifying the VMM process's VMA tree:
/// Post-copy fault handler state, stored on Vm (not on MmStruct).
/// Registration and lookup do not require mmap_lock.write().
pub struct PostCopyState {
/// Sorted array of GPA ranges pending transfer from the source host.
/// Lookup is O(log n) via binary search. Protected by a dedicated
/// SpinLock (never held concurrently with mmap_lock).
///
/// **Bound**: Adjacent ranges are coalesced during the faulting pass (when
/// a page is received, its range may merge with neighbors). The Vec grows
/// proportionally to the number of non-contiguous pending regions, bounded
/// by the number of KVM memory slots (typically < 500). Worst case without
/// coalescing would be per-page, but the migration protocol transfers pages
/// in batches that are range-coalesced by the source before sending.
/// Pre-allocated with capacity = max_memory_slots at `SwitchToPostCopy`
/// activation (BEFORE the spinlock is acquired). During post-copy operation,
/// only existing elements are modified (coalesce/remove) — no `push()` or
/// `extend()` calls under the spinlock. This prevents heap allocation under
/// SpinLock (which disables preemption and must not sleep).
pub pending_ranges: SpinLock<Vec<GpaRange>>,
/// Migration channel to the source host (TCP or RDMA).
pub source_channel: Arc<MigrationChannel>,
/// Set to true when post-copy is active. Checked on EPT fault before
/// consulting the host page tables.
pub active: AtomicBool,
}
EPT fault path with post-copy:
- vCPU takes EPT violation. The fault handler acquires
mmap_lock.read()as normal. - Before walking host page tables, check
vm.postcopy.active.load(Acquire). - If active: look up the faulting GPA in
postcopy.pending_ranges. If found, the page has not yet been transferred. Releasemmap_lock.read(), request the page from the source host viapostcopy.source_channel, and block the vCPU thread until the page arrives. Once received, map it into the VMM's address space and re-fault (the EPT fault will now resolve normally). - If not found in pending ranges: the page was already transferred. Proceed with normal EPT fault resolution.
Key property: SwitchToPostCopy sets vm.postcopy.active and populates
postcopy.pending_ranges — both operations require only the PostCopyState
SpinLock, not mmap_lock.write(). The post-copy activation latency is
therefore O(us) regardless of how many vCPU threads hold mmap_lock.read().
The mm::register_fault_handler() API documented earlier in this section
is still the generic mechanism for kernel-internal fault handlers. For
post-copy specifically, the PostCopyState path is preferred because it
avoids the mmap_lock write-side latency that would otherwise stall all
vCPUs during activation.
18.3.7 vGIC — Virtual GICv3 Emulation¶
On AArch64 hosts, umka-kvm emulates a GICv3 interrupt controller for each VM. The vGIC is an in-kernel device model (same tier as the x86 LAPIC/IOAPIC) because interrupt delivery is the hottest path in virtualisation — exiting to userspace for every interrupt injection would add 2-5 us per interrupt, destroying networking and storage throughput.
Architecture: A GICv3 system comprises a Distributor (GICD, one per system),
Redistributors (GICR, one per CPU), and CPU Interfaces (accessed via system
registers ICC_*_EL1). The hardware GICv3 supports direct virtual interrupt injection
via List Registers (LRs) in the ICH_LR<n>_EL2 system registers, minimising trap
overhead.
18.3.7.1 Distributor (GICD) MMIO Trap¶
The Distributor is mapped at a guest-physical address configured by the VMM via
KVM_VGIC_V3_ADDR_TYPE_DIST. All guest MMIO accesses to this region cause a Stage-2
fault trapped by umka-kvm. The vGIC distributor handler emulates:
/// GICD register offsets and emulation behaviour.
/// All registers are 32-bit unless noted.
pub struct VgicDistributor {
/// GICD_CTLR: distributor enable. Bit 0 = EnableGrp1NS.
/// Guest write → updates `enabled` flag; if transitioning from
/// disabled→enabled, flushes all pending interrupts to LRs.
pub ctlr: u32,
/// GICD_TYPER: read-only, reports configured INTID range and
/// number of implemented LRs. Bits [4:0] = ITLinesNumber (N),
/// supporting 32*(N+1) SPIs. Bits [7:5] = CPUNumber (max VCPUs - 1).
pub typer: u32,
/// SPI enable bits (GICD_ISENABLER[1..N] / GICD_ICENABLER[1..N]).
/// Bit per INTID (32..1019). XArray<u32> keyed by register index
/// (each u32 covers 32 INTIDs). Guest writes to ISENABLER set bits;
/// writes to ICENABLER clear bits.
pub spi_enabled: XArray<u32>,
/// SPI pending bits (GICD_ISPENDR / GICD_ICPENDR). Same layout as
/// `spi_enabled`. Set by `vcpu_inject_irq()` or guest write to
/// ISPENDR; cleared by guest write to ICPENDR or by LR delivery.
pub spi_pending: XArray<u32>,
/// SPI priority (GICD_IPRIORITYR). One byte per INTID, packed 4 per
/// register. XArray<u32> keyed by register index.
pub spi_priority: XArray<u32>,
/// SPI routing (GICD_IROUTER). One u64 per SPI INTID (32..1019).
/// Bits [39:32] = Aff3, [23:16] = Aff2, [15:8] = Aff1, [7:0] = Aff0.
/// Bit 31 (Interrupt_Routing_Mode): 0 = target specific PE,
/// 1 = any participating PE.
pub spi_route: XArray<u64>,
/// Lock serialising distributor state mutations. Not held on the
/// LR programming fast path (that path reads atomically).
pub lock: SpinLock<()>,
}
Register emulation table:
| Register | Offset | Emulation |
|---|---|---|
GICD_CTLR |
0x0000 | R/W — enable/disable distributor. Write flushes pending to LRs. |
GICD_TYPER |
0x0004 | RO — reflects VM's configured SPI count and VCPU count. |
GICD_ISENABLER<n> |
0x0100+ | WS (write-set) — sets enable bits in spi_enabled. |
GICD_ICENABLER<n> |
0x0180+ | WC (write-clear) — clears enable bits in spi_enabled. |
GICD_ISPENDR<n> |
0x0200+ | WS — sets pending bits; triggers LR flush for target VCPU. |
GICD_ICPENDR<n> |
0x0280+ | WC — clears pending bits. |
GICD_IPRIORITYR<n> |
0x0400+ | R/W — 8-bit priority per INTID. Lower value = higher priority. |
GICD_ITARGETSR<n> |
0x0800+ | R/W — GICv2 compat: 8-bit CPU target mask per INTID. Used by GICv2 guests only; ignored when GICD_CTLR.ARE_NS = 1. |
GICD_IROUTER<n> |
0x6100+ | R/W (64-bit) — affinity routing for each SPI. |
18.3.7.2 Redistributor (GICR) Per-VCPU MMIO Trap¶
Each VCPU has a Redistributor frame mapped at a VMM-configured guest-physical
address (KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION). Redistributor frames are 128 KiB
each (64 KiB RD_base + 64 KiB SGI_base), contiguous in guest-physical space.
/// Per-VCPU redistributor state.
pub struct VgicRedistributor {
/// GICR_TYPER: read-only. Encodes VCPU affinity, processor number,
/// and `Last` bit (set on the highest-numbered VCPU's redistributor).
pub typer: u64,
/// SGI/PPI enable bits (GICR_ISENABLER0 / GICR_ICENABLER0).
/// INTIDs 0-31: SGIs (0-15) and PPIs (16-31). Single u32.
pub sgi_ppi_enabled: u32,
/// SGI/PPI pending bits (GICR_ISPENDR0 / GICR_ICPENDR0).
pub sgi_ppi_pending: AtomicU32,
/// SGI/PPI priority (GICR_IPRIORITYR<0..7>). 8 registers × 4 INTIDs.
pub sgi_ppi_priority: [u32; 8],
/// SGI/PPI configuration (GICR_ICFGR0/1). 2 bits per INTID:
/// edge-triggered or level-sensitive.
pub sgi_ppi_config: [u32; 2],
/// LPI enable table base (GICR_PROPBASER). Points to guest-physical
/// memory containing the LPI configuration table (1 byte per LPI:
/// enable bit + priority). Used by ITS emulation.
pub propbaser: u64,
/// LPI pending table base (GICR_PENDBASER). Points to guest-physical
/// memory containing the LPI pending bitmap (1 bit per LPI).
pub pendbaser: u64,
}
18.3.7.3 List Register Programming¶
On VM entry, the vGIC populates the hardware List Registers (ICH_LR<n>_EL2)
with pending virtual interrupts for the target VCPU. GICv3 provides up to 16
LRs (the actual count is read from ICH_VTR_EL2[4:0] + 1).
LR population algorithm (called from vcpu_enter_guest(), after loading
guest register state):
- Scan the VCPU's
irq_pendingbitmap (a 1024-bit bitmap covering INTIDs 0–1019, stored inVcpu::vgic_state). - For each pending INTID, check that the interrupt is enabled (in GICR for
SGI/PPI, GICD for SPI) and that its priority is higher (numerically lower)
than the VCPU's running priority (
ICH_AP1R<n>_EL2). - Sort eligible interrupts by priority (lowest numerical value first).
- Write up to
nr_lrsentries intoICH_LR<n>_EL2: - Bits [63:62] = State (01 = Pending).
- Bit [60] = HW (1 if backed by a physical interrupt, e.g., passthrough via VFIO; 0 for purely virtual interrupts).
- Bits [51:48] = Priority.
- Bits [41:32] = Physical INTID (if HW=1, for deactivation routing).
- Bits [31:0] = Virtual INTID.
- Enable the maintenance interrupt (
ICH_HCR_EL2.LRENPIE = 1) so that the hardware notifies us when all LRs have been consumed.
/// Per-VCPU vGIC state, stored in `Vcpu::vgic_state`.
pub struct VcpuVgicState {
/// Pending interrupt bitmap for INTIDs 0-1023 (SGIs 0-15, PPIs 16-31,
/// SPIs 32-1019, reserved 1020-1023): bit N = INTID N is pending delivery.
/// Updated by `vcpu_inject_irq()` and by guest writes to ISPENDR.
/// Atomically updated (64-bit words with `AtomicU64`) so that
/// cross-VCPU injection (e.g., SGI from another VCPU) does not
/// require taking a lock on the target VCPU.
pub irq_pending: [AtomicU64; 16], // 1024 bits, INTIDs 0-1023
/// Active interrupt bitmap for INTIDs 0-1023 (same layout as `irq_pending`).
pub irq_active: [AtomicU64; 16],
/// Pending LPI bitmap for INTIDs 8192+ (GICv3 LPIs used by device
/// passthrough via ITS). LPI INTID space is sparse and can extend to
/// 2^32-1, so a fixed-size bitmap is infeasible. An XArray keyed by
/// `(intid >> 6)` stores `AtomicU64` words, each covering 64 LPIs.
/// Allocated on-demand when the guest configures LPI routes via the
/// virtual ITS command queue (`MAPTI`/`MAPI` commands).
/// Empty when no LPIs are configured (no overhead for non-passthrough VMs).
pub lpi_pending: XArray<AtomicU64>,
/// Active LPI bitmap (same XArray layout as `lpi_pending`).
pub lpi_active: XArray<AtomicU64>,
/// Number of hardware LRs available (cached from ICH_VTR_EL2).
pub nr_lrs: u8,
/// Bitmap of LRs currently in use (bit N = LR N contains a valid entry).
pub lr_used: u16,
}
18.3.7.4 Interrupt Injection¶
External code injects a virtual interrupt into a guest VCPU via vcpu_inject_irq():
/// Inject a virtual interrupt into a guest VCPU.
///
/// This function is called from:
/// - irqfd handler (eventfd → guest IRQ, no userspace exit)
/// - in-kernel vhost-net/vhost-scsi (Tier 1 data plane completion)
/// - ITS command emulation (MSI injection from passthrough devices)
/// - Guest-to-guest SGI (vCPU writes to ICC_SGI1R_EL1)
///
/// # Algorithm
/// 1. Set bit `intid` in `target_vcpu.vgic_state.irq_pending`.
/// 2. If `target_vcpu` is currently running on a physical CPU
/// (`vcpu.state == RUNNING`), send a doorbell IPI to that pCPU.
/// The IPI handler forces a lightweight VM exit (sets VGF bit in
/// HCR_EL2 or uses GICv4.1 direct injection if available), and
/// the re-entry path calls the LR population algorithm above.
/// 3. If `target_vcpu` is blocked in WFI (halt), wake the VCPU thread
/// via `wake_up_process()`. The thread will re-enter the guest and
/// populate LRs on entry.
/// 4. If `target_vcpu` is scheduled out (VCPU thread not running),
/// the pending bit persists. LR population happens on next VM entry.
pub fn vcpu_inject_irq(vcpu: &Vcpu, intid: u32);
Doorbell IPI cost: ~1-2 us on modern AArch64 hardware. GICv4.1 direct injection
(available on platforms with GITS_TYPER.VLPI = 1) avoids the IPI entirely by
programming the physical ITS to deliver the virtual interrupt directly to the VCPU's
LR via the GICv4.1 vPE table — reducing injection latency to <500 ns.
18.3.7.5 Maintenance Interrupt¶
The GICv3 maintenance interrupt (ICH_HCR_EL2.LRENPIE) fires when the guest
has acknowledged and deactivated enough interrupts that all LRs are now empty
(or when a specific LR transitions from pending to inactive). This traps from
the guest to EL2:
- The maintenance IRQ handler reads
ICH_MISR_EL2to determine the cause. - LR underflow (
MISR.LRENP = 1): All LRs are empty but more interrupts are pending inirq_pending. Re-run the LR population algorithm to refill. - EOI (
MISR.EOI = 1, when HW=1 LR): A hardware-backed interrupt was deactivated by the guest. The handler signals the physical interrupt controller to deactivate the corresponding physical INTID (write toICC_DIR_EL1at EL2), completing the interrupt lifecycle for passthrough devices. - Re-enter the guest after LR repopulation.
18.3.7.6 ITS Emulation (Interrupt Translation Service)¶
The ITS maps MSI doorbell writes from PCIe devices to LPI (Locality-specific Peripheral Interrupt) INTIDs routed to specific VCPUs. For VMs with VFIO passthrough devices, the vITS translates guest MSI configuration into physical ITS commands.
Command queue: The guest writes ITS commands to a ring buffer in guest
memory. The ring is described by GITS_CBASER (base address + size) and
GITS_CWRITER (guest write pointer). umka-kvm traps writes to GITS_CWRITER
and processes commands from the queue:
/// Virtual ITS state per VM.
pub struct VgicIts {
/// ITS enabled flag (GITS_CTLR.Enabled).
/// `AtomicU8` (0 = disabled, 1 = enabled) because MMIO trap handler
/// (guest write to GITS_CTLR) and command queue processing can race
/// on different vCPU threads. The `cmd_lock` serialises command
/// execution; the `enabled` check is a fast-path gate using
/// `Relaxed` ordering (the lock provides the necessary ordering
/// for command-queue state).
pub enabled: AtomicU8,
/// Command queue: guest-physical base, size, read/write pointers.
pub cmd_base: PhysAddr,
pub cmd_size: u32,
pub cmd_read: u32,
pub cmd_write: u32,
/// Device table: maps (DeviceID) → interrupt translation table pointer.
/// XArray keyed by DeviceID (integer key, O(1) lookup).
/// Each entry points to a per-device interrupt translation table.
pub device_table: XArray<Arc<ItsDeviceEntry>>,
/// Collection table: maps CollectionID → target VCPU MPIDR.
/// XArray keyed by CollectionID.
pub collection_table: XArray<u64>,
/// Lock serialising command queue processing.
pub cmd_lock: SpinLock<()>,
}
/// Per-device interrupt translation table entry.
pub struct ItsDeviceEntry {
/// Maps EventID → (INTID, CollectionID). XArray keyed by EventID.
/// INTID is the LPI number; CollectionID selects the target VCPU
/// via the collection table.
pub itt: XArray<ItsTranslation>,
}
pub struct ItsTranslation {
/// LPI INTID (8192..2^32-1).
pub intid: u32,
/// Collection ID (indexes into `VgicIts::collection_table`).
pub collection: u16,
}
ITS command emulation:
| Command | Action |
|---|---|
MAPD (Map Device) |
Create/update device table entry. Allocates ITT for the DeviceID. |
MAPC (Map Collection) |
Bind a CollectionID to a target VCPU (MPIDR → VCPU lookup). |
MAPTI / MAPI |
Map EventID → (INTID, CollectionID) in the device's ITT. |
INV (Invalidate) |
Invalidate cached LPI configuration for a single INTID. Re-reads the LPI config table (GICR_PROPBASER) for the affected INTID. |
INVALL |
Invalidate all cached LPI configurations for a collection. |
INT |
Assert the interrupt identified by (DeviceID, EventID). Translates through ITT → INTID, routes via collection → VCPU, calls vcpu_inject_irq(). |
DISCARD |
Remove an ITT entry (unmap EventID). |
CLEAR |
Clear the pending state of an LPI identified by (DeviceID, EventID). |
SYNC |
Ensure all prior commands targeting a specific collection are observable. Memory barrier on the vITS state; no-op if command processing is synchronous. |
After processing each command, umka-kvm advances GITS_CREADR (the hardware
read pointer visible to the guest). If the guest polls GITS_CREADR to wait
for command completion, it sees forward progress.
Physical ITS integration for passthrough: When a VFIO device is assigned
to a VM, its physical MSI doorbell is programmed (via the physical ITS) to
target a host LPI. The host LPI handler calls vcpu_inject_irq() with the
corresponding virtual INTID. On GICv4.1 hardware, the physical ITS can inject
directly into the VCPU's virtual pending table, bypassing the host entirely
(configured via VMAPTI/VMAPP physical ITS commands).
18.4 Suspend and Resume¶
Linux problem: Suspend/resume on laptops was notoriously unreliable for years. Driver suspend/resume callbacks are fragile — one broken driver blocks the entire system.
UmkaOS design:
18.4.1 Suspend Modes¶
- s2idle (suspend-to-idle): Primary suspend mode. Freezes all userspace processes, puts devices into low-power states, and halts CPUs in their deepest idle state. Does not require firmware cooperation (no ACPI S3 handoff), making it more reliable than traditional suspend-to-RAM. Wake sources: any enabled interrupt (keyboard, network, RTC alarm, lid switch).
- S3 (suspend-to-RAM): Fallback for platforms where s2idle power consumption is unacceptable. CPU and device state are saved to RAM, then firmware is called to power down the platform. Requires ACPI S3 support and correct firmware implementation.
- S4 (hibernate / suspend-to-disk) (Phase 5 — data structures defined here for design completeness; implementation deferred): Full system image is written to a swap partition or dedicated hibernate file, then the system powers off completely. On resume, the bootloader loads the kernel, which restores the saved image into memory and resumes execution. Hibernate support depends on the block I/O layer (Section 15.2) being available and a configured swap/hibernate target.
18.4.2 Device State Save/Restore Ordering¶
Device suspend follows the device dependency tree in leaf-to-root order (children before parents). Resume follows root-to-leaf order (parents before children, the reverse of suspend). This ensures that a child device never attempts I/O on a parent that is already suspended, and that parent buses are active before children attempt to re-initialize.
The ordering algorithm:
- Build topological order from the device dependency tree (Section 11.4, device registry).
- Suspend phase — traverse the tree bottom-up:
- For each device, invoke its KABI
suspend()callback with a per-device timeout (default per Section 11.4: 2 seconds for Tier 1, 5 seconds for Tier 2; configurable via sysfs). - If the callback does not complete within the timeout, the driver is forcibly stopped (Tier 1/Tier 2 driver recovery — the driver's isolation domain is torn down). The device is marked as "failed to suspend" and will be re-initialized from scratch on resume.
- DMA engines are quiesced before their parent bus controller suspends.
- Interrupt controllers are suspended last (after all device interrupts are masked).
- Resume phase — traverse the tree top-down:
- Bus controllers and interrupt controllers are restored first.
- For each device, invoke its KABI
resume()callback. - Devices that failed to suspend are re-initialized via the standard driver probe path rather than the resume path.
- Drivers that fail to resume within the timeout are forcibly restarted, same as suspend failures.
18.4.3 CPU State Save/Restore¶
On suspend, the kernel saves per-CPU state that is not preserved by hardware across the suspend/resume cycle:
- General-purpose registers: Saved to a per-CPU save area in the kernel BSS. On resume, the boot CPU restores its own state and brings up secondary CPUs via the normal SMP bringup path (Section 2.1), which re-initializes their register state.
- System registers / MSRs: Architecture-specific system register state that must be explicitly restored:
- x86_64: IA32_EFER, IA32_STAR, IA32_LSTAR, IA32_FMASK (syscall registers), IA32_PAT, IA32_KERNEL_GS_BASE, GDT/IDT/TR descriptors, CR0/CR3/CR4, debug registers (DR0-DR7), XCR0 (XSAVE state), IA32_SPEC_CTRL (Spectre mitigations)
- AArch64: SCTLR_EL1, TCR_EL1, TTBR0/TTBR1_EL1, MAIR_EL1, VBAR_EL1, TPIDR_EL1, CNTKCTL_EL1, CPACR_EL1
- RISC-V: satp, stvec, sscratch, sie, sstatus
- ARMv7: SCTLR, TTBR0/TTBR1, TTBCR, DACR, VBAR, TPIDRPRW, CNTKCTL, CPACR, DFAR/IFAR, CONTEXTIDR (deferred to Phase 3: full list pending ARMv7 suspend implementation)
- PPC32: MSR, SRR0/SRR1, SPRG0-3, DEC, PVR, HID0/HID1, DBAT/IBAT registers, L1CSR0/L1CSR1 (deferred to Phase 3: full list pending PPC32 suspend implementation)
- PPC64LE: MSR, SRR0/SRR1, SPRG0-3, LPCR, HSPRG0/1, DEC, AMOR, PIDR, PTCR (Radix), SDR1 (HPT). On POWER9+: PSSCR for stop states. (Deferred to Phase 3: full list pending PPC64LE suspend implementation; OPAL firmware may handle some state save/restore)
- FPU/SIMD state: Saved via XSAVE (x86_64), STP of Q0-Q31 + FPCR/FPSR (AArch64), or architecture-specific equivalent. Eager FPU restore is used on resume — FPU state is restored immediately on all CPUs before executing any userspace or untrusted code, preventing the CVE-2018-3665 (LazyFP) speculative execution side-channel vulnerability.
18.4.4 Memory Handling¶
- s2idle and S3: RAM remains powered. No memory save/restore is needed. The kernel
only needs to ensure that all dirty cache lines are flushed to RAM before the CPU
enters the suspended state (WBINVD on x86_64; DC CISW (Clean and Invalidate by
Set/Way) iterated over all sets and ways + DSB on AArch64 — DC CIVAC is per-VA
and cannot flush the entire cache without iterating all dirty addresses, which is
impractical; DC CISW is the standard ARM approach for full-cache flush before S3;
see Section 18.2, "AArch64 Cache Geometry Discovery and DC CISW Flush",
for the register layout and geometry discovery algorithm used by
flush_dcache_all()). - S4 (hibernate): The kernel creates a consistent snapshot of all in-use memory pages using a two-phase freeze-and-snapshot approach:
- Freeze phase: All userspace processes are frozen (SIGSTOP equivalent). All Tier 1 and Tier 2 drivers (except the storage stack required for the hibernate target) are suspended via the suspend path (Section 18.4), which quiesces their DMA activity. After this point, no new memory modifications occur except from the snapshot code and the active storage drivers.
- Snapshot phase: With all sources of concurrent modification stopped, the kernel walks the page frame allocator's used-page bitmap. Free pages (tracked by the buddy allocator) are excluded. Each in-use page is compressed (LZ4, matching Linux's default hibernate compressor) and written to the configured hibernate target (swap partition or file). The snapshot code runs on the boot CPU only, with interrupts disabled except for the disk I/O completion interrupt.
- Integrity and authentication: The hibernate image is cryptographically
authenticated to prevent tampering:
- A SHA-256 hash of the compressed image is computed during the write.
- The hash is signed (not merely stored) using a TPM-backed key if a TPM is available (Section 9.4), or encrypted with a symmetric key derived from the kernel's hibernate secret — a 256-bit random key generated at boot from the hardware RNG (RDRAND/RNDR), stored only in kernel memory, and destroyed on shutdown. On systems with TPM, the hibernate secret is additionally sealed to the TPM (PCR-bound) so that only a boot configuration matching the original can unseal and verify the image. On non-TPM systems, the hibernate secret does not survive a true cold reboot (full power-off). True ACPI S4 (power completely removed, cold resume from disk) is only reliably supported in TPM mode. Non-TPM systems support a memory-preservation mode:
- Detection: At early boot, the kernel checks ACPI FACS (Firmware ACPI
Control Structure) for the
S4BIOS_Fflag AND probes EFI MemoryMap for anEfiPersistentMemoryregion. If either mechanism confirms that at least one 4 KiB page survives an S4 transition, the memory-preservation path is available. - Protocol: Before entering S4, the kernel writes the 256-bit hibernate
secret into the preserved region (either a FACS scratch area or the EFI
persistent memory page). The integrity guard depends on TPM availability:
- TPM-equipped systems: The kernel derives a verification key
(HKDF-SHA3-256, label
"hibernate-preserve-hmac") from the TPM storage hierarchy, seals it to the current PCR state (PCR 0-7), and computes HMAC-SHA3-256 over the entire preserved region using this key. The 256-bit HMAC tag is stored alongside the secret. On resume, the fresh kernel unseals the verification key from the TPM, recomputes the HMAC, and rejects the preserved region if the tag does not match — this detects both firmware corruption and physical tampering with the preserved memory. - Non-TPM systems: A 32-bit CRC32 guard is stored alongside the secret. CRC32 validates firmware memory preservation integrity (hardware reliability) only; the threat model for non-TPM systems accepts that a physical attacker with access to the preserved memory region could modify the secret and recompute the CRC32. The hibernate image itself is still authenticated via HMAC-SHA-256 (below), so this weakness is limited to secret substitution in the preserved region. If verification fails (HMAC mismatch, TPM unseal failure, or CRC32 failure), the kernel prints a warning and refuses hibernate resume — the image cannot be authenticated.
- TPM-equipped systems: The kernel derives a verification key
(HKDF-SHA3-256, label
- Fallback: Platforms that cannot guarantee firmware memory preservation
(CRC32 fails on a test S4 cycle) must use the TPM path for hibernate
integrity. The kernel exposes this via
/sys/power/hibernate_auth_method(tpm | memory-preservation | none). - On resume, the signature is verified (TPM path) or the hash is decrypted and validated (boot-secret path) before any pages are restored.
- This prevents an attacker with disk write access from substituting a malicious hibernate image, as they cannot forge valid authentication without access to the TPM or the boot secret.
-
Hibernate image format:
/// Hibernate image header — stored at the beginning of the swap partition /// or hibernate file. Fixed size (4096 bytes, page-aligned). #[repr(C)] pub struct HibernateImageHeader { /// Magic: b"UMKA_HIB" (8 bytes). Used by the bootloader to detect a /// valid hibernate image on the swap device. pub magic: [u8; 8], /// Header version (currently 1). Future versions must be backwards /// compatible or use a different magic. pub version: u32, /// Explicit padding to align page_count to 8-byte boundary. /// On-disk formats must not rely on implicit repr(C) padding. pub _pad0: u32, /// Total number of in-use page frames in the image. pub page_count: u64, /// Byte offset from start of device/file to the first compressed page /// data block. The header occupies bytes 0..4095; data starts at /// `data_offset` (typically 4096). pub data_offset: u64, /// Physical address of the kernel's resume entry point. The fresh /// kernel jumps here after restoring all pages. pub resume_entry: u64, /// Physical address of the page frame number (PFN) map. The PFN map /// is an array of `(pfn: u64, compressed_offset: u64, compressed_len: u32)` /// tuples, one per saved page, stored contiguously after the header. pub pfn_map_offset: u64, /// Number of entries in the PFN map (== page_count). pub pfn_map_count: u64, /// LZ4 compressed total size (bytes) of all page data blocks. pub compressed_size: u64, /// SHA-256 hash of the entire compressed image (PFN map + page data). pub image_hash: [u8; 32], /// Signature over `image_hash` (TPM-backed HMAC-SHA-256 or sealed key). /// Zero-filled if non-TPM boot-secret mode. /// **Size**: 64 bytes accommodates HMAC-SHA-256 (32 bytes, left-padded), /// HMAC-SHA-512 (64 bytes), and Ed25519 (64 bytes). For PQC algorithms /// (ML-DSA-65: 3309 bytes), the signature is stored in a secondary /// block following the header, and this field contains only the first /// 64 bytes as a fast-reject prefix check. `sig_algorithm` discriminates. pub signature: [u8; 64], /// Signature algorithm: 0 = HMAC-SHA-256 (boot secret, 32 bytes in /// signature field), 1 = TPM2-signed HMAC-SHA-256, 2 = Ed25519 (64 bytes), /// 3 = ML-DSA-65 (prefix in signature, full sig in next block). pub sig_algorithm: u8, /// Padding to 4096 bytes. pub _reserved: [u8; 3935], } // Field offset table (all offsets in bytes): // magic: 0 .. 8 (8 bytes) // version: 8 .. 12 (4 bytes) // _pad0: 12 .. 16 (4 bytes) // page_count: 16 .. 24 (8 bytes) // data_offset: 24 .. 32 (8 bytes) // resume_entry: 32 .. 40 (8 bytes) // pfn_map_offset: 40 .. 48 (8 bytes) // pfn_map_count: 48 .. 56 (8 bytes) // compressed_size:56 .. 64 (8 bytes) // image_hash: 64 .. 96 (32 bytes) // signature: 96 .. 160 (64 bytes) // sig_algorithm: 160 .. 161 (1 byte) // _reserved: 161 ..4096 (3935 bytes) // Total: 4096 bytes ✓ /// Compile-time layout guarantee. The hibernate image header must be exactly /// one page (4096 bytes) for on-disk format stability. const _: () = assert!(core::mem::size_of::<HibernateImageHeader>() == 4096); -
Resume procedure: On resume, the bootloader loads a fresh kernel, which reads the hibernate image header from the swap device, verifies the magic and version, then verifies the image hash (TPM or boot-secret path) before any pages are restored. The fresh kernel allocates intermediate safe memory (bounce frames) at physical addresses that do not overlap with any saved page's original address. It then decompresses and copies the saved pages to their original physical addresses using the PFN map, taking care to avoid overwriting its own executing code or page tables. Finally, it jumps to
resume_entryvia a per-architecture identity-mapped trampoline (since the resumed kernel's virtual address space differs from the fresh kernel's):- x86-64: Build a temporary PML4 that identity-maps the trampoline code page
and
resume_entry's physical page. Disable interrupts, load the temporary CR3, jump toresume_entryphysical address. The resumed kernel's first instruction loads its own CR3 (saved in the hibernate image), switching to the restored virtual address layout. - AArch64: Write the trampoline page's PA into TTBR0_EL1 (identity-map
via a single 1GB block entry).
ISB; DSB SY; BR x0toresume_entry. The resumed kernel restores TTBR1_EL1 (kernel mapping) and TTBR0_EL1 (user mapping) from saved state. - RISC-V: Set
satpto a temporary Sv48 table identity-mapping the trampoline page andresume_entry.SFENCE.VMA; JR resume_entry. - ARMv7/PPC32/PPC64LE: Similar identity-map trampoline patterns using the respective MMU register (TTBR0/SDR1/LPCR).
The resumed kernel then re-initializes devices via the resume path described in Section 18.4.
- x86-64: Build a temporary PML4 that identity-maps the trampoline code page
and
-
Bootloader integration: The bootloader (GRUB or UEFI stub) checks the configured swap partition for
UMKA_HIBmagic at offset 0 before loading a fresh kernel. If a valid header is found, the bootloader passesumka.hibernate_resume=/dev/sdXNon the kernel command line. The fresh kernel's init code checks this parameter and enters the resume path instead of normal init.
18.4.5 Timer Re-synchronization¶
System clocks drift or lose state during suspend. On resume, the kernel must re-synchronize all time sources:
- Read the hardware RTC (CMOS on x86, PL031 on ARM, or platform-specific RTC) to determine wall-clock time elapsed during suspend.
- Adjust
CLOCK_BOOTTIMEoffset by the elapsed suspend duration so that it reflects total wall-clock time since boot, including suspend.CLOCK_MONOTONICis not adjusted — it does not count time spent in suspend, matching Linux semantics.CLOCK_BOOTTIMEincludes suspend time by definition. - Re-calibrate TSC / arch timer: On x86, re-read IA32_TSC_ADJUST if available, or re-synchronize TSC across CPUs via the TSC synchronization protocol (Section 7.8). On AArch64, the generic timer (CNTPCT_EL0) typically survives S3 suspend but must be verified on resume. On platforms with paravirtual clocks (KVM pvclock, Hyper-V TSC page), the shared clock page is re-read to pick up updated scale/offset values.
- Fire expired timers: All pending
hrtimerandtimer_listentries are checked against their respective updated time bases. Timers armed againstCLOCK_BOOTTIMEorCLOCK_REALTIMEthat expired during suspend are fired immediately in a batch. Timers armed againstCLOCK_MONOTONICare evaluated against the unadjusted monotonic clock and will not fire prematurely. - Notify userspace: A
CLOCK_REALTIMEdiscontinuity notification is sent to processes usingtimerfdorclock_nanosleepso they can adjust.
18.4.6 Interrupt Controller State¶
Interrupt controller state is saved on suspend and restored before any device resume callbacks are invoked:
- x86_64 (APIC): Save and restore the Local APIC registers (LVT entries, TPR, timer configuration, spurious interrupt vector). The I/O APIC redirection table entries are saved per-pin. MSI/MSI-X vectors are re-programmed by the PCI subsystem during device resume — the interrupt controller layer saves the IRQ-to-vector mapping so that devices resume with the same interrupt vectors they had before suspend.
- AArch64 (GICv3): Save and restore the GIC Distributor (GICD), Redistributor (GICR), and CPU Interface (ICC) state. GICv3 defines standard save/restore registers for this purpose.
- RISC-V (PLIC/APLIC): Save per-source priority, per-context enable bits, and threshold registers.
The kernel disables all interrupts (except the wake source IRQ) before entering the final suspend state, and re-enables them after interrupt controller state is restored on resume.
18.5 VFIO and iommufd — Device Passthrough Framework¶
VFIO (Virtual Function I/O) is the kernel framework that exposes physical PCIe devices directly to user-space processes — primarily KVM guests managed by a VMM such as QEMU. The device is detached from its host driver and assigned exclusively to the guest, which then drives it with its own unmodified guest driver. From the guest's perspective, the device is indistinguishable from a real bare-metal device. VFIO relies on the IOMMU subsystem (Section 11.3) to confine device DMA to the guest's physical address space.
iommufd is the modern replacement for the legacy vfio_iommu_type1 API.
It provides a richer, more composable object model and is the preferred interface
for all new VMM implementations.
18.5.1 VFIO Object Model¶
VFIO exposes three primary objects to userspace:
VfioGroup — A set of devices that the IOMMU requires to be isolated together
(an IOMMU group). An IOMMU group is the minimal unit of isolation: all devices in the
same group share DMA visibility and must either all be passed through to the same guest,
or all remain bound to their host drivers. This constraint follows from the PCIe ACS
(Access Control Services) topology: if a PCIe switch lacks ACS, peer devices behind
that switch can DMA to each other's address space, so they form a single IOMMU group.
// umka-kvm/src/vfio/group.rs
/// An IOMMU group — set of devices that must be isolated together.
/// Corresponds to a /dev/vfio/N file descriptor in userspace.
pub struct VfioGroup {
/// Kernel IOMMU group identity.
pub iommu_group_id: u32,
/// All devices in this group. Must all be unbound from host drivers
/// before any can be assigned to a guest (or all must be bound).
/// **Bound**: Populated once at IOMMU group discovery (boot or hot-plug).
/// Typical count: 1 (single-function device) to 8 (multi-function PCIe
/// device). Maximum: bounded by PCIe topology (256 functions per bus ×
/// 256 buses, but IOMMU groups rarely exceed 16 devices). Cold-path
/// allocation (group discovery), so Vec is acceptable.
pub devices: Vec<Arc<VfioDevice>>,
/// Reference to the iommufd context this group is attached to.
/// None if the group is not yet associated with an IOAS.
pub iommufd_ctx: Option<Arc<IommufdCtx>>,
/// Exclusive open lock: only one VMM process may open a given group.
pub open_mutex: Mutex<()>,
}
VfioDevice — A single PCIe function or platform device. Provides three
capabilities to the VMM:
- Region access: read/write of MMIO regions (BARs, ROM, config space) via pread/pwrite
on the device fd, or mmap for regions that allow direct mapping.
- Interrupt injection: delivery of device interrupts (INTx, MSI, MSI-X) to the
guest via the irqbypass mechanism (Section 18.5).
- Reset: VFIO_DEVICE_RESET triggers a Function-Level Reset (FLR) or bus reset
as appropriate for the device type.
// umka-kvm/src/vfio/device.rs
/// Maximum IRQ bypass producers per device. Matches MAX_MSIX_VECTORS (2048).
pub const MAX_IRQBYPASS_PRODUCERS: usize = 2048;
/// A single passthrough device (one PCIe function or platform device).
pub struct VfioDevice {
pub name: ArrayString<64>, // e.g. "0000:03:00.0"
pub group: Weak<VfioGroup>,
pub pci_dev: Option<Arc<PciDevice>>, // None for platform devices
/// BAR regions. Most PCIe devices have <=8 BARs (6 standard + 2 optional).
/// Fixed array avoids heap allocation on the hot KVM MMIO path.
pub bars: [Option<VfioRegion>; 8],
/// Count of active (Some) bars for iteration.
pub bar_count: u8,
/// Overflow for non-standard devices exposing additional regions beyond
/// the PCIe 6-BAR limit (e.g., platform-specific or vendor-defined regions).
/// None for the vast majority of hardware (PCIe max is 6 BARs).
pub bar_overflow: Option<Vec<VfioRegion>>,
/// IRQ configuration for each interrupt type (INTX, MSI, MSI-X).
/// PCIe spec allows max 3 interrupt types (INTX, MSI, MSI-X),
/// so ArrayVec<3> is sufficient and avoids heap allocation.
pub irqs: ArrayVec<VfioIrqConfig, 3>,
/// Flags: VFIO_DEVICE_FLAGS_PCI | VFIO_DEVICE_FLAGS_RESET | ...
pub flags: VfioDeviceFlags,
/// IOMMU domain this device is attached to.
/// Legacy VFIO path: direct IommuDomain reference. The iommufd path uses
/// IoAddrSpace abstraction instead.
pub iommu_domain: Option<Arc<IommuDomain>>,
/// IRQ bypass producers for KVM IRQFD interrupt injection.
/// Set when VFIO_DEVICE_SET_IRQS is called with a KVM IRQFD eventfd.
/// PCIe MSI-X supports up to 2048 vectors, but typical GPU/NIC use ≤ 64.
/// Bounded by MAX_IRQBYPASS_PRODUCERS at the VFIO_DEVICE_SET_IRQS ioctl.
/// Vec with documented bound: VFIO device creation is cold-path.
pub irqbypass_producers: Vec<IrqBypassProducer>,
}
/// A VFIO memory region (BAR, ROM, Config, or platform-specific).
pub struct VfioRegion {
pub index: u32,
pub flags: VfioRegionFlags, // READ | WRITE | MMAP | CAPS
pub size: u64,
/// Byte offset on the device fd for pread/pwrite access.
pub fd_offset: u64,
/// Physical address of the underlying MMIO resource (for mmap).
pub phys_addr: Option<u64>,
}
/// Values from Linux `include/uapi/linux/vfio.h`.
bitflags! {
pub struct VfioRegionFlags: u32 {
const READ = 0x1;
const WRITE = 0x2;
const MMAP = 0x4;
const CAPS = 0x8;
}
pub struct VfioDeviceFlags: u32 {
const RESET = 1 << 0; // 0x01 -- device supports FLR/bus reset
const PCI = 1 << 1; // 0x02
const PLATFORM = 1 << 2; // 0x04
const AMBA = 1 << 3; // 0x08
const CCW = 1 << 4; // 0x10 -- s390x channel I/O device
const AP = 1 << 5; // 0x20 -- s390x adjunct processor (crypto)
const FSL_MC = 1 << 6; // 0x40 -- NXP QorIQ Management Complex
const CAPS = 1 << 7; // 0x80 -- device info has capability chain
const CDX = 1 << 8; // 0x100 -- AMD CDX bus device
}
}
BAR array sizing: The PCIe spec defines at most 6 BARs for standard endpoints (BAR0-BAR5) plus expansion ROM. The 8-slot array covers all standard PCIe devices with headroom for expansion ROM regions. The
bar_overflowfield exists for non-standard platform devices exposing additional vendor-specific regions.
VfioContainer (legacy) — Aggregates multiple VfioGroups under a single IOMMU
domain. Kept for compatibility with older VMMs (QEMU < 8.2). New VMMs use iommufd
exclusively. The container model is superseded because it couples IOMMU domain
lifecycle to group membership, preventing the more flexible IOAS-based mapping
composition that iommufd provides.
18.5.2 iommufd Object Model¶
iommufd introduces a composable object graph, accessed via a single /dev/iommu fd.
All objects are reference-counted and can be shared across multiple VFIO devices or
multiple VMM processes (within capability and policy constraints).
IommufdCtx — The per-fd root context. Owns all objects created on this fd.
// umka-kvm/src/iommufd/ctx.rs
/// Per-fd root context for iommufd. All objects are owned here.
pub struct IommufdCtx {
/// Next object ID (monotonically increasing).
/// u64 internally — truncated to u32 only at the Linux ioctl ABI boundary
/// (IOMMU_IOAS_ALLOC returns u32 id). Internally u64 satisfies the 50-year
/// u64 counter policy. **Longevity**: Object IDs are allocated per IOAS,
/// per HW pagetable, and per device attach — not per VM. At a sustained
/// rate of 10,000 object allocations/sec (far beyond practical: typical
/// is <100/s during VM hot-plug storms), u64 wraps in ~58 million years.
/// The u32 ABI truncation wraps at ~4.3 billion IDs; per-fd (per-process)
/// scope makes collision harmless (process restart resets the counter).
pub next_id: AtomicU64,
/// All IO address spaces created on this fd.
/// XArray keyed by u32 object ID — O(1) lookup, internal locking replaces Mutex.
pub ioas: XArray<Arc<IoAddrSpace>>,
/// Hardware page tables created from IOASes.
pub hw_pagetables: XArray<Arc<HwPagetable>>,
/// Physical devices bound to this context.
pub devices: XArray<Arc<BoundDevice>>,
/// owning process credential — checked on IOMMU_DEVICE_ATTACH.
pub cred: Credential,
}
IoAddrSpace (IOAS) — A virtual DMA address space. Multiple physical devices
can be attached to the same IOAS, causing them all to share the same IOVA→physical
mapping. A KVM VM's guest physical address (GPA) space is implemented as an IOAS: the
VMM maps all guest RAM into it, and all passthrough devices are attached to it, so
device DMA using guest physical addresses is automatically translated to host physical
addresses by the IOMMU.
// umka-kvm/src/iommufd/ioas.rs
/// Alias for `IommuPageTable` ([Section 11.5](11-drivers.md#iommu-and-dma-mapping)) — the canonical
/// opaque arch-specific IOMMU page table root pointer. This alias preserves
/// the VFIO/iommufd naming convention (`Pgd` = Page Global Directory) while
/// referring to the same underlying type defined in the IOMMU subsystem.
pub type IommuPgd = IommuPageTable;
/// An IO Address Space: a virtual DMA address space for one or more devices.
///
/// `IoAddrSpace` wraps an `IommuDomain` ([Section 11.5](11-drivers.md#iommu-and-dma-mapping)) with
/// userspace-facing state (mapping BTree, attached device count, valid IOVA
/// ranges). The `pgd` field references the `IommuDomain`'s page table root.
/// When an `IommuDomain` is owned by an `IoAddrSpace`, `IommuDomain.mappings`
/// is not populated — `IoAddrSpace.mappings` is the sole software mapping
/// authority (see the correspondence table below for the ownership rule).
pub struct IoAddrSpace {
pub id: u32,
/// The underlying IOMMU page directory (arch-specific), obtained from the
/// wrapped `IommuDomain`. Shared with all HwPagetables derived from this IOAS.
pub pgd: Arc<IommuPgd>,
/// All current IOVA mappings, sorted by IOVA start.
/// This BTreeMap is used for IOVA range management (map/unmap ioctls);
/// actual DMA address translation is performed by the hardware IOMMU page
/// table (`pgd`). Map/unmap are warm-path operations (device setup, VM
/// memory slot changes); the per-packet/per-I/O DMA path never touches
/// this BTreeMap. Mutex (not SpinLock) is required because `iommu_map()`
/// allocates IOMMU page table entries, which may sleep on slab allocation
/// under memory pressure. SpinLock would deadlock or BUG if the slab
/// allocator triggers reclaim while spinning. Mutex protects concurrent
/// map/unmap from multiple vCPU threads issuing IOMMU_IOAS_MAP/UNMAP
/// ioctls simultaneously.
///
/// **Safety note (FOLL_LONGTERM)**: Mutex is safe here despite holding
/// it across sleeping allocations because VFIO-pinned pages use
/// FOLL_LONGTERM, which excludes them from reclaim. This prevents
/// re-entrant Mutex acquisition via the reclaim path.
pub mappings: Mutex<BTreeMap<u64, IommuMapping>>,
/// Number of devices currently attached. Mappings cannot be freed
/// while devices are attached (in-flight DMA hazard).
/// `AtomicU32` allows a fast-rejection check (`Relaxed` load) outside
/// the `mappings` Mutex: `IOMMU_IOAS_UNMAP` reads the count first and
/// returns `-EBUSY` without locking if devices are still attached.
/// Increment/decrement uses `Acquire`/`Release` under the Mutex to
/// synchronize with the mapping-table state.
pub attached_device_count: AtomicU32,
/// Valid IOVA ranges (from IOMMU hardware capabilities).
/// **Bound**: Populated once at IOAS creation from IOMMU capabilities.
/// Typical count: 1-3 ranges (one per DMA window). Maximum observed:
/// ~16 on s390x with multiple DMA windows. Cold-path allocation (once
/// per IOAS lifetime), so Vec is acceptable.
pub valid_iova_ranges: Vec<IovaRange>,
}
/// A single IOVA→physical mapping within an IOAS.
pub struct IommuMapping {
/// IO Virtual Address — the address the device will use.
pub iova: u64,
/// Host physical address this IOVA maps to.
pub paddr: u64,
/// Length in bytes. Must be a multiple of the IOMMU page size.
pub len: usize,
/// Access permissions.
pub prot: DmaProt,
}
// DmaProt is defined canonically in [Section 11.5](11-drivers.md#iommu-and-dma-mapping).
// Re-exported here for VFIO/IOMMUFD use:
// use umka_core::iommu::DmaProt;
//
// DmaProt: u32 { READ = 0x1, WRITE = 0x2, NOEXEC = 0x4 }
/// A contiguous range of valid IOVA space reported by the IOMMU.
pub struct IovaRange {
pub start: u64,
pub last: u64, // inclusive
}
HwPagetable — A hardware IOMMU page table derived from one IOAS. On x86-64 this
is the VT-d SLPT (Second-Level Page Table); on ARM64 it is the stage-2 page table; on
AMD systems it is the AMD-Vi page table. HwPagetable holds a reference to the IOAS's
IommuPgd and a device-side IOMMU context entry pointing to it.
BoundDevice — A physical device that has been attached to this iommufd context,
detached from its host driver, and linked to an IoAddrSpace or HwPagetable.
// umka-kvm/src/iommufd/bound.rs
pub struct BoundDevice {
pub id: u32,
pub dev: Arc<dyn DeviceNode>, // from §11.4 device registry
pub attached_ioas: Option<Arc<IoAddrSpace>>,
pub attached_hwpt: Option<Arc<HwPagetable>>,
}
18.5.3 ioctl Interface¶
VFIO and iommufd expose their APIs via ioctl on character device file descriptors.
All ioctl structs are #[repr(C)] and must match the Linux kernel ABI exactly for
QEMU and other VMMs to work without modification.
VFIO device ioctls (on /dev/vfio/devices/vfioX):
// umka-kvm/src/vfio/ioctl.rs
/// VFIO_DEVICE_GET_INFO — query device capabilities.
#[repr(C)]
pub struct VfioDeviceInfo {
pub argsz: u32,
pub flags: VfioDeviceFlags,
pub num_regions: u32,
pub num_irqs: u32,
pub cap_offset: u32, // offset into info struct of first capability header
}
// VfioDeviceInfo: u32(4)*5 = 20 bytes (VfioDeviceFlags is u32 bitflags).
// Userspace ABI struct — VFIO_DEVICE_GET_INFO ioctl argument.
const_assert!(core::mem::size_of::<VfioDeviceInfo>() == 20);
/// VFIO_DEVICE_GET_REGION_INFO — query one region (BAR, ROM, Config).
#[repr(C)]
pub struct VfioRegionInfo {
pub argsz: u32,
pub flags: VfioRegionFlags,
pub index: u32, // region index: VFIO_PCI_BAR0_REGION_INDEX .. CONFIG
pub cap_offset: u32,
pub size: u64,
pub offset: u64, // byte offset on the device fd for pread/pwrite/mmap
}
// VfioRegionInfo: u32(4)*4 + u64(8)*2 = 32 bytes.
// Userspace ABI struct — VFIO_DEVICE_GET_REGION_INFO ioctl argument.
const_assert!(core::mem::size_of::<VfioRegionInfo>() == 32);
/// VFIO_DEVICE_GET_IRQ_INFO — query one IRQ index.
#[repr(C)]
pub struct VfioIrqInfo {
pub argsz: u32,
/// VFIO_IRQ_INFO_EVENTFD | VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED | NORESIZE
pub flags: u32,
pub index: u32, // VFIO_PCI_INTX_IRQ_INDEX, MSI, MSIX, ERR, REQ
pub count: u32, // number of vectors in this IRQ index
}
// VfioIrqInfo: u32(4)*4 = 16 bytes.
// Userspace ABI struct — VFIO_DEVICE_GET_IRQ_INFO ioctl argument.
const_assert!(core::mem::size_of::<VfioIrqInfo>() == 16);
/// VFIO_DEVICE_SET_IRQS — configure interrupt delivery.
#[repr(C)]
pub struct VfioIrqSet {
pub argsz: u32,
/// ACTION (SET/UNMASK/MASK) | DATA (NONE/BOOL/EVENTFD) | INDEX flags
pub flags: u32,
pub index: u32, // which IRQ type (INTx, MSI, MSI-X, ...)
pub start: u32, // first vector within the index
pub count: u32, // number of vectors this call configures
// Followed by `count` eventfd integers in the data[] array.
// data[]: i32 eventfds. For irqbypass, these are KVM IRQFD eventfds.
}
// VfioIrqSet: u32(4)*5 = 20 bytes (header only; variable-length data[] follows).
// Userspace ABI struct — VFIO_DEVICE_SET_IRQS ioctl argument.
const_assert!(core::mem::size_of::<VfioIrqSet>() == 20);
The ioctl dispatch for VFIO_DEVICE_SET_IRQS wires the provided eventfds into the
irqbypass subsystem (Section 18.5):
// umka-kvm/src/vfio/ioctl.rs
fn handle_set_irqs(
dev: &mut VfioDevice,
req: &VfioIrqSet,
eventfds: &[RawFd],
) -> Result<(), KernelError> {
// Validate IRQ index and vector range against dev.irqs[].
let irq_cfg = dev.irqs.get(req.index as usize)
.ok_or(KernelError::EINVAL)?;
if req.start + req.count > irq_cfg.count {
return Err(KernelError::EINVAL);
}
if req.flags & VFIO_IRQ_SET_DATA_EVENTFD != 0
&& req.flags & VFIO_IRQ_SET_ACTION_TRIGGER != 0
{
// Wire each eventfd as an irqbypass producer.
for (i, &fd) in eventfds.iter().enumerate() {
let vector = req.start as usize + i;
let producer = IrqBypassProducer::from_eventfd(fd)?;
irqbypass_register_producer(&IRQ_BYPASS_REGISTRY, vm.vm_id, &producer)?;
dev.irqbypass_producers.push(producer);
// The KVM IRQFD (consumer) must have been registered already
// via KVM_IRQFD ioctl on the KVM VM fd.
}
}
Ok(())
}
iommufd ioctls (on /dev/iommu):
| ioctl | Description |
|---|---|
IOMMU_IOAS_ALLOC |
Allocate a new IO address space. Returns { id: u32 }. |
IOMMU_IOAS_IOVA_RANGES |
Query valid IOVA ranges for the IOAS. Returns Vec<IovaRange>. |
IOMMU_IOAS_MAP |
Map a physical range into the IOAS at a given IOVA. |
IOMMU_IOAS_UNMAP |
Unmap a range. Blocked while devices are in-flight (returns EBUSY if active DMA is detected). |
IOMMU_IOAS_COPY |
Copy all mappings from one IOAS to another. Used during VM live migration to clone the guest's IOVA mapping into the destination kernel without quiescing the device. |
IOMMU_HWPT_ALLOC |
Allocate a hardware page table from an IOAS. Returns { hwpt_id: u32 }. The HWPT shares the IOAS's IommuPgd. |
IOMMU_DEVICE_ATTACH |
Attach a bound device to an IOAS or HWPT. Installs the IOMMU context entry pointing to the page table. |
IOMMU_DEVICE_DETACH |
Detach a device. Invalidates the IOMMU context entry. Must drain in-flight DMA before returning (IOTLB invalidation with drain). |
The IOMMU_IOAS_MAP ioctl struct:
// umka-kvm/src/iommufd/ioctl.rs
#[repr(C)]
pub struct IommuIoasMap {
pub argsz: u32,
pub flags: u32, // IOMMU_IOAS_MAP_READABLE | WRITABLE | FIXED_IOVA
pub ioas_id: u32,
_padding: u32,
/// Userspace virtual address of the memory to map. The kernel pins
/// the pages and obtains their physical addresses via get_user_pages().
pub user_va: u64,
/// IO virtual address to map at. If FIXED_IOVA is not set, the kernel
/// allocates the IOVA (and returns it in this field on success).
pub iova: u64,
pub length: u64,
pub iommu_prot: u32, // DmaProt bits
_padding2: u32,
}
// IommuIoasMap: u32(4)*4 + u64(8)*3 + u32(4)*2 = 48 bytes.
// Userspace ABI struct — IOMMU_IOAS_MAP ioctl argument.
const_assert!(core::mem::size_of::<IommuIoasMap>() == 48);
18.5.4 irqbypass — Zero-Latency Interrupt Delivery¶
When a passthrough device raises an interrupt, the normal path would be: hardware IRQ → host kernel interrupt handler → write to eventfd → KVM thread wakeup → inject guest interrupt. This path involves a context switch and is latency-sensitive for NVMe and network devices.
irqbypass eliminates the kernel interrupt handler and thread wakeup:
Device raises IRQ
│
▼
IOMMU/APIC delivers to host CPU
│
▼
irqbypass producer fires ──────► irqbypass consumer (KVM IRQFD)
│
▼
KVM injects virtual interrupt
directly into guest LAPIC/GIC
(no kernel thread wakeup)
The two sides of irqbypass:
// umka-kvm/src/irqbypass.rs
/// Produced by the VFIO side: fires when the physical device raises an IRQ.
/// token is a unique identity used to match producers to consumers.
pub struct IrqBypassProducer {
/// Unique token — pointer-identity. Matches the corresponding consumer.
pub token: NonNull<IrqBypassToken>,
/// Called when a consumer is linked to this producer.
/// Implementor: disables the standard interrupt handler and installs
/// the consumer's direct delivery path.
pub add_consumer: unsafe fn(prod: &IrqBypassProducer, cons: &IrqBypassConsumer) -> KabiResult,
/// Called when the consumer is unlinked. Re-enables the standard handler.
pub del_consumer: unsafe fn(prod: &IrqBypassProducer, cons: &IrqBypassConsumer),
}
/// Consumed by the KVM IRQFD side: receives IRQs and injects them into the guest.
pub struct IrqBypassConsumer {
pub token: NonNull<IrqBypassToken>,
/// Called when a producer is linked. Installs the guest IRQ injection path.
/// On x86: programs the Posted Interrupt Descriptor so the device's MSI
/// vector is delivered directly to the vCPU's virtual LAPIC.
pub add_producer: unsafe fn(cons: &IrqBypassConsumer, prod: &IrqBypassProducer) -> KabiResult,
/// Called when the producer is unlinked. Tears down the direct path.
pub del_producer: unsafe fn(cons: &IrqBypassConsumer, prod: &IrqBypassProducer),
}
/// Opaque token used for producer↔consumer matching. Pointer identity only.
pub struct IrqBypassToken(());
The irqbypass registry maintains per-VM lists of registered producers and consumers,
sharded by VmId to avoid serializing multi-VM setups. When a new consumer is
registered (via KVM_IRQFD with the irqbypass flag), the registry acquires only
that VM's SpinLock and scans for a matching producer by token. When a new producer
is registered (via VFIO_DEVICE_SET_IRQS), the registry similarly acquires only the
target VM's lock. Matching is by token pointer identity.
// umka-kvm/src/irqbypass.rs
/// Entry in the per-VM irqbypass registry.
pub struct IrqBypassEntry {
pub producer: Option<IrqBypassProducer>,
pub consumer: Option<IrqBypassConsumer>,
}
/// Per-VM sharded IRQ bypass registry. Avoids a global mutex that would
/// serialize producer/consumer registration across independent VMs.
///
/// **Performance note**: The `per_vm` XArray is accessed only during irqbypass
/// registration/unregistration (warm path, per-device setup). Interrupt delivery
/// uses the linked producer→consumer fast path established at registration time,
/// bypassing the registry entirely. The XArray lookup cost is therefore
/// irrelevant to interrupt latency.
pub struct IrqBypassRegistry {
/// Per-VM IRQ bypass mappings. XArray keyed by VmId (integer) — O(1)
/// lookup with native RCU-compatible reads. Write operations use XArray's
/// internal locking (only on VM create/destroy, not per-device registration).
per_vm: XArray<Arc<SpinLock<ArrayVec<IrqBypassEntry, MAX_IRQ_BYPASS_ENTRIES>>>>,
}
/// Maximum IRQ bypass entries per VM. 256 covers multi-device passthrough
/// scenarios (e.g., 4 NVMe devices × 64 queues, or GPU + NIC + storage).
/// Previous value of 64 was insufficient for VMs with multiple high-queue-count
/// devices. All `push()` calls use `try_push()` and return `-ENOSPC` on
/// overflow rather than panicking.
const MAX_IRQ_BYPASS_ENTRIES: usize = 256;
Registration for a specific VM uses xa_load() under rcu_read_lock() to look up
the VM's entry (RCU-compatible read, no global lock). Cross-VM iteration uses
XArray's xa_for_each() iterator under RCU. Write operations (VM create/destroy)
use XArray's internal locking — not on the per-device registration path.
fn irqbypass_register_producer(
registry: &IrqBypassRegistry,
vm_id: VmId,
prod: &IrqBypassProducer,
) -> Result<(), KernelError> {
// Read lock: does not block other VMs' registrations.
let guard = rcu_read_lock();
let vm_entries = registry.per_vm.xa_load(vm_id.0 as u64, &guard)
.ok_or(KernelError::InvalidVm)?;
let mut entries = vm_entries.lock();
entries.try_push(IrqBypassEntry {
producer: Some(prod.clone()),
consumer: None,
}).map_err(|_| KernelError::NoSpace)?;
// Check if a matching consumer is already registered in this VM.
// After inserting, attempt to link with a matching consumer.
irqbypass_try_link(&mut entries, prod.token)?;
Ok(())
}
fn irqbypass_register_consumer(
registry: &IrqBypassRegistry,
vm_id: VmId,
cons: &IrqBypassConsumer,
) -> Result<(), KernelError> {
let guard = rcu_read_lock();
let vm_entries = registry.per_vm.xa_load(vm_id.0 as u64, &guard)
.ok_or(KernelError::InvalidVm)?;
let mut entries = vm_entries.lock();
entries.try_push(IrqBypassEntry {
producer: None,
consumer: Some(cons.clone()),
}).map_err(|_| KernelError::NoSpace)?;
// After inserting, attempt to link with a matching producer.
irqbypass_try_link(&mut entries, cons.token)?;
Ok(())
}
/// After both a producer and consumer are registered for the same token,
/// link them to enable direct interrupt injection (posted interrupts on
/// x86, vGIC forwarding on AArch64). Called automatically when the second
/// side is registered — the registry scans for a matching counterpart and
/// merges the two orphaned entries into a single linked entry.
///
/// On teardown: unregistering either side calls `disconnect()` on the
/// other, then removes the entry. If only one side is removed, the
/// remaining side reverts to an unlinked orphan (standard interrupt path).
fn irqbypass_try_link(
entries: &mut ArrayVec<IrqBypassEntry, MAX_IRQ_BYPASS_ENTRIES>,
token: NonNull<IrqBypassToken>,
) -> Result<(), KernelError> {
// Find indices of the producer-only and consumer-only entries for this token.
let prod_idx = entries.iter().position(|e|
e.producer.as_ref().map(|p| p.token) == Some(token) && e.consumer.is_none()
);
let cons_idx = entries.iter().position(|e|
e.consumer.as_ref().map(|c| c.token) == Some(token) && e.producer.is_none()
);
if let (Some(pi), Some(ci)) = (prod_idx, cons_idx) {
// Both sides present — merge into a single linked entry.
let consumer = entries[ci].consumer.take().unwrap();
let producer = entries[pi].producer.as_ref().unwrap();
// SAFETY: both producer and consumer are valid while registered.
// add_consumer disables the standard IRQ handler and installs
// the direct delivery path (e.g., posted interrupts on x86).
unsafe { (producer.add_consumer)(producer, &consumer) }
.into_result()?;
// SAFETY: symmetric linkage — consumer installs its side of the
// direct path (e.g., programs the vCPU's Posted Interrupt Descriptor).
unsafe { (consumer.add_producer)(&consumer, producer) }
.into_result()?;
// Merge: move consumer into the producer's entry, remove the orphan.
entries[pi].consumer = Some(consumer);
entries.swap_remove(ci);
}
Ok(())
}
/// Unregister a producer. If it was linked to a consumer, disconnect both
/// sides and leave the consumer as an unlinked orphan (reverts to standard
/// interrupt delivery path). Called from VFIO device teardown.
fn irqbypass_unregister_producer(
registry: &IrqBypassRegistry,
vm_id: VmId,
token: NonNull<IrqBypassToken>,
) -> Result<(), KernelError> {
let guard = rcu_read_lock();
let vm_entries = registry.per_vm.xa_load(vm_id.0 as u64, &guard)
.ok_or(KernelError::InvalidVm)?;
let mut entries = vm_entries.lock();
if let Some(idx) = entries.iter().position(|e|
e.producer.as_ref().map(|p| p.token) == Some(token)
) {
let entry = &mut entries[idx];
if let (Some(prod), Some(cons)) = (&entry.producer, &entry.consumer) {
// Linked — disconnect both sides before removing the producer.
// SAFETY: both are valid; del_consumer re-enables standard handler.
unsafe { (prod.del_consumer)(prod, cons) };
unsafe { (cons.del_producer)(cons, prod) };
// Leave consumer as an unlinked orphan.
entry.producer = None;
} else {
// Unlinked orphan producer — just remove.
entries.swap_remove(idx);
}
}
Ok(())
}
/// Unregister a consumer. Symmetric to `irqbypass_unregister_producer`.
/// Called from KVM IRQFD teardown.
fn irqbypass_unregister_consumer(
registry: &IrqBypassRegistry,
vm_id: VmId,
token: NonNull<IrqBypassToken>,
) -> Result<(), KernelError> {
let guard = rcu_read_lock();
let vm_entries = registry.per_vm.xa_load(vm_id.0 as u64, &guard)
.ok_or(KernelError::InvalidVm)?;
let mut entries = vm_entries.lock();
if let Some(idx) = entries.iter().position(|e|
e.consumer.as_ref().map(|c| c.token) == Some(token)
) {
let entry = &mut entries[idx];
if let (Some(prod), Some(cons)) = (&entry.producer, &entry.consumer) {
// SAFETY: both valid; disconnect before removal.
unsafe { (prod.del_consumer)(prod, cons) };
unsafe { (cons.del_producer)(cons, prod) };
entry.consumer = None;
} else {
entries.swap_remove(idx);
}
}
Ok(())
}
On x86-64 with APICv/Posted Interrupts (Intel VT-d PI), the add_producer
implementation programs the device's MSI destination address to point to the vCPU's
Posted Interrupt Descriptor (PID). The hardware then delivers the interrupt directly
into the guest's virtual LAPIC without any host CPU intervention — the VMEXIT for
interrupt injection is eliminated entirely.
18.5.5 KVM Integration¶
When QEMU assigns a VFIO device to a KVM VM, the following sequence establishes the full passthrough configuration:
-
Detach from host driver: QEMU opens
/dev/vfio/devices/vfioX. The kernel callsdriver_unbind(dev)on the PCI device's current host driver (e.g., NVMe, igb). The device is removed from the host driver's device list and its interrupts are disabled at the host LAPIC/GIC level. -
Create IOAS and map guest RAM: QEMU issues
IOMMU_IOAS_ALLOCto create a fresh IOAS. For each memory region in the VM (KVM_SET_USER_MEMORY_REGION), QEMU also callsIOMMU_IOAS_MAPto create a matching IOVA mapping in the IOAS, using the guest physical address as the IOVA and the host userspace virtual address as the source. The kernel pins the backing pages and maps them into the IOMMU page table. The device can now DMA to guest physical addresses — the IOMMU translates GPA→HPA transparently. -
Attach device to IOAS:
IOMMU_DEVICE_ATTACHprograms the IOMMU context entry for the device, pointing it at the IOAS's IOMMU page table. From this point, the device's DMA is translated by the IOMMU and confined to the guest's mapped regions. Any out-of-range DMA access triggers an IOMMU fault and is logged to the FMA subsystem (Section 20.1). -
Wire interrupts via irqbypass: For each MSI-X vector, QEMU calls
KVM_IRQFDon the KVM VM fd to register a KVM IRQFD consumer (linking a guest interrupt vector to an eventfd). Then QEMU callsVFIO_DEVICE_SET_IRQSwith the same eventfds to register the VFIO producer side. The irqbypass registry links them, and on APICv-capable hardware, programs Posted Interrupt Descriptors. -
Map MMIO regions into guest address space: QEMU calls
VFIO_DEVICE_GET_REGION_INFOto find which BARs supportmmap. Mappable BARs aremmap-ed from the VFIO device fd (obtaining a userspace VA mapping of the device's MMIO). QEMU then callsKVM_SET_USER_MEMORY_REGIONwith theKVM_MEM_READONLYflag cleared to place this MMIO mapping at the guest's BAR address. The guest now accesses the BAR via EPT/NPT without VMEXIT. -
Non-mappable MMIO (Config space, registers that require emulation): These generate a
KVM_EXIT_MMIOVMEXIT. QEMU handles it by callingpread/pwriteon the VFIO device fd at the region's fd offset. This path is inherently slower but only applies to infrequent configuration accesses.
IOMMU_IOAS_COPY for live migration: During VM live migration
(Section 18.1), the destination kernel issues
IOMMU_IOAS_COPY to clone the source VM's IOAS mappings. The device is detached from
the source IOAS and re-attached to the destination IOAS atomically. In-flight DMA
at the moment of device detach is drained (IOTLB invalidation with completion wait)
before the device is released. The guest's DMA window is thus preserved across
migration without the destination kernel needing to replay all IOMMU_IOAS_MAP
calls from scratch.
18.5.6 SR-IOV and VF Passthrough¶
SR-IOV (Single Root I/O Virtualization) allows a single PCIe Physical Function (PF) to present multiple Virtual Functions (VFs). VFs appear as independent PCIe devices with their own config space, BARs, and MSI-X vectors, but share the underlying hardware resources managed by the PF.
Each VF has its own IOMMU group (ACS ensures the VF is isolated from the PF and from other VFs at the PCIe bus level), so VFs can be passed through individually without binding the PF to VFIO. This is the standard mechanism for NIC and NVMe passthrough in cloud environments: the host retains control of the PF (and the physical link), while individual VFs are assigned to guest VMs.
UmkaOS's VFIO implementation supports VF passthrough with the same ioctl interface as
full-device passthrough. The num_regions and num_irqs reported by
VFIO_DEVICE_GET_INFO reflect the VF's resources, not the PF's.
18.5.6.1 SR-IOV NIC Passthrough¶
SR-IOV NIC passthrough is the primary mechanism for high-performance networking in container and VM workloads. A single physical NIC exposes multiple VFs, each with dedicated TX/RX queues, interrupt vectors, and DMA rings. The PF driver manages VF lifecycle, enforces MAC/VLAN policy, and provides rate limiting — the VF driver (running in the guest or container) handles only data-plane operations.
18.5.6.1.1 SR-IOV Architecture¶
/// SR-IOV Physical Function (PF) operations — implemented by NIC drivers
/// that support Single Root I/O Virtualization (PCIe SR-IOV, section 9.3
/// of the PCIe 5.0 Base Specification).
pub trait SriovPfOps {
/// Enable SR-IOV and create the specified number of Virtual Functions.
/// This triggers PCIe config space writes to enable VF BAR decoding.
fn sriov_enable(&self, num_vfs: u16) -> Result<(), DeviceError>;
/// Disable SR-IOV and destroy all Virtual Functions.
fn sriov_disable(&self) -> Result<(), DeviceError>;
/// Configure a VF's parameters.
fn set_vf_config(&self, vf_id: u16, config: &VfConfig) -> Result<(), DeviceError>;
/// Get current VF configuration.
fn get_vf_config(&self, vf_id: u16) -> Result<VfConfig, DeviceError>;
/// Set VF link state (auto/enable/disable).
fn set_vf_link_state(&self, vf_id: u16, state: VfLinkState) -> Result<(), DeviceError>;
/// Get VF statistics.
fn get_vf_stats(&self, vf_id: u16) -> Result<VfStats, DeviceError>;
/// Set VF trust mode (allows VF to change its own MAC, enable promiscuous).
fn set_vf_trust(&self, vf_id: u16, trusted: bool) -> Result<(), DeviceError>;
/// Query total number of VFs supported by hardware.
fn total_vfs(&self) -> u16;
/// Query current number of enabled VFs.
fn num_vfs(&self) -> u16;
}
The SriovPfOps trait is implemented by PF-side NIC drivers (e.g., Intel ixgbe/ice,
Mellanox mlx5). The kernel's PCI subsystem calls sriov_enable when userspace writes
to /sys/bus/pci/devices/<PF>/sriov_numvfs. The PF driver programs hardware mailbox
registers, allocates per-VF queue sets, and triggers the PCIe SR-IOV extended
capability to expose VF BARs. Each VF then appears as a new PCI device enumerated by
the bus layer (Section 11.4).
18.5.6.1.2 VF Configuration¶
/// Virtual Function configuration — set by the PF driver, affects VF behavior.
pub struct VfConfig {
/// VF index (0-based).
pub vf_id: u16,
/// MAC address assigned to VF (PF can override VF's self-assigned MAC).
pub mac: [u8; 6],
/// VLAN ID (0 = no VLAN). PF can force VF into a specific VLAN.
pub vlan: u16,
/// VLAN QoS priority (0-7).
pub qos: u8,
/// VLAN protocol (ETH_P_8021Q or ETH_P_8021AD for QinQ).
pub vlan_proto: u16,
/// TX rate limit in Mbps (0 = unlimited).
pub max_tx_rate: u32,
/// Minimum guaranteed TX rate in Mbps (0 = no guarantee).
pub min_tx_rate: u32,
/// Spoofcheck: verify VF's source MAC matches assigned MAC.
pub spoofchk: bool,
/// Trust mode: allow VF to change MAC, enter promiscuous mode.
pub trusted: bool,
/// Link state: auto (follow PF), enable (always up), disable (always down).
pub link_state: VfLinkState,
/// RSS query enable: allow VF to query RSS configuration.
pub rss_query_en: bool,
}
/// VF link state policy — controls the virtual link reported to the VF driver.
pub enum VfLinkState {
/// Follow PF link state.
Auto = 0,
/// VF link always up (even if PF is down).
Enable = 1,
/// VF link always down.
Disable = 2,
}
/// Per-VF traffic statistics — counters are maintained by the PF hardware
/// and read via the PF driver's mailbox interface.
pub struct VfStats {
pub rx_packets: u64,
pub tx_packets: u64,
pub rx_bytes: u64,
pub tx_bytes: u64,
pub broadcast: u64,
pub multicast: u64,
pub rx_dropped: u64,
pub tx_dropped: u64,
}
All VF configuration is applied through the PF driver. The VF driver has no direct
access to the PF's configuration registers. When spoofchk is enabled, the PF
hardware drops any frame from the VF whose source MAC does not match the assigned
mac — this prevents a compromised guest from spoofing other VFs' traffic.
18.5.6.1.3 sysfs Interface¶
Standard Linux sysfs paths for SR-IOV management:
- /sys/bus/pci/devices/<PF>/sriov_totalvfs — max VFs supported (read-only)
- /sys/bus/pci/devices/<PF>/sriov_numvfs — current VF count (write to create/destroy)
- /sys/bus/pci/devices/<PF>/sriov_drivers_autoprobe — auto-bind VF driver (default: 1)
- /sys/bus/pci/devices/<VF>/ — each VF appears as a separate PCI device
UmkaOS exposes the same sysfs attributes for Linux compatibility. The
sriov_numvfs write handler validates the requested count against sriov_totalvfs,
calls the PF driver's sriov_enable, and triggers device enumeration for each new VF.
18.5.6.1.4 Netlink Interface¶
VF configuration via netlink (iproute2):
ip link set <PF_name> vf <N> mac <addr>
ip link set <PF_name> vf <N> vlan <id> [qos <N>] [proto <802.1Q|802.1ad>]
ip link set <PF_name> vf <N> rate <max_tx_rate>
ip link set <PF_name> vf <N> min_tx_rate <rate>
ip link set <PF_name> vf <N> spoofchk {on|off}
ip link set <PF_name> vf <N> trust {on|off}
ip link set <PF_name> vf <N> state {auto|enable|disable}
ip link show <PF_name> # shows all VF configs
Netlink attributes: IFLA_VF_MAC, IFLA_VF_VLAN, IFLA_VF_TX_RATE,
IFLA_VF_SPOOFCHK, IFLA_VF_LINK_STATE, IFLA_VF_TRUST, IFLA_VF_STATS.
These are nested inside IFLA_VFINFO_LIST in the RTM_GETLINK/RTM_SETLINK messages.
UmkaOS's netlink subsystem (Section 16.13) dispatches VF
configuration requests to the PF driver's SriovPfOps implementation.
18.5.6.1.5 Container Integration¶
SR-IOV VFs in container environments:
1. Admin enables SR-IOV: echo 8 > /sys/bus/pci/devices/.../sriov_numvfs
2. VF PCI device appears; admin binds to vfio-pci driver
3. Container runtime (containerd, CRI-O) uses device plugin to discover VFs
4. Kubernetes nvidia.com/sriov-nic or intel.com/sriov-nic resource requests
5. CNI plugin (sriov-cni) moves VF into container's network namespace
6. Container sees a dedicated NIC with hardware-isolated TX/RX queues
The network namespace (Section 17.1) provides the isolation boundary: when a VF is moved into a container's netns, only that container can see and configure the interface. The device cgroup (Section 17.2) restricts which containers may access which VF PCI devices — preventing a container from binding to a VF that belongs to another tenant.
For VFIO passthrough to VMs:
1. Bind VF to vfio-pci: echo <VF_BDF> > /sys/bus/pci/drivers/vfio-pci/bind
2. Assign IOMMU group to VM via VFIO (Section 18.5)
3. VM sees a standard PCIe NIC — uses native driver, full line-rate performance
4. IOMMU isolates the VF's DMA to the VM's assigned memory region
18.5.6.1.6 macvtap Offload (Alternative to SR-IOV)¶
When SR-IOV is not available (hardware lacks SR-IOV capability or all VFs are exhausted), macvtap provides a software-based alternative:
/// macvtap — software-based NIC virtualization for VMs.
/// Creates a TAP device backed by a macvlan interface on the physical NIC.
/// Lower overhead than pure TAP (no bridge), but not hardware-isolated.
pub struct MacvtapDevice {
/// The macvtap network device visible to userspace.
pub dev: Arc<NetDevice>,
/// Backing physical NIC.
pub lower_dev: Arc<NetDevice>,
/// macvlan mode.
pub mode: MacvlanMode,
/// TAP file descriptor (for VM virtio-net).
pub tap_fd: Option<Fd>,
}
/// macvlan operating mode — determines traffic forwarding policy
/// between macvlan endpoints sharing the same lower device.
pub enum MacvlanMode {
/// Private: no communication between macvlan devices.
Private = 1,
/// VEPA: all traffic goes to external switch (802.1Qbg).
Vepa = 2,
/// Bridge: macvlan devices can communicate directly.
Bridge = 4,
/// Passthru: one macvlan takes over the lower device.
Passthru = 8,
/// Source: filter by source MAC address.
Source = 16,
}
macvtap is used by libvirt's "direct" network mode and by lightweight VMMs that prefer simplicity over line-rate performance. The TAP fd is passed to the VMM, which connects it to the guest's virtio-net backend. Traffic bypasses the host bridge layer entirely — the macvlan driver injects frames directly into the lower device's TX path.
TAP fd netns lifecycle: The TAP fd holds an Arc reference to the network
namespace containing the macvtap device. If the netns is destroyed (e.g., the
container exits), the TAP fd's Arc reference prevents the netns from being
freed until the VMM closes the fd. This matches Linux behavior — the TAP fd
keeps the netns alive, and the VMM continues to operate on the device even after
the originating container is gone. On fd close, the Arc is dropped, allowing
netns cleanup to proceed.
18.5.6.1.7 Performance Comparison¶
| Method | TX Latency | Throughput | CPU Overhead | Isolation |
|---|---|---|---|---|
| SR-IOV VF passthrough | ~2 us | Line rate | Near zero | Hardware (IOMMU) |
| macvtap bridge | ~5-10 us | ~95% line rate | Low | Software |
| veth + bridge | ~10-20 us | ~80-90% line rate | Moderate | Software |
| TAP + Linux bridge | ~15-25 us | ~70-85% line rate | High | Software |
SR-IOV passthrough achieves near-bare-metal performance because the VF's DMA engine writes directly to guest memory via IOMMU translation — no kernel data path is involved. The PF hardware handles queue scheduling, RSS hashing, and interrupt coalescing independently for each VF.
18.5.6.1.8 SR-IOV Cross-References¶
- Section 18.5 — VFIO passthrough for VFs
- Section 16.13 — NetDevice for VFs
- Section 11.4 — PCI bus, SR-IOV capability parsing
- Section 17.2 — device cgroup for VF access control
- Section 17.1 — network namespace for container VF
- Section 11.3 — VF drivers as Tier 2 isolated
18.5.7 Security Model¶
Device passthrough grants the guest direct hardware access. The security model must prevent privilege escalation back to the host:
-
Capability requirement: Opening a VFIO device fd requires
Capability::SysAdmin(Section 9.1). This requirement applies to the process that opens/dev/vfio/devices/vfioX. In practice, only the VMM process (QEMU) holds this capability; it is typically run with a minimal privilege set (cap_sys_admin only, no network or filesystem capabilities beyond what the VM needs). -
IOMMU mandatory: When a device is attached to VFIO, the kernel verifies that a functioning IOMMU domain can be created for it. If the system has no IOMMU, or if the device is not covered by the IOMMU (e.g., behind a legacy ISA bridge), the attach call fails with
ENODEV. The sole exception isiommu_off=dangerousboot parameter, which enables a no-IOMMU passthrough mode for development use only; a prominent boot warning and aWDIOF_OVERHEAT-equivalent flag in the device info struct mark the system as operating outside the security envelope. -
DMA containment: The IOMMU page table is populated only with the guest's memory regions. Any device DMA that targets an address outside the IOAS mapping is blocked by the IOMMU and generates an IOMMU fault. The fault is routed to the VMM via an eventfd-based notification mechanism and logged via the FMA subsystem — see Section 11.5 for the complete fault routing protocol including the
VfioFaultRingand escalation policy. -
Config space access control: The raw PCIe config space is not fully exposed to userspace. The VFIO PCI driver intercepts
pread/pwriteon the config space region. Writes to the Bus Master Enable bit, Interrupt Disable bit, and PCIe capability registers are validated or silently dropped where they could affect the host PCI topology. Rationale for silent drop: This matches Linux VFIO behavior, where VMMs (QEMU, crosvm, cloud-hypervisor) expect certain config space writes to be silently filtered rather than returning errors. Changing to error returns would break compatibility with existing VMMs. The Memory Space Enable and I/O Space Enable bits are allowed through (they gate BAR access and are necessary for device operation). -
Reset on release: When the VFIO device fd is closed, the kernel performs an FLR (Function-Level Reset) if the device supports it, or a bus reset if not. This clears any DMA-capable state in the device (pending DMA descriptors, MSI-X configuration) before the device is re-bound to the host driver or left quiesced.
18.5.8 VFIO Unbind and Isolation Domain Teardown¶
When a VFIO device fd is closed (VM shutdown or device hot-unplug), the kernel must tear down the full isolation context — including any Tier 1 isolation domain that was active on the device before VFIO claimed it. The unbind sequence is the reverse of the KVM integration sequence above, with additional steps for UmkaOS's isolation domain cleanup.
Full VFIO unbind sequence:
1. QUIESCE DEVICE
- Disable bus mastering: clear the PCI Bus Master Enable bit in the device's
Command register. This prevents the device from initiating new DMA transactions.
- IOTLB invalidation with drain-completion wait: call iommu_iotlb_sync(domain)
to flush all pending IOTLB entries for this device's IOMMU domain. Wait for
the IOMMU hardware to confirm that all in-flight DMA translations have completed
(VT-d: poll IQA drain; SMMU: poll SMMU_GERROR until clean).
2. UNLINK IRQBYPASS
- Unregister all VFIO producer-side entries from the irqbypass registry
(irqbypass_unregister_producer for each MSI-X vector).
- This disconnects posted interrupt descriptors: the interrupt path reverts
to standard host-side delivery.
- Close all eventfds associated with VFIO_DEVICE_SET_IRQS.
3. DETACH DEVICE FROM IOAS
- Call iommu_detach_device(domain, dev) to remove the IOMMU context/stream
table entry pointing the device at the VM's IOAS page table.
- Second IOTLB invalidation to ensure no stale TLB entries reference the
now-detached page table.
- Decrement the IoAddrSpace attached device count. If zero: proceed to
destroy the IOAS.
4. DESTROY IOAS MAPPINGS
- Unmap all IOVA ranges from the IoAddrSpace (iommu_unmap for each mapping
in the IOAS's BTreeMap).
- Unpin all host pages that were pinned via IOMMU_IOAS_MAP. Pages are
returned to the host memory manager.
- Free the IoAddrSpace and its backing IommuDomain page table.
5. TEAR DOWN TIER 1 ISOLATION DOMAIN (if device was Tier 1 before VFIO claim)
- The device registry records the pre-VFIO isolation tier of each device
(DeviceNode.pre_vfio_tier). If the device was a Tier 1 driver's device:
a. Release the isolation domain key assigned to the device's former
Tier 1 driver. On x86: clear the PKRU domain key in the per-CPU
domain allocation bitmap (returns the key to the free pool for
reuse by other Tier 1 drivers). On AArch64: release the POE
permission overlay index.
b. Free the driver's private memory region: the virtual address range
that was tagged with the isolation domain key is unmapped from the
kernel page tables and the backing physical pages are returned to
the page allocator. This includes the driver's heap, stack, and
any MMIO mappings that were part of its isolation domain.
c. Release driver capabilities: revoke all capability tokens held by
the former Tier 1 driver instance from the capability table
([Section 9.1](09-security.md#capability-based-foundation)).
d. Remove the driver's ring buffer endpoints from the KABI transport
registry. The ring buffer memory (allocated in umka-core address
space) is freed.
- If the device was Tier 0 or Tier 2 before VFIO claim: step 5 is skipped
(Tier 0 has no isolation domain to release; Tier 2's process-based
isolation was already torn down when the driver process exited during
VFIO bind).
6. DEVICE RESET
- Issue FLR (Function-Level Reset) if the device supports it
(PCI_EXP_DEVCTL_BCR_FLR). FLR clears all device-internal state:
pending DMA descriptors, MSI-X configuration, device-specific
registers.
- If FLR is not supported: issue a Secondary Bus Reset (SBR) on the
device's parent bridge.
- Wait for reset completion (poll PCI_EXP_DEVSTA for Transaction Pending
bit clear, timeout 100 ms per PCIe spec).
- On FLR timeout: log FMA event `FaultEvent::DeviceResetTimeout` with
device BDF. If SBR is available as fallback, attempt SBR. If SBR
also fails or is unavailable: mark device as permanently faulted
(state = `DeviceState::Faulted` in device registry), prevent
re-bind, and emit `FaultEvent::DevicePermanentFault`. The device
remains quiesced until manual admin intervention (`echo 1 > remove`
followed by PCI rescan).
7. RE-BIND HOST DRIVER (optional)
- If the device registry has a host driver binding for this device
(recorded at VFIO bind time): trigger driver_probe(dev) to re-bind
the host driver.
- The host driver's probe() re-initializes the device from scratch
(fresh register state after FLR).
- If re-bind fails: device remains quiesced (state = Unbound in the
device registry). An FMA event is emitted:
fma_emit(FaultEvent::Generic {
device_id: dev.id(),
event_code: FMA_VFIO_REBIND_FAILED,
payload: [...],
});
Concurrency: The unbind sequence runs in process context (VFIO device fd close handler, called from the VMM's exit path). Steps 1-4 hold the IOMMU domain lock. Step 5 acquires the device registry lock. Step 7 is asynchronous (queued to the device probe workqueue). A concurrent VM migration that races with fd close is prevented by the VfioDevice refcount — the fd close handler runs only when the last reference is dropped.
18.5.9 Integration with UmkaOS IOMMU (Section 11.4)¶
The iommufd layer is built on top of the IOMMU primitives defined in Section 11.3. The correspondence is:
| iommufd concept | Section 11.4 primitive | Notes |
|---|---|---|
IoAddrSpace |
IommuDomain |
IOAS wraps an IommuDomain with userspace-facing state (mapping BTree, attached device count, valid IOVA ranges). When an IommuDomain is owned by an IoAddrSpace, IommuDomain.mappings is not populated — IoAddrSpace.mappings is the sole software mapping authority (see Section 11.5 for the ownership rule). The hardware page table is shared and is the source of truth for DMA translation in both cases. |
HwPagetable |
IommuPgd + context entry |
HWPT holds a reference to the IommuPgd and the device-side context entry that points to it. |
IOMMU_IOAS_MAP |
iommu_map(domain, iova, paddr, len, prot) |
iommufd calls into the §11.4 iommu_map primitive after pinning userspace pages. |
IOMMU_IOAS_UNMAP |
iommu_unmap(domain, iova, len) + IOTLB invalidate |
Unmap also calls iommu_iotlb_sync(domain) to flush TLB entries before releasing page pins. |
IOMMU_DEVICE_ATTACH |
iommu_attach_device(domain, dev) |
Programs the IOMMU context/stream table entry that points the device at the domain's page table. |
IOMMU_DEVICE_DETACH |
iommu_detach_device(domain, dev) |
Removes the context entry and drains in-flight DMA (issues IOTLB invalidation with drain completion wait). |
The §11.4 layer handles all architecture-specific IOMMU programming (VT-d context
tables and SLPT on x86-64; SMMU stream table and stage-2 tables on ARM64; PCIe PASID
tables for Shared Virtual Addressing). iommufd is arch-neutral: it builds the IOVA→HPA
mapping in the arch-agnostic IommuDomain and relies on §11.4 to push it to hardware.
18.5.9.1 Concrete Type Bridging¶
The IoAddrSpace and HwPagetable types defined above (in the iommufd object model)
are the concrete wrappers that bridge iommufd's userspace-facing API to the driver
framework's IOMMU primitives. The following methods show how iommufd ioctl handlers
delegate to §11.4:
impl IoAddrSpace {
/// Map a guest physical address range to host physical via the IOMMU.
/// Called by the `IOMMU_IOAS_MAP` ioctl handler after pinning userspace
/// pages. Delegates to `iommu_map()` from Section 11.4, then records
/// the mapping in the local BTreeMap for userspace unmap tracking.
///
/// The `mappings` BTreeMap is keyed by IOVA start address. This is a
/// warm-path operation (called during VM memory slot setup, not per-I/O),
/// so BTreeMap is appropriate — it provides ordered iteration for
/// coalescing adjacent mappings and O(log n) lookup for unmap.
pub fn map(
&self,
iova: u64,
phys: PhysAddr,
size: u64,
prot: DmaProt,
) -> Result<(), IommuError> {
// Validate IOVA falls within hardware-reported valid ranges.
if !self.is_valid_iova_range(iova, size) {
return Err(IommuError::InvalidIova);
}
// Delegate to arch-specific IOMMU page table programming.
iommu_map(&self.pgd, iova, phys, size, prot)?;
// Record mapping for userspace unmap tracking.
self.mappings.lock().insert(iova, IommuMapping {
iova,
paddr: phys.as_u64(),
len: size as usize,
prot,
});
Ok(())
}
/// Unmap a previously mapped IOVA range and flush IOTLB entries.
/// Called by the `IOMMU_IOAS_UNMAP` ioctl handler.
/// After removing the mapping from the hardware page table, issues an
/// IOTLB invalidation to flush stale TLB entries. The invalidation is
/// synchronous — this function does not return until the IOTLB flush
/// completes, ensuring no in-flight DMA can use the old mapping.
pub fn unmap(&self, iova: u64, size: u64) -> Result<(), IommuError> {
iommu_unmap(&self.pgd, iova, size)?;
iommu_iotlb_sync(&self.pgd)?;
self.mappings.lock().remove(&iova);
Ok(())
}
/// Check whether an IOVA range falls within the hardware's valid IOVA
/// windows. The valid ranges are reported by the IOMMU driver at device
/// attach time and stored in `valid_iova_ranges`.
fn is_valid_iova_range(&self, iova: u64, size: u64) -> bool {
let end = iova.saturating_add(size).saturating_sub(1);
self.valid_iova_ranges.iter().any(|r| iova >= r.start && end <= r.last)
}
}
impl HwPagetable {
/// Attach a device to this hardware page table. Programs the IOMMU
/// context/stream table entry so the device's DMA transactions are
/// translated through this page table.
///
/// After this call, the device's DMA addresses are translated via
/// `self.pgd`. The device must be detached before the `HwPagetable`
/// is destroyed (enforced by the `BoundDevice` lifecycle).
pub fn attach_device(&self, dev: &Arc<dyn DeviceNode>) -> Result<(), IommuError> {
iommu_attach_device(&self.pgd, dev)
}
/// Detach a device from this hardware page table. Removes the IOMMU
/// context entry and drains in-flight DMA by issuing an IOTLB
/// invalidation with drain-completion wait.
pub fn detach_device(&self, dev: &Arc<dyn DeviceNode>) -> Result<(), IommuError> {
iommu_detach_device(&self.pgd, dev)?;
iommu_iotlb_sync(&self.pgd)
}
}
The key design point: IoAddrSpace owns both the hardware page table (pgd) and the
software mapping record (mappings). The pgd is the source of truth for hardware DMA
translation; the mappings BTreeMap exists solely for the userspace API (so IOMMU_IOAS_UNMAP
can validate and remove specific ranges). On the DMA data path, only the hardware page
table is consulted — the BTreeMap is never touched.
Dual mapping avoidance: When an IommuDomain (Section 11.5)
is owned by an IoAddrSpace, the IommuDomain.mappings BTreeMap is not populated.
IoAddrSpace.mappings is the sole software authority for all IOVA→HPA entries created
via IOMMU_IOAS_MAP. Having two software shadow maps for the same hardware page table
would be a consistency hazard — if one were updated and the other stale, unmap validation
or IOVA range queries would return wrong results. The IommuDomain.mappings field is
used only for kernel-initiated DMA (Tier 1/2 drivers using umka_driver_dma_alloc),
where no IoAddrSpace wrapper exists.
Path selection for VM device assignment: The iommufd/VFIO path (IoAddrSpace) and the
KVM direct-assign path (IommuDomainType::VmPassthrough) are mutually exclusive per
device. See Section 11.5 for the canonical
selection rule and invariant enforcement.