Execution flow of vCPU
Finally, we are into the vCPU execution flow which helps us to put everything together and understand what happens under the hood.
I hope you didn't forget that the QEMU creates a posix thread for a vCPU of the guest and ioctl()
, which is responsible for running a CPU and has the KVM_RUN arg (#define KVM_RUN_IO(KVMIO, 0x80)
). vCPU thread executes ioctl(.., KVM_RUN, ...)
to run the guest code. As these are posix threads, the Linux kernel can schedule these threads as with any other process/thread in the system.
Let us see how it all works:
Qemu-kvm User Space:
kvm_init_vcpu ()
kvm_arch_init_vcpu()
qemu_init_vcpu()
qemu_kvm_start_vcpu()
qemu_kvm_cpu_thread_fn()
while (1) {
if (cpu_can_run(cpu)) {
r = kvm_cpu_exec(cpu);
}
}
kvm_cpu_exec (CPUState *cpu)
-> run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
According to the underlying architecture and hardware, different structures are initialized by the KVM kernel modules and one among them is vmx_x86_ops/svm_x86_ops
(owned by either the kvm-intel or kvm-amd module), as can be seen in the following. It defines different operations that need to be performed when the vCPU is in context. The KVM makes use of the kvm_x86_ops
vector to point either of these vectors according to the KVM module (kvm-intel or kvm-amd) loaded for the hardware. The "run" pointer defines the function, which needs to be executed when the guest vCPU run is in action and handle_exit
defines the actions needed to be performed at the time of a vmexit:
The run pointer points to vmx_vcpu_run
or svm_vcpu_run
accordingly. The functions svm_vcpu_run
or vmx_vcpu_run
do the job of saving KVM host registers, loading guest o/s registers, and SVM_VMLOAD
instructions. We walked through the QEMU KVM user space code execution at the time of vcpu run, once it enters the kernel via syscall
. Then, following the file operations structures, it calls kvm_vcpu_ioctl()
; this defines the action to be taken according to the ioctl()
it defines:
static long kvm_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { switch (ioctl) { case KVM_RUN: …. kvm_arch_vcpu_ioctl_run(vcpu, vcpu->run); ->vcpu_load -> vmx_vcpu_load ->vcpu_run(vcpu); ->vcpu_enter_guest ->vmx_vcpu_run …. }
We will go through vcpu_run()
to understand how it reaches vmx_vcpu_run
or svm_vcpu_run
:
static int vcpu_run(struct kvm_vcpu *vcpu) { …. for (;;) { if (kvm_vcpu_running(vcpu)) { r = vcpu_enter_guest(vcpu); } else { r = vcpu_block(kvm, vcpu); }
Once it's in vcpu_enter_guest()
, you can see some of the important calls happening when it enters guest mode in the KVM:
static int vcpu_enter_guest(struct kvm_vcpu *vcpu) { ... kvm_x86_ops->prepare_guest_switch(vcpu); vcpu->mode = IN_GUEST_MODE; __kvm_guest_enter(); kvm_x86_ops->run(vcpu); [vmx_vcpu_run or svm_vcpu_run ] vcpu->mode = OUTSIDE_GUEST_MODE; kvm_guest_exit(); r = kvm_x86_ops->handle_exit(vcpu); [vmx_handle_exit or handle_exit ] … }
You can see a high-level picture of VMENTRY and VMEXIT from the vcpu_enter_guest()
function. That said, VMENTRY ([vmx_vcpu_run or svm_vcpu_run]
) is just a guest operating system executing in the CPU; different intercepted events can occur at this stage, causing a VMEXIT. If this happens, any vmx_handle_exit
or handle_exit
will start looking into this exit cause. We have already discussed the reasons for VMEXIT in previous sections. Once there is a VMEXIT, the exit reason is analyzed and action is taken accordingly.
vmx_handle_exit()
is the function responsible for handling the exit reason:
/ * The guest has exited. See if we can fix it or if we need userspace assistance. */ static int vmx_handle_exit(struct kvm_vcpu *vcpu) { ….. /* The exit handlers return 1 if the exit was handled fully and guest execution may resume. Otherwise they set the kvm_run parameter to indicate what needs to be done to userspace and return 0. */ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { [EXIT_REASON_EXCEPTION_NMI] = handle_exception, [EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt, [EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault, [EXIT_REASON_IO_INSTRUCTION] = handle_io, [EXIT_REASON_CR_ACCESS] = handle_cr, [EXIT_REASON_VMCALL] = handle_vmcall, [EXIT_REASON_VMCLEAR] = handle_vmclear, [EXIT_REASON_VMLAUNCH] = handle_vmlaunch, … }
kvm_vmx_exit_handlers[]
is the table of VM exit handlers, indexed by "exit reason". Similar to Intel, the svm code has handle_exit()
:
static int handle_exit(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); struct kvm_run *kvm_run = vcpu->run; u32 exit_code = svm->vmcb->control.exit_code; …. return svm_exit_handlers[exit_code](svm); }
handle_exit()
has the svm_exit_handler
array, as shown in the following section.
If needed KVM has to fall back to userspace (QEMU) to perform the emulation as some of the instructions has to be performed on the QEMU emulated devices. For example to emulate i/o port access, the control goes to userspace (QEMU):
kvm-all.c
:
static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, [SVM_EXIT_READ_CR4] = cr_interception, [SVM_EXIT_INTR] = intr_interception, [SVM_EXIT_NMI] = nmi_interception, [SVM_EXIT_SMI] = nop_on_interception, [SVM_EXIT_IOIO] = io_interception, [SVM_EXIT_VMRUN] = vmrun_interception, [SVM_EXIT_VMMCALL] = vmmcall_interception, [SVM_EXIT_VMLOAD] = vmload_interception, [SVM_EXIT_VMSAVE] = vmsave_interception, …. } switch (run->exit_reason) { case KVM_EXIT_IO: DPRINTF("handle_io\n"); /* Called outside BQL */ kvm_handle_io(run->io.port, attrs, (uint8_t *)run + run->io.data_offset, run->io.direction, run->io.size, run->io.count); ret = 0; break;