Mastering KVM Virtualization
上QQ阅读APP看书,第一时间看更新

Time to think more about QEMU

Quick Emulator (QEMU) was written by Fabrice Bellard (creator of FFmpeg), and is free software and mainly licensed under GNU General Public License (GPL).

QEMU is a generic and open source machine emulator and virtualizer. When used as a machine emulator, QEMU can run OSs and programs made for one machine (for example: an ARM board) on a different machine (for example: your own PC). By using dynamic translation, it achieves very good performance (see www.QEMU.org).

Let me rephrase the preceding paragraph and give a more specific explanation. QEMU is actually a hosted hypervisor/VMM that performs hardware virtualization. Are you confused? If yes, don't worry. You will get a better picture by the end of this chapter, especially when you go through each of the interrelated components and correlate the entire path used here to perform virtualization. QEMU can act as an Emulator or Virtualizer:

  • Qemu as an Emulator: In binary code written for a given processor to another one (for example: ARM in X86):

    "The Tiny Code Generator (TCG) aims to remove the shortcoming of relying on a particular version of GCC or any compiler, instead incorporating the compiler (code generator) into other tasks performed by QEMU at run time. The whole translation task thus consists of two parts: blocks of target code (TBs) being rewritten in TCG ops - a kind of machine-independent intermediate notation, and subsequently this notation being compiled for the host's architecture by TCG. Optional optimisation passes are performed between them.

    TCG requires dedicated code written to support every architecture it runs on."

    (TCG info from Wikipedia https://en.wikipedia.org/wiki/QEMU#Tiny_Code_Generator )

    Tiny Code Generator in QEMU

  • Qemu as a virtualizer: This is the mode where QEMU executes the guest code directly on the host CPU, thus achieving native performance. For example, when working under Xen/KVM hypervisors, QEMU can operate in this mode. If KVM is the underlying hypervisor, QEMU can virtualize embedded guests such as Power PC, S390, x86, and so on. In short, QEMU is capable of running without KVM, using the previously mentioned binary translation method. This execution will be slower when compared to the hardware-accelerated virtualization enabled by KVM. In any mode (either as a virtualizer or emulator), QEMU DOES NOT ONLY emulate the processor, it also emulates different peripherals, such as disks, networks, VGA, PCI, serial and parallel ports, USB, and so on. Apart from this I/O device emulation, when working with KVM, QEMU-KVM creates and initializes virtual machines. It also initializes different posix threads for each virtual CPU (refer to the following figure) of a guest. Also, it provides a framework to emulate the virtual machine's physical address space within the user mode address space of QEMU-KVM:

To execute the guest code in the physical CPU, QEMU makes use of posix threads. That said, the guest virtual CPUs are executed in the host kernel as posix threads. This itself brings lots of advantages, as these are just some processes for the host kernel in a high-level view. From another angle, the user space part of the KVM hypervisor is provided by QEMU. QEMU runs the guest code via the KVM kernel module. When working with KVM, QEMU also does I/O emulation, I/O device setup, live migration, and so on.

QEMU opens the device file (/dev/kvm) exposed by the KVM kernel module and executes ioctls() on it. Please refer to the next section on KVM to know more about these ioctls(). To conclude, KVM makes use of QEMU to become a complete hypervisor, and KVM is an accelerator or enabler of the hardware virtualization extensions (VMX or SVM) provided by the processor to be tightly coupled with the CPU architecture. Indirectly, this conveys that virtual systems also have to use the same architecture to make use of hardware virtualization extensions/capabilities. Once it is enabled, it will definitely give better performance than other techniques such as binary translation.

Qemu – KVM internals

Before we start looking into QEMU internals, let's clone the QEMU git repository:

#git clone git://git.qemu-project.org/qemu.git

Once it's cloned, you can see a hierarchy of files inside the repo, as shown in the following screenshot:

Some important data structures and ioctls() make up the QEMU userspace and KVM kernel space. Some of the important data structures are KVMState, CPU{X86}State, MachineState, and so on. Before we further explore the internals, I would like to point out that covering them in detail is beyond the scope of this book; however, I will give enough pointers to understand what is happening under the hood and give additional references for further explanation.

Data structures

In this section, we will discuss some of the important data structures of QEMU. The KVMState structure contains important file descriptors of VM representation in QEMU. For example it contains the virtual machine file descriptor, as shown in the following code:

struct KVMState      ( kvm-all.c ) 
{           …..
  int fd;
  int vmfd;
  int coalesced_mmio;
    struct kvm_coalesced_mmio_ring *coalesced_mmio_ring; ….}

QEMU-KVM maintains a list of CPUX86State structures, one structure for each virtual CPU. The content of general purpose registers (as well as RSP and RIP) is part of the CPUX86State:

Various ioctls() exist: kvm_ioctl(), kvm_vm_ioctl(), kvm_vcpu_ioctl(), kvm_device_ioctl(), and so on. For function definitions, please visit kvm-all.c inside the QEMU source code repo. These ioctls() fundamentally map to the system KVM level, VM level, and vCPU level. These ioctls() are analogous to the ioctls() categorized by KVM. We will discuss this when we dig further into KVM internals. To get access to these ioctls() exposed by the KVM kernel module, QEMU-KVM has to open /dev/kvm, and the resulting file descriptor is stored in KVMState->fd:

  • kvm_ioctl(): These ioctl()s mainly execute on the KVMState->fd parameter, where KVMState->fd carries the file descriptor obtained by opening /dev/kvm.

    For example:

    kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
    kvm_ioctl(s, KVM_CREATE_VM, type);
  • kvm_vm_ioctl(): These ioctl()s mainly execute on the KVMState->vmfd parameter.

    For example:

    kvm_vm_ioctl(s, KVM_CREATE_VCPU, (void *)kvm_arch_vcpu_id(cpu));
    kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
  • kvm_vcpu_ioctl(): These ioctl()s mainly execute on the CPUState->kvm_fd parameter, which is a vCPU file descriptor for KVM.

    For example:

    kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
  • kvm_device_ioctl(): These ioctl()s mainly execute on the device fd parameter.

    For example:

    kvm_device_ioctl(dev_fd, KVM_HAS_DEVICE_ATTR, &attribute) ? 0 : 1;

kvm-all.c is one of the important source files when considering QEMU KVM communication.

Now let us move on and see how a virtual machine and vCPUs are created and initialized by a QEMU in the context of KVM virtualization.

kvm_init() is the function that opens the KVM device file as shown in the following and it also fills fd [1] and vmfd [2] of KVMState:

static int kvm_init(MachineState *ms)
{ 
…..
KVMState *s;
      s = KVM_STATE(ms->accelerator);
      …
      s->vmfd = -1;
      s->fd = qemu_open("/dev/kvm", O_RDWR);       --->[1]
    ..
     do {
          ret = kvm_ioctl(s, KVM_CREATE_VM, type); --->[2]
        } while (ret == -EINTR);
     s->vmfd = ret;
     ret = kvm_arch_init(ms, s);   ---> ( target-i386/kvm.c: )
  }

As you can see in the preceding code, the ioctl() with the KVM_CREATE_VM argument will return vmfd. Once QEMU has fd and vmfd, one more file descriptor has to be filled, which is just kvm_fd or vcpu fd. Let us see how this is filled by QEMU:

main() ->
       -> cpu_init(cpu_model);
  [#define cpu_init(cpu_model) CPU(cpu_x86_init(cpu_model)) ]
       ->cpu_x86_create()
       ->qemu_init_vcpu
       ->qemu_kvm_start_vcpu()
       ->qemu_thread_create
       ->qemu_kvm_cpu_thread_fn()
       ->kvm_init_vcpu(CPUState *cpu)
int kvm_init_vcpu(CPUState *cpu)
{
  KVMState *s = kvm_state;
  ret = kvm_vm_ioctl(s, KVM_CREATE_VCPU, (void *)kvm_arch_vcpu_id(cpu));
  cpu->kvm_fd = ret;   --->   [vCPU fd]
  ..
  mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
  cpu->kvm_run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
  MAP_SHARED,  cpu->kvm_fd, 0);  [3]
...
  ret = kvm_arch_init_vcpu(cpu);   [target-i386/kvm.c]
…..
}

Some of the memory pages are shared between the QEMU-KVM process and the KVM kernel modules. You can see such a mapping in the kvm_init_vcpu() function. That said, two host memory pages per vCPU make a channel for communication between the QEMU user space process and the KVM kernel modules: kvm_run and pio_data. Also understand that, during the execution of these ioctls() that return the preceding fds, the Linux kernel allocates a file structure and related anonymous nodes. We will discuss the kernel part later when discussing KVM.

We have seen that vCPUs are posix threads created by QEMU-KVM. To run guest code, these vCPU threads execute an ioctl() with KVM_RUN as its argument, as shown in the following code:

int kvm_cpu_exec(CPUState *cpu) {
   struct kvm_run *run = cpu->kvm_run;
   run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
   ...

}

The same function kvm_cpu_exec() also defines the actions that need to be taken when the control comes back to the QEMU-KVM user space from KVM with a VM exit. Even though we will discuss later on how KVM and QEMU communicate with each other to perform an operation on behalf of the guest, let me touch upon this here. KVM is an enabler of hardware extensions provided by vendors such as Intel and AMD with their virtualization extensions such as SVM and VMX. These extensions are used by the KVM to directly execute the guest code on host CPUs. However if there is an event, for example, as part of an operation guest kernel code access hardware device register which is emulated by the QEMU, KVM has to exit back to QEMU and pass control. Then QEMU can emulate the outcome of the operation. There are different exit reasons, as shown in the following code:

  switch (run->exit_reason) {
   case KVM_EXIT_IO:
            DPRINTF("handle_io\n");

   case KVM_EXIT_MMIO:
            DPRINTF("handle_mmio\n");

   case KVM_EXIT_IRQ_WINDOW_OPEN:
            DPRINTF("irq_window_open\n");

   case KVM_EXIT_SHUTDOWN:
            DPRINTF("shutdown\n");
   case KVM_EXIT_UNKNOWN:
...
   case KVM_EXIT_INTERNAL_ERROR:
…

        case KVM_EXIT_SYSTEM_EVENT:
            switch (run->system_event.type) {
              case KVM_SYSTEM_EVENT_SHUTDOWN:
              case KVM_SYSTEM_EVENT_RESET:
              case KVM_SYSTEM_EVENT_CRASH:

Threading models in QEMU

QEMU-KVM is a multithreaded, event-driven (with a big lock) application. The important threads are:

  • Main thread
  • Worker threads for virtual disk I/O backend
  • One thread for each virtual CPU

For each and every VM, there is a QEMU process running in the host system. If the guest system is shut down this process will be destroyed/exited. Apart from vCPU threads, there are dedicated iothreads running a select (2) event loop to process I/O such as network packets and disk I/O completion. IO threads are also spawned by QEMU. In short, the situation will look like this:

KVM Guest

Before we discuss this further, there is always a question about the physical memory of guest systems: Where is it located? Here is the deal: the guest RAM is assigned inside the QEMU process's virtual address space, as shown in the preceding figure. That said, the physical RAM of the guest is inside the QEMU Process Address space.

Note

More details about threading can be fetched from the threading model at: http://blog.vmsplice.net/2011/03/qemu-internals-overall-architecutre-and-html?m=1.

The event loop thread is also called iothread. Event loops are used for timers, file descriptor monitoring, and so on. main_loop_wait() is the QEMU main event loop thread, which is defined as shown in the following. This main event loop thread is responsible for, main loop services include file descriptor callbacks, bottom halves, and timers (defined in qemu-timer.h). Bottom halves are similar to timers that execute immediately, but have a lower overhead, and scheduling them is wait-free, thread-safe, and signal-safe.

File: vl.c

static void main_loop(void)  {
  bool nonblocking;
  int last_io = 0;
...
  do {
      nonblocking = !kvm_enabled() && !xen_enabled() && last_io > 0;
…...
      last_io = main_loop_wait(nonblocking);
…...
     } while (!main_loop_should_exit());
}

Before we leave the QEMU code base, I would like to point out that there are mainly two parts to device codes. For example, the directory hw/block/ contains the host side of the block device code, and hw/block/ contains the code for device emulation.