### CS5460: Operating Systems Lecture: Virtualization

Anton Burtsev March, 2013

### Traditional operating system





### Virtual machines



### A bit of history

- Virtual machines were popular in 60s-70s
  - Share resources of mainframe computers [Goldberg 1974]
  - Run multiple single-user operating systems
- Interest is lost by 80s-90s
  - Development of multi-user OS
  - Rapid drop in hardware cost
- Hardware support for virtualizaiton is lost





### Virtual machine

Efficient duplicate of a real machine

- Compatibility
- Performance
- Isolation





### What needs to be emulated?

- CPU and memory
  - Register state
  - Memory state
- Memory management unit
  - Page tables, segments
- Platform
  - Interrupt controller, timer, buses
- BIOS
- Peripheral devices
  - Disk, network interface, serial line

### x86 is not virtualizable

- Some instructions (*sensitive*) read or update the state of virtual machine and don't trap (*non-privileged*)
  - 17 sensitive, non-privileged instructions [Robin et al 2000]

### x86 is not virtualizable (II)

| Group                                                           | Instructions                                                               |
|-----------------------------------------------------------------|----------------------------------------------------------------------------|
| Access to interrupt flag<br>Visibility into segment descriptors | pushf, popf, iret                                                          |
| Segment manipulation instructions                               | lar, verr, verw, lsl<br>pop <seg>, push <seg>, mov <seg></seg></seg></seg> |
| Read-only access to privileged state                            | sgdt, sldt, sidt, smsw<br>fcall longiumn rotfar str int (n)                |
|                                                                 |                                                                            |

#### • Examples

- popf doesn't update interrupt flag (IF)
  - Impossible to detect when guest disables interrupts
- push %cs can read code segment selector (%cs) and learn its CPL
  - Guest gets confused

### Solution space

- Parse the instruction stream and detect all sensitive instructions dynamically
  - Interpretation (BOCHS, JSLinux)
  - Binary translation (VMWare, QEMU)
- Change the operating system
  - Paravirtualization (Xen, L4, Denali, Hyper-V)
- Make all sensitive instructions privileged!
  - Hardware supported virtualization (Xen, KVM, VMWare)
    - Intel VT-x, AMD SVM

# Basic blocks of a virtual machine monitor: QEMU example



Interpreted execution: BOCHS, JSLinux



## What does it mean to run guest?

- Bochs internal emulation loop
- Similar to nonpipelined CPU like 8086
- How many cycles per instruction?

Binary translation: VMWare

```
int isPrime(int a) {
  for (int i = 2; i < a; i++) {</pre>
    if (a % i == 0) return 0;
  }
  return 1;
}
               %ecx, %edi ; %ecx = %edi (a)
isPrime:
         mov
             %esi, $2 ; i = 2
         mov
         cmp %esi, %ecx ; is i >= a?
         jge prime ; jump if yes
               %eax, %ecx ; set %eax = a
nexti:
         mov
         cdq
                          ; sign-extend
               %esi ; a % i
         idiv
               %edx, %edx ; is remainder zero?
         test
               notPrime ; jump if yes
         jz
         inc %esi ; i++
         cmp %esi, %ecx ; is i >= a?
               nexti ; jump if no
         jl
prime:
               %eax, $1 ; return value in %eax
         mov
         ret
               %eax, %eax ; %eax = 0
notPrime: xor
         ret
```

| isPrime:  | mov<br>mov<br>cmp<br>jge | %esi, \$2<br>%esi, %ecx                                             | ; ;   |                      |
|-----------|--------------------------|---------------------------------------------------------------------|-------|----------------------|
| nexti:    |                          | %esi                                                                | ; ; ; |                      |
|           | jz<br>inc                | notPrime<br>%esi                                                    | ;;;   | i++                  |
|           | cmp<br>jl                | %esi, %ecx<br>nexti                                                 |       |                      |
| prime:    | mov<br>ret               | %eax, \$1                                                           | ; 1   | return value in %eax |
| notPrime: |                          | %eax, %eax                                                          | ;     | %eax = 0             |
| isPrime': | mov<br>cmp<br>jge        | %ecx, %edi<br>%esi, \$2<br>%esi, %ecx<br>[takenAddr]<br>[fallthrAdd | ;     |                      |

isPrime': %ecx, %edi ; IDENT \*mov mov %esi, \$2 cmp %esi, %ecx jge [takenAddr] ; JCC ; fall-thru into next CCF nexti': %eax, %ecx ; IDENT \*mov cdq idiv %esi test %edx, %edx notPrime' ; JCC jz ; fall-thru into next CCF %esi ; IDENT \*inc cmp %esi, %ecx ; JCC jl nexti' jmp [fallthrAddr3] notPrime': \*xor %eax, %eax ; IDENT pop %r11 ; RET %gs:0xff39eb8(%rip), %rcx ; spill %rcx mov movzx %ecx, %r11b %gs:0xfc7dde0(8\*%rcx) jmp

#### **VMWare Workstation**



Fig. 2. The VMware Hosted Architecture. VMware Workstation consists of the three shaded components. The figure is split vertically between host operating system context and VMM context, and horizontally between system-level and user-level execution. The steps labeled (i)–(v) correspond to the execution that follows an external interrupt that occurs while the CPU is executing in VMM context.

### Address space during the world switch



### The world switch

- First, save the old processor state: general-purpose registers, privileged registers, and segment registers;
- Then, restore the new address space by assigning %cr3. All page table mappings immediately change, except the one of the cross page.
- Restore the global segment descriptor table register (%gdtr).
- With the %gdtr now pointing to the new descriptor table, restore %ds. From that point on, all data references to the cross page must use a different virtual address to access the same data structure. However, because %cs is unchanged, instruction addresses remain the same.
- Restore the other segment registers, %idtr, and the generalpurpose registers.
- Finally, restore %cs and %eip through a longjump instruction.

### Protecting the VMM



#### **Translator continuations**



### Interpreted execution revisited: Bochs



### Instruction trace cache

- 50% of time in the main loop
  - Fetch, decode, dispatch
- Trace cache (Bochs v2.3.6)
  - Hardware idea (Pentium 4)
  - Trace of up to 16 instructions (32K entries)
- 20% speedup

### Improve branch prediction

```
void BX CPU C::SUB EdGd(bxInstruction c *i)

    20 cycles

 Bit32u op2 32, op1 32, diff 32;
                                              penalty on
 op2 32 = BX READ 32BIT REG(i - nnn);
                                              Core 2 Duo
  if (i->modC0()) {
                     // reg/reg format
    op1 32 = BX READ 32BIT REG(i->rm());
   diff 32 = op1 32 - op2 32:
   BX WRITE 32BIT REGZ(i->rm(), diff 32);
  else {
                      // mem/reg format
    read RMW virtual dword(i->seg(),
        RMAddr(i), &op1 32);
    diff 32 - op1 32 - op2 32;
   Write RMW virtual dword(diff 32);
  SET LAZY FLAGS SUB32(op1 32, op2 32,
        diff 32);
```

### Improve branch prediction

- Split handlers to avoid conditional logic
  - Decide the handler at decode time (15% speedup)

### Resolve memory references without misprediction

- Bochs v2.3.5 has 30 possible branch targets for the effective address computation
  - Effective Addr = (Base + Index\*Scale + Displacement) mod(2<sup>AddrSize</sup>)
  - **e.g.** Effective Addr = Base, Effective Addr = Displacement
  - 100% chance of misprediction
- Two techniques to improve prediction:
  - Reduce the number of targets: leave only 2 forms
  - Replicate indirect branch point
- 40% speedup

### Time to boot Windows

|       | 1000 MHz    | 2533 MHz  | 2666 MHz   |
|-------|-------------|-----------|------------|
|       | Pentium III | Pentium 4 | Core 2 Duo |
| Bochs | 882         | 595       | 180        |
| 2.3.5 |             |           |            |
| Bochs | 609         | 533       | 157        |
| 2.3.6 |             |           |            |
| Bochs | 457         | 236       | 81         |
| 2.3.7 |             |           |            |

### Cycle costs

|                                                   | Bochs 2.3.5 | Bochs 2.3.7 | QEMU 0.9.0 |
|---------------------------------------------------|-------------|-------------|------------|
| Register move<br>(MOV, MOVSX)                     | 43          | 15          | 6          |
| Register arithmetic<br>(ADD, SBB)                 | 64          | 25          | 6          |
| Floating point<br>multiply                        | 1054        | 351         | 27         |
| Memory store of<br>constant                       | 99          | 59          | 5          |
| Pairs of memory<br>load and store<br>operations   | 193         | 98          | 14         |
| Non-atomic read-<br>modify-write                  | 112         | 75          | 10         |
| Indirect call<br>through guest<br>EAX register    | 190         | 109         | 197        |
| VirtualProtect<br>system call                     | 126952      | 63476       | 22593      |
| Page fault and handler                            | 888666      | 380857      | 156823     |
| Best case peak<br>guest execution<br>rate in MIPS | 62          | 177         | 444        |

### References

- A Comparison of Software and Hardware Techniques for x86 Virtualization. Keith Adams, Ole Agesen, ASPLOS'06
- Bringing Virtualization to the x86 Architecture with the Original VMware Workstation. Edouard Bugnion, Scott Devine, Mendel Rosenblum, Jeremy Sugerman, Edward Y. Wang, ACM TCS'12.
- Virtualization Without Direct Execution or Jitting: Designing a Portable Virtual Machine Infrastructure. Darek Mihocka, Stanislav Shwartsman.