Skip to content

Hardware virtualization performance

Alexander Krizhanovsky edited this page Oct 8, 2023 · 10 revisions

Read the Tempesta FW benchmarks in a virtualized environment.

Networking

virtio-net

The most basic setup is virtio-net networking for a VM running Tempesta FW. virtio-net is a paravirtualization solution, which is faster than default emulation of e1000 network adapter. You can check that virtio-net was correctly setup by ethtool -i, e.g.

# ethtool -i ens2 | grep driver
driver: virtio_net

The example above uses virtio-net, or you can check the official Qemu documentation how to setup it properly. Also you can use this example to configure a VM with libvirt:

<interface type='network'>
  <mac address='52:54:00:ea:4b:97'/>
  <source network='default'/>
  <model type='virtio'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</interface>

Multiqueue

Tempesta FW works in soft interrupt Linux kernel threads (aka softirq). The same threads (softirq) is also used by the Linux kernel firewall, netfilter. Each system CPU has it's own softirq context and the most network processing logic, such as IP and TCP protocols processing or network filtering, is done in 'local' CPU context, i.e. the Linux kernel does it's best to not to access data structures from one CPU to another.

The best network performance can be achieved if a network adapter, even a virtual one, uses separate queues to deliver network packets to the TCP/IP stack. The queues are processed by the softirq kernel threads, so it makes sense to configure a network adapter to use the number of the queues equal to the number of available CPUs.

You can use configure a virtio-net virtual network interface to use separate queues for Qemu command line using the vectors, queues, and mq=on options as in the example above.

Or you can use following configuration for libvirt:

<interface type='network'>
  <source network='default'/>
  <model type='virtio'/>
  <driver name='vhost' queues='N'/>
</interface>

Direct networking

It's most recommended to use SR-IOV to directly map a host NIC to a VM. There are instructions for Intel adapter. In this case it's possible to coalesce interruptions to avoid host/guest transitions (see at the below).

If your adapter doesn't support SR-IOV, then you can use macvtap interface. An example for private connection with 4 queues:

    <interface type='direct'>
      <mac address='52:54:00:07:88:51'/>
      <source dev='eth2' mode='private'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>

vCPU binding

In QEMU-KVM hypervisor each VM is separate QEMU process, which can migrate from one CPU to another. The migrations can hurt the VM performance by frequent CPU cache invalidations and/or slower memory transfers on multi-processor (NUMA) systems. Each vCPU (virtual CPU) - is a thread of corresponding QEMU process. These threads are distributed on host CPUs and between them by host Linux scheduler like any other usual thread/process in OS. In default configuration, vCPUs of virtual machine are not pinned to any CPU and can be distributed between all of them (regardless of VCPUs count specified in -smp QEMU option). E.g. if we define two vCPU for a guest VM on the host with four CPUs, then any of these two vCPU can be executed on any of host CPUs. Besides, due to free distribution between all host CPUs, situations with several vCPUs on one CPU processing often could occur.

It's recommended to pin VM CPUs to particular host system CPUS. You can do this with following configuration for libvirt:

  <vcpu placement='static' cpuset='4-7'>4</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='4'/>
    <vcpupin vcpu='1' cpuset='5'/>
    <vcpupin vcpu='2' cpuset='6'/>
    <vcpupin vcpu='3' cpuset='7'/>
  </cputune>

Host/guest transitions

KVM virtualization introduces significant overheads, which can be dramatic without appropriate hardware acceleration. In particular, the performance issues are noticeable during interrupts processing (including interprocess interrupts - IPI), writing to special model specific registers (MSR) and other events occurred during guest code execution, which cause transitions between guest and hypervisor modes.

Hardware support for virtualization in Intel processors is provided by VMX operations. There are VMX root operation (which hypervisor run in) and VMX non-root operation (for guest software). VM-entry transitions is VMX root => VMX non-root. VM-exit transitions is VMX non-root => VMX root. In VMX non-root mode certain instructions and events cause VM-exits to the hypervisor. VMX non-root operation and VMX transitions are controlled by a virtual-machine control structure (VMCS). A hypervisor could use a different VMCS for each virtual machine (VM) that it supports. For a VMs with multiple processors (vCPUs), QEMU-KVM hypervisor use different VMCS for each vCPU.

Interruptions

For example, consider IPI processing in VM guest: on Intel processors with x2APIC support, IPI generation is simply write operation into some address of an MSR, which in QEMU-KVM hypervisor causes a VM-exit on IPI source CPU. IPI receiving also causes VM-exit, at least on Intel processors without posted-interrupt (so called APICv or vAPIC) support.

Processing of interrupts in non-root mode (on Intel processors with VMX supported) can be accelerated in KVM, if processor has virtualized APIC support (APICv, including posted-interrupt). In this case on target CPU, which receives interrupt in non-root mode, VM-exit does not occur and processor will pass interrupt to the guest on its own, without hypervisor intervention. APICv support can be checked in special Capability Reporting VMX registers of MSR; the following bits must be set for Intel 64 architecture:

  • Activate secondary controls (bit 63 of IA32_VMX_PROCBASED_CTLS)
  • Virtualize APIC accesses (bit 32 of IA32_VMX_PROCBASED_CTLS2)
  • APIC-register virtualization (bit 40 of IA32_VMX_PROCBASED_CTLS2)
  • Virtual-interrupt delivery (bit 41 of IA32_VMX_PROCBASED_CTLS2)
  • Process posted interrupts (bit 39 of IA32_VMX_TRUE_PINBASED_CTLS if bit 55 is set in IA32_VMX_BASIC; otherwise - bit 39 of IA32_VMX_PINBASED_CTLS)

besides, kernel must be built with CONFIG_X86_LOCAL_APIC enabled and kvm_intel kernel module must be loaded with parameter enable_apicv=Y (default: N). You can check whether KVM is loaded with APICv support with:

# cat /sys/module/kvm_intel/parameters/enable_apicv
Y

The values of specified bits can be read via rdmsr utility (package msr-tools):

rdmsr [options] <register_address>

Registers specified above have following addresses:

  • IA32_VMX_BASIC: 0x480
  • IA32_VMX_PROCBASED_CTLS: 0x482
  • IA32_VMX_PROCBASED_CTLS2: 0x48B
  • IA32_VMX_PINBASED_CTLS: 0x481
  • IA32_VMX_TRUE_PINBASED_CTLS: 0x48D

For automation of cumbersome process of bits calculation - tempesta/scripts/check_conf.pl can be used:

$ sudo tempesta/scripts/check_conf.pl 
....
'Activate secondary controls' bit: found
'Virtualize APIC accesses' bit: found
'APIC-register virtualization' bit: NOT found
'Virtual-interrupt delivery' bit: NOT found
'Process posted interrupts' bit: NOT found

Detailed information about Capability Reporting VMX registers and their connection with VM-execution control can be found in Intel® 64 and IA-32 Architectures Software Developer’s Manual: Volume 3 (Sections 35.1, 24.6.1, 24.6.2, Appendix A.3).

Performance profiling

KVM uses domain model for performance monitoring counters (PMC) virtualization, which implies saving and restoring of relevant PMC registers only while execution switching between different guests, and not save/restore PMC registers during VM-exit/VM-entry in context of one current guest. So, in case of IPI receiving on target CPU we collect PMC not only for guest's operations, but also for hypervisor operations (between VM-exit and VM-entry). As a result, performance measurement for IPI-handler inside guest (e.g. via perf) will account not only IPI-handler processing in guest but also IPI-handler processing on host (on behalf of guest IPI) on target CPU. This effect, may appear not only during IPI processing, but also in other guest operations, which require VM-exit into hypervisor.

Interruptions & network performance

Network workloads using large packet payloads and/or intensive streaming, e.g. video streaming, when the Linux TCP/IP stack can efficiently coalesce packets, aren't sensitive to the interruptions issue described at the above. One of such examples could be proxying of Keep-Alive connections. Another example is iperf3 workload, e.g. (no vAPIC):

# iperf3 -c 172.16.0.200 -p 5000
Connecting to host 172.16.0.200, port 5000
[  4] local 172.16.0.101 port 46758 connected to 172.16.0.200 port 5000
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.06 GBytes  9.14 Gbits/sec   20    684 KBytes       
[  4]   1.00-2.00   sec  1.09 GBytes  9.32 Gbits/sec    0    686 KBytes       
[  4]   2.00-3.00   sec  1.09 GBytes  9.35 Gbits/sec    0    686 KBytes       
[  4]   3.00-4.00   sec  1.09 GBytes  9.33 Gbits/sec    0    687 KBytes       
[  4]   4.00-5.00   sec  1.09 GBytes  9.36 Gbits/sec    0    689 KBytes       
[  4]   5.00-6.00   sec  1.09 GBytes  9.37 Gbits/sec    0    689 KBytes       
[  4]   6.00-7.00   sec  1.09 GBytes  9.36 Gbits/sec    0    724 KBytes       
[  4]   7.00-8.00   sec  1.09 GBytes  9.36 Gbits/sec    0    752 KBytes       
[  4]   8.00-9.00   sec  1.09 GBytes  9.34 Gbits/sec    0    764 KBytes       
[  4]   9.00-10.00  sec  1.09 GBytes  9.34 Gbits/sec    0    773 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec   20             sender
[  4]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec                  receiver

In this case a 2 vCPU VM can deliver 10Gbps throughput on a macvtap interface. perf kvm stat shows the VM performance statistics (running on a host):

# perf kvm stat record
...
# perf kvm stat report

Analyze events for all VMs, all VCPUs:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time 

                 HLT     241302    48.07%    99.81%      0.34us 1503998.97us    683.62us ( +-   4.66% )
       EPT_MISCONFIG     238604    47.53%     0.18%      0.62us    199.60us      1.24us ( +-   0.30% )
           MSR_WRITE      10439     2.08%     0.01%      0.28us     18.22us      0.86us ( +-   1.40% )
  EXTERNAL_INTERRUPT       5216     1.04%     0.00%      0.24us     22.30us      0.48us ( +-   3.39% )
    PREEMPTION_TIMER       3931     0.78%     0.00%      0.54us     20.07us      0.98us ( +-   1.33% )
      IO_INSTRUCTION       1456     0.29%     0.01%      1.77us     51.86us      6.65us ( +-   2.30% )
               CPUID        548     0.11%     0.00%      0.28us     13.96us      0.56us ( +-   6.34% )
   PENDING_INTERRUPT        288     0.06%     0.00%      0.39us     13.81us      0.63us ( +-   7.44% )
   PAUSE_INSTRUCTION        108     0.02%     0.00%      0.28us     13.76us      0.69us ( +-  17.80% )
            MSR_READ         94     0.02%     0.00%      0.34us      1.30us      0.62us ( +-   3.12% )
       EXCEPTION_NMI          1     0.00%     0.00%      0.45us      0.45us      0.45us ( +-   0.00% )

Total Samples:501987, Total events handled time:165279040.75us.

However, massive TCP connections establishings and closings is characterized by many small TCP segments, which can not be efficiently coalesced. The perf kvm stat for the case looks like:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

  EXTERNAL_INTERRUPT    5073570    75.37%     2.02%      0.22us   3066.52us      0.96us ( +-   0.49% )
       EPT_MISCONFIG    1029496    15.29%     1.78%      0.34us   1795.92us      4.19us ( +-   0.35% )
           MSR_WRITE     279208     4.15%     0.16%      0.28us   5695.07us      1.36us ( +-   2.52% )
                 HLT     194422     2.89%    95.74%      0.30us 1504068.39us   1192.90us ( +-   4.03% )
   PENDING_INTERRUPT      89818     1.33%     0.03%      0.32us    189.53us      0.70us ( +-   0.83% )
   PAUSE_INSTRUCTION      40905     0.61%     0.26%      0.26us   1390.91us     15.39us ( +-   1.82% )
    PREEMPTION_TIMER      17384     0.26%     0.01%      0.44us    183.21us      1.49us ( +-   1.47% )
      IO_INSTRUCTION       5482     0.08%     0.01%      1.75us    186.08us      3.26us ( +-   1.19% )
               CPUID        972     0.01%     0.00%      0.30us      5.29us      0.66us ( +-   1.94% )
            MSR_READ        104     0.00%     0.00%      0.49us      2.54us      0.94us ( +-   3.29% )
       EXCEPTION_NMI          6     0.00%     0.00%      0.37us      0.78us      0.59us ( +-   9.68% )

In this case interruptions take drammatic 75% versus only 1% for the iperf case.

Many small packets is a well-known problem for the modern virtualization solutions. Besides SR-IOV and vAPIC hardware accelerations, there are software acceleration techniques like batching or XDP redirections. See Netdev talks for more information:

To efficiently process small packets workloads inside a VM you need to:

  • use SR-IOV network adapter and configure the direct network interface for the VM;
  • use coalescing on the network adapter (see ethtool -C)
  • use vAPIC CPU
  • enable IOMMU (intel_iommu=on) kernel boot parameter
  • enable vAPIC for KVM kernel module
  • tune sysctl parameters

An example of sysctl settings to reduce number of network interruptions (see linux/Documentation/sysctl/net.txt for the parameters descriptions):

    sysctl -w net.core.dev_weight=1024
    sysctl -w net.core.busy_read=1000
    sysctl -w net.core.busy_poll=1000
    sysctl -w net.core.netdev_budget=10000
    sysctl -w net.core.netdev_usecs=100000
    sysctl -w net.core.netdev_budget_usecs=100000
    sysctl -w net.core.netdev_max_backlog=32768
Clone this wiki locally