-
Notifications
You must be signed in to change notification settings - Fork 103
Hardware virtualization performance
Read the Tempesta FW benchmarks in a virtualized environment.
The most basic setup is virtio-net
networking for a VM running Tempesta FW.
virtio-net
is a
paravirtualization solution,
which is faster than default emulation of e1000
network adapter.
You can check that virtio-net
was correctly setup by ethtool -i
, e.g.
# ethtool -i ens2 | grep driver
driver: virtio_net
The example above uses virtio-net
, or you can check the official
Qemu documentation how to setup
it properly. Also you can use this example to configure a VM with libvirt
:
<interface type='network'>
<mac address='52:54:00:ea:4b:97'/>
<source network='default'/>
<model type='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</interface>
Tempesta FW works in
soft interrupt
Linux kernel threads (aka softirq). The same threads (softirq
) is also used by the
Linux kernel firewall, netfilter. Each system CPU has it's own softirq context and
the most network processing logic, such as IP and TCP protocols processing or
network filtering, is done in 'local' CPU context, i.e. the Linux kernel does
it's best to not to access data structures from one CPU to another.
The best network performance can be achieved if a network adapter, even a virtual one, uses separate queues to deliver network packets to the TCP/IP stack. The queues are processed by the softirq kernel threads, so it makes sense to configure a network adapter to use the number of the queues equal to the number of available CPUs.
You can use configure a virtio-net
virtual network interface to
use separate queues
for
Qemu command line
using the vectors
, queues
, and mq=on
options as in the example above.
Or you can use following configuration for libvirt:
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost' queues='N'/>
</interface>
It's most recommended to use SR-IOV to directly map a host NIC to a VM. There are instructions for Intel adapter. In this case it's possible to coalesce interruptions to avoid host/guest transitions (see at the below).
If your adapter doesn't support SR-IOV, then you can use
macvtap
interface. An example for private
connection with 4 queues:
<interface type='direct'>
<mac address='52:54:00:07:88:51'/>
<source dev='eth2' mode='private'/>
<model type='virtio'/>
<driver name='vhost' queues='4'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</interface>
In QEMU-KVM hypervisor each VM is separate QEMU process, which can migrate from
one CPU to another. The migrations can hurt the VM performance by frequent CPU cache
invalidations and/or slower memory transfers on multi-processor (NUMA) systems.
Each vCPU (virtual CPU) - is a thread of corresponding QEMU process. These threads are
distributed on host CPUs and between them by host Linux scheduler like any other
usual thread/process in OS. In default configuration, vCPUs of virtual machine are
not pinned to any CPU and can be distributed between all of them (regardless of VCPUs
count specified in -smp
QEMU option). E.g. if we define two vCPU for a guest VM on
the host with four CPUs, then any of these two vCPU can be executed on any of host
CPUs. Besides, due to free distribution between all host CPUs, situations with several
vCPUs on one CPU processing often could occur.
It's recommended to pin VM CPUs to particular host system CPUS. You can do this with following configuration for libvirt:
<vcpu placement='static' cpuset='4-7'>4</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='4'/>
<vcpupin vcpu='1' cpuset='5'/>
<vcpupin vcpu='2' cpuset='6'/>
<vcpupin vcpu='3' cpuset='7'/>
</cputune>
KVM virtualization introduces significant overheads, which can be dramatic without appropriate hardware acceleration. In particular, the performance issues are noticeable during interrupts processing (including interprocess interrupts - IPI), writing to special model specific registers (MSR) and other events occurred during guest code execution, which cause transitions between guest and hypervisor modes.
Hardware support for virtualization in Intel processors is provided by VMX
operations. There are VMX root
operation (which hypervisor run in) and
VMX non-root
operation (for guest software). VM-entry transitions is
VMX root => VMX non-root
. VM-exit transitions is VMX non-root => VMX root
.
In VMX non-root
mode certain instructions and events cause VM-exits to the
hypervisor. VMX non-root operation and VMX
transitions are controlled by a
virtual-machine control structure (VMCS
). A hypervisor could use a different VMCS
for each virtual machine (VM) that it supports. For a VMs with multiple processors
(vCPUs), QEMU-KVM hypervisor use different VMCS for each vCPU.
For example, consider IPI processing in VM guest: on Intel processors with x2APIC
support, IPI generation is simply write operation into some address of an MSR,
which in QEMU-KVM hypervisor causes a VM-exit on IPI source CPU.
IPI receiving also causes VM-exit, at least on Intel processors without
posted-interrupt
(so called APICv
or vAPIC
) support.
Processing of interrupts in non-root mode (on Intel processors with VMX supported)
can be accelerated in KVM, if processor has virtualized APIC support (APICv
,
including posted-interrupt
). In this case on target CPU, which receives interrupt
in non-root mode, VM-exit does not occur and processor will pass interrupt to the
guest on its own, without hypervisor intervention. APICv support can be checked in
special Capability Reporting VMX registers
of MSR; the following bits must be set
for Intel 64
architecture:
- Activate secondary controls (bit 63 of
IA32_VMX_PROCBASED_CTLS
) - Virtualize APIC accesses (bit 32 of
IA32_VMX_PROCBASED_CTLS2
) - APIC-register virtualization (bit 40 of
IA32_VMX_PROCBASED_CTLS2
) - Virtual-interrupt delivery (bit 41 of
IA32_VMX_PROCBASED_CTLS2
) - Process posted interrupts (bit 39 of
IA32_VMX_TRUE_PINBASED_CTLS
if bit 55 is set inIA32_VMX_BASIC
; otherwise - bit 39 ofIA32_VMX_PINBASED_CTLS
)
besides, kernel must be built with CONFIG_X86_LOCAL_APIC
enabled and kvm_intel
kernel module must be loaded with parameter enable_apicv=Y
(default: N
).
You can check whether KVM is loaded with APICv support with:
# cat /sys/module/kvm_intel/parameters/enable_apicv
Y
The values of specified bits can be read via rdmsr
utility (package msr-tools
):
rdmsr [options] <register_address>
Registers specified above have following addresses:
- IA32_VMX_BASIC:
0x480
- IA32_VMX_PROCBASED_CTLS:
0x482
- IA32_VMX_PROCBASED_CTLS2:
0x48B
- IA32_VMX_PINBASED_CTLS:
0x481
- IA32_VMX_TRUE_PINBASED_CTLS:
0x48D
For automation of cumbersome process of bits calculation - tempesta/scripts/check_conf.pl can be used:
$ sudo tempesta/scripts/check_conf.pl
....
'Activate secondary controls' bit: found
'Virtualize APIC accesses' bit: found
'APIC-register virtualization' bit: NOT found
'Virtual-interrupt delivery' bit: NOT found
'Process posted interrupts' bit: NOT found
Detailed information about Capability Reporting VMX registers
and
their connection with VM-execution control can be found in
Intel® 64 and IA-32 Architectures Software Developer’s Manual: Volume 3 (Sections 35.1, 24.6.1, 24.6.2, Appendix A.3).
KVM uses domain model for performance monitoring counters
(PMC) virtualization, which implies saving and restoring of relevant PMC registers
only while execution switching between different guests, and not save/restore PMC
registers during VM-exit/VM-entry in context of one current guest. So, in case of
IPI receiving on target CPU we collect PMC not only for guest's operations, but also
for hypervisor operations (between VM-exit and VM-entry). As a result, performance
measurement for IPI-handler inside guest (e.g. via perf
) will account not only
IPI-handler processing in guest but also IPI-handler processing on host (on behalf
of guest IPI) on target CPU. This effect, may appear not only during IPI processing,
but also in other guest operations, which require VM-exit into hypervisor.
Network workloads using large packet payloads and/or intensive streaming, e.g.
video streaming, when the Linux TCP/IP stack can efficiently coalesce packets,
aren't sensitive to the interruptions issue described at the above. One of
such examples could be
proxying of Keep-Alive connections.
Another example is iperf3
workload, e.g. (no vAPIC):
# iperf3 -c 172.16.0.200 -p 5000
Connecting to host 172.16.0.200, port 5000
[ 4] local 172.16.0.101 port 46758 connected to 172.16.0.200 port 5000
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 1.06 GBytes 9.14 Gbits/sec 20 684 KBytes
[ 4] 1.00-2.00 sec 1.09 GBytes 9.32 Gbits/sec 0 686 KBytes
[ 4] 2.00-3.00 sec 1.09 GBytes 9.35 Gbits/sec 0 686 KBytes
[ 4] 3.00-4.00 sec 1.09 GBytes 9.33 Gbits/sec 0 687 KBytes
[ 4] 4.00-5.00 sec 1.09 GBytes 9.36 Gbits/sec 0 689 KBytes
[ 4] 5.00-6.00 sec 1.09 GBytes 9.37 Gbits/sec 0 689 KBytes
[ 4] 6.00-7.00 sec 1.09 GBytes 9.36 Gbits/sec 0 724 KBytes
[ 4] 7.00-8.00 sec 1.09 GBytes 9.36 Gbits/sec 0 752 KBytes
[ 4] 8.00-9.00 sec 1.09 GBytes 9.34 Gbits/sec 0 764 KBytes
[ 4] 9.00-10.00 sec 1.09 GBytes 9.34 Gbits/sec 0 773 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 10.9 GBytes 9.33 Gbits/sec 20 sender
[ 4] 0.00-10.00 sec 10.9 GBytes 9.33 Gbits/sec receiver
In this case a 2 vCPU VM can deliver 10Gbps throughput on a macvtap
interface.
perf kvm stat
shows the VM performance statistics (running on a host):
# perf kvm stat record
...
# perf kvm stat report
Analyze events for all VMs, all VCPUs:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
HLT 241302 48.07% 99.81% 0.34us 1503998.97us 683.62us ( +- 4.66% )
EPT_MISCONFIG 238604 47.53% 0.18% 0.62us 199.60us 1.24us ( +- 0.30% )
MSR_WRITE 10439 2.08% 0.01% 0.28us 18.22us 0.86us ( +- 1.40% )
EXTERNAL_INTERRUPT 5216 1.04% 0.00% 0.24us 22.30us 0.48us ( +- 3.39% )
PREEMPTION_TIMER 3931 0.78% 0.00% 0.54us 20.07us 0.98us ( +- 1.33% )
IO_INSTRUCTION 1456 0.29% 0.01% 1.77us 51.86us 6.65us ( +- 2.30% )
CPUID 548 0.11% 0.00% 0.28us 13.96us 0.56us ( +- 6.34% )
PENDING_INTERRUPT 288 0.06% 0.00% 0.39us 13.81us 0.63us ( +- 7.44% )
PAUSE_INSTRUCTION 108 0.02% 0.00% 0.28us 13.76us 0.69us ( +- 17.80% )
MSR_READ 94 0.02% 0.00% 0.34us 1.30us 0.62us ( +- 3.12% )
EXCEPTION_NMI 1 0.00% 0.00% 0.45us 0.45us 0.45us ( +- 0.00% )
Total Samples:501987, Total events handled time:165279040.75us.
However,
massive TCP connections establishings and closings
is characterized by many small TCP segments, which can not be efficiently
coalesced. The perf kvm stat
for the case looks like:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 5073570 75.37% 2.02% 0.22us 3066.52us 0.96us ( +- 0.49% )
EPT_MISCONFIG 1029496 15.29% 1.78% 0.34us 1795.92us 4.19us ( +- 0.35% )
MSR_WRITE 279208 4.15% 0.16% 0.28us 5695.07us 1.36us ( +- 2.52% )
HLT 194422 2.89% 95.74% 0.30us 1504068.39us 1192.90us ( +- 4.03% )
PENDING_INTERRUPT 89818 1.33% 0.03% 0.32us 189.53us 0.70us ( +- 0.83% )
PAUSE_INSTRUCTION 40905 0.61% 0.26% 0.26us 1390.91us 15.39us ( +- 1.82% )
PREEMPTION_TIMER 17384 0.26% 0.01% 0.44us 183.21us 1.49us ( +- 1.47% )
IO_INSTRUCTION 5482 0.08% 0.01% 1.75us 186.08us 3.26us ( +- 1.19% )
CPUID 972 0.01% 0.00% 0.30us 5.29us 0.66us ( +- 1.94% )
MSR_READ 104 0.00% 0.00% 0.49us 2.54us 0.94us ( +- 3.29% )
EXCEPTION_NMI 6 0.00% 0.00% 0.37us 0.78us 0.59us ( +- 9.68% )
In this case interruptions take drammatic 75% versus only 1% for the iperf case.
Many small packets is a well-known problem for the modern virtualization solutions. Besides SR-IOV and vAPIC hardware accelerations, there are software acceleration techniques like batching or XDP redirections. See Netdev talks for more information:
- Story of Network Virtualization and its future in Software and Hardware
- Performance improvements of Virtual Machine Networking
- XDP offload with virtio-net
To efficiently process small packets workloads inside a VM you need to:
- use SR-IOV network adapter and configure the direct network interface for the VM;
- use coalescing on the network adapter (see
ethtool -C
) - use vAPIC CPU
- enable IOMMU (
intel_iommu=on
) kernel boot parameter - enable vAPIC for KVM kernel module
- tune
sysctl
parameters
An example of sysctl
settings to reduce number of network interruptions
(see linux/Documentation/sysctl/net.txt
for the parameters descriptions):
sysctl -w net.core.dev_weight=1024
sysctl -w net.core.busy_read=1000
sysctl -w net.core.busy_poll=1000
sysctl -w net.core.netdev_budget=10000
sysctl -w net.core.netdev_usecs=100000
sysctl -w net.core.netdev_budget_usecs=100000
sysctl -w net.core.netdev_max_backlog=32768
- Home
- Requirements
- Installation
-
Configuration
- Migration from Nginx
- On-the-fly reconfiguration
- Handling clients
- Backend servers
- Load Balancing
- Caching Responses
- Non-Idempotent Requests
- Modify HTTP Messages
- Virtual hosts and locations
- HTTP Session Management
- HTTP Tables
- HTTP(S) Security
- Header Via
- Health monitor
- TLS
- Virtual host confusion
- Traffic Filtering by Fingerprints
- Run & Stop
- Application Performance Monitoring
- Use cases
- Performance
- Contributing