Use LuaJIT-fu to optimized virtio-net device #1001

lukego · 2016-08-27T14:27:22Z

This branch improves virito-net performance substantially by fixing bad cases in the performance test matrix.

Test results

iperf and l2fwd benchmarks show more consistently high scores (dramatically so for l2fwd).
l2fwd by configuration benchmarks shows that the benefit comes mostly from fixing performance problems that affect specific virtio-net configuration options. This includes a big improvement for the base options negotiated by recent DPDK releases.

LuaJIT-fu

The vhost_user app now JITs separate machine code for each connection. The transmit and receive paths are also JITed separately.

This improves performance and consistency for many workloads, especially for dealing with different combinations of virtio-net options and with virtual machines that switch device drivers (e.g. from Linux to DPDK or Snabb).

The root problem was that our vring processing code is written in a style that performs badly with LuaJIT. The code is implemented in one central loop, VirtioVirtq:get_buffers(), which has many different behaviors depending on its arguments (callback functions, negotiated virtio features, etc). LuaJIT compiles all of these different options into a growing network of "side traces" and this degrades performance.

To make the code perform efficiently we need to make its control flow more consistent from the point of view of the JIT.

One solution to this problem would be to write specialized versions of the vring processing code for all different combinations of situations:

Separately for transmit and receive;
Separately for mergeable RX buffers and without;
Separately for indirect decsriptors and without;
etc...

However doing this by hand would be a lot of work. We would have to rewrite the vring processing code and the new version would be much more code.

Instead we make the JIT do this work for us automatically. We just have one copy of the source code but we load a fresh copy of the object code for each vring. This means that each vring is JITed separately in a way that suits its specific behavior. So the machine code for a vring that supports mergeable RX buffers will automatically optimize for that case, and so on.

You can think of it as creating many different vring processing loops that each inline a different set of subroutines.

If you want to know more about doing this kind of optimization with LuaJIT then there is some background information here:

LuaJIT/LuaJIT#208 (comment)

Note that the callback-driven nature of the vring processing code is not a problem directly. Callback indirection compiles very efficiently when the same callback function is used every time. However, performance degrades when the JIT is sharing machine code between multiple calls that provide different callback functions. (The same thing happens if you pass different parameter values that cause 'if' statements to switch from 'then' to 'else'.)

Quoth the LuaJIT masters all too innocently:

Avoid unbiased branches.

Checksum was being calculated in a case where this is not necessary. Specifically when checksum offload is disabled and so is MRG_RXBUF.

The vhost_user app now JITs separate machine code for each connection. The transmit and receive paths are also JITed separately. This improves performance and consistency for many workloads, especially for dealing with different combinations of virtio-net options and with virtual machines that switch device drivers (e.g. from Linux to DPDK or Snabb). The root problem was that our vring processing code is written in a style that performs badly with LuaJIT. The code is implemented in one central loop, VirtioVirtq:get_buffers(), which has many different behaviors depending on its arguments (callback functions, negotiated virtio features, etc). LuaJIT compiles all of these different options into a growing network of "side traces" and this degrades performance. To make the code perform efficiently we need to make its control flow more consistent from the point of view of the JIT. One solution to this problem would be to write specialized versions of the vring processing code for all different combinations of situations: - Separately for transmit and receive; - Separately for mergeable RX buffers and without; - Separately for indirect decsriptors and without; - etc... However doing this by hand would be a lot of work. We would have to rewrite the vring processing code and the new version would be much more code. Instead we make the JIT do this work for us automatically. We just have one copy of the source code but we load a fresh copy of the object code for each vring. This means that each vring is JITed separately in a way that suits its specific behavior. So the machine code for a vring that supports mergeable RX buffers will automatically optimize for that case, and so on. You can think of it as creating many different vring processing loops that each inline a different set of subroutines. If you want to know more about doing this kind of optimization with LuaJIT then there is some background information here: LuaJIT/LuaJIT#208 (comment) Note that the callback-driven nature of the vring processing code is not a problem directly. Callback indirection compiles very efficiently when the same callback function is used every time. However, performance degrades when the JIT is sharing machine code between multiple calls that provide different callback functions. (The same thing happens if you pass different parameter values that cause 'if' statements to switch from 'then' to 'else'.) Quoth the LuaJIT masters all too innocently: > Avoid unbiased branches.

kbara · 2016-08-27T15:32:18Z

Excellent work! A couple of minor notes:
a) Why is there a large blue spike in iperf filter, just under halfway along the x axis? It's quite a lot worse than the baseline data (although the blue density graph as a whole still looks better, barely).
b) The reddish color of the first table with 100% successes is really distracting, especially since degree of redness is used for failures in the next similar table. Pretty much anything except a reddish color would be better!

…ara-next

lukego · 2016-08-27T16:54:12Z

Good eye @kbara!

I wonder how to confirm whether this is a real effect vs a random artifact due to excessive slicing-and-dicing of the dataset i.e. Type I error (xkcd 882).

The first idea that comes to mind is to do Tukey's Test on the whole iperf dataset and look at the confidence interval on the difference in mean value between filter benchmark on next vs virtio-opt branches.

Here it is with the default setting of 95% confidence interval:

library(readr)
library(dplyr)
d <- read_csv("https://hydra.snabb.co/build/438947/download/1/bench.csv")
iperf <- filter(d, benchmark=='iperf')
TukeyHSD(aov(score ~ snabb*config, iperf))

The relevant line from the full output:

vring-opt:filter-next:filter             -0.65285714  -1.1857199 -0.11999441 0.0042371

My shaky interpretation is that on the one hand this suggests there is a difference in the mean performance of the two branches (p adj = 0.004) but on the other hand the lower bound on the difference is quite close to zero (lwr = -0.199). So the best we can say with 95% confidence is that vring-opt:filter is 0.199 Gbps slower than next:filter.

We could also repeat this test with 99% confidence interval to be more conservative:

TukeyHSD(aov(score ~ snabb*config, iperf), conf.level=0.99)

vring-opt:filter-next:filter             -0.65285714  -1.2672972 -0.038417072 0.0042371

which puts the lower bound on the effect at -0.038 Gbps.

How to interpret this? I am not really sure. My instinct is to collect more data as the first step. I already have a larger (more iterations) benchmark running (evaluation 3325) and that dataset should be ready tomorrow.

Interesting stuff! Still a risk that I am completely misapplying and misunderstanding all of these statistical tools but I feel like this is a promising approach.

lukego · 2016-08-27T22:06:00Z

Thinking about this some more... there is a very important statement at the beginning of R for Data Science

You can only use an observation once to confirm a hypothesis. As soon as you use it more than once you’re back to doing exploratory analysis. This means to do hypothesis confirmation you need to "preregister" (write out in advance) your analysis plan, and not deviate from it even when you have seen the data.

We have used our initial dataset to generate a hypothesis, that the filter configuration of the virtio-opt branch has a "hump" at around 12 Gbps, and now to test this we need to use new data.

So my idea of applying Tukey's Test to the original dataset was bad for at least two reasons. First we have already "used up" this dataset with our hypothesis generation. Second that Tukey's Test is comparing the mean scores of the benchmarks which does not really match our hypothesis anyway.

So let us now take the new data from evaluation 3325 and see if the same pattern is there. We have to be careful here because this CSV file is actually a superset of the previous one i.e. it includes the original data points. The first dataset contains 5 results for each scenario and the second dataset contains 30. (Hydra is clever enough to see that the tests with id 1-5 are identical in both runs and so it shares the results.) So in order to meet our goal of only looking at new data we will need to only consider CSV rows with id>5.

So without further ado let us take a look at the new data (2100 results) for the iperf filter benchmark:

library(readr)
library(dplyr)
library(ggplot2)
d <- read_csv("https://hydra.snabb.co/build/433897/download/1/bench.csv")
new.iperf.filter <- filter(d, benchmark=='iperf' & config=='filter' & id>5)
ggplot(aes(x=score, y=..count.., color=snabb), data=new.iperf.filter) + geom_density()

This picture seems like grounds to reject the hypothesis that the virtio-opt branch creates a spike in values around 12.5 Gbps. It looks more like scores move up from this range towards 17.5 Gbps.

Just for a sanity-check we can also use Tukey's Test to confirm that the average score has increased with virtio-opt:

> TukeyHSD(aov(score ~ snabb, new.iperf.filter), conf.level=0.99)
  Tukey multiple comparisons of means
    99% family-wise confidence level

Fit: aov(formula = score ~ snabb, data = new.iperf.filter)

$snabb
                    diff       lwr       upr p adj
vring-opt-next 0.6805714 0.4134409 0.9477019     0

This tells us that we can be 99% confident that the virtio-opt branch is improving average performance by at least 0.413 Gbps.

So: I reckon that the spike we saw in the first dataset was due to xkcd 882 i.e. we sliced our data into many different pieces and one of them showed a pattern by random chance.

lukego · 2016-08-27T22:29:38Z

The reddish color of the first table with 100% successes is really distracting

This should be fixed now with snabblab/snabblab-nixos@25c1802.

lukego · 2016-08-28T10:44:28Z

The next tributary report presents another interesting dataset for this code. This runs on lugano hardware (real 82599 NIC) instead of murren hardware (generic hetzner with faked NIC). Because we have fewer lugano servers it also tested fewer configurations.

The main things that jump out at me are:

l2fwd benchmark scores lower on lugano hardware - but much more consistently.
iperf scores are spread out and capped at ~14 Gbps (likely PCIe bottleneck with hardware NIC).

and that we need to work out which tests we want to run on which hardware platform(s).

lukego · 2016-09-07T04:25:31Z

@kbara Just following up on the "xkcd 882 problem." On the one hand it seems like there are straightforward ways to account for this in statistics, on the other hand I don't immediately know how to apply those to visualizations.

The Bonferroni correction seems cool. The idea is that if you make N comparisons then you are N times more likely to see an effect due to random chance. So to maintain your confidence you need to look for N times more significance in the tests. You can do this by multiplying your P-value by N.

So if we would make a single comparison of two Snabb versions then we may use P=0.05 to check for a significant difference with 95% confidence. If we split the data into 10 groups and compared them separately then the Bonferroni correction says we would need to use P=0.005 as the threshold for 95% confidence instead.

I am sure that R routines for ANOVA, etc, are applying such corrections automatically. However, the visualizations have no such corrections and are potentially misleading. Have to think about what is a good way to communicate the statistical significant of the test results e.g. visual confidence intervals or numeric statistics.

Snabb lwAFTR v2017.08.06

lukego added 2 commits August 26, 2016 15:48

net_device: Avoid unnecessary checksum calculation

d4b71e6

Checksum was being calculated in a case where this is not necessary. Specifically when checksum offload is disabled and so is MRG_RXBUF.

lukego mentioned this pull request Aug 27, 2016

How to control the scope of traces? LuaJIT/LuaJIT#208

Closed

kbara self-assigned this Aug 27, 2016

kbara pushed a commit to kbara/snabb that referenced this pull request Aug 27, 2016

Merge PR snabbco#1001 (LuaJIT-fu optimize virtio-next device) into kb…

de9707d

…ara-next

kbara mentioned this pull request Aug 27, 2016

Merge Luke's "Use LuaJIT-fu to optimized virtio-net device #1001" #1002

Merged

lukego merged commit 7f1d205 into snabbco:next Aug 28, 2016

This was referenced Sep 19, 2016

[wip] Reintroduce vhost_user statistics #1022

Open

Low DPDK throughput in NFV benchmark #665

Closed

lukego mentioned this pull request Oct 25, 2016

[wip] Add JIT optimization tool: lib.specialize(fn) #1053

Open

wingo added a commit that referenced this pull request Nov 28, 2017

Merge pull request #1001 from Igalia/update-version

3d15868

Snabb lwAFTR v2017.08.06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LuaJIT-fu to optimized virtio-net device #1001

Use LuaJIT-fu to optimized virtio-net device #1001

lukego commented Aug 27, 2016

kbara commented Aug 27, 2016

lukego commented Aug 27, 2016

lukego commented Aug 27, 2016 •

edited

Loading

lukego commented Aug 27, 2016

lukego commented Aug 28, 2016

lukego commented Sep 7, 2016

Use LuaJIT-fu to optimized virtio-net device #1001

Use LuaJIT-fu to optimized virtio-net device #1001

Conversation

lukego commented Aug 27, 2016

Test results

LuaJIT-fu

kbara commented Aug 27, 2016

lukego commented Aug 27, 2016

lukego commented Aug 27, 2016 • edited Loading

lukego commented Aug 27, 2016

lukego commented Aug 28, 2016

lukego commented Sep 7, 2016

lukego commented Aug 27, 2016 •

edited

Loading