Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LuaJIT-fu to optimized virtio-net device #1001

Merged
merged 2 commits into from
Aug 28, 2016
Merged

Conversation

lukego
Copy link
Member

@lukego lukego commented Aug 27, 2016

This branch improves virito-net performance substantially by fixing bad cases in the performance test matrix.

Test results

  • iperf and l2fwd benchmarks show more consistently high scores (dramatically so for l2fwd).
  • l2fwd by configuration benchmarks shows that the benefit comes mostly from fixing performance problems that affect specific virtio-net configuration options. This includes a big improvement for the base options negotiated by recent DPDK releases.

LuaJIT-fu

The vhost_user app now JITs separate machine code for each connection. The transmit and receive paths are also JITed separately.

This improves performance and consistency for many workloads, especially for dealing with different combinations of virtio-net options and with virtual machines that switch device drivers (e.g. from Linux to DPDK or Snabb).

The root problem was that our vring processing code is written in a style that performs badly with LuaJIT. The code is implemented in one central loop, VirtioVirtq:get_buffers(), which has many different behaviors depending on its arguments (callback functions, negotiated virtio features, etc). LuaJIT compiles all of these different options into a growing network of "side traces" and this degrades performance.

To make the code perform efficiently we need to make its control flow more consistent from the point of view of the JIT.

One solution to this problem would be to write specialized versions of the vring processing code for all different combinations of situations:

  • Separately for transmit and receive;
  • Separately for mergeable RX buffers and without;
  • Separately for indirect decsriptors and without;
  • etc...

However doing this by hand would be a lot of work. We would have to rewrite the vring processing code and the new version would be much more code.

Instead we make the JIT do this work for us automatically. We just have one copy of the source code but we load a fresh copy of the object code for each vring. This means that each vring is JITed separately in a way that suits its specific behavior. So the machine code for a vring that supports mergeable RX buffers will automatically optimize for that case, and so on.

You can think of it as creating many different vring processing loops that each inline a different set of subroutines.

If you want to know more about doing this kind of optimization with LuaJIT then there is some background information here:

LuaJIT/LuaJIT#208 (comment)

Note that the callback-driven nature of the vring processing code is not a problem directly. Callback indirection compiles very efficiently when the same callback function is used every time. However, performance degrades when the JIT is sharing machine code between multiple calls that provide different callback functions. (The same thing happens if you pass different parameter values that cause 'if' statements to switch from 'then' to 'else'.)

Quoth the LuaJIT masters all too innocently:

Avoid unbiased branches.

lukego added 2 commits August 26, 2016 15:48
Checksum was being calculated in a case where this is not necessary.
Specifically when checksum offload is disabled and so is MRG_RXBUF.
The vhost_user app now JITs separate machine code for each connection.
The transmit and receive paths are also JITed separately.

This improves performance and consistency for many workloads,
especially for dealing with different combinations of virtio-net
options and with virtual machines that switch device drivers (e.g.
from Linux to DPDK or Snabb).

The root problem was that our vring processing code is written in a
style that performs badly with LuaJIT. The code is implemented in one
central loop, VirtioVirtq:get_buffers(), which has many different
behaviors depending on its arguments (callback functions, negotiated
virtio features, etc). LuaJIT compiles all of these different options
into a growing network of "side traces" and this degrades performance.

To make the code perform efficiently we need to make its control flow
more consistent from the point of view of the JIT.

One solution to this problem would be to write specialized versions
of the vring processing code for all different combinations of situations:

- Separately for transmit and receive;
- Separately for mergeable RX buffers and without;
- Separately for indirect decsriptors and without;
- etc...

However doing this by hand would be a lot of work. We would have to
rewrite the vring processing code and the new version would be much
more code.

Instead we make the JIT do this work for us automatically. We just
have one copy of the source code but we load a fresh copy of the
object code for each vring. This means that each vring is JITed
separately in a way that suits its specific behavior. So the machine
code for a vring that supports mergeable RX buffers will automatically
optimize for that case, and so on.

You can think of it as creating many different vring processing loops
that each inline a different set of subroutines.

If you want to know more about doing this kind of optimization with
LuaJIT then there is some background information here:

  LuaJIT/LuaJIT#208 (comment)

Note that the callback-driven nature of the vring processing code is
not a problem directly. Callback indirection compiles very efficiently
when the same callback function is used every time. However,
performance degrades when the JIT is sharing machine code between
multiple calls that provide different callback functions. (The same
thing happens if you pass different parameter values that cause 'if'
statements to switch from 'then' to 'else'.)

Quoth the LuaJIT masters all too innocently:

> Avoid unbiased branches.
@kbara
Copy link
Contributor

kbara commented Aug 27, 2016

Excellent work! A couple of minor notes:
a) Why is there a large blue spike in iperf filter, just under halfway along the x axis? It's quite a lot worse than the baseline data (although the blue density graph as a whole still looks better, barely).
b) The reddish color of the first table with 100% successes is really distracting, especially since degree of redness is used for failures in the next similar table. Pretty much anything except a reddish color would be better!

@lukego
Copy link
Member Author

lukego commented Aug 27, 2016

Good eye @kbara!

I wonder how to confirm whether this is a real effect vs a random artifact due to excessive slicing-and-dicing of the dataset i.e. Type I error (xkcd 882).

The first idea that comes to mind is to do Tukey's Test on the whole iperf dataset and look at the confidence interval on the difference in mean value between filter benchmark on next vs virtio-opt branches.

Here it is with the default setting of 95% confidence interval:

library(readr)
library(dplyr)
d <- read_csv("https://hydra.snabb.co/build/438947/download/1/bench.csv")
iperf <- filter(d, benchmark=='iperf')
TukeyHSD(aov(score ~ snabb*config, iperf))

The relevant line from the full output:

vring-opt:filter-next:filter             -0.65285714  -1.1857199 -0.11999441 0.0042371

My shaky interpretation is that on the one hand this suggests there is a difference in the mean performance of the two branches (p adj = 0.004) but on the other hand the lower bound on the difference is quite close to zero (lwr = -0.199). So the best we can say with 95% confidence is that vring-opt:filter is 0.199 Gbps slower than next:filter.

We could also repeat this test with 99% confidence interval to be more conservative:

TukeyHSD(aov(score ~ snabb*config, iperf), conf.level=0.99)
vring-opt:filter-next:filter             -0.65285714  -1.2672972 -0.038417072 0.0042371

which puts the lower bound on the effect at -0.038 Gbps.

How to interpret this? I am not really sure. My instinct is to collect more data as the first step. I already have a larger (more iterations) benchmark running (evaluation 3325) and that dataset should be ready tomorrow.

Interesting stuff! Still a risk that I am completely misapplying and misunderstanding all of these statistical tools but I feel like this is a promising approach.

@lukego
Copy link
Member Author

lukego commented Aug 27, 2016

Thinking about this some more... there is a very important statement at the beginning of R for Data Science

You can only use an observation once to confirm a hypothesis. As soon as you use it more than once you’re back to doing exploratory analysis. This means to do hypothesis confirmation you need to "preregister" (write out in advance) your analysis plan, and not deviate from it even when you have seen the data.

We have used our initial dataset to generate a hypothesis, that the filter configuration of the virtio-opt branch has a "hump" at around 12 Gbps, and now to test this we need to use new data.

So my idea of applying Tukey's Test to the original dataset was bad for at least two reasons. First we have already "used up" this dataset with our hypothesis generation. Second that Tukey's Test is comparing the mean scores of the benchmarks which does not really match our hypothesis anyway.

So let us now take the new data from evaluation 3325 and see if the same pattern is there. We have to be careful here because this CSV file is actually a superset of the previous one i.e. it includes the original data points. The first dataset contains 5 results for each scenario and the second dataset contains 30. (Hydra is clever enough to see that the tests with id 1-5 are identical in both runs and so it shares the results.) So in order to meet our goal of only looking at new data we will need to only consider CSV rows with id>5.

So without further ado let us take a look at the new data (2100 results) for the iperf filter benchmark:

library(readr)
library(dplyr)
library(ggplot2)
d <- read_csv("https://hydra.snabb.co/build/433897/download/1/bench.csv")
new.iperf.filter <- filter(d, benchmark=='iperf' & config=='filter' & id>5)
ggplot(aes(x=score, y=..count.., color=snabb), data=new.iperf.filter) + geom_density()

rplot

This picture seems like grounds to reject the hypothesis that the virtio-opt branch creates a spike in values around 12.5 Gbps. It looks more like scores move up from this range towards 17.5 Gbps.

Just for a sanity-check we can also use Tukey's Test to confirm that the average score has increased with virtio-opt:

> TukeyHSD(aov(score ~ snabb, new.iperf.filter), conf.level=0.99)
  Tukey multiple comparisons of means
    99% family-wise confidence level

Fit: aov(formula = score ~ snabb, data = new.iperf.filter)

$snabb
                    diff       lwr       upr p adj
vring-opt-next 0.6805714 0.4134409 0.9477019     0

This tells us that we can be 99% confident that the virtio-opt branch is improving average performance by at least 0.413 Gbps.

So: I reckon that the spike we saw in the first dataset was due to xkcd 882 i.e. we sliced our data into many different pieces and one of them showed a pattern by random chance.

@lukego
Copy link
Member Author

lukego commented Aug 27, 2016

The reddish color of the first table with 100% successes is really distracting

This should be fixed now with snabblab/snabblab-nixos@25c1802.

@lukego lukego merged commit 7f1d205 into snabbco:next Aug 28, 2016
@lukego
Copy link
Member Author

lukego commented Aug 28, 2016

The next tributary report presents another interesting dataset for this code. This runs on lugano hardware (real 82599 NIC) instead of murren hardware (generic hetzner with faked NIC). Because we have fewer lugano servers it also tested fewer configurations.

The main things that jump out at me are:

  • l2fwd benchmark scores lower on lugano hardware - but much more consistently.
  • iperf scores are spread out and capped at ~14 Gbps (likely PCIe bottleneck with hardware NIC).

and that we need to work out which tests we want to run on which hardware platform(s).

@lukego
Copy link
Member Author

lukego commented Sep 7, 2016

@kbara Just following up on the "xkcd 882 problem." On the one hand it seems like there are straightforward ways to account for this in statistics, on the other hand I don't immediately know how to apply those to visualizations.

The Bonferroni correction seems cool. The idea is that if you make N comparisons then you are N times more likely to see an effect due to random chance. So to maintain your confidence you need to look for N times more significance in the tests. You can do this by multiplying your P-value by N.

So if we would make a single comparison of two Snabb versions then we may use P=0.05 to check for a significant difference with 95% confidence. If we split the data into 10 groups and compared them separately then the Bonferroni correction says we would need to use P=0.005 as the threshold for 95% confidence instead.

I am sure that R routines for ANOVA, etc, are applying such corrections automatically. However, the visualizations have no such corrections and are potentially misleading. Have to think about what is a good way to communicate the statistical significant of the test results e.g. visual confidence intervals or numeric statistics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants