-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use LuaJIT-fu to optimized virtio-net device #1001
Conversation
Checksum was being calculated in a case where this is not necessary. Specifically when checksum offload is disabled and so is MRG_RXBUF.
The vhost_user app now JITs separate machine code for each connection. The transmit and receive paths are also JITed separately. This improves performance and consistency for many workloads, especially for dealing with different combinations of virtio-net options and with virtual machines that switch device drivers (e.g. from Linux to DPDK or Snabb). The root problem was that our vring processing code is written in a style that performs badly with LuaJIT. The code is implemented in one central loop, VirtioVirtq:get_buffers(), which has many different behaviors depending on its arguments (callback functions, negotiated virtio features, etc). LuaJIT compiles all of these different options into a growing network of "side traces" and this degrades performance. To make the code perform efficiently we need to make its control flow more consistent from the point of view of the JIT. One solution to this problem would be to write specialized versions of the vring processing code for all different combinations of situations: - Separately for transmit and receive; - Separately for mergeable RX buffers and without; - Separately for indirect decsriptors and without; - etc... However doing this by hand would be a lot of work. We would have to rewrite the vring processing code and the new version would be much more code. Instead we make the JIT do this work for us automatically. We just have one copy of the source code but we load a fresh copy of the object code for each vring. This means that each vring is JITed separately in a way that suits its specific behavior. So the machine code for a vring that supports mergeable RX buffers will automatically optimize for that case, and so on. You can think of it as creating many different vring processing loops that each inline a different set of subroutines. If you want to know more about doing this kind of optimization with LuaJIT then there is some background information here: LuaJIT/LuaJIT#208 (comment) Note that the callback-driven nature of the vring processing code is not a problem directly. Callback indirection compiles very efficiently when the same callback function is used every time. However, performance degrades when the JIT is sharing machine code between multiple calls that provide different callback functions. (The same thing happens if you pass different parameter values that cause 'if' statements to switch from 'then' to 'else'.) Quoth the LuaJIT masters all too innocently: > Avoid unbiased branches.
Excellent work! A couple of minor notes: |
Good eye @kbara! I wonder how to confirm whether this is a real effect vs a random artifact due to excessive slicing-and-dicing of the dataset i.e. Type I error (xkcd 882). The first idea that comes to mind is to do Tukey's Test on the whole iperf dataset and look at the confidence interval on the difference in mean value between filter benchmark on Here it is with the default setting of 95% confidence interval: library(readr)
library(dplyr)
d <- read_csv("https://hydra.snabb.co/build/438947/download/1/bench.csv")
iperf <- filter(d, benchmark=='iperf')
TukeyHSD(aov(score ~ snabb*config, iperf)) The relevant line from the full output:
My shaky interpretation is that on the one hand this suggests there is a difference in the mean performance of the two branches (p adj = 0.004) but on the other hand the lower bound on the difference is quite close to zero (lwr = -0.199). So the best we can say with 95% confidence is that We could also repeat this test with 99% confidence interval to be more conservative: TukeyHSD(aov(score ~ snabb*config, iperf), conf.level=0.99)
which puts the lower bound on the effect at -0.038 Gbps. How to interpret this? I am not really sure. My instinct is to collect more data as the first step. I already have a larger (more iterations) benchmark running (evaluation 3325) and that dataset should be ready tomorrow. Interesting stuff! Still a risk that I am completely misapplying and misunderstanding all of these statistical tools but I feel like this is a promising approach. |
Thinking about this some more... there is a very important statement at the beginning of R for Data Science
We have used our initial dataset to generate a hypothesis, that the So my idea of applying Tukey's Test to the original dataset was bad for at least two reasons. First we have already "used up" this dataset with our hypothesis generation. Second that Tukey's Test is comparing the mean scores of the benchmarks which does not really match our hypothesis anyway. So let us now take the new data from evaluation 3325 and see if the same pattern is there. We have to be careful here because this CSV file is actually a superset of the previous one i.e. it includes the original data points. The first dataset contains 5 results for each scenario and the second dataset contains 30. (Hydra is clever enough to see that the tests with So without further ado let us take a look at the new data (2100 results) for the iperf filter benchmark: library(readr)
library(dplyr)
library(ggplot2)
d <- read_csv("https://hydra.snabb.co/build/433897/download/1/bench.csv")
new.iperf.filter <- filter(d, benchmark=='iperf' & config=='filter' & id>5)
ggplot(aes(x=score, y=..count.., color=snabb), data=new.iperf.filter) + geom_density() This picture seems like grounds to reject the hypothesis that the Just for a sanity-check we can also use Tukey's Test to confirm that the average score has increased with
This tells us that we can be 99% confident that the So: I reckon that the spike we saw in the first dataset was due to xkcd 882 i.e. we sliced our data into many different pieces and one of them showed a pattern by random chance. |
This should be fixed now with snabblab/snabblab-nixos@25c1802. |
The next tributary report presents another interesting dataset for this code. This runs on The main things that jump out at me are:
and that we need to work out which tests we want to run on which hardware platform(s). |
@kbara Just following up on the "xkcd 882 problem." On the one hand it seems like there are straightforward ways to account for this in statistics, on the other hand I don't immediately know how to apply those to visualizations. The Bonferroni correction seems cool. The idea is that if you make N comparisons then you are N times more likely to see an effect due to random chance. So to maintain your confidence you need to look for N times more significance in the tests. You can do this by multiplying your P-value by N. So if we would make a single comparison of two Snabb versions then we may use P=0.05 to check for a significant difference with 95% confidence. If we split the data into 10 groups and compared them separately then the Bonferroni correction says we would need to use P=0.005 as the threshold for 95% confidence instead. I am sure that R routines for ANOVA, etc, are applying such corrections automatically. However, the visualizations have no such corrections and are potentially misleading. Have to think about what is a good way to communicate the statistical significant of the test results e.g. visual confidence intervals or numeric statistics. |
This branch improves virito-net performance substantially by fixing bad cases in the performance test matrix.
Test results
LuaJIT-fu
The vhost_user app now JITs separate machine code for each connection. The transmit and receive paths are also JITed separately.
This improves performance and consistency for many workloads, especially for dealing with different combinations of virtio-net options and with virtual machines that switch device drivers (e.g. from Linux to DPDK or Snabb).
The root problem was that our vring processing code is written in a style that performs badly with LuaJIT. The code is implemented in one central loop, VirtioVirtq:get_buffers(), which has many different behaviors depending on its arguments (callback functions, negotiated virtio features, etc). LuaJIT compiles all of these different options into a growing network of "side traces" and this degrades performance.
To make the code perform efficiently we need to make its control flow more consistent from the point of view of the JIT.
One solution to this problem would be to write specialized versions of the vring processing code for all different combinations of situations:
However doing this by hand would be a lot of work. We would have to rewrite the vring processing code and the new version would be much more code.
Instead we make the JIT do this work for us automatically. We just have one copy of the source code but we load a fresh copy of the object code for each vring. This means that each vring is JITed separately in a way that suits its specific behavior. So the machine code for a vring that supports mergeable RX buffers will automatically optimize for that case, and so on.
You can think of it as creating many different vring processing loops that each inline a different set of subroutines.
If you want to know more about doing this kind of optimization with LuaJIT then there is some background information here:
LuaJIT/LuaJIT#208 (comment)
Note that the callback-driven nature of the vring processing code is not a problem directly. Callback indirection compiles very efficiently when the same callback function is used every time. However, performance degrades when the JIT is sharing machine code between multiple calls that provide different callback functions. (The same thing happens if you pass different parameter values that cause 'if' statements to switch from 'then' to 'else'.)
Quoth the LuaJIT masters all too innocently: