Idea: Timeline log with performance counters #1011

lukego · 2016-09-04T14:40:09Z

Just an idea that I wanted to share: I experimented with adding CPU performance monitoring counters to the timeline (#916) log entries.

In this mode each log messages records not only the elapsed time (cycles) but also the number of L1/L2/L3/RAM accesses. It also records the number of instructions executed and the effective clock speed (adjusted for frequency scaling, Turbo Boost, AVX2 frequency penalty, etc).

Here is a little demo in the Snabb Studio prototype. First you see a list of breaths where you can select one that looks interesting:

Then you can see the detailed processing steps for that breath, each annotated with performance counter deltas and useful metrics like Turbo level and Instructions Per Cycle:

The idea is that this tooling could take some of the mystery out of performance analysis. Perhaps we could come up with more systematic ways to optimize application performance:

Which processing step should we focus on? (Start with the ones that take the most cycles.)
Is it running slowly (low instructions-per-cycle)? If so then investigate the hazards like cache misses.
Is it running quickly (high instructions-per-cycle)? If so then investigate the generated code and try to streamline it.

The challenge and opportunity is how to make sense of these logs when we have a million-or-so entries. Experience is needed here... can we skim them by hand, do we need special visualizations, can we capture the important details with a few key metrics. I don't know yet. This is why I am tending to keep experimenting rather than pushing the prototype tools on other people for the moment :).

One more important direction is being able to deal with logs files from executions where a lot of different things happened. For example it would be wonderful to be able to torture a Snabb process with many different non-deterministic workloads and then extract well-defined performance results directly from the timeline files.

End braindump.

lukego · 2016-09-06T10:22:55Z

So latest crazy idea: Suppose that we would analyze Snabb performance by looking at individual breaths rather than whole end-to-end benchmarks.

Then each benchmark run would produce not one metric (e.g. overall average throughput) but more like 100,000 metrics (performance of a sample of breaths). This way when Hydra runs an intense benchmark (a machine-week or so) we would have around a billion data points to analyze instead of the 10,000 or so that we have now. The analysis could be done using models like in #1007 (comment).

Potential advantages:

Test cases could be much more diverse e.g. using randomized configurations and workloads. Anything that produces a diverse and interesting set of breaths to analyze. This would create a satisfying mess to tease apart by modeling.
Data sets from production deployments could be analyzed in the same way. For example we could take a timeline file from a production deployment and see how well it fits the model we came up with from the CI tests.
Models may point directly to optimizations. For example we may find that performance per breath is best predicted by number of packets processed, or number of bytes processed, or number of L3 cache accesses, or other specific factors and then we may be able to tune the engine by influencing these.

So still a pipe dream for now but it could be very interesting to turn a timeline log into a million-row CSV file and see what R can make of it.

Incidentally the data above ^^^ is a little interesting. This breath is from snabbnfv between two VMs doing iperf with jumbo frames. (MTU 9000). This is fun because when we are copying packets to the VMs we are probably using at least 2MB of cache per 100 packets. So we see quite a bit of activity in terms of L3 hits and even L3 misses (RAM access). On the other hand the overall performance is excellent at 20 bits of throughput per CPU cycle. So even if the engine is perhaps not optimally tuned for jumbo frames they are still a very easy workload and with ~ 100 packets per breath it seems like we could do 20 Gbps of traffic for each 1 GHz of CPU. This armchair analysis might be much more satisfying as a formal model fitted to the data though...

lukego · 2016-09-07T16:06:02Z

Related idea:

We could extend LuaJIT with a global counter exits for the total number of trace-exits taken & include this alongside the CPU performance counters. Then we could measure side-trace jumps in much the same way as cache misses or branch mispredictions. This could make it possible to account for the performance of a breath in terms of how well it stayed "on trace."

lukego added the idea label Sep 4, 2016

lukego mentioned this issue Sep 26, 2016

[wip] Reintroduce vhost_user statistics #1022

Open

lukego mentioned this issue Dec 7, 2016

Relatively high latency with SnabbNFV #1085

Open

eugeneia mentioned this issue Dec 7, 2018

Merge eugeneia/snabb:timeline-raptorjit into Vita inters/vita#65

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Timeline log with performance counters #1011

Idea: Timeline log with performance counters #1011

lukego commented Sep 4, 2016 •

edited

Loading

lukego commented Sep 6, 2016

lukego commented Sep 7, 2016

Idea: Timeline log with performance counters #1011

Idea: Timeline log with performance counters #1011

Comments

lukego commented Sep 4, 2016 • edited Loading

lukego commented Sep 6, 2016

lukego commented Sep 7, 2016

lukego commented Sep 4, 2016 •

edited

Loading