Avoid ctype diversity for registers in intel10g #612

alexandergall · 2015-09-09T15:21:57Z

I'm trying to set a new record for the amount of explanation for a one-line patch :) But seriously, I've had a number of revelations while working on this and I have an urge to share them with you. This is, in fact, the main point for this PR (maybe I should have raised an issue instead?). The suggested fix is just a workaround.

To have a concrete case to talk about, I've analysed the trace structure of a simple Snabb application: the "shunt". It's a program that simply short-circuits two physical interfaces by passing all packets from one directly to the other and vice versa. It is run via snsh

 local Intel82599 = require("apps.intel.intel_app").Intel82599
 local c = config.new()
 config.app(c, "if1", Intel82599, { pciaddr = "0000:82:00.0" })
 config.app(c, "if2", Intel82599, { pciaddr = "0000:83:00.0" })
 config.link(c, "if1.tx -> if2.rx")
 config.link(c, "if2.tx -> if1.rx")
 engine.configure(c)
 engine.busywait = true
 require("jit.dump").start('+rs', 'dump')
 engine.main({ duration = 10, no_timers = false, report = { showlinks = true, showload = true } })

Note that I use engine.busywait = true. There is actually a minor problem of garbage being created by pace_breathing(). I'll submit a PR on that eventually.

I executed it in the Snabb lab on host interlaken, using two pairs of looped interfaces 82:00.{0,1} and 83:00.{0,1}. Packets were generated using

 ./snabb packetblaster replay program/packetblaster/64.pcap 82:00.1
 ./snabb packetblaster replay program/packetblaster/64.pcap 83:00.1

The 64.pcap file contains 64-byte packets including the Ethernet header (but not the CRC). Assuming the standard frame overhead of 24 bytes (8 bytes preamble/SOF, 4 bytes CRC, 12 bytes IPG), the maximum packet rate is thus 10^10/(88*8) = 14'204'545. The packetblasters
don't quite achieve that. They peak at around 13'130'000. This is something I'd like to understand, but it's irrelevant for this analysis. Ideally, the shunt application will forward all packets and achieve ~26Mpps. This should actually be realistic.

Running the shunt without the patch for 10 seconds yields

 link report:
           37,870,464 sent on if1.tx -> if2.rx (loss rate: 0%)
           37,870,464 sent on if2.tx -> if1.rx (loss rate: 0%)

7.6Mpps is not quite what we want to see. With the patch I get

 link report:
          101,016,320 sent on if1.tx -> if2.rx (loss rate: 0%)
          101,016,320 sent on if2.tx -> if1.rx (loss rate: 0%)

20.2 Mpps is much better, but we still lose about 6Mpps. Closer inspection shows that the packets are dropped by the NICs when the receive buffers are full. Again, this is something I'd like to understand but it's irrelevant for this analysis.

Now let's look at the JIT dump file https://drive.google.com/folderview?id=0ByZgkfz7rch4bXEzNW90elE2SEU&usp=sharing and figure out the trace dependencies:

The numbers correspond to the traces in the dump. The circles indicate "looping traces", i.e. traces that end with an actual loop (I'm not sure what the proper nomenclature is). The other traces do
not contain loops. They are all called via exits of other traces. I call them "side traces".

The looping traces are ordered by the amount of CPU cycles consumed by them, highest usage at the top. Here is the complete breakdown according to perf top (trace 34 uses so few cycles that it doesn't show up here):

 10.06%  perf-16720.map           [.] TRACE_23::apps/intel/intel_app.lua:92
  8.17%  perf-16720.map           [.] TRACE_27::apps/intel/intel10g.lua:467
  3.87%  perf-16720.map           [.] TRACE_25::apps/intel/intel_app.lua:110
  3.08%  perf-16720.map           [.] TRACE_30::core/link.lua:70
  2.80%  perf-16720.map           [.] TRACE_17::apps/intel/intel_app.lua:101
  2.09%  perf-16720.map           [.] TRACE_31::apps/intel/intel10g.lua:473
  1.76%  perf-16720.map           [.] TRACE_32::core/app.lua:278
  0.30%  perf-16720.map           [.] TRACE_28::apps/intel/intel10g.lua:511
  0.20%  perf-16720.map           [.] TRACE_29::apps/intel/intel_app.lua:96
  0.20%  perf-16720.map           [.] TRACE_33::core/app.lua:293
  0.02%  perf-16720.map           [.] TRACE_38::core/counter.lua:77

The structure is pretty much ideal: there is a small number of looping traces connected by short side traces. The entire graph is fully connected, i.e. there are no transitions to the interpreter anywhere.
It's 100% compiled.

The looping traces represent the handfull of inner loops in the code, like iterating over ring buffers or allocating fresh packets for the receive buffers. It is noteworthy that the breathe loop itself as well as the inhale and exhale loops are effectively unrolled. Most of that code is in traces 28 and 31 and the actual loops are implicit in the sequence in which the traces are connected. This is quite fascinating :) You'll find descriptions of all traces at the end.

So, what about the patch? One thing to note in the trace analysis is that both instances of the Intel82599 app can be executed by the exact same code. There are no branches due to the fact that they are distinct objects. Without the patch, the trace structure is much more complex. Here is an excerpt of a trace that contains a call to add_receive_buffer():

 ---- TRACE 16 IR
 [...]
 ....              SNAP   #4   [ ---- ---- intel10g.lua:513|0006 0062 0078 intel10g.lua:509|---- ]
 0106       >  fun EQ     0078  assert
 0107       >  p32 HREFK  0086  "rxdesc" @27
 0108 r9    >  cdt HLOAD  0107
 0109 r10      u16 FLOAD  0108  cdata.ctypeid
 0110       >  int EQ     0109  +1349
 [...]

rxdesc is a cdata object and the compiler inserts a guard that checks whether the ctype ID is the one that was registered when the trace was recorded (+1349). According to the dump, this guard is
triggered during the execution of the program and a side trace is generated:

 ---- TRACE 17 start 16/4 intel10g.lua:510
 [...]
 ---- TRACE 17 IR
 0001 rbp      tab SLOAD  #3    PI
 0002 r12      fun SLOAD  #5    PI
 0003 rsi      p64 PVAL   #61
 0004  {sink}  cdt CNEWI  +1281  0003
 ....              SNAP   #0   [ ---- ---- intel10g.lua:513|0001 0004 0002 intel10g.lua:509|---- ]
 0005       >  nil GCSTEP
 0006       >  fun EQ     0002  assert
 0007          int FLOAD  0001  tab.hmask
 0008       >  int EQ     0007  +31
 0009 rax      p32 FLOAD  0001  tab.node
 0010       >  p32 HREFK  0009  "rxdesc" @27
 0011 r10   >  cdt HLOAD  0010
 0012       >  p32 HREFK  0009  "rdt" @24
 0013 xmm5  >  num HLOAD  0012
 0014 rbx      u16 FLOAD  0011  cdata.ctypeid
 0015       >  int EQ     0014  +1354
 [...]

This trace is executed when the ctype ID of rxdesc is equal to 1354. How does this come about? The rxdesc cdata objects are created in M_sf:init_dma_memory() via lib.bounds_checked(). Inspection of that function reveals that the innocent-looking ffi.typeof() in the first argument to ffi.metatype() creates a different ctype every time, even if the fundamental cdata type passed to the function is the same (rxdesc_t in this case).

As a result, the two instances of Intel82599 can no longer be executed by the same traces. The first instance for which the trace happens to be compiled will be executed in the fast path, while the second instance will branch to a side trace where it suffers heavily from trace-setup overhead and un-optimized code. The result is devastating and leads to a performance drop of over 60%.

The simple fix is to remove the bounds_checked() wrapper from the rxdesc object. Re-writing the code to avoid generating different ctypes is left as an excersise to the reader :) Actually, checking bounds for every single register access is probably overkill. We could skip that (like with my patch) or enable it only for debug mode.

The lesson to learn is that if we use multiple instances of an object in the same Snabb program, we must make sure that any cdata objects referenced in the inner packet-processing loop have the same ctype in all instances of the object. I've already fixed such a problem in my Murmur hash code, for example.

Finally some more information about the traces of the shunt run.

Looping traces:

Trace 23

Starts at for loop in Intel82599::pull(). There are always at least 128 packets ready, so the only exit ever taken is #19 when the iteration counter reaches 128.

Trace 27

Starts at the while loop in M_sf:sync_transmit(). The only possible exit is #8 when old_tdh == self.tdh.

Trace 25

Starts at the while loop in Intel82599:push(). There is always enough capacity on the egress NIC to transmit all packets queued from the ingress NIC, i.e. the loop is always terminated via exit #15 when empty(l) becomes true.

Trace 17

Starts at the while loop in Intel82599:add_receive_buffers(). The loop is always terminated via exit #15 when all available descriptors in the hardware receive buffer have been filled with fresh packets.

Trace 34

Starts at the for loop in commit() of core/counter.lua. Note that the trace is aborted twice during recording until the compiler has sorted out the nested loops from the calling trace (the dump doesn't contain enough context to see the complete state of the trace recorder). The loop stops via exit #5 when the iteration counter reaches #public.

Non-looping traces (with parent/exit from which they are called)

Trace 29 (23/19)

It starts just after the loop in Intel82599:pull() with the call to self:add_receive_buffers() and just contains the very first call to can_add_receive_buffer() before it connects with the actual loop in trace 17.

Trace 28 (17/15)

This trace starts with the return at M_sf:can_add_receive_buffer() after trace 17 has finished. It returns via with_restart() to the "inhale" loop in breathe() which has just finished calling the pull() method of one instance of the Intel82599 app. If no side exit is taken, the trace continues to execute the first unrolled iteration of the "exhale" loop, which executes the push() method of one of the instances of Intel82599 (as usual, it only contains the code up to the while loop and then hands over to trace 25 for the loop).

In case the inhale loop is not yet finished, the trace is left at exit #8 to call the pull() method of the other instance of Intel82599 via trace 23.

Trace 30 (25/15)

The trace starts with the return from the call to empty(l) that terminated the loop of trace 25. It continues with the call to sync_transmit() up to the conditional statement preceeding the while
loop and then connects to trace 27 to execute the loop.

Trace 31 (27/8)

The trace picks up after the loop of trace 27 has terminated by advancing the transmit descriptor tail. It continues to return to the "exhale" loop in breathe() from which it has been called. When no side exits are taken, the trace represents the last iteration of the exhale loop and it continues to update the breathe counter. The condition counter.read(breaths) % 100 == 0 is not fullfilled and counter.commit() is skipped. The trace returns to the main breath loop in app.main() where it executes timer.run() and skips pace_breathing() because busywait is set to false. timer.run() doesn't do anything either because no timers have been activated. It then executes the next call to breathe() with another "inhale" cycle, which calls Intel82599:pull() for one of the instances of the Intel82599 apps. It includes everything up to the for loop and then connects to trace 23 to execute the loop itself.

This is what happens when trace 31 is executed to the end. Among all 33 possible exits, only 2 are actually taken in some cases.

Exit #8 is taken when the exhale loop is not yet finished, i.e. when the next link in the sequence has has_new_data set to true. The trace then branches off to trace 33 to perform a call to the push() method of the other instance of the Intel82599 driver.

Exit #17 is taken when the condition counter.read(breaths) % 100 == 0 is true. Control is transferred via trace 36 to trace 34 which executes the loop in counter.commit(). It then continues with trace 38 which contains the same code as the remainder of trace 31 which has been skipped due to the taking of exit #17.

Trace 32 (28/8)

Called from exit #8 of trace 28, this trace contains the code of Intel82599:pull() up to the while loop, which is executed by trace 23.

Trace 33 (31/8)

Called from exit #8 of trace 31, this trace contains the code of Intel82599:push() up to the while loop, which is executed by trace 25.

Trace 36 (31/17)

Executes the call to counter.commit() which is basically just a transition to trace 34, which executes the actual loop.

Trace 38 (34/5)

This trace contains the end of trace 31 after exit #17. It basically helps to splice traces 36/34 into the path of trace 31 when that particular branch is taken.

The application of lib.bounds_checked() to the cdata objects used for receive and transmit descriptor rings causes branches in JIT traces for different instances of the driver, leading to highly sub-optimal side-traces in applications that use multiple instances in the same program. The patch only suggest a minimal solution by disabling the bounds checking logic entirely.

lukego · 2015-09-10T23:06:04Z

This is a really fantastic investigation and write-up :). More thoughts to follow..

lukego · 2015-09-14T13:12:58Z

I have not reproduced this setup yet but initial thoughts anyway:

This is a great write-up! I think that we really do need to go through the exercise of following the compiler in minute detail a few times to make sure that we understand what is going on. Thank you for digging into this and sharing what you figured out.

I have been doing a little related tracing in lukego/blog#8. I hope that once we have done this a few times we will be able to automate the analysis e.g. with profiler extensions like #621 and perhaps automatic generation of graphviz trace diagrams with hotspots indicated.

Having app instances interfere with each other seems unfortunate. It would be nice for an app to always behave in the same way independent of the rest of the app network. (This is already hard due to e.g. shared caches but it still seems like a good goal to strive towards.) Here we are solving one instance of this problem with a more careful programming style. I wonder if a reasonable alternative would be to somehow tell LuaJIT to always have separate traces for each app instance? (Could perhaps be accomplished by loading the app module fresh for each instance instead of using require() that caches and reuses the module definition on each call.) This would be moving back in the direction of "sandboxes" where we have done a lot of experiments already.

Generally I would love to get to the point where it is straightforward to optimize an app, understand its performance under "lab conditions", and accurately predict its performance in production and in combination with other apps. It feels like this is a pretty big problem to solve but that we are already making a lot of progress.

lukego · 2015-09-16T12:55:54Z

I reproduced this now. It works quite well to have a little snsh script in an issue/PR.

I confirmed that the alternative trick of loading a separate copy of the Lua code for each app instances will also solve the problem. This way I believe that each app has a separate trace and so there is no need for the instances to behave in a compatible way.

I did this in a really kludgy way - removing the modules from the cache that require() looks in - but if we wanted to adopt this behavior for all apps then we should be able to find a better solution from amongst @javierguerragiraldez's many experiments with sandboxing possibilities.

Here is the code:

local Intel82599_1 = require("apps.intel.intel_app").Intel82599
-- Load a second copy of the Intel driver modules                                                                                     
package.loaded ["apps.intel.intel_app"] = nil
_G.apps.intel.intel_app = nil
package.loaded ["apps.intel.intel10g"] = nil
_G.apps.intel.intel10g = nil
local Intel82599_2 = require("apps.intel.intel_app").Intel82599
local c = config.new()
config.app(c, "if1", Intel82599_1, { pciaddr = "0000:01:00.0" })
config.app(c, "if2", Intel82599_2, { pciaddr = "0000:02:00.0" })
config.link(c, "if1.tx -> if2.rx")
config.link(c, "if2.tx -> if1.rx")
engine.configure(c)
engine.busywait = true
require("jit.dump").start('+rs', 'dump')
engine.main({ duration = 10, no_timers = false, report = { showlinks = true, showload = true } })

javierguerragiraldez · 2015-09-16T13:36:39Z

Fascinating read! I'd like to be even half as fluent in the JIT tracings...

About the sandboxing style; yes, in one of the first proposals, you specified a file path to the config.app() call, and it instantiated the code with load() instead of require(). I think i still memoized the resulting chunk, but if it's better to compile it for each instance, then it's the easiest way to do.

In the end it was left behind because the multi-[state/thread/process] schemes didn't really needed any different style, but could be easily resurrected if it's desired. I recall that the code style change for the apps was well received, but it was non-trivial changed for some of them.

alexandergall · 2015-09-16T15:31:12Z

@lukego Interesting, thanks for checking that. This could eliminate the problem for apps but you can still run into it with other modules that use some kind of instantiation mechanism (like I did with the Murmur hash function). So I think we need some programming guidelines anyway (it's probably not as complicated as one might think). Also, I wonder whether there could be situations where this approach backfires, i.e. cases in which forcing different traces actually degrades performance. None come to mind right now, but it's difficult to anticipate.

lukego · 2015-10-02T11:17:46Z

btw: This PR would close automatically when it merges onto master in a release. (I think it's okay to leave PRs open before they are released because it gives people a window of time to discuss the change and potentially suggest reverting it before release.)

alexandergall · 2015-10-02T11:30:01Z

I deleted the branch by accident.

lukego · 2015-10-02T11:35:30Z

no worries :)

lukego · 2016-03-08T08:33:13Z

I was reminded of this PR when I came across this note in the LuaJIT FFI API documentation today:

Please note that an anonymous struct declaration implicitly creates a new and distinguished ctype every time you use it for ffi.new(). This is probably not what you want, especially if you create more than one cdata object. Different anonymous structs are not considered assignment-compatible by the C standard, even though they may have the same fields! Also, they are considered different types by the JIT-compiler, which may cause an excessive number of traces. It's strongly suggested to either declare a named struct or typedef with ffi.cdef() or to create a single ctype object for an anonymous struct with ffi.typeof().

See: snabbco#612 (comment)

Container YANG schema nodes always appear in data

lukego added a commit to lukego/snabb that referenced this pull request Sep 14, 2015

Merge PR snabbco#612 (intel10g ctype fix) into next

64045b2

lukego mentioned this pull request Sep 25, 2015

traceprof: a new trace-oriented profiler for LuaJIT #623

Merged

lukego mentioned this pull request Oct 2, 2015

Streamlining the Snabb Switch CI (SnabbBot) #626

Merged

alexandergall closed this Oct 2, 2015

alexandergall deleted the intel10g-no-register-bounds-check branch October 2, 2015 11:12

alexandergall restored the intel10g-no-register-bounds-check branch October 2, 2015 11:28

lukego mentioned this pull request Oct 12, 2015

Collection of DynASM/LuaJIT work available for sponsorship LuaJIT/LuaJIT#92

Closed

lukego mentioned this pull request Nov 29, 2015

Large penalty for frequent side-trace exit LuaJIT/LuaJIT#114

Closed

lukego mentioned this pull request Feb 23, 2016

Implement freelist with FFI cdata object instead of Lua table #771

Merged

eugeneia added a commit to eugeneia/snabb that referenced this pull request Mar 24, 2016

lib.ipsec.seq_no_t: Do not create redundant ctype objects.

0bd767e

See: snabbco#612 (comment)

eugeneia added a commit to eugeneia/snabb that referenced this pull request Mar 24, 2016

lib.ipsec.seq_no_t: Do not create redundant ctype objects.

36c3c93

See: snabbco#612 (comment)

lukego mentioned this pull request May 5, 2016

Help wanted: LuaJIT pros to contribute to Snabb LuaJIT/LuaJIT#177

Closed

lukego mentioned this pull request Aug 26, 2016

Max's async Snabb changes + public packet_t #997

Merged

lukego mentioned this pull request Oct 25, 2016

[wip] Add JIT optimization tool: lib.specialize(fn) #1053

Open

dpino pushed a commit to dpino/snabb that referenced this pull request Dec 5, 2016

Merge pull request snabbco#612 from Igalia/fix-container-presence

3958171

Container YANG schema nodes always appear in data

lukego mentioned this pull request Mar 29, 2017

[Core] Bounds checking for vectors vsergeev/luaradio#41

Open

alexandergall deleted the intel10g-no-register-bounds-check branch February 14, 2018 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid ctype diversity for registers in intel10g #612

Avoid ctype diversity for registers in intel10g #612

alexandergall commented Sep 9, 2015

lukego commented Sep 10, 2015

lukego commented Sep 14, 2015

lukego commented Sep 16, 2015

javierguerragiraldez commented Sep 16, 2015

alexandergall commented Sep 16, 2015

lukego commented Oct 2, 2015

alexandergall commented Oct 2, 2015

lukego commented Oct 2, 2015

lukego commented Mar 8, 2016

Avoid ctype diversity for registers in intel10g #612

Avoid ctype diversity for registers in intel10g #612

Conversation

alexandergall commented Sep 9, 2015

lukego commented Sep 10, 2015

lukego commented Sep 14, 2015

lukego commented Sep 16, 2015

javierguerragiraldez commented Sep 16, 2015

alexandergall commented Sep 16, 2015

lukego commented Oct 2, 2015

alexandergall commented Oct 2, 2015

lukego commented Oct 2, 2015

lukego commented Mar 8, 2016