-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid ctype diversity for registers in intel10g #612
Avoid ctype diversity for registers in intel10g #612
Conversation
The application of lib.bounds_checked() to the cdata objects used for receive and transmit descriptor rings causes branches in JIT traces for different instances of the driver, leading to highly sub-optimal side-traces in applications that use multiple instances in the same program. The patch only suggest a minimal solution by disabling the bounds checking logic entirely.
This is a really fantastic investigation and write-up :). More thoughts to follow.. |
I have not reproduced this setup yet but initial thoughts anyway: This is a great write-up! I think that we really do need to go through the exercise of following the compiler in minute detail a few times to make sure that we understand what is going on. Thank you for digging into this and sharing what you figured out. I have been doing a little related tracing in lukego/blog#8. I hope that once we have done this a few times we will be able to automate the analysis e.g. with profiler extensions like #621 and perhaps automatic generation of graphviz trace diagrams with hotspots indicated. Having app instances interfere with each other seems unfortunate. It would be nice for an app to always behave in the same way independent of the rest of the app network. (This is already hard due to e.g. shared caches but it still seems like a good goal to strive towards.) Here we are solving one instance of this problem with a more careful programming style. I wonder if a reasonable alternative would be to somehow tell LuaJIT to always have separate traces for each app instance? (Could perhaps be accomplished by loading the app module fresh for each instance instead of using Generally I would love to get to the point where it is straightforward to optimize an app, understand its performance under "lab conditions", and accurately predict its performance in production and in combination with other apps. It feels like this is a pretty big problem to solve but that we are already making a lot of progress. |
I reproduced this now. It works quite well to have a little snsh script in an issue/PR. I confirmed that the alternative trick of loading a separate copy of the Lua code for each app instances will also solve the problem. This way I believe that each app has a separate trace and so there is no need for the instances to behave in a compatible way. I did this in a really kludgy way - removing the modules from the cache that Here is the code: local Intel82599_1 = require("apps.intel.intel_app").Intel82599
-- Load a second copy of the Intel driver modules
package.loaded ["apps.intel.intel_app"] = nil
_G.apps.intel.intel_app = nil
package.loaded ["apps.intel.intel10g"] = nil
_G.apps.intel.intel10g = nil
local Intel82599_2 = require("apps.intel.intel_app").Intel82599
local c = config.new()
config.app(c, "if1", Intel82599_1, { pciaddr = "0000:01:00.0" })
config.app(c, "if2", Intel82599_2, { pciaddr = "0000:02:00.0" })
config.link(c, "if1.tx -> if2.rx")
config.link(c, "if2.tx -> if1.rx")
engine.configure(c)
engine.busywait = true
require("jit.dump").start('+rs', 'dump')
engine.main({ duration = 10, no_timers = false, report = { showlinks = true, showload = true } }) |
Fascinating read! I'd like to be even half as fluent in the JIT tracings... About the sandboxing style; yes, in one of the first proposals, you specified a file path to the In the end it was left behind because the multi-[state/thread/process] schemes didn't really needed any different style, but could be easily resurrected if it's desired. I recall that the code style change for the apps was well received, but it was non-trivial changed for some of them. |
@lukego Interesting, thanks for checking that. This could eliminate the problem for apps but you can still run into it with other modules that use some kind of instantiation mechanism (like I did with the Murmur hash function). So I think we need some programming guidelines anyway (it's probably not as complicated as one might think). Also, I wonder whether there could be situations where this approach backfires, i.e. cases in which forcing different traces actually degrades performance. None come to mind right now, but it's difficult to anticipate. |
btw: This PR would close automatically when it merges onto master in a release. (I think it's okay to leave PRs open before they are released because it gives people a window of time to discuss the change and potentially suggest reverting it before release.) |
I deleted the branch by accident. |
no worries :) |
I was reminded of this PR when I came across this note in the LuaJIT FFI API documentation today:
|
Container YANG schema nodes always appear in data
I'm trying to set a new record for the amount of explanation for a one-line patch :) But seriously, I've had a number of revelations while working on this and I have an urge to share them with you. This is, in fact, the main point for this PR (maybe I should have raised an issue instead?). The suggested fix is just a workaround.
To have a concrete case to talk about, I've analysed the trace structure of a simple Snabb application: the "shunt". It's a program that simply short-circuits two physical interfaces by passing all packets from one directly to the other and vice versa. It is run via
snsh
Note that I use
engine.busywait = true
. There is actually a minor problem of garbage being created bypace_breathing()
. I'll submit a PR on that eventually.I executed it in the Snabb lab on host interlaken, using two pairs of looped interfaces 82:00.{0,1} and 83:00.{0,1}. Packets were generated using
The 64.pcap file contains 64-byte packets including the Ethernet header (but not the CRC). Assuming the standard frame overhead of 24 bytes (8 bytes preamble/SOF, 4 bytes CRC, 12 bytes IPG), the maximum packet rate is thus 10^10/(88*8) = 14'204'545. The packetblasters
don't quite achieve that. They peak at around 13'130'000. This is something I'd like to understand, but it's irrelevant for this analysis. Ideally, the shunt application will forward all packets and achieve ~26Mpps. This should actually be realistic.
Running the shunt without the patch for 10 seconds yields
7.6Mpps is not quite what we want to see. With the patch I get
20.2 Mpps is much better, but we still lose about 6Mpps. Closer inspection shows that the packets are dropped by the NICs when the receive buffers are full. Again, this is something I'd like to understand but it's irrelevant for this analysis.
Now let's look at the JIT dump file https://drive.google.com/folderview?id=0ByZgkfz7rch4bXEzNW90elE2SEU&usp=sharing and figure out the trace dependencies:
The numbers correspond to the traces in the dump. The circles indicate "looping traces", i.e. traces that end with an actual loop (I'm not sure what the proper nomenclature is). The other traces do
not contain loops. They are all called via exits of other traces. I call them "side traces".
The looping traces are ordered by the amount of CPU cycles consumed by them, highest usage at the top. Here is the complete breakdown according to
perf top
(trace 34 uses so few cycles that it doesn't show up here):The structure is pretty much ideal: there is a small number of looping traces connected by short side traces. The entire graph is fully connected, i.e. there are no transitions to the interpreter anywhere.
It's 100% compiled.
The looping traces represent the handfull of inner loops in the code, like iterating over ring buffers or allocating fresh packets for the receive buffers. It is noteworthy that the breathe loop itself as well as the inhale and exhale loops are effectively unrolled. Most of that code is in traces 28 and 31 and the actual loops are implicit in the sequence in which the traces are connected. This is quite fascinating :) You'll find descriptions of all traces at the end.
So, what about the patch? One thing to note in the trace analysis is that both instances of the
Intel82599
app can be executed by the exact same code. There are no branches due to the fact that they are distinct objects. Without the patch, the trace structure is much more complex. Here is an excerpt of a trace that contains a call toadd_receive_buffer()
:rxdesc
is a cdata object and the compiler inserts a guard that checks whether the ctype ID is the one that was registered when the trace was recorded (+1349). According to the dump, this guard istriggered during the execution of the program and a side trace is generated:
This trace is executed when the ctype ID of
rxdesc
is equal to 1354. How does this come about? Therxdesc
cdata objects are created inM_sf:init_dma_memory()
vialib.bounds_checked()
. Inspection of that function reveals that the innocent-lookingffi.typeof()
in the first argument toffi.metatype()
creates a different ctype every time, even if the fundamental cdata type passed to the function is the same (rxdesc_t
in this case).As a result, the two instances of
Intel82599
can no longer be executed by the same traces. The first instance for which the trace happens to be compiled will be executed in the fast path, while the second instance will branch to a side trace where it suffers heavily from trace-setup overhead and un-optimized code. The result is devastating and leads to a performance drop of over 60%.The simple fix is to remove the
bounds_checked()
wrapper from therxdesc
object. Re-writing the code to avoid generating different ctypes is left as an excersise to the reader :) Actually, checking bounds for every single register access is probably overkill. We could skip that (like with my patch) or enable it only for debug mode.The lesson to learn is that if we use multiple instances of an object in the same Snabb program, we must make sure that any cdata objects referenced in the inner packet-processing loop have the same ctype in all instances of the object. I've already fixed such a problem in my Murmur hash code, for example.
Finally some more information about the traces of the shunt run.
Looping traces:
Trace 23
Starts at for loop in
Intel82599::pull()
. There are always at least 128 packets ready, so the only exit ever taken is #19 when the iteration counter reaches 128.Trace 27
Starts at the while loop in
M_sf:sync_transmit()
. The only possible exit is #8 whenold_tdh == self.tdh
.Trace 25
Starts at the while loop in
Intel82599:push()
. There is always enough capacity on the egress NIC to transmit all packets queued from the ingress NIC, i.e. the loop is always terminated via exit #15 whenempty(l)
becomes true.Trace 17
Starts at the while loop in
Intel82599:add_receive_buffers()
. The loop is always terminated via exit #15 when all available descriptors in the hardware receive buffer have been filled with fresh packets.Trace 34
Starts at the for loop in
commit()
ofcore/counter.lua
. Note that the trace is aborted twice during recording until the compiler has sorted out the nested loops from the calling trace (the dump doesn't contain enough context to see the complete state of the trace recorder). The loop stops via exit #5 when the iteration counter reaches#public
.Non-looping traces (with parent/exit from which they are called)
Trace 29 (23/19)
It starts just after the loop in
Intel82599:pull()
with the call toself:add_receive_buffers()
and just contains the very first call tocan_add_receive_buffer()
before it connects with the actual loop in trace 17.Trace 28 (17/15)
This trace starts with the return at
M_sf:can_add_receive_buffer()
after trace 17 has finished. It returns viawith_restart()
to the "inhale" loop inbreathe()
which has just finished calling thepull()
method of one instance of theIntel82599
app. If no side exit is taken, the trace continues to execute the first unrolled iteration of the "exhale" loop, which executes thepush()
method of one of the instances ofIntel82599
(as usual, it only contains the code up to the while loop and then hands over to trace 25 for the loop).In case the inhale loop is not yet finished, the trace is left at exit #8 to call the
pull()
method of the other instance of Intel82599 via trace 23.Trace 30 (25/15)
The trace starts with the return from the call to
empty(l)
that terminated the loop of trace 25. It continues with the call tosync_transmit()
up to the conditional statement preceeding the whileloop and then connects to trace 27 to execute the loop.
Trace 31 (27/8)
The trace picks up after the loop of trace 27 has terminated by advancing the transmit descriptor tail. It continues to return to the "exhale" loop in
breathe()
from which it has been called. When no side exits are taken, the trace represents the last iteration of the exhale loop and it continues to update the breathe counter. The conditioncounter.read(breaths) % 100 == 0
is not fullfilled andcounter.commit()
is skipped. The trace returns to the main breath loop inapp.main()
where it executestimer.run()
and skipspace_breathing()
because busywait is set to false.timer.run()
doesn't do anything either because no timers have been activated. It then executes the next call tobreathe()
with another "inhale" cycle, which callsIntel82599:pull()
for one of the instances of the Intel82599 apps. It includes everything up to the for loop and then connects to trace 23 to execute the loop itself.This is what happens when trace 31 is executed to the end. Among all 33 possible exits, only 2 are actually taken in some cases.
Exit #8 is taken when the exhale loop is not yet finished, i.e. when the next link in the sequence has
has_new_data
set to true. The trace then branches off to trace 33 to perform a call to thepush()
method of the other instance of theIntel82599
driver.Exit #17 is taken when the condition
counter.read(breaths) % 100 == 0
is true. Control is transferred via trace 36 to trace 34 which executes the loop incounter.commit()
. It then continues with trace 38 which contains the same code as the remainder of trace 31 which has been skipped due to the taking of exit #17.Trace 32 (28/8)
Called from exit #8 of trace 28, this trace contains the code of
Intel82599:pull()
up to the while loop, which is executed by trace 23.Trace 33 (31/8)
Called from exit #8 of trace 31, this trace contains the code of
Intel82599:push()
up to the while loop, which is executed by trace 25.Trace 36 (31/17)
Executes the call to
counter.commit()
which is basically just a transition to trace 34, which executes the actual loop.Trace 38 (34/5)
This trace contains the end of trace 31 after exit #17. It basically helps to splice traces 36/34 into the path of trace 31 when that particular branch is taken.