-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: "queue: Use k_poll if enabled" commit regressed BSD Sockets performance #4022
Comments
by Paul Sokolovsky: Luiz Augusto von Dentz : FYI I'm working on a native Zephyr testcase. |
by Luiz Augusto von Dentz: So what exactly is this some sort of benchmark? Anyway k_poll version should be getting similar numbers so perhaps something else is causing the problem, 1 second is awful long for k_fifo_get. |
by Paul Sokolovsky: Luiz Augusto von Dentz :
It uses ab, Apache Bench, which is indeed a well-known and accessible HTTP benchmark tool. But I use not as a benchmark tool, but as an HTTP request generator for a kind of smoke load testing. Like, if it can sustain 1000 requests, it can move onto the next stage of testing with 10000 requests, and if it can sustain that, then perhaps it works (based on the anecdote of the production testing of Android phones, which if it can sustain 24hrs of testing with "monkey" tool gets "ship it!" command). So, as promised, I prepared native Zephyr example: #1378 The way to test it:
|
by Paul Sokolovsky:
Perhaps, something else is causing it, like race condition or bug in the scheduler. But exactly 84db641 is what triggered this regression, so I'd appreciate help with investigating it (I for one have no idea how to approach it). |
by Paul Sokolovsky: More info: with the following patch:
Slowdowns happen much less frequently. Like, none for 100 requests, and for 1000 it's:
|
by Paul Sokolovsky: Actually, if only "printk("p out\n");" is active, it's enough to alleviate the slowdown. |
by Paul Sokolovsky: Further update: the above all happens on qemu_x86. I tested on real hardware, frdm_k64f, and can't reproduce:
This is worst of all of course, when you can't trust your devel tool (emulator). However, there're other difference points still: arch and interface speed (slip vs ethernet). Will try on qemu_cortex_m3 next. |
by Paul Sokolovsky:
That almost helped. First run on -n10000 went perfect. Repeating it, tested was aborted with "conn reset by peer", which is surely different issue. Scaling down to -n1000, I still get regular:
In other words, qemu_cortex_m3 unchanged works like qemu_x86 with printk added after k_poll() call (above). |
by Andy Ross: So... the hypothesis is that something in k_poll() (or the k_queue use of k_poll) is failing to wake up synchronously and we're hitting 1000ms timeouts, but only on some systems? Have we verified that the "slow downs" are indeed a timeout condition? The printk interaction seems werid, on qemu that's just a bunch of synchronous register writes, no changes to interrupt delivery order should be happening. Can you replace the printk() with a k_busy_wait() and fix it too? Do we know which queue is involved? Have we ruled out delays on insert from some other source (seems unlikely I guess). |
by Andy Ross: I spent some time staring at the queue and poll implementations and can't see a race. But here's one shot in the dark: Ben's code has some subtle locking at the entry to k_poll() to allow the initialization of the (potentially large) poll array to be done without the IRQ lock held for the whole process. I can't see anything wrong with it, but the fact that the printk() above "fixes" the issue implies the the race might be somewhere in this viscinity: a "wakeup event" happens "very soon, maybe already" relative to the k_poll() call, and simply delaying it a little causes poll to hit the latter code path instead of the former. If you have the rig ready, can you try this experiment and see if it affects anything? It wouldn't be a fix (it simply keeps the lock the whole time and thus is subject to latency problems with big poll arrays), but if it works it helps to localize the problem:
|
by Paul Sokolovsky: Andy Ross : Thanks much for looking into this! Let me start from the end, with trying the patch you give. Somehow, it didn't apply for me as is, so I redid it manually, catching another case of irq_lock to remove, see my exact patch below. So, unfortunately, it didn't resolve issue:
The patch:
|
by Paul Sokolovsky:
Well, it can be even worse than that:
Otherwise yes, the hypothesis is that with slow enough packet speeds (like with SLIP), there's some race or something leading to "extra sleep". |
by Paul Sokolovsky:
Well, that was exactly my question - where does this 1s (1000ms) - even delay comes from? One such figure I know is CONFIG_NET_TCP_ACK_TIMEOUT. I changed it to 500, but nope delays are still integer numbers of 1000ms (1s, 2s, 3s, 4s seen). Grepping thru .config, I don't see other "1000" figures, would need to grep thru codebase for "hardcoded" values. |
by Paul Sokolovsky:
Even before, I tried k_yield() there, as seen (with some worry from myself) in some Jukka Rissanen 's networking patches. Didn't help visibly. With k_busy_wait(500) after k_poll() and 1000 requests it's:
With k_busy_wait(1000) after k_poll() and 1000 requests it's:
With k_busy_wait(2000) after k_poll() and 1000 requests it's:
So, k_busy_wait() definitely "helps", but doesn't resolve the issue, and the behavior isn't even monotonic function of the delay used (of course, there's enough variability/nondeterminism in the first place). |
by Paul Sokolovsky:
Exact one - no, but (at least) following queues are involved:
I modified to print k_uptime_get_32() at some points in socket code, and here what I see:
So, first 2 connections got served without a delay, then 1s between following 2. But once zsock_accepted_cb() is called (which does k_fifo_put), there're no big delay until this connection is served, the delay happens inbetween connection acceptance and servings. Will patch in more printk...
What do you mean/have in mind? The processing should be like above: a driver puts a packet in RX queue, where it's taken from by RX thread and routed to either accept on recv callback, both of which put into other queues (which are actually union'ed, as their lifetimes don't cross), from which app code pulls them via calls to accept() or recv(). |
by Paul Sokolovsky: Luiz Augusto von Dentz on IRC suggested:
Ok, so I enabled in the config:
CONFIG_NET_BUF_WARN_ALLOC_INTERVAL itself defaults to 1. Nothing is being logged. I verified that with CONFIG_SYS_LOG_NET_BUF_LEVEL=4 I get debug spam. |
As an update, my further investigation so far showed that it indeed may be a (not so) usual case of butterfly effect: change in one place affects something not directly related to it. In this case, it seems that SLIP driver runs out of free RX buffers, which causes received packets to be dropped, which causes the sender (Linux host) to wait then retransmit, and that's what causes 1s delays. It's still a late-night hacking hypothesis however. Two issues in that regard however:
|
For testing with QEMU you can increase the amount of buffers much more than with real hw. So if you want to measure some performance issues then increase the amount of buffers to several hundred or so. |
Further investigation showed that packets don't "spend more time in queue", they're actually leaked. In a typical situation, 2 of 4 available RX net_pkt's get leaked, which causes severe performance regression. |
And here's how it looks like thru the printf style debugging:
The above shows that at one point, pkt 0x00409738 gets added to rx_queue, and indeed, the queue contain that packet. Next pkt 0x004097b0 gets added, but after this operation, the queue is empty. Despite that, receive thread gets the original pkt 0x00409738, but not heisen-pkt 0x004097b0. Then pkt 0x004097ec is added, and gotten by the thread. 0x004097b0 is lost forever. |
Magic spell to address zephyrproject-rtos#4022. Signed-off-by: Paul Sokolovsky <[email protected]>
So, with outcomes like in the previous comment, the only reasonable explanation was that something trashes (NULLs) "head" pointer of the fifo. And not actually thrashes, but saves, NULLs, and then restores, that would be the only (still partial and ends-not-meet) explanation how put didn't work but let get still did. I did some more twists manually which didn't help (e.g. I noticed that rx_queue is preceded in memory map by a SLIP structure, subject to the suspicion of overflow, but padding it didn't help). Then I finally went into debugger (which is in general not a panacea, because apps often behave differently under debugger). The idea was to watch rx_queue.[...].head pointer, specifically a case when head pointer is NULL, but tail pointer is non-NULL. Thru the expected random crashes and misbehavior under debugger, I arrived at sys_slist_get_not_empty() hitting my watchpoint, because it sets first head, then tail. So, I patched to do it in reverse order, and initially just refactored the expression to set tail to based on this code motion. The run under debugger produced some funky behavior, so I even thought that maybe I made a mistake during expression refactor. I went to look at sys_slist_get_not_empty() closely and saw that its code is "superfluous" - if it's run against node which is == ->tail, it means it's the only node, and the list becomes empty. So, I made the func have The run under debugger no longer hit my watchpoint. I ran So, it disappeared the same way it had appeared - with a seemingly unrelated change. Adding a being of rationality, it would mean that we either have some race condition during queue operations (shouldn't be there, we have all irq_lock protected, right?), or there's a weird issue with QEMU which exposes a race where it shouldn't be (to remind, I wasn't able to reproduce this on real HW under same conditions as on QEMU (~10s of running). A colleague looked into this and reported that he was able to "reproduce after ~30mins of running on frdm_k64f", but there're no further details, so it can be as well a different issue). |
Ahahah, fun, we actually don't: https://github.com/zephyrproject-rtos/zephyr/blob/master/kernel/queue.c#L215 . @Vudentz : Get your prize ;-). |
All sys_slist_*() functions aren't threadsafe and calls to them must be protected with irq_lock. This is usually done in a wider caller context, but k_queue_poll() is called with irq_lock already relinquished, and is thus subject to hard to detect and explain race conditions, as e.g. was tracked in zephyrproject-rtos#4022. Signed-off-by: Paul Sokolovsky <[email protected]>
All sys_slist_*() functions aren't threadsafe and calls to them must be protected with irq_lock. This is usually done in a wider caller context, but k_queue_poll() is called with irq_lock already relinquished, and is thus subject to hard to detect and explain race conditions, as e.g. was tracked in zephyrproject-rtos#4022. Signed-off-by: Paul Sokolovsky <[email protected]>
All sys_slist_*() functions aren't threadsafe and calls to them must be protected with irq_lock. This is usually done in a wider caller context, but k_queue_poll() is called with irq_lock already relinquished, and is thus subject to hard to detect and explain race conditions, as e.g. was tracked in #4022. Signed-off-by: Paul Sokolovsky <[email protected]>
All sys_slist_*() functions aren't threadsafe and calls to them must be protected with irq_lock. This is usually done in a wider caller context, but k_queue_poll() is called with irq_lock already relinquished, and is thus subject to hard to detect and explain race conditions, as e.g. was tracked in zephyrproject-rtos#4022. Signed-off-by: Paul Sokolovsky <[email protected]>
All sys_slist_*() functions aren't threadsafe and calls to them must be protected with irq_lock. This is usually done in a wider caller context, but k_queue_poll() is called with irq_lock already relinquished, and is thus subject to hard to detect and explain race conditions, as e.g. was tracked in #4022. Signed-off-by: Paul Sokolovsky <[email protected]>
All sys_slist_*() functions aren't threadsafe and calls to them must be protected with irq_lock. This is usually done in a wider caller context, but k_queue_poll() is called with irq_lock already relinquished, and is thus subject to hard to detect and explain race conditions, as e.g. was tracked in zephyrproject-rtos#4022. Signed-off-by: Paul Sokolovsky <[email protected]>
Fixed by #4350. |
Reported by Paul Sokolovsky:
With a sample "http_server" written in MicroPython, with BSD Sockets as the underlying implementation, running on qemu_x86, I get:
[UPDATE: A native Zephyr testcase is available below: https://github.com//issues/4022#issuecomment-331659689]
With "git checkout 84db641"
With "git checkout 84db641^" (note ^ at the end)
Sockets implementation heavily relies on FIFOs (and uses k_poll on FIFO too).
Per above, there's clear case of "some FIFO operation took +1s with k_fifo_get based on k_poll vs previous wait_q based impl".- This was proven to not be the case (see #4022 (comment)).84db641
(Imported from Jira ZEP-2593)
The text was updated successfully, but these errors were encountered: