Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: "queue: Use k_poll if enabled" commit regressed BSD Sockets performance #4022

Closed
zephyrbot opened this issue Sep 1, 2017 · 23 comments
Closed
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Milestone

Comments

@zephyrbot
Copy link
Collaborator

zephyrbot commented Sep 1, 2017

Reported by Paul Sokolovsky:

With a sample "http_server" written in MicroPython, with BSD Sockets as the underlying implementation, running on qemu_x86, I get:

[UPDATE: A native Zephyr testcase is available below: https://github.com//issues/4022#issuecomment-331659689]

With "git checkout 84db641"

$ ab -n200 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests


Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        27 bytes

Concurrency Level:      1
Time taken for tests:   52.057 seconds
Complete requests:      200
Failed requests:        190
   (Connect: 0, Receive: 0, Length: 190, Exceptions: 0)
Total transferred:      9090 bytes
HTML transferred:       5690 bytes
Requests per second:    3.84 [#/sec] (mean)
Time per request:       260.283 [ms] (mean)
Time per request:       260.283 [ms] (mean, across all concurrent requests)
Transfer rate:          0.17 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        6  249 431.0      7    1029
Processing:     8   11   2.6     11      29
Waiting:        8   10   2.1     10      27
Total:         15  260 430.9     18    1040

Percentage of the requests served within a certain time (ms)
  50%     18
  66%     21
  75%     36
  80%   1022
  90%   1025
  95%   1026
  98%   1036
  99%   1037
 100%   1040 (longest request)

With "git checkout 84db641^" (note ^ at the end)

$ ab -n200 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests


Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        27 bytes

Concurrency Level:      1
Time taken for tests:   3.712 seconds
Complete requests:      200
Failed requests:        190
   (Connect: 0, Receive: 0, Length: 190, Exceptions: 0)
Total transferred:      9090 bytes
HTML transferred:       5690 bytes
Requests per second:    53.88 [#/sec] (mean)
Time per request:       18.561 [ms] (mean)
Time per request:       18.561 [ms] (mean, across all concurrent requests)
Transfer rate:          2.39 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        6    7   1.8      7      23
Processing:     8   11   2.8     11      38
Waiting:        7   10   2.4      9      35
Total:         15   19   3.3     18      44

Percentage of the requests served within a certain time (ms)
  50%     18
  66%     19
  75%     19
  80%     19
  90%     21
  95%     23
  98%     31
  99%     34
 100%     44 (longest request)

Sockets implementation heavily relies on FIFOs (and uses k_poll on FIFO too). Per above, there's clear case of "some FIFO operation took +1s with k_fifo_get based on k_poll vs previous wait_q based impl". - This was proven to not be the case (see #4022 (comment)).

84db641

(Imported from Jira ZEP-2593)

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Luiz Augusto von Dentz : FYI

I'm working on a native Zephyr testcase.

@zephyrbot
Copy link
Collaborator Author

by Luiz Augusto von Dentz:

So what exactly is this some sort of benchmark? Anyway k_poll version should be getting similar numbers so perhaps something else is causing the problem, 1 second is awful long for k_fifo_get.

@zephyrbot
Copy link
Collaborator Author

zephyrbot commented Sep 6, 2017

by Paul Sokolovsky:

Luiz Augusto von Dentz :

So what exactly is this some sort of benchmark?

It uses ab, Apache Bench, which is indeed a well-known and accessible HTTP benchmark tool. But I use not as a benchmark tool, but as an HTTP request generator for a kind of smoke load testing. Like, if it can sustain 1000 requests, it can move onto the next stage of testing with 10000 requests, and if it can sustain that, then perhaps it works (based on the anecdote of the production testing of Android phones, which if it can sustain 24hrs of testing with "monkey" tool gets "ship it!" command).

So, as promised, I prepared native Zephyr example: #1378

The way to test it:

  1. Build that sample, run in qemu, run:
$ ab -n100 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient).....done


Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        19 bytes

Concurrency Level:      1
Time taken for tests:   134.229 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      7800 bytes
HTML transferred:       1900 bytes
Requests per second:    0.74 [#/sec] (mean)
Time per request:       1342.293 [ms] (mean)
Time per request:       1342.293 [ms] (mean, across all concurrent requests)
Transfer rate:          0.06 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5  683 477.0   1013    1032
Processing:     8  659 1254.8     10    3139
Waiting:        7  657 1254.9      9    3138
Total:         14 1342 1495.2   1024    4154

Percentage of the requests served within a certain time (ms)
  50%   1024
  66%   1026
  75%   1027
  80%   4095
  90%   4097
  95%   4097
  98%   4124
  99%   4154
 100%   4154 (longest request)
  1. git checkout 84db641de66463550fb41f083dbdf0b02fc78284, get the same picture.

  2. git checkout 84db641de66463550fb41f083dbdf0b02fc78284^ get:

$ ab -n10000 http://192.0.2.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        
Server Hostname:        192.0.2.1
Server Port:            8080

Document Path:          /
Document Length:        19 bytes

Concurrency Level:      1
Time taken for tests:   173.989 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      780000 bytes
HTML transferred:       190000 bytes
Requests per second:    57.47 [#/sec] (mean)
Time per request:       17.399 [ms] (mean)
Time per request:       17.399 [ms] (mean, across all concurrent requests)
Transfer rate:          4.38 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5    7   1.6      7      26
Processing:     7   10   1.8     10      29
Waiting:        6    9   1.6      8      27
Total:         13   17   2.8     17      45

Percentage of the requests served within a certain time (ms)
  50%     17
  66%     18
  75%     18
  80%     19
  90%     21
  95%     23
  98%     25
  99%     28
 100%     45 (longest request)

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Anyway k_poll version should be getting similar numbers so perhaps something else is causing the problem, 1 second is awful long for k_fifo_get.

Perhaps, something else is causing it, like race condition or bug in the scheduler. But exactly 84db641 is what triggered this regression, so I'd appreciate help with investigating it (I for one have no idea how to approach it).

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

More info: with the following patch:

--- a/kernel/queue.c
+++ b/kernel/queue.c
@@ -205,7 +205,9 @@ static void *k_queue_poll(struct k_queue *queue, s32_t timeout)
 
        event.state = K_POLL_STATE_NOT_READY;
 
+printk("p in %p\n", queue);
        err = k_poll(&event, 1, timeout);
+printk("p out\n");
        if (err) {
                return NULL;
        }

Slowdowns happen much less frequently. Like, none for 100 requests, and for 1000 it's:

Percentage of the requests served within a certain time (ms)
  50%     16
  66%     17
  75%     18
  80%     18
  90%     20
  95%     21
  98%     23
  99%     24
 100%   1039 (longest request)

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Actually, if only "printk("p out\n");" is active, it's enough to alleviate the slowdown.

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Further update: the above all happens on qemu_x86. I tested on real hardware, frdm_k64f, and can't reproduce:

Complete requests:      10000
Failed requests:        0
Total transferred:      780000 bytes
HTML transferred:       190000 bytes
Requests per second:    155.79 [#/sec] (mean)
Time per request:       6.419 [ms] (mean)
Time per request:       6.419 [ms] (mean, across all concurrent requests)
Transfer rate:          11.87 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.1      1       4
Processing:     2    6   0.2      6      13
Waiting:        2    6   0.2      6      13
Total:          4    6   0.1      6      14

Percentage of the requests served within a certain time (ms)
  50%      6
  66%      6
  75%      6
  80%      6
  90%      6
  95%      6
  98%      7
  99%      7
 100%     14 (longest request)

This is worst of all of course, when you can't trust your devel tool (emulator). However, there're other difference points still: arch and interface speed (slip vs ethernet). Will try on qemu_cortex_m3 next.

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Will try on qemu_cortex_m3 next.

That almost helped. First run on -n10000 went perfect. Repeating it, tested was aborted with "conn reset by peer", which is surely different issue. Scaling down to -n1000, I still get regular:

  50%     17
  66%     20
  75%     22
  80%     23
  90%     27
  95%     32
  98%     38
  99%     43
 100%   1033 (longest request)

In other words, qemu_cortex_m3 unchanged works like qemu_x86 with printk added after k_poll() call (above).

@zephyrbot
Copy link
Collaborator Author

by Andy Ross:

So... the hypothesis is that something in k_poll() (or the k_queue use of k_poll) is failing to wake up synchronously and we're hitting 1000ms timeouts, but only on some systems? Have we verified that the "slow downs" are indeed a timeout condition? The printk interaction seems werid, on qemu that's just a bunch of synchronous register writes, no changes to interrupt delivery order should be happening. Can you replace the printk() with a k_busy_wait() and fix it too?

Do we know which queue is involved? Have we ruled out delays on insert from some other source (seems unlikely I guess).

@zephyrbot
Copy link
Collaborator Author

by Andy Ross:

I spent some time staring at the queue and poll implementations and can't see a race. But here's one shot in the dark:

Ben's code has some subtle locking at the entry to k_poll() to allow the initialization of the (potentially large) poll array to be done without the IRQ lock held for the whole process. I can't see anything wrong with it, but the fact that the printk() above "fixes" the issue implies the the race might be somewhere in this viscinity: a "wakeup event" happens "very soon, maybe already" relative to the k_poll() call, and simply delaying it a little causes poll to hit the latter code path instead of the former.

If you have the rig ready, can you try this experiment and see if it affects anything? It wouldn't be a fix (it simply keeps the lock the whole time and thus is subject to latency problems with big poll arrays), but if it works it helps to localize the problem:

diff --git a/kernel/poll.c b/kernel/poll.c
index eb014ea..d15bbd7 100644
--- a/kernel/poll.c
+++ b/kernel/poll.c
@@ -199,7 +199,6 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
 
        key = irq_lock();
        set_polling_state(_current);
-       irq_unlock(key);
 
        struct _poller poller = { .thread = _current };
 
@@ -207,7 +206,6 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
        for (int ii = 0; ii < num_events; ii++) {
                u32_t state;
 
-               key = irq_lock();
                if (is_condition_met(&events[ii], &state)) {
                        set_event_ready(&events[ii], state);
                        clear_polling_state(_current);
@@ -219,11 +217,8 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
                                __ASSERT(0, "unexpected return code\n");
                        }
                }
-               irq_unlock(key);
        }
 
-       key = irq_lock();
-
        /*
         * If we're not polling anymore, it means that at least one event
         * condition is met, either when looping through the events here or

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Andy Ross : Thanks much for looking into this! Let me start from the end, with trying the patch you give. Somehow, it didn't apply for me as is, so I redid it manually, catching another case of irq_lock to remove, see my exact patch below.

So, unfortunately, it didn't resolve issue:

Complete requests:      853
Percentage of the requests served within a certain time (ms)
  50%     31
  66%     34
  75%     38
  80%     42
  90%   1025
  95%   1046
  98%   1055
  99%   1058
 100%   4146 (longest request)

The patch:

diff --git a/kernel/poll.c b/kernel/poll.c
index eb014ea51..50add842f 100644
--- a/kernel/poll.c
+++ b/kernel/poll.c
@@ -199,7 +199,7 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
 
 	key = irq_lock();
 	set_polling_state(_current);
-	irq_unlock(key);
+//	irq_unlock(key);
 
 	struct _poller poller = { .thread = _current };
 
@@ -207,7 +207,7 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
 	for (int ii = 0; ii < num_events; ii++) {
 		u32_t state;
 
-		key = irq_lock();
+//		key = irq_lock();
 		if (is_condition_met(&events[ii], &state)) {
 			set_event_ready(&events[ii], state);
 			clear_polling_state(_current);
@@ -219,10 +219,10 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
 				__ASSERT(0, "unexpected return code\n");
 			}
 		}
-		irq_unlock(key);
+//		irq_unlock(key);
 	}
 
-	key = irq_lock();
+//	key = irq_lock();
 
 	/*
 	 * If we're not polling anymore, it means that at least one event
@@ -259,7 +259,7 @@ int k_poll(struct k_poll_event *events, int num_events, s32_t timeout)
 	 * added to the list of events that occurred, the user has to check the
 	 * return code first, which invalidates the whole list of event states.
 	 */
-	key = irq_lock();
+//	key = irq_lock();
 	clear_event_registrations(events, last_registered, key);
 	irq_unlock(key);
 

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

So... the hypothesis is that something in k_poll() (or the k_queue use of k_poll) is failing to wake up synchronously and we're hitting 1000ms timeouts, but only on some systems?

Well, it can be even worse than that:

  1. Until reproduced by someone else, it can be an artifact of my particular system setup. (I mention this to cover my ass, to avoid situation like was with UART busy-wait API, when I did too much noise regarding imaginable issues with kernel scheduling).
  2. Then it still can be issues with QEMU emulation. There're known issues with how UART is emulated, why there can't be more corner cases? Testing on more hardware would help, but except for frdm_k64f/Ethernet (which is "too fast") I have only 96b_carbon/BLE, which is "too slow" and may have its own issues. (Would be cool if we could connect a real board over serial/SLIP).

Otherwise yes, the hypothesis is that with slow enough packet speeds (like with SLIP), there's some race or something leading to "extra sleep".

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Have we verified that the "slow downs" are indeed a timeout condition?

Well, that was exactly my question - where does this 1s (1000ms) - even delay comes from? One such figure I know is CONFIG_NET_TCP_ACK_TIMEOUT. I changed it to 500, but nope delays are still integer numbers of 1000ms (1s, 2s, 3s, 4s seen). Grepping thru .config, I don't see other "1000" figures, would need to grep thru codebase for "hardcoded" values.

@zephyrbot
Copy link
Collaborator Author

by Paul Sokolovsky:

Can you replace the printk() with a k_busy_wait() and fix it too?

Even before, I tried k_yield() there, as seen (with some worry from myself) in some Jukka Rissanen 's networking patches. Didn't help visibly.

With k_busy_wait(500) after k_poll() and 1000 requests it's:

  50%     32
  66%     35
  75%     37
  80%     39
  90%     44
  95%     49
  98%     55
  99%     67
 100%   1052 (longest request)

With k_busy_wait(1000) after k_poll() and 1000 requests it's:

  50%     36
  66%     40
  75%     42
  80%     44
  90%     51
  95%     63
  98%   1050
  99%   1057
 100%   1091 (longest request)

With k_busy_wait(2000) after k_poll() and 1000 requests it's:

  50%     37
  66%     41
  75%     44
  80%     46
  90%     55
  95%     66
  98%     78
  99%     89
 100%    150 (longest request)

So, k_busy_wait() definitely "helps", but doesn't resolve the issue, and the behavior isn't even monotonic function of the delay used (of course, there's enough variability/nondeterminism in the first place).

@zephyrbot
Copy link
Collaborator Author

zephyrbot commented Sep 7, 2017

by Paul Sokolovsky:

Do we know which queue is involved?

Exact one - no, but (at least) following queues are involved:

  • RX packet queue
  • Accept socket queue (new connection contexts go there)
  • Per-socket RX packet queue

I modified to print k_uptime_get_32() at some points in socket code, and here what I see:

38690 zsock_accepted_cb 0x004053b8
38690 Connection from 192.0.2.2
38710 Connection from 192.0.2.2 closed

38740 zsock_accepted_cb 0x004053b8
38750 Connection from 192.0.2.2
38760 Connection from 192.0.2.2 closed

39810 zsock_accepted_cb 0x004053b8
39810 Connection from 192.0.2.2
39830 Connection from 192.0.2.2 closed

40860 zsock_accepted_cb 0x004053b8
40870 Connection from 192.0.2.2
40870 Connection from 192.0.2.2 closed

So, first 2 connections got served without a delay, then 1s between following 2. But once zsock_accepted_cb() is called (which does k_fifo_put), there're no big delay until this connection is served, the delay happens inbetween connection acceptance and servings. Will patch in more printk...

Have we ruled out delays on insert from some other source (seems unlikely I guess).

What do you mean/have in mind? The processing should be like above: a driver puts a packet in RX queue, where it's taken from by RX thread and routed to either accept on recv callback, both of which put into other queues (which are actually union'ed, as their lifetimes don't cross), from which app code pulls them via calls to accept() or recv().

@zephyrbot
Copy link
Collaborator Author

zephyrbot commented Sep 13, 2017

by Paul Sokolovsky:

Luiz Augusto von Dentz on IRC suggested:

vudentz> pfalcon, are you using CONFIG_NET_BUF_WARN_ALLOC_INTERVAL, that should catch were is the extra latency

Ok, so I enabled in the config:

CONFIG_NET_BUF_LOG=y
CONFIG_SYS_LOG_NET_BUF_LEVEL=3

CONFIG_NET_BUF_WARN_ALLOC_INTERVAL itself defaults to 1. Nothing is being logged. I verified that with CONFIG_SYS_LOG_NET_BUF_LEVEL=4 I get debug spam.

@zephyrbot zephyrbot added priority: medium Medium impact/importance bug area: Networking bug The issue is a bug, or the PR is fixing a bug labels Sep 23, 2017
@zephyrbot zephyrbot added this to the v1.10.0 milestone Sep 23, 2017
@pfalcon pfalcon self-assigned this Oct 2, 2017
@pfalcon
Copy link
Contributor

pfalcon commented Oct 2, 2017

As an update, my further investigation so far showed that it indeed may be a (not so) usual case of butterfly effect: change in one place affects something not directly related to it. In this case, it seems that SLIP driver runs out of free RX buffers, which causes received packets to be dropped, which causes the sender (Linux host) to wait then retransmit, and that's what causes 1s delays. It's still a late-night hacking hypothesis however. Two issues in that regard however:

  • There's no logging of SLIP packet dropping, so user can't have an idea where things go wrong.
  • It's still kinda strange that SLIP driver started to be short of packets after exactly 84db641, so maybe that commit somehow changes the queue behavior so that packets spend longer time in the queue instead of being processed, and that's what causes shortage.

@jukkar
Copy link
Member

jukkar commented Oct 9, 2017

it seems that SLIP driver runs out of free RX buffers

For testing with QEMU you can increase the amount of buffers much more than with real hw. So if you want to measure some performance issues then increase the amount of buffers to several hundred or so.

@pfalcon
Copy link
Contributor

pfalcon commented Oct 14, 2017

so maybe that commit somehow changes the queue behavior so that packets spend longer time in the queue instead of being processed, and that's what causes shortage.

Further investigation showed that packets don't "spend more time in queue", they're actually leaked. In a typical situation, 2 of 4 available RX net_pkt's get leaked, which causes severe performance regression.

@pfalcon
Copy link
Contributor

pfalcon commented Oct 14, 2017

And here's how it looks like thru the printf style debugging:

fifo 0x00401068 iface 0x0041c3e0 pkt 0x00409738 len 54
 0x00409738 ==

slip: pkt=0x004097b0
fifo 0x00401068 iface 0x0041c3e0 pkt 0x004097b0 len 54
 ==
slip: pkt=0x004097ec
thr Received pkt 0x00409738

slab_free(0x0041c310, 0x004096fc)
slab_free(0x0041c2f4, 0x00409738)
from 192.0.2.2 closed

fifo 0x00401068 iface 0x0041c3e0 pkt 0x004097ec len 74
 0x004097ec ==
thr Received pkt 0x004097ec

The above shows that at one point, pkt 0x00409738 gets added to rx_queue, and indeed, the queue contain that packet. Next pkt 0x004097b0 gets added, but after this operation, the queue is empty. Despite that, receive thread gets the original pkt 0x00409738, but not heisen-pkt 0x004097b0. Then pkt 0x004097ec is added, and gotten by the thread. 0x004097b0 is lost forever.

pfalcon added a commit to pfalcon/zephyr that referenced this issue Oct 16, 2017
Magic spell to address zephyrproject-rtos#4022.

Signed-off-by: Paul Sokolovsky <[email protected]>
@pfalcon
Copy link
Contributor

pfalcon commented Oct 16, 2017

So, with outcomes like in the previous comment, the only reasonable explanation was that something trashes (NULLs) "head" pointer of the fifo. And not actually thrashes, but saves, NULLs, and then restores, that would be the only (still partial and ends-not-meet) explanation how put didn't work but let get still did. I did some more twists manually which didn't help (e.g. I noticed that rx_queue is preceded in memory map by a SLIP structure, subject to the suspicion of overflow, but padding it didn't help).

Then I finally went into debugger (which is in general not a panacea, because apps often behave differently under debugger). The idea was to watch rx_queue.[...].head pointer, specifically a case when head pointer is NULL, but tail pointer is non-NULL. Thru the expected random crashes and misbehavior under debugger, I arrived at sys_slist_get_not_empty() hitting my watchpoint, because it sets first head, then tail. So, I patched to do it in reverse order, and initially just refactored the expression to set tail to based on this code motion. The run under debugger produced some funky behavior, so I even thought that maybe I made a mistake during expression refactor. I went to look at sys_slist_get_not_empty() closely and saw that its code is "superfluous" - if it's run against node which is == ->tail, it means it's the only node, and the list becomes empty. So, I made the func have list->tail = NULL, which patch can now be seen as #4349.

The run under debugger no longer hit my watchpoint. I ran ab -n1000 without debugger, didn't hit the issue. 1st run, 2nd, 3rd. Then I ran ab -n10000. I had gazillion of other debugging changes in my tree, and started to deactivate them one by one, being afraid to scare away a fix butterfly. But the situation converged - #4349 is what it took to make the issue described in this ticket non-reproducible.

So, it disappeared the same way it had appeared - with a seemingly unrelated change. Adding a being of rationality, it would mean that we either have some race condition during queue operations (shouldn't be there, we have all irq_lock protected, right?), or there's a weird issue with QEMU which exposes a race where it shouldn't be (to remind, I wasn't able to reproduce this on real HW under same conditions as on QEMU (~10s of running). A colleague looked into this and reported that he was able to "reproduce after ~30mins of running on frdm_k64f", but there're no further details, so it can be as well a different issue).

@pfalcon
Copy link
Contributor

pfalcon commented Oct 16, 2017

(shouldn't be there, we have all irq_lock protected, right?)

Ahahah, fun, we actually don't: https://github.com/zephyrproject-rtos/zephyr/blob/master/kernel/queue.c#L215 . @Vudentz : Get your prize ;-).

pfalcon added a commit to pfalcon/zephyr that referenced this issue Oct 16, 2017
All sys_slist_*() functions aren't threadsafe and calls to them
must be protected with irq_lock. This is usually done in a wider
caller context, but k_queue_poll() is called with irq_lock already
relinquished, and is thus subject to hard to detect and explain
race conditions, as e.g. was tracked in zephyrproject-rtos#4022.

Signed-off-by: Paul Sokolovsky <[email protected]>
carlescufi pushed a commit to carlescufi/zephyr that referenced this issue Oct 16, 2017
All sys_slist_*() functions aren't threadsafe and calls to them
must be protected with irq_lock. This is usually done in a wider
caller context, but k_queue_poll() is called with irq_lock already
relinquished, and is thus subject to hard to detect and explain
race conditions, as e.g. was tracked in zephyrproject-rtos#4022.

Signed-off-by: Paul Sokolovsky <[email protected]>
carlescufi pushed a commit that referenced this issue Oct 17, 2017
All sys_slist_*() functions aren't threadsafe and calls to them
must be protected with irq_lock. This is usually done in a wider
caller context, but k_queue_poll() is called with irq_lock already
relinquished, and is thus subject to hard to detect and explain
race conditions, as e.g. was tracked in #4022.

Signed-off-by: Paul Sokolovsky <[email protected]>
carlescufi pushed a commit to carlescufi/zephyr that referenced this issue Oct 18, 2017
All sys_slist_*() functions aren't threadsafe and calls to them
must be protected with irq_lock. This is usually done in a wider
caller context, but k_queue_poll() is called with irq_lock already
relinquished, and is thus subject to hard to detect and explain
race conditions, as e.g. was tracked in zephyrproject-rtos#4022.

Signed-off-by: Paul Sokolovsky <[email protected]>
nashif pushed a commit that referenced this issue Oct 18, 2017
All sys_slist_*() functions aren't threadsafe and calls to them
must be protected with irq_lock. This is usually done in a wider
caller context, but k_queue_poll() is called with irq_lock already
relinquished, and is thus subject to hard to detect and explain
race conditions, as e.g. was tracked in #4022.

Signed-off-by: Paul Sokolovsky <[email protected]>
SebastianBoe pushed a commit to SebastianBoe/zephyr that referenced this issue Oct 19, 2017
All sys_slist_*() functions aren't threadsafe and calls to them
must be protected with irq_lock. This is usually done in a wider
caller context, but k_queue_poll() is called with irq_lock already
relinquished, and is thus subject to hard to detect and explain
race conditions, as e.g. was tracked in zephyrproject-rtos#4022.

Signed-off-by: Paul Sokolovsky <[email protected]>
@pfalcon
Copy link
Contributor

pfalcon commented Oct 24, 2017

Fixed by #4350.

@pfalcon pfalcon closed this as completed Oct 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants