-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu/esp8266: Fixes and improvements of esp_wifi netdev driver #10862
cpu/esp8266: Fixes and improvements of esp_wifi netdev driver #10862
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one, getting rid of the mutex and the (new) LWIP wrapper! However, I cannot achieve the same results as you did, here mine:
following with 1280B packets:
interval [s] | loss [%] |
---|---|
1.000 | 0.0 |
0.100 | 0.0 |
0.010 | 0.0 |
0.003 | 1.0 |
0.002 | 5.0 |
0.001 | 50.0 |
Reducing packer size helps, next all with interval 0.001s
size [B] | loss [%] |
---|---|
1280 | 50.0 |
1024 | 42.0 |
768 | 35.0 |
512 | 30.0 |
But after several ping trials in a row without reset, I got a kernel panic:
> 2019-01-25 09:40:40,124 - INFO # EXCEPTION!! exccause=28 (LoadProhibitedCause) @4022082a excvaddr=00000004
2019-01-25 09:40:40,132 - INFO # pid | name | state Q | pri | stack ( used) | base addr | current
2019-01-25 09:40:40,140 - INFO # - | isr_stack | - - | - | 2048 ( 1488) | 0x3ffe8110 | 0x3ffe8910
2019-01-25 09:40:40,148 - INFO # 1 | ets | bl anyfl _ | 1 | 1536 ( 1388) | 0x3fff27cc | 0x3fff2cd0
2019-01-25 09:40:40,156 - INFO # 2 | idle | pending Q | 31 | 1024 ( 220) | 0x3fff2e9c | 0x3fff31d0
2019-01-25 09:40:40,164 - INFO # 3 | main | bl mutex _ | 15 | 3072 ( 1420) | 0x3fff329c | 0x3fff3c30
2019-01-25 09:40:40,172 - INFO # 4 | pktdump | bl rx _ | 14 | 1024 ( 360) | 0x3fff6ed8 | 0x3fff7180
2019-01-25 09:40:40,180 - INFO # 5 | ipv6 | running Q | 12 | 1536 ( 1152) | 0x3fff4be4 | 0x3fff4fb0
2019-01-25 09:40:40,188 - INFO # 6 | udp | bl rx _ | 13 | 1024 ( 392) | 0x3fff7928 | 0x3fff7bb0
2019-01-25 09:40:40,196 - INFO # 7 | esp_wifi | bl mutex _ | 10 | 1536 ( 600) | 0x3fff3fa8 | 0x3fff43d0
2019-01-25 09:40:40,204 - INFO # 8 | RPL | bl rx _ | 13 | 1024 ( 324) | 0x3fff7524 | 0x3fff77f0
2019-01-25 09:40:40,210 - INFO # | SUM | | | 13824 ( 7344)
2019-01-25 09:40:40,212 - INFO # heap: 17000 (free 11448) byte
2019-01-25 09:40:40,216 - INFO # sysmem: 17000 (used 5552, free 11448)
2019-01-25 09:40:40,221 - INFO #
2019-01-25 09:40:40,225 - INFO # ets Jan 8 2013,rst cause:2, boot mode:(3,6)
2019-01-25 09:40:40,225 - INFO #
2019-01-25 09:40:40,249 - INFO # load 0x3ffe8000, len 3992, room 16
2019-01-25 09:40:40,253 - INFO # tail 8
2019-01-25 09:40:40,254 - INFO # chksum 0x41
2019-01-25 09:40:40,257 - INFO # load 0x3ffe8fa0, len 15616, room 0
2019-01-25 09:40:40,270 - INFO # tail 0
2019-01-25 09:40:40,277 - INFO # chksum 0x26
2019-01-25 09:40:40,277 - INFO # load 0x40100000, len 29308, room 8
2019-01-25 09:40:40,299 - INFO # tail 4
2019-01-25 09:40:40,300 - INFO # chksum 0x10
2019-01-25 09:40:40,301 - INFO # csum 0x10
2019-01-25 09:40:40,341 - INFO # ��p�n���o|�
$ll b�2
�|r�$��N�l ��d���
�mode : null
I cannot reproduce the panic on master, however sometimes I don't get any ping reply there and the device seems to be stuck - a reset helps.
size reduction is impressive though:
40K less ROM and 2K less RAM |
so actually, the panic is better bc it reboot the device - on master its stuck and manual reboot/reset is required. So not nice, but somehow better 😬 |
Hm, strange ... Maybe, it depends on the basic network load in the WLAN. I can even ping from two hosts in the LAN with maximum size as fast as they can without any loss:
You should see a noticable performance improvement when you ping any host in LAN from esp8266 at the same time
You should observe a lot ping timeouts with the |
IMHO, the exception occurs only when the network load increases a lot. Problem 3 in issue #10861 describes it. I couldn't observe this exception when I used normal network load pinging from thee nodes:
Probably, the exception is caused due an interrupt where interrupts shouldn't be allowed. It is quite possible that the exception is caused by the SDK. It is pretty well known that the SDK is very buggy. There are a lot of discussions in the network where several crashes for WiFi are described. My guess is that the problem does not occur in RIOT/cpu/esp8266/esp-wifi/esp_wifi_netdev.c Lines 199 to 207 in 8984a9b
NETDEV_EVENT_ISR approach. On heavy network load, ets thread provides the next frame before the gnrc_netif thread esp_wifi is scheduled. In worst case, every second frame is dropped.
|
I'm sure that you could observe the same for this PR if you play long enough. Problems 2 and 4 in issue #10861 describe the possible cases. Either the send function is blocking completely (problem 2) or the gnrc packets are not released and the gnrc packet buffer becomes full (problem 4). The latter one, you would be able to check with |
Thanks for these further improvements!
When the ESP disconnects because of "overload" for me it actually never came back to a working state without reset. |
@sming @MichelRottleuthner Thanks for testing. I know how frustrating testing can be when problems only occur sporadically and are not reproducable.
The Shell freezes? Do you mean the esp8266 shell? I have never seen this.
This corresponds to problem 3 described in issue #10861. I inspected the code to figure out why this happens. Unfortunately, it happens at different addresses. I have tried to catch the exception, however, there is no OCD that is working satisfying. Thus, I was not able to catch the exception.That is, all I can do is guessing.
3.b) is an intended disconnect by 3.a) comes from the SDK and stands for authentication expired. This may happen either after an intended disconnect or after the AP sent a de-authentication message (reason 7). I can observe with
This is an error message from the SDK and occurs when the memory is exhausted. Nobody knows exactly what this error message stands for. My guess is that this messages stands for "Error in Memory management only 48 bytes left" or something like that. Once 3.a) occured, in a noticeable number of cases, reconnecting fails continuously until the memory is exhausted. According to heap statistics, a lot of memory (several kBytes) is allocated with each reconnect, but this memory is not released the reconnect fails. This seems to be a known bug in the SDK. The situation is quite frustrating and after some days of testing and trying, I'm not knowing much more that before. The big question is, how reliable is it under normal network conditions. Most of these problems only occur on heavy network load. Aaaaannnnnddd, some of them also occur in the |
esp_wifi
netdev driver
Partially good news. I believe the cause for the crash is here. Once the packet buffer is full on heavy network load, RIOT/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c Lines 94 to 102 in 250b7cb
@miri64 Have you already a changed version of gnrc_icmpv6_echo.c anywhere? I have provided PR #10869
|
@sming @MichelRottleuthner Using the following changes, I can ping the esp8266 node from 4 terms with a maximum data size of 1392 byte and an interval of 0 without any crashes until I run into problem 3 in issue #10861. --- a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
+++ b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
@@ -93,6 +93,12 @@ void gnrc_icmpv6_echo_req_handle(gnrc_netif_t *netif, ipv6_hdr_t *ipv6_hdr,
pkt = hdr;
hdr = gnrc_netif_hdr_build(NULL, 0, NULL, 0);
+ if (hdr == NULL) {
+ DEBUG("icmpv6_echo: no space left in packet buffer\n");
+ gnrc_pktbuf_release(pkt);
+ return;
+ }
+
if (netif != NULL) {
((gnrc_netif_hdr_t *)hdr->data)->if_pid = netif->pid;
} |
@gschorcht thanks a lot for the additional info!
yes the shell of the esp8266. It wont reply to any commands anymore, not even
Oh well we can not really solve problems of the SDK - hopefully some of this this will improve with newer releases from espressif.
I also think we should focus on normal scenarios for now.
I completely agree, stability is much more important here. But if this PR has same stability as master I would still prefer "same stability with better performance". with your above fix applied I just ran a test from three terminals with ~3 seconds before starting the next command:
With that after around a minute or less I can reliably trigger a case where the ESP isn't sending replies anymore. If I then enter help on the ESP shell it lists all the commands as usual, but as soon as I type ifconfig it gets stuck (no ifconfig output printed, no reset, no crash) and the shell stops working. |
Since _esp_wifi_recv_cb is not executed in interrupt context but in the context of the `ets` thread, the receive function can be called directly. There is no need for a mutex anymore to synchronize the access to the receive buffer between _esp_wifi_recv_cb and _recv function.
Since _esp_wifi_recv_cb is not executed in interrupt context but in the context of the `ets` thread, it is not necessary to pass the`NETDEV_EVENT_ISR` event first. Instead, the receive function can be called directly which result in much faster handling, a less frame lost rate and more robustness.
Defines a number of lwIP functions that are required as symbols by Espressif's SDK libraries. These functions are only dummies without real functionality. Using these functions instead of real lwIP functions provided with the SDK saves arround 4 kBytes of RAM.
It seems to be more stable and less memory consuming to use auto reconnect policy.
Disconnecting from the AP in the send function if the lwIP packet buffer is exhausted is counterproductive since reconnecting usually fails on heavy network load. A better strategy is to slow down the sending of MAC frames from netif a bit to wait for flushing the buffer in the MAC layer.
It is not necessary to realize timeout handling in send function or to disconnect from AP if lwIP packet buffer is exhausted. Waiting that the frame allocated in lwIP packet buffer is freed by MAC layer led to the complete blockage of send function on heavy network load. Disconnecting from AP is counterproductive since reconnecting usually fails on heavy network load.
6dfa044
to
fa48e3f
Compare
@sming @MichelRottleuthner After further intensive investigations and a lot of things I tried, I have finally a version of
On performance I could observe in my network:
From my point of view, this PR provides a version that is much more stable, much better and much more performant than the |
The situation where the firmware `lwIP` packet buffer is exhausted is an important indication that the traffic sent to and sent from the esp8266 is more than the esp8266 is able to handle. Therefore, it should be an error message.
I did a little bit more testing but I can only confirm the better performance and that I don't get the wired disconnect/unable to reconnect problems. Though, I can reliably get the ESP into a state where it prints the following messages and will never come back to a working state without a full reset:
I can trigger this with even with only a single instance pinging the ESP
Thats what I get for the same settings: rtt min/avg/max/mdev = 2.468/3.131/57.639/1.366 ms, pipe 4, ipg/ewma 3.277/2.837 ms I think this could be related to the fact that I use hostapd directly on my computer so there is no additional hop (like AP/router) in between. Maybe that small difference is enough trigger the cases where the ESP can't handle the workload anymore(?) Anyway - I don't want to block this PR and the fact that simultaneous bidirectional communication works a lot better with it makes it worthwhile already. @gschorcht would you mind having a look at the Murdock output? There seems to be an issue with the lwip/err.h include. |
Checking by the send function that at least two maximum size Ethernet frames fit in the remaining heap before the LwIP packet buffer is allocated seems to increase stability. This can be caused by the fact that WLAN hardware interrupts allocate additional memory when receiving a frame during the send attempt.
Strange, I have tried with
Even more, I can run it for two pings in parallel:
I have added an additional heap check in the |
I did the last tests with the latest version also with I was able to ping the esp8266 from 5 terminals at full speed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! With this version I can not trigger any lock-ups or reboots anymore :) -> ACK
Feel free to squash as needed (IMO most of the detailed commits can stay like that to see exactly what was fixed and how).
Thanks for your support. |
Since this PR is notably improving things and fixing bugs, I would be in favour of backporting it to 2019.01-branch. What do you think ? |
@aabadie While you commented I was writing the following "I would like to backport these fixes to the So, yes of course, I would like to backport it. |
Contribution description
This PR increases the performance and the stability of the
esp_wifi
netdev driver. These are in detail:Since the
_esp_wifi_recv_cb
callback on frame receiption is not executed in interrupt context but in the context of theets
thread, it is not necessary to pass theNETDEV_EVENT_ISR
event first. Instead, the receive function_recv
can be called directly which leads to a much faster handling, a lower frame lost rate and more robustness.Since
_recv
function and_esp_wifi_recv_cb
are called in a deterministic order, there is no need for a mutex anymore to synchronize the access to the receive buffer between the_esp_wifi_recv_cb
and therecv
function.The Espressif SDK includes its own
lwIP
version. SincelwIP
is not required for RIOT, thelwIP
library from Espressif SDK is not used anymore. To satisfy symbol dependencies of the SDK libraries to thelwIP
library, a number of dummy functions are defined without real functionality. Using these dummy functions instead of reallwIP
functions saves about 4 kbytes of RAM.The auto reconnect mechanism from SDK is used now since it seems to be more stable and less memory consuming.
Removes timeout handling and disconnecting in send function when the lwIP buffer is exhausted. This PR solves problems 2 and 4 in issue cpu/esp8266: Tracking open problems of esp_wifi netdev driver #10861. Waiting that the frame allocated in lwIP packet buffer is freed by MAC layer led to the complete blockage of send function on heavy network load. Disconnecting from AP is counterproductive since reconnecting usually fails on heavy network load.
Human readable disconnect reasons.
Generation of
NETDEV_EVENT_LINK_DOWN
andNETDEV_EVENT_LINK_DOWN
on disconnect and connect to the AP.Testing procedure
With this PR, the performance of pinging an esp8266 node should be noticeably better than before.
example/gnrc_networking
to at least one ESP8266 node using your AP configuration, e.g.,sudo ping6 fe80::<ESP_IID> -Ieth0 -s1392 -i 0
should be stable. The packet loss rate should be 0. This should be also the case, if ping is executed on two nodes or in both directions.Issues/PRs references
This PR solves the problems 2 and 4 as described in #10861. It improves the stability and the performance a lot.