Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu/esp8266: Fixes and improvements of esp_wifi netdev driver #10862

Merged
merged 12 commits into from
Jan 30, 2019

Conversation

gschorcht
Copy link
Contributor

@gschorcht gschorcht commented Jan 25, 2019

Contribution description

This PR increases the performance and the stability of the esp_wifi netdev driver. These are in detail:

  • Since the _esp_wifi_recv_cb callback on frame receiption is not executed in interrupt context but in the context of the ets thread, it is not necessary to pass theNETDEV_EVENT_ISR event first. Instead, the receive function _recv can be called directly which leads to a much faster handling, a lower frame lost rate and more robustness.

  • Since _recv function and _esp_wifi_recv_cb are called in a deterministic order, there is no need for a mutex anymore to synchronize the access to the receive buffer between the _esp_wifi_recv_cb and the recv function.

  • The Espressif SDK includes its own lwIP version. Since lwIP is not required for RIOT, the lwIP library from Espressif SDK is not used anymore. To satisfy symbol dependencies of the SDK libraries to the lwIP library, a number of dummy functions are defined without real functionality. Using these dummy functions instead of real lwIP functions saves about 4 kbytes of RAM.

  • The auto reconnect mechanism from SDK is used now since it seems to be more stable and less memory consuming.

  • Removes timeout handling and disconnecting in send function when the lwIP buffer is exhausted. This PR solves problems 2 and 4 in issue cpu/esp8266: Tracking open problems of esp_wifi netdev driver #10861. Waiting that the frame allocated in lwIP packet buffer is freed by MAC layer led to the complete blockage of send function on heavy network load. Disconnecting from AP is counterproductive since reconnecting usually fails on heavy network load.

  • Human readable disconnect reasons.

  • Generation of NETDEV_EVENT_LINK_DOWN and NETDEV_EVENT_LINK_DOWN on disconnect and connect to the AP.

Testing procedure

With this PR, the performance of pinging an esp8266 node should be noticeably better than before.

  • Compile and flash example/gnrc_networking to at least one ESP8266 node using your AP configuration, e.g.,
    CFLAGS='-DESP_WIFI_SSID=\"<your SSID>\" -DESP_WIFI_PASS=\"your passphrase\"' USEMODULE="esp_wifi" make BOARD=esp8266-esp-12x -C examples/gnrc_networking flash
    
  • Stress test from LAN with sudo ping6 fe80::<ESP_IID> -Ieth0 -s1392 -i 0 should be stable. The packet loss rate should be 0. This should be also the case, if ping is executed on two nodes or in both directions.

Issues/PRs references

This PR solves the problems 2 and 4 as described in #10861. It improves the stability and the performance a lot.

@gschorcht
Copy link
Contributor Author

gschorcht commented Jan 25, 2019

@smlng Ups, it was late last night. This are theThese are improvements I made after #10762 was already merged. The performance is now:

Interval [s] max. packet loss [%]
1.000 0.0
0.100 0.0
0.010 0.0
0.005 0.0
0.002 0.0
0.001 0.0
0.000 0.0

@smlng smlng self-assigned this Jan 25, 2019
@smlng smlng added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation Area: network Area: Networking Area: drivers Area: Device drivers Platform: ESP Platform: This PR/issue effects ESP-based platforms labels Jan 25, 2019
Copy link
Member

@smlng smlng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, getting rid of the mutex and the (new) LWIP wrapper! However, I cannot achieve the same results as you did, here mine:

following with 1280B packets:

interval [s] loss [%]
1.000 0.0
0.100 0.0
0.010 0.0
0.003 1.0
0.002 5.0
0.001 50.0

Reducing packer size helps, next all with interval 0.001s

size [B] loss [%]
1280 50.0
1024 42.0
768 35.0
512 30.0

But after several ping trials in a row without reset, I got a kernel panic:

> 2019-01-25 09:40:40,124 - INFO #  EXCEPTION!! exccause=28 (LoadProhibitedCause) @4022082a excvaddr=00000004
2019-01-25 09:40:40,132 - INFO # 	pid | name                 | state    Q | pri | stack  ( used) | base addr  | current     
2019-01-25 09:40:40,140 - INFO # 	  - | isr_stack            | -        - |   - |   2048 ( 1488) | 0x3ffe8110 | 0x3ffe8910
2019-01-25 09:40:40,148 - INFO # 	  1 | ets                  | bl anyfl _ |   1 |   1536 ( 1388) | 0x3fff27cc | 0x3fff2cd0 
2019-01-25 09:40:40,156 - INFO # 	  2 | idle                 | pending  Q |  31 |   1024 (  220) | 0x3fff2e9c | 0x3fff31d0 
2019-01-25 09:40:40,164 - INFO # 	  3 | main                 | bl mutex _ |  15 |   3072 ( 1420) | 0x3fff329c | 0x3fff3c30 
2019-01-25 09:40:40,172 - INFO # 	  4 | pktdump              | bl rx    _ |  14 |   1024 (  360) | 0x3fff6ed8 | 0x3fff7180 
2019-01-25 09:40:40,180 - INFO # 	  5 | ipv6                 | running  Q |  12 |   1536 ( 1152) | 0x3fff4be4 | 0x3fff4fb0 
2019-01-25 09:40:40,188 - INFO # 	  6 | udp                  | bl rx    _ |  13 |   1024 (  392) | 0x3fff7928 | 0x3fff7bb0 
2019-01-25 09:40:40,196 - INFO # 	  7 | esp_wifi             | bl mutex _ |  10 |   1536 (  600) | 0x3fff3fa8 | 0x3fff43d0 
2019-01-25 09:40:40,204 - INFO # 	  8 | RPL                  | bl rx    _ |  13 |   1024 (  324) | 0x3fff7524 | 0x3fff77f0 
2019-01-25 09:40:40,210 - INFO # 	    | SUM                  |            |     |  13824 ( 7344)
2019-01-25 09:40:40,212 - INFO # heap: 17000 (free 11448) byte
2019-01-25 09:40:40,216 - INFO # sysmem: 17000 (used 5552, free 11448)
2019-01-25 09:40:40,221 - INFO #                                                          
2019-01-25 09:40:40,225 - INFO #  ets Jan  8 2013,rst cause:2, boot mode:(3,6)
2019-01-25 09:40:40,225 - INFO # 
2019-01-25 09:40:40,249 - INFO # load 0x3ffe8000, len 3992, room 16 
2019-01-25 09:40:40,253 - INFO # tail 8
2019-01-25 09:40:40,254 - INFO # chksum 0x41
2019-01-25 09:40:40,257 - INFO # load 0x3ffe8fa0, len 15616, room 0 
2019-01-25 09:40:40,270 - INFO # tail 0
2019-01-25 09:40:40,277 - INFO # chksum 0x26
2019-01-25 09:40:40,277 - INFO # load 0x40100000, len 29308, room 8 
2019-01-25 09:40:40,299 - INFO # tail 4
2019-01-25 09:40:40,300 - INFO # chksum 0x10
2019-01-25 09:40:40,301 - INFO # csum 0x10
2019-01-25 09:40:40,341 - INFO # ��p�n���o|�
                                            $ll b�2
                                                   �|r�$��N�l ��d���
                                                                    �mode : null

I cannot reproduce the panic on master, however sometimes I don't get any ping reply there and the device seems to be stuck - a reset helps.

@smlng
Copy link
Member

smlng commented Jan 25, 2019

size reduction is impressive though:

branch text data bss
master 321457 4010 47256
this PR 279725 3992 45304

40K less ROM and 2K less RAM

@smlng
Copy link
Member

smlng commented Jan 25, 2019

so actually, the panic is better bc it reboot the device - on master its stuck and manual reboot/reset is required. So not nice, but somehow better 😬

@gschorcht
Copy link
Contributor Author

However, I cannot achieve the same results as you did, here mine:

Hm, strange ... Maybe, it depends on the basic network load in the WLAN. I can even ping from two hosts in the LAN with maximum size as fast as they can without any loss:

sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i0

You should see a noticable performance improvement when you ping any host in LAN from esp8266 at the same time

host> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i0
esp8266> ping6 1000 fe80::345b:6598:1369:5bf4 1392

You should observe a lot ping timeouts with the master but 0 % packet loss with this PR.

@gschorcht
Copy link
Contributor Author

I cannot reproduce the panic on master, however sometimes I don't get any ping reply there and the device seems to be stuck - a reset helps.

IMHO, the exception occurs only when the network load increases a lot. Problem 3 in issue #10861 describes it. I couldn't observe this exception when I used normal network load pinging from thee nodes:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0
term3> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s52 -i 0

Probably, the exception is caused due an interrupt where interrupts shouldn't be allowed. It is quite possible that the exception is caused by the SDK. It is pretty well known that the SDK is very buggy. There are a lot of discussions in the network where several crashes for WiFi are described.

My guess is that the problem does not occur in master since it slows down the communication. Function _esp_wifi_recv_cb checks whether the internal buffer is occupied. If so, it drops the incoming frame.

if (_esp_wifi_dev.rx_len) {
ESP_WIFI_DEBUG("buffer used, dropping incoming frame of %d bytes",
pb->tot_len);
pbuf_free(pb);
_in_esp_wifi_recv_cb = false;
critical_exit();
mutex_unlock(&_esp_wifi_dev.dev_lock);
return;
}
What I could observe is that this dropping happens quite often on heavy network load due the asynchronous handling of frames using the NETDEV_EVENT_ISR approach. On heavy network load, ets thread provides the next frame before the gnrc_netif thread esp_wifi is scheduled. In worst case, every second frame is dropped.

@gschorcht
Copy link
Contributor Author

however sometimes I don't get any ping reply there and the device seems to be stuck - a reset helps.

I'm sure that you could observe the same for this PR if you play long enough. Problems 2 and 4 in issue #10861 describe the possible cases. Either the send function is blocking completely (problem 2) or the gnrc packets are not released and the gnrc packet buffer becomes full (problem 4). The latter one, you would be able to check with pktbuf command if you compile with gnrc_pktbuf_cmd. At the moment, I have no solution for both cases. For problem 2, it is not clear why function ieee80211_output_pbuf blocks although it returns with success.

@MichelRottleuthner
Copy link
Contributor

Thanks for these further improvements!
After bit of testing I can say that this definitely improves performance.
Though, regarding robustness I'm not sure if it is better, same or worse compared to master because problems i found are hard to reproduce.
Simultaneous pings in both directions work a lot better with no major hiccups at first.
With the ESP connected to the AP running on my computer and using high packet rates (ping parameters -s1 -i 0) I was able to run into various different error states:

  1. ESP completely freezes, no ping reply, shell not responding, neither crash nor reboot though
  2. ESP crashing with kernel panic and rebooting
  3. ESP disconnects and not able to reconnect again
    a) [esp_wifi] disconnected ... reason 2
    b) [esp_wifi] disconnected ... reason 8
  4. (when pinging in both directions) I got same as (3a) but ESP prints the following without ever recovering:
# ping timeout
# ping timeout
# E:M 48
# ping timeout
# E:M 56
# E:M 48
# E:M 48
# E:M 48

When the ESP disconnects because of "overload" for me it actually never came back to a working state without reset.
I'll try if I can find something suspicious with wireshark.

@gschorcht
Copy link
Contributor Author

@sming @MichelRottleuthner Thanks for testing. I know how frustrating testing can be when problems only occur sporadically and are not reproducable.

  1. ESP completely freezes, no ping reply, shell not responding, neither crash nor reboot though

The Shell freezes? Do you mean the esp8266 shell? I have never seen this.

  1. ESP crashing with kernel panic and rebooting

This corresponds to problem 3 described in issue #10861. I inspected the code to figure out why this happens. Unfortunately, it happens at different addresses. I have tried to catch the exception, however, there is no OCD that is working satisfying. Thus, I was not able to catch the exception.That is, all I can do is guessing.

  1. ESP disconnects and not able to reconnect again
    a) [esp_wifi] disconnected ... reason 2
    b) [esp_wifi] disconnected ... reason 8
    (when pinging in both directions) I got same as (3a) but ESP prints the following without ever recovering:

3.b) is an intended disconnect by esp_wifi when the send function blocks and stands for leave association. This is a try by esp_wifi to release buffers and to reinitialize the WiFi interface. Unfortunately, it is not really clear what really happens and there is no other way known to do that. Sometimes it works, sometimes not.

3.a) comes from the SDK and stands for authentication expired. This may happen either after an intended disconnect or after the AP sent a de-authentication message (reason 7). I can observe with airmon and wireshark that the PSK authentication procedure is started by the AP with key message (1 of 4) but the eps8266 does not answer with key message (2 of 4) to continue the authentication.

  1. (when pinging in both directions) I got same as (3a) but ESP prints the following without ever recovering:
    # E:M 48

This is an error message from the SDK and occurs when the memory is exhausted. Nobody knows exactly what this error message stands for. My guess is that this messages stands for "Error in Memory management only 48 bytes left" or something like that. Once 3.a) occured, in a noticeable number of cases, reconnecting fails continuously until the memory is exhausted. According to heap statistics, a lot of memory (several kBytes) is allocated with each reconnect, but this memory is not released the reconnect fails. This seems to be a known bug in the SDK.

The situation is quite frustrating and after some days of testing and trying, I'm not knowing much more that before.

The big question is, how reliable is it under normal network conditions. Most of these problems only occur on heavy network load. Aaaaannnnnddd, some of them also occur in the master version, at least 1., 3.a) and 3.b). I'm not really sure what to do, but normally I would prefer stability over performance.

@gschorcht gschorcht changed the title cpu/esp8266: Improvements to the esp_wifi netdev driver cpu/esp8266: Improvements to the esp_wifi netdev driver Jan 25, 2019
@gschorcht
Copy link
Contributor Author

gschorcht commented Jan 25, 2019

Partially good news. I believe the cause for the crash is here. Once the packet buffer is full on heavy network load, gnrc_netif_hdr_build returns NULL. The following unchecked access to hdr->data ...

hdr = gnrc_netif_hdr_build(NULL, 0, NULL, 0);
if (netif != NULL) {
((gnrc_netif_hdr_t *)hdr->data)->if_pid = netif->pid;
}
else {
/* ipv6_hdr->dst is loopback address */
((gnrc_netif_hdr_t *)hdr->data)->if_pid = KERNEL_PID_UNDEF;
}

@miri64 Have you already a changed version of gnrc_icmpv6_echo.c anywhere? I have provided PR #10869

@gschorcht
Copy link
Contributor Author

@sming @MichelRottleuthner Using the following changes, I can ping the esp8266 node from 4 terms with a maximum data size of 1392 byte and an interval of 0 without any crashes until I run into problem 3 in issue #10861.

--- a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
+++ b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
@@ -93,6 +93,12 @@ void gnrc_icmpv6_echo_req_handle(gnrc_netif_t *netif, ipv6_hdr_t *ipv6_hdr,
     pkt = hdr;
     hdr = gnrc_netif_hdr_build(NULL, 0, NULL, 0);
 
+    if (hdr == NULL) {
+        DEBUG("icmpv6_echo: no space left in packet buffer\n");
+        gnrc_pktbuf_release(pkt);
+        return;
+    }
+
     if (netif != NULL) {
         ((gnrc_netif_hdr_t *)hdr->data)->if_pid = netif->pid;
     }

@MichelRottleuthner
Copy link
Contributor

@gschorcht thanks a lot for the additional info!

The Shell freezes? Do you mean the esp8266 shell? I have never seen this.

yes the shell of the esp8266. It wont reply to any commands anymore, not even help
But as I said it's hard to reproduce. I'll try if I can find something to reliably trigger this.

(...)known bug in the SDK

Oh well we can not really solve problems of the SDK - hopefully some of this this will improve with newer releases from espressif.

The big question is, how reliable is it under normal network conditions.

I also think we should focus on normal scenarios for now.

(...) some of them also occur in the master (...) I would prefer stability over performance.

I completely agree, stability is much more important here. But if this PR has same stability as master I would still prefer "same stability with better performance".

with your above fix applied I just ran a test from three terminals with ~3 seconds before starting the next command:

sudo ping -6 -I br0 <espaddr> -s1392 -i 0 -c 10000
sudo ping -6 -I br0 <espaddr> -s0 -i 0 -c 10000
sudo ping -6 -I br0 <espaddr> -s0 -i 0 -c 10000

With that after around a minute or less I can reliably trigger a case where the ESP isn't sending replies anymore. If I then enter help on the ESP shell it lists all the commands as usual, but as soon as I type ifconfig it gets stuck (no ifconfig output printed, no reset, no crash) and the shell stops working.

Since _esp_wifi_recv_cb is not executed in interrupt context but in the context of the `ets` thread, the receive function can be called directly. There is no need for a mutex anymore to synchronize the access to the receive buffer between _esp_wifi_recv_cb and _recv function.
Since _esp_wifi_recv_cb is not executed in interrupt context but in the context of the `ets` thread, it is not necessary to pass the`NETDEV_EVENT_ISR` event first. Instead, the receive function can be called directly which result in much faster handling, a less frame lost rate and more robustness.
Defines a number of lwIP functions that are required as symbols by Espressif's SDK libraries. These functions are only dummies without real functionality. Using these functions instead of real lwIP functions provided with the SDK saves arround 4 kBytes of RAM.
It seems to be more stable and less memory consuming to use auto reconnect policy.
Disconnecting from the AP in the send function if the lwIP packet buffer is exhausted is counterproductive since reconnecting usually fails on heavy network load. A better strategy is to slow down the sending of MAC frames from netif a bit to wait for flushing the buffer in the MAC layer.
It is not necessary to realize timeout handling in send function or to disconnect from AP if lwIP packet buffer is exhausted. Waiting that the frame allocated in lwIP packet buffer is freed by MAC layer led to the complete blockage of send function on heavy network load. Disconnecting from AP is counterproductive since reconnecting usually fails on heavy network load.
@gschorcht gschorcht force-pushed the esp8266_esp_wifi_netdev_fix branch from 6dfa044 to fa48e3f Compare January 26, 2019 16:42
@gschorcht
Copy link
Contributor Author

gschorcht commented Jan 26, 2019

@sming @MichelRottleuthner After further intensive investigations and a lot of things I tried, I have finally a version of esp_wifi which is much faster than the master version and solves a number of issues described in #10861:

On performance I could observe in my network:

  • Pinging from the esp8266 node any node in the LAN:

    esp8266> ping6 10000 fe80::345b:6598:1369:5bf4 1392 0
    --- fe80::345b:6598:1369:5bf4 ping statistics ---
    10000 packets transmitted, 10000 received, 0% packet loss, time 51.06725224 s
    rtt min/avg/max = 3.560/4.238/59.757 ms
    

    which corresponds to 2.193 Mbps IP payload traffic.

  • Ping the esp8266 node from any node in the LAN:

    term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i0 -c 10000
    --- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
    10000 packets transmitted, 10000 received, 0% packet loss, time 37059ms
    rtt min/avg/max/mdev = 3.157/3.735/54.445/2.139 ms, pipe 4, ipg/ewma 3.706/3.471 ms
    

    which corresponds to 3.022 Mbps IP payload traffic.

  • Pinging the esp8266 node from any node in the LAN and vise versa simultaneously:

     term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i0 -c 10000
     --- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
     10000 packets transmitted, 9954 received, 0% packet loss, time 55917ms
     rtt min/avg/max/mdev = 3.098/5.818/85.605/4.516 ms, pipe 6, ipg/ewma 5.592/4.996 ms
    
    esp8266> ping6 10000 fe80::345b:6598:1369:5bf4 1392 0
    --- fe80::345b:6598:1369:5bf4 ping statistics ---
    10000 packets transmitted, 9998 received, 1% packet loss, time 78.06540984 s
    rtt min/avg/max = 3.683/6.057/92.794 ms
    
  • Pinging the esp8266 node from 2 terminals simultaneously:

    term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
    term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
    

    The loss rate was 0 %.

  • Pinging the esp8266 node from 3 terminals simultaneously:

    term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
    term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
    term3> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
    

    A number of E:M 1576 error messages could be observed. That is, the limit of traffic the esp8266 is able to handle was reached.

  • Pinging the esp8266 node from 5 terminals and vise versa simultaneously:

    term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
    term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
    term3> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
    term4> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
    term5> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
    esp8266> ping6 10000 fe80::345b:6598:1369:5bf4 1392 0
    

    Was working stable for several hours. Of course, a lot of ping timeouts and E:M 1576 error messages occurred. That is, esp8266 was working at its limit all the time.

  • Pinging the esp8266 node from 5 terminals in a more realistic network scenario:

    term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.1
    term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.2
    term3> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.3
    term4> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.4
    term5> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.5
    

    Was working stable for several hours. The loss rate was 0 %.

From my point of view, this PR provides a version that is much more stable, much better and much more performant than the master version. Therefore, I urgently recommend to merge it as a fix and to backport it to the 2019.01 branch. I think it is better to backport it to 2019.01 than to have a released version which is not satisfying the the users.

@gschorcht gschorcht changed the title cpu/esp8266: Improvements to the esp_wifi netdev driver cpu/esp8266: Fixes and improvements of esp_wifi netdev driver Jan 27, 2019
The situation where the firmware `lwIP` packet buffer is exhausted is an important indication that the traffic sent to and sent from the esp8266 is more than the esp8266 is able to handle. Therefore, it should be an error message.
@MichelRottleuthner MichelRottleuthner added CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 3-testing The PR was tested according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines labels Jan 28, 2019
@MichelRottleuthner
Copy link
Contributor

I did a little bit more testing but I can only confirm the better performance and that I don't get the wired disconnect/unable to reconnect problems. Though, I can reliably get the ESP into a state where it prints the following messages and will never come back to a working state without a full reset:

# E:M 1632
# [esp_wifi] could not allocate buffer to send 1514 bytes 
# E:M 1632
# [esp_wifi] could not allocate buffer to send 1514 bytes
(...)

I can trigger this with even with only a single instance pinging the ESP
i.e. a full run of sudo ping -6 -I br0 <espLLADDR> -s1452 -i 0 -c 100000 never completes successfully on my setup.
The only difference I can spot when comparing with your results is that with an interval of -i 0 my setup seems to stress the ESP a little more:

--- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 37059ms
rtt min/avg/max/mdev = 3.157/3.735/54.445/2.139 ms, pipe 4, ipg/ewma 3.706/3.471 ms```

Thats what I get for the same settings:

rtt min/avg/max/mdev = 2.468/3.131/57.639/1.366 ms, pipe 4, ipg/ewma 3.277/2.837 ms

I think this could be related to the fact that I use hostapd directly on my computer so there is no additional hop (like AP/router) in between. Maybe that small difference is enough trigger the cases where the ESP can't handle the workload anymore(?)

Anyway - I don't want to block this PR and the fact that simultaneous bidirectional communication works a lot better with it makes it worthwhile already.

@gschorcht would you mind having a look at the Murdock output? There seems to be an issue with the lwip/err.h include.

Checking by the send function that at least two maximum size Ethernet frames fit in the remaining heap before the LwIP packet buffer is allocated seems to increase stability. This can be caused by the fact that WLAN hardware interrupts allocate additional memory when receiving a frame during the send attempt.
@gschorcht
Copy link
Contributor Author

gschorcht commented Jan 29, 2019

I can trigger this with even with only a single instance pinging the ESP
i.e. a full run of sudo ping -6 -I br0 <espLLADDR> -s1452 -i 0 -c 100000 never completes successfully on my setup.

Strange, I have tried with hostapd too. Mine is even faster, with sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Iap0 -s1452 -c100000 but it doesn't fail:

--- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 284680ms
rtt min/avg/max/mdev = 2.479/2.732/18.899/0.262 ms, pipe 2, ipg/ewma 2.846/2.718 ms

Even more, I can run it for two pings in parallel:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Iap0 -s1452 -i0 -c100000
--- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
100000 packets transmitted, 97837 received, 2% packet loss, time 416489ms
rtt min/avg/max/mdev = 2.329/3.815/17.467/0.478 ms, pipe 2, ipg/ewma 4.164/3.957 ms

term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Iap0 -s1452 -i0 -c100000
--- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
100000 packets transmitted, 97872 received, 2% packet loss, time 415960ms
rtt min/avg/max/mdev = 2.345/3.815/15.834/0.481 ms, pipe 2, ipg/ewma 4.159/2.671 ms

I have added an additional heap check in the send function before attempting to allocate the lwIP packet buffer. If there are not at least two maximum-size Ethernet frames left, the function returns with -EIO. This prevents the memory from being exhausted while the WiFi interface is simultaneously receiving a frame and trying to allocate lwIP packet buffer. This should further increase stability and could solve your problem. In any case, it solves the problem of E:M xxx messages.

@gschorcht
Copy link
Contributor Author

I did the last tests with the latest version also with hostapd.

I was able to ping the esp8266 from 5 terminals at full speed -s1452 -i0 while being bombarded with 50 deauthentication attacks! The esp8266 was absolutely stable and could easily reconnect at any time to continue responding to the pings.

Copy link
Contributor

@MichelRottleuthner MichelRottleuthner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! With this version I can not trigger any lock-ups or reboots anymore :) -> ACK
Feel free to squash as needed (IMO most of the detailed commits can stay like that to see exactly what was fixed and how).

@gschorcht gschorcht merged commit 9423252 into RIOT-OS:master Jan 30, 2019
@gschorcht
Copy link
Contributor Author

Thanks for your support.

@aabadie
Copy link
Contributor

aabadie commented Jan 30, 2019

Since this PR is notably improving things and fixing bugs, I would be in favour of backporting it to 2019.01-branch. What do you think ?

@gschorcht
Copy link
Contributor Author

gschorcht commented Jan 30, 2019

@aabadie While you commented I was writing the following "I would like to backport these fixes to the 2019.01 branch since it contains a number of essential changes and fixes that improve the stability and performance a lot. The driver as it is in 2019.01 is working but has a lot of problems."

So, yes of course, I would like to backport it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: drivers Area: Device drivers Area: network Area: Networking CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Platform: ESP Platform: This PR/issue effects ESP-based platforms Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 3-testing The PR was tested according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants