cpu/esp8266: Fixes and improvements of esp_wifi netdev driver #10862

gschorcht · 2019-01-25T00:51:45Z

Contribution description

This PR increases the performance and the stability of the esp_wifi netdev driver. These are in detail:

Since the _esp_wifi_recv_cb callback on frame receiption is not executed in interrupt context but in the context of the ets thread, it is not necessary to pass theNETDEV_EVENT_ISR event first. Instead, the receive function _recv can be called directly which leads to a much faster handling, a lower frame lost rate and more robustness.
Since _recv function and _esp_wifi_recv_cb are called in a deterministic order, there is no need for a mutex anymore to synchronize the access to the receive buffer between the _esp_wifi_recv_cb and the recv function.
The Espressif SDK includes its own lwIP version. Since lwIP is not required for RIOT, the lwIP library from Espressif SDK is not used anymore. To satisfy symbol dependencies of the SDK libraries to the lwIP library, a number of dummy functions are defined without real functionality. Using these dummy functions instead of real lwIP functions saves about 4 kbytes of RAM.
The auto reconnect mechanism from SDK is used now since it seems to be more stable and less memory consuming.
Removes timeout handling and disconnecting in send function when the lwIP buffer is exhausted. This PR solves problems 2 and 4 in issue cpu/esp8266: Tracking open problems of esp_wifi netdev driver #10861. Waiting that the frame allocated in lwIP packet buffer is freed by MAC layer led to the complete blockage of send function on heavy network load. Disconnecting from AP is counterproductive since reconnecting usually fails on heavy network load.
Human readable disconnect reasons.
Generation of NETDEV_EVENT_LINK_DOWN and NETDEV_EVENT_LINK_DOWN on disconnect and connect to the AP.

Testing procedure

With this PR, the performance of pinging an esp8266 node should be noticeably better than before.

Compile and flash example/gnrc_networking to at least one ESP8266 node using your AP configuration, e.g.,

CFLAGS='-DESP_WIFI_SSID=\"<your SSID>\" -DESP_WIFI_PASS=\"your passphrase\"' USEMODULE="esp_wifi" make BOARD=esp8266-esp-12x -C examples/gnrc_networking flash

Stress test from LAN with sudo ping6 fe80::<ESP_IID> -Ieth0 -s1392 -i 0 should be stable. The packet loss rate should be 0. This should be also the case, if ping is executed on two nodes or in both directions.

Issues/PRs references

This PR solves the problems 2 and 4 as described in #10861. It improves the stability and the performance a lot.

gschorcht · 2019-01-25T01:00:13Z

@smlng Ups, it was late last night. ~~This are the~~These are improvements I made after #10762 was already merged. The performance is now:

Interval [s]	max. packet loss [%]
1.000	0.0
0.100	0.0
0.010	0.0
0.005	0.0
0.002	0.0
0.001	0.0
0.000	0.0

smlng

Nice one, getting rid of the mutex and the (new) LWIP wrapper! However, I cannot achieve the same results as you did, here mine:

following with 1280B packets:

interval [s]	loss [%]
1.000	0.0
0.100	0.0
0.010	0.0
0.003	1.0
0.002	5.0
0.001	50.0

Reducing packer size helps, next all with interval 0.001s

size [B]	loss [%]
1280	50.0
1024	42.0
768	35.0
512	30.0

But after several ping trials in a row without reset, I got a kernel panic:

> 2019-01-25 09:40:40,124 - INFO #  EXCEPTION!! exccause=28 (LoadProhibitedCause) @4022082a excvaddr=00000004
2019-01-25 09:40:40,132 - INFO # 	pid | name                 | state    Q | pri | stack  ( used) | base addr  | current     
2019-01-25 09:40:40,140 - INFO # 	  - | isr_stack            | -        - |   - |   2048 ( 1488) | 0x3ffe8110 | 0x3ffe8910
2019-01-25 09:40:40,148 - INFO # 	  1 | ets                  | bl anyfl _ |   1 |   1536 ( 1388) | 0x3fff27cc | 0x3fff2cd0 
2019-01-25 09:40:40,156 - INFO # 	  2 | idle                 | pending  Q |  31 |   1024 (  220) | 0x3fff2e9c | 0x3fff31d0 
2019-01-25 09:40:40,164 - INFO # 	  3 | main                 | bl mutex _ |  15 |   3072 ( 1420) | 0x3fff329c | 0x3fff3c30 
2019-01-25 09:40:40,172 - INFO # 	  4 | pktdump              | bl rx    _ |  14 |   1024 (  360) | 0x3fff6ed8 | 0x3fff7180 
2019-01-25 09:40:40,180 - INFO # 	  5 | ipv6                 | running  Q |  12 |   1536 ( 1152) | 0x3fff4be4 | 0x3fff4fb0 
2019-01-25 09:40:40,188 - INFO # 	  6 | udp                  | bl rx    _ |  13 |   1024 (  392) | 0x3fff7928 | 0x3fff7bb0 
2019-01-25 09:40:40,196 - INFO # 	  7 | esp_wifi             | bl mutex _ |  10 |   1536 (  600) | 0x3fff3fa8 | 0x3fff43d0 
2019-01-25 09:40:40,204 - INFO # 	  8 | RPL                  | bl rx    _ |  13 |   1024 (  324) | 0x3fff7524 | 0x3fff77f0 
2019-01-25 09:40:40,210 - INFO # 	    | SUM                  |            |     |  13824 ( 7344)
2019-01-25 09:40:40,212 - INFO # heap: 17000 (free 11448) byte
2019-01-25 09:40:40,216 - INFO # sysmem: 17000 (used 5552, free 11448)
2019-01-25 09:40:40,221 - INFO #                                                          
2019-01-25 09:40:40,225 - INFO #  ets Jan  8 2013,rst cause:2, boot mode:(3,6)
2019-01-25 09:40:40,225 - INFO # 
2019-01-25 09:40:40,249 - INFO # load 0x3ffe8000, len 3992, room 16 
2019-01-25 09:40:40,253 - INFO # tail 8
2019-01-25 09:40:40,254 - INFO # chksum 0x41
2019-01-25 09:40:40,257 - INFO # load 0x3ffe8fa0, len 15616, room 0 
2019-01-25 09:40:40,270 - INFO # tail 0
2019-01-25 09:40:40,277 - INFO # chksum 0x26
2019-01-25 09:40:40,277 - INFO # load 0x40100000, len 29308, room 8 
2019-01-25 09:40:40,299 - INFO # tail 4
2019-01-25 09:40:40,300 - INFO # chksum 0x10
2019-01-25 09:40:40,301 - INFO # csum 0x10
2019-01-25 09:40:40,341 - INFO # ��p�n���o|�
                                            $ll b�2
                                                   �|r�$��N�l ��d���
                                                                    �mode : null

I cannot reproduce the panic on master, however sometimes I don't get any ping reply there and the device seems to be stuck - a reset helps.

smlng · 2019-01-25T08:57:43Z

size reduction is impressive though:

branch	text	data	bss
master	321457	4010	47256
this PR	279725	3992	45304

40K less ROM and 2K less RAM

smlng · 2019-01-25T09:00:03Z

so actually, the panic is better bc it reboot the device - on master its stuck and manual reboot/reset is required. So not nice, but somehow better 😬

gschorcht · 2019-01-25T11:47:27Z

However, I cannot achieve the same results as you did, here mine:

Hm, strange ... Maybe, it depends on the basic network load in the WLAN. I can even ping from two hosts in the LAN with maximum size as fast as they can without any loss:

sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i0

You should see a noticable performance improvement when you ping any host in LAN from esp8266 at the same time

host> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i0

esp8266> ping6 1000 fe80::345b:6598:1369:5bf4 1392

You should observe a lot ping timeouts with the master but 0 % packet loss with this PR.

gschorcht · 2019-01-25T12:14:33Z

I cannot reproduce the panic on master, however sometimes I don't get any ping reply there and the device seems to be stuck - a reset helps.

IMHO, the exception occurs only when the network load increases a lot. Problem 3 in issue #10861 describes it. I couldn't observe this exception when I used normal network load pinging from thee nodes:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0
term3> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s52 -i 0

Probably, the exception is caused due an interrupt where interrupts shouldn't be allowed. It is quite possible that the exception is caused by the SDK. It is pretty well known that the SDK is very buggy. There are a lot of discussions in the network where several crashes for WiFi are described.

My guess is that the problem does not occur in master since it slows down the communication. Function _esp_wifi_recv_cb checks whether the internal buffer is occupied. If so, it drops the incoming frame.

RIOT/cpu/esp8266/esp-wifi/esp_wifi_netdev.c

Lines 199 to 207 in 8984a9b

    
           if (_esp_wifi_dev.rx_len) { 
        
               ESP_WIFI_DEBUG("buffer used, dropping incoming frame of %d bytes", 
        
                              pb->tot_len); 
        
               pbuf_free(pb); 
        
               _in_esp_wifi_recv_cb = false; 
        
               critical_exit(); 
        
               mutex_unlock(&_esp_wifi_dev.dev_lock); 
        
               return; 
        
           }

What I could observe is that this dropping happens quite often on heavy network load due the asynchronous handling of frames using the NETDEV_EVENT_ISR approach. On heavy network load, ets thread provides the next frame before the gnrc_netif thread esp_wifi is scheduled. In worst case, every second frame is dropped.

gschorcht · 2019-01-25T12:22:08Z

however sometimes I don't get any ping reply there and the device seems to be stuck - a reset helps.

I'm sure that you could observe the same for this PR if you play long enough. Problems 2 and 4 in issue #10861 describe the possible cases. Either the send function is blocking completely (problem 2) or the gnrc packets are not released and the gnrc packet buffer becomes full (problem 4). The latter one, you would be able to check with pktbuf command if you compile with gnrc_pktbuf_cmd. At the moment, I have no solution for both cases. For problem 2, it is not clear why function ieee80211_output_pbuf blocks although it returns with success.

MichelRottleuthner · 2019-01-25T12:57:18Z

Thanks for these further improvements!
After bit of testing I can say that this definitely improves performance.
Though, regarding robustness I'm not sure if it is better, same or worse compared to master because problems i found are hard to reproduce.
Simultaneous pings in both directions work a lot better with no major hiccups at first.
With the ESP connected to the AP running on my computer and using high packet rates (ping parameters -s1 -i 0) I was able to run into various different error states:

ESP completely freezes, no ping reply, shell not responding, neither crash nor reboot though
ESP crashing with kernel panic and rebooting
ESP disconnects and not able to reconnect again
a) [esp_wifi] disconnected ... reason 2
b) [esp_wifi] disconnected ... reason 8
(when pinging in both directions) I got same as (3a) but ESP prints the following without ever recovering:

# ping timeout
# ping timeout
# E:M 48
# ping timeout
# E:M 56
# E:M 48
# E:M 48
# E:M 48

When the ESP disconnects because of "overload" for me it actually never came back to a working state without reset.
I'll try if I can find something suspicious with wireshark.

gschorcht · 2019-01-25T13:57:38Z

@sming @MichelRottleuthner Thanks for testing. I know how frustrating testing can be when problems only occur sporadically and are not reproducable.

ESP completely freezes, no ping reply, shell not responding, neither crash nor reboot though

The Shell freezes? Do you mean the esp8266 shell? I have never seen this.

ESP crashing with kernel panic and rebooting

This corresponds to problem 3 described in issue #10861. I inspected the code to figure out why this happens. Unfortunately, it happens at different addresses. I have tried to catch the exception, however, there is no OCD that is working satisfying. Thus, I was not able to catch the exception.That is, all I can do is guessing.

ESP disconnects and not able to reconnect again
a) [esp_wifi] disconnected ... reason 2
b) [esp_wifi] disconnected ... reason 8
(when pinging in both directions) I got same as (3a) but ESP prints the following without ever recovering:

3.b) is an intended disconnect by esp_wifi when the send function blocks and stands for leave association. This is a try by esp_wifi to release buffers and to reinitialize the WiFi interface. Unfortunately, it is not really clear what really happens and there is no other way known to do that. Sometimes it works, sometimes not.

3.a) comes from the SDK and stands for authentication expired. This may happen either after an intended disconnect or after the AP sent a de-authentication message (reason 7). I can observe with airmon and wireshark that the PSK authentication procedure is started by the AP with key message (1 of 4) but the eps8266 does not answer with key message (2 of 4) to continue the authentication.

(when pinging in both directions) I got same as (3a) but ESP prints the following without ever recovering:
# E:M 48

This is an error message from the SDK and occurs when the memory is exhausted. Nobody knows exactly what this error message stands for. My guess is that this messages stands for "Error in Memory management only 48 bytes left" or something like that. Once 3.a) occured, in a noticeable number of cases, reconnecting fails continuously until the memory is exhausted. According to heap statistics, a lot of memory (several kBytes) is allocated with each reconnect, but this memory is not released the reconnect fails. This seems to be a known bug in the SDK.

The situation is quite frustrating and after some days of testing and trying, I'm not knowing much more that before.

The big question is, how reliable is it under normal network conditions. Most of these problems only occur on heavy network load. Aaaaannnnnddd, some of them also occur in the master version, at least 1., 3.a) and 3.b). I'm not really sure what to do, but normally I would prefer stability over performance.

gschorcht · 2019-01-25T15:50:10Z

Partially good news. I believe the cause for the crash is here. Once the packet buffer is full on heavy network load, gnrc_netif_hdr_build returns NULL. The following unchecked access to hdr->data ...

RIOT/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c

Lines 94 to 102 in 250b7cb

    
           hdr = gnrc_netif_hdr_build(NULL, 0, NULL, 0); 
        
           if (netif != NULL) { 
        
               ((gnrc_netif_hdr_t *)hdr->data)->if_pid = netif->pid; 
        
           } 
        
           else { 
        
               /* ipv6_hdr->dst is loopback address */ 
        
               ((gnrc_netif_hdr_t *)hdr->data)->if_pid = KERNEL_PID_UNDEF; 
        
           }

@miri64 Have you already a changed version of gnrc_icmpv6_echo.c anywhere? I have provided PR #10869

gschorcht · 2019-01-25T16:02:51Z

@sming @MichelRottleuthner Using the following changes, I can ping the esp8266 node from 4 terms with a maximum data size of 1392 byte and an interval of 0 without any crashes until I run into problem 3 in issue #10861.

--- a/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
+++ b/sys/net/gnrc/network_layer/icmpv6/echo/gnrc_icmpv6_echo.c
@@ -93,6 +93,12 @@ void gnrc_icmpv6_echo_req_handle(gnrc_netif_t *netif, ipv6_hdr_t *ipv6_hdr,
     pkt = hdr;
     hdr = gnrc_netif_hdr_build(NULL, 0, NULL, 0);
 
+    if (hdr == NULL) {
+        DEBUG("icmpv6_echo: no space left in packet buffer\n");
+        gnrc_pktbuf_release(pkt);
+        return;
+    }
+
     if (netif != NULL) {
         ((gnrc_netif_hdr_t *)hdr->data)->if_pid = netif->pid;
     }

MichelRottleuthner · 2019-01-25T16:35:57Z

@gschorcht thanks a lot for the additional info!

The Shell freezes? Do you mean the esp8266 shell? I have never seen this.

yes the shell of the esp8266. It wont reply to any commands anymore, not even help
But as I said it's hard to reproduce. I'll try if I can find something to reliably trigger this.

(...)known bug in the SDK

Oh well we can not really solve problems of the SDK - hopefully some of this this will improve with newer releases from espressif.

The big question is, how reliable is it under normal network conditions.

I also think we should focus on normal scenarios for now.

(...) some of them also occur in the master (...) I would prefer stability over performance.

I completely agree, stability is much more important here. But if this PR has same stability as master I would still prefer "same stability with better performance".

with your above fix applied I just ran a test from three terminals with ~3 seconds before starting the next command:

sudo ping -6 -I br0 <espaddr> -s1392 -i 0 -c 10000
sudo ping -6 -I br0 <espaddr> -s0 -i 0 -c 10000
sudo ping -6 -I br0 <espaddr> -s0 -i 0 -c 10000

With that after around a minute or less I can reliably trigger a case where the ESP isn't sending replies anymore. If I then enter help on the ESP shell it lists all the commands as usual, but as soon as I type ifconfig it gets stuck (no ifconfig output printed, no reset, no crash) and the shell stops working.

Since _esp_wifi_recv_cb is not executed in interrupt context but in the context of the `ets` thread, the receive function can be called directly. There is no need for a mutex anymore to synchronize the access to the receive buffer between _esp_wifi_recv_cb and _recv function.

Since _esp_wifi_recv_cb is not executed in interrupt context but in the context of the `ets` thread, it is not necessary to pass the`NETDEV_EVENT_ISR` event first. Instead, the receive function can be called directly which result in much faster handling, a less frame lost rate and more robustness.

Defines a number of lwIP functions that are required as symbols by Espressif's SDK libraries. These functions are only dummies without real functionality. Using these functions instead of real lwIP functions provided with the SDK saves arround 4 kBytes of RAM.

It seems to be more stable and less memory consuming to use auto reconnect policy.

Disconnecting from the AP in the send function if the lwIP packet buffer is exhausted is counterproductive since reconnecting usually fails on heavy network load. A better strategy is to slow down the sending of MAC frames from netif a bit to wait for flushing the buffer in the MAC layer.

It is not necessary to realize timeout handling in send function or to disconnect from AP if lwIP packet buffer is exhausted. Waiting that the frame allocated in lwIP packet buffer is freed by MAC layer led to the complete blockage of send function on heavy network load. Disconnecting from AP is counterproductive since reconnecting usually fails on heavy network load.

gschorcht · 2019-01-26T17:19:16Z

@sming @MichelRottleuthner After further intensive investigations and a lot of things I tried, I have finally a version of esp_wifi which is much faster than the master version and solves a number of issues described in #10861:

This PR now seems to solve the problems of complete blockage (issues 2 and 4 in cpu/esp8266: Tracking open problems of esp_wifi netdev driver #10861), at least for an amount of traffic the esp8266 is able to handle at all.
PR gnrc_icmpv6_echo: avoid crashing when pktbuf full #10869 solved the problems of sporadic crashes (issues 3 and 5 in cpu/esp8266: Tracking open problems of esp_wifi netdev driver #10861).
Connecting/reconnecting problems while excessive traffic is sent to the esp8266 (issues 1 and 6 in cpu/esp8266: Tracking open problems of esp_wifi netdev driver #10861) can't be solved from the today's perspective. esp8266 has a performance that is too low to handle a large amount of frames while connecting, see Correct Error Handling for reconnect (better than range.py?) peterhinch/micropython-mqtt#3 (comment).
One further little problem remains. You will see an error message like E:M xxxx whenever the firmware lwIP packet buffer is exhausted and cannot be allocated anymore. Unfortunately, there is no way known to suppress this message 😟

On performance I could observe in my network:

Pinging from the esp8266 node any node in the LAN:

esp8266> ping6 10000 fe80::345b:6598:1369:5bf4 1392 0
--- fe80::345b:6598:1369:5bf4 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 51.06725224 s
rtt min/avg/max = 3.560/4.238/59.757 ms

which corresponds to 2.193 Mbps IP payload traffic.

Ping the esp8266 node from any node in the LAN:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i0 -c 10000
--- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 37059ms
rtt min/avg/max/mdev = 3.157/3.735/54.445/2.139 ms, pipe 4, ipg/ewma 3.706/3.471 ms

which corresponds to 3.022 Mbps IP payload traffic.

Pinging the esp8266 node from any node in the LAN and vise versa simultaneously:

 term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i0 -c 10000
 --- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
 10000 packets transmitted, 9954 received, 0% packet loss, time 55917ms
 rtt min/avg/max/mdev = 3.098/5.818/85.605/4.516 ms, pipe 6, ipg/ewma 5.592/4.996 ms

esp8266> ping6 10000 fe80::345b:6598:1369:5bf4 1392 0
--- fe80::345b:6598:1369:5bf4 ping statistics ---
10000 packets transmitted, 9998 received, 1% packet loss, time 78.06540984 s
rtt min/avg/max = 3.683/6.057/92.794 ms

Pinging the esp8266 node from 2 terminals simultaneously:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0

The loss rate was 0 %.

Pinging the esp8266 node from 3 terminals simultaneously:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
term3> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0

A number of E:M 1576 error messages could be observed. That is, the limit of traffic the esp8266 is able to handle was reached.

Pinging the esp8266 node from 5 terminals and vise versa simultaneously:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
term3> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
term4> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
term5> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
esp8266> ping6 10000 fe80::345b:6598:1369:5bf4 1392 0

Was working stable for several hours. Of course, a lot of ping timeouts and E:M 1576 error messages occurred. That is, esp8266 was working at its limit all the time.

Pinging the esp8266 node from 5 terminals in a more realistic network scenario:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.1
term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.2
term3> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.3
term4> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.4
term5> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.5

Was working stable for several hours. The loss rate was 0 %.

From my point of view, this PR provides a version that is much more stable, much better and much more performant than the master version. Therefore, I urgently recommend to merge it as a fix and to backport it to the 2019.01 branch. I think it is better to backport it to 2019.01 than to have a released version which is not satisfying the the users.

The situation where the firmware `lwIP` packet buffer is exhausted is an important indication that the traffic sent to and sent from the esp8266 is more than the esp8266 is able to handle. Therefore, it should be an error message.

MichelRottleuthner · 2019-01-29T09:10:10Z

I did a little bit more testing but I can only confirm the better performance and that I don't get the wired disconnect/unable to reconnect problems. Though, I can reliably get the ESP into a state where it prints the following messages and will never come back to a working state without a full reset:

# E:M 1632
# [esp_wifi] could not allocate buffer to send 1514 bytes 
# E:M 1632
# [esp_wifi] could not allocate buffer to send 1514 bytes
(...)

I can trigger this with even with only a single instance pinging the ESP
i.e. a full run of sudo ping -6 -I br0 <espLLADDR> -s1452 -i 0 -c 100000 never completes successfully on my setup.
The only difference I can spot when comparing with your results is that with an interval of -i 0 my setup seems to stress the ESP a little more:

--- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 37059ms
rtt min/avg/max/mdev = 3.157/3.735/54.445/2.139 ms, pipe 4, ipg/ewma 3.706/3.471 ms```

Thats what I get for the same settings:

rtt min/avg/max/mdev = 2.468/3.131/57.639/1.366 ms, pipe 4, ipg/ewma 3.277/2.837 ms

I think this could be related to the fact that I use hostapd directly on my computer so there is no additional hop (like AP/router) in between. Maybe that small difference is enough trigger the cases where the ESP can't handle the workload anymore(?)

Anyway - I don't want to block this PR and the fact that simultaneous bidirectional communication works a lot better with it makes it worthwhile already.

@gschorcht would you mind having a look at the Murdock output? There seems to be an issue with the lwip/err.h include.

Checking by the send function that at least two maximum size Ethernet frames fit in the remaining heap before the LwIP packet buffer is allocated seems to increase stability. This can be caused by the fact that WLAN hardware interrupts allocate additional memory when receiving a frame during the send attempt.

gschorcht · 2019-01-29T22:38:48Z

I can trigger this with even with only a single instance pinging the ESP
i.e. a full run of sudo ping -6 -I br0 <espLLADDR> -s1452 -i 0 -c 100000 never completes successfully on my setup.

Strange, I have tried with hostapd too. Mine is even faster, with sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Iap0 -s1452 -c100000 but it doesn't fail:

--- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 284680ms
rtt min/avg/max/mdev = 2.479/2.732/18.899/0.262 ms, pipe 2, ipg/ewma 2.846/2.718 ms

Even more, I can run it for two pings in parallel:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Iap0 -s1452 -i0 -c100000
--- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
100000 packets transmitted, 97837 received, 2% packet loss, time 416489ms
rtt min/avg/max/mdev = 2.329/3.815/17.467/0.478 ms, pipe 2, ipg/ewma 4.164/3.957 ms

term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Iap0 -s1452 -i0 -c100000
--- fe80::5ecf:7fff:fe80:3f08 ping statistics ---
100000 packets transmitted, 97872 received, 2% packet loss, time 415960ms
rtt min/avg/max/mdev = 2.345/3.815/15.834/0.481 ms, pipe 2, ipg/ewma 4.159/2.671 ms

I have added an additional heap check in the send function before attempting to allocate the lwIP packet buffer. If there are not at least two maximum-size Ethernet frames left, the function returns with -EIO. This prevents the memory from being exhausted while the WiFi interface is simultaneously receiving a frame and trying to allocate lwIP packet buffer. This should further increase stability and could solve your problem. In any case, it solves the problem of E:M xxx messages.

gschorcht · 2019-01-30T01:50:45Z

I did the last tests with the latest version also with hostapd.

I was able to ping the esp8266 from 5 terminals at full speed -s1452 -i0 while being bombarded with 50 deauthentication attacks! The esp8266 was absolutely stable and could easily reconnect at any time to continue responding to the pings.

MichelRottleuthner

Nice! With this version I can not trigger any lock-ups or reboots anymore :) -> ACK
Feel free to squash as needed (IMO most of the detailed commits can stay like that to see exactly what was fixed and how).

gschorcht · 2019-01-30T10:31:49Z

Thanks for your support.

aabadie · 2019-01-30T10:34:33Z

Since this PR is notably improving things and fixing bugs, I would be in favour of backporting it to 2019.01-branch. What do you think ?

gschorcht · 2019-01-30T10:38:08Z

@aabadie While you commented I was writing the following "I would like to backport these fixes to the 2019.01 branch since it contains a number of essential changes and fixes that improve the stability and performance a lot. The driver as it is in 2019.01 is working but has a lot of problems."

So, yes of course, I would like to backport it.

smlng self-assigned this Jan 25, 2019

smlng requested a review from MichelRottleuthner January 25, 2019 08:23

smlng added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation Area: network Area: Networking Area: drivers Area: Device drivers Platform: ESP Platform: This PR/issue effects ESP-based platforms labels Jan 25, 2019

smlng reviewed Jan 25, 2019

View reviewed changes

gschorcht changed the title ~~cpu/esp8266: Improvements to the esp_wifi netdev driver~~ cpu/esp8266: Improvements to the esp_wifi netdev driver Jan 25, 2019

gschorcht mentioned this pull request Jan 25, 2019

gnrc_icmpv6_echo: avoid crashing when pktbuf full #10869

Merged

gschorcht mentioned this pull request Jan 25, 2019

gnrc_icmpv6_echo: avoid crashing when pktbuf full [backport 2019.01] #10874

Merged

gschorcht added 8 commits January 25, 2019 23:59

cpu/esp8266: auto reconnect in esp_wifi

5d0b447

It seems to be more stable and less memory consuming to use auto reconnect policy.

cpu/esp8266: readable disconn reasons in esp_wifi

4e41afa

cpu/esp8266: event handling for link up/down

fa48e3f

gschorcht mentioned this pull request Jan 26, 2019

cpu/esp8266: Tracking open problems of esp_wifi netdev driver #10861

Open

gschorcht force-pushed the esp8266_esp_wifi_netdev_fix branch from 6dfa044 to fa48e3f Compare January 26, 2019 16:42

gschorcht changed the title ~~cpu/esp8266: Improvements to the esp_wifi netdev driver~~ cpu/esp8266: Fixes and improvements of esp_wifi netdev driver Jan 27, 2019

cpu/esp8266: dbg msg replaced by an error msg

c811383

The situation where the firmware `lwIP` packet buffer is exhausted is an important indication that the traffic sent to and sent from the esp8266 is more than the esp8266 is able to handle. Therefore, it should be an error message.

gschorcht added 2 commits January 29, 2019 18:16

cpu/esp8266: fix compile problems

57fdb23

cpu/esp8266: avoid WiFi going into sleep mode

66c4cf8

MichelRottleuthner approved these changes Jan 30, 2019

View reviewed changes

gschorcht merged commit 9423252 into RIOT-OS:master Jan 30, 2019

gschorcht mentioned this pull request Jan 30, 2019

cpu/esp8266: Fixes and improvements of esp_wifi netdev driver [backport 2019.01] #10904

Merged

gschorcht deleted the esp8266_esp_wifi_netdev_fix branch January 30, 2019 12:39

aabadie added this to the Release 2019.01 milestone Jan 31, 2019

gschorcht mentioned this pull request Feb 9, 2019

cpu/esp8266: fix compilation problems if module esp_wifi is not used #10980

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu/esp8266: Fixes and improvements of esp_wifi netdev driver #10862

cpu/esp8266: Fixes and improvements of esp_wifi netdev driver #10862

gschorcht commented Jan 25, 2019 •

edited

Loading

gschorcht commented Jan 25, 2019 •

edited

Loading

smlng left a comment

smlng commented Jan 25, 2019 •

edited

Loading

smlng commented Jan 25, 2019

gschorcht commented Jan 25, 2019

gschorcht commented Jan 25, 2019

gschorcht commented Jan 25, 2019

MichelRottleuthner commented Jan 25, 2019

gschorcht commented Jan 25, 2019

gschorcht commented Jan 25, 2019 •

edited

Loading

gschorcht commented Jan 25, 2019

MichelRottleuthner commented Jan 25, 2019

gschorcht commented Jan 26, 2019 •

edited

Loading

MichelRottleuthner commented Jan 29, 2019

gschorcht commented Jan 29, 2019 •

edited

Loading

gschorcht commented Jan 30, 2019

MichelRottleuthner left a comment

gschorcht commented Jan 30, 2019

aabadie commented Jan 30, 2019

gschorcht commented Jan 30, 2019 •

edited

Loading

cpu/esp8266: Fixes and improvements of esp_wifi netdev driver #10862

cpu/esp8266: Fixes and improvements of esp_wifi netdev driver #10862

Conversation

gschorcht commented Jan 25, 2019 • edited Loading

Contribution description

Testing procedure

Issues/PRs references

gschorcht commented Jan 25, 2019 • edited Loading

smlng left a comment

Choose a reason for hiding this comment

smlng commented Jan 25, 2019 • edited Loading

smlng commented Jan 25, 2019

gschorcht commented Jan 25, 2019

gschorcht commented Jan 25, 2019

gschorcht commented Jan 25, 2019

MichelRottleuthner commented Jan 25, 2019

gschorcht commented Jan 25, 2019

gschorcht commented Jan 25, 2019 • edited Loading

gschorcht commented Jan 25, 2019

MichelRottleuthner commented Jan 25, 2019

gschorcht commented Jan 26, 2019 • edited Loading

MichelRottleuthner commented Jan 29, 2019

gschorcht commented Jan 29, 2019 • edited Loading

gschorcht commented Jan 30, 2019

MichelRottleuthner left a comment

Choose a reason for hiding this comment

gschorcht commented Jan 30, 2019

aabadie commented Jan 30, 2019

gschorcht commented Jan 30, 2019 • edited Loading

gschorcht commented Jan 25, 2019 •

edited

Loading

gschorcht commented Jan 25, 2019 •

edited

Loading

smlng commented Jan 25, 2019 •

edited

Loading

gschorcht commented Jan 25, 2019 •

edited

Loading

gschorcht commented Jan 26, 2019 •

edited

Loading

gschorcht commented Jan 29, 2019 •

edited

Loading

gschorcht commented Jan 30, 2019 •

edited

Loading