net: tcp: fix spurious TCP retries #9504

vaussard · 2018-08-17T13:37:09Z

Spurious TCP retries were observed using Wireshark while continuously
sending TCP packets at an interval faster than the initial RTO.

If the send list is empty and CONFIG_NET_TCP_TIME_WAIT_DELAY is used,
the retry timer will not be correctly stopped when receiving a valid
ACK. As a consequence, the timer might be running when a new packet is
queued, but the logics in net_tcp_queue_data() will not restart the
timer as it is already running. This will make the retry timer to expire
prematurely, potentially while sending packets.

The nested condition is merged into a single condition, allowing the
final else clause to be reached when a valid ACK is received.

Signed-off-by: Florian Vaussard [email protected]

Spurious TCP retries were observed using Wireshark while continuously sending TCP packets at an interval faster than the initial RTO. If the send list is empty and CONFIG_NET_TCP_TIME_WAIT_DELAY is used, the retry timer will not be correctly stopped when receiving a valid ACK. As a consequence, the timer might be running when a new packet is queued, but the logics in net_tcp_queue_data() will not restart the timer as it is already running. This will make the retry timer to expire prematurely, potentially while sending packets. The nested condition is merged into a single condition, allowing the final else clause to be reached when a valid ACK is received. Signed-off-by: Florian Vaussard <[email protected]>

pfalcon · 2018-08-17T13:43:53Z

@vaussard : Looks interesting. Are you aware that there're more issues in that stuff, e.g. #8188 ? And sent_list itself looks like might be able to be corrupted due to concurrency issues, as weird debug traces in #5857 show.

jukkar

LGTM

vaussard · 2018-08-17T14:07:59Z

I do not think that I already encountered #5857. Maybe #8188 already happened because I know that we had unexpected TCP disconnections sometime. My first explanation was to blame the retry logics because one of the hardware setup is a very nice collision generator. That's why I looked more carefully into the retry mechanisms. I will see if #8188 pops up in our tests.

codecov-io · 2018-08-17T14:21:29Z

Codecov Report

Merging #9504 into master will decrease coverage by 0.02%.
The diff coverage is 25%.

@@            Coverage Diff            @@
##           master   #9504      +/-   ##
=========================================
- Coverage   52.22%   52.2%   -0.03%     
=========================================
  Files         212     212              
  Lines       25948   25951       +3     
  Branches     5577    5577              
=========================================
- Hits        13551   13547       -4     
- Misses      10140   10148       +8     
+ Partials     2257    2256       -1

Impacted Files	Coverage Δ
subsys/net/ip/tcp.c	`58.67% <25%> (-0.3%)`	⬇️
kernel/timer.c	`93.02% <0%> (-3.49%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 47889cd...7f975dd. Read the comment docs.

pfalcon · 2018-08-20T07:30:48Z

@vaussard

I do not think that I already encountered #5857. Maybe #8188 already happened because I know that we had unexpected TCP disconnections sometime. My first explanation was to blame the retry logics because one of the hardware setup is a very nice collision generator. That's why I looked more carefully into the retry mechanisms. I will see if #8188 pops up in our tests.

Thanks. First step is indeed getting a confirmation from other folks that they see same/similar problematic behavior. Myself, I couldn't find time to concentrate on further investigating them, with floodwork of refactoring and new features Zephyr has :-(. For your reference, #5857 was spotted while playing with MicroPython's Zephyr port. Using interactive interpreted language makes playing with networking easy, and it doesn't take long to drive the IP stack into havoc. (IIRC, it was sending a bunch of smallish packets, e.g. generating an HTTP request with a bunch of send() calls).

pfalcon · 2018-08-20T07:32:33Z

From commit message:

Spurious TCP retries were observed using Wireshark while continuously
sending TCP packets at an interval faster than the initial RTO.

Based on this description, it's unclear, whether packets were send from or to Zephyr.

vaussard · 2018-08-20T09:00:09Z

Based on this description, it's unclear, whether packets were send from or to Zephyr.

Packets were sent from Zephyr to Zephyr :) The sending node resent some packets without letting the other node enough time to ACK the packets. Do you want me to update the description to be more clear?

pfalcon · 2018-08-20T09:14:10Z

Do you want me to update the description to be more clear?

Up to you, and thanks for considering that option ;-). In this case, it's probably not too important, as the rest of the commit message explains the situation with the code.

vaussard requested review from jukkar, pfalcon and tbursztyka as code owners August 17, 2018 13:37

jukkar added the area: Networking label Aug 17, 2018

jukkar approved these changes Aug 17, 2018

View reviewed changes

nashif approved these changes Aug 19, 2018

View reviewed changes

pfalcon approved these changes Aug 20, 2018

View reviewed changes

jukkar merged commit 5212659 into zephyrproject-rtos:master Aug 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net: tcp: fix spurious TCP retries #9504

net: tcp: fix spurious TCP retries #9504

vaussard commented Aug 17, 2018

pfalcon commented Aug 17, 2018 •

edited

Loading

jukkar left a comment

vaussard commented Aug 17, 2018

codecov-io commented Aug 17, 2018

pfalcon commented Aug 20, 2018

pfalcon commented Aug 20, 2018

vaussard commented Aug 20, 2018

pfalcon commented Aug 20, 2018

net: tcp: fix spurious TCP retries #9504

net: tcp: fix spurious TCP retries #9504

Conversation

vaussard commented Aug 17, 2018

pfalcon commented Aug 17, 2018 • edited Loading

jukkar left a comment

Choose a reason for hiding this comment

vaussard commented Aug 17, 2018

codecov-io commented Aug 17, 2018

Codecov Report

pfalcon commented Aug 20, 2018

pfalcon commented Aug 20, 2018

vaussard commented Aug 20, 2018

pfalcon commented Aug 20, 2018

pfalcon commented Aug 17, 2018 •

edited

Loading