Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: tcp: fix spurious TCP retries #9504

Merged
merged 1 commit into from
Aug 20, 2018

Conversation

vaussard
Copy link
Contributor

Spurious TCP retries were observed using Wireshark while continuously
sending TCP packets at an interval faster than the initial RTO.

If the send list is empty and CONFIG_NET_TCP_TIME_WAIT_DELAY is used,
the retry timer will not be correctly stopped when receiving a valid
ACK. As a consequence, the timer might be running when a new packet is
queued, but the logics in net_tcp_queue_data() will not restart the
timer as it is already running. This will make the retry timer to expire
prematurely, potentially while sending packets.

The nested condition is merged into a single condition, allowing the
final else clause to be reached when a valid ACK is received.

Signed-off-by: Florian Vaussard [email protected]

Spurious TCP retries were observed using Wireshark while continuously
sending TCP packets at an interval faster than the initial RTO.

If the send list is empty and CONFIG_NET_TCP_TIME_WAIT_DELAY is used,
the retry timer will not be correctly stopped when receiving a valid
ACK. As a consequence, the timer might be running when a new packet is
queued, but the logics in net_tcp_queue_data() will not restart the
timer as it is already running. This will make the retry timer to expire
prematurely, potentially while sending packets.

The nested condition is merged into a single condition, allowing the
final else clause to be reached when a valid ACK is received.

Signed-off-by: Florian Vaussard <[email protected]>
@pfalcon
Copy link
Contributor

pfalcon commented Aug 17, 2018

@vaussard : Looks interesting. Are you aware that there're more issues in that stuff, e.g. #8188 ? And sent_list itself looks like might be able to be corrupted due to concurrency issues, as weird debug traces in #5857 show.

Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vaussard
Copy link
Contributor Author

I do not think that I already encountered #5857. Maybe #8188 already happened because I know that we had unexpected TCP disconnections sometime. My first explanation was to blame the retry logics because one of the hardware setup is a very nice collision generator. That's why I looked more carefully into the retry mechanisms. I will see if #8188 pops up in our tests.

@codecov-io
Copy link

Codecov Report

Merging #9504 into master will decrease coverage by 0.02%.
The diff coverage is 25%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #9504      +/-   ##
=========================================
- Coverage   52.22%   52.2%   -0.03%     
=========================================
  Files         212     212              
  Lines       25948   25951       +3     
  Branches     5577    5577              
=========================================
- Hits        13551   13547       -4     
- Misses      10140   10148       +8     
+ Partials     2257    2256       -1
Impacted Files Coverage Δ
subsys/net/ip/tcp.c 58.67% <25%> (-0.3%) ⬇️
kernel/timer.c 93.02% <0%> (-3.49%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 47889cd...7f975dd. Read the comment docs.

@pfalcon
Copy link
Contributor

pfalcon commented Aug 20, 2018

@vaussard

I do not think that I already encountered #5857. Maybe #8188 already happened because I know that we had unexpected TCP disconnections sometime. My first explanation was to blame the retry logics because one of the hardware setup is a very nice collision generator. That's why I looked more carefully into the retry mechanisms. I will see if #8188 pops up in our tests.

Thanks. First step is indeed getting a confirmation from other folks that they see same/similar problematic behavior. Myself, I couldn't find time to concentrate on further investigating them, with floodwork of refactoring and new features Zephyr has :-(. For your reference, #5857 was spotted while playing with MicroPython's Zephyr port. Using interactive interpreted language makes playing with networking easy, and it doesn't take long to drive the IP stack into havoc. (IIRC, it was sending a bunch of smallish packets, e.g. generating an HTTP request with a bunch of send() calls).

@pfalcon
Copy link
Contributor

pfalcon commented Aug 20, 2018

From commit message:

Spurious TCP retries were observed using Wireshark while continuously
sending TCP packets at an interval faster than the initial RTO.

Based on this description, it's unclear, whether packets were send from or to Zephyr.

@vaussard
Copy link
Contributor Author

Based on this description, it's unclear, whether packets were send from or to Zephyr.

Packets were sent from Zephyr to Zephyr :) The sending node resent some packets without letting the other node enough time to ACK the packets. Do you want me to update the description to be more clear?

@pfalcon
Copy link
Contributor

pfalcon commented Aug 20, 2018

Do you want me to update the description to be more clear?

Up to you, and thanks for considering that option ;-). In this case, it's probably not too important, as the rest of the commit message explains the situation with the code.

@jukkar jukkar merged commit 5212659 into zephyrproject-rtos:master Aug 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants