Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subsys: shell_telent: don't close connection on EAGAIN #66287

Merged
merged 1 commit into from
Dec 12, 2023

Conversation

talih0
Copy link
Contributor

@talih0 talih0 commented Dec 7, 2023

An EAGAIN error can occur if the tcp window size is full. Instead of closing the connection the application should sleep and retry sendin the packet.

@zephyrbot zephyrbot added the area: Shell Shell subsystem label Dec 7, 2023
@talih0 talih0 force-pushed the telnet_fix branch 2 times, most recently from 0e15df1 to a4f17d1 Compare December 7, 2023 18:30
@talih0 talih0 requested a review from rlubos December 7, 2023 18:30
Copy link
Contributor

@rlubos rlubos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I've left one more suggestion to make this even more robust.

for (;;) {
int err;

err = net_context_send(sh_telnet->client_ctx, msg, len, telnet_sent_cb,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one more thing that's missing to make this fully robust. net_context_send() returns the number of bytes sent, and it could return less than was requested (for example if window was almost full).
So, intead of having an endless loop with break, the loop could monitor how much bytes are left to send, see https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/lib/http/http_client.c#L32 for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @rlubos, I updated the PR.

When I was testing this PR, I noticed that there can be a deadlock on arp_mutex/conn->lock when using the "net arp" command with telnet shell
https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/lib/shell/arp.c#L63

Thread 1:
The "net arp" command will lock
arp_mutex followed by conn->lock

Thread 2:
In tcp_in() the locks are taken in the other order:
conn->lock followed by arp_mutex (in ethernet_ll_prepare_on_ipv4())

In some race conditions, we can get stuck in Thread 1 waiting for conn->lock, because Thread 2 has conn->lock and is waiting for arp_mutex.

We probably should not take arp_mutex for a long time via "net arp" shell command. We can instead copy all the arp entries, and send the copied entries to telnet shell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of puzzled here, I can't see where (and why in the first place) would net arp command lock the TCP connection mutex?

Copy link
Contributor Author

@talih0 talih0 Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlubos, sorry, was a bit busy last couple of days. Here's a backtrace:

0: net_tcp_queue () at /home/andriy/zephyrproject/zephyr/subsys/net/ip/tcp.c:3285
1: 0x0c016b2c in context_sendto () at /home/andriy/zephyrproject/zephyr/subsys/net/ip/net_context.c:2184
2: 0x0c016e3e in net_context_send () at /home/andriy/zephyrproject/zephyr/subsys/net/ip/net_context.c:2287
3: 0x0c0015a6 in telnet_send () at /home/andriy/zephyrproject/zephyr/subsys/shell/backends/shell_telnet.c:188
4: 0x0c001678 in write () at /home/andriy/zephyrproject/zephyr/subsys/shell/backends/shell_telnet.c:493
5: 0x0c014378 in z_shell_write () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_ops.c:434
6: 0x0c013fe0 in z_shell_fprintf_buffer_flush () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_fprintf.c:46
7: 0x0c0026b4 in z_shell_fprintf_fmt () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_fprintf.c:39
8: 0x0c014404 in z_shell_vfprintf () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell_ops.c:525
9: 0x0c013ef2 in shell_vfprintf () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell.c:1526
10: 0x0c013f32 in shell_fprintf () at /home/andriy/zephyrproject/zephyr/subsys/shell/shell.c:1543
11: 0x0c0042ca in arp_cb () at /home/andriy/zephyrproject/zephyr/subsys/net/lib/shell/arp.c:25
12: 0x0c00728c in net_arp_foreach () at /home/andriy/zephyrproject/zephyr/subsys/net/l2/ethernet/arp.c:794
13: 0x0c004294 in cmd_net_arp () at /home/andriy/zephyrproject/zephyr/subsys/net/lib/shell/arp.c:63
14: cmd_net_arp () at /home/andriy/zephyrproject/zephyr/subsys/net/lib/shell/arp.c:46

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @talih0 for the details, I somehow missed the information that you use a telnet backend. I see where's the problem now.

If I'm not mistaken, I think that the deadlock could be avoided if CONFIG_NET_TC_TX_COUNT was set to 1 (it defaults to 0 unless userspace is enabled). That way the Ethernet L2 is processed by a separate thread, so TCP/ARP mutexes won't be locked from the same thread. Would you be able to check that? If I'm correct, we could likely enforce this configuration if one of the networking shell backends are enabled.

Otherwise, I don't see other way around than postponing shell operations until we're out of net_arp_foreach() function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @rlubos. Yes, setting CONFIG_NET_TC_TX_COUNT=1 fixes the problem.

@rlubos rlubos requested a review from jukkar December 8, 2023 08:17
EAGAIN error is returned if the tcp window size is full. Retry
sending the packet instead of closing the connection if this
error occurs.

Also the full payload may not be sent in a single call to
net_context_send(). Keep track of the number of bytes remaining
and try to send the full payload.

Signed-off-by: Andriy Gelman <[email protected]>
@carlescufi carlescufi merged commit 8232ddd into zephyrproject-rtos:main Dec 12, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Shell Shell subsystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants