-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network deadlock because of mutex locking order #29347
Comments
One thing which is a bit mystery here is that the RX thread in receive path and the application thread are separate threads. So even if |
Release the context lock before passing data to the application socket as that might cause deadlock if the application is run before the RX thread and it starts to send data and if the RX thread is never able to run (because of priorities etc). Fixes zephyrproject-rtos#29347 Signed-off-by: Jukka Rissanen <[email protected]>
@ohitz thanks for a good analysis. I think the new version of the fix will solve the issue, at least I was no longer able to replicate the hang issue. In the fix, we do not pass data to application with tcpconn->lock held. |
Thanks @jukkar I have tried the new fix. Unfortunately, there is a NULL pointer access now at |
Indeed, the connection handler disappears. This is an easy fix, I will send a new version in few minutes. |
I remember that there was some commit(s) in another TCP issue where one could catch the null pointer access violations with this hw, you seem to have that applied. Have you considered to upstream that properly, could be quite useful to have? |
I can confirm the fix now works properly, thanks @jukkar! As for the MPU configuration to catch null pointer access violations, I'll check with my colleague. |
@jukkar I have no time to clean-up our patch now but I can publish it on github if someone wants to take it over. It is not ready for a pull request yet since it is very specific to our target. |
Release the context lock before passing data to the application socket as that might cause deadlock if the application is run before the RX thread and it starts to send data and if the RX thread is never able to run (because of priorities etc). Fixes #29347 Signed-off-by: Jukka Rissanen <[email protected]>
Describe the bug
It is possible to deadlock the entire network stack with TCP. If this happens, the network is completely blocked, not even ICMP traffic works anymore.
The reason is that the send and receive paths each lock two mutexes in different order.
z_work_q_main()
->process_rx_packet()
->net_rx()
->process_data()
->net_ipv4_input()
->net_conn_input()
->tcp_recv()
->tcp_in()
(locks conn->lock) ->tcp_data_get()
->net_context_packet_received()
(locks context->lock)send()
->zsock_send()
->zsock_sendto()
->z_impl_zsock_sendto()
->sock_sendto_vmeth()
->zsock_sendto_ctx()
->net_context_send()
(locks context->lock) ->context_sendto()
->net_tcp_queue_data()
(locks conn->lock)The deadlock happens if a mutex is locked in one path and the other path also locks a mutex before the second mutex can be locked. In that case both paths wait for each other to release the locks.
To Reproduce
This problem is very sensible to timing. It can be reliably reproduced by inserting a short sleep right before acquiring the mutex in
net_tcp_queue_data()
. In that case, a small modification to the echo server in the samples can be used to demonstrate the problem.Steps to reproduce the behavior:
echo -n "test" | nc -v -N 192.0.2.1 4242
The stack is blocked, not even ping works anymore.
Expected behavior
The stack should not block.
Impact
That's a pretty bad deadlock as by the textbooks.
Environment
The text was updated successfully, but these errors were encountered: