Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network deadlock because of mutex locking order #29347

Closed
ohitz opened this issue Oct 20, 2020 · 9 comments · Fixed by #29351
Closed

Network deadlock because of mutex locking order #29347

ohitz opened this issue Oct 20, 2020 · 9 comments · Fixed by #29351
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@ohitz
Copy link
Contributor

ohitz commented Oct 20, 2020

Describe the bug

It is possible to deadlock the entire network stack with TCP. If this happens, the network is completely blocked, not even ICMP traffic works anymore.

The reason is that the send and receive paths each lock two mutexes in different order.

  • Receive path: z_work_q_main() -> process_rx_packet() -> net_rx() -> process_data() -> net_ipv4_input() -> net_conn_input() -> tcp_recv() -> tcp_in() (locks conn->lock) -> tcp_data_get() -> net_context_packet_received() (locks context->lock)
  • Send path: send() -> zsock_send() -> zsock_sendto() -> z_impl_zsock_sendto() -> sock_sendto_vmeth() -> zsock_sendto_ctx() -> net_context_send() (locks context->lock) -> context_sendto() -> net_tcp_queue_data() (locks conn->lock)

The deadlock happens if a mutex is locked in one path and the other path also locks a mutex before the second mutex can be locked. In that case both paths wait for each other to release the locks.

To Reproduce

This problem is very sensible to timing. It can be reliably reproduced by inserting a short sleep right before acquiring the mutex in net_tcp_queue_data(). In that case, a small modification to the echo server in the samples can be used to demonstrate the problem.

Steps to reproduce the behavior:

  1. Patch the TCP stack and the echo server:
diff --git a/samples/net/sockets/echo_server/src/tcp.c b/samples/net/sockets/echo_server/src/tcp.c
index 0ff35a1852..034e2e7c7b 100644
--- a/samples/net/sockets/echo_server/src/tcp.c
+++ b/samples/net/sockets/echo_server/src/tcp.c
@@ -129,6 +129,8 @@ static void handle_data(void *ptr1, void *ptr2, void *ptr3)
 
        client = data->tcp.accepted[slot].sock;
 
+       sendall(client, "HELLO", sizeof("HELLO"));
+
        do {
                received = recv(client,
                        data->tcp.accepted[slot].recv_buffer + offset,
diff --git a/subsys/net/ip/tcp2.c b/subsys/net/ip/tcp2.c
index 36d6b191d1..ff7d64d5bd 100644
--- a/subsys/net/ip/tcp2.c
+++ b/subsys/net/ip/tcp2.c
@@ -1550,6 +1550,8 @@ int net_tcp_queue_data(struct net_context *context, struct net_pkt *pkt)
                return -ENOTCONN;
        }
 
+       k_sleep(K_MSEC(10));
+
        k_mutex_lock(&conn->lock, K_FOREVER);
 
        if (tcp_window_full(conn)) {
  1. Compile the echo server and flash (tested on MIMXRT1060-EVK)
  2. Connect to the echo server, send data and close the connection right away: echo -n "test" | nc -v -N 192.0.2.1 4242

The stack is blocked, not even ping works anymore.

Expected behavior

The stack should not block.

Impact

That's a pretty bad deadlock as by the textbooks.

Environment

@ohitz ohitz added the bug The issue is a bug, or the PR is fixing a bug label Oct 20, 2020
@jukkar
Copy link
Member

jukkar commented Oct 20, 2020

The reason is that the send and receive paths each lock two mutexes in different order.

* Receive path: `z_work_q_main()` -> `process_rx_packet()` -> `net_rx()` -> `process_data()` -> `net_ipv4_input()` -> `net_conn_input()` -> `tcp_recv()` -> `tcp_in()` **(locks conn->lock)** -> `tcp_data_get()` -> `net_context_packet_received()` **(locks context->lock)**

* Send path: `send()` -> `zsock_send()` -> `zsock_sendto()` -> `z_impl_zsock_sendto()` -> `sock_sendto_vmeth()` -> `zsock_sendto_ctx()` -> `net_context_send()` **(locks context->lock)** -> `context_sendto()` -> `net_tcp_queue_data()` **(locks conn->lock)**

One thing which is a bit mystery here is that the RX thread in receive path and the application thread are separate threads. So even if net_context_packet_received() calls the socket recv_cb() with context locked, the application thread will get blocked when trying to get context lock, in which case the RX should continue running. Of course if the RX thread is never able to run because of priorities, then the unlock will not happen.
Anyway, one easy "fix" here is to just release the context lock in net_context_packet_received() before calling recv_cb().

@ohitz
Copy link
Contributor Author

ohitz commented Oct 20, 2020

@jukkar there is still something odd with your PR #29351. When the sleep is left in net_tcp_queue_data(), the echo server still locks the network stack as above, trying to lock the mutex in net_tcp_queue_data().

@jukkar
Copy link
Member

jukkar commented Oct 20, 2020

@jukkar there is still something odd with your PR #29351. When the sleep is left in net_tcp_queue_data(), the echo server still locks the network stack as above, trying to lock the mutex in net_tcp_queue_data().

Ok, thanks for the info. I will try to replicate the issue.

jukkar added a commit to jukkar/zephyr that referenced this issue Oct 21, 2020
Release the context lock before passing data to the application
socket as that might cause deadlock if the application is run
before the RX thread and it starts to send data and if the RX
thread is never able to run (because of priorities etc).

Fixes zephyrproject-rtos#29347

Signed-off-by: Jukka Rissanen <[email protected]>
@jukkar
Copy link
Member

jukkar commented Oct 21, 2020

@ohitz thanks for a good analysis. I think the new version of the fix will solve the issue, at least I was no longer able to replicate the hang issue. In the fix, we do not pass data to application with tcpconn->lock held.

@ohitz
Copy link
Contributor Author

ohitz commented Oct 21, 2020

Thanks @jukkar I have tried the new fix. Unfortunately, there is a NULL pointer access now at subsys/net/ip/tcp2.c:1494 if the client closes the connection. The echo server doesn't crash, but you can see it if you insert an assertion which checks conn->context != NULL right there.

@jukkar
Copy link
Member

jukkar commented Oct 21, 2020

Unfortunately, there is a NULL pointer access now at subsys/net/ip/tcp2.c:1494 if the client closes the connection.

Indeed, the connection handler disappears. This is an easy fix, I will send a new version in few minutes.

@jukkar
Copy link
Member

jukkar commented Oct 21, 2020

I remember that there was some commit(s) in another TCP issue where one could catch the null pointer access violations with this hw, you seem to have that applied. Have you considered to upstream that properly, could be quite useful to have?

@ohitz
Copy link
Contributor Author

ohitz commented Oct 21, 2020

I can confirm the fix now works properly, thanks @jukkar!

As for the MPU configuration to catch null pointer access violations, I'll check with my colleague.

@armandciejak
Copy link
Contributor

@jukkar I have no time to clean-up our patch now but I can publish it on github if someone wants to take it over. It is not ready for a pull request yet since it is very specific to our target.

MaureenHelm pushed a commit that referenced this issue Oct 21, 2020
Release the context lock before passing data to the application
socket as that might cause deadlock if the application is run
before the RX thread and it starts to send data and if the RX
thread is never able to run (because of priorities etc).

Fixes #29347

Signed-off-by: Jukka Rissanen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants