-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory corruption in pkt_alloc #29003
Comments
Thanks for the analysis. I checked the |
The problem happens in function pkt_alloc in subsys/net/ip/net_pkt.c, called with CLONE_TIME. The function calls k_mem_slab_alloc, which in my case returns 0, indicating success but the pkt pointer is NULL. Because pkt_alloc relies on the return code it does not check the pkt pointer and uses the returned NULL in the subsequent call to memset(pkt, 0, sizeof(struct net_pkt));. I reconstruct the problem under heavy multicast traffic, which causes a lot of packet cloning and finally apparently the exhaustion of the packet buffers. The strange thing for me is the function k_mem_slab_alloc:
|
Ok, this seems to indicate issue in |
Are you also using |
yes, we at least use also the k_msgq , work queues and threads |
Could you provide something that can reproduce the issue via QEMU (which will help with debugging)? Just by looking at the code, mem slab should not return a null pointer in |
Sorry, I do not have the resources to reproduce this, except the scenario I described. Could it be, that there simply was no call to k_mem_slab_free in the given timeout? Another possible direction for me is, that our bsp code (aarch64) somehow messes things up. |
It should return I ran my mock-up app with |
Another thing, is it always happening to the same thread? If so, maybe you can setup a watch point to catch when the thread's |
Looks like #29615 was submitted to manage exactly the same issue. That's more of a workaround than a fix, though. |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
When running Zperf UDP-RX and issuing "wifi status" in the background few times causes a crash (data access violation), the callstack is not helping. The current theory is that, during "wifi status" the driver waits for RPU responses which are in the same queue as RX frames, so, this causes delay in RX frames processing and network stack runs out of buffers and subsequent allocations in the driver wait for memory to be reclaimed, and this internally uses z_pend_curr that causes to switch the thread and crash is seen in the WPA supplicant thread context (probably in l2_packet_recv or socket select). There is a similar issue reported in [1] but the fix didn't help, so, as a temporary workaround disable the timeout for allocation, with this the issue is not seen. [1] - zephyrproject-rtos/zephyr#29003 Signed-off-by: Krishna T <[email protected]>
I am seeing this issue with latest Zephyr ( Test
Context
Ozone callstack:
This varies but I see two callstacks consistently, none of them are directly related to the code.
Enabling all network debugs was unhelpful. This looks like the some thread switching while we wait for network buffers to be free is causing the issue, for now not using timeout works well, so, proceeding with that . I can provide any details needed. @rlubos @carlescufi please re-open the issue or let me know if you need me to raise a new bug. |
@krish2718 I'd rather open a new issue, instead of resurrecting an over 2-year-old report. However - I think'll be really difficult to help here, w/o providing some way to reproduce the issue, ideally in a form of a minimal sample or test to run in a simulated environment. |
When running Zperf UDP-RX and issuing "wifi status" in the background few times causes a crash (data access violation), the callstack is not helping. The current theory is that, during "wifi status" the driver waits for RPU responses which are in the same queue as RX frames, so, this causes delay in RX frames processing and network stack runs out of buffers and subsequent allocations in the driver wait for memory to be reclaimed, and this internally uses z_pend_curr that causes to switch the thread and crash is seen in the WPA supplicant thread context (probably in l2_packet_recv or socket select). There is a similar issue reported in [1] but the fix didn't help, so, as a temporary workaround disable the timeout for allocation, with this the issue is not seen. [1] - zephyrproject-rtos/zephyr#29003 Signed-off-by: Krishna T <[email protected]>
When pkt_alloc is called with a timeout (CLONE_TIMEOUT=100ms) and there is no free buffer, k_mem_slab_alloc returns a 0 value (OK) but the returned pkt is NULL. Since afterwards the pkt pointer is not checked it results in memory corruption.
I encountered this problem in the following scenario: I have a multicast listener (mdns ipv4) and bombard the device with multicast traffic for this listener, the aim being to exhaust the pkt buffer pool. From the code (net_conn_input) I see that multicast packets are cloned as follows:
mcast_pkt = net_pkt_clone(pkt, CLONE_TIMEOUT);
This eventually calls pkt_alloc, which calls k_mem_slab_alloc. When the slab is exhausted and because our timeout is not K_NO_WAIT, the function k_mem_slab_alloc returns a return code of zero but also a NULL pkt.
The text was updated successfully, but these errors were encountered: