memory corruption in pkt_alloc #29003

eyuval · 2020-10-07T12:54:05Z

When pkt_alloc is called with a timeout (CLONE_TIMEOUT=100ms) and there is no free buffer, k_mem_slab_alloc returns a 0 value (OK) but the returned pkt is NULL. Since afterwards the pkt pointer is not checked it results in memory corruption.

I encountered this problem in the following scenario: I have a multicast listener (mdns ipv4) and bombard the device with multicast traffic for this listener, the aim being to exhaust the pkt buffer pool. From the code (net_conn_input) I see that multicast packets are cloned as follows:
mcast_pkt = net_pkt_clone(pkt, CLONE_TIMEOUT);

This eventually calls pkt_alloc, which calls k_mem_slab_alloc. When the slab is exhausted and because our timeout is not K_NO_WAIT, the function k_mem_slab_alloc returns a return code of zero but also a NULL pkt.

jukkar · 2020-10-07T14:37:56Z

Thanks for the analysis. I checked the net_pkt_clone() call in connection.c and at least it is checking the return value. Can you point me to the line where you found the issue?

eyuval · 2020-10-07T16:59:58Z

The problem happens in function pkt_alloc in subsys/net/ip/net_pkt.c, called with CLONE_TIME. The function calls k_mem_slab_alloc, which in my case returns 0, indicating success but the pkt pointer is NULL. Because pkt_alloc relies on the return code it does not check the pkt pointer and uses the returned NULL in the subsequent call to memset(pkt, 0, sizeof(struct net_pkt));. I reconstruct the problem under heavy multicast traffic, which causes a lot of packet cloning and finally apparently the exhaustion of the packet buffers. The strange thing for me is the function k_mem_slab_alloc:
The problematic code (below) somehow assumes that _current->base.swap_data; has a valid memory pointer. Shouldn't the logic be somewhat like: return *mem ? result : -ENOMEM

/* wait for a free block or timeout */
	result = z_pend_curr(&lock, key, &slab->wait_q, timeout);
	if (result == 0) {
		*mem = _current->base.swap_data;
	}
	return result;

jukkar · 2020-10-08T07:36:27Z

Ok, this seems to indicate issue in k_mem_slab_alloc() if 0 is returned but the returned memory pointer is NULL.
Assigning to @andyross for analysis.

dcpleung · 2020-10-15T20:52:34Z

Are you also using k_mbox_*(), k_stack_*(), k_pipes_*(), k_queue_*(), or k_msgq_*()? These also touch base.swap_data.

eyuval · 2020-10-16T05:51:53Z

yes, we at least use also the k_msgq , work queues and threads

dcpleung · 2020-10-17T22:55:01Z

Could you provide something that can reproduce the issue via QEMU (which will help with debugging)? Just by looking at the code, mem slab should not return a null pointer in k_mem_slab_alloc() unless a null pointer is being freed in k_mem_slab_free(). I also made a mock up to test alloc/free with timeouts but it ran for hours without returning pointers outside the mem slab block. So anything that can help us to reproduce the issue would be very helpful.

eyuval · 2020-10-20T06:47:53Z

Sorry, I do not have the resources to reproduce this, except the scenario I described. Could it be, that there simply was no call to k_mem_slab_free in the given timeout? Another possible direction for me is, that our bsp code (aarch64) somehow messes things up.

dcpleung · 2020-10-20T20:28:06Z

It should return -EAGAIN if alloc times out (as set in kernel/include/kswap.h, line 99 when the allocating thread first is being swapped out).

I ran my mock-up app with qemu_cortex_a53 for all morning without any issue, but this is missing a lot of interactions that exist in your app.

dcpleung · 2020-10-20T20:38:17Z

Another thing, is it always happening to the same thread? If so, maybe you can setup a watch point to catch when the thread's base.swap_data is set to zero. k_mem_slab_free() should not set it to zero, or else it would be freeing a null pointer. The k_msgq may be setting it to zero, depending on your app.

andyross · 2020-10-28T20:55:10Z

Looks like #29615 was submitted to manage exactly the same issue. That's more of a workaround than a fix, though.

github-actions · 2020-12-28T01:34:39Z

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

When running Zperf UDP-RX and issuing "wifi status" in the background few times causes a crash (data access violation), the callstack is not helping. The current theory is that, during "wifi status" the driver waits for RPU responses which are in the same queue as RX frames, so, this causes delay in RX frames processing and network stack runs out of buffers and subsequent allocations in the driver wait for memory to be reclaimed, and this internally uses z_pend_curr that causes to switch the thread and crash is seen in the WPA supplicant thread context (probably in l2_packet_recv or socket select). There is a similar issue reported in [1] but the fix didn't help, so, as a temporary workaround disable the timeout for allocation, with this the issue is not seen. [1] - zephyrproject-rtos/zephyr#29003 Signed-off-by: Krishna T <[email protected]>

krish2718 · 2023-02-23T21:26:45Z

I am seeing this issue with latest Zephyr (e1e06d05fa + some cherry-picked fixes), its quite easy to reproduce and my test case is below:

Test

Zperf is running UDP download over Wi-Fi and iperf client pumping at 50Mbps, the throughput reported is ~15Mbps
In the background, a separate thread is running a command to get some driver stats, this command causes the RX queue to be full as it blocks the
driver. Running it 4/5 times is enough

Context

Driver is using net_pkt_rx_alloc_with_buffer with 100ms timeout, when this fails net_pkt_clone also fails as described
above
Disable timeout fixes the issue (net_pkt_rx_alloc_with_buffer with K_NO_WAIT)
The fix suggested in this thread didn't help, as in the driver we do check for pkt not just the return value.

Ozone callstack:

[00:00:52.087,463] <err> net_pkt: Data buffer (1498) allocation failed.
[00:00:52.087,493] <wrn> net_conn: pkt cloning failed, pkt 0x20066df4 dropped
[00:00:52.187,744] <err> net_pkt: Data buffer (1498) allocation failed.
[00:00:52.187,744] <wrn> net_conn: pkt cloning failed, pkt 0x20067168 dropped
[00:00:52.288,024] <err> net_pkt: Data buffer (1498) allocation failed.
[00:00:52.288,024] <wrn> net_conn: pkt cloning failed, pkt 0x200671ac dropped
[00:03:38.462,005] <err> os: ***** BUS FAULT *****
[00:03:38.462,005] <err> os:   Precise data bus error
[00:03:38.462,005] <err> os:   BFAR Address: 0xffffff73
[00:03:38.462,036] <err> os: r0/a1:  0xffffff65  r1/a2:  0x20003700  r2/a3:  0x20003760
[00:03:38.462,036] <err> os: r3/a4:  0x20011338 r12/ip:  0x00006f4c r14/lr:  0x0006c6c7
[00:03:38.462,036] <err> os:  xpsr:  0xa1000000
[00:03:38.462,036] <err> os: Faulting instruction address (r15/pc): 0x0006c33a
[00:03:38.462,066] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:03:38.462,066] <err> os: Current thread: 0x20003700

This varies but I see two callstacks consistently, none of them are directly related to the code.

z_impl_zsock_select....->z_sched_prio_cmp (thread1 pointer is corrupted, thread2 is the background thread)
z_impl_k_poll->register_events...->clear_event_registrations...->sys_dlist_remove (list is corrupted)

Enabling all network debugs was unhelpful.

This looks like the some thread switching while we wait for network buffers to be free is causing the issue, for now not using timeout works well, so, proceeding with that . I can provide any details needed.

@rlubos @carlescufi please re-open the issue or let me know if you need me to raise a new bug.

rlubos · 2023-02-24T11:54:56Z

@krish2718 I'd rather open a new issue, instead of resurrecting an over 2-year-old report. However - I think'll be really difficult to help here, w/o providing some way to reproduce the issue, ideally in a form of a minimal sample or test to run in a simulated environment.
Also, please note, that based on my past experiences, I'm more inclined to believe that such kernel misbehaviors, are most likely a result of some external memory corruption (stack overflows etc.) rather than a problem within the kernel itself.

When running Zperf UDP-RX and issuing "wifi status" in the background few times causes a crash (data access violation), the callstack is not helping. The current theory is that, during "wifi status" the driver waits for RPU responses which are in the same queue as RX frames, so, this causes delay in RX frames processing and network stack runs out of buffers and subsequent allocations in the driver wait for memory to be reclaimed, and this internally uses z_pend_curr that causes to switch the thread and crash is seen in the WPA supplicant thread context (probably in l2_packet_recv or socket select). There is a similar issue reported in [1] but the fix didn't help, so, as a temporary workaround disable the timeout for allocation, with this the issue is not seen. [1] - zephyrproject-rtos/zephyr#29003 Signed-off-by: Krishna T <[email protected]>

eyuval added the bug The issue is a bug, or the PR is fixing a bug label Oct 7, 2020

nashif assigned jukkar Oct 7, 2020

nashif added the area: Networking label Oct 7, 2020

jukkar assigned andyross and unassigned jukkar Oct 8, 2020

jukkar added the area: Memory Management label Oct 8, 2020

nashif added the priority: medium Medium impact/importance bug label Oct 13, 2020

github-actions bot added the Stale label Dec 28, 2020

github-actions bot closed this as completed Jan 11, 2021

krish2718 mentioned this issue Feb 25, 2023

networking: Crash when requesting a network buffer allocation with timeout under stress #55191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory corruption in pkt_alloc #29003

memory corruption in pkt_alloc #29003

eyuval commented Oct 7, 2020

jukkar commented Oct 7, 2020

eyuval commented Oct 7, 2020 •

edited

Loading

jukkar commented Oct 8, 2020

dcpleung commented Oct 15, 2020 •

edited

Loading

eyuval commented Oct 16, 2020

dcpleung commented Oct 17, 2020

eyuval commented Oct 20, 2020

dcpleung commented Oct 20, 2020

dcpleung commented Oct 20, 2020

andyross commented Oct 28, 2020

github-actions bot commented Dec 28, 2020

krish2718 commented Feb 23, 2023 •

edited

Loading

rlubos commented Feb 24, 2023

memory corruption in pkt_alloc #29003

memory corruption in pkt_alloc #29003

Comments

eyuval commented Oct 7, 2020

jukkar commented Oct 7, 2020

eyuval commented Oct 7, 2020 • edited Loading

jukkar commented Oct 8, 2020

dcpleung commented Oct 15, 2020 • edited Loading

eyuval commented Oct 16, 2020

dcpleung commented Oct 17, 2020

eyuval commented Oct 20, 2020

dcpleung commented Oct 20, 2020

dcpleung commented Oct 20, 2020

andyross commented Oct 28, 2020

github-actions bot commented Dec 28, 2020

krish2718 commented Feb 23, 2023 • edited Loading

Test

Context

Ozone callstack:

rlubos commented Feb 24, 2023

eyuval commented Oct 7, 2020 •

edited

Loading

dcpleung commented Oct 15, 2020 •

edited

Loading

krish2718 commented Feb 23, 2023 •

edited

Loading