Bluetooth: host: l2cap deadlock fix #50476

jori-nordic · 2022-09-21T10:04:17Z

This combination of commits fixes the symptoms seen in #34600.
It also adds an multilink l2cap test to stress the host in case other bugs like this crop up.

Fixes #34600.

A note on the test:
As it is a special usage of the l2cap CoC API, the stack will throw a lot of errors, this is expected.
We might want to downgrade those errors (in the near future) to be less scary, as no actual data loss is happening.

Expected errors/warnings:

d_00: @00:01:54.482920  [00:01:54.482,910] <wrn> bt_l2cap: No credits to transmit packet
d_00: @00:01:54.518763  [00:01:54.518,737] <err> bt_conn: Unable to allocate TX context
d_00: @00:01:54.518763  [00:01:54.518,737] <wrn> bt_l2cap: Unable to send seg -105

jori-nordic · 2022-09-21T11:50:30Z

whoops looks like I forgot to review the test, it's still littered with commented code and is missing the new alloc_seg callback. Adding DNM until I fix it.
The rest of the PR can still be reviewed.

Whole PR is now ready for review.

alwa-nordic · 2022-09-21T12:33:18Z

subsys/bluetooth/host/l2cap.c

+		/* Release the segment, even if it is the original buf. the
+		 * caller of this function already holds a reference to this
+		 * buffer and l2cap_chan_le_send will get called again.


Do I understand correctly?: l2cap_chan_create_seg returns an owned reference, if any. bt_l2cap_send_cb, if successful, moves the reference, but otherwise not. So, when it was not successful, we must still move the reference (into net_buf_unref) to avoid leaking it.

I don't think it's meaningful to say what the caller of this function does? At least it does not easily fit into my mental model.

alwa-nordic · 2022-09-21T12:39:25Z

subsys/bluetooth/host/l2cap.c

@@ -2265,6 +2266,12 @@ static void l2cap_chan_send_credits(struct bt_l2cap_le_chan *chan,
 		credits = chan->rx.init_credits;
 	}

+	/* Don't send back more than the initial amount. */


Why? Is it not allowed by spec? This looks to me like a band-aid solution for some other problem in the stack. Why does it happen that we try to l2cap_chan_send_credits more credits than we should?

Think the stack has this assumption, even if it's not in writing anywhere.

I've approved this PR because it seems to make things better and I don't want to block it from getting into the release. But I'm not happy with the change in this commit because I think it will bite us later.

The question stands: If the stack assumes that credits never goes above init credits, then why is l2cap_chan_send_credits invoked in a way that breaks that assumption?

This change is not well motivated in my eyes. The comment "Don't send back more than the initial amount." just says what the code is doing, not why.

yeah initially, the fix was on the caller side. I decided to put it here to gate it at the very last point, just in case I missed something in one of the other calls.
We can revisit later.

Isn't the number of credits we can give limited by the number of buffers available? We don't get any more buffers than the initial amount, so this change makes sense to me. But I may be misunderstanding something.

yes exactly, this is the assumption. In the zephyr stack, the credits are an image of the number of the receive buffers. If they get out of sync, you're in trouble.

I think in nordic's softdevice, it was on the application to send the credits (public api), so it allowed the app to process the buffers before deciding to accept more. it also allowed user error if they sent too much credits. I'm not sure but maybe the number of buffers were runtime-configurable, hence that API.

As I said, i think in zephyr that doesn't make sense, as the number of buffers is fixed at compile time.

So we are the losing sync between the number of free buffers and the credits we send?~ Does the limit added here prevent the desync? Or is the sync already lost, and we are patching the symptom; that some code is calling l2cap_chan_send_credits with more credits than it is possible to have?

yeah this code prevents the desync. There is some code callign l2cap_chan_send_credits with more credits than it is possible to have, yes. I figured I'd have the check in one place rather than two (or just the orig. caller and possibly leave another bug in if I didn't gate it here).
I can rummage thru my commits to find out where I originally put the limit if you so prefer. But i want to leave that one too.

alwa-nordic · 2022-09-21T12:46:27Z

subsys/bluetooth/host/l2cap.c

+	/* Resume the current channel */
 	l2cap_chan_tx_resume(BT_L2CAP_LE_CHAN(chan));


Is there a reason you leave this line? It seems to duplicate the work of the line below? Is it to retain the prioritizing property of preferentially resuming the channel that finished a send? Then we should have a comment on this. And the comment must also include an explanation why this behavior is good, or that it's unknown if this is good, but it is conservatively kept in it's legacy state.

yeah it's to prioritize the current channel, I'll add a comment.

alwa-nordic · 2022-09-21T12:50:03Z

include/zephyr/bluetooth/l2cap.h

+	 *
+	 *  @param chan The channel requesting a buffer.
+	 *
+	 *  @return Allocated buffer.


.."Or NULL to fallback on the SDU allocator."? Or is this behavior unintentional? If it's unintentional, we should log an error.

behavior is unintentional, I think we should log an error.

I moved the check into an assert, lemme know if you don't like it.

The behavior will be that without asserts enabled (which you shouldn't do bc of the stack relying on it), segments will be allocated from the global pdu pool.

@jori-nordic was your last comment posted in the right place? Seems it was intended to be in the thread about the assert?

no, it's in the right place:
@alwa-nordic was saying that I had undocumented behavior where if the allocator cb returns NULL, the segment would tentatively be allocated from the same pool as the SDU.
I decided to make it mandatory to return a valid buffer, added a line to this comment and added an assert to verify that.

I have already changed the line with the assert in the bsim test, that's the one you both had problems with.
edit: my bad I hadn't changed it, it's fixed now.

tests/bluetooth/bsim_bt/bsim_test_l2cap_stress/src/common.c

alwa-nordic · 2022-09-21T13:23:52Z

tests/bluetooth/bsim_bt/bsim_test_l2cap_stress/src/main_l2cap_stress.c

+	test_l2cap_server.psm = 0;
+	test_l2cap_server.sec_level = sec_level;
+
+	__ASSERT_NO_MSG(bt_l2cap_server_register(&test_l2cap_server) == 0);


Don't put side effects in assert macros. :O

Preempt: The day will come when we need to run babelsim tests with assertions disabled. And somebody may be tempted to copy this code into their application. I think it's reasonable that the babelsim tests double as samples in addition to tests.

carlescufi · 2022-09-22T08:13:32Z

@Vudentz a review from you would be appreciated for this PR

alwa-nordic · 2022-09-23T09:00:00Z

subsys/bluetooth/host/l2cap.c

+		/* Release the segment, even if it is the original buf, as
+		 * buffer ownership is not transferred when bt_l2cap_send_cb
+		 * returns an error.


I think this comment is still unnecessarily hard to understand. I don't see the need to refer to the "original buf". "Original buf" does not have an obvious meaning to the casual reader in this context. Imagine being new to the code.

Is it not better to say simply "bt_l2cap_send_cb would have taken the ownership of the reference in seg if it was successful. But the call returned an error, so we must get rid of the reference here."?

We can append the following to explain what is meant by owning references: "If a variable owning a reference falls out of scope, it is considered the source of a leak." But I think that is common concept in programming, so it does not need to be repeated.

yep wording was from the previous comment. "original buf" -> huge SDU buffer that was passed by ref to the l2cap send API.
I'll reword it.
edit: done

See code comment Signed-off-by: Jonathan Rico <[email protected]>

There was an edge-case where we were sending back too much credits, add a check so we can't do that anymore. Signed-off-by: Jonathan Rico <[email protected]>

See the code comments. SDUs might enter a state where they will be blocked forever, as a workaround, we nudge them when another SDU has been sent. Signed-off-by: Jonathan Rico <[email protected]>

This callback allows use-cases where the SDU is much larger than the l2cap MPS. The stack will then try to allocate using this callback if specified, and fall-back on using the buffer's pool (previous behavior). This way one can define two buffer pools, one with a very large buffer size, and one with a buffer size >= MPS, and the stack will allocate from that instead. Signed-off-by: Jonathan Rico <[email protected]>

`l2cap_tx_meta_data` is also a function-like macro. Signed-off-by: Jonathan Rico <[email protected]>

Turns out the [first bugfix] was too naive: there is a case where resuming all channels will not work on all queued SDUs, and the work handler will give up and wait for the next sent SDU instead of trying to resume again. This happens when the number of credits and conn contexts is very low for the amount of data to send. Always reschedule with a delay to avoid that situation. [first bugfix]: zephyrproject-rtos#50476 Signed-off-by: Jonathan Rico <[email protected]>

Turns out the [first bugfix] was too naive: there is a case where resuming all channels will not work on all queued SDUs, and the work handler will give up and wait for the next sent SDU instead of trying to resume again. This happens when the number of credits and conn contexts is very low for the amount of data to send. Always reschedule with a delay to avoid that situation. [first bugfix]: #50476 Signed-off-by: Jonathan Rico <[email protected]>

Turns out the [first bugfix] was too naive: there is a case where resuming all channels will not work on all queued SDUs, and the work handler will give up and wait for the next sent SDU instead of trying to resume again. This happens when the number of credits and conn contexts is very low for the amount of data to send. Always reschedule with a delay to avoid that situation. [first bugfix]: zephyrproject-rtos/zephyr#50476 (cherry picked from commit 6cdb82c) Original-Signed-off-by: Jonathan Rico <[email protected]> GitOrigin-RevId: 6cdb82c Change-Id: Ie7b283c65234be30ba2c561d1661f69b9861ef44 Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/4484100 Tested-by: Tristan Honscheid <[email protected]> Reviewed-by: Tristan Honscheid <[email protected]> Commit-Queue: Tristan Honscheid <[email protected]> Tested-by: CopyBot Service Account <[email protected]>

jori-nordic requested review from alwa-nordic, jhedberg, Vudentz, wopu-ot and sjanc as code owners September 21, 2022 10:04

zephyrbot added the area: Bluetooth label Sep 21, 2022

zephyrbot requested review from asbjornsabo, hermabe and Thalley September 21, 2022 10:04

zephyrbot assigned jhedberg Sep 21, 2022

jori-nordic changed the title ~~L2cap deadlock fix~~ Bluetooth: host: l2cap deadlock fix Sep 21, 2022

jori-nordic added bug The issue is a bug, or the PR is fixing a bug area: API Changes to public APIs area: Bluetooth Host Bluetooth Host (excluding BR/EDR) labels Sep 21, 2022

kruithofa mentioned this pull request Sep 21, 2022

Network traffic occurs before Bluetooth NET L2 (IPSP) link setup complete #50345

Closed

jori-nordic added the DNM This PR should not be merged (Do Not Merge) label Sep 21, 2022

alwa-nordic reviewed Sep 21, 2022

View reviewed changes

tests/bluetooth/bsim_bt/bsim_test_l2cap_stress/src/common.c Outdated Show resolved Hide resolved

alwa-nordic reviewed Sep 21, 2022

View reviewed changes

tests/bluetooth/bsim_bt/bsim_test_l2cap_stress/src/common.c Outdated Show resolved Hide resolved

alwa-nordic reviewed Sep 21, 2022

View reviewed changes

tests/bluetooth/bsim_bt/bsim_test_l2cap_stress/src/common.c Outdated Show resolved Hide resolved

alwa-nordic reviewed Sep 21, 2022

View reviewed changes

jhedberg previously approved these changes Sep 22, 2022

View reviewed changes

jori-nordic dismissed jhedberg’s stale review via d85950e September 22, 2022 13:40

jori-nordic force-pushed the l2cap-deadlock-tentative-fix branch 2 times, most recently from d85950e to 767952d Compare September 22, 2022 14:04

jori-nordic dismissed alwa-nordic’s stale review via fd2e4d4 September 23, 2022 08:51

jori-nordic force-pushed the l2cap-deadlock-tentative-fix branch from 4d8a4b9 to fd2e4d4 Compare September 23, 2022 08:51

alwa-nordic reviewed Sep 23, 2022

View reviewed changes

jori-nordic added 5 commits September 23, 2022 11:08

Bluetooth: host: l2cap: release segment in all cases

f478104

See code comment Signed-off-by: Jonathan Rico <[email protected]>

Bluetooth: host: l2cap: don't send too much credits

edb7638

There was an edge-case where we were sending back too much credits, add a check so we can't do that anymore. Signed-off-by: Jonathan Rico <[email protected]>

Bluetooth: host: l2cap: workaround SDU deadlock

85b8c62

See the code comments. SDUs might enter a state where they will be blocked forever, as a workaround, we nudge them when another SDU has been sent. Signed-off-by: Jonathan Rico <[email protected]>

Bluetooth: l2cap: remove ambiguous naming

e244b04

`l2cap_tx_meta_data` is also a function-like macro. Signed-off-by: Jonathan Rico <[email protected]>

jori-nordic force-pushed the l2cap-deadlock-tentative-fix branch from fd2e4d4 to e244b04 Compare September 23, 2022 09:08

jhedberg approved these changes Sep 23, 2022

View reviewed changes

jori-nordic requested a review from alwa-nordic September 23, 2022 11:46

carlescufi approved these changes Sep 26, 2022

View reviewed changes

carlescufi merged commit 109c476 into zephyrproject-rtos:main Sep 26, 2022

theob-pro mentioned this pull request Jun 5, 2023

[Backport v2.7-branch] Bluetooth: fix HCI command / data deadlock #58725

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bluetooth: host: l2cap deadlock fix #50476

Bluetooth: host: l2cap deadlock fix #50476

jori-nordic commented Sep 21, 2022 •

edited

Loading

jori-nordic commented Sep 21, 2022 •

edited

Loading

alwa-nordic Sep 21, 2022 •

edited

Loading

alwa-nordic Sep 21, 2022

jori-nordic Sep 21, 2022

alwa-nordic Sep 23, 2022 •

edited

Loading

jori-nordic Sep 23, 2022

hermabe Sep 23, 2022

jori-nordic Sep 23, 2022

alwa-nordic Sep 26, 2022 •

edited

Loading

jori-nordic Sep 26, 2022

alwa-nordic Sep 21, 2022

jori-nordic Sep 22, 2022

alwa-nordic Sep 21, 2022

jori-nordic Sep 21, 2022

jori-nordic Sep 22, 2022

jhedberg Sep 22, 2022

jori-nordic Sep 23, 2022 •

edited

Loading

alwa-nordic Sep 21, 2022

alwa-nordic Sep 21, 2022

carlescufi commented Sep 22, 2022

alwa-nordic Sep 23, 2022 •

edited

Loading

jori-nordic Sep 23, 2022 •

edited

Loading

		/* Resume the current channel */
		l2cap_chan_tx_resume(BT_L2CAP_LE_CHAN(chan));

Bluetooth: host: l2cap deadlock fix #50476

Bluetooth: host: l2cap deadlock fix #50476

Conversation

jori-nordic commented Sep 21, 2022 • edited Loading

jori-nordic commented Sep 21, 2022 • edited Loading

alwa-nordic Sep 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alwa-nordic Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alwa-nordic Sep 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jori-nordic Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlescufi commented Sep 22, 2022

alwa-nordic Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

jori-nordic Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

jori-nordic commented Sep 21, 2022 •

edited

Loading

jori-nordic commented Sep 21, 2022 •

edited

Loading

alwa-nordic Sep 21, 2022 •

edited

Loading

alwa-nordic Sep 23, 2022 •

edited

Loading

alwa-nordic Sep 26, 2022 •

edited

Loading

jori-nordic Sep 23, 2022 •

edited

Loading

alwa-nordic Sep 23, 2022 •

edited

Loading

jori-nordic Sep 23, 2022 •

edited

Loading