Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: host: l2cap deadlock fix #50476

Merged

Conversation

jori-nordic
Copy link
Collaborator

@jori-nordic jori-nordic commented Sep 21, 2022

This combination of commits fixes the symptoms seen in #34600.
It also adds an multilink l2cap test to stress the host in case other bugs like this crop up.

Fixes #34600.

A note on the test:
As it is a special usage of the l2cap CoC API, the stack will throw a lot of errors, this is expected.
We might want to downgrade those errors (in the near future) to be less scary, as no actual data loss is happening.

Expected errors/warnings:

d_00: @00:01:54.482920  [00:01:54.482,910] <wrn> bt_l2cap: No credits to transmit packet
d_00: @00:01:54.518763  [00:01:54.518,737] <err> bt_conn: Unable to allocate TX context
d_00: @00:01:54.518763  [00:01:54.518,737] <wrn> bt_l2cap: Unable to send seg -105

@jori-nordic jori-nordic changed the title L2cap deadlock fix Bluetooth: host: l2cap deadlock fix Sep 21, 2022
@jori-nordic jori-nordic added bug The issue is a bug, or the PR is fixing a bug area: API Changes to public APIs area: Bluetooth Host Bluetooth Host (excluding BR/EDR) labels Sep 21, 2022
@jori-nordic
Copy link
Collaborator Author

jori-nordic commented Sep 21, 2022

whoops looks like I forgot to review the test, it's still littered with commented code and is missing the new alloc_seg callback. Adding DNM until I fix it.
The rest of the PR can still be reviewed.

Whole PR is now ready for review.

@jori-nordic jori-nordic added the DNM This PR should not be merged (Do Not Merge) label Sep 21, 2022
Comment on lines 1974 to 1976
/* Release the segment, even if it is the original buf. the
* caller of this function already holds a reference to this
* buffer and l2cap_chan_le_send will get called again.
Copy link
Collaborator

@alwa-nordic alwa-nordic Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly?: l2cap_chan_create_seg returns an owned reference, if any. bt_l2cap_send_cb, if successful, moves the reference, but otherwise not. So, when it was not successful, we must still move the reference (into net_buf_unref) to avoid leaking it.

I don't think it's meaningful to say what the caller of this function does? At least it does not easily fit into my mental model.

@@ -2265,6 +2266,12 @@ static void l2cap_chan_send_credits(struct bt_l2cap_le_chan *chan,
credits = chan->rx.init_credits;
}

/* Don't send back more than the initial amount. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Is it not allowed by spec? This looks to me like a band-aid solution for some other problem in the stack. Why does it happen that we try to l2cap_chan_send_credits more credits than we should?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think the stack has this assumption, even if it's not in writing anywhere.

Copy link
Collaborator

@alwa-nordic alwa-nordic Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've approved this PR because it seems to make things better and I don't want to block it from getting into the release. But I'm not happy with the change in this commit because I think it will bite us later.

The question stands: If the stack assumes that credits never goes above init credits, then why is l2cap_chan_send_credits invoked in a way that breaks that assumption?

This change is not well motivated in my eyes. The comment "Don't send back more than the initial amount." just says what the code is doing, not why.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah initially, the fix was on the caller side. I decided to put it here to gate it at the very last point, just in case I missed something in one of the other calls.
We can revisit later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the number of credits we can give limited by the number of buffers available? We don't get any more buffers than the initial amount, so this change makes sense to me. But I may be misunderstanding something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly, this is the assumption. In the zephyr stack, the credits are an image of the number of the receive buffers. If they get out of sync, you're in trouble.

I think in nordic's softdevice, it was on the application to send the credits (public api), so it allowed the app to process the buffers before deciding to accept more. it also allowed user error if they sent too much credits. I'm not sure but maybe the number of buffers were runtime-configurable, hence that API.

As I said, i think in zephyr that doesn't make sense, as the number of buffers is fixed at compile time.

Copy link
Collaborator

@alwa-nordic alwa-nordic Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are the losing sync between the number of free buffers and the credits we send?~ Does the limit added here prevent the desync? Or is the sync already lost, and we are patching the symptom; that some code is calling l2cap_chan_send_credits with more credits than it is possible to have?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this code prevents the desync. There is some code callign l2cap_chan_send_credits with more credits than it is possible to have, yes. I figured I'd have the check in one place rather than two (or just the orig. caller and possibly leave another bug in if I didn't gate it here).
I can rummage thru my commits to find out where I originally put the limit if you so prefer. But i want to leave that one too.

Comment on lines +1897 to 1910
/* Resume the current channel */
l2cap_chan_tx_resume(BT_L2CAP_LE_CHAN(chan));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you leave this line? It seems to duplicate the work of the line below? Is it to retain the prioritizing property of preferentially resuming the channel that finished a send? Then we should have a comment on this. And the comment must also include an explanation why this behavior is good, or that it's unknown if this is good, but it is conservatively kept in it's legacy state.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's to prioritize the current channel, I'll add a comment.

*
* @param chan The channel requesting a buffer.
*
* @return Allocated buffer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.."Or NULL to fallback on the SDU allocator."? Or is this behavior unintentional? If it's unintentional, we should log an error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

behavior is unintentional, I think we should log an error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the check into an assert, lemme know if you don't like it.

The behavior will be that without asserts enabled (which you shouldn't do bc of the stack relying on it), segments will be allocated from the global pdu pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jori-nordic was your last comment posted in the right place? Seems it was intended to be in the thread about the assert?

Copy link
Collaborator Author

@jori-nordic jori-nordic Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's in the right place:
@alwa-nordic was saying that I had undocumented behavior where if the allocator cb returns NULL, the segment would tentatively be allocated from the same pool as the SDU.
I decided to make it mandatory to return a valid buffer, added a line to this comment and added an assert to verify that.

I have already changed the line with the assert in the bsim test, that's the one you both had problems with.
edit: my bad I hadn't changed it, it's fixed now.

test_l2cap_server.psm = 0;
test_l2cap_server.sec_level = sec_level;

__ASSERT_NO_MSG(bt_l2cap_server_register(&test_l2cap_server) == 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't put side effects in assert macros. :O

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preempt: The day will come when we need to run babelsim tests with assertions disabled. And somebody may be tempted to copy this code into their application. I think it's reasonable that the babelsim tests double as samples in addition to tests.

@carlescufi
Copy link
Member

@Vudentz a review from you would be appreciated for this PR

jhedberg
jhedberg previously approved these changes Sep 22, 2022
@jori-nordic jori-nordic force-pushed the l2cap-deadlock-tentative-fix branch 2 times, most recently from d85950e to 767952d Compare September 22, 2022 14:04
Comment on lines 2007 to 2009
/* Release the segment, even if it is the original buf, as
* buffer ownership is not transferred when bt_l2cap_send_cb
* returns an error.
Copy link
Collaborator

@alwa-nordic alwa-nordic Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is still unnecessarily hard to understand. I don't see the need to refer to the "original buf". "Original buf" does not have an obvious meaning to the casual reader in this context. Imagine being new to the code.

Is it not better to say simply "bt_l2cap_send_cb would have taken the ownership of the reference in seg if it was successful. But the call returned an error, so we must get rid of the reference here."?

We can append the following to explain what is meant by owning references: "If a variable owning a reference falls out of scope, it is considered the source of a leak." But I think that is common concept in programming, so it does not need to be repeated.

Copy link
Collaborator Author

@jori-nordic jori-nordic Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep wording was from the previous comment. "original buf" -> huge SDU buffer that was passed by ref to the l2cap send API.
I'll reword it.
edit: done

See code comment

Signed-off-by: Jonathan Rico <[email protected]>
There was an edge-case where we were sending back too much credits, add a
check so we can't do that anymore.

Signed-off-by: Jonathan Rico <[email protected]>
See the code comments.

SDUs might enter a state where they will be blocked forever, as a
workaround, we nudge them when another SDU has been sent.

Signed-off-by: Jonathan Rico <[email protected]>
This callback allows use-cases where the SDU is much larger than the l2cap
MPS. The stack will then try to allocate using this callback if specified,
and fall-back on using the buffer's pool (previous behavior).

This way one can define two buffer pools, one with a very large buffer
size, and one with a buffer size >= MPS, and the stack will allocate from
that instead.

Signed-off-by: Jonathan Rico <[email protected]>
`l2cap_tx_meta_data` is also a function-like macro.

Signed-off-by: Jonathan Rico <[email protected]>
@jori-nordic jori-nordic force-pushed the l2cap-deadlock-tentative-fix branch from fd2e4d4 to e244b04 Compare September 23, 2022 09:08
@carlescufi carlescufi merged commit 109c476 into zephyrproject-rtos:main Sep 26, 2022
jori-nordic added a commit to jori-nordic/zephyr that referenced this pull request Mar 27, 2023
Turns out the [first bugfix] was too naive: there is a case where resuming
all channels will not work on all queued SDUs, and the work handler will
give up and wait for the next sent SDU instead of trying to resume again.

This happens when the number of credits and conn contexts is very low for
the amount of data to send.

Always reschedule with a delay to avoid that situation.

[first bugfix]: zephyrproject-rtos#50476

Signed-off-by: Jonathan Rico <[email protected]>
jori-nordic added a commit to jori-nordic/zephyr that referenced this pull request Apr 18, 2023
Turns out the [first bugfix] was too naive: there is a case where resuming
all channels will not work on all queued SDUs, and the work handler will
give up and wait for the next sent SDU instead of trying to resume again.

This happens when the number of credits and conn contexts is very low for
the amount of data to send.

Always reschedule with a delay to avoid that situation.

[first bugfix]: zephyrproject-rtos#50476

Signed-off-by: Jonathan Rico <[email protected]>
carlescufi pushed a commit that referenced this pull request Apr 26, 2023
Turns out the [first bugfix] was too naive: there is a case where resuming
all channels will not work on all queued SDUs, and the work handler will
give up and wait for the next sent SDU instead of trying to resume again.

This happens when the number of credits and conn contexts is very low for
the amount of data to send.

Always reschedule with a delay to avoid that situation.

[first bugfix]: #50476

Signed-off-by: Jonathan Rico <[email protected]>
coreboot-bot pushed a commit to coreboot/zephyr-cros that referenced this pull request Apr 27, 2023
Turns out the [first bugfix] was too naive: there is a case where resuming
all channels will not work on all queued SDUs, and the work handler will
give up and wait for the next sent SDU instead of trying to resume again.

This happens when the number of credits and conn contexts is very low for
the amount of data to send.

Always reschedule with a delay to avoid that situation.

[first bugfix]: zephyrproject-rtos/zephyr#50476

(cherry picked from commit 6cdb82c)

Original-Signed-off-by: Jonathan Rico <[email protected]>
GitOrigin-RevId: 6cdb82c
Change-Id: Ie7b283c65234be30ba2c561d1661f69b9861ef44
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/4484100
Tested-by: Tristan Honscheid <[email protected]>
Reviewed-by: Tristan Honscheid <[email protected]>
Commit-Queue: Tristan Honscheid <[email protected]>
Tested-by: CopyBot Service Account <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bluetooth: L2CAP: Deadlock when there are no free buffers while transmitting on multiple channels
7 participants