-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bluetooth: host: l2cap deadlock fix #50476
Bluetooth: host: l2cap deadlock fix #50476
Conversation
Whole PR is now ready for review. |
subsys/bluetooth/host/l2cap.c
Outdated
/* Release the segment, even if it is the original buf. the | ||
* caller of this function already holds a reference to this | ||
* buffer and l2cap_chan_le_send will get called again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly?: l2cap_chan_create_seg
returns an owned reference, if any. bt_l2cap_send_cb
, if successful, moves the reference, but otherwise not. So, when it was not successful, we must still move the reference (into net_buf_unref
) to avoid leaking it.
I don't think it's meaningful to say what the caller of this function does? At least it does not easily fit into my mental model.
@@ -2265,6 +2266,12 @@ static void l2cap_chan_send_credits(struct bt_l2cap_le_chan *chan, | |||
credits = chan->rx.init_credits; | |||
} | |||
|
|||
/* Don't send back more than the initial amount. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Is it not allowed by spec? This looks to me like a band-aid solution for some other problem in the stack. Why does it happen that we try to l2cap_chan_send_credits
more credits than we should?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think the stack has this assumption, even if it's not in writing anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've approved this PR because it seems to make things better and I don't want to block it from getting into the release. But I'm not happy with the change in this commit because I think it will bite us later.
The question stands: If the stack assumes that credits never goes above init credits, then why is l2cap_chan_send_credits
invoked in a way that breaks that assumption?
This change is not well motivated in my eyes. The comment "Don't send back more than the initial amount." just says what the code is doing, not why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah initially, the fix was on the caller side. I decided to put it here to gate it at the very last point, just in case I missed something in one of the other calls.
We can revisit later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the number of credits we can give limited by the number of buffers available? We don't get any more buffers than the initial amount, so this change makes sense to me. But I may be misunderstanding something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly, this is the assumption. In the zephyr stack, the credits are an image of the number of the receive buffers. If they get out of sync, you're in trouble.
I think in nordic's softdevice, it was on the application to send the credits (public api), so it allowed the app to process the buffers before deciding to accept more. it also allowed user error if they sent too much credits. I'm not sure but maybe the number of buffers were runtime-configurable, hence that API.
As I said, i think in zephyr that doesn't make sense, as the number of buffers is fixed at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are the losing sync between the number of free buffers and the credits we send?~ Does the limit added here prevent the desync? Or is the sync already lost, and we are patching the symptom; that some code is calling l2cap_chan_send_credits
with more credits than it is possible to have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this code prevents the desync. There is some code callign l2cap_chan_send_credits with more credits than it is possible to have, yes. I figured I'd have the check in one place rather than two (or just the orig. caller and possibly leave another bug in if I didn't gate it here).
I can rummage thru my commits to find out where I originally put the limit if you so prefer. But i want to leave that one too.
/* Resume the current channel */ | ||
l2cap_chan_tx_resume(BT_L2CAP_LE_CHAN(chan)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you leave this line? It seems to duplicate the work of the line below? Is it to retain the prioritizing property of preferentially resuming the channel that finished a send? Then we should have a comment on this. And the comment must also include an explanation why this behavior is good, or that it's unknown if this is good, but it is conservatively kept in it's legacy state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's to prioritize the current channel, I'll add a comment.
* | ||
* @param chan The channel requesting a buffer. | ||
* | ||
* @return Allocated buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.."Or NULL to fallback on the SDU allocator."? Or is this behavior unintentional? If it's unintentional, we should log an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
behavior is unintentional, I think we should log an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the check into an assert, lemme know if you don't like it.
The behavior will be that without asserts enabled (which you shouldn't do bc of the stack relying on it), segments will be allocated from the global pdu pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jori-nordic was your last comment posted in the right place? Seems it was intended to be in the thread about the assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it's in the right place:
@alwa-nordic was saying that I had undocumented behavior where if the allocator cb returns NULL, the segment would tentatively be allocated from the same pool as the SDU.
I decided to make it mandatory to return a valid buffer, added a line to this comment and added an assert to verify that.
I have already changed the line with the assert in the bsim test, that's the one you both had problems with.
edit: my bad I hadn't changed it, it's fixed now.
test_l2cap_server.psm = 0; | ||
test_l2cap_server.sec_level = sec_level; | ||
|
||
__ASSERT_NO_MSG(bt_l2cap_server_register(&test_l2cap_server) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't put side effects in assert macros. :O
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preempt: The day will come when we need to run babelsim tests with assertions disabled. And somebody may be tempted to copy this code into their application. I think it's reasonable that the babelsim tests double as samples in addition to tests.
@Vudentz a review from you would be appreciated for this PR |
d85950e
to
767952d
Compare
4d8a4b9
to
fd2e4d4
Compare
subsys/bluetooth/host/l2cap.c
Outdated
/* Release the segment, even if it is the original buf, as | ||
* buffer ownership is not transferred when bt_l2cap_send_cb | ||
* returns an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is still unnecessarily hard to understand. I don't see the need to refer to the "original buf". "Original buf" does not have an obvious meaning to the casual reader in this context. Imagine being new to the code.
Is it not better to say simply "bt_l2cap_send_cb would have taken the ownership of the reference in seg if it was successful. But the call returned an error, so we must get rid of the reference here."?
We can append the following to explain what is meant by owning references: "If a variable owning a reference falls out of scope, it is considered the source of a leak." But I think that is common concept in programming, so it does not need to be repeated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep wording was from the previous comment. "original buf" -> huge SDU buffer that was passed by ref to the l2cap send API.
I'll reword it.
edit: done
See code comment Signed-off-by: Jonathan Rico <[email protected]>
There was an edge-case where we were sending back too much credits, add a check so we can't do that anymore. Signed-off-by: Jonathan Rico <[email protected]>
See the code comments. SDUs might enter a state where they will be blocked forever, as a workaround, we nudge them when another SDU has been sent. Signed-off-by: Jonathan Rico <[email protected]>
This callback allows use-cases where the SDU is much larger than the l2cap MPS. The stack will then try to allocate using this callback if specified, and fall-back on using the buffer's pool (previous behavior). This way one can define two buffer pools, one with a very large buffer size, and one with a buffer size >= MPS, and the stack will allocate from that instead. Signed-off-by: Jonathan Rico <[email protected]>
`l2cap_tx_meta_data` is also a function-like macro. Signed-off-by: Jonathan Rico <[email protected]>
fd2e4d4
to
e244b04
Compare
Turns out the [first bugfix] was too naive: there is a case where resuming all channels will not work on all queued SDUs, and the work handler will give up and wait for the next sent SDU instead of trying to resume again. This happens when the number of credits and conn contexts is very low for the amount of data to send. Always reschedule with a delay to avoid that situation. [first bugfix]: zephyrproject-rtos#50476 Signed-off-by: Jonathan Rico <[email protected]>
Turns out the [first bugfix] was too naive: there is a case where resuming all channels will not work on all queued SDUs, and the work handler will give up and wait for the next sent SDU instead of trying to resume again. This happens when the number of credits and conn contexts is very low for the amount of data to send. Always reschedule with a delay to avoid that situation. [first bugfix]: zephyrproject-rtos#50476 Signed-off-by: Jonathan Rico <[email protected]>
Turns out the [first bugfix] was too naive: there is a case where resuming all channels will not work on all queued SDUs, and the work handler will give up and wait for the next sent SDU instead of trying to resume again. This happens when the number of credits and conn contexts is very low for the amount of data to send. Always reschedule with a delay to avoid that situation. [first bugfix]: #50476 Signed-off-by: Jonathan Rico <[email protected]>
Turns out the [first bugfix] was too naive: there is a case where resuming all channels will not work on all queued SDUs, and the work handler will give up and wait for the next sent SDU instead of trying to resume again. This happens when the number of credits and conn contexts is very low for the amount of data to send. Always reschedule with a delay to avoid that situation. [first bugfix]: zephyrproject-rtos/zephyr#50476 (cherry picked from commit 6cdb82c) Original-Signed-off-by: Jonathan Rico <[email protected]> GitOrigin-RevId: 6cdb82c Change-Id: Ie7b283c65234be30ba2c561d1661f69b9861ef44 Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/4484100 Tested-by: Tristan Honscheid <[email protected]> Reviewed-by: Tristan Honscheid <[email protected]> Commit-Queue: Tristan Honscheid <[email protected]> Tested-by: CopyBot Service Account <[email protected]>
This combination of commits fixes the symptoms seen in #34600.
It also adds an multilink l2cap test to stress the host in case other bugs like this crop up.
Fixes #34600.
A note on the test:
As it is a special usage of the l2cap CoC API, the stack will throw a lot of errors, this is expected.
We might want to downgrade those errors (in the near future) to be less scary, as no actual data loss is happening.
Expected errors/warnings: