Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: Controller: Advertising can only be started 2^16 times #30480

Closed
JordanYates opened this issue Dec 7, 2020 · 4 comments · Fixed by #30496
Closed

Bluetooth: Controller: Advertising can only be started 2^16 times #30480

JordanYates opened this issue Dec 7, 2020 · 4 comments · Fixed by #30496
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@JordanYates
Copy link
Collaborator

Describe the bug
On Nordic devices, Bluetooth advertising can only be started ~2^16 times before all calls to bt_le_adv_start return -EIO.

rc = bt_le_adv_start(adv_param, adv_data, adv_data_len, scan_rsp_data, scan_rsp_len);
if (rc < 0) {
	LOG_ERR("Advertising failed to start: %d", rc);
	return;
}
k_sleep(K_MSEC(50));
rc = bt_le_adv_stop();
if (rc < 0) {
	LOG_ERR("Advertising failed to stop: %d", rc);
	return;
}

Will eventually result in :

[00:00:08.656,616] <wrn> bt_hci_core: opcode 0x200a status 0x03
[00:00:08.656,646] <err> bt_hci_core: Failed to start advertiser
[00:00:08.656,646] <err> app: Advertising failed to start: -5

The root cause of this issue is the failure to release onoff_manager dependencies in lll_clock.c
Each time advertising is started, we wait until the LF clock is ready:
ll_adv_enable -> lll_clock_wait -> blocking_on -> onoff_request

Each call to blocking_on adds an additional dependency via onoff_request that is never released with a call to onoff_release.
Eventually the refs counter will reach UINT16_MAX, which will cause onoff_request to always return -EAGAIN.

To Reproduce
Add the following to clock_ready in lll_clock.c to reduce reproduction time:
If the refs were being properly released, there is still a headroom of ~156 to allow for normal operation.

static bool first = true;
if (first) {
	printk("Overriding refs\n");
	mgr->refs = 65400;
	first = false;
}

Repeatedly call bt_le_adv_start and bt_le_adv_stop until -EIO is always observed.

Expected behavior
Advertising sets should be able to be started and stopped any number of times without errors.

Impact
This bug can only be recovered from by rebooting the device, as nothing will naturally reduce the reference count.
If a reboot is not triggered, the device can no longer be interacted with via Bluetooth.
For devices without physical access to the PCB, this bug bricks the device.

Logs and console output
Standard error messages are included above.

Environment (please complete the following information):

  • Zephyr 2.4
@JordanYates JordanYates added the bug The issue is a bug, or the PR is fixing a bug label Dec 7, 2020
@JordanYates
Copy link
Collaborator Author

JordanYates commented Dec 7, 2020

As lll_clock provides no function which calls onoff_release, the same issue presumably affects connection creation as well, which calls lll_clock_wait in ll_create_connection.

@joerchan joerchan added the priority: medium Medium impact/importance bug label Dec 7, 2020
@cvinayak
Copy link
Contributor

cvinayak commented Dec 7, 2020

The regression was introduced here: 2b47630#diff-4b2727dd1031ae4d5de482dc43a4c5d2295a89f718b04e5480f6c8f8082846b4

The following is missing from lll_clock_wait

	static bool done;

	if (done) {
		return 0;
	}
	done = true;

I will send a PR soon. Thanks for the issue report and detailed analysis.

@JordanYates
Copy link
Collaborator Author

JordanYates commented Dec 7, 2020

static bool done;

if (done) {
return 0;
}
done = true;

While I imagine this would work in most real-world cases, it's not strictly correct.

  1. A second thread calling lll_clock_wait would return immediately without the clock being ready.
  2. If the first clock setup fails, future calls don't recognize that the clock is not running and return 0 immediately.

@cvinayak
Copy link
Contributor

cvinayak commented Dec 7, 2020

@JordanYates This is the LF clock which is also the core system clock, it is only started once on power up. The lll_clock_wait only needs to ensure the clock has settled, i.e. truely switched to crystal/RC source and stable. This clock is not to be stopped after power up.

A second thread calling lll_clock_wait would return immediately without the clock being ready.

By design, only one thread shall dispatch HCI cmd and data to the controller, and is a co-operative thread.

If the first clock setup fails, future calls don't recognize that the clock is not running and return 0 immediately.

Clock setup/settle failure is consider hardware fatal error, there is no expectation to succeed future calls, example, missing/damaged 32 KHz crystal on the board.

cvinayak added a commit to cvinayak/zephyr that referenced this issue Dec 8, 2020
Fix lll_clock_wait function to wait for LF clock to settle
only once after power up.

Regression introduced in commit 2b47630 ("bluetooth:
controller: Adapt to onoff clock control").

Fixes zephyrproject-rtos#30480.

Signed-off-by: Vinayak Kariappa Chettimada <[email protected]>
carlescufi pushed a commit that referenced this issue Dec 17, 2020
Fix lll_clock_wait function to wait for LF clock to settle
only once after power up.

Regression introduced in commit 2b47630 ("bluetooth:
controller: Adapt to onoff clock control").

Fixes #30480.

Signed-off-by: Vinayak Kariappa Chettimada <[email protected]>
github-actions bot pushed a commit that referenced this issue Dec 17, 2020
Fix lll_clock_wait function to wait for LF clock to settle
only once after power up.

Regression introduced in commit 2b47630 ("bluetooth:
controller: Adapt to onoff clock control").

Fixes #30480.

Signed-off-by: Vinayak Kariappa Chettimada <[email protected]>
bwasim pushed a commit to bwasim/zephyr that referenced this issue Jan 5, 2021
Fix lll_clock_wait function to wait for LF clock to settle
only once after power up.

Regression introduced in commit 2b47630 ("bluetooth:
controller: Adapt to onoff clock control").

Fixes zephyrproject-rtos#30480.

Signed-off-by: Vinayak Kariappa Chettimada <[email protected]>
nashif pushed a commit that referenced this issue Jan 10, 2021
Fix lll_clock_wait function to wait for LF clock to settle
only once after power up.

Regression introduced in commit 2b47630 ("bluetooth:
controller: Adapt to onoff clock control").

Fixes #30480.

Signed-off-by: Vinayak Kariappa Chettimada <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants