Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High bluetooth ISR latency with CONFIG_BT_MAX_CONN=2 #29994

Closed
jmgao opened this issue Nov 13, 2020 · 36 comments · Fixed by #31753
Closed

High bluetooth ISR latency with CONFIG_BT_MAX_CONN=2 #29994

jmgao opened this issue Nov 13, 2020 · 36 comments · Fixed by #31753
Assignees
Labels
area: Bluetooth Controller area: Bluetooth area: Clock Control bug The issue is a bug, or the PR is fixing a bug platform: nRF Nordic nRFx priority: medium Medium impact/importance bug
Milestone

Comments

@jmgao
Copy link
Contributor

jmgao commented Nov 13, 2020

Describe the bug

I'm seeing high ISR latencies with CONFIG_BT_MAX_CONN set to 2, which I suspect might be resulting in a !radio_is_ready() assertion firing after multiple hours (perhaps due to unlucky collision with another ISR).

To Reproduce
Apply the following patch to samples/bluetooth/peripheral:

diff --git a/samples/bluetooth/peripheral/prj.conf b/samples/bluetooth/peripheral/prj.conf
index 9a56b2c122..fc7eacf030 100644
--- a/samples/bluetooth/peripheral/prj.conf
+++ b/samples/bluetooth/peripheral/prj.conf
@@ -1,6 +1,9 @@
 # Incresed stack due to settings API usage
 CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=2048
 
+CONFIG_BT_CTLR_PROFILE_ISR=y
+CONFIG_BT_MAX_CONN=2
+
 CONFIG_BT=y
 CONFIG_BT_DEBUG_LOG=y
 CONFIG_BT_SMP=y

Build and flash (tested on particle_xenon).
Use nRF Connect to connect to the device, open the vendor-defined service, and wait.
I'm seeing maximum ISR latencies of >50us within a minute or so, and >70us after a few minutes of waiting.

Expected behavior
Lower peak ISR latencies

Impact
!radio_is_ready assertions firing

Logs and console output

[00:01:09.077,667] <inf> bt_ctlr_hci: l: 19, 1, 46; t: 7, 7, 23.
[00:01:09.078,002] <inf> bt_ctlr_hci: l: 13, 1, 46; t: 15, 7, 23.
[00:01:09.078,918] <inf> bt_ctlr_hci: l: 8, 1, 46; t: 8, 7, 23.
[00:01:09.084,136] <inf> bt_ctlr_hci: l: 1, 1, 46; t: 18, 7, 23.
[00:01:09.174,163] <inf> bt_ctlr_hci: l: 14, 1, 46; t: 17, 7, 23.
[00:01:09.185,363] <inf> bt_ctlr_hci: l: 19, 1, 46; t: 8, 7, 23.

<snip>

[00:03:39.603,515] <inf> bt_ctlr_hci: l: 13, 1, 75; t: 14, 7, 23.
[00:03:39.604,461] <inf> bt_ctlr_hci: l: 19, 1, 75; t: 7, 7, 23.
[00:03:39.620,849] <inf> bt_ctlr_hci: l: 1, 1, 75; t: 18, 7, 23.
[00:03:39.705,871] <inf> bt_ctlr_hci: l: 19, 1, 75; t: 8, 7, 23.
[00:03:39.706,207] <inf> bt_ctlr_hci: l: 13, 1, 75; t: 16, 7, 23.
[00:03:39.707,153] <inf> bt_ctlr_hci: l: 19, 1, 75; t: 7, 7, 23.
[00:03:39.707,489] <inf> bt_ctlr_hci: l: 13, 1, 75; t: 14, 7, 23.
[00:03:39.708,435] <inf> bt_ctlr_hci: l: 19, 1, 75; t: 7, 7, 23.

Environment (please complete the following information):

  • OS: linux host, Android R on Pixel 4 for BT client
  • Toolchain: Zephyr SDK 0.11.1
  • Commit SHA or Version used: v2.4.0

Additional context
Add any other context about the problem here.
@cvinayak

@jmgao jmgao added the bug The issue is a bug, or the PR is fixing a bug label Nov 13, 2020
@joerchan
Copy link
Contributor

It could perhaps be this?

static void hci_num_completed_packets(struct net_buf *buf)
{
        /* ... */
		key = irq_lock();

		conn = bt_conn_lookup_handle(handle);
		if (!conn) {
			irq_unlock(key);
			BT_ERR("No connection for handle %u", handle);
			continue;
		}

		irq_unlock(key);

@jhedberg Is this irq_lock really needed? We don't lock irq anywhere else where we lookup the conn from the handle.
Not much can change with regards to the handles here I think.

@cvinayak
Copy link
Contributor

Reporter see increasing latencies over time which has no explanation, unless there is some sort of O(n) traversal

@carlescufi carlescufi added area: Bluetooth priority: high High impact/importance bug labels Nov 13, 2020
@carlescufi
Copy link
Member

Setting this as high since ISR latency is critical for most users.

@carlescufi
Copy link
Member

Reporter see increasing latencies over time which has no explanation, unless there is some sort of O(n) traversal

Well, you do display the worst case recorded right? So it could be that over time we hit the worst case scenario?

@jmgao
Copy link
Contributor Author

jmgao commented Nov 13, 2020

Yeah, I don't think it's actually linear growth, just more time to get unlucky. I've seen assertions firing anywhere between 2 and 8 hours of uptime.

(edit for clarification: I saw the assertions with my own project, not the sample)

@joerchan
Copy link
Contributor

@jmgao Can you try and comment out the irq-lock/unlock in hci_num_completed_packets around bt_conn_lookup_handle?
Source file: subsys/bluetooth/host/hci_core.c.

@jmgao
Copy link
Contributor Author

jmgao commented Nov 13, 2020

Commenting out the first set of irq_lock/unlocks doesn't seem to help.

@jmgao
Copy link
Contributor Author

jmgao commented Nov 14, 2020

Sidenote: increasing BT_CTLR_ULL_HIGH_PRIO and BT_CTLR_ULL_LOW_PRIO to 1 appears to make the symptoms go away. It's strange to me that the critical radio ISR and the low priority upper level ISR have the same priority by default. Is ULL_LOW_PRIO actually on a hard realtime deadline? If not, that seems like it's just asking for priority inversion...

@joerchan
Copy link
Contributor

I agree, that looks like strange IRQ priorities defaults.
@jmgao You should probably only increase BT_CTLR_ULL_LOW_PRIO though. Leave BT_CTLR_ULL_HIGH_PRIO one as the default value (0)

@jmgao
Copy link
Contributor Author

jmgao commented Nov 16, 2020

Oh, whoops, I think I tested with zero latency interrupts in my latest comment: with ZLI off, I'm still seeing elevated latencies with BT_CTLR_ULL_LOW_PRIO increased (and with HIGH_PRIO increased as well), although maybe somewhat better better (max of 65 after 5 minutes)

@cvinayak
Copy link
Contributor

cvinayak commented Nov 17, 2020

appears to make the symptoms go away.

Using ZLI, controller interrupts are at highest proirity and hence other interrupts or irq locks do not introduce latency.

I'm still seeing elevated latencies with BT_CTLR_ULL_LOW_PRIO increased (and with HIGH_PRIO increased as well)

As latencies are observed with ULL_LOW and ULL_HIGH increased, indicate that there is some other interrupt at the same priority (ARM interrupt priority value 0) as the Radio interrupt. As this is your own project, what other interrupts/peripherals are you using?

@jmgao
Copy link
Contributor Author

jmgao commented Nov 17, 2020

As this is your own project, what other interrupts/peripherals are you using?

This is with the peripheral sample.

As latencies are observed with ULL_LOW and ULL_HIGH increased, indicate that there is some other interrupt at the same priority (ARM interrupt priority value 0) as the Radio interrupt.

Or irq_lock in a lower priority interrupt, right?

@cvinayak cvinayak self-assigned this Nov 22, 2020
@cvinayak
Copy link
Contributor

cvinayak commented Dec 3, 2020

During my work in progress tasks, I am notice such gradual increase in reported latency by the controller. I am suspecting logging, hence, please disable logging subsystem, use a printk instead to display the ISR latencies and let us know if there is similar gradual increased latencies.

@carlescufi carlescufi added the Waiting for response Waiting for author's response label Dec 3, 2020
@cvinayak
Copy link
Contributor

cvinayak commented Dec 4, 2020

image

The increase in latency is in some way related to log buffer overflow? @nordic-krch I see that log bytes is truncated, and then the next log message from controller has increased ISR latency measured. Does logging disable interrupts that log messages are truncated or rotated?

@cvinayak
Copy link
Contributor

cvinayak commented Dec 4, 2020

image

Ok, without logging subsystem too, I see high latencies... will investigate further.

@cvinayak
Copy link
Contributor

cvinayak commented Dec 4, 2020

image

No added latency seen for > 15 mins of connection with phone as central and simultaneous advertising when using the below:
CONFIG_ZERO_LATENCY_IRQS=y
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_ZLI=y

This rules out any controller's self implementation to be introducing latencies.
The latencies should be coming from outside the Bluetooth subsystem (and its not logging either).
See #29994 (comment)

@koffes you may have interest here.

@cvinayak cvinayak removed the Waiting for response Waiting for author's response label Dec 4, 2020
@koffes
Copy link
Collaborator

koffes commented Dec 4, 2020

I can confirm that I am seeing similar issues.
Random firings of !radio_is_ready especially in https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/bluetooth/controller/ll_sw/nordic/lll/lll_conn.c#L316
This issue is most prominent if I activate CPU load reports and LOG active:

CONFIG_TRACING=y
CONFIG_TRACING_CPU_STATS=y
CONFIG_TRACING_CPU_STATS_LOG=y
CONFIG_TRACING_CPU_STATS_INTERVAL=5000

I tried with CONFIG_ZERO_LATENCY_IRQS=y + CONFIG_BT_CTLR_ZLI=n and
CONFIG_ZERO_LATENCY_IRQS=n + CONFIG_BT_CTLR_ZLI=n
In both cases the assert happens. I know the first combo of settings is not recommended.

@cvinayak
Copy link
Contributor

cvinayak commented Dec 4, 2020

@koffes What was the result for following in your application:
CONFIG_ZERO_LATENCY_IRQS=y
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_ZLI=y

@cvinayak
Copy link
Contributor

cvinayak commented Dec 7, 2020

Controller currently has an O(n) execution on scheduling conflict resolution and request processing in ULL_HIGH and ULL_LOW execution contexts.

When the Radio ISR, LLL, ULL_HIGH, and ULL_LOW execution are configured at the same IRQ priority, these introduce ISR latencies amongst themself.

There exists 3 Kconfig options to mitigate controller's self induced ISR Latencies

  1. Use of Zero Latency IRQs (which inherently splits out the Radio ISR to a high priority):
CONFIG_ZERO_LATENCY_IRQS=y
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_ZLI=y
  1. Use explicitly a higher IRQ priority for Radio ISR (and LLL)
CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_LLL_PRIO=0
CONFIG_BT_CTLR_ULL_HIGH_PRIO=1
CONFIG_BT_CTLR_ULL_LOW_PRIO=1
  1. Use of low latency controller design (applicable for slower CPUs like in nRF51 series, Cortex-M0 at 16 MHz):
CONFIG_BT_CTLR_LOW_LAT=y
CONFIG_BT_CTLR_LOW_LAT_ULL=y
  1. Proposal to make controller's processing limited to a max CPU use per iteration.

I will appreciate, if the options 1 to 3 are tested and any observations is noted here in this issue.

@jmgao
Copy link
Contributor Author

jmgao commented Dec 7, 2020

Option 1 appeared to fix things for me, option 2 still had elevated latencies.

@koffes
Copy link
Collaborator

koffes commented Dec 8, 2020

@cvinayak : I tested with the combination you suggested. It then asserts almost right away.

The call stack looks like: lll_hfclock_off() -> z_nrf_clock_bt_ctlr_hf_release(); -> hfclk_stop(); then something strange happens with the call stack, as the actual assert seems to be triggered in int64_t z_impl_k_uptime_ticks(void) then k_spin_lock(struct k_spinlock *l) -> __ASSERT(z_spin_lock_valid(l), "Recursive spinlock %p", l);

Let me know and I can show you my debug session.

@carlescufi
Copy link
Member

carlescufi commented Dec 10, 2020

Option 1 appeared to fix things for me, option 2 still had elevated latencies.

@jmgao Have you tried 3 as well?

@nashif
Copy link
Member

nashif commented Jan 5, 2021

any updates on that?

@nashif nashif added the priority: medium Medium impact/importance bug label Jan 5, 2021
@nashif
Copy link
Member

nashif commented Jan 5, 2021

lowering to medium because this only happens with shell enabled.

@nashif nashif removed the priority: high High impact/importance bug label Jan 5, 2021
@jmgao
Copy link
Contributor Author

jmgao commented Jan 8, 2021

Sorry, missed the questions in my email inbox. I vaguely recall trying option 3 and it increasing the time to failure to a day or two, but it's been a while.

I don't believe this requires shell being enabled to happen, although that might increase the chances of a fatal assert happening.

@carlescufi
Copy link
Member

I don't believe this requires shell being enabled to happen, although that might increase the chances of a fatal assert happening.

@koffes and @jmgao
Could you please verify this statement? It's important that we figure out if enabling shell is a factor here.

@carlescufi carlescufi added this to the v2.5.0 milestone Jan 14, 2021
@koffes
Copy link
Collaborator

koffes commented Jan 15, 2021

@carlescufi I am unable to reproduce on recent master.

Attempted with CONFIG_BT_MAX_CONN=2 in samples/bluetooth/peripheral/prj.conf and hci_rpmsg with CONFIG_SHELL=y and CONFIG_TRACING=y on the NET core or nRF5340.

Also tried in our custom application with TRACING and SHELL active.
My observed issue with !radio_is_ready is only related to this issue though.

@jmgao
Copy link
Contributor Author

jmgao commented Jan 16, 2021

Sorry for the delay, I also had some problems reproducing, but it was because I tested without connecting to the device, because I was using a different board that connections were failing on, and spent the last few days trying to figure out why. (I was running it on a board that doesn't have a 32kHz crystal, and apparently CONFIG_CLOCK_CONTROL_NRF_K32SRC_SYNTH is good enough for my project but not the peripheral sample? I'm still not quite sure what's going on...)

I tested with particle_xenon on master as of 5 minutes ago: it reproduces without shell enabled. Cherry-picking #30829 does not help.

(I was wondering if the problem could be related to RC oscillator calibration, because I vaguely recall skimming across something in an NRF datasheet saying that it would do regular calibration. I checked the board I used in the original report and particle_xenon: both of them are definitely using CLOCK_CONTROL_NRF_K32SRC_XTAL, and regardless, the problem seems to occur with both K32SRC_RC and K32SRC_XTAL.)

@carlescufi
Copy link
Member

@jmgao

Sorry for the delay, I also had some problems reproducing, but it was because I tested without connecting to the device, because I was using a different board that connections were failing on, and spent the last few days trying to figure out why. (I was running it on a board that doesn't have a 32kHz crystal, and apparently CONFIG_CLOCK_CONTROL_NRF_K32SRC_SYNTH is good enough for my project but not the peripheral sample? I'm still not quite sure what's going on...)

You should not use SYNTH, that will not work reliably with Bluetooth. You either use XTAL or RC.

I tested with particle_xenon on master as of 5 minutes ago: it reproduces without shell enabled. Cherry-picking #30829 does not help.

Which application did you use, your own or a zephyr sample?

@jmgao
Copy link
Contributor Author

jmgao commented Jan 19, 2021

Which application did you use, your own or a zephyr sample?

The Zephyr sample in samples/bluetooth/peripheral

@carlescufi
Copy link
Member

I tested with particle_xenon on master as of 5 minutes ago: it reproduces without shell enabled. Cherry-picking #30829 does not help.

Do you have a Nordic nRF52840 DK that you could try to reproduce this on?
If not, could you please confirm that the instructions in the PR description are correct to reproduce the issue?

@pabigot you have a Particle Xenon, right? Could you please follow the instructions above and see if you can reproduce?
By reproduce I mean latencies displayed above 60us and any assert triggered.

@pabigot
Copy link
Collaborator

pabigot commented Jan 21, 2021

On xenon I do get a spike latency around the time service discovery is done. It seems to be a one-time thing, and I don't see any asserts.

[00:01:08.277,313] <inf> bt_ctlr_hci: l: 2, 1, 19; t: 13, 7, 23.
[00:01:08.291,290] <inf> bt_ctlr_hci: l: 2, 1, 19; t: 15, 7, 23.
[00:01:08.321,289] <inf> bt_ctlr_hci: l: 2, 1, 19; t: 17, 7, 23.
[00:01:08.378,112] <inf> bt_ctlr_hci: l: 19, 1, 19; t: 8, 7, 23.
[00:01:08.378,753] <inf> bt_ctlr_hci: l: 19, 1, 19; t: 7, 7, 23.
[00:01:08.379,455] <inf> bt_ctlr_hci: l: 77, 1, 77; t: 8, 7, 23.
[00:01:08.381,317] <inf> bt_ctlr_hci: l: 13, 1, 77; t: 16, 7, 23.
[00:01:08.481,994] <inf> bt_ctlr_hci: l: 19, 1, 77; t: 8, 7, 23.
[00:01:08.482,635] <inf> bt_ctlr_hci: l: 19, 1, 77; t: 7, 7, 23.

edit: I see the same thing on nRF52840 DK, spikes there of 74 then 79.

@cvinayak
Copy link
Contributor

Thank you @pabigot I am not able to reproduce the assertion too, do see similar latencies on nrf52840dk.

image

@cvinayak
Copy link
Contributor

cvinayak commented Jan 28, 2021

I have been adding implementation to in-system profile other controller's ISRs, and have noticed an implementation flaw that executes both ULL_HIGH and ULL_LOW in the same ISR when their priorities are same (instead of tailchaining/yielding so as to permit the naturally higher priority radio ISR exection which is at the same IRQ priority level as the ULL_HIGH and ULL_LOW).

This is a regression in nRF52 and nRF53 introduced when nRF53 support was added. (78b461a)

I am working on a solution to decouple the ULL_LOW and allow an early vectoring to Radio ISR.

With the new ticker implementation in the split controller, it is not feasible to have LLL and ULL_HIGH/ULL_LOW at same IRQ priority level, hence have submitted #31753

@jmgao please check that you get reduced Radio ISR latency using this PR: #31753
@pabigot could you also check on the xenon board you have that with #31753 that latency be around ~20ms

cvinayak added a commit to cvinayak/zephyr that referenced this issue Jan 28, 2021
Fix the default ULL_HIGH execution priority to be lower than
LLL execution priority to reduce Radio ISR service latency.

Also, default to Zero Latency IRQ for Radio ISR if enabled
in the kernel.

Fixes zephyrproject-rtos#29994.

Signed-off-by: Vinayak Kariappa Chettimada <[email protected]>
nashif pushed a commit that referenced this issue Feb 2, 2021
Fix the default ULL_HIGH execution priority to be lower than
LLL execution priority to reduce Radio ISR service latency.

Also, default to Zero Latency IRQ for Radio ISR if enabled
in the kernel.

Fixes #29994.

Signed-off-by: Vinayak Kariappa Chettimada <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment