-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu/esp: improve thread safety in newlib locking functions #18544
cpu/esp: improve thread safety in newlib locking functions #18544
Conversation
@maribu Two questions regarding this PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me. If I didn't miss anything, no IRQ disabling is needed in the lock acquire functions.
cpu/esp_common/syscalls.c
Outdated
@@ -142,13 +156,17 @@ void IRAM_ATTR _lock_acquire(_lock_t *lock) | |||
|
|||
assert(lock != NULL); | |||
|
|||
uint32_t state = irq_disable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be needed, the mutex_lock()
below should disable IRQs at critical sections already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I thought too. But without disabling interrupts, tests/malloc_thread_safe
crashes on ESP32.
cpu/esp_common/syscalls.c
Outdated
@@ -176,13 +194,17 @@ void IRAM_ATTR _lock_acquire_recursive(_lock_t *lock) | |||
|
|||
assert(lock != NULL); | |||
|
|||
uint32_t state = irq_disable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
cpu/esp_common/syscalls.c
Outdated
@@ -194,16 +216,22 @@ int IRAM_ATTR _lock_try_acquire(_lock_t *lock) | |||
|
|||
assert(lock != NULL); | |||
|
|||
uint32_t state = irq_disable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
cpu/esp_common/syscalls.c
Outdated
@@ -215,16 +243,22 @@ int IRAM_ATTR _lock_try_acquire_recursive(_lock_t *lock) | |||
|
|||
assert(lock != NULL); | |||
|
|||
uint32_t state = irq_disable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
Sounds good to me. Those definitions are a bit finicky, but to my understanding the basic idea is that if a given program runs fine on Cortex M with the given stack size define, it should also run on all others.
The assertion should actually detect when one thread is unlocking the rmutex owned by another (something that is perfectly fine for mutex, btw). Assuming that lock operations and unlock operations are matching (that is the same number of unlocks as locks is issued and only an rmutex actually owned by the calling thread is unlocked), I would guess some kind of memory corruption has happened. |
I had to change a lot to get the locking working for both ESP8266 and ESP32:
With these changes |
Compilation was successful in CI, however 225 of 227 tests failed for ESP32 log with error message:
which seems to be completely unrelated. |
Can you limit your build so we can test a bit more? diff --git a/.murdock b/.murdock
index e2432b2c09..49f1f98b96 100755
--- a/.murdock
+++ b/.murdock
@@ -1,9 +1,9 @@
#!/bin/sh
# uncomment and change this to limit builds, e.g.,
-#export BOARDS="samr21-xpro native"
+export BOARDS="esp32-wroom-32"
# and / or
-#export APPS="examples/hello-world tests/unittests"
+export APPS="tests/thread_priority_inversion"
#
: ${TEST_BOARDS_AVAILABLE:="esp32-wroom-32 samr21-xpro"} |
AFAIK the pi fleet only has 1 esp32-wroom-32 still working, if that has gone down than hardware tests may be a bit of a challenge. |
❌ arduino-mega2560 (1 fail test)
😬 frdm-k64f (1 fail flash)
✅ frdm-k22f
✅ nucleo-f091rc
✅ nucleo-f303re
✅ nucleo-f767zi
✅ nucleo-l152re
✅ nucleo-l432kc
✅ saml10-xpro
✅ saml11-xpro
✅ samr21-xpro
✅ nrf52dk
✅ remote-revb
✅ nucleo-l073rz
✅ slstk3401a
✅ nucleo-f411re
✅ nucleo-f103rb
✅ frdm-kw41z
✅ esp32-wroom-32
✅ z1
✅ arduino-due
✅ slstk3400a
|
Whoa, the HiL actually did what I wanted it too. The pi-fleet is in fact broken now. I think @kaspar030 is on it. Meanwhile it looks like this PR fixes the one test. It seems like there is a problem in the |
After some further testing, I'm afraid I have to admit that it's not possible after all to assume that a lock variable is already assigned and initialized the first time That is, I have to revert some changes. @maribu Would it perhaps be better to completely squash and rebase the changes rather than fix the fixes of the fixes? |
Do as you consider is best. And more generally: My review-workflow doesn't really profit from the "squash only when asked" policy. So if a PR has only reviews of me, you can safely assume implicit approval to squashing at will ;) |
Oh, regarding squashing: @MrKevinWeiss also reviewed this PR. Sorry that I lost track. This omission was not intentional, but human error due to more context switching between task than my brain can safely handle :/ Anyway, from my side you have both implicit and explicit support for squashing ;) |
With the improvements of the locking mechanism, thread safety of malloc/realloc/calloc/free is guaranteed. Module malloc_thread_safe is not needed any longer.
This reverts commit 2210721.
Dynamic allocation and initialization of the mutex used by a newlib locking variable must not be interrupted. Since a thread context switch can occur on exit from an ISR, the allocation and initialization of the mutex must be guarded by disabling interrupts. The same must be done for the release of such a locking variable.
baa316d
to
5b713f7
Compare
|
e9a31ff
to
966930e
Compare
Just do it in the one pr. I can start another hil test when I'm free to see if the esp8266 is fixed too. |
All tests runs were now successful log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me. The CI found a few typos and one instance of two subsequent blank lines.
The doccheck will need and entry in exclude_pattern
to not complain about the stack size (or an Doxygen comment). Both is fine with me.
❌ nucleo-f207gz (1 fail build)
❌ arduino-mega2560 (1 fail test)
❌ saml21-xpro (1 fail test)
❌ frdm-k64f (1 fail test)
❌ stk3200 (1 fail test)
✅ frdm-k22f
✅ nucleo-f091rc
✅ nucleo-f303re
✅ nucleo-g474re
✅ nucleo-f767zi
✅ nucleo-l152re
✅ nucleo-l432kc
✅ saml10-xpro
✅ saml11-xpro
✅ samr30-xpro
✅ samr21-xpro
✅ nrf52dk
✅ remote-revb
✅ nucleo-l073rz
✅ slstk3401a
✅ nucleo-f411re
✅ nucleo-f103rb
✅ frdm-kw41z
✅ esp32-wroom-32
✅ z1
✅ arduino-due
✅ slstk3400a
✅ samr34-xpro
✅ same54-xpro
|
For ESP32x, the operations on recursive locking variables have to be guarded by disabling interrupts to prevent unintended context switches. For ESP8266, interrupts must not be disabled, otherwise the intended context switch doesn't work when trying to lock a rmutex that is already locked by another thread.
If module `core_mutex_priority_inheritance` is enabled, the scheduling has to be active to lock/unlock the mutex/rmutex used by FreeRTOS semaphores. If scheduling is not active FreeRTOS semaphore function always succeed.
966930e
to
f183029
Compare
I squashed the fixes of static tests directly to avoid additional CI load by an extra compilation test cycle. Since all tests were already successful (log), I removed label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK. Looks good to me and the CI did the testing :)
Contribution description
This PR improves the thread safety of the newlib locking functions as following:
malloc_thread_safe
is not needed any longer.This PR fixes issue #18534 and reverts the change made in PR #18535.
Testing procedure
should work with this PR.
should still work.
Issues/PRs references
Fixes #18534