Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu/esp: improve thread safety in newlib locking functions #18544

Merged

Conversation

gschorcht
Copy link
Contributor

Contribution description

This PR improves the thread safety of the newlib locking functions as following:

  • Dynamic allocation and initialization of the mutex used by a newlib locking variable must not be interrupted. Since a thread context switch can occur on exit from an ISR, the allocation and initialization of such a mutex is guarded by disabling interrupts. The same is done for the release of such a mutex.
  • Locking functions implicitly allocate and initialize the mutex used by a newlib locking variable if necessary. This implicit allocation and initialization as well as the subsequent acquiring must not be interrupted. Since a thread context switch can occur on exit from an ISR, the implicit allocation and initialization and the subsequent acquiring of such a newlib locking variable must be guarded by disabling interrupts. The same must be done for the release of such a newlib locking variable.
  • Module malloc_thread_safe is not needed any longer.

This PR fixes issue #18534 and reverts the change made in PR #18535.

Testing procedure

BOARD=esp32-wroom-32 make flash test -C tests/thread_priority_inversion flash test

should work with this PR.

BOARD=esp32-wroom-32 make-C tests/malloc_thread_safe flash test 

should still work.

Issues/PRs references

Fixes #18534

@github-actions github-actions bot added Area: cpu Area: CPU/MCU ports Area: tests Area: tests and testing framework Platform: ESP Platform: This PR/issue effects ESP-based platforms labels Aug 31, 2022
@gschorcht
Copy link
Contributor Author

gschorcht commented Aug 31, 2022

@maribu Two questions regarding this PR:

  • The stack size THREAD_STACKSIZE_SMALL for the duelling threads in tests/malloc_thread_safe is too small for ESP8266. What would be the best way from your point of view to increase it only for ESP8266. Should I override THREAD_STACKSIZE_SMALL in cpu/esp8266/include/cpu_conf.h?
  • What might be the reason that the following assertion fails for ESP8266
    assert(atomic_load_kernel_pid(&rmutex->owner) == thread_getpid());
    if interrupts are disabled in _lock_acquire_recursive here?
    uint32_t state = irq_disable();
    It works with enabled interrupts for ESP8266. For ESP32 it works as it should.

Copy link
Member

@maribu maribu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. If I didn't miss anything, no IRQ disabling is needed in the lock acquire functions.

cpu/esp_common/syscalls.c Show resolved Hide resolved
cpu/esp_common/syscalls.c Show resolved Hide resolved
@@ -142,13 +156,17 @@ void IRAM_ATTR _lock_acquire(_lock_t *lock)

assert(lock != NULL);

uint32_t state = irq_disable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed, the mutex_lock() below should disable IRQs at critical sections already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I thought too. But without disabling interrupts, tests/malloc_thread_safe crashes on ESP32.

@@ -176,13 +194,17 @@ void IRAM_ATTR _lock_acquire_recursive(_lock_t *lock)

assert(lock != NULL);

uint32_t state = irq_disable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@@ -194,16 +216,22 @@ int IRAM_ATTR _lock_try_acquire(_lock_t *lock)

assert(lock != NULL);

uint32_t state = irq_disable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@@ -215,16 +243,22 @@ int IRAM_ATTR _lock_try_acquire_recursive(_lock_t *lock)

assert(lock != NULL);

uint32_t state = irq_disable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@maribu
Copy link
Member

maribu commented Aug 31, 2022

The stack size THREAD_STACKSIZE_SMALL for the duelling threads in tests/malloc_thread_safe is too small for ESP8266. What would be the best way from your point of view to increase it only for ESP8266. Should I override THREAD_STACKSIZE_SMALL in cpu/esp8266/include/cpu_conf.h?

Sounds good to me. Those definitions are a bit finicky, but to my understanding the basic idea is that if a given program runs fine on Cortex M with the given stack size define, it should also run on all others.

What might be the reason that the following assertion fails for ESP8266 [...]

The assertion should actually detect when one thread is unlocking the rmutex owned by another (something that is perfectly fine for mutex, btw). Assuming that lock operations and unlock operations are matching (that is the same number of unlocks as locks is issued and only an rmutex actually owned by the calling thread is unlocked), I would guess some kind of memory corruption has happened.

@gschorcht
Copy link
Contributor Author

I had to change a lot to get the locking working for both ESP8266 and ESP32:

  • The initialization of locking variables with retargetable locking had to be added to avoid their implicit initialization in _lock_acquire* functions. Now the locking variables are either statically defined and initialized or they have to be allocated and initialized explicitly by calling _lock_init*. asserts ensure that locking variables are initialized when _lock_acquire* functions are called.
  • For ESP32, the operations on recursive locking variables had to be guarded by disabling interrupts to prevent unintended context switches. For ESP8266, interrupts must not be disabled, otherwise the intended context switch doesn't work when trying to lock a rmutex that is already locked by another thread. That was the reason for the assert in rmutex_unlock.

With these changes tests/malloc_thread_safety and tests/thread_priority_inversion work absolutely stable on ESP32. Without these changes, tests/malloc_thread_safety crashes often, probably because of unintended context switches on exit from ISRs. I wasn't able to catch the problem in debugger.

@gschorcht gschorcht added CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR CI: run tests If set, CI server will run tests on hardware for the labeled PR labels Sep 1, 2022
@gschorcht
Copy link
Contributor Author

Compilation was successful in CI, however 225 of 227 tests failed for ESP32 log with error message:

make: *** No rule to make target 'flash-only'.  Stop.

which seems to be completely unrelated.

@gschorcht gschorcht removed the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Sep 1, 2022
@MrKevinWeiss
Copy link
Contributor

Can you limit your build so we can test a bit more?

diff --git a/.murdock b/.murdock
index e2432b2c09..49f1f98b96 100755
--- a/.murdock
+++ b/.murdock
@@ -1,9 +1,9 @@
 #!/bin/sh
 
 # uncomment and change this to limit builds, e.g.,
-#export BOARDS="samr21-xpro native"
+export BOARDS="esp32-wroom-32"
 # and / or
-#export APPS="examples/hello-world tests/unittests"
+export APPS="tests/thread_priority_inversion"
 
 #
 : ${TEST_BOARDS_AVAILABLE:="esp32-wroom-32 samr21-xpro"}

@MrKevinWeiss
Copy link
Contributor

AFAIK the pi fleet only has 1 esp32-wroom-32 still working, if that has gone down than hardware tests may be a bit of a challenge.

@riot-hil-bot
Copy link

HiL Test Results

PASS FAIL SKIP
20 2 0
  ❌ arduino-mega2560 (1 fail test)
PASS FAIL SKIP
0 1 0
TEST RESULT
tests/thread_priority_inversion ❌ test fail
  😬 frdm-k64f (1 fail flash)
PASS FAIL SKIP
0 1 0
TEST RESULT
tests/thread_priority_inversion 😬 flash fail
  ✅ frdm-k22f
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-f091rc
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-f303re
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-f767zi
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-l152re
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-l432kc
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ saml10-xpro
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ saml11-xpro
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ samr21-xpro
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nrf52dk
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ remote-revb
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-l073rz
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ slstk3401a
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-f411re
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-f103rb
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ frdm-kw41z
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ esp32-wroom-32
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ z1
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ arduino-due
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ slstk3400a
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass

@MrKevinWeiss
Copy link
Contributor

Whoa, the HiL actually did what I wanted it too. The pi-fleet is in fact broken now. I think @kaspar030 is on it. Meanwhile it looks like this PR fixes the one test. It seems like there is a problem in the arduino-mega2560 unrelated to this PR but probably good to look into.

@kaspar030 kaspar030 added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Sep 1, 2022
@gschorcht
Copy link
Contributor Author

  • The initialization of locking variables with retargetable locking had to be added to avoid their implicit initialization in _lock_acquire* functions. Now the locking variables are either statically defined and initialized or they have to be allocated and initialized explicitly by calling _lock_init*. asserts ensure that locking variables are initialized when _lock_acquire* functions are called.

After some further testing, I'm afraid I have to admit that it's not possible after all to assume that a lock variable is already assigned and initialized the first time _lock_acquire* is used. When esp_wifi is used, some ESP IDF sources are used that use various static lock variables without allocating and initializing them using _lock_init*. They simply use _lock_acquire* and _lock_release and rely on implicit allocation and initialization.

That is, I have to revert some changes. @maribu Would it perhaps be better to completely squash and rebase the changes rather than fix the fixes of the fixes?

@maribu
Copy link
Member

maribu commented Sep 1, 2022

Do as you consider is best.

And more generally: My review-workflow doesn't really profit from the "squash only when asked" policy. So if a PR has only reviews of me, you can safely assume implicit approval to squashing at will ;)

@maribu
Copy link
Member

maribu commented Sep 1, 2022

Oh, regarding squashing: @MrKevinWeiss also reviewed this PR. Sorry that I lost track. This omission was not intentional, but human error due to more context switching between task than my brain can safely handle :/

Anyway, from my side you have both implicit and explicit support for squashing ;)

With the improvements of the locking mechanism, thread safety of malloc/realloc/calloc/free is guaranteed. Module malloc_thread_safe is not needed any longer.
Dynamic allocation and initialization of the mutex used by a newlib locking variable must not be interrupted. Since a thread context switch can occur on exit from an ISR, the allocation and initialization of the mutex must be guarded by disabling interrupts. The same must be done for the release of such a locking variable.
@gschorcht gschorcht force-pushed the cpu/esp/improve_thread_safety_of_malloc branch from baa316d to 5b713f7 Compare September 1, 2022 15:21
@gschorcht gschorcht requested a review from kaspar030 as a code owner September 1, 2022 15:21
@github-actions github-actions bot added the Area: CI Area: Continuous Integration of RIOT components label Sep 1, 2022
@gschorcht
Copy link
Contributor Author

Can you limit your build so we can test a bit more?

@MrKevinWeiss Log

@gschorcht gschorcht force-pushed the cpu/esp/improve_thread_safety_of_malloc branch 3 times, most recently from e9a31ff to 966930e Compare September 1, 2022 16:36
@gschorcht
Copy link
Contributor Author

@maribu With commits be0e530 and 2b1ab9f tests/thread_priority_inversion also works for ESP8266. Should I split off these commits as a separate PR or would the change of the title to cover all the commits ok?

@MrKevinWeiss
Copy link
Contributor

Just do it in the one pr. I can start another hil test when I'm free to see if the esp8266 is fixed too.

@gschorcht
Copy link
Contributor Author

All tests runs were now successful log.

Copy link
Member

@maribu maribu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. The CI found a few typos and one instance of two subsequent blank lines.

The doccheck will need and entry in exclude_pattern to not complain about the stack size (or an Doxygen comment). Both is fine with me.

@riot-hil-bot
Copy link

HiL Test Results

PASS FAIL SKIP
24 5 0
  ❌ nucleo-f207gz (1 fail build)
PASS FAIL SKIP
0 1 0
TEST RESULT
tests/thread_priority_inversion ❌ build fail
  ❌ arduino-mega2560 (1 fail test)
PASS FAIL SKIP
0 1 0
TEST RESULT
tests/thread_priority_inversion ❌ test fail
  ❌ saml21-xpro (1 fail test)
PASS FAIL SKIP
0 1 0
TEST RESULT
tests/thread_priority_inversion ❌ test fail
  ❌ frdm-k64f (1 fail test)
PASS FAIL SKIP
0 1 0
TEST RESULT
tests/thread_priority_inversion ❌ test fail
  ❌ stk3200 (1 fail test)
PASS FAIL SKIP
0 1 0
TEST RESULT
tests/thread_priority_inversion ❌ test fail
  ✅ frdm-k22f
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-f091rc
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-f303re
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-g474re
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-f767zi
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-l152re
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-l432kc
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ saml10-xpro
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ saml11-xpro
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ samr30-xpro
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ samr21-xpro
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nrf52dk
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ remote-revb
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-l073rz
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ slstk3401a
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-f411re
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ nucleo-f103rb
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ frdm-kw41z
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ esp32-wroom-32
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ z1
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ arduino-due
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ slstk3400a
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ samr34-xpro
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass
  ✅ same54-xpro
PASS FAIL SKIP
1 0 0
TEST RESULT
tests/thread_priority_inversion ✅ pass

For ESP32x, the operations on recursive locking variables have to be guarded by disabling interrupts to prevent unintended context switches. For ESP8266, interrupts must not be disabled, otherwise the intended context switch doesn't work when trying to lock a rmutex that is already locked by another thread.
If module `core_mutex_priority_inheritance` is enabled, the scheduling has to be active to lock/unlock the mutex/rmutex used by FreeRTOS semaphores. If scheduling is not active FreeRTOS semaphore function always succeed.
@gschorcht gschorcht force-pushed the cpu/esp/improve_thread_safety_of_malloc branch from 966930e to f183029 Compare September 2, 2022 06:58
@github-actions github-actions bot added Area: tools Area: Supplementary tools and removed Area: CI Area: Continuous Integration of RIOT components labels Sep 2, 2022
@gschorcht gschorcht removed the CI: run tests If set, CI server will run tests on hardware for the labeled PR label Sep 2, 2022
@gschorcht
Copy link
Contributor Author

Code looks good to me. The CI found a few typos and one instance of two subsequent blank lines.

The doccheck will need and entry in exclude_pattern to not complain about the stack size (or an Doxygen comment). Both is fine with me.

I squashed the fixes of static tests directly to avoid additional CI load by an extra compilation test cycle. Since all tests were already successful (log), I removed label CI: run tests for these "cosmetic" corrections.

@gschorcht gschorcht added CI: run tests If set, CI server will run tests on hardware for the labeled PR CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR and removed CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR CI: run tests If set, CI server will run tests on hardware for the labeled PR labels Sep 2, 2022
Copy link
Member

@maribu maribu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. Looks good to me and the CI did the testing :)

@gschorcht gschorcht merged commit a0a0b64 into RIOT-OS:master Sep 5, 2022
@gschorcht gschorcht deleted the cpu/esp/improve_thread_safety_of_malloc branch September 20, 2022 06:27
@maribu maribu added this to the Release 2022.10 milestone Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: cpu Area: CPU/MCU ports Area: tests Area: tests and testing framework Area: tools Area: Supplementary tools CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Platform: ESP Platform: This PR/issue effects ESP-based platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tests/thread_priority_inversion failing with esp32-wroom-32
5 participants