-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing thread exit and thread cancellation #10524
Fixing thread exit and thread cancellation #10524
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks for working on this.
__pthread_setcancelstate(PTHREAD_CANCEL_MASKED, &cs); | ||
if (cs == PTHREAD_CANCEL_DISABLE) __pthread_setcancelstate(cs, 0); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its sad to see emscripten differing from upstream here ... can you explain and maybe add comments as to why the default implementation doesn't work for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like musl bug. According to POSIX spec https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html "2.9.5 Thread Cancellation" pthread_cond_timedwait and pthread_cond_wait funcitons are cancellation points.
Test pthread_cond_wait/2-3 from Open POSIX suite checks such behavior and cancels thread waiting on conditional variable. With masked cancellation this test hangs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if there is a musl bug, did you mark this as a #ifndef __EMSCRIPTEN__
with the intent to "let the native side retain its buggy behavior to not break anything there" and fix the bug for Emscripten? Or was the divergence for some other reason?
This code does implement a __pthread_testcancel()
call in the beginning of this file. It is not wrong for the implementation to mask cancellation afterwards (to my interpretation of the spec). What kind of behavior are you seeing in the test suite with regards to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I wanted to mark Emscripten related changes/fixes with this macro.
Problem is that test pthread_cond_wait/2-3 hangs, because there is scenario when thread waiting on condition variable is being cancelled. So if we set PTHREAD_CANCEL_MASKED for such thread it won't be cancelled.
I'll check behaviour of native musl with this test (pthread_cond_wait/2-3), maybe there is some other cancelation mechanism (like signals)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked musl-1.2.0 behaviour and yes there is a bug there.
Original idea for this patch was to use PTHREAD_CANCEL_MASKED
state distinguish:
- waits on functions which are non-cancellation points (e.g.
pthread_mutex_lock
) and can
be canceled only when user wants asynchronous cancellation (i.e. sets thread
cancelability type toPTHREAD_CANCEL_ASYNCHRONOUS
). Such functions call__timedwait()
to perform waiting. - waits on functions which are defined as cancellation points (
pthread_cond_wait
).
Such functions call__timedwait_cp()
This test was done by setting cancellation state to PTHREAD_CANCEL_MASKED
and using check:
int can_cancel_wait = pthread_self()->canceldisable == PTHREAD_CANCEL_ENABLE ||
(pthread_self()->canceldisable == PTHREAD_CANCEL_MASKED &&
pthread_self()->cancelasync == PTHREAD_CANCEL_ASYNCHRONOUS);
in __timedwait_cp
.
Original musl implementation which disables thread cancelability internally during call to waits (__timedwait
), disables also user request to allow thread to be asynchronously canceled (Cancelability Type set PTHREAD_CANCEL_ASYNCHRONOUS
). This causes that test pthread_setcanceltype/1-1
fails with native musl implementation.
As there are many functions (~30) in musl which in their implementation disables thread cancellation and additionally flag PTHREAD_CANCEL_MASKED
was added to avoid some cancellation race in pthread_cond_wait
in commit https://git.musl-libc.org/cgit/musl/commit/?id=8741ffe625363a553e8f509dc3ca7b071bdbab47, so in acd89cd I changed test for Emscripten to keep changes in musl minimal.
tests/test_posixtestsuite_prejs.js
Outdated
(function () { | ||
Module['onExit'] = function(status) { | ||
console.log('POSIX test finished with status:' + status); | ||
reportResultToServer(status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its really a shame this doesn't work like this just out of the box for all browser tests.. but I know its complicated.
tests/test_posixtestsuite.py
Outdated
# '-I', posix_test_dir('pthread_attr_destroy')]) | ||
|
||
|
||
@requires_threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test are marked as requires_threads but the above ones are not?
I wonder if we could run the other ones outside of browser tests?
For that matter, given that we now support node threads I wonder if we could make all of these just regular tests that run without the browser?
if (is_main_thread || pthread_self()->cancelasync == PTHREAD_CANCEL_ASYNCHRONOUS) { | ||
int can_cancel_wait = pthread_self()->canceldisable == PTHREAD_CANCEL_ENABLE || | ||
(pthread_self()->canceldisable == PTHREAD_CANCEL_MASKED && | ||
pthread_self()->cancelasync == PTHREAD_CANCEL_ASYNCHRONOUS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's store pthread_self()
to its own variable, there is no guarantee that it would be inlined.
@@ -80,9 +83,13 @@ int __timedwait(volatile int *addr, int val, | |||
clockid_t clk, const struct timespec *at, int priv) | |||
{ | |||
int cs, r; | |||
__pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, was this something that I had incorrectly added outside an #ifdef __EMSCRIPTEN__
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, now this call has been moved to #else block.
tests/test_posixtestsuite.py
Outdated
'-I', posix_test_dir('pthread_create')]) | ||
|
||
|
||
# Test throws exception: need to fix thread function si |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is soo much copy-paste in this file. :( The original test suite enumerated files in directories to run all tests in the suite, can we follow that approach to avoid hand-maintaining this file for each test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 4c4fd41 I've refactored this to generate test methods based on list of valid Test Cases.
My idea is to have all valid TCs (valid in browser environment) listed in this file. In this approach we avoid running TCs which won't work in with Emscripten (i.e. ones testing fork and signals). WDYT?
Open POSIX test suite is licensed under GPLv2, https://github.com/juj/posixtestsuite/blob/master/COPYING , which to my interpretation is not compatible with Emscripten licensing. While my interpretation is that using the test suite does not require user to license the project it is used for under GPLv2 (similar to the way that using Linux or its GPLv2 tools does not require one to license whatever those tools are used for under GPLv2), but when developing the test suite, any changes to the test suite must be released under GPLv2. Because if this requirement, it is a big gray area to marry Emscripten's test suite tightly together with the Open POSIX test suite; or have a GPLv2 submodule in a repository that is not licensed under GPLv2. This was the reason why the original work was a maintained in a separate repository - keep all changes applied to it public and released under GPLv2, while retaining a strict boundary to what is Emcripten and what is the test suite. Hence we should probably not merge these test suites directly together; though this is great work here you are doing! |
@juj Yes, but in my understanding only changes to test suite (GPLv2) code itself. We could release any changes to Emscripten with current licensing model
@juj With submodule POSIX test suite is sill a separate repository, only change is that now emscripten repo has named reference to it. We are not bundling Open POSIX test suite with emscritpen. This is my understanding, unfortunately I'm not a lawyer, so as you mentioned it is grey area to mary these two repos. My first idea was to use code similar to emsdk to checkout testsuite directly form scripts running tests (see dcefb9c) |
How does the submodule work here - I think this is the first submodule we have. Do we need to get people (and our CI) to do submodule init etc. in order to run the tests? Or are these tests optional and not run by default? About the license issue, how about putting it under the |
Oh wait, it's already in |
I think this tests will be optional, so only those who want to run them would run the submodule init. |
Sounds fine to me. |
Is there any interest in rebasing this PR now that #12923 is merged? I had an attempt to rebase this, see: (It fixes a couple of test failures on the Open POSIX test suite, but it also seems to have introduced two regressions) |
@kleisauke Can you tell me what are the regressions? |
@abujalski Sorry for not clarifying that. It seems to cause test failures on these two test cases: I still need to double check whether this is actually caused by this PR. |
I've rebased this branch. However I need to test it and verify that it doesn't cause any regressions in tests. |
Oh no, I'm really sorry but I completely forgot about this PR and I my recent change duplicates some it : #12923. Hopefully its just the testsuite integration stuff that was duplicated and all the substantive changes here are still good. |
@abujalski Details$ EMTEST_VERBOSE=1 python3 tests/runner.py posixtest.test_pthread_setcanceltype_1_1 Before: -- begin program output --
Test PASSED
Error in pthread_mutex_lock()
: No error information
-- end program output --
ok After: -- begin program output --
Test FAILED: Cancel request timed out
-- end program output --
FAIL And the Details[Thread 0x0x7019f0] started and locked the mutex
[Thread 0x0x7019f0] is waiting for the cond
[Thread 0x0x901ce8] started and locked the mutex
[Thread 0x0x901ce8] is waiting for the cond
[Thread 0x0xb02000] started and locked the mutex
[Thread 0x0xb02000] is waiting for the cond
[Main thread] signals a condition
[Thread 0x0x7019f0] was wakened and acquired the mutex again
[Thread 0x0x7019f0] released the mutex
[Main thread] 1 waiters were wakened
Calling stub instead of sigaction()
[Main thread] signals to wake up the next thread
[Thread 0x0x901ce8] was wakened and acquired the mutex again
[Main thread] signals to wake up the next thread
[Thread 0x0x901ce8] was wakened and acquired the mutex again
[Main thread] signals to wake up the next thread
[Main thread] signals to wake up the next thread
[Thread 0x0x901ce8] released the mutex
main thead called exit: noExitRuntime=undefined
[Thread 0x0xb02000[Main thread] had to signal the condition @sbc100 Most of the tests that were disabled were only failing on Node, see commit kleisauke@6d62b9c. Do you have any idea why this works without problems in the browser? |
I think its probably the differences between web workers and node workers. They are surprisingly different IIUC. You should mind submitting that as a PR? |
// cp suffix in the function name means "cancellation point", so this wait can be cancelled | ||
// by the users unless current threads cancelability is set to PTHREAD_CANCEL_DISABLE | ||
// which may be either done by the user of __timedwait() function. | ||
if (is_main_thread || pthread_self()->canceldisable != PTHREAD_CANCEL_DISABLE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is causing 'test_pthread_setcanceltype_1_1
' test to fail.
Call chain:
pthread_mutex_lock --> __timedwait() --> __pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cs);
In test there are following flags set:
pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);
It looks that this condition has to be modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated test in 0eb64ed
I just submitted PR #12963 which should fix this. |
Hi @kleisauke, Sorry for late reply, but I've missed second problem :(
This was regression, fixed. See #10524 (comment)
Indeed it looks flaky. But what is bit odd is that this tests failed once when I run whole test suite, but when I've run just this particular test it passes. I've looped this particular tests for 1000 times and each one of them passed. I'll investigate this further. |
I can confirm that commit 0eb64ed fixes that, thanks.
I've observed the same behavior of that test, it's really curious. Note that this test failure does not seem to be caused by this PR, as it appears to be happening on the master branch as well, see: Thanks for investigating this further! |
I've checked that This test awaits for some action to happen in side threads. However time threshold is quite low and it looks that when there are other tests/tasks running in background side threads didn't get CPU on time. There can be two possible ways to fix this issue:
Can you tell me which one is preferred? |
Sounds like the test is flaky. I would mark it as such (with @disabled(<link_to_bug>) or something like that) and we we can consider fixing the test (to be less timing dependent) upstream if we think its important enough (or if we think we lack coverage of this features via other tests). |
Done in separate PR #13259 |
30db8fc
to
c7879b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice fixes!
96160f9
to
d86abbe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with better PR title and description.
// Lock mutex to ensure that thread is waiting | ||
pthread_mutex_lock(&mutex); | ||
|
||
EM_ASM(out('Canceling thread..');); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use printf or puts instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printf() and puts() locks stdout file. I wanted to avoid such locks here, that's why I've used EM_ASM here.
Replacing it with emscripten_log(EM_LOG_CONSOLE, ...)
will be ok?
REPORT_RESULT(res); | ||
#endif | ||
return res != 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how if this new test is testing similar things to the newly enabled posixtestsuite tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually yes, this test is supposed to check similar scenario to one tested in pthread_cond_wait_2_3 test. Although in this test scenario itself is simplified compared to test from posixtestsuite. Main purpose of this test was to check cancelling thread waiting on cv scenario without adding posixtestsuite.
I don't know if posixtestsuite is run by CI but if so then this test can be removed. Do you think it is better to keep this tests or remove it as redundant?
@@ -8080,7 +8080,6 @@ def test_pthread_c11_threads(self): | |||
self.set_setting('INITIAL_MEMORY', '64mb') | |||
self.do_run_in_out_file_test('tests', 'pthread', 'test_pthread_c11_threads.c') | |||
|
|||
@no_asan('flakey errors that must be fixed, https://github.com/emscripten-core/emscripten/issues/12985') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the PR title and description including specifically how this change address this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure no problem. I've updated them. Is my description clear and understandable?
d86abbe
to
2cc5955
Compare
2cc5955
to
32bd7f0
Compare
32bd7f0
to
e21c1d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
@sbc100 had some outstanding questions about the test suite, not sure if those were resolved to his liking, so letting him hit merge if so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with a couple more questions/nits
I'm ok with keeping the (possibly redundant) test.
src/library_pthread.js
Outdated
@@ -186,19 +204,8 @@ var LibraryPThread = { | |||
Atomics.store(HEAPU32, (tb + {{{ C_STRUCTS.pthread.profilerBlock }}} ) >> 2, 0); | |||
_free(profilerBlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this cleanup of profilerBlock happens for threadExit but not for threadCancel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the only reason for that was fact that in original code it is not called in threadCancel.
Fixed - moved to runExitHandlersAndDeinitThread
function.
@@ -494,9 +495,9 @@ var LibraryPThread = { | |||
$cleanupThread: function(pthread_ptr) { | |||
if (ENVIRONMENT_IS_PTHREAD) throw 'Internal Error! cleanupThread() can only ever be called from main application thread!'; | |||
if (!pthread_ptr) throw 'Internal Error! Null pthread_ptr in cleanupThread!'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines should probably be in #if ASSERTIONS
shouldn't that? Although unrelated to this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to tell that, as $killThread
, $cancelThread
and $spawnThread
functions has similar checks not guarded by #if ASSERTIONS
.
#ifdef __EMSCRIPTEN__ | ||
emscripten_futex_wait(&inst->finished, 1, INFINITY); | ||
int is_main_thread = emscripten_is_main_runtime_thread(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be emscripten_is_main_browser_thread
. IIUC its only the browser thread that can't block. This would also seem to match __timedwait.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remembered a comment about emscripten_is_main_browser_thread()
versus emscripten_is_main_runtime_thread()
, see: #12963 (comment).
Therefore shouldn't the __timedwait_cp
function instead check for emscripten_is_main_runtime_thread()
since that function is a cancellation point?
(note that emscripten_is_main_browser_thread()
also returns true
on non-worker threads on Node, I attempted to change this but unfortunately it didn't work out)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it seems that both pthread_barrier_wait.c
and __timedwait.c
should use the same thing. Having them differ seems bad. Perhaps land this change with emscripten_is_main_browser_thread
and then we can consider a separate PR to change them both to emscripten_is_main_runtime_thread
as a followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, catch. Fixed.
e21c1d0
to
610f958
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! The details in that comment sounds like there are some future improvments we could make here but this looks good.
This patch fixes thread cancellation/exit method and also unifies why how thread structures are updated when thread is canceled or exits. Following problems are addressed: 1. heap-use-after-free on std::thread emscripten-core#12985 Function `cleanupThread()` from `src/library_pthread.js` unconditionally set `pthread.self` member of pthread structure to 0. Problem was that `cleanupThread()` function might be called after that pthread structure has been deleted. To fix this problem, setting `pthread.self` field is now guarded by check if thread data hasn't been already freed. 2. pthread_cond_wait/2-3 test hang - Disabling recursive thread cancellation. - Allowing __timedwait_cp to be true cancellation point as _cp suffix suggests 3. pthread_getschedparam/1-3 test hangs sometimes: In pthread_barrier_wait adding check if lock is held by main thread and waiting on futex in small slices of time there, to check if there is some work to do on behalf of Worker Threads. Signed-off-by: Adam Bujalski <[email protected]>
610f958
to
a342a1d
Compare
This PR aim at fixing some pthread implementation defects (in JavaScript glue layer) discovered when running POSIX tests from http://posixtest.sourceforge.net/ compiled to WebAssembly. It is done by fixes thread cancellation/exit methods and also unifying way how thread structures are updated when thread is canceled or exits.
Following problems are addressed:
Function
cleanupThread()
fromsrc/library_pthread.js
unconditionally set
pthread.self
member of pthread structure to 0.Problem was that
cleanupThread()
function might be called afterthat pthread structure has been deleted. To fix this problem, setting
pthread.self
field is now guarded by check if thread data hasn'tbeen already freed.
suffix suggests
In pthread_barrier_wait adding check if lock is held by main thread
and waiting on futex in small slices of time there, to check if
there is some work to do on behalf of Worker Threads.