Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't poison the Mutex when panicking inside Condvar::wait #58461

Closed
wants to merge 1 commit into from

Conversation

jethrogb
Copy link
Contributor

@jethrogb jethrogb commented Feb 14, 2019

Currently, when a panic occurs while a Condvar is blocking on a wait, this will result in the Mutex being poisoned, even though the thread that was waiting didn't hold the lock at that time. Various platforms may trigger a panic while a thread is blocked, due to an error condition.

This PR changes the Condvar wait logic so that there is no poison guard on the stack during the wait. It also add a test for Condvar::wait_timeout. It can be tricky to trigger a panic while a thread is blocked. Here I've shimmed pthread_cond_timedwait on Unix systems to trigger an assertion failure in the sys code.

This is similar to but not quite the same as #58042. That issue triggers during any kind of unwinding, this issue only occurs during an actual Rust panic. Therefore, the test can't use pthread_cancel.

r? @alexcrichton

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 14, 2019
@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:092c9bf2:start=1550141531495449001,finish=1550141532873738292,duration=1378289291
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
Setting environment variables from .travis.yml
$ export IMAGE=x86_64-gnu-llvm-6.0
---

[00:12:55] travis_fold:start:tidy
travis_time:start:tidy
tidy check
[00:12:55] tidy error: /checkout/src/test/run-pass/condvar-wait-panic-poison.rs:6: trailing whitespace
[00:12:55] tidy error: /checkout/src/test/run-pass/condvar-wait-panic-poison.rs:7: trailing whitespace
[00:12:55] tidy error: /checkout/src/test/run-pass/condvar-wait-panic-poison.rs:8: trailing whitespace
[00:12:56] some tidy checks failed
[00:12:56] 
[00:12:56] 
[00:12:56] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet"
[00:12:56] 
[00:12:56] 
[00:12:56] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy
[00:12:56] Build completed unsuccessfully in 0:00:44
[00:12:56] Build completed unsuccessfully in 0:00:44
[00:12:56] make: *** [tidy] Error 1
[00:12:56] Makefile:68: recipe for target 'tidy' failed
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:17be66c2
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Thu Feb 14 11:05:20 UTC 2019
---
travis_time:end:29e24460:start=1550142321936480625,finish=1550142321941182802,duration=4702177
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:004a2517
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:0f2ad44b
travis_time:start:0f2ad44b
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:0951404d
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@jethrogb jethrogb force-pushed the jb/condvar-panic-poison branch from 4bdb5d3 to 5ebf976 Compare February 14, 2019 11:10
@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:01bcf512:start=1550142738220470889,finish=1550142739101137347,duration=880666458
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
Setting environment variables from .travis.yml
$ export IMAGE=x86_64-gnu-llvm-6.0
---
travis_time:start:test_debuginfo
Check compiletest suite=debuginfo mode=debuginfo-both (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
[01:14:27] 
[01:14:27] running 119 tests
[01:14:52] .iiiii...i.....i..i...i..i.i..i.ii...i.....i..i....i..........iiii..........i...ii...i.......ii.i.i. 100/119
[01:14:56] i......iii.i.....ii
[01:14:56] 
[01:14:56]  finished in 28.721
[01:14:56] travis_fold:end:test_debuginfo

---
[01:32:26] .................................................................................................... 400/763
[01:32:27] thread '<unnamed>' panicked at 'attempted to use a condition variable with two mutexes', src/libstd/sync/condvar.rs:578:18
[01:32:27] thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
[01:32:27]   left: `16`,
[01:32:27]  right: `0`', src/libstd/sys/unix/mutex.rs:72:9
[01:32:27] thread panicked while panicking. aborting.
[01:32:27] error: process didn't exit successfully: `/checkout/obj/build/x86_64-unknown-linux-gnu/stage1-std/x86_64-unknown-linux-gnu/release/deps/std-5a5109574d7166f0 --quiet` (signal: 4, SIGILL: illegal instruction)
[01:32:27] 
[01:32:27] 
[01:32:27] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "test" "--target" "x86_64-unknown-linux-gnu" "-j" "4" "--release" "--locked" "--color" "always" "--features" "panic-unwind backtrace" "--manifest-path" "/checkout/src/libstd/Cargo.toml" "-p" "std" "--" "--quiet"
[01:32:27] 
[01:32:27] 
[01:32:27] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test
[01:32:27] Build completed unsuccessfully in 0:29:32
[01:32:27] Build completed unsuccessfully in 0:29:32
[01:32:27] make: *** [check] Error 1
[01:32:27] Makefile:48: recipe for target 'check' failed
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:003c3f7b
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Thu Feb 14 12:44:55 UTC 2019

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@jethrogb
Copy link
Contributor Author

The test failure is because the MutexGuard is gone, so during the unwind the unlock isn't happening before the Mutex is dropped. I will modify the PR to fix.

@jethrogb jethrogb force-pushed the jb/condvar-panic-poison branch from 5ebf976 to dd20f16 Compare February 14, 2019 17:34
@alexcrichton
Copy link
Member

As I mentioned on the other issue, Condvar::wait should never panic and if it does so then it's a bug in the standard library. The fix here I think that should be applied is retrying on EINTR and fixing any panicking conditions in other implementations.

@jethrogb
Copy link
Contributor Author

Note to triage: I will return to this PR March 4.

@Centril Centril added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 23, 2019
@bors
Copy link
Contributor

bors commented Feb 28, 2019

☔ The latest upstream changes (presumably #58208) made this pull request unmergeable. Please resolve the merge conflicts.

@jethrogb
Copy link
Contributor Author

jethrogb commented Mar 4, 2019

#58042 (comment)

@Amanieu
Copy link
Member

Amanieu commented Mar 4, 2019

I agree with @alexcrichton here, any panic in Condvar::wait (and other similar functions) should be considered either a bug in Rust or a bug in the underlying OS. In both cases I think that crashing or deadlocking is an acceptable outcome.

@jethrogb
Copy link
Contributor Author

jethrogb commented Mar 4, 2019

Why is deadlocking an acceptable outcome for a bug in Rust, if we know about the bug and can fix it? The outcome here is neither of those btw.

@Amanieu
Copy link
Member

Amanieu commented Mar 4, 2019

Can you explain what is causing the panic in the first place?

@jethrogb
Copy link
Contributor Author

jethrogb commented Mar 4, 2019

Doesn't matter, it can be anything. If in Condvar::wait/wait_timeout, any panic occurs in self.verify() or self.inner.wait/wait_timeout(), then the poison state of the Mutex becomes invalid. The current test triggers a panic at https://github.com/rust-lang/rust/blob/93b6d9e/src/libstd/sys/unix/condvar.rs#L102, but as I stated in #58042 (comment) I'm sure I can find another unwrap/assert! somewhere in sync/ or sys/ that, when failing, would lead to the same behavior.

@jethrogb
Copy link
Contributor Author

Here's another one. If there's a panic somewhere in park_timeout, you see three different panics here:

use std::{panic, thread, time::Duration};

fn main() {
    let _ = panic::catch_unwind(|| {
        thread::park_timeout(Duration::from_millis(10));
    }); // original panic
    let _ = panic::catch_unwind(|| {
        thread::park_timeout(Duration::from_millis(10));
    }); // “inconsistent park_timeout state”
    let _ = panic::catch_unwind(|| {
        thread::park_timeout(Duration::from_millis(10));
    }); // “PoisonError” (from `thread.inner.lock`)
}

@jethrogb
Copy link
Contributor Author

Please continue discussion in #58042

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants