Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JoinGuard::join returning an Err is **really** unsafe #20807

Closed
tomaka opened this issue Jan 9, 2015 · 14 comments · Fixed by #22435
Closed

JoinGuard::join returning an Err is **really** unsafe #20807

tomaka opened this issue Jan 9, 2015 · 14 comments · Fixed by #22435

Comments

@tomaka
Copy link
Contributor

tomaka commented Jan 9, 2015

Exception safety is a very well-known problem in C++.

Consider this:

struct Foo {
    int* ptr;
};

void modify(Foo& value) {
    delete foo.ptr;
    foo.ptr = new int;
}

void error_proof_modify(Foo& value) {
    try {
        modify(value);
    } catch(...) {
        cerr << "Error, but let's continue!";
    }
}

If the call to new throws an exception, the error gets caught and the execution continue. However the Foo object is now in an invalid state with its pointer pointing to memory that has been free'd.

Before #20615 and before the change in Send this is not possible in Rust, because when a thread panics all objects that are local to this thread are no longer accessible. It's also the reason why mutexes are poisoned in case of a panic.

But now that it's possible to send local variables to other threads, and ignore when this other thread panics, situations like this can totally happen.

Let's take this code for example:

pub struct Foo {
    val1: int,
    val2: int,
    val3: int,
    calculation_result: int,    // must always be equal to val1+val2+val3
}

impl Foo {
    pub fn set_val1(&mut self, value: int) {
        self.val1 = value;
        self.update_calculation();
    }
    pub fn set_val2(&mut self, value: int) {
        self.val2 = value;
        self.update_calculation();
    }
    pub fn set_val3(&mut self, value: int) {
        self.val3 = value;
        self.update_calculation();
    }

    fn update_calculation(&mut self) {
        if self.val1 == 127 { panic!("for the sake of this example, we panic") };
        self.calculation_result = self.val1 + self.val2 + self.val3;
    }
}

Looks safe, right?

But what if you use that "ignore panics" trick?

let mut foo: Foo = ...;  // whatever

Thread::scoped(|| {
    foo.set_val1(127);
}).join();   // ignoring the panic

use_foo(&foo);   // continue using `foo`

At the end of this code, you have a Foo object in an invalid state because calculation_result is not equal to val1+val2+val3 as it should be. Continuing to use the Foo object in this invalid state could lead to weird and hard to debug results (depending on the rest of the code).

With the latest reforms in thread, library writer really have to take exception safety (or "panic safety") into account by avoiding the kind of code written above. You may argue that it's just the fault of the person who wrote the Foo struct. The problem is that exception safety is hard. Too hard to get right.

In my opinion, JoinGuard::join should just panic if the underlying thread panicked, without leaving the choice to the user. The only situation in which a try_join function (or another name) is safe is for JoinGuard<'static>.

cc @aturon

@sfackler
Copy link
Member

sfackler commented Jan 9, 2015

This is already something you have to worry about, since destructors would be able to look at foo during unwinding in the thread as well.

@jfager
Copy link
Contributor

jfager commented Jan 9, 2015

These examples are different classes of bugs: the C++ code shows a critical memory safety issue, while the Rust code shows a logic error that doesn't fall within the realm of 'safety' in the sense that Rust means.

How is the rust example so different from the obviously "just a bug" behavior of:

fn update_calculation(&mut self) {
    if self.val1 == 127 { return; };
    self.calculation_result = self.val1 + self.val2 + self.val3;
}

?

@pythonesque
Copy link
Contributor

This is a really good point that I overlooked in the API. I agree with you, tomaka.

@tomaka
Copy link
Contributor Author

tomaka commented Jan 9, 2015

How is the rust example so different from the obviously "just a bug" behavior of:

In my example I wrote if self.val1 == 127 { panic!() }, but in real code things that can panic look like unwrap(), my_vec[i], etc., which you use when you know that you can't recover from them. That's why bringing the possibility to recover from a panic is a bad idea.

Here is a better example. This code below looks reasonable but has exception-safety related issues:

fn update_calculation(&mut self) {
    self.calculation_result = self.val1.checked_add(self.val2).unwrap().checked_add(self.val3).unwrap();
}

pub fn set_val1(&mut self, value: int) {
    self.val1 = value;
    self.update_calculation();
}

If you want to "fix" this code, you would have to write:

fn update_calculation(&mut self) -> Result<(), ()> {
    let tmp1 = match self.val1.checked_add(self.val2) {
        Some(v) => v,
        None => return Err(())
    };

    let tmp2 = match tmp1.checked_add(self.val3) {
        Some(v) => v,
        None => return Err(())
    };

    self.calculation_result = tmp2;
    Ok(())
}

pub fn set_val1(&mut self, value: int) {
    let previous_value = self.val1;
    self.val1 = value;
    if let Err(_) = self.update_calculation() {
        self.val1 = previous_value;
        panic!();
    }
}

This is why exception safety is hard. And I'd like to emphasize that this is just a trivial case.

@pythonesque
Copy link
Contributor

@sfackler Sure, it's easy to replicate in a single thread:

#![feature(unsafe_destructor)]

#[derive(Show)] struct Foo { foo: u8 }

#[unsafe_destructor]
impl<'a, 'b> Drop for Foo {
    fn drop(&mut self) {
        println!("Foo: {:?}", self.foo);
    }
}

/// Invariant: *x = 2 after this is called.
fn foo<'a>(x: &'a mut u8) {
    if *x == 0 {
        panic!("oops");
    }
    *x = 2;
}

fn main() {
    let mut x = Foo {foo: 0 };
    {
        foo(&mut x.foo);
    }
}

The difference is that you only have to worry about the exception safety violations in destructors. Destructors already have to do this if you don't abort on panic. That's fundamental to unwinding. The problem here that tomaka is pointing out is that this change causes non-destructors to have to worry about it.

The other key difference with destructors is that destructors can make no guarantees about where they are called (for the most part) so they can't really rely on functions being run after the type is instantiated. In this case, the guarantee foo made is not one the destructor could assume anyway. However, straightforward regular code does need to be able to assume that function calls fulfill their promises (often encoded in the type signature). If it can't, that's where you get exception safety violations.

Personally I think making .join() panic unless you use 'static is a great idea. It is more consistent with how I view bounded lifetime threads--they should for the most part work like function calls that just happen to be synchronous. When a regular function call panics, you know for sure that all code is going to get unwound. This would encourage people to use Results for bounded lifetime thread return values, just like normal functions that can fail.

@tomaka
Copy link
Contributor Author

tomaka commented Jan 9, 2015

Destructors are usually either really low-level or don't require the object to be in a valid state, so it's a far smaller problem.

@pythonesque
Copy link
Contributor

By the way, this is really sometimes a problem even for existing threads in non-destructors, unless we abort on panic. If you use atomics, for example, which have no poison value, it's very easy to get exception safety violations by not coding transactionally. So this is also an argument for abort on panic being the default (among other things) but this is a larger discussion. I think at least for now, making non-destructors worry about exception safety in cases like this is not a good idea.

On another note, if people use tricks like this to make sure their or third party code can't panic, rather than because they don't want to deal with it, that is a good sign that we need a noexcept.

@aturon
Copy link
Member

aturon commented Feb 13, 2015

Sorry for taking so long to write in explicitly on this thread, though I've been mulling over the API issue for a while.

I think that @tomaka and @pythonesque are right in that in general, we make it somewhat difficult (though by no means impossible) to run into exception safety problems in "normal" (especially safe) code. You have to use RefCell, or dig out a value from PoisonError in a Mutex, in order to observe problems outside of destructors.

The current scoped API, however, risks changing this equation a bit in that "all" you have to do is throw away a Result to start observing potentially bad data:

let _ = Thread::scoped(f).join();
// read some data that f had &mut access to

The question is, is the #[must_use] attribute on Result (together with the fact that you have to at least explicitly throw it out as above) enough of a warning here? Note, we do allow you to make similar observations with mutexes, but you have to dig it out.

I'm not really sure. I feel like the Result is probably enough, personally, but we could also consider splitting up join into a panicking version and a join_static version that has today's signature, for example. I'd appreciate other API sketches if people have more ideas.

I also played with a much crazier version, but I think it's far too "clever" to be of use.

@pythonesque

If you use atomics, for example, which have no poison value, it's very easy to get exception safety violations by not coding transactionally.

This I disagree with: if you're using atomics, you're coding transactionally almost by definition: since concurrent threads can observe the state after every atomic update, each update must uphold all needed invariants. This is why lock-free data structures are so fun to write :)

@tomaka
Copy link
Contributor Author

tomaka commented Feb 14, 2015

I have never thought of #[must_use] as a "must handle". There are several situations where you don't care whether the function call succeeded and where you can legitimately throw away the Result. For example writing a "goodbye" message just before closing a socket, or sending the result of a calculation from a background thread on a channel.

After some thoughts, for me the API should ideally look like this:

  • Thread::scoped returns a ScopedJoinGuard whose join method panics.
  • Thread::spawn returns a JoinGuard (or another name) whose join method returns a Result.

After all, I can see three use cases:

  • You want to spawn a background thread, in which case you'd call Thread::spawn() then detach (instead of just Thread::spawn() today). For me this is the rarest case.
  • You want to spawn a thread but have some kind of ownership over it, in which case you'd call Thread::spawn() (today you'd call Thread::scoped()).
  • You want to spawn a thread that manipulates your local variables, in which case you'd call Thread::scoped().

I don't know if that will pass, since I think the API used to look kind of like this and since it's a bigger change than a simple modification to a safety problem.

Other important problem: the destructor of JoinGuard simply joins the thread and ignores the result. This should be fixed too. I don't know what the state of linear types is, but that would be a good situation to use them. Alternatively, in my opinion panic in the destructor unless the current thread is unwinding.

or dig out a value from PoisonError in a Mutex

Ouch, I didn't know about this one.

I'd be in favor or making into_guard unsafe. I don't see what would be a legitimate use-case of this outside of unsafe code.

I guess that this whole discussion is supposed to be in the "threads RFC" instead of an issue. Unfortunately like many people (I think) I don't have the time to read RFCs, and I spot problems once things are implemented.

@pythonesque
Copy link
Contributor

@tomaka I'm okay with the poison stuff because you really have to explicitly abandon exception safety in order to use the Mutex that way. My concern is that there is nothing stopping you from just ignoring the value of the returned Thread except for #[must_use], so the path of least resistance may be giving up exception safety (especially since in most cases, Rust programmers are being correctly trained to use try! rather than unwrap()). Unfortunately I don't really see a way to make it much easier to do the right thing without real linear types (so you could actually force the Result of joining a thread to be handled explicitly) or just panicking (the bomb idea is really interesting though... I could actually be persuaded to like that, but it might be hard to get through an RFC).

@bluss
Copy link
Member

bluss commented Feb 16, 2015

I thought you will get better traction if you demonstrate real memory unsafety in safe Rust code. The following is a paper construction, but it should work:

struct Panicker<T>(T);
// Implement PartialOrd, Ord on Panicker so that it panics when compared with `<=`.

let mut heap = BinaryHeap::new();
let guard = Thread::scoped(|| {
    heap.push(Panicker(Vec::new()));
    heap.push(Panicker(Vec::new()));
});
guard.join().ok();

// Access heaps' first element here.
let zeroed_struct = heap.into_iter().next();

// BinaryHeap's sift_up will now panic on the first comparison, but not until after having replaced
// its first element with `zeroed()`.
// So we can create a zeroed-out version of any struct we want.
// This breaks many invariants:
// 1) A zeroed Vec means that Some(v) is the same as None, because the vec's ptr is marked NonZero.
// 2) We can zero a BTreeMap which will result in a btreemap with 0 for `b` and null for the
// root node, which will lead to an immediate violation since the insert/lookup
// methods assume root is not null.

I've come back to the zeroing code in BinaryHeap before, and it's an easy target (that will eventually go away), but I'm sure there are other exploitable locations (in the sense of this bug) in libstd if we look for them.

Edit: It's actually the second element of the BinaryHeap that's zeroed.

@nikomatsakis
Copy link
Contributor

I've been thinking about this over the weekend.

My feeling is roughly that exception safety is a fact of life, but we should do what we can to minimize it. I've always been a fan of "unwind the whole thread" for this reason, and when I envisioned "data-parallel" helpers in the past -- basically a more-efficient and convenient version of Thread::scoped -- I always assumed that they would propagate panics unconditionally, just as would occur in sequential code.

That said, even if we did unconditionally propagate a panic, it doesn't obviate the need for exception safety! The need for exception safety is inherent to error recovery -- the more flexible your error recovery, the more fine-grained your problem. For example, even if we did NO unwinding at all, we'd still have to worry about shared memory between processes. The problem that @bluss highlights, for example, can also occur if you have a destructor that owns the BinaryHeap and which tries to use it during unwinding:

...
struct Something {
    heap: BinaryHeap<Panicker>
}

impl Drop for Something {
    fn drop(&mut self) {
        ... self.heap.iter().next() ...
    }
}

let mut heap = Something { heap: BinaryHeap::new() };
heap.push(Panicker { ... });

I guess that the main difference is that using Thread::scoped one can fairly easily encounter such bugs outside of a destructor. That is, Thread::scoped is a lot like a more expensive "try/catch", and by allowing the task boundary to be permeated, it tends to expose some of the problems that "try/catch" implies.

I am sympathetic towards @aturon's suggestion (modulo naming) that we have a version of join which works for any thread, but which unconditionally propagates panics, and a version of join that works only for 'static threads, but which permits recovery (and perhaps an unsafe version of join that permits recovery in all situations?). But this starts to be quite the zoo of join functions! And given that all of the bugs we're talking about are bugs that have to be fixed anyway, I am not sure it's worth it.

I guess that the main advantages I see to having all of these functions is that they would avoid the footgun of ignoring the result and thus help mitigate problems due to exception safety in end-user code (no guarantees are provided). Certainly libstd will still need careful vetting and fuzzing and so forth to catch these sorts of bugs, because (as I described above) the problems with exception safety still exist, even if the range of code that can expose them is less. Those advantages may be worth the price of having many distinct join functions (or perhaps there's another more ergonomic way to present this?).

@aturon
Copy link
Member

aturon commented Feb 16, 2015

Thanks everyone for the additional comments!

The basic point, which has been articulated variously in the above comments, is that exception safety is fundamentally a concern when programming in Rust. However, with careful design of things like mutex poisoning, it's a concern that can largely be relegated to unsafe implementations (like BinaryHeap) and destructor code, making it a relatively narrow concern.

@nikomatsakis's last comment made clear to me, moreover, that the current scoped API is basically tantamount to adding a general catch facility, which we've long stayed away from for exactly the reasons above: we want exception safety to be a thing that most Rust code doesn't have to worry much about.

I will mull over the cleanest/most ergonomic way of tightening up the API -- probably something along the lines of @tomaka's suggestion.

@tomaka

Other important problem: the destructor of JoinGuard simply joins the thread and ignores the result. This should be fixed too. I don't know what the state of linear types is, but that would be a good situation to use them. Alternatively, in my opinion panic in the destructor unless the current thread is unwinding.

The problem is that we fundamentally cannot allow the thread to continue past that point in the stack, whether through unwinding or otherwise, because the child thread could continue accessing the data! That has clear memory safety problems. So the only other option would be to abort the process.

We thought pretty hard about this join-on-drop semantics, since it is surprising, and carefully chose the name "scoped" to try to clearly communicate the lifetime concerns.

@aturon
Copy link
Member

aturon commented Feb 17, 2015

WIP PR has been posted, addressing the above concerns together with some other API requests that have accumulated. I'd appreciate any comments on the PR.

aturon added a commit to aturon/rust that referenced this issue Feb 17, 2015
This commit makes several changes to `std::thread` in preparation for
final stabilization:

* It removes the ability to handle panics from `scoped` children; see
  rust-lang#20807 for discussion

* It adds a `JoinHandle` structure, now returned from `spawn`, which
  makes it possible to join on children that do not share data from
  their parent's stack. The child is automatically detached when the
  handle is dropped, and the handle cannot be copied due to Posix
  semantics.

* It moves all static methods from `std::thread::Thread` to free
  functions in `std::thread`. This was done in part because, due to the
  above changes, there are effectively no direct `Thread` constructors,
  and the static methods have tended to feel a bit awkward.

* Adds an `io::Result` around the `Builder` methods `scoped` and
  `spawn`, making it possible to handle OS errors when creating
  threads. The convenience free functions entail an unwrap.

* Stabilizes the entire module. Despite the fact that the API is
  changing somewhat here, this is part of a long period of baking and
  the changes are addressing all known issues prior to alpha2. If
  absolutely necessary, further breaking changes can be made prior to beta.

Closes rust-lang#20807

[breaking-change]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants