Skip to content
This repository has been archived by the owner on Apr 5, 2024. It is now read-only.

Is it undefined behavior to hold an invalid reference, if it is never dereferenced. #12

Closed
mystor opened this issue Aug 19, 2016 · 45 comments

Comments

@mystor
Copy link

mystor commented Aug 19, 2016

It is not Safe in rust code to create invalid references, in that these references can be used by safe code to trigger Undefined Behavior, however, is it Undefined Behavior to create one of these references?

For example, does this code trigger Undefined Behavior?

struct Foo {...}
let _x: &const Foo = &*(1024 as *const Foo);

How about this code, which gets the address of the member, but does no reading?

struct Foo {
    member: i32
}
let _x: &i32 = &(*(1024 as *const Foo)).member;

How about this code, which creates an invalid reference by over-extending the lifetime, but does not follow it?

struct Foo {...}
let _x: &'static Foo;
{
    let y = Foo {...};
    _x = mem::transmute(&y);
}

Or this code, which creates a reference with the wrong lifetime, but only follows it while the object behind it is still alive?

struct Foo {
    member: i32
}
let y: &'static Foo;
{
    let x = Foo {...};
    y = mem::transmute(&x);
    println!("y.member = {}", y.member);
}
@strega-nil
Copy link

I believe the second two are equivalent.

The real question is, are these different:

let x: &'static Foo;
let foo = Foo {...};
x = unsafe { &*(&foo as *const _) };

// x is never alive while foo is not, *but* the lifetimes are a lie

I'd argue your examples are all the same actually; invalid references. References which either point to objects that don't exist, or which point to freed objects.

I'm not sure if this should be UB, although I lean to yes, but I don't like accessing it while the reference is alive to change it. Also, the first three should, in my opinion, at least be the same, even if we do decide to look at accessing things for UB.

@mystor
Copy link
Author

mystor commented Aug 19, 2016

I lean towards none of these being Undefined Behavior. I feel like in rust, we want to make sure that when people write unsafe code which they would expect to work in other languages, that code works in rust as well.

Since references act a lot like pointers, it seems reasonable to assume (from the POV of an unsafe rust programmer coming from other languages like C/C++) that they act like C++ pointers - that is that invalid references are invalid, and thus cannot be followed, but that they don't trigger any UB unless they are dereferenced. If we want to make our memory model as intuitive as possible for people, I think we should make sure that this type of behavior is not UB.

I think it's fairly likely that a decent number of existing crates depend on invalid references not triggering UB in order to work correctly, and it would be a shame to break them, or make them trigger UB, when we can avoid it at (what I would imagine is) almost no performance cost. I am not an expert in this area, but I'm not sure what optimizations we could gain by assuming that these references cannot exist in codegen.

This also fits into the goal of making unsafe rust nicer to write than many other languages (such as C) by reducing the number of pitfalls which can screw you up and trigger UB I understand that unsafe rust will always be hard due to the need to maintain rust's invariants (which include things like not having invalid references) at the safety boundary, but allowing people to break some of these invariant without necessarily triggering UB behind the safety privacy boundary seems reasonable, and like it will make more reliable programs. Many people complain about how C has too much Undefined Behavior, which is too easy to trip over, and I don't want Rust to make the same mistake without good reasons.

@strega-nil
Copy link

@mystor I'd argue that Rust references are far more like C++ references; &T -> T const&, &UnsafeCell<T> -> T&, &mut T -> T&&.

@eternaleye
Copy link

@mystor: @ubsan: But unlike C++ references, Rust references with lifetimes greater than their referents are (or IMO should be) illegal/UB because they allow safe code to cause UB, trivially (or accidentally, simply by autoderef), and with no way for the safe code to tell that is a risk.

IMO, references are something of the canonical example of "you should never even construct an invalid/mistyped/mis-lifetimed instance" - largely because of how incredibly easy it is for an invalid lifetime to leak out of unsafe {} or even broader boundaries (such as the module).

@strega-nil
Copy link

@eternaleye I don't really agree. I feel like lifetimes shouldn't exist at all after typechecking; they should only be a lint tool, to me. Just because safe code might cause UB with what unsafe code has done doesn't mean doing it in the unsafe code should be UB.

@eternaleye
Copy link

@ubsan: My point is that permitting the construction of references with invalid lifetimes is a huge footgun, due to how easy it is for that to pollute safe code on accident (via inference or otherwise).

Meanwhile, I'm not sure there are any useful patterns for references with invalid lifetimes that are not served equally well by raw pointers - whose big distinction is that they neither have nor enforce lifetimes.

Blurring the lines between them by permitting references to have invalid lifetimes, when you aren't permitted to dereference them, seems to me as if it encourages unnecessarily risky behavior on the part of the user.

@strega-nil
Copy link

@eternaleye I have personally used them in my own code :)

struct Foo {
    foo: Box<Foo>,
    foo_ref: &'static Foo, // reference into self
}

and other similar pattern: I've gone through 'static references to do things a lot. There's no point to making it undefined, except for some notion of nicer. It only outlaws more code, without making it easier to do anything compiler-y, so AFAICT, it only makes it harder to write correct Rust code. The point is to hopefully make it easier.

@Amanieu
Copy link
Member

Amanieu commented Aug 19, 2016

AFAIK in the current implementation there are currently only 3 restrictions on references:

  • nonzero: Can't be null. This is necessary for the enum layout trick.
  • noalias: No mutable aliasing. (The precise definition of this is outside the scope of this issue)
  • dereferencable: The reference must point to valid memory (ie doesn't segfault). That memory doesn't need to contain a valid value, it just needs to not fault.

Violating any of these is UB. However there are also additional restrictions we may want to apply when dereferencing a reference:

  • Alignment: IMO a reference should only need to be correctly aligned when it is dereferenced. This allows the use of references to fields of a packed struct for example, which is unsafe (can cause UB from safe code) but not UB on its own if they are not dereferenced.
  • Valid value: This one is more debatable. Having a reference to an invalid value should be fine so that we can support the use of mem::uninitialized(). Overwriting such a value with ptr::write is also fine. The question is, is it UB to read such a value? I'm not entirely sure about this one.
  • Transient dereference: Is it UB to dereference an invalid reference without accessing any of its memory. One example of this is to get a reference to a sub-field of a reference (&a.b -- technically a is dereferenced, but its memory is never accessed). The question can be reformulated as "does UB occur when we dereference or when we perform lvalue to rvalue conversion?" I am currently tending towards the latter.

@strega-nil
Copy link

strega-nil commented Aug 19, 2016

@Amanieu

On the third point: lexpr to vexpr conversion ;P

I'd argue, for now, that it should be UB, but I think we should postmark it for revisiting in the future.

@RalfJung
Copy link
Member

RalfJung commented Aug 19, 2016

Since references act a lot like pointers, it seems reasonable to assume (from the POV of an unsafe rust programmer coming from other languages like C/C++) that they act like C++ pointers - that is that invalid references are invalid, and thus cannot be followed, but that they don't trigger any UB unless they are dereferenced.

Just a quick remark on this: Actually, at least in C11, even creating an invalid reference is UB. From section 6.5.6 of the standard:

If both the pointer
operand and the result point to elements of the same array object, or one past the last
element of the array object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined.

I am not saying that this is expected behavior, or that Rust should follow this. But generally, I think we don't want anything to act like C++ pointers, those are just way too horrible. ;-)

@nikomatsakis
Copy link
Contributor

@ubsan

There's no point to making it undefined, except for some notion of nicer. It only outlaws more code, without making it easier to do anything compiler-y

This is very interesting to me. I am sympathetic to this POV, but it seems to be one of the key questions that we need to decide: "to trust types or not to trust types". I think that using &'static u32 when, in fact, the underlying value is not 'static is strongly related to examples like "RefCell-Ref". This kind of example, to my mind, pushes us quickly towards "Tootsie-Pop"-style models, which I know you have historically been opposed to. =)

That said, I think I agree with @mystor that we have to tread carefully here -- it seems obvious to me that "access-based" models are likely to be less error-prone. I am very wary of giving too much significance to types.

Another related issue is the "only-access-using-volatile-loads" use case described in #16.

@eternaleye
Copy link

eternaleye commented Aug 25, 2016

One thing I think may be worthwhile is to examine how many ways an invalid reference in unsafe code could give rise to an invalid reference in safe code - I'm particularly worried about lifetime inference.

If this turns out to be "not many, and easily managed" then making it UB to pass an invalid reference across an unsafe {} boundary into safe code would seem to make sense to me.

If it turns out that there are many, or that they're hard to defend against, then from an ergonomics and teachability standpoint, I think it'd be better to outlaw creating such references, even within unsafe code.

The third option, allowing invalid references to exist in safe code, seems to fall into the Tootsie-Pop zone, and notably constrain the optimizer on safe code.

@strega-nil
Copy link

strega-nil commented Aug 26, 2016

@eternaleye Please define "invalid reference". I can't tell if you're talking about

{
    let y = 0;
    let x: &'static i32 = &* (&y as *const i32);
}

which should be fine, or

{
    let x: &'static i32 = &*(&0 as *const i32);
    x
}

which, I'd argue, shouldn't be.

@eternaleye
Copy link

eternaleye commented Aug 26, 2016

@ubsan
To be clearer, I'd call this "passing an invalid reference to safe code across an unsafe {} boundary":

let x = unsafe {
    let y: &'static i32 = &*{
        let z = 0;
        &z as *const i32
    }
}

The let y creates it, but my worry is that there may be other ways of doing this that are not near to the point it's passed across, and which thus result in "spooky" invalid references making their way to safe code.

Once an invalid reference has made it to safe code, we have two options:

  1. Constrain the optimizer in (at least some) safe code, removing its ability to optimize based on dereferenceability, including under inlining. This leads rapidly to the Tootsie-Pop model.
  2. Admit UB that is potentially very hard to diagnose and correct, due to non-locality between the point at which UB occurs and the code which caused it to occur.

My preference is to prevent invalid references from entering safe code, but if it is hard to prevent an invalid reference (once created) from reaching safe code then case 2 recurs. As a result, it would then be beneficial to make constructing it invalid, such that it's a good boundary for best-effort lints, or a sanitizer which does preserve lifetime information in the binary.

@strega-nil
Copy link

@eternaleye that does not answer my question. I'm asking your definition of invalid reference, not for an example of one :P

@eternaleye
Copy link

eternaleye commented Aug 27, 2016

@ubsan: An invalid reference is a reference that doesn't satisfy its contract, just like an invalid instance of any other type.

For references, the contract is that it points to a value of the specified type, and that the value outlives the specified lifetime. It also includes the aliasing rules, though I think those have some freedom in them (specifically, "parent" mutable references effectively don't exist when subobjects are borrowed)

@strega-nil
Copy link

strega-nil commented Aug 27, 2016

@eternaleye

So you're saying the first example of mine should be undefined behavior? This?

struct Foo {
    foo: Box<u8>,
    foo_ref: &'static u8,
}

fn new() -> Foo {
    let b = Box::new(0);
    let ref_ = unsafe { &*(&*b as *const _) };
    Foo {
        foo: b,
        foo_ref: ref_,
    }
}

because I've definitely written this code; that foo_ref is not 'static.

I don't get the point of that.

@eternaleye
Copy link

eternaleye commented Aug 28, 2016

I'd say that can't possibly be anything other than UB without a Tootsie-Pop model.

@strega-nil
Copy link

@eternaleye

I can't think of a reasonable model where that isn't defined behavior: no invalid reference ever exists (invalid meaning "points to an invalid value or invalid memory"). No memory model that I have seen so far has taken lifetimes into account: they aren't intended to be used for codegen, just for static analysis.

@arielb1
Copy link
Contributor

arielb1 commented Aug 28, 2016

@ubsan

My model does consider lifetimes in safe code. IIRC old trans considered lifetimes in some situations.

@Ixrec
Copy link

Ixrec commented Aug 28, 2016

Complete newbie here, but in the context of ubsan's example it seems like any code which creates a Foo and then tries to do anything at all with the foo_ref member (or call any Foo methods that touch that member) would be immediately invoking UB, since that member has already been dropped. It seems no different from "use-after-free" to me, which I assume is considered UB in Rust. The only question in my mind is whether "making-accessible-to-safe-code-after-free" should be UB as well, but I can't think of any reason we would want to allow that.

If the lifetime of foo_ref was part of new()'s signature, then it would at least be possible for clients of Foo to create it and invoke its methods in such a way that no use-after-free actually occurred. Thus I see no problem with allowing a reference to not-yet-freed memory to enter safe code. But giving safe code a reference to definitely-already-freed memory seems completely undesirable., unless you have a custom Drop impl that doesn't actually free the memory, but I have no idea what that would be good for.

@strega-nil
Copy link

@lxrec The point (heh) is that, while the pointer foo_ref exists, foo also exists, and foo_ref points to *foo.

@eternaleye
Copy link

eternaleye commented Aug 28, 2016

@ubsan

@Ixrec The point (heh) is that, while the pointer foo_ref exists, foo also exists, and foo_ref points to *foo.

That's not a guarantee; it's a precondition - one that needs to hold on every piece of code that accesses foo_ref.

A guarantee is enforced; a precondition is relied on - and violating a precondition, followed by relying on it, leads to UB.

Because the precondition there is not enforced by the type system, it can be violated by safe code.

As a result, safe code must behave very precisely - in particular, there are behaviors allowed by the type system that it must not perform, or else cause UB.

If the optimizer is unaware of exactly what those behaviors are (and that's a huge, possibly insoluble, kettle of fish in its own right), then the only solution is to apply blanket restrictions to the optimizer if such "illegal behaviors" could exist, as it may otherwise invoke such behaviors.

Since such behaviors could exist if it's ever valid to pass an invalid instance of a type to safe code across an unsafe {} boundary, the only options for bounding such restrictions become privacy-based. Specifically, module and crate boundaries.

Thus, we arrive at the Tootsie-Pop model. One which I (and I believe you as well?) do not favor.

@mystor
Copy link
Author

mystor commented Aug 29, 2016

Id love to read some explanation as to the problems with the Tootsie-Pop model. Since I read the blog post, it has always seemed like a relatively sane model for allowing breaking the rules in unsafe code, and avoiding the problems with people triggering ub when writing that unsafe code, but still allowing aggressive optimization in most of the program.

@Ixrec
Copy link

Ixrec commented Aug 29, 2016

The only explanation I could find was http://smallcultfollowing.com/babysteps/blog/2016/08/18/tootsie-pop-followup/, which if I understand correctly boils down to: Unsafe code is usually intended as an optimization, but if the optimizer is forced to be more conservative around any unsafety boundaries, then unsafe code can end up being slower than safe code; the example given was the unsafe vec.get_unchecked() being potentially slower than the safe-but-bounds-checked vec[i] operation. If there's more explanation than that, I'd also love to hear it.

@eternaleye
Copy link

There was a (lengthy) discussion thread on internals.rust-lang.org.

@RalfJung
Copy link
Member

I can't think of a reasonable model where that isn't defined behavior: no invalid reference ever exists (invalid meaning "points to an invalid value or invalid memory"). No memory model that I have seen so far has taken lifetimes into account: they aren't intended to be used for codegen, just for static analysis.

It was my understanding that there are a bunch of desirable optimizations that crucially rely on lifetimes being correct; or at least, they rely on references being more than just "valid" according to your definition. For example

fn foo<F: FnOnce()>(x: &mut u32, f: F) -> u32 {
  let r = *x;
  f();
  r
}

The compiler is (in my understanding) allowed to move the dereference of x after the call to unknown code. If x however is a perfectly valid reference that just happens to not be exclusively borrowed to foo for the given lifetime (but, say, for another lifetime that has long ended), then this transformation is incorrect.

Actually, I cannot see how a model would allow such transformations without taking lifetimes into account. According to your statement, a transmute that changes the lifetime (and nothing else) could never invoke UB, because the only bit that changed (the lifetime) is entirely irrelevant for the memory model. I must be missing something here?

@nikomatsakis
Copy link
Contributor

@RalfJung

Actually, I cannot see how a model would allow such transformations without taking lifetimes into account.

Indeed, this is more or less what I concluded in this recent blog post.

@RalfJung
Copy link
Member

I'd like to expand the scope of this topic beyond "holding an invalid reference", namely:
Even if we permit "holding" (but not using) invalid references, what about unaligned references? What about NULL references? Also, we certainly permit "holding" invalid and NULL pointers, but what about unaligned pointers?
See https://internals.rust-lang.org/t/rules-for-alignment-and-non-nullness-of-references/5430.

Actually, I think this goes beyond pointers. To me, the question is: Are there cases where even just "temporarily holding but not using an invalid value inside an unsafe function" is UB? (I am restricting this to "inside a function" because I think there are some guarantees we want to uphold even for unused values when certain or all function boundaries are crossed. In fact, this is needed for some optimizations.)

Most types don't matter here, as they clearly don't have any invalid values. Notable exceptions are pointers/references (this is what has been discussed here), str (must be valid UTF-8), bool and enums.

On the one hand, if the answer to the question is "yes" -- if there are such cases where just creating/holding a bad value is UB -- then we get into the difficult game to define what does and what does not constitute a use. Store/load, conditional branching, and pattern matching clearly do (and there is, I think, no question that these are UB for "invalid" values, but who knows?). But other cases are less clear-cut. Is it insta-UB to transmute 3u8 to bool? What about loading 3u8 from memory with a bool-pointer?

So, I'd somewhat prefer if we had as little such "invalid value UB" as possible -- ideally, none. It seems like a terrible footgun. However, I am not sure if that can be reconciled with NonZero and the optimizations the compiler does around it...

Just dropping problems here, not proposing solutions to all of them. ;)

@strega-nil
Copy link

@RalfJung yeah, I'm also of the opinion that we should make it as easy as possible to not commit UB, within reason. Passing and returning invalid values shouldn't be UB for example. OTOH, I don't think that invalid values should be treated specially - they should simply be treated as uninitialized. Use is UB, but there's a lot of stuff you should be able to do without it being UB.

@RalfJung
Copy link
Member

Passing and returning invalid values shouldn't be UB for example.

I don't think that's always possible...

fn some_fn(x: &mut u32) {
  other_unknown_fn();
  *x = 13;
}

It should be permitted for the compiler to move the store around the call. However, if passing a bad pointer into some_fn is NOT UB, and other_unknown_fn diverges, that optimization is incorrect.

@strega-nil
Copy link

@RalfJung note that that is a use of x.

@strega-nil
Copy link

(the dereference, that is)

The passing itself isn't UB, but it's definitely not okay to do that to random APIs; you're breaking an API that the function has, and then the function does UB because you broke its API.

@RalfJung
Copy link
Member

@ubsan

note that that is a use of x.

True -- but the use is after other_unknown_fn is called. If that function diverges, some_fn does not UB even if x is invalid. So, if we take your interpretation, reordering the store and the call is illegal. I think we want it to be legal, though.

@strega-nil
Copy link

strega-nil commented Jun 25, 2017

@RalfJung ah, I guess that's true. Although, if you think about it, if we don't know if this function is noexcept or not, we can't reorder the store anyways, even if the pointer is valid (because if it throws an exception, the pointer would never be written to).

@strega-nil
Copy link

strega-nil commented Jun 25, 2017

@RalfJung perhaps passing as a parameter could be counted as a use, though? I'm not sure. If you have panic=abort set, I'd want to be able to do that.

@RalfJung
Copy link
Member

Although, if you think about it, if we don't know if this function is noexcept or not, we can't reorder the store anyways, even if the pointer is valid (because if it throws an exception, the pointer would never be written to).

Good point!

@arielb1
Copy link
Contributor

arielb1 commented Jun 25, 2017

On the one hand, if the answer to the question is "yes" -- if there are such cases where just creating/holding a bad value is UB -- then we get into the difficult game to define what does and what does not constitute a use. Store/load, conditional branching, and pattern matching clearly do (and there is, I think, no question that these are UB for "invalid" values, but who knows?). But other cases are less clear-cut. Is it insta-UB to transmute 3u8 to bool? What about loading 3u8 from memory with a bool-pointer?

I think we made the decision that all of these are "instant" UB. The moment you have an invalid rvalue, you have UB. But then there's the little reference problem, which we should probably solve at the typeck level by allowing ExprRef to return *T directly without a coercion.

@strega-nil
Copy link

strega-nil commented Jun 25, 2017

@arielb1 I've gone back on that idea. It's just... too hard to work with, imo.

edit: nvm, arielby convinced me

@arielb1
Copy link
Contributor

arielb1 commented Jun 25, 2017

@ubsan

First, at least for nonnull/range that's how LLVM works, so we don't have much of a choice.

Also, what's the problem with that idea?

@strega-nil
Copy link

@arielb1 You've convinced me back on irc. I'm back to "creating an uninitialized rvalue is UB".

@arielb1
Copy link
Contributor

arielb1 commented Jun 25, 2017

The main reason I want "no undef values ever" is because, well, see the LLVM long thread about undef. I'll like to keep this within the optimizer.

@nagisa
Copy link
Member

nagisa commented Jun 25, 2017 via email

@strega-nil
Copy link

strega-nil commented Jun 25, 2017

It boils down to "it's already UB", basically.

arielby ubsan: what's the problem with instant UB on invalid rvalue?
[00:59:55] ubsan arielby: it doesn't actually help very much, and the standard library relies on that not being the case
[01:00:00] ubsan see: std::mem::uninitialized
[01:00:09] arielby ubsan: std::mem::uninitialized is being killed
[01:00:18] ubsan arielby: I know it's being killed
[01:00:31] ubsan but I still think that it's... a little strong
[01:00:54] ubsan I'm much more comfortable with "UB on use"
[01:01:23] arielby I'm actually less comfortable with "UB on use"
[01:01:27] ubsan where `use` is defined liberally
[01:01:31] arielby because that means you have undef values traveling through your code
[01:01:39] arielby and you are never sure what a use is or what value is undef
[01:01:50] ubsan arielby: it'd be reasonable rules
[01:02:01] ubsan like, passing to other functions, matching, using if, etc.
[01:02:19] ubsan or maybe not even passing to functions?
[01:02:23] ubsan I dunno
[01:02:35] ubsan I've been rethinking my mental model of values and objects
[01:03:10] arielby ubsan: first, is there a legitimate use case for invalid values that isn't handled by MaybeInitialized?
[01:03:37] ubsan arielby: initializing a stack thing one subobject at a time
[01:03:48] arielby ubsan: you do that with a MaybeInitialized
[01:03:55] arielby and pointers to subobjects
[01:04:03] ubsan arielby: hmm
[01:04:06] ubsan I mean, I guess?
[01:04:19] arielby let value: MaybeInitialized<MyStruct> = MaybeInitialized::uninit();
[01:04:34] arielby for i in 0..n { init(&mut value[pi(n)]); }
[01:04:45] ubsan I guess you could do that
[01:04:51] arielby I mean, if you want to use safe pointers
[01:04:59] arielby there's possibly some difficulty
[01:05:08] ubsan arielby: although I'd only want to break it at rust 2.0
[01:05:14] arielby but that's because of a different rule
[01:05:24] arielby ubsan: you mean, mem::uninitialized itself?
[01:05:28] ubsan arielby: yeah
[01:05:31] ubsan and the rules around it
[01:05:31] arielby of course
[01:05:34] arielby we'll special case it
[01:05:55] ubsan arielby: I'd rather allow people at least a release to fix their code
[01:06:05] ubsan *major release
[01:06:08] arielby I mean, GCC had 4.0 and they survived that
[01:06:16] ubsan before we go full UB
[01:06:23] ubsan arielby: I mean in Rust 2.0, not 3.0
[01:06:40] ubsan and we should *definitely* give people a ton of warning
[01:06:54] ubsan TBAA was a huge deal in the C community...
[01:06:58] arielby ubsan: btw we *already* do the UB thing
[01:07:09] ubsan arielby: do we?
[01:07:11] ubsan hmm
[01:07:11] arielby yes
[01:07:19] ubsan then yeah, uninitialized should be special cased, I guess
[01:07:22] arielby semi-inconsistently
[01:07:27] arielby (because it's UB)
[01:07:28] ubsan never mind then
[01:07:36] ubsan I wasn't clear on that
[01:07:57] arielby I once added nonnull enforcement to function pointers (it already was for data pointers) and that broke someone's code
[01:08:20] ubsan arielby: okay, never mind then lol

@RalfJung
Copy link
Member

This is now becoming the validity invariant discussion at rust-lang/unsafe-code-guidelines#76 and rust-lang/unsafe-code-guidelines#77.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants