Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: hint::bench_black_box #2360

Merged
merged 22 commits into from
Sep 2, 2019
Merged

RFC: hint::bench_black_box #2360

merged 22 commits into from
Sep 2, 2019

Conversation

gnzlbg
Copy link
Contributor

@gnzlbg gnzlbg commented Mar 12, 2018

Adds bench_black_box to core::hint.

Rendered.
Tracking issue.

text/0000-bench-utils.md Outdated Show resolved Hide resolved
@gnzlbg gnzlbg force-pushed the black_box branch 4 times, most recently from b3f4441 to 8a9ae3f Compare March 12, 2018 17:39
pub fn clobber() -> ();
```

flushes all pending writes to memory. Memory managed by block scope objects must

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording (flushing pending writes) makes me uncomfortable because it's remniscient of memory consistency models, including hardware ones, when this is just a single-threaded and compiler-level restriction. Actually, come to think of it, I'd like to know the difference between this and compiler_fence(SeqCst). I can't think of any off-hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when this is just a single-threaded and compiler-level restriction

This is correct.

compiler_fence(SeqCst). I can't think of any off-hand.

I can't either, but for some reason they do generate different code: https://godbolt.org/g/G2UoZC

I'll give this some more thought.

Copy link
Member

@nagisa nagisa Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between asm! with a memory clobber and compiler_fence exists in the fact, that memory clobber requires compiler to actually reload the memory if they want to use it again (as memory is… clobbered – considered changed), whereas compiler_fence only enforces that memory accesses are not reordered and the compiler still may use the usual rules to figure that it needn’t to reload stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nagisa the only thing that clobber should do is flush pending writes to memory. It doesn't need to require that the compiler reloads memory on reuse. Maybe the volatile asm! with memory clobber is not the best way to implement that.

Copy link

@hanna-kruppe hanna-kruppe Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nagisa Thank you. I was mislead by the fact that fences prohibit some load-store optimization to thinking they'd also impact things like store-load forwarding on the same address with no intervening writes.

@gnzlbg asm! with memory is simply assumed to read from and write to all memory and all its effects follow from that. However, to be precise, this does not mean any reloads are introduced after a clobbering inline asm, it just means that all the loads that is already there (of which are a lot given that every local and many temporaries are stack slots) can't be replaced with values loaded from the same address before the clobber.

If you want something less strong, you need to be precise. "Flushing loads writes" is not really something that makes intuitive sense at the compiler level (and it surely has no effect on runtime, for example caches or store buffers?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkruppe mem::clobber() should be assumed to read from all memory with effects, so that any pending memory stores must have completed before the clobber. All the loads that are already there in registers, temporary stack slots, etc, should not be invalidated by the clobber.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that is a coherent concept. I'm not sure off-hand how to best implement that in LLVM. (compiler_fence is probably not enough, since it permits dead store elimination if the memory location is known to not escape.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, come to think of it, what's the difference between black_box(x) (which is currently stated to just force a write of x to memory) and let tmp = x; clobber(); (which writes x to the stack slot of tmp and then forces that store to be considered live)?

Copy link
Contributor Author

@gnzlbg gnzlbg Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question and this relates to what I meant with "block scope" . In

{ 
    let tmp = x; 
    clobber();
}

this {} block is the only part of the program that knows the address of tmp so no other code can actually read it without invoking undefined behavior. So in this case, clobber does not make the store of tmp live, because nothing outside of this scope is able to read from it, and this scope doesn't read from it but executes clobber instead (clobbering "memory" does not clobber temporaries AFAICT).

However, if one shares the address of tmp with the rest of the program:

{ 
    let tmp = x; 
    black_box(&tmp);
    clobber();
}

then clobber will force the store of tmp to be considered live.

text/0000-bench-utils.md Outdated Show resolved Hide resolved
@Centril Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Mar 12, 2018
@hanna-kruppe
Copy link

I was already concerned about the interaction with the memory model, and this comment basically confirmed my worst suspicions: There are extremely subtle interactions between the memory model and what these function can do.

I'm starting to believe that we'll be unable to give any guarantes that are of any use to benchmark authors. But if this is the case, these functions become a matter of chasing the optimizer and benchmarking reliably requires checking the asm to see your work wasn't optimize out. That's clearly very unappealing, but has always been true of micro benchmarks, so maybe that's just unavoidable.

This also raises the question of why this needs to be in std, if it's not compiler magic (like compiler_fence is) but a natural consequence of the memory model and inline asm. Aside from the stability of inline asm, which is not a very convincing argument to me.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 13, 2018

I'm temporarily closing this till we resolve the semantics of these functions in the memory model repo.

@gnzlbg gnzlbg closed this Mar 13, 2018
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 13, 2018

The issue in the memory model repo is: https://github.com/nikomatsakis/rust-memory-model/issues/45

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jun 11, 2018

Update. The summary of the discussion on the memory model repo is that clobber should be scrapped, and the RFC updated with the definition of black box agreed on there.

@Manishearth
Copy link
Member

Any updates on opening the new RFC?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Aug 29, 2018

I've updated the RFC with the discussion of the memory model repo.

It now just proposes pub unsafe fn core::hint::black_box<T>(x: T) -> T, which is specified as an unknown unsafe function that returns x. It is a no-op in the virtual machine, but the compiler has to assume that it can perform any valid operation on x that unsafe Rust code is allowed to perform.

Other changes:

  • moved the function from core::mem to core::hint since that appears to be a more suitable place for this functionality which is a no-op.

Open questions:

  • do we need a "safe" alternative? What safe and unsafe Rust are allowed to do differs, e.g., if the user passes it a raw pointer unsafe Rust is allowed to dereference it while safe Rust is not.

  • In the memory model repo discussion it was a bit unclear (to me) whether this function should take a reference or not. I've keep the interface from the test crate which takes a value, which allows users to pass it a &/&mut/*mut... or whatever they want, but I am not sure this is the right call.

  • should we call this black_box or something else ? (e.g. unknown)

cc @RalfJung @ubsan @rkruppe

@gnzlbg gnzlbg reopened this Aug 29, 2018
@Manishearth
Copy link
Member

Unsure if we should be opening a new RFC or reopening this one, cc @rust-lang/libs

text/0000-bench-utils.md Outdated Show resolved Hide resolved
text/0000-bench-utils.md Outdated Show resolved Hide resolved
@RalfJung
Copy link
Member

Seems fine to me! I was worried you'd try to talk about which memory black_box could touch, but you actually are using the spec I was hoping for. :)

text/0000-bench-utils.md Outdated Show resolved Hide resolved
@cbeck88
Copy link

cbeck88 commented Jul 24, 2019

As a user let me contribute my 2 cents

In C++, there is the same need for some kind of compiler barrier that prevents the compiler from hoisting benchmark code out of loops. There is a really great talk by Chandler Carruth about how to do this using inline assembly, where he goes into detail about how gtest implements this, and what it means from compiler's point of view: https://www.youtube.com/watch?v=nXaxk27zwlk

test::black_box(x) should mean for the compiler, this statement compiles to a no-op, but the surrounding code must be optimized in such a way that it would still be correct if test::black_box changed x in some arbitrary and unpredictable way.


Please don't block this PR based on arguments about the semantics of this.

The need for a barrier like this is obvious in any language that allows the optimizer to do things like hoisting expressions out of loops. It is an important practical need to be able to do benchmarks in rust, and this is not a "novel" solution to the problem, it is the same one that has been actively deployed in google's gtest C++ library for many years, which is used by millions of users with gcc and clang. Many experts have thought extensively about the best way to create this kind of barrier, with special attention to llvm. Real users of rust have the same need to use this benchmarking tool, without using a nightly compiler. Stabilizing test::black_box is the only way to do that AFAIU.

It's really sad that useful stuff like this is blocked for months by what appear to be pedantic concerns about having a completely formal definition of the semantics. It should be adequate to say "test::black_box is an appropriate barrier to use for microbenchmarks, to prevent the compiler from making optimizations about a particular variable over multiple iterations of a loop", and if in the future it is determined that the implementation is inadequate for this in some way, then it is treated as a bug and fixed.

Honestly, it's probably adequate to just track whatever gtest is doing -- if similar small microbenchmarks in C++ and rust, one using gtest and one using core::black_box, produce similar LLVM IR when compiled with clang vs. rustc, then the implementation is correct, and this is all that users need.

@eddyb
Copy link
Member

eddyb commented Jul 25, 2019

@cbeck88 That's why I attempted to bypass the whole thing and have bench_{input,output}, which can be more weakly specified (I hope?), and it should be clearer from the names that these are for benchmarks.

Whereas we've seen people (want to) misuse black_box to write incorrect/unsound code.

Pretty much the only way we can specify something like this is to make it a correct implementation to define those functions as noops that don't prevent optimization.
It's a fool's errand to try to specify almost anything about an optimizer, especially if you mention LLVM, I just wish it were more obvious. The only winning move is not to play.

@Centril
Copy link
Contributor

Centril commented Jul 25, 2019

There is a really great talk by Chandler Carruth about how to do this using inline assembly, where he goes into detail about how gtest implements this, and what it means from compiler's point of view: https://www.youtube.com/watch?v=nXaxk27zwlk

Here's a transcript from imo the most relevant part of that talk:

"First of all, I'm about to pause for a minute. I'm about to tell you all how to defeat the optimizer. As a compiler and optimizer writer, please use these powers for good. Like... Don't defeat the optimizer because you have undefined behavior in your source code and you're like 'No no no optimizer, you're not going to miscompile my code today, nah ah!'. That's not gonna actually work for you. It goes very badly.

Okay, but how do we actually defeat it when we need to for a good reason, we need to for some performance analysis? We need two functions. Alright, these are kind of magical functions. I would never expect anyone to know how to write these functions. You might reasonably argue that they shouldn't... like I shouldn't be writing them and they should be in my library but I wanted to should you guys how these things actually work. Okay, we're already in trouble, I'm using void* pointers, I know, I know, type safety and all that. But but... there's actually a reason I'm using void* and it's gonna get a lot worse.

I'm going to next type the two scariest tokens for my compiler implementer world. Okay, literally, the two scariest tokens. Okay. [types asm volatile("" : : "g"(p") : "memory");] So let's see, what on earth have I done? Okay so what is this...? This is a magical escape function. It is truly magical, okay.. It has these unique and mysterious properties. The first property here is that it accepts an arbitrary pointer and it discards all type information because this escape function isn't about playing nice with your type system. It's about turning off the optimizer, right? We don't need a type system to do that. Okay so then we get really crazy and we do an asm volatile. [...]"

It's really sad that useful stuff like this is blocked for months by what appear to be pedantic concerns about having a completely formal definition of the semantics.

We have a definition that is adequate for benchmarking purposes and completely formal already. (Well as much as fn black_box<T>(x: T) -> T { x } is formally defined at least.)

It seems to me that the only things this RFC is actually blocked on right now is a bikeshed and to consider whether a change towards bench_{input,output} would be better for benchmarking purposes. Anything else, e.g. guarantees regarding freezing or constant time operations, is really out of scope for this RFC.

From my perspective it would be fine to offer all three of hint::bench_{black_box,input,output} and we can start with these in any order. I think that's fine because these are all best effort tools that are trivial to formally specify when they truly are hints.

Sometimes the functions will also be wrapped in a benchmarking framework which means that those users won't need to care about which ones to use. I think which ones to use will be largely driven by tutorials and convention as they are all rather opaque (at least from my perspective). If it turns out we need to deprecate some of them in the future that's also fine because again, these functions are trivial to specify.

@eddyb

@cbeck88 That's why I attempted to bypass the whole thing and have bench_{input,output}, which can be more weakly specified (I hope?), and it should be clearer from the names that these are for benchmarks.

I'm not sure I follow this part. fn bench_black_box<T>(x: T) -> T { x } being a valid definition of bench_black_box for any Rust compiler/interpreter seems like as weak a specification as it gets?

Whereas we've seen people (want to) misuse black_box to write incorrect/unsound code.

Sure. That's unfortunate... but isn't good documentation the solution here?

@RalfJung
Copy link
Member

It's really sad that useful stuff like this is blocked for months by what appear to be pedantic concerns about having a completely formal definition of the semantics.

This is not what this is blocked on. Please re-read the discussion. Nobody, not even from the formal camp, is opposed to having something that says "this encourages the compiler to not optimize some things, but code cannot rely on this for functional correctness".

But some people don't like calling a thing that does this black_box, and I think that name is the only open concern.

@Centril thanks a lot for that transcript! I am relieved and happy that I can now even quote Chandler any time someone asks about using black_box for "defeating UB" purposes. It's not just us academic eggheads that argue this way. ;)

For those wanting to watch the video, the timecode-link is https://youtu.be/nXaxk27zwlk?t=2445.

Seems like the two functions Chandler defines are escape and clobber. black_box corresponds to escape and IIRC we discarded clobber because Rust's strong aliasing guarantees make it much less useful.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jul 25, 2019

I've renamed the intrinsic to bench_black_box but still maintained the unresolved question that we might want to tune naming before stabilization, since it is unclear whether we might want to add bench_input/bench_output at some point, and if that might be the case we might want to put these in a core::hint::bench::.. submodule where we can explain how these intrinsics are supposed to be holistically used. Otherwise we need to explain that in the docs of each of the intrinsics.

I don't think that resolving this should block the RFC from making progress, and we probably should just implement bench_input/bench_output on nightly behind a feature gate and let users gain experience with them and let us gain experience with them as well about how to explain how to use them together, and see how well that works out in practice.

This should resolve @cramertj concern about naming, let me know if this isn't the case.


I've also tried to document @eddyb proposal of adding bench_input/output in the alternative section. We tried to work that alternative out in dozens of comments so I might have missed some, please review the diff of the last commit.

The pros of that proposal is that they are easier to specify with stronger guarantees than bench_black_box, and the cons is that they are currently not properly implementable in LLVM, and without a proper implementation we can't really show users what their effect on optimizations actually is, which makes teaching how to use them together hard (at least harder than bench_black_box, since we can really show examples of things going wrong when they are used improperly).

@eddyb
Copy link
Member

eddyb commented Jul 25, 2019

@gnzlbg Would you be happier with bench_{input,output} if we added optimizations at the MIR level to demonstrate their distinct semantics?

@eddyb
Copy link
Member

eddyb commented Jul 25, 2019

Also, keep in mind none of the LLVM limitations would prevent miri from demonstrating UB.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jul 25, 2019

Would you be happier with bench_{input,output} if we added optimizations at the MIR level to demonstrate their distinct semantics?

I'm happy with bench_input and bench_output as proposed.

bench_black_box has a steep-ish learning curve and might disable too many optimizations, and I do think we should add bench_intput/output as well so that we can try to experiment how to implement them more strictly, and for those users with more time and wanting more finer grained control can learn how to use them and try them out. I just don't think it is worth blocking bench_black_box on figuring this out.

Also, keep in mind none of the LLVM limitations would prevent miri from demonstrating UB.

The main use case for these is benchmarks and when users try to put what they learn into practice, there is no self-correcting mechanism that will let them know they are using these incorrectly unless they run their benchmarks under miri. Maybe there is a way to encourage that somehow, but that is going to require a bit of experimentation because currently miri is quite slow. I suppose libtest could help here by reducing the number of iterations to 1 when run under miri by default and omit all the timing stuff or something like that.

@Manishearth
Copy link
Member

Sure. That's unfortunate... but isn't good documentation the solution here?

I mean, function name choice (and whether you have one or two functions) kind of is part of "good documentation". Self-documenting functions are better than something people may misuse if they don't read the docs.

@cramertj
Copy link
Member

@rfcbot resolve actually-a-black-box

The new name bench_black_box seems pretty clear about how this is supposed to be used.

@cramertj
Copy link
Member

Ping @SimonSapin @Zoxc @michaelwoerister @varkor for your checkboxes here.

@varkor varkor changed the title RFC: hint::black_box RFC: hint::bench_black_box Jul 28, 2019
@hanna-kruppe
Copy link

Ping @SimonSapin @Zoxc @michaelwoerister for your checkboxes #2360 (comment)

@rfcbot
Copy link
Collaborator

rfcbot commented Aug 23, 2019

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot added final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. and removed proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. labels Aug 23, 2019
@rfcbot
Copy link
Collaborator

rfcbot commented Sep 2, 2019

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

The RFC will be merged soon.

@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this RFC. and removed final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. labels Sep 2, 2019
@Centril Centril merged commit 356cb57 into rust-lang:master Sep 2, 2019
@Centril
Copy link
Contributor

Centril commented Sep 2, 2019

🎉 Huzzah! This RFC has been merged! 🎉

Tracking issue: rust-lang/rust#64102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-hint Hints related proposals & ideas A-profiling Profiling and benchmark related proposals & ideas disposition-merge This RFC is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this RFC. T-compiler Relevant to the compiler team, which will review and decide on the RFC. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.