-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: hint::bench_black_box #2360
Conversation
b3f4441
to
8a9ae3f
Compare
text/0000-bench-utils.md
Outdated
pub fn clobber() -> (); | ||
``` | ||
|
||
flushes all pending writes to memory. Memory managed by block scope objects must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording (flushing pending writes) makes me uncomfortable because it's remniscient of memory consistency models, including hardware ones, when this is just a single-threaded and compiler-level restriction. Actually, come to think of it, I'd like to know the difference between this and compiler_fence(SeqCst)
. I can't think of any off-hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when this is just a single-threaded and compiler-level restriction
This is correct.
compiler_fence(SeqCst). I can't think of any off-hand.
I can't either, but for some reason they do generate different code: https://godbolt.org/g/G2UoZC
I'll give this some more thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between asm!
with a memory
clobber and compiler_fence
exists in the fact, that memory
clobber requires compiler to actually reload the memory if they want to use it again (as memory is… clobbered – considered changed), whereas compiler_fence
only enforces that memory accesses are not reordered and the compiler still may use the usual rules to figure that it needn’t to reload stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nagisa the only thing that clobber should do is flush pending writes to memory. It doesn't need to require that the compiler reloads memory on reuse. Maybe the volatile
asm!
with memory
clobber is not the best way to implement that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nagisa Thank you. I was mislead by the fact that fences prohibit some load-store optimization to thinking they'd also impact things like store-load forwarding on the same address with no intervening writes.
@gnzlbg asm!
with memory
is simply assumed to read from and write to all memory and all its effects follow from that. However, to be precise, this does not mean any reloads are introduced after a clobbering inline asm, it just means that all the loads that is already there (of which are a lot given that every local and many temporaries are stack slots) can't be replaced with values loaded from the same address before the clobber.
If you want something less strong, you need to be precise. "Flushing loads writes" is not really something that makes intuitive sense at the compiler level (and it surely has no effect on runtime, for example caches or store buffers?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkruppe mem::clobber()
should be assumed to read from all memory with effects, so that any pending memory stores must have completed before the clobber
. All the loads that are already there in registers, temporary stack slots, etc, should not be invalidated by the clobber
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that is a coherent concept. I'm not sure off-hand how to best implement that in LLVM. (compiler_fence
is probably not enough, since it permits dead store elimination if the memory location is known to not escape.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, come to think of it, what's the difference between black_box(x)
(which is currently stated to just force a write of x
to memory) and let tmp = x; clobber();
(which writes x
to the stack slot of tmp
and then forces that store to be considered live)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question and this relates to what I meant with "block scope" . In
{
let tmp = x;
clobber();
}
this {}
block is the only part of the program that knows the address of tmp
so no other code can actually read it without invoking undefined behavior. So in this case, clobber
does not make the store of tmp
live, because nothing outside of this scope is able to read from it, and this scope doesn't read from it but executes clobber
instead (clobbering "memory" does not clobber temporaries AFAICT).
However, if one shares the address of tmp
with the rest of the program:
{
let tmp = x;
black_box(&tmp);
clobber();
}
then clobber
will force the store of tmp
to be considered live.
I was already concerned about the interaction with the memory model, and this comment basically confirmed my worst suspicions: There are extremely subtle interactions between the memory model and what these function can do. I'm starting to believe that we'll be unable to give any guarantes that are of any use to benchmark authors. But if this is the case, these functions become a matter of chasing the optimizer and benchmarking reliably requires checking the asm to see your work wasn't optimize out. That's clearly very unappealing, but has always been true of micro benchmarks, so maybe that's just unavoidable. This also raises the question of why this needs to be in std, if it's not compiler magic (like |
I'm temporarily closing this till we resolve the semantics of these functions in the memory model repo. |
The issue in the memory model repo is: https://github.com/nikomatsakis/rust-memory-model/issues/45 |
Update. The summary of the discussion on the memory model repo is that |
Any updates on opening the new RFC? |
I've updated the RFC with the discussion of the memory model repo. It now just proposes Other changes:
Open questions:
|
Unsure if we should be opening a new RFC or reopening this one, cc @rust-lang/libs |
Seems fine to me! I was worried you'd try to talk about which memory |
As a user let me contribute my 2 cents In C++, there is the same need for some kind of compiler barrier that prevents the compiler from hoisting benchmark code out of loops. There is a really great talk by Chandler Carruth about how to do this using inline assembly, where he goes into detail about how gtest implements this, and what it means from compiler's point of view: https://www.youtube.com/watch?v=nXaxk27zwlk
Please don't block this PR based on arguments about the semantics of this. The need for a barrier like this is obvious in any language that allows the optimizer to do things like hoisting expressions out of loops. It is an important practical need to be able to do benchmarks in rust, and this is not a "novel" solution to the problem, it is the same one that has been actively deployed in google's gtest C++ library for many years, which is used by millions of users with gcc and clang. Many experts have thought extensively about the best way to create this kind of barrier, with special attention to llvm. Real users of rust have the same need to use this benchmarking tool, without using a nightly compiler. Stabilizing It's really sad that useful stuff like this is blocked for months by what appear to be pedantic concerns about having a completely formal definition of the semantics. It should be adequate to say " Honestly, it's probably adequate to just track whatever gtest is doing -- if similar small microbenchmarks in C++ and rust, one using |
@cbeck88 That's why I attempted to bypass the whole thing and have Whereas we've seen people (want to) misuse Pretty much the only way we can specify something like this is to make it a correct implementation to define those functions as noops that don't prevent optimization. |
Here's a transcript from imo the most relevant part of that talk: "First of all, I'm about to pause for a minute. I'm about to tell you all how to defeat the optimizer. As a compiler and optimizer writer, please use these powers for good. Like... Don't defeat the optimizer because you have undefined behavior in your source code and you're like 'No no no optimizer, you're not going to miscompile my code today, nah ah!'. That's not gonna actually work for you. It goes very badly. Okay, but how do we actually defeat it when we need to for a good reason, we need to for some performance analysis? We need two functions. Alright, these are kind of magical functions. I would never expect anyone to know how to write these functions. You might reasonably argue that they shouldn't... like I shouldn't be writing them and they should be in my library but I wanted to should you guys how these things actually work. Okay, we're already in trouble, I'm using I'm going to next type the two scariest tokens for my compiler implementer world. Okay, literally, the two scariest tokens. Okay. [types
We have a definition that is adequate for benchmarking purposes and completely formal already. (Well as much as It seems to me that the only things this RFC is actually blocked on right now is a bikeshed and to consider whether a change towards From my perspective it would be fine to offer all three of Sometimes the functions will also be wrapped in a benchmarking framework which means that those users won't need to care about which ones to use. I think which ones to use will be largely driven by tutorials and convention as they are all rather opaque (at least from my perspective). If it turns out we need to deprecate some of them in the future that's also fine because again, these functions are trivial to specify.
I'm not sure I follow this part.
Sure. That's unfortunate... but isn't good documentation the solution here? |
This is not what this is blocked on. Please re-read the discussion. Nobody, not even from the formal camp, is opposed to having something that says "this encourages the compiler to not optimize some things, but code cannot rely on this for functional correctness". But some people don't like calling a thing that does this @Centril thanks a lot for that transcript! I am relieved and happy that I can now even quote Chandler any time someone asks about using For those wanting to watch the video, the timecode-link is https://youtu.be/nXaxk27zwlk?t=2445. Seems like the two functions Chandler defines are |
I've renamed the intrinsic to I don't think that resolving this should block the RFC from making progress, and we probably should just implement This should resolve @cramertj concern about naming, let me know if this isn't the case. I've also tried to document @eddyb proposal of adding The pros of that proposal is that they are easier to specify with stronger guarantees than |
@gnzlbg Would you be happier with |
Also, keep in mind none of the LLVM limitations would prevent |
I'm happy with
The main use case for these is benchmarks and when users try to put what they learn into practice, there is no self-correcting mechanism that will let them know they are using these incorrectly unless they run their benchmarks under miri. Maybe there is a way to encourage that somehow, but that is going to require a bit of experimentation because currently miri is quite slow. I suppose libtest could help here by reducing the number of iterations to 1 when run under miri by default and omit all the timing stuff or something like that. |
I mean, function name choice (and whether you have one or two functions) kind of is part of "good documentation". Self-documenting functions are better than something people may misuse if they don't read the docs. |
@rfcbot resolve actually-a-black-box The new name |
Ping @SimonSapin @Zoxc @michaelwoerister @varkor for your checkboxes here. |
Ping @SimonSapin @Zoxc @michaelwoerister for your checkboxes #2360 (comment) |
🔔 This is now entering its final comment period, as per the review above. 🔔 |
The final comment period, with a disposition to merge, as per the review above, is now complete. As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed. The RFC will be merged soon. |
🎉 Huzzah! This RFC has been merged! 🎉 Tracking issue: rust-lang/rust#64102 |
Adds
bench_black_box
tocore::hint
.Rendered.
Tracking issue.