-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate measureme's hardware performance counter support. #78781
Conversation
r? @davidtwco (rust_highfive has picked a reviewer for you, use r? to override) |
@bors try @rust-timer queue (this won't use the new counters, it's just to check the overhead of the new hacks) |
Awaiting bors try build completion |
⌛ Trying commit 5c0e03110a79a66356646ec3c02b3c0a4b6fb77a with merge 36688b7cf765fadab38d1c07a0e7c3111cac632d... |
cc @Amanieu Looks like we generate LLVM inline assembly that's not compatible with older versions? LLVM ERROR: Bad $ operand number in inline asm string: 'xor eax, eax
cpuid
mov ecx, ${4:k}
rdpmc'
error: could not compile `measureme` |
Intel syntax is not supported by older versions of LLVM. Only AT&T syntax is. |
☀️ Try build successful - checks-actions |
Queued 36688b7cf765fadab38d1c07a0e7c3111cac632d with parent b1d9f31, future comparison URL. |
Finished benchmarking try commit (36688b7cf765fadab38d1c07a0e7c3111cac632d): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
Regressions of up to 3%. |
@Mark-Simulacrum Just checking, do we run the |
Yes, but we take the minimum of all runs under perf stat (including the self profile one), and always have at least 2 (and IIRC at most 3 right now) runs, including one self profile run. |
You can use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation LGTM, r=me when ready
// HACK(eddyb) remove instruction-counting noise from `-Z self-profile`. | ||
#[cfg(target_arch = "x86_64")] | ||
unsafe { | ||
asm!("mfence", options(nostack)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidtwco @nikomatsakis Just it doesn't get lost, I don't think we should merge with this, but I also am not sure what to do here. Do you think it would be reasonable to limit this to when -Z self-profile
is enabled?
It's also not easy to test the impact of this fence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure either. I personally wouldn't have a problem with this being limited to when -Z self-profile
is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking for -Zself-profile
is probably more expensive than unconditionally running mfence
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine leaving it in as it is for now. We can analyze its impact once perftest multithreaded rustc
Given #78785, it might make sense to define |
I was confused about this, because I could've sworn I've used Intel syntax inline The code emitting the error we're seeing is still around, but the And, there it is: operand modifiers for Intel syntax is new in LLVM 10 (llvm/llvm-project@dc5b614) So the problem is the |
5c0e031
to
5b4363f
Compare
Looks like the CI build got much farther this time (before getting cancelled by the check failure), so that worked. |
5b4363f
to
97b4a30
Compare
☔ The latest upstream changes (presumably #79586) made this pull request unmergeable. Please resolve the merge conflicts. Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:
|
Triage: there's merge conflicts. |
Found likely culprit, missing rust/library/std/src/collections/hash/map.rs Lines 3136 to 3159 in 083721a
Don't think anyone uses them so I won't bother with a separate benchmarking PR, just add a fix to this PR directly. |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit a4f1331 with merge 37e95f0d4e5b87d4b88f8eb37c46b964462717c9... |
☀️ Try build successful - checks-actions |
Queued 37e95f0d4e5b87d4b88f8eb37c46b964462717c9 with parent 083721a, future comparison URL. |
Finished benchmarking commit (37e95f0d4e5b87d4b88f8eb37c46b964462717c9): comparison url. Instruction count
Max RSS (memory usage)Results
CyclesResults
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Footnotes |
@oli-obk Should be ready to merge now, the remaining |
@bors r+ |
📌 Commit a4f1331 has been approved by |
☀️ Test successful - checks-actions |
let filename = format!("{}-{}.rustc_profile", crate_name, process::id()); | ||
// HACK(eddyb) we need to pad the PID, strange as it may seem, as its | ||
// length can behave as a source of entropy for heap addresses, when | ||
// ASLR is disabled and the heap is otherwise determinic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*deterministic
Finished benchmarking commit (872503d): comparison url. Instruction count
Max RSS (memory usage)Results
CyclesResults
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. @rustbot label: -perf-regression Footnotes |
Oops this is what I get for not testing it, only realized it now that this is actually broken: $ echo 'fn main() {}' | rustc +nightly - -Z self-profile -Z self-profile-counter=instructions-minus-irqs:u
warning: failed to create profiler: only supported with measureme's "nightly" feature I removed |
$ echo 'fn main() {}' | rustc +nightly-2022-07-23 - -Z self-profile -Z self-profile-counter=instructions-minus-irqs:u
warning: failed to create profiler: only supported with measureme's "nightly" feature
$ echo 'fn main() {}' | rustc +nightly-2022-07-24 - -Z self-profile -Z self-profile-counter=instructions-minus-irqs:u
thread 'opt hkqbnt908e7l8ng' panicked at 'assertion failed: end <= MAX_INTERVAL_VALUE', /cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/measureme-10.1.0/src/raw_event.rs:56:9 @jyn514 mentioned it never worked, to me, and... yeah, we added So this is the $ echo 'fn main() {}' | rustc +nightly-2022-07-24 - -Z self-profile -Z self-profile-counter=instructions-minus-irqs:u --emit=metadata
$ summarize summarize unknown-crate-*.mm_profdata
Segmentation fault (core dumped) That's a segfault in |
Note: this is a companion to rust-lang/measureme#143, and duplicates some information with it for convenience
(much later) EDIT: take any numbers with a grain of salt, they may have changed since initial PR open.
Credits
I'd like to start by thanking @alyssais, @cuviper, @edef1c, @glandium, @jix, @Mark-Simulacrum, @m-ou-se, @mystor, @nagisa, @puckipedia, and @yorickvP, for all of their help with testing, and valuable insight and suggestions.
Getting here wouldn't have been possible without you!
(If I've forgotten anyone please let me know, I'm going off memory here, plus some discussion logs)
Summary
This PR adds support to
-Z self-profile
for counting hardware events such as "instructions retired" (as opposed to being limited to time measurements), using therdpmc
instruction onx86_64
Linux.While other OSes may eventually be supported, preliminary research suggests some kind of kernel extension/driver is required to enable this, whereas on Linux any user can profile (at least) their own threads.
Supporting Linux on architectures other than x86_64 should be much easier (provided the hardware supports such performance counters), and was mostly not done due to a lack of readily available test hardware.
That said, 32-bit
x86
(akai686
) would be almost trivial to add and test once we land the initialx86_64
version (as all the CPU detection code can be reused).A new flag
-Z self-profile-counter
was added, to control which of the namedmeasureme
counters is used, and which defaults towall-time
, in order to keep-Z self-profile
's current functionality unchanged (at least for now).The named counters so far are:
wall-time
: the existing time measurementperf.rust-lang.org
std::time::Instant
for a nanosecond-precision "monotonic clock"instructions:u
: the hardware performance counter usually referred to as "Instructions retired":u
suffix is from the Linuxperf
tool and indicates the counter only runs while userspace code is executing, and therefore counts no kernel instructionsinstructions-minus-irqs:u
should be preferred insteadinstructions-minus-irqs:u
: same asinstructions:u
, except the count of hardware interrupts ("IRQs" here for brevity) is subtractedinstructions:u
instructions-minus-r0420:u
: experimental counter, same asinstructions-minus-irqs:u
but subtracting an undocumented counter (r0420:u
) instead of IRQsrXXXX
notation is again from Linuxperf
, and indicates a "raw" counter, with a hex representation of the low-level counter configuration - this was picked because we still don't really know what it isThere are also some additional commits:
see Challenges/Rebasing shouldn't affect the results, right? for details on the changes torustc_parse
andrustc_trait_section
(the latter far more dubious, and probably shouldn't be merged, or not as-is)see Challenges/jemalloc
: purging will commence in ten seconds for details on thejemalloc
changejemalloc
offers (assuming that can stop the timer that's already running, which I'm not sure about)-Z
flags, this commit has also been revertedproc_macro
change was to avoid randomized hashing and therefore ASLR-like effects(much later) EDIT: take any numbers with a grain of salt, they may have changed since initial PR open.
Write-up / report
Because of how extensive the full report ended up being, I've kept most of it on
hackmd.io
, but for convenient access, here are all the sections (with individual links):(someone suggested I'd make a backup, so here it is on the wayback machine - I'll need to remember to update that if I have to edit the write-up)
Motivation
Results
Preview (see the report itself for more details):
instructions-minus-irqs:u
(for all 1903881
counter reads)
(per each counter read)
instructions:u
instructions-minus-irqs:u
wall-time
Preview (see the report itself for more details):
wall-time
(ns)instructions:u
instructions-minus-irqs:u
typeck
expand_crate
mir_borrowck
mir_built
resolve_crate
Caveats
Challenges
SpecLockMap
or: "how we accidentally unlocked
rr
on AMD Zen"jemalloc
: purging will commence in ten seconds