-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[experiment/perf] Disable jemalloc's time-delayed purging, for extra determinism. #77162
Conversation
r? @lcnr (rust_highfive has picked a reviewer for you, use r? to override) |
@bors try (do not start perf run yet, @Mark-Simulacrum probably wants to tweak some settings before then) |
⌛ Trying commit 8534cd7 with merge 138021fc09ebe36d0582611f85c5e3d45e75abad... |
☀️ Try build successful - checks-actions, checks-azure |
@rust-timer build 138021fc09ebe36d0582611f85c5e3d45e75abad exclude=style-servo,cargo,webrender runs=6 This should give us 5 clean runs, 1 with self-profile, on a subset of the faster-running benchmarks. Hopefully. |
Fixing some bugs... @rust-timer build 138021fc09ebe36d0582611f85c5e3d45e75abad exclude=style-servo,cargo,webrender runs=6 |
Fixing some bugs... @rust-timer build 138021f exclude=style-servo,cargo,webrender runs=6 |
Queued 138021fc09ebe36d0582611f85c5e3d45e75abad with parent 87d262a, future comparison URL. |
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
Ignore those messages -- had to fix some bugs -- but we should be collecting now. I'm not quite sure how best to extract useful data from the results, though. We have raw values for the instructions stat for each of the runs (5 for the try build, previous runs usually have just 1-2, depending on benchmark). My initial attempt is to do a max-min and compare that delta across parent/child here, which is relatively easy to do but doesn't appear to yield any significant change from this PR.
And this is for "check", non-incremental builds. Will update with more data once it finishes. (Edit: collection done) Collected with |
Finished benchmarking try commit (138021fc09ebe36d0582611f85c5e3d45e75abad): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
I also added setarch -R, so maybe worth rerunning another run now to get that onto master as well. But not sure. |
I'd split benchmarks where the variance decreases from those where the range increases. EDIT: oh and, well, we need to know which take over 10 seconds, I think there shouldn't be any benefits for anything shorter, as (supposedly) the delayed purging mechanism shouldn't be kicking in before then anyway. |
I just checked and we only have two benchmarks that go over 10 seconds: This workaround is (unsurprisingly) also a performance degradation so we probably don't want to turn on by default anyway. Let's see if this does anything: |
Okay that does nothing, but I won't go through the trouble of doing another try build, especially with |
The rr script has now been run on the collection server fwiw. |
@eddyb where you thinking that we should merge this PR? I forget what the status is... |
Integrate measureme's hardware performance counter support. *Note: this is a companion to rust-lang/measureme#143, and duplicates some information with it for convenience* **(much later) EDIT**: take any numbers with a grain of salt, they may have changed since initial PR open. ## Credits I'd like to start by thanking `@alyssais,` `@cuviper,` `@edef1c,` `@glandium,` `@jix,` `@Mark-Simulacrum,` `@m-ou-se,` `@mystor,` `@nagisa,` `@puckipedia,` and `@yorickvP,` for all of their help with testing, and valuable insight and suggestions. Getting here wouldn't have been possible without you! (If I've forgotten anyone please let me know, I'm going off memory here, plus some discussion logs) ## Summary This PR adds support to `-Z self-profile` for counting hardware events such as "instructions retired" (as opposed to being limited to time measurements), using the `rdpmc` instruction on `x86_64` Linux. While other OSes may eventually be supported, preliminary research suggests some kind of kernel extension/driver is required to enable this, whereas on Linux any user can profile (at least) their own threads. Supporting Linux on architectures other than x86_64 should be much easier (provided the hardware supports such performance counters), and was mostly not done due to a lack of readily available test hardware. That said, 32-bit `x86` (aka `i686`) would be almost trivial to add and test once we land the initial `x86_64` version (as all the CPU detection code can be reused). A new flag `-Z self-profile-counter` was added, to control which of the named `measureme` counters is used, and which defaults to `wall-time`, in order to keep `-Z self-profile`'s current functionality unchanged (at least for now). The named counters so far are: * `wall-time`: the existing time measurement * name chosen for consistency with `perf.rust-lang.org` * continues to use `std::time::Instant` for a nanosecond-precision "monotonic clock" * `instructions:u`: the hardware performance counter usually referred to as "Instructions retired" * here "retired" (roughly) means "fully executed" * the `:u` suffix is from the Linux `perf` tool and indicates the counter only runs while userspace code is executing, and therefore counts no kernel instructions * *see [Caveats/Subtracting IRQs](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Subtracting-IRQs) for why this isn't entirely true and why `instructions-minus-irqs:u` should be preferred instead* * `instructions-minus-irqs:u`: same as `instructions:u`, except the count of hardware interrupts ("IRQs" here for brevity) is subtracted * *see [Caveats/Subtracting IRQs](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Subtracting-IRQs) for why this should be preferred over `instructions:u`* * `instructions-minus-r0420:u`: experimental counter, same as `instructions-minus-irqs:u` but subtracting an undocumented counter (`r0420:u`) instead of IRQs * the `rXXXX` notation is again from Linux `perf`, and indicates a "raw" counter, with a hex representation of the low-level counter configuration - this was picked because we still don't *really* know what it is * this only exists for (future) testing and isn't included/used in any comparisons/data we've put together so far * *see [Challenges/Zen's undocumented 420 counter](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Epilogue-Zen’s-undocumented-420-counter) for details on how this counter was found and what it does* --- There are also some additional commits: * ~~see [Challenges/Rebasing *shouldn't* affect the results, right?](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Rebasing-*shouldn’t*-affect-the-results,-right) for details on the changes to `rustc_parse` and `rustc_trait_section` (the latter far more dubious, and probably shouldn't be merged, or not as-is)~~ * **EDIT**: the effects of these are no long quantifiable, the PR includes reverts for them * ~~see [Challenges/`jemalloc`: purging will commence in ten seconds](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#jemalloc-purging-will-commence-in-ten-seconds) for details on the `jemalloc` change~~ * this is also separately found in rust-lang#77162, and we probably want to avoid doing it by default, ideally we'd use the runtime control API `jemalloc` offers (assuming that can stop the timer that's already running, which I'm not sure about) * **EDIT**: until we can do this based on `-Z` flags, this commit has also been reverted * the `proc_macro` change was to avoid randomized hashing and therefore ASLR-like effects --- **(much later) EDIT**: take any numbers with a grain of salt, they may have changed since initial PR open. #### Write-up / report Because of how extensive the full report ended up being, I've kept most of it [on `hackmd.io`](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view), but for convenient access, here are all the sections (with individual links): <sup>(someone suggested I'd make a backup, so [here it is on the wayback machine](http://web.archive.org/web/20201127164748/https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view) - I'll need to remember to update that if I have to edit the write-up)</sup> * [**Motivation**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Motivation) * [**Results**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Results) * [**Overhead**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Overhead) *Preview (see the report itself for more details):* |Counter|Total<br>`instructions-minus-irqs:u`|Overhead from "Baseline"<br>(for all 1903881<br>counter reads)|Overhead from "Baseline"<br>(per each counter read)| |-|-|-|-| |Baseline|63637621286 ±6|| |`instructions:u`|63658815885 ±2| +21194599 ±8| +11| |`instructions-minus-irqs:u`|63680307361 ±13| +42686075 ±19| +22| |`wall-time`|63951958376 ±10275|+314337090 ±10281|+165| * [**"Macro" noise (self time)**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#“Macro”-noise-(self-time)) *Preview (see the report itself for more details):* || `wall-time` (ns) | `instructions:u` | `instructions-minus-irqs:u` -: | -: | -: | -: `typeck` | 5478261360 ±283933373 (±~5.2%) | 17350144522 ±6392 (±~0.00004%) | 17351035832.5 ±4.5 (±~0.00000003%) `expand_crate` | 2342096719 ±110465856 (±~4.7%) | 8263777916 ±2937 (±~0.00004%) | 8263708389 ±0 (±~0%) `mir_borrowck` | 2216149671 ±119458444 (±~5.4%) | 8340920100 ±2794 (±~0.00003%) | 8341613983.5 ±2.5 (±~0.00000003%) `mir_built` | 1269059734 ±91514604 (±~7.2%) | 4454959122 ±1618 (±~0.00004%) | 4455303811 ±1 (±~0.00000002%) `resolve_crate` | 942154987.5 ±53068423.5 (±~5.6%) | 3951197709 ±39 (±~0.000001%) | 3951196865 ±0 (±~0%) * [**"Micro" noise (individual sampling intervals)**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#“Micro”-noise-(individual-sampling-intervals)) * [**Caveats**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Caveats) * [**Disabling ASLR**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Disabling-ASLR) * [**Non-deterministic proc macros**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Non-deterministic-proc-macros) * [**Subtracting IRQs**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Subtracting-IRQs) * [**Lack of support for multiple threads**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Lack-of-support-for-multiple-threads) * [**Challenges**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Challenges) * [**How do we even read hardware performance counters?**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#How-do-we-even-read-hardware-performance-counters) * [**ASLR: it's free entropy**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#ASLR-it’s-free-entropy) * [**The serializing instruction**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#The-serializing-instruction) * [**Getting constantly interrupted**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Getting-constantly-interrupted) * [**AMD patented time-travel and dubbed it `SpecLockMap`<br><sup> or: "how we accidentally unlocked `rr` on AMD Zen"</sup>**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#AMD-patented-time-travel-and-dubbed-it-SpecLockMapnbspnbspnbspnbspnbspnbspnbspnbspor-“how-we-accidentally-unlocked-rr-on-AMD-Zen”) * [**`jemalloc`: purging will commence in ten seconds**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#jemalloc-purging-will-commence-in-ten-seconds) * [**Rebasing *shouldn't* affect the results, right?**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Rebasing-*shouldn’t*-affect-the-results,-right) * [**Epilogue: Zen's undocumented 420 counter**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Epilogue-Zen’s-undocumented-420-counter)
When ASLR is disabled1, the default glibc allocator appears to behave deterministically2, but
jemalloc
does not, and its own non-determinism leaks into the addresses being allocated, effectively becoming a form of ASLR3.1 either system-wide or by running
rustc
undersetarch -R ...
2 at least in my limited testing, on
libcore
, check-only i.e. without any LLVM threads, though I should try single-threaded LLVM3 any variation in addresses causes even more non-determinism because of how
rustc
uses (Fx
)HashMap
s keyed on types that hash "by address" (such as interned references)(To be clear, this "non-determinism" doesn't affect compilation behavior/outputs, at least I'm not aware of a way that it does, but it does affect measurements such as
instructions:u
measured byperf.rust-lang.org
)Turns out that this behavior is caused by
jemalloc
defaulting to a 10 second delay before "purging"4 memory, and so forrustc
executions taking longer than 10 seconds, purging will start taking place, in unpredictable5 patterns.See
jemalloc
documentation: http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms.4 using
madvise
to inform the kernel that it can reuse the physical backing memory - AFAIK, if userspace tries to use that memory again, the kernel will allocate zeroed physical memory to it, having lost the original data5 the timer will not expire in the exact same place on every execution, so different parts of memory will be purged first, which cascades into variations in allocation/deallocation behavior in
jemalloc
This PR sets both
dirty_decay_ms
andmuzzy_decay_ms
to 0, which is documented as removing the reliance on timers:And I've confirmed that, at least for my testcase, the resulting
jemalloc
behavior is just as deterministic as the glibc allocator.However, I don't know the full performance implications of these settings, hence opening this PR to test it. I also want to see whether we can observe a reduction in variation between identical runs on
perf.rust-lang.org
.cc @Mark-Simulacrum