-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
32 codegen units may not always be better at -O0 #44941
Comments
@mersinvald unfortunately the perf files you mentioned I can't read locally, and the I wonder, could you try investigating if one crate in particular takes much longer? Maybe take some of the crates that take longest to compile (sorting the rlibs by size is typically a rough measurement for this) and see how long they take? I'm hoping that this is one crate hitting an accidentally pathological case, but it may also just be a case where everything slows down with 32 codegen units (even though we don't intend that). One other possiblility may be the linker? (like |
I tested compiling regex with different number of codegen units on my 2 core Pentium E2200.
It gets a decent win with the number of codegen units equal the number of cores, but gets increasingly slower with more than that. Here are the results for rustfmt. These are a bit more wild. It's actually faster on 4 units than 2, but gets increasingly slower after that, up to the point where 32 units is actually slower than 1 unit.
|
Testing with winapi 0.3 on a Ryzen 7 1700X.
|
There is a pretty noticable change in compile times on perf.rust-lang.org with the commit that changed the default number as well. |
I'm going to tag this as a regression until we can figure out what the spike is |
Ok looking more into the data from perf.rust-lang.org. We're measuring instructions executed and the "cpu-clock" measurement. Instructions certainly don't have much bearing on wall time (but serve as great indicators of when to look for a regression) and apparently cpu-clock is not the same as wall time. I believe the I'm trying to rationalize the numbers we're seeing and it's not quite making sense. The regressions shown on perf.rust-lang.org imply to me that if I pin the process to one CPU we'll see that much increase in compile time, but I see virtually no increase in compile time locally. In fact, if I pin cargo/rustc to one CPU it typically gets faster with more cgus presumably because the objects are a little quicker to create overall. Basically I'm having a lot of difficulty reproducing any slowdowns here. Assistance investigating why things are slowing down as opposed to just "things are slowing down" would be much appreciated. |
I believe these are all the reported regressions from this change |
Some more investigation I found locally I was able to reproduce some slowdowns with the syntex_syntax crate:
Here "wall time" is the wall time it took to compile the There's a clear tend here that parallelization is helping but also taking up more CPU time. Namely increasing the number of codegen units is not purely dividing the work, it's also increasing the total amount of work being done. This becomes particularly noticeable with more codegen units limited to one core on this big crate. So on one hand there's certainly some overhead from more codegen units. We're shuffling more files around (not copying, just working with), spawning more threads, more messages, more synchronization, etc. I don't think that these factors would cause such a drastic increase in compile times, however. Instead what I think may be happening here (that I have just now remembered) is how we treat @michaelwoerister correct me if I'm wrong, but I believe that we'll inline any As a result my conclusion is that increasing the number of codegen units increases the amount of work the compiler does, and hence parallelism isn't always enough to recover this loss in the increase of work done. In the worst case with N codegen units you can probably do N times more work, and with less than N cores you'll notice the increase in compile times. @michaelwoerister do you agree with the conclusions here? If so, do you think we could treat I'm personally hesitant to set the number of codegen units based on the amount of cores on the machine. I think that may return optimal performance (perhaps 2x the number of cores cgus) because the extra work needed for each codegen unit I think should come out in the wash because of parallelization. This, however, would make builds across machines nondeterministic if they've got different numbers of cpus, forcing deterministic compiles to explicitly specify the number of codegen units. My personal preference for a solution right now is:
|
@alexcrichton, Yes I think your assumptions and conclusions are correct. I also think that
We probably mean the same thing here: I would suggest treating them the same as regular polymorphic functions:
That might require some special handling in the symbol hash to avoid conflicts between crates, just as we do for generic functions at the moment: rust/src/librustc_trans/back/symbol_names.rs Lines 194 to 201 in ed1cffd
If performance turns out to be unbearable, we maybe could have a heuristic to do some inlining:
Seems like a sensible approach. Other solutions?I wonder if ThinLTO would allow us to completely get rid of duplicating bodies of inline functions across codegen units during trans. That would be awesome for incremental compilation and it would also help in the non-incremental case. |
Rationale explained in the included comment as well as rust-lang#44941
Some excellent news! I implemented the strategy of "only inline functions into one codegen unit" and the updating timings I get are:
In other words, I think this indeed makes a huge dent in the issue of "more cgus causes rustc to do more work" problem. I'm going to send a PR shortly implementing this. @michaelwoerister I also think that you're correct in that ThinLTO should eliminate the need for inlining functions into all codegen units unconditionally. Just gotta land ThinLTO and test it out first :) |
Ok I've created a PR at #45075 @mersinvald and @crumblingstatue, if y'all could test out that PR locally that'd be great! My hope is that it should make 32 codegen units continue to be better than 1, and perhaps give us flexibility to increase the number of cgus in the future. |
This commit tweaks the behavior of inlining functions into multiple codegen units when rustc is compiling in debug mode. Today rustc will unconditionally treat `#[inline]` functions by translating them into all codegen units that they're needed within, marking the linkage as `internal`. This commit changes the behavior so that in debug mode (compiling at `-O0`) rustc will instead only translate `#[inline]` functions into *one* codegen unit, forcing all other codegen units to reference this one copy. The goal here is to improve debug compile times by reducing the amount of translation that happens on behalf of multiple codegen units. It was discovered in rust-lang#44941 that increasing the number of codegen units had the adverse side effect of increasing the overal work done by the compiler, and the suspicion here was that the compiler was inlining, translating, and codegen'ing more functions with more codegen units (for example `String` would be basically inlined into all codegen units if used). The strategy in this commit should reduce the cost of `#[inline]` functions to being equivalent to one codegen unit, which is only translating and codegen'ing inline functions once. Collected [data] shows that this does indeed improve the situation from [before] as the overall cpu-clock time increases at a much slower rate and when pinned to one core rustc does not consume significantly more wall clock time than with one codegen unit. One caveat of this commit is that the symbol names for inlined functions that are only translated once needed some slight tweaking. These inline functions could be translated into multiple crates and we need to make sure the symbols don't collideA so the crate name/disambiguator is mixed in to the symbol name hash in these situations. [data]: rust-lang#44941 (comment) [before]: rust-lang#44941 (comment)
…erister rustc: Reduce default CGUs to 16 Rationale explained in the included comment as well as #44941
rustc: Don't inline in CGUs at -O0 This commit tweaks the behavior of inlining functions into multiple codegen units when rustc is compiling in debug mode. Today rustc will unconditionally treat `#[inline]` functions by translating them into all codegen units that they're needed within, marking the linkage as `internal`. This commit changes the behavior so that in debug mode (compiling at `-O0`) rustc will instead only translate `#[inline]` functions into *one* codegen unit, forcing all other codegen units to reference this one copy. The goal here is to improve debug compile times by reducing the amount of translation that happens on behalf of multiple codegen units. It was discovered in #44941 that increasing the number of codegen units had the adverse side effect of increasing the overal work done by the compiler, and the suspicion here was that the compiler was inlining, translating, and codegen'ing more functions with more codegen units (for example `String` would be basically inlined into all codegen units if used). The strategy in this commit should reduce the cost of `#[inline]` functions to being equivalent to one codegen unit, which is only translating and codegen'ing inline functions once. Collected [data] shows that this does indeed improve the situation from [before] as the overall cpu-clock time increases at a much slower rate and when pinned to one core rustc does not consume significantly more wall clock time than with one codegen unit. One caveat of this commit is that the symbol names for inlined functions that are only translated once needed some slight tweaking. These inline functions could be translated into multiple crates and we need to make sure the symbols don't collideA so the crate name/disambiguator is mixed in to the symbol name hash in these situations. [data]: #44941 (comment) [before]: #44941 (comment)
@mersinvald can you try re-testing with tonight's nightly which should have some improvements? |
@alexcrichton sure, will do it tomorrow. |
@alexcrichton I've ran the tests on the I've also measured the hot build time, i.e only 200LOC crate + link-time.
|
@mersinvald ok I've done some further investigation into your benchmark. The setup was:
Build from scratch Here I compiled the
There's a suspicious jump at 2 CGUs where the compilation takes longer (with ThinLTO) but it quickly flattens out, the time improving with more and more codegen units thrown at it. Additionally in debug mode we see wins across the board as soon as codegen units are enabled. Incremental build Here I compiled the
Here like before debug builds leveled off pretty quickly, benefitting a good amount from multiple CGUs. Release builds didn't fare so well after 8 CGUs, however, but still show improvements across the board from the original build time with just one CGU. Again though there's a jump at 2 CGUs in release mode which is a little suspicious @mersinvald can you try running with that script and similar settings to guarantee that the only difference in the data collection here is hardware? |
Are the incremental crate-only builds 10-15x slower than full non-incremental builds? |
oh oops, I mixed up the tables |
@alexcrichton rustc 1.22.0-nightly (4e9527c 2017-10-16)
|
Awesome, thanks @mersinvald! So reading that I'd conclude that in debug mode 16cgus is a unversal win over 1 cgu (although slightly worse than 8/4 cgus in some cases). In that sense, I think we can close this issue? Otherwise it looks like the proposal for turning on multiple cgus by default in release mode means that 16 cgus is universally better than 1 cgu in terms of compile time if you only compile the last crate. If you compile the entire crate graph it looks like multiple cgus regreses the build time? |
👍 |
Ok! |
Continuing discussion from #44853 (comment)
The text was updated successfully, but these errors were encountered: