-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ThinLTO with incremental compilation. #53673
Conversation
@bors try |
⌛ Trying commit ee14d4a5f2818fa96865ad03544716f2f124f32b with merge 2d7e52fcdbcbb66446270bc3f48b60f5ec50c119... |
This comment has been minimized.
This comment has been minimized.
ee14d4a
to
1d0f3fd
Compare
@bors try |
⌛ Trying commit 1d0f3fd0ae87ccd0f8823e045ad675a05f2926c3 with merge 937c465de94cf2edd249c4355619eda8fa8616ee... |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
src/librustc_codegen_llvm/base.rs
Outdated
Err(e) => { | ||
let msg = format!("Error while trying to load ThinLTO import data \ | ||
for incremental compilation: {}", e); | ||
sess.fatal(&msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this perhaps fall back on returning a new ThinLTOImports
instance? It seems like if a previous compiler is ctrl-c'd at just the right time it could poison future compilers to return this error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch.
@@ -983,6 +1006,9 @@ pub fn start_async_codegen(tcx: TyCtxt, | |||
allocator_config.emit_bc_compressed = true; | |||
} | |||
|
|||
modules_config.emit_pre_thin_lto_bc = | |||
need_pre_thin_lto_bitcode_for_incr_comp(sess); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be swapped with save_temps
above so -C save-temps
always emits it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foo.pre-thin-lto.bc
should actually be exactly the same as foo.thin-lto-input.bc
, so I didn't really make an effort to add it to the save-temps
output. Do you think it's worth the trouble?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean now. Yes, it should be swapped so we don't overwrite the value.
execute_copy_from_cache_work_item(cgcx, work_item, timeline) | ||
} | ||
work_item @ WorkItem::LTO(_) => { | ||
execute_lto_work_item(cgcx, work_item, timeline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like each of these methods takes the bound work_item
but quickly unwraps it, could instead they be bound in this match
and the value passed in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ping on this comment)
len: module.data().len(), | ||
}); | ||
serialized.push(module); | ||
module_names.push(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks the same as the loop above, so could chain
be used to process both in one go? The modules_to_optimize
local variable looks like it can be hoisted above the loop too perhaps?
// the cache instead of having been recompiled... | ||
let current_imports = ThinLTOImports::from_thin_lto_data(data); | ||
|
||
// ... so we load this additional information from the previous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm following what's going on here. Aren't all CGUs loaded into the ThinLTOData
instance? Is this perhaps an older comment?
AFAIK when we redo ThinLTO we have to unconditionally load the ThinLTO buffer for all CGUs coming inas input, so I can't quite figure out where some would be missing, but you can likely enlighten me!
pub fn save_to_file(&self, path: &Path) -> io::Result<()> { | ||
use std::io::Write; | ||
let file = File::create(path)?; | ||
let mut writer = io::BufWriter::new(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and load_from_file
below I'd imagine that these maps are pretty small (on the order of CGUs, not symbols) which probably means that we can reasonable hold the entire contents of this serialized file in memory. In that case it's probably much faster to read/write the file in one go (and do all other operations in memory)
src/librustc_codegen_llvm/base.rs
Outdated
No, | ||
PreThinLto, | ||
PostThinLto, | ||
PostThinLtoButImportedFrom, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly confused by this enum variant, but I think this is the same confusion that I had before perhaps?
If any CGU is either "no" reusable or pre-thin-lto, I think that means that all CGUs need to be loaded for the ThinLTO data collection stage.
In thinking about this as well, I think this function below may not work in general? I think we can only determine CGU post-thin-lto CGU reuse after the ThinLTO data is created, right? Put another way, I think the possible reuse states are:
- Everything is green, nothing has changed.
- All modified CGUs need to be re-codegen'd
- Afterwards, ThinLTOData is created, using the cached ThinLTO buffers for unmodified CGUs and freshly created buffers for re-codegen'd CGUs.
- Now there's a graph of CGU to CGUs-imported, as well as whether each CGU is red/green (green for cached, red for just codegen'd)
- Any red CGU is re-thin-LTO'd.
- Any green CGU which imports from a red CGU is re-thin-LTO'd
Here, before we create the ThinLTOData, I don't think we can determine that a green CGU only imports from other green CGUs? LLVM seems like it could do fancy things such as:
- Let's have three CGUs, A, B, and C.
- A/B are green and C is red
- Previously, A imported from B and not C
- Afterwards, though, A ends up importing from both B and C (for whatever reason)
I think that this classification below would mean that A is "as green as can be" but it actually needs to be re-thin-LTO'd?
I may also just be lost with the classification of names here...
src/librustc_codegen_llvm/base.rs
Outdated
}); | ||
true | ||
} | ||
CguReUsable::PostThinLtoButImportedFrom => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose to elaborate on my comment above, the way I expected this to work these two latter states wouldn't be possible. It seems like don't really need to handle the case that literally nothing changed as it's not so important. In that case we can assume something changed which means that everything will either be codegen'd or sent as a pre-thin-lto module to the backend. After the synchronization point we'd then make another decision about CGU reuse and such.
☀️ Test successful - status-travis |
@rust-timer build 937c465de94cf2edd249c4355619eda8fa8616ee |
Success: Queued 937c465de94cf2edd249c4355619eda8fa8616ee with parent 57e13ba, comparison URL. |
Yes, I can see one case where your A/B/C example would be handled sub-optimally: A references functions in both B and C, but in session 1 ThinLTO classifies no exported functions in C (and called from A) as potentially inlineable. Therefore the import data will show no edge from A to C. Then, in session 2, C is changed and now some function there has become small enough to be elegible for inlining. The algorithm in the PR would re-translate C (because it changed) but it would take the cached version of A since it has no edge to C. It would therefore not be able to inline functions from C into A although that might be possible now. There are a couple of factors that somewhat lessen the negative effect:
That being said, deferring the classification to until after the index is built would solve the problem reliably (and is probably more in line with how the linker-plugin works). Unless I'm overlooking something, it shouldn't be too hard to implement it this way fortunately |
OK, so the perf results (which don't contain the proposed changes yet but should be kind of valid anyway) look better than last time:
Some cases profit a lot from incr. comp. (e.g. |
1d0f3fd
to
8fdf3e6
Compare
@alexcrichton, I just pushed a commit that implements the algorithm as suggested by you. The code actually got simpler @bors try |
🔒 Merge conflict This pull request and the master branch diverged in a way that cannot be automatically merged. Please rebase on top of the latest master branch, and let the reviewer approve again. How do I rebase?Assuming
You may also read Git Rebasing to Resolve Conflicts by Drew Blessing for a short tutorial. Please avoid the "Resolve conflicts" button on GitHub. It uses Sometimes step 4 will complete without asking for resolution. This is usually due to difference between how Error message
|
737f1ef
to
21d05f6
Compare
I think all nits should be addressed now. I added some @bors r=alexcrichton |
📌 Commit 21d05f6 has been approved by |
…hton Enable ThinLTO with incremental compilation. This is an updated version of #52309. This PR allows `rustc` to use (local) ThinLTO and incremental compilation at the same time. In theory this should allow for getting compile-time improvements for small changes while keeping the runtime performance of the generated code roughly the same as when compiling non-incrementally. The difference to #52309 is that this version also caches the pre-LTO version of LLVM bitcode. This allows for another layer of caching: 1. if the module itself has changed, we have to re-codegen and re-optimize. 2. if the module itself has not changed, but a module it imported from during ThinLTO has, we don't need to re-codegen and don't need to re-run the first optimization phase. Only the second (i.e. ThinLTO-) optimization phase is re-run. 3. if neither the module itself nor any of its imports have changed then we can re-use the final, post-ThinLTO version of the module. (We might have to load its pre-ThinLTO version though so it's available for other modules to import from)
☀️ Test successful - status-appveyor, status-travis |
😲 |
What happened to the perf? Plenty of stuff turned very red, is this expected? |
Indeed, this had a calamitous effect on compile times for incremental opt builds, and I don't understand how this was deemed acceptable prior to landing. I think it should be backed out ASAP.
|
Yes, the effects on compile times were expected. Let me explain what's going on here: This PR enables a new combination of compiler settings (ThinLTO + incremental compilation) that we've wanted to have for years and that, as per the existing rules, is now selected as the default when doing optimized, incremental builds. The old behavior (optimized, incremental builds without the additional ThinLTO pass) is still available when compiling with Without ThinLTO, incremental opt builds produce much slower code. In many cases benchmarks performed 2-3 times worse because of reduced IPO opportunities. If that code is fast enough for your needs, great, but there was no way we could make incremental compilation the default for optimized builds in Cargo. With ThinLTO enabled this might change. Once this is part of a nightly compiler, we'll test what runtime performance of code produced this way looks like; if it's close enough to non-incremental builds, we can make incr. comp. the default for opt builds in Cargo, giving compile time reductions of 50-85% for small changes! Note that Cargo still defaults to non-incremental compilation for opt builds, so none of this will be visible to end users yet. |
Huh, ok.
If you look at the perf results for just this PR, there are no improvements. (The few green entries are almost certainly noise, belonging to benchmarks that have high variance.) |
Yeah, I was wondering why I hadn't seen those improvements in the try builds before But it looks like |
@michaelwoerister hm oh I also just realized, this didn't actually add any tests? Would it be possible to add a few incremental + optimized tests to exercise these code paths? (I don't think we can really test it works without a disassembly and brittle tests), but we can at least try to run it through the ringer! |
The existing incremental tests will actually test some of this when I'll think about how to test this some more. Maybe expand |
Oh nevermind then, carry on! So long as something broke when implementing this sounds like it's being exercised which is all I would look for :) |
@@ -1622,6 +1626,11 @@ extern "C" { | |||
Data: &ThinLTOData, | |||
Module: &Module, | |||
) -> bool; | |||
pub fn LLVMRustGetThinLTOModuleImports( | |||
Data: *const ThinLTOData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be &ThinLTOData
.
I do have a few ideas for 2 or 3 tests. I'll make a PR this week, if I get to it. It requires exposing the different caching levels to the test framework. That's a good idea anyway but it's not totally trivial because of that. |
…nerics-for-incr-comp, r=alexcrichton incr.comp.: Don't automatically enable -Zshare-generics for incr. comp. builds. So far the compiler would automatically enable sharing of monomorphizations for incremental builds. That was OK because without (Thin)LTO this could have very little impact on the runtime performance of the generated code. However, since rust-lang#53673, ThinLTO and incr. comp. can be combined, so the trade-off is not as clear anymore. This PR removes the automatic tie between the two options. Whether monomorphizations are shared between crates or not now _only_ depends on the optimization level. r? @alexcrichton
…t-lto-imports, r=michaelwoerister save LTO import info and check it when trying to reuse build products Fix rust-lang#59535 Previous runs of LTO optimization on the previous incremental build can import larger portions of the dependence graph into a codegen unit than the current compilation run is choosing to import. We need to take that into account when we choose to reuse PostLTO-optimization object files from previous compiler invocations. This PR accomplishes that by serializing the LTO import information on each incremental build. We load up the previous LTO import data as well as the current LTO import data. Then as we decide whether to reuse previous PostLTO objects or redo LTO optimization, we check whether the LTO import data matches. After we finish with this decision process for every object, we write the LTO import data back to disk. ---- What is the scenario where comparing against past LTO import information is necessary? I've tried to capture it in the comments in the regression test, but here's yet another attempt from me to summarize the situation: 1. Consider a call-graph like `[A] -> [B -> D] <- [C]` (where the letters are functions and the modules are enclosed in `[]`) 2. In our specific instance, the earlier compilations were inlining the call to`B` into `A`; thus `A` ended up with a external reference to the symbol `D` in its object code, to be resolved at subsequent link time. The LTO import information provided by LLVM for those runs reflected that information: it explicitly says during those runs, `B` definition and `D` declaration were imported into `[A]`. 3. The change between incremental builds was that the call `D <- C` was removed. 4. That change, coupled with other decisions within `rustc`, made the compiler decide to make `D` an internal symbol (since it was no longer accessed from other codegen units, this makes sense locally). And then the definition of `D` was inlined into `B` and `D` itself was eliminated entirely. 5. The current LTO import information reported that `B` alone is imported into `[A]` for the *current compilation*. So when the Rust compiler surveyed the dependence graph, it determined that nothing `[A]` imports changed since the last build (and `[A]` itself has not changed either), so it chooses to reuse the object code generated during the previous compilation. 6. But that previous object code has an unresolved reference to `D`, and that causes a link time failure! ---- The interesting thing is that its quite hard to actually observe the above scenario arising, which is probably why no one has noticed this bug in the year or so since incremental LTO support landed (PR rust-lang#53673). I've literally spent days trying to observe the bug on my local machine, but haven't managed to find the magic combination of factors to get LLVM and `rustc` to do just the right set of the inlining and `internal`-reclassification choices that cause this particular problem to arise. ---- Also, I have tried to be careful about injecting new bugs with this PR. Specifically, I was/am worried that we could get into a scenario where overwriting the current LTO import data with past LTO import data would cause us to "forget" a current import. ~~To guard against this, the PR as currently written always asserts, at overwrite time, that the past LTO import-set is a *superset* of the current LTO import-set. This way, the overwriting process should always be safe to run.~~ * The previous note was written based on the first version of this PR. It has since been revised to use a simpler strategy, where we never attempt to merge the past LTO import information into the current one. We just *compare* them, and act accordingly. * Also, as you can see from the comments on the PR itself, I was quite right to be worried about forgetting past imports; that scenario was observable via a trivial transformation of the regression test I had devised.
…ts, r=mw save LTO import info and check it when trying to reuse build products Fix #59535 Previous runs of LTO optimization on the previous incremental build can import larger portions of the dependence graph into a codegen unit than the current compilation run is choosing to import. We need to take that into account when we choose to reuse PostLTO-optimization object files from previous compiler invocations. This PR accomplishes that by serializing the LTO import information on each incremental build. We load up the previous LTO import data as well as the current LTO import data. Then as we decide whether to reuse previous PostLTO objects or redo LTO optimization, we check whether the LTO import data matches. After we finish with this decision process for every object, we write the LTO import data back to disk. ---- What is the scenario where comparing against past LTO import information is necessary? I've tried to capture it in the comments in the regression test, but here's yet another attempt from me to summarize the situation: 1. Consider a call-graph like `[A] -> [B -> D] <- [C]` (where the letters are functions and the modules are enclosed in `[]`) 2. In our specific instance, the earlier compilations were inlining the call to`B` into `A`; thus `A` ended up with a external reference to the symbol `D` in its object code, to be resolved at subsequent link time. The LTO import information provided by LLVM for those runs reflected that information: it explicitly says during those runs, `B` definition and `D` declaration were imported into `[A]`. 3. The change between incremental builds was that the call `D <- C` was removed. 4. That change, coupled with other decisions within `rustc`, made the compiler decide to make `D` an internal symbol (since it was no longer accessed from other codegen units, this makes sense locally). And then the definition of `D` was inlined into `B` and `D` itself was eliminated entirely. 5. The current LTO import information reported that `B` alone is imported into `[A]` for the *current compilation*. So when the Rust compiler surveyed the dependence graph, it determined that nothing `[A]` imports changed since the last build (and `[A]` itself has not changed either), so it chooses to reuse the object code generated during the previous compilation. 6. But that previous object code has an unresolved reference to `D`, and that causes a link time failure! ---- The interesting thing is that its quite hard to actually observe the above scenario arising, which is probably why no one has noticed this bug in the year or so since incremental LTO support landed (PR #53673). I've literally spent days trying to observe the bug on my local machine, but haven't managed to find the magic combination of factors to get LLVM and `rustc` to do just the right set of the inlining and `internal`-reclassification choices that cause this particular problem to arise. ---- Also, I have tried to be careful about injecting new bugs with this PR. Specifically, I was/am worried that we could get into a scenario where overwriting the current LTO import data with past LTO import data would cause us to "forget" a current import. ~~To guard against this, the PR as currently written always asserts, at overwrite time, that the past LTO import-set is a *superset* of the current LTO import-set. This way, the overwriting process should always be safe to run.~~ * The previous note was written based on the first version of this PR. It has since been revised to use a simpler strategy, where we never attempt to merge the past LTO import information into the current one. We just *compare* them, and act accordingly. * Also, as you can see from the comments on the PR itself, I was quite right to be worried about forgetting past imports; that scenario was observable via a trivial transformation of the regression test I had devised.
This is an updated version of #52309. This PR allows
rustc
to use (local) ThinLTO and incremental compilation at the same time. In theory this should allow for getting compile-time improvements for small changes while keeping the runtime performance of the generated code roughly the same as when compiling non-incrementally.The difference to #52309 is that this version also caches the pre-LTO version of LLVM bitcode. This allows for another layer of caching: