Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental compilation results in linker error when method use is removed #59535

Closed
jonas-schievink opened this issue Mar 29, 2019 · 22 comments · Fixed by #67020
Closed

Incremental compilation results in linker error when method use is removed #59535

jonas-schievink opened this issue Mar 29, 2019 · 22 comments · Fixed by #67020
Assignees
Labels
A-incr-comp Area: Incremental compilation A-linkage Area: linking into static, shared libraries and binaries C-bug Category: This is a bug. P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-embedded Working group: Embedded systems

Comments

@jonas-schievink
Copy link
Contributor

See: https://github.com/jonas-schievink/rubble/compare/incr-linker-error

Build the first commit (jonas-schievink/rubble@04cf1d1), then the second (jonas-schievink/rubble@cedf7be) using cargo build, and you should get a linker error like this:

  = note: rust-lld: error: undefined symbol: _$LT$rubble..ble..time..Duration$u20$as$u20$core..fmt..Display$GT$::fmt::h343296729f83afc4
          >>> referenced by time.rs:129 (src/ble/time.rs:129)
          >>>               /home/jonas/dev/rubble/target/thumbv7em-none-eabi/debug/deps/rubble-a38354bf4561073f.3qtzuenakrd0qcaf.rcgu.o:(_$LT$$RF$T$u20$as$u20$core..fmt..Debug$GT$::fmt::h26ac05bd6f4f461e)

Note that this needs the thumbv7em-none-eabi target, so you might need to run rustup target add thumbv7em-none-eabi beforehand.

This is on rustc 1.33.0 (2aa4c46cf 2019-02-28).
Also reproduces on rustc 1.35.0-nightly (237bf3244 2019-03-28).

@jonas-schievink jonas-schievink added A-linkage Area: linking into static, shared libraries and binaries T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-incr-comp Area: Incremental compilation C-bug Category: This is a bug. labels Mar 29, 2019
@dtolnay
Copy link
Member

dtolnay commented Jun 19, 2019

I minimized this to ~60 lines in jonas-schievink/rubble#62.

@dtolnay
Copy link
Member

dtolnay commented Jun 19, 2019

#61917 looks like it reproduces this in a non-Cargo environment. Some observations from that issue:

  • It only happens with incremental builds. If I either remove the incremental DB or disable incremental builds, it works
  • It only happens at opt-level=1; opt-level=0 and opt-level=3 are fine
  • It goes away if I make small code changes, but it doesn't seem to matter what (which says incremental)
  • Changing codegen-units doesn't affect anything.

@jsgf

@pnkfelix
Copy link
Member

pnkfelix commented Jun 27, 2019

pre-triage: P-high; leaving nominated in hopes it increases chance that we discuss it today

@pnkfelix pnkfelix added the P-high High priority label Jun 27, 2019
@pnkfelix
Copy link
Member

discussed at T-compiler meeting. Tentatively assigning to @Zoxc for initial investigation.

@Zoxc, if you do not expect to be able to do much investigation, let me know (and unassign yourself)

@Zoxc
Copy link
Contributor

Zoxc commented Jun 27, 2019

Is this specific to the thumbv7em-none-eabi target?

@dtolnay
Copy link
Member

dtolnay commented Jun 27, 2019

Not specific to thumbv7em-none-eabi. @jsgf told me #61917 happened when building for x86_64-unknown-linux-gnu. We don't yet have a reliable repro on x86 ("build this commit followed by this commit") and the project in #61917 is not using Cargo so minimizing from there may be more complicated.

@pnkfelix
Copy link
Member

triage: Reassigning from @Zoxc to self.

@pnkfelix pnkfelix assigned pnkfelix and unassigned Zoxc Sep 12, 2019
@pnkfelix
Copy link
Member

pnkfelix commented Oct 29, 2019

spent a little while trying to reproduce this today, and failed.

Of note: even just trying to do cargo +nightly-2019-03-29 build atop jonas-schievink/rubble#62, I am getting

error: language item required, but not found: eh_personality

I tried to work around this (by adding my own fake definition for eh_personality), but that just led linking problems for the basic case:

  = note: /usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/9/../../../../lib64/Scrt1.o: in function `_start':
          (.text+0x16): undefined reference to `__libc_csu_fini'
          /usr/bin/ld: (.text+0x1d): undefined reference to `__libc_csu_init'
          /usr/bin/ld: (.text+0x2a): undefined reference to `__libc_start_main'
          /usr/bin/ld: /home/pnkfelix/.rustup/toolchains/nightly-2019-03-29-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore\
-ed8b00cc009d913d.rlib(core-ed8b00cc009d913d.core.95lhmtlm-cgu.0.rcgu.o): in function `equal<u8>':
          /rustc/237bf3244fffef501cf37d4bda00e1fce3fcfb46//src/libcore/slice/mod.rs:5074: undefined reference to `memcmp'

(I assume there's some cross-compilation system library support that I'm missing.)

@pnkfelix
Copy link
Member

pnkfelix commented Nov 6, 2019

Okay, I managed to now replicate it atop @dtolnay 's reduction from jonas-schievink/rubble#62 (I was not installing the thumb target correctly before this point; I was probably not doing it with rustup target add --toolchain or something to synchronize it with my rustc version.)

@pnkfelix
Copy link
Member

pnkfelix commented Nov 8, 2019

Here is a repo with my (further-reduced) reduction:
https://github.com/pnkfelix/rust-issue-59535

I got the whole thing down to a chain of three crates: a 13-line crate at the root, a 65-line crate at the leaf, and a .... 1643-line crate in the middle...

  • Unfortunately the nature of the bug makes it quite difficult to remove any of those 1643 lines
  • "Fortunately" the majority of the lines (over 1000 of them) are impl R { }. Which is ... interesting...

It still requires the thumbv7em-none-eabi target.


At this point I think I've gotten it at small as I can with my limited knowledge of embedded rust.

So now I'll switch to trying to understand what's going on with the code-generation here.

@pnkfelix
Copy link
Member

pnkfelix commented Nov 8, 2019

(for my narrowed test case, passing -C codegen-units=1 does seem to make the problem go away, which could be interpreted as contradicting the last bullet in an earlier comment. but given the heisenbug nature of this issue, that could just be masking over the problem rather than showing something fundamental here...)

@therealprof
Copy link
Contributor

At this point I think I've gotten it at small as I can with my limited knowledge of embedded rust.

Embedded Rust is pretty much regular Rust sans magic done by the operating system and libc in a regular application. The reduced 1.6k middle crate is just an abstraction over fixed bits at fixed memory addresses to talk to built-in peripherals (data busses, timers, I/O pins, radios...) generated from a chip manufacturer provided register description.

Having multiple impl R { }s on the same struct R in the same module sounds quite wrong, is that even allowed? I could imagine that this confuses the heck out of the name mangling.

@jonas-schievink jonas-schievink added the WG-embedded Working group: Embedded systems label Nov 9, 2019
@pnkfelix
Copy link
Member

Having multiple impl R { }s on the same struct R in the same module sounds quite wrong, is that even allowed? I could imagine that this confuses the heck out of the name mangling.

Its absolutely allowed. See:

https://doc.rust-lang.org/stable/book/ch05-03-method-syntax.html#multiple-impl-blocks

@therealprof
Copy link
Contributor

@pnkfelix Yes, I have learned that last weekend in Barcelona and makes sense to me. Should have updated my comment to not waste your time, sorry about that.

It's still kind of weird that Rust accepts multiple identical impls. That sounds like something which should possibly be at least a lint.

@pnkfelix
Copy link
Member

pnkfelix commented Nov 12, 2019

Current status: using -C save-temps and llvm-dis, I have been inspecting the contents of the distinct codegen units that rustc generates.

As a reminder, the undefined symbol is: _$LT$rubble..ble..Duration$u20$as$u20$core..fmt..Display$GT$::fmt::hcaf51ebe05a39b3b, (referenced by mks4f5lxe2i4sqc in /tmp/issue59535/out/rubble2/rubble.mks4f5lxe2i4sqc.rcgu.o). So searching for hcaf51ebe05a39b3b in the disassembled bitcode files (*.ll) seemed like a reasonable step to take.

From there, I discovered that rubble.3bvob4s3e3jl7a31.rcgu.no-opt.ll has a defintion for the aforementioned symbol:

; Function Attrs: nounwind optsize
define internal zeroext i1 @"_ZN60_$LT$rubble..ble..Duration$u20$as$u20$core..fmt..Display$GT$3fmt17hcaf51ebe05a39b3bE"(i32* noalias readonly align 4 dereferenceable(4), %"core::fmt::Formatter"* align 4 dereferenceable(52)) unnamed_addr #1 {
...
}

but rubble.3bvob4s3e3jl7a31.rcgu.ll is missing that definition. So I think that means the definition is getting eliminated by llvm optimization passes.

One thing that stood out to me here: The function definition in the llvm bitcode says define internal. But this definition is referenced by another codegen unit. Surely that should not be internal ? (Maybe this is why the LLVM optimization passes are able to justify removing it?)

Still, seems odd that this only crops up during this combination of factors (namely, the fact that incremental compilation seems to be involved...)

  • Ah: The first clean run, that didn't have incremental build products, does not mark the function definition as internal; that definition is merely marked hidden
define hidden zeroext i1 @"_ZN60_$LT$rubble..ble..Duration$u20$as$u20$core..fmt..Display$GT$3fmt17hcaf51ebe05a39b3bE"(i32* noalias readonly align 4 dereferenceable(4), %"core::fmt::Formatter"* align 4 dereferenceable(52)) unnamed_addr #1 {
...
}

@pnkfelix
Copy link
Member

@pnkfelix Yes, I have learned that last weekend in Barcelona and makes sense to me. Should have updated my comment to not waste your time, sorry about that.

It's still kind of weird that Rust accepts multiple identical impls. That sounds like something which should possibly be at least a lint.

Well, multiple identical non-trivial impl blocks is impossible: You get "duplicate definition" errors.

And multiple non-identical impl blocks is useful, e.g. for code-grouping. As mentioned in a very recent tweet:

rust documentation code layout idea: group functions together in separate impl blocks, even if the blocks have the same trait bounds or are in the same file

then apply doc comments to the impl blocks that rustdoc can show to create "sections"

— a quiet misdreavus (@QuietMisdreavus) November 11, 2019

So I think the only case here that one might want to lint for is trivial (i.e. empty) impl blocks. (But even there I would want to exercise care to not lint cases where all the items in the block were removed due to cfg attributes.)

@pnkfelix
Copy link
Member

More data, after adding some debug! instrumentation:

The module rustc_mir::monomorphize::partitioning has fn internalize_symbols, which tries to mark symbols internal that are only referenced by the current codegen unit.

The first (i.e. clean) run does not mark rubble[317d]::ble[0]::{{impl}}[0]::fmt[0] as internal, because it has these two accessors:

  • Fn(Instance { def: Item(DefId(0:19 ~ rubble[317d]::ble[0]::{{impl}}[1]::fmt[0])), substs: [] }),
  • Fn(Instance { def: Item(DefId(0:6 ~ rubble[317d]::ble[0]::link[0]::process_data_packet[0])), substs: [] })

The second run with incremental build produces and an empty process_data_packet definition does mark rubble[317d]::ble[0]::{{impl}}[0]::fmt[0] as internal, while noting that it has this one accessor:

  • Fn(Instance { def: Item(DefId(0:17 ~ rubble[317d]::ble[0]::{{impl}}[1]::fmt[0])), substs: [] })

The logic of the relevant code is here:

if let Some(accessors) = accessor_map.get(accessee) {
if accessors.iter()
.filter_map(|accessor| {
// Some accessors might not have been
// instantiated. We can safely ignore those.
mono_item_placements.get(accessor)
})
.any(|placement| *placement != home_cgu) {
// Found an accessor from another CGU, so skip to the next
// item without marking this one as internal.
continue
}
}

(The debug print outs above do not include the codegen unit placements for the accessors, so I do not actually know for sure how much sense all of this makes. But it certainly makes sense that the aforementioned change to process_data_packet would cause it to be removed from the accessors list, so there is no smoking gun here yet...)

@pnkfelix
Copy link
Member

pnkfelix commented Nov 12, 2019

More data after adding instrumentation of the codegen units involved: The codegen unit that contains the unresolved reference is named mks4f5lxe2i4sqc. It has a single fn item (one of the fmt fns, specifically core::fmt[0]::{{impl}}[49]::fmt[0]<rubble::ble[0]::Duration[0]>), and that fn item accesses rubble[317d]::ble[0]::{{impl}}[1]::fmt[0].

Note that rubble[317d]::ble[0]::{{impl}}[1]::fmt[0] (which belongs to codegen unit 3bvob4s3e3jl7a31) has previously been found to be an accessor of rubble[317d]::ble[0]::{{impl}}[0]::fmt[0] (whcih belongs to the same codegen unit). The latter item is the item that being marked as Internal and subsequently has its definition optimized away by LLVM.

In short: it looks like we have a call chain of methods: A calls B calls C, A is in one code-gen unit, B and C are in another, and we make the decision to mark C internal (since it is only reachable from B, in the same codegen-unit), but independently, the compiler (presumably LLVM) has inlined the body of B into A, and so we now have a call from A to an internal method (C) in a distinct codegen-unit.

Is there some sort of inlining happening here that implies we need to do a deeper traversal of the accessor-chain before we commit to marking a reachable item as internal?

  • From looking at the pre- and post-optimized versions of rubble.mks4f5lxe2i4sqc.rcgu.ll, I am thinking that this seems like strong possibility.
  • Another thing to note: it seems to me like the inlining in question is happening for both the first (clean) build and also for the the second (reusing) build.
    • (Is it possible that the second build is reusing the object code for the mks4f5lxe2i4sqc codegen unit from the first build? Is that somehow relevant, or a red herring?)
    • In particular: Using the simplified A,B,C model I summarized above, I would have imagined that LLVM could not inline the body of B into A since it is making a call to the internal function C. But maybe the issue is that we are reusing the results of a previous compile where it was legal to inline B into A (because during that compile, C was not marked internal), and independently we are now deciding to recategorize C as internal.
      • Note however that I also would have thought that LLVM could not inline the body of B into A because they belong to distinct codegen units. So there are multiple points of personal confusion here for me to resolve in my theories above.
      • Further update: it seems like the inlining of B into A is happening during link-time optimization (specifically ThinLTO), which does seem like it would jibe with the rules for link-time optimization...

@pnkfelix
Copy link
Member

Okay, further investigation: rustc itself controls whether we can reuse the post-ThinLTO version of a module (or if we have to go back and reoptimize it).

And, thank goodness, we already had debug instrumentation in place to emit information about such reuse:

[DEBUG rustc::dep_graph::cgu_reuse_tracker] set_actual_reuse("37it3t5gwe6y9f2", PostLto)
[INFO  rustc_codegen_llvm::back::lto]  - mks4f5lxe2i4sqc: re-used
...
  = note: lld: error: undefined symbol: _$LT$rubble..ble..Duration$u20$as$u20$core..fmt..Display$GT$::fmt::hcaf51ebe05a39b3b
          >>> referenced by mks4f5lxe2i4sqc
          >>>               /tmp/issue59535/out/rubble2/rubble.mks4f5lxe2i4sqc.rcgu.o:(_$LT$$RF$T$u20$as$u20$core..fmt..Debug$GT$::fmt::he13cc5d4e65d5ab9)

In other words: This lends credence to my theory that we are reusing an artifact from a previous incremental compile that should have been invalidated by the changes to what symbols were marked internal by rustc_mir::monomorphize::partitioning::internalize_symbols

The code that dictates whether a post-ThinLTO object file for a codegen unit can be reused is here:

info!("checking which modules can be-reused and which have to be re-optimized.");
for (module_index, module_name) in shared.module_names.iter().enumerate() {
let module_name = module_name_to_str(module_name);
// If the module hasn't changed and none of the modules it imports
// from has changed, we can re-use the post-ThinLTO version of the
// module.
if green_modules.contains_key(module_name) {
let imports_all_green = import_map.modules_imported_by(module_name)
.iter()
.all(|imported_module| green_modules.contains_key(imported_module));
if imports_all_green {
let work_product = green_modules[module_name].clone();
copy_jobs.push(work_product);
info!(" - {}: re-used", module_name);
cgcx.cgu_reuse_tracker.set_actual_reuse(module_name,
CguReuse::PostLto);
continue
}
}
info!(" - {}: re-compiled", module_name);
opt_jobs.push(LtoModuleCodegen::Thin(ThinModule {
shared: shared.clone(),
idx: module_index,
}));
}

I'm going to read over this more carefully and see if I can figure out how to make it aware of the impact of internalize_symbols....

@pnkfelix
Copy link
Member

@michaelwoerister and I are continuing to investigate this. (zulip archive)

My most recent hypothesis as to what the bug is (transcribed here):

lto.rs, when it decides which modules can be reused in PostLTO form, looks at the import-list and sees if they are all on the green modules list. But it is looking at the import list that it currently gets from LLVM. Should we be instead (or additionally) looking at the old import list, from the time that we did the LTO optimization that we are reusing? (Presumably we would just serialize the ThinLTOImports and re-load that, or add it to the dep-graph maybe.)

Specifically:

let import_map = if cgcx.incr_comp_session_dir.is_some() {
ThinLTOImports::from_thin_lto_data(data)
} else {

is relying on this:

let mut map = ThinLTOImports::default();
llvm::LLVMRustGetThinLTOModuleImports(data,
imported_module_callback,
&mut map as *mut _ as *mut libc::c_void);
map
}

which reflects the whatever the current LTO imports are computed, which may differ from the set of imports that were used to drive the original LTO optimization that we are reusing. In particular, they may differ by being a subset of the original set of imports, which means we can overlook hidden dependencies.

Mark-Simulacrum added a commit to Mark-Simulacrum/rust that referenced this issue Dec 19, 2019
…t-lto-imports, r=michaelwoerister

save LTO import info and check it when trying to reuse build products

Fix rust-lang#59535

Previous runs of LTO optimization on the previous incremental build can import larger portions of the dependence graph into a codegen unit than the current compilation run is choosing to import. We need to take that into account when we choose to reuse PostLTO-optimization object files from previous compiler invocations.

This PR accomplishes that by serializing the LTO import information on each incremental build. We load up the previous LTO import data as well as the current LTO import data. Then as we decide whether to reuse previous PostLTO objects or redo LTO optimization, we check whether the LTO import data matches. After we finish with this decision process for every object, we write the LTO import data back to disk.

----

What is the scenario where comparing against past LTO import information is necessary?

I've tried to capture it in the comments in the regression test, but here's yet another attempt from me to summarize the situation:

 1. Consider a call-graph like `[A] -> [B -> D] <- [C]` (where the letters are functions and the modules are enclosed in `[]`)
 2. In our specific instance, the earlier compilations were inlining the call to`B` into `A`; thus `A` ended up with a external reference to the symbol `D` in its object code, to be resolved at subsequent link time. The LTO import information provided by LLVM for those runs reflected that information: it explicitly says during those runs, `B` definition and `D` declaration were imported into `[A]`.
 3. The change between incremental builds was that the call `D <- C` was removed.
 4. That change, coupled with other decisions within `rustc`, made the compiler decide to make `D` an internal symbol (since it was no longer accessed from other codegen units, this makes sense locally). And then the definition of `D` was inlined into `B` and `D` itself was eliminated entirely.
  5. The current LTO import information reported that `B` alone is imported into `[A]` for the *current compilation*. So when the Rust compiler surveyed the dependence graph, it determined that nothing `[A]` imports changed since the last build (and `[A]` itself has not changed either), so it chooses to reuse the object code generated during the previous compilation.
  6. But that previous object code has an unresolved reference to `D`, and that causes a link time failure!

----

The interesting thing is that its quite hard to actually observe the above scenario arising, which is probably why no one has noticed this bug in the year or so since incremental LTO support landed (PR rust-lang#53673).

I've literally spent days trying to observe the bug on my local machine, but haven't managed to find the magic combination of factors to get LLVM and `rustc` to do just the right set of the inlining and `internal`-reclassification choices that cause this particular problem to arise.

----

Also, I have tried to be careful about injecting new bugs with this PR. Specifically, I was/am worried that we could get into a scenario where overwriting the current LTO import data with past LTO import data would cause us to "forget" a current import. ~~To guard against this, the PR as currently written always asserts, at overwrite time, that the past LTO import-set is a *superset* of the current LTO import-set. This way, the overwriting process should always be safe to run.~~
 * The previous note was written based on the first version of this PR. It has since been revised to use a simpler strategy, where we never attempt to merge the past LTO import information into the current one. We just *compare* them, and act accordingly.
 * Also, as you can see from the comments on the PR itself, I was quite right to be worried about forgetting past imports; that scenario was observable via a trivial transformation of the regression test I had devised.
@bors bors closed this as completed in ccd2383 Dec 21, 2019
@rhit-sashiks
Copy link

I still see this issue

@bjorn3
Copy link
Member

bjorn3 commented Oct 26, 2024

If you mean #45929 (comment), that has almost certainly a different root cause that what this issue is about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-incr-comp Area: Incremental compilation A-linkage Area: linking into static, shared libraries and binaries C-bug Category: This is a bug. P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-embedded Working group: Embedded systems
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants