Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reloading a dylib stops working when moving from 1.19.0 to 1.20.0 #44365

Closed
Ryan1729 opened this issue Sep 6, 2017 · 39 comments
Closed

Reloading a dylib stops working when moving from 1.19.0 to 1.20.0 #44365

Ryan1729 opened this issue Sep 6, 2017 · 39 comments
Labels
A-FFI Area: Foreign function interface (FFI) A-linkage Area: linking into static, shared libraries and binaries C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Ryan1729
Copy link
Contributor

Ryan1729 commented Sep 6, 2017

I have been writing programs which check if a dynamic library's modification time has changed and reload it if so. This allows me to interact with the program, producing transient state but then make changes to the dynamic libraries code and load it to see the result of the changes on the transient state without closing the program.

This feature of my programs has stopped working since I updated to 1.20.0. If I switch back to 1.19.0 with rustup default 1.19.0 then it works again. Switching back with rustup default stable breaks it again.

I've made a small example program that does the lib reloading which you can see here. On 1.19.0 if I launch the program and change, say, this line to something like state.counter += 1000; and rebuild the dynamic library, (or just the whole crate,) the running version of the program will eventually notice the different modification time and start incrementing the counter by 1000 instead of 1. On 1.20.0 this no longer works. On this branch in particular you can see that the new modification time is noticed and the library reloading code is run but it doesn't do anything.

Looking through the patch notes for 1.20.0 I noticed that ManuallyDrop had been stabilized. So I tried using it in this branch but it didn't cause any (noticeable) change in behaviour, (besides compile errors on 1.19.0 of course.)

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Sep 6, 2017

Dylibs, like rlibs, have a completely unstable ABI. Among countless other problems, the compiler is free to inline code from a dylib into crates using the dylib, and will in fact do this for #[inline] and generic functions -- which of course breaks hot-reloading. (I'm not saying this is what happened here, that seems a bit unlikely given that you use libloading, it's just one example of how dylib is really more like an rlib than a traditional shared library.)

A reliable way to do this (insofar any hot-reloading can be reliable) would be using the cdylib crate type and interact with with the shared object through a C interface with a stable ABI. That is, declare extern fns and repr(C) structs in the cdylib, and bind to those as you would to a C library (e.g., get function pointers with an extern ABI from libloading).

(You can share type definitions if you structure your code accordingly, and the function bindings might be easier to generate automatically than with C.)

@Ryan1729
Copy link
Contributor Author

Ryan1729 commented Sep 7, 2017

@rkruppe I believe I've followed your suggestion in this branch, (diff here), please let me know if I missed anything!

The code in that branch doesn't reload the library on 1.20.0 or 1.190. Oddly, if I don't change state_manipulation to a cdylib then it works on 1.19.0 but not on 1.20.0, like before. Here is a branch with that change.

In both of those branches, the code compiles and runs on 1.19.0 and 1.20.0 just fine, the reloading just doesn't work except in the 1.19.0 plain dylib case.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Sep 7, 2017

One detail that's missing in your code is that you still get a fn() -> State instead of a extern "C" fn() -> State function pointer (and analogously for the other function). Unlikely to cause issues with these extremely simple signatures, but the calling convention is different.

Reading over the diff, I wonder whether LIB_PATH is still accurate? I believe cdylib artifacts are named and placed differently from dylib artifacts. You could try this by doing cargo clean in the main crate, ensuring that any dylib artifacts are gone.

Edit: Indeed something's fishy with the way you use the cdylib. I'm surprised that having a dependency on a cdylib crate works at all! A cdylib is more like a staticlib or executable in that it's a finished artifact for consumption by the outside world, rather than something specifically for other Rust crates to consume.

@Ryan1729
Copy link
Contributor Author

Ryan1729 commented Sep 7, 2017

I ran cargo clean and then ran the version of the program that uses the C ABI again, and it behaved the same as before, that is, still not reloading the lib, but as expected otherwise. It still creates libstate_manipulation.so in /target/debug/, along with another copy in /target/debug/deps, whether I set state_manipulation to be a cdylib or a plain dylib.

When you talk about "having a dependency on a cdylib crate" are you specifically referring to these lines? This might be clear already, but my intention is to do this live reloading stuff only in debug mode and to have state_manipulation be a normal crate in release mode. In order to do this I need to comment out the cylib/dylib line in state_manipulation's Cargo.toml before building in release mode. I build in release mode infrequently enough that this is no big deal.

I had so far in this issue only been running the code in debug mode, but I tried it in release mode and after fixing update_and_render's return type, (left over from cutting down the previous version to make the example,) it worked just as expected. And of course in release mode it does not create libstate_manipulation.so, just libstate_manipulation.rlib.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Sep 7, 2017

When you talk about "having a dependency on a cdylib crate" are you specifically referring to these lines?

I am referring to the fact that state_manipulation only declares the cdylib crate-type in its Cargo.toml, not lib. I would not expect an rlib to be generated for that. Indeed I have just cloned the repository (commit Ryan1729/tiny-live-code-example@d8ef24c) and I get "can't find crate for state_manipulation" in release mode. Please double-check that the code you compile exactly matches what's in the repository, and that you don't have any stale artifacts lying around.

(I'm booted into Windows at the moment so I can't try to reproduce the actual error (reloading failing to have an effect) in debug mode. The fact that state_manipulation does not generate an rlib should be platform-independent though.)

@Ryan1729
Copy link
Contributor Author

Ryan1729 commented Sep 8, 2017

can't find crate for state_manipulation

That's the error I get if I don't comment out the crate-type = ["cdylib"] line. As you expected, if I don't do that then I don't get an rlib.

The version of the C_ABI branch I was compiling from is even with the github version. If I download a zip of that code and run it from a different folder it has the same behaviour, (reload only works in 1.19.0 with state_manipulation set as a dylib, in order to compile in release mode you need to make state_manipulation a plain crate, etc.)

I think these debug mode vs release mode differences might be making things more complicated than necessary. So I made another branch from the C_ABI branch with those differences removed, (besides the necessary changes to LIB_PATH). The reloading now works on both (1.19.0,dylib,debug) and (1.19.0,dylib,release) so it seems like that's one less thing to think about.


Just in case it might be relevant, have you noticed these lines in the root Cargo.toml?

#this is needed so the state manipulation dynamic lib is put in the right place
[workspace]

I haven't used workspaces besides this so I'm not sure of the full effects of that [workspace] line, but apparently I added it because it affects where the .so files end up.

@Ryan1729
Copy link
Contributor Author

Ryan1729 commented Sep 8, 2017

I decided to see if I could get reloading working at all on windows. If I try to just build while the program is running I get an error saying the build process cannot remove the .dll files. Per this stackexchange answer you can get around this by renaming the .dll then copying the replacement to the original name. So I came up with the following procedure:

  • set the counter increment to 1000.
  • build the program
  • copy the library file, appending 1000 to its name.
  • set the counter increment to 1.
  • run the program. The remaining steps are done without stopping the program
  • rename the freshly built library file, say by adding an underscore.
  • rename the older library file, removing the 1000
  • set that library file's modification time to the current time [1]
  • hopefully observe that the program has now begun incrementing by 1000

I tried that procedure on windows, on a fresh download of the simplified C_ABI branch, (note, without changing cdylib to dylib,) using both 1.19 and 1.20.0. In both cases, it worked; the increment about became 1000.

Performing the same procedure on Linux, (using touch ./target/debug/libstate_manipulation.so instead of the Powershell) only worked if I was using 1.19.0 and I made state_manipulation a plain dylib. Otherwise the new modification time was noticed but the increment amount did not change.

[1]
on windows you can use the following in Powershell :

$file = Get-Item .\target\debug\libstate_manipulation.dll
$file.LastWriteTime = (Get-Date)

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Sep 8, 2017

I investigated some more (on a linux machine). I can reproduce the behavior you describe, but haven't been able to identify the cause. I'm at my wit's end: with both dylib and cdylib, the right .so file changes and contains the code I'd expect, and the file change is correctly noticed and the library is reloaded, yet results differ. Maybe someone else can diagnose this?

@Ryan1729
Copy link
Contributor Author

I searched back through the nightlies and found that if I set the Rust version with rustup default nightly-2017-07-06 the reloading works, but if I set it with rustup default nightly-2017-07-07 it doesn't work.

This works using using this branch where state_manipulation is set to be a plain dylib, on Linux.

@Ryan1729
Copy link
Contributor Author

Based on the commit hashes that are printed when I run rustup default nightly-2017-07-06 and rustup default nightly-2017-07-07 it seems like the bug would have to have been introduced among these commits. Based on the commit messages alone I don't see an obvious place to look for the bug. Maybe someone else would?

@Ryan1729
Copy link
Contributor Author

This issue is currently on the 4th page of the issues list, so I doubt someone who could quickly make further progress will just find it.

@rkruppe You probably know better than I, who would be most able to help. Can you ask someone else to have a look at this bug? @ mentioning everyone involved in that list of commits seems like overkill.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Sep 13, 2017

#42899 and #42727 seems like the most likely candidates, though only because they're the only things in the list that look like they touch linking at all. So: cc @alexcrichton

@alexcrichton
Copy link
Member

Hm there's a lot of discussion here and anything dealing with loading/unloading dylibs is absolutely fraught with unsafety, is there a distillation of the problem at this point?

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Sep 14, 2017

No idea what the cause might be, but there is a reasonably small reproduction: https://github.com/Ryan1729/tiny-live-code-example/tree/3636b8e29af3e141c12f4e00514b34840f99c12e

The problem that occurs on Linux with state_manipulation having crate type cdylib (and also with dylib on newer versions, but I consider that the less interesting part) is reproduced like this:

  • cargo run [--release], keep it running
  • change the amount by which self.counter is incremented in state_manipulation/src/lib.rs
  • cargo build [--release] in state_manipulation/
  • the .so is updated (and contains the correct code) and the running process notices that and reloads, but the state changes in said process are not affected

@Ryan1729 I have noticed that this reproduction could be even smaller, though. For example, you could remove Platform and State and just pass a &mut i64, and consequently also simplify the the code in main.rs. Can you try reducing it further?

@Ryan1729
Copy link
Contributor Author

I've removed Platform and State here: https://github.com/Ryan1729/tiny-live-code-example/tree/f7b1b07801866db6de3a525f6ec2ebb1cac05549
This reduction removed the need for the common crate altogether.
The behaviour is still the same.

@alexcrichton
Copy link
Member

Is this a rust bug? Testing https://github.com/Ryan1729/tiny-live-code-example/tree/f7b1b07801866db6de3a525f6ec2ebb1cac05549 I get the same behavior on 1.19.0 and 1.20.0, which is that the changes don't seem to make a difference when the library is reloaded. Additionally I saw via strace that when a library is "reloaded" it doesn't actually do anything syscall-wise.

@Ryan1729
Copy link
Contributor Author

If state_manipulation's crate-type is set to dylib instead of cdylibthen the reloading actually works on 1.19.0 but not on 1.20.0.

@alexcrichton
Copy link
Member

Ok that does reproduce for me, but this is pretty much super uncharted territory. The dylib crate type here is depending on the standard library dynamically, which means you've got two standard libraries in one program during the executable. That's a recipe for all sorts of disasters, so my guess it's that this is just some obscure bug related to that. I don't really have the time to dig in further unfortunately.

@Ryan1729
Copy link
Contributor Author

Is there a less uncharted territory way to achieve the same effect, that is, swapping in new behaviour at runtime? For my use case I'm fine with using or not using a C ABI. Also, given it didn't cause problems I'd be fine with an extra copy of the standard library, since I'm only using the reloading during development, (the first version of the example program only reloaded in debug mode and used state_manipulation as a normal crate in release.) For the problems it applies to, swapping behaviour like this gives much tighter feedback cycle which I would miss if it wasn't there.

@alexcrichton
Copy link
Member

Maybe? I'm not particularly well versed in dynamic libraries. If you're mixing two Rust programs together via dynamic libraries, though, the one being dlopen'd needs to be a cdylib

@TimNN TimNN added A-FFI Area: Foreign function interface (FFI) A-linkage Area: linking into static, shared libraries and binaries C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Sep 17, 2017
@Ryan1729
Copy link
Contributor Author

I've written a version of the example that performs the rename-then-copy mentioned above and exits with a zero exit code on 1.19.0 and a non-zero exit code on 1.20.0. It's on this branch. Hopefully that's useful for git bisect run or writing an regression test.

@Ryan1729
Copy link
Contributor Author

Ryan1729 commented Sep 21, 2017

I've taken that automatic rename-then-copy version and ran git bisect run with this script over the range of commits mentioned before

It reported the following:

696412de7e4e119f8536686c643621115b90c775 is the first bad commit

That's the final commit in the range, and it's only a change to a CI script.
It would appear that I misinterpreted the commit hashes (or there's a off by one error or something related to timezones?) and the bug now appears to originate in these commits. I'm re-running git bisect run but it takes quite awhile on my hardware. Maybe someone else can spot the error faster than that?

@Ryan1729
Copy link
Contributor Author

After a few false starts I've ran git bisect properly and gotten a result of the following

695dee063bcd40f154bb27b7beafcb3d4dd775ac is the first bad commit

which points to the problem being in #42727.

@alexcrichton
Copy link
Member

One thing I've noticed is that the symbols for allocation are exported from cdylibs when they shouldn't be (they should be internal symbols). That may fix the bug here, but it also may not.

@Ryan1729
Copy link
Contributor Author

Would the code that exports those symbols be in #42727? If so, can you point me in its general direction? 115 changed files is a lot to search through.

@alexcrichton
Copy link
Member

Oh sure!

So right now you can see for yourself:

$ touch foo.rs
$ rustc +nightly foo.rs --crate-type cdylib
$ nm -g libfoo.so | grep -w T
0000000000008590 T __rdl_alloc
00000000000087a0 T __rdl_alloc_excess
0000000000008710 T __rdl_alloc_zeroed
0000000000008630 T __rdl_dealloc
00000000000088f0 T __rdl_grow_in_place
0000000000008600 T __rdl_oom
0000000000008650 T __rdl_realloc
0000000000008820 T __rdl_realloc_excess
0000000000008900 T __rdl_shrink_in_place
0000000000008640 T __rdl_usable_size
0000000000012db0 T rust_eh_personality

All of the __rdl_* symbols (and __rde_* in some situations) don't actually need to get exported from the cdylib, they should be internal to Rust. I think that something in this file is likely to change to account for this.

@Ryan1729
Copy link
Contributor Author

Here's the symbols that are exported after compiling the automated branch with the compiler at 695dee06/after #42727.

Using cdylib

$ nm -g ../tiny-live-code-example/target/debug/libstate_manipulation1000.so | grep -w T
0000000000003530 T lib_update_and_render
000000000000a110 T __rdl_alloc
000000000000a300 T __rdl_alloc_excess
000000000000a280 T __rdl_alloc_zeroed
000000000000a1b0 T __rdl_dealloc
000000000000a430 T __rdl_grow_in_place
000000000000a180 T __rdl_oom
000000000000a1d0 T __rdl_realloc
000000000000a370 T __rdl_realloc_excess
000000000000a440 T __rdl_shrink_in_place
000000000000a1c0 T __rdl_usable_size
0000000000014a70 T rust_eh_personality

Using plain dylib

$ nm -g ../tiny-live-code-example/target/debug/libstate_manipulation1000.so | grep -w T
0000000000001058 T _fini
0000000000000b98 T _init
0000000000000ed0 T lib_update_and_render
0000000000000fc0 T __rust_alloc
0000000000001020 T __rust_alloc_excess
0000000000001010 T __rust_alloc_zeroed
0000000000000fe0 T __rust_dealloc
0000000000001040 T __rust_grow_in_place
0000000000000fd0 T __rust_oom
0000000000001000 T __rust_realloc
0000000000001030 T __rust_realloc_excess
0000000000001050 T __rust_shrink_in_place
0000000000000ff0 T __rust_usable_size

@alexcrichton
Copy link
Member

Yeah for the cdylib case the __rdl_* functions shouldn't be exported, but for the dylib case the __rust_ functions being exported is expected I believe.

@Ryan1729
Copy link
Contributor Author

I changed these lines to the following

let export_level = if special_runtime_crate {
                        // We can probably do better here by just ensuring that
                        // it has hidden visibility rather than public
                        // visibility, as this is primarily here to ensure it's
                        // not stripped during LTO.
                        //
                        // In general though we won't link right if these
                        // symbols are stripped, and LTO currently strips them.
                        if &*name == "rust_eh_personality" ||
                           &*name == "rust_eh_register_frames" ||
                           &*name == "rust_eh_unregister_frames" {
                            SymbolExportLevel::C
                        } else {
                            SymbolExportLevel::Rust
                        }
                    } else if (&*name).starts_with("__rdl_") ||
                              (&*name).starts_with("__rust_") { 
                        SymbolExportLevel::Rust
                    } else {
                        export_level(scx, def_id)
                    };

then recompiled the automated branch. If I set the two inner crates to be cdylibs then those symbols are removed

nm -g ../tiny-live-code-example/target/debug/libstate_manipulation1000.so | grep -w T
0000000000003260 T lib_update_and_render
00000000000145c0 T rust_eh_personality

but the reloading still doesn't work.

If I set the crates to be dylibs then the reloading still doesn't work, but the symbols are still exported, that is, I still see the following results from nm:

nm -g ../tiny-live-code-example/target/debug/libstate_manipulation1000.so | grep -w T
0000000000001058 T _fini
0000000000000b98 T _init
0000000000000ed0 T lib_update_and_render
0000000000000fc0 T __rust_alloc
0000000000001020 T __rust_alloc_excess
0000000000001010 T __rust_alloc_zeroed
0000000000000fe0 T __rust_dealloc
0000000000001040 T __rust_grow_in_place
0000000000000fd0 T __rust_oom
0000000000001000 T __rust_realloc
0000000000001030 T __rust_realloc_excess
0000000000001050 T __rust_shrink_in_place
0000000000000ff0 T __rust_usable_size

Should that code change have caused the dylib symbols to be internal? Or are those symbols handled separately?
(Note: I'm still concerned with dylibs because the reloading has only ever worked with dylibs.)

@Ryan1729
Copy link
Contributor Author

I checked out 4c225c4, the last commit the reloading works on, and I edited symbol_export.rs to always set export_level to SymbolExportLevel::C.

In the dylib case the reloading still works and I get this result from nm:

$ nm -g ../tiny-live-code-example/target/debug/libstate_manipulation1000.so | grep -w T
0000000000000a8c T _fini
0000000000000708 T _init
00000000000009a0 T lib_update_and_render

Which is the same result I get if I use the unaltered version.
So, it seems like whether the symbols are exported or not isn't affecting the reloading.

@alexcrichton
Copy link
Member

Ah then I'm not sure what's going on :(

@Ryan1729
Copy link
Contributor Author

I don't understand what's going on either, but I've managed to cause the symptoms with this 42 line diff.

You can checkout 4c225c4 and note that using that version to compile this version of the test results in an executable that reloads dynamic libraries without an issue.

Then you can apply that diff with git apply /path/to/break.diff and then note that compiling the test with the resulting code produces an executable that does not reload the dynamic libraries.

@comex
Copy link
Contributor

comex commented Oct 25, 2017

I'm not sure why the use of liballoc_system affects anything, but it's generally a bad idea to rely on dlclose actually fully unloading a library.

Instead, I strongly suggest using a different name for each version of the library. Not just the filename on disk should differ, but also the path embedded in the library file itself, if any: at least on macOS, such a path is always included, and defaults to the path the library is written to.

@Ryan1729
Copy link
Contributor Author

@comex I have now done some reading about dlclose and since the POSIX spec specifically states implementations do not have to unload the library I am forced to agree that I can't rely on it.

I will try your suggestion. Having to deal with multiple copies of the library on disk seems like it will be annoying, but if I can get it to work it sounds better than being stuck on 1.19.0.

@jethrogb
Copy link
Contributor

On Linux, you can use dlmopen with LM_ID_NEWLM to make sure the library is loaded anew.

@Ryan1729
Copy link
Contributor Author

@jethrogb Interesting. I think I would like the library reloading to be portable, and it's nice to be able to let libloading deal with the platform specific portions. But it's nice to know that dlmopen option is available, if needed.

@Ryan1729
Copy link
Contributor Author

@comex I've gotten a version of my example code working on 1.21.0 using your suggestion! The first working version is here but it is a messy hack. I'm currently working on a more robust version but now I can see a way forward where I'm not stuck on 1.19.0 forever. Thanks for making your suggestion!

@Ryan1729
Copy link
Contributor Author

Since the POSIX standard (currently?) doesn't provide a way to reload dynamic libraries without the workaround, I suppose this issue can be closed.

@porglezomp
Copy link
Contributor

I'm also dealing with the issue for live reloading in games (I wrote live-reload), so it's good to know that this is going on and it's not something I changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-FFI Area: Foreign function interface (FFI) A-linkage Area: linking into static, shared libraries and binaries C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants