Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compilecache failed when @everywhere using from remote machines #48217

Closed
Moelf opened this issue Jan 10, 2023 · 33 comments
Closed

compilecache failed when @everywhere using from remote machines #48217

Moelf opened this issue Jan 10, 2023 · 33 comments
Labels
parallelism Parallel or distributed computation regression 1.9 Regression in the 1.9 release

Comments

@Moelf
Copy link
Contributor

Moelf commented Jan 10, 2023

it is common for a user of some cluster to create remote processes via e.g. https://github.com/JuliaParallel/ClusterManagers.jl, the workflow essentially looks like this:

  | | |_| | | | (_| |  |  Version 1.8.5 (2023-01-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using WVZAnalysis, ClusterManagers, Distributed

julia> addprocs(HTCManager(80))
Waiting for 80 workers: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 .

julia> @everywhere using WVZAnalysis

this has stopped working on 1.9-beta2 with errors like:

julia> @everywhere using WVZAnalysis
      From worker 2:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for SentinelArrays [91c51154-3ec4-41a3-a24f-3f23e20d615c]
      From worker 2:	│   exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.9/SentinelArrays/uMYVe_zIiTQ.so.
      From worker 2:	└ @ Base loading.jl:1673
      From worker 3:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for SentinelArrays [91c51154-3ec4-41a3-a24f-3f23e20d615c]
      From worker 3:	│   exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.9/SentinelArrays/uMYVe_zIiTQ.so.
      From worker 3:	└ @ Base loading.jl:1673
      From worker 8:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for JLLWrappers [692b3bcd-3c85-4b1f-b108-f13ce0eb3210]
      From worker 8:	│   exception = Required dependency Preferences [21216c6a-2e73-6563-6e65-726566657250] failed to load from a cache file.
      From worker 8:	└ @ Base loading.jl:1673
      From worker 2:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for JLLWrappers [692b3bcd-3c85-4b1f-b108-f13ce0eb3210]
      From worker 2:	│   exception = Required dependency Preferences [21216c6a-2e73-6563-6e65-726566657250] failed to load from a cache file.

might be related to #48057 and #48039

@Moelf
Copy link
Contributor Author

Moelf commented Jan 13, 2023

this is not too easy to reproduce because it doesn't seen to happen with local procs

@timholy
Copy link
Member

timholy commented Jan 13, 2023

Precompilation race condition? What if you first pkg> precompile everything?

@Moelf
Copy link
Contributor Author

Moelf commented Jan 13, 2023

why would that happen when using XXX is done on master node already?

@timholy
Copy link
Member

timholy commented Jan 13, 2023

Possible, perhaps, if your secondary julia processes are started with different command-line flags than the primary one? I'm not sure.

I've wondered if it would be easier to debug this if you temporarily make the compiled/v1.x folder recursively read-only, then you might get an immediate error if something tries to precompile something differently.

@Moelf
Copy link
Contributor Author

Moelf commented Jan 13, 2023

Possible, perhaps, if your secondary julia processes are started with different command-line flags than the primary one? I'm not sure.

maybe so but still that would be something used to work on 1.8 but broken on 1.9

@timholy
Copy link
Member

timholy commented Jan 13, 2023

We're more picky about the command line flags now. In a sense the old behavior was a bug. But the crash is not the outcome we want.

@giordano
Copy link
Contributor

The resolution of #48039 was to consistently use Base.julia_cmd() in MPI.jl tests to make sure the same flags are used in the subprocesses spawned during the tests. The problem became clear by setting the environment variable JULIA_DEBUG=loading, as suggested by #48039 (comment), you may want to do the same.

@IanButterworth
Copy link
Member

Also try nightly which has #48179 which will be in 1.9-beta3 which improves cache specificity

@Moelf
Copy link
Contributor Author

Moelf commented Jan 14, 2023

@giordano so julia_cmd() seems okay, and I don't know how should I read JULIA_DEBUG, do I do it for every remote workers?

julia> using WVZAnalysisCore, ClusterManagers, Distributed

julia> addprocs(HTCManager(80); extrajdl=["+queue=\"short\""], exeflags = `--project=$(Base.active_project()) -e 'include("/data/jiling/WVZ/init.jl")'`);

julia> Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`

julia> @fetchfrom 2 Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`

julia> @everywhere using WVZAnalysisCore
      From worker 42:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for SentinelArrays [91c51154-3ec4-41a3-a24f-3f23e20d615c]
      From worker 42:	│   exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.10/SentinelArrays/uMYVe_BgNuj.so.
      From worker 42:	└ @ Base loading.jl:1725
      From worker 24:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for SentinelArrays [91c51154-3ec4-41a3-a24f-3f23e20d615c]

@Moelf
Copy link
Contributor Author

Moelf commented Jan 14, 2023

trying @timholy 's idea:

chmod -R 555 ~/.julia/compiled/v1.10
julia> using WVZAnalysisCore, ClusterManagers, Distributed

julia> addprocs(HTCManager(80); extrajdl=["+queue=\"short\""], exeflags = `--project=$(Base.active_project()) -e 'include("/data/jiling/WVZ/init.jl")'`);


julia> @everywhere using WVZAnalysisCore
ERROR: On worker 4:
SystemError: mktemp: Permission denied
Stacktrace:
  [1] #systemerror#84
    @ ./error.jl:176
  [2] kwcall
    @ ./error.jl:176
  [3] kwcall
    @ ./error.jl:176
  [4] #systemerror#83

so it does seem that for some reason some worker wants to recompile it, despite:

julia> Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`

julia> @fetchfrom 4 Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`

@brenhinkeller brenhinkeller added the parallelism Parallel or distributed computation label Jan 18, 2023
@timholy
Copy link
Member

timholy commented Jan 18, 2023

Does the same thing happen on 1.9? It's possible 1.9's extra pickiness might actually solve this problem?

@Moelf
Copy link
Contributor Author

Moelf commented Jan 18, 2023

it does, the original post was based on

this has stopped working on 1.9-beta2 with errors like

@Moelf
Copy link
Contributor Author

Moelf commented May 18, 2023

fwiw I still hit this on 1.9.0

julia> using WVZAnalysis, ClusterManagers, Distributed, Pkg

julia> addprocs(HTCManager(4); extrajdl=["+queue=\"short\""], exeflags = `--project=$(Base.active_project()) -e 'include("/data/jiling/WVZ/init.jl")'`);
Waiting for 4 workers: ]1 23 4 .

(WVZAnalysis) pkg> precompile

julia> @everywhere using WVZAnalysis
      From worker 4:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for FHist [68837c9b-b678-4cd5-9925-8a54edc8f695]
      From worker 4:	│   exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.9/FHist/heGI2_Mag0e.so.
      From worker 4:	└ @ Base loading.jl:1783
      From worker 5:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for FHist [68837c9b-b678-4cd5-9925-8a54edc8f695]
      From worker 5:	│   exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.9/FHist/heGI2_Mag0e.so.
      From worker 5:	└ @ Base loading.jl:1783

I find this extremely disruptive for the workflow -- when we have the same architecture and OS across cluster, it was a big advantage over C++ workflow that I don't have to "compile" on every node in Julia.

Now it's worse than C++ -- even if I want to, I don't know how to tell each node to separately compile its own cache

@IanButterworth
Copy link
Member

Can you try with JULIA_DEBUG=loading env var set for the workers. For some reason the precompile cache made by the parent is being ignored by the workers, that should explain why

@Moelf
Copy link
Contributor Author

Moelf commented May 18, 2023

I did that for when I have 1 remote worker: https://pastebin.com/tt3Wz5qL

@IanButterworth
Copy link
Member

Are the workers on the same machine or same architecture? If so perhaps it's

      From worker 2:	┌ Debug: Rejecting cache file /home/jiling/.julia/compiled/v1.9/Statistics/ERcPL_qEvLw.ji for Statistics [10745b16-79ce-11e8-11f9-7d13ad32a3b2] since the flags are mismatched
      From worker 2:	│   current session: use_pkgimages = true, debug_level = 1, check_bounds = 0, inline = true, opt_level = 2
      From worker 2:	│   cache file:      use_pkgimages = false, debug_level = 1, check_bounds = 0, inline = true, opt_level = 0
      From worker 2:	└ @ Base loading.jl:2690

so the workers want O2 but the parent is using O0?

Given there's a lot of cache files there, perhaps retry with a cleared out precompile cache dir, so there's just the file generated by the parent?

@IanButterworth
Copy link
Member

So there are two issues here:

  1. Figuring out why the workers aren't using the parent's precompile cache files (it might be reasonable not to if they're on different architecture or different julia optimization settings)

  2. Making multiple workers using the same depot not crash when they try to precompile at the same time, which is the goal of pidlock cache file precompilation #49052

@Moelf
Copy link
Contributor Author

Moelf commented May 18, 2023

let me try again by removing ~/.julia/compiled first https://pastebin.com/RpfsRpDn

I still see weird rejection. I also don't understand how the "current session" would have check_bounds = 0 btw, since I'm not tweaking that at all

@Moelf
Copy link
Contributor Author

Moelf commented May 18, 2023

is there a way to tell workers to use their own depot at the moment? I guess I can manually set JULIA_DEPOT_PATH to workers' /tmp maybe

@IanButterworth
Copy link
Member

I don't see any issue in the latest log shared in #48217 (comment)
The check_bounds=1 rejection is just because the stdlibs are shipped with one version with checkbounds on and one on auto. I assume in that case you didn't see any precompilation on the workers

@Moelf
Copy link
Contributor Author

Moelf commented May 18, 2023

No, workers still tried to precompile in this case

@IanButterworth
Copy link
Member

Did you chop that out of the log then, because there's no log saying the (single) worker is having to precompile

@Moelf
Copy link
Contributor Author

Moelf commented May 19, 2023

https://pastebin.com/dRFucaqM

what does this mean?

since pkgimage can't be loaded on this target

yeah, idk, I think in general I still run into it no matter what I do. @vchuravy had a theory this might have something todo with file-system (in this case fuse-mounted over network) doesn't support certain feature (atomic move?).

I guess in this case we should try to mitigate, or at least provide user a workaround (spiritually similar to Revise.jl's "polling")

@KristofferC
Copy link
Member

Can you play with setting the env variable JULIA_CPU_TARGET? For example, try setting it to generic.

@vchuravy
Copy link
Member

Okay this is the smoking gun that I was looking for:

From worker 3:	┌ Debug: Rejecting cache file /home/jiling/.julia/compiled/v1.9/WVZAnalysis/mPzCc_RVCk5.ji for WVZAnalysis [15e846ca-95d3-4e21-8d51-3cc2ce27e5cd] since pkgimage can't be loaded on this target

Pkgimages compiles first for the native target, but we don't know what the target is so we can't mixin the information into the hash... Therefore the cache file location collides with the previous one.

What is the cpuinfo on both machines? As @KristofferC alluded to you can set JULIA_CPU_TARGET to control the architecture we create cache files for.

@Moelf
Copy link
Contributor Author

Moelf commented May 19, 2023

julia> using WVZAnalysis, ClusterManagers, Distributed

julia> addprocs(HTCManager(4); extrajdl=["+queue=\"short\""], extraenv=["export JULIA_CPU_TARGET=generic"], exeflags = `-e 'include("/data/jiling/WVZ/init.jl")'`);
Waiting for 4 workers: 1 2 3 4 .

julia> @fetchfrom 1 ENV["JULIA_CPU_TARGET"]
"generic"

julia> @fetchfrom 2 ENV["JULIA_CPU_TARGET"]
"generic"

julia> @everywhere using WVZAnalysis

julia>

looks like success to me!

@KristofferC
Copy link
Member

Pkgimages compiles first for the native target, but we don't know what the target is so we can't mixin the information into the hash... Therefore the cache file location collides with the previous one.

Almost seems like this should be more than a debug message.

@giordano
Copy link
Contributor

Alternatively you can set JULIA_CPU_TARGET="target1;target2" to target both CPUs if they're different.

@Moelf
Copy link
Contributor Author

Moelf commented May 19, 2023

btw here's the CPU related information @vchuravy https://pastebin.com/x7gA14Gx

they look, the same? the NUMA node0 CPU... is different, but why do we care about that?

@IanButterworth
Copy link
Member

Almost seems like this should be more than a debug message.

Yeah. Something like

Caches exist in this Depot for another cpu architecture. Consider setting .... to make precompilation target both

@Moelf
Copy link
Contributor Author

Moelf commented May 19, 2023

well in this case what's actually different for the purpose of Julia hashing compiled cache? CPU arch seems to be identical

@Moelf
Copy link
Contributor Author

Moelf commented Nov 8, 2023

@giordano
Copy link
Contributor

giordano commented Nov 8, 2023

https://discourse.julialang.org/t/yet-another-precompilation-on-hpc-issue/105731/

https://discourse.julialang.org/t/julia-1-9-same-depot-with-different-machines/103463

We need like a very late PSA on this I think, if we can't just fix this

The first one looks completely unrelated to this problem though, it's installation of packages somehow using compile-time path.

@vtjnash vtjnash closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation regression 1.9 Regression in the 1.9 release
Projects
None yet
Development

No branches or pull requests

8 participants