-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compilecache
failed when @everywhere using
from remote machines
#48217
Comments
this is not too easy to reproduce because it doesn't seen to happen with local procs |
Precompilation race condition? What if you first |
why would that happen when |
Possible, perhaps, if your secondary julia processes are started with different command-line flags than the primary one? I'm not sure. I've wondered if it would be easier to debug this if you temporarily make the |
maybe so but still that would be something used to work on 1.8 but broken on 1.9 |
We're more picky about the command line flags now. In a sense the old behavior was a bug. But the crash is not the outcome we want. |
The resolution of #48039 was to consistently use |
Also try nightly which has #48179 which will be in 1.9-beta3 which improves cache specificity |
@giordano so julia> using WVZAnalysisCore, ClusterManagers, Distributed
julia> addprocs(HTCManager(80); extrajdl=["+queue=\"short\""], exeflags = `--project=$(Base.active_project()) -e 'include("/data/jiling/WVZ/init.jl")'`);
julia> Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`
julia> @fetchfrom 2 Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`
julia> @everywhere using WVZAnalysisCore
From worker 42: ┌ Warning: The call to compilecache failed to create a usable precompiled cache file for SentinelArrays [91c51154-3ec4-41a3-a24f-3f23e20d615c]
From worker 42: │ exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.10/SentinelArrays/uMYVe_BgNuj.so.
From worker 42: └ @ Base loading.jl:1725
From worker 24: ┌ Warning: The call to compilecache failed to create a usable precompiled cache file for SentinelArrays [91c51154-3ec4-41a3-a24f-3f23e20d615c] |
trying @timholy 's idea:
julia> using WVZAnalysisCore, ClusterManagers, Distributed
julia> addprocs(HTCManager(80); extrajdl=["+queue=\"short\""], exeflags = `--project=$(Base.active_project()) -e 'include("/data/jiling/WVZ/init.jl")'`);
julia> @everywhere using WVZAnalysisCore
ERROR: On worker 4:
SystemError: mktemp: Permission denied
Stacktrace:
[1] #systemerror#84
@ ./error.jl:176
[2] kwcall
@ ./error.jl:176
[3] kwcall
@ ./error.jl:176
[4] #systemerror#83 so it does seem that for some reason some worker wants to recompile it, despite: julia> Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`
julia> @fetchfrom 4 Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1` |
Does the same thing happen on 1.9? It's possible 1.9's extra pickiness might actually solve this problem? |
it does, the original post was based on
|
fwiw I still hit this on 1.9.0 julia> using WVZAnalysis, ClusterManagers, Distributed, Pkg
julia> addprocs(HTCManager(4); extrajdl=["+queue=\"short\""], exeflags = `--project=$(Base.active_project()) -e 'include("/data/jiling/WVZ/init.jl")'`);
Waiting for 4 workers: ]1 23 4 .
(WVZAnalysis) pkg> precompile
julia> @everywhere using WVZAnalysis
From worker 4: ┌ Warning: The call to compilecache failed to create a usable precompiled cache file for FHist [68837c9b-b678-4cd5-9925-8a54edc8f695]
From worker 4: │ exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.9/FHist/heGI2_Mag0e.so.
From worker 4: └ @ Base loading.jl:1783
From worker 5: ┌ Warning: The call to compilecache failed to create a usable precompiled cache file for FHist [68837c9b-b678-4cd5-9925-8a54edc8f695]
From worker 5: │ exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.9/FHist/heGI2_Mag0e.so.
From worker 5: └ @ Base loading.jl:1783 I find this extremely disruptive for the workflow -- when we have the same architecture and OS across cluster, it was a big advantage over C++ workflow that I don't have to "compile" on every node in Julia. Now it's worse than C++ -- even if I want to, I don't know how to tell each node to separately compile its own cache |
Can you try with |
I did that for when I have 1 remote worker: https://pastebin.com/tt3Wz5qL |
Are the workers on the same machine or same architecture? If so perhaps it's
so the workers want O2 but the parent is using O0? Given there's a lot of cache files there, perhaps retry with a cleared out precompile cache dir, so there's just the file generated by the parent? |
So there are two issues here:
|
let me try again by removing I still see weird rejection. I also don't understand how the "current session" would have |
is there a way to tell workers to use their own depot at the moment? I guess I can manually set |
I don't see any issue in the latest log shared in #48217 (comment) |
No, workers still tried to precompile in this case |
Did you chop that out of the log then, because there's no log saying the (single) worker is having to precompile |
what does this mean?
yeah, idk, I think in general I still run into it no matter what I do. @vchuravy had a theory this might have something todo with file-system (in this case fuse-mounted over network) doesn't support certain feature (atomic move?). I guess in this case we should try to mitigate, or at least provide user a workaround (spiritually similar to Revise.jl's "polling") |
Can you play with setting the env variable |
Okay this is the smoking gun that I was looking for:
Pkgimages compiles first for the native target, but we don't know what the target is so we can't mixin the information into the hash... Therefore the cache file location collides with the previous one. What is the |
julia> using WVZAnalysis, ClusterManagers, Distributed
julia> addprocs(HTCManager(4); extrajdl=["+queue=\"short\""], extraenv=["export JULIA_CPU_TARGET=generic"], exeflags = `-e 'include("/data/jiling/WVZ/init.jl")'`);
Waiting for 4 workers: 1 2 3 4 .
julia> @fetchfrom 1 ENV["JULIA_CPU_TARGET"]
"generic"
julia> @fetchfrom 2 ENV["JULIA_CPU_TARGET"]
"generic"
julia> @everywhere using WVZAnalysis
julia> looks like success to me! |
Almost seems like this should be more than a debug message. |
Alternatively you can set |
btw here's the CPU related information @vchuravy https://pastebin.com/x7gA14Gx they look, the same? the |
Yeah. Something like
|
well in this case what's actually different for the purpose of Julia hashing compiled cache? CPU arch seems to be identical |
https://discourse.julialang.org/t/yet-another-precompilation-on-hpc-issue/105731/ https://discourse.julialang.org/t/julia-1-9-same-depot-with-different-machines/103463 We need like a very late PSA on this I think, if we can't just fix this |
The first one looks completely unrelated to this problem though, it's installation of packages somehow using compile-time path. |
it is common for a user of some cluster to create remote processes via e.g. https://github.com/JuliaParallel/ClusterManagers.jl, the workflow essentially looks like this:
this has stopped working on 1.9-beta2 with errors like:
might be related to #48057 and #48039
The text was updated successfully, but these errors were encountered: