`compilecache` failed when `@everywhere using` from remote machines #48217

Moelf · 2023-01-10T23:24:42Z

it is common for a user of some cluster to create remote processes via e.g. https://github.com/JuliaParallel/ClusterManagers.jl, the workflow essentially looks like this:

  | | |_| | | | (_| |  |  Version 1.8.5 (2023-01-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using WVZAnalysis, ClusterManagers, Distributed

julia> addprocs(HTCManager(80))
Waiting for 80 workers: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 .

julia> @everywhere using WVZAnalysis

this has stopped working on 1.9-beta2 with errors like:

julia> @everywhere using WVZAnalysis
      From worker 2:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for SentinelArrays [91c51154-3ec4-41a3-a24f-3f23e20d615c]
      From worker 2:	│   exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.9/SentinelArrays/uMYVe_zIiTQ.so.
      From worker 2:	└ @ Base loading.jl:1673
      From worker 3:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for SentinelArrays [91c51154-3ec4-41a3-a24f-3f23e20d615c]
      From worker 3:	│   exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.9/SentinelArrays/uMYVe_zIiTQ.so.
      From worker 3:	└ @ Base loading.jl:1673
      From worker 8:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for JLLWrappers [692b3bcd-3c85-4b1f-b108-f13ce0eb3210]
      From worker 8:	│   exception = Required dependency Preferences [21216c6a-2e73-6563-6e65-726566657250] failed to load from a cache file.
      From worker 8:	└ @ Base loading.jl:1673
      From worker 2:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for JLLWrappers [692b3bcd-3c85-4b1f-b108-f13ce0eb3210]
      From worker 2:	│   exception = Required dependency Preferences [21216c6a-2e73-6563-6e65-726566657250] failed to load from a cache file.

might be related to #48057 and #48039

Moelf · 2023-01-13T01:35:23Z

this is not too easy to reproduce because it doesn't seen to happen with local procs

timholy · 2023-01-13T09:25:13Z

Precompilation race condition? What if you first pkg> precompile everything?

Moelf · 2023-01-13T16:07:04Z

why would that happen when using XXX is done on master node already?

timholy · 2023-01-13T17:35:25Z

Possible, perhaps, if your secondary julia processes are started with different command-line flags than the primary one? I'm not sure.

I've wondered if it would be easier to debug this if you temporarily make the compiled/v1.x folder recursively read-only, then you might get an immediate error if something tries to precompile something differently.

Moelf · 2023-01-13T17:51:46Z

Possible, perhaps, if your secondary julia processes are started with different command-line flags than the primary one? I'm not sure.

maybe so but still that would be something used to work on 1.8 but broken on 1.9

timholy · 2023-01-13T20:21:27Z

We're more picky about the command line flags now. In a sense the old behavior was a bug. But the crash is not the outcome we want.

giordano · 2023-01-14T19:30:47Z

The resolution of #48039 was to consistently use Base.julia_cmd() in MPI.jl tests to make sure the same flags are used in the subprocesses spawned during the tests. The problem became clear by setting the environment variable JULIA_DEBUG=loading, as suggested by #48039 (comment), you may want to do the same.

IanButterworth · 2023-01-14T19:33:49Z

Also try nightly which has #48179 which will be in 1.9-beta3 which improves cache specificity

Moelf · 2023-01-14T20:06:00Z

@giordano so julia_cmd() seems okay, and I don't know how should I read JULIA_DEBUG, do I do it for every remote workers?

julia> using WVZAnalysisCore, ClusterManagers, Distributed

julia> addprocs(HTCManager(80); extrajdl=["+queue=\"short\""], exeflags = `--project=$(Base.active_project()) -e 'include("/data/jiling/WVZ/init.jl")'`);

julia> Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`

julia> @fetchfrom 2 Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`

julia> @everywhere using WVZAnalysisCore
      From worker 42:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for SentinelArrays [91c51154-3ec4-41a3-a24f-3f23e20d615c]
      From worker 42:	│   exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.10/SentinelArrays/uMYVe_BgNuj.so.
      From worker 42:	└ @ Base loading.jl:1725
      From worker 24:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for SentinelArrays [91c51154-3ec4-41a3-a24f-3f23e20d615c]

Moelf · 2023-01-14T20:14:40Z

trying @timholy 's idea:

chmod -R 555 ~/.julia/compiled/v1.10

julia> using WVZAnalysisCore, ClusterManagers, Distributed

julia> addprocs(HTCManager(80); extrajdl=["+queue=\"short\""], exeflags = `--project=$(Base.active_project()) -e 'include("/data/jiling/WVZ/init.jl")'`);


julia> @everywhere using WVZAnalysisCore
ERROR: On worker 4:
SystemError: mktemp: Permission denied
Stacktrace:
  [1] #systemerror#84
    @ ./error.jl:176
  [2] kwcall
    @ ./error.jl:176
  [3] kwcall
    @ ./error.jl:176
  [4] #systemerror#83

so it does seem that for some reason some worker wants to recompile it, despite:

julia> Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`

julia> @fetchfrom 4 Base.julia_cmd()
`/home/jiling/julia-0c3b950e02/bin/julia -Cnative -J/home/jiling/julia-0c3b950e02/lib/julia/sys.so -g1`

timholy · 2023-01-18T15:52:26Z

Does the same thing happen on 1.9? It's possible 1.9's extra pickiness might actually solve this problem?

Moelf · 2023-01-18T18:32:12Z

it does, the original post was based on

this has stopped working on 1.9-beta2 with errors like

Moelf · 2023-05-18T20:53:15Z

fwiw I still hit this on 1.9.0

julia> using WVZAnalysis, ClusterManagers, Distributed, Pkg

julia> addprocs(HTCManager(4); extrajdl=["+queue=\"short\""], exeflags = `--project=$(Base.active_project()) -e 'include("/data/jiling/WVZ/init.jl")'`);
Waiting for 4 workers: ]1 23 4 .

(WVZAnalysis) pkg> precompile

julia> @everywhere using WVZAnalysis
      From worker 4:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for FHist [68837c9b-b678-4cd5-9925-8a54edc8f695]
      From worker 4:	│   exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.9/FHist/heGI2_Mag0e.so.
      From worker 4:	└ @ Base loading.jl:1783
      From worker 5:	┌ Warning: The call to compilecache failed to create a usable precompiled cache file for FHist [68837c9b-b678-4cd5-9925-8a54edc8f695]
      From worker 5:	│   exception = ArgumentError: Invalid checksum in cache file /home/jiling/.julia/compiled/v1.9/FHist/heGI2_Mag0e.so.
      From worker 5:	└ @ Base loading.jl:1783

I find this extremely disruptive for the workflow -- when we have the same architecture and OS across cluster, it was a big advantage over C++ workflow that I don't have to "compile" on every node in Julia.

Now it's worse than C++ -- even if I want to, I don't know how to tell each node to separately compile its own cache

IanButterworth · 2023-05-18T21:12:47Z

Can you try with JULIA_DEBUG=loading env var set for the workers. For some reason the precompile cache made by the parent is being ignored by the workers, that should explain why

Moelf · 2023-05-18T21:20:51Z

I did that for when I have 1 remote worker: https://pastebin.com/tt3Wz5qL

IanButterworth · 2023-05-18T21:24:00Z

Are the workers on the same machine or same architecture? If so perhaps it's

      From worker 2:	┌ Debug: Rejecting cache file /home/jiling/.julia/compiled/v1.9/Statistics/ERcPL_qEvLw.ji for Statistics [10745b16-79ce-11e8-11f9-7d13ad32a3b2] since the flags are mismatched
      From worker 2:	│   current session: use_pkgimages = true, debug_level = 1, check_bounds = 0, inline = true, opt_level = 2
      From worker 2:	│   cache file:      use_pkgimages = false, debug_level = 1, check_bounds = 0, inline = true, opt_level = 0
      From worker 2:	└ @ Base loading.jl:2690

so the workers want O2 but the parent is using O0?

Given there's a lot of cache files there, perhaps retry with a cleared out precompile cache dir, so there's just the file generated by the parent?

IanButterworth · 2023-05-18T21:28:04Z

So there are two issues here:

Figuring out why the workers aren't using the parent's precompile cache files (it might be reasonable not to if they're on different architecture or different julia optimization settings)
Making multiple workers using the same depot not crash when they try to precompile at the same time, which is the goal of pidlock cache file precompilation #49052

Moelf · 2023-05-18T21:30:23Z

let me try again by removing ~/.julia/compiled first https://pastebin.com/RpfsRpDn

I still see weird rejection. I also don't understand how the "current session" would have check_bounds = 0 btw, since I'm not tweaking that at all

Moelf · 2023-05-18T21:31:13Z

is there a way to tell workers to use their own depot at the moment? I guess I can manually set JULIA_DEPOT_PATH to workers' /tmp maybe

IanButterworth · 2023-05-18T21:46:51Z

I don't see any issue in the latest log shared in #48217 (comment)
The check_bounds=1 rejection is just because the stdlibs are shipped with one version with checkbounds on and one on auto. I assume in that case you didn't see any precompilation on the workers

Moelf · 2023-05-18T21:48:31Z

No, workers still tried to precompile in this case

IanButterworth · 2023-05-18T21:51:03Z

Did you chop that out of the log then, because there's no log saying the (single) worker is having to precompile

Moelf · 2023-05-19T03:57:30Z

https://pastebin.com/dRFucaqM

what does this mean?

since pkgimage can't be loaded on this target

yeah, idk, I think in general I still run into it no matter what I do. @vchuravy had a theory this might have something todo with file-system (in this case fuse-mounted over network) doesn't support certain feature (atomic move?).

I guess in this case we should try to mitigate, or at least provide user a workaround (spiritually similar to Revise.jl's "polling")

KristofferC · 2023-05-19T07:21:35Z

Can you play with setting the env variable JULIA_CPU_TARGET? For example, try setting it to generic.

vchuravy · 2023-05-19T09:16:45Z

Okay this is the smoking gun that I was looking for:

From worker 3:	┌ Debug: Rejecting cache file /home/jiling/.julia/compiled/v1.9/WVZAnalysis/mPzCc_RVCk5.ji for WVZAnalysis [15e846ca-95d3-4e21-8d51-3cc2ce27e5cd] since pkgimage can't be loaded on this target

Pkgimages compiles first for the native target, but we don't know what the target is so we can't mixin the information into the hash... Therefore the cache file location collides with the previous one.

What is the cpuinfo on both machines? As @KristofferC alluded to you can set JULIA_CPU_TARGET to control the architecture we create cache files for.

Moelf · 2023-05-19T13:49:21Z

julia> using WVZAnalysis, ClusterManagers, Distributed

julia> addprocs(HTCManager(4); extrajdl=["+queue=\"short\""], extraenv=["export JULIA_CPU_TARGET=generic"], exeflags = `-e 'include("/data/jiling/WVZ/init.jl")'`);
Waiting for 4 workers: 1 2 3 4 .

julia> @fetchfrom 1 ENV["JULIA_CPU_TARGET"]
"generic"

julia> @fetchfrom 2 ENV["JULIA_CPU_TARGET"]
"generic"

julia> @everywhere using WVZAnalysis

julia>

looks like success to me!

KristofferC · 2023-05-19T13:52:36Z

Pkgimages compiles first for the native target, but we don't know what the target is so we can't mixin the information into the hash... Therefore the cache file location collides with the previous one.

Almost seems like this should be more than a debug message.

giordano · 2023-05-19T13:53:42Z

Alternatively you can set JULIA_CPU_TARGET="target1;target2" to target both CPUs if they're different.

Moelf · 2023-05-19T13:53:46Z

btw here's the CPU related information @vchuravy https://pastebin.com/x7gA14Gx

they look, the same? the NUMA node0 CPU... is different, but why do we care about that?

IanButterworth · 2023-05-19T13:58:17Z

Almost seems like this should be more than a debug message.

Yeah. Something like

Caches exist in this Depot for another cpu architecture. Consider setting .... to make precompilation target both

Moelf · 2023-05-19T13:59:58Z

well in this case what's actually different for the purpose of Julia hashing compiled cache? CPU arch seems to be identical

Moelf · 2023-11-08T20:49:48Z

https://discourse.julialang.org/t/yet-another-precompilation-on-hpc-issue/105731/

https://discourse.julialang.org/t/julia-1-9-same-depot-with-different-machines/103463

We need like a very late PSA on this I think, if we can't just fix this

giordano · 2023-11-08T21:30:18Z

https://discourse.julialang.org/t/yet-another-precompilation-on-hpc-issue/105731/

https://discourse.julialang.org/t/julia-1-9-same-depot-with-different-machines/103463

We need like a very late PSA on this I think, if we can't just fix this

The first one looks completely unrelated to this problem though, it's installation of packages somehow using compile-time path.

vchuravy mentioned this issue Jan 13, 2023

MPI.jl fails to precompile on Julia nightly in some configurations #48039

Closed

brenhinkeller added the parallelism Parallel or distributed computation label Jan 18, 2023

aplavin mentioned this issue May 18, 2023

add regressions to changelog (NEWS.md) #49883

Merged

KristofferC added the regression 1.9 Regression in the 1.9 release label Jun 2, 2023

sjkelly mentioned this issue Aug 8, 2023

New errors with distributed runs on Omega ProjectTorreyPines/FUSE.jl#379

Closed

exaexa mentioned this issue Sep 9, 2023

Document forwarding the --project exeflag with ClusterManagers LCSB-BioCore/COBREXA.jl#794

Closed

vtjnash closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`compilecache` failed when `@everywhere using` from remote machines #48217

`compilecache` failed when `@everywhere using` from remote machines #48217

Moelf commented Jan 10, 2023

Moelf commented Jan 13, 2023

timholy commented Jan 13, 2023

Moelf commented Jan 13, 2023

timholy commented Jan 13, 2023

Moelf commented Jan 13, 2023

timholy commented Jan 13, 2023

giordano commented Jan 14, 2023

IanButterworth commented Jan 14, 2023

Moelf commented Jan 14, 2023

Moelf commented Jan 14, 2023 •

edited

Loading

timholy commented Jan 18, 2023

Moelf commented Jan 18, 2023 •

edited

Loading

Moelf commented May 18, 2023 •

edited

Loading

IanButterworth commented May 18, 2023

Moelf commented May 18, 2023

IanButterworth commented May 18, 2023

IanButterworth commented May 18, 2023

Moelf commented May 18, 2023 •

edited

Loading

Moelf commented May 18, 2023

IanButterworth commented May 18, 2023

Moelf commented May 18, 2023

IanButterworth commented May 18, 2023

Moelf commented May 19, 2023 •

edited

Loading

KristofferC commented May 19, 2023

vchuravy commented May 19, 2023

Moelf commented May 19, 2023 •

edited

Loading

KristofferC commented May 19, 2023

giordano commented May 19, 2023

Moelf commented May 19, 2023 •

edited

Loading

IanButterworth commented May 19, 2023

Moelf commented May 19, 2023 •

edited

Loading

Moelf commented Nov 8, 2023

giordano commented Nov 8, 2023

compilecache failed when @everywhere using from remote machines #48217

compilecache failed when @everywhere using from remote machines #48217

Comments

Moelf commented Jan 10, 2023

Moelf commented Jan 13, 2023

timholy commented Jan 13, 2023

Moelf commented Jan 13, 2023

timholy commented Jan 13, 2023

Moelf commented Jan 13, 2023

timholy commented Jan 13, 2023

giordano commented Jan 14, 2023

IanButterworth commented Jan 14, 2023

Moelf commented Jan 14, 2023

Moelf commented Jan 14, 2023 • edited Loading

timholy commented Jan 18, 2023

Moelf commented Jan 18, 2023 • edited Loading

Moelf commented May 18, 2023 • edited Loading

IanButterworth commented May 18, 2023

Moelf commented May 18, 2023

IanButterworth commented May 18, 2023

IanButterworth commented May 18, 2023

Moelf commented May 18, 2023 • edited Loading

Moelf commented May 18, 2023

IanButterworth commented May 18, 2023

Moelf commented May 18, 2023

IanButterworth commented May 18, 2023

Moelf commented May 19, 2023 • edited Loading

KristofferC commented May 19, 2023

vchuravy commented May 19, 2023

Moelf commented May 19, 2023 • edited Loading

KristofferC commented May 19, 2023

giordano commented May 19, 2023

Moelf commented May 19, 2023 • edited Loading

IanButterworth commented May 19, 2023

Moelf commented May 19, 2023 • edited Loading

Moelf commented Nov 8, 2023

giordano commented Nov 8, 2023

`compilecache` failed when `@everywhere using` from remote machines #48217

`compilecache` failed when `@everywhere using` from remote machines #48217

Moelf commented Jan 14, 2023 •

edited

Loading

Moelf commented Jan 18, 2023 •

edited

Loading

Moelf commented May 18, 2023 •

edited

Loading

Moelf commented May 18, 2023 •

edited

Loading

Moelf commented May 19, 2023 •

edited

Loading

Moelf commented May 19, 2023 •

edited

Loading

Moelf commented May 19, 2023 •

edited

Loading

Moelf commented May 19, 2023 •

edited

Loading