Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dlclose'ing the compatibility driver can fail #1848

Closed
maleadt opened this issue Mar 31, 2023 · 1 comment · Fixed by #2463
Closed

dlclose'ing the compatibility driver can fail #1848

maleadt opened this issue Mar 31, 2023 · 1 comment · Fixed by #2463
Labels
bug Something isn't working installation CUDA is easy to install, right?

Comments

@maleadt
Copy link
Member

maleadt commented Mar 31, 2023

As observed on the benchmark bot:

$ JULIA_DEBUG=CUDA_Driver_jll julia --project -e 'using CUDA'
┌ Debug: System CUDA driver found at libcuda.so.1, detected as version 11.6.0
└ @ CUDA_Driver_jll ~/.julia/packages/CUDA_Driver_jll/cTyAb/src/wrappers/x86_64-linux-gnu.jl:87
┌ Debug: Forward-compatible CUDA driver found at /home/tbesard/.julia/artifacts/7a7fd08bbad6b42e7ed42fd2fc058e42039b075f/lib/libcuda.so; known to be version 12.1.0
└ @ CUDA_Driver_jll ~/.julia/packages/CUDA_Driver_jll/cTyAb/src/wrappers/x86_64-linux-gnu.jl:138
┌ Debug: Could not use forward compatibility package (error 804)
└ @ CUDA_Driver_jll ~/.julia/packages/CUDA_Driver_jll/cTyAb/src/wrappers/x86_64-linux-gnu.jl:151

signal (11): Segmentation fault
in expression starting at none:1
unknown function (ip: 0x7ffff7df3d46)
_dl_exception_create at /lib64/ld-linux-x86-64.so.2 (unknown line)
_dl_signal_error at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7ffff7de9c80)
_dl_catch_exception at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_dl_catch_error at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7ffff7bd0744)
dlclose at /lib/x86_64-linux-gnu/libdl.so.2 (unknown line)
dlclose at ./libdl.jl:165
unknown function (ip: 0x7fffa14ea935)

I encountered this after seeing the following on CI:

┌ Error: Failed to initialize CUDA
│   exception =
│    CUDA error (code 804, CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE)
│    Stacktrace:
│      [1] throw_api_error(res::CUDA.cudaError_enum)
│        @ CUDA /var/lib/buildkite-agent/builds/gpuci1/julialang/cuda-dot-jl/lib/cudadrv/libcuda.jl:27
│      [2] macro expansion
│        @ /var/lib/buildkite-agent/builds/gpuci1/julialang/cuda-dot-jl/lib/cudadrv/libcuda.jl:35 [inlined]
│      [3] cuInit
│        @ /var/lib/buildkite-agent/builds/gpuci1/julialang/cuda-dot-jl/lib/utils/call.jl:26 [inlined]
│      [4] __init__()
│        @ CUDA /var/lib/buildkite-agent/builds/gpuci1/julialang/cuda-dot-jl/src/initialization.jl:119
│      [5] _include_from_serialized(path::String, depmods::Vector{Any})
│        @ Base ./loading.jl:696
│      [6] _require_from_serialized(path::String)
│        @ Base ./loading.jl:749
│      [7] _require(pkg::Base.PkgId)
│        @ Base ./loading.jl:1053
│      [8] require(uuidkey::Base.PkgId)
│        @ Base ./loading.jl:936
│      [9] require(into::Module, mod::Symbol)
│        @ Base ./loading.jl:923
│     [10] include(fname::String)
│        @ Base.MainInclude ./client.jl:444
│     [11] top-level scope
│        @ none:12
│     [12] eval
│        @ ./boot.jl:360 [inlined]
│     [13] exec_options(opts::Base.JLOptions)
│        @ Base ./client.jl:261
│     [14] _start()
│        @ Base ./client.jl:485
└ @ CUDA /var/lib/buildkite-agent/builds/gpuci1/julialang/cuda-dot-jl/src/initialization.jl:121

That error really shouldn't be possible, as we should have failed the initialization when loading CUDA_Driver_jll and decided not to use the foward-compatible driver. The fact that trying this out in isolation results in a segfault may be related...

@maleadt maleadt added bug Something isn't working installation CUDA is easy to install, right? labels Mar 31, 2023
@maleadt
Copy link
Member Author

maleadt commented Nov 15, 2023

@simonbyrne ran into this, with the following stack trace:

┌ Debug: System CUDA driver found at libcuda.so.1, detected as version 12.2.0
└ @ CUDA_Driver_jll /central/scratch/esm/slurm-buildkite/shared_depot/packages/CUDA_Driver_jll/TNFcW/src/wrappers/x86_64-linux-gnu.jl:130
┌ Debug: Forward-compatible CUDA driver found at /central/scratch/esm/slurm-buildkite/shared_depot/artifacts/09eba544c107fcbe4c50dc34a32b398dd75d33fb/lib/libcuda.so; known to be version 12.3.0
└ @ CUDA_Driver_jll /central/scratch/esm/slurm-buildkite/shared_depot/packages/CUDA_Driver_jll/TNFcW/src/wrappers/x86_64-linux-gnu.jl:184
┌ Debug: Could not use forward compatibility package (error 100)
└ @ CUDA_Driver_jll /central/scratch/esm/slurm-buildkite/shared_depot/packages/CUDA_Driver_jll/TNFcW/src/wrappers/x86_64-linux-gnu.jl:192
 
[20393] signal (11.1): Segmentation fault
in expression starting at /central/scratch/esm/slurm-buildkite/climacore-ci/2845/climacore-ci/test/DataLayouts/data2d.jl:2
strlen at /lib64/ld-linux-x86-64.so.2 (unknown line)
_dl_signal_error at /lib64/ld-linux-x86-64.so.2 (unknown line)
_dl_close at /lib64/ld-linux-x86-64.so.2 (unknown line)
_dl_catch_error at /lib64/ld-linux-x86-64.so.2 (unknown line)
_dlerror_run at /lib64/libdl.so.2 (unknown line)
dlclose at /lib64/libdl.so.2 (unknown line)
dlclose at ./libdl.jl:165
unknown function (ip: 0x7fa0ea141d65)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
__init__ at /central/scratch/esm/slurm-buildkite/shared_depot/packages/CUDA_Driver_jll/TNFcW/src/wrappers/x86_64-linux-gnu.jl:195
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
jl_module_run_initializer at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:75
ijl_init_restored_modules at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/module.c:982
register_restored_modules at ./loading.jl:1115
_include_from_serialized at ./loading.jl:1061
_tryrequire_from_serialized at ./loading.jl:1391
_require_search_from_serialized at ./loading.jl:1494
_require at ./loading.jl:1783
_require_prelocked at ./loading.jl:1660
macro expansion at ./loading.jl:1648 [inlined]
macro expansion at ./lock.jl:267 [inlined]
require at ./loading.jl:1611
jfptr_require_45889.clone_1 at /central/software/julia/1.9.3/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
call_require at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:466 [inlined]
eval_import_path at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:503
jl_toplevel_eval_flex at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:731
jl_toplevel_eval_flex at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1903
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
_include at ./loading.jl:1963
include at ./Base.jl:457
jfptr_include_35036.clone_1 at /central/software/julia/1.9.3/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
exec_options at ./client.jl:307
_start at ./client.jl:522
jfptr__start_40034.clone_1 at /central/software/julia/1.9.3/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
true_main at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jlapi.c:717
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 1210337 (Pool: 1209401; Big: 936); GC: 2
/bin/bash: line 1: 20393 Segmentation fault      julia --color=yes --check-bounds=yes --project=test test/DataLayouts/data2d.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working installation CUDA is easy to install, right?
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant