Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden segfault when doing calculating FWI on the cloud #214

Closed
kerim371 opened this issue Nov 11, 2023 · 10 comments
Closed

Sudden segfault when doing calculating FWI on the cloud #214

kerim371 opened this issue Nov 11, 2023 · 10 comments

Comments

@kerim371
Copy link
Contributor

Hi,

I do calculations on the cloud (master node and 4 computational nodes, standard SSH cluster manager, CentOS 7).

Starting from yesterday I begin to receive segmentation fault.
Before that moment probably a week I haven't encountered this error:

      From worker 3:    Operator `forward` ran in 8.50 s
      From worker 5:    Operator `forward` ran in 8.55 s
      From worker 2:    Operator `forward` ran in 8.02 s
      From worker 4:    Operator `forward` ran in 8.30 s
      From worker 4:
      From worker 4:    [9258] signal (11.1): Segmentation fault
      From worker 4:    in expression starting at none:0
      From worker 4:    sgemm_itcopy_SKYLAKEX at /home/kerim/shared_app/julia/julia-1.9.3/bin/../lib/julia/libopenblas64_.so (unknown line)
      From worker 4:    sgemm_nn at /home/kerim/shared_app/julia/julia-1.9.3/bin/../lib/julia/libopenblas64_.so (unknown line)
      From worker 4:    sgemm_64_ at /home/kerim/shared_app/julia/julia-1.9.3/bin/../lib/julia/libopenblas64_.so (unknown line)
      From worker 4:    gemm! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/blas.jl:1524
      From worker 4:    gemm_wrapper! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:674
      From worker 4:    mul! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:161 [inlined]
      From worker 4:    mul! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:276 [inlined]
      From worker 4:    * at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:148 [inlined]
      From worker 4:    SincInterpolation at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:553
      From worker 4:    macro expansion at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:527 [inlined]
      From worker 4:    macro expansion at ./timing.jl:393 [inlined]
      From worker 4:    macro expansion at /home/kerim/.julia/packages/JUDI/JEsVr/src/JUDI.jl:141 [inlined]
      From worker 4:    time_resample at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:523
      From worker 4:    time_resample at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:547 [inlined]
      From worker 4:    post_process at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Modeling/time_modeling_serial.jl:61
      From worker 4:    unknown function (ip: 0x7f6a7d452b42)
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    time_modeling at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Modeling/time_modeling_serial.jl:52
      From worker 4:    unknown function (ip: 0x7f6adaf3c1d8)
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    propagate at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Modeling/propagation.jl:9
      From worker 4:    unknown function (ip: 0x7f6adaf306d6)
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 4:    jl_f__call_latest at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:774
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 4:    do_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:730
      From worker 4:    #invokelatest#2 at ./essentials.jl:819
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 4:    do_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:730
      From worker 4:    invokelatest at ./essentials.jl:816
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 4:    do_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:730
      From worker 4:    JuliaLang/julia#107 at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:281
      From worker 4:    run_work_thunk at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
      From worker 4:    unknown function (ip: 0x7f6adaf2e8b9)
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    run_work_thunk at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:79
      From worker 4:    JuliaLang/julia#100 at ./task.jl:514
      From worker 4:    unknown function (ip: 0x7f6adaf2e47f)
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 4:    start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092
      From worker 4:    Allocations: 83909332 (Pool: 83868309; Big: 41023); GC: 241
[ Info: Line search failed
         2          5          5          9     0.00000e+00     3.05662e-01     2.81569e+06     1.25000e-01
Step size: 0.00e+00 below progTol: 1.00e-10
Worker 4 terminated.
      From worker 3:    Operator `forward` ran in 8.10 s
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#715")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:949
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:955
  [4] unsafe_read
    @ ./io.jl:761 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:760
  [6] read!
    @ ./io.jl:762 [inlined]
  [7] deserialize_hdr_raw
    @ ~/shared_app/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed ~/shared_app/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed ~/shared_app/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ./task.jl:514
      From worker 5:    Operator `forward` ran in 8.33 s

I used to have julia LTS 1.6.7 but today after this problem started annoying me I updated julia using the following commads:

using Pkg
Pkg.add("UpdateJulia")  
using UpdateJulia
update_julia()

and now julia version is 1.9.3 but the problem still appears.
I can get this error at iteration 9 or 4 or probably at any other time.

I understand that the problem is unlikely related to JUDI itself but maybe you already seen that?

@mloubout
Copy link
Member

It seems to happen randomly because of BLAS multithreading when there is more than one Julia worker on the same node.

@mloubout
Copy link
Member

Should maybe disable it but make data time interpolation but slow sometimes without multithreading

@kerim371
Copy link
Contributor Author

Thank you!
Disabling it with BLAS.set_num_threads(1) should probably work. I will try

@kerim371
Copy link
Contributor Author

I just tried to run FWI with preliminary settings in startup.jl:

@info "STARTUP SCRIPT: $(@__FILE__ )"

using LinearAlgebra
BLAS.set_num_threads(1) 

ENV["DEVITO_LANGUAGE"]="openmp"
ENV["OMP_NUM_THREADS"]=length(Sys.cpu_info())
ENV["DEVITO_LOGGING"]="INFO"

@info "Number of BLAS threads: $(BLAS.get_num_threads())"
@info "DEVITO_LANGUAGE: $(ENV["DEVITO_LANGUAGE"])"
@info "OMP_NUM_THREADS: $(ENV["OMP_NUM_THREADS"])"
@info "DEVITO_LOGGING: $(ENV["DEVITO_LOGGING"])"

and this didn't help: I got the same segfault error et 9th FWI iteration.

@mloubout
Copy link
Member

mloubout commented Nov 12, 2023

This won't help JUDI set BLAS num threads in it's init so need to set it to 1 after using JUDI

BLAS.set_num_threads(Threads.nthreads())

@kerim371
Copy link
Contributor Author

This won't help JUDI set BLAS num threads in it's init so need to set it to 1 after using JUDI

BLAS.set_num_threads(Threads.nthreads())

Didnt know that! thank you!

@kerim371
Copy link
Contributor Author

kerim371 commented Nov 13, 2023

@mloubout it is strange but sometimes @everywhere BLAS.set_num_threads(1) after using JUDI work and sometimes not.
Now I'm trying to use only 3 of 4 computational cores on each node:

addprocs(["[email protected]",
          "[email protected]",
          "[email protected]",
          "[email protected]"], 
          env=["DEVITO_LANGUAGE"=>"openmp", "OMP_NUM_THREADS"=>"3", "DEVITO_LOGGING"=>"INFO"])

I hope this help

@kerim371
Copy link
Contributor Author

kerim371 commented Nov 13, 2023

@mloubout it is strange but sometimes @everywhere BLAS.set_num_threads(1) after using JUDI work and sometimes not. Now I'm trying to use only 3 of 4 computational cores on each node:

addprocs(["[email protected]",
          "[email protected]",
          "[email protected]",
          "[email protected]"], 
          env=["DEVITO_LANGUAGE"=>"openmp", "OMP_NUM_THREADS"=>"3", "DEVITO_LOGGING"=>"INFO"])

I hope this help

Helped for now
Didn't help...

@kerim371
Copy link
Contributor Author

Julia community thoughts on this (for the future references):
JuliaLang/LinearAlgebra.jl#1038

@mloubout
Copy link
Member

Thanks for raising it there and the update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants