Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler hang in julia 1.7 on Gridap tutorial #42236

Closed
stevengj opened this issue Sep 13, 2021 · 23 comments · Fixed by #42263
Closed

compiler hang in julia 1.7 on Gridap tutorial #42236

stevengj opened this issue Sep 13, 2021 · 23 comments · Fixed by #42263
Assignees
Labels
bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version
Milestone

Comments

@stevengj
Copy link
Member

stevengj commented Sep 13, 2021

See the following issue in Gridap.jl — one of the tutorials that worked in Julia 1.6 now causes the compiler to hang in 1.7-beta4

gridap/Tutorials#118

cc @fverdugo

@stevengj stevengj added the regression Regression in behavior compared to a previous version label Sep 13, 2021
@stevengj stevengj added this to the 1.7 milestone Sep 13, 2021
@stevengj
Copy link
Member Author

stevengj commented Sep 13, 2021

Update: I can reproduce it in Julia master (1.8.0-DEV.532) on macOS (x86_64-apple-darwin18.7.0).

To reproduce, do git clone https://github.com/gridap/Tutorials, then run julia --project=Tutorial, run Pkg.resolve and Pkg.update (since the project dependency versions listed in Tutorials/Project.toml are too old for Julia 1.8), cd to Tutorial/test and include("../src/emscatter.jl").

I get

julia> include("../src/emscatter.jl")
Error   : Unknown number option 'General.SolverPositionX'
Error   : Unknown number option 'General.SolverPositionY'
Error   : Unknown number option 'General.SolverHeight'
Error   : Unknown number option 'General.SolverWidth'
Info    : Reading '../models/geometry.msh'...
Info    : 21 entities
Info    : 44987 nodes
Info    : 89876 elements
Info    : Done reading '../models/geometry.msh'

at which point it hangs. (It's been running for 20 minutes so far.) (The "Error" messages are from gmsh, and it seems they should actually be warnings since it succeeds in reading the mesh.)

Killing the process seems to confirm that it has hung in compilation:

signal (15): Terminated: 15
in expression starting at /Users/stevenj/Documents/Code/Tutorials/src/emscatter.jl:187
var_gt at /Users/stevenj/Documents/Code/julia/src/subtype.c:657
forall_exists_equal at /Users/stevenj/Documents/Code/julia/src/subtype.c:1366
subtype_tuple_varargs at /Users/stevenj/Documents/Code/julia/src/subtype.c:990 [inlined]
subtype_tuple_tail at /Users/stevenj/Documents/Code/julia/src/subtype.c:1041
subtype_unionall at /Users/stevenj/Documents/Code/julia/src/subtype.c:803
exists_subtype at /Users/stevenj/Documents/Code/julia/src/subtype.c:1390 [inlined]
forall_exists_subtype at /Users/stevenj/Documents/Code/julia/src/subtype.c:1418
subtype_ccheck at /Users/stevenj/Documents/Code/julia/src/subtype.c:555
var_gt at /Users/stevenj/Documents/Code/julia/src/subtype.c:657
forall_exists_equal at /Users/stevenj/Documents/Code/julia/src/subtype.c:1366
subtype at /Users/stevenj/Documents/Code/julia/src/subtype.c:1301
subtype_unionall at /Users/stevenj/Documents/Code/julia/src/subtype.c:769
subtype_unionall at /Users/stevenj/Documents/Code/julia/src/subtype.c:769
subtype_unionall at /Users/stevenj/Documents/Code/julia/src/subtype.c:803
subtype_tuple_tail at /Users/stevenj/Documents/Code/julia/src/subtype.c:1074
subtype_unionall at /Users/stevenj/Documents/Code/julia/src/subtype.c:803
exists_subtype at /Users/stevenj/Documents/Code/julia/src/subtype.c:1390 [inlined]
forall_exists_subtype at /Users/stevenj/Documents/Code/julia/src/subtype.c:1418
jl_subtype_env at /Users/stevenj/Documents/Code/julia/src/subtype.c:1873
jl_type_union at /Users/stevenj/Documents/Code/julia/src/jltypes.c:476
analyze_single_call! at ./compiler/ssair/inlining.jl:1210
assemble_inline_todo! at ./compiler/ssair/inlining.jl:1356
ssa_inlining_pass! at ./compiler/ssair/inlining.jl:72
jfptr_ssa_inlining_passNOT._12806 at /Users/stevenj/Documents/Code/julia/usr/lib/julia/sys.dylib (unknown line)
...

@stevengj stevengj added the bug Indicates an unexpected problem or unintended behavior label Sep 13, 2021
@stevengj
Copy link
Member Author

stevengj commented Sep 13, 2021

@fverdugo and @WenjieYao, it's taking an awfully long time to run this on Julia 1.6 for me — can you check for which version of Julia this script was working?

@fverdugo
Copy link

fverdugo commented Sep 13, 2021

@stevengj I have not run it locally (as far as I can remember), but it seems to take a long time also in the CI:
https://github.com/gridap/Tutorials/runs/3433911309?check_suite_focus=true#step:6:46

@KristofferC
Copy link
Member

If it is something with type intersection it should repro even with a smaller mesh, so maybe try that and get the 1.6 time into something reasonable.

@vchuravy vchuravy changed the title compiler hang in julia 1.7 on Gridap tutorial subtyping hang in julia 1.7 on Gridap tutorial Sep 13, 2021
@stevengj
Copy link
Member Author

The mesh is small enough that the actual solver should take seconds, no? I think most of the time is compilation even on 1.6.

@KristofferC
Copy link
Member

I didn't look at the mesh size, I just assumed that the problem didn't take 20 minutes to compile on 1.6.

@WenjieYao
Copy link

@stevengj My julia version is just 1.6.0. I think one reason that it takes long is the analytical solution. For each element it requires (40+40+1=) 81 summation of the bessel/hankel functions. That is indeed the most time consuming part when I run the code in a jupyter notebook. However, when I run the code from command line with `julia --emscatter.jl‘, it takes awfully long compared to the jupyter noteook....

@stevengj
Copy link
Member Author

(We can just comment out the analytical part for checking the compiler.)

@stevengj
Copy link
Member Author

stevengj commented Sep 13, 2021

In particular, you can execute this line instead of include("../src/emscatter.jl") in order to skip the expensive semi-analytical calculation:

@time include_string(Main, join(first(eachline("../src/emscatter.jl"), 188), '\n'))

It takes about 78 seconds on my machine with Julia 1.6.2:

 78.294438 seconds (182.82 M allocations: 11.978 GiB, 4.10% gc time, 0.01% compilation time)

@fverdugo
Copy link

fverdugo commented Sep 13, 2021

On my side, I have launched a run (full tutorial) with a mesh with 10x larger element size: https://github.com/gridap/Tutorials/runs/3589643700?check_suite_focus=true#step:6:47

The tutorial runs in 60 secs in Julia 1.6 with the coarse mesh.

@vchuravy vchuravy changed the title subtyping hang in julia 1.7 on Gridap tutorial compiler hang in julia 1.7 on Gridap tutorial Sep 13, 2021
@stevengj
Copy link
Member Author

stevengj commented Sep 13, 2021

The same

include_string(Main, join(first(eachline("../src/emscatter.jl"), 188), '\n'))

indeed hangs in 1.7.0-beta4 — it's been running on my machine for an hour now.

I think the original mesh is fine (runs in about a minute) if you use the truncated tutorial file as I've done here.

@vchuravy
Copy link
Member

Is it hanging or just taking a very long time?

Can you take a profile?

@stevengj
Copy link
Member Author

stevengj commented Sep 13, 2021

It has been running for over an hour 24 hours … I never saw it complete … when on 1.6 it takes a minute, so it’s either hanging or there is a > 100x 1000x slowdown.

Do you mean gcc-based profiling, i.e. compiling Julia with -pg?

@JeffBezanson
Copy link
Member

In a debug build there is also:

Unbound GlobalRef not allowed in value position
Internal error: encountered unexpected error in runtime:
ErrorException("")
error at ./error.jl:33
check_op at ./compiler/ssair/verify.jl:41
verify_ir at ./compiler/ssair/verify.jl:217
verify_ir at ./compiler/ssair/verify.jl:67 [inlined]
run_passes at ./compiler/optimize.jl:338
optimize at ./compiler/optimize.jl:314 [inlined]

but probably a separate issue?

@JeffBezanson
Copy link
Member

Looks like these arguments to subtype are taking an extremely long time:

Tuple{typeof(Base.eltype), Union{Type{var"#s521"} where var"#s521"<:Tuple{E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, Vararg{E, N}} where N, Type{var"#s78"} where var"#s78"<:(Tuple{E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, E, Vararg{E, N}} where N)}} where E<:Function
Tuple{typeof(Base.eltype), Type{var"#s521"} where var"#s521"<:Tuple{Function, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Union{Gridap.CellData.CellField, Function}, Vararg{Union{Gridap.CellData.CellField, Function}, N}} where N}

@JeffBezanson JeffBezanson self-assigned this Sep 14, 2021
@JeffBezanson
Copy link
Member

The "good news" is that that subtype query takes forever in 1.6 as well; the change seems to be that inlining is now trying to union those types, which didn't occur in 1.6.

@fverdugo
Copy link

Hi @JeffBezanson,

one of our users was able to find a MWE gridap/Gridap.jl#657 (here the code crashes with an internal error instead of hanging)

I have been able to reproduce the crash in my machine.

@KristofferC
Copy link
Member

one of our users was able to find a MWE gridap/Gridap.jl#657 (here the code crashes with an internal error instead of hanging)

Are you sure that is the same issue?

Also, out of curiosity, why are you generating such huge types as shown in #42236 (comment). It is always possible to generate big enough types that the type system can't handle it and it seems you are on the border now.

@stevengj
Copy link
Member Author

Also, out of curiosity, why are you generating such huge types

As I understand it, the types essentially represent the AST of a symbolic expression for the weak form of a PDE — this is represented in the type domain for performance, to force the compiler to specialize finite-element assembly code that is generated from this AST.

@fverdugo
Copy link

As I understand it, the types essentially represent the AST of a symbolic expression for the weak form of a PDE — this is represented in the type domain for performance, to force the compiler to specialize finite-element assembly code that is generated from this AST.

Yes, in Gridap we build some complex and large types and this is the main reason.

In any case, we build complex types by nesting structs and the number of type params in each of those structs is usually less than 4. I am not aware of any part of the code that would lead to a tuple of 31 entries in this tutorial (see the Tuple{E,E,E,...,E} above), but I can be wrong of course...

@JeffBezanson
Copy link
Member

So the 32-element tuple comes from changing Any16 to Any32 in base/tuple.jl. These types are hazardous and we should probably stop using them in favor of simple length checks.

@JeffBezanson
Copy link
Member

one of our users was able to find a MWE gridap/Gridap.jl#657 (here the code crashes with an internal error instead of hanging)

This definitely looks like a separate issue; could you file one?

@fverdugo
Copy link

This definitely looks like a separate issue; could you file one?

Sure! #42264

KristofferC pushed a commit that referenced this issue Sep 16, 2021
LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants