Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix finalizer_ref usage of lock/wait resulting in task switches in gc context #41846

Merged
merged 2 commits into from
Aug 16, 2021
Merged

Fix finalizer_ref usage of lock/wait resulting in task switches in gc context #41846

merged 2 commits into from
Aug 16, 2021

Conversation

krynju
Copy link
Contributor

@krynju krynju commented Aug 9, 2021

Restructured some code, so that lock on the client_refs is obtained once at the beginning of the finalizer.
It's obtained using trylock instead of lock down the line in the functions invoked later.
Usage of lock in gc context led to calling wait() and a task switch attempt.
Catching the error and trying again led to different errors (switching to exited tasks) due to failed task switch handling in ensure_scheduled in task.jl, so I went for repeated trylock instead.

Fix addresses the two following issues found in JuliaParallel/Dagger.jl#243

And supersedes #41786

error in running finalizer: ErrorException("task switch not allowed from inside gc finalizer")
jl_error at /opt/julia/src/rtutils.c:41
jl_switch at /opt/julia/src/task.c:481
try_yieldto at ./task.jl:754
wait at ./task.jl:824
wait at ./condition.jl:112
lock at ./lock.jl:100
lock at ./lock.jl:185
lock at ./weakkeydict.jl:89 [inlined]
delete! at ./weakkeydict.jl:168 [inlined]
finalize_ref at /opt/julia/usr/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:92
jl_apply at /opt/julia/src/julia.h:1768 [inlined]
run_finalizer at /opt/julia/src/gc.c:278
error in running finalizer: ErrorException("task switch not allowed from inside gc finalizer")
jl_error at /tmp/julia/src/rtutils.c:41
jl_switch at /tmp/julia/src/task.c:481
try_yieldto at ./task.jl:754
wait at ./task.jl:824
wait at ./condition.jl:112
lock at ./lock.jl:100
lock at ./lock.jl:185
lock at ./weakkeydict.jl:90 [inlined]
del_client at /tmp/julia/usr/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:236 [inlined]
del_client at /tmp/julia/usr/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:234 [inlined]
del_client at /tmp/julia/usr/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:232 [inlined]
send_del_client at /tmp/julia/usr/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:265
finalize_ref at /tmp/julia/usr/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:102
jl_apply at /tmp/julia/src/julia.h:1768 [inlined]
run_finalizer at /tmp/julia/src/gc.c:278

@krynju
Copy link
Contributor Author

krynju commented Aug 9, 2021

@vtjnash Thanks for the review! Forgot to cleanup that trylock logic, but I made it nice now.
Tested on two separate examples that were causing the issues mentioned in the main post and it's working well.

@krynju
Copy link
Contributor Author

krynju commented Sep 13, 2021

@Sacha0 Hey, could this be backported to 1.7? It's a relatively common error occuring in Dagger under any heavier usage. Would be awesome if this could be already fixed on 1.7 when it releases!

@Sacha0
Copy link
Member

Sacha0 commented Sep 13, 2021

Tagging for backport --- thanks for the note Krystian! :)

KristofferC pushed a commit that referenced this pull request Sep 15, 2021
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
vchuravy pushed a commit to JuliaLang/Distributed.jl that referenced this pull request Oct 6, 2023
Keno pushed a commit that referenced this pull request Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants