Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix conflicting multithreading and fork management #4169

Merged
merged 5 commits into from
Jul 3, 2024

Conversation

binarman
Copy link
Contributor

This PR disables multithreading in MLIR context after compilation ends. This is done to safely finalize thread pool implemented in MLIRContext. Not properly finalized thread pool can hang or crash in child process after fork.

@binarman
Copy link
Contributor Author

Details about this issue

In compilation Triton creates native MLIRContext object managed by python:

context = ir.context()

MLIRContext object holds pool of worker threads, which are allocated during compilation.

This object is not removed instantly after compilation ends. I've tried to remove it by python del statement, but seems that something referencing it and it is not trivial to delete on request.

If program forks right after compiling one kernel, MLIRContext has a chance to be duplicated along with thread pool inside. This invalidates pool in child process and subsequent attempt to delete MLIRContext hangs or crashes compilation.

@zhanglx13
Copy link
Collaborator

@binarman Then why this is not an issue on nv backend?

@binarman
Copy link
Contributor Author

@zhanglx13
I've checked test from this PR and it hangs on NV backend as well if do not finalize thread pool.

My conclusion is that we were just lucky this issue did not showed before.

P.s. pay attention that I've disabled automatic GC in test, otherwise this test is flickering

@zhanglx13 zhanglx13 requested a review from antiagainst June 20, 2024 16:52
@binarman
Copy link
Contributor Author

@antiagainst please, take a look

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome findings! Just some nits around comments.

python/triton/compiler/compiler.py Outdated Show resolved Hide resolved
python/test/unit/runtime/test_subproc.py Show resolved Hide resolved
binarman added 3 commits June 25, 2024 14:34
This PR disables multithreading in MLIR context after compilation ends.
This is done to safely finalize thread pool implemented in MLIRContext.
Not properly finalized thread pool can hang or crash in child process after fork.
@alefimov-amd alefimov-amd force-pushed the fix_gc_fork_thread_conflict branch from ff4de99 to 6e8483e Compare June 25, 2024 13:20
@alefimov-amd
Copy link

@antiagainst added more comments, PTAL

@antiagainst antiagainst marked this pull request as ready for review June 25, 2024 21:22
@antiagainst antiagainst requested a review from ptillet as a code owner June 25, 2024 21:22
@ThomasRaoux
Copy link
Collaborator

how would you reproduce the problem?

@binarman
Copy link
Contributor Author

@ThomasRaoux

hi!
It is in test_compile_in_forked_subproc_with_forced_gc test.

@antiagainst
Copy link
Collaborator

Drop the change in python/triton/compiler/compiler.py and run the test I believe.

@antiagainst antiagainst changed the title [AMD] Fix conflicting multithreading and fork management Fix conflicting multithreading and fork management Jun 25, 2024
@binarman
Copy link
Contributor Author

@ThomasRaoux kind reminder

@ptillet ptillet merged commit 7809164 into triton-lang:main Jul 3, 2024
6 checks passed
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
…-lang#4169)

This PR disables multithreading in MLIR context after compilation ends.
This is done to safely finalize thread pool implemented in MLIRContext.
Not properly finalized thread pool can hang or crash in child process
after fork.

---------

Co-authored-by: Lei Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants