Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hang on testsuite completion: avoid forking with threads #6

Merged
merged 1 commit into from
Dec 16, 2024

Conversation

cjriches
Copy link
Collaborator

We have observed that sometimes a multiprocessing worker fails to properly terminate, getting stuck somewhere in the python multiprocessing internals after the whole of process_with_threads has completed. This results in the entire test suite hanging at 99% completion, as the process join never completes.

This appears to be due to starting the responses_processor thread before starting the worker processes - the default multiprocessing start method on POSIX is fork which directly forks the python interpreter without execing. This is generally unsafe in a multithreaded environment as the child process may fork while another thread of the parent has locked arbitrary mutexes or similar, meaning they are already-locked in the child without any thread to ever unlock them, leading to deadlocks if the child ever tries to lock them itself.

In fact, the default is changing to forkserver in Python 3.14 precisely because of subtle issues like this (see
python/cpython#84559). Rather than making that same change here now, move the thread creation after the process creation to remain compatible with both fork and forkserver. There is no need to start the thread that early anyway; the worst that could happen is a few responses piling up in the meantime.

This appears to fix the hang, as it has not reproduced with this patch in several days of continuous runs (where previously it reproduced within a few minutes).

It is possible that the macOS-specific logic at the top of the file that "[forces] forking behavior at the expense of safety" should be revisited too, since the docs suggest that system libraries could create threads without our knowledge, but this is deferred to future work as no specific problems have been observed yet, and the docs suggest that problems here would lead to crashes rather than hangs.

We have observed that sometimes a multiprocessing worker fails to
properly terminate, getting stuck somewhere in the python
multiprocessing internals after the whole of `process_with_threads` has
completed. This results in the entire test suite hanging at 99%
completion, as the process join never completes.

This appears to be due to starting the `responses_processor` thread
before starting the worker processes - the default multiprocessing start
method on POSIX is `fork` which directly forks the python interpreter
without execing. This is generally unsafe in a multithreaded environment
as the child process may fork while another thread of the parent has
locked arbitrary mutexes or similar, meaning they are already-locked in
the child without any thread to ever unlock them, leading to deadlocks
if the child ever tries to lock them itself.

In fact, the default is changing to `forkserver` in Python 3.14
precisely because of subtle issues like this (see
python/cpython#84559). Rather than making that
same change here now, move the thread creation after the process
creation to remain compatible with both `fork` and `forkserver`. There
is no need to start the thread that early anyway; the worst that could
happen is a few responses piling up in the meantime.

This appears to fix the hang, as it has not reproduced with this patch
in several days of continuous runs (where previously it reproduced
within a few minutes).

It is possible that the macOS-specific logic at the top of the file that
"[forces] forking behavior at the expense of safety" should be revisited
too, since the docs suggest that system libraries could create threads
without our knowledge, but this is deferred to future work as no
specific problems have been observed yet, and the docs suggest that
problems here would lead to crashes rather than hangs.

Co-authored-by: Andrea Segalini <[email protected]>
Copy link
Collaborator

@PatrickJordanNutanix PatrickJordanNutanix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing a fix to this mysterious hanging problem. I've learned some more about how multiprocessing works in Python too!

@cjriches cjriches merged commit 46561a2 into andrea-segalini:master Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants