Fix hang on testsuite completion: avoid forking with threads #6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have observed that sometimes a multiprocessing worker fails to properly terminate, getting stuck somewhere in the python multiprocessing internals after the whole of
process_with_threads
has completed. This results in the entire test suite hanging at 99% completion, as the process join never completes.This appears to be due to starting the
responses_processor
thread before starting the worker processes - the default multiprocessing start method on POSIX isfork
which directly forks the python interpreter without execing. This is generally unsafe in a multithreaded environment as the child process may fork while another thread of the parent has locked arbitrary mutexes or similar, meaning they are already-locked in the child without any thread to ever unlock them, leading to deadlocks if the child ever tries to lock them itself.In fact, the default is changing to
forkserver
in Python 3.14 precisely because of subtle issues like this (seepython/cpython#84559). Rather than making that same change here now, move the thread creation after the process creation to remain compatible with both
fork
andforkserver
. There is no need to start the thread that early anyway; the worst that could happen is a few responses piling up in the meantime.This appears to fix the hang, as it has not reproduced with this patch in several days of continuous runs (where previously it reproduced within a few minutes).
It is possible that the macOS-specific logic at the top of the file that "[forces] forking behavior at the expense of safety" should be revisited too, since the docs suggest that system libraries could create threads without our knowledge, but this is deferred to future work as no specific problems have been observed yet, and the docs suggest that problems here would lead to crashes rather than hangs.