-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProcessPool join hangs for 60 seconds due to intermittent deadlock #147
Comments
If I understand correctly, the occasional 60-second delay observed in The root cause seems to be:
The deadlock is recovered due to the |
As for the ways to fix it, I'm thinking of maybe letting the |
Signed-off-by: Matteo Cafasso <[email protected]>
Signed-off-by: Matteo Cafasso <[email protected]>
Hello, this is a known limitation and it's covered within these test cases. As the documentation states, Python multiprocessing guidelines fully apply to Pebble as well. In particular, I am addressing the following section:
The reason why this is an anti-patter is due to few reason.
The recommended way to handle these situations is to use files rather than sending large amount of data back-and-forth between the pipe. What is observed in most of the times is a significant increase of performance within the application itself. (Some references: 1, 2). Your main loop and your workers would write down the data in dedicated files and only share between each others the files path. This allows an optimal usage of the pipe (an empty pipe is a happy pipe) and leads to few benefits:
That said, I do agree that the current implementation is sub-optimal as it's thought around the usage recommendations. Yet the solution you propose cannot be accepted due to security reasons. The expectation over the Waiting for the pipe to be drained does not provide the same guarantees. A malicious worker process could keep slowly feeding the pipe literally locking the main loop indefinitely. Hence, a better solution should be devised. Indeed when we ask the pool to stop (compared to the I will come with a proper implementation in the upcoming days. In the meantime I have 2 recommendations for you:
|
Thank you for the detailed comment. I'll look into the workarounds you mentioned. However I have a question regarding the concern you expressed regarding the PR:
My understanding is that the PR doesn't introduce waiting for the pipe to be drained. The pipe draining is still executed on the background thread, and this thread's lifetime is only prolonged to (roughly) until I've also updated the PR to add a unit test for the situation you described (a spammy worker) - if I understood it correctly. |
Regarding your other suggestion on large amount of data: I've double-checked the application where the problem was originally observed (marxin/cvise#41), and the sizes of the messages are actually moderate: up to 2 KB. Still, the problem is observed with standard parallelization settings (e.g., 64 workers on my machine). Probably the specified deadlock scenario is bounded by 64 KiB divided by the number of workers - https://github.com/python/cpython/blob/a8dc6d6d44a141a8f839deb248a02148dcfb509e/Lib/multiprocessing/connection.py#L43 ? Asking the library consumer to use file-based I/O even for a kilobyte of data seems like an overkill. |
2Kb is definitely too small to be the issue in there but it seems the issue in the linked ticket is not due to that. The user had a dual-socket workstation and this might be a different matter. The side-effect with the stuffed pipe is known but I've never stumbled across issues with termination of the processes with small files. |
I was able to reduce it down to the following example:
The text was updated successfully, but these errors were encountered: