-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Multiprocessing threading causing timeouts #372
Comments
(Assuming what I observe on our local cluster is the same bug) and running with the latest changes from #375, it looks like
This is part of the code that happens still in the multiprocessing context, but should just happen in the main process. |
I'm even more confused because I created this branch to attempt to find a minimum failing example... and everything looks like it's fine 😢
As in, you can see that it's still running in a multiprocessing context? Is there any way to force Python to only run a function on one process, and we can try and enforce that explicitly here to see if the error goes away? |
It will be running inside the multiprocessing context we get on cellfinder/cellfinder/core/detect/detect.py Line 162 in 68ccb12
Seems tricky in our code specifically at first glance... would require some non-trivial refactoring?
The Windows and the possibly no- |
I set a timeout of 15 mins which is twice the length of time for the ubuntu seg-fault to occur. Those tests on Windows/ NUMBA disabled are actually running fine, but they're usually just slower than the others so get cancelled due to the timeout. |
Alright, I think we are looking at two distinct issues here, since I (think! I) have fixed the pytest problem we saw in CI. Bad cache restore on CI in #369This one is an isolated GH problem. Looking back at the logs, the cache that was downloaded for the original tests on the Ubuntu 3.10 job was smaller in size than all others for that job (although this has now retroactively changed thanks to the re-run and shared tasks). After a cache deletion everything is OK: https://github.com/brainglobe/cellfinder/actions/runs/7743237973/job/21314877990, and we don't need the garbage collection fix as shown on #376 TLDR it looks like attempt 1 didn't fetch the entire cache for some reason (maybe faulty download, not sure why). Forcing the branch to fetch the cache again seems to have done the trick. Will try merging in #376's branch, which removes all the "un-necessary" fixes, and re-running CI again after purging the cache once more. This we can safely put to bed with #369 once merged.
|
Moved the "brainmapper_hanging" issue to #383 as I think I have diagnosed the cause 🤞 |
Going to close this guy since we now have #383 to track the |
First encountered here after attempting a fix for #368 |
From what we can infer; it appears that some combination of our multiprocessing, CPU allocation/requesting, and Python's garbage collection process is resulting in segmentation faults during teardown.
This only occurs on Ubuntu machines running Python 3.10 specifically, all other OSes are fine, as are other Python versions. In particular though, Python 3.10 is what BrainGlobe users on the HPC system have access too - and @alessandrofelder has received similar bug reports from users.
The symptoms
Identifiable symptoms seem to be: after running the detection algorithm, but before completing teardown, Python will hang due to a segmentation fault in the (memory) cleanup phase.
This is most easily replicated by running the
tox
command to execute the tests on a GitHub runner. Note that the segmentation fault message will appear some time after the firsttest_detection
test runs.pytest-lazy-fixture
due to incompatibility withpytest v8.0.0
#369 seems to be to explicitly invoke the garbage collector at the end of the detection algorithm.Minimum failing example
Will be pasted here once a script that reliably reproduces the error is available.
Until then, invoking
tox
on a GH runner will reproduce the bug - note that the session will timeout rather than error immediately due to the Python hang.Possible causes
cpu_count
method allocating CPUs, and claiming they are "free", before garbage collection can be runRuled out causes
The text was updated successfully, but these errors were encountered: