Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multiprocessing threading causing timeouts #372

Closed
willGraham01 opened this issue Feb 6, 2024 · 7 comments
Closed

[BUG] Multiprocessing threading causing timeouts #372

willGraham01 opened this issue Feb 6, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@willGraham01
Copy link
Collaborator

willGraham01 commented Feb 6, 2024

First encountered here after attempting a fix for #368 |

From what we can infer; it appears that some combination of our multiprocessing, CPU allocation/requesting, and Python's garbage collection process is resulting in segmentation faults during teardown.

This only occurs on Ubuntu machines running Python 3.10 specifically, all other OSes are fine, as are other Python versions. In particular though, Python 3.10 is what BrainGlobe users on the HPC system have access too - and @alessandrofelder has received similar bug reports from users.

The symptoms

Identifiable symptoms seem to be: after running the detection algorithm, but before completing teardown, Python will hang due to a segmentation fault in the (memory) cleanup phase.

This is most easily replicated by running the tox command to execute the tests on a GitHub runner. Note that the segmentation fault message will appear some time after the first test_detection test runs.

Minimum failing example

Will be pasted here once a script that reliably reproduces the error is available.
Until then, invoking tox on a GH runner will reproduce the bug - note that the session will timeout rather than error immediately due to the Python hang.

Possible causes

  • Multiprocessing memory management
  • Delegation of tasks to cores
  • Use of the cpu_count method allocating CPUs, and claiming they are "free", before garbage collection can be run

Ruled out causes

  • We are NOT requesting more CPUs than we have available, so this is not the cause of the problem
  • The detection algorithm itself manages to complete without error, the issue always occurs after it finishes.
@alessandrofelder
Copy link
Member

(Assuming what I observe on our local cluster is the same bug) and running with the latest changes from #375, it looks like

f"Processing {len(self.cell_detector.coords_maps.items())} cells"
is reached, but
logger.debug("Finished splitting cell clusters.")
is not...

This is part of the code that happens still in the multiprocessing context, but should just happen in the main process.

@willGraham01
Copy link
Collaborator Author

I'm even more confused because I created this branch to attempt to find a minimum failing example... and everything looks like it's fine 😢

This is part of the code that happens still in the multiprocessing context, but should just happen in the main process.

As in, you can see that it's still running in a multiprocessing context? Is there any way to force Python to only run a function on one process, and we can try and enforce that explicitly here to see if the error goes away?

@alessandrofelder
Copy link
Member

It will be running inside the multiprocessing context we get on

with mp_ctx.Pool(n_ball_procs) as worker_pool:

Is there any way to force Python to only run a function on one process, and we can try and enforce that explicitly here to see if the error goes away?

Seems tricky in our code specifically at first glance... would require some non-trivial refactoring?

. and everything looks like it's fine 😢

The Windows and the possibly no-numba run are timing out though??

@willGraham01
Copy link
Collaborator Author

The Windows and the possibly no-numba run are timing out though??

I set a timeout of 15 mins which is twice the length of time for the ubuntu seg-fault to occur. Those tests on Windows/ NUMBA disabled are actually running fine, but they're usually just slower than the others so get cancelled due to the timeout.

@willGraham01
Copy link
Collaborator Author

willGraham01 commented Feb 7, 2024

Alright, I think we are looking at two distinct issues here, since I (think! I) have fixed the pytest problem we saw in CI.

Bad cache restore on CI in #369

This one is an isolated GH problem.

Looking back at the logs, the cache that was downloaded for the original tests on the Ubuntu 3.10 job was smaller in size than all others for that job (although this has now retroactively changed thanks to the re-run and shared tasks). After a cache deletion everything is OK: https://github.com/brainglobe/cellfinder/actions/runs/7743237973/job/21314877990, and we don't need the garbage collection fix as shown on #376

TLDR it looks like attempt 1 didn't fetch the entire cache for some reason (maybe faulty download, not sure why). Forcing the branch to fetch the cache again seems to have done the trick. Will try merging in #376's branch, which removes all the "un-necessary" fixes, and re-running CI again after purging the cache once more.

This we can safely put to bed with #369 once merged.

brainmapper hanging

This however is a genuine issue with our multiprocessing logic, which we still need to get to the bottom of. This is what we are tracking in this issue, so have re-named the issue accordingly:

@willGraham01 willGraham01 changed the title [BUG] Threading Segmentation Faults [BUG] Multiprocessing threading causing timeouts Feb 7, 2024
@alessandrofelder
Copy link
Member

Moved the "brainmapper_hanging" issue to #383 as I think I have diagnosed the cause 🤞

@willGraham01
Copy link
Collaborator Author

Going to close this guy since we now have #383 to track the brainmapper hanging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants