[BUG] Multiprocessing threading causing timeouts #372

willGraham01 · 2024-02-06T12:24:23Z

First encountered here after attempting a fix for #368 |

From what we can infer; it appears that some combination of our multiprocessing, CPU allocation/requesting, and Python's garbage collection process is resulting in segmentation faults during teardown.

This only occurs on Ubuntu machines running Python 3.10 specifically, all other OSes are fine, as are other Python versions. In particular though, Python 3.10 is what BrainGlobe users on the HPC system have access too - and @alessandrofelder has received similar bug reports from users.

The symptoms

Identifiable symptoms seem to be: after running the detection algorithm, but before completing teardown, Python will hang due to a segmentation fault in the (memory) cleanup phase.

This is most easily replicated by running the tox command to execute the tests on a GitHub runner. Note that the segmentation fault message will appear some time after the first test_detection test runs.

The fix on Remove pytest-lazy-fixture due to incompatibility with pytest v8.0.0 #369 seems to be to explicitly invoke the garbage collector at the end of the detection algorithm.
Further evidence from the SLURM job outputs received by @alessandrofelder also suggest that the detection algorithm runs to completion, but then hangs at teardown due to bad/missing memory references.

Minimum failing example

Will be pasted here once a script that reliably reproduces the error is available.
Until then, invoking tox on a GH runner will reproduce the bug - note that the session will timeout rather than error immediately due to the Python hang.

Possible causes

Multiprocessing memory management
Delegation of tasks to cores
Use of the cpu_count method allocating CPUs, and claiming they are "free", before garbage collection can be run

Ruled out causes

We are NOT requesting more CPUs than we have available, so this is not the cause of the problem
The detection algorithm itself manages to complete without error, the issue always occurs after it finishes.

The text was updated successfully, but these errors were encountered:

alessandrofelder · 2024-02-07T09:43:41Z

(Assuming what I observe on our local cluster is the same bug) and running with the latest changes from #375, it looks like

cellfinder/cellfinder/core/detect/filters/volume/volume_filter.py

Line 147 in 68ccb12

f"Processing {len(self.cell_detector.coords_maps.items())} cells"

is reached, but

cellfinder/cellfinder/core/detect/filters/volume/volume_filter.py

Line 198 in 68ccb12

logger.debug("Finished splitting cell clusters.")

is not...

This is part of the code that happens still in the multiprocessing context, but should just happen in the main process.

willGraham01 · 2024-02-07T10:03:13Z

I'm even more confused because I created this branch to attempt to find a minimum failing example... and everything looks like it's fine 😢

This is part of the code that happens still in the multiprocessing context, but should just happen in the main process.

As in, you can see that it's still running in a multiprocessing context? Is there any way to force Python to only run a function on one process, and we can try and enforce that explicitly here to see if the error goes away?

alessandrofelder · 2024-02-07T10:24:32Z

It will be running inside the multiprocessing context we get on

cellfinder/cellfinder/core/detect/detect.py

Line 162 in 68ccb12

with mp_ctx.Pool(n_ball_procs) as worker_pool:

Is there any way to force Python to only run a function on one process, and we can try and enforce that explicitly here to see if the error goes away?

Seems tricky in our code specifically at first glance... would require some non-trivial refactoring?

. and everything looks like it's fine 😢

The Windows and the possibly no-numba run are timing out though??

willGraham01 · 2024-02-07T10:29:11Z

The Windows and the possibly no-numba run are timing out though??

I set a timeout of 15 mins which is twice the length of time for the ubuntu seg-fault to occur. Those tests on Windows/ NUMBA disabled are actually running fine, but they're usually just slower than the others so get cancelled due to the timeout.

willGraham01 · 2024-02-07T11:54:46Z

Alright, I think we are looking at two distinct issues here, since I (think! I) have fixed the pytest problem we saw in CI.

Bad cache restore on CI in #369

This one is an isolated GH problem.

Looking back at the logs, the cache that was downloaded for the original tests on the Ubuntu 3.10 job was smaller in size than all others for that job (although this has now retroactively changed thanks to the re-run and shared tasks). After a cache deletion everything is OK: https://github.com/brainglobe/cellfinder/actions/runs/7743237973/job/21314877990, and we don't need the garbage collection fix as shown on #376

TLDR it looks like attempt 1 didn't fetch the entire cache for some reason (maybe faulty download, not sure why). Forcing the branch to fetch the cache again seems to have done the trick. Will try merging in #376's branch, which removes all the "un-necessary" fixes, and re-running CI again after purging the cache once more.

This we can safely put to bed with #369 once merged.

`brainmapper` hanging

This however is a genuine issue with our multiprocessing logic, which we still need to get to the bottom of. This is what we are tracking in this issue, so have re-named the issue accordingly:

brainmapper will timeout (hang) at the point of returning the cells, due to a function being run in a multiprocess context that should only be run on the main process.
Currently this is only symptomatic on the SWC HPC cluster. Maybe worth investigating how we count CPUs via SLURM jobs: this is done in brainglobe-utils now - https://github.com/brainglobe/brainglobe-utils/blob/6bae0a6536b3b990ca9482c40d7bfbfc5727a212/brainglobe_utils/general/system.py#L114

alessandrofelder · 2024-02-13T10:32:10Z

Moved the "brainmapper_hanging" issue to #383 as I think I have diagnosed the cause 🤞

willGraham01 · 2024-02-20T09:36:15Z

Going to close this guy since we now have #383 to track the brainmapper hanging

willGraham01 added the bug Something isn't working label Feb 6, 2024

willGraham01 assigned willGraham01 and alessandrofelder Feb 6, 2024

willGraham01 mentioned this issue Feb 6, 2024

Remove pytest-lazy-fixture due to incompatibility with pytest v8.0.0 #369

Merged

alessandrofelder mentioned this issue Feb 6, 2024

improve detection debug logs #375

Merged

5 tasks

willGraham01 mentioned this issue Feb 7, 2024

Searching for min. segfault example #376

Merged

willGraham01 changed the title ~~[BUG] Threading Segmentation Faults~~ [BUG] Multiprocessing threading causing timeouts Feb 7, 2024

willGraham01 mentioned this issue Feb 7, 2024

Migrate to Keras 3.0 with TF backend #373

Merged

willGraham01 closed this as completed Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Multiprocessing threading causing timeouts #372

[BUG] Multiprocessing threading causing timeouts #372

willGraham01 commented Feb 6, 2024 •

edited

Loading

alessandrofelder commented Feb 7, 2024

willGraham01 commented Feb 7, 2024

alessandrofelder commented Feb 7, 2024

willGraham01 commented Feb 7, 2024

willGraham01 commented Feb 7, 2024 •

edited

Loading

alessandrofelder commented Feb 13, 2024

willGraham01 commented Feb 20, 2024

[BUG] Multiprocessing threading causing timeouts #372

[BUG] Multiprocessing threading causing timeouts #372

Comments

willGraham01 commented Feb 6, 2024 • edited Loading

The symptoms

Minimum failing example

Possible causes

Ruled out causes

alessandrofelder commented Feb 7, 2024

willGraham01 commented Feb 7, 2024

alessandrofelder commented Feb 7, 2024

willGraham01 commented Feb 7, 2024

willGraham01 commented Feb 7, 2024 • edited Loading

Bad cache restore on CI in #369

brainmapper hanging

alessandrofelder commented Feb 13, 2024

willGraham01 commented Feb 20, 2024

willGraham01 commented Feb 6, 2024 •

edited

Loading

willGraham01 commented Feb 7, 2024 •

edited

Loading

`brainmapper` hanging