-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize cell detection #398
Optimize cell detection #398
Conversation
Had to split text because of length limitation: jitted the ball filter code. Main difference is in the cProfile timing, meaning more happens underneath python, which is ultimately good.
With cProfile
Like for the fortran optimization above, numba doesn't support specifying array order so we can't use fortran, instead I switched the internal volume to having the z-axis first, which is row-major so again rolling the z-axis should be faster. The effect is small, not not insignificant.
With cProfile
Unrolling the recusrion. No effect, but nice nontheless in case the recursion does go deeper than normal.
With cProfile
Now, for the original problem of splitting clusters being so slow and from the terminal it wasn't clear why it was stuck. I parallelized splitting clusters and also added a progress bar. As expected, for very few clusters (10 as above) on 16 cores, it slows down: But, if we use 100 clusters now it goes from prior commit being: For us with many cores and a few thouasand clusters in a whole brain it saves a lot. I did notice looking at the task manager that it uses all cores, even when spacifying a fair number of blank cores. I think though internally somehow it must distribute reading the volume data or something across all cores because there were lots of kernel thread acitiviy. Next, I used numba's When we first process the images to find the cells, the system uses many cores to filter each slice and then passing it to the 3D filter which in the main thread calls On the other hand, during cluster splitting. That's why I made So, what does this gain? Now we have to compare the full cell detection stage. Brain one: Previous commit - With the previous commit
After this commit
And for another brain Previous commit - With the previous commit
After this commit
|
Hi @matham, thanks for your PRs! Someone will hopefully be able to take a look at them soon, but they won't be trivial to review. I wonder if you would like to join our develop chat over on Zulip (you can sign in with GitHub)? It would be useful to know more about your use cases and how we can help. |
Ok, I'll join zulip later! And apologies for the giant PR, which I know is hard to review. I initially thought of splitting it up, but the changes depend on each other so went with one PR for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @matham
Thanks a lot for this 🤩 - the changes, as far as I understand them, look really helpful to me. And the commit-by-commit explanation really helped the review!
I have two questions, mostly so I am sure I understand the intention entirely (so we can maintain going forwards). Was wondering whether you could help explain?
Co-authored-by: Alessandro Felder <[email protected]>
…am/cellfinder into numba_optim_parallel_split_numba_mp
Added a fix in last commit for |
…parallel_split_numba_mp
Thanks @matham - this is great and I am happy with the changes. I'm going to merge this into a separate dev branch to enable me to install it more easily to run end-to-end tests on our HPC (pip installing from you dev-branch installs as version |
27ccf25
into
brainglobe:matham/optimise-cell-detection
* Replace coord map values with numba list/tuple for optim. * Switch to fortran layout for faster update of last dim. * Cache kernel. * jit ball filter. * Put z as first axis to speed z rolling (row-major memory). * Unroll recursion (no perf impact either way). * Parallelize cell cluster splitting. * Parallelize walking for full images. * Cleanup docs and pep8 etc. * Add pre-commit fixes. * Fix parallel always being selected and numba function 1st class warning. * Run hook. * Older python needs Union instead of |. * Accept review suggestion. * Address review changes. * num_threads must be an int. --------- Co-authored-by: Matt Einhorn <[email protected]>
* replace tensorflow Tensor with keras tensor * add case for TF prep in prep_model_weights * add different backends to pyproject.toml * add backend configuration to cellfinder init file. tests passing with jax locally * define extra dependencies for cellfinder with different backends. run tox with TF backend * run tox using TF and JAX backend * install TF in brainmapper environment before running tests in CI * add backends check to cellfinder init file * clean up comments * fix tf-nightly import check * specify TF backend in include guard check * clarify comment * remove 'backend' from dependencies specifications * Apply suggestions from code review Co-authored-by: Igor Tatarnikov <[email protected]> * PyTorch runs utilizing multiple cores * PyTorch fix with default models * Tests run on every push for now * Run test on torch backend only * Fixed guard test to set torch as KERAS_BACKEND * KERAS_BACKEND env variable set directly in test_include_guard.yaml * Run test on python 3.11 * Remove tf-nightly from __init__ version check * Added 3.11 to legacy tox config * Changed legacy tox config for real this time * Don't set the wrong max_processing value * Torch is now set as the default backend * Tests only run with torch, updated comments * Unpinned torch version * Add codecov token (#403) * add codecov token * generate xml coverage report * add timeout to testing jobs * Allow turning off classification or detection in GUI (#402) * Allow turning off classification or detection in GUI. * Fix test. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor to fix code analysis errors. * Ensure array is always 2d. * Apply suggestions from code review Co-authored-by: Igor Tatarnikov <[email protected]> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Igor Tatarnikov <[email protected]> * Support single z-stack tif file for input (#397) * Support single z-stack tif file for input. * Fix commit hook. * Apply review suggestions. * Remove modular asv benchmarks (#406) * remove modular asv benchmarks * recover old structure * remove asv-specific lines from gitignore and manifest * prune benchmarks * Adapt CI so it covers both new and old Macs, and installs required additional dependencies on M1 (#408) * naive attempt at adapting to silicon mac CI * run include guard test on Silicon CI * double-check hdf5 is needed * Optimize cell detection (#398) (#407) * Replace coord map values with numba list/tuple for optim. * Switch to fortran layout for faster update of last dim. * Cache kernel. * jit ball filter. * Put z as first axis to speed z rolling (row-major memory). * Unroll recursion (no perf impact either way). * Parallelize cell cluster splitting. * Parallelize walking for full images. * Cleanup docs and pep8 etc. * Add pre-commit fixes. * Fix parallel always being selected and numba function 1st class warning. * Run hook. * Older python needs Union instead of |. * Accept review suggestion. * Address review changes. * num_threads must be an int. --------- Co-authored-by: Matt Einhorn <[email protected]> * [pre-commit.ci] pre-commit autoupdate (#412) updates: - [github.com/pre-commit/pre-commit-hooks: v4.5.0 → v4.6.0](pre-commit/pre-commit-hooks@v4.5.0...v4.6.0) - [github.com/astral-sh/ruff-pre-commit: v0.3.5 → v0.4.3](astral-sh/ruff-pre-commit@v0.3.5...v0.4.3) - [github.com/psf/black: 24.3.0 → 24.4.2](psf/black@24.3.0...24.4.2) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: sfmig <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Simplify model download (#414) * Simplify model download * Update model cache * Remove jax and tf tests * Standardise the data types for inputs to all be float32 * Force torch to use CPU on arm based macOS during tests * Added PYTORCH_MPS_HIGH_WATERMARK_RATION env variable * Set env variables in test setup * Try to set the default device to cpu in the test itself * Add device call to Conv3D to force cpu * Revert changes, request one cpu left free * Revers the numb cores, don't use arm based mac runner * Merged main, removed torch flags on cellfinder install for guards and brainmapper * Lowercase Torch * Change cache directory --------- Co-authored-by: sfmig <[email protected]> Co-authored-by: Kimberly Meechan <[email protected]> Co-authored-by: Matt Einhorn <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Alessandro Felder <[email protected]> Co-authored-by: Adam Tyson <[email protected]>
* remove pytest-lazy-fixture as dev dependency and skip test (with WG temp fix) * Test Keras is present (#374) * check if Keras present * change TF to Keras in CI * remove comment * change dependencies in pyproject.toml for Keras 3.0 * Migrate to Keras 3.0 with TF backend (#373) * remove pytest-lazy-fixture as dev dependency and skip test (with WG temp fix) * change tensorflow dependency for cellfinder * replace keras imports from tensorflow to just keras imports * add keras import and reorder * add keras and TF 2.16 to pyproject.toml * comment out TF version check for now * change checkpoint filename for compliance with keras 3. remove use_multiprocessing=False from fit() as it is no longer an input. test_train() passing * add multiprocessing parameters to cube generator constructor and remove from fit() signature (keras3 change) * apply temp garbage collector fix * skip troublesome test * skip running tests on CI on windows * remove commented out TF check * clean commented out code. Explicitly pass use_multiprocessing=False (as before) * remove str conversion before model.save * raise test_detection error for sonarcloud happy * skip running tests on windows on CI * remove filename comment and small edits * Replace TF references in comments and warning messages (#378) * change some old references to TF for the import check * change TF cached model to Keras * Cellfinder with Keras 3.0 and jax backend (#379) * replace tensorflow Tensor with keras tensor * add case for TF prep in prep_model_weights * add different backends to pyproject.toml * add backend configuration to cellfinder init file. tests passing with jax locally * define extra dependencies for cellfinder with different backends. run tox with TF backend * run tox using TF and JAX backend * install TF in brainmapper environment before running tests in CI * add backends check to cellfinder init file * clean up comments * fix tf-nightly import check * specify TF backend in include guard check * clarify comment * remove 'backend' from dependencies specifications * Apply suggestions from code review Co-authored-by: Igor Tatarnikov <[email protected]> --------- Co-authored-by: Igor Tatarnikov <[email protected]> * Run cellfinder with JAX in Windows tests in CI (#382) * use jax backend in brainmapper tests in CI * skip TF backend on windows * fix pip install cellfinder for brainmapper CI tests * add keras env variable for brainmapper CLI tests * fix prep_model_weights * It/keras3 pytorch (#396) * replace tensorflow Tensor with keras tensor * add case for TF prep in prep_model_weights * add different backends to pyproject.toml * add backend configuration to cellfinder init file. tests passing with jax locally * define extra dependencies for cellfinder with different backends. run tox with TF backend * run tox using TF and JAX backend * install TF in brainmapper environment before running tests in CI * add backends check to cellfinder init file * clean up comments * fix tf-nightly import check * specify TF backend in include guard check * clarify comment * remove 'backend' from dependencies specifications * Apply suggestions from code review Co-authored-by: Igor Tatarnikov <[email protected]> * PyTorch runs utilizing multiple cores * PyTorch fix with default models * Tests run on every push for now * Run test on torch backend only * Fixed guard test to set torch as KERAS_BACKEND * KERAS_BACKEND env variable set directly in test_include_guard.yaml * Run test on python 3.11 * Remove tf-nightly from __init__ version check * Added 3.11 to legacy tox config * Changed legacy tox config for real this time * Don't set the wrong max_processing value * Torch is now set as the default backend * Tests only run with torch, updated comments * Unpinned torch version * Add codecov token (#403) * add codecov token * generate xml coverage report * add timeout to testing jobs * Allow turning off classification or detection in GUI (#402) * Allow turning off classification or detection in GUI. * Fix test. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor to fix code analysis errors. * Ensure array is always 2d. * Apply suggestions from code review Co-authored-by: Igor Tatarnikov <[email protected]> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Igor Tatarnikov <[email protected]> * Support single z-stack tif file for input (#397) * Support single z-stack tif file for input. * Fix commit hook. * Apply review suggestions. * Remove modular asv benchmarks (#406) * remove modular asv benchmarks * recover old structure * remove asv-specific lines from gitignore and manifest * prune benchmarks * Adapt CI so it covers both new and old Macs, and installs required additional dependencies on M1 (#408) * naive attempt at adapting to silicon mac CI * run include guard test on Silicon CI * double-check hdf5 is needed * Optimize cell detection (#398) (#407) * Replace coord map values with numba list/tuple for optim. * Switch to fortran layout for faster update of last dim. * Cache kernel. * jit ball filter. * Put z as first axis to speed z rolling (row-major memory). * Unroll recursion (no perf impact either way). * Parallelize cell cluster splitting. * Parallelize walking for full images. * Cleanup docs and pep8 etc. * Add pre-commit fixes. * Fix parallel always being selected and numba function 1st class warning. * Run hook. * Older python needs Union instead of |. * Accept review suggestion. * Address review changes. * num_threads must be an int. --------- Co-authored-by: Matt Einhorn <[email protected]> * [pre-commit.ci] pre-commit autoupdate (#412) updates: - [github.com/pre-commit/pre-commit-hooks: v4.5.0 → v4.6.0](pre-commit/pre-commit-hooks@v4.5.0...v4.6.0) - [github.com/astral-sh/ruff-pre-commit: v0.3.5 → v0.4.3](astral-sh/ruff-pre-commit@v0.3.5...v0.4.3) - [github.com/psf/black: 24.3.0 → 24.4.2](psf/black@24.3.0...24.4.2) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: sfmig <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Simplify model download (#414) * Simplify model download * Update model cache * Remove jax and tf tests * Standardise the data types for inputs to all be float32 * Force torch to use CPU on arm based macOS during tests * Added PYTORCH_MPS_HIGH_WATERMARK_RATION env variable * Set env variables in test setup * Try to set the default device to cpu in the test itself * Add device call to Conv3D to force cpu * Revert changes, request one cpu left free * Revers the numb cores, don't use arm based mac runner * Merged main, removed torch flags on cellfinder install for guards and brainmapper * Lowercase Torch * Change cache directory --------- Co-authored-by: sfmig <[email protected]> Co-authored-by: Kimberly Meechan <[email protected]> Co-authored-by: Matt Einhorn <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Alessandro Felder <[email protected]> Co-authored-by: Adam Tyson <[email protected]> * Set pooling padding to valid by default on all MaxPooling3D layers * Removed tf error suppression and other tf related functions * Force torch to use cpu device when CELLFINDER_TEST_DEVICE env variable set to cpu * Added nev variable to test step * Use the GITHUB ACTIONS environemntal variable instead * Added docstring for fixture setting device to cpu on arm based mac * Revert changes to no_free_cpus being fixture, and default param * Fixed typo in test_and_deploy.yml * Set multiprocessing to false for the data generators * Update all cache steps to match * Remove reference to TF * Make sure tests can run locally when GITHUB_ACTIONS env variable is missing2 * Removed warning when backend is not configured * Set the label tensor to be float32 to ensure compatibility with mps * Always set KERAS_BACKEND to torch on init * Remove code in __init__ checking for if backend is installed --------- Co-authored-by: sfmig <[email protected]> Co-authored-by: Kimberly Meechan <[email protected]> Co-authored-by: Matt Einhorn <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Alessandro Felder <[email protected]> Co-authored-by: Adam Tyson <[email protected]>
Description
What is this PR
Why is this PR needed?
Our lab has whole cleared brain light sheet images. We previously used Imaris for cell counting but wanted to test open source approaches. When I ran it on our brain (1500 image stack), it seemed to not finish even after a day. Looking closer I saw that cluster splitting is where it stalled because it took too long. Hence I started optimizing the code and ended up with this PR...
What does this PR do?
split_cells
to use multiprocessing, which helps when there are a fair number of them.How has this PR been tested?
I ran the pipeline upto and including the initial cell detection that generates the points of 300k cells on the whole brain. Then, before the splitting stage I dumped the cell coordinates and parameters used by the pipeline to initialize
VolumeFilter
and then tested with a small subset of those cells using the following script to make sure the number of cells didn't change with each commit.I also used this code for profiling:
Is this a breaking change?
I don't think so. Like I changed
process
to not call directlyget_results
andget_results
needs a worker pool arg. But these are really all internal API, not user side API.Checklist:
Changes
I'm going to list the changes in the order of the commits. There's two measures, the duration measured using
time
andcprofile
data. And I tested it with 10 clusters.The duration for
main
branch:Elapsed = 19.57954730000347, num cells = 86
With cProfile
Replaced accumulation of points in
coords_maps
from stacking each new point on the previous array to using a list of tuples, and each new point is simply added to the list. Also, fixed the types so that we don't get the warnings about type conversions:Elapsed = 1.2765582000138238, num cells = 86
With cProfile
Switched to using fortran layout for volume block because we constantly rotate (roll) the last axis - the z-axis, and fortran layout is column major order which is faster for rotating that axis.
Elapsed = 1.227499600034207, num cells = 86
With cProfile
Used an lru-cache for the computing the kernel.
Elapsed = 1.1849509000312537, num cells = 86
With cProfile