Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hvg seurat v3 numba kernel #3017

Merged
merged 18 commits into from
May 31, 2024
Merged

Hvg seurat v3 numba kernel #3017

merged 18 commits into from
May 31, 2024

Conversation

Intron7
Copy link
Member

@Intron7 Intron7 commented Apr 22, 2024

Adds a numba kernel for seurat_v3 for sparse matrices. This kernel is a lot faster and more memory efficient. I doesn't copy nor promotes the matrix to 64-bit floats.

@Intron7 Intron7 added this to the 1.10.2 milestone Apr 22, 2024
@Intron7 Intron7 linked an issue Apr 22, 2024 that may be closed by this pull request
Copy link

scverse-benchmark bot commented Apr 22, 2024

Benchmark changes

Change Before [3ba3f46] After [277c1bf] Ratio Benchmark (Parameter)
- 9.09±1ms 6.90±0.06ms 0.76 preprocessing_counts.time_log1p('pbmc3k')

Comparison: https://github.com/scverse/scanpy/compare/3ba3f46b4e6e77e8c6f0551db9663822097b486a..277c1bfb0885234aa757d0fdaeaa9103eb8568e2
Last changed: Thu, 23 May 2024 12:59:15 +0000

More details: https://github.com/scverse/scanpy/pull/3017/checks?check_run_id=25329779518

Copy link

codecov bot commented Apr 22, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 75.86%. Comparing base (3ba3f46) to head (277c1bf).
Report is 47 commits behind head on main.

Files with missing lines Patch % Lines
scanpy/preprocessing/_highly_variable_genes.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3017      +/-   ##
==========================================
- Coverage   75.87%   75.86%   -0.01%     
==========================================
  Files         110      110              
  Lines       12533    12533              
==========================================
- Hits         9509     9508       -1     
- Misses       3024     3025       +1     
Files with missing lines Coverage Δ
scanpy/preprocessing/_highly_variable_genes.py 95.20% <85.71%> (-0.40%) ⬇️

@Zethson Zethson changed the title Hvg seurat v3 update Hvg seurat v3 multicore kernel Apr 28, 2024
@Intron7 Intron7 changed the title Hvg seurat v3 multicore kernel Hvg seurat v3 numba kernel May 17, 2024
@Intron7
Copy link
Member Author

Intron7 commented May 17, 2024

~500k Cells.
New Version:
peak memory: 16117.61 MiB, increment: 7709.86 MiB
Wall time: 11 s
Old Version:
peak memory: 40093.42 MiB, increment: 31646.50 MiB
Wall time: 30.9 s

@Intron7 Intron7 requested review from flying-sheep and eroell May 17, 2024 12:02
@eroell
Copy link
Contributor

eroell commented May 21, 2024

Indeed, ~3-7x faster for me & of course quite a bit more memory efficient (quickly checked with scalene).

Tests keep working (for seurat_v3/seurat_v3_paper somewhat tight numeric comparison with Seurat results), nice.

scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Kinda suprised how much faster this is.

You could probably get another big speed up by operating on the whole matrix at once, and not creating batch specific matrices.

But I think fine as is.

I added a couple points to address, but that's just naming and docs.

scanpy/preprocessing/_highly_variable_genes.py Outdated Show resolved Hide resolved
docs/release-notes/1.10.2.md Outdated Show resolved Hide resolved
@ivirshup
Copy link
Member

Would also be nice to have a timing benchmark for this.

Co-authored-by: Isaac Virshup <[email protected]>
@Intron7
Copy link
Member Author

Intron7 commented May 22, 2024

It is the whole matrix for each batch. It's just called batch_counts because I needed to make sure the format is csr

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more of getting rid of the whole for b in np.unique(batch_info) loop and doing the whole thing in two passes over the matrix.

But LGTM, and I think ready to merge once those tests get fixed

@Intron7
Copy link
Member Author

Intron7 commented May 25, 2024

I don't know what makes the tests fail with dask in utils

@Intron7 Intron7 merged commit 5dc489d into main May 31, 2024
15 checks passed
@Intron7 Intron7 deleted the hvg-seurat_v3-update branch May 31, 2024 09:15
meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request May 31, 2024
flying-sheep pushed a commit that referenced this pull request Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Preprocessing functions with numba
4 participants