Create cat regressor #3353

Intron7 · 2024-11-11T14:06:20Z

Use numba to create the regressor for categorical regression

codecov · 2024-11-11T14:21:13Z

Codecov Report

Attention: Patch coverage is 50.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 75.41%. Comparing base (8ce811a) to head (1b7d7e1).

Files with missing lines	Patch %	Lines
src/scanpy/preprocessing/_simple.py	50.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3353      +/-   ##
==========================================
- Coverage   75.44%   75.41%   -0.04%     
==========================================
  Files         113      113              
  Lines       13250    13257       +7     
==========================================
+ Hits         9997     9998       +1     
- Misses       3253     3259       +6

Files with missing lines	Coverage Δ
src/scanpy/preprocessing/_simple.py	`88.83% <50.00%> (-1.32%)`	⬇️

tests/test_preprocessing.py

src/scanpy/preprocessing/_simple.py

ilan-gold · 2024-11-11T15:13:59Z

tests/test_preprocessing.py

+    np.testing.assert_array_almost_equal(adata.X, tester)
+
+
+def test_regressor_categorical():


I would

explain why this test exists (to test against a previous implementation? I am impartial whether it's necessary TBH since we are already testing for reproducibility, could see getting rid of this)

refactor the "Create org regressors" into a helper function like create_original

I can see your point here

Do you have an an opinion on the first point? Is this test necessary? If so, perhaps a comment then?

tests/test_preprocessing.py

ilan-gold

I think this is missing: #3353 (comment) and the first part of https://github.com/scverse/scanpy/pull/3353/files#r1836830351

tests/test_preprocessing.py

src/scanpy/preprocessing/_simple.py

ilan-gold · 2024-11-12T13:32:38Z

src/scanpy/preprocessing/_simple.py

@@ -722,13 +737,13 @@ def regress_out(
                "we regress on the mean for each category."
            )
        logg.debug("... regressing on per-gene means within categories")
-        regressors = np.zeros(X.shape, dtype="float32")
+        # Create numpy array's from categorical variable
+        cats = np.int64(len(adata.obs[keys[0]].cat.categories))


Also comment why np.int64

because it has be done because of weird typing from pandas. So this ensures that it works within the kernel

so len doesn’t return a Python int? That’s a pandas bug.

Co-authored-by: Ilan Gold <[email protected]>

tests/test_preprocessing.py

src/scanpy/preprocessing/_simple.py

ilan-gold · 2024-11-12T15:53:37Z

tests/test_preprocessing.py

+    np.testing.assert_array_almost_equal(adata.X, tester)
+
+
+def test_regressor_categorical():


Do you have an an opinion on the first point? Is this test necessary? If so, perhaps a comment then?

src/scanpy/preprocessing/_simple.py

flying-sheep · 2024-11-14T08:41:57Z

src/scanpy/preprocessing/_simple.py

+        number_categories = np.int64(len(adata.obs[keys[0]].cat.categories))
+        filters = adata.obs[keys[0]].cat.codes.to_numpy()
+        number_categories = number_categories.astype(filters.dtype)


Either this or add a comment (to the code) explaining why it needs to be the other way.
Also if I do this, the test still passes, so …

Suggested change

number_categories = np.int64(len(adata.obs[keys[0]].cat.categories))

filters = adata.obs[keys[0]].cat.codes.to_numpy()

number_categories = number_categories.astype(filters.dtype)

number_categories = len(adata.obs[keys[0]].cat.categories)

filters = adata.obs[keys[0]].cat.codes.to_numpy()

I added a comment. Other wise you have a dtype missmatch and crash of the kernel

Other wise you have a dtype missmatch and crash of the kernel

I would say that this is the important part for the comment!

100%!

refactor your code until the “what” is obvious.

if the “why” isn’t obvious from understanding the “what”, add the missing parts as a comment

I see that you’re

convert the cat codes into a numpy array

creating a numpy scalar with the same dtype as filters, holding the number of categories

So you don’t need to comment that you do any of that.

I asked because I’m confused why a Python integer is converted to a numpy scalar: Usually APIs accept either and do the converting themselves. So I’d like to see a comment removing that confusion by explaining why you convert to a numpy scalar. (a crash is a great reason)

but I also see that _create_regressor_categorical has number_categories: int and then does range(number_categories), so I’m still very confused why numba crashes unless the dtypes match.

I can’t reproduce the crash. leaving the thing as a Python int just works for me.

Also the way to do this in one step is

Suggested change

number_categories = np.int64(len(adata.obs[keys[0]].cat.categories))

filters = adata.obs[keys[0]].cat.codes.to_numpy()

number_categories = number_categories.astype(filters.dtype)

filters = adata.obs[keys[0]].cat.codes.to_numpy()

number_categories = filters.dtype.type(len(adata.obs[keys[0]].cat.categories))

tests/test_preprocessing.py

ilan-gold · 2024-11-21T10:58:01Z

src/scanpy/preprocessing/_simple.py

+        number_categories = np.int64(len(adata.obs[keys[0]].cat.categories))
+        filters = adata.obs[keys[0]].cat.codes.to_numpy()
+        number_categories = number_categories.astype(filters.dtype)


Other wise you have a dtype missmatch and crash of the kernel

I would say that this is the important part for the comment!

ilan-gold · 2024-11-21T10:58:13Z

src/scanpy/preprocessing/_simple.py

+def _create_regressor_categorical(
+    X: np.ndarray, number_categories: int, filters: np.ndarray
+) -> np.ndarray:
+    # create regressor matrix faster for categorical variables


What does this comment mean?

scverse-benchmark · 2024-11-21T11:39:21Z

Benchmark changes

Change	Before [`8ce811a`]	After [`1b7d7e1`]	Ratio	Benchmark (Parameter)
-	59.9±0.8ms	43.6±0.5ms	0.73	preprocessing_counts.time_filter_genes('pbmc3k', 'counts-off-axis')
-	3.23±0.1ms	2.67±0.06ms	0.83	preprocessing_log.FastSuite.time_mean_var('pbmc3k', None)
+	8.87±0.03ms	9.78±0.7ms	1.1	preprocessing_log.time_highly_variable_genes('pbmc68k_reduced', 'off-axis')

Comparison: https://github.com/scverse/scanpy/compare/8ce811ac3ab6674a7d6235f44afda3e10e366682..1b7d7e177b9f1c8bc67eb6dad8d4617b1d16e443
Last changed: Thu, 23 Jan 2025 15:39:10 +0000

More details: https://github.com/scverse/scanpy/pull/3353/checks?check_run_id=36066814582

Intron7 added 3 commits November 11, 2024 14:35

add function and test

086f70d

add test

37244a9

add test for regressor

b4ecb0a

Intron7 added this to the 1.11.0 milestone Nov 11, 2024

Intron7 and others added 2 commits November 11, 2024 15:54

add release note

36858d9

Merge branch 'main' into create_cat_regressor

be1bccc

Intron7 requested review from flying-sheep and ilan-gold November 11, 2024 14:56

ilan-gold requested changes Nov 11, 2024

View reviewed changes

Intron7 added 2 commits November 11, 2024 16:25

update typing

a1a59ae

update test

7b41bc8

Intron7 requested a review from ilan-gold November 11, 2024 15:36

ilan-gold requested changes Nov 12, 2024

View reviewed changes

Intron7 added 2 commits November 12, 2024 13:45

update test

119a142

update dtype

d77fa9c

ilan-gold requested changes Nov 12, 2024

View reviewed changes

Intron7 and others added 4 commits November 12, 2024 14:44

rename cats

236e356

Update tests/test_preprocessing.py

bb9cde4

Co-authored-by: Ilan Gold <[email protected]>

Update tests/test_preprocessing.py

bbb5035

Co-authored-by: Ilan Gold <[email protected]>

Update tests/test_preprocessing.py

2a92193

Co-authored-by: Ilan Gold <[email protected]>

Intron7 requested a review from ilan-gold November 12, 2024 15:18

ilan-gold requested changes Nov 12, 2024

View reviewed changes

ilan-gold and others added 4 commits November 12, 2024 16:53

Update tests/test_preprocessing.py

c7b78c0

remove test

b001c0e

update kernel

c3ce03e

remove test

c50226a

Intron7 requested a review from ilan-gold November 13, 2024 10:55

flying-sheep requested changes Nov 14, 2024

View reviewed changes

make test together

c6665f4

Intron7 added 2 commits November 21, 2024 11:43

cleanup

858e247

add comment

2421bd5

Intron7 requested a review from flying-sheep November 21, 2024 10:45

ilan-gold reviewed Nov 21, 2024

View reviewed changes

ilan-gold added the benchmark label Nov 21, 2024

flying-sheep removed their request for review November 21, 2024 11:39

Merge branch 'main' into create_cat_regressor

2e16c45

flying-sheep modified the milestones: 1.11.0, 1.12.0 Dec 20, 2024

Merge branch 'main' into create_cat_regressor

1b7d7e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create cat regressor #3353

Create cat regressor #3353

Intron7 commented Nov 11, 2024

codecov bot commented Nov 11, 2024 •

edited

Loading

ilan-gold Nov 11, 2024

Intron7 Nov 11, 2024

ilan-gold Nov 12, 2024

ilan-gold left a comment

ilan-gold Nov 12, 2024

ilan-gold Nov 12, 2024

Intron7 Nov 12, 2024

flying-sheep Nov 21, 2024

ilan-gold Nov 12, 2024

flying-sheep Nov 14, 2024 •

edited

Loading

Intron7 Nov 21, 2024

ilan-gold Nov 21, 2024

flying-sheep Nov 21, 2024 •

edited

Loading

flying-sheep Nov 21, 2024

ilan-gold Nov 21, 2024

ilan-gold Nov 21, 2024

scverse-benchmark bot commented Nov 21, 2024 •

edited

Loading

		np.testing.assert_array_almost_equal(adata.X, tester)


		def test_regressor_categorical():

Create cat regressor #3353

Are you sure you want to change the base?

Create cat regressor #3353

Conversation

Intron7 commented Nov 11, 2024

codecov bot commented Nov 11, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilan-gold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flying-sheep Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flying-sheep Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scverse-benchmark bot commented Nov 21, 2024 • edited Loading

Benchmark changes

codecov bot commented Nov 11, 2024 •

edited

Loading

flying-sheep Nov 14, 2024 •

edited

Loading

flying-sheep Nov 21, 2024 •

edited

Loading

scverse-benchmark bot commented Nov 21, 2024 •

edited

Loading