Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hvg flavors seurat and cellranger with batch: bug in subset #3042

Merged
merged 12 commits into from
Jun 28, 2024

Conversation

eroell
Copy link
Contributor

@eroell eroell commented May 2, 2024

  • Release notes not necessary because: they are, added

This PR fixes the bug reported in the linked issue.

A new test which spots the erroneous computations has been added.

I would use this chance to refactor the _highly_variable_genes.py, rather than using the 2-lines fix suggested in the first commit:
Doing the multi-batch hvg flagging differently for seurat_v3 and seurat/cell_ranger is what made this bug hard to spot in the first place I think.

Copy link

codecov bot commented May 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.31%. Comparing base (896e249) to head (13b1f6c).
Report is 50 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3042   +/-   ##
=======================================
  Coverage   76.31%   76.31%           
=======================================
  Files         109      109           
  Lines       12513    12515    +2     
=======================================
+ Hits         9549     9551    +2     
  Misses       2964     2964           
Files with missing lines Coverage Δ
src/scanpy/preprocessing/_highly_variable_genes.py 95.23% <100.00%> (+0.03%) ⬆️

@eroell eroell requested a review from flying-sheep May 2, 2024 14:49
@flying-sheep
Copy link
Member

Looks simple enough! Please deduplicate the tests though, they have too many identical lines.

@eroell
Copy link
Contributor Author

eroell commented May 21, 2024

Please deduplicate the tests though, they have too many identical lines.

To do so did across-setting tests with for loop on top of former test...

Would you prefer one separate test with that for loop for the across-settings check?

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry for missing this. I’m always for pytest.parametrize instead of for loops. That way it’s immediately visible which variants of the test fail.

@eroell
Copy link
Contributor Author

eroell commented Jun 27, 2024

To test in the future that this bug doesnt happen, the 4 combinations of inplace=True/False and subset=True/False need to be compared for their selected var_names - if these arguments were be in pytest.mark.parametrize, the different produced anndatas don't live at the same time and can't be compared...

Could

  • extract this specific section into a new, dedicated smaller test
    # check that the results are consistent for subset True/False: inplace True
    adata_post_subset = adatas["subset_False_inplace_True"][
        :, adatas["subset_False_inplace_True"].var["highly_variable"]
    ]
    assert adata_post_subset.var_names.equals(
        adatas["subset_True_inplace_True"].var_names
    )

    # check that the results are consistent for subset True/False: inplace False
    df_post_subset = dfs["subset_False_inplace_False"][
        dfs["subset_False_inplace_False"]["highly_variable"]
    ]
    assert df_post_subset.index.equals(dfs["subset_True_inplace_False"].index)

    # check that the results are consistent for inplace True/False: subset True
    assert adatas["subset_True_inplace_True"].var_names.equals(
        dfs["subset_True_inplace_False"].index
    )
  • hardcode the known-to-be-selected features and check against them

what's your preference here?

@flying-sheep
Copy link
Member

Ah, silly me, this makes sense.

Since some entries of the dicts are never used, I just removed them and replaced the string with a bool (subset=True/False). This makes it all quite a bit more compact.

Also please remember itertools: If we’re not writing numba code, it’s always preferable to use it as opposed to nesting for loops.

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a relnote

@flying-sheep flying-sheep added this to the 1.10.3 milestone Jun 27, 2024
@eroell
Copy link
Contributor Author

eroell commented Jun 27, 2024

Timeout and Scrublet failing in Python3.12?

@eroell
Copy link
Contributor Author

eroell commented Jun 27, 2024

Ah on main too I see

@eroell
Copy link
Contributor Author

eroell commented Jun 27, 2024

coverage decreased, I think not detected because some pytest.parametrize were removed?
I can also add the new specific test separately do avoid this, else it looks good to me

@eroell eroell marked this pull request as ready for review June 27, 2024 16:45
@flying-sheep
Copy link
Member

coverage decreased, I think not detected because some pytest.parametrize were removed?

I think codecov just hadn’t updated its comment yet when you saw that.

What you say can’t be, it doesn’t matter how a line was hit: If a line is run, it’ll be reported as hit, if your changes would have caused it to no longer be it, it would have been reported as a miss.

@flying-sheep flying-sheep merged commit fdfb9a1 into scverse:main Jun 28, 2024
11 of 14 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Jun 28, 2024
flying-sheep pushed a commit that referenced this pull request Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexplainable sc.pp.highly_variable_genes(subset = True) behavior
2 participants