Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexplainable sc.pp.highly_variable_genes(subset = True) behavior #3027

Closed
3 tasks done
jberkh opened this issue Apr 24, 2024 · 1 comment · Fixed by #3042
Closed
3 tasks done

Unexplainable sc.pp.highly_variable_genes(subset = True) behavior #3027

jberkh opened this issue Apr 24, 2024 · 1 comment · Fixed by #3042
Assignees

Comments

@jberkh
Copy link

jberkh commented Apr 24, 2024

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

When doing HVG selection with batch_key = "something" and subset = True, I noticed unexpected genes to be selected in the subset anndata. Upon further investigation, it seems that somehow the inplace subsetting goes wrong. (Though I checked the source code and could not find any issue that may explain it there.)

Minimal code sample

import scanpy as sc
import numpy as np
np.random.seed(0)

# Get AnnData
adata = sc.datasets.pbmc3k()
adata.layers['counts'] = adata.X.copy().tocsr()
adata.obs["Age"] = np.random.randint(0, 6, (2700,))
adata.obs["Age"] = adata.obs["Age"].astype('category')

# Filter genes, preprocess
sc.pp.filter_genes(adata, min_counts = 10)
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

# Subset = False
ad_nosub = adata.copy()
sc.pp.highly_variable_genes(ad_nosub, n_top_genes = 1000, batch_key = "Age", subset = False)

# Subset = False, manual subset afterwards
ad_nosub_subbed = ad_nosub.copy()
ad_nosub_subbed._inplace_subset_var(ad_nosub_subbed.var["highly_variable"].to_numpy())

# Subset = True
ad_sub = adata.copy()
sc.pp.highly_variable_genes(ad_sub, n_top_genes = 1000, batch_key = "Age", subset = True)

Error output

>>> # As expected
>>> print(np.sum(ad_nosub.var["highly_variable"]))
1000
>>> 
>>> # As expected
>>> print(np.sum(ad_nosub_subbed.var["highly_variable"]))
1000
>>> 
>>> # Not as expected
>>> print(np.sum(ad_sub.var["highly_variable"]))
101

Versions

→ conda list | grep scanpy
scanpy                    1.10.1             pyhd8ed1ab_0    conda-forge

→ conda list | grep anndata
anndata                   0.10.7             pyhd8ed1ab_0    conda-forge
@eroell eroell self-assigned this May 2, 2024
@eroell
Copy link
Contributor

eroell commented May 2, 2024

Hey, thanks a lot for spotting & the nice reproducible example! Big help.

From my first look into it, it seems this is a bug indeed;

Bug appearing when

  • batch_key is not None and
  • flavor is “seurat” or “cell_ranger” and
  • then using subset=True.

Other cases are not suffering from this it seems (e.g. flavor="seurat_v3", or when batch_key=None).

Issue
It appears in the cases describe above, subset=True will cause the first n_top_genes many genes of adata.var to be used as selection: not the actual n_top_genes highly variable genes.

Fix is on the way: I'll follow up here.

Your Example
Reveals that sc.pp.highly_variable_genes(ad_sub, n_top_genes = 1000, batch_key = "Age", subset = True) suffers from this.

Circumvent bug
For now, I recommend not using subset=True if the cases above hold for you:
Rather, use

adata_subset = adata[:, adata.var["highly_variable"]]

when subsetting: which is basically the "subsetting afterwards" strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants