Unexplainable sc.pp.highly_variable_genes(subset = True) behavior #3027

jberkh · 2024-04-24T10:14:20Z

Please make sure these conditions are met

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of scanpy.
(optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

When doing HVG selection with batch_key = "something" and subset = True, I noticed unexpected genes to be selected in the subset anndata. Upon further investigation, it seems that somehow the inplace subsetting goes wrong. (Though I checked the source code and could not find any issue that may explain it there.)

Minimal code sample

import scanpy as sc
import numpy as np
np.random.seed(0)

# Get AnnData
adata = sc.datasets.pbmc3k()
adata.layers['counts'] = adata.X.copy().tocsr()
adata.obs["Age"] = np.random.randint(0, 6, (2700,))
adata.obs["Age"] = adata.obs["Age"].astype('category')

# Filter genes, preprocess
sc.pp.filter_genes(adata, min_counts = 10)
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

# Subset = False
ad_nosub = adata.copy()
sc.pp.highly_variable_genes(ad_nosub, n_top_genes = 1000, batch_key = "Age", subset = False)

# Subset = False, manual subset afterwards
ad_nosub_subbed = ad_nosub.copy()
ad_nosub_subbed._inplace_subset_var(ad_nosub_subbed.var["highly_variable"].to_numpy())

# Subset = True
ad_sub = adata.copy()
sc.pp.highly_variable_genes(ad_sub, n_top_genes = 1000, batch_key = "Age", subset = True)

Error output

>>> # As expected
>>> print(np.sum(ad_nosub.var["highly_variable"]))
1000
>>> 
>>> # As expected
>>> print(np.sum(ad_nosub_subbed.var["highly_variable"]))
1000
>>> 
>>> # Not as expected
>>> print(np.sum(ad_sub.var["highly_variable"]))
101

Versions

→ conda list | grep scanpy
scanpy                    1.10.1             pyhd8ed1ab_0    conda-forge

→ conda list | grep anndata
anndata                   0.10.7             pyhd8ed1ab_0    conda-forge

eroell · 2024-05-02T14:20:33Z

Hey, thanks a lot for spotting & the nice reproducible example! Big help.

From my first look into it, it seems this is a bug indeed;

Bug appearing when

batch_key is not None and
flavor is “seurat” or “cell_ranger” and
then using subset=True.

Other cases are not suffering from this it seems (e.g. flavor="seurat_v3", or when batch_key=None).

Issue
It appears in the cases describe above, subset=True will cause the first n_top_genes many genes of adata.var to be used as selection: not the actual n_top_genes highly variable genes.

Fix is on the way: I'll follow up here.

Your Example
Reveals that sc.pp.highly_variable_genes(ad_sub, n_top_genes = 1000, batch_key = "Age", subset = True) suffers from this.

Circumvent bug
For now, I recommend not using subset=True if the cases above hold for you:
Rather, use

adata_subset = adata[:, adata.var["highly_variable"]]

when subsetting: which is basically the "subsetting afterwards" strategy.

jberkh added the Bug 🐛 label Apr 24, 2024

eroell self-assigned this May 2, 2024

eroell mentioned this issue May 2, 2024

hvg flavors seurat and cellranger with batch: bug in subset #3042

Merged

3 tasks

flying-sheep closed this as completed in #3042 Jun 28, 2024

ilan-gold mentioned this issue Jul 16, 2024

hvg selects more genes than asked for #3157

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexplainable sc.pp.highly_variable_genes(subset = True) behavior #3027

Unexplainable sc.pp.highly_variable_genes(subset = True) behavior #3027

jberkh commented Apr 24, 2024 •

edited

Loading

eroell commented May 2, 2024 •

edited

Loading

Unexplainable sc.pp.highly_variable_genes(subset = True) behavior #3027

Unexplainable sc.pp.highly_variable_genes(subset = True) behavior #3027

Comments

jberkh commented Apr 24, 2024 • edited Loading

Please make sure these conditions are met

What happened?

Minimal code sample

Error output

Versions

eroell commented May 2, 2024 • edited Loading

jberkh commented Apr 24, 2024 •

edited

Loading

eroell commented May 2, 2024 •

edited

Loading