`hvg` selects more genes than asked for #3157

ilan-gold · 2024-07-16T13:39:24Z

Please make sure these conditions are met

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of scanpy.
(optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

HVG can produce more than the number of genes asked for as highly variable.

This occurs on these two datasets:

wget https://datasets.cellxgene.cziscience.com/e00ab1f4-28cd-497d-b889-94d45840f423.h5ad

Minimal code sample

import scanpy as sc

adata1 = sc.read('e00ab1f4-28cd-497d-b889-94d45840f423.h5ad')

sc.pp.normalize_total(adata1, target_sum=1e4)

sc.pp.log1p(adata1)

n_top_gene = 10000
sc.pp.highly_variable_genes(adata1, n_top_genes = n_top_gene)

hvg_system1 = set(adata1.var[adata1.var['highly_variable']].index)
assert len(hvg_system1) == n_top_gene, f"found {len(hvg_system1)} instead of {n_top_gene}"

Error output

AssertionError                            Traceback (most recent call last)
Cell In[12], line 1
----> 1 assert len(hvg_system1) == n_top_gene, f"found {len(hvg_system1)} instead of {n_top_gene}"

AssertionError: found 13355 instead of 10000

Versions

import scanpy; scanpy.logging.print_versions()
-----
anndata     0.10.8
scanpy      1.10.0rc2.dev85+gb918a23e
-----

The text was updated successfully, but these errors were encountered:

ilan-gold · 2024-07-17T10:59:21Z

@eroell Can you give some context here? The issue is the following lines:

scanpy/src/scanpy/preprocessing/_highly_variable_genes.py

Lines 383 to 418 in b918a23

    
           def _subset_genes( 
        
               adata: AnnData, 
        
               *, 
        
               mean: NDArray[np.float64] | DaskArray, 
        
               dispersion_norm: NDArray[np.float64] | DaskArray, 
        
               cutoff: _Cutoffs | int, 
        
           ) -> NDArray[np.bool_] | DaskArray: 
        
               """Get boolean mask of genes with normalized dispersion in bounds.""" 
        
               if isinstance(cutoff, _Cutoffs): 
        
                   dispersion_norm = np.nan_to_num(dispersion_norm)  # similar to Seurat 
        
                   return cutoff.in_bounds(mean, dispersion_norm) 
        
               n_top_genes = cutoff 
        
               del cutoff 
        
               if n_top_genes > adata.n_vars: 
        
                   logg.info("`n_top_genes` > `adata.n_var`, returning all genes.") 
        
                   n_top_genes = adata.n_vars 
        
               disp_cut_off = _nth_highest(dispersion_norm, n_top_genes) 
        
               logg.debug( 
        
                   f"the {n_top_genes} top genes correspond to a " 
        
                   f"normalized dispersion cutoff of {disp_cut_off}" 
        
               ) 
        
               return np.nan_to_num(dispersion_norm) >= disp_cut_off 
        
           def _nth_highest(x: NDArray[np.float64] | DaskArray, n: int) -> float | DaskArray: 
        
               x = x[~np.isnan(x)] 
        
               if n > x.size: 
        
                   msg = "`n_top_genes` > number of normalized dispersions, returning all genes with normalized dispersions." 
        
                   warnings.warn(msg, UserWarning) 
        
                   n = x.size 
        
               if isinstance(x, DaskArray): 
        
                   return x.topk(n)[-1] 
        
               # interestingly, np.argpartition is slightly slower 
        
               x[::-1].sort() 
        
               return x[n - 1]

Basically, disp_cut_off = _nth_highest(dispersion_norm, n_top_genes) can produce a negative value (it does not consider nans at all since it first calls x = x[~np.isnan(x)]). So the last line np.nan_to_num(dispersion_norm) >= disp_cut_off produces more genes than you'd expect because nans become 0, but you're comparing to a negative value that was calculated without considering said nan-to-0's.

So what is the fix here? Converting nans to minus infinity? Converting nans to 0 before getting the nth highest?

flying-sheep · 2024-07-26T14:12:27Z

yeah, a fix could simply be

-return np.nan_to_num(dispersion_norm) >= disp_cut_off
+return np.nan_to_num(dispersion_norm, nan=-np.inf) >= disp_cut_off

but I have a hard time coming up with a test. Doing something like this doesn’t work, as it crashes earlier,
with something like “ValueError: cannot specify integer bins when input data contains infinity”

@pytest.mark.parametrize("flavor", ["seurat", "cell_ranger"])
def test_no_filter_genes(flavor):
    """Test that even with 0 columns in the data, n_top_genes is respected."""
    adata = pbmc3k()
    means, _ = _get_mean_var(adata.X)
    assert (means == 0).any()
    sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
    assert adata.var["highly_variable"].sum() == 10000

eroell · 2024-07-29T11:49:11Z

Sorry for being slow here - not sure how I got the mention as this part of hvg I have not contributed to in the past but happy to give my 5 cents to this.

This occurs on these two datasets:

I think you only linked one? Which is enough as I assume it is only the constant-gene case which is causing this issue, which is indeed present just as you showed in this dataset.

Making a test

ValueError: cannot specify integer bins when input data contains infinity”

I think this test is missing sc.pp.normalize_total and sc.pp.log1p for flavor="seurat”. The following test passes with the fix, and fails with the unfixed prior version.

def test_no_filter_genes(flavor):
    """Test that even with 0 columns in the data, n_top_genes is respected."""
    adata = sc.datasets.pbmc3k()
    means, _ = _get_mean_var(adata.X)
    assert (means == 0).any()
    sc.pp.normalize_total(adata, target_sum=10000)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
    assert adata.var["highly_variable"].sum() == 10000
test_no_filter_genes("seurat")

Happy to make this PR

Below second comment on follow up issue...

eroell · 2024-07-29T12:09:08Z

The data also allows to detect a next issue: When multiple genes have the same value of disp_cut_off.

Can be found if here e.g. dont do sc.pp.normalize_total:

import scanpy as as
adata = sc.datasets.pbmc3k()
# sc.pp.normalize_total(adata, target_sum=10000)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
adata.var["highly_variable"].sum()

Which is due to many genes having the value selected for the disp_cut_off here, having

...x[n-2] = x[n-1 ] = x[n] = x[n+1] = x[n+2]...

scanpy/src/scanpy/preprocessing/_highly_variable_genes.py

Lines 408 to 418 in b918a23

    
           def _nth_highest(x: NDArray[np.float64] | DaskArray, n: int) -> float | DaskArray: 
        
               x = x[~np.isnan(x)] 
        
               if n > x.size: 
        
                   msg = "`n_top_genes` > number of normalized dispersions, returning all genes with normalized dispersions." 
        
                   warnings.warn(msg, UserWarning) 
        
                   n = x.size 
        
               if isinstance(x, DaskArray): 
        
                   return x.topk(n)[-1] 
        
               # interestingly, np.argpartition is slightly slower 
        
               x[::-1].sort() 
        
               return x[n - 1]

I tried to check how Seurat is proceeding in such a case, expecting to see how it breaks the ties. (data downloaded from here)
Here I'm actually not sure how to turn off the scale.factor argument? Its set to 10'000 by default.

library(dplyr)
library(Seurat)
library(patchwork)

pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k")

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor=10000)

pbmc <- FindVariableFeatures(pbmc, selection.method = "mean.var.plot", nfeatures = 10000)

length(VariableFeatures(pbmc))

However, it turns out Seurat seems to restrict to the genes which are variable in the sense of passing the set mean threshold and normalized dispersion thresholds. These thresholds are ignored in scanpy if the number of genes is given.

So not really an insight of how to break ties in this case. Would suggest to make a new issue, which the potential project on comparing the frameworks could address.

flying-sheep · 2024-07-30T11:10:29Z

The following test passes with the fix, and fails with the unfixed prior version.

Ah, with normalize_total, it works with seurat, thank you!

When multiple genes have the same value of disp_cut_off.

I think when the n-th highest value is tied with others, returning more than the specified number makes total sense.

As done by scipy.stats.rankdata, the rank of identical data points is identical. So the “values up to the 4th-highest” of the array [6,5,3,3,3,1,0] are [6,5,3,3,3], not [6,5,3,3]. It makes no sense to include fewer than all of the 3s here.

ilan-gold added Bug 🐛 Triage 🩺 This issue needs to be triaged by a maintainer labels Jul 16, 2024

ilan-gold changed the title ~~hvg~~ hvg selects more genes than asked for Jul 16, 2024

ilan-gold added this to the 1.10.3 milestone Jul 16, 2024

flying-sheep mentioned this issue Jul 30, 2024

Fix NaN dispersion #3176

Merged

3 tasks

flying-sheep closed this as completed in #3176 Jul 30, 2024

flying-sheep added Area – Preprocessing 🧼 and removed Triage 🩺 This issue needs to be triaged by a maintainer labels Jul 30, 2024

flying-sheep self-assigned this Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`hvg` selects more genes than asked for #3157

`hvg` selects more genes than asked for #3157

ilan-gold commented Jul 16, 2024 •

edited

Loading

ilan-gold commented Jul 17, 2024 •

edited

Loading

flying-sheep commented Jul 26, 2024

eroell commented Jul 29, 2024 •

edited

Loading

eroell commented Jul 29, 2024

flying-sheep commented Jul 30, 2024

hvg selects more genes than asked for #3157

hvg selects more genes than asked for #3157

Comments

ilan-gold commented Jul 16, 2024 • edited Loading

Please make sure these conditions are met

What happened?

Minimal code sample

Error output

Versions

ilan-gold commented Jul 17, 2024 • edited Loading

flying-sheep commented Jul 26, 2024

eroell commented Jul 29, 2024 • edited Loading

eroell commented Jul 29, 2024

flying-sheep commented Jul 30, 2024

`hvg` selects more genes than asked for #3157

`hvg` selects more genes than asked for #3157

ilan-gold commented Jul 16, 2024 •

edited

Loading

ilan-gold commented Jul 17, 2024 •

edited

Loading

eroell commented Jul 29, 2024 •

edited

Loading