Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hvg selects more genes than asked for #3157

Closed
3 tasks done
ilan-gold opened this issue Jul 16, 2024 · 5 comments · Fixed by #3176
Closed
3 tasks done

hvg selects more genes than asked for #3157

ilan-gold opened this issue Jul 16, 2024 · 5 comments · Fixed by #3176

Comments

@ilan-gold
Copy link
Contributor

ilan-gold commented Jul 16, 2024

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the main branch of scanpy.

What happened?

HVG can produce more than the number of genes asked for as highly variable.

This occurs on these two datasets:

wget https://datasets.cellxgene.cziscience.com/e00ab1f4-28cd-497d-b889-94d45840f423.h5ad

Minimal code sample

import scanpy as sc

adata1 = sc.read('e00ab1f4-28cd-497d-b889-94d45840f423.h5ad')

sc.pp.normalize_total(adata1, target_sum=1e4)

sc.pp.log1p(adata1)

n_top_gene = 10000
sc.pp.highly_variable_genes(adata1, n_top_genes = n_top_gene)

hvg_system1 = set(adata1.var[adata1.var['highly_variable']].index)
assert len(hvg_system1) == n_top_gene, f"found {len(hvg_system1)} instead of {n_top_gene}"

Error output

AssertionError                            Traceback (most recent call last)
Cell In[12], line 1
----> 1 assert len(hvg_system1) == n_top_gene, f"found {len(hvg_system1)} instead of {n_top_gene}"

AssertionError: found 13355 instead of 10000

Versions

import scanpy; scanpy.logging.print_versions()
-----
anndata     0.10.8
scanpy      1.10.0rc2.dev85+gb918a23e
-----
@ilan-gold ilan-gold added Bug 🐛 Triage 🩺 This issue needs to be triaged by a maintainer labels Jul 16, 2024
@ilan-gold ilan-gold changed the title hvg hvg selects more genes than asked for Jul 16, 2024
@ilan-gold ilan-gold added this to the 1.10.3 milestone Jul 16, 2024
@ilan-gold
Copy link
Contributor Author

ilan-gold commented Jul 17, 2024

@eroell Can you give some context here? The issue is the following lines:

def _subset_genes(
adata: AnnData,
*,
mean: NDArray[np.float64] | DaskArray,
dispersion_norm: NDArray[np.float64] | DaskArray,
cutoff: _Cutoffs | int,
) -> NDArray[np.bool_] | DaskArray:
"""Get boolean mask of genes with normalized dispersion in bounds."""
if isinstance(cutoff, _Cutoffs):
dispersion_norm = np.nan_to_num(dispersion_norm) # similar to Seurat
return cutoff.in_bounds(mean, dispersion_norm)
n_top_genes = cutoff
del cutoff
if n_top_genes > adata.n_vars:
logg.info("`n_top_genes` > `adata.n_var`, returning all genes.")
n_top_genes = adata.n_vars
disp_cut_off = _nth_highest(dispersion_norm, n_top_genes)
logg.debug(
f"the {n_top_genes} top genes correspond to a "
f"normalized dispersion cutoff of {disp_cut_off}"
)
return np.nan_to_num(dispersion_norm) >= disp_cut_off
def _nth_highest(x: NDArray[np.float64] | DaskArray, n: int) -> float | DaskArray:
x = x[~np.isnan(x)]
if n > x.size:
msg = "`n_top_genes` > number of normalized dispersions, returning all genes with normalized dispersions."
warnings.warn(msg, UserWarning)
n = x.size
if isinstance(x, DaskArray):
return x.topk(n)[-1]
# interestingly, np.argpartition is slightly slower
x[::-1].sort()
return x[n - 1]

Basically, disp_cut_off = _nth_highest(dispersion_norm, n_top_genes) can produce a negative value (it does not consider nans at all since it first calls x = x[~np.isnan(x)]). So the last line np.nan_to_num(dispersion_norm) >= disp_cut_off produces more genes than you'd expect because nans become 0, but you're comparing to a negative value that was calculated without considering said nan-to-0's.

So what is the fix here? Converting nans to minus infinity? Converting nans to 0 before getting the nth highest?

@flying-sheep
Copy link
Member

yeah, a fix could simply be

-return np.nan_to_num(dispersion_norm) >= disp_cut_off
+return np.nan_to_num(dispersion_norm, nan=-np.inf) >= disp_cut_off

but I have a hard time coming up with a test. Doing something like this doesn’t work, as it crashes earlier,
with something like “ValueError: cannot specify integer bins when input data contains infinity”

@pytest.mark.parametrize("flavor", ["seurat", "cell_ranger"])
def test_no_filter_genes(flavor):
    """Test that even with 0 columns in the data, n_top_genes is respected."""
    adata = pbmc3k()
    means, _ = _get_mean_var(adata.X)
    assert (means == 0).any()
    sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
    assert adata.var["highly_variable"].sum() == 10000

@eroell
Copy link
Contributor

eroell commented Jul 29, 2024

Sorry for being slow here - not sure how I got the mention as this part of hvg I have not contributed to in the past but happy to give my 5 cents to this.

This occurs on these two datasets:

I think you only linked one? Which is enough as I assume it is only the constant-gene case which is causing this issue, which is indeed present just as you showed in this dataset.

Making a test

ValueError: cannot specify integer bins when input data contains infinity”

I think this test is missing sc.pp.normalize_total and sc.pp.log1p for flavor="seurat”. The following test passes with the fix, and fails with the unfixed prior version.

def test_no_filter_genes(flavor):
    """Test that even with 0 columns in the data, n_top_genes is respected."""
    adata = sc.datasets.pbmc3k()
    means, _ = _get_mean_var(adata.X)
    assert (means == 0).any()
    sc.pp.normalize_total(adata, target_sum=10000)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
    assert adata.var["highly_variable"].sum() == 10000
test_no_filter_genes("seurat")

Happy to make this PR

Below second comment on follow up issue...

@eroell
Copy link
Contributor

eroell commented Jul 29, 2024

The data also allows to detect a next issue: When multiple genes have the same value of disp_cut_off.

Can be found if here e.g. dont do sc.pp.normalize_total:

import scanpy as as
adata = sc.datasets.pbmc3k()
# sc.pp.normalize_total(adata, target_sum=10000)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
adata.var["highly_variable"].sum()
10367

Which is due to many genes having the value selected for the disp_cut_off here, having

...x[n-2] = x[n-1 ] = x[n] = x[n+1] = x[n+2]...

def _nth_highest(x: NDArray[np.float64] | DaskArray, n: int) -> float | DaskArray:
x = x[~np.isnan(x)]
if n > x.size:
msg = "`n_top_genes` > number of normalized dispersions, returning all genes with normalized dispersions."
warnings.warn(msg, UserWarning)
n = x.size
if isinstance(x, DaskArray):
return x.topk(n)[-1]
# interestingly, np.argpartition is slightly slower
x[::-1].sort()
return x[n - 1]

I tried to check how Seurat is proceeding in such a case, expecting to see how it breaks the ties. (data downloaded from here)
Here I'm actually not sure how to turn off the scale.factor argument? Its set to 10'000 by default.

library(dplyr)
library(Seurat)
library(patchwork)

pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k")

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor=10000)

pbmc <- FindVariableFeatures(pbmc, selection.method = "mean.var.plot", nfeatures = 10000)

length(VariableFeatures(pbmc))
2292

However, it turns out Seurat seems to restrict to the genes which are variable in the sense of passing the set mean threshold and normalized dispersion thresholds. These thresholds are ignored in scanpy if the number of genes is given.

So not really an insight of how to break ties in this case. Would suggest to make a new issue, which the potential project on comparing the frameworks could address.

@flying-sheep flying-sheep mentioned this issue Jul 30, 2024
3 tasks
@flying-sheep
Copy link
Member

The following test passes with the fix, and fails with the unfixed prior version.

Ah, with normalize_total, it works with seurat, thank you!

When multiple genes have the same value of disp_cut_off.

I think when the n-th highest value is tied with others, returning more than the specified number makes total sense.

As done by scipy.stats.rankdata, the rank of identical data points is identical. So the “values up to the 4th-highest” of the array [6,5,3,3,3,1,0] are [6,5,3,3,3], not [6,5,3,3]. It makes no sense to include fewer than all of the 3s here.

@flying-sheep flying-sheep added Area – Preprocessing 🧼 and removed Triage 🩺 This issue needs to be triaged by a maintainer labels Jul 30, 2024
@flying-sheep flying-sheep self-assigned this Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants