-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hvg
selects more genes than asked for
#3157
Comments
@eroell Can you give some context here? The issue is the following lines: scanpy/src/scanpy/preprocessing/_highly_variable_genes.py Lines 383 to 418 in b918a23
Basically, So what is the fix here? Converting nans to minus infinity? Converting nans to 0 before getting the nth highest? |
yeah, a fix could simply be -return np.nan_to_num(dispersion_norm) >= disp_cut_off
+return np.nan_to_num(dispersion_norm, nan=-np.inf) >= disp_cut_off but I have a hard time coming up with a test. Doing something like this doesn’t work, as it crashes earlier, @pytest.mark.parametrize("flavor", ["seurat", "cell_ranger"])
def test_no_filter_genes(flavor):
"""Test that even with 0 columns in the data, n_top_genes is respected."""
adata = pbmc3k()
means, _ = _get_mean_var(adata.X)
assert (means == 0).any()
sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
assert adata.var["highly_variable"].sum() == 10000 |
Sorry for being slow here - not sure how I got the mention as this part of hvg I have not contributed to in the past but happy to give my 5 cents to this.
I think you only linked one? Which is enough as I assume it is only the constant-gene case which is causing this issue, which is indeed present just as you showed in this dataset. Making a test
I think this test is missing def test_no_filter_genes(flavor):
"""Test that even with 0 columns in the data, n_top_genes is respected."""
adata = sc.datasets.pbmc3k()
means, _ = _get_mean_var(adata.X)
assert (means == 0).any()
sc.pp.normalize_total(adata, target_sum=10000)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
assert adata.var["highly_variable"].sum() == 10000
test_no_filter_genes("seurat") Happy to make this PR Below second comment on follow up issue... |
The data also allows to detect a next issue: When multiple genes have the same value of Can be found if here e.g. dont do import scanpy as as
adata = sc.datasets.pbmc3k()
# sc.pp.normalize_total(adata, target_sum=10000)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor=flavor, n_top_genes=10000)
adata.var["highly_variable"].sum()
Which is due to many genes having the value selected for the ... scanpy/src/scanpy/preprocessing/_highly_variable_genes.py Lines 408 to 418 in b918a23
I tried to check how Seurat is proceeding in such a case, expecting to see how it breaks the ties. (data downloaded from here) library(dplyr)
library(Seurat)
library(patchwork)
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k")
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor=10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "mean.var.plot", nfeatures = 10000)
length(VariableFeatures(pbmc))
However, it turns out Seurat seems to restrict to the genes which are variable in the sense of passing the set mean threshold and normalized dispersion thresholds. These thresholds are ignored in scanpy if the number of genes is given. So not really an insight of how to break ties in this case. Would suggest to make a new issue, which the potential project on comparing the frameworks could address. |
Ah, with
I think when the n-th highest value is tied with others, returning more than the specified number makes total sense. As done by |
Please make sure these conditions are met
What happened?
HVG can produce more than the number of genes asked for as highly variable.
This occurs on these two datasets:
Minimal code sample
Error output
Versions
The text was updated successfully, but these errors were encountered: