Small (reference) data for testing #104

bernt-matthias · 2024-06-14T13:23:59Z

Is there any small reference data set (and fasta) that could be used for testing.

Background: I'm thinking about creating a tool wrapper for Galaxy and those require tests.

apcamargo · 2024-06-14T18:37:17Z

Do you think the Klebsiella pneumoniae that is used in the guide is small enough?

curl -LJO https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_009025895.1/download\?include_annotation_type\=GENOME_FASTA

bernt-matthias · 2024-06-14T18:41:06Z

The fasta should be fine. I guess one could even use a subsequence of this genome to reduce runtime and memory requirements of the test.

But I was wondering more about the reference data that you have on zenodo (i.e. that is downloaded with genomad download-database).

apcamargo · 2024-06-14T23:18:33Z

You could use mmseqs createsubdb to create a subset of the database.

Within the database directory, the mini_set_ids file contains the IDs of 42,098 markers (~20%) that comprise a "mini database" with the most informative markers. This can be used as input to mmseqs createsubdb.

You could create an even smaller database if create a database containing only the markers with hits in the test sequence.

bernt-matthias · 2024-06-15T08:01:39Z

You could use mmseqs createsubdb to create a subset of the database.

Wonderful.

You could create an even smaller database if create a database containing only the markers with hits in the test sequence.

Could you tell me where in the output I can find the IDs of the markers for the ids file?

apcamargo · 2024-06-15T22:28:38Z

genomad_db/genomad_db.lookup has the ID → marker accession mappings (first and second columns, respectively):

0	GENOMAD.070201.VV	0
1	GENOMAD.179093.PC	0
2	GENOMAD.152930.VV	0
3	GENOMAD.102389.VV	0
4	GENOMAD.094353.VV	0

To get a list of the accessions of markers with hit in the test genome:

awk -v FS="\t" 'NR>1 && $9!="NA" {print $9}' genomad_output/GCF_009025895.1_annotate/GCF_009025895.1_genes.tsv | sort -u

After you create the sub-database it's not guaranteed that the matches will be the same, as the database size will change significantly. It should work for test purposes anyway.

bernt-matthias · 2024-06-17T12:11:19Z

Excellent. Got it down to 23MB (as tar.gz) which is still to large for our repo but it will help a lot anyway.

I would put this on zenodo or would you be interested in doing it with your account?

apcamargo · 2024-06-17T19:23:27Z

Great!

One thing you can do to reduce the size of the database a bit and make the test faster is to reduce the search sensitivity in geNomad (setting -s 1, for example). This is will lead to less markers with hits and the runtime will be shorter.

I think it's best if you upload it yourself, since you'll be using it. But please share the link once its up!

bernt-matthias · 2024-06-17T21:17:55Z

Thanks for the help. Here is the link: https://zenodo.org/records/11945948

Galaxy tool wrappers should be finished soon as well: Helmholtz-UFZ/galaxy-tools#29

apcamargo · 2024-06-17T21:37:35Z

Awesome! Thanks!

bernt-matthias closed this as completed Jun 17, 2024

jfy133 mentioned this issue Jul 18, 2024

Config restructure nf-core/mag#621

Draft

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small (reference) data for testing #104

Small (reference) data for testing #104

bernt-matthias commented Jun 14, 2024

apcamargo commented Jun 14, 2024

bernt-matthias commented Jun 14, 2024

apcamargo commented Jun 14, 2024

bernt-matthias commented Jun 15, 2024

apcamargo commented Jun 15, 2024

bernt-matthias commented Jun 17, 2024

apcamargo commented Jun 17, 2024

bernt-matthias commented Jun 17, 2024

apcamargo commented Jun 17, 2024

Small (reference) data for testing #104

Small (reference) data for testing #104

Comments

bernt-matthias commented Jun 14, 2024

apcamargo commented Jun 14, 2024

bernt-matthias commented Jun 14, 2024

apcamargo commented Jun 14, 2024

bernt-matthias commented Jun 15, 2024

apcamargo commented Jun 15, 2024

bernt-matthias commented Jun 17, 2024

apcamargo commented Jun 17, 2024

bernt-matthias commented Jun 17, 2024

apcamargo commented Jun 17, 2024