Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small (reference) data for testing #104

Closed
bernt-matthias opened this issue Jun 14, 2024 · 9 comments
Closed

Small (reference) data for testing #104

bernt-matthias opened this issue Jun 14, 2024 · 9 comments

Comments

@bernt-matthias
Copy link

Is there any small reference data set (and fasta) that could be used for testing.

Background: I'm thinking about creating a tool wrapper for Galaxy and those require tests.

@apcamargo
Copy link
Owner

Do you think the Klebsiella pneumoniae that is used in the guide is small enough?

curl -LJO https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_009025895.1/download\?include_annotation_type\=GENOME_FASTA

@bernt-matthias
Copy link
Author

The fasta should be fine. I guess one could even use a subsequence of this genome to reduce runtime and memory requirements of the test.

But I was wondering more about the reference data that you have on zenodo (i.e. that is downloaded with genomad download-database).

@apcamargo
Copy link
Owner

You could use mmseqs createsubdb to create a subset of the database.

Within the database directory, the mini_set_ids file contains the IDs of 42,098 markers (~20%) that comprise a "mini database" with the most informative markers. This can be used as input to mmseqs createsubdb.

You could create an even smaller database if create a database containing only the markers with hits in the test sequence.

@bernt-matthias
Copy link
Author

You could use mmseqs createsubdb to create a subset of the database.

Wonderful.

You could create an even smaller database if create a database containing only the markers with hits in the test sequence.

Could you tell me where in the output I can find the IDs of the markers for the ids file?

@apcamargo
Copy link
Owner

genomad_db/genomad_db.lookup has the ID → marker accession mappings (first and second columns, respectively):

0	GENOMAD.070201.VV	0
1	GENOMAD.179093.PC	0
2	GENOMAD.152930.VV	0
3	GENOMAD.102389.VV	0
4	GENOMAD.094353.VV	0

To get a list of the accessions of markers with hit in the test genome:

awk -v FS="\t" 'NR>1 && $9!="NA" {print $9}' genomad_output/GCF_009025895.1_annotate/GCF_009025895.1_genes.tsv | sort -u

After you create the sub-database it's not guaranteed that the matches will be the same, as the database size will change significantly. It should work for test purposes anyway.

@bernt-matthias
Copy link
Author

Excellent. Got it down to 23MB (as tar.gz) which is still to large for our repo but it will help a lot anyway.

I would put this on zenodo or would you be interested in doing it with your account?

@apcamargo
Copy link
Owner

Great!

One thing you can do to reduce the size of the database a bit and make the test faster is to reduce the search sensitivity in geNomad (setting -s 1, for example). This is will lead to less markers with hits and the runtime will be shorter.

I think it's best if you upload it yourself, since you'll be using it. But please share the link once its up!

@bernt-matthias
Copy link
Author

Thanks for the help. Here is the link: https://zenodo.org/records/11945948

Galaxy tool wrappers should be finished soon as well: Helmholtz-UFZ/galaxy-tools#29

@apcamargo
Copy link
Owner

Awesome! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants