Problems recognizing the BUSCO databases with nextflow BTK pipeline v0.6.0 #122

gitcruz · 2024-11-14T13:29:28Z

Description of the bug

Dear developers,

I downloaded and installed the pipeline v0.6.0.

As pointed out in the usage, I downloaded the entire busco v5 databases, untarred them. As I was having a recurrent error with BUSCO, after that I also decompressed the refseq_db.faa.gz for all dbs. However the error still persists and it looks like this:

2024-11-14 12:32:41 ERROR: Unable to run BUSCO in offline mode. Dataset /scratch_tmp/32318106/nxf.LMCX5N46Un/lineages/lineages/viridiplantae_odb10 does not exist.
mv: cannot stat 'tnRamLact8_Nhpy_mq10-viridiplantae_odb10-busco//short_summary..json': No such file or directory
mv: cannot stat 'tnRamLact8_Nhpy_mq10-viridiplantae_odb10-busco//short_summary..txt': No such file or directory

Work dir:
/scratch_isilon/groups/assembly/data/projects/BGE/tnRamLact/assembly/curation/nextdenovo.hypo1.purged.yahs_mq10/1_blobtoolkit/blobtoolkit_nextflow/work/e3/910c9ec4ebbab08e742510c3a50ee8

I don't really know why is not finding the busco databases!!! all of them are stored here: /scratch_isilon/groups/assembly/data/databases/BUSCO_2024_11/v5/data/lineages/

This is my nextflow command:

nextflow
run /software/assembly/pipelines/nf-core-pipelines/blobtoolkit_sanger-tol/blobtoolkit-0.6.0/main.nf
-c /software/assembly/pipelines/nf-core-pipelines/cluster_config/cnag_nextflow_queue.config
-profile singularity
--input tnRamLact8_samplesheet_s3.csv
--outdir out
--fasta tnRamLact8_Nhpy_mq10.fasta
--taxon 947578
--align true
--taxdump /scratch_isilon/groups/assembly/data/databases/taxdump_2024_10_01
--blastp /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd
--blastx /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd
--blastn /scratch_isilon/groups/assembly/data/databases/nt_2024_10_03
--busco /scratch_isilon/groups/assembly/data/databases/BUSCO_2024_11/v5/data/lineages/
--busco_lineages metazoa_odb10,viridiplantae_odb10,fungi_odb10,apicomplexa_odb10,euglenozoa_odb10,diptera_odb10,alphaproteobacteria_odb10,mycoplasmatales_odb10,proteobacteria_odb10,nematoda_odb10,rickettsiales_odb10

I am attaching the full log and sbatch command so you can check it entirely. I would really appreciate if you can help me to overcome this error and get this pipeline running.

Thanks.
btk_v0.6.0_nextflow.log
run_blobtoolkit_v060_on_tnRamLact8.sbatch.txt

Command used and terminal output

No response

Relevant files

No response

System information

No response

muffato · 2024-11-14T21:52:24Z

Hi @gitcruz . Can you try with /scratch_isilon/groups/assembly/data/databases/BUSCO_2024_11/v5/data/, i.e. without the trailing lineages ?

The way we run the pipeline, --busco gets the path to a directory that contains lineages, cf:

├── information
├── lineages
│   ├── acidobacteria_odb10
│   │   ├── hmms
│   │   └── info
│   ├── aconoidasida_odb10
│   │   ├── hmms
│   │   ├── info
│   │   └── prfl
│   (...)
│   ├── viridiplantae_odb10
│   │   ├── hmms
│   │   ├── info
│   │   └── prfl
│   └── xanthomonadales_odb10
│       ├── hmms
│       └── info
└── placement_files

All the refseq_db.faa.gz have been decompressed already (like you did). I should mention that in the doc.

Matthieu

gitcruz · 2024-11-15T12:23:07Z

Thanks for the quick response Matthieu,

I'm trying it that way. So far the nextflow job has been running > 2hours

WRT the databases path is it necessary to add the final slash or not (i.e. --blastn /scratch_isilon/groups/assembly/data/databases/nt_2024_10_03/)?

And also I don't understand the guide examples for the diamond databases I just have one. While you show two:
--blastp /path/to/buscogenes.dmnd
--blastx /path/to/buscoregions.dmnd

I am using only one:
--blastp /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd
--blastx /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd \

Is this correct? I followed the guide and built just one diamond db...

Regards,
Fernando

muffato · 2024-11-15T16:28:28Z

WRT the databases path is it necessary to add the final slash or not (i.e. --blastn /scratch_isilon/groups/assembly/data/databases/nt_2024_10_03/)?

I think it should work the same with and without.

And also I don't understand the guide examples for the diamond databases I just have one. While you show two: --blastp /path/to/buscogenes.dmnd --blastx /path/to/buscoregions.dmnd

I am using only one: --blastp /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd --blastx /scratch_isilon/groups/assembly/data/databases/uniprot_2024_10_03/reference_proteomes.dmnd \

Is this correct? I followed the guide and built just one diamond db...

Yes it's correct. Those are different parameters in case people want to use different databases. I could imagine someone optimising the pipeline by using a more restricted database for the blastp search (which happens first) in order to get the blastp jobs done quicker, while using the complete database for the blastx search (which happens after).
In practice, the way we run it on all our assembled genomes, we use the same, complete, database for both.

Best,
Matthieu

gitcruz · 2024-11-18T14:57:23Z

Hi Matthieu,

It have worked well for three species: 2 vertebrates and 1 nemertea worm. I may have more questions but I'll post them as separate comments or issues.

Thanks,
Fernando

gitcruz added the bug Something isn't working label Nov 14, 2024

gitcruz closed this as completed Nov 18, 2024

muffato mentioned this issue Nov 19, 2024

number of busco databases #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems recognizing the BUSCO databases with nextflow BTK pipeline v0.6.0 #122

Problems recognizing the BUSCO databases with nextflow BTK pipeline v0.6.0 #122

gitcruz commented Nov 14, 2024 •

edited

Loading

muffato commented Nov 14, 2024

gitcruz commented Nov 15, 2024 •

edited

Loading

muffato commented Nov 15, 2024

gitcruz commented Nov 18, 2024

Problems recognizing the BUSCO databases with nextflow BTK pipeline v0.6.0 #122

Problems recognizing the BUSCO databases with nextflow BTK pipeline v0.6.0 #122

Comments

gitcruz commented Nov 14, 2024 • edited Loading

Description of the bug

Command used and terminal output

Relevant files

System information

muffato commented Nov 14, 2024

gitcruz commented Nov 15, 2024 • edited Loading

muffato commented Nov 15, 2024

gitcruz commented Nov 18, 2024

gitcruz commented Nov 14, 2024 •

edited

Loading

gitcruz commented Nov 15, 2024 •

edited

Loading