Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support draft assemblies #97

Merged
merged 33 commits into from
Aug 24, 2024
Merged

Support draft assemblies #97

merged 33 commits into from
Aug 24, 2024

Conversation

muffato
Copy link
Member

@muffato muffato commented May 9, 2024

On this branch, there is no input Yaml file. The only mandatory parameters are:

  • Species name / taxon_id (--taxon)
  • Assembly (--fasta)
  • Sample sheet (--input) to list the read files

--accession is optional and is used to pull assembly information from ENA into the blobDir's meta.json.

I haven't restructured the pipeline much. All the blobtools command at the end still require a yaml file. My solution is to add a script at the beginning of the pipeline that generates the minimal yaml file required (as per #77 (comment)). It still allows clearly getting some parameters in the input-check sub-workflow and making the busco sub-workflow more focused on running buco + blastp.

Busco lineages are inferred from the taxonomy directly here. Like in the genome-note pipeline, I've moved away from using GoaT as GoaT is just a proxy to the NCBI taxonomy. This way, I can keep control of both the version of Busco and the list of lineages in the same place.
I've also introduced the --busco_lineages parameter to allow precisely selecting the lineages that are used, rather than the taxonomy-based defaults.

Still a draft for now as I want to review /nfs/team135/yy5/btk_config/taxonomiser_v2.py and maybe incorporate some elements of it.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@muffato muffato self-assigned this May 9, 2024
Copy link

github-actions bot commented May 9, 2024

nf-core lint overall result: Passed ✅

Posted for pipeline commit 8c70c77

+| ✅ 134 tests passed       |+
#| ❔  24 tests were ignored |#

❔ Tests ignored:

  • files_exist - File is ignored: CODE_OF_CONDUCT.md
  • files_exist - File is ignored: assets/nf-core-blobtoolkit_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-blobtoolkit_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-blobtoolkit_logo_dark.png
  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/config.yml
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • files_exist - File is ignored: conf/igenomes.config
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • files_unchanged - File ignored due to lint config: CODE_OF_CONDUCT.md
  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/config.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
  • files_unchanged - File ignored due to lint config: assets/nf-core-blobtoolkit_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-blobtoolkit_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-blobtoolkit_logo_dark.png
  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/blobtoolkit/blobtoolkit/.github/workflows/awstest.yml
  • template_strings - template_strings
  • merge_markers - merge_markers

✅ Tests passed:

Run details

  • nf-core/tools version 2.11
  • Run at 2024-08-24 10:19:22

@muffato muffato mentioned this pull request May 18, 2024
10 tasks
Copy link

Python linting (black) is failing

To keep the code consistent with lots of contributors, we run automated code consistency checks.
To fix this CI test, please run:

  • Install black: pip install black
  • Fix formatting errors in your pipeline: black .

Once you push these changes the test should pass, and you can hide this comment 👍

We highly recommend setting up Black in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!

Thanks again for your contribution!

@muffato muffato changed the base branch from dev to clean_params May 20, 2024 09:42
@muffato muffato force-pushed the draft_assemblies branch 2 times, most recently from a049ba9 to 0e6fa8e Compare May 20, 2024 09:57
@muffato muffato requested review from eeaunin and DLBPointon May 20, 2024 10:00
@muffato muffato force-pushed the draft_assemblies branch from 0e6fa8e to 80d86de Compare May 21, 2024 09:16
Base automatically changed from clean_params to dev May 23, 2024 15:02
@muffato muffato force-pushed the draft_assemblies branch from 80d86de to 4b7b3b2 Compare May 23, 2024 15:03
@muffato
Copy link
Member Author

muffato commented May 24, 2024

I've added some code to achieve the goal of taxonomiser_v2.py, which is: find a taxon_id that is recognised by the NT database and the closest to the species of interest.
It's implemented very differently from the script. I leverage the taxonomy4blast.sqlite3 database that is shipped with NT and essentially lists the taxon_ids it knows about. If the species' taxon_id is not recognised, then it looks for the parent, etc.

As far as I understand the requirements, this is the last bit that was missing to complete support for draft assemblies. I'll mark this pull-request as ready.

@muffato muffato marked this pull request as ready for review May 24, 2024 10:48
@muffato muffato added the enhancement Improvement of the existing features label May 24, 2024
@muffato muffato linked an issue Jun 1, 2024 that may be closed by this pull request
@muffato muffato added the user request Requests made by users and public label Jun 20, 2024
@muffato muffato force-pushed the draft_assemblies branch from 413a84d to 424cdf7 Compare July 10, 2024 17:47
@muffato
Copy link
Member Author

muffato commented Jul 10, 2024

@eeaunin . I've rebased this branch. It now includes the fixes I've made for blast

docs/usage.md Show resolved Hide resolved
@eeaunin
Copy link

eeaunin commented Aug 5, 2024

I had a closer look at how -negative_taxids has been implemented in the Snakemake pipeline and it appears quite confusing. The BlobToolKit paper (https://academic.oup.com/g3journal/article/10/4/1361/6026202) says:

An optional filter excludes a configurable list of NCBI taxIDs (default: excludes query genus).

So the exclusion of taxids is supposed to be optional and configurable by the user.
BlobToolKit pipeline v1 has the mask_ids setting for excluding taxids:

https://github.com/blobtoolkit/pipeline/blob/master/v1/example.yaml

However, I couldn't find a setting for the same thing in the Snakemake pipeline v2 code. Maybe the authors just forgot to include it?

In my runs with the Snakemake pipeline negative taxids were not used but there are suppressed error messages buried in the run logs relating to that. In a run with a Plasmodium yoelii yoelii assembly there is this error in the logs (/lustre/scratch123/tol/teams/tola/users/ea10/pipeline_testing/20230215_pyoelii_asg_cobiont_check_run/btk_busco/blastn/logs/pyoelii/run_blastn.log):

BLAST Database error: Taxonomy ID(s) not found.Taxonomy ID(s) not found. This could be because the ID(s) provided are not at or below the species level. Please use get_species_taxids.sh to get taxids for nodes higher than species (see https://www.ncbi.nlm.nih.gov/books/NBK546209/).
Restarting blastn without taxid filter

So it ran into the error but then just quietly continued running. It is unclear to me what caused this error, as the taxid used there (352914) is at strain level.

In another run it has skipped using the taxid filter due to another error: /lustre/scratch123/tol/teams/grit/contamination_screen/icMagCera1/20240712_icMagCera1.20240711.hap1.fa_asg_cobiont_check_run/btk_busco/blastn/logs/icMagCera1.20240711.hap1.fa/run_blastn.log

BLAST Database error: Taxonomy filtering is not supported in v4 BLAST dbs
Restarting blastn without taxid filter

So the filtering doesn't work if the supplied database is V4 instead of V5 but this also doesn't crash the Snakemake pipeline and just produces an error message in the logs.

I guess it would be okay if the sanger-tol/blobtoolkit pipeline used -negative_taxids in all runs with draft assemblies as long as this doesn't produce frequent crashes. But I think it would be better if the use of -negative_taxids was optional for draft assemblies.

The filter in SEQTK_SUBSEQ is not sufficient because some BLOBTOOLKIT_CHUNK further excludes masked regions
Skip blastn if there are no chunks
…NA taxon_ids

NCBI is still the first database we query
@muffato
Copy link
Member Author

muffato commented Aug 22, 2024

@eeaunin . I've added a --skip_taxon_filtering flag for you. It removes the taxon filtering from all Blast searches

I've rebased the branch onto the latest stable release 0.5.1

@eeaunin
Copy link

eeaunin commented Aug 22, 2024

That's good then! I think it's fine to merge the draft_assemblies branch to dev now

@muffato muffato merged commit 18d2daf into dev Aug 24, 2024
6 checks passed
@muffato muffato deleted the draft_assemblies branch August 24, 2024 12:02
@muffato muffato mentioned this pull request Sep 11, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of the existing features user request Requests made by users and public
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Improved generation of the summary Yaml file
2 participants