Merge pull request #268 from nf-core/dev

Release - v1.1.0 - British Beans on Toast
nf-core · Apr 27, 2023 · 1c1c9ae · 1c1c9ae
2 parents 98a0815 + af57abe
commit 1c1c9ae
Show file tree

Hide file tree

Showing 55 changed files with 2,450 additions and 480 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -29,7 +29,7 @@ jobs:
         parameters:
           - "--annotation_tool prodigal"
           - "--annotation_tool prokka"
-          ## Warning: we can't test Bakta as uses more memory than available on GHA CIs
+          - "--annotation_tool bakta --annotation_bakta_db_downloadtype light"
 
     steps:
       - name: Check out pipeline code
@@ -57,6 +57,7 @@ jobs:
         parameters:
           - "--annotation_tool prodigal"
           - "--annotation_tool prokka"
+          - "--annotation_tool bakta --annotation_bakta_db_downloadtype light"
 
     steps:
       - name: Check out pipeline code
@@ -71,31 +72,31 @@ jobs:
         run: |
           nextflow run ${GITHUB_WORKSPACE} -profile test_bgc,docker --outdir ./results ${{ matrix.parameters }}
 
-  ## DEACTIVATE CURRENTLY DUE TO EXTENDED DATABASE SERVER FAILURE
-  ## CAN REACTIVATE ONCE WORKING AGAIN
-  # test_deeparg:
-  #   name: Run pipeline with test data (DeepARG only workflow)
-  #   # Only run on push if this is the nf-core dev branch (merged PRs)
-  #   if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/funcscan') }}"
-  #   runs-on: ubuntu-latest
-  #   strategy:
-  #     matrix:
-  #       NXF_VER:
-  #         - "22.10.1"
-  #         - "latest-everything"
-  #       parameters:
-  #         - "--annotation_tool prodigal"
-  #         - "--annotation_tool prokka"
+  test_deeparg:
+    name: Run pipeline with test data (DeepARG only workflow)
+    # Only run on push if this is the nf-core dev branch (merged PRs)
+    if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/funcscan') }}"
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        NXF_VER:
+          - "22.10.1"
+          - "latest-everything"
+        parameters:
+          - "--annotation_tool bakta --annotation_bakta_db_downloadtype light"
+          - "--annotation_tool prodigal"
+          - "--annotation_tool prokka"
+          - "--annotation_tool pyrodigal"
 
-  #   steps:
-  #     - name: Check out pipeline code
-  #       uses: actions/checkout@v2
+    steps:
+      - name: Check out pipeline code
+        uses: actions/checkout@v2
 
-  #     - name: Install Nextflow
-  #       uses: nf-core/setup-nextflow@v1
-  #       with:
-  #         version: "${{ matrix.NXF_VER }}"
+      - name: Install Nextflow
+        uses: nf-core/setup-nextflow@v1
+        with:
+          version: "${{ matrix.NXF_VER }}"
 
-  #     - name: Run pipeline with test data (DeepARG workflow)
-  #       run: |
-  #         nextflow run ${GITHUB_WORKSPACE} -profile test_deeparg,docker --outdir ./results ${{ matrix.parameters }}
+      - name: Run pipeline with test data (DeepARG workflow)
+        run: |
+          nextflow run ${GITHUB_WORKSPACE} -profile test_deeparg,docker --outdir ./results ${{ matrix.parameters }}
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,39 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## v1.1.0 - British Beans on Toast - [2023-04-26]
+
+### `Added`
+
+- [#238](https://github.com/nf-core/funcscan/pull/238) Added dedicated DRAMP database downloading step for AMPcombi to prevent parallel downloads when no database provided by user. (by @jfy133)
+- [#235](https://github.com/nf-core/funcscan/pull/235) Added parameter `annotation_bakta_db_downloadtype` to be able to switch between downloading either full (33.1 GB) or light (1.3 GB excluding UPS, IPS, PSC, see parameter description) versions of the Bakta database. (by @jasmezz)
+- [#249](https://github.com/nf-core/funcscan/pull/249) Added bakta annotation to CI tests. (by @jasmezz)
+- [#251](https://github.com/nf-core/funcscan/pull/251) Added annotation tool: Pyrodigal. (by @jasmezz)
+- [#252](https://github.com/nf-core/funcscan/pull/252) Added a new parameter `-arg_rgi_savejson` that saves the file `<samplename>.json` in the RGI directory. The default ouput for RGI is now only `<samplename>.txt`. (by @darcy220606)
+- [#253](https://github.com/nf-core/funcscan/pull/253) Updated Prodigal to have compressed output files. (by @jasmezz)
+- [#262](https://github.com/nf-core/funcscan/pull/262) Added comBGC function to screen whole directory of antiSMASH output (one subfolder per sample). (by @jasmezz)
+- [#263](https://github.com/nf-core/funcscan/pull/263) Removed `AMPlify` from test_full.config. (by @jasmezz)
+- [#266](https://github.com/nf-core/funcscan/pull/266) Updated README.md with Pyrodigal. (by @jasmezz)
+
+### `Fixed`
+
+- [#243](https://github.com/nf-core/funcscan/pull/243) Compress the ampcombi_complete_summary.csv in the output directory. (by @louperelo)
+- [#237](https://github.com/nf-core/funcscan/pull/237) Reactivate DeepARG automatic database downloading and CI tests as server is now back up. (by @jfy133)
+- [#235](https://github.com/nf-core/funcscan/pull/235) Improved annotation speed by switching off pipeline-irrelevant Bakta annotation steps by default. (by @jasmezz)
+- [#235](https://github.com/nf-core/funcscan/pull/235) Renamed parameter `annotation_bakta_db` to `annotation_bakta_db_localpath`. (by @jasmezz)
+- [#242](https://github.com/nf-core/funcscan/pull/242) Fixed MACREL '.faa' issue that was generated when it was run on its own and upgraded MACREL from version `1.1.0` to `1.2.0` (by @Darcy220606)
+- [#248](https://github.com/nf-core/funcscan/pull/248) Applied best-practice `error("message")` to all (sub)workflow files. (by @jasmezz)
+- [#254](https://github.com/nf-core/funcscan/pull/254) Further resource optimisation based on feedback from 'real world' datasets. (ongoing, reported by @alexhbnr and @Darcy220606, fix by @jfy133)
+- [#266](https://github.com/nf-core/funcscan/pull/266) Fixed wrong process name in base.config. (reported by @Darcy220606, fix by @jasmezz)
+
+### `Dependencies`
+
+| Tool  | Previous version | New version |
+| ----- | ---------------- | ----------- |
+| Bakta | 1.6.1            | 1.7.0       |
+
+### `Deprecated`
+
 ## v1.0.1 - [2023-02-27]
 
 ### `Added`

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -12,7 +12,7 @@
 
 - [ABRicate](https://github.com/tseemann/abricate)
 
-  > Seemann T. (2020). ABRicate. Github [https://github.com/tseemann/abricate](https://github.com/tseemann/abricate).
+  > Seemann, T. (2020). ABRicate. Github [https://github.com/tseemann/abricate](https://github.com/tseemann/abricate).
 
 - [AMPir](https://doi.org/10.1093/bioinformatics/btaa653)
 
@@ -48,15 +48,15 @@
 
 - [GECCO](https://gecco.embl.de)
 
-  > Carroll, L.M. , Larralde, M., Fleck, J. S., Ponnudurai, R., Milanese, A., Cappio Barazzone, E. & Zeller, G. (2021). Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv [DOI: 10.1101/2021.05.03.442509](https://doi.org/10.1101/2021.05.03.442509)
+  > Carroll, L. M. , Larralde, M., Fleck, J. S., Ponnudurai, R., Milanese, A., Cappio Barazzone, E. & Zeller, G. (2021). Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv [DOI: 10.1101/2021.05.03.442509](https://doi.org/10.1101/2021.05.03.442509)
 
 - [hAMRonization](https://github.com/pha4ge/hAMRonization)
 
   > Public Health Alliance for Genomic Epidemiology (pha4ge). (2022). Parse multiple Antimicrobial Resistance Analysis Reports into a common data structure. Github. Retrieved October 5, 2022, from [https://github.com/pha4ge/hAMRonization](https://github.com/pha4ge/hAMRonization)
 
 - [AMPcombi](https://github.com/Darcy220606/AMPcombi)
 
-  > Anan Ibrahim, & Louisa Perelo. (2023). Darcy220606/AMPcombi. [DOI: 10.5281/zenodo.7639121](https://doi.org/10.5281/zenodo.7639121).
+  > Ibrahim, A. & Perelo, L. (2023). Darcy220606/AMPcombi. [DOI: 10.5281/zenodo.7639121](https://doi.org/10.5281/zenodo.7639121).
 
 - [HMMER](https://doi.org/10.1371/journal.pcbi.1002195.)
 
@@ -72,7 +72,11 @@
 
 - [PROKKA](https://doi.org/10.1093/bioinformatics/btu153)
 
-  > Seemann T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics (Oxford, England), 30(14), 2068–2069. [DOI: 10.1093/bioinformatics/btu153](https://doi.org/10.1093/bioinformatics/btu153)
+  > Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics (Oxford, England), 30(14), 2068–2069. [DOI: 10.1093/bioinformatics/btu153](https://doi.org/10.1093/bioinformatics/btu153)
+
+- [Pyrodigal](https://doi.org/10.1186/1471-2105-11-119)
+
+  > Larralde, M. (2022). Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. Journal of Open Source Software, 7(72), 4296. [DOI: 10.21105/joss.04296](https://doi.org/10.21105/joss.04296)
 
 - [RGI](https://doi.org/10.1093/nar/gkz935)
 

diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) Jasmin Frangenberg, Anan Ibrahim, James A. Fellows Yates
+Copyright (c) Jasmin Frangenberg, Anan Ibrahim, Louisa Perelo, Moritz E. Beber, James A. Fellows Yates
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ The nf-core/funcscan AWS full test dataset are contigs generated by the MGnify s
 
 ## Pipeline summary
 
-1. Annotation of assembled prokaryotic contigs with [`Prodigal`](https://github.com/hyattpd/Prodigal), [`Prokka`](https://github.com/tseemann/prokka), or [`Bakta`](https://github.com/oschwengers/bakta)
+1. Annotation of assembled prokaryotic contigs with [`Prodigal`](https://github.com/hyattpd/Prodigal), [`Pyrodigal`](https://github.com/althonos/pyrodigal), [`Prokka`](https://github.com/tseemann/prokka), or [`Bakta`](https://github.com/oschwengers/bakta)
 2. Screening contigs for antimicrobial peptide-like sequences with [`ampir`](https://cran.r-project.org/web/packages/ampir/index.html), [`Macrel`](https://github.com/BigDataBiology/macrel), [`HMMER`](http://hmmer.org/), [`AMPlify`](https://github.com/bcgsc/AMPlify)
 3. Screening contigs for antibiotic resistant gene-like sequences with [`ABRicate`](https://github.com/tseemann/abricate), [`AMRFinderPlus`](https://github.com/ncbi/amr), [`fARGene`](https://github.com/fannyhb/fargene), [`RGI`](https://card.mcmaster.ca/analyze/rgi), [`DeepARG`](https://bench.cs.vt.edu/deeparg)
 4. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/)

diff --git a/bin/ampcombi_download.py b/bin/ampcombi_download.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python3
+
+#########################################
+# Authors: [Anan Ibrahim](https://github.com/brianjohnhaas), [Louisa Perelo](https://github.com/louperelo)
+# File: amp_database.py
+# Source: https://github.com/Darcy220606/AMPcombi/blob/main/ampcombi/amp_database.py
+# Source+commit: https://github.com/Darcy220606/AMPcombi/commit/a75bc00c32ecf873a133b18cf01f172ad9cf0d2d/ampcombi/amp_database.py
+# Download Date: 2023-03-08, commit: a75bc00c
+# This source code is licensed under the MIT license
+#########################################
+
+# TITLE: Download the DRAMP database if input db empty AND and make database compatible for diamond
+
+import pandas as pd
+import requests
+import os
+from datetime import datetime
+import subprocess
+from Bio import SeqIO
+import tempfile
+import shutil
+
+
+########################################
+#  FUNCTION: DOWNLOAD DRAMP DATABASE AND CLEAN IT
+#########################################
+def download_DRAMP(db):
+    ##Download the (table) file and store it in a results directory
+    url = "http://dramp.cpu-bioinfor.org/downloads/download.php?filename=download_data/DRAMP3.0_new/general_amps.xlsx"
+    r = requests.get(url, allow_redirects=True)
+    with open(db + "/" + "general_amps.xlsx", "wb") as f:
+        f.write(r.content)
+    ##Convert excel to tab sep file and write it to a file in the DRAMP_db directly with the date its downloaded
+    date = datetime.now().strftime("%Y_%m_%d")
+    ref_amps = pd.read_excel(db + "/" + r"general_amps.xlsx")
+    ref_amps.to_csv(db + "/" + f"general_amps_{date}.tsv", index=None, header=True, sep="\t")
+    ##Download the (fasta) file and store it in a results directory
+    urlfasta = (
+        "http://dramp.cpu-bioinfor.org/downloads/download.php?filename=download_data/DRAMP3.0_new/general_amps.fasta"
+    )
+    z = requests.get(urlfasta)
+    fasta_path = os.path.join(db + "/" + f"general_amps_{date}.fasta")
+    with open(fasta_path, "wb") as f:
+        f.write(z.content)
+    ##Cleaning step to remove ambigous aminoacids from sequences in the database (e.g. zeros and brackets)
+    new_fasta = db + "/" + f"general_amps_{date}_clean.fasta"
+    seq_record = SeqIO.parse(open(fasta_path), "fasta")
+    with open(new_fasta, "w") as f:
+        for record in seq_record:
+            id, sequence = record.id, str(record.seq)
+            letters = [
+                "A",
+                "C",
+                "D",
+                "E",
+                "F",
+                "G",
+                "H",
+                "I",
+                "K",
+                "L",
+                "M",
+                "N",
+                "P",
+                "Q",
+                "R",
+                "S",
+                "T",
+                "V",
+                "W",
+                "Y",
+            ]
+            new = "".join(i for i in sequence if i in letters)
+            f.write(">" + id + "\n" + new + "\n")
+    return os.remove(fasta_path), os.remove(db + "/" + r"general_amps.xlsx")
+
+
+download_DRAMP("amp_ref_database")
diff --git a/bin/comBGC.py b/bin/comBGC.py
@@ -32,7 +32,7 @@
 SOFTWARE.
 """
 
-tool_version = "0.5"
+tool_version = "0.6.0"
 welcome = """\
                 ........................
                     * comBGC v.{version} *
@@ -61,7 +61,9 @@
 these can be:
 - antiSMASH: <sample name>.gbk and (optional) knownclusterblast/ directory
 - DeepBGC:   <sample name>.bgc.tsv
-- GECCO:     <sample name>.clusters.tsv""",
+- GECCO:     <sample name>.clusters.tsv
+Note: Please provide files from a single sample only. If you would like to
+summarize multiple samples, please see the --antismash_multiple_samples flag.""",
 )
 parser.add_argument(
     "-o",
@@ -73,6 +75,16 @@
     type=str,
     default=".",
 )
+parser.add_argument(
+    "-a",
+    "--antismash_multiple_samples",
+    metavar="PATH",
+    dest="antismash_multiple_samples",
+    nargs="?",
+    help="""directory of antiSMASH output. Should contain subfolders (one per
+sample). Can only be used if --input is not specified.""",
+    type=str,
+)
 parser.add_argument("-vv", "--verbose", help="increase output verbosity", action="store_true")
 parser.add_argument("-v", "--version", help="show version number and exit", action="store_true")
 
@@ -81,6 +93,7 @@
 
 # Assign input arguments to variables
 input = args.input
+dir_antismash = args.antismash_multiple_samples
 outdir = args.outdir
 verbose = args.verbose
 version = args.version
@@ -111,15 +124,38 @@
         elif path.endswith("knownclusterblast/"):
             input_antismash.append(path)
 
+if input and dir_antismash:
+    exit(
+        "The flags --input and --antismash_multiple_samples are mutually exclusive.\nPlease use only one of them (or see --help for how to use)."
+    )
+
 # Make sure that at least one input argument is given
-if not (input_antismash or input_gecco or input_deepbgc):
+if not (input_antismash or input_gecco or input_deepbgc or dir_antismash):
     exit("Please specify at least one input file (i.e. output from antismash, deepbgc, or gecco) or see --help")
 
 ########################
 # ANTISMASH FUNCTIONS
 ########################
 
 
+def prepare_multisample_input_antismash(antismash_dir):
+    """
+    Prepare string of input paths of a given antiSMASH output folder (with sample subdirectories)
+    """
+    sample_paths = []
+    for root, subdirs, files in os.walk(antismash_dir):
+        antismash_file = "/".join([root, "index.html"])
+        if os.path.exists(antismash_file):
+            sample = root.split("/")[-1]
+            gbk_path = "/".join([root, sample]) + ".gbk"
+            kkb_path = "/".join([root, "knownclusterblast"])
+            if os.path.exists(kkb_path):
+                sample_paths.append([gbk_path, kkb_path])
+            else:
+                sample_paths.append([gbk_path])
+    return sample_paths
+
+
 def parse_knownclusterblast(kcb_file_path):
     """
     Extract MIBiG IDs from knownclusterblast TXT file.
@@ -148,9 +184,6 @@ def antismash_workflow(antismash_paths):
     - Return data frame with aggregated info.
     """
 
-    if verbose:
-        print("\nParsing antiSMASH files\n... ", end="")
-
     antismash_sum_cols = [
         "Sample_ID",
         "Prediction_tool",
@@ -186,6 +219,9 @@ def antismash_workflow(antismash_paths):
 
     # Aggregate information
     Sample_ID = gbk_path.split("/")[-1].split(".gbk")[-2]  # Assuming file name equals sample name
+    if verbose:
+        print("\nParsing antiSMASH file(s): " + Sample_ID + "\n... ", end="")
+
     with open(gbk_path) as gbk:
         for record in SeqIO.parse(gbk, "genbank"):  # GBK records are contigs in this case
             # Initiate variables per contig
@@ -514,7 +550,13 @@ def gecco_workflow(gecco_paths):
 ########################
 
 if __name__ == "__main__":
-    tools = {"antiSMASH": input_antismash, "deepBGC": input_deepbgc, "GECCO": input_gecco}
+    if input_antismash:
+        tools = {"antiSMASH": input_antismash, "deepBGC": input_deepbgc, "GECCO": input_gecco}
+    elif dir_antismash:
+        tools = {"antiSMASH": dir_antismash}
+    else:
+        tools = {"deepBGC": input_deepbgc, "GECCO": input_gecco}
+
     tools_provided = {}
 
     for tool in tools.keys():
@@ -532,7 +574,13 @@ def gecco_workflow(gecco_paths):
 
     for tool in tools_provided.keys():
         if tool == "antiSMASH":
-            summary_antismash = antismash_workflow(input_antismash)
+            if dir_antismash:
+                antismash_paths = prepare_multisample_input_antismash(dir_antismash)
+                for input_antismash in antismash_paths:
+                    summary_antismash_temp = antismash_workflow(input_antismash)
+                    summary_antismash = pd.concat([summary_antismash, summary_antismash_temp])
+            else:
+                summary_antismash = antismash_workflow(input_antismash)
         elif tool == "deepBGC":
             summary_deepbgc = deepbgc_workflow(input_deepbgc)
         elif tool == "GECCO":