Skip to content

04. Generating Required Inputs for lsaBGC

Rauf Salamzade edited this page Apr 28, 2021 · 18 revisions

lsaBGC-Process.py

Overview

lsaBGC-Process is the first program to run in the lsaBGC suite and simply creates the required inputs for the rest of suite. It's implementation is also different in that it requires users to specify paths to separate conda environments for the three programs which generate these required inputs: (i) Prokka (2) antiSMASH and (3) OrthoFinderV2. It is actually a workflow, similar to lsaBGC-Automate.py, and both programs can be found in the workflows/ subdirectory of the suite.

All three programs take a while to run, and it is therefore recommended that users only process completed / high-quality genomic assemblies through lsaBGC-Process to layout and identify the major BGCs found in two or more members of lineages. Additional instances of BGCs belonging to a GCF of interest can later be identified in high-throughput using lsaBGC-Expansion.py across a multitude of draft genomes, if desired. To run lsaBGC-Expansion.py however you will need to run the additional (low/medium quality) draft genomes through lsaBGC-Process.py in a special mode [ specified by setting the flags -p (run only Prokka) and -q (avoid deep annotation with Prokka) ] which avoids running AntiSMASH and OrthoFinder for each genomic assembly.

A hopefully convenient option for certain users with access to high-performance computing resources is the dry-run option. Which simply creates task files with commands for each of the three major programs and leaves it to the user to parallelize or initiate these on the server.

Usage

usage: lsaBGC-Process.py [-h] -a ASSEMBLY_LISTING -o OUTPUT_DIRECTORY -cp CONDA_PATH -pe PROKKA_ENV_PATH [-oe ORTHOFINDER_ENV_PATH] [-ae ANTISMASH_ENV_PATH] [-g GENUS] [-c CORES] [-d] [-q] [-p]

	Program: lsaBGC-Process.py
	Author: Rauf Salamzade
	Affiliation: Kalan Lab, UW Madison, Department of Microbiology and Immunology
	
	This program will automatically run or create task files for running Prokka (gene calling and annotation), 
	antiSMASH (biosynthetic gene cluster annotation), and OrthoFinder (de novo ortholog group construction).
	

optional arguments:
  -h, --help            show this help message and exit
  -a ASSEMBLY_LISTING, --assembly_listing ASSEMBLY_LISTING
                        Tab delimited text file. First column is the sample name and the second is the path to its assembly in FASTA format. Please remove troublesome characters in the sample name.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Prefix for output files.
  -cp CONDA_PATH, --conda_path CONDA_PATH
                        Path to anaconda/miniconda installation directory itself.
  -pe PROKKA_ENV_PATH, --prokka_env_path PROKKA_ENV_PATH
                        Path to conda environment for Prokka.
  -oe ORTHOFINDER_ENV_PATH, --orthofinder_env_path ORTHOFINDER_ENV_PATH
                        Path to conda environment for OrthoFinder. Optional, if not used, locus tags will be 3 characters insteado just 2.
  -ae ANTISMASH_ENV_PATH, --antiSMASH_env_path ANTISMASH_ENV_PATH
                        Path to conda environment for antiSMASH. Database should automatically configured for antiSMASH loaded by the environment.
  -g GENUS, --genus GENUS
                        The genus under investigation. The lineage of interest could be species, but for this, just use the genus.
  -c CORES, --cores CORES
                        The number of cores to use.
  -d, --dry_run         Just create task files with commands for running prodigal, antiSMASH, and OrthoFinder. Useful for parallelizing across an HPC.
  -q, --fast_annotation
                        Skip basic/standard annotation in Prokka.
  -p, --only_run_prokka
                        Only run Prokka for gene annotation and Genbank creation. Skip the rest.