add command options

xiezhq · Apr 9, 2020 · 62c3af5 · 62c3af5
1 parent a4950b1
commit 62c3af5
Show file tree

Hide file tree

Showing 10 changed files with 116 additions and 355 deletions.
diff --git a/README.md b/README.md
@@ -5,11 +5,11 @@
 Contact:
 * Zhiqun Xie: [email protected]
 
-Last revision: 05/2019
+Last revision: 04/08/2020
 
 ## Overview
 
-ISEScan is a python pipeline to identify IS (Insertion Sequence) elements in genome. It includes an option in constants.py to report either complete IS elements or both complete and partial IS elements. It might be a good idea to try reporting both complete and partial IS elements when it is used to identify the IS elements in the assemblies of metegenome. ISEScan reports both complete and partial IS elements by default.
+ISEScan is a python pipeline to identify IS (Insertion Sequence) elements in genome. It includes an option to report either complete IS elements or both complete and partial IS elements. It might be a good idea to try reporting both complete and partial IS elements when it is used to identify the IS elements in the assemblies of metegenome. ISEScan reports both complete and partial IS elements by default.
 
 ISEScan was developed using Python3. It 1) scanes genome (or metagenome) in fasta format; 2) predicts/translates (using FragGeneScan) genome into proteome; 3) searches the pre-built pHMMs (profile Hidden Markov Models) of transposases (two files shipped with ISEScan; clusters.faa.hmm and clusters.single.faa) against the proteome and identifies the transposase gene in genome; 4) then extends the identified transposase gene into the complete IS (Insertion Sequence) elements based on the common characteristics shared by the known IS elements reported by literatures and database; 5) finally reports the identified IS elements in a few result files (e.g. a file containing a list of IS elements, a file containing sequences of IS elements in fasta format, an annotation file in GFF3 format).
 
@@ -46,8 +46,8 @@ Download: [publication/btx433.pdf](publication/btx433.pdf), [publication/Supplem
   * To use the shipped SSW library in ISEScan, please go to ssw201507 and then compile the codes by gcc:  
   `gcc -Wall -O3 -pipe -fPIC -shared -rdynamic -o libssw.so ssw.c ssw.h`
   * And then copy sswlib.so to the directory of ISEScan and set the search path as:   
-  `cp sswlib.so ../`
-  `export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:libssw.so`
+  `cp libssw.so ../`
+  `export LD_LIBRARY_PATH=/path/to/libssw.so:$LD_LIBRARY_PATH # export LD_LIBRARY_PATH=/home/xiezhq/projects/isescan/libssw.so:$LD_LIBRARY_PATH`
   * The latest SSW library can be found at https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.
 * biopython 1.62 or later (required by SSW library)
 
@@ -61,10 +61,10 @@ Download: [publication/btx433.pdf](publication/btx433.pdf), [publication/Supplem
 
 Let's try an example, NC_012624.fna.
 
-* The command below scans NC_012624.fna (the genome sequence from Sulfolobus_islandicus_Y_N_15_51), and outputs all results in prediction directory:   
+* The command below scans NC_012624.fna (genome sequence of Sulfolobus_islandicus_Y_N_15_51, ~42 kb), and outputs all results in i`prediction` directory:   
 `python3 isescan.py NC_012624.fna proteome hmm`
 
-* Wait for its finishing. It may take a while as ISEScan uses the HMMER to scan the genome sequences and it will use 621 profile HMM models to scan each protein sequence (predicted by FragGeneScan) in the genome sequence. HMMER searching is usually more sensitive but slower than the regular BLAST searching for remote homologs.
+* Wait for its finishing. It may take a while (~40 seconds) as ISEScan uses the HMMER to scan the genome sequences and it will use 621 profile HMM models to scan each protein sequence (predicted by FragGeneScan) in the genome sequence. HMMER searching is usually more sensitive but slower than the regular BLAST searching for remote homologs. The running time for larger genome will increase quickly, e.g. about 20 minutes for NC_000913.fna (genome sequence of Escherichia coli str. K-12 substr. MG1655, ~4.6 Mb) with two cpu cores on my virtual machine.
 
 * After ISEScan finish running, you can find the output files in prediction directory: 
   * NC_012624.fna.sum: the summarization of IS copies for each IS family
@@ -110,11 +110,41 @@ Let's try an example, NC_012624.fna.
     * tir: terminal inverted repeat sequences
 
 ### Tips:
+* How to run a set of genomes in a row
+  *  Based on a few assumptions:
+  - You can successfully run ISEScan on one genome by executing `python3 /home/qiime2/ISEScan-1.7/isescan.py genome1.fa proteome hmm` where genome1.fa is your genome sequence file in fasta format. By default, ISEScan will use two CPU cores but you can change 'nproc = 2' (at the bottom of constants.py) to 'nproc = 1' or 'nproc = 4' or greater numbers in `constants.py` file.
+  - You are working and running ISEScan jobs on a Linux computer instead of a Linux cluster system.
+  - Your Linux computer has `nproc` (nproc could be 2 or 4 or 6 or 8 or ....) CPU cores.
+  - You want to run ISEScan on ngenome (ngenome could be 1 or 2 or 3, ...) fasta file (genome) in parallel on your Linux computer.
+
+  Now, let's run 200 genomes in one line of command and then wait for all computing jobs to complete (probably several days or weeks, depending on how many hours are required for each of your 200 genomes in average). If your computer has 8 CPU cores and You can execute the command below:
+  `nohup cat test.fna.list | xargs -n 1 -P 4 -I{} python3 /home/qiime2/ISEScan-1.7/isescan.py {} proteome hmm > log.txt &`
+
+  * In the command line, 
+  - **_test.fna.list_** is a text file which includes 200 fasta files, one fasta file per row, for example:
+  `/N/dc2/scratch/zhiqxie/hmp/HMASM/SRS014235.scaffolds.fa
+  /N/dc2/scratch/zhiqxie/hmp/HMASM/SRS049959.scaffolds.fa
+  /N/dc2/scratch/zhiqxie/hmp/HMASM/SRS020233.scaffolds.fa
+  /N/dc2/scratch/zhiqxie/hmp/HMASM/SRS022609.scaffolds.fa
+  /N/dc2/scratch/zhiqxie/hmp/HMASM/SRS024132.scaffolds.fa
+  ` 
+  - **_-n 1_** tells your computer to pick only one fasta file from **_test.fna.list_** for each ISEScan computing job.
+  - **_-P 4_** tells your computer to spawn 4 processes at the same time (run 4 ISEScan jobs in parallel, namely, run 4 genomes at the same time). When one job completes with success or exits with error, a new ISEScan job on the next fasta file (e.g. 5th fasta file) in **_test.fna.list_** is spawned. So, the command line will keep 4 ISEScan computing jobs (one fasta file per ISEScan job) running on your computer, and each job utilizes two CPU cores by default. It means all of 8 CPU cores on your computer have been utilized by your 4 ISEScan computing jobs till the last fasta file is processed by ISEScan.
+  - **> log.txt** tells your computer to write the screen messages output by ISEScan to the file `log.txt`.
+  - **&** tells your computer to run jobs in the background without interrupting you on the current terminal (e.g. xterm), in order that you can work on other things on the same terminal.
+  You can check your job status by the command `top -c -u qiime2` (assuming your user name is `qiime2`). 
+
+  It might take several days or weeks for 200 genomes to complete. It depends on how many CPU cores you have on your computer and how fast each CPU core is. Please do not load too many ISEScan jobs because each ISEScan job will consume part of your RAM on your computer. However, you can always test and estimate how many GB RAM and how many hours are required for a genome.
+
 * ISEScan will run much faster if you run it on the same genome sequence more than once (e.g., trying different optimal parameters of near and far regions (see our paper [...] for the definitions of near and far regions)) to search for IS elements in your genome). The reason is that it skips either FragGeneScan or both FragGeneScan and phmer/hmmsearch steps which are most time-consuming steps in ISEScan pipeline.
 * If you prefer ISEScan recalculating the the results, you can simply remove the proteome file and HMMER search results which are related to your genome sequence file name. For example, you can delete NC_012624.fna.faa in proteome directory and clusters.faa.hmm.NC_012624.fna.faa and clusters.single.faa.NC_012624.fna.faa in hmm directory, and then rerun it:  
 `python3 isescan.py NC_012624.fna proteome hmm`
 
 ## Release History 
+* 1.7.2.
+  * Add command options `--removeShortIS` and `--no-FragGeneScan`, and remove `removeShortIS` and `translateGenome` from constants.py. (Thanks EricDeveaud for his suggestion and codes)
+  * Add command option `--nthread` to isescan.py, and remove `nthread` and `nproc` from constants.py.
+  * Remove useless parallel testing codes from code base.
 * 1.7.1
   * fix a bug in constants.py, which fails to locate the correct path pointing to profile HMM files (clusters.single.faa and clusters.faa.hmm). Thank giuliodimaria92 for it.
 * 1.7

diff --git a/constants.py b/constants.py
@@ -6,15 +6,14 @@
 # on your computer.
 # 
 # FragGeneScan
-#FragGeneScan = '/u/zhiqxie/informatics/inst/FragGeneScan1.19/run_FragGeneScan.pl'
-FragGeneScan = '/u/zhiqxie/informatics/inst/FragGeneScan1.30/run_FragGeneScan.pl'
+FragGeneScan = '/apps/inst/FragGeneScan1.30/run_FragGeneScan.pl'
 # Hmmer
-phmmer = '/u/zhiqxie/informatics/inst/hmmer-3.1b2/bin/phmmer'
-hmmsearch = '/u/zhiqxie/informatics/inst/hmmer-3.1b2/bin/hmmsearch'
-# Blast 
-blastn = '/l/ncbi-blast/bin/blastn'
-blastp = '/l/ncbi-blast/bin/blastp'
-makeblastdb = '/l/ncbi-blast/bin/makeblastdb'
+phmmer = '/apps/inst/hmmer-3.3/bin/phmmer'
+hmmsearch = '/apps/inst/hmmer-3.3/bin/hmmsearch'
+# Blast
+blastn = '/apps/inst/ncbi-blast-2.10.0+/bin/blastn'
+blastp = '/apps/inst/ncbi-blast-2.10.0+/bin/blastp'
+makeblastdb = '/apps/inst/ncbi-blast-2.10.0+/bin/makeblastdb'
 
 # get path where isescan.py is
 import sys
@@ -23,39 +22,16 @@
 # Set the path variables pointing to the profile HMM files (clusters.single.faa and clusters.faa.hmm).
 #
 # The peptide sequences of single-member clusters, which is used by phmmer in hmmer
-file4clusterSeqFile4phmmer = os.path.join(path2isescan, 'pHMMs', 'clusters.single.faa')
+#file4clusterSeqFile4phmmer = os.path.join(path2isescan, 'pHMMs', 'clusters.single.faa')
+file4clusterSeqFile4phmmer = '/home/xiezhq/projects/isescan/pHMMs/clusters.single.faa'
 #
 # The profile HMMs of multiple-member clusters, which is used by hmmsearch in hmmer
-file4clusterHMM = os.path.join(path2isescan, 'pHMMs', 'clusters.faa.hmm')
+#file4clusterHMM = os.path.join(path2isescan, 'pHMMs', 'clusters.faa.hmm')
+file4clusterHMM = '/home/xiezhq/projects/isescan/pHMMs/clusters.faa.hmm'
 #
 ## Config packages
 
 
-# Option switch to report partial IS element
-#
-# If removeShortIS is True, ISEScan will remove partial IS elements which include 
-# IS element with length < 400 or single copy IS element without perfect TIR.
-# If removeShortIS is False, ISEScan will report both partial IS element and complete (full-lenght) IS elements.
-# The default is False.
-#removeShortIS = True
-removeShortIS = False
-#
-# Option switch to report both complete and partial IS elements
-
-# When translateGenome is True, pipepline will predict and translate genes 
-# from genome sequence into protein database (file in fasta format) 
-# using FragGeneScan program.
-translateGenome = True
-# When translateGenome is False, pipeline will use the protein database (.faa file)
-# from ncbi genome database, and the corresponding .fna and .ptt files in the same folder
-# are required to map proteins to genome locations.
-#translateGenome = False
-
-# set temporary directory used by ISEScan
-#tmpdir = 'tmpdir'
-#tmpdir = '/N/u/zhiqxie/Karst/is/isescan/tmpdir'
-#tmpdir = '/N/dc2/scratch/zhiqxie/insertion_sequence/tmpdir'
-
 # for local linux machine
 path2results = ''
 # for HPC system
@@ -361,13 +337,3 @@
 		'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',
 		}
 gene2pepTable = {'11': table11}
-
-# default number of processes to use in calculation if it is not given
-#nproc = 32
-#nproc = 16
-#nproc = 8
-nproc = 2
-# default number of threads to use in calculation if it is not given
-nthread = 4
-#nthread = 16
-#nthread = 32