ESPRESSO_alpha1.2.2/ESPRESSO_v1.2.0_manual.txt

Manual of ESPRESSO

ESPRESSO is a novel method for processing alignment of long read RNA-seq data, which can effectively improve splice junction accuracy and isoform quantification.

1 Commands and arguments
How to run ESPRESSO: 
perl ESPRESSO_S.pl -L samples.tsv -F ref.fa -A anno.gtf -O work_dir
perl ESPRESSO_C.pl -I work_dir -F ref.fa -X 0
(perl ESPRESSO_C.pl -I work_dir -F ref.fa -X 1
.
.
.)
perl ESPRESSO_Q.pl -L work_dir/samples.tsv.updated -A anno.gtf


Arguments of ESPRESSO_S:

    -L, --list_samples
          tsv list of sample(s) (each file in a line with 1st column as sorted 
          BAM/SAM file and 2nd column as sample name; required)
    -F, --fa
          FASTA file of all reference sequences. Please make sure this file is
          the same one provided to mapper. (required)
    -A, --anno
          input annotation file in GTF format (optional)
    -B, --SJ_bed
          input custom reliable splice junctions in BED format (optional; each
          reliable SJ in one line, with the 1st column as chromosome, the 2nd
          column as upstream splice site 0-base coordinate, the 3rd column as 
          downstream splice site and 6th column as strand)
    -O, --out
          work directory (existing files in this directory may be OVERWRITTEN; 
          default: ./)

    -H, --help
          show this help information

    -N, --read_num_cutoff
          min perfect read count for denovo detected candidate splice junctions
          (default: 2)
    -R, --read_ratio_cutoff
          min perfect read ratio for denovo detected candidate splice junctions: 
          Set this as 1 for completely GTF-dependent processing (default: 0)
    
    -C, --cont_del_max
          max continuous deletion allowed; intron will be identified if longer
          (default: 50)
    -M, --chrM
          tell ESPRESSO the ID of mitochondrion in reference file (default:
          chrM)

    -T, --num_thread
          thread number (default: minimum of 5 and sam file number)  
    -Q, --mapq_cutoff
          min mapping quality for processing (default: 1)
          
          
Arguments of ESPRESSO_C:

    -I, --in
          work directory (generated by ESPRESSO_S)
    -F, --fa
          FASTA file of all reference sequences. Please make sure this file is
          the same one provided to mapper. (required)
    -X, --target_ID
          ID of sample to process (required)

    -H, --help
          show this help information

    -T, --num_thread
          thread number (default: 5)


Arguments of ESPRESSO_Q:

    -L, --list_samples
          tsv list of multiple samples (each bam in a line with 1st column as
          sorted bam file, 2nd column as sample name in output, 3rd column as
          directory of ESPRESSO_C results; this list can be generated by
          ESPRESSO_S according to the initially provided tsv list; required)
    -A, --anno
          input annotation file in GTF format (optional)
    -O, --out_dir
          work directory (default: directory of -L)

    -H, --help
          show this help information

    -N, --read_num_cutoff
          min perfect read count for all splice junctions of novel isoform
          (default: 2)
    -R, --read_ratio_cutoff
          min perfect read ratio for all splice junctions of novel isoform
          (default: 0)


2 Preparation of running ESPRESSO
1) Please make sure parallel Perl, hmmer 3.3.1 and BLAST 2.8.1+ (or later version) have been installed and included in $PATH.

2) Prepare your fastq file by base calling.

For example, if the data is generated by Nanopore r9.4.1 cDNA library and you use fast mode of guppy, you can do base calling with 20 threads as follows,
guppy_basecaller --input_path /path/to/fast5 --save_path /path/to/fastq/ --cpu_threads_per_caller 20 --config dna_r9.4.1_450bps_fast.cfg

Or, if the data is generated by Nanopore r9.4.1 direct RNA library and you use fast mode of guppy, you can do base calling with 20 threads as follows,
guppy_basecaller --input_path /path/to/fast5 --save_path /path/to/fastq/ --cpu_threads_per_caller 20 --config rna_r9.4.1_70bps_fast.cfg

And then simply cat all separate .fastq files into one as follows,
cat /path/to/fastq/*.fastq > in.fq

3) The input of ESPRESSO is sorted SAM/BAM file, generated by long read mapper.

For example, if you use Minimap2[1], SAM alignment can be generated using default setting without outputting secondary alignment:
minimap2 -ax splice -ub --secondary=no ref.fa in.fq > in.sam

Or you can also designate -k 14 and -w 4 for noisy 1D Nanopore data as suggested by developer of Minimap2:
minimap2 -ax splice -ub -k14 -w 4 --secondary=no ref.fa in.fq > in.sam

Or add a BED file to assist the mapping:
minimap2 -ax splice -ub -k14 -w 4 --junc-bed junctions.bed --secondary=no ref.fa in.fq > in.sam

Only common format of CIGAR values in sam/bam are accepted by ESPRESSO, so please do NOT use parameters to output specific format of CIGAR values (e.g. --cs in minimap2).

4) The alignment SAM file in.sam need be sorted by samtools and then be inputted to ESPRESSO:
samtools sort -o in.sort.sam in.sam

5) All of the SAM files need be summarized into a tsv list, so the list can be inputted to ESPRESSO_S(parameter -L).


3 An example of running ESPRESSO
1) Please download ESPRESSO and toy_data first.

2) In this example, we only have one sam file(SIRV2_3.sort.sam), and we need to let ESPRESSO know the path and sample name of this sample by inputting a sample file in tsv format.
We can write a new sample.tsv in toy_data directory and type the info:

/path/to/SIRV2_3.sort.sam	test

If you have multiple sam files, write each of them in one line as above. Each sample can correspond to one or multiple sam files.

3) Enter the toy_data directory and type as follows in your terminal:
perl /path/to/alpha1.2.0/ESPRESSO_S.pl -A SIRV_C.gtf -L sample.tsv -F SIRV2.fasta -O test_toy

By doing this, ESPRESSO_S can pre-process all of the sam to output high confidence splice junctions and other necessary info for next steps.
We can see a file sample.tsv.updated will be generated as follows after this step, in which a target ID (0) is assigned to the sam file.

/path/to/SIRV2_3.sort.sam	test	0

4) Next try typing as follows in your terminal:
perl /path/to/alpha1.2.0/ESPRESSO_C.pl -I test_toy -F SIRV2.fasta -X 0 -T 5

Parameter -X is the target ID of sam file to process, which can be found in the test_toy/sample.tsv.updated above.
By doing this, ESPRESSO_C can correct and recover splice junction for each read, and it will finish its work in about one minute in this example.
If you have multiple sam files, please process each of them with ESPRESSO_C.

5) Try the final step by typing as follows in your terminal:
perl /path/to/alpha1.2.0/ESPRESSO_Q.pl -A SIRV_C.gtf -L test_toy/sample.tsv.updated

By doing this, ESPRESSO_Q can identify and quantify all isoforms from the reads.

You will see three main output files in the directory test_toy, which include details of three detected isoforms (SIRV201,202,203):
(1) sample_N2_R0_updated.gtf: a updated GTF annotation for detected isoforms
(2) sample_N2_R0_abundance.esp: a tsv file for expression of detected isoforms
(3) sample_N2_R0_compatible_isoform.tsv: a tsv file for compatible isoforms of each read


References
[1] Li H. Minimap2: pairwise alignment for nucleotide sequences[J]. Bioinformatics, 2018, 34(18): 3094-3100.