-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathESPRESSO_v1.2.0_manual.txt
168 lines (121 loc) · 7.22 KB
/
ESPRESSO_v1.2.0_manual.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
Manual of ESPRESSO
ESPRESSO is a novel method for processing alignment of long read RNA-seq data, which can effectively improve splice junction accuracy and isoform quantification.
1 Commands and arguments
How to run ESPRESSO:
perl ESPRESSO_S.pl -L samples.tsv -F ref.fa -A anno.gtf -O work_dir
perl ESPRESSO_C.pl -I work_dir -F ref.fa -X 0
(perl ESPRESSO_C.pl -I work_dir -F ref.fa -X 1
.
.
.)
perl ESPRESSO_Q.pl -L work_dir/samples.tsv.updated -A anno.gtf
Arguments of ESPRESSO_S:
-L, --list_samples
tsv list of sample(s) (each file in a line with 1st column as sorted
BAM/SAM file and 2nd column as sample name; required)
-F, --fa
FASTA file of all reference sequences. Please make sure this file is
the same one provided to mapper. (required)
-A, --anno
input annotation file in GTF format (optional)
-B, --SJ_bed
input custom reliable splice junctions in BED format (optional; each
reliable SJ in one line, with the 1st column as chromosome, the 2nd
column as upstream splice site 0-base coordinate, the 3rd column as
downstream splice site and 6th column as strand)
-O, --out
work directory (existing files in this directory may be OVERWRITTEN;
default: ./)
-H, --help
show this help information
-N, --read_num_cutoff
min perfect read count for denovo detected candidate splice junctions
(default: 2)
-R, --read_ratio_cutoff
min perfect read ratio for denovo detected candidate splice junctions:
Set this as 1 for completely GTF-dependent processing (default: 0)
-C, --cont_del_max
max continuous deletion allowed; intron will be identified if longer
(default: 50)
-M, --chrM
tell ESPRESSO the ID of mitochondrion in reference file (default:
chrM)
-T, --num_thread
thread number (default: minimum of 5 and sam file number)
-Q, --mapq_cutoff
min mapping quality for processing (default: 1)
Arguments of ESPRESSO_C:
-I, --in
work directory (generated by ESPRESSO_S)
-F, --fa
FASTA file of all reference sequences. Please make sure this file is
the same one provided to mapper. (required)
-X, --target_ID
ID of sample to process (required)
-H, --help
show this help information
-T, --num_thread
thread number (default: 5)
Arguments of ESPRESSO_Q:
-L, --list_samples
tsv list of multiple samples (each bam in a line with 1st column as
sorted bam file, 2nd column as sample name in output, 3rd column as
directory of ESPRESSO_C results; this list can be generated by
ESPRESSO_S according to the initially provided tsv list; required)
-A, --anno
input annotation file in GTF format (optional)
-O, --out_dir
work directory (default: directory of -L)
-H, --help
show this help information
-N, --read_num_cutoff
min perfect read count for all splice junctions of novel isoform
(default: 2)
-R, --read_ratio_cutoff
min perfect read ratio for all splice junctions of novel isoform
(default: 0)
2 Preparation of running ESPRESSO
1) Please make sure parallel Perl, hmmer 3.3.1 and BLAST 2.8.1+ (or later version) have been installed and included in $PATH.
2) Prepare your fastq file by base calling.
For example, if the data is generated by Nanopore r9.4.1 cDNA library and you use fast mode of guppy, you can do base calling with 20 threads as follows,
guppy_basecaller --input_path /path/to/fast5 --save_path /path/to/fastq/ --cpu_threads_per_caller 20 --config dna_r9.4.1_450bps_fast.cfg
Or, if the data is generated by Nanopore r9.4.1 direct RNA library and you use fast mode of guppy, you can do base calling with 20 threads as follows,
guppy_basecaller --input_path /path/to/fast5 --save_path /path/to/fastq/ --cpu_threads_per_caller 20 --config rna_r9.4.1_70bps_fast.cfg
And then simply cat all separate .fastq files into one as follows,
cat /path/to/fastq/*.fastq > in.fq
3) The input of ESPRESSO is sorted SAM/BAM file, generated by long read mapper.
For example, if you use Minimap2[1], SAM alignment can be generated using default setting without outputting secondary alignment:
minimap2 -ax splice -ub --secondary=no ref.fa in.fq > in.sam
Or you can also designate -k 14 and -w 4 for noisy 1D Nanopore data as suggested by developer of Minimap2:
minimap2 -ax splice -ub -k14 -w 4 --secondary=no ref.fa in.fq > in.sam
Or add a BED file to assist the mapping:
minimap2 -ax splice -ub -k14 -w 4 --junc-bed junctions.bed --secondary=no ref.fa in.fq > in.sam
Only common format of CIGAR values in sam/bam are accepted by ESPRESSO, so please do NOT use parameters to output specific format of CIGAR values (e.g. --cs in minimap2).
4) The alignment SAM file in.sam need be sorted by samtools and then be inputted to ESPRESSO:
samtools sort -o in.sort.sam in.sam
5) All of the SAM files need be summarized into a tsv list, so the list can be inputted to ESPRESSO_S(parameter -L).
3 An example of running ESPRESSO
1) Please download ESPRESSO and toy_data first.
2) In this example, we only have one sam file(SIRV2_3.sort.sam), and we need to let ESPRESSO know the path and sample name of this sample by inputting a sample file in tsv format.
We can write a new sample.tsv in toy_data directory and type the info:
/path/to/SIRV2_3.sort.sam test
If you have multiple sam files, write each of them in one line as above. Each sample can correspond to one or multiple sam files.
3) Enter the toy_data directory and type as follows in your terminal:
perl /path/to/alpha1.2.0/ESPRESSO_S.pl -A SIRV_C.gtf -L sample.tsv -F SIRV2.fasta -O test_toy
By doing this, ESPRESSO_S can pre-process all of the sam to output high confidence splice junctions and other necessary info for next steps.
We can see a file sample.tsv.updated will be generated as follows after this step, in which a target ID (0) is assigned to the sam file.
/path/to/SIRV2_3.sort.sam test 0
4) Next try typing as follows in your terminal:
perl /path/to/alpha1.2.0/ESPRESSO_C.pl -I test_toy -F SIRV2.fasta -X 0 -T 5
Parameter -X is the target ID of sam file to process, which can be found in the test_toy/sample.tsv.updated above.
By doing this, ESPRESSO_C can correct and recover splice junction for each read, and it will finish its work in about one minute in this example.
If you have multiple sam files, please process each of them with ESPRESSO_C.
5) Try the final step by typing as follows in your terminal:
perl /path/to/alpha1.2.0/ESPRESSO_Q.pl -A SIRV_C.gtf -L test_toy/sample.tsv.updated
By doing this, ESPRESSO_Q can identify and quantify all isoforms from the reads.
You will see three main output files in the directory test_toy, which include details of three detected isoforms (SIRV201,202,203):
(1) sample_N2_R0_updated.gtf: a updated GTF annotation for detected isoforms
(2) sample_N2_R0_abundance.esp: a tsv file for expression of detected isoforms
(3) sample_N2_R0_compatible_isoform.tsv: a tsv file for compatible isoforms of each read
References
[1] Li H. Minimap2: pairwise alignment for nucleotide sequences[J]. Bioinformatics, 2018, 34(18): 3094-3100.