It is coded for LTR_retriever to classify long terminal repeat retrotransposons (LTR-RTs) at first. It can also be used to classify any other transposable elements (TEs), including Class I and Class II elements which are covered by the REXdb database.
Since version v1.4, a GENOME mode is supported to identify TE protein domains throughout whole genome.
For more details of methods and benchmarking results in classifying TEs, please see the paper in Horticulture Research.
- Installation
- Quick Start
- Citations
- Outputs
- Usage
- Limitations
- Further phylogenetic analyses
- Extracting TE sequences from genome for TEsorter
conda install -c bioconda tesorter
Dependencies:
- python >3
+ biopython: quickly install bypip install biopython
orconda install biopython
+ xopen: quickly install bypip install xopen
orconda install xopen
- hmmscan 3.3x: be compatible with HMMER3/f database format. quickly install by
conda install hmmer
- blast+: quickly install by
conda install blast
- TEsorter:
git clone https://github.com/zhangrengang/TEsorter
cd TEsorter
python setup.py install
# run the example
TEsorter-test
# or
TEsorter TEsorter/test/rice6.9.5.liban
By default, the newly released REXdb (viridiplantae_v3.0 + metazoa_v3) database is used, which is more sensitive and more common and thus is recommended.
For plants (an example), it might be better to use only the plant database (Note that the input file is TE or LTR sequences but not genome sequences: ELEMENT mode):
TEsorter TE.fasta -db rexdb-plant
Classical GyDB can also be used:
TEsorter TE.fasta -db gydb
To speed up, use more processors [default=4]:
TEsorter TE.fasta -p 20
To improve sensitivity, reduce the criteria (coverage and E-value):
TEsorter TE.fasta -p 20 -cov 10 -eval 1e-2
To improve specificity, increase the criteria and disable the pass2 mode:
TEsorter TE.fasta -p 20 -cov 30 -eval 1e-5 -dp2
To improve sensitivity of pass-2, reduce the 80–80–80 rule which may be too strict for superfamily-level classification:
TEsorter TE.fasta -p 20 -rule 70-30-80
To classify TE polyprotein sequences (an example) or gene protein seqeunces:
TEsorter RepeatPeps.lib -st prot -p 20
Since version v1.4, a GENOME mode (input genome sequences) is supported to identify TE protein domains throughout whole genome:
TEsorter genome.fasta -genome -p 20 -prob 0.9 -cov 30 -eval 1e-5 -score 1
TE domain regions can be masked by:
# hard mask
TEsorter genome.fasta -genome -p 20 -prob 0.9 -mask hard
# soft mask
TEsorter genome.fasta -genome -p 20 -prob 0.9 -mask soft
# both output
TEsorter genome.fasta -genome -p 20 -prob 0.9 -mask soft hard
# mask proteins of plant genes
TEsorter pep.faa -st prot -p 20 -prob 0.9 -score 1 -mask hard
If you use the TEsorter
tool, please cite:
Zhang RG, Li GL, Wang XL et. al. TEsorter: an accurate and fast method to classify LTR retrotransposons in plant genomes [J]. Horticulture Research, 2022, 9: uhac017 https://doi.org/10.1093/hr/uhac017
If you use the REXdb database (-db rexdb/rexdb-plant/rexdb-metazoa
), please cite:
Neumann P, Novák P, Hoštáková N et. al. Systematic survey of plant LTR-retrotransposons elucidates phylogenetic relationships of their polyprotein domains and provides a reference for element classification [J]. Mobile DNA, 2019, 10: 1 https://doi.org/10.1186/s13100-018-0144-1
If you use the GyDB database (-db gydb
), please cite:
Llorens C, Futami R, Covelli L et. al. The Gypsy Database (GyDB) of mobile genetic elements: release 2.0 [J]. Nucleic Acids Research, 2011, 39: 70–74 https://doi.org/10.1093/nar/gkq1061
If you use the AnnoSINE database (-db sine
), please cite:
Li Y, Jiang N, Sun Y. AnnoSINE: a short interspersed nuclear elements annotation tool for plant genomes [J]. Plant Physiology, 2022, 188: 955–970 http://doi.org/10.1093/plphys/kiab524
If you use the LINE/RT database (-db rexdb-line
), please cite:
Kapitonov VV, Tempel S, Jurka J. Simple and fast classification of non-LTR retrotransposons based on phylogeny of their RT domain protein sequences [J]. Gene, 2009, 448: 207–213 http://doi.org/10.1016/j.gene.2009.07.019
If you use the DNA/TIR database (-db rexdb-pnas
), please cite:
Yuan YW, Wessler SR. The catalytic domain of all eukaryotic cut-and-paste transposase superfamilies [J]. Proceedings of the National Academy of Sciences, 2011, 108: 7884–7889 http://doi.org/10.1073/pnas.1104208108
rice6.9.5.liban.rexdb.domtbl HMMScan raw output
rice6.9.5.liban.rexdb.dom.faa protein sequences of domain, which can be used for phylogenetic analysis.
rice6.9.5.liban.rexdb.dom.tsv inner domains of TEs/LTR-RTs, which might be used to filter domains based on their scores and coverages.
rice6.9.5.liban.rexdb.dom.gff3 domain annotations in `gff3` format
rice6.9.5.liban.rexdb.cls.tsv TEs/LTR-RTs classifications
Column 1: raw id
Column 2: Order, e.g. LTR
Column 3: Superfamily, e.g. Copia
Column 4: Clade, e.g. SIRE
Column 5: Complete, "yes" means one LTR Copia/Gypsy element with full GAG-POL domains.
Column 6: Strand, + or - or ?
Column 7: Domains, e.g. GAG|SIRE PROT|SIRE INT|SIRE RT|SIRE RH|SIRE; `none` for pass-2 classifications
rice6.9.5.liban.rexdb.cls.lib fasta library for RepeatMasker
rice6.9.5.liban.rexdb.cls.pep the same sequences as `rice6.9.5.liban.rexdb.dom.faa`, but id is changed with classifications.
rice6.9.5.liban.rexdb.*masked sequences masking the TE domains
Note: the GENOME mode (-genome
) will not output *.cls.*
files.
$ TEsorter -h
usage: TEsorter [-h] [-v] [-db {gydb,rexdb,rexdb-plant,rexdb-metazoa,rexdb-pnas,rexdb-line,sine}] [--db-hmm DB_HMM]
[-st {nucl,prot}] [-pre PREFIX] [-fw] [-p PROCESSORS] [-tmp TMP_DIR] [-cov MIN_COVERAGE] [-eval MAX_EVALUE]
[-prob MIN_PROBABILITY] [-nocln] [-cite] [-dp2] [-rule PASS2_RULE] [-nolib] [-norc] [-genome]
[-win_size WIN_SIZE] [-win_ovl WIN_OVL]
sequence
lineage-level classification of transposable elements using conserved protein domains.
positional arguments:
sequence input TE/LTR or genome sequences in fasta format [required]
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-db {gydb,rexdb,rexdb-plant,rexdb-metazoa,rexdb-pnas,rexdb-line,sine}, --hmm-database {gydb,rexdb,rexdb-plant,rexdb-metazoa,rexdb-pnas,rexdb-line,sine}
the database name used [default=rexdb]
--db-hmm DB_HMM the database HMM file used (prior to `-db`) [default=None]
-st {nucl,prot}, --seq-type {nucl,prot}
'nucl' for DNA or 'prot' for protein [default=nucl]
-pre PREFIX, --prefix PREFIX
output prefix [default='{-s}.{-db}']
-fw, --force-write-hmmscan
if False, will use the existed hmmscan outfile and skip hmmscan [default=False]
-p PROCESSORS, --processors PROCESSORS
processors to use [default=4]
-tmp TMP_DIR, --tmp-dir TMP_DIR
directory for temporary files [default=./tmp-e104611e-7ce3-11ed-90b7-0b7f57d69b28]
-cov MIN_COVERAGE, --min-coverage MIN_COVERAGE
mininum coverage for protein domains in HMMScan output [default=20]
-eval MAX_EVALUE, --max-evalue MAX_EVALUE
maxinum E-value for protein domains in HMMScan output [default=0.001]
-prob MIN_PROBABILITY, --min-probability MIN_PROBABILITY
mininum posterior probability for protein domains in HMMScan output [default=0.5]
-score MIN_SCORE, --min-score MIN_SCORE
mininum score for protein domains in HMMScan output
[default=0.1]
-mask {soft,hard} [{soft,hard} ...]
output masked sequences (soft: masking with lowercase;
hard: masking with N) [default=None]
-nocln, --no-cleanup do not clean up the temporary directory [default=False]
-cite, --citation print the citation and exit [default=False]
ELEMENT mode (default):
Input TE/LTR sequences to classify them into clade-level.
-dp2, --disable-pass2
do not further classify the unclassified sequences [default=False for `nucl`, True for `prot`]
-rule PASS2_RULE, --pass2-rule PASS2_RULE
classifying rule [identity-coverage-length] in pass-2 based on similarity [default=80-80-80]
-nolib, --no-library do not generate a library file for RepeatMasker [default=False]
-norc, --no-reverse do not reverse complement sequences if they are detected in minus strand [default=False]
GENOME mode:
Input genome sequences to detect TE protein domains throughout whole genome.
-genome input is genome sequences [default=False]
-win_size WIN_SIZE window size of chunking genome sequences [default=270000]
-win_ovl WIN_OVL overlap size of windows [default=30000]
- For each domain (e.g. RT), only the best hit with the highest score will output, which means: 1) if frame is shifted, only one part can be annotated; 2) for example, if two or more RT domains are present in one query sequence, only one of these RT domains will be annotated.
- Many LTR-RTs cannot be classified due to no hit, which might be because: 1) the database is still incompleted; 2) some LTR-RTs may have too many mutations such as frame shifts and stop gains or have lost protein domains; 3) some LTR-RTs may be identified false positively. For the test data set (rice6.9.5.liban), ~84% LTR-RTs (_INT sequences) are classified.
- Non-autonomous TEs that lack protein domains, some un-active autonomous TEs that have lost their protein domains and any other elements that contain none protein domains, are excepted to be un-classified.
You may want to use the RT domains to analysis relationships among retrotransposons (LTR, LINE, DIRS, etc.). Here is an example (with mafft and iqtree installed):
# to extract RT domain sequences
concatenate_domains.py rice6.9.5.liban.rexdb.cls.pep RT > rice6.9.5.liban.rexdb.cls.pep.RT.aln
# to reconduct the phylogenetic tree with IQTREE or other tools
iqtree -s rice6.9.5.liban.rexdb.dom.RT.aln -bb 1000 -nt AUTO
# Finally, visualize and edit the tree 'rice6.9.5.liban.rexdb.RT.faa.aln.treefile' with FigTree or other tools.
The alignments of LTR-RTs full domains can be generated by (align and concatenate; concatenate_domains.py
will convert all special characters to _
to be compatible with iqtree
and scripts/LTR_tree.R
):
concatenate_domains.py rice6.9.5.liban.rexdb.cls.pep GAG PROT RH RT INT > rice6.9.5.liban.rexdb.cls.pep.full.aln
The alignments of Class I INT and Class II TPase (DDE-transposases) can be generated by:
concatenate_domains.py rice6.9.5.liban.rexdb.cls.pep INT > rice6.9.5.liban.rexdb.cls.pep.INT.aln
concatenate_domains.py rice6.9.5.liban.rexdb.cls.pep TPase > rice6.9.5.liban.rexdb.cls.pep.TPase.aln
cat rice6.9.5.liban.rexdb.cls.pep.INT.aln rice6.9.5.liban.rexdb.cls.pep.TPase.aln > rice6.9.5.liban.rexdb.cls.pep.INT_TPase.faa
mafft --auto rice6.9.5.liban.rexdb.cls.pep.INT_TPase.faa > rice6.9.5.liban.rexdb.cls.pep.INT_TPase.aln
Note: the domain names between rexdb and gydb are somewhat different: PROT (rexdb) = AP (gydb), RH (rexdb) = RNaseH (gydb). Please use the actual domain name.
Here, an R script (depending on ggtree) is provided to fast visualize the LTR tree. An example in example_data/
:
../scripts/LTR_tree.R rice6.9.5.liban.rexdb.cls.pep_RT_RH_INT.aln.treefile rice6.9.5.liban.rexdb.cls.tsv rice6.9.5.liban.rexdb.cls.pep_RT_RH_INT.aln.treefile.png
Here are examples to extract TE sequences from outputs of wide-used softwares, when you have only genome sequences.
- extract all TE sequences from RepeatMasker output:
# run RepeatMasker, which will generate a *.out file.
RepeatMasker [options] genome.fa
# extract sequences
RepeatMasker.py out2seqs genome.fa.out genome.fa > whole_genome_te.fa
# classify
TEsorter whole_genome_te.fa [options]
- extract all intact LTR-RTs sequences from LTR_retriever outputs:
# run LTR_retriever, which generate two *.pass.list files.
LTR_retriever -genome genome.fa [options]
# extract sequences
LTR_retriever.py get_full_seqs genome.fa > intact_ltr.fa
# classify
TEsorter intact_ltr.fa [options]