Discrepancies between EDTA and RepeatMasker results, how to combine? #231

lx-1011 · 2021-10-25T11:39:46Z

Dear @oushujun ,
Thank you for developing this useful tool. I have run it successfully and found that SINEs and LINEs could't be identified based on structure features. However, we didn't have such TE lib tons of manual curations in pigs. We tried to combine the results of EDTA and RepeatMasker for an entire TE identification, but here were some different results:

TEs make up nearly 40% of mammalian genomes[1]. EDTA can identify 31.09%, and RepeatMasker can identify 37.31%. Was the difference nearly 6% caused by identification of SINEs and LINEs?.
About the result difference of EDTA and RepeatMasker, do you have a better suggestion for the arrangement of the two results?
####Not about EDTA#######
Most mammalian genomes are dominated by LINE and SINE retrotransposons, more limited LTR retrotransposons, and minimal DNA transposon accumulation[2]. However, we didn't identify any SINEs in pig genome, and only 3 LINEs using EDTA. Do you have any idea about that?

Thanks and wish you all the best
Li Xin

[1] Isaac A Babarinde, Gang Ma, Yuhao Li, Boping Deng, Zhiwei Luo, Hao Liu, Mazid Md Abdul, Carl Ward, Minchun Chen, Xiuling Fu, Liyang Shi, Martha Duttlinger, Jiangping He, Li Sun, Wenjuan Li, Qiang Zhuang, Guoqing Tong, Jon Frampton, Jean-Baptiste Cazier, Jiekai Chen, Ralf Jauch, Miguel A Esteban, Andrew P Hutchins, Transposable element sequence fragments incorporated into coding and noncoding transcripts modulate the transcriptome of human pluripotent stem cells, Nucleic Acids Research, Volume 49, Issue 16, 20 September 2021, Pages 9132–9153, https://doi.org/10.1093/nar/gkab710
[2] Platt, R.N., Vandewege, M.W. & Ray, D.A. Mammalian transposable elements and their impacts on genome evolution. Chromosome Res 26, 25–43 (2018). https://doi.org/10.1007/s10577-017-9570-z

oushujun · 2021-11-10T00:18:37Z

Dear Li Xin,

Sorry for the delayed response. If you compare TE categories side by side, you may find many of them have quite big differences. In my opinion, the major discrenpcy comes from the failure to identify SINE and LINE by EDTA, which may have inflated the TIR category (i.e., CACTA and mutator).

It's a good sign is that RepeatMasker can identify LINEs. A better way to combine the two is to find out which LINE sequences in the RepBase were used for the annotation, then obtain those library sequences from Repbase or somewhere (i.e. NCBI), and format their names into the RepeatMasker format (example, EDTA/database/rice6.9.5.liban.nonLTR), and feed them to EDTA via --curatedlib, then EDTA should perform much better. If you know of any pig TEs, they don't have to be comprehensive, giving them to EDTA via --curatedlib will be also a good idea.

Best,
Shujun

lx-1011 · 2021-11-21T05:13:17Z

Dear @oushujun

Thanks for your response. I have tried to carry out your suggestion, and still have some question, like that:

I try to get LINE sequences through their position in reference.fa (the result of RepeatMaker based on Dfam database). But the name doesn't meet needs.
Then format their names like rice6.9.5.liban.nonLTR, chr:pos-end#LINE/L1 match=RM > LINE.fa
Run EDTA.pl again.
perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib ../03.RepeatModeler_RepeatMasker/LINE/LINE.fa

###log file
2021-11-18 15:35:06,300 -INFO- Summary of classifications:
Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains
LTR Bel-Pao 13 0 0 0
LTR Copia 175 64 13 0
LTR Gypsy 152 111 13 0
LTR Retrovirus 16 0 0 0
LTR mixture 1 0 0 0
DIRS unknown 7 0 0 0
LINE unknown 1116 0 0 0
TIR MuDR_Mutator 2 0 0 0
TIR PIF_Harbinger 1 0 0 0
TIR PiggyBac 2 0 0 0
TIR Tc1_Mariner 11 0 0 0
TIR hAT 32 0 0 0
Helitron unknown 4 0 0 0
Maverick unknown 252 0 0 0
2021-11-18 15:35:06,304 -INFO- Pipeline done.
2021-11-18 15:35:06,305 -INFO- cleaning the temporary directory ./tmp
Remove CDS-related sequences in the EDTA library:

Thu Nov 18 15:41:00 CST 2021 **Combine the high-quality TE library LINE.fa with the EDTA library:

(EDTA) cche@sg04 15:51:57**
~/lixin/02_sus_pop/06annotation/02TE_annotation/tt_EDTA_LINE
$
###No any err reported, only interrupt. I have tried it twice, and the results are same.

try to use the first four lines of rice6.9.5.liban.nonLTR as curatedlib, run again
perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib rice6.9.5.liban.nonLTR

####log file
2021-11-21 00:49:51,972 -INFO- Summary of classifications:
Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains
LTR Bel-Pao 13 0 0 0
LTR Copia 175 64 13 0
LTR Gypsy 152 111 13 0
LTR Retrovirus 16 0 0 0
LTR mixture 1 0 0 0
DIRS unknown 7 0 0 0
LINE unknown 1116 0 0 0
TIR MuDR_Mutator 2 0 0 0
TIR PIF_Harbinger 1 0 0 0
TIR PiggyBac 2 0 0 0
TIR Tc1_Mariner 11 0 0 0
TIR hAT 32 0 0 0
Helitron unknown 4 0 0 0
Maverick unknown 252 0 0 0
2021-11-21 00:49:51,976 -INFO- Pipeline done.
2021-11-21 00:49:51,976 -INFO- cleaning the temporary directory ./tmp
Remove CDS-related sequences in the EDTA library:

Sun Nov 21 00:57:17 CST 2021 Combine the high-quality TE library rice6.9.5.liban.nonLTR with the EDTA library:

**Input file "Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa.mod.EDTA.TElib.fa.masked" not found!**

That's a good idea! I'd appreciate your suggestions!

Thanks and wish you all the best
Li Xin

oushujun · 2021-11-21T18:16:53Z

Hi Li Xin,

You may only select those high-copy LINE annotations from the RepeatMasker output, and generate non-redundant sequences from them. You may manually select the ones that you think are representative, or use consensus to generate a representative sequence from sequences of each family. Please DON'T give all RepeatMasker sequences to EDTA.

The name formatting looks good to me, but I don't understand what do you mean by interruption. Please include full reports in the attachment so that I can better judge what may be the issue.

Best,
Shujun

lx-1011 · 2022-01-18T02:17:33Z

Hi @oushujun ,
Thanks for your response, I have run test_file sussessfully with your suggestion, and i will show the detail later.
And then I run the whole genome using EDTA with LINS_SINE.data.fa which identified by RepeatMasker about 177246 (173080 LINEs, 4166 SINEs) from 23th, Nov to now.

Nearly 56 days. I am not sure it is normal or not. Is that anyway to short this time?
Annotation is running now, and only find LINEs , no SINEs in $.EDTA.TEanno.sum. Is it because the task was not completed?

###01input

###02LINE_SINE.data.fa

###03current proceeding

test_file in details(obtain LINE10.fa from RepeatMaker)
#1
perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib LINE10.fa
#2 LINE10.fa (5 lines)

#3 output (nearly 25h)

oushujun · 2022-01-18T23:07:22Z

Hi, Apparently, you are providing all SINE/LINE annotations to EDTA - you should not do that. Please only provide exemplary sequences (aka, non-redundant library sequences) to EDTA. Doing so will make your run very slow (as you mentioned, 56 days) and the annotation is just not right. Shujun

…

On Mon, Jan 17, 2022 at 9:17 PM lx-1011 ***@***.***> wrote: Hi @oushujun <https://github.com/oushujun> , Thanks for your response, I have run test_file sussessfully with your suggestion, and i will show the detail later. And then I run the whole genome using EDTA with LINS_SINE.data.fa which identified by RepeatMasker about 177246 (173080 LINEs, 4166 SINEs) from 23th, Nov to now. 1. Nearly 56 days. I am not sure it is normal or not. Is that anyway to short this time? 2. Annotation is running now, and only find LINEs , no SINEs in $.EDTA.TEanno.sum. Is it because the task was not completed? [image: image] <https://user-images.githubusercontent.com/47030888/149858908-6bb0470b-5aef-44ad-8489-344af6f75458.png> ###01input [image: image] <https://user-images.githubusercontent.com/47030888/149857267-89d48eaa-6d9c-49da-ab9e-10ff2c221784.png> ###02LINE_SINE.data.fa [image: image] <https://user-images.githubusercontent.com/47030888/149857186-7837abd1-3d67-4bb0-9b75-5c2a489f809a.png> ###03current proceeding [image: image] <https://user-images.githubusercontent.com/47030888/149857460-79f7e34f-a76a-45ec-9a7f-db2698f573a3.png> test_file in details(obtain LINE10.fa from RepeatMaker) #1 <#1> *perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib LINE10.fa* #2 <#2> LINE10.fa (5 lines) [image: image] <https://user-images.githubusercontent.com/47030888/149857774-c1d9dcd4-8bc3-4014-9a2a-f9cd295cd6bf.png> #3 <#3> output (nearly 25h) [image: image] <https://user-images.githubusercontent.com/47030888/149858049-8c49e59c-0a7e-490b-aa15-761455bf586b.png> [image: image] <https://user-images.githubusercontent.com/47030888/149858081-bfd0c0c2-821f-4884-8c0e-5f1ff1843351.png> — Reply to this email directly, view it on GitHub <#231 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNX4NDKE75QDHWXIME6YGDUWTEUTANCNFSM5GVBX7ZA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

lx-1011 · 2022-01-19T04:59:02Z

Hi, @oushujun
Thanks for your response.

We filtered the result of RepeatMasker (SW score 300, Length 80, div 80), and the number of overlap region of that result accounted for 4% , nearly a few bps (mostly 1-10bp). and then run EDTA with the filtered RM database.
Another question is that if the output of EDTA counld be used as the input database of RepeatMasker , and then combine RM and EDTA.

Li Xin

oushujun · 2022-01-22T04:56:14Z

Hi Li Xin,

Is your RM database redundant or not? You may only provide non-redundant sequences to EDTA. This means you need to use one sequence to represent the entire family that it belongs to, and provide a collection of these representative sequences to EDTA. EDTA will use these sequences to perform homological annotation to other similar sequences with RepeatMasker that was integrated into EDTA. If you provide redundant sequences to EDTA, it will use these sequences to repetitively annotate your genome, which is super slow and not meaningful.

The homology result will then be combined with structural results as the final output of EDTA. So there is no need to perform RepeatMasker annotation again.

Shujun

lx-1011 · 2022-03-15T07:58:21Z

Hi Li Xin,

Is your RM database redundant or not? You may only provide non-redundant sequences to EDTA. This means you need to use one sequence to represent the entire family that it belongs to, and provide a collection of these representative sequences to EDTA. EDTA will use these sequences to perform homological annotation to other similar sequences with RepeatMasker that was integrated into EDTA. If you provide redundant sequences to EDTA, it will use these sequences to repetitively annotate your genome, which is super slow and not meaningful.

The homology result will then be combined with structural results as the final output of EDTA. So there is no need to perform RepeatMasker annotation again.

Shujun

Hi Shujin,
I run it again and the process is running now. The curatedlib has been filtered by CD-HIT, while It still takes nearly 34 days.

perl ~/lixin/software/EDTA-2.0.0/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chr.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --overwrite 1 --anno 1 --evaluate 1 --threads 10 --curatedlib RModeler_pig_rm_merged.rmDup.fa

RModeler_pig_rm_merged.rmDup.fa is derived from Libraries/RepeatMaskerLib.h5

Thanks and wish you all the best
Li Xin

oushujun · 2022-03-15T17:03:44Z

Hi Li Xin,

It should not take this long. The pig genome is not that big, which means you were doing something not right. If your job runs longer than a week, you should be trying to identify any issues.

I think the issue is the --curatedlib you provided to EDTA. How large is it? Judging from the file name RModeler_pig_rm_merged.rmDup.fa, is it generated by RepeatModeler initially, then used RepeatMasker to mask the pig genome, then you extracted the masked sequences, then you removed duplications with CD-HIT? If this is the case, you are doing it wrong.

Both EDTA and RepeatModeler can generate a non-redundant TE library. What you want to do is to use the SINE/LINE elements in the RepeatModeler library to boost the annotation of EDTA. So you may extract SINE/LINE sequences from the RepeatModeler library, format the sequence names, and provide them to EDTA via --curatedlib.

You may also want to read the EDTA paper for how it works.

Shujun

oushujun · 2022-05-24T03:27:55Z

any luck?

lx-1011 · 2022-05-24T05:12:47Z

any luck?
Hi Shujun,
Thanks for your response.
I have run it successfully.

Whole-genome TE annotation (total TE: 35.04%): Sus_scrofa.Sscrofa11.1.dna.chr.fa.mod.EDTA.TEanno.gff3

However, the results show that the percent of TE annotation is lower than expected, and it still can't identify SINEs.

perl ~/lixin/software/EDTA-2.0.0/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chr.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --overwrite 1 --anno 1 --evaluate 1 --threads 10 --curatedlib ../line_sine.fa

line_sine.fa filted by CD-HIT included 233 LINEs and 36 SINEs.

oushujun · 2022-05-24T19:39:55Z

Thanks for the update. Can you articulate which superfamily or class of TEs is lower than expected? Your result suggests that the 36 SINEs provided are not annotating any SINE elements in your genome. You may use AnnoSINE to generate the SINE library.

Shujun

oushujun changed the title ~~How to combine the results of EDTA and RepeatMasker?~~ Discrepancies between EDTA and RepeatMasker results, how to combine? Nov 20, 2021

oushujun added the question Further information is requested label Nov 20, 2021

yuzhenpeng mentioned this issue Jun 24, 2022

Add SINE hmms zhangrengang/TEsorter#29

Closed

maruiqi0710 mentioned this issue Mar 22, 2023

如何将RepeatMasker中的内容导入到EDTA？ #343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancies between EDTA and RepeatMasker results, how to combine? #231

Discrepancies between EDTA and RepeatMasker results, how to combine? #231

lx-1011 commented Oct 25, 2021

oushujun commented Nov 10, 2021

lx-1011 commented Nov 21, 2021

oushujun commented Nov 21, 2021

lx-1011 commented Jan 18, 2022

oushujun commented Jan 18, 2022 via email

lx-1011 commented Jan 19, 2022

oushujun commented Jan 22, 2022

lx-1011 commented Mar 15, 2022 •

edited

Loading

oushujun commented Mar 15, 2022

oushujun commented May 24, 2022

lx-1011 commented May 24, 2022

oushujun commented May 24, 2022

Discrepancies between EDTA and RepeatMasker results, how to combine? #231

Discrepancies between EDTA and RepeatMasker results, how to combine? #231

Comments

lx-1011 commented Oct 25, 2021

oushujun commented Nov 10, 2021

lx-1011 commented Nov 21, 2021

oushujun commented Nov 21, 2021

lx-1011 commented Jan 18, 2022

oushujun commented Jan 18, 2022 via email

lx-1011 commented Jan 19, 2022

oushujun commented Jan 22, 2022

lx-1011 commented Mar 15, 2022 • edited Loading

oushujun commented Mar 15, 2022

oushujun commented May 24, 2022

lx-1011 commented May 24, 2022

Whole-genome TE annotation (total TE: 35.04%): Sus_scrofa.Sscrofa11.1.dna.chr.fa.mod.EDTA.TEanno.gff3

oushujun commented May 24, 2022

lx-1011 commented Mar 15, 2022 •

edited

Loading