Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies between EDTA and RepeatMasker results, how to combine? #231

Open
lx-1011 opened this issue Oct 25, 2021 · 12 comments
Open

Discrepancies between EDTA and RepeatMasker results, how to combine? #231

lx-1011 opened this issue Oct 25, 2021 · 12 comments
Labels
question Further information is requested

Comments

@lx-1011
Copy link

lx-1011 commented Oct 25, 2021

Dear @oushujun ,
Thank you for developing this useful tool. I have run it successfully and found that SINEs and LINEs could't be identified based on structure features. However, we didn't have such TE lib tons of manual curations in pigs. We tried to combine the results of EDTA and RepeatMasker for an entire TE identification, but here were some different results:
5b224e439ad26fb5ade34388c4aac7e
c2b3b0478a24fe7b67c64d4462586ed

  1. TEs make up nearly 40% of mammalian genomes[1]. EDTA can identify 31.09%, and RepeatMasker can identify 37.31%. Was the difference nearly 6% caused by identification of SINEs and LINEs?.
  2. About the result difference of EDTA and RepeatMasker, do you have a better suggestion for the arrangement of the two results?
    ####Not about EDTA#######
  3. Most mammalian genomes are dominated by LINE and SINE retrotransposons, more limited LTR retrotransposons, and minimal DNA transposon accumulation[2]. However, we didn't identify any SINEs in pig genome, and only 3 LINEs using EDTA. Do you have any idea about that?

Thanks and wish you all the best
Li Xin

[1] Isaac A Babarinde, Gang Ma, Yuhao Li, Boping Deng, Zhiwei Luo, Hao Liu, Mazid Md Abdul, Carl Ward, Minchun Chen, Xiuling Fu, Liyang Shi, Martha Duttlinger, Jiangping He, Li Sun, Wenjuan Li, Qiang Zhuang, Guoqing Tong, Jon Frampton, Jean-Baptiste Cazier, Jiekai Chen, Ralf Jauch, Miguel A Esteban, Andrew P Hutchins, Transposable element sequence fragments incorporated into coding and noncoding transcripts modulate the transcriptome of human pluripotent stem cells, Nucleic Acids Research, Volume 49, Issue 16, 20 September 2021, Pages 9132–9153, https://doi.org/10.1093/nar/gkab710
[2] Platt, R.N., Vandewege, M.W. & Ray, D.A. Mammalian transposable elements and their impacts on genome evolution. Chromosome Res 26, 25–43 (2018). https://doi.org/10.1007/s10577-017-9570-z

@oushujun
Copy link
Owner

Dear Li Xin,

Sorry for the delayed response. If you compare TE categories side by side, you may find many of them have quite big differences. In my opinion, the major discrenpcy comes from the failure to identify SINE and LINE by EDTA, which may have inflated the TIR category (i.e., CACTA and mutator).

It's a good sign is that RepeatMasker can identify LINEs. A better way to combine the two is to find out which LINE sequences in the RepBase were used for the annotation, then obtain those library sequences from Repbase or somewhere (i.e. NCBI), and format their names into the RepeatMasker format (example, EDTA/database/rice6.9.5.liban.nonLTR), and feed them to EDTA via --curatedlib, then EDTA should perform much better. If you know of any pig TEs, they don't have to be comprehensive, giving them to EDTA via --curatedlib will be also a good idea.

Best,
Shujun

@oushujun oushujun changed the title How to combine the results of EDTA and RepeatMasker? Discrepancies between EDTA and RepeatMasker results, how to combine? Nov 20, 2021
@oushujun oushujun added the question Further information is requested label Nov 20, 2021
@lx-1011
Copy link
Author

lx-1011 commented Nov 21, 2021

Dear @oushujun

Thanks for your response. I have tried to carry out your suggestion, and still have some question, like that:

  1. I try to get LINE sequences through their position in reference.fa (the result of RepeatMaker based on Dfam database). But the name doesn't meet needs.
    image

  2. Then format their names like rice6.9.5.liban.nonLTR, chr:pos-end#LINE/L1 match=RM > LINE.fa
    image

  3. Run EDTA.pl again.
    perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib ../03.RepeatModeler_RepeatMasker/LINE/LINE.fa

###log file
2021-11-18 15:35:06,300 -INFO- Summary of classifications:
Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains
LTR Bel-Pao 13 0 0 0
LTR Copia 175 64 13 0
LTR Gypsy 152 111 13 0
LTR Retrovirus 16 0 0 0
LTR mixture 1 0 0 0
DIRS unknown 7 0 0 0
LINE unknown 1116 0 0 0
TIR MuDR_Mutator 2 0 0 0
TIR PIF_Harbinger 1 0 0 0
TIR PiggyBac 2 0 0 0
TIR Tc1_Mariner 11 0 0 0
TIR hAT 32 0 0 0
Helitron unknown 4 0 0 0
Maverick unknown 252 0 0 0
2021-11-18 15:35:06,304 -INFO- Pipeline done.
2021-11-18 15:35:06,305 -INFO- cleaning the temporary directory ./tmp
Remove CDS-related sequences in the EDTA library:

Thu Nov 18 15:41:00 CST 2021 **Combine the high-quality TE library LINE.fa with the EDTA library:

(EDTA) cche@sg04 15:51:57**
~/lixin/02_sus_pop/06annotation/02TE_annotation/tt_EDTA_LINE
$
###No any err reported, only interrupt. I have tried it twice, and the results are same.

  1. try to use the first four lines of rice6.9.5.liban.nonLTR as curatedlib, run again
    perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib rice6.9.5.liban.nonLTR

####log file
2021-11-21 00:49:51,972 -INFO- Summary of classifications:
Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains
LTR Bel-Pao 13 0 0 0
LTR Copia 175 64 13 0
LTR Gypsy 152 111 13 0
LTR Retrovirus 16 0 0 0
LTR mixture 1 0 0 0
DIRS unknown 7 0 0 0
LINE unknown 1116 0 0 0
TIR MuDR_Mutator 2 0 0 0
TIR PIF_Harbinger 1 0 0 0
TIR PiggyBac 2 0 0 0
TIR Tc1_Mariner 11 0 0 0
TIR hAT 32 0 0 0
Helitron unknown 4 0 0 0
Maverick unknown 252 0 0 0
2021-11-21 00:49:51,976 -INFO- Pipeline done.
2021-11-21 00:49:51,976 -INFO- cleaning the temporary directory ./tmp
Remove CDS-related sequences in the EDTA library:

Sun Nov 21 00:57:17 CST 2021 Combine the high-quality TE library rice6.9.5.liban.nonLTR with the EDTA library:

**Input file "Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa.mod.EDTA.TElib.fa.masked" not found!**
  1. That's a good idea! I'd appreciate your suggestions!

Thanks and wish you all the best
Li Xin

@oushujun
Copy link
Owner

Hi Li Xin,

You may only select those high-copy LINE annotations from the RepeatMasker output, and generate non-redundant sequences from them. You may manually select the ones that you think are representative, or use consensus to generate a representative sequence from sequences of each family. Please DON'T give all RepeatMasker sequences to EDTA.

The name formatting looks good to me, but I don't understand what do you mean by interruption. Please include full reports in the attachment so that I can better judge what may be the issue.

Best,
Shujun

@lx-1011
Copy link
Author

lx-1011 commented Jan 18, 2022

Hi @oushujun ,
Thanks for your response, I have run test_file sussessfully with your suggestion, and i will show the detail later.
And then I run the whole genome using EDTA with LINS_SINE.data.fa which identified by RepeatMasker about 177246 (173080 LINEs, 4166 SINEs) from 23th, Nov to now.

  1. Nearly 56 days. I am not sure it is normal or not. Is that anyway to short this time?
  2. Annotation is running now, and only find LINEs , no SINEs in $.EDTA.TEanno.sum. Is it because the task was not completed?
    image

###01input
image

###02LINE_SINE.data.fa
image

###03current proceeding
image

test_file in details(obtain LINE10.fa from RepeatMaker)
#1
perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib LINE10.fa
#2 LINE10.fa (5 lines)
image
#3 output (nearly 25h)
image
image

@oushujun
Copy link
Owner

oushujun commented Jan 18, 2022 via email

@lx-1011
Copy link
Author

lx-1011 commented Jan 19, 2022

Hi, @oushujun
Thanks for your response.

  1. We filtered the result of RepeatMasker (SW score 300, Length 80, div 80), and the number of overlap region of that result accounted for 4% , nearly a few bps (mostly 1-10bp). and then run EDTA with the filtered RM database.
  2. Another question is that if the output of EDTA counld be used as the input database of RepeatMasker , and then combine RM and EDTA.

Li Xin

@oushujun
Copy link
Owner

Hi Li Xin,

Is your RM database redundant or not? You may only provide non-redundant sequences to EDTA. This means you need to use one sequence to represent the entire family that it belongs to, and provide a collection of these representative sequences to EDTA. EDTA will use these sequences to perform homological annotation to other similar sequences with RepeatMasker that was integrated into EDTA. If you provide redundant sequences to EDTA, it will use these sequences to repetitively annotate your genome, which is super slow and not meaningful.

The homology result will then be combined with structural results as the final output of EDTA. So there is no need to perform RepeatMasker annotation again.

Shujun

@lx-1011
Copy link
Author

lx-1011 commented Mar 15, 2022

Hi Li Xin,

Is your RM database redundant or not? You may only provide non-redundant sequences to EDTA. This means you need to use one sequence to represent the entire family that it belongs to, and provide a collection of these representative sequences to EDTA. EDTA will use these sequences to perform homological annotation to other similar sequences with RepeatMasker that was integrated into EDTA. If you provide redundant sequences to EDTA, it will use these sequences to repetitively annotate your genome, which is super slow and not meaningful.

The homology result will then be combined with structural results as the final output of EDTA. So there is no need to perform RepeatMasker annotation again.

Shujun

Hi Shujin,
I run it again and the process is running now. The curatedlib has been filtered by CD-HIT, while It still takes nearly 34 days.

image

perl ~/lixin/software/EDTA-2.0.0/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chr.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --overwrite 1 --anno 1 --evaluate 1 --threads 10 --curatedlib RModeler_pig_rm_merged.rmDup.fa

RModeler_pig_rm_merged.rmDup.fa is derived from Libraries/RepeatMaskerLib.h5
image

Thanks and wish you all the best
Li Xin

@oushujun
Copy link
Owner

Hi Li Xin,

It should not take this long. The pig genome is not that big, which means you were doing something not right. If your job runs longer than a week, you should be trying to identify any issues.

I think the issue is the --curatedlib you provided to EDTA. How large is it? Judging from the file name RModeler_pig_rm_merged.rmDup.fa, is it generated by RepeatModeler initially, then used RepeatMasker to mask the pig genome, then you extracted the masked sequences, then you removed duplications with CD-HIT? If this is the case, you are doing it wrong.

Both EDTA and RepeatModeler can generate a non-redundant TE library. What you want to do is to use the SINE/LINE elements in the RepeatModeler library to boost the annotation of EDTA. So you may extract SINE/LINE sequences from the RepeatModeler library, format the sequence names, and provide them to EDTA via --curatedlib.

You may also want to read the EDTA paper for how it works.

Shujun

@oushujun
Copy link
Owner

any luck?

@lx-1011
Copy link
Author

lx-1011 commented May 24, 2022

any luck?
Hi Shujun,
Thanks for your response.
I have run it successfully.

Whole-genome TE annotation (total TE: 35.04%): Sus_scrofa.Sscrofa11.1.dna.chr.fa.mod.EDTA.TEanno.gff3

image
However, the results show that the percent of TE annotation is lower than expected, and it still can't identify SINEs.

perl ~/lixin/software/EDTA-2.0.0/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chr.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --overwrite 1 --anno 1 --evaluate 1 --threads 10 --curatedlib ../line_sine.fa

line_sine.fa filted by CD-HIT included 233 LINEs and 36 SINEs.

@oushujun
Copy link
Owner

Thanks for the update. Can you articulate which superfamily or class of TEs is lower than expected? Your result suggests that the 36 SINEs provided are not annotating any SINE elements in your genome. You may use AnnoSINE to generate the SINE library.

Shujun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants