-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancies between EDTA and RepeatMasker results, how to combine? #231
Comments
Dear Li Xin, Sorry for the delayed response. If you compare TE categories side by side, you may find many of them have quite big differences. In my opinion, the major discrenpcy comes from the failure to identify SINE and LINE by EDTA, which may have inflated the TIR category (i.e., CACTA and mutator). It's a good sign is that RepeatMasker can identify LINEs. A better way to combine the two is to find out which LINE sequences in the RepBase were used for the annotation, then obtain those library sequences from Repbase or somewhere (i.e. NCBI), and format their names into the RepeatMasker format (example, Best, |
Dear @oushujun Thanks for your response. I have tried to carry out your suggestion, and still have some question, like that:
###log file Thu Nov 18 15:41:00 CST 2021 **Combine the high-quality TE library LINE.fa with the EDTA library: (EDTA) cche@sg04 15:51:57**
####log file Sun Nov 21 00:57:17 CST 2021 Combine the high-quality TE library rice6.9.5.liban.nonLTR with the EDTA library:
Thanks and wish you all the best |
Hi Li Xin, You may only select those high-copy LINE annotations from the RepeatMasker output, and generate non-redundant sequences from them. You may manually select the ones that you think are representative, or use consensus to generate a representative sequence from sequences of each family. Please DON'T give all RepeatMasker sequences to EDTA. The name formatting looks good to me, but I don't understand what do you mean by interruption. Please include full reports in the attachment so that I can better judge what may be the issue. Best, |
Hi @oushujun ,
test_file in details(obtain LINE10.fa from RepeatMaker) |
Hi,
Apparently, you are providing all SINE/LINE annotations to EDTA - you
should not do that. Please only provide exemplary sequences (aka,
non-redundant library sequences) to EDTA. Doing so will make your run very
slow (as you mentioned, 56 days) and the annotation is just not right.
Shujun
…On Mon, Jan 17, 2022 at 9:17 PM lx-1011 ***@***.***> wrote:
Hi @oushujun <https://github.com/oushujun> ,
Thanks for your response, I have run test_file sussessfully with your
suggestion, and i will show the detail later.
And then I run the whole genome using EDTA with LINS_SINE.data.fa which
identified by RepeatMasker about 177246 (173080 LINEs, 4166 SINEs) from
23th, Nov to now.
1. Nearly 56 days. I am not sure it is normal or not. Is that anyway
to short this time?
2. Annotation is running now, and only find LINEs , no SINEs in
$.EDTA.TEanno.sum. Is it because the task was not completed?
[image: image]
<https://user-images.githubusercontent.com/47030888/149858908-6bb0470b-5aef-44ad-8489-344af6f75458.png>
###01input
[image: image]
<https://user-images.githubusercontent.com/47030888/149857267-89d48eaa-6d9c-49da-ab9e-10ff2c221784.png>
###02LINE_SINE.data.fa
[image: image]
<https://user-images.githubusercontent.com/47030888/149857186-7837abd1-3d67-4bb0-9b75-5c2a489f809a.png>
###03current proceeding
[image: image]
<https://user-images.githubusercontent.com/47030888/149857460-79f7e34f-a76a-45ec-9a7f-db2698f573a3.png>
test_file in details(obtain LINE10.fa from RepeatMaker)
#1 <#1>
*perl ~/lixin/software/EDTA/EDTA.pl --genome
~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa
--species others --step all --cds
~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa
--sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib LINE10.fa*
#2 <#2> LINE10.fa (5 lines)
[image: image]
<https://user-images.githubusercontent.com/47030888/149857774-c1d9dcd4-8bc3-4014-9a2a-f9cd295cd6bf.png>
#3 <#3> output (nearly 25h)
[image: image]
<https://user-images.githubusercontent.com/47030888/149858049-8c49e59c-0a7e-490b-aa15-761455bf586b.png>
[image: image]
<https://user-images.githubusercontent.com/47030888/149858081-bfd0c0c2-821f-4884-8c0e-5f1ff1843351.png>
—
Reply to this email directly, view it on GitHub
<#231 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABNX4NDKE75QDHWXIME6YGDUWTEUTANCNFSM5GVBX7ZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi, @oushujun
Li Xin |
Hi Li Xin, Is your RM database redundant or not? You may only provide non-redundant sequences to EDTA. This means you need to use one sequence to represent the entire family that it belongs to, and provide a collection of these representative sequences to EDTA. EDTA will use these sequences to perform homological annotation to other similar sequences with RepeatMasker that was integrated into EDTA. If you provide redundant sequences to EDTA, it will use these sequences to repetitively annotate your genome, which is super slow and not meaningful. The homology result will then be combined with structural results as the final output of EDTA. So there is no need to perform RepeatMasker annotation again. Shujun |
Hi Li Xin, It should not take this long. The pig genome is not that big, which means you were doing something not right. If your job runs longer than a week, you should be trying to identify any issues. I think the issue is the Both EDTA and RepeatModeler can generate a non-redundant TE library. What you want to do is to use the SINE/LINE elements in the RepeatModeler library to boost the annotation of EDTA. So you may extract SINE/LINE sequences from the RepeatModeler library, format the sequence names, and provide them to EDTA via You may also want to read the EDTA paper for how it works. Shujun |
any luck? |
Thanks for the update. Can you articulate which superfamily or class of TEs is lower than expected? Your result suggests that the 36 SINEs provided are not annotating any SINE elements in your genome. You may use AnnoSINE to generate the SINE library. Shujun |
Dear @oushujun ,
Thank you for developing this useful tool. I have run it successfully and found that SINEs and LINEs could't be identified based on structure features. However, we didn't have such TE lib tons of manual curations in pigs. We tried to combine the results of EDTA and RepeatMasker for an entire TE identification, but here were some different results:
####Not about EDTA#######
Thanks and wish you all the best
Li Xin
[1] Isaac A Babarinde, Gang Ma, Yuhao Li, Boping Deng, Zhiwei Luo, Hao Liu, Mazid Md Abdul, Carl Ward, Minchun Chen, Xiuling Fu, Liyang Shi, Martha Duttlinger, Jiangping He, Li Sun, Wenjuan Li, Qiang Zhuang, Guoqing Tong, Jon Frampton, Jean-Baptiste Cazier, Jiekai Chen, Ralf Jauch, Miguel A Esteban, Andrew P Hutchins, Transposable element sequence fragments incorporated into coding and noncoding transcripts modulate the transcriptome of human pluripotent stem cells, Nucleic Acids Research, Volume 49, Issue 16, 20 September 2021, Pages 9132–9153, https://doi.org/10.1093/nar/gkab710
[2] Platt, R.N., Vandewege, M.W. & Ray, D.A. Mammalian transposable elements and their impacts on genome evolution. Chromosome Res 26, 25–43 (2018). https://doi.org/10.1007/s10577-017-9570-z
The text was updated successfully, but these errors were encountered: