Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing with TIR-Learner #4

Closed
Neato-Nick opened this issue Jun 21, 2019 · 35 comments
Closed

Crashing with TIR-Learner #4

Neato-Nick opened this issue Jun 21, 2019 · 35 comments
Labels
bug Something isn't working

Comments

@Neato-Nick
Copy link

Neato-Nick commented Jun 21, 2019

Hi,

I copy and pasted the installation instructions from the README and am running the the script in the active EDTA environment. It seems that the EDTA.pl script chokes trying to use TIR-Learner. Looking at my output, all the correct folders and such are there. After crashing, the Helitron, MITE, and TIR folders are empty but the LTR folder is not. The only file in the parent output folder is genome.fasta.LTR.raw.fa.

Is there a way to run the Perl pipeline script but just not use TIR-Learner, or even just not call TIRs? I'm still interested in the other features, and even if I could just use EDTA for Helitrons, LTRs, MITEs, filtering, consensus calling, and repeat classifying I would be happy.

The lines before the crash start with what's seen in #2 (comment). Then it's a traceback starting from ~/bin/EDTA/bin/TIR-Learner1.12/Module1/Fullcov.py, line 52, in <module> ProcessHomology(genome_Name). After that, there's some cryptic errors including
cat: '*DTA-+-select.fa': No such file or directory
cat: '*-+-*-+-*.gff3': No such file or directory
There's a few more error traces after that, with each Traceback followed by various errors from files not being found by rm, cp, mv, cat.

  • TIR-Learner1.12/Module1 (above)
  • TIR-Learner1.12/Module1/Lowcomp_M1.py
  • TIR-Learner1.12/Module2/Lowcomp_M2.py
  • TIR-Learner1.12/

Lastly, in the last few lines before the crash, I get these lines which tell me that it certainly is a problem with TIR-Learner
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3' mv: cannot stat 'TIR-Learner/*FinalAnn.gff3': No such file or directory mv: cannot stat 'TIR-Learner/*FinalAnn.fa': No such file or directory cp: cannot stat 'TIR-Learner-Result/TIR-Learner_FinalAnn.fa': No such file or directory Error: TIR results not found!

ERROR: Raw TIR results not found in genome.fasta.EDTA.raw/genome.fasta.TIR.raw.fa at ~bin/EDTA/EDTA.pl line 145.

While bug testing I've just been using the first two scaffolds of my genome. That file is attached.

Thanks!

PR-102_JGI_twoscafs.fasta.zip

@oushujun
Copy link
Owner

Hi Nick,

We identified this bug in TIR-Learner as you described in detail. A testing version has been pushed in the EDTA branch named "TIR-Learner1.13". Please try that out under the same active EDTA environment (no need to reinstall). In particular, if you want to just test out TIR-Learner, you can:

nohup sh .....EDTA/bin/TIR-Learner1.13/TIR-Learner.sh genome.fa $CPU

For your other question, yes. To do so, you can run these initial TE finders separately, then feed them to the EDTA_process.pl pipeline to make the stage 1 library. If you don't have TIR-Learner results, you can use the MITE-Hunter result to feed the -tir parameter to trick the program.

Please let me know if you still encounter the same issue. Sorry for the inconvenience.

Best,
Shujun

@oushujun oushujun added the bug Something isn't working label Jun 22, 2019
@Neato-Nick
Copy link
Author

I pulled from the TIR-Learner 1.13 branch and just ran TIR-Learner as you suggested. Looks like it's still crashing. Module1 did contain some results but there was still a temp file in there, so I'm not sure it finished running. The nohup.out is attached

CPU=4
nohup sh ~/bin/EDTA-TIR-Learner1.13/bin/TIR-Learner1.13/TIR-Learner.sh genome.fasta $CPU

nohup.TIR-Learner1.13.txt

@oushujun
Copy link
Owner

Hi Nick,

Thanks for testing, @weijiaweijia is working on this. I will update you once we have a new version.

Best,
Shujun

@oushujun
Copy link
Owner

Hi Nick,

If you are under a pressing need, you may run the updated TIR-Learner1.13 branch for your genome. I just temporarily removed the TIR-Learner module in EDTA, thus you should be able to run the rest of the pipeline. Note that due to the missing of TIR-Learner, large TIR elements and autonomous TIR elements will likely be dampened in the final library. However, the MITE-Hunter should be able to pick up most of short TIR elements and MITEs.

Best,
Shujun

@DanJeffries
Copy link

Hi Shujun,

I have an crash that seems similar to Nick's above. Here is the log file:

Wed Jun 26 13:40:03 CEST 2019   Dependency checking:
                All passed!
Wed Jun 26 13:40:14 CEST 2019   Obtain raw TE libraries using various structure-based programs:
FASTA-Reader: Ignoring invalid residues at position(s): On line 380: 5993-6353
FASTA-Reader: Ignoring invalid residues at position(s): On line 534: 225-229
FASTA-Reader: Ignoring invalid residues at position(s): On line 250: 1224-1597, 1622-1750
FASTA-Reader: Ignoring invalid residues at position(s): On line 252: 520-746
FASTA-Reader: Ignoring invalid residues at position(s): On line 386: 238-242, 1481-1485, 2106-2110, 3127-3131
FASTA-Reader: Ignoring invalid residues at position(s): On line 254: 1936-2299
FASTA-Reader: Ignoring invalid residues at position(s): On line 566: 11-15
FASTA-Reader: Ignoring invalid residues at position(s): On line 472: 268-272
Traceback (most recent call last):
  File "/stn4/djeffrie/EDTA/bin/TIR-Learner1.12/Module1/Fullcov.py", line 52, in <module>
    ProcessHomology(genome_Name)
  File "/stn4/djeffrie/EDTA/bin/TIR-Learner1.12/Module1/Fullcov.py", line 41, in ProcessHomology
    f = pd.read_csv(blast, header=None, sep="\t")
  File "/scratch/temporary/djeffrie/EDTAcondaenv/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/scratch/temporary/djeffrie/EDTAcondaenv/lib/python3.7/site-packages/pandas/io/parsers.py", line 429, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/scratch/temporary/djeffrie/EDTAcondaenv/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/scratch/temporary/djeffrie/EDTAcondaenv/lib/python3.7/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/scratch/temporary/djeffrie/EDTAcondaenv/lib/python3.7/site-packages/pandas/io/parsers.py", line 1853, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
cat: *DTA-+-select.fa: No such file or directory
cat: *DTC-+-select.fa: No such file or directory

The issue again seems to be in TIR-learner. Perhaps it is still a TIR learner bug, but one thing that might be worth noting is that I had an issue during installation where scikit-learn=0.19.0 would not install because of some conflict with multiprocesses. I got around this problem by installing them in a different order but I later realised that by default on my cluster, python 3.7 gets installed in the environment and then a version of multiprocess that is only compatible with python 3.7 is installed. I think then the issue with scikit-learn=0.19.0 was because it only works with python 3.6.

So do you think my issue above could be an installation issue, or a bug in TIR-learner?

One more thing, I am dealing with a large genome of about 5 Gb. The LTR programs completed fine, but took about 3 days. So I was wondering if it is possible to re-use these outputs, rather than waiting another 3 days to see if the pipeline will pass the next step?

Thanks a lot in advance for your help

Best

Dan

@oushujun
Copy link
Owner

Hi Dan,

Thanks for testing. Yes, this is the issue of TIR-Learner. We are working to make a better version so please wait a week or two.

For the conflicts between python, scikit-learn, and multiprocess, you may try different versions of python and multiprocess, but the trained models do require scikit-learn=0.19.0 to work properly.

The last suggestion is actually on my to-do list. Good idea!

Again, I am sorry for the bugs keeping you from getting meaningful results. We hope to resolve this issue in the near future.

Best,
Shujun

@philippbayer
Copy link

In my case with TIR-Learner, my installed CentOS did not have a installed realpath executable which TIR-Learner is calling on line 30 in TIR-Learner.sh

I fixed it like this:

#genomeFile=`realpath $rawFile` #the genome file with real path
genomeFile=`readlink -e $rawFile`

@oushujun
Copy link
Owner

oushujun commented Jul 1, 2019

Thanks @philippbayer!

I did some research and found this multi-platform solution:

resolve_link() {
  if type -p realpath >/dev/null; then
    realpath "$1"
  elif type -p greadlink >/dev/null; then
    greadlink -f "$1"
  else
    readlink -f "$1"
  fi
}

Ref: basherpm/basher#49 (comment)

Changes will be reflected in the next version.

Best,
Shujun

@Neato-Nick
Copy link
Author

@oushujun would it be possible to push these changes to a development branch ahead of the release for your next version? Or do you think you are only 1-2 weeks away from your next release?

If my impatience is overwhelming, it might just be easier for me to fix as you and @philippbayer have suggested.

@philippbayer after you made those changes to TIR-Learner, did EDTA run properly?

@philippbayer
Copy link

@Neato-Nick The main branch didn't like my GLIBC, so I switched to the origin/TIR-Learner1.13 branch for testing and that one is happily chugging along, but it hasn't finished so far (14 threads, plant genome, still in the 'raw' stage. grf-main and blastall are happily consuming resources, but nothing has been written in a while. The TIR, LTR, and MITE directories have data inside them.

@oushujun
Copy link
Owner

Hi @Neato-Nick and @philippbayer,

Thank you for waiting patiently, and I am sorry for the prolonged time of development. I went to the Evolution meeting 2 weeks ago so there was some delay there.

I am working on a new version of EDTA, this version will have much better performance in both speed and quality. The main improvement is in TIR-Learner - @weijiaweijia and me are working together to make an improved, more generalized prediction model that fits most species; and also in the downstream filtering of TIR elements and Helitrons - I am working to provide more thorough filtering for raw predictions which will make the final library much smaller and better.

I should be able to push these updates in 1-2 weeks if things work well - - our HPC has been down for maintenance for 3 days, so I can do nothing but talking ...

Again, thank you for your interest and testing.

Best,
Shujun

@philippbayer
Copy link

No worries @oushujun :) I'm just playing around with this software, the outcome doesn't depend on anything. Take all the time you need!!

@oushujun
Copy link
Owner

oushujun commented Aug 1, 2019

Dear All,

Sorry for the delay of response. I just push a bulk update to EDTA and have tested it in different servers - it seems to work now. But I have not tested it in macOS, so some tiny differences could cause problems.

For testing purposes, please use a small file, ie. 20 Mb, for faster turn around. Please let me know if there are any issues.

Best,
Shujun

@philippbayer
Copy link

Thank you for the update! I'll give it a try this weekend :)

@baozg
Copy link
Contributor

baozg commented Aug 3, 2019

Hi, Shujun

I try the new release EDTA, the TIR_learner is still have error, the LTR, MITE and Helitron is fine. Is my genome (336M eudicots plant) have low percentage TIR?

TIR command is below

perl /data/software/EDTA/20190802/EDTA_raw.pl -genome genome.fa -species others -type tir -threads 24

Here is the error log

cat: *-+-DTA.fa: No such file or directory
cat: *-+-DTC.fa: No such file or directorycat: *-+-DTH.fa: No such file or directory
cat: *-+-DTM.fa: No such file or directory
cat: *-+-DTT.fa: No such file or directory
cat: *-+-NonTIR.fa: No such file or directory
cat: *-+-*-+-*.gff3: No such file or directory
rm: cannot remove ‘*-+-*-+-*.gff3’: No such file or directory
Traceback (most recent call last):
  File "/data/software/EDTA/20190802/bin/TIR-Learner1.19/Module3_New/CombineAll.py", line 90, in <module>
    keep=removeIRFhomo("%s.gff3"%(genome_Name+spliter+dataset),remove,"%sClean.gff3"%(genome_Name+spliter+dataset+spliter))
  File "/data/software/EDTA/20190802/bin/TIR-Learner1.19/Module3_New/CombineAll.py", line 76, in removeIRFhomo
    f=pd.read_csv(file,header=None,sep="\t")
  File "/data/software/Anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/data/software/Anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/data/software/Anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/data/software/Anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/data/software/Anaconda3/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in __init__
    self._reader = parsers.TextReader(src, **kwds)  File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Traceback (most recent call last):
  File "/data/software/EDTA/20190802/bin/TIR-Learner1.19/Module3/GetAllSeq.py", line 62, in <module>    file=open(f,"r+")
FileNotFoundError: [Errno 2] No such file or directory: 'TIR-Learner_FinalAnn.gff3'

@WeijiaSu
Copy link

WeijiaSu commented Aug 3, 2019 via email

@baozg
Copy link
Contributor

baozg commented Aug 3, 2019

Hi, Weijiai

Thanks for reply.I didn't find the predi.fa-+-200. Where's the result should be?

I find only Module3_New directory have result, the other directory are empty.

.
├── temp
│   ├── TIR-Learner-+-Chr10.fasta
│   ├── TIR-Learner-+-Chr10-+-GRFmite.fa
│   ├── TIR-Learner-+-Chr10-+-GRFmite.fa-+-p
├── TIR-Learner
│   ├── TIR-Learner-+-Chr10-+-GRFmite.fa-+-p
│   ├── TIR-Learner-+-Chr1-+-GRFmite.fa-+-p
│   ├── TIR-Learner-+-Chr2-+-GRFmite.fa-+-p
│   ├── TIR-Learner-+-Chr3-+-GRFmite.fa-+-p

@WeijiaSu
Copy link

WeijiaSu commented Aug 3, 2019 via email

@baozg
Copy link
Contributor

baozg commented Aug 3, 2019

Hi Weijia,

Thanks for check.
Modules3_Newonly have temp ,TIR-Learner and TIR-Learner-Result.

@WeijiaSu
Copy link

WeijiaSu commented Aug 3, 2019 via email

@baozg
Copy link
Contributor

baozg commented Aug 3, 2019

Hi Weijia,

Totatl size are 1.1G
image

@baozg
Copy link
Contributor

baozg commented Aug 3, 2019

Hi Weijia,

I find the reason why I miss the predi.fa-+-200 result. The new release of the EDTA have update the install requirement, the script getDataset.py need the tensorflow and keras. I will install a brand new enviroment for EDTA.
Thanks for the update. I will update the issue if have any new result.

Cheers,
Zhigui

@baozg
Copy link
Contributor

baozg commented Aug 4, 2019

Hi all,

Just update the testing result. It seems that new release TIR can close this issue.

  1. Please install a new env for the EDTA 20190802 release
  2. Follow the step by the Shujun provided.
  • EDTA_raw
  • EDTA_processF
  • EDTA -step final
  1. The time and resource of my plant genome (336M plant genome, 58% repeat estimated by the GenomeScope, 24 cores machine)
Step maxvmem time(h) raw_fa size
Helitron 7.914GB 2.352222 1.3Mb
MITE 1.529GB 1.815278 4.9kb
TIR 42.127GB 4.895556 20Mb
LTR 19.049GB 1.417222 2.5Mb
EDTA_Final 19.388GB 19.42389 19Mb

Thanks for the developing.

Bests,
Zhigui

@oushujun
Copy link
Owner

I just pushed some new updates to EDTA, mainly to fix the TIR-Learner issue. Please reinstall EDTA and rerun it in the same work folder. Existing results will be reused so there is essentially no waste of time. Thank you for your patience and support!

@oushujun
Copy link
Owner

I consider this issue resolved. Please reopen it if it doesn't. Thank you all for testing. Shujun

@Neato-Nick
Copy link
Author

I can't find TIR-Learner on github, but I'd rather be opening issues there. The errors I'm running into are typically encountered almost exclusively while running TIR-Learner, but EDTA itself is doing just fine. @oushujun Do you know if this is the right repo I should be posting to? https://github.com/weijiaweijia/TIR-Learner-Rice

@oushujun
Copy link
Owner

oushujun commented Aug 30, 2019 via email

@aaronphillips7493
Copy link

Hello, I installed EDTA following the instructions for a conda install. I have run EDTA using the following commands:
perl ../EDTA.pl --genome $GENOME --cds $CDS --curatedlib $CURATEDLIB --overwrite 0 --sensitive 1 --anno 1 --species Rice --evaluate 1 --threads 10

It works for LTR, but it crashes at the TIR step. Please see the error below:
Species: Rice
Traceback (most recent call last):
File "/hpcfs/users/a1779884/rice_genomics/EDTA/bin/TIR-Learner2.5/Module1/Fullcov.py", line 58, in
ProcessHomology(genome_Name)
File "/hpcfs/users/a1779884/rice_genomics/EDTA/bin/TIR-Learner2.5/Module1/Fullcov.py", line 47, in ProcessHomology
f = pd.read_csv(blast, header=None, sep="\t")
File "/hpcfs/users/a1779884/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "/hpcfs/users/a1779884/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 452, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/hpcfs/users/a1779884/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 946, in init
self._make_engine(self.engine)
File "/hpcfs/users/a1779884/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 1178, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/hpcfs/users/a1779884/.conda/envs/EDTA/lib/python3.6/site-packages/pandas/io/parsers.py", line 2008, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 540, in pandas._libs.parsers.TextReader.cinit
pandas.errors.EmptyDataError: No columns to parse from file
cat: *DTC-+-select.fa: No such file or directory
cat: *DTH-+-select.fa: No such file or directory
cat: *DTM-+-select.fa: No such file or directory
cat: *DTT-+-select.fa: No such file or directory

I have read this thread, but have not found anything to help me (I am a novice, so maybe that is why). Can someone please help me understand what is going on here, and help me figure out how to fix it?

Thank you,
Aaron :)

@oushujun
Copy link
Owner

oushujun commented Jan 16, 2021 via email

@aaronphillips7493
Copy link

aaronphillips7493 commented Jan 16, 2021

Hi Shujun,

Thank you for your hasty reply!

To install I did the following on Thursday 14th Jan 2021:

conda create -n EDTA

conda activate EDTA

conda config --env --add channels anaconda --add channels conda-forge --add channels bioconda

conda install -n EDTA -y cd-hit repeatmodeler muscle mdust blast openjdk perl perl-text-soundex multiprocess regex tensorflow=1.14.0 keras=2.2.4 scikit-learn=0.19.0 biopython pandas glob2 python=3.6 tesorter genericrepeatfinder genometools-genometools ltr_retriever ltr_finder numpy=1.16.4

git clone https://github.com/oushujun/EDTA

And then I ran the test, which worked.

Please find the list of packages in the EDTA env attached to this here.
edta.env.list.txt

I have just refreshed the EDTA page and the instructions for installation appear to be different now. Were they recently updated, and perhaps that is why I am having issues?

Thank you again,
Aaron :)

@oushujun
Copy link
Owner

Hi Aaron,

Thanks for the details. Something may be conflicted with keras and numpy, you may use the lasted installation version (the yml file) that is frozen from a successful one. You may need to modify the first line of that file, change "EDTA" to something else (eg. EDTA1.9.5) to avoid conflicts with your current env.

Best,
Shujun

@aaronphillips7493
Copy link

Hey Shujun,

I reinstalled EDTA using the .yml file. I re-ran my analyses with the overwrite option switched off (to avoid redoing the LTR finding) and I got the same errors again. I am now trying to rerun EDTA with the overwrite option switched on, so will let you know how that goes.

Thanks again for your suggestions,
Aaron :)

@oushujun
Copy link
Owner

Hi Aaron,

That is one of my thoughts too, that you may have run multiple times on the same folder, and some erroneous runs have made the files weird and preventing new runs to proceed. Ovewriting the existing files will be a good choice. If you want to save the LTR results, you can run EDTA_raw with --type TIR --overwrite 1 to just overwrite the TIR results.

Best,
Shujun

@aaronphillips7493
Copy link

aaronphillips7493 commented Jan 17, 2021

Hey Shujun,

LTR step worked, but TIR failed again with the same errors.

I noticed that when I try to do just TIR with --type TIR --overwrite 1 I instantly get the error:
Failed to parse command line

Do you have any other suggestions?

Thank you again,
Aaron :)

@oushujun
Copy link
Owner

oushujun commented Jan 19, 2021 via email

jguhlin added a commit that referenced this issue Oct 30, 2024
Add main.nf into the new branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants