Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

rna-seq build fails because of missing files bowtie2 index files #30

Closed
malachig opened this issue Oct 25, 2013 · 8 comments
Closed

rna-seq build fails because of missing files bowtie2 index files #30

malachig opened this issue Oct 25, 2013 · 8 comments
Assignees
Labels

Comments

@malachig
Copy link
Collaborator

It seems that the reference sequence build that gets incorporated into GMS1 is missing many files compared to that build on the TGI filesystem... Some critical files are missing and rna-seq builds fail for example when they expect bowtie2 indices to be there but they are not:

Compare the contents of this (46 Gb):
ls /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/

With the contents of this (5.9 Gb):
ls /gscmnt/ams1102/info/model_data/2869585698/build106942997/

Maybe these files do not need to be there anyway because it seems like the bowtie2 index files were created during the rna-seq build attempt and stored here:

/opt/gms/HU9D538/fs/HU9D538/info/model_data/ref_build_aligner_index_data/2869585698/build106942997/aligner-index-precise64-vagrant-13502-4bf701b63ced11e3b0cc080027880ca6/bowtie/2_0_0_beta7/

But the following command during the rna-seq build is looking for them here:
Error: Could not find Bowtie 2 index files. /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa.*.bt2

/usr/bin/tophat2.0.4 -p 4 --transcriptome-only --transcriptome-index '/tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/annotation-index-precise64-vagrant-13502-513acabe3d0f11e3a0dc080027880ca6/all_sequences' -G /opt/gms/GMS1/fs/gc12001/info/model_data/2772828715/build124434505/annotation_data/rna_annotation/106942997-all_sequences.gtf --output-dir '/tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/anonymous1' /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa /tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/anonymous0/fake_reads.fastq

Incidentally, this is kind of a 'dummy' rna-seq alignment run, perhaps performed for the purpose of creating genome and transcriptome indices?

@sakoht
Copy link
Contributor

sakoht commented Oct 25, 2013

Even though the refseq build has a bunch of index files, they are from back before we had a formal software result for aligner indexes, and in theory are no longer used.

This theory clearly fails. :(

Probably the best thing is to grab the few things we die on and put them in the FTP staging location. Then redo the sync after the files stage.

Sent from my iPhone

On Oct 24, 2013, at 7:35 PM, Malachi Griffith [email protected] wrote:

It seems that the reference sequence build that gets incorporated into GMS1 is missing many files compared to that build on the TGI filesystem... Some critical files are missing and rna-seq builds fail for example when they expect bowtie2 indices to be there but they are not:

Compare the contents of this (46 Gb):
ls /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/

With the contents of this (5.9 Gb):
ls /gscmnt/ams1102/info/model_data/2869585698/build106942997/

It seems like the bowtie2 index files were created during the rna-seq build attempt and stored here:

/opt/gms/HU9D538/fs/HU9D538/info/model_data/ref_build_aligner_index_data/2869585698/build106942997/aligner-index-precise64-vagrant-13502-4bf701b63ced11e3b0cc080027880ca6/bowtie/2_0_0_beta7/

But the following command during the rna-seq build is looking for them here:
Error: Could not find Bowtie 2 index files. /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa.*.bt2

/usr/bin/tophat2.0.4 -p 4 --transcriptome-only --transcriptome-index '/tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/annotation-index-precise64-vagrant-13502-513acabe3d0f11e3a0dc080027880ca6/all_sequences' -G /opt/gms/GMS1/fs/gc12001/info/model_data/2772828715/build124434505/annotation_data/rna_annotation/106942997-all_sequences.gtf --output-dir '/tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/anonymous1' /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa /tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/anonymous0/fake_reads.fastq


Reply to this email directly or view it on GitHub.

@malachig
Copy link
Collaborator Author

Yeah I was thinking that is the short term fix as well. Also, doing the index takes a long time, so having them pre-generated for the aligners used in the tutorial exercise will make that go faster for the user.

For clarity. I am going to place files in TGI staging dir here:
/gscmnt/sata102/info/ftp-staging/pub/software/gms/testdata/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/

I will get the files from TGI reference annotation build here:
/gscmnt/ams1102/info/model_data/2869585698/build106942997/

The files we are missing are:
ls /gscmnt/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa.*.bt2
all_sequences.fa.1.bt2 all_sequences.fa.2.bt2 all_sequences.fa.3.bt2 all_sequences.fa.4.bt2 all_sequences.fa.rev.1.bt2 all_sequences.fa.rev.2.bt2

copy command as follows:
cp /gscmnt/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa.*.bt2 /gscmnt/sata102/info/ftp-staging/pub/software/gms/testdata/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/

Monitor progress of staging here:
http://genome.wustl.edu/pub/software/gms/testdata/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/

Once staging is complete I will run the rysnc command again:
genome sys gateway attach GMS1 --protocol ftp --rsync

And these missing files should appear here on my VM system:
/opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997

@malachig
Copy link
Collaborator Author

This worked, relaunching the rna-seq build to see if I can get past index building step now.

When this new build launched, the first thing I notice is that it is still trying to generate indices:

Finding or generating reference build index for aligner per-lane-tophat version 2.0.4 params -p 4 --bowtie-version=2.0.0-beta7 refbuild 106942997

This step complete successfully before but perhaps because the overall step crashed the result was not correctly logged in the DB?

Unfortunately, this means I will now have to wait many hours before I know if the same crash is going to happen.

Talking to Jason, it seems like this issue has been solved in 'gms-core master' before the last merge into 'gms-core pub'.

See here for details:
http://git/cgi-bin/gitweb.cgi?p=genome.git;a=commitdiff;h=57a461c2c37701272047615d03911ff2b87d6379

This is yet another example of why it would be great to merge from master into pub more regularly. The merge is already an active issue: #23

@malachig
Copy link
Collaborator Author

One possible short term fix for this issue, that might work is to selectively merge bug fixes made to the following module from 'gms-core master' into 'gms-core pub'.

lib/perl/Genome/InstrumentData/AlignmentResult/PerLaneTophat.pm

@sakoht
Copy link
Contributor

sakoht commented Oct 30, 2013

This should be a moot point now that master is merged into gms-pub, right?

@malachig
Copy link
Collaborator Author

I think so. Need to confirm still. I definitely got past this point in my
last round of tests...
On Oct 29, 2013 9:48 PM, "Scott Smith" [email protected] wrote:

This should be a moot point now that master is merged into gms-pub, right?


Reply to this email directly or view it on GitHubhttps://github.com//issues/30#issuecomment-27361487
.

@malachig
Copy link
Collaborator Author

The merge is complete, but a fresh rna-seq build still has the same issue.

2013-11-27 16:38:01-0600 clia1: Finding or generating reference build index for aligner per-lane-tophat version 2.0.4 params -p 4 --bowtie-version=2.0.0-beta7 refbuild 106942997

Currently it seems that we are not able to find software results for bwa indexes in reference-alignment or bowtie indexes in rna-seq alignments. Despite the fact that the actual data files seem to be present in both cases...

This slows down the testing and even if we get a small data set (i.e. TST2 a with small number of reads) the total run time of the demonstration analysis will still be high because generating fresh reference indexes is slow.

@malachig
Copy link
Collaborator Author

This is successfully resolved with the closing of issue 15.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants