rna-seq build fails because of missing files bowtie2 index files #30

malachig · 2013-10-25T02:35:54Z

It seems that the reference sequence build that gets incorporated into GMS1 is missing many files compared to that build on the TGI filesystem... Some critical files are missing and rna-seq builds fail for example when they expect bowtie2 indices to be there but they are not:

Compare the contents of this (46 Gb):
ls /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/

With the contents of this (5.9 Gb):
ls /gscmnt/ams1102/info/model_data/2869585698/build106942997/

Maybe these files do not need to be there anyway because it seems like the bowtie2 index files were created during the rna-seq build attempt and stored here:

/opt/gms/HU9D538/fs/HU9D538/info/model_data/ref_build_aligner_index_data/2869585698/build106942997/aligner-index-precise64-vagrant-13502-4bf701b63ced11e3b0cc080027880ca6/bowtie/2_0_0_beta7/

But the following command during the rna-seq build is looking for them here:
Error: Could not find Bowtie 2 index files. /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa.*.bt2

/usr/bin/tophat2.0.4 -p 4 --transcriptome-only --transcriptome-index '/tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/annotation-index-precise64-vagrant-13502-513acabe3d0f11e3a0dc080027880ca6/all_sequences' -G /opt/gms/GMS1/fs/gc12001/info/model_data/2772828715/build124434505/annotation_data/rna_annotation/106942997-all_sequences.gtf --output-dir '/tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/anonymous1' /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa /tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/anonymous0/fake_reads.fastq

Incidentally, this is kind of a 'dummy' rna-seq alignment run, perhaps performed for the purpose of creating genome and transcriptome indices?

sakoht · 2013-10-25T03:25:59Z

Even though the refseq build has a bunch of index files, they are from back before we had a formal software result for aligner indexes, and in theory are no longer used.

This theory clearly fails. :(

Probably the best thing is to grab the few things we die on and put them in the FTP staging location. Then redo the sync after the files stage.

Sent from my iPhone

On Oct 24, 2013, at 7:35 PM, Malachi Griffith [email protected] wrote:

It seems that the reference sequence build that gets incorporated into GMS1 is missing many files compared to that build on the TGI filesystem... Some critical files are missing and rna-seq builds fail for example when they expect bowtie2 indices to be there but they are not:

Compare the contents of this (46 Gb):
ls /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/

With the contents of this (5.9 Gb):
ls /gscmnt/ams1102/info/model_data/2869585698/build106942997/

It seems like the bowtie2 index files were created during the rna-seq build attempt and stored here:

/opt/gms/HU9D538/fs/HU9D538/info/model_data/ref_build_aligner_index_data/2869585698/build106942997/aligner-index-precise64-vagrant-13502-4bf701b63ced11e3b0cc080027880ca6/bowtie/2_0_0_beta7/

But the following command during the rna-seq build is looking for them here:
Error: Could not find Bowtie 2 index files. /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa.*.bt2

/usr/bin/tophat2.0.4 -p 4 --transcriptome-only --transcriptome-index '/tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/annotation-index-precise64-vagrant-13502-513acabe3d0f11e3a0dc080027880ca6/all_sequences' -G /opt/gms/GMS1/fs/gc12001/info/model_data/2772828715/build124434505/annotation_data/rna_annotation/106942997-all_sequences.gtf --output-dir '/tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/anonymous1' /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa /tmp/9.tmpdir/gm-genome_sys-2013-10-24_20_45_57--iyBS/anonymous0/fake_reads.fastq

—
Reply to this email directly or view it on GitHub.

malachig · 2013-10-25T14:46:35Z

Yeah I was thinking that is the short term fix as well. Also, doing the index takes a long time, so having them pre-generated for the aligners used in the tutorial exercise will make that go faster for the user.

For clarity. I am going to place files in TGI staging dir here:
/gscmnt/sata102/info/ftp-staging/pub/software/gms/testdata/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/

I will get the files from TGI reference annotation build here:
/gscmnt/ams1102/info/model_data/2869585698/build106942997/

The files we are missing are:
ls /gscmnt/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa.*.bt2
all_sequences.fa.1.bt2 all_sequences.fa.2.bt2 all_sequences.fa.3.bt2 all_sequences.fa.4.bt2 all_sequences.fa.rev.1.bt2 all_sequences.fa.rev.2.bt2

copy command as follows:
cp /gscmnt/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa.*.bt2 /gscmnt/sata102/info/ftp-staging/pub/software/gms/testdata/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/

Monitor progress of staging here:
http://genome.wustl.edu/pub/software/gms/testdata/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/

Once staging is complete I will run the rysnc command again:
genome sys gateway attach GMS1 --protocol ftp --rsync

And these missing files should appear here on my VM system:
/opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997

malachig · 2013-10-25T15:58:07Z

This worked, relaunching the rna-seq build to see if I can get past index building step now.

When this new build launched, the first thing I notice is that it is still trying to generate indices:

Finding or generating reference build index for aligner per-lane-tophat version 2.0.4 params -p 4 --bowtie-version=2.0.0-beta7 refbuild 106942997

This step complete successfully before but perhaps because the overall step crashed the result was not correctly logged in the DB?

Unfortunately, this means I will now have to wait many hours before I know if the same crash is going to happen.

Talking to Jason, it seems like this issue has been solved in 'gms-core master' before the last merge into 'gms-core pub'.

See here for details:
http://git/cgi-bin/gitweb.cgi?p=genome.git;a=commitdiff;h=57a461c2c37701272047615d03911ff2b87d6379

This is yet another example of why it would be great to merge from master into pub more regularly. The merge is already an active issue: #23

malachig · 2013-10-25T17:48:08Z

One possible short term fix for this issue, that might work is to selectively merge bug fixes made to the following module from 'gms-core master' into 'gms-core pub'.

lib/perl/Genome/InstrumentData/AlignmentResult/PerLaneTophat.pm

sakoht · 2013-10-30T02:48:32Z

This should be a moot point now that master is merged into gms-pub, right?

malachig · 2013-10-30T17:17:18Z

I think so. Need to confirm still. I definitely got past this point in my
last round of tests...
On Oct 29, 2013 9:48 PM, "Scott Smith" [email protected] wrote:

This should be a moot point now that master is merged into gms-pub, right?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/30#issuecomment-27361487
.

malachig · 2013-11-27T23:21:07Z

The merge is complete, but a fresh rna-seq build still has the same issue.

2013-11-27 16:38:01-0600 clia1: Finding or generating reference build index for aligner per-lane-tophat version 2.0.4 params -p 4 --bowtie-version=2.0.0-beta7 refbuild 106942997

Currently it seems that we are not able to find software results for bwa indexes in reference-alignment or bowtie indexes in rna-seq alignments. Despite the fact that the actual data files seem to be present in both cases...

This slows down the testing and even if we get a small data set (i.e. TST2 a with small number of reads) the total run time of the demonstration analysis will still be high because generating fresh reference indexes is slow.

malachig · 2014-01-28T05:01:49Z

This is successfully resolved with the closing of issue 15.

ghost assigned malachig Jan 3, 2014

malachig mentioned this issue Jan 3, 2014

Exome RefAlign builds consistently crash at reference_coverage step in workflow [AND] Reference Alignment software results are not exported correctly to the standalone gms. #15

Closed

malachig closed this as completed Jan 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rna-seq build fails because of missing files bowtie2 index files #30

rna-seq build fails because of missing files bowtie2 index files #30

malachig commented Oct 25, 2013

sakoht commented Oct 25, 2013

malachig commented Oct 25, 2013

malachig commented Oct 25, 2013

malachig commented Oct 25, 2013

sakoht commented Oct 30, 2013

malachig commented Oct 30, 2013

malachig commented Nov 27, 2013

malachig commented Jan 28, 2014

rna-seq build fails because of missing files bowtie2 index files #30

rna-seq build fails because of missing files bowtie2 index files #30

Comments

malachig commented Oct 25, 2013

sakoht commented Oct 25, 2013

malachig commented Oct 25, 2013

malachig commented Oct 25, 2013

malachig commented Oct 25, 2013

sakoht commented Oct 30, 2013

malachig commented Oct 30, 2013

malachig commented Nov 27, 2013

malachig commented Jan 28, 2014