-
Notifications
You must be signed in to change notification settings - Fork 22
Exome RefAlign builds consistently crash at reference_coverage step in workflow [AND] Reference Alignment software results are not exported correctly to the standalone gms. #15
Comments
Do we have that as an input on those models? If so, is it a "name" instead of an ID? If so the dumper "genome model export metadata" will need some exception logic like on the target_region_set_name for refalign.
|
I believe there are two inputs on exome ref-align models that point to these feature list objects. '--target-region-set-names' and '--region-of-interest-set-name' In order to run the exome analysis you need to define both a target region set name and region of interest. This is generally done by name when you do it at the command line. Since I can't do an install because of the dependency issue, I can't confirm this right now. It does seem like there is logic in "/lib/perl/Genome/Model/Command/Export/Metadata.pm" to dump target region software results. Maybe this was added after the metadata dump was created? Or maybe something is also needed for region of interest software results? Or maybe it just isn't working quite right? Wish I understood the whole software results business a bit better... |
As we discussed verbally, the real issue here was that the DB data for the feature list was present, but the FS data was not. I just copied it into the FTP staging location: If you ever forget where the staging directory is for the FTP site, I just do this, because it is in a comment in the Makefile. (The commented-out code does scp via a random blade: grep blade Makefile We will also need to sync amazon from there. Since the tool requires Ubuntu precise, that will have to happen after doing an install via FTP locally. |
Ok, sounds good. For future reference and convenience, at this time the staging dir is here: /gscmnt/sata102/info/ftp-staging/pub/software/gms/testdata/GMS1/setup/archive-files/ |
We are still blocked on reference-alignment at this step, but I think slightly further. Now we get the following errors: ERROR: Calling get on SoftwareResult (unless getting by id) is slow and possibly incorrect. Looking at the code, it is not immediately obvious what is going on there... Anyone know who might be familiar with this code? |
From Tom: "The "algorithm" is the name of a subroutine defined in [tmooney@linus284:~]$ genome model reference-sequence converter list-f 108573982 nimblegen-human-buildhg19 (108563338) GRCh37-lite-build37 The algorithm must be defined when the converter is created or else the |
If I do this test inside the standalone GMS genome model reference-sequence converter list -f destination_reference_build_id=106942997 ID SOURCE_REFERENCE_BUILD DESTINATION_REFERENCE_BUILD ALGORITHM RESOURCE 108573982 nimblegen-human-buildhg19 (108563338) GRCh37-lite-build37 (106942997) In other words, 'algorithm' is defined inside the TGI but is not defined in the standalone GMS. This is a problem with this module: genome/lib/perl/Genome/Model/Command/Export/Metadata.pm Metrics association with models and software results are not being obtained when grabbing the content to dump and then import into the standalone GMS for the demonstration analysis. Refer to this code section: for my $ext (qw/Input Param/) { my $related_class = $base_class . "::$ext"; if (UR::Object::Type->get($related_class)) { my $owner_method; my $value_method; my $value_method2; if ($obj->isa("Genome::Model")) { $owner_method = "model_id"; $value_method = "value"; } elsif ($obj->isa("Genome::Model::Build")) { $owner_method = "build_id"; $value_method = "value"; } elsif ($obj->isa("Genome::SoftwareResult")) { $owner_method = "software_result_id"; $value_method = "value_obj"; $value_method2 = "value_id"; } else { next; } my @assoc = $related_class->get($owner_method => $obj->id); for my $a (@assoc) { my $v = $a->$value_method; unless ($v) { my $id = $a->$value_method2; die if not defined $id; $v = UR::Value::Text->get($id); } unless ($sanitize_map->{$v->id} and $sanitize_map->{$v->id} == $obj->id) { $self->add_to_dump_queue($a, $queue, $exclude, $sanitize_map) unless $exclude->{$final_class}; $self->add_to_dump_queue($v, $queue, $exclude, $sanitize_map); } } } } |
Fixing this might be as simple changing:
to
Then of course we would have to regenerate the metadata dump, update this in the FTP staging dir, redo the import in the standalone GMS and test.... |
To test whether fixing this issue properly will actually get this build past this point we can manually patch the database to contain the missing algorithm like so: perl -MGenome -e 'Genome::Model::Build::ReferenceSequence::Converter->get(108573982)->algorithm("convert_chrXX_contigs_to_GL"); UR::Context->commit();' But we should really fix the metadata exporter and not do this. |
Manually adding the algorithm to the database worked, but now I am getting a bunch of errors like this: DBD::Pg::st execute failed: ERROR: insert or update on table "instance" violates foreign key constraint "instance_peer_instance_id_fkey" |
I tried adding the line of code described above to Metadata.pm: for my $ext (qw/Input Param Metric/) { But when I try to regenerate the metadata I now get this error: export Genome::SoftwareResult::Param (Genome::SoftwareResult::Param): variant_type:snvERROR: Can't locate object method "value_obj" via package "Genome::SoftwareResult::Metric" (perhaps you forgot to load "Genome::SoftwareResult::Metric"?) at /gscuser/mgriffit/git/genome/lib/perl/Genome/Model/Command/Export/Metadata.pm line 310 |
Metrics are a little asymmetrical from inputs and params because they are don't have value_{class_name,id,obj}, just value. Probably needs a similar loop but just copying "value". Sent from my iPhone
|
The latest test still has some issue with obtaining a FeatureList during the reference coverage step. 2013-11-23 08:52:12-0600 clia1: ERROR: Calling get on SoftwareResult (unless getting by id) is slow and possibly incorrect. I'm guessing this is still some problem with the metadata export/import process? If I try the following query inside TGI I get a file path among other values for the feature list is question. If I try the same query in the standalone GMS, all values look the same, except 'OUTPUT_DIR' is NULL % genome feature-list list id=7BF768EF51FB11E1A0743039993C62A0 why? Are any other values related to the feature list also not being defined properly? The actual data does seem to be present in the standalone GMS as part of the TST1 data mount here: Furthermore, inside the TGI: % genome feature-list list id=7BF768EF51FB11E1A0743039993C62A0 --show disk_allocation Gives: But in the standalone GMS, this is NULL |
I expect this latest error has something to do with this code in Genome/FeatureList.pm
Perhaps the needed disk allocation is not being imported when the database is primed. If that is the case, then the problem likely lies in: Genome/Model/Command/Export/Metadata.pm |
I did an export from the TGI side and stored the results in 2891454740-2013.11.25 and did a |
ssmith@blade16-4-16 ~> genome feature-list list 7BF768EF51FB11E1A0743039993C62A0 11111001 capture chip set nimblegen true-BED exome nimblegen-human-buildhg19 (108563338) /opt/gms/GMS1/fs/gc4095/info/feature_list/7BF768EF51FB11E1A0743039993C62A0 |
The core tries to call file_path instead of output_dir. Does that work? Sent from my iPhone
|
Run the lister and show file_path and disk_allocation. It may be that one or some of those fail. Sent from my iPhone
|
The ref align model succeeded !! Looks like just doing a fresh export and import worked, might this be due to the recent changes in table schemas ? === Build === |
While it is not entirely clear why redoing the export/import worked here, for now I have copied the data dump file described above '2891454740-2013.11.25.dat' into the staging dir here: It will stage here: I also updated the README.md to specify use of this file. I will now update the database priming in the clia test box and try a new reference alignment there as well. This was done by |
After importing the new meta data: % genome feature-list list id=7BF768EF51FB11E1A0743039993C62A0 --show output_dir,disk_allocation OUTPUT_DIR DISK_ALLOCATION /opt/gms/GMS1/fs/gc4095/info/feature_list/7BF768EF51FB11E1A0743039993C62A0 /opt/gms/GMS1/fs/gc4095/info/feature_list/7BF768EF51FB11E1A0743039993C62A0 |
The new data dump seems to have added a disk allocation and various other things that were not in the old data dump. This seems to resolve the issue we were having above. BUT, at the same time we have lost meta-data related to the aligner indexes for bwa 0.5.9. This means that when reference alignment runs, the indexes have to be built from scratch... Kind of unfortunate since we go through the trouble of copying over the index files... Why is the software result for aligner index no longer being exported. To see the differences between the old and the new meta-data dumps you can diff these files: http://genome.wustl.edu/pub/software/gms/testdata/GMS1/export/2891454740-2013.11.1.dat How are these things different while the test case for genome model export metadata continues to pass? |
To test for existence of software results related to indexing we should be able to do this: Or more specifically: Now try grepping for SoftwareResult in the metadata .dat file: % cat 2891454740-2013.11.25.dat | grep Software | grep aligner It appears that the Nov. 1 dat file had software results (Genome::SoftwareResult::Param) for 'bwa' and '0.5.9' but these were lost in the Nov. 25 dat file. Neither seems to make an mention of Bowtie related Software results. But they do both have ProcessingProfile params (Genome::ProcessingProfile::Param) related to Bowtie... Not sure of the significance of this... If I try to regenerate the .dat file again from a fresh checkout, the 'aligner' results that were present Nov. 1 are still gone. Same thing if I run it from a stable branch. Are these software results even in the database on the TGI side still? Yes. |
Both this issue and issue #30 are now stuck on an apparent inability to recognize software results that are imported into the sGMS from TGI by:
For both rna-seq and reference-alignment pipelines we import software results for aligner specific reference genome indexes. i.e. An indexed version of the reference genome created using the aligner program and specific version that will be used for alignment of reads. For both pipelines, these software results are not recognized and the indexes are created from scratch the first time they are needed. These new indexes are stored as new software results and in subsequent steps, these software results are recognized and short-cutting works correctly from that point onward. |
It looks like the problem is related to this code block in: if ($obj->isa("Genome::Model::Build::ReferenceSequence")) {
my @i = Genome::Model::Build::ReferenceSequence::AlignerIndex->get(reference_build_id => $obj->id, test_name => undef);
for my $i (@i) {
my $dir = $i->output_dir;
next if $dir and $dir =~ /gscarchive/;
next unless $i->id eq '117803766'; # TODO: make this smarter
$self->add_to_dump_queue($i, $queue, $exclude, $sanitize_map);
}
my @prev_builds = grep { $_->isa("Genome::Model::Build::ReferenceSequence") } values %{ $queue->{"Genome::Model::Build"} };
if (@prev_builds) {
$DB::single = 1;
my @converters1 = map { Genome::Model::Build::ReferenceSequence::Converter->get(source_reference_build => $obj, destination_reference_build => $_) } @prev_builds;
my @converters2 = map { Genome::Model::Build::ReferenceSequence::Converter->get(destination_reference_build => $obj, source_reference_build => $_) } @prev_builds;
for my $converter (@converters1, @converters2) {
$self->add_to_dump_queue($converter, $queue, $exclude, $sanitize_map);
}
}
} Try this:
It looks like the aligner index we are trying to retrieve has a test name set and that we skip such objects... Even if we fix this it seems that we still would not get the tophat aligner index. |
Note from @acoffman 👍 When I set the test name for the bad software results for the ticket (RT 96275), I must have inadvertently grabbed all the results associated with the build (including the aligner index). There is nothing wrong with the index as far as I know. Unfortunately, I can't remove the test name at this point because then there will be two identical aligner index software results and this will break shortcutting. It looks like the identical one that was created subsequently has the id: ec26139c40dc4d9ea24dc9fa55160b71 |
Obtaining information about ReferenceSequence::AlignerIndex software results: Get all results for a particular reference sequence build and display basic results perl -e 'use Genome; my @r = Genome::Model::Build::ReferenceSequence::AlignerIndex->get(reference_build_id=>106942997); foreach my $r (@r){$a_id=$r->id; $a_name=$r->aligner_name; $a_version=$r->aligner_version; $a_params=$r->aligner_params; $test_name=$r->test_name; print "$a_id\t$a_name\t$a_version\t$a_params\t$test_name\n"}'; Narrow down to a specific aligner and version of that aligner: perl -e 'use Genome; my @r = Genome::Model::Build::ReferenceSequence::AlignerIndex->get(aligner_name=>"bwa", reference_build_id=>106942997, aligner_version=>"0.5.9"); foreach my $r (@r){$a_id=$r->id; $a_name=$r->aligner_name; $a_version=$r->aligner_version; $a_params=$r->aligner_params; $test_name=$r->test_name; print "$a_id\t$a_name\t$a_version\t$a_params\t$test_name\n"}'; Now remove any with a 'test_name' defined and display along with actual dir for: 'bwa', '0.5.9', reference_build=106942997 perl -e 'use Genome; my @r = Genome::Model::Build::ReferenceSequence::AlignerIndex->get(aligner_name=>"bwa", reference_build_id=>106942997, aligner_version=>"0.5.9"); foreach my $r (@r){$a_id=$r->id; $a_name=$r->aligner_name; $a_version=$r->aligner_version; $a_params=$r->aligner_params; $test_name=$r->test_name; $a_dir=$r->output_dir; unless ($test_name){print "$a_id\t$a_name\t$a_version\t$a_params\t$a_dir\n"}}'; Now do the same thing but for: 'bowtie', '2.0.0-beta7', reference_build=106942997 perl -e 'use Genome; my @r = Genome::Model::Build::ReferenceSequence::AlignerIndex->get(aligner_name=>"bowtie", reference_build_id=>106942997, aligner_version=>"2.0.0-beta7"); foreach my $r (@r){$a_id=$r->id; $a_name=$r->aligner_name; $a_version=$r->aligner_version; $a_params=$r->aligner_params; $test_name=$r->test_name; $a_dir=$r->output_dir; unless ($test_name){print "$a_id\t$a_name\t$a_version\t$a_params\t$a_dir\n"}}'; |
The Index that is currently staged is here(the staging directory) /gscmnt/sata102/info/ftp-staging/pub/software/gms/testdata/GMS1/fs/gc4095/info/model_data/ref_build_aligner_index_data/2869585698/build106942997/aligner-index-blade13-4-7.gsc.wustl.edu-wschierd-21466-117803766/bwa/0_5_9/ This index has a test_name that is not 'undef', hence we have decided to remove these index files(these old indexes have been removed from the staging directory) and import a newer aligner index result(id = ec26139c40dc4d9ea24dc9fa55160b71), this will go here, Index results for bowtie(id = 127215980 ) are now going to be staged as well. These will go here, BWA succesfully shortcuts the indexing step for the ref align models. |
In the 'first phase of testing', bwa shortcuts but the 'per-lane-tophat index #1' step does not shortcut. This now takes 30 minutes as compared to 4 hours earlier(presumably due to bowtie indexes being present now) but does not shortcut. aligner_name = per-lane-tophat , aligner_version = 2.0.4, aligner_params = "-p 4 --bowtie-version=2.0.0-beta7" perl -e 'use Genome; my @r = Genome::Model::Build::ReferenceSequence::AlignerIndex->get(aligner_name=>"per-lane-tophat", reference_build_id=>106942997, aligner_version=>"2.0.4", aligner_params=>"-p 4 --bowtie-version=2.0.0-beta7"); foreach my $r (@r){$a_id=$r->id; $a_name=$r->aligner_name; $a_version=$r->aligner_version; $a_params=$r->aligner_params; $test_name=$r->test_name; $a_dir=$r->output_dir; unless ($test_name){print "$a_id\t$a_name\t$a_version\t$a_params\t$a_dir\n"}}'; These indexes were copied to /gscmnt/sata102/info/ftp-staging/pub/software/gms/testdata/GMS1/fs/gc4096/info/model_data/ref_build_aligner_index_data/2869585698/build106942997/aligner-index-blade14-4-6.gsc.wustl.edu-jwalker-4387-127215977/per_lane_tophat/2_0_4/_p_4___bowtie_version_2_0_0_beta7 These indexes have now been staged over to the FTP site. |
In addition to the Genome::Model::Build::ReferenceSequence::AlignerIndex, RNAseq also needs objects of kind Genome::Model::Build::ReferenceSequence::AnnotationIndex and their software results. perl -e 'use Genome; my @r = Genome::Model::Build::ReferenceSequence::AnnotationIndex->get(annotation_build_id => 124434505, aligner_name=>"per-lane-tophat", reference_build_id=>106942997, aligner_version=>"2.0.4", aligner_params=>"-p 4 --bowtie-version=2.0.0-beta7"); foreach my $r (@r){$a_id=$r->id; $a_name=$r->aligner_name; $a_version=$r->aligner_version; $a_params=$r->aligner_params; $test_name=$r->test_name; $a_dir=$r->output_dir; unless ($test_name){print "$a_id\t$a_name\t$a_version\t$a_params\t$a_dir\n"}}'; These files have been copied to These files have been staged. The RNAseq per-lane-tophat index step should now successfully shortcut. |
That will speed things along. Such a dilemma, having peop download something they can rebuild.
|
If they are both verified to be identical, a test name could be added to the new one, and removed from the one in the same transaction. This would be "safe" from causing crashes or race conditions.
|
Just trying a fresh install and run through on the box at home. First try at running exome refalign and it does not look like I am getting a shortcut on building the reference index. Update: also still not getting shortcut on RNAseq indexing. |
Did we resolve that test_name issue? Sent from my iPhone
|
You can list with: Sent from my iPhone
|
Unfortunately, the last command above errors out. ogriffit@GGMS ~/gms (ubuntu-12.04)> ur list objects --subj Genome::SoftwareResult ERROR: Can't call method "display_name" on an undefined value at /opt/gms/AUIS907/sw/genome/lib/perl/Genome/SoftwareResult/Param.pm line 68. |
Both the ref-align-exome and rna-seq steps shortcut on a completely fresh install and fresh download of data on a blade. This step not working for Obi might have something to do with a standalone install outside TGI or an r-sync issue on Obi's box(I'm leaning towards this). |
Try --show id,class. Not sure what the deal with display_name is. It should be On Tuesday, January 21, 2014, Obi Griffith [email protected] wrote:
Sent from Gmail Mobile |
This is now working on external standalone. Both RNAseq and RefAlign indexing is being shortcut successfully. Problem last time through for me was not having latest version of meta-data file. This issue seems to be resolved. |
Reference alignment builds for TST1 currently can not get past the ref-cov step because an input software result is missing. Specifically, the ref-cov step needs a feature-list result to run properly. This feature-list includes a bed file for the exome target regions.
When the build fails, the following errors are generated:
ERROR: Calling get on SoftwareResult (unless getting by id) is slow and possibly incorrect.
ERROR: Can't open file (/opt/gms/GMS1/fs/gc4095/info/feature_list/7BF768EF51FB11E1A0743039993C62A0/7BF768EF51FB11E1A0743039993C62A0.bed) to md5sum: No such file or directory at /opt/gms-1E0X346/sw/genome/lib/
I confirmed that this file (and the entire result) are missing from my GMS1 instance. I expect this result is not getting dumped correctly during the initial creation of the TST1 metadata object, or perhaps just not getting imported correctly...
mg
The text was updated successfully, but these errors were encountered: