-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make HaplotypeCaller genotype and output spanning deletions #4963
Conversation
Not sure who the best reviewer(s) would be but it would be good to get some experienced eyes on this because it touches a few core methods of haplotype caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! This is so much more correct.
|
||
@VisibleForTesting | ||
protected static VariantContext composeGivenAllelesVariantContextFromVariantList(final List<VariantContext> variantContextsInFeatureContext, final Locatable loc, final boolean snpsOnly, final boolean keepFiltered) { | ||
final List<VariantContext> rodVcsAtLoc = variantContextsInFeatureContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like "ROD" is a linguistic artifact of GATK 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was trying to step lightly on renaming variables in this PR but since you want them to go I'm more than happy to.. renamed.
.stream() | ||
.filter(vc -> vc.getStart() == loc.getStart() && | ||
(keepFiltered || vc.isNotFiltered()) && | ||
(!snpsOnly || vc.isSNP())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snpsOnly
strikes me as a weird argument to have, and it's only ever false
. Why not eliminate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must have been a legacy thing. Removed.
} | ||
final List<String> haplotypeSources = rodVcsAtLoc.stream().map(VariantContext::getSource).collect(Collectors.toList()); | ||
final VariantContext mergedVc = GATKVariantContextUtils.simpleMerge(rodVcsAtLoc, haplotypeSources, | ||
GATKVariantContextUtils.FilteredRecordMergeType.KEEP_IF_ANY_UNFILTERED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be keepFiltered ? KEEP_UNCONDITIONAL : KEEP_IF_ANY_UNFILTERED
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that. Changed to your suggestion.
@Override | ||
public int hashCode() { | ||
int result = loc; | ||
result = 31 * result + (alleles != null ? alleles.hashCode() : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems better as a one-liner return 31 * loc + (alleles != null ? alleles.hashCode() : 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} else { // we are in GGA mode! | ||
int compCount = 0; | ||
for( final VariantContext compVC : activeAllelesToGenotype ) { | ||
if( compVC.getStart() == loc ) { | ||
if( compVC.getStart() <= loc && compVC.getEnd() >= loc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all "comp" names should be renamed to something clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, more weird old legacy names.. done, although I did keep the "Comp" in the vcSourceName
in case anyone depended on that downstream.
@@ -442,7 +442,7 @@ public void writeHeader( final VariantContextWriter vcfWriter, final SAMSequence | |||
public ActivityProfileState isActive( final AlignmentContext context, final ReferenceContext ref, final FeatureContext features ) { | |||
|
|||
if ( hcArgs.genotypingOutputMode == GenotypingOutputMode.GENOTYPE_GIVEN_ALLELES ) { | |||
final VariantContext vcFromAllelesRod = GenotypingGivenAllelesUtils.composeGivenAllelesVariantContextFromRod(features, ref.getInterval(), false, hcArgs.genotypeFilteredAlleles, logger, hcArgs.alleles); | |||
final VariantContext vcFromAllelesRod = GenotypingGivenAllelesUtils.composeGivenAllelesVariantContextFromRod(features, ref.getInterval(), false, hcArgs.genotypeFilteredAlleles, hcArgs.alleles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're at it you can rename variables with "rod" in their name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -230,7 +227,7 @@ public void shutdown() { | |||
@Override | |||
public ActivityProfileState isActive(final AlignmentContext context, final ReferenceContext ref, final FeatureContext featureContext) { | |||
if ( MTAC.genotypingOutputMode == GenotypingOutputMode.GENOTYPE_GIVEN_ALLELES ) { | |||
final VariantContext vcFromAllelesRod = GenotypingGivenAllelesUtils.composeGivenAllelesVariantContextFromRod(featureContext, ref.getInterval(), false, MTAC.genotypeFilteredAlleles, logger, MTAC.alleles); | |||
final VariantContext vcFromAllelesRod = GenotypingGivenAllelesUtils.composeGivenAllelesVariantContextFromRod(featureContext, ref.getInterval(), false, MTAC.genotypeFilteredAlleles, MTAC.alleles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to change "rod" names in Mutect code, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -395,6 +396,10 @@ public String toString() { | |||
return startPosKeySet; | |||
} | |||
|
|||
public Iterator<VariantContext> getSpanningEvents(final int loc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this getOverlappingEvents
because otherwise it suggests that you're only getting spanning deletions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public class GenotypingGivenAllelesUtilsUnitTest { | ||
|
||
@Test | ||
public void testComposeGivenAllelesVariantContextFromRod() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More griping about "ROD" names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@Test | ||
public void testComposeGivenAllelesVariantContextFromRod() { | ||
|
||
final SimpleInterval loc = new SimpleInterval("20", 10093568, 10093568); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These loci don't need to be 8-digit numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@davidbenjamin Thanks for your review; I've tried to address your comments but let me know if anything still looks like it could use some work or if you spot anything new. Still waiting for tests to pass on this update (bar the GenomicsDB tests that I expect to fail). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! 👍 to merge. (That is, once all the tests are resolved).
final Map<Allele, Allele> spanningEventAlleleMappingToMergedVc | ||
= GATKVariantContextUtils.createAlleleMapping(mergedVC.getReference(), spanningEvent, new ArrayList<>()); | ||
final Allele remappedSpanningEventAltAllele = spanningEventAlleleMappingToMergedVc.get(spanningEvent.getAlternateAllele(0)); | ||
// in the case of GGA mode the spanning event might not match an allele in the mergedVC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does sound correct, and I see that you hooked up SomaticGenotypingEngine
to the version of this method that doesn't get the GGA alleles, which is correct because they have already been injected into the haplotypes. So, yes, I think everything is okay.
I've rebased on the most recent GenomicsDB changes but am still having trouble getting tests to pass. Not sure yet what the solution is but here is the issue for documentation: With these new changes HaplotypeCaller produces the following GVCF records on our chr20 test data:
If I run this through CombineGVCFs like this:
The resulting GVCF has these records:
When the original GVCF is imported into GenomicsDB and then extracted:
It contains the following records in this region:
Note that format DP and AD are different for sites after 10068161. Not sure what the right answer should be here, to my eye it looks like both CombineGVCFs and GenomicsDB are wrong and DP should be 6 for the star alleles in those sites. |
I agree that AD should be 6 for the * allele. It doesn't look like it's entirely trivial to implement in the GATK code since when we do the allele index remapping we expect dropped alleles to be low quality and don't aggregate their ADs. But that issue aside, I'm concerned that the CombineGVCF and GenomicsDB results don't match. @kgururaj is it possible #4645 changed something relevant to this bug? |
I'll try to explain my understanding of how spanning deletions and the associated attributes are computed in GenomicsDB - you can let me know what part of the logic I should fix.
Let me know if I should fix 3.ii |
f5ff180
to
97d3898
Compare
Yes, the * at 10068160 should be considered a deletion. It looks like that must be necessary to match the CombineGVCFs results. |
Thanks Laura, the next release will incorporate both your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so familiar with the GGA code anymore, but nothing jumped out at me as being amiss. I looked at all the integration test changes and they're all improvements.
|
||
/** | ||
* Compendium of utils to work in GENOTYPE_GIVEN_ALLELES mode. | ||
*/ | ||
public final class GenotypingGivenAllelesUtils { | ||
|
||
/** | ||
* Composes the given allele variant-context providing information about the rods and reference location. | ||
* Composes the given allele variant-context providing information about the given allele variants and reference location. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would say "providing information about the given variant alleles"
@@ -1124,8 +1124,8 @@ | |||
20 10068149 . T <NON_REF> . . END=10068151 GT:DP:GQ:MIN_DP:PL 0/0:31:63:31:0,63,945 | |||
20 10068152 . A <NON_REF> . . END=10068155 GT:DP:GQ:MIN_DP:PL 0/0:29:57:28:0,57,855 | |||
20 10068156 . A <NON_REF> . . END=10068157 GT:DP:GQ:MIN_DP:PL 0/0:27:51:27:0,51,765 | |||
20 10068158 . GTGTATATATATA G,<NON_REF> 66.73 . AS_RAW_BaseQRankSum=|-0.7,1|NaN;AS_RAW_MQ=5282.00|8882.00|0.00;AS_RAW_MQRankSum=|0.3,1|NaN;AS_RAW_ReadPosRankSum=|0.5,1|NaN;AS_SB_TABLE=0,3|2,2|0,0;BaseQRankSum=-0.652;DP=29;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.328;RAW_MQ=93364.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:3,4,0:7:57:0|1:10068158_GTGTATATATATA_G:104,0,57,114,69,183:0,3,2,2 | |||
20 10068160 . GTATATATATATGTA G,<NON_REF> 7.57 . AS_RAW_BaseQRankSum=|-0.6,1|NaN;AS_RAW_MQ=9723.00|1682.00|0.00;AS_RAW_MQRankSum=|-0.8,1|NaN;AS_RAW_ReadPosRankSum=|0.0,1|NaN;AS_SB_TABLE=2,3|0,2|0,0;BaseQRankSum=-0.566;DP=28;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.712;RAW_MQ=87005.00;ReadPosRankSum=0.000 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:5,2,0:7:44:1|0:10068158_GTGTATATATATA_G:44,0,134,60,141,201:2,3,0,2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are all great.
* A spanning deletion is also considered a deletion allele - fix as directed by broadinstitute/gatk#4963 (comment)
@cwhelan my understanding is that this is blocked because tests won't pass until there's a GDB update -- yes? |
@ldgauthier Yes, this was just waiting on some resolution to the double-spanning deletion issue reported above. If a new GDB update makes the output of |
I know you don't need this for your project anymore, but I'd rather get this in and leave the AD correction for overlapping deletions for another issue. |
… mapping to a proper check of matching alleles
…by haplotypes (GGA mode)
…B and CombinedGVCFs behavior around doubled spanning deletions can be synchronized
97d3898
to
0284666
Compare
@ldgauthier I've updated |
If you add an issue for the combining of AD for multiple spanning deletions (which will ultimately improve your new expected results here) then I'll consider this 👍 |
from earlier positions) as deletions in the min PL value computation. This behavior now matches the behavior of CombineGVCFs. A more detailed description of the issue is provided in #4963 * Deleted a couple of files which are no longer necessary. * Fixed the index of newMQcalc.combined.g.vcf
from earlier positions) as deletions in the min PL value computation. This behavior now matches the behavior of CombineGVCFs. A more detailed description of the issue is provided in #4963 * Deleted a couple of files which are no longer necessary. * Fixed the index of newMQcalc.combined.g.vcf
from earlier positions) as deletions in the min PL value computation. This behavior now matches the behavior of CombineGVCFs. A more detailed description of the issue is provided in #4963 * Deleted a couple of files which are no longer necessary. * Fixed the index of newMQcalc.combined.g.vcf
from earlier positions) as deletions in the min PL value computation. This behavior now matches the behavior of CombineGVCFs. A more detailed description of the issue is provided in #4963 * Deleted a couple of files which are no longer necessary. * Fixed the index of newMQcalc.combined.g.vcf
* The newest release of GenomicsDB treats spanning deletions (spanning from earlier positions) as deletions in the min PL value computation. This behavior now matches the behavior of CombineGVCFs. A more detailed description of the issue is provided in #4963 * Deleted a couple of files which are no longer necessary. * Fixed the index of newMQcalc.combined.g.vcf * Fix for #5300 when multiple reader-threads are used in the importer. Not a race condition in GenomicsDB - InitializedQueryWrapper wasn't written for multiple intervals.
This PR modifies
HaplotypeCaller
so that it can output and genotype spanning deletion alleles represented by the*
allele.Currently, the output of
HaplotypeCaller
will not include spanning deletion alleles when run in single sample VCF mode or in genotype given alleles mode, even when that genotype would be more appropriate. In the joint calling workflowGenotypeGVCFs
adds genotypes for spanning deletions, although the input likelihoods will not be broken out to specifically account for spanning deletion alleles.Some implementation notes:
HaplotypeCaller
used to emit the warning"Multiple valid VCF records detected in the alleles input file at site " + loc + ", only considering the first record"
for each such site. This was a bit of a misleading message, since the other variants were in fact taken into account UNLESS HC decided to emit an empty variant context, for example due to zero coverage.createAlleleMapper
method inAssemblyBasedCallerGenotypingEngine
. The old version had a very brittle mapping scheme that depended heavily on the ordering of alleles in the variant context created byAssemblyBasedCallerUtils.makeMergedVariantContext
andgetEventsAtThisLoc
. This proved to be difficult to ensure when spanning deletions were added in, and there was an ominous TODO in the old method saying that the logic was not good enough, so I ended up re-writing it from scratch. The new version is longer but I hope it is easier to read and less fragile, but let me know if I've missed anything.Test currently fail on this branch and therefore it should not be merged. To make them pass we need a fix to #4716 which is currently being worked on in #4645. However, since that PR is taking a while to make it through code review, I thought it might be good to start the review process for these changes.