-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GGA mode for Mutect #4601
GGA mode for Mutect #4601
Conversation
@cwhelan Maybe David could give you a second opinion on the overlapping variants issue in GGA mode? |
What's the overlapping variants issue? |
@davidbenjamin I'm unsure about how GGA mode works when alleles overlap, ie a spanning deletion over a SNP. HaplotypeCaller spews a bunch of warnings from line 51 of |
Codecov Report
@@ Coverage Diff @@
## master #4601 +/- ##
===============================================
+ Coverage 79.811% 79.814% +0.003%
- Complexity 17152 17165 +13
===============================================
Files 1067 1067
Lines 62415 62439 +24
Branches 10130 10138 +8
===============================================
+ Hits 49814 49835 +21
- Misses 8656 8657 +1
- Partials 3945 3947 +2
|
Does this have anything to do with HaplotypeCaller GGA mode? Or, is Mutect2 completely separate in GATK4? |
@chandrans It's a similar idea in that GGA mode forces the tool to genotype all sites in the |
I see. Thanks David |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few comments/questions. Thanks for being patient. Back to you @davidbenjamin
@@ -43,7 +45,7 @@ public static VariantContext composeGivenAllelesVariantContextFromRod(final Feat | |||
|
|||
// search for usable record | |||
for ( final VariantContext rodVc : tracker.getValues(allelesBinding, new SimpleInterval(loc)) ) { | |||
if ( rodVc != null && ! rodVc.isFiltered() && (! snpsOnly || rodVc.isSNP() )) { | |||
if ( rodVc != null && (keepFiltered || rodVc.isNotFiltered()) && (! snpsOnly || rodVc.isSNP() )) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not your code but here I think a stream with findFirst
is cleaner and more clear.
Optional<VariantContext> vc = tracker.getValues(allelesBinding, new SimpleInterval(loc)).stream()
.filter(rodVc -> rodVc != null && (keepFiltered || rodVc.isNotFiltered()) && (! snpsOnly || rodVc.isSNP()))
.findFirst();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but I don't think the engine team wants us touching HaplotypeCaller code more than absolutely necessary for the moment.
@@ -98,15 +96,19 @@ public CalledHaplotypes callMutations( | |||
|
|||
// update the haplotypes so we're ready to call, getting the ordered list of positions on the reference | |||
// that carry events among the haplotypes | |||
final List<Integer> startPosKeySet = decomposeHaplotypesIntoVariantContexts(haplotypes, assemblyResultSet.getFullReferenceWithPadding(), assemblyResultSet.getPaddedReferenceLoc(), NO_GIVEN_ALLELES).stream() | |||
final List<Integer> startPosKeySet = decomposeHaplotypesIntoVariantContexts(haplotypes, assemblyResultSet.getFullReferenceWithPadding(), assemblyResultSet.getPaddedReferenceLoc(), givenAlleles).stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside AssemblyBasedCallerGenotypingEngine::decomposeHaplotypesIntoVariantContexts
(the whole method copied below), we call startPosKeySet.clear()
, which, as far as I can tell, ensures that the only GAA variants go through and that's not what we want. Since we've already inserted the GGA alleles into haplotypes already, it seems that we can remove the if (inGGAMode)
block entirely.
protected TreeSet<Integer> decomposeHaplotypesIntoVariantContexts(final List<Haplotype> haplotypes,
final byte[] ref,
final SimpleInterval refLoc,
final List<VariantContext> activeAllelesToGenotype) {
final boolean inGGAMode = ! activeAllelesToGenotype.isEmpty();
// Using the cigar from each called haplotype figure out what events need to be written out in a VCF file
// IMPORTANT NOTE: This needs to be done even in GGA mode, as this method call has the side effect of setting the
// event maps in the Haplotype objects!
final TreeSet<Integer> startPosKeySet = EventMap.buildEventMapsForHaplotypes(haplotypes, ref, refLoc, configuration.debug);
if ( inGGAMode ) {
startPosKeySet.clear(); // <<<< (emphasis by Takuto)
for( final VariantContext compVC : activeAllelesToGenotype ) {
startPosKeySet.add(compVC.getStart());
}
}
return startPosKeySet;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we can give this method an empty list instead of givenAlleles
as you do several lines below with getVCsAtThisLocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. Really good catch. I thought I had tracked down all the places where GGA mode overrides discovery alleles. Code review at its finest.
@@ -176,6 +186,18 @@ public CalledHaplotypes callMutations( | |||
return new CalledHaplotypes(outputCallsWithEventCountAnnotation, calledHaplotypes); | |||
} | |||
|
|||
// check whether two alleles coming from different variant contexts and with possibly different reference alleles | |||
// could in fact be the same. The condition is that one is a prefix of the other | |||
private boolean allelesAreConsistent(final Allele allele1, final Allele allele2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna have to think about this more but don't want to delay handing the review back to you. It seems that the way it's written,
VC: A -> AA
Given allele: A -> AAA
are deemed consistent and therefore emitted forcibly...I don't think that's what we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that this happens and I was actually debating myself about the same thing. I was willing to be generous and possibly admit an extra allele at a GGA site because I didn't want to complicate the code and re-represent the GGA allele in the merged variant context.
However, I realize now that without complicating the code I can simply add the condition that the difference between the ref and alt lengths agree.
Back to you @takutosato. |
Looks good! |
cae34a4
to
67b7315
Compare
Quick question @davidbenjamin. Does the GGA mode allow outputting a BAMOUT that correspond to the GGA genotypes? |
@sooheelee If you only wanted the GGA sites in the bamout you would need to restrict your intervals eg |
I just wanted to know if generating a BAMOUT was possible with GGA mode @davidbenjamin. Great to hear that it is. Now I have a followup question. Is it possible to simultaneously run Mutect2 on a BAM and specify GGA mode for the callset that will be generated during the run such that the BAMOUT will reflect the GGA calls? Or for GGA mode do I have to have a callset in hand that I supply to Mutect2 to generate this GGA-BAMOUT? |
@sooheelee I don't understand yet. Could you restate what you mean? |
Sure @davidbenjamin. Is an alleles VCF required to run GGA mode in Mutect2? Or can I have Mutect2 run GGA mode on the fly, for the alleles it determines for the pseudo-genotype that it outputs to the VCF, such that the BAMOUT reads will represent the pseudo-genotypes without artificial haplotypes etc? I want a BAMOUT with all the reads being forced to represent the called alleles, for manual review. But I don't want to have to run Mutect2 twice for this BAMOUT. Rather, I want M2 to generate this GGA-BAMOUT on the fly, while it is determining the somatic calls. |
@sooheelee Mutect2 can't run GGA on the fly. You have to pass the GGA vcf on the command line. I am, however, separately working on an option to call haplotypes directly so that there are fewer artificial haplotypes in the bamout, and the ones that remain will be biologically meaningful. This isn't exactly what you want, but it comes close. |
@takutosato Can you review this PR?
This is a community request and a useful feature for our MC3 validation.