Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GGA mode for Mutect #4601

Merged
merged 4 commits into from
Apr 8, 2018
Merged

GGA mode for Mutect #4601

merged 4 commits into from
Apr 8, 2018

Conversation

davidbenjamin
Copy link
Contributor

@takutosato Can you review this PR?

This is a community request and a useful feature for our MC3 validation.

@davidbenjamin davidbenjamin added this to the Popularize Mutect 2 at the Broad milestone Mar 27, 2018
@ldgauthier
Copy link
Contributor

@cwhelan Maybe David could give you a second opinion on the overlapping variants issue in GGA mode?

@davidbenjamin
Copy link
Contributor Author

What's the overlapping variants issue?

@cwhelan
Copy link
Member

cwhelan commented Mar 28, 2018

@davidbenjamin I'm unsure about how GGA mode works when alleles overlap, ie a spanning deletion over a SNP. HaplotypeCaller spews a bunch of warnings from line 51 of GenotypingGivenAllelesUtils but I'm not sure the warnings are right or meaningful. I'll come chat with you about it.

@codecov-io
Copy link

codecov-io commented Mar 28, 2018

Codecov Report

Merging #4601 into master will increase coverage by 0.003%.
The diff coverage is 73.529%.

@@               Coverage Diff               @@
##              master     #4601       +/-   ##
===============================================
+ Coverage     79.811%   79.814%   +0.003%     
- Complexity     17152     17165       +13     
===============================================
  Files           1067      1067               
  Lines          62415     62439       +24     
  Branches       10130     10138        +8     
===============================================
+ Hits           49814     49835       +21     
- Misses          8656      8657        +1     
- Partials        3945      3947        +2
Impacted Files Coverage Δ Complexity Δ
...walkers/genotyper/GenotypingGivenAllelesUtils.java 28.571% <0%> (ø) 2 <0> (ø) ⬇️
...nder/tools/walkers/genotyper/GenotypingEngine.java 68.908% <0%> (ø) 68 <0> (ø) ⬇️
...rs/genotyper/StandardCallerArgumentCollection.java 89.744% <100%> (+0.27%) 15 <0> (ø) ⬇️
...walkers/haplotypecaller/HaplotypeCallerEngine.java 78.652% <50%> (-0.375%) 66 <1> (+1)
.../tools/walkers/mutect/SomaticGenotypingEngine.java 89.82% <76.19%> (-1.57%) 65 <11> (+6)
...hellbender/tools/walkers/mutect/Mutect2Engine.java 87.654% <87.5%> (+0.558%) 50 <12> (+5) ⬆️
...ypecaller/AssemblyBasedCallerGenotypingEngine.java 91.515% <0%> (+0.606%) 70% <0%> (+1%) ⬆️
...utils/smithwaterman/SmithWatermanIntelAligner.java 90% <0%> (+10%) 3% <0%> (ø) ⬇️

@chandrans
Copy link
Contributor

Does this have anything to do with HaplotypeCaller GGA mode? Or, is Mutect2 completely separate in GATK4?

@davidbenjamin
Copy link
Contributor Author

@chandrans It's a similar idea in that GGA mode forces the tool to genotype all sites in the -alleles vcf. The big difference is that in HaplotypeCaller this mode excludes other alleles whereas the new mode forces genotyping of these alleles in addition to any other variants it finds.

@chandrans
Copy link
Contributor

I see. Thanks David

Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments/questions. Thanks for being patient. Back to you @davidbenjamin

@@ -43,7 +45,7 @@ public static VariantContext composeGivenAllelesVariantContextFromRod(final Feat

// search for usable record
for ( final VariantContext rodVc : tracker.getValues(allelesBinding, new SimpleInterval(loc)) ) {
if ( rodVc != null && ! rodVc.isFiltered() && (! snpsOnly || rodVc.isSNP() )) {
if ( rodVc != null && (keepFiltered || rodVc.isNotFiltered()) && (! snpsOnly || rodVc.isSNP() )) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not your code but here I think a stream with findFirst is cleaner and more clear.

Optional<VariantContext> vc = tracker.getValues(allelesBinding, new SimpleInterval(loc)).stream()
                .filter(rodVc -> rodVc != null && (keepFiltered || rodVc.isNotFiltered()) && (! snpsOnly || rodVc.isSNP()))
                .findFirst();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I don't think the engine team wants us touching HaplotypeCaller code more than absolutely necessary for the moment.

@@ -98,15 +96,19 @@ public CalledHaplotypes callMutations(

// update the haplotypes so we're ready to call, getting the ordered list of positions on the reference
// that carry events among the haplotypes
final List<Integer> startPosKeySet = decomposeHaplotypesIntoVariantContexts(haplotypes, assemblyResultSet.getFullReferenceWithPadding(), assemblyResultSet.getPaddedReferenceLoc(), NO_GIVEN_ALLELES).stream()
final List<Integer> startPosKeySet = decomposeHaplotypesIntoVariantContexts(haplotypes, assemblyResultSet.getFullReferenceWithPadding(), assemblyResultSet.getPaddedReferenceLoc(), givenAlleles).stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside AssemblyBasedCallerGenotypingEngine::decomposeHaplotypesIntoVariantContexts (the whole method copied below), we call startPosKeySet.clear(), which, as far as I can tell, ensures that the only GAA variants go through and that's not what we want. Since we've already inserted the GGA alleles into haplotypes already, it seems that we can remove the if (inGGAMode) block entirely.

    protected TreeSet<Integer> decomposeHaplotypesIntoVariantContexts(final List<Haplotype> haplotypes,
                                                                      final byte[] ref,
                                                                      final SimpleInterval refLoc,
                                                                      final List<VariantContext> activeAllelesToGenotype) {
        final boolean inGGAMode = ! activeAllelesToGenotype.isEmpty();

        // Using the cigar from each called haplotype figure out what events need to be written out in a VCF file
        // IMPORTANT NOTE: This needs to be done even in GGA mode, as this method call has the side effect of setting the
        // event maps in the Haplotype objects!
        final TreeSet<Integer> startPosKeySet = EventMap.buildEventMapsForHaplotypes(haplotypes, ref, refLoc, configuration.debug);

        if ( inGGAMode ) {
            startPosKeySet.clear(); // <<<< (emphasis by Takuto)
            for( final VariantContext compVC : activeAllelesToGenotype ) {
                startPosKeySet.add(compVC.getStart());
            }
        }

        return startPosKeySet;
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we can give this method an empty list instead of givenAlleles as you do several lines below with getVCsAtThisLocation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. Really good catch. I thought I had tracked down all the places where GGA mode overrides discovery alleles. Code review at its finest.

@@ -176,6 +186,18 @@ public CalledHaplotypes callMutations(
return new CalledHaplotypes(outputCallsWithEventCountAnnotation, calledHaplotypes);
}

// check whether two alleles coming from different variant contexts and with possibly different reference alleles
// could in fact be the same. The condition is that one is a prefix of the other
private boolean allelesAreConsistent(final Allele allele1, final Allele allele2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna have to think about this more but don't want to delay handing the review back to you. It seems that the way it's written,

VC: A -> AA
Given allele: A -> AAA

are deemed consistent and therefore emitted forcibly...I don't think that's what we want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that this happens and I was actually debating myself about the same thing. I was willing to be generous and possibly admit an extra allele at a GGA site because I didn't want to complicate the code and re-represent the GGA allele in the merged variant context.

However, I realize now that without complicating the code I can simply add the condition that the difference between the ref and alt lengths agree.

@davidbenjamin
Copy link
Contributor Author

Back to you @takutosato.

@takutosato
Copy link
Contributor

Looks good!

@davidbenjamin davidbenjamin force-pushed the db_m2_gga branch 2 times, most recently from cae34a4 to 67b7315 Compare April 5, 2018 20:55
@davidbenjamin davidbenjamin merged commit ce34240 into master Apr 8, 2018
@davidbenjamin davidbenjamin deleted the db_m2_gga branch April 8, 2018 03:51
@sooheelee
Copy link
Contributor

Quick question @davidbenjamin. Does the GGA mode allow outputting a BAMOUT that correspond to the GGA genotypes?

@davidbenjamin
Copy link
Contributor Author

@sooheelee If you only wanted the GGA sites in the bamout you would need to restrict your intervals eg -L intervals.interval_list -L gga.vcf -interval-set-rule INTERSECTION.

@sooheelee
Copy link
Contributor

I just wanted to know if generating a BAMOUT was possible with GGA mode @davidbenjamin. Great to hear that it is. Now I have a followup question. Is it possible to simultaneously run Mutect2 on a BAM and specify GGA mode for the callset that will be generated during the run such that the BAMOUT will reflect the GGA calls? Or for GGA mode do I have to have a callset in hand that I supply to Mutect2 to generate this GGA-BAMOUT?

@davidbenjamin
Copy link
Contributor Author

@sooheelee I don't understand yet. Could you restate what you mean?

@sooheelee
Copy link
Contributor

Sure @davidbenjamin. Is an alleles VCF required to run GGA mode in Mutect2? Or can I have Mutect2 run GGA mode on the fly, for the alleles it determines for the pseudo-genotype that it outputs to the VCF, such that the BAMOUT reads will represent the pseudo-genotypes without artificial haplotypes etc?

I want a BAMOUT with all the reads being forced to represent the called alleles, for manual review. But I don't want to have to run Mutect2 twice for this BAMOUT. Rather, I want M2 to generate this GGA-BAMOUT on the fly, while it is determining the somatic calls.

@davidbenjamin
Copy link
Contributor Author

@sooheelee Mutect2 can't run GGA on the fly. You have to pass the GGA vcf on the command line.

I am, however, separately working on an option to call haplotypes directly so that there are fewer artificial haplotypes in the bamout, and the ones that remain will be biologically meaningful. This isn't exactly what you want, but it comes close.

cwhelan pushed a commit to cwhelan/gatk-linked-reads that referenced this pull request May 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants