(SV) consolidate logic in simple chimera inference, update how variants are represented in VCF emitted by new code path #4663

SHuang-Broad · 2018-04-17T08:46:02Z

Part of road map laid out in #4111

Consolidate logic, update variant representation (PR#4663)

consolidate logic in the following classes

AssemblyContigAlignmentSignatureClassifier now gone, its inner enum class RawTypes is moved to AssemblyContigWithFineTunedAlignments.AlignmentSignatureBasicTypes and reduced into fewer cases (Suspicious, Simple and Complex)
static method BreakpointsInference.inferFromSimpleChimera() now moved to state query method ChimericAlignment.inferType()
AssemblyContigWithFineTunedAlignments.hasIncompletePictureFromTwoAlignments() merged with ChimericAlignment.hasIncompletePicture()

update how variants are represented

change SVLEN for CPX variants to the difference between [alt haplotype sequence length] and [affected reference region length], which is following the technical definition of SVLEN in VCF spec.
change RPL output to one of these (note that test coverage is expected)
- ins/del, when del/ins bases are < 50 and annotate; when type is determined as ins, the POS will be 1 base before the micro-deleted range and END will be end of the micro-deleted range, where the REF allele will be the corresponding reference bases.
- ins and del when both are >= 50, and link by EVENT
change SVTYPE=DUP toSVYTPE=INS when the duplicated region is shorter than 50 bp (tests). Note that this will lead to INS records with DUP_REPEAT_UNIT_REF_SPAN and DUP_SEQ_CIGARS (when available).

In addition, we are currently treating duplication expansion as insertion.
The VCF spec doesn't force DUP records as such.
If we decide to allow POS and END to designate the beginning and end of the duplicated reference region, we need to make at least the following change:

shift the left breakpoint to the right by 1 base compared to the current implementation, and
downstreamBreakpointRefPos = complication.getDupSeqRepeatUnitRefSpan().getEnd();

bump test coverage

SimpleNovelAdjacencyAndChimericAlignmentEvidence serialization test
NovelAdjacencyAndAltHaplotype.toSimpleOrBNDTypes() (but only related to SimpleSvType's, BND's will wait for a later PR.

codecov-io · 2018-04-20T22:09:42Z

Codecov Report

Merging #4663 into master will increase coverage by 0.082%.
The diff coverage is 76.21%.

@@               Coverage Diff               @@
##              master     #4663       +/-   ##
===============================================
+ Coverage     79.973%   80.055%   +0.082%     
- Complexity     17421     17476       +55     
===============================================
  Files           1081      1080        -1     
  Lines          63140     63456      +316     
  Branches       10200     10271       +71     
===============================================
+ Hits           50495     50800      +305     
  Misses          8656      8656               
- Partials        3989      4000       +11

Impacted Files	Coverage Δ	Complexity Δ
...te/hellbender/tools/spark/sv/discovery/SvType.java	`100% <ø> (ø)`	`6 <0> (ø)`	⬇️
...ender/tools/spark/sv/utils/GATKSVVCFConstants.java	`75% <ø> (ø)`	`1 <0> (ø)`	⬇️
...tructuralVariationDiscoveryArgumentCollection.java	`97.297% <ø> (ø)`	`0 <0> (ø)`	⬇️
.../tools/spark/sv/discovery/BreakEndVariantType.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
...lbender/tools/spark/sv/discovery/SimpleSVType.java	`86.667% <100%> (+0.226%)`	`3 <1> (+1)`	⬆️
...der/tools/spark/sv/utils/GATKSVVCFHeaderLines.java	`82.813% <100%> (+0.273%)`	`7 <0> (ø)`	⬇️
...pleNovelAdjacencyAndChimericAlignmentEvidence.java	`87.5% <100%> (+63.176%)`	`10 <2> (+5)`	⬆️
...ry/inference/CpxVariantInducingAssemblyContig.java	`84.141% <100%> (ø)`	`27 <0> (ø)`	⬇️
...s/spark/sv/discovery/SvDiscoveryInputMetaData.java	`100% <100%> (ø)`	`2 <2> (?)`
...overy/inference/TypeInferredFromSimpleChimera.java	`100% <100%> (ø)`	`1 <0> (ø)`	⬇️
... and 43 more

SHuang-Broad · 2018-04-27T02:50:38Z

Feature (this) branch run log

00:47:25.107 INFO  StructuralVariationDiscoveryPipelineSpark - Used 3957 evidence target links to annotate assembled breakpoints
00:47:25.226 INFO  StructuralVariationDiscoveryPipelineSpark - Called 336 imprecise deletion variants
00:47:25.264 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 9651 variants.
00:47:25.280 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 281
00:47:25.281 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 5630
00:47:25.281 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 2248
00:47:25.281 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 1492
00:47:25.291 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
00:47:25.291 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
00:47:25.291 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
00:47:25.291 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
00:47:25.291 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 0
18/04/27 00:47:28 WARN org.apache.spark.scheduler.TaskSetManager: Stage 19 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 00:47:33 WARN org.apache.spark.scheduler.TaskSetManager: Stage 20 contains a task of very large size (2685 KB). The maximum recommended task size is 100 KB.
00:47:39.417 INFO  StructuralVariationDiscoveryPipelineSpark - Processing 323040 raw alignments from 254678 contigs.
18/04/27 00:47:39 WARN org.apache.spark.scheduler.TaskSetManager: Stage 22 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
00:47:47.083 INFO  StructuralVariationDiscoveryPipelineSpark - Filtering on MQ left 225280 contigs.
00:47:48.135 INFO  StructuralVariationDiscoveryPipelineSpark - 23702 contigs with chimeric alignments potentially giving SV signals.
18/04/27 00:47:49 WARN org.apache.spark.scheduler.TaskSetManager: Stage 32 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 00:47:57 WARN org.apache.spark.scheduler.TaskSetManager: Stage 33 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 00:48:04 WARN org.apache.spark.scheduler.TaskSetManager: Stage 34 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
00:48:13.094 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 15964 variants.
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 6920
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 264
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 268
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 4680
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 487
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 3345
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 0
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 0
00:48:47.955 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 1334 variants.
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 1334
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 0

Master branch run log

02:20:09.897 INFO  StructuralVariationDiscoveryPipelineSpark - Used 3957 evidence target links to annotate assembled breakpoints
02:20:10.082 INFO  StructuralVariationDiscoveryPipelineSpark - Called 336 imprecise deletion variants
02:20:10.121 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 9651 variants.
02:20:10.138 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 281
02:20:10.138 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 5630
02:20:10.138 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 2248
02:20:10.138 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 1492
02:20:10.147 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
02:20:10.148 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
02:20:10.148 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
02:20:10.148 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
02:20:10.148 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 0
18/04/27 02:20:12 WARN org.apache.spark.scheduler.TaskSetManager: Stage 19 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 02:20:18 WARN org.apache.spark.scheduler.TaskSetManager: Stage 20 contains a task of very large size (2599 KB). The maximum recommended task size is 100 KB.
02:20:24.460 INFO  StructuralVariationDiscoveryPipelineSpark - Processing 323040 raw alignments from 254678 contigs.
18/04/27 02:20:24 WARN org.apache.spark.scheduler.TaskSetManager: Stage 22 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
02:20:32.040 INFO  StructuralVariationDiscoveryPipelineSpark - Filtering on MQ left 225280 contigs.
02:20:33.079 INFO  StructuralVariationDiscoveryPipelineSpark - 23841 contigs with chimeric alignments potentially giving SV signals.
18/04/27 02:20:34 WARN org.apache.spark.scheduler.TaskSetManager: Stage 32 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 02:20:42 WARN org.apache.spark.scheduler.TaskSetManager: Stage 33 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 02:20:49 WARN org.apache.spark.scheduler.TaskSetManager: Stage 34 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
02:20:58.973 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 15659 variants.
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 6920
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 264
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 268
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 5019
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 1793
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 1395
02:20:58.992 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 0
02:20:58.992 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
02:20:58.992 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 0
02:21:46.603 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 1334 variants.
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 1334
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 0

As can be seen, the stable version of the interpretation tool is not affected, where as the experimental version is affected expectedly, in the number of DEL, INS, and DUP calls (less DEL calls because RPL calls defaulted to DEL, less DUP calls as we now require duplicated region to be >49 bp, more number of variants since some RPL variants now generates two calls)

cwhelan

I have a bunch of questions about some of the refactoring here.

cwhelan · 2018-05-01T19:31:59Z

.../org/broadinstitute/hellbender/tools/spark/sv/StructuralVariationDiscoveryPipelineSpark.java

-        final Broadcast<ReferenceMultiSource> referenceBroadcast = svDiscoveryInputData.referenceBroadcast;
-        final Broadcast<SVIntervalTree<VariantContext>> cnvCallsBroadcast = svDiscoveryInputData.cnvCallsBroadcast;
+        final String updatedOutputPath = svDiscoveryInputData.outputPath + "experimentalInterpretation_";
+        final SvDiscoveryInputData.SampleSpecificData updatedSampleSpecificData =


I find this whole process very confusing. Why do we need to make a new SvDiscoveryInputData.SampleSpecificData that has the contigRawAlignments updated? It looks like it runs through the old pipeline once with contigRawAlignments set to getReads(), which are actually the aligned input reads, and then through the new pipeline with contigRawAlignments set to the aligned contigs (which makes more sense with the name of the field). Why not just have two different fields, named appropriately for what's going to be in them? Then you wouldn't have to create a new input object for the second run to begin with.

Yes, it is confusing, and it is a bit difficult to explain, so I drew the following chart describing what's happening:

It looks like it runs through the old pipeline once with contigRawAlignments set to getReads(),
those are actually not used, as the old pipeline takes (see diagram above) an JaraRDD<AlignedContig>

So I took the contigRawAlignments out of SvDiscoveryInputData, and renamed SvDiscoveryInputData to SvDiscoveryInputMetaData

cwhelan · 2018-05-01T19:44:21Z

...der/tools/spark/sv/discovery/inference/SimpleNovelAdjacencyAndChimericAlignmentEvidence.java

+     * A simple struct holding information of simple chimera of a particular contig and
+     * its good mapping to a non-canonical chromosome, if it has one.
+     */
+    @DefaultSerializer(SimpleChimeraAndNCAMstring.Serializer.class)


What does NCAM stand for? I'd prefer not to add a new acronym if possible.

Why not just add the NCAM string to ChimericAlignment? It would save a lot of lines of code and having to reason about a new object type.

NCAM strands for "Non-CAnonical Mapping".
Yes, that is a better idea!

cwhelan · 2018-05-01T19:47:39Z

...in/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AnnotatedVariantProducer.java

     * @param inferredType                      inferred type of variant
-     * @param altHaplotypeSeq                   alt haplotype sequence (could be null)
-     * @param contigAlignments                  chimeric alignments from contigs used for generating this novel adjacency,


I think you're missing documentation for contigEvidence now.

cwhelan · 2018-05-01T19:56:05Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

+
+        /**
+         * Write SAM file, if requested, for original alignments of contigs recognized as "Ambiguous", "Incomplete", and "MisAssemblySuspect"
+         * TODO: 11/17/17 salvation on assembly contigs that 1) has ambiguous "best" configuration, and 2) has incomplete picture; and flag accordingly


Is this todo still applicable?

yes, it is, ticket was created awhile ago #4229.

Out of the two, those "incomplete" ones cause us more variants in the sense that
even though the contig currently doesn't cover the full event,
extending it is unlikely to change what currently can be extracted.
I'm thinking about how to make such checks and get these variants back.

cwhelan · 2018-05-01T20:04:50Z

...ute/hellbender/tools/spark/sv/discovery/alignment/AssemblyContigWithFineTunedAlignments.java

-        else
-            return hasIncompletePictureFromMultipleAlignments();
+    public enum AlignmentSignatureBasicTypes {
+        Suspicious, Simple, Complex;


I'd suggest renaming Suspicious to Unknown. Also I like to UPCASE all enum values.

cwhelan · 2018-05-01T20:41:52Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/inference/SimpleChimera.java

@@ -195,6 +160,137 @@ static StrandSwitch determineStrandSwitch(final AlignmentInterval first, final A
        }
    }

+    // =================================================================================================================
+    // state query block


What does this mean?

a block of methods for querying the state of a SimpleChimera, now removed

cwhelan · 2018-05-01T20:45:50Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

@@ -317,25 +319,38 @@ private static void forNonComplexVariants(final JavaRDD<AssemblyContigWithFineTu
            throws IOException {
        final SimpleNovelAdjacencyAndChimericAlignmentEvidence simpleNovelAdjacencyAndChimericAlignmentEvidence = pair._1;
        final List<SvType> svTypes = pair._2;


I think you're assuming here that there are either 1 or 2 elements of this list. Can you add a validation of that?

cwhelan · 2018-05-01T20:48:16Z

...adinstitute/hellbender/tools/spark/sv/discovery/inference/NovelAdjacencyAndAltHaplotype.java

-                return Collections.singletonList( new SimpleSVType.Deletion(this, svLength) );
+                final int deletedLength = leftJustifiedRightRefLoc.getStart() - leftJustifiedLeftRefLoc.getEnd();
+                final int insertionLength = complication.getInsertedSequenceForwardStrandRep().length();
+                if ( deletedLength < 50 ) { // "fat" insertion


Make a constant for 50 and put it somewhere central?

added StructuralVariationDiscoveryArgumentCollection.STRUCTURAL_VARIANT_SIZE_LOWER_BOUND

cwhelan · 2018-05-01T20:49:11Z

...adinstitute/hellbender/tools/spark/sv/discovery/inference/TypeInferredFromSimpleChimera.java

+    RPL,                    // replacement
+    SMALL_DUP_EXPANSION,    // small duplication tandem expansion
+    SMALL_DUP_CPX,          // small duplication, either expansion or contraction, with complex structure
+    IntraChrStrandSwitch,   // intra-chromosome strand-switch novel adjacency


Upcase these.

cwhelan · 2018-05-01T20:50:16Z

...adinstitute/hellbender/tools/spark/sv/discovery/inference/NovelAdjacencyAndAltHaplotype.java

-                return Collections.singletonList( new SimpleSVType.DuplicationTandem(this, svLength) );
+                final BreakpointComplications.SmallDuplicationWithPreciseDupRangeBreakpointComplications duplicationComplication =
+                        (BreakpointComplications.SmallDuplicationWithPreciseDupRangeBreakpointComplications) this.getComplication();
+                if (duplicationComplication.getDupSeqRepeatUnitRefSpan().size() < 50) {


Use constant.

SHuang-Broad

Hi Chris, I've updated the code following the requests.
For one of the major concerns you have on performance issues by splitting the RDD's into smaller ones and then working on them, I replied with a comment and not sure yet how to better optimize it.

SHuang-Broad · 2018-05-05T01:22:53Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

-    public static EnumMap<RawTypes, JavaRDD<AssemblyContigWithFineTunedAlignments>> preprocess(final SvDiscoveryInputData svDiscoveryInputData,
-                                                                                               final boolean writeSAMFiles) {
+    public static AssemblyContigsClassifiedByAlignmentSignatures preprocess(final SvDiscoveryInputData svDiscoveryInputData,
+                                                                            final boolean writeSAMFiles) {

        final Broadcast<SAMSequenceDictionary> referenceSequenceDictionaryBroadcast = svDiscoveryInputData.referenceData.referenceSequenceDictionaryBroadcast;


SHuang-Broad · 2018-05-05T01:27:44Z

...in/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/AnnotatedVariantProducer.java

     * @param inferredType                      inferred type of variant
-     * @param altHaplotypeSeq                   alt haplotype sequence (could be null)
-     * @param contigAlignments                  chimeric alignments from contigs used for generating this novel adjacency,


SHuang-Broad · 2018-05-05T01:30:23Z

...der/tools/spark/sv/discovery/inference/SimpleNovelAdjacencyAndChimericAlignmentEvidence.java

+     * A simple struct holding information of simple chimera of a particular contig and
+     * its good mapping to a non-canonical chromosome, if it has one.
+     */
+    @DefaultSerializer(SimpleChimeraAndNCAMstring.Serializer.class)


NCAM strands for "Non-CAnonical Mapping".
Yes, that is a better idea!

SHuang-Broad · 2018-05-05T01:35:48Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

@@ -317,25 +319,38 @@ private static void forNonComplexVariants(final JavaRDD<AssemblyContigWithFineTu
            throws IOException {
        final SimpleNovelAdjacencyAndChimericAlignmentEvidence simpleNovelAdjacencyAndChimericAlignmentEvidence = pair._1;
        final List<SvType> svTypes = pair._2;


SHuang-Broad · 2018-05-05T01:39:57Z

...adinstitute/hellbender/tools/spark/sv/discovery/inference/NovelAdjacencyAndAltHaplotype.java

-                return Collections.singletonList( new SimpleSVType.Deletion(this, svLength) );
+                final int deletedLength = leftJustifiedRightRefLoc.getStart() - leftJustifiedLeftRefLoc.getEnd();
+                final int insertionLength = complication.getInsertedSequenceForwardStrandRep().length();
+                if ( deletedLength < 50 ) { // "fat" insertion


added StructuralVariationDiscoveryArgumentCollection.STRUCTURAL_VARIANT_SIZE_LOWER_BOUND

SHuang-Broad · 2018-05-05T20:21:27Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

+         * Write SAM file, if requested, for original alignments of contigs recognized as "Ambiguous", "Incomplete", and "MisAssemblySuspect"
+         * TODO: 11/17/17 salvation on assembly contigs that 1) has ambiguous "best" configuration, and 2) has incomplete picture; and flag accordingly
+         *
+         * TODO: 3/4/18 a bug is present here that even though only one alignment has not-bad MQ, it could contain a large gap,


No. This particular bug was fixed, but I've found another edge case that's on my radar to be fixed, example as follows (notice the 58I gap "flanked" by the short 16M):

asm005554:tig00000 0 chr3 96003552 60 1137M58I16M438S * 0 0 TATTAAAACCTAGCATTGATACTGATAGACAATGAGACTGGAGGATTTTTCTCTAGTGAAATAAATGAAGCAAGTAAAAAAATTGTCAGTTATGAAAATAAATCCTTGACCTTCACCATTTTATACTTCAATCACACAAACCCAAAATAAAACACTACTCATGATCACACAGATTCCATTCAACTTTTTAGTACTTTGATTTTAAATATTACTAGAAAATTTATGATTAGTGAATATTAGAAAAAGAAATCAGTTTCTACCATAAAAATACAGAAATACAAATAAGAAGAAAAAAAGAAAATTAGAGAAAACAGAGGCATTATCAAAGGTGATAGAACTTAAAAAAAAACCTTAATTTAATATACCAAGGTATATATGGTAATACATTACATCCATTAATATACCACAAGTTTAGATATCCCTTCTGTTCATATCCCATTGATAAATTTTTATGTAAATTGCCACATCTAAAATCAAGGAAAACAGGTAACATAGTTTGGCTGAATAGCAAATTATTTGAGAGGGGGAAAAGCAAATATACGTATCCACTAAAAAAAAACAAACAAAAACTAATATCATTATAAAAAGAGTTTTAAAAAAAGGAGAACAAAAAATATAGAGATCACAACATCAAATAACTTTTTCAATAAATATGTTCTGGAGAAAGGGATGACAACAGTTGTCTTGTGTTCATTTGCCCCACAAAAAGAGACCACAACGTCAATAAACAAGTTGACTTCAACGCCAGTGACTGAGGAGGTATGGTGGAGAGAACCAGGGGAGCAGCAAAATCTCTGTAGAACATAGAAGCCCAGGATAACACCATAGAGAGGGGAATAAGACATCCCACCCCTGCCACACTGTCTCCCCTGATAGGATTGGCTCAGTGGTGGGAGGACTTATTTTAGGAAAAAGATAAGCTGGAGATCCCTCTTGGTCCCCATTTCTACCATGGACACAAGACAACCTTTCTACAGGAGAGACCCACTGTCCTCACAGCCCCTAAATCCAGTTTGGAGAGTTCTTAAAAGTTTACACCATGCATCCCCCACTCTCATGATCTAAAAATTATACTTTTTTATATATGTATGCATACATATATAATTTTTATATATGTATGCATACATATATAATTTTTATATATGTATGCATACATATATAATTTTTATATATGTATGCATACATATAATTTTTATATATGTATGCATACATATATAATTTTTATATGTATGCATACATATATAATTTTTTATATGTATGCATACATATATAATTTTTATATGTATGCATACAATATAATTTTTTATATATTGTATGCATACATATACTGTTTTATATATGTATGCATACATATATACTTTTTTGTATGTACACATACACATATAATTTTTTATATATGCATGCATACATATATGATTTTTATATAGGCATGCATACATATATAATTTTTTATATATGCATGCATATATAAATGCATATAACTTTTTATATATGTATGCATACATAACTTTTTATATATGTATGCATACATAACTTTTTATATATGTATGCATACATAACTTTTATATATGTATGCATACATAACTTTTTATATATGTATGCATACATATATAACTTTTTATATATGTATGCATACATATATAACTTTT * SA:Z:chr3,96004627,+,1189S35M26I399M,60,26; MD:Z:1153 RG:Z:GATKSVContigAlignments NM:i:58 AS:i:1079 XS:i:58

SHuang-Broad · 2018-05-05T20:27:25Z

...tute/hellbender/tools/spark/sv/discovery/alignment/AssemblyContigAlignmentsConfigPicker.java

+
+    //==================================================================================================================
+
+    public static final int LAST_ROUND_TUNING_ALIGNMENT_MAPQUAL_THREHOLD = 20;


These parameters (name now updated) and the following methods are moved here from the now gone classAssemblyContigAlignmentSignatureClassifier,
the intention is documented in the method removeNonUniqueMappings .

SHuang-Broad · 2018-05-05T20:27:28Z

...tute/hellbender/tools/spark/sv/discovery/alignment/AssemblyContigAlignmentsConfigPicker.java

+     * See  {@link #removeNonUniqueMappings(List, int, int)}
+     */
+    @VisibleForTesting
+    static AssemblyContigWithFineTunedAlignments lastRoundAlignmentFineTuning(final AssemblyContigWithFineTunedAlignments tig,


this middle man method is now removed, the intention is documented in removeNonUniqueMappings

SHuang-Broad · 2018-05-05T20:28:03Z

...ute/hellbender/tools/spark/sv/discovery/alignment/AssemblyContigWithFineTunedAlignments.java

+     *             </li>
+     *             <li>
+     *                 there's strand switch,
+     *         todo: this obsoletes the actual use of SimpleChimera#isCandidateInvertedDuplication()


I've update the comment and moved the todo to SimpleChimera.isCandidateInvertedDuplication().
The problem at hand, essentially, is that this way of classifying the contig as incomplete makes that function "useless" in the new pipeline because the contigs won't be sent down for analysis.

SHuang-Broad · 2018-05-05T20:28:20Z

...ute/hellbender/tools/spark/sv/discovery/alignment/AssemblyContigWithFineTunedAlignments.java

-                ||
-                (firstAndLastAlignmentMappedToSameChr() && hasIncompletePictureDueToOverlappingRefOrderSwitch());
-    }
+    public static boolean hasIncompletePictureFromTwoAlignments(final AlignmentInterval headAlignment,


Yes, because some logic are different are different, one such example: for 2-alignment contigs, alignments to different chromosomes are considered OK, but for multiple-alignment contigs, head and tail alignments must be mapped to the same chromosome for it NOT to be considered "incomplete". The motivation is that the 2-alignment case could produce an BND record (or two if both mates count), whereas the multiple-alignment contigs could emit several and I'm not sure yet how to make interpretation (this is tied to the more general question of how to extract variants from the "incomplete" contigs).

cwhelan · 2018-05-09T17:02:37Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/inference/SimpleChimera.java

    }

    @Override
    public int hashCode() {
        int result = sourceContigName.hashCode();
        result = 31 * result + regionWithLowerCoordOnContig.hashCode();
        result = 31 * result + regionWithHigherCoordOnContig.hashCode();
-        result = 31 * result + strandSwitch.ordinal();
+        result = 31 * result + strandSwitch.hashCode();


Looks like you regenerated this hashCode method and reverted from using ordinal, which will potentially give rise to bugs as we learned before. Please change back.

Oops! Thanks for catching that! auto-generated by Intellij....

cwhelan · 2018-05-09T17:07:53Z

...ute/hellbender/tools/spark/sv/discovery/alignment/AssemblyContigWithFineTunedAlignments.java

@@ -165,7 +165,16 @@ public final boolean hasEquallyGoodAlnConfigurations() {
        return hasEquallyGoodAlnConfigurations;
    }

-    public final boolean hasIncompletePicture() {
+    /**
+     * This method exist because extending the assembly contig in either direction


"method exists"

cwhelan · 2018-05-09T17:09:25Z

...ute/hellbender/tools/spark/sv/discovery/alignment/AssemblyContigWithFineTunedAlignments.java

     *
-     * If the assembly contig, with its two (picked) alignments, shows any of the following signature,
+     * If the assembly contig's alignments show any of the following signature,


"signatures"

cwhelan · 2018-05-09T17:10:31Z

...ute/hellbender/tools/spark/sv/discovery/alignment/AssemblyContigWithFineTunedAlignments.java

     *
-     * If the assembly contig, with its two (picked) alignments, shows any of the following signature,
+     * If the assembly contig's alignments show any of the following signature,
+     * it will be classified as incomplete.
     * it is definitely not giving the whole picture of the alt haplotype,


With your edits I'd now start this sentence off, "Incomplete alignments do not give the whole picture.."

cwhelan · 2018-05-09T17:12:44Z

...ain/java/org/broadinstitute/hellbender/tools/spark/sv/discovery/inference/SimpleChimera.java

@@ -279,8 +279,13 @@ private boolean hasIncompletePicture() {
        return AssemblyContigWithFineTunedAlignments.hasIncompletePictureFromTwoAlignments(regionWithLowerCoordOnContig, regionWithHigherCoordOnContig);
    }

+    // TODO: 5/5/18 Note that the use of the following predicate is currently obsoleted by


I'm a bit confused by this comment: this method is still being called in several places, so how is it obsolete?

cwhelan · 2018-05-09T17:14:16Z

.../org/broadinstitute/hellbender/tools/spark/sv/StructuralVariationDiscoveryPipelineSpark.java

@@ -186,7 +189,7 @@ protected void runTool( final JavaSparkContext ctx ) {

        // TODO: 1/14/18 this is the next version of precise variant calling
        if ( expInterpret != null ) {
-            experimentalInterpretation(ctx, assembledEvidenceResults, svDiscoveryInputMetaData);
+            experimentalInterpretation(ctx, expInterpret, assembledEvidenceResults, svDiscoveryInputMetaData);


What's the use of this boolean variable? Why not just add the check that it's true to the if clause on the previous line, then you don't need the extra parameter.

Right! I think the original CMD line option was a string telling where to store the whole directory of experimental interpretation, and was passed in as a String. When I removed the directory creation and switched to the flag, I forgot to remove this parameter.

cwhelan · 2018-05-09T17:27:47Z

...te/hellbender/tools/spark/sv/discovery/SvDiscoverFromLocalAssemblyContigAlignmentsSpark.java

+        private void writeSAMfilesForUnknown(final String outputPrefix, final JavaRDD<GATKRead> assemblyRawAlignments,
+                                             final SAMFileHeader header) {
+
+            final Map<ReasonForAlignmentClassificationFailure, Iterable<String>> reasonToContigNames =


This is a better but you are still collecting the RDD and passing over the collection three times. What I meant by my original suggestion was this: Why not make the map go the other way, ie make a Map<String, ReasonForAlignmentClassificationFailure> that maps contig names to their reasons? Then make a Map<ReasonForAlignmentClassificationFailure, SAMFileWriter> with three entries. Then you only have to iterate over the collection of reads once to write everything out (you just look up the writer for each entry).

cwhelan

Looks better after changes. The most important thing remaining to fix is the hashCode method on SimpleChimera which looks vulnerable to the serialized-hashcode-on-enums problem again. Also consider my comments on writing the three bam files.

SHuang-Broad · 2018-05-09T22:53:12Z

For record keeping, as the comments and replies may be buried in the many commits

On the problem of too many splits of RDD and performance concerns

Initial comment by @cwhelan :

I'm starting to really not like this approach of splitting up the RDD into lots of smaller RDDs for later processing. It seems inefficient to me: it launches tons of different Spark stages each of which has a bunch of overhead. Perhaps not in this PR, but I think it would be better to classify the contigs on the fly and dispatch them to the right processing methods in a single pass over the RDD.

Reply by @SHuang-Broad

I tried to fix it in this PR, but that seems to be a big task,
and probably is impossible to achieve in a single pass,
because currently each class of contig ends up producing a different type of object
(3 general classes: simple -> SimpleNovelAdjacency, complex -> ComplexVariantCanonicalRepresentation, and unknown -> SAM records of the contigs)
and a groupBy() operation is necessary in the middle using these objects as keys
due to the fact that different contigs may produce the same variant
So what I'm thinking about, is two pass:
one pass for splitting them up into the 3 classes,
then another pass on each of those 3 RDD's to turn them into VariantContext's.
Any better idea?

Reply by @cwhelan

That would be better, and yeah you don't have to do it in this PR.
In theory you could make the keys for the groupByKey() (ie NovelAdjacencyAndAltHaplotype, CpxVariantCanonicalRepresentation, right?) all inherit from the same superclass and do a single group by, couldn't you? Then you could do everything in a single pass.

Reply by @SHuang-Broad

Yes, that is what I'm planning but I'm not sure yet about how to approach that (I actually tried it, before putting in the above comment, and quickly ran into the problem of mixing Java serialization and Kryo serialization, so a larger re-structuring might be needed, and not just a inheritance structure)

On the problem of having a confusing TODO for

boolean SimpleChimera.isCandidateInvertedDuplication()

The todo message

TODO: 5/5/18 Note that the use of the following predicate is currently obsoleted by
{@link AssemblyContigWithFineTunedAlignments#hasIncompletePictureFromTwoAlignments()}
because the contigs with this alignment signature is classified as "incomplete",
hence will NOT sent here for constructing SimpleChimera's.
But we may want to keep the code (and related code in BreakpointComplications) for future use.

Comment by @cwhelan

I'm a bit confused by this comment: this method is still being called in several places, so how is it obsolete?

Reply by @SHuang-Broad (also copied to update the "TODO" message

this predicate is currently used in two places (excluding appearance in comments): BreakpointComplications.IntraChrStrandSwitchBreakpointComplications, where it is use to test if the input simple chimera indicates an inverse tandem duplication and trigger the logic for inferring duplicated region; and BreakpointsInference.IntraChrStrandSwitchBreakpointInference, where it is used for breakpoint inference. The problem is, the contig will not even be sent here, because AssemblyContigWithFineTunedAlignments.hasIncompletePictureFromTwoAlignments() defines a simple chimera that has strand switch and the two alignments overlaps on reference as "incomplete", so in practice the two uses are not going to be triggered. But when we come back later and see what can be extracted from such "incomplete" contigs, these code could be useful again. So it is kept.

On the problem of writing out SAM records of "Unknown" contigs efficiently

First round comment by @cwhelan

This seems like a very inefficient way to write these three files. You end up calling collect on the RDD three different times and then traversing the local collection three times. Why not make a map of contig name to bam file, collect the rdd once, and then traverse the local collection once, writing each read to the appropriate bam file from the map?

Second round comment by @cwhelan

This is a better but you are still collecting the RDD and passing over the collection three times. What I meant by my original suggestion was this: Why not make the map go the other way, ie make a Map<String, ReasonForAlignmentClassificationFailure> that maps contig names to their reasons? Then make a Map<ReasonForAlignmentClassificationFailure, SAMFileWriter> with three entries. Then you only have to iterate over the collection of reads once to write everything out (you just look up the writer for each entry).

Reply by @SHuang-Broad

That's smart!
But there's a catch: SAMFileWriter is unfortunately not serializable, so I cannot pass it to the forEach working on the JavaRDD<GATKRead>
So I have two options: a) writeReads() of GATKSparkTool which takes an RDD so we have to go through the RDD three times again for the three cases; b) still do a collect on the RDD, but this time do a filtering by name (there's several thousand such contigs, resulting in ~12K GATKReads, which is not that bad) before the collect call.
I took the second approach.

* move method parseOneContig() in ChimericAlignments to DiscoverVariantsFromContigAlignmentsSAMSpark considering it is only used there * refactor and update SvDiscoveryInputData, also rename it to SvDiscoveryInputMetaData and moved contig alignment out of it * refactor to simplify contig alignment signature basic type (suspicious, simple, complex) classification, hence removing class AssemblyContigAlignmentSignatureClassifier * simplify SimpleNovelAdjacencyAndChimericAlignmentEvidence (SV) Representation change commit: * change SVLEN for CPX variants to the difference between [alt haplotype sequence length] and [affected reference region length] * change how replacement (RPL) variants are represented in NEW CODE PATH ONLY * change how DUP variants with short duplicated ref region are represented in NEW CODE PATH ONLY

SHuang-Broad added the SV label Apr 17, 2018

SHuang-Broad force-pushed the sh-sv-representation-change branch 2 times, most recently from c96209a to 5b8e571 Compare April 20, 2018 21:13

SHuang-Broad force-pushed the sh-sv-representation-change branch from 5b8e571 to 954cf6d Compare April 26, 2018 22:39

SHuang-Broad changed the title ~~(SV) change how variants are represented in VCF emitted by new code path~~ (SV) consolidate logic in simple chimera inference, change how variants are represented in VCF emitted by new code path Apr 26, 2018

SHuang-Broad changed the title ~~(SV) consolidate logic in simple chimera inference, change how variants are represented in VCF emitted by new code path~~ (SV) consolidate logic in simple chimera inference, update how variants are represented in VCF emitted by new code path Apr 26, 2018

SHuang-Broad force-pushed the sh-sv-representation-change branch from 954cf6d to 8bb042b Compare April 26, 2018 23:53

cwhelan requested changes May 1, 2018

View reviewed changes

SHuang-Broad force-pushed the sh-sv-representation-change branch from 1aebf69 to fa389b0 Compare May 7, 2018 01:10

SHuang-Broad commented May 7, 2018

View reviewed changes

cwhelan reviewed May 9, 2018

View reviewed changes

cwhelan approved these changes May 9, 2018

View reviewed changes

SHuang-Broad force-pushed the sh-sv-representation-change branch from fa389b0 to ffbca98 Compare May 10, 2018 00:59

SHuang-Broad merged commit 0e8c5fd into master May 10, 2018

SHuang-Broad deleted the sh-sv-representation-change branch May 10, 2018 01:58


		//==================================================================================================================

		public static final int LAST_ROUND_TUNING_ALIGNMENT_MAPQUAL_THREHOLD = 20;

(SV) consolidate logic in simple chimera inference, update how variants are represented in VCF emitted by new code path #4663

(SV) consolidate logic in simple chimera inference, update how variants are represented in VCF emitted by new code path #4663

Conversation

SHuang-Broad commented Apr 17, 2018 • edited Loading

Consolidate logic, update variant representation (PR#4663)

consolidate logic in the following classes

update how variants are represented

bump test coverage

codecov-io commented Apr 20, 2018 • edited Loading

Codecov Report

SHuang-Broad commented Apr 27, 2018

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SHuang-Broad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan left a comment

Choose a reason for hiding this comment

SHuang-Broad commented May 9, 2018 • edited Loading

On the problem of too many splits of RDD and performance concerns

On the problem of having a confusing TODO for

On the problem of writing out SAM records of "Unknown" contigs efficiently

SHuang-Broad commented Apr 17, 2018 •

edited

Loading

codecov-io commented Apr 20, 2018 •

edited

Loading

SHuang-Broad commented May 9, 2018 •

edited

Loading