Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(SV) consolidate logic in simple chimera inference, update how variants are represented in VCF emitted by new code path #4663

Merged
merged 1 commit into from
May 10, 2018

Conversation

SHuang-Broad
Copy link
Contributor

@SHuang-Broad SHuang-Broad commented Apr 17, 2018

Part of road map laid out in #4111

Consolidate logic, update variant representation (PR#4663)

consolidate logic in the following classes

  • AssemblyContigAlignmentSignatureClassifier now gone, its inner enum class RawTypes is moved to AssemblyContigWithFineTunedAlignments.AlignmentSignatureBasicTypes and reduced into fewer cases (Suspicious, Simple and Complex)

  • static method BreakpointsInference.inferFromSimpleChimera() now moved to state query method ChimericAlignment.inferType()

  • AssemblyContigWithFineTunedAlignments.hasIncompletePictureFromTwoAlignments() merged with ChimericAlignment.hasIncompletePicture()

update how variants are represented

  • change SVLEN for CPX variants to the difference between [alt haplotype sequence length] and [affected reference region length], which is following the technical definition of SVLEN in VCF spec.

  • change RPL output to one of these (note that test coverage is expected)

    • ins/del, when del/ins bases are < 50 and annotate; when type is determined as ins, the POS will be 1 base before the micro-deleted range and END will be end of the micro-deleted range, where the REF allele will be the corresponding reference bases.
    • ins and del when both are >= 50, and link by EVENT
  • change SVTYPE=DUP toSVYTPE=INS when the duplicated region is shorter than 50 bp (tests). Note that this will lead to INS records with DUP_REPEAT_UNIT_REF_SPAN and DUP_SEQ_CIGARS (when available).

In addition, we are currently treating duplication expansion as insertion.
The VCF spec doesn't force DUP records as such.
If we decide to allow POS and END to designate the beginning and end of the duplicated reference region, we need to make at least the following change:

  • shift the left breakpoint to the right by 1 base compared to the current implementation, and
  • downstreamBreakpointRefPos = complication.getDupSeqRepeatUnitRefSpan().getEnd();

bump test coverage

  • SimpleNovelAdjacencyAndChimericAlignmentEvidence serialization test

  • NovelAdjacencyAndAltHaplotype.toSimpleOrBNDTypes() (but only related to SimpleSvType's, BND's will wait for a later PR.

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-representation-change branch 2 times, most recently from c96209a to 5b8e571 Compare April 20, 2018 21:13
@codecov-io
Copy link

codecov-io commented Apr 20, 2018

Codecov Report

Merging #4663 into master will increase coverage by 0.082%.
The diff coverage is 76.21%.

@@               Coverage Diff               @@
##              master     #4663       +/-   ##
===============================================
+ Coverage     79.973%   80.055%   +0.082%     
- Complexity     17421     17476       +55     
===============================================
  Files           1081      1080        -1     
  Lines          63140     63456      +316     
  Branches       10200     10271       +71     
===============================================
+ Hits           50495     50800      +305     
  Misses          8656      8656               
- Partials        3989      4000       +11
Impacted Files Coverage Δ Complexity Δ
...te/hellbender/tools/spark/sv/discovery/SvType.java 100% <ø> (ø) 6 <0> (ø) ⬇️
...ender/tools/spark/sv/utils/GATKSVVCFConstants.java 75% <ø> (ø) 1 <0> (ø) ⬇️
...tructuralVariationDiscoveryArgumentCollection.java 97.297% <ø> (ø) 0 <0> (ø) ⬇️
.../tools/spark/sv/discovery/BreakEndVariantType.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...lbender/tools/spark/sv/discovery/SimpleSVType.java 86.667% <100%> (+0.226%) 3 <1> (+1) ⬆️
...der/tools/spark/sv/utils/GATKSVVCFHeaderLines.java 82.813% <100%> (+0.273%) 7 <0> (ø) ⬇️
...pleNovelAdjacencyAndChimericAlignmentEvidence.java 87.5% <100%> (+63.176%) 10 <2> (+5) ⬆️
...ry/inference/CpxVariantInducingAssemblyContig.java 84.141% <100%> (ø) 27 <0> (ø) ⬇️
...s/spark/sv/discovery/SvDiscoveryInputMetaData.java 100% <100%> (ø) 2 <2> (?)
...overy/inference/TypeInferredFromSimpleChimera.java 100% <100%> (ø) 1 <0> (ø) ⬇️
... and 43 more

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-representation-change branch from 5b8e571 to 954cf6d Compare April 26, 2018 22:39
@SHuang-Broad SHuang-Broad changed the title (SV) change how variants are represented in VCF emitted by new code path (SV) consolidate logic in simple chimera inference, change how variants are represented in VCF emitted by new code path Apr 26, 2018
@SHuang-Broad SHuang-Broad changed the title (SV) consolidate logic in simple chimera inference, change how variants are represented in VCF emitted by new code path (SV) consolidate logic in simple chimera inference, update how variants are represented in VCF emitted by new code path Apr 26, 2018
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-representation-change branch from 954cf6d to 8bb042b Compare April 26, 2018 23:53
@SHuang-Broad
Copy link
Contributor Author

Feature (this) branch run log

00:47:25.107 INFO  StructuralVariationDiscoveryPipelineSpark - Used 3957 evidence target links to annotate assembled breakpoints
00:47:25.226 INFO  StructuralVariationDiscoveryPipelineSpark - Called 336 imprecise deletion variants
00:47:25.264 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 9651 variants.
00:47:25.280 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 281
00:47:25.281 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 5630
00:47:25.281 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 2248
00:47:25.281 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 1492
00:47:25.291 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
00:47:25.291 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
00:47:25.291 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
00:47:25.291 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
00:47:25.291 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 0
18/04/27 00:47:28 WARN org.apache.spark.scheduler.TaskSetManager: Stage 19 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 00:47:33 WARN org.apache.spark.scheduler.TaskSetManager: Stage 20 contains a task of very large size (2685 KB). The maximum recommended task size is 100 KB.
00:47:39.417 INFO  StructuralVariationDiscoveryPipelineSpark - Processing 323040 raw alignments from 254678 contigs.
18/04/27 00:47:39 WARN org.apache.spark.scheduler.TaskSetManager: Stage 22 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
00:47:47.083 INFO  StructuralVariationDiscoveryPipelineSpark - Filtering on MQ left 225280 contigs.
00:47:48.135 INFO  StructuralVariationDiscoveryPipelineSpark - 23702 contigs with chimeric alignments potentially giving SV signals.
18/04/27 00:47:49 WARN org.apache.spark.scheduler.TaskSetManager: Stage 32 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 00:47:57 WARN org.apache.spark.scheduler.TaskSetManager: Stage 33 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 00:48:04 WARN org.apache.spark.scheduler.TaskSetManager: Stage 34 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
00:48:13.094 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 15964 variants.
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 6920
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 264
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 268
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 4680
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 487
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 3345
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 0
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
00:48:13.107 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 0
00:48:47.955 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 1334 variants.
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 1334
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 0
00:48:47.956 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 0

Master branch run log

02:20:09.897 INFO  StructuralVariationDiscoveryPipelineSpark - Used 3957 evidence target links to annotate assembled breakpoints
02:20:10.082 INFO  StructuralVariationDiscoveryPipelineSpark - Called 336 imprecise deletion variants
02:20:10.121 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 9651 variants.
02:20:10.138 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 281
02:20:10.138 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 5630
02:20:10.138 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 2248
02:20:10.138 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 1492
02:20:10.147 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
02:20:10.148 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
02:20:10.148 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
02:20:10.148 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
02:20:10.148 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 0
18/04/27 02:20:12 WARN org.apache.spark.scheduler.TaskSetManager: Stage 19 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 02:20:18 WARN org.apache.spark.scheduler.TaskSetManager: Stage 20 contains a task of very large size (2599 KB). The maximum recommended task size is 100 KB.
02:20:24.460 INFO  StructuralVariationDiscoveryPipelineSpark - Processing 323040 raw alignments from 254678 contigs.
18/04/27 02:20:24 WARN org.apache.spark.scheduler.TaskSetManager: Stage 22 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
02:20:32.040 INFO  StructuralVariationDiscoveryPipelineSpark - Filtering on MQ left 225280 contigs.
02:20:33.079 INFO  StructuralVariationDiscoveryPipelineSpark - 23841 contigs with chimeric alignments potentially giving SV signals.
18/04/27 02:20:34 WARN org.apache.spark.scheduler.TaskSetManager: Stage 32 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 02:20:42 WARN org.apache.spark.scheduler.TaskSetManager: Stage 33 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
18/04/27 02:20:49 WARN org.apache.spark.scheduler.TaskSetManager: Stage 34 contains a task of very large size (2262 KB). The maximum recommended task size is 100 KB.
02:20:58.973 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 15659 variants.
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 6920
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 264
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 268
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 5019
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 1793
02:20:58.991 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 1395
02:20:58.992 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 0
02:20:58.992 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
02:20:58.992 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 0
02:21:46.603 INFO  StructuralVariationDiscoveryPipelineSpark - Discovered 1334 variants.
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - CPX: 1334
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - INV: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - BND_NOSS: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - DUP_INV: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV33: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - BND_INV55: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - DEL: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - DUP: 0
02:21:46.604 INFO  StructuralVariationDiscoveryPipelineSpark - INS: 0

As can be seen, the stable version of the interpretation tool is not affected, where as the experimental version is affected expectedly, in the number of DEL, INS, and DUP calls (less DEL calls because RPL calls defaulted to DEL, less DUP calls as we now require duplicated region to be >49 bp, more number of variants since some RPL variants now generates two calls)

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a bunch of questions about some of the refactoring here.

final Broadcast<ReferenceMultiSource> referenceBroadcast = svDiscoveryInputData.referenceBroadcast;
final Broadcast<SVIntervalTree<VariantContext>> cnvCallsBroadcast = svDiscoveryInputData.cnvCallsBroadcast;
final String updatedOutputPath = svDiscoveryInputData.outputPath + "experimentalInterpretation_";
final SvDiscoveryInputData.SampleSpecificData updatedSampleSpecificData =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this whole process very confusing. Why do we need to make a new SvDiscoveryInputData.SampleSpecificData that has the contigRawAlignments updated? It looks like it runs through the old pipeline once with contigRawAlignments set to getReads(), which are actually the aligned input reads, and then through the new pipeline with contigRawAlignments set to the aligned contigs (which makes more sense with the name of the field). Why not just have two different fields, named appropriately for what's going to be in them? Then you wouldn't have to create a new input object for the second run to begin with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is confusing, and it is a bit difficult to explain, so I drew the following chart describing what's happening:

untitled diagram

It looks like it runs through the old pipeline once with contigRawAlignments set to getReads(),
those are actually not used, as the old pipeline takes (see diagram above) an JaraRDD<AlignedContig>

So I took the contigRawAlignments out of SvDiscoveryInputData, and renamed SvDiscoveryInputData to SvDiscoveryInputMetaData

* A simple struct holding information of simple chimera of a particular contig and
* its good mapping to a non-canonical chromosome, if it has one.
*/
@DefaultSerializer(SimpleChimeraAndNCAMstring.Serializer.class)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does NCAM stand for? I'd prefer not to add a new acronym if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just add the NCAM string to ChimericAlignment? It would save a lot of lines of code and having to reason about a new object type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCAM strands for "Non-CAnonical Mapping".
Yes, that is a better idea!

* @param inferredType inferred type of variant
* @param altHaplotypeSeq alt haplotype sequence (could be null)
* @param contigAlignments chimeric alignments from contigs used for generating this novel adjacency,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing documentation for contigEvidence now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


/**
* Write SAM file, if requested, for original alignments of contigs recognized as "Ambiguous", "Incomplete", and "MisAssemblySuspect"
* TODO: 11/17/17 salvation on assembly contigs that 1) has ambiguous "best" configuration, and 2) has incomplete picture; and flag accordingly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this todo still applicable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is, ticket was created awhile ago #4229.

Out of the two, those "incomplete" ones cause us more variants in the sense that
even though the contig currently doesn't cover the full event,
extending it is unlikely to change what currently can be extracted.
I'm thinking about how to make such checks and get these variants back.

else
return hasIncompletePictureFromMultipleAlignments();
public enum AlignmentSignatureBasicTypes {
Suspicious, Simple, Complex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest renaming Suspicious to Unknown. Also I like to UPCASE all enum values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -195,6 +160,137 @@ static StrandSwitch determineStrandSwitch(final AlignmentInterval first, final A
}
}

// =================================================================================================================
// state query block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a block of methods for querying the state of a SimpleChimera, now removed

@@ -317,25 +319,38 @@ private static void forNonComplexVariants(final JavaRDD<AssemblyContigWithFineTu
throws IOException {
final SimpleNovelAdjacencyAndChimericAlignmentEvidence simpleNovelAdjacencyAndChimericAlignmentEvidence = pair._1;
final List<SvType> svTypes = pair._2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're assuming here that there are either 1 or 2 elements of this list. Can you add a validation of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

return Collections.singletonList( new SimpleSVType.Deletion(this, svLength) );
final int deletedLength = leftJustifiedRightRefLoc.getStart() - leftJustifiedLeftRefLoc.getEnd();
final int insertionLength = complication.getInsertedSequenceForwardStrandRep().length();
if ( deletedLength < 50 ) { // "fat" insertion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a constant for 50 and put it somewhere central?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added StructuralVariationDiscoveryArgumentCollection.STRUCTURAL_VARIANT_SIZE_LOWER_BOUND

RPL, // replacement
SMALL_DUP_EXPANSION, // small duplication tandem expansion
SMALL_DUP_CPX, // small duplication, either expansion or contraction, with complex structure
IntraChrStrandSwitch, // intra-chromosome strand-switch novel adjacency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upcase these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return Collections.singletonList( new SimpleSVType.DuplicationTandem(this, svLength) );
final BreakpointComplications.SmallDuplicationWithPreciseDupRangeBreakpointComplications duplicationComplication =
(BreakpointComplications.SmallDuplicationWithPreciseDupRangeBreakpointComplications) this.getComplication();
if (duplicationComplication.getDupSeqRepeatUnitRefSpan().size() < 50) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-representation-change branch from 1aebf69 to fa389b0 Compare May 7, 2018 01:10
Copy link
Contributor Author

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Chris, I've updated the code following the requests.
For one of the major concerns you have on performance issues by splitting the RDD's into smaller ones and then working on them, I replied with a comment and not sure yet how to better optimize it.

public static EnumMap<RawTypes, JavaRDD<AssemblyContigWithFineTunedAlignments>> preprocess(final SvDiscoveryInputData svDiscoveryInputData,
final boolean writeSAMFiles) {
public static AssemblyContigsClassifiedByAlignmentSignatures preprocess(final SvDiscoveryInputData svDiscoveryInputData,
final boolean writeSAMFiles) {

final Broadcast<SAMSequenceDictionary> referenceSequenceDictionaryBroadcast = svDiscoveryInputData.referenceData.referenceSequenceDictionaryBroadcast;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

* @param inferredType inferred type of variant
* @param altHaplotypeSeq alt haplotype sequence (could be null)
* @param contigAlignments chimeric alignments from contigs used for generating this novel adjacency,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

* A simple struct holding information of simple chimera of a particular contig and
* its good mapping to a non-canonical chromosome, if it has one.
*/
@DefaultSerializer(SimpleChimeraAndNCAMstring.Serializer.class)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCAM strands for "Non-CAnonical Mapping".
Yes, that is a better idea!

@@ -317,25 +319,38 @@ private static void forNonComplexVariants(final JavaRDD<AssemblyContigWithFineTu
throws IOException {
final SimpleNovelAdjacencyAndChimericAlignmentEvidence simpleNovelAdjacencyAndChimericAlignmentEvidence = pair._1;
final List<SvType> svTypes = pair._2;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

return Collections.singletonList( new SimpleSVType.Deletion(this, svLength) );
final int deletedLength = leftJustifiedRightRefLoc.getStart() - leftJustifiedLeftRefLoc.getEnd();
final int insertionLength = complication.getInsertedSequenceForwardStrandRep().length();
if ( deletedLength < 50 ) { // "fat" insertion
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added StructuralVariationDiscoveryArgumentCollection.STRUCTURAL_VARIANT_SIZE_LOWER_BOUND

* Write SAM file, if requested, for original alignments of contigs recognized as "Ambiguous", "Incomplete", and "MisAssemblySuspect"
* TODO: 11/17/17 salvation on assembly contigs that 1) has ambiguous "best" configuration, and 2) has incomplete picture; and flag accordingly
*
* TODO: 3/4/18 a bug is present here that even though only one alignment has not-bad MQ, it could contain a large gap,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This particular bug was fixed, but I've found another edge case that's on my radar to be fixed, example as follows (notice the 58I gap "flanked" by the short 16M):

asm005554:tig00000	0	chr3	96003552	60	1137M58I16M438S	*	0	0	TATTAAAACCTAGCATTGATACTGATAGACAATGAGACTGGAGGATTTTTCTCTAGTGAAATAAATGAAGCAAGTAAAAAAATTGTCAGTTATGAAAATAAATCCTTGACCTTCACCATTTTATACTTCAATCACACAAACCCAAAATAAAACACTACTCATGATCACACAGATTCCATTCAACTTTTTAGTACTTTGATTTTAAATATTACTAGAAAATTTATGATTAGTGAATATTAGAAAAAGAAATCAGTTTCTACCATAAAAATACAGAAATACAAATAAGAAGAAAAAAAGAAAATTAGAGAAAACAGAGGCATTATCAAAGGTGATAGAACTTAAAAAAAAACCTTAATTTAATATACCAAGGTATATATGGTAATACATTACATCCATTAATATACCACAAGTTTAGATATCCCTTCTGTTCATATCCCATTGATAAATTTTTATGTAAATTGCCACATCTAAAATCAAGGAAAACAGGTAACATAGTTTGGCTGAATAGCAAATTATTTGAGAGGGGGAAAAGCAAATATACGTATCCACTAAAAAAAAACAAACAAAAACTAATATCATTATAAAAAGAGTTTTAAAAAAAGGAGAACAAAAAATATAGAGATCACAACATCAAATAACTTTTTCAATAAATATGTTCTGGAGAAAGGGATGACAACAGTTGTCTTGTGTTCATTTGCCCCACAAAAAGAGACCACAACGTCAATAAACAAGTTGACTTCAACGCCAGTGACTGAGGAGGTATGGTGGAGAGAACCAGGGGAGCAGCAAAATCTCTGTAGAACATAGAAGCCCAGGATAACACCATAGAGAGGGGAATAAGACATCCCACCCCTGCCACACTGTCTCCCCTGATAGGATTGGCTCAGTGGTGGGAGGACTTATTTTAGGAAAAAGATAAGCTGGAGATCCCTCTTGGTCCCCATTTCTACCATGGACACAAGACAACCTTTCTACAGGAGAGACCCACTGTCCTCACAGCCCCTAAATCCAGTTTGGAGAGTTCTTAAAAGTTTACACCATGCATCCCCCACTCTCATGATCTAAAAATTATACTTTTTTATATATGTATGCATACATATATAATTTTTATATATGTATGCATACATATATAATTTTTATATATGTATGCATACATATATAATTTTTATATATGTATGCATACATATAATTTTTATATATGTATGCATACATATATAATTTTTATATGTATGCATACATATATAATTTTTTATATGTATGCATACATATATAATTTTTATATGTATGCATACAATATAATTTTTTATATATTGTATGCATACATATACTGTTTTATATATGTATGCATACATATATACTTTTTTGTATGTACACATACACATATAATTTTTTATATATGCATGCATACATATATGATTTTTATATAGGCATGCATACATATATAATTTTTTATATATGCATGCATATATAAATGCATATAACTTTTTATATATGTATGCATACATAACTTTTTATATATGTATGCATACATAACTTTTTATATATGTATGCATACATAACTTTTATATATGTATGCATACATAACTTTTTATATATGTATGCATACATATATAACTTTTTATATATGTATGCATACATATATAACTTTT	*	SA:Z:chr3,96004627,+,1189S35M26I399M,60,26;	MD:Z:1153	RG:Z:GATKSVContigAlignments	NM:i:58	AS:i:1079	XS:i:58


//==================================================================================================================

public static final int LAST_ROUND_TUNING_ALIGNMENT_MAPQUAL_THREHOLD = 20;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters (name now updated) and the following methods are moved here from the now gone classAssemblyContigAlignmentSignatureClassifier,
the intention is documented in the method removeNonUniqueMappings .

* See {@link #removeNonUniqueMappings(List, int, int)}
*/
@VisibleForTesting
static AssemblyContigWithFineTunedAlignments lastRoundAlignmentFineTuning(final AssemblyContigWithFineTunedAlignments tig,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this middle man method is now removed, the intention is documented in removeNonUniqueMappings

* </li>
* <li>
* there's strand switch,
* todo: this obsoletes the actual use of SimpleChimera#isCandidateInvertedDuplication()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've update the comment and moved the todo to SimpleChimera.isCandidateInvertedDuplication().
The problem at hand, essentially, is that this way of classifying the contig as incomplete makes that function "useless" in the new pipeline because the contigs won't be sent down for analysis.

||
(firstAndLastAlignmentMappedToSameChr() && hasIncompletePictureDueToOverlappingRefOrderSwitch());
}
public static boolean hasIncompletePictureFromTwoAlignments(final AlignmentInterval headAlignment,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because some logic are different are different, one such example: for 2-alignment contigs, alignments to different chromosomes are considered OK, but for multiple-alignment contigs, head and tail alignments must be mapped to the same chromosome for it NOT to be considered "incomplete". The motivation is that the 2-alignment case could produce an BND record (or two if both mates count), whereas the multiple-alignment contigs could emit several and I'm not sure yet how to make interpretation (this is tied to the more general question of how to extract variants from the "incomplete" contigs).

}

@Override
public int hashCode() {
int result = sourceContigName.hashCode();
result = 31 * result + regionWithLowerCoordOnContig.hashCode();
result = 31 * result + regionWithHigherCoordOnContig.hashCode();
result = 31 * result + strandSwitch.ordinal();
result = 31 * result + strandSwitch.hashCode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you regenerated this hashCode method and reverted from using ordinal, which will potentially give rise to bugs as we learned before. Please change back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Thanks for catching that! auto-generated by Intellij....

@@ -165,7 +165,16 @@ public final boolean hasEquallyGoodAlnConfigurations() {
return hasEquallyGoodAlnConfigurations;
}

public final boolean hasIncompletePicture() {
/**
* This method exist because extending the assembly contig in either direction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"method exists"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

*
* If the assembly contig, with its two (picked) alignments, shows any of the following signature,
* If the assembly contig's alignments show any of the following signature,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"signatures"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

*
* If the assembly contig, with its two (picked) alignments, shows any of the following signature,
* If the assembly contig's alignments show any of the following signature,
* it will be classified as incomplete.
* it is definitely not giving the whole picture of the alt haplotype,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your edits I'd now start this sentence off, "Incomplete alignments do not give the whole picture.."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -279,8 +279,13 @@ private boolean hasIncompletePicture() {
return AssemblyContigWithFineTunedAlignments.hasIncompletePictureFromTwoAlignments(regionWithLowerCoordOnContig, regionWithHigherCoordOnContig);
}

// TODO: 5/5/18 Note that the use of the following predicate is currently obsoleted by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this comment: this method is still being called in several places, so how is it obsolete?

@@ -186,7 +189,7 @@ protected void runTool( final JavaSparkContext ctx ) {

// TODO: 1/14/18 this is the next version of precise variant calling
if ( expInterpret != null ) {
experimentalInterpretation(ctx, assembledEvidenceResults, svDiscoveryInputMetaData);
experimentalInterpretation(ctx, expInterpret, assembledEvidenceResults, svDiscoveryInputMetaData);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use of this boolean variable? Why not just add the check that it's true to the if clause on the previous line, then you don't need the extra parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! I think the original CMD line option was a string telling where to store the whole directory of experimental interpretation, and was passed in as a String. When I removed the directory creation and switched to the flag, I forgot to remove this parameter.

private void writeSAMfilesForUnknown(final String outputPrefix, final JavaRDD<GATKRead> assemblyRawAlignments,
final SAMFileHeader header) {

final Map<ReasonForAlignmentClassificationFailure, Iterable<String>> reasonToContigNames =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a better but you are still collecting the RDD and passing over the collection three times. What I meant by my original suggestion was this: Why not make the map go the other way, ie make a Map<String, ReasonForAlignmentClassificationFailure> that maps contig names to their reasons? Then make a Map<ReasonForAlignmentClassificationFailure, SAMFileWriter> with three entries. Then you only have to iterate over the collection of reads once to write everything out (you just look up the writer for each entry).

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better after changes. The most important thing remaining to fix is the hashCode method on SimpleChimera which looks vulnerable to the serialized-hashcode-on-enums problem again. Also consider my comments on writing the three bam files.

@SHuang-Broad
Copy link
Contributor Author

SHuang-Broad commented May 9, 2018

For record keeping, as the comments and replies may be buried in the many commits


On the problem of too many splits of RDD and performance concerns

Initial comment by @cwhelan :

I'm starting to really not like this approach of splitting up the RDD into lots of smaller RDDs for later processing. It seems inefficient to me: it launches tons of different Spark stages each of which has a bunch of overhead. Perhaps not in this PR, but I think it would be better to classify the contigs on the fly and dispatch them to the right processing methods in a single pass over the RDD.

Reply by @SHuang-Broad

I tried to fix it in this PR, but that seems to be a big task,
and probably is impossible to achieve in a single pass,
because currently each class of contig ends up producing a different type of object
(3 general classes: simple -> SimpleNovelAdjacency, complex -> ComplexVariantCanonicalRepresentation, and unknown -> SAM records of the contigs)
and a groupBy() operation is necessary in the middle using these objects as keys
due to the fact that different contigs may produce the same variant
So what I'm thinking about, is two pass:
one pass for splitting them up into the 3 classes,
then another pass on each of those 3 RDD's to turn them into VariantContext's.
Any better idea?

Reply by @cwhelan

That would be better, and yeah you don't have to do it in this PR.
In theory you could make the keys for the groupByKey() (ie NovelAdjacencyAndAltHaplotype, CpxVariantCanonicalRepresentation, right?) all inherit from the same superclass and do a single group by, couldn't you? Then you could do everything in a single pass.

Reply by @SHuang-Broad

Yes, that is what I'm planning but I'm not sure yet about how to approach that (I actually tried it, before putting in the above comment, and quickly ran into the problem of mixing Java serialization and Kryo serialization, so a larger re-structuring might be needed, and not just a inheritance structure)


On the problem of having a confusing TODO for

boolean SimpleChimera.isCandidateInvertedDuplication()

The todo message

TODO: 5/5/18 Note that the use of the following predicate is currently obsoleted by
{@link AssemblyContigWithFineTunedAlignments#hasIncompletePictureFromTwoAlignments()}
because the contigs with this alignment signature is classified as "incomplete",
hence will NOT sent here for constructing SimpleChimera's.
But we may want to keep the code (and related code in BreakpointComplications) for future use.

Comment by @cwhelan

I'm a bit confused by this comment: this method is still being called in several places, so how is it obsolete?

Reply by @SHuang-Broad (also copied to update the "TODO" message

this predicate is currently used in two places (excluding appearance in comments): BreakpointComplications.IntraChrStrandSwitchBreakpointComplications, where it is use to test if the input simple chimera indicates an inverse tandem duplication and trigger the logic for inferring duplicated region; and BreakpointsInference.IntraChrStrandSwitchBreakpointInference, where it is used for breakpoint inference. The problem is, the contig will not even be sent here, because AssemblyContigWithFineTunedAlignments.hasIncompletePictureFromTwoAlignments() defines a simple chimera that has strand switch and the two alignments overlaps on reference as "incomplete", so in practice the two uses are not going to be triggered. But when we come back later and see what can be extracted from such "incomplete" contigs, these code could be useful again. So it is kept.


On the problem of writing out SAM records of "Unknown" contigs efficiently

First round comment by @cwhelan

This seems like a very inefficient way to write these three files. You end up calling collect on the RDD three different times and then traversing the local collection three times. Why not make a map of contig name to bam file, collect the rdd once, and then traverse the local collection once, writing each read to the appropriate bam file from the map?

Second round comment by @cwhelan

This is a better but you are still collecting the RDD and passing over the collection three times. What I meant by my original suggestion was this: Why not make the map go the other way, ie make a Map<String, ReasonForAlignmentClassificationFailure> that maps contig names to their reasons? Then make a Map<ReasonForAlignmentClassificationFailure, SAMFileWriter> with three entries. Then you only have to iterate over the collection of reads once to write everything out (you just look up the writer for each entry).

Reply by @SHuang-Broad

That's smart!
But there's a catch: SAMFileWriter is unfortunately not serializable, so I cannot pass it to the forEach working on the JavaRDD<GATKRead>
So I have two options: a) writeReads() of GATKSparkTool which takes an RDD so we have to go through the RDD three times again for the three cases; b) still do a collect on the RDD, but this time do a filtering by name (there's several thousand such contigs, resulting in ~12K GATKReads, which is not that bad) before the collect call.
I took the second approach.

  * move method parseOneContig() in ChimericAlignments to DiscoverVariantsFromContigAlignmentsSAMSpark considering it is only used there
  * refactor and update SvDiscoveryInputData, also rename it to SvDiscoveryInputMetaData and moved contig alignment out of it
  * refactor to simplify contig alignment signature basic type (suspicious, simple, complex) classification, hence removing class AssemblyContigAlignmentSignatureClassifier
  * simplify SimpleNovelAdjacencyAndChimericAlignmentEvidence

(SV) Representation change commit:
  * change SVLEN for CPX variants to the difference between [alt haplotype sequence length] and [affected reference region length]
  * change how replacement (RPL) variants are represented in NEW CODE PATH ONLY
  * change how DUP variants with short duplicated ref region are represented in NEW CODE PATH ONLY
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-representation-change branch from fa389b0 to ffbca98 Compare May 10, 2018 00:59
@SHuang-Broad SHuang-Broad merged commit 0e8c5fd into master May 10, 2018
@SHuang-Broad SHuang-Broad deleted the sh-sv-representation-change branch May 10, 2018 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants