Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to the updated type & location inference tool in SV pipeline #4111

Open
SHuang-Broad opened this issue Jan 9, 2018 · 2 comments
Open
Assignees
Labels

Comments

@SHuang-Broad
Copy link
Contributor

As we have finished implementing the updated logic for how variants are interpreted and location inferred by studying local assembly contig alignment signatures, it is time to clean up the corresponding package in the pipeline and make the switch to the updated implementation, which now outputs not only insertion, deletion, small tandem duplication, and inversions, but also novel adjacencies (BND records whose meanings cannot be fully resolved solely from assembly alignment signatures) as well as complex variants that theoretically could be arbitrarily complex (<CPX>, as long as we have assembled across the full event).

Planed organization

the discovery package could be divided roughly now into

interface

SvDiscoveryDataBundle, SvDiscoverFromLocalAssemblyContigAlignmentsSpark, SvType, AnnotatedVariantProducer

alignment prep (sub package)

AlignmentInterval, AlignedContig (refactor AssemblyContigWithFineTunedAlignments into AlignedContig), AlignedContigGenerator, AlignedAssembly, ContigAlignmentsModifier (refactor AlnModType into it), GappedAlignmentSplitter, StrandSwitch, FilterLongReadAlignmentsSAMSpark (factor out the major methods in the new alignment filter by score into a 1st level class)

type & location inference (sub package)

  • imprecise: refactor out methods from to-be-deprecated DiscoverVariantsFromContigAlignmentsSAMSpark

  • alignment classification: ChimericAlignment and NovelAdjacencyReferenceLocations (very tricky to decouple the functionalities because both have over 50 uses), AssemblyContigAlignmentSignatureClassifier, VariantDetectorFromLocalAssemblyContigAlignments

  • simple: SimpleSVType, SvTypeInference, InsDelVariantDetector, BreakpointComplications (rename to BreakpointComplicationsForSimpleTypes)

  • complex: BreakEndVariantType, SuspectedTransLocDetector, SimpleStrandSwitchVariantDetector

deprecated

DiscoverVariantsFromContigAlignmentsSAMSpark

It currently provides 3 groups of functionalities:

  • novel adjacency detection (for ins, del, small dup, inversion only) by delegating to ChimericAlignment.parseOneContig and NovelAdjacencyReferenceLocations(ChimericAlignment chimericAlignment, byte[] contigSequence, SAMSequenceDictionary); this should be deprecated
  • exact variant type inference (delegated to SvTypeInference.inferFromNovelAdjacency()) and annotation (delegated to AnnotatedVariantProducer.produceAnnotatedVcFromInferredTypeAndRefLocations()); this should be deprecated
  • imprecise variants detection; this should be kept and factored out


Planed steps

  1. repackaging & refactoring (no logic change, see merge sv discovery code path, commit 0 #3934 )
  2. bring in some valuable changes made in PR HOLD: Bring in more duplication expansion calls #3668
  3. more test coverage (ticket Follow up with necessary tests for cpx SV PR series #3431)
  4. switch
    make StructuralVariationDiscoveryPipelineSpark call into SvDiscoverFromLocalAssemblyContigAlignmentsSpark by default and optionally into DiscoverVariantsFromContigAlignmentsSAMSpark, i.e. opposite of what we currently do.
@SHuang-Broad SHuang-Broad added the SV label Jan 9, 2018
@SHuang-Broad SHuang-Broad self-assigned this Jan 9, 2018
@SHuang-Broad
Copy link
Contributor Author

SHuang-Broad commented Mar 22, 2018

Updated plan


Small improvements in new interpretation tool

  • Output bam instead of sam for assembly alignments
  • Instead of creating directory, new interpretation tool writes files (behavior consistent with current interpretation tool)
  • Prefix with sample name for output files' names
  • Add INSLEN annotation when there's INSSEQ
  • Clarify the boundary between AlignedContig and AssemblyContigWithFineTunedAlignments
  • Increase test coverage for AssemblyContigAlignmentsConfigPicker

Consolidate logic, bump test coverage and update how variants are represented

consolidate logic

When initially prototyped, there's redundancy in logic for simple variants, now it's time to consolidate.

  • AssemblyContigWithFineTunedAlignments

    • hasIncompletePicture()
  • AssemblyContigAlignmentSignatureClassifier

    • Don't make so many splits
    • Reduce RawTypes into fewer cases
  • ChimericAlignment

    • update documentation
    • implement a getCoordinateSortedRefSpans(), and use in BreakpointsInference
    • isNeitherSimpleTranslocationNorIncompletePicture()
    • extractSimpleChimera()

bump test coverage

Once code above is consolidated, bump test coverage, particularly for the classes above and the following poorly-covered classes

  • ChimericAlignment

    • isForwardStrandRepresentation()
    • splitPairStrongEnoughEvidenceForCA()
    • parseOneContig() (needs testing because we need it for simple-re-interpretation for CPX variants) Note that nextAlignmentMayBeInsertion() is currently broken in the sense that when using this to filter out alignments whose ref span is contained by another, check if the two alignments involved are head/tail.
  • BreakpointsInference & BreakpointComplications

  • NovelAdjacencyAndAltHaplotype

    • toSimpleOrBNDTypes()
  • SimpleNovelAdjacencyAndChimericAlignmentEvidence

    • serialization test
  • AnnotatedVariantProducer

    • produceAnnotatedBNDmatesVcFromNovelAdjacency()
  • BreakEndVariantType

  • SvDiscoverFromLocalAssemblyContigAlignmentsSpark integration test

update how variants are represented

Implement the following representation changes that should make type-based evaluation easier

  • change INSDUP toINS when the duplicated ref region, denoted with annotation DUP_REPEAT_UNIT_REF_SPAN, is shorter than 50 bp.
  • change scarred deletion calls, which currently output as DEL with INSSEQ annotation, to one of these
    • INS/DEL, when deleted/inserted bases are < 50 bp and annotate accordingly; when type is determined asINS, the POS will be 1 base before the micro-deleted range and END will be end of the micro-deleted range, where the REF allele will be the corresponding reference bases.
    • two records INS and DEL when both are >= 50, share the same POS, and link by EVENT
  • we are making a choice that treats duplication expansion as insertion. If decide to treat DUP as a separate 1st class type, we need to
    • shift the left breakpoint to the right by 1 base compared to the current implementation, and
    • downstreamBreakpointRefPos = complication.getDupSeqRepeatUnitRefSpan().getEnd();

CPX variant re-interpretation

Send cpx variant for re-interpretation of simple basic types, and check for consistency (this might be the difficult part)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant