gcnvkernel tunings #4720

mbabadi · 2018-04-30T22:51:05Z

Summary of changes:

Fixed a minor issue in sampling error estimation that could lead to NaN (as a result of division by zero)
Introduced separate internal and external admixing rates. The internal admixing rate is to be used internally by discrete RV posterior update routines ("callers") as a safety measure to stabilize self-consistency loops. For example, consider the mean-field treatment of two coupled Markov chains: the mean-field decoupling of the two chains yields two independent Markov chains with effective emission, transition, and prior probabilities, all of which must be self-consistency determined. The internal admixing rate would be used to admix the old and new self-consistent fields across the two chains in order to dampen oscillations and improve convergence properties. Once internal convergence is achieved, the converged posteriors must be saved to a workspace in order to be consumed by the continuous sub-model. The new internally converged posteriors will be admixed with the old internally converged posteriors from the previous epoch with the external admixing rate.
Introduced two-stage inference for cohort denoising and calling. In the first ("warm-up") stage, discrete variables are marginalized out, yielding an effective continuous-only model. The warm-up stage calculates continuous posteriors based on the marginalized model. Once convergence is achieved, continuous and discrete variables are decoupled for the second ("main") stage. The second stage starts with a discrete calling step (crucial), using continuous posteriors from the warm-up stage as the starting point. The motivation behind the two-stage inference strategy is to avoid getting trapped in spurious local minima that are potentially introduced by mean-field decoupling of discrete and continuous RVs. Note that mean-field decoupling has a tendency to stabilize local minima, most of which will disappear or turn into saddle points once correlations are taken into account. While the marginalized model is free of such spurious local minima, it does not yield discrete posteriors in a tractable way; hence, the necessity of ultimately decoupling in the "main" stage.
Capped phred-scaled qualities to maximum values permitted by machine precision in order to avoid NaNs and overflows.
Took a first step toward tracking and logging parameters during inference, starting with the ELBO history. In the future, it may be desirable to allow tracking of arbitrary RVs and deterministics via command line args (for debugging and exploratory work).

Notes:

We still need to decide about GermlineCNVCaller default arguments. See issue Multiple default values for optional parameters for different data types #4719.
The case denoising and calling is unlikely to benefit from the marginalized warmup strategy. Therefore, it is not included in this PR.

bumped up version minor issue in the sampling error calculation that could lead to NaNs different admixing rates (internal, external) bugfix in determining annealing rate bugfix in ploidy args

…cting correlations between copy numbers), then mean-field between discrete and continious RVs (though with correlations between copy-number and classes)) bugfix in convergence tracker thermal epochs to thermal advi iterations updated WDLs bugfix in ploidy args NOTE: case mode still does not have the two-stage implementation

…on to avoid inf changed the definition of QS (no more normalization -- it turned out to yield more realistic ROC curves, i.e. better representative of the actual segment quality) fixed test resources

small fixes and changes to the hybrid ADVI inference task base convergence criterion bumped up gcnvkernel version to 0.7

codecov-io · 2018-04-30T23:52:07Z

Codecov Report

Merging #4720 into master will increase coverage by 0.56%.
The diff coverage is 100%.

@@              Coverage Diff               @@
##              master     #4720      +/-   ##
==============================================
+ Coverage     79.977%   80.537%   +0.56%     
- Complexity     17397     18570    +1173     
==============================================
  Files           1080      1084       +4     
  Lines          63093     67925    +4832     
  Branches       10179     11449    +1270     
==============================================
+ Hits           50460     54705    +4245     
- Misses          8648      9052     +404     
- Partials        3985      4168     +183

Impacted Files	Coverage Δ	Complexity Δ
...hellbender/tools/copynumber/GermlineCNVCaller.java	`86.364% <ø> (ø)`	`10 <0> (ø)`	⬇️
...number/gcnv/GermlineCNVSegmentVariantComposer.java	`100% <ø> (ø)`	`6 <0> (ø)`	⬇️
...ments/GermlineCNVHybridADVIArgumentCollection.java	`100% <100%> (ø)`	`3 <0> (ø)`	⬇️
...number/arguments/HybridADVIArgumentCollection.java	`94.118% <100%> (+0.258%)`	`3 <0> (ø)`	⬇️
...mlineContigPloidyHybridADVIArgumentCollection.java	`100% <100%> (ø)`	`3 <0> (ø)`	⬇️
...ce/AssemblyContigAlignmentSignatureClassifier.java	`75.472% <0%> (-8.399%)`	`34% <0%> (-6%)`
...lignment/AssemblyContigAlignmentsConfigPicker.java	`88.462% <0%> (-4.996%)`	`92% <0%> (+19%)`
...decs/xsvLocatableTable/XsvLocatableTableCodec.java	`78.947% <0%> (-3.769%)`	`69% <0%> (+9%)`
...g/broadinstitute/hellbender/utils/io/Resource.java	`53.488% <0%> (-2.067%)`	`6% <0%> (ø)`
...transforms/markduplicates/MarkDuplicatesSpark.java	`94.34% <0%> (-0.782%)`	`13% <0%> (-2%)`
... and 89 more

samuelklee

Looks good, thanks! Just a couple very minor comments and one request: can you write a quick summary of the changes in the PR description? Perhaps briefly touch on how different internal/external admixing rates aid in convergence here and how you arrived at this conclusion. Might be helpful for future reference, that's all!

samuelklee · 2018-05-01T17:05:06Z

scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl

@@ -362,21 +366,23 @@ task GermlineCNVCallerCaseMode {
            --log-emission-samples-per-round ${default="50" log_emission_samples_per_round} \
            --log-emission-sampling-median-rel-error ${default="0.005" log_emission_sampling_median_rel_error} \
            --log-emission-sampling-rounds ${default="10" log_emission_sampling_rounds} \
-            --max-advi-iter-first-epoch ${default="100" max_advi_iter_first_epoch} \
+            --max-advi-iter-first-epoch ${default="5000" max_advi_iter_first_epoch} \


Can we actually change all instances of "iters" to "iterations", both in the jar and in the WDL? Also perhaps "rel" to "relative"?

samuelklee · 2018-05-01T17:05:41Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/GermlineCNVCaller.java

@@ -190,6 +190,7 @@

    public static final String MODEL_PATH_SUFFIX = "-model";
    public static final String CALLS_PATH_SUFFIX = "-calls";
+    public static final String TRACKING_PATH_SUFFIX = "-tracking";


Can you add documentation about what is output into this directory?

mwalker174 · 2018-05-03T22:19:14Z

I tested this on a collection of 32 tandem duplications (~10kbp-50kbp events, 5kbp padding, 100bp bins) from an HGSV proband case. Using the latest release docker it nailed almost all of them (hooray!) but it hardly called any dups when I tried to do the calling/postprocessing locally using this branch (log reports successful convergence). This could just be an error on my end but we should check before merging.

mwalker174 · 2018-05-07T19:30:07Z

@mbabadi @samuelklee After our discussion, I suspect my issue was related to sharding and is probably not related to any changes in this branch.

samuelklee · 2018-05-14T19:26:07Z

@mbabadi can you make any changes and merge this when you get a chance?

mbabadi · 2018-05-15T14:59:26Z

@samuelklee done!

mbabadi added 4 commits March 30, 2018 16:39

a more appropriate posterior admixing strategy

3531acb

bumped up version minor issue in the sampling error calculation that could lead to NaNs different admixing rates (internal, external) bugfix in determining annealing rate bugfix in ploidy args

capped phred-scaled probability to what is allowed by machine precisi…

a52405f

…on to avoid inf changed the definition of QS (no more normalization -- it turned out to yield more realistic ROC curves, i.e. better representative of the actual segment quality) fixed test resources

save ELBO history

c7bc019

small fixes and changes to the hybrid ADVI inference task base convergence criterion bumped up gcnvkernel version to 0.7

mbabadi requested a review from samuelklee April 30, 2018 22:51

samuelklee requested changes May 1, 2018

View reviewed changes

mbabadi merged commit 2f09e02 into master May 15, 2018

samuelklee mentioned this pull request Mar 1, 2019

DetermineContigPloidy BETA AssertionError CNV WDL #5738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcnvkernel tunings #4720

gcnvkernel tunings #4720

mbabadi commented Apr 30, 2018 •

edited

Loading

codecov-io commented Apr 30, 2018

samuelklee left a comment

samuelklee May 1, 2018

samuelklee May 1, 2018

mwalker174 commented May 3, 2018

mwalker174 commented May 7, 2018

samuelklee commented May 14, 2018

mbabadi commented May 15, 2018

gcnvkernel tunings #4720

gcnvkernel tunings #4720

Conversation

mbabadi commented Apr 30, 2018 • edited Loading

codecov-io commented Apr 30, 2018

Codecov Report

samuelklee left a comment

Choose a reason for hiding this comment

samuelklee May 1, 2018

Choose a reason for hiding this comment

samuelklee May 1, 2018

Choose a reason for hiding this comment

mwalker174 commented May 3, 2018

mwalker174 commented May 7, 2018

samuelklee commented May 14, 2018

mbabadi commented May 15, 2018

mbabadi commented Apr 30, 2018 •

edited

Loading