-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gcnvkernel tunings #4720
gcnvkernel tunings #4720
Conversation
bumped up version minor issue in the sampling error calculation that could lead to NaNs different admixing rates (internal, external) bugfix in determining annealing rate bugfix in ploidy args
…cting correlations between copy numbers), then mean-field between discrete and continious RVs (though with correlations between copy-number and classes)) bugfix in convergence tracker thermal epochs to thermal advi iterations updated WDLs bugfix in ploidy args NOTE: case mode still does not have the two-stage implementation
…on to avoid inf changed the definition of QS (no more normalization -- it turned out to yield more realistic ROC curves, i.e. better representative of the actual segment quality) fixed test resources
small fixes and changes to the hybrid ADVI inference task base convergence criterion bumped up gcnvkernel version to 0.7
Codecov Report
@@ Coverage Diff @@
## master #4720 +/- ##
==============================================
+ Coverage 79.977% 80.537% +0.56%
- Complexity 17397 18570 +1173
==============================================
Files 1080 1084 +4
Lines 63093 67925 +4832
Branches 10179 11449 +1270
==============================================
+ Hits 50460 54705 +4245
- Misses 8648 9052 +404
- Partials 3985 4168 +183
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks! Just a couple very minor comments and one request: can you write a quick summary of the changes in the PR description? Perhaps briefly touch on how different internal/external admixing rates aid in convergence here and how you arrived at this conclusion. Might be helpful for future reference, that's all!
@@ -362,21 +366,23 @@ task GermlineCNVCallerCaseMode { | |||
--log-emission-samples-per-round ${default="50" log_emission_samples_per_round} \ | |||
--log-emission-sampling-median-rel-error ${default="0.005" log_emission_sampling_median_rel_error} \ | |||
--log-emission-sampling-rounds ${default="10" log_emission_sampling_rounds} \ | |||
--max-advi-iter-first-epoch ${default="100" max_advi_iter_first_epoch} \ | |||
--max-advi-iter-first-epoch ${default="5000" max_advi_iter_first_epoch} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we actually change all instances of "iters" to "iterations", both in the jar and in the WDL? Also perhaps "rel" to "relative"?
@@ -190,6 +190,7 @@ | |||
|
|||
public static final String MODEL_PATH_SUFFIX = "-model"; | |||
public static final String CALLS_PATH_SUFFIX = "-calls"; | |||
public static final String TRACKING_PATH_SUFFIX = "-tracking"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add documentation about what is output into this directory?
I tested this on a collection of 32 tandem duplications (~10kbp-50kbp events, 5kbp padding, 100bp bins) from an HGSV proband case. Using the latest release docker it nailed almost all of them (hooray!) but it hardly called any dups when I tried to do the calling/postprocessing locally using this branch (log reports successful convergence). This could just be an error on my end but we should check before merging. |
@mbabadi @samuelklee After our discussion, I suspect my issue was related to sharding and is probably not related to any changes in this branch. |
@mbabadi can you make any changes and merge this when you get a chance? |
@samuelklee done! |
Summary of changes:
Fixed a minor issue in sampling error estimation that could lead to NaN (as a result of division by zero)
Introduced separate internal and external admixing rates. The internal admixing rate is to be used internally by discrete RV posterior update routines ("callers") as a safety measure to stabilize self-consistency loops. For example, consider the mean-field treatment of two coupled Markov chains: the mean-field decoupling of the two chains yields two independent Markov chains with effective emission, transition, and prior probabilities, all of which must be self-consistency determined. The internal admixing rate would be used to admix the old and new self-consistent fields across the two chains in order to dampen oscillations and improve convergence properties. Once internal convergence is achieved, the converged posteriors must be saved to a workspace in order to be consumed by the continuous sub-model. The new internally converged posteriors will be admixed with the old internally converged posteriors from the previous epoch with the external admixing rate.
Introduced two-stage inference for cohort denoising and calling. In the first ("warm-up") stage, discrete variables are marginalized out, yielding an effective continuous-only model. The warm-up stage calculates continuous posteriors based on the marginalized model. Once convergence is achieved, continuous and discrete variables are decoupled for the second ("main") stage. The second stage starts with a discrete calling step (crucial), using continuous posteriors from the warm-up stage as the starting point. The motivation behind the two-stage inference strategy is to avoid getting trapped in spurious local minima that are potentially introduced by mean-field decoupling of discrete and continuous RVs. Note that mean-field decoupling has a tendency to stabilize local minima, most of which will disappear or turn into saddle points once correlations are taken into account. While the marginalized model is free of such spurious local minima, it does not yield discrete posteriors in a tractable way; hence, the necessity of ultimately decoupling in the "main" stage.
Capped phred-scaled qualities to maximum values permitted by machine precision in order to avoid NaNs and overflows.
Took a first step toward tracking and logging parameters during inference, starting with the ELBO history. In the future, it may be desirable to allow tracking of arbitrary RVs and deterministics via command line args (for debugging and exploratory work).
Notes:
GermlineCNVCaller
default arguments. See issue Multiple default values for optional parameters for different data types #4719.