-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scripts for SV callset evaluation #4406
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4406 +/- ##
==============================================
- Coverage 87.069% 79.769% -7.3%
+ Complexity 31244 17447 -13797
==============================================
Files 1922 1067 -855
Lines 144210 63673 -80537
Branches 15916 10444 -5472
==============================================
- Hits 125562 50791 -74771
+ Misses 12874 8885 -3989
+ Partials 5774 3997 -1777
|
Here's how these scripts are organized and why they take the form it is now: How to run
and what to expect
Misc points:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this. I got this going and it seems to work pretty well. I'm not going to do a full review yet, but a few things I think we need to change to make this useable by everyone, and a few more general suggestions are:
- Get rid of hard-coded paths in all scripts and make them all arguments to the main script.
- Right now the way it handles the Cpx variants seems not to generalize very well; I think it's assuming that master GATK doesn't have Cpx variants and the feature branch does. It'd be nice to make the Cpx variant processing completely optional (I had to comment out some things to get it to ignore them), until we switch the default behavior of master to use the new variant interpretation methods.
- I'm not sure we should filter out very small and very large variants, especially very large ones. We should get graded on making those calls. Very small ones I can see excluding because by definition they might not be in the truth set, but I don't think that applies to large ones.
- What's the reason for storing compiled Rdata objects in with the scripts? I don't necessarily see anything that needs that, and it will make things very hard to maintain.
masterVsFeature.sh
appears to set its working directory in the parent directory of where it is run from. If these scripts are in the gatk repo that will end up being in the scripts/sv/evaluation dir and will look like an untracked directory by git there. Its location should be a parameter just like the other working directories.- Running masterVsFeature should be optional; sometimes we'll just want to run master vs pacbio (ie in a Jenkins automated test).
- It might be helpful if the masterVsFeature logic also produced these lists of variants: TP (vs PacBio) gained in feature, TP lost in feature, FP gained in feature, FP lost in feature.
What do you think?
Thanks for providing suggestions @cwhelan ! ===============
Will do, I'll also make Manta output an optional argument. EDIT: removed hard-coded path in 1560c6e
Agree. EDIT: done in d4d4244
I agree it is not optimal, particularly for filtering out the huge variants.
Historical reason. They were used for storing R functions. Will remove them. EDIT: done in 546a36465f7860f8c85e28cf40ca8f3851ba9d4c
Will do as suggested. EDIT: done in 23da9d4
Will do as suggested. EDIT: done in 23da9d4
That's actually available. They are stored as: In addition, two files I think I just need to rename the outputs. |
@cwhelan |
a0eb379
to
eae0881
Compare
467da7e
to
61d8275
Compare
ebf0ec5
to
937b4f3
Compare
74f064a
to
235a228
Compare
235a228
to
7cd9a02
Compare
7cd9a02
to
1d5ec93
Compare
1d5ec93
to
a91cc6e
Compare
e6876e6
to
bfa3076
Compare
bfa3076
to
f7549a9
Compare
e43677e
to
f292cc5
Compare
27d841d
to
0c81127
Compare
7d205f6
to
6439e2c
Compare
b92f224
to
81db984
Compare
81db984
to
79bd9cc
Compare
190ada0
to
31f1a12
Compare
ad4af01
to
612bfbc
Compare
612bfbc
to
3ff361b
Compare
3ff361b
to
930dd23
Compare
597ecc6
to
91bf3e9
Compare
* setup main script "GATSVcallsetAnalysis.sh" to replace the script "project.sh", and * get rid of hard-coded path
* optionally run analysis on complex SV vcf's
* optionally run comparison between two GATK call sets * move the hard-coded directory creation from the utility script to a user provided path in "GATSVcallsetAnalysis.sh"
* remove Rdata * change loading Rdata to source the corresponding R script
* switch from using CPX call to re-interpreted calls
91bf3e9
to
e9e437f
Compare
@SHuang-Broad @cwhelan Can you guys either merge this ancient PR or close it if it's no longer needed? Thanks! |
closing at request, and we have an improved pipeline now somewhere else, and a plan for big improvements there. |
1st version, with its own assumptions and shortcomings.
@cwhelan please see if they fit what you have in mind,
one demo of running these scripts is here