Scripts for SV callset evaluation #4406

SHuang-Broad · 2018-02-13T19:42:10Z

1st version, with its own assumptions and shortcomings.

@cwhelan please see if they fit what you have in mind,
one demo of running these scripts is here

codecov-io · 2018-02-13T20:41:55Z

Codecov Report

Merging #4406 into master will decrease coverage by 7.3%.
The diff coverage is n/a.

@@              Coverage Diff               @@
##              master     #4406      +/-   ##
==============================================
- Coverage     87.069%   79.769%    -7.3%     
+ Complexity     31244     17447   -13797     
==============================================
  Files           1922      1067     -855     
  Lines         144210     63673   -80537     
  Branches       15916     10444    -5472     
==============================================
- Hits          125562     50791   -74771     
+ Misses         12874      8885    -3989     
+ Partials        5774      3997    -1777

Impacted Files	Coverage Δ	Complexity Δ
...nder/engine/spark/AssemblyRegionWalkerContext.java	`0% <0%> (-100%)`	`0% <0%> (-4%)`
...bender/engine/spark/AssemblyRegionWalkerSpark.java	`0% <0%> (-100%)`	`0% <0%> (-13%)`
...ols/examples/ExampleAssemblyRegionWalkerSpark.java	`0% <0%> (-93.103%)`	`0% <0%> (-8%)`
.../tools/spark/sv/discovery/BreakEndVariantType.java	`0% <0%> (-92.381%)`	`0% <0%> (-14%)`
...s/spark/ParallelCopyGCSDirectoryIntoHDFSSpark.java	`0% <0%> (-74.257%)`	`0% <0%> (-17%)`
...nder/tools/spark/pipelines/PrintVariantsSpark.java	`0% <0%> (-66.667%)`	`0% <0%> (-2%)`
...pleNovelAdjacencyAndChimericAlignmentEvidence.java	`24.324% <0%> (-63.176%)`	`5% <0%> (-5%)`
...ellbender/engine/filters/VariantFilterLibrary.java	`33.333% <0%> (-56.667%)`	`1% <0%> (ø)`
...walkers/genotyper/GenotypingGivenAllelesUtils.java	`28.571% <0%> (-46.429%)`	`2% <0%> (-3%)`
...hellbender/tools/spark/pipelines/SortSamSpark.java	`70.588% <0%> (-29.412%)`	`4% <0%> (-2%)`
... and 1259 more

SHuang-Broad · 2018-02-14T20:09:52Z

Here's how these scripts are organized and why they take the form it is now:

How to run

manage/project.sh is the "executable"
paths for VCF files (zipped or not) from PacBio callsets on CHM haploids, and Manta's VCF on the mixture should be provided to manage/project.sh, and
paths for two versions of GATK-SV callsets; one is fine, but scripts in the sub-directory manage must be modified. Two GATK-SV vcf files are requested because this would allow one to compare if a supposedly improvement would make our raw sensitivity/specificity better or worse, that was the use case here, and
paths to where results are to be stored, one for each GATK-SV callset must be given and
path to where to store the results of comparing the two callsets
several GNU bash utilities are expected, guniq and gsort, when run on a Mac, as well as bedtools

and what to expect

the scripts checks the VCF files, prints to screen a slew of information that one can pipe, or simplely browse through.
the scripts also outputs the ID's of variants from each of the two GATK callsets that are "validated" by PacBio haploid calls.

Misc points:

watch out for "duplicated" records, as sometimes different assembly contigs mapped to the same locations have slightly different alleles (SNP, for example) hence both would be output, but there aren't many such records based on experience
there are also some variants that we output to the VCFs having size <50 or >50K, both of which are filtered upfront and saved separately.
The scripts started when we first call insertions, deletions, inversions and small duplications, and back then PacBio call sets on the CHM haploids were not available, so Manta's calls were used as "reference", that explains why they are referred to throughout the project

cwhelan

Thanks for submitting this. I got this going and it seems to work pretty well. I'm not going to do a full review yet, but a few things I think we need to change to make this useable by everyone, and a few more general suggestions are:

Get rid of hard-coded paths in all scripts and make them all arguments to the main script.
Right now the way it handles the Cpx variants seems not to generalize very well; I think it's assuming that master GATK doesn't have Cpx variants and the feature branch does. It'd be nice to make the Cpx variant processing completely optional (I had to comment out some things to get it to ignore them), until we switch the default behavior of master to use the new variant interpretation methods.
I'm not sure we should filter out very small and very large variants, especially very large ones. We should get graded on making those calls. Very small ones I can see excluding because by definition they might not be in the truth set, but I don't think that applies to large ones.
What's the reason for storing compiled Rdata objects in with the scripts? I don't necessarily see anything that needs that, and it will make things very hard to maintain.
masterVsFeature.sh appears to set its working directory in the parent directory of where it is run from. If these scripts are in the gatk repo that will end up being in the scripts/sv/evaluation dir and will look like an untracked directory by git there. Its location should be a parameter just like the other working directories.
Running masterVsFeature should be optional; sometimes we'll just want to run master vs pacbio (ie in a Jenkins automated test).
It might be helpful if the masterVsFeature logic also produced these lists of variants: TP (vs PacBio) gained in feature, TP lost in feature, FP gained in feature, FP lost in feature.

What do you think?

SHuang-Broad · 2018-02-26T21:25:36Z

Thanks for providing suggestions @cwhelan !
Replies are below

===============

Get rid of hard-coded paths in all scripts and make them all arguments to the main script.

Will do, I'll also make Manta output an optional argument. EDIT: removed hard-coded path in 1560c6e

Right now the way it handles the Cpx variants seems not to generalize very well; I think it's assuming that master GATK doesn't have Cpx variants and the feature branch does. It'd be nice to make the Cpx variant processing completely optional (I had to comment out some things to get it to ignore them), until we switch the default behavior of master to use the new variant interpretation methods.

Agree. EDIT: done in d4d4244

I'm not sure we should filter out very small and very large variants, especially very large ones. We should get graded on making those calls. Very small ones I can see excluding because by definition they might not be in the truth set, but I don't think that applies to large ones.

I agree it is not optimal, particularly for filtering out the huge variants.
The huge variants come from 2 sources: <DEL>, <INV>.
For huge deletions, the 50% (or a custom value) reciprocal overlap should classify almost all of them as FP; actually the PacBio call sets (CHM1, CHM13, and the mixture) do not contain any deletion records that are over 30K in size.
For huge inversions though, I am not sure exactly what to do with them. The scripts currently avoid overlap analysis on the inversions actually because IMO we are overly confident in the <INV> variants, and the new variant interpretation methods will not emit records at all—they become BND's with appripriate annotations and are submitted to a yet-to-be-implemented unit for further interpretation if it is a dispersed duplication or inversion.

What's the reason for storing compiled Rdata objects in with the scripts? I don't necessarily see anything that needs that, and it will make things very hard to maintain.

Historical reason. They were used for storing R functions. Will remove them. EDIT: done in 546a36465f7860f8c85e28cf40ca8f3851ba9d4c

masterVsFeature.sh appears to set its working directory in the parent directory of where it is run from. If these scripts are in the gatk repo that will end up being in the scripts/sv/evaluation dir and will look like an untracked directory by git there. Its location should be a parameter just like the other working directories.

Will do as suggested. EDIT: done in 23da9d4

Running masterVsFeature should be optional; sometimes we'll just want to run master vs pacbio (ie in a Jenkins automated test).

Will do as suggested. EDIT: done in 23da9d4

It might be helpful if the masterVsFeature logic also produced these lists of variants: TP (vs PacBio) gained in feature, TP lost in feature, FP gained in feature, FP lost in feature.

That's actually available. They are stored as:
TP gained in feature—"featureOnly.gatkIDsWithMatchingPacBio.txt",
TP lost in feature—"masterOnly.gatkIDsWithMatchingPacBio.txt"
FP gained in feature—"featureOnly.gatkIDsNoMatchingPacBio.txt"
FP lost in feature—"masterOnly.gatkIDsNoMatchingPacBio.txt"

In addition, two files
"shared.gatkIDsNOMatchingPacBio.txt"
"shared.gatkIDsWithMatchingPacBio.txt"
shows the common TP's and FP's shared by GATK master and feature call sets.

I think I just need to rename the outputs.

SHuang-Broad · 2018-02-27T01:50:54Z

@cwhelan
I've done most of the suggested changes in separate commits, would you mind taking a look and see if they are what you need?

* setup main script "GATSVcallsetAnalysis.sh" to replace the script "project.sh", and * get rid of hard-coded path

* optionally run analysis on complex SV vcf's

* optionally run comparison between two GATK call sets * move the hard-coded directory creation from the utility script to a user provided path in "GATSVcallsetAnalysis.sh"

* remove Rdata * change loading Rdata to source the corresponding R script

* switch from using CPX call to re-interpreted calls

droazen · 2019-01-11T17:19:06Z

@SHuang-Broad @cwhelan Can you guys either merge this ancient PR or close it if it's no longer needed? Thanks!

SHuang-Broad · 2019-01-16T23:16:16Z

closing at request, and we have an improved pipeline now somewhere else, and a plan for big improvements there.
Thanks for keeping an eye on this, @droazen!

SHuang-Broad added the SV label Feb 13, 2018

cwhelan requested changes Feb 16, 2018

View reviewed changes

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from a0eb379 to eae0881 Compare March 4, 2018 02:15

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from 467da7e to 61d8275 Compare March 22, 2018 20:46

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from ebf0ec5 to 937b4f3 Compare March 27, 2018 22:11

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from 74f064a to 235a228 Compare April 13, 2018 01:32

SHuang-Broad force-pushed the sh-sv-eval-scripts branch from 235a228 to 7cd9a02 Compare April 19, 2018 22:31

SHuang-Broad mentioned this pull request Apr 20, 2018

Port SV call set eval bash scripts to java tool(s) #4684

Closed

SHuang-Broad force-pushed the sh-sv-eval-scripts branch from 7cd9a02 to 1d5ec93 Compare April 22, 2018 19:18

SHuang-Broad force-pushed the sh-sv-eval-scripts branch from 1d5ec93 to a91cc6e Compare May 7, 2018 19:03

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from e6876e6 to bfa3076 Compare May 25, 2018 19:52

SHuang-Broad force-pushed the sh-sv-eval-scripts branch from bfa3076 to f7549a9 Compare May 31, 2018 18:16

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 3 times, most recently from e43677e to f292cc5 Compare June 14, 2018 18:19

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 3 times, most recently from 27d841d to 0c81127 Compare June 24, 2018 20:19

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from 7d205f6 to 6439e2c Compare June 29, 2018 23:04

SHuang-Broad mentioned this pull request Jun 29, 2018

Switch to the updated type & location inference tool in SV pipeline #4111

Open

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from b92f224 to 81db984 Compare July 9, 2018 20:50

SHuang-Broad force-pushed the sh-sv-eval-scripts branch from 81db984 to 79bd9cc Compare July 11, 2018 00:01

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 3 times, most recently from 190ada0 to 31f1a12 Compare July 23, 2018 18:39

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 3 times, most recently from ad4af01 to 612bfbc Compare August 6, 2018 20:32

SHuang-Broad force-pushed the sh-sv-eval-scripts branch from 612bfbc to 3ff361b Compare August 13, 2018 23:02

SHuang-Broad force-pushed the sh-sv-eval-scripts branch from 3ff361b to 930dd23 Compare August 23, 2018 20:29

SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from 597ecc6 to 91bf3e9 Compare September 10, 2018 22:51

SHuang-Broad added 8 commits December 6, 2018 14:14

general scripts

4113dfa

added managing scripts

351d6dc

Implementing suggested changes from PR review:

f37dbfe

* setup main script "GATSVcallsetAnalysis.sh" to replace the script "project.sh", and * get rid of hard-coded path

Implementing suggested changes from PR review:

f379bed

* optionally run analysis on complex SV vcf's

Implementing suggested changes from PR review:

5708ca5

* optionally run comparison between two GATK call sets * move the hard-coded directory creation from the utility script to a user provided path in "GATSVcallsetAnalysis.sh"

Implementing suggested changes from PR review:

825cb74

* remove Rdata * change loading Rdata to source the corresponding R script

improving prerequisite checking

84e9786

improved intersection and GATK-to-GATK call set analysis scripts

e9e437f

* switch from using CPX call to re-interpreted calls

SHuang-Broad force-pushed the sh-sv-eval-scripts branch from 91bf3e9 to e9e437f Compare December 6, 2018 19:32

droazen assigned cwhelan and SHuang-Broad Jan 11, 2019

SHuang-Broad closed this Jan 16, 2019

SHuang-Broad deleted the sh-sv-eval-scripts branch January 17, 2019 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scripts for SV callset evaluation #4406

Scripts for SV callset evaluation #4406

SHuang-Broad commented Feb 13, 2018

codecov-io commented Feb 13, 2018 •

edited

Loading

SHuang-Broad commented Feb 14, 2018

cwhelan left a comment

SHuang-Broad commented Feb 26, 2018 •

edited

Loading

SHuang-Broad commented Feb 27, 2018

droazen commented Jan 11, 2019

SHuang-Broad commented Jan 16, 2019

Scripts for SV callset evaluation #4406

Scripts for SV callset evaluation #4406

Conversation

SHuang-Broad commented Feb 13, 2018

codecov-io commented Feb 13, 2018 • edited Loading

Codecov Report

SHuang-Broad commented Feb 14, 2018

cwhelan left a comment

Choose a reason for hiding this comment

SHuang-Broad commented Feb 26, 2018 • edited Loading

SHuang-Broad commented Feb 27, 2018

droazen commented Jan 11, 2019

SHuang-Broad commented Jan 16, 2019

codecov-io commented Feb 13, 2018 •

edited

Loading

SHuang-Broad commented Feb 26, 2018 •

edited

Loading