Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts for SV callset evaluation #4406

Closed
wants to merge 8 commits into from
Closed

Conversation

SHuang-Broad
Copy link
Contributor

1st version, with its own assumptions and shortcomings.

@cwhelan please see if they fit what you have in mind,
one demo of running these scripts is here

@codecov-io
Copy link

codecov-io commented Feb 13, 2018

Codecov Report

Merging #4406 into master will decrease coverage by 7.3%.
The diff coverage is n/a.

@@              Coverage Diff               @@
##              master     #4406      +/-   ##
==============================================
- Coverage     87.069%   79.769%    -7.3%     
+ Complexity     31244     17447   -13797     
==============================================
  Files           1922      1067     -855     
  Lines         144210     63673   -80537     
  Branches       15916     10444    -5472     
==============================================
- Hits          125562     50791   -74771     
+ Misses         12874      8885    -3989     
+ Partials        5774      3997    -1777
Impacted Files Coverage Δ Complexity Δ
...nder/engine/spark/AssemblyRegionWalkerContext.java 0% <0%> (-100%) 0% <0%> (-4%)
...bender/engine/spark/AssemblyRegionWalkerSpark.java 0% <0%> (-100%) 0% <0%> (-13%)
...ols/examples/ExampleAssemblyRegionWalkerSpark.java 0% <0%> (-93.103%) 0% <0%> (-8%)
.../tools/spark/sv/discovery/BreakEndVariantType.java 0% <0%> (-92.381%) 0% <0%> (-14%)
...s/spark/ParallelCopyGCSDirectoryIntoHDFSSpark.java 0% <0%> (-74.257%) 0% <0%> (-17%)
...nder/tools/spark/pipelines/PrintVariantsSpark.java 0% <0%> (-66.667%) 0% <0%> (-2%)
...pleNovelAdjacencyAndChimericAlignmentEvidence.java 24.324% <0%> (-63.176%) 5% <0%> (-5%)
...ellbender/engine/filters/VariantFilterLibrary.java 33.333% <0%> (-56.667%) 1% <0%> (ø)
...walkers/genotyper/GenotypingGivenAllelesUtils.java 28.571% <0%> (-46.429%) 2% <0%> (-3%)
...hellbender/tools/spark/pipelines/SortSamSpark.java 70.588% <0%> (-29.412%) 4% <0%> (-2%)
... and 1259 more

@SHuang-Broad
Copy link
Contributor Author

Here's how these scripts are organized and why they take the form it is now:

How to run

  • manage/project.sh is the "executable"
  • paths for VCF files (zipped or not) from PacBio callsets on CHM haploids, and Manta's VCF on the mixture should be provided to manage/project.sh, and
  • paths for two versions of GATK-SV callsets; one is fine, but scripts in the sub-directory manage must be modified. Two GATK-SV vcf files are requested because this would allow one to compare if a supposedly improvement would make our raw sensitivity/specificity better or worse, that was the use case here, and
  • paths to where results are to be stored, one for each GATK-SV callset must be given and
  • path to where to store the results of comparing the two callsets
  • several GNU bash utilities are expected, guniq and gsort, when run on a Mac, as well as bedtools

and what to expect

  • the scripts checks the VCF files, prints to screen a slew of information that one can pipe, or simplely browse through.
  • the scripts also outputs the ID's of variants from each of the two GATK callsets that are "validated" by PacBio haploid calls.

Misc points:

  • watch out for "duplicated" records, as sometimes different assembly contigs mapped to the same locations have slightly different alleles (SNP, for example) hence both would be output, but there aren't many such records based on experience
  • there are also some variants that we output to the VCFs having size <50 or >50K, both of which are filtered upfront and saved separately.
  • The scripts started when we first call insertions, deletions, inversions and small duplications, and back then PacBio call sets on the CHM haploids were not available, so Manta's calls were used as "reference", that explains why they are referred to throughout the project

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this. I got this going and it seems to work pretty well. I'm not going to do a full review yet, but a few things I think we need to change to make this useable by everyone, and a few more general suggestions are:

  • Get rid of hard-coded paths in all scripts and make them all arguments to the main script.
  • Right now the way it handles the Cpx variants seems not to generalize very well; I think it's assuming that master GATK doesn't have Cpx variants and the feature branch does. It'd be nice to make the Cpx variant processing completely optional (I had to comment out some things to get it to ignore them), until we switch the default behavior of master to use the new variant interpretation methods.
  • I'm not sure we should filter out very small and very large variants, especially very large ones. We should get graded on making those calls. Very small ones I can see excluding because by definition they might not be in the truth set, but I don't think that applies to large ones.
  • What's the reason for storing compiled Rdata objects in with the scripts? I don't necessarily see anything that needs that, and it will make things very hard to maintain.
  • masterVsFeature.sh appears to set its working directory in the parent directory of where it is run from. If these scripts are in the gatk repo that will end up being in the scripts/sv/evaluation dir and will look like an untracked directory by git there. Its location should be a parameter just like the other working directories.
  • Running masterVsFeature should be optional; sometimes we'll just want to run master vs pacbio (ie in a Jenkins automated test).
  • It might be helpful if the masterVsFeature logic also produced these lists of variants: TP (vs PacBio) gained in feature, TP lost in feature, FP gained in feature, FP lost in feature.

What do you think?

@SHuang-Broad
Copy link
Contributor Author

SHuang-Broad commented Feb 26, 2018

Thanks for providing suggestions @cwhelan !
Replies are below

===============

Get rid of hard-coded paths in all scripts and make them all arguments to the main script.

Will do, I'll also make Manta output an optional argument. EDIT: removed hard-coded path in 1560c6e

Right now the way it handles the Cpx variants seems not to generalize very well; I think it's assuming that master GATK doesn't have Cpx variants and the feature branch does. It'd be nice to make the Cpx variant processing completely optional (I had to comment out some things to get it to ignore them), until we switch the default behavior of master to use the new variant interpretation methods.

Agree. EDIT: done in d4d4244

I'm not sure we should filter out very small and very large variants, especially very large ones. We should get graded on making those calls. Very small ones I can see excluding because by definition they might not be in the truth set, but I don't think that applies to large ones.

I agree it is not optimal, particularly for filtering out the huge variants.
The huge variants come from 2 sources: <DEL>, <INV>.
For huge deletions, the 50% (or a custom value) reciprocal overlap should classify almost all of them as FP; actually the PacBio call sets (CHM1, CHM13, and the mixture) do not contain any deletion records that are over 30K in size.
For huge inversions though, I am not sure exactly what to do with them. The scripts currently avoid overlap analysis on the inversions actually because IMO we are overly confident in the <INV> variants, and the new variant interpretation methods will not emit records at all—they become BND's with appripriate annotations and are submitted to a yet-to-be-implemented unit for further interpretation if it is a dispersed duplication or inversion.

What's the reason for storing compiled Rdata objects in with the scripts? I don't necessarily see anything that needs that, and it will make things very hard to maintain.

Historical reason. They were used for storing R functions. Will remove them. EDIT: done in 546a36465f7860f8c85e28cf40ca8f3851ba9d4c

masterVsFeature.sh appears to set its working directory in the parent directory of where it is run from. If these scripts are in the gatk repo that will end up being in the scripts/sv/evaluation dir and will look like an untracked directory by git there. Its location should be a parameter just like the other working directories.

Will do as suggested. EDIT: done in 23da9d4

Running masterVsFeature should be optional; sometimes we'll just want to run master vs pacbio (ie in a Jenkins automated test).

Will do as suggested. EDIT: done in 23da9d4

It might be helpful if the masterVsFeature logic also produced these lists of variants: TP (vs PacBio) gained in feature, TP lost in feature, FP gained in feature, FP lost in feature.

That's actually available. They are stored as:
TP gained in feature—"featureOnly.gatkIDsWithMatchingPacBio.txt",
TP lost in feature—"masterOnly.gatkIDsWithMatchingPacBio.txt"
FP gained in feature—"featureOnly.gatkIDsNoMatchingPacBio.txt"
FP lost in feature—"masterOnly.gatkIDsNoMatchingPacBio.txt"

In addition, two files
"shared.gatkIDsNOMatchingPacBio.txt"
"shared.gatkIDsWithMatchingPacBio.txt"
shows the common TP's and FP's shared by GATK master and feature call sets.

I think I just need to rename the outputs.

@SHuang-Broad
Copy link
Contributor Author

@cwhelan
I've done most of the suggested changes in separate commits, would you mind taking a look and see if they are what you need?

@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from a0eb379 to eae0881 Compare March 4, 2018 02:15
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from 467da7e to 61d8275 Compare March 22, 2018 20:46
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from ebf0ec5 to 937b4f3 Compare March 27, 2018 22:11
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from 74f064a to 235a228 Compare April 13, 2018 01:32
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch from 1d5ec93 to a91cc6e Compare May 7, 2018 19:03
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from e6876e6 to bfa3076 Compare May 25, 2018 19:52
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 3 times, most recently from e43677e to f292cc5 Compare June 14, 2018 18:19
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 3 times, most recently from 27d841d to 0c81127 Compare June 24, 2018 20:19
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from 7d205f6 to 6439e2c Compare June 29, 2018 23:04
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from b92f224 to 81db984 Compare July 9, 2018 20:50
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 3 times, most recently from 190ada0 to 31f1a12 Compare July 23, 2018 18:39
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 3 times, most recently from ad4af01 to 612bfbc Compare August 6, 2018 20:32
@SHuang-Broad SHuang-Broad force-pushed the sh-sv-eval-scripts branch 2 times, most recently from 597ecc6 to 91bf3e9 Compare September 10, 2018 22:51
* setup main script "GATSVcallsetAnalysis.sh" to replace the script "project.sh", and
* get rid of hard-coded path
* optionally run analysis on complex SV vcf's
* optionally run comparison between two GATK call sets
* move the hard-coded directory creation from the utility script to a user provided path in "GATSVcallsetAnalysis.sh"
* remove Rdata
* change loading Rdata to source the corresponding R script
  * switch from using CPX call to re-interpreted calls
@droazen
Copy link
Contributor

droazen commented Jan 11, 2019

@SHuang-Broad @cwhelan Can you guys either merge this ancient PR or close it if it's no longer needed? Thanks!

@SHuang-Broad
Copy link
Contributor Author

closing at request, and we have an improved pipeline now somewhere else, and a plan for big improvements there.
Thanks for keeping an eye on this, @droazen!

@SHuang-Broad SHuang-Broad deleted the sh-sv-eval-scripts branch January 17, 2019 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants