VCF Tranche filtering in java #4800

lucidtronix · 2018-05-22T23:15:44Z

Rewrite of the FilterVariantTranches tool without python dependencies. Uses @takutosato's shiny new TwoPassVariantWalker. @takutosato or @cmnbroad care to review?

codecov-io · 2018-05-23T00:18:14Z

Codecov Report

Merging #4800 into master will increase coverage by 0.021%.
The diff coverage is 92.308%.

@@               Coverage Diff               @@
##              master     #4800       +/-   ##
===============================================
+ Coverage     80.457%   80.478%   +0.021%     
- Complexity     17839     17866       +27     
===============================================
  Files           1092      1092               
  Lines          64238     64287       +49     
  Branches       10352     10368       +16     
===============================================
+ Hits           51684     51737       +53     
+ Misses          8503      8498        -5     
- Partials        4051      4052        +1

Impacted Files	Coverage Δ	Complexity Δ
...der/tools/walkers/vqsr/CNNVariantWriteTensors.java	`83.333% <100%> (ø)`	`4 <2> (ø)`	⬇️
...hellbender/tools/walkers/vqsr/CNNVariantTrain.java	`80.645% <100%> (ø)`	`4 <2> (ø)`	⬇️
...nder/tools/walkers/vqsr/FilterVariantTranches.java	`92.632% <92.135%> (+16.545%)`	`32 <32> (+27)`	⬆️

cmnbroad

Not finished with first pass yet but checkpointing what I have so far.

cmnbroad · 2018-05-25T21:20:51Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

        }
    }

    @Override
-    protected Object doWork() {
-        final Resource pythonScriptResource = new Resource("tranches.py", FilterVariantTranches.class);


Can all or part of tranches.py be removed from the repo now ?

yes all gone!

cmnbroad · 2018-05-25T21:28:00Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

+        for (FeatureInput<VariantContext> featureSource : resources) {
+            for (VariantContext v : featureContext.getValues(featureSource)) {
+                if (variant.isSNP()){
+                    snpScores.add(Double.parseDouble((String)variant.getAttribute(infoKey)));


Is this supposed to be including the CNN score for the input variant repeatedly, once for each overlapping resource variant, no matter whether the known variant's type matches the input variant's type or not (SNP or INDEL) ?

No that was bug and it also wasn't checking that the alleles actually matched. Now it returns if it finds a match.

cmnbroad · 2018-05-25T21:33:21Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

+                } else if (variant.isIndel()){
+                    indelScores.add(Double.parseDouble((String)variant.getAttribute(infoKey)));
+                } else {
+                    logger.info(String.format("Not SNP or INDEL Overlapping variant at %s:%d-%d: Ref: %s Alt(s): %s\n",


This could produce lots of output.

Also, the text of the message could be clearer.

cmnbroad · 2018-05-25T21:39:54Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

+        if (variant.isSNP() && isTrancheFiltered(score, snpCutoffs)){
+            builder.filter(filterStringFromScore(score, snpCutoffs));
+        } else if (variant.isIndel() && isTrancheFiltered(score, indelCutoffs)){
+            builder.filter(filterStringFromScore(Double.parseDouble((String)variant.getAttribute(infoKey)), indelCutoffs));


Th getAtttibute/parse calls are redundant here since the attribute value is already in score.

cmnbroad · 2018-05-25T21:49:07Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

+    private String filterStringFromScore(double score, List<Double> cutoffs){
+        for (int i = 0; i < cutoffs.size()-1; i++){
+            if (score > cutoffs.get(i) && i == 0){
+                return "PASS"; // but this case should already be caught by isTrancheFiltered()


There is a VCFConstant for this.

On second thought, since this should never happen, I think it should just throw a GATKException if it does rather than returning PASS.

I couldn't find one, so I added it, let me know if I'm looking in the wrong place.

cmnbroad · 2018-05-25T21:56:34Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java


+public class FilterVariantTranches extends TwoPassVariantWalker {


This tool should have at least one validating test with expected output, independent of the integrated pipeline test.

I just noticed that the CNNVariantPipeline test is using VQSLOD for the info key for the tranche test. That should probably be replaced with something generated by the pipeline... and the existing test can be made part of the integration test for this tool.

cmnbroad

First pass done, with a couple of questions.

cmnbroad · 2018-05-30T20:22:49Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

+            shortName="t",
+            doc="The levels of truth sensitivity at which to slice the data. (in percents, i.e. 99.9 for 99.9 percent and 1.0 for 1 percent)",
+            optional=true)
+    private List<Double> tranches = new ArrayList<Double>(Arrays.asList(99.9, 99.0, 90.0));


<Double> in new ArrayList<Double> is unnecessary and can be removed since it can be inferred.

cmnbroad · 2018-05-30T20:49:54Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

+            shortName = "resource",
+            doc="A list of validated VCFs with known sites of common variation",
+            optional=false)
+    private List<FeatureInput<VariantContext>> resources = new ArrayList<>();

    @Argument(fullName = "info-key", shortName = "info-key", doc = "The key must be in the INFO field of the input VCF.")
    private String infoKey = GATKVCFConstants.CNN_1D_KEY;


Just checking that the intention is that this can be applied to any info attribute, not just the CNN ones ?

Also, any reason not to default to 2d ?

Yes, any INFO attribute should work, I will change the default.

cmnbroad · 2018-05-30T20:52:38Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

+        // setup the header fields
+        final VCFHeader inputHeader = getHeaderForVariants();
+        final Set<VCFHeaderLine> inputHeaders = inputHeader.getMetaDataInSortedOrder();
+        final Set<VCFHeaderLine> hInfo = new HashSet<>(inputHeaders);


What should the behavior be if the input already has tranche filters for this key from a previous run ? It seems like both the header lines as well as the filters themselves should be removed since they're being replaced with new ones ?

Also, this should check that the header has a header line that matches the requested infoKey (i.e., that the input actually has CNN_1D or whatever).

The behavior now is to just add to the filter field and I think that makes sense in general. There could be use-cases where we've filtered with a different tool on some other criteria and then want to add tranche filtering. If the user does want to remove all existing filters they can run VariantFiltration with the --invalidate-previous-filters option.

cmnbroad · 2018-05-30T21:16:19Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java


+public class FilterVariantTranches extends TwoPassVariantWalker {


I just noticed that the CNNVariantPipeline test is using VQSLOD for the info key for the tranche test. That should probably be replaced with something generated by the pipeline... and the existing test can be made part of the integration test for this tool.

cmnbroad · 2018-05-30T21:27:11Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

-    private List<Double> tranches = new ArrayList<Double>(Arrays.asList(99.9, 99.0, 90.0));
+    @Override
+    public void onTraversalStart() {
+        tranches.sort(Double::compareTo);


This should verify that there is at least one tranche or throw, and that the tranche values make sense.

cmnbroad · 2018-05-30T21:30:02Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

-            throw new GATKException(String.format("Could not write temporary index to file: %s", tempFileIdx.getAbsolutePath()), e);
+    protected void secondPassApply(VariantContext variant, ReadsContext readsContext, ReferenceContext referenceContext, FeatureContext featureContext) {
+        final VariantContextBuilder builder = new VariantContextBuilder(variant);
+        final double score = Double.parseDouble((String)variant.getAttribute(infoKey));


getAttribute here and elsewhere will return null and this throw a null pointer exception when called on a variant that doesn't have the attribute.

added check.

cmnbroad · 2018-05-30T21:39:14Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

+    private String filterStringFromScore(double score, List<Double> cutoffs){
+        for (int i = 0; i < cutoffs.size()-1; i++){
+            if (score > cutoffs.get(i) && i == 0){
+                return "PASS"; // but this case should already be caught by isTrancheFiltered()


On second thought, since this should never happen, I think it should just throw a GATKException if it does rather than returning PASS.

lucidtronix · 2018-05-30T22:17:18Z

Responded to most of the review, but still need to add a proper integration test and fix up the pipeline test.

cmnbroad

Added a question about the tranche filter assignments, plus this has conflicts now.

cmnbroad · 2018-06-06T12:56:57Z

src/test/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNVariantPipelineTest.java

-                .addArgument("snp-truth-vcf", snpTruthVCF)
-                .addArgument("indel-truth-vcf", indelTruthVCF)
+                .addArgument("resource", snpTruthVCF)
+                .addArgument("resource", indelTruthVCF)


This test doesn't really need to be assigned to the python group anymore.

cmnbroad · 2018-06-06T14:24:06Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranches.java

+            if (score > cutoffs.get(i) && i == 0){
+                throw new GATKException("Trying to add a filter to a passing variant.");
+            } else if (score > cutoffs.get(i)){
+                return filterKeyFromTranches(infoKey, tranches.get(i), tranches.get(i+1));


Is this assigning the correct tranche label ? i is always > 0 && < size if you get here, should the tranche label for this be i-1 to i ? i+1 would go off the end of the list when i == size()-1 ?

I think this is OK. Tranches and cutoffs are assumed to be the same size and loop conditional is
i < cutoffs.size()-1

@lucidtronix Right, just want to check if the label assignment is correct: Given these tranches and snp cutoffs:

Tranche: 95.0
Tranche: 99.0
Tranche: 99.9

snpCutoffs: 0.860204
snpCutoffs: -1.59729
snpCutoffs: -4.4343

and a snpScore of -3.51172, this assigns filter label 99.90 - 100.00. Is that the right one ?

Oh I see now, you're right that is not expected. I will shift the tranches over.

lucidtronix · 2018-06-11T15:00:07Z

Resolved conflicts and rebased.

lucidtronix · 2018-06-11T18:51:23Z

@cmnbroad back to you.

lucidtronix · 2018-06-20T13:04:53Z

@cmnbroad rebased and fixed bug. Is this ready to merge?

cmnbroad

One more minor cleanup request, then we can merge this once tests pass again.

cmnbroad · 2018-06-21T00:39:30Z

...a/org/broadinstitute/hellbender/tools/walkers/vqsr/FilterVariantTranchesIntegrationTest.java

+
+        if(newExpectations){
+            argsBuilder.addArgument(StandardArgumentDefinitions.OUTPUT_LONG_NAME, largeFileTestDir + "VQSR/expected/g94982_20_1m_10m_tranched_99.vcf");
+            runCommandLine(argsBuilder);


This code path and the variable newExpectations can be removed, here and below.

cmnbroad

Thanks for the changes @lucidtronix!

cmnbroad self-assigned this May 23, 2018

This was referenced May 23, 2018

FilterVariantTranches should be written entirely in Java #4535

Closed

[FilterVariantTranches] ValueError: fetch requires an index #4794

Closed

droazen requested review from cmnbroad and takutosato May 23, 2018 15:27

droazen assigned takutosato May 23, 2018

lucidtronix mentioned this pull request May 24, 2018

cnn variant wdls and jsons #4774

Merged

cmnbroad reviewed May 25, 2018

View reviewed changes

cmnbroad requested changes May 30, 2018

View reviewed changes

cmnbroad reviewed Jun 6, 2018

View reviewed changes

lucidtronix force-pushed the sf_tranche_filter_java branch from f242121 to 9195390 Compare June 11, 2018 14:58

tranche filtering in java

1f732a5

lucidtronix force-pushed the sf_tranche_filter_java branch from 58d7eb1 to 26175ad Compare June 20, 2018 13:03

cleanup

26175ad

cmnbroad reviewed Jun 21, 2018

View reviewed changes

fix tests

c88dbde

cmnbroad approved these changes Jun 21, 2018

View reviewed changes

cmnbroad merged commit fb4c7a1 into master Jun 21, 2018

cmnbroad deleted the sf_tranche_filter_java branch June 21, 2018 15:55


		public class FilterVariantTranches extends TwoPassVariantWalker {

VCF Tranche filtering in java #4800

VCF Tranche filtering in java #4800

Conversation

lucidtronix commented May 22, 2018

codecov-io commented May 23, 2018 • edited Loading

Codecov Report

cmnbroad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmnbroad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucidtronix commented May 30, 2018

cmnbroad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmnbroad Jun 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucidtronix commented Jun 11, 2018

lucidtronix commented Jun 11, 2018

lucidtronix commented Jun 20, 2018

cmnbroad left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmnbroad left a comment

Choose a reason for hiding this comment

codecov-io commented May 23, 2018 •

edited

Loading

cmnbroad Jun 18, 2018 •

edited

Loading

cmnbroad left a comment •

edited

Loading