-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF Tranche filtering in java #4800
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4800 +/- ##
===============================================
+ Coverage 80.457% 80.478% +0.021%
- Complexity 17839 17866 +27
===============================================
Files 1092 1092
Lines 64238 64287 +49
Branches 10352 10368 +16
===============================================
+ Hits 51684 51737 +53
+ Misses 8503 8498 -5
- Partials 4051 4052 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not finished with first pass yet but checkpointing what I have so far.
} | ||
} | ||
|
||
@Override | ||
protected Object doWork() { | ||
final Resource pythonScriptResource = new Resource("tranches.py", FilterVariantTranches.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can all or part of tranches.py be removed from the repo now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes all gone!
for (FeatureInput<VariantContext> featureSource : resources) { | ||
for (VariantContext v : featureContext.getValues(featureSource)) { | ||
if (variant.isSNP()){ | ||
snpScores.add(Double.parseDouble((String)variant.getAttribute(infoKey))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be including the CNN score for the input variant repeatedly, once for each overlapping resource variant, no matter whether the known variant's type matches the input variant's type or not (SNP or INDEL) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that was bug and it also wasn't checking that the alleles actually matched. Now it returns if it finds a match.
} else if (variant.isIndel()){ | ||
indelScores.add(Double.parseDouble((String)variant.getAttribute(infoKey))); | ||
} else { | ||
logger.info(String.format("Not SNP or INDEL Overlapping variant at %s:%d-%d: Ref: %s Alt(s): %s\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could produce lots of output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the text of the message could be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
if (variant.isSNP() && isTrancheFiltered(score, snpCutoffs)){ | ||
builder.filter(filterStringFromScore(score, snpCutoffs)); | ||
} else if (variant.isIndel() && isTrancheFiltered(score, indelCutoffs)){ | ||
builder.filter(filterStringFromScore(Double.parseDouble((String)variant.getAttribute(infoKey)), indelCutoffs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Th getAtttibute
/parse
calls are redundant here since the attribute value is already in score
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
private String filterStringFromScore(double score, List<Double> cutoffs){ | ||
for (int i = 0; i < cutoffs.size()-1; i++){ | ||
if (score > cutoffs.get(i) && i == 0){ | ||
return "PASS"; // but this case should already be caught by isTrancheFiltered() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a VCFConstant for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, since this should never happen, I think it should just throw a GATKException if it does rather than returning PASS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find one, so I added it, let me know if I'm looking in the wrong place.
|
||
public class FilterVariantTranches extends TwoPassVariantWalker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tool should have at least one validating test with expected output, independent of the integrated pipeline test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that the CNNVariantPipeline test is using VQSLOD
for the info key for the tranche test. That should probably be replaced with something generated by the pipeline... and the existing test can be made part of the integration test for this tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass done, with a couple of questions.
shortName="t", | ||
doc="The levels of truth sensitivity at which to slice the data. (in percents, i.e. 99.9 for 99.9 percent and 1.0 for 1 percent)", | ||
optional=true) | ||
private List<Double> tranches = new ArrayList<Double>(Arrays.asList(99.9, 99.0, 90.0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<Double>
in new ArrayList<Double>
is unnecessary and can be removed since it can be inferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
shortName = "resource", | ||
doc="A list of validated VCFs with known sites of common variation", | ||
optional=false) | ||
private List<FeatureInput<VariantContext>> resources = new ArrayList<>(); | ||
|
||
@Argument(fullName = "info-key", shortName = "info-key", doc = "The key must be in the INFO field of the input VCF.") | ||
private String infoKey = GATKVCFConstants.CNN_1D_KEY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking that the intention is that this can be applied to any info attribute, not just the CNN ones ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, any reason not to default to 2d ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, any INFO attribute should work, I will change the default.
// setup the header fields | ||
final VCFHeader inputHeader = getHeaderForVariants(); | ||
final Set<VCFHeaderLine> inputHeaders = inputHeader.getMetaDataInSortedOrder(); | ||
final Set<VCFHeaderLine> hInfo = new HashSet<>(inputHeaders); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should the behavior be if the input already has tranche filters for this key from a previous run ? It seems like both the header lines as well as the filters themselves should be removed since they're being replaced with new ones ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this should check that the header has a header line that matches the requested infoKey (i.e., that the input actually has CNN_1D or whatever).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior now is to just add to the filter field and I think that makes sense in general. There could be use-cases where we've filtered with a different tool on some other criteria and then want to add tranche filtering. If the user does want to remove all existing filters they can run VariantFiltration with the --invalidate-previous-filters
option.
|
||
public class FilterVariantTranches extends TwoPassVariantWalker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that the CNNVariantPipeline test is using VQSLOD
for the info key for the tranche test. That should probably be replaced with something generated by the pipeline... and the existing test can be made part of the integration test for this tool.
private List<Double> tranches = new ArrayList<Double>(Arrays.asList(99.9, 99.0, 90.0)); | ||
@Override | ||
public void onTraversalStart() { | ||
tranches.sort(Double::compareTo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should verify that there is at least one tranche or throw, and that the tranche values make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
throw new GATKException(String.format("Could not write temporary index to file: %s", tempFileIdx.getAbsolutePath()), e); | ||
protected void secondPassApply(VariantContext variant, ReadsContext readsContext, ReferenceContext referenceContext, FeatureContext featureContext) { | ||
final VariantContextBuilder builder = new VariantContextBuilder(variant); | ||
final double score = Double.parseDouble((String)variant.getAttribute(infoKey)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getAttribute
here and elsewhere will return null and this throw a null pointer exception when called on a variant that doesn't have the attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added check.
private String filterStringFromScore(double score, List<Double> cutoffs){ | ||
for (int i = 0; i < cutoffs.size()-1; i++){ | ||
if (score > cutoffs.get(i) && i == 0){ | ||
return "PASS"; // but this case should already be caught by isTrancheFiltered() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, since this should never happen, I think it should just throw a GATKException if it does rather than returning PASS
.
Responded to most of the review, but still need to add a proper integration test and fix up the pipeline test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a question about the tranche filter assignments, plus this has conflicts now.
.addArgument("snp-truth-vcf", snpTruthVCF) | ||
.addArgument("indel-truth-vcf", indelTruthVCF) | ||
.addArgument("resource", snpTruthVCF) | ||
.addArgument("resource", indelTruthVCF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't really need to be assigned to the python
group anymore.
if (score > cutoffs.get(i) && i == 0){ | ||
throw new GATKException("Trying to add a filter to a passing variant."); | ||
} else if (score > cutoffs.get(i)){ | ||
return filterKeyFromTranches(infoKey, tranches.get(i), tranches.get(i+1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assigning the correct tranche label ? i
is always > 0 && < size if you get here, should the tranche label for this be i-1
to i
? i+1
would go off the end of the list when i == size()-1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK. Tranches and cutoffs are assumed to be the same size and loop conditional is
i < cutoffs.size()-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucidtronix Right, just want to check if the label assignment is correct: Given these tranches and snp cutoffs:
Tranche: 95.0
Tranche: 99.0
Tranche: 99.9
snpCutoffs: 0.860204
snpCutoffs: -1.59729
snpCutoffs: -4.4343
and a snpScore of -3.51172, this assigns filter label 99.90 - 100.00. Is that the right one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see now, you're right that is not expected. I will shift the tranches over.
f242121
to
9195390
Compare
Resolved conflicts and rebased. |
@cmnbroad back to you. |
58d7eb1
to
26175ad
Compare
@cmnbroad rebased and fixed bug. Is this ready to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more minor cleanup request, then we can merge this once tests pass again.
|
||
if(newExpectations){ | ||
argsBuilder.addArgument(StandardArgumentDefinitions.OUTPUT_LONG_NAME, largeFileTestDir + "VQSR/expected/g94982_20_1m_10m_tranched_99.vcf"); | ||
runCommandLine(argsBuilder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code path and the variable newExpectations
can be removed, here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @lucidtronix!
Rewrite of the FilterVariantTranches tool without python dependencies. Uses @takutosato's shiny new TwoPassVariantWalker. @takutosato or @cmnbroad care to review?