Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF Tranche filtering in java #4800

Merged
merged 3 commits into from
Jun 21, 2018
Merged

VCF Tranche filtering in java #4800

merged 3 commits into from
Jun 21, 2018

Conversation

lucidtronix
Copy link
Contributor

Rewrite of the FilterVariantTranches tool without python dependencies. Uses @takutosato's shiny new TwoPassVariantWalker. @takutosato or @cmnbroad care to review?

@codecov-io
Copy link

codecov-io commented May 23, 2018

Codecov Report

Merging #4800 into master will increase coverage by 0.021%.
The diff coverage is 92.308%.

@@               Coverage Diff               @@
##              master     #4800       +/-   ##
===============================================
+ Coverage     80.457%   80.478%   +0.021%     
- Complexity     17839     17866       +27     
===============================================
  Files           1092      1092               
  Lines          64238     64287       +49     
  Branches       10352     10368       +16     
===============================================
+ Hits           51684     51737       +53     
+ Misses          8503      8498        -5     
- Partials        4051      4052        +1
Impacted Files Coverage Δ Complexity Δ
...der/tools/walkers/vqsr/CNNVariantWriteTensors.java 83.333% <100%> (ø) 4 <2> (ø) ⬇️
...hellbender/tools/walkers/vqsr/CNNVariantTrain.java 80.645% <100%> (ø) 4 <2> (ø) ⬇️
...nder/tools/walkers/vqsr/FilterVariantTranches.java 92.632% <92.135%> (+16.545%) 32 <32> (+27) ⬆️

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not finished with first pass yet but checkpointing what I have so far.

}
}

@Override
protected Object doWork() {
final Resource pythonScriptResource = new Resource("tranches.py", FilterVariantTranches.class);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can all or part of tranches.py be removed from the repo now ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes all gone!

for (FeatureInput<VariantContext> featureSource : resources) {
for (VariantContext v : featureContext.getValues(featureSource)) {
if (variant.isSNP()){
snpScores.add(Double.parseDouble((String)variant.getAttribute(infoKey)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be including the CNN score for the input variant repeatedly, once for each overlapping resource variant, no matter whether the known variant's type matches the input variant's type or not (SNP or INDEL) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that was bug and it also wasn't checking that the alleles actually matched. Now it returns if it finds a match.

} else if (variant.isIndel()){
indelScores.add(Double.parseDouble((String)variant.getAttribute(infoKey)));
} else {
logger.info(String.format("Not SNP or INDEL Overlapping variant at %s:%d-%d: Ref: %s Alt(s): %s\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could produce lots of output.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the text of the message could be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

if (variant.isSNP() && isTrancheFiltered(score, snpCutoffs)){
builder.filter(filterStringFromScore(score, snpCutoffs));
} else if (variant.isIndel() && isTrancheFiltered(score, indelCutoffs)){
builder.filter(filterStringFromScore(Double.parseDouble((String)variant.getAttribute(infoKey)), indelCutoffs));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Th getAtttibute/parse calls are redundant here since the attribute value is already in score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

private String filterStringFromScore(double score, List<Double> cutoffs){
for (int i = 0; i < cutoffs.size()-1; i++){
if (score > cutoffs.get(i) && i == 0){
return "PASS"; // but this case should already be caught by isTrancheFiltered()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a VCFConstant for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, since this should never happen, I think it should just throw a GATKException if it does rather than returning PASS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find one, so I added it, let me know if I'm looking in the wrong place.


public class FilterVariantTranches extends TwoPassVariantWalker {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool should have at least one validating test with expected output, independent of the integrated pipeline test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that the CNNVariantPipeline test is using VQSLOD for the info key for the tranche test. That should probably be replaced with something generated by the pipeline... and the existing test can be made part of the integration test for this tool.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass done, with a couple of questions.

shortName="t",
doc="The levels of truth sensitivity at which to slice the data. (in percents, i.e. 99.9 for 99.9 percent and 1.0 for 1 percent)",
optional=true)
private List<Double> tranches = new ArrayList<Double>(Arrays.asList(99.9, 99.0, 90.0));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<Double> in new ArrayList<Double> is unnecessary and can be removed since it can be inferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

shortName = "resource",
doc="A list of validated VCFs with known sites of common variation",
optional=false)
private List<FeatureInput<VariantContext>> resources = new ArrayList<>();

@Argument(fullName = "info-key", shortName = "info-key", doc = "The key must be in the INFO field of the input VCF.")
private String infoKey = GATKVCFConstants.CNN_1D_KEY;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking that the intention is that this can be applied to any info attribute, not just the CNN ones ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, any reason not to default to 2d ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, any INFO attribute should work, I will change the default.

// setup the header fields
final VCFHeader inputHeader = getHeaderForVariants();
final Set<VCFHeaderLine> inputHeaders = inputHeader.getMetaDataInSortedOrder();
final Set<VCFHeaderLine> hInfo = new HashSet<>(inputHeaders);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the behavior be if the input already has tranche filters for this key from a previous run ? It seems like both the header lines as well as the filters themselves should be removed since they're being replaced with new ones ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should check that the header has a header line that matches the requested infoKey (i.e., that the input actually has CNN_1D or whatever).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior now is to just add to the filter field and I think that makes sense in general. There could be use-cases where we've filtered with a different tool on some other criteria and then want to add tranche filtering. If the user does want to remove all existing filters they can run VariantFiltration with the --invalidate-previous-filters option.


public class FilterVariantTranches extends TwoPassVariantWalker {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that the CNNVariantPipeline test is using VQSLOD for the info key for the tranche test. That should probably be replaced with something generated by the pipeline... and the existing test can be made part of the integration test for this tool.

private List<Double> tranches = new ArrayList<Double>(Arrays.asList(99.9, 99.0, 90.0));
@Override
public void onTraversalStart() {
tranches.sort(Double::compareTo);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should verify that there is at least one tranche or throw, and that the tranche values make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

throw new GATKException(String.format("Could not write temporary index to file: %s", tempFileIdx.getAbsolutePath()), e);
protected void secondPassApply(VariantContext variant, ReadsContext readsContext, ReferenceContext referenceContext, FeatureContext featureContext) {
final VariantContextBuilder builder = new VariantContextBuilder(variant);
final double score = Double.parseDouble((String)variant.getAttribute(infoKey));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAttribute here and elsewhere will return null and this throw a null pointer exception when called on a variant that doesn't have the attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added check.

private String filterStringFromScore(double score, List<Double> cutoffs){
for (int i = 0; i < cutoffs.size()-1; i++){
if (score > cutoffs.get(i) && i == 0){
return "PASS"; // but this case should already be caught by isTrancheFiltered()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, since this should never happen, I think it should just throw a GATKException if it does rather than returning PASS.

@lucidtronix
Copy link
Contributor Author

Responded to most of the review, but still need to add a proper integration test and fix up the pipeline test.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a question about the tranche filter assignments, plus this has conflicts now.

.addArgument("snp-truth-vcf", snpTruthVCF)
.addArgument("indel-truth-vcf", indelTruthVCF)
.addArgument("resource", snpTruthVCF)
.addArgument("resource", indelTruthVCF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't really need to be assigned to the python group anymore.

if (score > cutoffs.get(i) && i == 0){
throw new GATKException("Trying to add a filter to a passing variant.");
} else if (score > cutoffs.get(i)){
return filterKeyFromTranches(infoKey, tranches.get(i), tranches.get(i+1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assigning the correct tranche label ? i is always > 0 && < size if you get here, should the tranche label for this be i-1 to i ? i+1 would go off the end of the list when i == size()-1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK. Tranches and cutoffs are assumed to be the same size and loop conditional is
i < cutoffs.size()-1

Copy link
Collaborator

@cmnbroad cmnbroad Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucidtronix Right, just want to check if the label assignment is correct: Given these tranches and snp cutoffs:

Tranche: 95.0
Tranche: 99.0
Tranche: 99.9

snpCutoffs: 0.860204
snpCutoffs: -1.59729
snpCutoffs: -4.4343

and a snpScore of -3.51172, this assigns filter label 99.90 - 100.00. Is that the right one ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see now, you're right that is not expected. I will shift the tranches over.

@lucidtronix lucidtronix force-pushed the sf_tranche_filter_java branch from f242121 to 9195390 Compare June 11, 2018 14:58
@lucidtronix
Copy link
Contributor Author

Resolved conflicts and rebased.

@lucidtronix
Copy link
Contributor Author

@cmnbroad back to you.

@lucidtronix lucidtronix force-pushed the sf_tranche_filter_java branch from 58d7eb1 to 26175ad Compare June 20, 2018 13:03
@lucidtronix
Copy link
Contributor Author

@cmnbroad rebased and fixed bug. Is this ready to merge?

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more minor cleanup request, then we can merge this once tests pass again.


if(newExpectations){
argsBuilder.addArgument(StandardArgumentDefinitions.OUTPUT_LONG_NAME, largeFileTestDir + "VQSR/expected/g94982_20_1m_10m_tranched_99.vcf");
runCommandLine(argsBuilder);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path and the variable newExpectations can be removed, here and below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @lucidtronix!

@cmnbroad cmnbroad merged commit fb4c7a1 into master Jun 21, 2018
@cmnbroad cmnbroad deleted the sf_tranche_filter_java branch June 21, 2018 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants