Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added --sites-only argument to GATK engine #4764

Merged
merged 4 commits into from
May 17, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ private StandardArgumentDefinitions(){}
public static final String ANNOTATIONS_TO_EXCLUDE_LONG_NAME = "annotations-to-exclude";
public static final String SAMPLE_NAME_LONG_NAME = "sample-name";
public static final String PEDIGREE_FILE_LONG_NAME = "pedigree";
public static final String SITES_ONLY_LONG_NAME = "sites-only";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename to the more explicit sites-only-vcf-output?


public static final String INPUT_SHORT_NAME = "I";
public static final String OUTPUT_SHORT_NAME = "O";
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,10 @@ public abstract class GATKTool extends CommandLineProgram {
optional = true)
public boolean disableBamIndexCaching = false;

@Argument(fullName = StandardArgumentDefinitions.SITES_ONLY_LONG_NAME,
doc = "If true, don't emit GT fields when writing vcf file output.", optional = true)
public boolean sitesOnlyMode = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of field should make it clear that this applies to VCFs specifically -- eg., outputSitesOnlyVCFs instead of sitesOnlyMode


/**
* Master sequence dictionary to be used instead of all other dictionaries (if provided).
*/
Expand Down Expand Up @@ -728,6 +732,10 @@ public VariantContextWriter createVCFWriter(final File outFile) {
}
}

if (sitesOnlyMode) {
options.add(Options.DO_NOT_WRITE_GENOTYPES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check to see whether there are any other GATKTools that bypass GATKTool.createVCFWriter() when writing a VCF and call GATKVariantContextUtils.createVCFWriter() directly instead? (besides the ones you dealt with below)

Don't worry about non-GATKTools.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already checked. It is bypassed in a number of tests, the HaplotypeCaller, and CreateSomaticPanelOfNormals which doesn't extend GATKTool so I elected not to include it here.

}

return GATKVariantContextUtils.createVCFWriter(
outFile,
sequenceDictionary,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ public void onTraversalStart() {

// The HC engine will make the right kind (VCF or GVCF) of writer for us
final SAMSequenceDictionary sequenceDictionary = getHeaderForReads().getSequenceDictionary();
vcfWriter = hcEngine.makeVCFWriter(outputVCF, sequenceDictionary, createOutputVariantIndex, createOutputVariantMD5);
vcfWriter = hcEngine.makeVCFWriter(outputVCF, sequenceDictionary, createOutputVariantIndex, sitesOnlyMode, createOutputVariantMD5);
hcEngine.writeHeader(vcfWriter, sequenceDictionary, getDefaultToolVCFHeaderLines());
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -322,17 +322,20 @@ public static List<ReadFilter> makeStandardHCReadFilters() {
* @return a VCF or GVCF writer as appropriate, ready to use
*/
public VariantContextWriter makeVCFWriter( final String outputVCF, final SAMSequenceDictionary readsDictionary,
final boolean createOutputVariantIndex, final boolean createOutputVariantMD5 ) {
final boolean createOutputVariantIndex, final boolean sitesOnlyMode,
final boolean createOutputVariantMD5 ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be slightly more logical if the ordering of the boolean args were: createOutputVariantIndex, createOutputVariantMD5, sitesOnlyMode instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Utils.nonNull(outputVCF);
Utils.nonNull(readsDictionary);

List<Options> options = new ArrayList<>(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

if (createOutputVariantIndex) {options.add(Options.INDEX_ON_THE_FLY);}
if (sitesOnlyMode) {options.add(Options.DO_NOT_WRITE_GENOTYPES);}

VariantContextWriter writer = GATKVariantContextUtils.createVCFWriter(
new File(outputVCF),
readsDictionary,
createOutputVariantMD5,
createOutputVariantIndex ?
new Options[]{Options.INDEX_ON_THE_FLY} :
new Options[0]
options.toArray(new Options[options.size()])
);

if ( hcArgs.emitReferenceConfidence == ReferenceConfidenceMode.GVCF ) {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,23 @@
package org.broadinstitute.hellbender.engine;

import htsjdk.variant.variantcontext.VariantContext;
import htsjdk.variant.vcf.VCFHeader;
import org.apache.commons.lang3.tuple.Pair;
import org.broadinstitute.hellbender.CommandLineProgramTest;
import org.broadinstitute.hellbender.Main;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove stray imports from this class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doen

import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions;
import org.broadinstitute.hellbender.exceptions.UserException;
import org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants;
import org.broadinstitute.hellbender.utils.test.IntegrationTestSpec;
import org.broadinstitute.hellbender.tools.examples.ExampleReadWalkerWithVariants;
import org.broadinstitute.hellbender.utils.test.VariantContextTestUtils;
import org.testng.Assert;
import org.testng.annotations.Test;

import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;

public final class FeatureSupportIntegrationTest extends CommandLineProgramTest {
private static final String FEATURE_INTEGRATION_TEST_DIRECTORY = publicTestDir + "org/broadinstitute/hellbender/engine/";
Expand Down Expand Up @@ -83,6 +92,25 @@ public void testFeaturesAsIntervalsUnrecognizedFormatFile() throws IOException {
testSpec.executeTest("testFeaturesAsIntervals", this);
}

@Test
public void testSitesOnlyMode() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to a new GATKToolIntegrationTest class instead? I regret my original suggestion of FeatureSupportIntegrationTest

File out = createTempFile("GTStrippedOutput", "vcf");
String[] args = new String[] {
"-V", FEATURE_INTEGRATION_TEST_DIRECTORY + "vcf_with_genotypes.vcf",
"--" + StandardArgumentDefinitions.SITES_ONLY_LONG_NAME,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm that this test fails if you remove the --sites-only arg temporarily?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checked, it fails without --sites-only

"-O",
out.getAbsolutePath()};
new Main().instanceMain(makeCommandLineArgs(Arrays.asList(args), SelectVariants.class.getSimpleName()));

// Assert that the genotype field has been stripped from the file
Pair<VCFHeader, List<VariantContext>> results = VariantContextTestUtils.readEntireVCFIntoMemory(out.getAbsolutePath());

Assert.assertFalse(results.getLeft().hasGenotypingData());
for (VariantContext v: results.getRight()) {
Assert.assertFalse(v.hasGenotypes());
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a similar test to HaplotypeCallerIntegrationTest to prove that it honors --sites-only, since that is a special case?

@Test
// this test asserts that a helpful exception is thrown for blockZipped files lacking an index as they may not be fully supported
//TODO this is a temporary fix until https://github.com/broadinstitute/gatk/issues/4224 has been resolved
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=MIN_GQ,Number=1,Type=Integer,Description="Minimum GQ observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA1
20 69491 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:MIN_GQ:PL 0/0:94:99:82:99:0,120,1800
20 69511 . A G,<NON_REF> 2253.77 . . GT:AD:DP:GQ:PL:SB 1/1:1,79,0:80:99:2284,207,0,2287,237,2316:0,1,46,33
20 69512 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:MIN_GQ:PL 0/0:96:99:82:99:0,120,1800
20 69522 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:MIN_GQ:PL 0/0:95:0:95:0:0,0,0
20 69549 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:MIN_GQ:PL 0/0:156:99:56:66:0,66,990
20 69635 . C T,<NON_REF> 60.77 . . GT:AD:DP:GQ:PL:SB 0/1:4,3,0:7:89:89,0,119,101,128,229:0,4,0,3
20 69762 . A <NON_REF> . . . GT:DP:GQ:MIN_DP:MIN_GQ:PL 0/0:7:18:7:18:0,18,270
20 69763 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:MIN_GQ:PL 0/0:7:21:7:21:0,21,253
20 69767 . C <NON_REF> . . . GT:DP:GQ:MIN_DP:MIN_GQ:PL 0/0:7:12:7:12:0,12,180
20 69772 . TTATC T,<NON_REF> 60.77 . . GT:AD:DP:GQ:PL:SB 0/1:4,3,0:7:89:89,0,119,101,128,229:0,4,0,3
20 69773 . T <NON_REF> . . . GT:DP:GQ:MIN_DP:MIN_GQ:PL 0/0:7:0:3:0:0,0,0