Genotype concordance workflows #414

mwalker174 · 2022-09-21T18:17:14Z

Adds two new WDLs:

JoinRawCalls - merges raw variant calls across batches from ClusterBatch outputs
SVConcordance - evaluates genotype concordance of an "eval" vcf against a "truth" vcf.

The second workflow is intended to be run with a "cleaned" eval vcf and a joined raw call truth vcf from the same cohort. The output vcf can then be used for filtering with TrainGqRecalibrator and RecalibrateGq workflows as the annotations_vcf input, along with the appropriate annotations_to_transfer list (TBD).

Note this depends on a pending PR in gatk, so I've added a temporary concordance gatk docker.

Also includes necessary updates to the vcf formatting script. Currently the concordance tool does not handle CPX/CTX so those are filtered out. As a result, the concordance-annotated VCF should NOT be carried forward other than as an input to the GQ recalibrator workflows.

Tested and updated both new workflows with inputs generated from corresponding templates.

TedBrookings

I have a few questions and I'm requesting a few minor, stylistic changes. Overall it looks pretty good.

I have a couple comments about format_svtk_vcf_for_gatk.py that are either 1) not really a request for change or 2) just outside the changes you made and therefore not critical for the PR

Minor style note: I noticed __parse_bnd_ends and __parse_ploidy_table start with double-underscores. I've taken to using double-underscores for __parse_arguments because double-underscores enforce privacy. That way, if a python script does "from other_script import *" and that other script happens to have __parse_arguments in it, nothing bad happens, but if they both had _parse_arguments, that would get clobbered and produce crazy crashes. I don't think it's necessary to use more than single underscore for most other private functions, and it's arguably overkill even for __parse_arguments.
It's better practice to use a context manager than directly handling the closing of file-like objects. e.g.

with vcf_in = pysam.VariantFile(arguments.vcf, 'r')
    # vcf_in will be closed after block ends, including early return, exception, etc
    do_stuff()

Technically you could do that even with filter_out, using nullcontext, although that may be a bit clunky.

src/sv-pipeline/scripts/format_svtk_vcf_for_gatk.py

wdl/SVConcordance.wdl

inputs/templates/test/SVConcordance/SVConcordance.json.tmpl

epiercehoffman · 2022-09-23T21:30:10Z

inputs/values/ref_panel_1kg.json

@@ -2236,6 +2238,7 @@
    "name": "ref_panel_1kg",
    "outlier_cutoff_table": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v1/module03_outlier_cutoff_table.tsv",
    "ped_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/ref-panel/1KG/v1/ped/1kg_ref_panel_v1.ped",
+    "ploidy_table": "gs://gatk-sv-ref-panel-1kg/outputs/ClusterBatch/mw-concordance/ref_panel_1kg.ploidy.FEMALE_chrY_1.tsv",


Should open a ticket to track changes needed to be able to repopulate this JSON with create_test_batch.py including outputting this file from the Batch pipeline, or keeping this value along with the new batch's metadata. I think some other recent changes might need to sync up with the create_test_batch.py script too (ex. handling tarballs of std vcfs vs. the file lists used in the single-sample pipeline) so it would be good to keep track of those

epiercehoffman · 2022-09-23T21:45:50Z

inputs/templates/test/SVConcordance/SVConcordance.json.tmpl

+  "SVConcordance.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }},
+  "SVConcordance.sv_utils_docker": {{ dockers.sv_utils_docker | tojson }},
+
+  "SVConcordance.eval_vcf" : {{ test_batch.merged_depth_vcf | tojson }},


I thought the comparison would be the raw calls vs. the clean VCF, not the clustered depth VCF? (If that's the case - the clean VCF should be GATK-ready except for dropping CTX/CPX for your tool, so I'm not sure that the call to SvUtilsFixVcf is necessary)

Good catch, fixed

TedBrookings

I'm changing my review to "approve". I made a couple comments, including one with suggested changes. I'll leave it to your judgement whether to make those changes though, they're not critical.

TedBrookings · 2022-09-26T14:30:43Z

src/sv-pipeline/scripts/format_svtk_vcf_for_gatk.py

+    new_record = vcf_out.new_record(id=record.id, contig=contig, start=record.start, stop=record.stop, alleles=alleles)
    # copy INFO fields
    for key in record.info:
        if key not in default_remove_infos and key not in remove_infos:
            new_record.info[key] = record.info[key]


Two last suggestions for this section. 1) combine the default and passed remove_infos into one set object to get two checks instead of one, and 2) construct the baseline info field as one object so as to avoid multiple API calls to set info. I'll leave it to your judgement to make either of these changes (they can be done independently).

Suggested change

new_record = vcf_out.new_record(id=record.id, contig=contig, start=record.start, stop=record.stop, alleles=alleles)

# copy INFO fields

for key in record.info:

if key not in default_remove_infos and key not in remove_infos:

new_record.info[key] = record.info[key]

remove_infos = remove_infos.union(default_remove_infos)

new_record = vcf_out.new_record(id=record.id, contig=contig, start=record.start, stop=record.stop, alleles=alleles,

info={key: value for key, value in record.info.items() if key not in remove_infos})

Made this change

wdl/SVConcordance.wdl

…cordance template

VJalili · 2022-09-30T15:03:48Z

inputs/values/dockers.json

  "gq_recalibrator_docker": "us.gcr.io/broad-dsde-methods/tbrookin/gatk:0a7e1d86f",
-  "str": "us.gcr.io/broad-dsde-methods/gatk-sv/str:2022-09-15-v0.25.1-beta-b53e58af"


I suppose this PR is not related to EH, if so, I wonder how build_docker.py ended up incorrectly updating this, while it worked correctly when invoked by the bot.

Had to do with the way I invoked build_docker.py - my master branch was out of date on the machine where I did the build but my working branch was rebased on the latest master, which contained your recent changes to some str code. I used "master" as the current commit - so it saw the str updates relative to the outdated master and rebuilt the docker.

That explains it, thanks!

mwalker174 requested review from epiercehoffman and TedBrookings September 21, 2022 18:17

TedBrookings requested changes Sep 22, 2022

View reviewed changes

TedBrookings reviewed Sep 22, 2022

View reviewed changes

inputs/templates/test/SVConcordance/SVConcordance.json.tmpl Outdated Show resolved Hide resolved

epiercehoffman reviewed Sep 23, 2022

View reviewed changes

TedBrookings approved these changes Sep 26, 2022

View reviewed changes

Implement JoinRawCalls and SVConcordance

913a86b

mwalker174 force-pushed the mw_concordance branch from dc2234e to 913a86b Compare September 26, 2022 15:28

mwalker174 added 6 commits September 26, 2022 11:49

Fix SVConcordance template

643e343

Lint; allow cpx/bnd/ctx empty intervals in svutils fix-vcf; fix svcon…

ba139cd

…cordance template

Improve INFO filtering in formatting script

7251507

Update dockers

ac346f3

Fix cn-ecn mode in format script

8223c30

Update dockers

2089fd2

mwalker174 merged commit 1f1a3b3 into master Sep 27, 2022

mwalker174 deleted the mw_concordance branch September 27, 2022 22:12

VJalili reviewed Sep 30, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genotype concordance workflows #414

Genotype concordance workflows #414

mwalker174 commented Sep 21, 2022

TedBrookings left a comment

epiercehoffman Sep 23, 2022

mwalker174 Sep 27, 2022

epiercehoffman Sep 23, 2022

mwalker174 Sep 27, 2022

TedBrookings left a comment

TedBrookings Sep 26, 2022

mwalker174 Sep 27, 2022

VJalili Sep 30, 2022

mwalker174 Sep 30, 2022

VJalili Sep 30, 2022

		"gq_recalibrator_docker": "us.gcr.io/broad-dsde-methods/tbrookin/gatk:0a7e1d86f",
		"str": "us.gcr.io/broad-dsde-methods/gatk-sv/str:2022-09-15-v0.25.1-beta-b53e58af"

Genotype concordance workflows #414

Genotype concordance workflows #414

Conversation

mwalker174 commented Sep 21, 2022

TedBrookings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TedBrookings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment