-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genotype concordance workflows #414
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few questions and I'm requesting a few minor, stylistic changes. Overall it looks pretty good.
I have a couple comments about format_svtk_vcf_for_gatk.py that are either 1) not really a request for change or 2) just outside the changes you made and therefore not critical for the PR
-
Minor style note: I noticed __parse_bnd_ends and __parse_ploidy_table start with double-underscores. I've taken to using double-underscores for __parse_arguments because double-underscores enforce privacy. That way, if a python script does "from other_script import *" and that other script happens to have __parse_arguments in it, nothing bad happens, but if they both had _parse_arguments, that would get clobbered and produce crazy crashes. I don't think it's necessary to use more than single underscore for most other private functions, and it's arguably overkill even for __parse_arguments.
-
It's better practice to use a context manager than directly handling the closing of file-like objects. e.g.
with vcf_in = pysam.VariantFile(arguments.vcf, 'r')
# vcf_in will be closed after block ends, including early return, exception, etc
do_stuff()
Technically you could do that even with filter_out, using nullcontext, although that may be a bit clunky.
@@ -2236,6 +2238,7 @@ | |||
"name": "ref_panel_1kg", | |||
"outlier_cutoff_table": "gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v1/module03_outlier_cutoff_table.tsv", | |||
"ped_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/ref-panel/1KG/v1/ped/1kg_ref_panel_v1.ped", | |||
"ploidy_table": "gs://gatk-sv-ref-panel-1kg/outputs/ClusterBatch/mw-concordance/ref_panel_1kg.ploidy.FEMALE_chrY_1.tsv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should open a ticket to track changes needed to be able to repopulate this JSON with create_test_batch.py including outputting this file from the Batch pipeline, or keeping this value along with the new batch's metadata. I think some other recent changes might need to sync up with the create_test_batch.py script too (ex. handling tarballs of std vcfs vs. the file lists used in the single-sample pipeline) so it would be good to keep track of those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"SVConcordance.sv_pipeline_docker": {{ dockers.sv_pipeline_docker | tojson }}, | ||
"SVConcordance.sv_utils_docker": {{ dockers.sv_utils_docker | tojson }}, | ||
|
||
"SVConcordance.eval_vcf" : {{ test_batch.merged_depth_vcf | tojson }}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the comparison would be the raw calls vs. the clean VCF, not the clustered depth VCF? (If that's the case - the clean VCF should be GATK-ready except for dropping CTX/CPX for your tool, so I'm not sure that the call to SvUtilsFixVcf is necessary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm changing my review to "approve". I made a couple comments, including one with suggested changes. I'll leave it to your judgement whether to make those changes though, they're not critical.
new_record = vcf_out.new_record(id=record.id, contig=contig, start=record.start, stop=record.stop, alleles=alleles) | ||
# copy INFO fields | ||
for key in record.info: | ||
if key not in default_remove_infos and key not in remove_infos: | ||
new_record.info[key] = record.info[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two last suggestions for this section. 1) combine the default and passed remove_infos into one set object to get two checks instead of one, and 2) construct the baseline info field as one object so as to avoid multiple API calls to set info. I'll leave it to your judgement to make either of these changes (they can be done independently).
new_record = vcf_out.new_record(id=record.id, contig=contig, start=record.start, stop=record.stop, alleles=alleles) | |
# copy INFO fields | |
for key in record.info: | |
if key not in default_remove_infos and key not in remove_infos: | |
new_record.info[key] = record.info[key] | |
remove_infos = remove_infos.union(default_remove_infos) | |
new_record = vcf_out.new_record(id=record.id, contig=contig, start=record.start, stop=record.stop, alleles=alleles, | |
info={key: value for key, value in record.info.items() if key not in remove_infos}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change
dc2234e
to
913a86b
Compare
"gq_recalibrator_docker": "us.gcr.io/broad-dsde-methods/tbrookin/gatk:0a7e1d86f", | ||
"str": "us.gcr.io/broad-dsde-methods/gatk-sv/str:2022-09-15-v0.25.1-beta-b53e58af" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this PR is not related to EH, if so, I wonder how build_docker.py
ended up incorrectly updating this, while it worked correctly when invoked by the bot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to do with the way I invoked build_docker.py - my master branch was out of date on the machine where I did the build but my working branch was rebased on the latest master, which contained your recent changes to some str code. I used "master" as the current commit - so it saw the str updates relative to the outdated master and rebuilt the docker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That explains it, thanks!
Adds two new WDLs:
ClusterBatch
outputsThe second workflow is intended to be run with a "cleaned" eval vcf and a joined raw call truth vcf from the same cohort. The output vcf can then be used for filtering with
TrainGqRecalibrator
andRecalibrateGq
workflows as theannotations_vcf
input, along with the appropriateannotations_to_transfer
list (TBD).Note this depends on a pending PR in gatk, so I've added a temporary concordance gatk docker.
Also includes necessary updates to the vcf formatting script. Currently the concordance tool does not handle CPX/CTX so those are filtered out. As a result, the concordance-annotated VCF should NOT be carried forward other than as an input to the GQ recalibrator workflows.
Tested and updated both new workflows with inputs generated from corresponding templates.