Integer-conformant AF values #661

MattWellie · 2024-04-02T03:38:21Z

Docker image this failure was identified in: australia-southeast1-docker.pkg.dev/cpg-common/images/sv/sv-pipeline:2023-09-13-v0.28.3-beta-af8362e3

Relevant Workflow: ShardedAnnotateVcf.ComputeAFs

gatk-sv/wdl/ShardedAnnotateVcf.wdl

Line 146 in d37e453

task ComputeAFs {

I've recently seen deterministic failures during AnnotateVCF running on CNV data. I've tracked the error to this line -

if hemi:
    AN = AN / 2

if Hemi (chrX & MALE & non-par), AN = AN/2. If the AN was an odd number, the result is a float. This value is added to the Variant entity as MALE_AN, which is described in the header as an INTEGER. PySam doesn't support adding a float value to an Integer field:

Traceback (most recent call last):
  File "/data/script.py", line 506, in <module>
    main()
  File "/data/script.py", line 498, in main
    newrec = gather_allele_freqs(r, samples_list, males_set, females_set, parbt, pop_dict,
  File "/data/script.py", line 105, in gather_allele_freqs
    calc_allele_freq(record, males_set, prefix='MALE', hemi=True)
  File "/data/script.py", line 195, in calc_allele_freq
    record.info[(prefix + '_' if prefix else '') + 'AN'] = AN
  File "pysam/libcbcf.pyx", line 2586, in pysam.libcbcf.VariantRecordInfo.__setitem__
  File "pysam/libcbcf.pyx", line 687, in pysam.libcbcf.bcf_info_set_value
  File "pysam/libcbcf.pyx", line 592, in pysam.libcbcf.bcf_check_values
TypeError: invalid value for Integer format

There is an additional later call implicated here, adj_an = m_an + f_an, where if the male AN is a float, the overall AN field would become a float, though this single change also addresses that, as MALE_AN is fixed as an Integer.

I'm not sure if I'm running into this now due to some weird data, previous data run through the same pipeline has never hit this issue, but based on the data I have this is a possible point of failure and I'm surprised nobody has hit it yet.

mwalker174 · 2024-04-04T00:22:41Z

Hey @MattWellie thanks for reporting this. I do suspect this has something to do with how the input data is formatted here, specifically the genotype (GT) fields. In gatk-sv, genotypes are always diploid even on sex chromosomes (e.g. 0/1 for hemizygous genotypes and ./. for females on Y). Admittedly this is not correct, but for historical reasons we still maintain this convention and the annotation assumes as much on the input.

I suspect that the data you're using is using the correct convention where GT reflects the ploidy (i.e. 0 or 1 for haploid genotypes). Can you check the VCF for records that might use that convention? I'm not sure how they would get in there unless you did something a little off the rails.

MattWellie · 2024-04-04T00:45:11Z

Hi Mark, thanks for replying. I was missing the forest for the trees a bit here: The data we're having problems with is generated from GATK-gCNV, not from this pipeline. As we've implemented GATK-SV for genomes already, we wanted to share a common annotation endpoint for our exome and genome pipelines. That is the reason we're now hitting this AD counting issue when it was previously A-Ok.

If it doesn't cause you any issues we'd love for this to be patched at source, otherwise we can implement a buffer stage in our pipeline to translate all these haplo genotypes into the GATK'y representation before running through annotation

mwalker174 · 2024-04-04T01:03:05Z

IMO, if you are mixing GATK-SV and GATK-gCNV calls it would be better to harmonize everything with one genotype convention, so a formatting script for gCNV->GATK-SV sex chromosome genotype conversion might be the best way to go. Unfortunately I don't think we have any scripts available for this purpose but it shouldn't be too hard to implement.

I'd definitely be open to fixing at the source if you want to create a PR for that, but I don't think we can dedicate any effort to that right now.

MattWellie · 2024-04-04T22:19:51Z

I think this PR itself is a minim fix. The round shouldn't affect your GATK sv genotypes as they're already ints. I've test run this script locally to show that it fixes my CNV case, I'll check it produces identical gatksv output

mwalker174 · 2024-04-05T21:32:02Z

Ah I didn't even realize this was a PR! I think the fix is a bit more complicated, however. I think you would need to know a priori which convention is being used for GT and adjust the calculation for AN accordingly. You could modify the script to take in a flag that switches the behavior.

mwalker174 · 2024-05-08T14:46:09Z

@MattWellie did you want to follow up on this or should we close this out? I think a proper fix will be a bit involved so if you have a workaround it would probably be easier for you to do that instead.

MattWellie · 2024-05-10T05:06:54Z

Hi @mwalker174,

Just to clarify from before we have two separate pipelines which use this GATK-SV annotation stage, but the results are never intermingled. The gCNV/exome usage is running into this problem, the GATK-SV/genomes are fine. I don't think it's important, but just wanted to clear that up.

I've overcome this issue by taking your sv-pipeline docker image and patching this script with a working copy that rounds the AN value before assigning it to the pysam AN. That's solved this problem for our purposes, so if this is a patch you don't need I'm happy to close this.

mwalker174 · 2024-05-10T15:15:03Z

Ok I think it makes sense to close this out but keep #663 open.

Integer-conformant AF values

21e7df3

MattWellie mentioned this pull request Apr 3, 2024

AD values in Annotation aren't Integers #663

Open

MattWellie mentioned this pull request Apr 8, 2024

Workaround for GATK-SV annotation bug populationgenomics/images#137

Merged

MattWellie closed this May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integer-conformant AF values #661

Integer-conformant AF values #661

MattWellie commented Apr 2, 2024

mwalker174 commented Apr 4, 2024

MattWellie commented Apr 4, 2024

mwalker174 commented Apr 4, 2024

MattWellie commented Apr 4, 2024

mwalker174 commented Apr 5, 2024

mwalker174 commented May 8, 2024

MattWellie commented May 10, 2024

mwalker174 commented May 10, 2024

Integer-conformant AF values #661

Integer-conformant AF values #661

Conversation

MattWellie commented Apr 2, 2024

mwalker174 commented Apr 4, 2024

MattWellie commented Apr 4, 2024

mwalker174 commented Apr 4, 2024

MattWellie commented Apr 4, 2024

mwalker174 commented Apr 5, 2024

mwalker174 commented May 8, 2024

MattWellie commented May 10, 2024

mwalker174 commented May 10, 2024