Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer-conformant AF values #661

Closed
wants to merge 1 commit into from

Conversation

MattWellie
Copy link
Contributor

Docker image this failure was identified in: australia-southeast1-docker.pkg.dev/cpg-common/images/sv/sv-pipeline:2023-09-13-v0.28.3-beta-af8362e3

Relevant Workflow: ShardedAnnotateVcf.ComputeAFs

task ComputeAFs {

I've recently seen deterministic failures during AnnotateVCF running on CNV data. I've tracked the error to this line -

if hemi:
    AN = AN / 2

if Hemi (chrX & MALE & non-par), AN = AN/2. If the AN was an odd number, the result is a float. This value is added to the Variant entity as MALE_AN, which is described in the header as an INTEGER. PySam doesn't support adding a float value to an Integer field:

Traceback (most recent call last):
  File "/data/script.py", line 506, in <module>
    main()
  File "/data/script.py", line 498, in main
    newrec = gather_allele_freqs(r, samples_list, males_set, females_set, parbt, pop_dict,
  File "/data/script.py", line 105, in gather_allele_freqs
    calc_allele_freq(record, males_set, prefix='MALE', hemi=True)
  File "/data/script.py", line 195, in calc_allele_freq
    record.info[(prefix + '_' if prefix else '') + 'AN'] = AN
  File "pysam/libcbcf.pyx", line 2586, in pysam.libcbcf.VariantRecordInfo.__setitem__
  File "pysam/libcbcf.pyx", line 687, in pysam.libcbcf.bcf_info_set_value
  File "pysam/libcbcf.pyx", line 592, in pysam.libcbcf.bcf_check_values
TypeError: invalid value for Integer format

There is an additional later call implicated here, adj_an = m_an + f_an, where if the male AN is a float, the overall AN field would become a float, though this single change also addresses that, as MALE_AN is fixed as an Integer.

I'm not sure if I'm running into this now due to some weird data, previous data run through the same pipeline has never hit this issue, but based on the data I have this is a possible point of failure and I'm surprised nobody has hit it yet.

@mwalker174
Copy link
Collaborator

Hey @MattWellie thanks for reporting this. I do suspect this has something to do with how the input data is formatted here, specifically the genotype (GT) fields. In gatk-sv, genotypes are always diploid even on sex chromosomes (e.g. 0/1 for hemizygous genotypes and ./. for females on Y). Admittedly this is not correct, but for historical reasons we still maintain this convention and the annotation assumes as much on the input.

I suspect that the data you're using is using the correct convention where GT reflects the ploidy (i.e. 0 or 1 for haploid genotypes). Can you check the VCF for records that might use that convention? I'm not sure how they would get in there unless you did something a little off the rails.

@MattWellie
Copy link
Contributor Author

Hi Mark, thanks for replying. I was missing the forest for the trees a bit here: The data we're having problems with is generated from GATK-gCNV, not from this pipeline. As we've implemented GATK-SV for genomes already, we wanted to share a common annotation endpoint for our exome and genome pipelines. That is the reason we're now hitting this AD counting issue when it was previously A-Ok.

If it doesn't cause you any issues we'd love for this to be patched at source, otherwise we can implement a buffer stage in our pipeline to translate all these haplo genotypes into the GATK'y representation before running through annotation

@mwalker174
Copy link
Collaborator

IMO, if you are mixing GATK-SV and GATK-gCNV calls it would be better to harmonize everything with one genotype convention, so a formatting script for gCNV->GATK-SV sex chromosome genotype conversion might be the best way to go. Unfortunately I don't think we have any scripts available for this purpose but it shouldn't be too hard to implement.

I'd definitely be open to fixing at the source if you want to create a PR for that, but I don't think we can dedicate any effort to that right now.

@MattWellie
Copy link
Contributor Author

I think this PR itself is a minim fix. The round shouldn't affect your GATK sv genotypes as they're already ints. I've test run this script locally to show that it fixes my CNV case, I'll check it produces identical gatksv output

@mwalker174
Copy link
Collaborator

Ah I didn't even realize this was a PR! I think the fix is a bit more complicated, however. I think you would need to know a priori which convention is being used for GT and adjust the calculation for AN accordingly. You could modify the script to take in a flag that switches the behavior.

@mwalker174
Copy link
Collaborator

@MattWellie did you want to follow up on this or should we close this out? I think a proper fix will be a bit involved so if you have a workaround it would probably be easier for you to do that instead.

@MattWellie
Copy link
Contributor Author

Hi @mwalker174,

Just to clarify from before we have two separate pipelines which use this GATK-SV annotation stage, but the results are never intermingled. The gCNV/exome usage is running into this problem, the GATK-SV/genomes are fine. I don't think it's important, but just wanted to clear that up.

I've overcome this issue by taking your sv-pipeline docker image and patching this script with a working copy that rounds the AN value before assigning it to the pysam AN. That's solved this problem for our purposes, so if this is a patch you don't need I'm happy to close this.

@mwalker174
Copy link
Collaborator

Ok I think it makes sense to close this out but keep #663 open.

@MattWellie MattWellie closed this May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants