-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integer-conformant AF values #661
Conversation
Hey @MattWellie thanks for reporting this. I do suspect this has something to do with how the input data is formatted here, specifically the genotype ( I suspect that the data you're using is using the correct convention where |
Hi Mark, thanks for replying. I was missing the forest for the trees a bit here: The data we're having problems with is generated from GATK-gCNV, not from this pipeline. As we've implemented GATK-SV for genomes already, we wanted to share a common annotation endpoint for our exome and genome pipelines. That is the reason we're now hitting this AD counting issue when it was previously A-Ok. If it doesn't cause you any issues we'd love for this to be patched at source, otherwise we can implement a buffer stage in our pipeline to translate all these haplo genotypes into the GATK'y representation before running through annotation |
IMO, if you are mixing GATK-SV and GATK-gCNV calls it would be better to harmonize everything with one genotype convention, so a formatting script for gCNV->GATK-SV sex chromosome genotype conversion might be the best way to go. Unfortunately I don't think we have any scripts available for this purpose but it shouldn't be too hard to implement. I'd definitely be open to fixing at the source if you want to create a PR for that, but I don't think we can dedicate any effort to that right now. |
I think this PR itself is a minim fix. The round shouldn't affect your GATK sv genotypes as they're already ints. I've test run this script locally to show that it fixes my CNV case, I'll check it produces identical gatksv output |
Ah I didn't even realize this was a PR! I think the fix is a bit more complicated, however. I think you would need to know a priori which convention is being used for GT and adjust the calculation for AN accordingly. You could modify the script to take in a flag that switches the behavior. |
@MattWellie did you want to follow up on this or should we close this out? I think a proper fix will be a bit involved so if you have a workaround it would probably be easier for you to do that instead. |
Hi @mwalker174, Just to clarify from before we have two separate pipelines which use this GATK-SV annotation stage, but the results are never intermingled. The gCNV/exome usage is running into this problem, the GATK-SV/genomes are fine. I don't think it's important, but just wanted to clear that up. I've overcome this issue by taking your sv-pipeline docker image and patching this script with a working copy that rounds the AN value before assigning it to the pysam |
Ok I think it makes sense to close this out but keep #663 open. |
Docker image this failure was identified in: australia-southeast1-docker.pkg.dev/cpg-common/images/sv/sv-pipeline:2023-09-13-v0.28.3-beta-af8362e3
Relevant Workflow: ShardedAnnotateVcf.ComputeAFs
gatk-sv/wdl/ShardedAnnotateVcf.wdl
Line 146 in d37e453
I've recently seen deterministic failures during AnnotateVCF running on CNV data. I've tracked the error to this line -
if Hemi (chrX & MALE & non-par), AN = AN/2. If the AN was an odd number, the result is a float. This value is added to the Variant entity as
MALE_AN
, which is described in the header as an INTEGER. PySam doesn't support adding a float value to an Integer field:There is an additional later call implicated here,
adj_an = m_an + f_an
, where if the male AN is a float, the overallAN
field would become a float, though this single change also addresses that, asMALE_AN
is fixed as an Integer.I'm not sure if I'm running into this now due to some weird data, previous data run through the same pipeline has never hit this issue, but based on the data I have this is a possible point of failure and I'm surprised nobody has hit it yet.