-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using DP to extract somatic AF in strelka2.py might be wrong #2836
Comments
Andrej; |
This totally makes sense. Much appriciated for pointing at this. I can create a pull request changing the calculation accordingly. |
On second thought. But what about multiallelics? If the REF is G, but the genotype is A/T, then we should sum up for all possible bases in the denominator. So it would be rather:
And perhaps also need to account for possible UU for RNA sequences. And for the possibility of other ambiguity symbols from the IUPAC notation. Unless they go as undefined reads. But in this case it actually makes sense to just count undefined reads in. As long as they are filtered by quality (and according to the docs, DP includes filtered reads). So the most natural way is just to divide by DP, isn't it? Or perhaps the approaches should be different for germline and for somatic calling? |
Is strelka2 reporting multiallelics? I don't see any in my datasets (but get thousands from strelka1). Perhpas @sangtaekim can help us here :) Using DP for SNPs might not be a practical problem in most cases I guess. |
Thanks-- @vladsaveliev what do you think the resolution of this should be? |
Hello,
Correct me if I am wrong, but strelka2.py uses the following formula to calculate allele fractions (AF) from srelka2 somatic VCFs:
snps:
AF = {ALT}U[0] / DP
indels:
AF = TIR[0] / DP
However, according to the strelka2 manual, the correct formula should be
snps:
AF = {ALT}U[0] / ({ALT}U[0] + {REF}U[0])
indels:
AF = TIR[0] / (TIR[0] + TAR[0])
.The problem with DP is that it includes also uninformative reads (reads that do not provide enough statistical evidence to support one allele over another).
From the GATK forum:
AD and DP : Allele depth and depth of coverage.
These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.
Best regards,
Andrej
The text was updated successfully, but these errors were encountered: