Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using DP to extract somatic AF in strelka2.py might be wrong #2836

Closed
abenjak opened this issue May 24, 2019 · 5 comments
Closed

Using DP to extract somatic AF in strelka2.py might be wrong #2836

abenjak opened this issue May 24, 2019 · 5 comments
Labels

Comments

@abenjak
Copy link

abenjak commented May 24, 2019

Hello,

Correct me if I am wrong, but strelka2.py uses the following formula to calculate allele fractions (AF) from srelka2 somatic VCFs:
snps: AF = {ALT}U[0] / DP
indels: AF = TIR[0] / DP

However, according to the strelka2 manual, the correct formula should be
snps: AF = {ALT}U[0] / ({ALT}U[0] + {REF}U[0])
indels: AF = TIR[0] / (TIR[0] + TAR[0]).

The problem with DP is that it includes also uninformative reads (reads that do not provide enough statistical evidence to support one allele over another).

From the GATK forum:
AD and DP : Allele depth and depth of coverage.
These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.

Best regards,
Andrej

@chapmanb
Copy link
Member

Andrej;
Thanks much for starting this discussion and looking into the calculations so deeply. Your suggestions make sense, but I'd love to get @vladsaveliev to comment as he's the original author of this code and want to understand his intentions and thoughts. Thanks again.

@vladsavelyev
Copy link
Contributor

This totally makes sense. Much appriciated for pointing at this. I can create a pull request changing the calculation accordingly.

@vladsavelyev
Copy link
Contributor

On second thought. But what about multiallelics? If the REF is G, but the genotype is A/T, then we should sum up for all possible bases in the denominator. So it would be rather:

AF = {ALT}U[0] / (AU[0] + CU[0] + GU[0] + TU[0])

And perhaps also need to account for possible UU for RNA sequences. And for the possibility of other ambiguity symbols from the IUPAC notation. Unless they go as undefined reads. But in this case it actually makes sense to just count undefined reads in. As long as they are filtered by quality (and according to the docs, DP includes filtered reads). So the most natural way is just to divide by DP, isn't it?

Or perhaps the approaches should be different for germline and for somatic calling?

@abenjak
Copy link
Author

abenjak commented Jun 4, 2019

Is strelka2 reporting multiallelics? I don't see any in my datasets (but get thousands from strelka1). Perhpas @sangtaekim can help us here :)

Using DP for SNPs might not be a practical problem in most cases I guess.
It's a bigger issue for indels, where often an important number of reads (TOR) does not support neither the reference nor the indel, but these reads make up for the DP.

@roryk
Copy link
Collaborator

roryk commented Aug 10, 2019

Thanks-- @vladsaveliev what do you think the resolution of this should be?

@roryk roryk added the bug label Aug 10, 2019
@naumenko-sa naumenko-sa mentioned this issue May 29, 2020
90 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants