Using DP to extract somatic AF in strelka2.py might be wrong #2836

abenjak · 2019-05-24T17:06:17Z

Hello,

Correct me if I am wrong, but strelka2.py uses the following formula to calculate allele fractions (AF) from srelka2 somatic VCFs:
snps: AF = {ALT}U[0] / DP
indels: AF = TIR[0] / DP

However, according to the strelka2 manual, the correct formula should be
snps: AF = {ALT}U[0] / ({ALT}U[0] + {REF}U[0])
indels: AF = TIR[0] / (TIR[0] + TAR[0]).

The problem with DP is that it includes also uninformative reads (reads that do not provide enough statistical evidence to support one allele over another).

From the GATK forum:
AD and DP : Allele depth and depth of coverage.
These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.

Best regards,
Andrej

The text was updated successfully, but these errors were encountered:

chapmanb · 2019-05-25T00:12:58Z

Andrej;
Thanks much for starting this discussion and looking into the calculations so deeply. Your suggestions make sense, but I'd love to get @vladsaveliev to comment as he's the original author of this code and want to understand his intentions and thoughts. Thanks again.

vladsavelyev · 2019-05-25T00:21:59Z

This totally makes sense. Much appriciated for pointing at this. I can create a pull request changing the calculation accordingly.

vladsavelyev · 2019-05-25T07:49:09Z

On second thought. But what about multiallelics? If the REF is G, but the genotype is A/T, then we should sum up for all possible bases in the denominator. So it would be rather:

AF = {ALT}U[0] / (AU[0] + CU[0] + GU[0] + TU[0])

And perhaps also need to account for possible UU for RNA sequences. And for the possibility of other ambiguity symbols from the IUPAC notation. Unless they go as undefined reads. But in this case it actually makes sense to just count undefined reads in. As long as they are filtered by quality (and according to the docs, DP includes filtered reads). So the most natural way is just to divide by DP, isn't it?

Or perhaps the approaches should be different for germline and for somatic calling?

abenjak · 2019-06-04T12:23:12Z

Is strelka2 reporting multiallelics? I don't see any in my datasets (but get thousands from strelka1). Perhpas @sangtaekim can help us here :)

Using DP for SNPs might not be a practical problem in most cases I guess.
It's a bigger issue for indels, where often an important number of reads (TOR) does not support neither the reference nor the indel, but these reads make up for the DP.

roryk · 2019-08-10T18:22:04Z

Thanks-- @vladsaveliev what do you think the resolution of this should be?

roryk added the bug label Aug 10, 2019

naumenko-sa mentioned this issue May 29, 2020

bcbio priorities #3242

Open

90 tasks

naumenko-sa closed this as completed May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using DP to extract somatic AF in strelka2.py might be wrong #2836

Using DP to extract somatic AF in strelka2.py might be wrong #2836

abenjak commented May 24, 2019

chapmanb commented May 25, 2019

vladsavelyev commented May 25, 2019

vladsavelyev commented May 25, 2019

abenjak commented Jun 4, 2019

roryk commented Aug 10, 2019

Using DP to extract somatic AF in strelka2.py might be wrong #2836

Using DP to extract somatic AF in strelka2.py might be wrong #2836

Comments

abenjak commented May 24, 2019

chapmanb commented May 25, 2019

vladsavelyev commented May 25, 2019

vladsavelyev commented May 25, 2019

abenjak commented Jun 4, 2019

roryk commented Aug 10, 2019