Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allele frequency cant be flipped for multi-allelic variants. #165

Closed
kousathanas opened this issue Aug 18, 2023 · 4 comments
Closed

Allele frequency cant be flipped for multi-allelic variants. #165

kousathanas opened this issue Aug 18, 2023 · 4 comments

Comments

@kousathanas
Copy link

Hi,
when running mungesumstats v1.9.6, I got the following Error:

Error in check_allele_flip(sumstats_dt = sumstats_return$sumstats_dt, : Certain SNPs need to be flipped along with their effect columns and frequency column. However to flip the FRQ column, only bi-allelic SNPs can be considered. It is recommended to set bi_allelic_filter to TRUE so non-bi-allelic SNPs are removed. Otherwise, set allele_flip_frq to FALSE to not flip the FRQ column but note this could lead to incorrect FRQ values.

With ever increasing sample sizes, the majority of positions in the genome will have (rare) multi-allelic variants. It is theoretically possible that with large enough sample sizes, all possible mutations for every single position in the genome will be detected and added to dbSNP. Thus, only keeping bi-allelics (as defined through dbSNP) is not really a viable option: more than half of any dataset -eventually the entirety- will be eliminated by such a filter.

In this context, I would like to know how the above error can be sensibly bypassed while flipping columns. Is the solution to do this procedure manually?

will be glad for any info or feedback on the above issue/error.

best,
Thanos

@Al-Murphy
Copy link
Owner

Hey!

With ever increasing sample sizes, the majority of positions in the genome will have (rare) multi-allelic variants. It is theoretically possible that with large enough sample sizes, all possible mutations for every single position in the genome will be detected and added to dbSNP. Thus, only keeping bi-allelics (as defined through dbSNP) is not really a viable option: more than half of any dataset -eventually the entirety- will be eliminated by such a filter.

I completely agree, this is something we are actively investigating in the lab with regards to the number of non-bi-allelic SNPs across different dbSNP builds. For example, see this issue. I do believe we will be heading towards keeping non-bi-allelic SNPs as the default but this requires checking what effect this will have, for example on commonly used downstream analysis tools - currently these mostly expect bi-allelic SNPs only.

In this context, I would like to know how the above error can be sensibly bypassed while flipping columns. Is the solution to do this procedure manually?
It is a hard problem to get around flipping non-bi-allelic SNPs since we would need to know the frequency of the other alternative allele in the same population that the study was conducted in. It's possible we could get the user to specify a population, say European population and use reference databases for these but this would not be accurate for the specific study plus these databases currently don't exist in this form in R. Consider dbSNP which has started to capture this information where a second alternative allele is found in an African population but has never been seen in European, this SNP will be added to dbSNP with the frequency value for Africa only. This could be used to flip frequency in a more accurate manner but is not perfect and would require the R versions of dbSNP releases to hold frequency data which they currently don't.

My advice to sensibly deal with this is to set allele_flip_frq = FALSE and then not to use the frq data as these will not have been flipped (all other effect columns will have been for the necessary SNPs). Or to allele_flip_frq = FALSE and also set imputation_ind = TRUE and then for the SNPs which have been flipped, manually flip the bi-allelic ones and set the non-bi-allelic SNPs to a sensible value, perhaps NA. Again the second approach is only necessary if you need the frequency column for downstream analysis.

I'm open to suggestions if you think MSS could be modified to better deal with your issue, just let me know.

Alan.

@kousathanas
Copy link
Author

kousathanas commented Aug 18, 2023

Hi @Al-Murphy

thank you for the prompt reply. I agree that the problem of flipping allele frequencies for multi-allelics is not trivial.

Three comments:

  • The vast majority of multi-allelics are very rare (the third and fourth alleles).
  • Mungesumstats determines "multi-allelicness" from dbSNP, not the examined sample. This inflates the number of multi-allelics.
  • Pragmatically speaking, multi-allelics are not treated properly by most GWAS analyses and they are analysed in a decomposed way as bi-allelics anyway.

In this context, I believe that an easy solution for mungesumstats would be to add an option flip_frq_as_biallelic, which will flip allele frequencies as if the variant is bi-allelic, i.e, 1-p. This could be set to FALSE by default. The user can choose to QC-out variants with inconsistent allele frequencies with their population of preference (e.g., gnomAD) which will eliminate any errors introduced in this way. Its not a perfect solution, but you can provide a warning when activated. As the vast majority of multi-allelics are very rare, this will save a large fraction of variants that are effectively bi-allelic.

best,
Thanos

@Al-Murphy
Copy link
Owner

Hey,

In this context, I believe that an easy solution for mungesumstats would be to add an option flip_frq_as_biallelic, which will flip allele frequencies as if the variant is bi-allelic, i.e, 1-p. This could be set to FALSE by default. The user can choose to QC-out variants with inconsistent allele frequencies with their population of preference (e.g., gnomAD) which will eliminate any errors introduced in this way. Its not a perfect solution, but you can provide a warning when activated. As the vast majority of multi-allelics are very rare, this will save a large fraction of variants that are effectively bi-allelic.

Yes I agree with this approach as long as it isn't set as the default. I have updated MSS to incorporate this (v1.9.16) which you can test and let me know if it works as intended for you?

Cheers,
Alan.

@Al-Murphy
Copy link
Owner

Closing because of inactivity. Reopen if the issue isn't resolved for you.

Alan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants