WGSA annotation for noncoding variants in WGS studies #2587

naumenko-sa · 2018-12-06T17:36:07Z

Hello, bcbio community!

Thanks for the great framework!

Gnomad_genome frequencies help to prioritize variants in WGS studies.
However, it would be nice to have functional prediction and conservation scores for noncoding variants.

For now, many scores come from dbNSFP, but, by definition, this database is for nonsynonymous (i.e. coding + not synonymous) variants and splice sites variants only (it contains 83,189,732 records).
https://sites.google.com/site/jpopgen/dbNSFP

For non coding variants the same group proposes to use WGSA:
https://sites.google.com/site/jpopgen/wgsa
"For SNV-centric resources, WGSA integrated 12 sets of functional prediction scores (CADD, FATHMM-MKL, FATHMM-XF, Funseq, Funseq2, RegulomeDB, DANN, fitCons x 4, GenoCanyon, Eigen & Eigen-PC, GenoSkyline-Plus x 127, LINSIGHT), 9 conservation scores (bStatistic, GERP++, PhyloP x 3, phastCons x 3, SyPhy), allele frequencies from 5 large-scale re-sequencing studies (1000G, EP6500, ExAC, UK10K, gnomAD), variants in 4 disease related databases (ClinVar, COSMIC, GWAS_catalog, GRASP2), among others (see list of resources)."

Are there any plans to introduce WGSA to bcbio? The dataset is so huge (1.4T, which is 2-3 times more than most bcbio installations with human/mouse genomes), that, probably, the local installation of WGSA is not an option. But what about accessing through Amazon Web Service? Does it look like something feasible (https://sites.google.com/site/jpopgen/wgsa/using-wgsa-via-aws)?

Thanks!
Sergey

chapmanb · 2018-12-07T15:49:23Z

Sergey;
Thanks for starting this discussion. This looks fairly unwieldy to deal with and given how tricky dbNSFP has been I'm worried about the amount of effort to make this happen. I'd love to have better prioritization for non-coding variants but also worried about this approach of enumerating every position. The AWS approach looks like setting up something custom within the context of a project but maybe not the best target for bcbio to automate and support. How were you envisioning this all happening? Do you have any ideas how we can do this in a useful way without needing to mess with this gigantic files? Thanks again.

naumenko-sa · 2018-12-11T05:40:03Z

Thanks Brad,
In particular, I needed a GERP++ score. For this score there is also a small file (17Mb) with conserved elements. http://mendel.stanford.edu/SidowLab/downloads/gerp/. It could be easily recoded as a bed file and used in vcfanno.
Probably, similar approach might work for every score: transforming the values into a discrete variable, i.e. binning, and then creating a bed file where the genome will be split into elements.
SN

chapmanb · 2018-12-11T10:10:34Z

Sergey;
Thanks for this. The gerp_elements files are a component of GEMINI inputs but unfortunately only available for build 37 so I hadn't ported them over to the generalized vcfanno support in bcbio and CWL since I was trying to focus on shared resources also available for build 38. Is there any equivalent scores that would be useful and are also updated for the latest build we could include? Thanks again.

naumenko-sa · 2018-12-12T19:19:21Z

Thanks Brad!

I have not noticed that GERP conserved elements are already in gemini bundle. Now I see! This works perfectly well for me, as I'm still on grch37 and standalone bcbio installation. For grch38 I can only propose to use phastcons20way, phylop20way scores from UCSC browser:
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/
http://genome.ucsc.edu/cgi-bin/hgTables
They also will be small interval files, like gerp elements, but they are updated for grch38.

Sergey

roryk · 2019-08-10T19:03:20Z

Thanks, @naumenko-sa do you think this is something that would be useful still?

roryk added the enhancement label Aug 10, 2019

naumenko-sa mentioned this issue May 29, 2020

bcbio priorities #3242

Open

90 tasks

naumenko-sa closed this as completed May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WGSA annotation for noncoding variants in WGS studies #2587

WGSA annotation for noncoding variants in WGS studies #2587

naumenko-sa commented Dec 6, 2018

chapmanb commented Dec 7, 2018

naumenko-sa commented Dec 11, 2018 •

edited

Loading

chapmanb commented Dec 11, 2018

naumenko-sa commented Dec 12, 2018

roryk commented Aug 10, 2019

WGSA annotation for noncoding variants in WGS studies #2587

WGSA annotation for noncoding variants in WGS studies #2587

Comments

naumenko-sa commented Dec 6, 2018

chapmanb commented Dec 7, 2018

naumenko-sa commented Dec 11, 2018 • edited Loading

chapmanb commented Dec 11, 2018

naumenko-sa commented Dec 12, 2018

roryk commented Aug 10, 2019

naumenko-sa commented Dec 11, 2018 •

edited

Loading