Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WGSA annotation for noncoding variants in WGS studies #2587

Closed
naumenko-sa opened this issue Dec 6, 2018 · 5 comments
Closed

WGSA annotation for noncoding variants in WGS studies #2587

naumenko-sa opened this issue Dec 6, 2018 · 5 comments

Comments

@naumenko-sa
Copy link
Contributor

Hello, bcbio community!

Thanks for the great framework!

Gnomad_genome frequencies help to prioritize variants in WGS studies.
However, it would be nice to have functional prediction and conservation scores for noncoding variants.

For now, many scores come from dbNSFP, but, by definition, this database is for nonsynonymous (i.e. coding + not synonymous) variants and splice sites variants only (it contains 83,189,732 records).
https://sites.google.com/site/jpopgen/dbNSFP

For non coding variants the same group proposes to use WGSA:
https://sites.google.com/site/jpopgen/wgsa
"For SNV-centric resources, WGSA integrated 12 sets of functional prediction scores (CADD, FATHMM-MKL, FATHMM-XF, Funseq, Funseq2, RegulomeDB, DANN, fitCons x 4, GenoCanyon, Eigen & Eigen-PC, GenoSkyline-Plus x 127, LINSIGHT), 9 conservation scores (bStatistic, GERP++, PhyloP x 3, phastCons x 3, SyPhy), allele frequencies from 5 large-scale re-sequencing studies (1000G, EP6500, ExAC, UK10K, gnomAD), variants in 4 disease related databases (ClinVar, COSMIC, GWAS_catalog, GRASP2), among others (see list of resources)."

Are there any plans to introduce WGSA to bcbio? The dataset is so huge (1.4T, which is 2-3 times more than most bcbio installations with human/mouse genomes), that, probably, the local installation of WGSA is not an option. But what about accessing through Amazon Web Service? Does it look like something feasible (https://sites.google.com/site/jpopgen/wgsa/using-wgsa-via-aws)?

Thanks!
Sergey

@chapmanb
Copy link
Member

chapmanb commented Dec 7, 2018

Sergey;
Thanks for starting this discussion. This looks fairly unwieldy to deal with and given how tricky dbNSFP has been I'm worried about the amount of effort to make this happen. I'd love to have better prioritization for non-coding variants but also worried about this approach of enumerating every position. The AWS approach looks like setting up something custom within the context of a project but maybe not the best target for bcbio to automate and support. How were you envisioning this all happening? Do you have any ideas how we can do this in a useful way without needing to mess with this gigantic files? Thanks again.

@naumenko-sa
Copy link
Contributor Author

naumenko-sa commented Dec 11, 2018

Thanks Brad,
In particular, I needed a GERP++ score. For this score there is also a small file (17Mb) with conserved elements. http://mendel.stanford.edu/SidowLab/downloads/gerp/. It could be easily recoded as a bed file and used in vcfanno.
Probably, similar approach might work for every score: transforming the values into a discrete variable, i.e. binning, and then creating a bed file where the genome will be split into elements.
SN

@chapmanb
Copy link
Member

Sergey;
Thanks for this. The gerp_elements files are a component of GEMINI inputs but unfortunately only available for build 37 so I hadn't ported them over to the generalized vcfanno support in bcbio and CWL since I was trying to focus on shared resources also available for build 38. Is there any equivalent scores that would be useful and are also updated for the latest build we could include? Thanks again.

@naumenko-sa
Copy link
Contributor Author

Thanks Brad!

I have not noticed that GERP conserved elements are already in gemini bundle. Now I see! This works perfectly well for me, as I'm still on grch37 and standalone bcbio installation. For grch38 I can only propose to use phastcons20way, phylop20way scores from UCSC browser:
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/
http://genome.ucsc.edu/cgi-bin/hgTables
They also will be small interval files, like gerp elements, but they are updated for grch38.

Sergey

@roryk
Copy link
Collaborator

roryk commented Aug 10, 2019

Thanks, @naumenko-sa do you think this is something that would be useful still?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants