vcfassoc

perform phenotype::genotype-association tests on a VCF with logistic regression.

Quickstart

(also see Installation)

You will need:

A multi-sample VCF
A covariates file with columns you wish to model. (currently the first column of this file must match the sample headings in the VCF.)
A formula in R (patsy) syntax that will pull from 2. Must contain 'genotype' which will be added to the covariates on-the-fly.

Example run with additive model:

vcfassoc.py samples.vcf covariates.txt 'status ~ genotype + age + gender + PC1 + PC2'

Example run with dominant model:

vcfassoc.py samples.vcf covariates.txt 'status ~ I(genotype > 0) + age + gender + PC1 + PC2'

Example run with recessive model:

vcfassoc.py samples.vcf covariates.txt 'status ~ I(genotype == 2) + age + gender + PC1 + PC2'

The output will look like:

site	pvalue	REF/ALT	OR	OR_CI	z	p_chi_additive	p_chi_dominant	df_resid
1:256769(rs166498)	0.1418	C/T	5.000	0.5842..42.7971	1.469	0.174	0.174	31
1:257063(rs194882)	0.9994	G/A	0.000	0.0000..inf	-0.001	1	1	31
1:257334(rs3959)	0.8998	A/G	1.032	0.6298..1.6919	0.126	0.875	1	31
1:257454(rs70148)	0.5714	C/G	0.500	0.0453..5.5141	-0.566	1	1	31
1:257735(rs134134)	0.9994	T/G	2322868131.974	0.0000..inf	0.001	1	1	31
1:257925(rs39944)	0.05584	C/T	4.076	0.9656..17.2084	1.912	0.0465	0.0431	27
1:258014(rs3283)	0.4235	A/G	2.000	0.3663..10.9192	0.800	0.651	0.651	31
1:258064(rs3091)	0.1863	G/C	0.427	0.1207..1.5090	-1.321	0.206	0.202	30
1:258091(rs3092)	0.438	A/G	0.730	0.3294..1.6175	-0.776	0.473	0.6	30

Where the p_chi_* columns are the result of running a chisq test under the additive or dominant (and now recessive) models. pvalue is for the specified model under logistic regression. the OR_CI is the 95% confidence interval for the odds-ratio of genotype from the model.

The INFO column from the VCF and contingency table for the chisq test are also printed, but not shown here.

If --as-vcf, is specified, the above info will be crammed into the INFO field and the output will be in VCF format.

Installation

vcfassoc requires a number of python modules: numpy, scipy, statsmodels, toolshed, patsy, click, scikit-learn

All dependencies can be installed with pip via:

pip install -r requirements.txt

For users who do not have the scientific python stack set up, it is recommended to download and install anaconda, a free distribution that contains most of these modules.

You will also need to install futures.

Features

In addition to the obvious features, vcfassoc will also perform a feature-selection using 'l1' penalized regression and report the top features from that selection to STDERR at the end of the run.

Further, it will allow correlated data via the use of generalized estimating equations (GEE)'s. Just specify, e.g. --group 'family_id' to indicate the grouping and GEE will automatically be used instead of traditional logistic regression.

See: https://github.com/brentp/vcfassoc/blob/master/test/test.sh

vcfassoc has been tested against R.

TODO

option to output genotype matrix.
allow continuous dependent variable
parallelization (naive version didn't improve speed much)
weighted regression by genotype qualities

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
test		test
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
vcfassoc.py		vcfassoc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vcfassoc

Quickstart

Installation

Features

TODO

About

Releases

Packages

Languages

brentp/vcfassoc

Folders and files

Latest commit

History

Repository files navigation

vcfassoc

Quickstart

Installation

Features

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages