-
Notifications
You must be signed in to change notification settings - Fork 74
Genetic association tests using SAIGE
- Overview
- Frequently asked questions
- Installing SAIGE
- Flowchart
- Running SAIGE and SAIGE-GENE
SAIGE version >= 0.35.8.5 accounts for case-control imbalance in region/gene-based tests.
SAIGE version > 0.35.6 supports single-variant association tests, gene-based tests (SKAT-O, SKAT, BURDEN), as well as conditional analysis based on summary statistics for both single variant and gene-based tests.
This tutorial will walk you through
- Installing SAIGE
- Running SAIGE for
- Single-variant association tests
- Gene-based tests
- Conditional analysis
Workflow of these tests are almost the same. They use the same step 1 and step 2 functions of SAIGE. We are going to use the example data in the extdata folder in the github page.
- Which version of SAIGE/SAIGE-GENE should I use?
- We are continuously adding more options to meet the needs of data analysis by users and fixing bugs, so we recommend always trying the newest version in the master branch. The docker hub will be always updated when there is new version.
- What is the easiest way to install SAIGE/SAIGE-GENE?
- The easiest way to use SAIGE/SAIGE-GENE is using the docker image (approach 4). If docker cannot be used in your machine, please try the installation using a conda env (approach 5) https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE/_edit#installing-saige
- Can SAIGE/SAIGE-GENE handle samples from different ancestry?
- We haven't throughly tested the performance of SAIGE on multi-ethnic samples. SAIGE uses a random effect to account for sample relatedness/population sub-stratification through the genetic relationship matrix. For multi-ancestry samples (such as 1000 genome) you would need to include PCs to adjust for population stratification. Using both (mixed effect model + PC as covariates) would be the best approach.
- For Step 1, how many markers need to be included in the plink file?
-
To fit the null logistic mixed model using fitNULLGLMM (step 1), ideally the number of independent markers for constructing GRM is larger than the sample size. In our previous GWAS for the UK Biobank, we used 93k markers that were pruned by UKB for kinship estimation while the sample size is ~400k (white British samples). The results look fine. We then compared the P-values from the previous analysis with 93k markers for GRM and with 340k markers for GRM for four phenotypes with different prevalences as shown in the Supplementary Figure 17 in the SAIGE paper https://www.ncbi.nlm.nih.gov/pubmed/30104761. We see for phenotypes such as CAD, the inflation of p-values was further corrected when 340k markers were used for GRM.
-
The option minMAFforGRM (default, 0.01) is used for set the minimum MAF for markers used in the plink file to estimate the GRM for step 1. Only markers with MAF >= minMAFforGRM will be used to estimate the GRM.
-
For gene-based tests, the same plink file should be used for both Step 0 and Step 1. This means, Step 0 creates a sparse version of the "full" GRM that is used in Step 1. The problem is that for gene-based tests, variance ratios are estimated for rare variants and common variants and for each MAC category that is specified, 30 markers will be randomly selected first and increase by 10 until the variance ratio estimation is stable. Therefore, rare variants need to be included in the plink file, although they are not used for GRM, because only markers with MAF >= minMAFforGRM will be used for GRM and the default value is 0.01. In the example of the gene-based tests for UK Biobank, you may include all rare variants with MAF < 1% and those 93k markers in the plink file for Step 0 and Step 1.
- Does LOCO=TRUE work?
- Version 0.36.5 or later works for LOCO=TRUE
- Does SAIGE/SAIGE-GENE requires that samples in Step 1 have non-missing dosages in Step 2?
- Yes. Current version of SAIGE/SAIGE-GENE requires that all samples used in Step 1 have to be in the bgen/vcf file used in Step 2.
- For Step 2, how does SAIGE/SAIGE-GENE handle missing dosages/genotypes?
- To drop samples with missing genotypes/dosages, --IsDropMissingDosages=TRUE, if FALSE, missing genotypes/dosages will be mean imputed.
- For Step 2, what options can be used to filter out variants based on the MAF and imputation scores?
- minMAF and minMAC (default 0.5) can be used to filter variants based on the MAF. The higher threshold between minMAC and minMAF will be used.
- minInfo can be used to filter variants based on the imputation score. ** Please note that the saddlepoint approximation used in SAIGE for single-variant association tests does not work for variants with MAC <= 3 (usually gives extremely significant P-values). For those extremely rare variants, please use SAIGE-GENE to tests them. **
SAIGE can be installed in 4 ways. The first 3 ways require installing dependencies first.
- R-3.6.1, gcc >= 5.4.0, cmake 3.14.1, cget
- R packages: "R.utils", "Rcpp", "RcppParallel", "RcppArmadillo", "data.table", "RcppEigen", "Matrix", "methods", "BH", "optparse", "SPAtest", "SKAT","MetaSKAT"
- /extdata/install_packages.R can be used to install the R packages
- The binary install file can be downloaded from SAIGE releases or from the master branch. Then using the following command line to install SAIGE
R CMD INSTALL SAIGE_XX_R_x86_64-pc-linux-gnu.tar.gz
- Installing using the R library devtools from github
devtools::install_github("weizhouUMICH/SAIGE")
- Installing from the source code.
src_branch=master
repo_src_url=https://github.com/weizhouUMICH/SAIGE
git clone --depth 1 -b $src_branch $repo_src_url
R CMD INSTALL SAIGE
- Using a docker image (Thanks to Juha Karjalainen for sharing the Dockerfile). The docker image can be pulled
docker pull wzhou88/saige:0.36.6
Dockerfile for creating a docker image for SAIGE
Functions can be called
step1_fitNULLGLMM.R --help
step2_SPAtests.R --help
createSparseGRM.R --help
- Using a conda environment (Thanks to Wallace(Minxian) Wang)
a) create a conda environment using (conda environment file)
conda env create -f environment-RSAIGE.yml
conda activate RSAIGE
FLAGPATH=`which python | sed 's|/bin/python$||'`
export LDFLAGS="-L${FLAGPATH}/lib"
export CPPFLAGS="-I${FLAGPATH}/include"
Note: Here are the steps to create the conda environment file
b) Using method 3 Open R and install package MetaSKAT
install.packages('MetaSKAT')
exit R and run command
src_branch=master
repo_src_url=https://github.com/weizhouUMICH/SAIGE
git clone --depth 1 -b $src_branch $repo_src_url
R CMD INSTALL SAIGE
c) Or using method 2 Open R and run (choose 3 no update any packages):
devtools::install_github("weizhouUMICH/SAIGE")
- Previous versions of SAIGE installation files are here https://www.dropbox.com/sh/zmlu1llpxd66pjl/AADFqdssvOBjbWZch6Q9zYNaa?dl=0
SAIGE and SAIGE-GENE can work for both binary and quantitative traits.
Scripts are in the folder extdata/ and command lines are in the file cmd.sh
cd /extdata/
less -S cmd.sh
Input files are in the folder extdata/input
Output files are in the folder extdata/output
To obtain help information of the scripts that call functions in the SAIGE library
Rscript createSparseGRM.R --help
Rscript step1_fitNULLGLMM.R --help
Rscript step2_SPAtests.R --help
To obtain help information of functions in the SAIGE library
#open R
R
library(SAIGE)
?createSparseGRM
?fitNULLGLMM
?SPAGMMATtest
For single-variant association tests, sparse GRM and categorical variance ratios are NOT needed. Randomly selected markers with MAC >= 20 are used to estimate the variance ratio
- For binary traits, a null logistic mixed model will be fitted (--traitType=binary).
- For quantitative traits, a null linear mixed model will be fitted (--traitType=quantitative) and needs to be inverse normalized (--invNormalize=TRUE)
#check the help info for step 1
Rscript step1_fitNULLGLMM.R --help
#For Binary traits:
Rscript step1_fitNULLGLMM.R \
--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
--phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
--phenoCol=y_binary \
--covarColList=x1,x2 \
--sampleIDColinphenoFile=IID \
--traitType=binary \
--outputPrefix=./output/example_binary \
--nThreads=4 \
--LOCO=FALSE
#For quantitative traits, if not normally distributed, inverse normalization needs to be specified to be TRUE --invNormalize=TRUE
Rscript step1_fitNULLGLMM.R \
--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
--phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
--phenoCol=y_quantitative \
--covarColList=x1,x2 \
--sampleIDColinphenoFile=IID \
--traitType=quantitative \
--invNormalize=TRUE \
--outputPrefix=./output/example_quantitative \
--nThreads=4 \
--LOCO=FALSE \
--tauInit=1,0
- Genotype file for constructing the genetic relationship matrix (GRM)
SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam
- Phenotype file (contains non-genetic covariates if any) The file can be either space or tab-delimited with a header. It is required that the file contains one column for sample IDs and one column for the phenotype. It may contain columns for non-genetic covariates.
Note: Current version of SAIGE does not support categorical covariates that have more than two categories
less -S ./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt
- model file
./output/example_quantitative.rda
#open R
R
#load the model file in R
load("./output/example_quantitative.rda")
names(modglmm)
modglmm$theta
#theta: a vector of length 2. The first element is the dispersion parameter estimate and the second one is the variance component parameter estimate
#coefficients: fixed effect parameter estimates
#linear.predictors: a vector of length N (N is the sample size) containing linear predictors
#fitted.values: a vector of length N (N is the sample size) containing fitted mean values on the original scale
#Y: a vector of length N (N is the sample size) containing final working vector
#residuals: a vector of length N (N is the sample size) containing residuals on the original scale
#sampleID: a vector of length N (N is the sample size) containing sample IDs used to fit the null model
- association result file for the subset of randomly selected markers
less -S ./output/example_quantitative_30markers.SAIGE.results.txt
- variance ratio file
less -S ./output/example_quantitative.varianceRatio.txt
- For binary traits, saddle point approximation is used to account for case-control imbalance.
- Four different file formats for dosages/genotypes of genetic variants to be tested can be used: VCF, BGEN, SAV, plain text
- Conditional analysis based summary stats can be performed (--condition) can be performed in step 2 with dosage/genotype input file formats VCF, BGEN and SAV.
- To test a subset of markers,
-
- both variant IDs and range of chromosome positions can be specified for BGEN input (--idstoExcludeFile, --idstoIncludeFile, --rangestoExcludeFile, --rangestoIncludeFile)
-
- range chromosome positions can be specified for VCF/SAV input (--chrom, --start, --end).
- For VCF/SAV input, --chrom must be specified and the string needs to be exactly the same as in the VCF/SAV, such as "01" or "chr1".
- For VCF/SAV input, --vcfField=DS to test dosages and --vcfField=GT to test genotypes
- To drop samples with missing genotypes/dosages, --IsDropMissingDosages=TRUE, if FALSE, missing genotypes/dosages will be mean imputed. **NOTE: this option has not been extensively tested yet. **
#check the help info for step 2
Rscript step2_SPAtests.R --help
#perform the single-variant association tests
#for binary traits, --IsOutputAFinCaseCtrl=TRUE can be specified to output allele frequencies in cases and controls
Rscript step2_SPAtests.R \
--vcfFile=./input/genotype_10markers.missingness.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example_binary.rda \
--varianceRatioFile=./output/example_binary.varianceRatio.txt \
--SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
##drop samples with missing genotypes/dosages
##--IsDropMissingDosages=TRUE
Rscript step2_SPAtests.R \
--vcfFile=./input/genotype_10markers.missingness.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example_binary.rda \
--varianceRatioFile=./output/example_binary.varianceRatio.txt \
--SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.dropmissing.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE \
--IsDropMissingDosages=TRUE
#conditional analysis
## --condition = Genetic marker ids (chr:pos_ref/alt for vcf/sav or marker id for bgen) seperated by comma. e.g.chr3:101651171_C/T,chr3:101651186_G/A, Note that currently conditional analysis is only for vcf/sav and bgen input.
Rscript step2_SPAtests.R \
--vcfFile=./input/genotype_10markers.missingness.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example_binary.rda \
--varianceRatioFile=./output/example_binary.varianceRatio.txt \
--SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.dropmissing.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE \
--IsDropMissingDosages=TRUE \
--condition=1:4_A/C
#plain text input for dosages (plain text input for dosages is not inputed since 0.36.1)
Rscript step2_SPAtests.R \
--dosageFile=./input/dosage_10markers.txt \
--dosageFileNrowSkip=1 \
--dosageFileNcolSkip=6 \
--dosageFilecolnamesSkip=CHR,SNP,CM,POS,EFFECT_ALLELE,ALT_ALLELE \
--minMAF=0.0001 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example.rda \
--varianceRatioFile=./output/example.varianceRatio.txt \
--SAIGEOutputFile=./output/example.plainDosage.SAIGE.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
#bgen for dosages
Rscript step2_SPAtests.R \
--bgenFile=./input/genotype_100markers.bgen \
--bgenFileIndex=./input/genotype_100markers.bgen.bgi \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/samplefileforbgen_10000samples.txt \
--GMMATmodelFile=./output/example.rda \
--varianceRatioFile=./output/example.varianceRatio.txt \
--SAIGEOutputFile=./output/example.SAIGE.bgen.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
#VCF containing genotypes
Rscript step2_SPAtests.R \
--vcfFile=./input/genotype_10markers.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example.rda \
--varianceRatioFile=./output/example.varianceRatio.txt \
--SAIGEOutputFile=./output/example.SAIGE.vcf.genotype.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
#VCF containing dosages
Rscript step2_SPAtests.R \
--vcfFile=./input/dosage_10markers.vcf.gz \
--vcfFileIndex=./input/dosage_10markers.vcf.gz.tbi \
--vcfField=DS \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example.rda \
--varianceRatioFile=./output/example.varianceRatio.txt \
--SAIGEOutputFile=./output/example.SAIGE.vcf.dosage.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
# SAV for dosages
Rscript step2_SPAtests.R \
--savFile=./input/dosage_10markers.sav \
--savFileIndex=./input/dosage_10markers.sav.s1r \
--vcfField=DS \
--minMAF=0.0001 \
--minMAC=1 \
--chrom=1 \
--sampleFile=./input/samplefileforbgen_10000samples.txt \
--GMMATmodelFile=./output/example.rda \
--varianceRatioFile=./output/example.varianceRatio.txt \
--SAIGEOutputFile=./output/example.SAIGE.sav.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
- Dosage file
SAIGE can take in several different formats for dosages: plain text (gzipped file is supported), VCF, BCF, BGEN, and SAV.
The example dosage files contain 10 markers for 1,000 samples
- plain text file
less -S dosage_10markers.txt
- BGEN
genotype_100markers.bgen
genotype_100markers.bgen.bgi
- VCF containing genotypes
zcat genotype_10markers.vcf.gz | less -S
genotype_10markers.vcf.gz.tbi
- VCF containing dosages
zcat dosage_10markers.vcf.gz | less -S
dosage_10markers.vcf.gz.tbi
- SAV
dosage_10markers.sav
dosage_10markers.sav.s1r
- Sample file <br> This file contains one column for sample IDs corresponding to the sample order in the dosage file. No header is included. The option was originally for BGEN file that does not contain sample information.
less -S sampleIDindosage.txt
- Model file from step 1
./output/example.rda
- Variance ratio file from step 1
./output/example.varianceRatio.txt
A file with association test results
less -S ./output/example.SAIGE.vcf.dosage.txt
NOTE:
- If the input format of dosages is plain text. The first several columns for header are provided by the user.
- For plain text input for dosage/genotypes, association results are for the allele whose dosages/genotypes are provides. For VCF, BCF, SAV, or BGEN input, association results are for Allele2.
- For binary traits, the header of the output file:
CHR POS SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value p.value.NA Is.SPA.converge varT varTstar AF.Cases AF.Controls
- For quantitative traits, the header of the output file:
CHR POS SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value varT varTstar
CHR: chromosome
POS: genome position
SNPID: variant ID
Allele1: allele 1
Allele2: allele 2
AC_Allele2: allele count of allele 2
AF_Allele2: allele frequency of allele 2
imputationInfo: imputation info. If not in dosage/genotype input file, will output 1
N: sample size
BETA: effect size of allele 2
SE: standard error of BETA
Tstat: score statistic of allele 2
p.value: p value (with SPA applied for binary traits)
p.value.NA: p value when SPA is not applied (only for binary traits)
Is.SPA.converge: whether SPA is converged or not (only for binary traits)
varT: estimated variance of score statistic with sample relatedness incorporated
varTstar: variance of score statistic without sample relatedness incorporated
AF.Cases: allele frequency of allele 2 in cases (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
AF.Controls: allele frequency of allele 2 in controls (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
Tstat_cond, p.value_cond, varT_cond, BETA_cond, SE_cond: summary stats for conditional analysis
Note: This step is only needed for region- and gene-based tests (SAIGE-GENE)
The sparse GRM only needs to be created once for each data set, e.g. a biobank, and can be used for all different phenotypes as long as all tested samples are included in it.
Rscript createSparseGRM.R \
--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
--nThreads=4 \
--outputPrefix=./output/sparseGRM \
--numRandomMarkerforSparseKin=2000 \
--relatednessCutoff=0.125
Genotype file for constructing the genetic relationship matrix
SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam
- a file storing the sparse GRM
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx
The sparse GRM file can be opened using the readMM function in the R library Matrix
- a file storing IDs of the samples in the sparse GRM
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt
- IsSparseKin=TRUE must be specified for SAIGE-GENE
- step 1 model result from the single-variant assoc test can be re-used, except that for gene-based tests, variance ratios for multiple MAC categories and a sparse GRM need to be used (IsSparseKin=TRUE).
- with IsSparseKin=TRUE, no sparseGRMFile and sparseGRMSampleIDFile are specified (step 0 has not been done beforehand), a sparse GRM will be created based on the relatednessCutoff in this step.
- with IsSparseKin=TRUE, sparseGRMFile and sparseGRMSampleIDFile can be used to specify a pre-calcuated sparse GRM and the sample ids for the sparse GRM (output from step 0). Tested samples would be a subset of samples in the pre-calcuated GRM.
- To activate the variance ratio estimation based multiple MAC categories, --isCateVarianceRatio=TRUE
- cateVarRatioMinMACVecExclude and cateVarRatioMaxMACVecInclude are used to specify the MAC categories
#by default
--cateVarRatioMinMACVecExclude=0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5
--cateVarRatioMaxMACVecInclude=1.5,2.5,3.5,4.5,5.5,10.5,20.5
corresponding to
0.5 < MAC <= 1.5
1.5 < MAC <= 2.5
2.5 < MAC <= 3.5
3.5 < MAC <= 4.5
4.5 < MAC <= 5.5
5.5 < MAC <= 10.5
10.5 < MAC <= 20.5
20.5 < MAC
Rscript step1_fitNULLGLMM.R \
--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
--phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
--phenoCol=y_quantitative \
--covarColList=x1,x2 \
--sampleIDColinphenoFile=IID \
--traitType=quantitative \
--invNormalize=TRUE \
--outputPrefix=./output/example_quantitative \
--outputPrefix_varRatio=./output/example_quantitative_cate \
--sparseGRMFile=./output/example_binary_cate.varianceRatio.txt.sparseGRM.mtx \
--sparseGRMSampleIDFile=./output/example_binary.varianceRatio.txt.sparseGRM.mtx.sample \
--nThreads=4 \
--LOCO=FALSE \
--skipModelFitting=FALSE \
--IsSparseKin=TRUE \
--isCateVarianceRatio=TRUE
-
(same as step 1 for single-variant assoc tests and step 0)
Genotype file for constructing the genetic relationship matrix in the plink format -
a file storing the sparse GRM (optional, output by step 0)
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx
The sparse GRM file can be opened using the readMM function in the R library Matrix
- a file storing IDs of the samples in the sparse GRM (optional, output by step 0)
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt
- same as the step 1 output by SAIGE for single-variant association tests
- model file
- association result file for the subset of randomly selected markers
- variance ratio file
- specific to SAIGE-GENE
- sparse Sigma file
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx
NOTE: the file contains the sparse Sigma matrix, which is ** NOT ** the sparse GRM. The sparse Sigma matrix is computed based on the sparse GRM and the results of step 1.
- The command line is the same as the step 2 for single-variant assoc tests, except that a group file is specified (--groupFile)
- Each line is for one gene/set of variants. The first element is for gene/set name.
- The rest of the line is for variant ids included in this gene/set. For vcf/sav, the genetic marker ids are in the format chr:pos_ref/alt. For bgen, the genetic marker ids should match the ids in the bgen file. Each element in the line is separated by tab.
- IsSingleVarinGroupTest=TRUE is to perform single-variant assoc tests as well for markers included in the gene-based tests
- Same as the single-variant assoc tests, conditional analysis based summary stats can be performed (--condition) can be performed in step 2 with dosage/genotype input file formats VCF, BGEN and SAV.
Rscript step2_SPAtests.R \
--vcfFile=./input/genotype_10markers.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0 \
--minMAC=0.5 \
--maxMAFforGroupTest=0.01 \
--sampleFile=./input/samplelist.txt \
--GMMATmodelFile=./output/example_quantitative.rda \
--varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
--SAIGEOutputFile=./output/example_quantitative.SAIGE.gene.txt \
--numLinesOutput=1 \
--groupFile=./input/groupFile_geneBasedtest_simulation.txt \
--sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx \
--IsSingleVarinGroupTest=TRUE
##another example, conditional analysis for gene-based tests
Rscript step2_SPAtests.R \
--vcfFile=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav \
--vcfFileIndex=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav.s1r \
--vcfField=DS \
--chrom=chr1 \
--minMAF=0 \
--minMAC=0.5 \
--maxMAFforGroupTest=0.01 \
--sampleFile=./input/samplelist.txt \
--GMMATmodelFile=./output/example_quantitative.rda \
--varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
--SAIGEOutputFile=./output/example_quantitative.SAIGE.gene_conditional.txt \
--numLinesOutput=1 \
--groupFile=./input/groupFile_geneBasedtest.txt \
--sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx \
--IsOutputAFinCaseCtrl=TRUE \
--IsSingleVarinGroupTest=TRUE \
--condition=chr1:32302_A/C
##Specify customized weights for markers in the gene- or region-based tests
* weightsIncludeinGroupFile logical. Whether to specify customized weight for makers in gene- or region-based tests. If TRUE, weights are included in the group file. For vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab. By default, FALSE
* weights_for_G2_cond vector of float. weights for conditioning markers for gene- or region-based tests. The length equals to the number of conditioning markers, delimited by comma.
Rscript step2_SPAtests.R \
--vcfFile=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav \
--vcfFileIndex=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav.s1r \
--vcfField=DS \
--chrom=chr1 \
--minMAF=0 \
--minMAC=0.5 \
--maxMAFforGroupTest=0.01 \
--sampleFile=./input/samplelist.txt \
--GMMATmodelFile=./output/example_binary.rda \
--varianceRatioFile=./output/example_binary_cate_v2.varianceRatio.txt \
--SAIGEOutputFile=./output/example_binary.SAIGE.gene_conditional_withspecifiedWeights.txt \
--numLinesOutput=1 \
--groupFile=./input/groupFile_geneBasedtest_withWeights.txt \
--sparseSigmaFile=./output/example_binary_cate_v2.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx \
--IsOutputAFinCaseCtrl=TRUE \
--IsSingleVarinGroupTest=TRUE \
--IsOutputPvalueNAinGroupTestforBinary=TRUE \
--IsAccountforCasecontrolImbalanceinGroupTest=TRUE \
--weightsIncludeinGroupFile=TRUE \
--weights_for_G2_cond=3,1 \
--condition=chr1:32302_A/C,chr1:32304_A/C
-
Dosage file
SAIGE-GENE can take in VCF, BCF, BGEN, and SAV. The plain text dosage/genotype input does not work. -
Sample file
This file contains one column for sample IDs corresponding to the sample order in the dosage file. No header is included.
less -S sampleIDindosage.txt
- Model file from step 1
./output/example.rda
- Variance ratio file from step 1
./output/example.varianceRatio.txt
- specific to SAIGE-GENE
- Group file
Each line is for one gene/set of variants. The first element is for gene/set name.
The rest of the line is for variant ids included in this gene/set. For vcf/sav, the genetic marker ids are in the format chr:pos_ref/alt. For bgen, the genetic marker ids should match the ids in the bgen file. Each element in the line is separated by tab.
** Note that the order of variants in the group file needs to be consistent to variants' order in the dosage file ** To specify customized weights for variants, set weightsIncludeinGroupFile=TRUE and in the group file, for vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab.**
** If weightsIncludeinGroupFile=TRUE, in conditional analysis, weights for conditioning markers need to be specified using the argument weights_for_G2_cond, e.g. weights_for_G2_cond=1,3,4. The weights in --weights_for_G2_cond are for markers in --condition, respectively.**
less -S ./input/groupFile_geneBasedtest.txt
- sparse Sigma file
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx
- A file with region- or gene-based association test results
less -S ./output/example_quantitative.SAIGE.gene_conditional.txt
NOTE:
- The header of the output file:
Gene Pvalue Pvalue_cond Nmarker_MACCate_1 Nmarker_MACCate_2 Nmarker_MACCate_3 Nmarker_MACCate_4 Nmarker_MACCate_5 Nmarker_MACCate_6 Nmarker_MACCate_7 Nmarker_MACCate_8 markerIDs markerAFs Pvalue_Burden Pvalue_SKAT Pvalue_Burden_cond Pvalue_SKAT_cond
Gene : Gene name (as provided in the group file)
Pvalue: p-value of SKAT-O test
Pvalue_cond: conditional p-value of SKAT-O test (if --condition= is specified)
Nmarker_MACCate_n: number of markers in the gene falling in the nth MAC category (The MAC category is corresponding to the one used for the variance ratio estimation). For example, by default, Nmarker_MACCate_1 is the number of singletons
markerIDs: IDs for variants included in the test
markerAFs: allele frequencies for variants included in the test
Pvalue_Burden: p-value of Burden test
Pvalue_SKAT: p-value of SKAT test
Pvalue_Burden_cond: conditional p-value of Burden test (if --condition= is specified)
Pvalue_SKAT_cond: conditional p-value of SKAT test (if --condition= is specified)
Pvalue_SKAT.NA: p-values of SKAT test without accounting for unbalanced case-control ratios for binary traits
Pvalue.NA: p-values of SKAT-O test without accounting for unbalanced case-control ratios for binary traits
Pvalue_Burden.NA: p-values of Burden test without accounting for unbalanced case-control ratios for binary traits
- A file with single-variant association test results for genetic variants included in the gene-based tests (if --IsSingleVarinGroupTest=TRUE)
less -S ./output/example_quantitative.SAIGE.gene_conditional.txt_single
Same as the output by SAIGE for single-variant association tests