Genetic association tests using SAIGE

Overview
Frequently asked questions
Installing SAIGE
Flowchart
- SAIGE
- SAIGE-GENE
Running SAIGE and SAIGE-GENE
- Single variant association tests
  - Step 1: fitting the null logistic/linear mixed model
    - Input files
    - Output files
  - Step 2: performing the single-variant association tests
    - Input files
    - Output file
- Region- or gene-based association tests (SAIGE-GENE)

Overview

SAIGE version >= 0.35.8.5 accounts for case-control imbalance in region/gene-based tests.

SAIGE version > 0.35.6 supports single-variant association tests, gene-based tests (SKAT-O, SKAT, BURDEN), as well as conditional analysis based on summary statistics for both single variant and gene-based tests.

This tutorial will walk you through

Installing SAIGE
Running SAIGE for

Single-variant association tests
Gene-based tests
Conditional analysis

Workflow of these tests are almost the same. They use the same step 1 and step 2 functions of SAIGE. We are going to use the example data in the extdata folder in the github page.

Frequently asked questions

Which version of SAIGE/SAIGE-GENE should I use?

We are continuously adding more options to meet the needs of data analysis by users and fixing bugs, so we recommend always trying the newest version in the master branch. The docker hub will be always updated when there is new version.

What is the easiest way to install SAIGE/SAIGE-GENE?

The easiest way to use SAIGE/SAIGE-GENE is using the docker image (approach 4). If docker cannot be used in your machine, please try the installation using a conda env (approach 5) https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE/_edit#installing-saige

Can SAIGE/SAIGE-GENE handle samples from different ancestry?

We haven't throughly tested the performance of SAIGE on multi-ethnic samples. SAIGE uses a random effect to account for sample relatedness/population sub-stratification through the genetic relationship matrix. For multi-ancestry samples (such as 1000 genome) you would need to include PCs to adjust for population stratification. Using both (mixed effect model + PC as covariates) would be the best approach.

For Step 1, how many markers need to be included in the plink file?

To fit the null logistic mixed model using fitNULLGLMM (step 1), ideally the number of independent markers for constructing GRM is larger than the sample size. In our previous GWAS for the UK Biobank, we used 93k markers that were pruned by UKB for kinship estimation while the sample size is ~400k (white British samples). The results look fine. We then compared the P-values from the previous analysis with 93k markers for GRM and with 340k markers for GRM for four phenotypes with different prevalences as shown in the Supplementary Figure 17 in the SAIGE paper https://www.ncbi.nlm.nih.gov/pubmed/30104761. We see for phenotypes such as CAD, the inflation of p-values was further corrected when 340k markers were used for GRM.
The option minMAFforGRM (default, 0.01) is used for set the minimum MAF for markers used in the plink file to estimate the GRM for step 1. Only markers with MAF >= minMAFforGRM will be used to estimate the GRM.
For gene-based tests, the same plink file should be used for both Step 0 and Step 1. This means, Step 0 creates a sparse version of the "full" GRM that is used in Step 1. The problem is that for gene-based tests, variance ratios are estimated for rare variants and common variants and for each MAC category that is specified, 30 markers will be randomly selected first and increase by 10 until the variance ratio estimation is stable. Therefore, rare variants need to be included in the plink file, although they are not used for GRM, because only markers with MAF >= minMAFforGRM will be used for GRM and the default value is 0.01. In the example of the gene-based tests for UK Biobank, you may include all rare variants with MAF < 1% and those 93k markers in the plink file for Step 0 and Step 1.

Does LOCO=TRUE work?

Version 0.36.5 or later works for LOCO=TRUE

Does SAIGE/SAIGE-GENE requires that samples in Step 1 have non-missing dosages in Step 2?

Yes. Current version of SAIGE/SAIGE-GENE requires that all samples used in Step 1 have to be in the bgen/vcf file used in Step 2.

For Step 2, how does SAIGE/SAIGE-GENE handle missing dosages/genotypes?

To drop samples with missing genotypes/dosages, --IsDropMissingDosages=TRUE, if FALSE, missing genotypes/dosages will be mean imputed.

For Step 2, what options can be used to filter out variants based on the MAF and imputation scores?

minMAF and minMAC (default 0.5) can be used to filter variants based on the MAF. The higher threshold between minMAC and minMAF will be used.
minInfo can be used to filter variants based on the imputation score. ** Please note that the saddlepoint approximation used in SAIGE for single-variant association tests does not work for variants with MAC <= 3 (usually gives extremely significant P-values). For those extremely rare variants, please use SAIGE-GENE to tests them. **

Installing SAIGE

SAIGE can be installed in 4 ways. The first 3 ways require installing dependencies first.

R-3.6.1, gcc >= 5.4.0, cmake 3.14.1, cget
R packages: "R.utils", "Rcpp", "RcppParallel", "RcppArmadillo", "data.table", "RcppEigen", "Matrix", "methods", "BH", "optparse", "SPAtest", "SKAT","MetaSKAT"
/extdata/install_packages.R can be used to install the R packages

The binary install file can be downloaded from SAIGE releases or from the master branch. Then using the following command line to install SAIGE

R CMD INSTALL SAIGE_XX_R_x86_64-pc-linux-gnu.tar.gz

Installing using the R library devtools from github

devtools::install_github("weizhouUMICH/SAIGE")

Installing from the source code.

src_branch=master
repo_src_url=https://github.com/weizhouUMICH/SAIGE
git clone --depth 1 -b $src_branch $repo_src_url

R CMD INSTALL SAIGE

Using a docker image (Thanks to Juha Karjalainen for sharing the Dockerfile). The docker image can be pulled

docker pull wzhou88/saige:0.36.6

Dockerfile for creating a docker image for SAIGE

Functions can be called

step1_fitNULLGLMM.R --help
step2_SPAtests.R --help
createSparseGRM.R --help

Using a conda environment (Thanks to Wallace(Minxian) Wang)

a) create a conda environment using (conda environment file)

conda env create -f environment-RSAIGE.yml
conda activate RSAIGE
FLAGPATH=`which python | sed 's|/bin/python$||'`
export LDFLAGS="-L${FLAGPATH}/lib"
export CPPFLAGS="-I${FLAGPATH}/include"

Note: Here are the steps to create the conda environment file

b) Using method 3 Open R and install package MetaSKAT

install.packages('MetaSKAT')

exit R and run command

src_branch=master
repo_src_url=https://github.com/weizhouUMICH/SAIGE
git clone --depth 1 -b $src_branch $repo_src_url
R CMD INSTALL SAIGE

c) Or using method 2 Open R and run (choose 3 no update any packages):

devtools::install_github("weizhouUMICH/SAIGE")

Previous versions of SAIGE installation files are here https://www.dropbox.com/sh/zmlu1llpxd66pjl/AADFqdssvOBjbWZch6Q9zYNaa?dl=0

Flowchart

SAIGE and SAIGE-GENE can work for both binary and quantitative traits.

SAIGE

2 steps in SAIGE

SAIGE-GENE

2 steps in SAIGE-GENE

Running SAIGE and SAIGE-GENE

Scripts are in the folder extdata/ and command lines are in the file cmd.sh

cd /extdata/
less -S cmd.sh

Input files are in the folder extdata/input
Output files are in the folder extdata/output

To obtain help information of the scripts that call functions in the SAIGE library

Rscript createSparseGRM.R --help
Rscript step1_fitNULLGLMM.R --help
Rscript step2_SPAtests.R --help

To obtain help information of functions in the SAIGE library

#open R
R
library(SAIGE)
?createSparseGRM
?fitNULLGLMM
?SPAGMMATtest

Single variant association tests

For single-variant association tests, sparse GRM and categorical variance ratios are NOT needed. Randomly selected markers with MAC >= 20 are used to estimate the variance ratio

Step 1: fitting the null logistic/linear mixed model

For binary traits, a null logistic mixed model will be fitted (--traitType=binary).
For quantitative traits, a null linear mixed model will be fitted (--traitType=quantitative) and needs to be inverse normalized (--invNormalize=TRUE)

#check the help info for step 1
Rscript step1_fitNULLGLMM.R --help

#For Binary traits:
Rscript step1_fitNULLGLMM.R     \
        --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_binary \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=binary        \
        --outputPrefix=./output/example_binary \
        --nThreads=4 \
        --LOCO=FALSE

#For quantitative traits, if not normally distributed, inverse normalization needs to be specified to be TRUE --invNormalize=TRUE
Rscript step1_fitNULLGLMM.R     \
        --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_quantitative \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=quantitative       \
	--invNormalize=TRUE	\
        --outputPrefix=./output/example_quantitative \
        --nThreads=4 \
        --LOCO=FALSE	\
	--tauInit=1,0

Input files

Genotype file for constructing the genetic relationship matrix (GRM)
SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam

./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam

Phenotype file (contains non-genetic covariates if any) The file can be either space or tab-delimited with a header. It is required that the file contains one column for sample IDs and one column for the phenotype. It may contain columns for non-genetic covariates.

Note: Current version of SAIGE does not support categorical covariates that have more than two categories

less -S ./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt

Output files

model file

./output/example_quantitative.rda

#open R
R
#load the model file in R
load("./output/example_quantitative.rda")
names(modglmm)
modglmm$theta

#theta: a vector of length 2. The first element is the dispersion parameter estimate and the second one is the variance component parameter estimate
#coefficients: fixed effect parameter estimates
#linear.predictors: a vector of length N (N is the sample size) containing linear predictors
#fitted.values: a vector of length N (N is the sample size) containing fitted mean values on the original scale
#Y: a vector of length N (N is the sample size) containing final working vector
#residuals: a vector of length N (N is the sample size) containing residuals on the original scale
#sampleID: a vector of length N (N is the sample size) containing sample IDs used to fit the null model

association result file for the subset of randomly selected markers

less -S ./output/example_quantitative_30markers.SAIGE.results.txt

variance ratio file

less -S ./output/example_quantitative.varianceRatio.txt

Step 2: performing the single-variant association tests

For binary traits, saddle point approximation is used to account for case-control imbalance.
Four different file formats for dosages/genotypes of genetic variants to be tested can be used: VCF, BGEN, SAV, plain text
Conditional analysis based summary stats can be performed (--condition) can be performed in step 2 with dosage/genotype input file formats VCF, BGEN and SAV.
To test a subset of markers,
- both variant IDs and range of chromosome positions can be specified for BGEN input (--idstoExcludeFile, --idstoIncludeFile, --rangestoExcludeFile, --rangestoIncludeFile)
- range chromosome positions can be specified for VCF/SAV input (--chrom, --start, --end).
For VCF/SAV input, --chrom must be specified and the string needs to be exactly the same as in the VCF/SAV, such as "01" or "chr1".
For VCF/SAV input, --vcfField=DS to test dosages and --vcfField=GT to test genotypes
To drop samples with missing genotypes/dosages, --IsDropMissingDosages=TRUE, if FALSE, missing genotypes/dosages will be mean imputed. **NOTE: this option has not been extensively tested yet. **

#check the help info for step 2
Rscript step2_SPAtests.R --help

#perform the single-variant association tests
#for binary traits, --IsOutputAFinCaseCtrl=TRUE can be specified to output allele frequencies in cases and controls
Rscript step2_SPAtests.R	\
        --vcfFile=./input/genotype_10markers.missingness.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

##drop samples with missing genotypes/dosages
##--IsDropMissingDosages=TRUE
Rscript step2_SPAtests.R        \
        --vcfFile=./input/genotype_10markers.missingness.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.dropmissing.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE	\
	--IsDropMissingDosages=TRUE    


#conditional analysis
## --condition = Genetic marker ids (chr:pos_ref/alt for vcf/sav or marker id for bgen) seperated by comma. e.g.chr3:101651171_C/T,chr3:101651186_G/A, Note that currently conditional analysis is only for vcf/sav and bgen input.
Rscript step2_SPAtests.R        \
        --vcfFile=./input/genotype_10markers.missingness.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.dropmissing.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE     \
	--IsDropMissingDosages=TRUE	\
	--condition=1:4_A/C

#plain text input for dosages (plain text input for dosages is not inputed since 0.36.1)
Rscript step2_SPAtests.R \
        --dosageFile=./input/dosage_10markers.txt \
        --dosageFileNrowSkip=1 \
        --dosageFileNcolSkip=6 \
        --dosageFilecolnamesSkip=CHR,SNP,CM,POS,EFFECT_ALLELE,ALT_ALLELE \
        --minMAF=0.0001 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.plainDosage.SAIGE.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

#bgen for dosages
Rscript step2_SPAtests.R \
        --bgenFile=./input/genotype_100markers.bgen \
        --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/samplefileforbgen_10000samples.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.bgen.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

#VCF containing genotypes
Rscript step2_SPAtests.R \
        --vcfFile=./input/genotype_10markers.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.vcf.genotype.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

#VCF containing dosages
Rscript step2_SPAtests.R \
        --vcfFile=./input/dosage_10markers.vcf.gz \
        --vcfFileIndex=./input/dosage_10markers.vcf.gz.tbi \
        --vcfField=DS \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.vcf.dosage.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

# SAV for dosages
Rscript step2_SPAtests.R        \
        --savFile=./input/dosage_10markers.sav  \
        --savFileIndex=./input/dosage_10markers.sav.s1r \
        --vcfField=DS \
        --minMAF=0.0001 \
        --minMAC=1 \
        --chrom=1 \
        --sampleFile=./input/samplefileforbgen_10000samples.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.sav.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

Input files

Dosage file SAIGE can take in several different formats for dosages: plain text (gzipped file is supported), VCF, BCF, BGEN, and SAV.
The example dosage files contain 10 markers for 1,000 samples

plain text file

less -S dosage_10markers.txt

BGEN

genotype_100markers.bgen
genotype_100markers.bgen.bgi

VCF containing genotypes

zcat genotype_10markers.vcf.gz | less -S
genotype_10markers.vcf.gz.tbi

VCF containing dosages

zcat dosage_10markers.vcf.gz | less -S
dosage_10markers.vcf.gz.tbi

SAV

dosage_10markers.sav
dosage_10markers.sav.s1r

Sample file <br> This file contains one column for sample IDs corresponding to the sample order in the dosage file. No header is included. The option was originally for BGEN file that does not contain sample information.

less -S sampleIDindosage.txt

Model file from step 1

./output/example.rda

Variance ratio file from step 1

./output/example.varianceRatio.txt

Output file

A file with association test results

less -S ./output/example.SAIGE.vcf.dosage.txt

NOTE:

If the input format of dosages is plain text. The first several columns for header are provided by the user.
For plain text input for dosage/genotypes, association results are for the allele whose dosages/genotypes are provides. For VCF, BCF, SAV, or BGEN input, association results are for Allele2.
For binary traits, the header of the output file: CHR POS SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value p.value.NA Is.SPA.converge varT varTstar AF.Cases AF.Controls
For quantitative traits, the header of the output file: CHR POS SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value varT varTstar

CHR: chromosome
POS: genome position 
SNPID: variant ID
Allele1: allele 1
Allele2: allele 2
AC_Allele2: allele count of allele 2
AF_Allele2: allele frequency of allele 2
imputationInfo: imputation info. If not in dosage/genotype input file, will output 1
N: sample size
BETA: effect size of allele 2
SE: standard error of BETA
Tstat: score statistic of allele 2
p.value: p value (with SPA applied for binary traits)
p.value.NA: p value when SPA is not applied (only for binary traits)
Is.SPA.converge: whether SPA is converged or not (only for binary traits)
varT: estimated variance of score statistic with sample relatedness incorporated
varTstar: variance of score statistic without sample relatedness incorporated
AF.Cases: allele frequency of allele 2 in cases (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
AF.Controls: allele frequency of allele 2 in controls (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
Tstat_cond, p.value_cond, varT_cond, BETA_cond, SE_cond: summary stats for conditional analysis

Region- or gene-based association tests (SAIGE-GENE)

Step 0: creating a sparse GRM

Note: This step is only needed for region- and gene-based tests (SAIGE-GENE)
The sparse GRM only needs to be created once for each data set, e.g. a biobank, and can be used for all different phenotypes as long as all tested samples are included in it.

Rscript createSparseGRM.R	\
	--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
	--nThreads=4  \
	--outputPrefix=./output/sparseGRM	\
	--numRandomMarkerforSparseKin=2000	\
	--relatednessCutoff=0.125

Input file

Genotype file for constructing the genetic relationship matrix
SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam

nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam

Output files

a file storing the sparse GRM

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx

The sparse GRM file can be opened using the readMM function in the R library Matrix

a file storing IDs of the samples in the sparse GRM

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt

Step 1: fitting the null logistic/linear mixed model

IsSparseKin=TRUE must be specified for SAIGE-GENE
step 1 model result from the single-variant assoc test can be re-used, except that for gene-based tests, variance ratios for multiple MAC categories and a sparse GRM need to be used (IsSparseKin=TRUE).
with IsSparseKin=TRUE, no sparseGRMFile and sparseGRMSampleIDFile are specified (step 0 has not been done beforehand), a sparse GRM will be created based on the relatednessCutoff in this step.
with IsSparseKin=TRUE, sparseGRMFile and sparseGRMSampleIDFile can be used to specify a pre-calcuated sparse GRM and the sample ids for the sparse GRM (output from step 0). Tested samples would be a subset of samples in the pre-calcuated GRM.
To activate the variance ratio estimation based multiple MAC categories, --isCateVarianceRatio=TRUE
cateVarRatioMinMACVecExclude and cateVarRatioMaxMACVecInclude are used to specify the MAC categories

#by default 
 --cateVarRatioMinMACVecExclude=0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5
 --cateVarRatioMaxMACVecInclude=1.5,2.5,3.5,4.5,5.5,10.5,20.5
corresponding to
0.5 < MAC <=  1.5
1.5 < MAC <=  2.5
2.5 < MAC <=  3.5
3.5 < MAC <=  4.5
4.5 < MAC <=  5.5
5.5 < MAC <=  10.5
10.5 < MAC <=  20.5
20.5 < MAC

Rscript step1_fitNULLGLMM.R     \
	--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_quantitative \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=quantitative       \
        --invNormalize=TRUE     \
        --outputPrefix=./output/example_quantitative \
	--outputPrefix_varRatio=./output/example_quantitative_cate	\
	--sparseGRMFile=./output/example_binary_cate.varianceRatio.txt.sparseGRM.mtx    \
        --sparseGRMSampleIDFile=./output/example_binary.varianceRatio.txt.sparseGRM.mtx.sample  \
        --nThreads=4 \
        --LOCO=FALSE	\
	--skipModelFitting=FALSE \
        --IsSparseKin=TRUE      \
        --isCateVarianceRatio=TRUE

Input files

(same as step 1 for single-variant assoc tests and step 0)
Genotype file for constructing the genetic relationship matrix in the plink format
a file storing the sparse GRM (optional, output by step 0)

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx

The sparse GRM file can be opened using the readMM function in the R library Matrix

a file storing IDs of the samples in the sparse GRM (optional, output by step 0)

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt

Output files

same as the step 1 output by SAIGE for single-variant association tests

model file
association result file for the subset of randomly selected markers
variance ratio file

specific to SAIGE-GENE

sparse Sigma file

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx

NOTE: the file contains the sparse Sigma matrix, which is ** NOT ** the sparse GRM. The sparse Sigma matrix is computed based on the sparse GRM and the results of step 1.

Step 2: performing the region- or gene-based association tests

The command line is the same as the step 2 for single-variant assoc tests, except that a group file is specified (--groupFile)
Each line is for one gene/set of variants. The first element is for gene/set name.
The rest of the line is for variant ids included in this gene/set. For vcf/sav, the genetic marker ids are in the format chr:pos_ref/alt. For bgen, the genetic marker ids should match the ids in the bgen file. Each element in the line is separated by tab.
IsSingleVarinGroupTest=TRUE is to perform single-variant assoc tests as well for markers included in the gene-based tests
Same as the single-variant assoc tests, conditional analysis based summary stats can be performed (--condition) can be performed in step 2 with dosage/genotype input file formats VCF, BGEN and SAV.

Rscript step2_SPAtests.R \
        --vcfFile=./input/genotype_10markers.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
	--minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --sampleFile=./input/samplelist.txt \
        --GMMATmodelFile=./output/example_quantitative.rda \
        --varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_quantitative.SAIGE.gene.txt \
        --numLinesOutput=1 \
        --groupFile=./input/groupFile_geneBasedtest_simulation.txt    \
        --sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx       \
        --IsSingleVarinGroupTest=TRUE


##another example, conditional analysis for gene-based tests
Rscript step2_SPAtests.R \
        --vcfFile=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav \
        --vcfFileIndex=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav.s1r \
        --vcfField=DS \
        --chrom=chr1 \
        --minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --sampleFile=./input/samplelist.txt \
        --GMMATmodelFile=./output/example_quantitative.rda \
        --varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
	--SAIGEOutputFile=./output/example_quantitative.SAIGE.gene_conditional.txt \
        --numLinesOutput=1 \
        --groupFile=./input/groupFile_geneBasedtest.txt    \
        --sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx       \
        --IsOutputAFinCaseCtrl=TRUE     \
        --IsSingleVarinGroupTest=TRUE   \
	--condition=chr1:32302_A/C 


##Specify customized weights for markers in the gene- or region-based tests
* weightsIncludeinGroupFile logical. Whether to specify customized weight for makers in gene- or region-based tests. If TRUE, weights are included in the group file. For vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab. By default, FALSE
* weights_for_G2_cond vector of float. weights for conditioning markers for gene- or region-based tests. The length equals to the number of conditioning markers, delimited by comma.

Rscript step2_SPAtests.R \
        --vcfFile=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav \
        --vcfFileIndex=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav.s1r \
        --vcfField=DS \
        --chrom=chr1 \
        --minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --sampleFile=./input/samplelist.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary_cate_v2.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.gene_conditional_withspecifiedWeights.txt \
        --numLinesOutput=1 \
        --groupFile=./input/groupFile_geneBasedtest_withWeights.txt    \
        --sparseSigmaFile=./output/example_binary_cate_v2.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx       \
        --IsOutputAFinCaseCtrl=TRUE     \
        --IsSingleVarinGroupTest=TRUE   \
        --IsOutputPvalueNAinGroupTestforBinary=TRUE     \
        --IsAccountforCasecontrolImbalanceinGroupTest=TRUE      \
        --weightsIncludeinGroupFile=TRUE        \
        --weights_for_G2_cond=3,1       \
        --condition=chr1:32302_A/C,chr1:32304_A/C

Input files

same as step 1 input by SAIGE for single-variant association tests

Dosage file
SAIGE-GENE can take in VCF, BCF, BGEN, and SAV. The plain text dosage/genotype input does not work.
Sample file
This file contains one column for sample IDs corresponding to the sample order in the dosage file. No header is included.

less -S sampleIDindosage.txt

Model file from step 1

./output/example.rda

Variance ratio file from step 1

./output/example.varianceRatio.txt

specific to SAIGE-GENE

Group file
Each line is for one gene/set of variants. The first element is for gene/set name.
The rest of the line is for variant ids included in this gene/set. For vcf/sav, the genetic marker ids are in the format chr:pos_ref/alt. For bgen, the genetic marker ids should match the ids in the bgen file. Each element in the line is separated by tab.
** Note that the order of variants in the group file needs to be consistent to variants' order in the dosage file ** To specify customized weights for variants, set weightsIncludeinGroupFile=TRUE and in the group file, for vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab.**
** If weightsIncludeinGroupFile=TRUE, in conditional analysis, weights for conditioning markers need to be specified using the argument weights_for_G2_cond, e.g. weights_for_G2_cond=1,3,4. The weights in --weights_for_G2_cond are for markers in --condition, respectively.**

less -S ./input/groupFile_geneBasedtest.txt

sparse Sigma file

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx

Output files

A file with region- or gene-based association test results

less -S ./output/example_quantitative.SAIGE.gene_conditional.txt

NOTE:

The header of the output file: Gene Pvalue Pvalue_cond Nmarker_MACCate_1 Nmarker_MACCate_2 Nmarker_MACCate_3 Nmarker_MACCate_4 Nmarker_MACCate_5 Nmarker_MACCate_6 Nmarker_MACCate_7 Nmarker_MACCate_8 markerIDs markerAFs Pvalue_Burden Pvalue_SKAT Pvalue_Burden_cond Pvalue_SKAT_cond

Gene : Gene name (as provided in the group file)
Pvalue: p-value of SKAT-O test
Pvalue_cond: conditional p-value of SKAT-O test (if --condition= is specified)
Nmarker_MACCate_n: number of markers in the gene falling in the nth MAC category (The MAC category is corresponding to the one used for the variance ratio estimation). For example, by default, Nmarker_MACCate_1 is the number of singletons
markerIDs: IDs for variants included in the test
markerAFs: allele frequencies for variants included in the test
Pvalue_Burden: p-value of Burden test
Pvalue_SKAT: p-value of SKAT test
Pvalue_Burden_cond: conditional p-value of Burden test (if --condition= is specified)
Pvalue_SKAT_cond: conditional p-value of SKAT test (if --condition= is specified)
Pvalue_SKAT.NA: p-values of SKAT test without accounting for unbalanced case-control ratios for binary traits
Pvalue.NA: p-values of SKAT-O test without accounting for unbalanced case-control ratios for binary traits
Pvalue_Burden.NA: p-values of Burden test without accounting for unbalanced case-control ratios for binary traits

A file with single-variant association test results for genetic variants included in the gene-based tests (if --IsSingleVarinGroupTest=TRUE)

less -S ./output/example_quantitative.SAIGE.gene_conditional.txt_single

Same as the output by SAIGE for single-variant association tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genetic association tests using SAIGE

Overview

Frequently asked questions

Installing SAIGE

Flowchart

SAIGE

SAIGE-GENE

Running SAIGE and SAIGE-GENE

Single variant association tests

Step 1: fitting the null logistic/linear mixed model

Input files

Output files

Step 2: performing the single-variant association tests

Input files

Output file

Region- or gene-based association tests (SAIGE-GENE)

Step 0: creating a sparse GRM

Input file

Output files

Step 1: fitting the null logistic/linear mixed model

Input files

Output files

Step 2: performing the region- or gene-based association tests

Input files

Output files

Clone this wiki locally