-
Notifications
You must be signed in to change notification settings - Fork 74
Genetic association tests using SAIGE
- Overview
- Frequently asked questions
- Installing SAIGE
- Flowchart
- Running SAIGE and SAIGE-GENE
The most current version of SAIGE is version 0.38
This tutorial will walk you through
- Installing SAIGE
- Running SAIGE for
- Single-variant association tests
- Gene-based tests
- Conditional analysis
Workflow of these tests are almost the same. They use the same step 1 and step 2 functions of SAIGE. We are going to use the example data in the extdata folder in the github page.
- Which version of SAIGE/SAIGE-GENE should I use?
- We are continuously adding more options to meet the needs of data analysis by users and fixing bugs, so we recommend always trying the newest version in the master branch. The docker hub will be always updated when there is new version.
- What is the easiest way to install SAIGE/SAIGE-GENE?
- The easiest way to use SAIGE/SAIGE-GENE is using the docker image (approach 4). If docker cannot be used in your machine, please try the installation using a conda env (approach 5) https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE/_edit#installing-saige
- Can SAIGE/SAIGE-GENE handle samples from different ancestry?
- We haven't throughly tested the performance of SAIGE on multi-ethnic samples. SAIGE uses a random effect to account for sample relatedness/population sub-stratification through the genetic relationship matrix. For multi-ancestry samples (such as 1000 genome) you would need to include PCs to adjust for population stratification. Using both (mixed effect model + PC as covariates) would be the best approach.
- For Step 1, how many markers need to be included in the plink file?
-
To fit the null logistic mixed model using fitNULLGLMM (step 1), ideally the number of independent markers for constructing GRM is larger than the sample size. In our previous GWAS for the UK Biobank, we used 93k markers that were pruned by UKB for kinship estimation while the sample size is ~400k (white British samples). The results look fine. We then compared the P-values from the previous analysis with 93k markers for GRM and with 340k markers for GRM for four phenotypes with different prevalences as shown in the Supplementary Figure 17 in the SAIGE paper https://www.ncbi.nlm.nih.gov/pubmed/30104761. We see for phenotypes such as CAD, the inflation of p-values was further corrected when 340k markers were used for GRM.
-
The option minMAFforGRM (default, 0.01) is used for set the minimum MAF for markers used in the plink file to estimate the GRM for step 1. Only markers with MAF >= minMAFforGRM will be used to estimate the GRM.
-
For gene-based tests, the same plink file should be used for both Step 0 and Step 1. This means, Step 0 creates a sparse version of the "full" GRM that is used in Step 1. The problem is that for gene-based tests, variance ratios are estimated for rare variants and common variants and for each MAC category that is specified, 30 markers will be randomly selected first and increase by 10 until the variance ratio estimation is stable. Therefore, rare variants need to be included in the plink file, although they are not used for GRM, because only markers with MAF >= minMAFforGRM will be used for GRM and the default value is 0.01. In the example of the gene-based tests for UK Biobank, you may include all rare variants with MAF < 1% and those 93k markers in the plink file for Step 0 and Step 1.
- Does LOCO=TRUE work?
- Version 0.36.5 or later works for LOCO=TRUE
- Does SAIGE/SAIGE-GENE requires that samples in Step 1 have non-missing dosages in Step 2?
- Yes. Current version of SAIGE/SAIGE-GENE requires that all samples used in Step 1 have to be in the bgen/vcf file used in Step 2.
- For Step 2, how does SAIGE/SAIGE-GENE handle missing dosages/genotypes?
- To drop samples with missing genotypes/dosages, --IsDropMissingDosages=TRUE, if FALSE, missing genotypes/dosages will be mean imputed.
- For Step 2, what options can be used to filter out variants based on the MAF and imputation scores?
- minMAF and minMAC (default 0.5) can be used to filter variants based on the MAF. The higher threshold between minMAC and minMAF will be used.
- minInfo can be used to filter variants based on the imputation score. ** Please note that the saddlepoint approximation used in SAIGE for single-variant association tests does not work for variants with MAC <= 3 (usually gives extremely significant P-values). For those extremely rare variants, please use SAIGE-GENE to tests them. **
- How does SAIGE/SAIGE-GENE handle multi-allelic variants in VCF files?
SAIGE/SAIGE-GENE uses the savvy library to handle VCF files. For the multi-allelic variants, genotypes/dosages for all alt alleles in one line in VCF files is not supported. The best solution is to split the variants by alleles.
- How to extract the "heritability" estimated by SAIGE/SAIGE-GENE?
Tau is a vector with 2 elements. The first element is for the variance component parameter for the error term and the second one is for the GRM (genetic relationship matrix). Tau can be extracted from the null model of SAIGE results by R load("model.rda"); tau = modglmm$theta
For quantitative traits from the linear mixed model, h2 = tau[2]/(tau[1]+tau[2])
For binary traits from the logistic mixed model (tau[1] is always 1), h2_liability = tau[2]/(tau[2]+pi^2/3)
. But note that the heritability is the point estimate for proportion of variance of the phenotype explained by the GRM, which is not equal to the heritability explained using LDSC. Also, we have noticed that the h2 estimate for binary traits by SAIGE is underestimated and the penalized quasi-likelihood used in SAIGE is known to be biased for heritability estimation but it works well for adjusting for sample-relatedness.
- R-3.6.1, gcc >= 5.4.0, cmake 3.14.1, cget
- R packages: "R.utils", "Rcpp", "RcppParallel", "RcppArmadillo", "data.table", "RcppEigen", "Matrix", "methods", "BH", "optparse", "SPAtest", "SKAT","MetaSKAT"
- /extdata/install_packages.R can be used to install the R packages
-
Create a conda environment using (conda environment file) Here is a link to download the conda environment file
After downloading environment-RSAIGE.yml, run following command
conda env create -f environment-RSAIGE.yml
-
Activate the conda environment RSAIGE
conda activate RSAIGE FLAGPATH=`which python | sed 's|/bin/python$||'` export LDFLAGS="-L${FLAGPATH}/lib" export CPPFLAGS="-I${FLAGPATH}/include"
Please make sure to set up the LDFLAGS and CPPFLAGS using export (the last two command lines), so libraries can be linked correctly when the SAIGE source code is compiled. Note: Here are the steps to create the conda environment file
-
Open R, run following script to install the MetaSKAT R library.
devtools::install_github("leeshawn/MetaSKAT")
-
Install SAIGE from the source code.
Method 1:
src_branch=master repo_src_url=https://github.com/weizhouUMICH/SAIGE git clone --depth 1 -b $src_branch $repo_src_url R CMD INSTALL --library=path_to_final_SAIGE_library SAIGE
When call SAIGE in R, set lib.loc=path_to_final_SAIGE_library
library(SAIGE, lib.loc=path_to_final_SAIGE_library)
Method 2:
Open R. Run
devtools::install_github("weizhouUMICH/SAIGE")
Thanks to Juha Karjalainen for sharing the Dockerfile. The docker image can be pulled
docker pull wzhou88/saige:0.38
Dockerfile for creating your own docker image for SAIGE
Functions can be called
step1_fitNULLGLMM.R --help
step2_SPAtests.R --help
createSparseGRM.R --help
- Previous versions of SAIGE installation files are here https://www.dropbox.com/sh/zmlu1llpxd66pjl/AADFqdssvOBjbWZch6Q9zYNaa?dl=0
SAIGE and SAIGE-GENE can work for both binary and quantitative traits.
Scripts are in the folder extdata/ and command lines are in the file cmd.sh
cd /extdata/
less -S cmd.sh
Input files are in the folder extdata/input
Output files are in the folder extdata/output
To obtain help information of the scripts that call functions in the SAIGE library
Rscript createSparseGRM.R --help
Rscript step1_fitNULLGLMM.R --help
Rscript step2_SPAtests.R --help
To obtain help information of functions in the SAIGE library
#open R
R
library(SAIGE)
?createSparseGRM
?fitNULLGLMM
?SPAGMMATtest
For single-variant association tests, sparse GRM and categorical variance ratios are NOT needed. Randomly selected markers with MAC >= 20 are used to estimate the variance ratio
- For binary traits, a null logistic mixed model will be fitted (--traitType=binary).
- For quantitative traits, a null linear mixed model will be fitted (--traitType=quantitative) and needs to be inverse normalized (--invNormalize=TRUE)
#check the help info for step 1
Rscript step1_fitNULLGLMM.R --help
#For Binary traits:
Rscript step1_fitNULLGLMM.R \
--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
--phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
--phenoCol=y_binary \
--covarColList=x1,x2 \
--sampleIDColinphenoFile=IID \
--traitType=binary \
--outputPrefix=./output/example_binary \
--nThreads=4 \
--LOCO=FALSE \
--IsOverwriteVarianceRatioFile ## v0.38. Whether to overwrite the variance ratio file if the file already exists
#For Quantitative traits, if not normally distributed, inverse normalization needs to be specified to be TRUE --invNormalize=TRUE
Rscript step1_fitNULLGLMM.R \
--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
--phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
--phenoCol=y_quantitative \
--covarColList=x1,x2 \
--sampleIDColinphenoFile=IID \
--traitType=quantitative \
--invNormalize=TRUE \
--outputPrefix=./output/example_quantitative \
--nThreads=4 \
--LOCO=FALSE \
--tauInit=1,0
- Genotype file for constructing the genetic relationship matrix (GRM)
SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam
- Phenotype file (contains non-genetic covariates if any, such as gender and age) The file can be either space or tab-delimited with a header. It is required that the file contains one column for sample IDs and one column for the phenotype. It may contain columns for non-genetic covariates.
Note: Current version of SAIGE does not support categorical covariates that have more than two categories
less -S ./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt
- model file
./output/example_quantitative.rda
#open R
R
#load the model file in R
load("./output/example_quantitative.rda")
names(modglmm)
modglmm$theta
#theta: a vector of length 2. The first element is the dispersion parameter estimate and the second one is the variance component parameter estimate
#coefficients: fixed effect parameter estimates
#linear.predictors: a vector of length N (N is the sample size) containing linear predictors
#fitted.values: a vector of length N (N is the sample size) containing fitted mean values on the original scale
#Y: a vector of length N (N is the sample size) containing final working vector
#residuals: a vector of length N (N is the sample size) containing residuals on the original scale
#sampleID: a vector of length N (N is the sample size) containing sample IDs used to fit the null model
- association result file for the subset of randomly selected markers
less -S ./output/example_quantitative_30markers.SAIGE.results.txt
- variance ratio file
less -S ./output/example_quantitative.varianceRatio.txt
- For binary traits, saddle point approximation is used to account for case-control imbalance.
- File formats for dosages/genotypes of genetic variants to be tested can be used: VCF, BGEN, SAV
- Conditional analysis based summary stats can be performed (--condition) can be performed in Step 2 with dosage/genotype input file formats VCF, BGEN and SAV.
- To query and test a subset of markers
-
- both variant IDs and range of chromosome positions can be specified for BGEN input (--idstoExcludeFile, --idstoIncludeFile, --rangestoExcludeFile, --rangestoIncludeFile)
-
- range chromosome positions can be specified for VCF/SAV input (--chrom, --start, --end).
- For VCF/SAV input, --chrom MUST be specified and the string needs to be exactly the same as in the VCF/SAV, such as "01" or "chr1".
- For VCF/SAV input, --vcfField=DS to test dosages and --vcfField=GT to test genotypes
- To drop samples with missing genotypes/dosages, --IsDropMissingDosages=TRUE, if FALSE, missing genotypes/dosages will be mean imputed. **NOTE: this option has not been extensively tested yet. **
- sampleFile is used specify a file with sample IDs for bgen file. Please DO NOT include a header in the file. SAIGE versions >= 0.38 do not need sampleFile if VCF files are used
#check the help info for step 2
Rscript step2_SPAtests.R --help
#Perform single-variant association tests
#for binary traits,
* --IsOutputAFinCaseCtrl=TRUE can be specified to output allele frequencies in cases and controls
* --IsOutputNinCaseCtrl=TRUE can be specified to output sample sizes of cases and controls for each marker
* --IsOutputHetHomCountsinCaseCtrl can be specified to output heterozygous and homozygous counts in cases and controls
Rscript step2_SPAtests.R \
--vcfFile=./input/genotype_10markers.missingness.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example_binary.rda \
--varianceRatioFile=./output/example_binary.varianceRatio.txt \
--SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
##drop samples with missing genotypes/dosages
##--IsDropMissingDosages=TRUE
Rscript step2_SPAtests.R \
--vcfFile=./input/genotype_10markers.missingness.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example_binary.rda \
--varianceRatioFile=./output/example_binary.varianceRatio.txt \
--SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.dropmissing.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE \
--IsDropMissingDosages=TRUE
#conditional analysis
## --condition = Genetic marker ids (**chr:pos_ref/alt for vcf/sav or marker id for bgen**) separated by comma. e.g.chr3:101651171_C/T,chr3:101651186_G/A
Rscript step2_SPAtests.R \
--vcfFile=./input/genotype_10markers.missingness.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example_binary.rda \
--varianceRatioFile=./output/example_binary.varianceRatio.txt \
--SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.dropmissing.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE \
--IsDropMissingDosages=TRUE \
--condition=1:4_A/C
#plain text input for dosages (plain text input for dosages is not inputed since 0.36.1)
Rscript step2_SPAtests.R \
--dosageFile=./input/dosage_10markers.txt \
--dosageFileNrowSkip=1 \
--dosageFileNcolSkip=6 \
--dosageFilecolnamesSkip=CHR,SNP,CM,POS,EFFECT_ALLELE,ALT_ALLELE \
--minMAF=0.0001 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example.rda \
--varianceRatioFile=./output/example.varianceRatio.txt \
--SAIGEOutputFile=./output/example.plainDosage.SAIGE.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
#bgen for dosages
Rscript step2_SPAtests.R \
--bgenFile=./input/genotype_100markers.bgen \
--bgenFileIndex=./input/genotype_100markers.bgen.bgi \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/samplefileforbgen_10000samples.txt \
--GMMATmodelFile=./output/example.rda \
--varianceRatioFile=./output/example.varianceRatio.txt \
--SAIGEOutputFile=./output/example.SAIGE.bgen.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
#VCF containing genotypes (--vcfField=GT)
Rscript step2_SPAtests.R \
--vcfFile=./input/genotype_10markers.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example.rda \
--varianceRatioFile=./output/example.varianceRatio.txt \
--SAIGEOutputFile=./output/example.SAIGE.vcf.genotype.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
#VCF containing dosages (--vcfField=DS)
Rscript step2_SPAtests.R \
--vcfFile=./input/dosage_10markers.vcf.gz \
--vcfFileIndex=./input/dosage_10markers.vcf.gz.tbi \
--vcfField=DS \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--sampleFile=./input/sampleIDindosage.txt \
--GMMATmodelFile=./output/example.rda \
--varianceRatioFile=./output/example.varianceRatio.txt \
--SAIGEOutputFile=./output/example.SAIGE.vcf.dosage.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
# SAV for dosages
Rscript step2_SPAtests.R \
--savFile=./input/dosage_10markers.sav \
--savFileIndex=./input/dosage_10markers.sav.s1r \
--vcfField=DS \
--minMAF=0.0001 \
--minMAC=1 \
--chrom=1 \
--sampleFile=./input/samplefileforbgen_10000samples.txt \
--GMMATmodelFile=./output/example.rda \
--varianceRatioFile=./output/example.varianceRatio.txt \
--SAIGEOutputFile=./output/example.SAIGE.sav.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE
- BGEN
genotype_100markers.bgen
genotype_100markers.bgen.bgi
- VCF containing genotypes
zcat genotype_10markers.vcf.gz | less -S
genotype_10markers.vcf.gz.tbi
- VCF containing dosages
zcat dosage_10markers.vcf.gz | less -S
dosage_10markers.vcf.gz.tbi
- SAV
dosage_10markers.sav
dosage_10markers.sav.s1r
- Sample file <br> This file contains one column for sample IDs corresponding to the sample order in the dosage file. No header is included. The option was originally for BGEN file that does not contain sample information.
less -S sampleIDindosage.txt
- Model file from step 1
./output/example.rda
- Variance ratio file from step 1
./output/example.varianceRatio.txt
A file with association test results
less -S ./output/example.SAIGE.vcf.dosage.txt
NOTE:
- Association results are with regard to Allele2.
- For binary traits, the header of the output file:
CHR POS SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value p.value.NA Is.SPA.converge varT varTstar AF.Cases AF.Controls
- For quantitative traits, the header of the output file:
CHR POS SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value varT varTstar
CHR: chromosome
POS: genome position
SNPID: variant ID
Allele1: allele 1
Allele2: allele 2
AC_Allele2: allele count of allele 2
AF_Allele2: allele frequency of allele 2
imputationInfo: imputation info. If not in dosage/genotype input file, will output 1
N: sample size
BETA: effect size of allele 2
SE: standard error of BETA
Tstat: score statistic of allele 2
p.value: p value (with SPA applied for binary traits)
p.value.NA: p value when SPA is not applied (only for binary traits)
Is.SPA.converge: whether SPA is converged or not (only for binary traits)
varT: estimated variance of score statistic with sample relatedness incorporated
varTstar: variance of score statistic without sample relatedness incorporated
AF.Cases: allele frequency of allele 2 in cases (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
AF.Controls: allele frequency of allele 2 in controls (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
Tstat_cond, p.value_cond, varT_cond, BETA_cond, SE_cond: summary stats for conditional analysis
- p.value in Step 2 for the marker is ~4.16 x 10^-7
Rscript step1_fitNULLGLMM.R \
--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
--phenoFile=./input/Prev_0.1_nfam_1000.pheno_positive_pheno.txt \
--phenoCol=y \
--covarColList=x1,x2 \
--sampleIDColinphenoFile=IID \
--traitType=binary \
--outputPrefix=./output/example_binary_positive_signal \
--nThreads=4 \
--LOCO=FALSE \
--minMAFforGRM=0.01
Rscript step2_SPAtests.R \
--vcfFile=./input/nfam_1000_MAF0.2_nMarker1_nseed200.vcf.gz \
--vcfFileIndex=./input/nfam_1000_MAF0.2_nMarker1_nseed200.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0.0001 \
--minMAC=1 \
--GMMATmodelFile=./output/example_binary_positive_signal.rda \
--varianceRatioFile=./output/example_binary_positive_signal.varianceRatio.txt \
--SAIGEOutputFile=./output/example_binary_positive_signal.assoc.step2.txt \
--numLinesOutput=2 \
--IsOutputAFinCaseCtrl=TRUE \
Note: This step is only needed for region- and gene-based tests (SAIGE-GENE)
The sparse GRM only needs to be created once for each data set, e.g. a biobank, and can be used for all different phenotypes as long as all tested samples are included in it.
Rscript createSparseGRM.R \
--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
--nThreads=4 \
--outputPrefix=./output/sparseGRM \
--numRandomMarkerforSparseKin=2000 \
--relatednessCutoff=0.125
Genotype file for constructing the genetic relationship matrix
SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam
- a file storing the sparse GRM
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx
The sparse GRM file can be opened using the readMM function in the R library Matrix
- a file storing IDs of the samples in the sparse GRM
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt
- IsSparseKin=TRUE must be specified for SAIGE-GENE
- step 1 model result from the single-variant assoc test can be re-used, except that for gene-based tests, variance ratios for multiple MAC categories and a sparse GRM need to be used (IsSparseKin=TRUE).
- with IsSparseKin=TRUE, no sparseGRMFile and sparseGRMSampleIDFile are specified (step 0 has not been done beforehand), a sparse GRM will be created based on the relatednessCutoff in this step.
- with IsSparseKin=TRUE, sparseGRMFile and sparseGRMSampleIDFile can be used to specify a pre-calcuated sparse GRM and the sample ids for the sparse GRM (output from step 0). Tested samples would be a subset of samples in the pre-calcuated GRM.
- To activate the variance ratio estimation based multiple MAC categories, --isCateVarianceRatio=TRUE
- cateVarRatioMinMACVecExclude and cateVarRatioMaxMACVecInclude are used to specify the MAC categories
#by default
--cateVarRatioMinMACVecExclude=0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5
--cateVarRatioMaxMACVecInclude=1.5,2.5,3.5,4.5,5.5,10.5,20.5
corresponding to
0.5 < MAC <= 1.5
1.5 < MAC <= 2.5
2.5 < MAC <= 3.5
3.5 < MAC <= 4.5
4.5 < MAC <= 5.5
5.5 < MAC <= 10.5
10.5 < MAC <= 20.5
20.5 < MAC
Rscript step1_fitNULLGLMM.R \
--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
--phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
--phenoCol=y_quantitative \
--covarColList=x1,x2 \
--sampleIDColinphenoFile=IID \
--traitType=quantitative \
--invNormalize=TRUE \
--outputPrefix=./output/example_quantitative \
--outputPrefix_varRatio=./output/example_quantitative_cate \
--sparseGRMFile=./output/example_binary_cate.varianceRatio.txt.sparseGRM.mtx \
--sparseGRMSampleIDFile=./output/example_binary.varianceRatio.txt.sparseGRM.mtx.sample \
--nThreads=4 \
--LOCO=FALSE \
--skipModelFitting=FALSE \
--IsSparseKin=TRUE \
--isCateVarianceRatio=TRUE
-
(same as step 1 for single-variant assoc tests and step 0)
Genotype file for constructing the genetic relationship matrix in the plink format -
a file storing the sparse GRM (optional, output by step 0)
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx
The sparse GRM file can be opened using the readMM function in the R library Matrix
- a file storing IDs of the samples in the sparse GRM (optional, output by step 0)
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt
- same as the step 1 output by SAIGE for single-variant association tests
- model file
- association result file for the subset of randomly selected markers
- variance ratio file
- specific to SAIGE-GENE
- sparse Sigma file
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx
NOTE: the file contains the sparse Sigma matrix, which is ** NOT ** the sparse GRM. The sparse Sigma matrix is computed based on the sparse GRM and the results of step 1.
- The command line is the same as the step 2 for single-variant assoc tests, except that a group file is specified (--groupFile)
- Each line is for one gene/set of variants. The first element is for gene/set name.
- The rest of the line is for variant ids included in this gene/set. For vcf/sav, the genetic marker ids are in the format chr:pos_ref/alt. For bgen, the genetic marker ids should match the ids in the bgen file. Each element in the line is separated by tab. The marker ids in the group file for vcf/sav need to be sorted by chr and pos.
- IsSingleVarinGroupTest=TRUE is to perform single-variant association tests as well for markers included in the gene-based tests
- --IsOutputBETASEinBurdenTest is to output effect sizes for burden tests (this option is still under development)
- Same as the single-variant association tests, conditional analysis based summary stats can be performed (--condition) can be performed in step 2 with dosage/genotype input file formats VCF, BGEN and SAV.
Rscript step2_SPAtests.R \
--vcfFile=./input/genotype_10markers.vcf.gz \
--vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
--vcfField=GT \
--chrom=1 \
--minMAF=0 \
--minMAC=0.5 \
--maxMAFforGroupTest=0.01 \
--sampleFile=./input/samplelist.txt \
--GMMATmodelFile=./output/example_quantitative.rda \
--varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
--SAIGEOutputFile=./output/example_quantitative.SAIGE.gene.txt \
--numLinesOutput=1 \
--groupFile=./input/groupFile_geneBasedtest_simulation.txt \
--sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx \
--IsSingleVarinGroupTest=TRUE
##another example, conditional analysis for gene-based tests
Rscript step2_SPAtests.R \
--vcfFile=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav \
--vcfFileIndex=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav.s1r \
--vcfField=DS \
--chrom=chr1 \
--minMAF=0 \
--minMAC=0.5 \
--maxMAFforGroupTest=0.01 \
--sampleFile=./input/samplelist.txt \
--GMMATmodelFile=./output/example_quantitative.rda \
--varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
--SAIGEOutputFile=./output/example_quantitative.SAIGE.gene_conditional.txt \
--numLinesOutput=1 \
--groupFile=./input/groupFile_geneBasedtest.txt \
--sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx \
--IsOutputAFinCaseCtrl=TRUE \
--IsSingleVarinGroupTest=TRUE \
--condition=chr1:32302_A/C
##Specify customized weights for markers in the gene- or region-based tests
* weightsIncludeinGroupFile logical. Whether to specify customized weight for makers in gene- or region-based tests. If TRUE, weights are included in the group file. For vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab. By default, FALSE
* weights_for_G2_cond vector of float. weights for conditioning markers for gene- or region-based tests. The length equals to the number of conditioning markers, delimited by comma.
Rscript step2_SPAtests.R \
--vcfFile=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav \
--vcfFileIndex=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav.s1r \
--vcfField=DS \
--chrom=chr1 \
--minMAF=0 \
--minMAC=0.5 \
--maxMAFforGroupTest=0.01 \
--sampleFile=./input/samplelist.txt \
--GMMATmodelFile=./output/example_binary.rda \
--varianceRatioFile=./output/example_binary_cate_v2.varianceRatio.txt \
--SAIGEOutputFile=./output/example_binary.SAIGE.gene_conditional_withspecifiedWeights.txt \
--numLinesOutput=1 \
--groupFile=./input/groupFile_geneBasedtest_withWeights.txt \
--sparseSigmaFile=./output/example_binary_cate_v2.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx \
--IsOutputAFinCaseCtrl=TRUE \
--IsSingleVarinGroupTest=TRUE \
--IsOutputPvalueNAinGroupTestforBinary=TRUE \
--IsAccountforCasecontrolImbalanceinGroupTest=TRUE \
--weightsIncludeinGroupFile=TRUE \
--weights_for_G2_cond=3,1 \
--condition=chr1:32302_A/C,chr1:32304_A/C
-
Sample file
This file contains one column for sample IDs corresponding to the sample order in the dosage file. No header is included.
less -S sampleIDindosage.txt
- Model file from step 1
./output/example.rda
- Variance ratio file from step 1
./output/example.varianceRatio.txt
- specific to SAIGE-GENE
- Group file
Each line is for one gene/set of variants. The first element is for gene/set name.
The rest of the line is for variant ids included in this gene/set. For vcf/sav, the genetic marker ids are in the format chr:pos_ref/alt. For bgen, the genetic marker ids should match the ids in the bgen file. Each element in the line is separated by tab.
** Note that the order of variants in the group file needs to be consistent to variants' order in the dosage file ** To specify customized weights for variants, set weightsIncludeinGroupFile=TRUE and in the group file, for vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab.**
** If weightsIncludeinGroupFile=TRUE, in conditional analysis, weights for conditioning markers need to be specified using the argument weights_for_G2_cond, e.g. weights_for_G2_cond=1,3,4. The weights in --weights_for_G2_cond are for markers in --condition, respectively.**
less -S ./input/groupFile_geneBasedtest.txt
- sparse Sigma file
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx
- A file with region- or gene-based association test results
less -S ./output/example_quantitative.SAIGE.gene_conditional.txt
NOTE:
- The header of the output file:
Gene Pvalue Pvalue_cond Nmarker_MACCate_1 Nmarker_MACCate_2 Nmarker_MACCate_3 Nmarker_MACCate_4 Nmarker_MACCate_5 Nmarker_MACCate_6 Nmarker_MACCate_7 Nmarker_MACCate_8 markerIDs markerAFs Pvalue_Burden Pvalue_SKAT Pvalue_Burden_cond Pvalue_SKAT_cond
Gene : Gene name (as provided in the group file)
Pvalue: p-value of SKAT-O test
Pvalue_cond: conditional p-value of SKAT-O test (if --condition= is specified)
Nmarker_MACCate_n: number of markers in the gene falling in the nth MAC category (The MAC category is corresponding to the one used for the variance ratio estimation). For example, by default, Nmarker_MACCate_1 is the number of singletons
markerIDs: IDs for variants included in the test
markerAFs: allele frequencies for variants included in the test
Pvalue_Burden: p-value of Burden test
Pvalue_SKAT: p-value of SKAT test
Pvalue_Burden_cond: conditional p-value of Burden test (if --condition= is specified)
Pvalue_SKAT_cond: conditional p-value of SKAT test (if --condition= is specified)
Pvalue_SKAT.NA: p-values of SKAT test without accounting for unbalanced case-control ratios for binary traits
Pvalue.NA: p-values of SKAT-O test without accounting for unbalanced case-control ratios for binary traits
Pvalue_Burden.NA: p-values of Burden test without accounting for unbalanced case-control ratios for binary traits
- A file with single-variant association test results for genetic variants included in the gene-based tests (if --IsSingleVarinGroupTest=TRUE)
less -S ./output/example_quantitative.SAIGE.gene_conditional.txt_single
Same as the output by SAIGE for single-variant association tests