R/tutorials/scLVM_vignette.Rmd

---
title: "An R package for scLVM"
author: "Florian Buettner,  F. Paolo Casale and Oliver Stegle"  
output: 
  html_document:
    toc: true
    toc_depth: 2
---
# What is scLVM?

scLVM is a modelling framework for single-cell RNA-seq data that can be used to dissect the observed heterogeneity into different sources, thereby allowing for the correction of confounding sources of variation.

scLVM was primarily designed to account for cell-cycle induced variations in single-cell RNA-seq data where cell cycle is the primary soure of variability.

Software by Florian Buettner, Paolo Casale and Oliver Stegle. scLVM is explained in more detail in the accompanying publication:

Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC & Stegle O, 2015. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-Sequencing data reveals hidden subpopulation of cells, Nature Biotechnology, doi: 10.1038/nbt.3102.


# Data preparation and pre-processing

In the following exmaple script we illustrate how the pre-processing was performed for the T-cell data.
First, we need to load some required packages. 

```{r,message=FALSE}
library(genefilter)
library(statmod)
require(ggplot2)
library(gplots)
require(DESeq2)
library(scLVM)


#We also need to set the limix path; if you installed, you can delete the last 2 lines, otherwise you need to adapt the path.
limix_path = '/Users/flo/software/limix-master/build/release.darwin/interfaces/python'
configLimix(limix_path)

```

Now, we load he data frames containing the mapped read counts for the 81 T-cells described in the paper.
```{r}
data(data_Tcells)
help(data_Tcells)
```

Next, we look for the spike-ins which we then use to normalise the data. In general, there data can be normalized either with a size factor derived from the endogenous genes (normalising for cell size) or a size factor derived from the ERCC spike-ins (only normalising for sequencing depth).  Here, we use a spike-in based normalisation as  cell size and cell cycle are correlated. 

```{r}
dataMouse[ 1:5, 1:4 ]

geneTypes <- factor( c( ENSM="ENSM", ERCC="ERCC" )[
  substr( rownames(dataMouse), 1, 4 ) ] )

#2. calculate normalisation for counts
countsMmus <- dataMouse[ which( geneTypes=="ENSM" ), ]
countsERCC <- dataMouse[ which( geneTypes=="ERCC" ), ]
lengthsMmus <- dataMouse[ which( geneTypes=="ENSM" ), 1 ]
lengthsERCC <- dataMouse[ which( geneTypes=="ERCC" ), 1 ]


sfERCC <- estimateSizeFactorsForMatrix( countsERCC )
sfMmus <- sfERCC #also use ERCC size factor for endogenous genes


#normalise read counts
nCountsERCC <- t( t(countsERCC) / sfERCC )
nCountsMmus <- t( t(countsMmus) / sfERCC )

```
Using the normalised read counts and the ERCC spike-ins, we can estimate the technical noise by fitting the relationship between mean and squared coefficient of variation (CV2) following Brennecke et al.

```{r}
#get technical noise
techNoise = fitTechnicalNoise(nCountsMmus,nCountsERCC=nCountsERCC, fit_type = 'counts')  

```

If no spike-ins are available, we can also use the endogenous read counts for fitting the mean-CV2 relation using a log-linear fit in the log-space. Alternatively, we can fit the mean-variance relationship in the log-space using local 2nd order polynomial regression (loess).

```{r}
techNoiseLogFit = fitTechnicalNoise(nCountsMmus, fit_type = 'log', use_ERCC = FALSE, plot=FALSE) 

techNoiseLogVarFit = fitTechnicalNoise(nCountsMmus, fit_type = 'logvar', use_ERCC = FALSE, plot=FALSE) 


```

Once we have completed the fitting process, we can call variable genes.

```{r}
#call variable genes
is_het = getVariableGenes(nCountsMmus, techNoise$fit, method = "fdr", 
                          threshold = 0.1, fit_type="counts",sfEndo=sfMmus, sfERCC=sfERCC)
table(is_het)

#we an also do this for the other fits

is_hetLog = getVariableGenes(nCountsMmus, techNoiseLogFit$fit, plot=TRUE)
table(is_hetLog)

is_hetLogVar = getVariableGenes(nCountsMmus, techNoiseLogVarFit$fit, plot=TRUE)
table(is_hetLogVar)

``` 


In order to fit the latent cell cycle factor we need to retrieve cell cycle genes. For illustration purposes, here we only use genes annotated in GO (term GO:0007049).

```{r}

#get cell cycle genes from GO 
ens_ids_cc <- getEnsembl('GO:0007049')

```


Now, we have preprocessed the data and can run scLVM.
```{r}


#rename a few variables
Y = t(log10(nCountsMmus+1)) #normalised trandformed read counts
genes_het_bool = as.vector(is_het) #variable genes
geneID = rownames(nCountsMmus) #gene IDs
tech_noise = as.vector(techNoise$techNoiseLog) #technical noise


```

# Running vanilla scLVM
We first construct an scLVM object and iniitalize it with a normalised expression matirx and estimates of technical noise which can be computed using the pre-processing functions. In addition, it can be useful to include a set of variable genes which can also be determined using the preprocessing functions as described above.

```{r}
#construct and initialize new scLVM object
sclvm = new("scLVM")
sclvm = init(sclvm,Y=Y,tech_noise = tech_noise)
```

## Fitting the latent factors
Next, we fit the latent factor. Here, we fit the cell cycle factor. We first use an ARD prior and fit a large number of factors (here k=20) in order to assess how many factors we need in order to capture cell cycle. The ARD prior tells us the relevance of each factor. Note that in this example we expected cell cycle to be the major driver of cell-cell variability.

```{r}
#get cell cycle genes from GO 
ens_ids_cc <- getEnsembl('GO:0007049')

CellCycleARD = fitFactor(sclvm,geneSet = ens_ids_cc, k=20,use_ard = TRUE)
```

In order to establish the number of latent factors used to model the cell-cell covariance we plot the variance contributions of the individual latent factors.

```{r}
plot(seq(1, length(CellCycleARD$X_ard)), CellCycleARD$X_ard, xlab = '# Factor', ylab = 'Variance explained')
title('Variance explained by latent factors')
```

In this example (and generally when considering cell cycle as the confounding factor), there is a large gap in the proportion of explained variance between the first and the second factor. This suggests, that a single latent factor underlies the variation captured by the cellcycle genes. Consequently, we choose to re-fit the scLVM mdoel with one latent factor only.

```{r}
CellCycle = fitFactor(sclvm,geneSet = ens_ids_cc,k=1)

#Get cell-cycle factor
Kcc = CellCycle$K
Xcc = CellCycle$X
```

Next, we plot the inferred cell-cell covarince matirx.

```{r}
#Plot inferred similarity matrix
image(Kcc,xaxt = "n", yaxt = "n", xlab = 'cells', ylab = 'cells')
title('Similarity matrix based on cell cycle')
```


## Variance decomposition and cell-cycle corection


We first perfrom a variance decomposition on the variable genes. The computation time for the next step can be substantial. If large datasets are considerd, it may be advisable to distribute these calculations on a high performance compute cluster. In this case idx determines the range of genes for wich this anlaysis is performed. 

```{r}
idx_het = which(is_het)

# fit the model for variable genes
sclvm = varianceDecomposition(sclvm, K=Kcc, idx = idx_het)
``` 

Once the contribution of cell cycle to the observed variance is estimated, cell-cycled corrected gene expression levels can be obtained. Variance components are normalised such that they sum uo to 1 and genes for which the variance decompostion has not convered are filtered out.


```{r}

# get variance components
results_var = getVarianceComponents(sclvm)
var_filtered = results_var$var[results_var$conv,] # filter out genes for which vd has not converged
head(var_filtered)

# get corrected expression levels
Ycorr = getCorrectedExpression(sclvm)
dim(Ycorr)

```

After having perfromed the variance decompostion we can illustrate the contributions of the indiviudal components.

```{r}
var_mean = apply(var_filtered,2,mean)
colors = c('Green','Blue','Gray')
pie(var_mean, , col = colors)
```


## Correlation Analysis

In order to estimate pairwise correlation coefficients whilst controlling for hidden factors such as the cell cycle, we fit a linear mixed model with a fixed effect representing the contribution of another gene j and random effects representing the contribution of the cell cycle and biological variation.
Again, as computaion times can be very long we allow the computations to be split up over several calculations for subsets of genes.

```{r}
idx_lmm = idx_het[1:5]

# fit lmm without correction
res_nocorr = LMM(sclvm, K = NULL,idx = idx_lmm,verbose=TRUE)

# fit lmm with correction

res_corr = LMM(sclvm, K = Kcc, idx = idx_lmm,verbose = TRUE)

```

Finally we can have a quick look at the difference between corrected and uncorrected correlations.

```{r}

heatmap.2(res_nocorr$beta, Rowv = NULL, Colv = NULL, dendrogram = "none",
          labCol = as.character(idx_lmm), labRow = as.character(idx_lmm),srtCol = 0, key=T,density.info = "none",
          trace="none", breaks=seq.int(from = -0.6, to = 1.0, length.out = 13), main = 'Without Correction')

heatmap.2(res_corr$beta, Rowv = NULL, Colv = NULL, dendrogram = "none",
          labCol = as.character(idx_lmm), labRow = as.character(idx_lmm),srtCol = 0, key=T,density.info = "none",
          trace="none", breaks=seq.int(from = -0.6, to = 1.0, length.out = 13), main = 'With Correction')
```

Finally, let's make a standard PCA on the corrected and uncorrected data. If you would like to use non-linear PCA as in our paper, we suggest you use the python implementation as illustrated in our ipython notebook.


```{r}
Yhet = Y[,idx_het]
geneSymbols = getSymbols(colnames(Yhet))


gene_plot = "Gata3"
idx_gene = which(geneSymbols==gene_plot)

#PCA on corrected data
pcaCorr = prcomp(Ycorr,2)
d <- qplot(pcaCorr$x[,1], pcaCorr$x[,2],colour=Ycorr[,idx_gene], xlab = 'PC1', ylab = 'PC2')
d + ggtitle('PCA corrected gene expression') + scale_color_continuous(name =gene_plot)

#PCA on uncorrected data
pca = prcomp(Yhet,2)
d <- qplot(pca$x[,1], pca$x[,2],colour=Yhet[,idx_gene], xlab = 'PC1', ylab = 'PC2')
d + ggtitle('PCA uncorrected gene expression') + scale_color_continuous(name =gene_plot)


```

# Fitting multiple factors with scLVM
In 'vanilla' scLVM we implicitly assume that cell cycle is the major driver of heterogeneity; that's why we can simply capture cell-cycle effects by fiiting a GPLVM on the cell cycle genes. However, in many applications, either cell cycle is not the dominant source of variation (e.g. imagine a time-course experiment where pluripotnet cells are differentiated towards a specific lineage - here, most of the observed variablity may be induced by differentiation) and/or there are several factors of interest driving the observed heterogeneity (e.g. in the T-cell data, cell-cycle is the major  driver of hetergeneity, but differentation processes are also playing an important role). 
In both cases, we would like a framework where we condition on the dominant factor (cell cycle for the T-cell experients or facotrs reflecting the experimental desing in more complex experiments). 

Here, we illustrate with the T-cell data how this conditioning can be perfromed using scLVM.

## Fitting multiple latent factors

We assume that cell cycle is the dominant factor and fit it without conditioning as before. Next, we fit a Th2 factor based on 122 Th2 signature genes by conditioning on the cell cycle factors. Therefore, we use the gpCLVM class which is part of the scLVM package.

First, let's generate a new scLVM object and fit the cell-cycle factor:

```{r}
#get cell cycle genes from GO 
ens_ids_cc <- getEnsembl('GO:0007049')

#construct and initialize new scLVM object
sclvmMult = new("scLVM")
sclvmMult = init(sclvmMult,Y=Y,tech_noise = tech_noise)


CellCycle = fitFactor(sclvmMult,geneSet = ens_ids_cc, k=1)

#Get cell-cycle factor
Kcc = CellCycle$K
Xcc = CellCycle$X

```

After having fit the cell-cycle factor, we can now fit the Th2 factor by conditioning on the the dominant cell-cycle factor. As we might expect interactions between cell cycle and differentiation, we can also fit an interaction term.

```{r}

#Load Th2 genes
Th2_genes = read.table(system.file("extdata","Th2_markers.txt",package = "scLVM"), as.is=TRUE)$V1

#get Th2 marker genes 
gene_symbols = getSymbols(rownames(dataMouse))
idx_Th2 <- na.omit(match(Th2_genes, gene_symbols))

th2 = fitFactor(sclvmMult, idx = idx_Th2, XKnown = Xcc, k = 1, interaction=TRUE)
KTh2 = th2$K
Kint = th2$Kint

```


Next, we plot the inferred cell-cell covarince matrices.

```{r}
#Plot inferred similarity matrix
par(mfrow = c(1,3))
image(Kcc,xaxt = "n", yaxt = "n", xlab = 'cells', ylab = 'cells')
title('Similarity matrix based on cell cycle')

image(KTh2,xaxt = "n", yaxt = "n", xlab = 'cells', ylab = 'cells')
title('Similarity matrix based on Th2 factor')

image(Kint,xaxt = "n", yaxt = "n", xlab = 'cells', ylab = 'cells')
title('Interaction similarity matrix')

```


## Variance decomposition


We first perfrom a variance decomposition on the variable genes. The computation time for the next step can be substantial. If large datasets are considerd, it may be advisable to distribute these calculations on a high performance compute cluster.

```{r}

Klist = list(Kcc,KTh2 , Kint)
idx_het = which(is_het)

# fit the model for the variable genes
sclvmMult = varianceDecomposition(sclvmMult, K = Klist, idx = idx_het)
```

Once the contribution of cell cycle to the observed variance is estimated, cell-cycled corrected gene expression levels can be obtained. Variance components are normalised such that they sum uo to 1 and genes for which the variance decompostion has not convered are filtered out.

```{r}

# get variance components
results_var = getVarianceComponents(sclvmMult)
var_filtered = results_var$var[results_var$conv,] # filter out genes for which vd has not converged
head(var_filtered)

# get corrected expression levels
Ycorr = getCorrectedExpression(sclvmMult)
dim(Ycorr)

```

After having perfromed the variance decompostion we can illustrate the contributions of the indiviudal components.

```{r}
var_mean = apply(var_filtered,2,mean)
colors = c('Green','Blue','Gray', 'yellow')
pie(var_mean)
```


# Correlation Analysis

In order to estimate pairwise correlation coefficients whilst controlling for hidden factors such as the cell cycle, we fit a linear mixed model with a fixed effect representing the contribution of another gene j and random effects representing the contribution of the cell cycle and biological variation.
Again, as computaion times can be very long we allow the computations to be split up over several calculations for subsets of genes.

```{r}
idx_lmm = idx_het[1:5]

# fit lmm without correction
res_nocorr = LMM(sclvmMult, K = NULL,idx = idx_lmm,verbose=TRUE)

# fit lmm with correction
K=list(Kcc, KTh2, Kint)
res_corr = LMM(sclvmMult, K = K, idx = idx_lmm,verbose = TRUE)

```

Finally we can have a quick look at the difference between corrected and uncorrected correlations.

```{r}

heatmap.2(res_nocorr$beta, Rowv = NULL, Colv = NULL, dendrogram = "none",
          labCol = as.character(idx_lmm), labRow = as.character(idx_lmm),srtCol = 0, key=T,density.info = "none",
          trace="none", breaks=seq.int(from = -0.6, to = 1.0, length.out = 13), main = 'Without Correction')

heatmap.2(res_corr$beta, Rowv = NULL, Colv = NULL, dendrogram = "none",
          labCol = as.character(idx_lmm), labRow = as.character(idx_lmm),srtCol = 0, key=T,density.info = "none",
          trace="none", breaks=seq.int(from = -0.6, to = 1.0, length.out = 13), main = 'With Correction')
```

Finally, let's make a standard PCA on the corrected and uncorrected data. If you would like to use non-linear PCA as in our paper, we suggest you use the python implementation as illustrated in our ipython notebook. You can see that as we have now regressed out not only the cell cycle effects, but also Th2 effects, very little structure is left.

```{r}
Yhet = Y[,idx_het]
geneSymbols = getSymbols(colnames(Yhet))


gene_plot = "Gata3"
idx_gene = which(geneSymbols==gene_plot)


#PCA on cell cycle corrected gene expression data
pcaCorr = prcomp(Ycorr,2)
d <- qplot(pcaCorr$x[,1], pcaCorr$x[,2],colour=Ycorr[,idx_gene], xlab = 'PC1', ylab = 'PC2')
d + ggtitle('PCA corrected gene expression') + scale_color_continuous(name =gene_plot)

#PCA on uncorrected data
pca = prcomp(Yhet,2)
d <- qplot(pca$x[,1], pca$x[,2],colour=Yhet[,idx_gene], xlab = 'PC1', ylab = 'PC2')
d + ggtitle('PCA uncorrected gene expression') + scale_color_continuous(name =gene_plot)


```