-
Notifications
You must be signed in to change notification settings - Fork 46
/
Copy pathscLVM_vignette.Rmd
426 lines (271 loc) · 16.2 KB
/
scLVM_vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
---
title: "An R package for scLVM"
author: "Florian Buettner, F. Paolo Casale and Oliver Stegle"
output:
html_document:
toc: true
toc_depth: 2
---
# What is scLVM?
scLVM is a modelling framework for single-cell RNA-seq data that can be used to dissect the observed heterogeneity into different sources, thereby allowing for the correction of confounding sources of variation.
scLVM was primarily designed to account for cell-cycle induced variations in single-cell RNA-seq data where cell cycle is the primary soure of variability.
Software by Florian Buettner, Paolo Casale and Oliver Stegle. scLVM is explained in more detail in the accompanying publication:
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC & Stegle O, 2015. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-Sequencing data reveals hidden subpopulation of cells, Nature Biotechnology, doi: 10.1038/nbt.3102.
# Data preparation and pre-processing
In the following exmaple script we illustrate how the pre-processing was performed for the T-cell data.
First, we need to load some required packages.
```{r,message=FALSE}
library(genefilter)
library(statmod)
require(ggplot2)
library(gplots)
require(DESeq2)
library(scLVM)
#We also need to set the limix path; if you installed, you can delete the last 2 lines, otherwise you need to adapt the path.
limix_path = '/Users/flo/software/limix-master/build/release.darwin/interfaces/python'
configLimix(limix_path)
```
Now, we load he data frames containing the mapped read counts for the 81 T-cells described in the paper.
```{r}
data(data_Tcells)
help(data_Tcells)
```
Next, we look for the spike-ins which we then use to normalise the data. In general, there data can be normalized either with a size factor derived from the endogenous genes (normalising for cell size) or a size factor derived from the ERCC spike-ins (only normalising for sequencing depth). Here, we use a spike-in based normalisation as cell size and cell cycle are correlated.
```{r}
dataMouse[ 1:5, 1:4 ]
geneTypes <- factor( c( ENSM="ENSM", ERCC="ERCC" )[
substr( rownames(dataMouse), 1, 4 ) ] )
#2. calculate normalisation for counts
countsMmus <- dataMouse[ which( geneTypes=="ENSM" ), ]
countsERCC <- dataMouse[ which( geneTypes=="ERCC" ), ]
lengthsMmus <- dataMouse[ which( geneTypes=="ENSM" ), 1 ]
lengthsERCC <- dataMouse[ which( geneTypes=="ERCC" ), 1 ]
sfERCC <- estimateSizeFactorsForMatrix( countsERCC )
sfMmus <- sfERCC #also use ERCC size factor for endogenous genes
#normalise read counts
nCountsERCC <- t( t(countsERCC) / sfERCC )
nCountsMmus <- t( t(countsMmus) / sfERCC )
```
Using the normalised read counts and the ERCC spike-ins, we can estimate the technical noise by fitting the relationship between mean and squared coefficient of variation (CV2) following Brennecke et al.
```{r}
#get technical noise
techNoise = fitTechnicalNoise(nCountsMmus,nCountsERCC=nCountsERCC, fit_type = 'counts')
```
If no spike-ins are available, we can also use the endogenous read counts for fitting the mean-CV2 relation using a log-linear fit in the log-space. Alternatively, we can fit the mean-variance relationship in the log-space using local 2nd order polynomial regression (loess).
```{r}
techNoiseLogFit = fitTechnicalNoise(nCountsMmus, fit_type = 'log', use_ERCC = FALSE, plot=FALSE)
techNoiseLogVarFit = fitTechnicalNoise(nCountsMmus, fit_type = 'logvar', use_ERCC = FALSE, plot=FALSE)
```
Once we have completed the fitting process, we can call variable genes.
```{r}
#call variable genes
is_het = getVariableGenes(nCountsMmus, techNoise$fit, method = "fdr",
threshold = 0.1, fit_type="counts",sfEndo=sfMmus, sfERCC=sfERCC)
table(is_het)
#we an also do this for the other fits
is_hetLog = getVariableGenes(nCountsMmus, techNoiseLogFit$fit, plot=TRUE)
table(is_hetLog)
is_hetLogVar = getVariableGenes(nCountsMmus, techNoiseLogVarFit$fit, plot=TRUE)
table(is_hetLogVar)
```
In order to fit the latent cell cycle factor we need to retrieve cell cycle genes. For illustration purposes, here we only use genes annotated in GO (term GO:0007049).
```{r}
#get cell cycle genes from GO
ens_ids_cc <- getEnsembl('GO:0007049')
```
Now, we have preprocessed the data and can run scLVM.
```{r}
#rename a few variables
Y = t(log10(nCountsMmus+1)) #normalised trandformed read counts
genes_het_bool = as.vector(is_het) #variable genes
geneID = rownames(nCountsMmus) #gene IDs
tech_noise = as.vector(techNoise$techNoiseLog) #technical noise
```
# Running vanilla scLVM
We first construct an scLVM object and iniitalize it with a normalised expression matirx and estimates of technical noise which can be computed using the pre-processing functions. In addition, it can be useful to include a set of variable genes which can also be determined using the preprocessing functions as described above.
```{r}
#construct and initialize new scLVM object
sclvm = new("scLVM")
sclvm = init(sclvm,Y=Y,tech_noise = tech_noise)
```
## Fitting the latent factors
Next, we fit the latent factor. Here, we fit the cell cycle factor. We first use an ARD prior and fit a large number of factors (here k=20) in order to assess how many factors we need in order to capture cell cycle. The ARD prior tells us the relevance of each factor. Note that in this example we expected cell cycle to be the major driver of cell-cell variability.
```{r}
#get cell cycle genes from GO
ens_ids_cc <- getEnsembl('GO:0007049')
CellCycleARD = fitFactor(sclvm,geneSet = ens_ids_cc, k=20,use_ard = TRUE)
```
In order to establish the number of latent factors used to model the cell-cell covariance we plot the variance contributions of the individual latent factors.
```{r}
plot(seq(1, length(CellCycleARD$X_ard)), CellCycleARD$X_ard, xlab = '# Factor', ylab = 'Variance explained')
title('Variance explained by latent factors')
```
In this example (and generally when considering cell cycle as the confounding factor), there is a large gap in the proportion of explained variance between the first and the second factor. This suggests, that a single latent factor underlies the variation captured by the cellcycle genes. Consequently, we choose to re-fit the scLVM mdoel with one latent factor only.
```{r}
CellCycle = fitFactor(sclvm,geneSet = ens_ids_cc,k=1)
#Get cell-cycle factor
Kcc = CellCycle$K
Xcc = CellCycle$X
```
Next, we plot the inferred cell-cell covarince matirx.
```{r}
#Plot inferred similarity matrix
image(Kcc,xaxt = "n", yaxt = "n", xlab = 'cells', ylab = 'cells')
title('Similarity matrix based on cell cycle')
```
## Variance decomposition and cell-cycle corection
We first perfrom a variance decomposition on the variable genes. The computation time for the next step can be substantial. If large datasets are considerd, it may be advisable to distribute these calculations on a high performance compute cluster. In this case idx determines the range of genes for wich this anlaysis is performed.
```{r}
idx_het = which(is_het)
# fit the model for variable genes
sclvm = varianceDecomposition(sclvm, K=Kcc, idx = idx_het)
```
Once the contribution of cell cycle to the observed variance is estimated, cell-cycled corrected gene expression levels can be obtained. Variance components are normalised such that they sum uo to 1 and genes for which the variance decompostion has not convered are filtered out.
```{r}
# get variance components
results_var = getVarianceComponents(sclvm)
var_filtered = results_var$var[results_var$conv,] # filter out genes for which vd has not converged
head(var_filtered)
# get corrected expression levels
Ycorr = getCorrectedExpression(sclvm)
dim(Ycorr)
```
After having perfromed the variance decompostion we can illustrate the contributions of the indiviudal components.
```{r}
var_mean = apply(var_filtered,2,mean)
colors = c('Green','Blue','Gray')
pie(var_mean, , col = colors)
```
## Correlation Analysis
In order to estimate pairwise correlation coefficients whilst controlling for hidden factors such as the cell cycle, we fit a linear mixed model with a fixed effect representing the contribution of another gene j and random effects representing the contribution of the cell cycle and biological variation.
Again, as computaion times can be very long we allow the computations to be split up over several calculations for subsets of genes.
```{r}
idx_lmm = idx_het[1:5]
# fit lmm without correction
res_nocorr = LMM(sclvm, K = NULL,idx = idx_lmm,verbose=TRUE)
# fit lmm with correction
res_corr = LMM(sclvm, K = Kcc, idx = idx_lmm,verbose = TRUE)
```
Finally we can have a quick look at the difference between corrected and uncorrected correlations.
```{r}
heatmap.2(res_nocorr$beta, Rowv = NULL, Colv = NULL, dendrogram = "none",
labCol = as.character(idx_lmm), labRow = as.character(idx_lmm),srtCol = 0, key=T,density.info = "none",
trace="none", breaks=seq.int(from = -0.6, to = 1.0, length.out = 13), main = 'Without Correction')
heatmap.2(res_corr$beta, Rowv = NULL, Colv = NULL, dendrogram = "none",
labCol = as.character(idx_lmm), labRow = as.character(idx_lmm),srtCol = 0, key=T,density.info = "none",
trace="none", breaks=seq.int(from = -0.6, to = 1.0, length.out = 13), main = 'With Correction')
```
Finally, let's make a standard PCA on the corrected and uncorrected data. If you would like to use non-linear PCA as in our paper, we suggest you use the python implementation as illustrated in our ipython notebook.
```{r}
Yhet = Y[,idx_het]
geneSymbols = getSymbols(colnames(Yhet))
gene_plot = "Gata3"
idx_gene = which(geneSymbols==gene_plot)
#PCA on corrected data
pcaCorr = prcomp(Ycorr,2)
d <- qplot(pcaCorr$x[,1], pcaCorr$x[,2],colour=Ycorr[,idx_gene], xlab = 'PC1', ylab = 'PC2')
d + ggtitle('PCA corrected gene expression') + scale_color_continuous(name =gene_plot)
#PCA on uncorrected data
pca = prcomp(Yhet,2)
d <- qplot(pca$x[,1], pca$x[,2],colour=Yhet[,idx_gene], xlab = 'PC1', ylab = 'PC2')
d + ggtitle('PCA uncorrected gene expression') + scale_color_continuous(name =gene_plot)
```
# Fitting multiple factors with scLVM
In 'vanilla' scLVM we implicitly assume that cell cycle is the major driver of heterogeneity; that's why we can simply capture cell-cycle effects by fiiting a GPLVM on the cell cycle genes. However, in many applications, either cell cycle is not the dominant source of variation (e.g. imagine a time-course experiment where pluripotnet cells are differentiated towards a specific lineage - here, most of the observed variablity may be induced by differentiation) and/or there are several factors of interest driving the observed heterogeneity (e.g. in the T-cell data, cell-cycle is the major driver of hetergeneity, but differentation processes are also playing an important role).
In both cases, we would like a framework where we condition on the dominant factor (cell cycle for the T-cell experients or facotrs reflecting the experimental desing in more complex experiments).
Here, we illustrate with the T-cell data how this conditioning can be perfromed using scLVM.
## Fitting multiple latent factors
We assume that cell cycle is the dominant factor and fit it without conditioning as before. Next, we fit a Th2 factor based on 122 Th2 signature genes by conditioning on the cell cycle factors. Therefore, we use the gpCLVM class which is part of the scLVM package.
First, let's generate a new scLVM object and fit the cell-cycle factor:
```{r}
#get cell cycle genes from GO
ens_ids_cc <- getEnsembl('GO:0007049')
#construct and initialize new scLVM object
sclvmMult = new("scLVM")
sclvmMult = init(sclvmMult,Y=Y,tech_noise = tech_noise)
CellCycle = fitFactor(sclvmMult,geneSet = ens_ids_cc, k=1)
#Get cell-cycle factor
Kcc = CellCycle$K
Xcc = CellCycle$X
```
After having fit the cell-cycle factor, we can now fit the Th2 factor by conditioning on the the dominant cell-cycle factor. As we might expect interactions between cell cycle and differentiation, we can also fit an interaction term.
```{r}
#Load Th2 genes
Th2_genes = read.table(system.file("extdata","Th2_markers.txt",package = "scLVM"), as.is=TRUE)$V1
#get Th2 marker genes
gene_symbols = getSymbols(rownames(dataMouse))
idx_Th2 <- na.omit(match(Th2_genes, gene_symbols))
th2 = fitFactor(sclvmMult, idx = idx_Th2, XKnown = Xcc, k = 1, interaction=TRUE)
KTh2 = th2$K
Kint = th2$Kint
```
Next, we plot the inferred cell-cell covarince matrices.
```{r}
#Plot inferred similarity matrix
par(mfrow = c(1,3))
image(Kcc,xaxt = "n", yaxt = "n", xlab = 'cells', ylab = 'cells')
title('Similarity matrix based on cell cycle')
image(KTh2,xaxt = "n", yaxt = "n", xlab = 'cells', ylab = 'cells')
title('Similarity matrix based on Th2 factor')
image(Kint,xaxt = "n", yaxt = "n", xlab = 'cells', ylab = 'cells')
title('Interaction similarity matrix')
```
## Variance decomposition
We first perfrom a variance decomposition on the variable genes. The computation time for the next step can be substantial. If large datasets are considerd, it may be advisable to distribute these calculations on a high performance compute cluster.
```{r}
Klist = list(Kcc,KTh2 , Kint)
idx_het = which(is_het)
# fit the model for the variable genes
sclvmMult = varianceDecomposition(sclvmMult, K = Klist, idx = idx_het)
```
Once the contribution of cell cycle to the observed variance is estimated, cell-cycled corrected gene expression levels can be obtained. Variance components are normalised such that they sum uo to 1 and genes for which the variance decompostion has not convered are filtered out.
```{r}
# get variance components
results_var = getVarianceComponents(sclvmMult)
var_filtered = results_var$var[results_var$conv,] # filter out genes for which vd has not converged
head(var_filtered)
# get corrected expression levels
Ycorr = getCorrectedExpression(sclvmMult)
dim(Ycorr)
```
After having perfromed the variance decompostion we can illustrate the contributions of the indiviudal components.
```{r}
var_mean = apply(var_filtered,2,mean)
colors = c('Green','Blue','Gray', 'yellow')
pie(var_mean)
```
# Correlation Analysis
In order to estimate pairwise correlation coefficients whilst controlling for hidden factors such as the cell cycle, we fit a linear mixed model with a fixed effect representing the contribution of another gene j and random effects representing the contribution of the cell cycle and biological variation.
Again, as computaion times can be very long we allow the computations to be split up over several calculations for subsets of genes.
```{r}
idx_lmm = idx_het[1:5]
# fit lmm without correction
res_nocorr = LMM(sclvmMult, K = NULL,idx = idx_lmm,verbose=TRUE)
# fit lmm with correction
K=list(Kcc, KTh2, Kint)
res_corr = LMM(sclvmMult, K = K, idx = idx_lmm,verbose = TRUE)
```
Finally we can have a quick look at the difference between corrected and uncorrected correlations.
```{r}
heatmap.2(res_nocorr$beta, Rowv = NULL, Colv = NULL, dendrogram = "none",
labCol = as.character(idx_lmm), labRow = as.character(idx_lmm),srtCol = 0, key=T,density.info = "none",
trace="none", breaks=seq.int(from = -0.6, to = 1.0, length.out = 13), main = 'Without Correction')
heatmap.2(res_corr$beta, Rowv = NULL, Colv = NULL, dendrogram = "none",
labCol = as.character(idx_lmm), labRow = as.character(idx_lmm),srtCol = 0, key=T,density.info = "none",
trace="none", breaks=seq.int(from = -0.6, to = 1.0, length.out = 13), main = 'With Correction')
```
Finally, let's make a standard PCA on the corrected and uncorrected data. If you would like to use non-linear PCA as in our paper, we suggest you use the python implementation as illustrated in our ipython notebook. You can see that as we have now regressed out not only the cell cycle effects, but also Th2 effects, very little structure is left.
```{r}
Yhet = Y[,idx_het]
geneSymbols = getSymbols(colnames(Yhet))
gene_plot = "Gata3"
idx_gene = which(geneSymbols==gene_plot)
#PCA on cell cycle corrected gene expression data
pcaCorr = prcomp(Ycorr,2)
d <- qplot(pcaCorr$x[,1], pcaCorr$x[,2],colour=Ycorr[,idx_gene], xlab = 'PC1', ylab = 'PC2')
d + ggtitle('PCA corrected gene expression') + scale_color_continuous(name =gene_plot)
#PCA on uncorrected data
pca = prcomp(Yhet,2)
d <- qplot(pca$x[,1], pca$x[,2],colour=Yhet[,idx_gene], xlab = 'PC1', ylab = 'PC2')
d + ggtitle('PCA uncorrected gene expression') + scale_color_continuous(name =gene_plot)
```