-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathModule2_Quiz.Rmd
178 lines (129 loc) · 5.42 KB
/
Module2_Quiz.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: "Module2 Quiz"
author: "mindan"
date: "2021/10/9"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r }
rm(list=ls())
```
## Dependencies
```{r load}
library(devtools)
library(Biobase)
library(broom)
library(limma)
library(sva)
```
## Load the Montgomery and Pickrell eSet:
```{r}
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
load(file=con)
close(con)
mp = montpick.eset
pdata=pData(mp)
edata=as.data.frame(exprs(mp))
fdata = fData(mp)
ls()
```
## Questions and Answers
1.What percentage of variation is explained by the 1st principal component in the data set if you:
- Do no transformations?
- log2(data + 1) transform?
- log2(data + 1) transform and subtract row means?
```{r}
edata1 = edata
edata2 = log2(edata + 1)
edata3 = edata2 - rowMeans(edata2)
pc1 = prcomp(edata1,center=F, scale=F)
pc2 = prcomp(edata2,center=F, scale=F)
pc3 = prcomp(edata3,center=F, scale=F)
summary(pc1)
summary(pc2)
summary(pc3)
```
2.Perform the log2(data + 1) transform and subtract row means from the samples. Set the seed to 333 and use k-means to cluster the samples into two clusters. Use `svd` to calculate the singular vectors. What is the correlation between the first singular vector and the sample clustering indicator?
```{r}
edata_centered = edata2 - rowMeans(edata2)
set.seed(333)
kmeans1 = kmeans(t(edata_centered),centers=2)
names(kmeans1)
table(kmeans1$cluster)
svd3 = svd(edata_centered)
names(svd3)
length(svd3$v[,1])
cor(svd3$v[,1],kmeans1$cluster)
```
5.Perform the log2(data + 1) transform. Then fit a regression model to each sample using population as the outcome. Do this using the `lm.fit` function (hint: don't forget the intercept). What is the dimension of the residual matrix, the effects matrix and the coefficients matrix?
```{r}
edata = as.matrix(edata2)
mod = model.matrix(~ pdata$population)
fit = lm.fit(mod,t(edata))
names(fit)
nrow(fit$coefficients)
nrow(fit$residuals)
nrow(fit$effects)
```
6.Perform the log2(data + 1) transform. Then fit a regression model to each sample using population as the outcome. Do this using the `lm.fit` function (hint: don't forget the intercept). What is the effects matrix?
```{r}
hist(fit$effects[2,],col=2,breaks=100)
nrow(fit$effects)
```
9.Why is it difficult to distinguish the study effect from the population effect in the Montgomery Pickrell dataset from ReCount?
## Load the Bodymap data with the following command
```{r}
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
edata = exprs(bm)
pdata_bm=pData(bm)
ls()
```
3.Fit a linear model relating the first gene’s counts to the number of technical replicates, treating the number of replicates as a factor. Plot the data for this gene versus the covariate. Can you think of why this model might not fit well?
```{r}
edata = as.matrix(edata)
lm1 = lm(edata[1,] ~ as.factor(pdata_bm$num.tech.reps))
tidy(lm1)
plot(pdata_bm$num.tech.reps,edata[1,], col=1)
abline(lm1$coeff[1],lm1$coeff[2], col=2,lwd=3)
```
4.Fit a linear model relating he first gene’s counts to the age of the person and the sex of the samples. What is the value and interpretation of the coefficient for age?
```{r}
edata = as.matrix(edata)
lm2 = lm(edata[1,] ~ pdata_bm$age + pdata_bm$gender)
tidy(lm2)
```
7.Fit many regression models to the expression data where `age` is the outcome variable using the `lmFit` function from the `limma` package (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is the coefficient for age for the 1,000th gene? Make a plot of the data and fitted values for this gene. Does the model fit well?
```{r}
pdata0 = as.data.frame(na.omit(pdata_bm))
edata0 = edata[,-c(11,12,13)]
mod_adj = model.matrix(~ pdata0$age)
fit_limma = lmFit(edata0,mod_adj)
names(fit_limma)
fit_limma$coefficients[1000,]
plot(pdata0$age,edata0[1000,], col=1)
abline(fit_limma$coeff[1],fit_limma$coeff[2], col=2,lwd=3)
```
8.Fit many regression models to the expression data where `age` is the outcome variable and `tissue.type` is an adjustment variable using the `lmFit` function from the `limma` package (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is wrong with this model?
```{r}
mod_adj = model.matrix(~ pdata0$age + pdata0[,3])
fit_limma = lmFit(edata0,mod_adj)
```
10.Set the seed using the command `set.seed(33353)` then estimate a single surrogate variable using the `sva` function after log2(data + 1) transforming the expression data, removing rows with rowMeans less than 1, and treating age as the outcome (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is the correlation between the estimated surrogate for batch and age? Is the surrogate more highly correlated with `race` or `gender`?
```{r}
edata2 = log2(edata0 + 1)
edata = edata2[rowMeans(edata2) > 1, ]
mod = model.matrix(~age,data=pdata0)
mod0 = model.matrix(~1, data=pdata0)
sva1 = sva(edata,mod,mod0,n.sv=2)
# why error
pdata0$batch
# summary(lm(sva1$sv ~ pdata0$batch))
```
```{r session_info}
devtools::session_info()
```