Module2_Quiz.Rmd

---
title: "Module2 Quiz"
author: "mindan"
date: "2021/10/9"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r }
rm(list=ls())
```

## Dependencies

```{r load}
  library(devtools)
  library(Biobase)
  library(broom)
  library(limma)
  library(sva)
```


## Load the Montgomery and Pickrell eSet:
```{r}
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData")
load(file=con)
close(con)
mp = montpick.eset
pdata=pData(mp)
edata=as.data.frame(exprs(mp))
fdata = fData(mp)
ls()
```

## Questions and Answers

1.What percentage of variation is explained by the 1st principal component in the data set if you:

- Do no transformations?

- log2(data + 1) transform?

- log2(data + 1) transform and subtract row means?

```{r}
edata1 = edata
edata2 = log2(edata + 1)
edata3 = edata2 - rowMeans(edata2)

pc1 = prcomp(edata1,center=F, scale=F)
pc2 = prcomp(edata2,center=F, scale=F)
pc3 = prcomp(edata3,center=F, scale=F)

summary(pc1)
summary(pc2)
summary(pc3)
```

2.Perform the log2(data + 1) transform and subtract row means from the samples. Set the seed to 333 and use k-means to cluster the samples into two clusters. Use `svd` to calculate the singular vectors. What is the correlation between the first singular vector and the sample clustering indicator?

```{r}
edata_centered = edata2 - rowMeans(edata2)
set.seed(333) 
kmeans1 = kmeans(t(edata_centered),centers=2)
names(kmeans1)
table(kmeans1$cluster)

svd3 = svd(edata_centered)
names(svd3)
length(svd3$v[,1])

cor(svd3$v[,1],kmeans1$cluster)
```

5.Perform the log2(data + 1) transform. Then fit a regression model to each sample using population as the outcome. Do this using the `lm.fit` function (hint: don't forget the intercept). What is the dimension of the residual matrix, the effects matrix and the coefficients matrix?

```{r}
edata = as.matrix(edata2)

mod = model.matrix(~ pdata$population)
fit = lm.fit(mod,t(edata))
names(fit)

nrow(fit$coefficients)
nrow(fit$residuals)
nrow(fit$effects)
```

6.Perform the log2(data + 1) transform. Then fit a regression model to each sample using population as the outcome. Do this using the `lm.fit` function (hint: don't forget the intercept). What is the effects matrix?

```{r}
hist(fit$effects[2,],col=2,breaks=100)
nrow(fit$effects)
```

9.Why is it difficult to distinguish the study effect from the population effect in the Montgomery Pickrell dataset from ReCount? 


## Load the Bodymap data with the following command

```{r}
con =url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData")
load(file=con)
close(con)
bm = bodymap.eset
edata = exprs(bm)
pdata_bm=pData(bm)
ls()
```

3.Fit a linear model relating the first gene’s counts to the number of technical replicates, treating the number of replicates as a factor. Plot the data for this gene versus the covariate. Can you think of why this model might not fit well?

```{r}
edata = as.matrix(edata)
lm1 = lm(edata[1,] ~ as.factor(pdata_bm$num.tech.reps))
tidy(lm1)

plot(pdata_bm$num.tech.reps,edata[1,], col=1)
abline(lm1$coeff[1],lm1$coeff[2], col=2,lwd=3)
```

4.Fit a linear model relating he first gene’s counts to the age of the person and the sex of the samples. What is the value and interpretation of the coefficient for age?

```{r}
edata = as.matrix(edata)
lm2 = lm(edata[1,] ~ pdata_bm$age + pdata_bm$gender)
tidy(lm2)
```

7.Fit many regression models to the expression data where `age` is the outcome variable using the `lmFit` function from the `limma` package (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is the coefficient for age for the 1,000th gene? Make a plot of the data and fitted values for this gene. Does the model fit well?

```{r}
pdata0 = as.data.frame(na.omit(pdata_bm))
edata0 = edata[,-c(11,12,13)]

mod_adj = model.matrix(~ pdata0$age)
fit_limma = lmFit(edata0,mod_adj)
names(fit_limma)
fit_limma$coefficients[1000,]

plot(pdata0$age,edata0[1000,], col=1)
abline(fit_limma$coeff[1],fit_limma$coeff[2], col=2,lwd=3)
```

8.Fit many regression models to the expression data where `age` is the outcome variable and `tissue.type` is an adjustment variable using the `lmFit` function from the `limma` package (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is wrong with this model?

```{r}
mod_adj = model.matrix(~ pdata0$age + pdata0[,3])
fit_limma = lmFit(edata0,mod_adj)
```
10.Set the seed using the command `set.seed(33353)` then estimate a single surrogate variable using the `sva` function after log2(data + 1) transforming the expression data, removing rows with rowMeans less than 1, and treating age as the outcome (hint: you may have to subset the expression data to the samples without missing values of age to get the model to fit). What is the correlation between the estimated surrogate for batch and age? Is the surrogate more highly correlated with `race` or `gender`?


```{r}
edata2 = log2(edata0 + 1)
edata = edata2[rowMeans(edata2) > 1, ]

mod = model.matrix(~age,data=pdata0)
mod0 = model.matrix(~1, data=pdata0)
sva1 = sva(edata,mod,mod0,n.sv=2)
# why error 
pdata0$batch

# summary(lm(sva1$sv ~ pdata0$batch))
```

```{r session_info}
devtools::session_info()
```