-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathK4me312mRmarkdown.R
288 lines (268 loc) · 13.6 KB
/
K4me312mRmarkdown.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
---
title: "csaw_markdown_K4me3"
author: "Andrew Phipps"
date: "13 April 2018"
output:
html_document: default
params:
set_title "K4me3_12m_2fc_1000bplim"
title: "`r params$set_title`"
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Download and install the required packages
```{r, include=FALSE}
#download and install required packages - note you need R 3.3.3 or later
#source("https://bioconductor.org/biocLite.R")
#biocLite("edgeR")
#biocLite("csaw")
#biocLite("DiffBind")
#install.packages("tidyverse")
#install.packages("dplyr")
#biocLite("RUVSeq")
#biocLite("TxDb.Mmusculus.UCSC.mm10.knownGene")
```
We will now load the required packages into R
```{r}
#load required packages into R
#tutorial link https://www.bioconductor.org/help/course-materials/2015/BioC2015/csaw_lab.html
library(DiffBind)
library(csaw)
library(edgeR)
library(rtracklayer)
library(ggplot2)
library(dplyr)
```
Importing the sample files for analysis - note I have included the input control, but not needed for this analysis
For this analysis we will only use a single
```{r}
#import required bam files - for now stick to single timepoint WT vs TG for K4me3 - but not sure if it would be better to load all data
#for multiple marks at a single time point or import all data from the experiment
#load samples in from marked duplicated files
K4me3_bam.files <- file.path("/NGS_Data/Andrew/Bowtie2_MarkDuplicates/K4me3" , c("154467_TG12_K4me3_CC0TAANXX_CGTCCATT_all_trimmed.fq.sam.bam.sorted.bam_markdup.bam",
"155542_TG12_K4me3_CC0TAANXX_GTCCTGTT_all_trimmed.fq.sam.bam.sorted.bam_markdup.bam",
"155668_WT12_K4me3_CC0TAANXX_AGCTAGTG_all_trimmed.fq.sam.bam.sorted.bam_markdup.bam",
"155669_TG12_K4me3_CC0TAANXX_AGCCGTAA_all_trimmed.fq.sam.bam.sorted.bam_markdup.bam",
"155688_WT12_K4me3_CC0TAANXX_CACGTCTA_all_trimmed.fq.sam.bam.sorted.bam_markdup.bam",
"155691_TG12_K4me3_CC0TAANXX_GAGTAGAG_all_trimmed.fq.sam.bam.sorted.bam_markdup.bam",
"156508_WT12_K4me3_CC0TAANXX_ACTATCGC_all_trimmed.fq.sam.bam.sorted.bam_markdup.bam",
"156509_WT12_K4me3_CC0TAANXX_GCGTATCA_all_trimmed.fq.sam.bam.sorted.bam_markdup.bam",
"157306_WT12_K4me3_CC0TAANXX_ACTCTCCA_all_trimmed.fq.sam.bam.sorted.bam_markdup.bam",
"157307_TG12_K4me3_CC0TAANXX_ACGTCGTT_all_trimmed.fq.sam.bam.sorted.bam_markdup.bam"))
input <- file.path("/NGS_Data/Andrew/Bowtie2_MarkDuplicates/TI", c("TI_merged.sorted.bam_markdup.bam"))
```
We will now make a design matrix for these samples (keep the same order as samples)
```{r}
#design matrix
design <- model.matrix(~factor(c('TG12', 'TG12', 'WT12', 'TG12', 'WT12', 'TG12', 'WT12','WT12','WT12','TG12')))
colnames(design) <- c("intercept", "genotype")
#making blacklist granges with Rtracklayer
library(rtracklayer)
gr_blacklist = import("/NGS_Data/Andrew/bigwig/mm10.blacklist.bed")
```
We can also load in other genomic regions to annotate with
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
This is now where we set up the parameters for importing the data and bring it inot a ranged summarized experiment
```{r}
###############################################################################
#setting up parameters
frag.length <- 250 #this is the fragment length from sonication
window.width <- 150 #will partition into windows of 150bp - for TF set small window, but for histone marks set window relevent \
##to mark - around 150bp windows - min size of a nucleosome? see 2.5 - Choosing appropriate window size
spacing <- 50
#discard mm10 blacklist regions - remember to include encode blacklist regions #probability mapped incorrectly: 0.01 (mapq 20)
parameters <- readParam(minq=30, discard = gr_blacklist)
data <- windowCounts(K4me3_bam.files, ext=frag.length, width =window.width, param = parameters)
```
Visualisation of data to ensure it has loaded in correctly:
```{r}
#can visualise the data with
rowRanges(data)
```
and visualisation of counts:
```{r}
#visualise counts with
head(assay(data), n=100)
```
Confirmation of fragment size:
```{r}
#can visualise to pick fragment size with the following plot - however you need to have marked (not removed) duplicates with Picard
#Sharp spike = fragment length. Low curve = potentially poor IP efficiency (low input from my samples)
max.delay <- 500
dedup.on <- reform(parameters, dedup=TRUE)
plot1 <- correlateReads(K4me3_bam.files, max.delay, param = dedup.on)
plot(0:max.delay, plot1, type='l', ylab="CCF", xlab="Delay (bp)")
```
and we can quantitate the number for sonication using the following:
```{r}
#can quantitate the fragment size with the following:
maximizeCcf(plot1)
```
Performing filtering steps - filtering by global enrichment to remove background (see chapter 3)
Note that we are keeping any windows with greater than 3 fold change against background binning, and filtering out reads less than x (We should also filter a subset of windows with limited counts)
For this test we have changed the comparison to fold change 2 against background - 3 fold change with noisy data might be limiting our ability to visualise the data and call significant regions
```{r}
#filtering from the 'negative binomial' - log transform of NB is referred to as abundance
##to remove uninteresting regions and lower computational complexity
#a better option for our analysis is to filter via global enrichement
bin.size <- 2000L
binned1 <- windowCounts(K4me3_bam.files, bin=TRUE, width=bin.size, param=parameters)
filtered.stat <- filterWindows(data, background=binned1, type="global")
#keep samples with a fold change of greater than 3 (3 by default in csaw guide) - try other filtering steps
filtered.keep <- filtered.stat$filter >log2(3)
```
Total number of filtered samples
```{r}
#sum of filtered samples
sum(filtered.keep)
#make filtered.data array for downstream analysis
filtered.data <- data[filtered.keep,]
```
Generate histogram to show the majority of sites are filtered as background
```{r}
#visualise fold change to confirm that the bulk of background sites are removed by filtering
par(mfrow=c(1,1))
hist(filtered.stat$back.abundances, xlab="adjusted bin log-CPM", breaks=100,
main="", col="grey80", xlim=c(min(filtered.stat$back.abundances), 0))
global.bg <- filtered.stat$abundances -filtered.stat$filter
abline(v=global.bg[1], col="red", lwd=2)
abline(v=global.bg[1]+log2(3), col="blue", lwd=2)
legend("topright", lwd=2, col=c('red', 'blue'), legend=c("Background", "Threshold"))
```
Now use global enrichment for elimination of composition bias
```{r}
#elimination of composition bias with global enrichment
binned <- windowCounts(K4me3_bam.files, bin=TRUE, width=10000, param=parameters)
filtered.data <- normOffsets(binned, se.out=filtered.data)
filtered.data$norm.factors
#You can test multiple normalisation windows here: too small = low counts and loss of DB, too large and DB will
##be in the same window as background
#demo <- windowCounts(K4me3_bam.files, bin=TRUE, width=5000, param=parameters)
#normOffsets(demo, se.out=FALSE)
# [1] 0.9748893 1.0295585 0.8987019 1.0386579 1.0815877 0.8709669 0.9466737 1.0718893 1.0981895 1.0167509
#demo <- windowCounts(K4me3_bam.files, bin=TRUE, width=15000, param=parameters)
#normOffsets(demo, se.out=FALSE)
#[1] 0.9847623 1.0302603 0.9183524 1.0549877 1.0909148 0.8883423 0.9719159 1.0686444 1.1166129 0.9051679
```
Generate MA plots to visualise composition bias and normalisation. A vertical shift in the bar might indicate composition bias. We ideally want the composition factors (line) to pass through the centre of the dense cloud
```{r}
#visualisation of normalisation with MA plots - generate for each sample
#vertical shift in the bars might indicate composition bias - ideally want comp factors (line) to pass through
##centre of the cloud
par(mfrow=c(3,3), mar=c(5,4,2,1.5))
adj.counts <- cpm(asDGEList(binned), log=TRUE)
normfacs <- filtered.data$norm.factors
for (i in seq_len(length(K4me3_bam.files)-1)) {
cur.x <- adj.counts[,1]
cur.y <- adj.counts[,1+i]
smoothScatter(x=(cur.x+cur.y)/2+6*log2(10), y=cur.x-cur.y,
xlab="A", ylab="M", main=paste("1 vs",i+1))
all.dist <-diff(log2(normfacs[c(i+1, 1)]))
abline(h=all.dist, col="red")
}
```
TMM normalisation on high abundance regions
```{r}
#Eliminating efficiency bias using TMM on high abundance regions
filtered.data.TMM <- normOffsets(filtered.data, se.out=TRUE)
filtered.data.TMM.efficiency <- filtered.data.TMM$norm.factors
data.comp<- normOffsets(binned, se.out=FALSE)
```
Now we can visualise the effect of TMM normalisation: removal of comp bias(dash) - this should pass through a low A-value,
while removal of efficiency bias is represented by the (solid) line
```{r}
#visualisation post normalisation
##Low A-value = background, high A-value = bound. Normalisation factors from removal of comp bias(dashed line), pass
###through low A-value, removal of efficiency bias pass through (full)
par(mfrow=c(1,2))
bins <- binned
comp <- data.comp
eff <- filtered.data.TMM.efficiency
adjc <-cpm(asDGEList(bins), log=TRUE)
smoothScatter(x=rowMeans(adjc), y=adjc[,1]-adjc[,2], xlab="A-value (background vs whole)", ylab="M", main= "TMM normalisation K4me3 12m")
abline(h=log2(eff[1]/eff[2]), col="red")
abline(h=log2(comp[1]/comp[2]), col="red", lty=2)
```
Now we have set up the samples and everything is ready for testing of differential binding
```{r}
#testing for Diffbinding
#need: filtered.data.TMM and filtered.data, original data: K4me3_bam.files, and design matrix
#setting up the data
y<- asDGEList(filtered.data.TMM)
#experimental design and setting up data
design
#stabilising estimates with empyrical bayes
y<- estimateDisp(y, design)
summary(y$trended.dispersion)
fit <-glmQLFit(y, design, robust=TRUE)
summary(fit$var.post)
```
We can visualise the empyrical bayes stabilisation for NB dispersion
```{r}
#visualisation of EB stabilisation biological coefficient of variation for NB dispersion - see pg 42
par(mfrow=c(1,2))
o<-order(y$AveLogCPM)
plot(y$AveLogCPM[o],sqrt(y$trended.dispersion[o]), type="l", lwd=2,
ylim=c(0,1), xlab=expression("Ave."~Log[2]~"CPM"),
ylab=("biological coefficient of variation"))
plotQLDisp(fit)
#my filtering may not have been strong enough - might need greater fold change to reduce the crap I am seeing here
#alternatively it might be due to increased variation I see between my samples here?
summary(fit$df.prior)
```
Prior to testing for differential bias - or maybe also prior to TMM normalisation we should try using RUVseq to remove
unknown sources of bias.
#RUVseq removal of unknown biases - based around RNA-seq data, but could work to remove
##the biases we are seeing in our samples
source("https://bioconductor.org/biocLite.R")
biocLite("RUVSeq")
Now we test for differential binding
#visualise with MDS plots:
#plotMDS()
```{r}
#Testing for differential binding
results <- glmQLFTest(fit, contrast=c(0,1))
head(results$table)
#assign p-values to co-ordinates
rowData(filtered.data.TMM) <-cbind(rowData(filtered.data.TMM),results$table)
```
By utilising MDS plots we can examine similarity between samples for the top 100,5000,1000,5000 called sites
```{r}
#examine replicate similarity with MDS plots
par(mfrow=c(2,2), mar=c(5,4,2,2))
adj.counts<-cpm(y,log=TRUE)
for(top in c(100,500,1000,5000)) {
out <- plotMDS(adj.counts, main=top, col=c("blue","blue", "red", "blue", "red", "blue", "red", "red", "red", "blue" ),
labels=c("154467", "155542", "155668","155669", "155688", "155691", "156508", "156509", "157306", "157307"), top=top)
}
```
Now we can correct these sites for false discovery rate using Benjamini-Hochbeg method per p-value. This will correct for regions, but is less conservative than Bonferroni.
Cluster with external cues/prior information - the mm10 DB
```{r}
olap <- findOverlaps(broads, rowRanges(filtered.data.TMM))
tabbroad <- combineOverlaps(olap, results$table)
head(tabbroad[!is.na(tabbroad$PValue),])
```
```{r}
#cluster windows into regions without external information/cues - note max.width to limit size (6.2.2)
#tolerance is the min distance for two binding sites to be treated as separate events
mergedwindowsK4me3 <- mergeWindows(rowRanges(filtered.data.TMM), tol = 1000, max.width = 1000L)
mergedwindowsK4me3$region
#assigning combined P value for merged windows
p.mergedwindowsK4me3 <- combineTests(mergedwindowsK4me3$id, results$table)
#check to see if most clusters are an acceptable size, if there are huge clusters we need to improve our filtering or limit
summary(width(mergedwindowsK4me3$region))
```
and assign a direction for fold change of the p-value
```{r}
#now assign direction of fold change to the p value
direction.p.mergedwindowsK4me3 <- p.mergedwindowsK4me3$FDR <= 0.05
table(mergedwindowsK4me3$direction[direction.p.mergedwindowsK4me3])
```
Find the top significant sites to check if there are different FDRs:
```{r}
options(digits = 22)
p.mergedwindowsK4me3 %>% arrange(PValue) %>% head(n = 10)
```