-
Notifications
You must be signed in to change notification settings - Fork 42
/
Copy path09-initialize-genomic-data.Rmd
437 lines (357 loc) · 17.8 KB
/
09-initialize-genomic-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
```{r, echo = FALSE}
library(circlize)
```
# Initialize with genomic data {#initialize-genomic-plot}
**circlize** is quite flexible to initialize the circular plot
not only by chromosomes, but also by any type of general genomic categories.
## Initialize with cytoband data {#initialize-cytoband}
[Cytoband data](https://genome.ucsc.edu/cgi-bin/hgTables?hgta_table=cytoBand&hgta_doSchema=describe%20table%20schema) is
an ideal data source to initialize genomic plots. It contains length of
chromosomes as well as so called "chromosome band" annotation to help to
identify positions on chromosomes.
### Basic usage
If you work on human genome, the most straightforward way is to directly use
`circos.initializeWithIdeogram()` (Figure \@ref(fig:genomic-initialize-ideogram-default)).
By default, the function creates a track with chromosome name
and axes, and a track of ideograms.
Although chromosome names added to the plot are pure numeric, actually the
internally names have the "chr" index. When you adding more tracks, the
chromosome names should also have "chr" index.
```{r genomic-initialize-ideogram-default, fig.cap = "Initialize genomic plot, default."}
circos.initializeWithIdeogram()
text(0, 0, "default", cex = 1)
circos.info()
circos.clear()
```
By default, `circos.initializeWithIdeogram()` initializes the plot with
cytoband data of human genome `hg19`. Users can also initialize with other
species by specifying `species` argument and it will automatically download
cytoband files for corresponding species.
```{r, eval = FALSE}
circos.initializeWithIdeogram(species = "hg18")
circos.initializeWithIdeogram(species = "mm10")
```
When you are dealing rare species and there is no cytoband data available yet,
`circos.initializeWithIdeogram()` will try to continue to download the
"chromInfo" file form UCSC, which also contains lengths of chromosomes, but of
course, there is no ideogram track on the plot.
In some cases, when there is no internet connection for downloading or there is
no corresponding data avaiable on UCSC yet. You can manually construct a data frame
which contains ranges of chromosomes or a file path if it is stored in a file,
and sent to `circos.initializeWithIdeogram()`.
```{r eval = FALSE}
cytoband.file = system.file(package = "circlize", "extdata", "cytoBand.txt")
circos.initializeWithIdeogram(cytoband.file)
cytoband.df = read.table(cytoband.file, colClasses = c("character", "numeric",
"numeric", "character", "character"), sep = "\t")
circos.initializeWithIdeogram(cytoband.df)
```
If you read cytoband data directly from file, please explicitly specify
`colClasses` arguments and set the class of position columns as `numeric`. The
reason is since positions are represented as integers, `read.table` would
treat those numbers as `integer` by default. In initialization of circular
plot, **circlize** needs to calculate the summation of all chromosome lengths.
The summation of such large integers would throw error of integer overflow.
By default, `circos.intializeWithIdeogram()` uses all chromosomes which are
available in cytoband data to initialize the circular plot. Users can
choose a subset of chromosomes by specifying `chromosome.index`. This
argument is also for ordering chromosomes (Figure \@ref(fig:genomic-initialize-ideogram-subset)).
```{r genomic-initialize-ideogram-subset, fig.cap = "Initialize genomic plot, subset chromosomes."}
circos.initializeWithIdeogram(chromosome.index = paste0("chr", c(3,5,2,8)))
text(0, 0, "subset of chromosomes", cex = 1)
circos.clear()
```
When there is no cytoband data for the specified species, and when chromInfo data
is used instead, there may be many many extra short contigs. `chromosome.index`
can also be useful to remove unnecessary contigs.
### Pre-defined tracks
After the initialization of the circular plot,
`circos.initializeWithIdeogram()` additionally creates a track where there are
genomic axes and chromosome names, and create another track where there is an
ideogram (depends on whether cytoband data is available). `plotType` argument
is used to control which type of tracks to add. (figure Figure \@ref(fig:genomic-initialize-ideogram-track)).
```{r genomic-initialize-ideogram-track, echo = -1, fig.width = 8, fig.height = 4, fig.cap = "Initialize genomic plot, control tracks."}
par(mfrow = c(1, 2))
circos.initializeWithIdeogram(plotType = c("axis", "labels"))
text(0, 0, "plotType = c('axis', 'labels')", cex = 1)
circos.clear()
circos.initializeWithIdeogram(plotType = NULL)
text(0, 0, "plotType = NULL", cex = 1)
circos.clear()
```
### Other general settings
Similar as general circular plot, the parameters for the layout can be
controlled by `circos.par()` (Figure \@ref(fig:genomic-initialize-ideogram-par)).
Do remember when you explicitly set `circos.par()`, you need to call `circos.clear()`
to finish the plotting.
```{r genomic-initialize-ideogram-par, echo = -1, fig.width = 8, fig.height = 4, fig.cap = "Initialize genomic plot, control layout."}
par(mfrow = c(1, 2));
circos.par("start.degree" = 90)
circos.initializeWithIdeogram()
circos.clear()
text(0, 0, "'start.degree' = 90", cex = 1)
circos.par("gap.degree" = rep(c(2, 4), 12))
circos.initializeWithIdeogram()
circos.clear()
text(0, 0, "'gap.degree' = rep(c(2, 4), 12)", cex = 1)
```
## Customize chromosome track
By default `circos.initializeWithIdeogram()` initializes the layout and adds
two tracks. When `plotType` argument is set to `NULL`, the circular layout is
only initialized but nothing is added. This makes it possible for users to
completely design their own style of chromosome track.
In the following example, we use different colors to represent chromosomes and
put chromosome names in the center of each cell (Figure \@ref(fig:genomic-customize-chromosome)).
```{r genomic-customize-chromosome, fig.cap = "Customize chromosome track."}
set.seed(123)
circos.initializeWithIdeogram(plotType = NULL)
circos.track(ylim = c(0, 1), panel.fun = function(x, y) {
chr = CELL_META$sector.index
xlim = CELL_META$xlim
ylim = CELL_META$ylim
circos.rect(xlim[1], 0, xlim[2], 1, col = rand_color(1))
circos.text(mean(xlim), mean(ylim), chr, cex = 0.7, col = "white",
facing = "inside", niceFacing = TRUE)
}, track.height = 0.15, bg.border = NA)
circos.clear()
```
## Initialize with general genomic category
Chromosome is just a special case of genomic category.
`circos.genomicInitialize()` can initialize circular layout with any type of
genomic categories. In fact, `circos.initializeWithIdeogram()` is implemented
by `circos.genomicInitialize()`. The input data for
`circos.genomicInitialize()` is also a data frame with at least three columns.
The first column is genomic category (for cytoband data, it is chromosome
name), and the next two columns are positions in each genomic category. The
range in each category will be inferred as the minimum position and the
maximum position in corresponding category.
In the following example, a circular plot is initialized with three genes.
```{r, eval = FALSE}
df = data.frame(
name = c("TP53", "TP63", "TP73"),
start = c(7565097, 189349205, 3569084),
end = c(7590856, 189615068, 3652765))
circos.genomicInitialize(df)
```
Note it is not necessary that the record for each gene is only one row.
In following example, we plot the transcripts for TP53, TP63 and TP73 in a
circular layout (Figure \@ref(fig:genomic-gene-model)).
```{r}
tp_family = readRDS(system.file(package = "circlize", "extdata", "tp_family_df.rds"))
head(tp_family)
```
In the following code, we first create a track which identifies three genes.
```{r genomic_gene_model_1, eval = FALSE}
circos.genomicInitialize(tp_family)
circos.track(ylim = c(0, 1),
bg.col = c("#FF000040", "#00FF0040", "#0000FF40"),
bg.border = NA, track.height = 0.05)
```
Next, we put transcripts one after the other for each gene. It is simply to
add lines and rectangles. The usage of `circos.genomicTrack()` and
`circos.genomicRect()` will be discussed in Chapter \@ref(genomic-plotting-region).
```{r genomic_gene_model_2, eval = FALSE}
n = max(tapply(tp_family$transcript, tp_family$gene, function(x) length(unique(x))))
circos.genomicTrack(tp_family, ylim = c(0.5, n + 0.5),
panel.fun = function(region, value, ...) {
all_tx = unique(value$transcript)
for(i in seq_along(all_tx)) {
l = value$transcript == all_tx[i]
# for each transcript
current_tx_start = min(region[l, 1])
current_tx_end = max(region[l, 2])
circos.lines(c(current_tx_start, current_tx_end),
c(n - i + 1, n - i + 1), col = "#CCCCCC")
circos.genomicRect(region[l, , drop = FALSE], ytop = n - i + 1 + 0.4,
ybottom = n - i + 1 - 0.4, col = "orange", border = NA)
}
}, bg.border = NA, track.height = 0.4)
circos.clear()
```
```{r genomic-gene-model, echo = FALSE,fig.width = 6, fig.height = 6, fig.cap = "Circular representation of alternative transcripts for genes."}
chunks <- knitr:::knit_code$get()
eval(parse(text = chunks[["genomic_gene_model_1"]]))
eval(parse(text = chunks[["genomic_gene_model_2"]]))
```
In Figure \@ref(fig:genomic-gene-model), you may notice the start of axes
becomes "0KB" while not the original values. It is just an adjustment of the
axes labels to reflect the relative distance to the start of each gene, while
the coordinate in the cells are still using the original values. Set
`tickLabelsStartFromZero` to `FALSE` to recover axes labels to the original
values.
## Zooming chromosomes
The strategy is the same as introduced in Section \@ref(zooming-of-sectors).
We first define a function `extend_chromosomes()` which copy data in subset of
chromosomes into the original data frame.
```{r}
extend_chromosomes = function(bed, chromosome, prefix = "zoom_") {
zoom_bed = bed[bed[[1]] %in% chromosome, , drop = FALSE]
zoom_bed[[1]] = paste0(prefix, zoom_bed[[1]])
rbind(bed, zoom_bed)
}
```
We use `read.cytoband()` to download and read cytoband data from UCSC. In following,
x ranges for normal chromosomes and zoomed chromosomes are normalized separetely.
```{r}
cytoband = read.cytoband()
cytoband_df = cytoband$df
chromosome = cytoband$chromosome
xrange = c(cytoband$chr.len, cytoband$chr.len[c("chr1", "chr2")])
normal_chr_index = 1:24
zoomed_chr_index = 25:26
# normalize in normal chromsomes and zoomed chromosomes separately
sector.width = c(xrange[normal_chr_index] / sum(xrange[normal_chr_index]),
xrange[zoomed_chr_index] / sum(xrange[zoomed_chr_index]))
```
The extended cytoband data which is in form of a data frame is sent to
`circos.initializeWithIdeogram()`. You can see the ideograms for chromosome 1
and 2 are zoomed (Figure \@ref(fig:genomic-zoom)).
```{r genomic_zoom_1, eval = FALSE}
circos.par(start.degree = 90)
circos.initializeWithIdeogram(extend_chromosomes(cytoband_df, c("chr1", "chr2")),
sector.width = sector.width)
```
Add a new track.
```{r genomic_zoom_2, eval = FALSE}
bed = generateRandomBed(500)
circos.genomicTrack(extend_chromosomes(bed, c("chr1", "chr2")),
panel.fun = function(region, value, ...) {
circos.genomicPoints(region, value, pch = 16, cex = 0.3)
})
```
Add a link from original chromosome to the zoomed chromosome (Figure \@ref(fig:genomic-zoom)).
```{r genomic_zoom_3, eval = FALSE}
circos.link("chr1", get.cell.meta.data("cell.xlim", sector.index = "chr1"),
"zoom_chr1", get.cell.meta.data("cell.xlim", sector.index = "zoom_chr1"),
col = "#00000020", border = NA)
circos.clear()
```
```{r genomic-zoom, echo = FALSE, fig.width = 6, fig.height = 6, fig.cap = "Zoom chromosomes."}
chunks <- knitr:::knit_code$get()
eval(parse(text = chunks[["genomic_zoom_1"]]))
eval(parse(text = chunks[["genomic_zoom_2"]]))
eval(parse(text = chunks[["genomic_zoom_3"]]))
```
## Concatenating two genomes
There are some scenarios where users want to visualize more than one genomes
in the circular plots. This can be done by making "a combined genome". In the
following example, I combine both human and mouse genomes.
```{r}
human_cytoband = read.cytoband(species = "hg19")$df
mouse_cytoband = read.cytoband(species = "mm10")$df
```
One thing important is since the two genome will be combined, the chromosome
names for each genome need to be adjusted. Here I simply add `human_`/`mouse_`
prefix.
```{r}
human_cytoband[ ,1] = paste0("human_", human_cytoband[, 1])
mouse_cytoband[ ,1] = paste0("mouse_", mouse_cytoband[, 1])
```
Now I can combine the two cytoband data frames into one.
```{r}
cytoband = rbind(human_cytoband, mouse_cytoband)
head(cytoband)
```
The combined `cytoband` is still a valid cytoband data frame, thus, the ideograms
can be drawn for the combined genome. Also when I construct the variable `chromosome.index`,
I let human chromosome 1 to be close to mouse chromosome 1 (Figure \@ref(fig:genomic-combined-default).
```{r genomic-combined-default, fig.cap = "Default style of two combined genomes."}
chromosome.index = c(paste0("human_chr", c(1:22, "X", "Y")),
rev(paste0("mouse_chr", c(1:19, "X", "Y"))))
circos.initializeWithIdeogram(cytoband, chromosome.index = chromosome.index)
circos.clear()
```
By default, in the plot there are chromosome names, axes and ideograms. Now
for the combined genome, since there are quite a lot of chromosomes, each
chromosome will be very short in the plot, which makes it not easy to read the
axes and the long chromosome names. In the following improved code, I turn off
the chromosome name and the axes. We create a small track to discriminate
human chromosomes and mouse chromosomes (by `highlight.chromosome()`) and I
only write the numeric indices (also `X` and `Y`) for chromosomes. A gap of 5
degrees is set between human and mouse chromosomes (by
`circos.par("gap.after")`) (Figure \@ref(fig:genomic-combined-improved).
```{r genomic-combined-improved, fig.cap = "Improved visualization of the combined genome."}
circos.par(gap.after = c(rep(1, 23), 5, rep(1, 20), 5))
circos.initializeWithIdeogram(cytoband, plotType = NULL,
chromosome.index = chromosome.index)
circos.track(ylim = c(0, 1), panel.fun = function(x, y) {
circos.text(CELL_META$xcenter, CELL_META$ylim[2] + mm_y(2),
gsub(".*chr", "", CELL_META$sector.index), cex = 0.6, niceFacing = TRUE)
}, track.height = mm_h(1), cell.padding = c(0, 0, 0, 0), bg.border = NA)
highlight.chromosome(paste0("human_chr", c(1:22, "X", "Y")),
col = "red", track.index = 1)
highlight.chromosome(paste0("mouse_chr", c(1:19, "X", "Y")),
col = "blue", track.index = 1)
circos.genomicIdeogram(cytoband)
circos.clear()
```
In previous example, I demonstrate to create the circular layout for the
combined genome with the cytoband data frames. The layout can also be created
only by the chromosome ranges, i.e., the length of each chromosome. In the
following code, `read.chromInfo()` can fetch the chromosome range for a
specific genome.
```{r}
human_chromInfo = read.chromInfo(species = "hg19")$df
mouse_chromInfo = read.chromInfo(species = "mm10")$df
human_chromInfo[ ,1] = paste0("human_", human_chromInfo[, 1])
mouse_chromInfo[ ,1] = paste0("mouse_", mouse_chromInfo[, 1])
chromInfo = rbind(human_chromInfo, mouse_chromInfo)
# note the levels of the factor controls the chromosome orders in the plot
chromInfo[, 1] = factor(chromInfo[ ,1], levels = chromosome.index)
head(chromInfo)
```
With the specified chromosome ranges, `circos.genomicInitialize()` is used to initialize the
layout (Figure \@ref(fig:genomic-combined-genomic-initialize).
```{r genomic-combined-genomic-initialize, fig.cap = "Initialize the plot with chromosome ranges."}
circos.par(gap.after = c(rep(1, 23), 5, rep(1, 20), 5))
circos.genomicInitialize(chromInfo, plotType = NULL)
circos.track(ylim = c(0, 1), panel.fun = function(x, y) {
circos.text(CELL_META$xcenter, CELL_META$ylim[2] + mm_y(2),
gsub(".*chr", "", CELL_META$sector.index), cex = 0.6, niceFacing = TRUE)
}, track.height = mm_h(1), cell.padding = c(0, 0, 0, 0), bg.border = NA)
highlight.chromosome(paste0("human_chr", c(1:22, "X", "Y")),
col = "red", track.index = 1)
highlight.chromosome(paste0("mouse_chr", c(1:19, "X", "Y")),
col = "blue", track.index = 1)
circos.track(ylim = c(0, 1))
circos.clear()
```
Adding more tracks has no difference to the single-genome plots. The only thing
to note is the chromosome names should be properly formatted. In the following
code, I create a track of points and add links between human and mouse genomes
(Figure \@ref(fig:genomic-combined-more-tracks).
```{r genomic-combined-more-tracks, fig.cap = "The combined genome with more tracks."}
circos.par(gap.after = c(rep(1, 23), 5, rep(1, 20), 5))
circos.genomicInitialize(chromInfo, plotType = NULL)
circos.track(ylim = c(0, 1), panel.fun = function(x, y) {
circos.text(CELL_META$xcenter, CELL_META$ylim[2] + mm_y(2),
gsub(".*chr", "", CELL_META$sector.index), cex = 0.6, niceFacing = TRUE)
}, track.height = mm_h(1), cell.padding = c(0, 0, 0, 0), bg.border = NA)
highlight.chromosome(paste0("human_chr", c(1:22, "X", "Y")),
col = "red", track.index = 1)
highlight.chromosome(paste0("mouse_chr", c(1:19, "X", "Y")),
col = "blue", track.index = 1)
circos.genomicIdeogram(cytoband)
# a track of points
human_df = generateRandomBed(200, species = "hg19")
mouse_df = generateRandomBed(200, species = "mm10")
human_df[ ,1] = paste0("human_", human_df[, 1])
mouse_df[ ,1] = paste0("mouse_", mouse_df[, 1])
df = rbind(human_df, mouse_df)
circos.genomicTrack(df, panel.fun = function(region, value, ...) {
circos.genomicPoints(region, value, col = rand_color(1), cex = 0.5, ...)
})
# links between human and mouse genomes
human_mid = data.frame(
chr = paste0("human_chr", 1:19),
mid = round((human_chromInfo[1:19, 2] + human_chromInfo[1:19, 3])/2)
)
mouse_mid = data.frame(
chr = paste0("mouse_chr", 1:19),
mid = round((mouse_chromInfo[1:19, 2] + mouse_chromInfo[1:19, 3])/2)
)
circos.genomicLink(human_mid, mouse_mid, col = rand_color(19))
circos.clear()
text(-0.9, -0.8, "Human\ngenome")
text(0.9, 0.8, "Mouse\ngenome")
```