-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathintro.Rmd
169 lines (116 loc) · 5.75 KB
/
intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "Tidy Genomics"
author: "Constantin Ahlmann-Eltze"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
fig_caption: yes
vignette: >
%\VignetteIndexEntry{Tidy Genomics}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
The most dramatic impact on programming in R the last years was the development of the [tidyverse](http://tidyverse.org/) by Hadley Wickham et al.
which, combined with the ingenious `%>%` from magrittr, provides a uniform philosophy for handling data.
The genomics community has an alternative set of approaches, for which [bioconductor](http://bioconductor.org/) and the
[GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) package provide the basis. The `GenomicRanges` and
the underlying `IRanges` package provide a great set of methods for dealing with intervals as they typically encountered in genomics.
Unfortunately it is not always easy to combine those two worlds, many common operations in `GenomicRanges` focus solely on the
ranges and loose the additional metadata columns. On the other hand the `tidyverse` does not provide a unified set of methods
to do common set operations with intervals.
At least until recently, when the [fuzzyjoin](https://github.com/dgrtwo/fuzzyjoin) package was extended with the `genome_join`
method for combining genomic data stored in a `data.frame`. It demonstrated that genomic data could appropriately be handled
with the _tidy_-philosophy.
The `tidygenomics` package extends the limited set of methods provided by the `fuzzyjoin` package for dealing with genomic
data. Its API is inspired by the very popular [bedtools](http://bedtools.readthedocs.io/en/latest/index.html):
- `genome_intersect`
- `genome_subtract`
- `genome_join_closest`
- `genome_cluster`
- `genome_complement`
- `genome_join` _Provided by the fuzzyjoin package_
```{r, message=FALSE, warning=FALSE, echo=FALSE}
library(dplyr)
library(tidygenomics)
```
## genome_intersect
Joins 2 data frames based on their genomic overlap. Unlike the `genome_join` function it updates the boundaries to reflect
the overlap of the regions.
<img src="resources/genome_intersect_docu.png" alt="genome_intersect" style="width: 100%;"/>
```{r}
x1 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr2"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr2", "chr2", "chr1"),
start = c(140, 210, 400, 300),
end = c(160, 240, 415, 320))
genome_intersect(x1, x2, by=c("chromosome", "start", "end"), mode="both")
```
## genome_subtract
Subtracts one data frame from the other. This can be used to split the x data frame into smaller areas.
<img src="resources/genome_subtract_docu.png" alt="genome_subtract" style="width: 100%;"/>
```{r}
x1 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr1"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr2", "chr1", "chr1"),
start = c(120, 210, 300, 400),
end = c(125, 240, 320, 415))
genome_subtract(x1, x2, by=c("chromosome", "start", "end"))
```
## genome_join_closest
Joins 2 data frames based on their genomic location. If no exact overlap is found the next closest interval is used.
<img src="resources/genome_join_closest_docu.png" alt="genome_join_closest" style="width: 100%;"/>
```{r}
x1 <- tibble(id = 1:4,
chr = c("chr1", "chr1", "chr2", "chr3"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- tibble(id = 1:4,
chr = c("chr1", "chr1", "chr1", "chr2"),
start = c(220, 210, 300, 400),
end = c(225, 240, 320, 415))
genome_join_closest(x1, x2, by=c("chr", "start", "end"), distance_column_name="distance", mode="left")
```
## genome_cluster
Add a new column with the cluster if 2 intervals are overlapping or are within the `max_distance`.
<img src="resources/genome_cluster_docu.png" alt="genome_cluster" style="width: 100%;"/>
```{r}
x1 <- data.frame(id = 1:4, bla=letters[1:4],
chromosome = c("chr1", "chr1", "chr2", "chr1"),
start = c(100, 120, 300, 260),
end = c(150, 250, 350, 450))
genome_cluster(x1, by=c("chromosome", "start", "end"))
genome_cluster(x1, by=c("chromosome", "start", "end"), max_distance=10)
```
## genome_complement
Calculates the complement of a genomic region.
<img src="resources/genome_complement_docu.png" alt="genome_complement" style="width: 100%;"/>
```{r}
x1 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr1"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
genome_complement(x1, by=c("chromosome", "start", "end"))
```
## genome_join
Classical join function based on the overlap of the interval. Implemented and mainted in the
[fuzzyjoin](https://github.com/dgrtwo/fuzzyjoin) package and documented here only for completeness.
<img src="resources/genome_join_docu.png" alt="genome_join" style="width: 100%;"/>
```{r}
x1 <- tibble(id = 1:4,
chr = c("chr1", "chr1", "chr2", "chr3"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- tibble(id = 1:4,
chr = c("chr1", "chr1", "chr1", "chr2"),
start = c(220, 210, 300, 400),
end = c(225, 240, 320, 415))
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="inner")
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="left")
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="anti")
```