Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add select,filter,mutate,left_join for treedata #19

Merged
merged 7 commits into from
Aug 22, 2021

Conversation

xiangpin
Copy link
Member

  • add select, filter, mutate, left_join for treedata object.
    introduce keep.td argument to control whether return treedata object in select, filter, mutate.
    keep.td=FALSE is default that will return tbl_df in select.
    and keep.td=TRUE is default that will return treedata in filter and mutate.
    And these verbs will only process the associated data of tree.
  • add left_join to add the external associated data to treedata.

Examples

> library(treeio)
treeio v1.17.2  For help: https://yulab-smu.top/treedata-book/

If you use treeio in published research, please cite:

LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. Molecular Biology and Evolution 2020, 37(2):599-603. doi: 10.1093/molbev/msz240

> library(tidytree)
> nwk <- '(((((((A:4,B:4):6,C:5):8,D:6):3,E:21):10,((F:4,G:12):14,H:8):13):13,((I:5,J:2):30,(K:11,L:11):2):17):4,M:56);'
> dat <- tibble(node=c(1, 2, 3, 4, 5), group=c("A", "A", "A", "B", "B"), test=c(10, 20, 30, 40, 50))
>
> tree <- read.tree(text=nwk) %>% treeio::as.treedata()
> tree@data <- dat
> tree
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 13 tips and 12 internal nodes.

Tip labels:
  A, B, C, D, E, F, ...

Rooted; includes branch lengths.

with the following features available:
        'group',        'test'.

# The associated data tibble abstraction: 25 x 5
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip group  test
   <int> <chr> <lgl> <chr> <dbl>
 1     1 A     TRUE  A        10
 2     2 B     TRUE  A        20
 3     3 C     TRUE  A        30
 4     4 D     TRUE  B        40
 5     5 E     TRUE  B        50
 6     6 F     TRUE  NA       NA
 7     7 G     TRUE  NA       NA
 8     8 H     TRUE  NA       NA
 9     9 I     TRUE  NA       NA
10    10 J     TRUE  NA       NA
# … with 15 more rows

select

> tree %>% select(group)
# A tibble: 25 x 1
   group
   <chr>
 1 A
 2 A
 3 A
 4 B
 5 B
 6 NA
 7 NA
 8 NA
 9 NA
10 NA
# … with 15 more rows
> tree %>% select(node, group) %>% filter(!is.na(group))
# A tibble: 5 x 2
   node group
  <int> <chr>
1     1 A
2     2 A
3     3 A
4     4 B
5     5 B
> tree %>% select(-group, keep.td=TRUE)
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 13 tips and 12 internal nodes.

Tip labels:
  A, B, C, D, E, F, ...

Rooted; includes branch lengths.

with the following features available:
        'test'.

# The associated data tibble abstraction: 25 x 4
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip  test
   <int> <chr> <lgl> <dbl>
 1     1 A     TRUE     10
 2     2 B     TRUE     20
 3     3 C     TRUE     30
 4     4 D     TRUE     40
 5     5 E     TRUE     50
 6     6 F     TRUE     NA
 7     7 G     TRUE     NA
 8     8 H     TRUE     NA
 9     9 I     TRUE     NA
10    10 J     TRUE     NA
# … with 15 more rows

filter

> tree %>% filter(group=="A", keep.td=FALSE)
# A tibble: 3 x 5
   node label isTip group  test
  <int> <chr> <lgl> <chr> <dbl>
1     1 A     TRUE  A        10
2     2 B     TRUE  A        20
3     3 C     TRUE  A        30
> tree %>% filter(group=="A")
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 13 tips and 12 internal nodes.

Tip labels:
  A, B, C, D, E, F, ...

Rooted; includes branch lengths.

with the following features available:
        'group',        'test'.

# The associated data tibble abstraction: 25 x 5
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip group  test
   <int> <chr> <lgl> <chr> <dbl>
 1     1 A     TRUE  A        10
 2     2 B     TRUE  A        20
 3     3 C     TRUE  A        30
 4     4 D     TRUE  NA       NA
 5     5 E     TRUE  NA       NA
 6     6 F     TRUE  NA       NA
 7     7 G     TRUE  NA       NA
 8     8 H     TRUE  NA       NA
 9     9 I     TRUE  NA       NA
10    10 J     TRUE  NA       NA
# … with 15 more rows
> tree %>% filter(group=="A" & test>=20)
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 13 tips and 12 internal nodes.

Tip labels:
  A, B, C, D, E, F, ...

Rooted; includes branch lengths.

with the following features available:
        'group',        'test'.

# The associated data tibble abstraction: 25 x 5
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip group  test
   <int> <chr> <lgl> <chr> <dbl>
 1     1 A     TRUE  NA       NA
 2     2 B     TRUE  A        20
 3     3 C     TRUE  A        30
 4     4 D     TRUE  NA       NA
 5     5 E     TRUE  NA       NA
 6     6 F     TRUE  NA       NA
 7     7 G     TRUE  NA       NA
 8     8 H     TRUE  NA       NA
 9     9 I     TRUE  NA       NA
10    10 J     TRUE  NA       NA
# … with 15 more rows

mutate

> tree %>% mutate(type="A", keep.td=FALSE)
# A tibble: 25 x 6
    node label isTip group  test type
   <int> <chr> <lgl> <chr> <dbl> <chr>
 1     1 A     TRUE  A        10 A
 2     2 B     TRUE  A        20 A
 3     3 C     TRUE  A        30 A
 4     4 D     TRUE  B        40 A
 5     5 E     TRUE  B        50 A
 6     6 F     TRUE  NA       NA A
 7     7 G     TRUE  NA       NA A
 8     8 H     TRUE  NA       NA A
 9     9 I     TRUE  NA       NA A
10    10 J     TRUE  NA       NA A
# … with 15 more rows
> tree %>% mutate(test="A", keep.td=FALSE)
# A tibble: 25 x 5
    node label isTip group test
   <int> <chr> <lgl> <chr> <chr>
 1     1 A     TRUE  A     A
 2     2 B     TRUE  A     A
 3     3 C     TRUE  A     A
 4     4 D     TRUE  B     A
 5     5 E     TRUE  B     A
 6     6 F     TRUE  NA    A
 7     7 G     TRUE  NA    A
 8     8 H     TRUE  NA    A
 9     9 I     TRUE  NA    A
10    10 J     TRUE  NA    A
# … with 15 more rows
> tree %>% mutate(test="A")
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 13 tips and 12 internal nodes.

Tip labels:
  A, B, C, D, E, F, ...

Rooted; includes branch lengths.

with the following features available:
        'group',        'test'.

# The associated data tibble abstraction: 25 x 5
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip group test
   <int> <chr> <lgl> <chr> <chr>
 1     1 A     TRUE  A     A
 2     2 B     TRUE  A     A
 3     3 C     TRUE  A     A
 4     4 D     TRUE  B     A
 5     5 E     TRUE  B     A
 6     6 F     TRUE  NA    A
 7     7 G     TRUE  NA    A
 8     8 H     TRUE  NA    A
 9     9 I     TRUE  NA    A
10    10 J     TRUE  NA    A
# … with 15 more rows

left_join

> set.seed(123)
> df <- data.frame(label=tree@phylo$tip.label, value=abs(rnorm(length(tree@phylo$tip.label))))
> N <- tree %>% treeio::Nnode(internal.only=FALSE)
> dt <- data.frame(ind=rep(seq_len(N), 2), group=rep(c("A","B"), each=N))
> tr2 <- tree %>% left_join(df, by="label")
> tr3 <- tree %>% left_join(dt, by=c("node"="ind"))
> tr2
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 13 tips and 12 internal nodes.

Tip labels:
  A, B, C, D, E, F, ...

Rooted; includes branch lengths.

with the following features available:
        'group',        'test', 'value'.

# The associated data tibble abstraction: 25 x 6
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip group  test  value
   <int> <chr> <lgl> <chr> <dbl>  <dbl>
 1     1 A     TRUE  A        10 0.560
 2     2 B     TRUE  A        20 0.230
 3     3 C     TRUE  A        30 1.56
 4     4 D     TRUE  B        40 0.0705
 5     5 E     TRUE  B        50 0.129
 6     6 F     TRUE  NA       NA 1.72
 7     7 G     TRUE  NA       NA 0.461
 8     8 H     TRUE  NA       NA 1.27
 9     9 I     TRUE  NA       NA 0.687
10    10 J     TRUE  NA       NA 0.446
# … with 15 more rows
> tr3
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 13 tips and 12 internal nodes.

Tip labels:
  A, B, C, D, E, F, ...

Rooted; includes branch lengths.

with the following features available:
        'group',        'test', 'group.y'.

# The associated data tibble abstraction: 25 x 6
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip group  test group.y
   <int> <chr> <lgl> <chr> <dbl> <list>
 1     1 A     TRUE  A        10 <tibble [2 × 1]>
 2     2 B     TRUE  A        20 <tibble [2 × 1]>
 3     3 C     TRUE  A        30 <tibble [2 × 1]>
 4     4 D     TRUE  B        40 <tibble [2 × 1]>
 5     5 E     TRUE  B        50 <tibble [2 × 1]>
 6     6 F     TRUE  NA       NA <tibble [2 × 1]>
 7     7 G     TRUE  NA       NA <tibble [2 × 1]>
 8     8 H     TRUE  NA       NA <tibble [2 × 1]>
 9     9 I     TRUE  NA       NA <tibble [2 × 1]>
10    10 J     TRUE  NA       NA <tibble [2 × 1]>
# … with 15 more rows
> tr3 %>% left_join(dt, by=c("node"="ind"))
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 13 tips and 12 internal nodes.

Tip labels:
  A, B, C, D, E, F, ...

Rooted; includes branch lengths.

with the following features available:
        'group',        'test', 'group.y',      'group.y.y'.

# The associated data tibble abstraction: 25 x 7
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip group  test group.y          group.y.y
   <int> <chr> <lgl> <chr> <dbl> <list>           <list>
 1     1 A     TRUE  A        10 <tibble [2 × 1]> <tibble [2 × 1]>
 2     2 B     TRUE  A        20 <tibble [2 × 1]> <tibble [2 × 1]>
 3     3 C     TRUE  A        30 <tibble [2 × 1]> <tibble [2 × 1]>
 4     4 D     TRUE  B        40 <tibble [2 × 1]> <tibble [2 × 1]>
 5     5 E     TRUE  B        50 <tibble [2 × 1]> <tibble [2 × 1]>
 6     6 F     TRUE  NA       NA <tibble [2 × 1]> <tibble [2 × 1]>
 7     7 G     TRUE  NA       NA <tibble [2 × 1]> <tibble [2 × 1]>
 8     8 H     TRUE  NA       NA <tibble [2 × 1]> <tibble [2 × 1]>
 9     9 I     TRUE  NA       NA <tibble [2 × 1]> <tibble [2 × 1]>
10    10 J     TRUE  NA       NA <tibble [2 × 1]> <tibble [2 × 1]>
# … with 15 more rows
>

@xiangpin
Copy link
Member Author

add unnest, pull, rename

pull

> library(tidytree)
> library(treeio)
treeio v1.17.2  For help: https://yulab-smu.top/treedata-book/

If you use treeio in published research, please cite:

LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. Molecular Biology and Evolution 2020, 37(2):599-603. doi: 10.1093/molbev/msz240

> nwk <- '(((((((A:4,B:4):6,C:5):8,D:6):3,E:21):10,((F:4,G:12):14,H:8):13):13,((I:5,J:2):30,(K:11,L:11):2):17):4,M:56);'
> tr <- read.tree(textConnection(nwk))
> dat <- data.frame(label=rep(tr$tip.label, 2), value=abs(rnorm(length(tr$tip.label)*2)), group=rep(c("A", "B"), each=length(tr$tip.label)))
> dat
   label     value group
1      A 1.5519595     A
2      B 1.3769063     A
3      C 0.2054155     A
4      D 1.1060085     A
5      E 0.6554567     A
6      F 0.4969638     A
7      G 0.8210107     A
8      H 0.8286685     A
9      I 0.1559962     A
10     J 0.2080823     A
11     K 1.1358309     A
12     L 0.5969092     A
13     M 1.9534269     A
14     A 0.1723078     B
15     B 0.6882117     B
16     C 0.8640694     B
17     D 0.2769176     B
18     E 1.4081071     B
19     F 1.9001461     B
20     G 0.6673657     B
21     H 0.2133854     B
22     I 2.1669321     B
23     J 0.2632242     B
24     K 0.1254599     B
25     L 1.6644862     B
26     M 1.2938236     B
> tr %>% left_join(dat, by="label")
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 13 tips and 12 internal nodes.

Tip labels:
  A, B, C, D, E, F, ...

Rooted; includes branch lengths.

with the following features available:
        '',     'value',        'group'.

# The associated data tibble abstraction: 25 x 5
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip value            group
   <int> <chr> <lgl> <list>           <list>
 1     1 A     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 2     2 B     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 3     3 C     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 4     4 D     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 5     5 E     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 6     6 F     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 7     7 G     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 8     8 H     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 9     9 I     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
10    10 J     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
# … with 15 more rows
> tr %<>% left_join(dat, by="label")
> tr %>% pull(label, name=node)
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
"A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M"  NA  NA  NA  NA  NA  NA  NA
 21  22  23  24  25
 NA  NA  NA  NA  NA

rename

> tr %>% rename(G=group)
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 13 tips and 12 internal nodes.

Tip labels:
  A, B, C, D, E, F, ...

Rooted; includes branch lengths.

with the following features available:
        '',     'value',        'G'.

# The associated data tibble abstraction: 25 x 5
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip value            G
   <int> <chr> <lgl> <list>           <list>
 1     1 A     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 2     2 B     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 3     3 C     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 4     4 D     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 5     5 E     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 6     6 F     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 7     7 G     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 8     8 H     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
 9     9 I     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
10    10 J     TRUE  <tibble [2 × 1]> <tibble [2 × 1]>
# … with 15 more rows

unnest

> tr %>% unnest(cols=c(value, group))
# A tbl_df is returned for independent data analysis.
# A tibble: 38 x 5
    node label isTip value group
   <int> <chr> <lgl> <dbl> <chr>
 1     1 A     TRUE  1.55  A
 2     1 A     TRUE  0.172 B
 3     2 B     TRUE  1.38  A
 4     2 B     TRUE  0.688 B
 5     3 C     TRUE  0.205 A
 6     3 C     TRUE  0.864 B
 7     4 D     TRUE  1.11  A
 8     4 D     TRUE  0.277 B
 9     5 E     TRUE  0.655 A
10     5 E     TRUE  1.41  B
# … with 28 more rows

@GuangchuangYu GuangchuangYu merged commit 652adbe into YuLab-SMU:master Aug 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants