Betadiversity.Rmd

---
title: "Beta diversity and microbiome divergence"
author: "Leo Lahti, Sudarshan Shetty et al."
bibliography: 
- bibliography.bib
output:
  BiocStyle::html_document:
    number_sections: no
    toc: yes
    toc_depth: 4
    toc_float: true
    self_contained: true
    thumbnails: true
    lightbox: true
    gallery: true
    use_bookdown: false
    highlight: haddock    
---
<!--
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{microbiome tutorial - variability}
  %\usepackage[utf8]{inputenc}
  %\VignetteEncoding{UTF-8}  
-->


## Beta diversity

Beta diversity quantifies dissimilarity in community composition between samples. Dissimilarity can be also quantified by _distance_ or _divergence_. These measures have a broad use in statistical data analysis.

The [vegan R package](https://cran.r-project.org/web/packages/vegan/index.html) and the [phyloseq R package](https://bioconductor.org/packages/release/bioc/html/phyloseq.html) implement a number of standard ecological dissimilarity measures implemented in the 'vegdist' function.

Here, we show brief examples on how to compare sample heterogeneity between groups and over time. 


Load example data

```{r divergence-example2, message=FALSE}
library(microbiome)
library(dplyr)
data(peerj32)
pseq <- peerj32$phyloseq
```

## Intra-individual divergence 

Divergence within subjects may increase following intervention.

```{r divergence2c, message=FALSE, warning=FALSE}
library(tidyverse)
betas <- list()
groups <- as.character(unique(meta(pseq)$group))
for (g in groups) {

  df <- subset(meta(pseq), group == g)
  beta <- c()

  for (subj in df$subject) {
    # Pick the samples for this subject
    dfs <- subset(df, subject == subj)
    # Check that the subject has two time points
    if (nrow(dfs) == 2) {
      s <- as.character(dfs$sample)
      # Here with just two samples we can calculate the
      beta[[subj]] <- divergence(abundances(pseq)[, s[[1]]],abundances(pseq)[, s[[2]]],method = "bray")
    }
  }
  betas[[g]] <- beta
}
# boxplot
df <- as.data.frame(unlist(betas))
s<- rownames(df)
si<- as.data.frame(s)
si<- separate(si, s, into = c('names','s'))
df1<- bind_cols(df, si)
rownames(df1)<- df1$s ; df1$s<- NULL

p<- ggplot(df1, aes(x = names, y = `unlist(betas)`))+ geom_boxplot() + ylab('') + xlab('')

plot(p) 

```


## Divergence within individual over time

Community divergence within individual often tends to increase over time with respect to the baseline sample.

```{r homogeneity-example2d, message=FALSE, warning=FALSE}
library(MicrobeDS)
library(microbiome)
library(dplyr)
library(vegan)
data(MovingPictures)

# Pick the metadata for this subject and sort the
# samples by time

# Pick the data and modify variable names
pseq <- MovingPictures
s <- "F4" # Selected subject
b <- "UBERON:feces" # Selected body site

# Let us pick a subset
pseq <- subset_samples(MovingPictures, host_subject_id == s & body_site == b) 

# Rename variables
sample_data(pseq)$subject <- sample_data(pseq)$host_subject_id
sample_data(pseq)$sample <- sample_data(pseq)$X.SampleID

# Tidy up the time point information (convert from dates to days)
sample_data(pseq)$time <- as.numeric(as.Date(gsub(" 0:00", "", as.character(sample_data(pseq)$collection_timestamp)), "%m/%d/%Y") - as.Date("10/21/08", "%m/%d/%Y"))

# Order the entries by time
df <- meta(pseq) %>% arrange(time)

# Calculate the beta diversity between each time point and
# the baseline (first) time point
beta <- c() # Baseline similarity
s0 <- subset(df, time == 0)$sample
# Let us transform to relative abundance for Bray-Curtis calculations
a <-microbiome::abundances(microbiome::transform(pseq, "compositional")) 
for (tp in df$time[-1]) {
  # Pick the samples for this subject
  # If the same time point has more than one sample,
  # pick one at random
  st <- sample(subset(df, time == tp)$sample, 1)
  # Beta diversity between the current time point and baseline
  b <- vegdist(rbind(a[, s0], a[, st]), method = "bray")
  # Add to the list
  beta <- rbind(beta, c(tp, b))
}
colnames(beta) <- c("time", "beta")
beta <- as.data.frame(beta)

theme_set(theme_bw(20))
library(ggplot2)
p <- ggplot(beta, aes(x = time, y = beta)) +
       geom_point() +
       geom_line() +
       geom_smooth() +
       labs(x = "Time (Days)", y = "Beta diversity (Bray-Curtis)")
print(p)
```


## Inter-individual divergence / spread 

Divergence within a sample set quantifies the overall heterogeneity in community composition across samples or individuals. This is sometimes quantified as the average dissimilarity of each sample from the group mean; the dissimilarity can be quantified by beta diversity as in [Salonen et al. ISME J 2014](http://www.nature.com/ismej/journal/v8/n11/full/ismej201463a.html) (they focused on homogeneity using inverse divergence but the measure is essentially the same). 

Calculate divergences within the LGG (probiotic) and Placebo groups with respect to the median profile within each group.

```{r divergence-example2bb, message=FALSE}
pseq <- peerj32$phyloseq

b.pla <- divergence(subset_samples(pseq, group == "Placebo"),
   apply(abundances(subset_samples(pseq, group == "Placebo")), 1, median))

b.lgg <- divergence(subset_samples(pseq, group == "LGG"),
   apply(abundances(subset_samples(pseq, group == "LGG")), 1, median))
```


The group with larger values has a more heterogeneous community composition. 

```{r divergence-example2bbb, message=FALSE}
library(reshape)
l<- list(b.pla, b.lgg)
df<- melt(l)
df$L1[df$L1 == '1']<- 'placebo'
df$L1[df$L1 == '2']<- 'LGG'

df$L1<- factor(df$L1, levels = c('placebo','LGG'))


p<- ggplot(df, aes(x = L1, y = value)) + geom_boxplot()+ xlab('')

plot(p)
```


See [Community comparisons](Comparisons.html) for examples on group-level comparisons based on beta diversity.

#####

For visualizing temporal trajectory in beta diversity [click here](TemporalMicrobiotaTrajectory.html)