12_universal_enrichment.Rmd

# Universal enrichment analysis {#universal-api}


```{r include=FALSE}
library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, cache=TRUE)
library(clusterProfiler)
```


The `r Biocpkg("clusterProfiler")` package [@yu2012] supports both hypergeometric test and gene set enrichment analyses of many ontology/pathway, but it's still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations.

The `r Biocpkg("clusterProfiler")` package provides `enricher()` function for hypergeometric test and `GSEA()` function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters `TERM2GENE` and `TERM2NAME`. As indicated in the parameter names, `TERM2GENE` is a data.frame with first column of term ID and second column of corresponding mapped gene and `TERM2NAME` is a `data.frame` with first column of term ID and second column of corresponding term name. `TERM2NAME` is optional.


## Input data

For over representation analysis, all we need is a gene vector, that is a vector of gene IDs. These gene IDs can be obtained by differential expression analysis (*e.g.* with the `r Biocpkg("DESeq2")` package). 

For gene set enrichment analysis, we need a ranked list of genes. `r Biocpkg("DOSE")` provides an example dataset `geneList` which was derived from `R` package `r Biocpkg("breastCancerMAINZ")` that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in `geneList` dataset. If you want to prepare your own `geneList`, please refer to the [FAQ](#genelist).


We can load the sample data into R via:

```{r}
data(geneList, package="DOSE")
head(geneList)
```

Suppose we define fold change greater than 2 as DEGs:

```{r}
gene <- names(geneList)[abs(geneList) > 2]
head(gene)
```


## Cell Marker	

<!--     # tidyr::unite("cellMarker", tissueType, cancerType, cellName, sep=", ")  -->


```{r}
## cell_marker_data <- vroom::vroom('http://bio-bigdata.hrbmu.edu.cn/CellMarker/download/Human_cell_markers.txt')
url <- "http://yikedaxue.slwshop.cn/Cell_marker_Human.xlsx"
f <- tempfile(fileext = ".xlsx")
download.file(url, f)

cell_marker_data <- readxl::read_excel(f, 1)


## instead of `cellName`, users can use other features (e.g. `cancerType`)
cells <- cell_marker_data %>%
    dplyr::select('Cell name', GeneID) %>%
    dplyr::mutate(GeneID = strsplit(GeneID, ', ')) %>%
    tidyr::unnest(cols = c(GeneID))
```

### Cell Marker over-presentaton analysis {#cell-marker-ora}


```{r}
x <- enricher(gene, TERM2GENE = cells)
head(x)
```

### Cell Marker gene set enrichment analysis {#cell-marker-gsea}

```{r}
y <- GSEA(geneList, TERM2GENE = cells)
head(y)
```

## MSigDb analysis


[Molecular Signatures Database](http://software.broadinstitute.org/gsea/msigdb) is a collection of annotated gene sets. It contains 8 major collections:

* H: hallmark gene sets
* C1: positional gene sets
* C2: curated gene sets
* C3: motif gene sets
* C4: computational gene sets
* C5: GO gene sets
* C6: oncogenic signatures
* C7: immunologic signatures

Users can download [GMT files](www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29) from [Broad Institute](http://software.broadinstitute.org/gsea/msigdb) and use the `read.gmt()` function to parse the file to be used in `enricher()` and `GSEA()`.

There is an R package, [msigdbr](https://cran.r-project.org/package=msigdbr), that already packed the MSigDB gene sets in tidy data format that can be used directly with `r Biocpkg("clusterProfiler")` [@yu2012].

It supports several specices:

```{r}
library(msigdbr)
msigdbr_show_species()
```

We can retrieve all human gene sets:

```{r}
m_df <- msigdbr(species = "Homo sapiens")
head(m_df, 2) %>% as.data.frame
```


Or specific collection. Here we use C6, oncogenic gene sets as an example:

```{r}
m_t2g <- msigdbr(species = "Homo sapiens", category = "C6") %>% 
  dplyr::select(gs_name, entrez_gene)
head(m_t2g)
```


### MSigDb over-presentaton analysis {#msigdb-ora}


```{r}
em <- enricher(gene, TERM2GENE=m_t2g)
head(em)
```


### MSigDb gene set enrichment analysis {#msigdb-gsea}

In over-presentaton analysis, we use oncogenic gene sets (i.e. C6) to test whether the DE genes are involved in the process that leads to cancer. In this example, we will use the C3 category to test whether genes are up/down-regulated by sharing specific motif using GSEA approach.


```{r}
C3_t2g <- msigdbr(species = "Homo sapiens", category = "C3") %>% 
  dplyr::select(gs_name, entrez_gene)
head(C3_t2g)

em2 <- GSEA(geneList, TERM2GENE = C3_t2g)
head(em2)
```