Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding differential expression analysis tutorial pages to Usage section #1447

Merged
merged 18 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
553 changes: 553 additions & 0 deletions docs/usage/DE_analysis/de_rstudio.md

Large diffs are not rendered by default.

Binary file added docs/usage/DE_analysis/img/DESeq_function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/Excalidraw_RNAseq.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/MA_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/PCA.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/count_distribution.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/dds_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/enrichment_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/heatmap_de_genes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/overdispersion.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/plotCounts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/project_R.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/volcanoplot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/usage/DE_analysis/img/workdir_RStudio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
77 changes: 77 additions & 0 deletions docs/usage/DE_analysis/interpretation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
order: 5
---

# Interpretation

Once DE genes have been identified, the next crucial step is to interpret the results. This involves the inspection of tables and plots generated during the analysis to understand the biological significance of the data. In this part of the tutorial, we will delve into the results by discussing the significant DE genes and we will explore various plots generated during the analysis.

!!! note

The results illustrated in this section might show slight variations compared to your runs due to randomness in the STAR algorithm. This randomness arises from using variable seed values and parallel processing, leading to minor differences in results between runs on the same data. These small discrepancies are not biologically significant and may affect counts and subsequent plots (such as PCA and count plots). However, the overall patterns and main findings should remain consistent. While exact reproducibility is ideal, minor variations are acceptable in practice, as long as they do not impact the main conclusions of the study.

## Quality control plots

The first plot we will examine is the Principal Component Analysis (PCA) plot. Since we're working with simulated data, our metadata is relatively simple, consisting of just three variables: sample, condition and replica. In a typical RNA-seq experiment, however, metadata can be complex and encompass a wide range of variables that could contribute to sample variation, such as sex, age and developmental stage.

![RNAseq](./img/PCA.png)

By plotting the PCA on the PC1 and PC2 axes, using `condition` as the main variable of interest, we can quickly identify the primary source of variation in our data. By accounting for this variation in our design model, we should be able to detect more differentially expressed genes related to `condition`. When working with real data, it's often useful to plot the data using different variables to explore how much variation is explained by the first two PCs. Depending on the results, it may be informative to examine variation on additional PC axes, such as PC3 and PC4, to gain a more comprehensive understanding of the data.

Next, we will examine the hierarchical clustering plot to explore the relationships between samples based on their gene expression profiles. The heatmap is organized such that samples with similar expression profiles are close to each other, allowing us to identify patterns in the data.

![RNAseq](./img/hierarchical_clustering.png)

Remember that to create this plot, we utilized the `dist()` function, so in the legend on the right, a value of 0 corresponds to high correlation, while a value of 5 corresponds to very low correlation. Similar to PCA, we can see that samples tend to cluster together according to `condition`, indeed we can observe a high degree of correlation between the three control samples and between the three treated samples.

Overall, the integration of these plots suggests that we are working with high-quality data and we can confidently proceed to the differential expression analysis.

## Differential expression results

From this point, we will examine plots that are generated after the differential expression analysis. These plots are not quality control (QC) plots, but rather plots that help us to interpret the results.
After running the `results()` function, a good way to start to have an idea about the results is to look at the MA plot.

![RNAseq](./img/MA_plot.png)

By default, genes are coloured in blue if the padj is less than 0.1 and the log2foldchange greater than or less than 0. Genes that fall outside the plotting region are represented as open triangles. Note that we have not yet applied a filter to select only significant, which we define as those with a padj value less than 0.5 and a log2 fold change of at least 1 or -1.

After filtering our genes of interest according to our threshold, let's have a look to our significatnt genes:

```bash
gene baseMean log2FoldChange lfcSE stat pvalue padj
ENSG00000205726 121645.5908 2.894480 0.1515387 19.100600 2.496005e-81 5.840651e-79
ENSG00000142192 51065.3192 3.025489 0.1891258 15.997230 1.335883e-57 1.562983e-55
ENSG00000142156 20805.8078 2.977705 0.2159277 13.790287 2.915972e-43 2.274458e-41
ENSG00000159231 458.9277 -1.194777 0.3058100 -3.906926 9.347790e-05 5.468457e-03
ENSG00000156282 481.7624 1.095272 0.2969594 3.688289 2.257672e-04 1.056590e-02
```

To gain a comprehensive overview of the transcriptional profile, the volcano plot represents a highly informative tool.

![RNAseq](./img/volcanoplot.png)

The treatment induced differential expression in five genes, with one downregulated and four upregulated. This plot visually represents the numerical results reported in the table above.

After the identification of DE genes, it's informative to visualise the expression of specific genes of interest. The `plotCounts()` function applied directly on the `dds` object allows us to examine individual gene expression profiles without accessing the full `res` object.

![RNAseq](./img/plotCounts.png)

In our example, post-treatment, we observe a significant increase in the expression of the _ENSG00000142192_ gene, highlighting its responsiveness to the experimental conditions.

Finally, we can create a heatmap using the normalised expression counts of DE genes. The resulting heatmap visualises how the expression of significant genes varies across samples. Each row represents a gene, and each column represents a sample. The color intensity in the heatmap reflects the normalised expression levels: red colors indicate higher expression, while blue colors indicate lower expression.

![RNAseq](./img/heatmap_DE_genes.png)

By examining the heatmap, we can visually identify the expression patterns of our five significant differentially expressed genes. This visualization allows us to identify how these genes respond to the treatment. The heatmap provides a clear and intuitive way to explore gene expression dynamics.

## Over Representation Analysis (ORA)

Finally, we can attempt to assign biological significance to our differentially expressed genes through Enrichment Analysis (ORA). The ORA analysis identifies specific biological pathways, molecular functions and cellular processes, according to the Gene Ontology (GO) database, that are enriched with our differentially expressed genes.

![RNAseq](./img/enrichment_plot.png)

The enrichment analysis reveals a possible involvement of cellular structures and processes, including "clathrin-coated pit", "dendritic spine", "neuron spine" and "endoplasmic reticulum lumen". These terms suggest a focus on cellular transport, structural integrity and protein processing, especially in neural contexts. This pattern points to pathways related to cellular organization and maintenance, possibly playing an important role in the biological condition under study.

## Conclusions

In this tutorial, we have walked through the steps of performing differential expression analysis using DESeq2, from preparing the data to interpreting the results. We have seen how to identify differentially expressed genes, visualise the results and perform enrichment analysis to gain insights into the biological significance of the results. By following this tutorial, you should now be able to perform differential expression analysis using DESeq2 and interpret the results in the context of your own research question.
44 changes: 44 additions & 0 deletions docs/usage/DE_analysis/intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
order: 1
---

## Welcome

These pages are a tutorial workshop for the [Nextflow](https://www.nextflow.io) pipeline [nf-core/rnaseq](https://nf-co.re/rnaseq).

In this workshop, we will recap the application of next generation sequencing to identify differentially expressed genes. You will learn how to use the rnaseq pipeline out this data-intensive workflow efficiently. We will cover topics such as configuration of the pipeline, code execution and data interpretation.

Please note that this is not an introductory workshop, and we will assume some basic familiarity with Nextflow.

By the end of this workshop, you will be able to:

- analyse simple NGS datasets with the nf-core/rnaseq workflow
- understand the key concepts behind RNAseq differential expression analysis
- customise some of its features for your own analyses
- integrate different sources of information to interpret the results

Let's get started!

## Running with Gitpod

In order to run this using GitPod, please make sure:

1. You have a GitHub account: if not, create one [here](https://github.com/signup)
2. Once you have a GitHub account, sign up for GitPod using your GitHub user [here](https://gitpod.io/login/) choosing "continue with GitHub".

Now you're all set and can use the following button to launch the service:

[![Open in GitPod](https://img.shields.io/badge/Gitpod-%20Open%20in%20Gitpod-908a85?logo=gitpod)](https://gitpod.io/#https://github.com/lescai-teaching/rnaseq-tutorial)

## Additional documentation

- You can find detailed documentation on **Nextflow** [here](https://www.nextflow.io/docs/latest/)
- You can find additional training on [these pages](https://training.nextflow.io)

## Credits & Copyright

This training material has been written and completed during the [nf-core](https://nf-co.re) Hackathon in Barcellona, 2024, by Lorenzo Sola, Francesco Lescai, Mariangela Santorsola and Victoria Cepeda. The tutorial is aimed at anyone who is interested in using nf-core pipelines for their studies or research activities.

The Docker image and Gitpod environment used in this repository have been created by [Seqera](https://seqera.io) but have been made open-source ([CC BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)) for the community.

All examples and descriptions are licensed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/).
126 changes: 126 additions & 0 deletions docs/usage/DE_analysis/rnaseq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
---
order: 3
---

# The nf-core/rnaseq pipeline

In order to carry out a RNA-Seq analysis we will use the nf-core pipeline [rnaseq](https://nf-co.re/rnaseq/3.14.0).

## Overview

The pipeline is organised following the diffent blocks: pre-processing, alignment (or lightweight alignment) and quantification, post-processing and final QC.

![RNAseq](./img/nf-core-rnaseq_metro_map_grey.png)

In each process, the users can choose among a range of different options. Importantly, the users can decide to follow one of the two different routes in the alignment and quantification step:

- alignment and quantification (stage 2);
- pseudoalignment and quantification (stage 3).

## Experimental Design

The number of reads and the number of biological replicates are two critical factors that researchers need to carefully consider during the design of a RNA-seq experiment. While it may seem intuitive that having a large number of reads is always desirable, an excessive number can lead to unnecessary costs and computational burdens, without providing significant improvements in quality of the data. Instead, it is often more beneficial to prioritise the number of biological replicates, as it allows to capture the natural biological variation of the data. Biological replicates involve collecting and sequencing RNA from distinct biological samples (e.g., different individuals, tissues, or time points), helping to detect genuine changes in gene expression.

!!! note

This concept must not be confused with technical replicates that asses the technical variability of the sequencing platform by sequencing the same RNA sample multiple time.

To obtain optimal results, it is crucial to balance the number of biological replicates and sequencing depth. While deeper sequencing improves the detection of lowly expressed genes, it reaches a plateau, beyond which no additional benefits are gained. Statistical power calculations can inform experimental design by estimating the optimal number of reads and replicates required. For instance, this approach helps establish a suitable log2 fold change (log2FC) threshold for the DE analysis. By incorporating multiple biological replicates into the design and optimizing sequencing depth, researchers can enhance the statistical power of the analysis, reducing the number of false positive results and increasing the reliability of the findings.

## Library design

RNA-seq library design involves critical decisions, including the choice between paired-end and single-end sequencing. Paired-end sequencing provides valuable information on structural variations and transcript isoforms, improving mapping accuracy, especially for longer transcripts and repetitive regions. In contrast, single-end sequencing, where only one end of the fragment is sequenced, can be more cost-effective while still providing high-quality data for gene expression analysis. The decision between paired-end and single-end sequencing ultimately depends on the research question and experimental goals. Paired-end sequencing is preferred for novel transcript identification or isoform characterization, while single-end sequencing is sufficient for gene expression quantification. The type of RNA (e.g., mRNA or total RNA), read length, budget and computational resources can impact the choice.

## Reference genome

nf-core pipelines make use of the Illumina iGenomes collection as [reference genomes](https://nf-co.re/docs/usage/reference_genomes).
Before starting the analysis, the users might want to check whether the genome they need is part of this collection.
They also might want to consider downloading the reference locally, when running on premises: this would be useful for multiple runs and to speed up the analysis. In this case the parameter `--igenomes_base` might be used to pass the root directory of the downloaded references.

One might also need to use custom files: in this case the users might either provide specific parameters at command line (`--fasta` option followed by the genome of choiche), or create a config file adding a new section to the `genome` object. See [here](https://nf-co.re/docs/usage/reference_genomes#custom-genomes) for more details.

We will follow this specific approach in this tutorial, since the data we will be using have been simulated on chromosome 21 of the Human GRCh38 reference, and we have prepared genome fasta and genome index containing only this chromosome locally. The two files are `/workspace/gitpod/training/data/refs/Homo_sapiens_assembly38_chr21.fa` and `/workspace/gitpod/training/data/refs/Homo_sapiens_assembly38_chr21.fa.fai`, respectively.

## Reference annoation

The reference annotation plays a crucial role in the RNA-seq analysis. Without a high-quality reference annotation, RNA-seq analysis would result in inaccurate or incomplete results. The reference annotation provides a precise guide for aligning sequencing reads to specific genomic regions, allowing to identify genes, transcripts and regulatory elements. This is particularly important for identifying novel transcripts and alternative splicing events.

nf-core pipelines make use of the Illumina iGenomes collection also as [reference annotation](https://nf-co.re/docs/usage/reference_genomes).
The reference annotations are vastly out of date with respect to current annotations and miss certain features. So, the general recommendation is to download a newest annotation version compatible with the genome. A user can utilize the `--gtf` or the `--gff` options to specify the annottation files of choiche, or create a config file adding a new section to the `genome` object.

Similarly to the approach utilised for the genome, in this tutorial we will follow this approach. The annotation files include only the annotated transcripts on chromosome 21 of the Human GRCh38 reference genome and we have already prepared these files locally. The two files are `/workspace/gitpod/training/data/refs/gencode_v29_chr21.gff` and `/workspace/gitpod/training/data/refs/gencode_v29_transcripts_chr21.fa`, respectively.

## Input files

The input data should be provided in a CSV file, according to a format that is largely common for nf-core pipelines.
The format is described in the [rnaseq usage page](https://nf-co.re/rnaseq/3.14.0/docs/usage).
In the tutorial, the input file is `/workspace/gitpod/training/data/reads/rnaseq_samplesheet.csv`.

## Running nf-core/rnaseq

In the following sections we will:

- prepare our references;
- set our computational resources in order to be able to run the pipeline on a gitpod VM;
- edit the optional settings;
- run the pipeline.

## Reference and annotation files

Following the considerations above, we will first of all edit the `nextflow.config` file in our working directory to add a new genome.
It is sufficient to add the following code to the `parameters` directive in the config.

```groovy
igenomes_base = '/workspace/gitpod/training/data/refs/'
genomes {
'GRCh38chr21' {
fasta = "${params.igenomes_base}/sequence/Homo_sapiens_assembly38_chr21.fasta"
fasta_fai = "${params.igenomes_base}/sequence/Homo_sapiens_assembly38_chr21.fasta.fai"
gff = "${params.igenomes_base}/gencode_v29_chr21_parsed.gff"
transcript_fasta = "${params.igenomes_base}/gencode.v29.transcripts_chr21.fa"
star_index = "${params.igenomes_base}/star_index_chr21.tar.gz"
salmon_index = "${params.igenomes_base}/salmon_index_chr21.tar.gz"
}
}
```

To speed up the analysis we will include the `star_index` and the `salmon_index` in the config. These files have already been created locally.

## Computing resources

Based on the choices we made when starting up the gitpod environment, we recommend to use the following additional parameters.
They can also be added to the parameters directive in the config file we just edited.

```groovy
params {
max_cpus = 2
max_memory = '6.5GB'
max_time = '2.h'
}
```

## Launching the pipeline

Now we are ready to launch the pipeline, and we can use the following command line:

```bash
nextflow run nf-core/rnaseq -r 3.12.0 \
--input /workspace/gitpod/training/data/reads/rnaseq_samplesheet.csv \
--outdir ./results_star_salmon \
--genome GRCh38chr21 \
--aligner star_salmon \
--pseudo_aligner salmon \
--skip_biotype_qc \
--skip_stringtie \
--skip_bigwig \
--skip_umi_extract \
--skip_trimming \
--skip_fastqc \
--skip_markduplicates \
--skip_dupradar \
--skip_rseqc \
--skip_qualimap
```

Notice that we will run the pipeline with STAR as aligner and Salmon in alignment-based mode to quantify gene expression. We will also run the pipeline with Salmon in mapping-based mode to perform a lightweight alignment and quantification.
The `skip` parameters were inserted to reduce the running time.
Loading
Loading