# pigx pipeline for single-cell RNAseq

**Copyright 2017-2022: Vedran Franke, Bora Uyar, Ricardo Wurmus, Altuna Akalin.**
**This work is distributed under the terms of the GNU General Public License, version 3 or later. It is free to use for all purposes.**

-----------

# Introduction

PiGX scRNAseq is an analysis pipeline for preprocessing and quality control for single cell RNA sequencing experiments.
The inputs are read files from the sequencing experiment, and a configuration file which describes the experiment.
It produces processed files for downstream analysis and interactive quality reports.
The pipeline is designed to work with UMI based methods. It currently supports all methods which output paired
adapter - read files.
The pipeline was heavily influenced by the [Dropseq](http://mccarrolllab.com/dropseq/) pipeline from the McCaroll lab.

## What does it do

- Quality control reads using fastQC and multiQC
- Automatically determines the appropriate cell number
- Constructs the digital gene expression matrix
- Calculates per sample and per cell statistics
- Prepares a quality control report
- Normalizes data and does dimensionallity reduction

## What does it output

- bam files
- bigwig files
- UMI and read count matrices
- Quality control report
- SingleCellExperiment object with pre-calculated statistics and dimensionallity reductions

## PiGx - scRNA-seq workflow

![PiGx scRNAseq workflow](./images/pigx-scrnaseq.svg)
_Figure 1: An overview of the PiGx scRNA-seq workflow_

# Install

You can install this pipeline and all of its dependencies through GNU
Guix:

guix install pigx-scrnaseq

You can also install it manually from source. You can find the [latest
release](https://github.com/BIMSBbioinfo/pigx_scrnaseq/releases/latest)
here. PiGx uses the GNU build system. Please make sure that all
required dependencies are installed and then follow these steps after
unpacking the latest release tarball:

```sh
./configure \
--prefix=/some/where
make install
```

# Dependencies

By default the `configure` script expects tools to be in a directory
listed in the `PATH` environment variable. If the tools are installed
in a location that is not on the `PATH` you can tell the `configure`
script about them with variables. Run `./configure --help` for a list
of all variables and options.

You can prepare a suitable environment with Conda or with [GNU
Guix](https://gnu.org/s/guix).

## Via Guix

Assuming you have Guix installed, the following command spawns a
sub-shell in which all dependencies are available:

```sh
guix shell
```

# Getting started

To run PiGx on your experimental data, first enter the necessary parameters in the spreadsheet file (see following section), and then from the terminal type.
To run the pipeline, you will also need the appropriate genome sequence in fasta format, and the genome annotation in a
gtf format.

```sh
$ pigx-scrnaseq [options] sample_sheet.csv -s settings.yaml
```

To see all available options type the `--help` option

```sh
$ pigx-scrnaseq --help

usage: pigx-scrnaseq [-h] [-v] -s SETTINGS [-c CONFIGFILE] [--target TARGET]
[-n] [--graph GRAPH] [--force] [--reason] [--unlock]
samplesheet

PiGx scRNAseq Pipeline.

PiGx scRNAseq is a data processing pipeline for single cell RNAseq read data.

positional arguments:
samplesheet The sample sheet containing sample data in CSV format.

optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-s SETTINGS, --settings SETTINGS A YAML file for settings that deviate from the defaults.
-c CONFIGFILE, --configfile CONFIGFILE The config file used for calling the underlying snakemake process. By
default the file 'config.json' is dynamically created from the sample
sheet and the settings file.
--target TARGET Stop when the named target is completed instead of running the whole
pipeline. The default target is "final-report". Pass "--target=help"
to describe all available targets.
-n, --dry-run Only show what work would be performed. Do not actually run the
pipeline.
--graph GRAPH Output a graph in Graphviz dot format showing the relations between
rules of this pipeline. You must specify a graph file name such as
"graph.pdf".
--force Force the execution of rules, even though the outputs are considered
fresh.
--reason Print the reason why a rule is executed.
--unlock Recover after a snakemake crash.

This pipeline was developed by the Akalin group at MDC in Berlin in 2017-2018.
```

# The input parameters

## Sample Sheet

The sample sheet is a tabular file describing the experiment. The table has the following columns:

| name | barcode | reads | method | covariate1 | covariate2 |
|------|---------|-------|---------|------------|------------|

- _name_ - name for the sample, which will be used to label the sample in all downstream analysis
- _barcode_ - fastq file containing the **adapter sequences**
- _reads_ - fastq file containing the **sequenced reads**
- location of these files is specified in `settings.yaml`
- _method_ - sequencing platform on which the experiment was performed (i.e. dropseq)
- _covariates_ - variables which describe the samples. For example: replicate, time, hour post infection, tissue ...

Additional columns may be included which may be used as covariates in the differential expression analysis (sex, age, different treatments).

## Settings File

The settings file is a _YAML_ file which specifies:

- Locations:
- The locations of the reads (directory where `fastq` files are located)
- The location of the output directory
- The location of the `fasta` file with the reference genome (must be prepared by the user)
- The location of a `GTF` file with genome annotations
- Genome assembly name (i.e. mm10)

- covariates

In order to get started, enter `pigx-scrnaseq --init-settings my_settings.yaml`. This will create a file called `my_settings.yaml` with the default structure. The file will look like this:

```yaml
locations:
output-dir: out/
reads-dir: sample_data/reads/
tempdir:

covariates: 'covariate1, covariate2, ...'

annotation:
primary:
genome:
name: hg19
fasta: sample_data/genome.fa
gtf: sample_data/genome.gtf

execution:
submit-to-cluster: no
jobs: 6
nice: 19
```

# Resource consideration

Single cell expression analysis is data intensive, and requires substantial computing resources.
The pipeline uses the [STAR](https://github.com/alexdobin/STAR) aligner for read mapping, so the memory requirements will scale with the
size of the genome.
Please look at the [STAR](https://github.com/alexdobin/STAR) manual for the concrete number about the memory requirements.
For the human/mouse genome it requires ~ 40Gb of RAM.
The pipeline produces temporary files which require a substantial amount of disk space. Please ensure
that you have at least 30Gb of disk space per 100 milion sequenced reads.
The location of the temporary directory can be controlled using the tempdir: variable in the settings.yaml.
By default the tempdir is set to **/tmp**.

**Important:** please make sure that the temporary directory has adequate free space

# Output directory structure
The output directory structure should look like the following tree

### Annotation

Contains pre-processed fasta and gtf file, along with the STAR genome index.
The genome fasta file is processed into a dict header.
The gtf file has gene_names replaced with gene_id.

**Important**: We sincerely advise that you check that the gtf file corresponds to the same organism
and genome version as the genome fasta files.
The chromosome names have to completely correspond between the two files.

We encourage users to use both the genome annotation and the fasta file from the [ENSEMBL](https://www.ensembl.org/info/data/ftp/index.html) database.

### Log

Contains execution logs for every step of the pipeline.

### Mapped

The **Mapped** folder contains per sample processed single cell samples.
Additionally, it contains a [loom](http://linnarssonlab.org/loompy/) file with merged expression values from all experiments,
an RDS file with a saved [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) object, and a quality control report in the html format.

#### Mapped/Sample1

Analaysis results for each sample are done in a separate subdirectory under **Mapped**.
Structure of analysis results:

#### Description of relevant output files:

- Sample1.fastq.bam - contains merged barcode and sequence fq files

- Sample1_genome_name.**bw** - bigWig file constructed from selected cells. Files with **m/p**.bw contain strand separated signal

- Sample1_genome_name_**BAMTagHistogram** - Number of reads in coressponding to each cell barcode.

- Sample1_genome_name_UMI.Matrix.**txt/loom** - UMI based digital expression matrix in txt and loom format

- Sample1_genome_name_READS.Matrix.txt - Read count digital expression matrix

- Sample1_genome_name_ReadCutoff.yaml - contains the UMI threshold for selecting high quality cells (obtained using [dropbead](https://github.com/rajewsky-lab/dropbead)). The corresponding .png file visualizes the UMI curve and the threshold.

- star_gene_exon_tagged.bam - mapped and annotated reads. Each read is tagged by annotation based on it's mapping location.

- Sample1_genome_name_**ReadStatistics/Downstream**.txt - quality control statistics used in the html report. They contain values such as number of reads in Exons/Introns.

# Downstream analysis

The combined expression data are subsequently processed into a [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) object. SingleCellExperiment is a Bioconductor class for storing expression values, along with the cell, and gene data, and experimental meta data in a single container. It is constructed on top of hdf5 file based arrays (Pagès 2017), which enables exploration even on systems with limited memory capacity.
During the object construction, the pipeline performs expression normalization, dimensionallity reduction, identification of significantly variable genes, assigns the cells to the steps of the cell cycle, and calculates the quality statistics. The SingleCellExperiment object contains all of the necessary data needed for further exploration. The object connects the pigx-pipeline with the Bioconductor single cell computing environment, and enables integration with state of the art statistical, and machine learning mehods ([scran](https://bioconductor.org/packages/release/bioc/html/scran.html), [zinbwave](https://bioconductor.org/packages/release/bioc/html/zinbwave.html), [netSmooth](https://github.com/BIMSBbioinfo/netSmooth), [iSEE](https://github.com/csoneson/iSEE).

### Cluster Execution

The `execution` section in the settings file allows the user to specify whether the pipeline is to be submitted to a cluster, or run locally, and the degree of parallelism. For a full list of possible parameters, see `etc/settings.yaml`.

# Example

An example can be found in the `tests` directory. The
`sample_sheet.csv` file here specifies the following sample data:

# How to contribute?

### Dependencies

#### Guix

The easiest way to install all of the dependencies is through the [guix](https://www.gnu.org/software/guix/download/)
package management system.
Firstly download and install guix to your computer.
The **guix.scm** file in the root of the project directory contains the description recipe for installing all of the necessary
tools.
The following command will install all of the dependencies to the **.guix-profile** folder

guix environment -l guix.scm --root=`pwd`/.guix-profile

### Installing PigX-scRNAseq for development

# sets up the directory
basepath='~/pigx=scrnaseq/development'
mkdir -p $basepath; cd $basepath

# downloads the repository
git clone https://github.com/BIMSBbioinfo/pigx_scrnaseq.git

cd pigx_scrnaseq; mkdir run

# uses guix to install all of the dependencies into a separate environment
guix environment -l guix.scm --root=`pwd`'./run/.guix-profile'

# fetch pigx-common
git submodule update --init --recursive

# sets the temporary directory - needed for storing large temporary files
export TMPDIR=~/Tmp

# installs the pipeline
./bootstrap.sh && ./configure --prefix=`pwd`/run && make install

# runs the pipeline on the test data
./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml

### Preparing the environment for the development

To prepare the environment for the development set the following variable:

export PIGX_UNINSTALLED=1

If this variable is not set pigx-scrnaseq will execute files in the `./run/bin` folder (pre-installed files),
and will not react to changes to scripts.

### Continuing development

If you already have a pre-installed dependencies, then execute the following commands to setup your environment

# loads the guix environment.
guix package -p run/.guix-profile --search-path=prefix
export PIGX_UNINSTALLED=1

# runs the pipeline on the test data
./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml

# loads the dependencies into PATH
guix package -p ./run/.guix-profile --search-path="prefix"

### Scripts and Executables

#### pigx-scrnaseq

pigx-scrnaseq is the main driver script for the pipeline (user entry point).
It is constructed from `pigx-runner.in` (from the `pigx-common` repository) during the configuration step.
If you want to update the pigx-scrnaseq, change the copy of common/pigx-runner.in, and run the `install the pipeline`
step of the development installation, to update the changes.

#### snakefile.py

Is the main SnakeMake script which constructs the execution graph and executes the pipeline.
Any changes to the `snakefile.py` are observed directly upon execution.

#### ./scripts

Folder which contains all **R** and **python** scripts. These scripts are called
used by the **Snake_Dropseq.py**

### Making changes

To make changes or add improvements to the pipeline, follow these steps:
1. create a new git branch
2. switch to the branch
3. make your updates
4. check whether the updates work by running the following code:

make install && ./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml

5. run the tests with

make distcheck

6. check whether there were updates to master. If there were updates, run git pull -r. Again check whether the pipeline works
7. push the changes to the corresponding branch, and open a pull request.

----------------------------------------
2022