-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2 from fiberseq/fire_updates
Updates for FIRE
- Loading branch information
Showing
13 changed files
with
461 additions
and
92 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# `ft-pileup` | ||
The `ft-pileup` command is used to generate a per-base pileup of Fiber-seq data. This command is useful for visualizing the distribution of various features across the genome. | ||
|
||
|
||
## Inputs and options | ||
|
||
See the [help message](../help.md#ft-pileup) for details. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# `ft-qc` | ||
The `ft-qc` command is used to generate quality control metrics for a Fiber-seq BAM file. | ||
|
||
|
||
## Inputs and options | ||
|
||
See the [help message](../help.md#ft-qc) for details. |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,21 @@ | ||
# FIRE: descriptions, methods, and outputs | ||
# FIRE: descriptions, methods, and outputs | ||
|
||
For details on running the FIRE pipeline see the [README.md](https://github.com/fiberseq/FIRE). | ||
The code for running the FIRE pipeline can be found on [GitHub](https://github.com/fiberseq/FIRE), and if you do use FIRE in your work please cite our publication: | ||
|
||
> **Vollger, M. R.**, **Swanson, E. G.**, Neph, S. J., Ranchalis, J., Munson, K. M., Ho, C.-H., Sedeño-Cortés, A. E., Fondrie, W. E., Bohaczuk, S. C., Mao, Y., Parmalee, N. L., Mallory, B. J., Harvey, W. T., Kwon, Y., Garcia, G. H., Hoekzema, K., Meyer, J. G., Cicek, M., Eichler, E. E., … Stergachis, A. B. (2024). A haplotype-resolved view of human gene regulation. _bioRxiv_. [https://doi.org/10.1101/2024.06.14.599122](https://doi.org/10.1101/2024.06.14.599122) | ||
## Summary of Fiber-seq inferred regulatory elements (FIREs) | ||
|
||
For those who would prefer video I have recorded a [lab meeting](https://youtu.be/RiZrMltAiWM?si=sSo64goaNQxgyfcc) discussing FIRE and the methods used here. If you prefer to read please continue on. | ||
For those who would prefer video I have recorded a [lab meeting](https://youtu.be/RiZrMltAiWM?si=sSo64goaNQxgyfcc) discussing FIRE and the methods used here. If you prefer to read please continue on. | ||
|
||
FIREs are MTase sensitive patches (MSPs) that are inferred to be regulatory elements on single chromatin fibers. To do this we used semi-supervised machine learning to identify MSPs that are likely to be regulatory elements using the `Mokapot` framework and `XGBoost`. Every individual FIRE element is associated with an estimated precision value (defined below), which indicates the probability that the FIRE element is a true regulatory element. The estimated precision of FIREs elements are created using `Mokapot` and validation data not used in training. We train our model targeting FIRE elements with at least 95% precision, MSPs with less than 90% precision are considered to have average level of accessibility expected between two nucleosomes, and are referred to as linker regions. | ||
FIREs are MTase sensitive patches (MSPs) that are inferred to be regulatory elements on single chromatin fibers. To do this we used semi-supervised machine learning to identify MSPs that are likely to be regulatory elements using the `Mokapot` framework and `XGBoost`. Every individual FIRE element is associated with an estimated precision value, which indicates the probability that the FIRE element is a true regulatory element. The estimated precision of FIREs elements are created using `Mokapot` and validation data not used in training. We train our model targeting FIRE elements with at least 95% precision, MSPs with less than 90% precision are considered to have average level of accessibility expected between two nucleosomes, and are referred to as linker regions. | ||
|
||
Semi-superivized machine learning with `Mokapot` requires a mixed-positive training set and a clean negative training set. To create mixed positive training data we selected MSPs that overlapped DNase hypersensitive sites (DHSs) and CTCF ChIP-seq peaks. And to create a clean negative training set we selected MSPs that did not overlap DHSs or CTCF ChIP-seq peaks. | ||
|
||
For details on running the FIRE workflow see: | ||
|
||
- [Running and installing](run.md) | ||
|
||
For details on the outputs of the FIRE workflow see: | ||
|
||
- [Outputs](outputs.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# FIRE methods | ||
This chapter describes the methods and training used in the FIRE pipeline. There are two major components, identifying per-molecule FIRE elements, and aggregating these elements into peaks. |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# Outputs of the FIRE pipeline | ||
|
||
The FIRE pipeline returns outputs for a sample in the directory `results/{sample}/` and files are labeled with the sample (`{sample}`) and the version of the FIRE pipeline (`{v}`). The following files and directories are generated: | ||
|
||
| Output | Description | | ||
| ---------------------------------------- | -------------------------------------------------------------------------------------------------- | | ||
| {sample}-fire-{v}-filtered.cram | CRAM file containing the all the data used in the FIRE pipeline. | | ||
| {sample}-fire-{v}-peaks.bed.gz | BED file containing the FIRE peaks calls | | ||
| {sample}-fire-{v}-hap-differences.bed.gz | BED file containing the results of searching for haplotype-selective peaks. | | ||
| {sample}-fire-{v}-pileup.bed.gz | BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more. | | ||
| {sample}-fire-{v}-qc.tbl.gz | Table containing quality control metrics for the FIRE CRAM. | | ||
| trackHub-{v}/ | Directory containing a UCSC trackHub for visualizing all the results of the FIRE pipeline. | | ||
| additional-outputs-{v}/ | Directory containing additional outputs from the FIRE pipeline. | | ||
|
||
# More details on the individual outputs | ||
|
||
## The `{sample}-fire-{v}-filtered.cram` file | ||
|
||
The CRAM file contains all the data used in the FIRE pipeline. It is a CRAM file that can be viewed with IGV or other genome browsers. Sequencing quality scores are removed from the CRAM file to reduce the file size since per base quality scores are not used in the FIRE pipeline, as well as reads with insufficient m6A signal. The CRAM file is sorted and indexed. | ||
|
||
## The `{sample}-fire-{v}-peaks.bed.gz` file | ||
|
||
This is the peak file for the FIRE method. Peaks are called by identifying FIRE score ([methods](methods/aggregation.md)) local-maxima that have FDR values below a threshold. By default, the pipeline reports peaks at a 5% FDR threshold. Once a local-maxima is identified, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculate and report wide peaks in the `additional-outputs/` by taking the union of the FIRE peaks and all regions below the FDR threshold and then merging resulting regions that are within one nucleosome (147 bp) of one another. | ||
|
||
The FIRE peaks file has the following columns: | ||
| Column | Description | | ||
| --- | --- | | ||
| #chrom | Chromosome of the peak | | ||
| peak\*start | Start of the peak | | ||
| peak_end | End of the peak | | ||
| start | Start of the maximum of the peak | | ||
| end | End of the maximum of the peak | | ||
| coverage | Coverage of the peak | | ||
| fire_coverage | Coverage of the FIREs in the peak | | ||
| score | FIRE score of the peak (see [methods](methods/aggregation.md)) | | ||
| nuc_coverage | Coverage of the nucleosomes in the peak | | ||
| msp_coverage | Coverage of the MSPs in the peak | | ||
| .\*\*{H1,H2} | Repeats of previous columns but specific for the two haplotypes | | ||
| FDR | False discovery rate of the peak | | ||
| log_FDR | -10\*log10 of the FDR | | ||
| FIRE_size_mean | Mean size of the FIREs in the peak | | ||
| FIRE_size_ssd | Standard deviation of the size of the FIREs in the peak | | ||
| FIRE_start_ssd | Standard deviation of the start of the FIREs in the peak | | ||
| FIRE_end_ssd | Standard deviation of the end of the FIREs in the peak | | ||
| pass_coverage | Whether the peak passes coverage filters | | ||
|
||
## The `{sample}-fire-{v}-hap-differences.bed.gz` file | ||
|
||
This file primarily contains the same columns as the FIRE peaks file but additionally has a `p_value` column with the results of a Fisher's exact test for the difference in coverage between the two haplotypes, and a `p_adjust` column with the Benjamini-Hochberg adjusted p-value. See the [methods](methods/haplotype-selective.md) for more details. | ||
|
||
## The `{sample}-fire-{v}-pileup.bed.gz` file | ||
|
||
This is a BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more. The columns are calculated using `ft-pileup` and more details can be found in the `ft-pileup` documentation. | ||
|
||
## The `{sample}-fire-{v}-qc.tbl.gz` file | ||
|
||
This file contains quality control metrics for the FIRE CRAM. The results are directly created by `ft-qc` and more details can be found in the `ft-qc` documentation. | ||
|
||
## The `trackHub-{v}/` directory | ||
|
||
The `trackHub-{v}/` directory contains a UCSC trackHub for visualizing all the results of the FIRE pipeline. A description of the trackHub can be found in `trackHub/fire-description.html`. The trackHub can be loaded into UCSC by uploading the trackHub directory to a public facing website and then loading the `hub.txt`'s URL into the UCSC trackHub browser. | ||
|
||
A copy of the trackHub description can be found [here](https://s3-us-west-2.amazonaws.com/stergachis-public1/Mitchell/temp/FIRE/dev/trackHub/fire-description.html). | ||
|
||
## The `additional-outputs-{v}/` directory | ||
|
||
The `additional-outputs-{v}/` directory contains the following files: | ||
TODO |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# Running the FIRE pipeline | ||
|
||
The FIRE pipeline is a Snakemake workflow for calling Fiber-seq Inferred Regulatory Elements (FIREs) on single molecules and peak calling with Fiber-seq. | ||
|
||
## Install | ||
|
||
Please start by installing [pixi](https://pixi.sh/latest/) which handles the environment of the FIRE workflow. | ||
|
||
Then install FIRE using `git` and `pixi`: | ||
|
||
```bash | ||
git clone https://github.com/fiberseq/FIRE.git | ||
pixi install | ||
``` | ||
|
||
We then recommend quickly testing your installation by running the test suite: | ||
|
||
```bash | ||
pixi run test | ||
``` | ||
|
||
We recommend setting a Snakemake conda prefix and the Apptainer cache directory in your `bashrc`, e.g. in the Stergachis lab add: | ||
|
||
```bash | ||
export SNAKEMAKE_CONDA_PREFIX=/mmfs1/gscratch/stergachislab/snakemake-conda-envs | ||
export APPTAINER_CACHEDIR=/mmfs1/gscratch/stergachislab/snakemake-conda-envs/apptainer-cache | ||
``` | ||
|
||
Then Snakemake installs all the additional requirements as conda envs in that directory. | ||
|
||
If you wish to distribute jobs across a cluster you may need to install the appropriate [snakemake executor plugin](https://snakemake.github.io/snakemake-plugin-catalog/). The SLURM executor is included in the environment (`snakemake-executor-slurm`) | ||
|
||
## Configuring | ||
|
||
See the [configuration README](https://github.com/fiberseq/FIRE/tree/main/config), the example [configuration file](https://github.com/fiberseq/FIRE/blob/main/config/config.yaml), and the example [manifest file](https://github.com/fiberseq/FIRE/blob/main/config/config.tbl) for configuration options. | ||
|
||
## Run | ||
|
||
The `FIRE` workflow can be executed using the `pixi run fire` command. Under the hood this runs a `snakemake` workflow and any extra parameters are passed directly to snakemake. For example: | ||
|
||
```bash | ||
pixi run fire --configfile config/config.yaml | ||
``` | ||
|
||
If you want to do a dry run: | ||
|
||
```bash | ||
pixi run fire --configfile config/config.yaml -n | ||
``` | ||
|
||
If you want to execute across a cluster (modify `profiles/slurm-executor` as needed for distributed execution): | ||
|
||
```bash | ||
pixi run fire --configfile config/config.yaml --profile profiles/slurm-executor | ||
``` | ||
|
||
And if you want to run the FIRE workflow from another directory you can do so with: | ||
|
||
```bash | ||
pixi run --manifest-path /path/to/snakemake/pixi.toml fire ... | ||
``` | ||
|
||
where you update `/path/to/snakemake/pixi.toml` to the path of the `pixi.toml` in the cloned FIRE repository. | ||
|
||
You can also run snakemake directly, e.g.: | ||
|
||
```bash | ||
pixi shell | ||
snakemake \ | ||
--configfile config/config.yaml \ | ||
--profile profiles/slurm-executor \ | ||
--local-cores 8 -k | ||
``` |