Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for FIRE #2

Merged
merged 26 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion book.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ mathjax-support = true

[output.html.fold]
enable = true # whether or not to enable section folding
level = 1 # the depth to start folding
level = 2 # the depth to start folding

[preprocessor.toc]
command = "mdbook-toc"
Expand Down
2 changes: 1 addition & 1 deletion scripts/make_help_docs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ echo "# Help pages for fibertools subcommands" >$out
echo "<!-- toc -->" >>$out
echo "" >>$out

for subcommand in "" "predict-m6a" "fire" "extract" "center" "add-nucleosomes" "footprint" "clear-kinetics" "strip-basemods" "track-decorators" "pileup"; do
for subcommand in "" "predict-m6a" "fire" "extract" "center" "add-nucleosomes" "footprint" "clear-kinetics" "strip-basemods" "track-decorators" "pileup" "qc"; do
echo $subcommand
#out="docs/ft-${subcommand}-help.md"

Expand Down
11 changes: 8 additions & 3 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,21 @@
- [`ft extract`](fibertools/extracting/extract.md)
- [`ft center`](fibertools/extracting/center.md)
- [`ft footprint`](fibertools/extracting/footprint.md)
- [`ft pileup`](fibertools/extracting/pileup.md)
- [`ft qc`](fibertools/extracting/qc.md)
- [Help pages for `ft` subcommands](fibertools/help.md)
- [pyft: Python bindings](fibertools/pyft.md)
- [Installation](fibertools/install.md)

---

- [FIRE](fire/fire.md)
- [Training](fire/training.md)
- [Aggregation and peak calling](fire/aggregation.md)
- [Identifying haplotype-selective peaks](fire/haplotype-selective.md)
- [Running and installing](fire/run.md)
- [Outputs](fire/outputs.md)
- [Methods](fire/methods/methods.md)
- [Training](fire/methods/training.md)
- [Aggregation and peak calling](fire/methods/aggregation.md)
- [Identifying haplotype-selective peaks](fire/methods/haplotype-selective.md)

---

Expand Down
7 changes: 7 additions & 0 deletions src/fibertools/extracting/pileup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# `ft-pileup`
The `ft-pileup` command is used to generate a per-base pileup of Fiber-seq data. This command is useful for visualizing the distribution of various features across the genome.


## Inputs and options

See the [help message](../help.md#ft-pileup) for details.
7 changes: 7 additions & 0 deletions src/fibertools/extracting/qc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# `ft-qc`
The `ft-qc` command is used to generate quality control metrics for a Fiber-seq BAM file.


## Inputs and options

See the [help message](../help.md#ft-qc) for details.
360 changes: 279 additions & 81 deletions src/fibertools/help.md

Large diffs are not rendered by default.

17 changes: 13 additions & 4 deletions src/fire/fire.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,21 @@
# FIRE: descriptions, methods, and outputs
# FIRE: descriptions, methods, and outputs

For details on running the FIRE pipeline see the [README.md](https://github.com/fiberseq/FIRE).
The code for running the FIRE pipeline can be found on [GitHub](https://github.com/fiberseq/FIRE), and if you do use FIRE in your work please cite our publication:

> **Vollger, M. R.**, **Swanson, E. G.**, Neph, S. J., Ranchalis, J., Munson, K. M., Ho, C.-H., Sedeño-Cortés, A. E., Fondrie, W. E., Bohaczuk, S. C., Mao, Y., Parmalee, N. L., Mallory, B. J., Harvey, W. T., Kwon, Y., Garcia, G. H., Hoekzema, K., Meyer, J. G., Cicek, M., Eichler, E. E., … Stergachis, A. B. (2024). A haplotype-resolved view of human gene regulation. _bioRxiv_. [https://doi.org/10.1101/2024.06.14.599122](https://doi.org/10.1101/2024.06.14.599122)

## Summary of Fiber-seq inferred regulatory elements (FIREs)

For those who would prefer video I have recorded a [lab meeting](https://youtu.be/RiZrMltAiWM?si=sSo64goaNQxgyfcc) discussing FIRE and the methods used here. If you prefer to read please continue on.
For those who would prefer video I have recorded a [lab meeting](https://youtu.be/RiZrMltAiWM?si=sSo64goaNQxgyfcc) discussing FIRE and the methods used here. If you prefer to read please continue on.

FIREs are MTase sensitive patches (MSPs) that are inferred to be regulatory elements on single chromatin fibers. To do this we used semi-supervised machine learning to identify MSPs that are likely to be regulatory elements using the `Mokapot` framework and `XGBoost`. Every individual FIRE element is associated with an estimated precision value (defined below), which indicates the probability that the FIRE element is a true regulatory element. The estimated precision of FIREs elements are created using `Mokapot` and validation data not used in training. We train our model targeting FIRE elements with at least 95% precision, MSPs with less than 90% precision are considered to have average level of accessibility expected between two nucleosomes, and are referred to as linker regions.
FIREs are MTase sensitive patches (MSPs) that are inferred to be regulatory elements on single chromatin fibers. To do this we used semi-supervised machine learning to identify MSPs that are likely to be regulatory elements using the `Mokapot` framework and `XGBoost`. Every individual FIRE element is associated with an estimated precision value, which indicates the probability that the FIRE element is a true regulatory element. The estimated precision of FIREs elements are created using `Mokapot` and validation data not used in training. We train our model targeting FIRE elements with at least 95% precision, MSPs with less than 90% precision are considered to have average level of accessibility expected between two nucleosomes, and are referred to as linker regions.

Semi-superivized machine learning with `Mokapot` requires a mixed-positive training set and a clean negative training set. To create mixed positive training data we selected MSPs that overlapped DNase hypersensitive sites (DHSs) and CTCF ChIP-seq peaks. And to create a clean negative training set we selected MSPs that did not overlap DHSs or CTCF ChIP-seq peaks.

For details on running the FIRE workflow see:

- [Running and installing](run.md)

For details on the outputs of the FIRE workflow see:

- [Outputs](outputs.md)
4 changes: 2 additions & 2 deletions src/fire/aggregation.md → src/fire/methods/aggregation.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ where \\(C_g\\) is the number of FIRE elements at the \\(g\\)th position, \\(R_g
Regions with unreliable coverage are defined as regions with Fiber-seq coverage that deviate from the median coverage by five standard deviations, where a standard deviation is defined by the Poisson distribution (i.e., the square root of the mean coverage). This is a parameter that can be adjusted by the user.

### FIRE score FDR calculation
We shuffle the location of an entire Fiber-seq read by selecting a random start position within the chromosome and relocating the entire read to that start position. Fiber-seq reads originating from regions with unreliable coverage (defined above) are not shuffled, and reads from regions with reliable coverage are not shuffled into regions with unreliable coverage (`bedtools shuffle -chrom -excl`, v2.31.0) (Quinlan & Hall, 2010). We then compute FIRE scores associated with the shuffled genome. Recalling that the FIRE score for the \\(g\\)th position in the genome is denoted \\(S_g\\), we divide the number of bases that have shuffled FIRE scores above \\(S_g\\) by the number of bases that have un-shuffled FIRE scores above \\(S_g\\). This provides an estimate of the FDR associated with the FIRE score \\(S_g\\).
We shuffle the location of an entire Fiber-seq read by selecting a random start position within the chromosome and relocating the entire read to that start position. Fiber-seq reads originating from regions with unreliable coverage are not shuffled, and reads from regions with reliable coverage are not shuffled into regions with unreliable coverage (`bedtools shuffle -chrom -excl`, v2.31.0). We then compute FIRE scores associated with the shuffled genome. Recalling that the FIRE score for the \\(g\\)th position in the genome is denoted \\(S_g\\), we divide the number of bases that have shuffled FIRE scores above \\(S_g\\) by the number of bases that have un-shuffled FIRE scores above \\(S_g\\). This provides an estimate of the FDR associated with the FIRE score \\(S_g\\).

### Peak calling
Peaks are called by identifying FIRE score local-maxima that have FDR values below a 5% threshold and at least 10% actuation (Cg / Rg). Adjacent local maxima that share 50% of the underlying FIRE elements or have 90% reciprocal overlap are merged into a single peak, using the higher of the two local maxima. Then, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculated and reported wide peaks by taking the union of the FIRE peaks and all regions below the FDR threshold and then merged the resulting regions that were within one nucleosome (147 bp) of one another.
Peaks are called by identifying FIRE score local-maxima that have FDR values below a 5% threshold and that exceed an optional percent actuation threshold \\(C_g / R_g\\). Adjacent local maxima that share 50% of the underlying FIRE elements or have 90% reciprocal overlap are merged into a single peak, using the higher of the two local maxima. Then, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculated and reported wide peaks by taking the union of the FIRE peaks and all regions below the FDR threshold and then merged the resulting regions that were within one nucleosome (147 bp) of one another.

2 changes: 2 additions & 0 deletions src/fire/methods/methods.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# FIRE methods
This chapter describes the methods and training used in the FIRE pipeline. There are two major components, identifying per-molecule FIRE elements, and aggregating these elements into peaks.
File renamed without changes.
68 changes: 68 additions & 0 deletions src/fire/outputs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Outputs of the FIRE pipeline

The FIRE pipeline returns outputs for a sample in the directory `results/{sample}/` and files are labeled with the sample (`{sample}`) and the version of the FIRE pipeline (`{v}`). The following files and directories are generated:

| Output | Description |
| ---------------------------------------- | -------------------------------------------------------------------------------------------------- |
| {sample}-fire-{v}-filtered.cram | CRAM file containing the all the data used in the FIRE pipeline. |
| {sample}-fire-{v}-peaks.bed.gz | BED file containing the FIRE peaks calls |
| {sample}-fire-{v}-hap-differences.bed.gz | BED file containing the results of searching for haplotype-selective peaks. |
| {sample}-fire-{v}-pileup.bed.gz | BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more. |
| {sample}-fire-{v}-qc.tbl.gz | Table containing quality control metrics for the FIRE CRAM. |
| trackHub-{v}/ | Directory containing a UCSC trackHub for visualizing all the results of the FIRE pipeline. |
| additional-outputs-{v}/ | Directory containing additional outputs from the FIRE pipeline. |

# More details on the individual outputs

## The `{sample}-fire-{v}-filtered.cram` file

The CRAM file contains all the data used in the FIRE pipeline. It is a CRAM file that can be viewed with IGV or other genome browsers. Sequencing quality scores are removed from the CRAM file to reduce the file size since per base quality scores are not used in the FIRE pipeline, as well as reads with insufficient m6A signal. The CRAM file is sorted and indexed.

## The `{sample}-fire-{v}-peaks.bed.gz` file

This is the peak file for the FIRE method. Peaks are called by identifying FIRE score ([methods](methods/aggregation.md)) local-maxima that have FDR values below a threshold. By default, the pipeline reports peaks at a 5% FDR threshold. Once a local-maxima is identified, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculate and report wide peaks in the `additional-outputs/` by taking the union of the FIRE peaks and all regions below the FDR threshold and then merging resulting regions that are within one nucleosome (147 bp) of one another.

The FIRE peaks file has the following columns:
| Column | Description |
| --- | --- |
| #chrom | Chromosome of the peak |
| peak\*start | Start of the peak |
| peak_end | End of the peak |
| start | Start of the maximum of the peak |
| end | End of the maximum of the peak |
| coverage | Coverage of the peak |
| fire_coverage | Coverage of the FIREs in the peak |
| score | FIRE score of the peak (see [methods](methods/aggregation.md)) |
| nuc_coverage | Coverage of the nucleosomes in the peak |
| msp_coverage | Coverage of the MSPs in the peak |
| .\*\*{H1,H2} | Repeats of previous columns but specific for the two haplotypes |
| FDR | False discovery rate of the peak |
| log_FDR | -10\*log10 of the FDR |
| FIRE_size_mean | Mean size of the FIREs in the peak |
| FIRE_size_ssd | Standard deviation of the size of the FIREs in the peak |
| FIRE_start_ssd | Standard deviation of the start of the FIREs in the peak |
| FIRE_end_ssd | Standard deviation of the end of the FIREs in the peak |
| pass_coverage | Whether the peak passes coverage filters |

## The `{sample}-fire-{v}-hap-differences.bed.gz` file

This file primarily contains the same columns as the FIRE peaks file but additionally has a `p_value` column with the results of a Fisher's exact test for the difference in coverage between the two haplotypes, and a `p_adjust` column with the Benjamini-Hochberg adjusted p-value. See the [methods](methods/haplotype-selective.md) for more details.

## The `{sample}-fire-{v}-pileup.bed.gz` file

This is a BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more. The columns are calculated using `ft-pileup` and more details can be found in the `ft-pileup` documentation.

## The `{sample}-fire-{v}-qc.tbl.gz` file

This file contains quality control metrics for the FIRE CRAM. The results are directly created by `ft-qc` and more details can be found in the `ft-qc` documentation.

## The `trackHub-{v}/` directory

The `trackHub-{v}/` directory contains a UCSC trackHub for visualizing all the results of the FIRE pipeline. A description of the trackHub can be found in `trackHub/fire-description.html`. The trackHub can be loaded into UCSC by uploading the trackHub directory to a public facing website and then loading the `hub.txt`'s URL into the UCSC trackHub browser.

A copy of the trackHub description can be found [here](https://s3-us-west-2.amazonaws.com/stergachis-public1/Mitchell/temp/FIRE/dev/trackHub/fire-description.html).

## The `additional-outputs-{v}/` directory

The `additional-outputs-{v}/` directory contains the following files:
TODO
73 changes: 73 additions & 0 deletions src/fire/run.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Running the FIRE pipeline

The FIRE pipeline is a Snakemake workflow for calling Fiber-seq Inferred Regulatory Elements (FIREs) on single molecules and peak calling with Fiber-seq.

## Install

Please start by installing [pixi](https://pixi.sh/latest/) which handles the environment of the FIRE workflow.

Then install FIRE using `git` and `pixi`:

```bash
git clone https://github.com/fiberseq/FIRE.git
pixi install
```

We then recommend quickly testing your installation by running the test suite:

```bash
pixi run test
```

We recommend setting a Snakemake conda prefix and the Apptainer cache directory in your `bashrc`, e.g. in the Stergachis lab add:

```bash
export SNAKEMAKE_CONDA_PREFIX=/mmfs1/gscratch/stergachislab/snakemake-conda-envs
export APPTAINER_CACHEDIR=/mmfs1/gscratch/stergachislab/snakemake-conda-envs/apptainer-cache
```

Then Snakemake installs all the additional requirements as conda envs in that directory.

If you wish to distribute jobs across a cluster you may need to install the appropriate [snakemake executor plugin](https://snakemake.github.io/snakemake-plugin-catalog/). The SLURM executor is included in the environment (`snakemake-executor-slurm`)

## Configuring

See the [configuration README](https://github.com/fiberseq/FIRE/tree/main/config), the example [configuration file](https://github.com/fiberseq/FIRE/blob/main/config/config.yaml), and the example [manifest file](https://github.com/fiberseq/FIRE/blob/main/config/config.tbl) for configuration options.

## Run

The `FIRE` workflow can be executed using the `pixi run fire` command. Under the hood this runs a `snakemake` workflow and any extra parameters are passed directly to snakemake. For example:

```bash
pixi run fire --configfile config/config.yaml
```

If you want to do a dry run:

```bash
pixi run fire --configfile config/config.yaml -n
```

If you want to execute across a cluster (modify `profiles/slurm-executor` as needed for distributed execution):

```bash
pixi run fire --configfile config/config.yaml --profile profiles/slurm-executor
```

And if you want to run the FIRE workflow from another directory you can do so with:

```bash
pixi run --manifest-path /path/to/snakemake/pixi.toml fire ...
```

where you update `/path/to/snakemake/pixi.toml` to the path of the `pixi.toml` in the cloned FIRE repository.

You can also run snakemake directly, e.g.:

```bash
pixi shell
snakemake \
--configfile config/config.yaml \
--profile profiles/slurm-executor \
--local-cores 8 -k
```