Merge pull request #2 from fiberseq/fire_updates

Updates for FIRE
fiberseq · Nov 26, 2024 · 73339f1 · 73339f1
2 parents 34304f0 + a3da1cd
commit 73339f1
Show file tree

Hide file tree

Showing 13 changed files with 461 additions and 92 deletions.
diff --git a/book.toml b/book.toml
@@ -23,7 +23,7 @@ mathjax-support = true
 
 [output.html.fold]
 enable = true    # whether or not to enable section folding
-level = 1        # the depth to start folding
+level = 2        # the depth to start folding
 
 [preprocessor.toc]
 command = "mdbook-toc"

diff --git a/scripts/make_help_docs.sh b/scripts/make_help_docs.sh
@@ -10,7 +10,7 @@ echo "# Help pages for fibertools subcommands" >$out
 echo "<!-- toc -->" >>$out
 echo "" >>$out
 
-for subcommand in "" "predict-m6a" "fire" "extract" "center" "add-nucleosomes" "footprint" "clear-kinetics" "strip-basemods" "track-decorators" "pileup"; do
+for subcommand in "" "predict-m6a" "fire" "extract" "center" "add-nucleosomes" "footprint" "clear-kinetics" "strip-basemods" "track-decorators" "pileup" "qc"; do
     echo $subcommand
     #out="docs/ft-${subcommand}-help.md"
 

diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -16,16 +16,21 @@
         - [`ft extract`](fibertools/extracting/extract.md)
         - [`ft center`](fibertools/extracting/center.md)
         - [`ft footprint`](fibertools/extracting/footprint.md)
+        - [`ft pileup`](fibertools/extracting/pileup.md)
+        - [`ft qc`](fibertools/extracting/qc.md)
     - [Help pages for `ft` subcommands](fibertools/help.md)
     - [pyft: Python bindings](fibertools/pyft.md)
     - [Installation](fibertools/install.md)
 
 ---
 
 - [FIRE](fire/fire.md)
-    - [Training](fire/training.md)
-    - [Aggregation and peak calling](fire/aggregation.md)
-    - [Identifying haplotype-selective peaks](fire/haplotype-selective.md)
+    - [Running and installing](fire/run.md)
+    - [Outputs](fire/outputs.md)
+    - [Methods](fire/methods/methods.md)
+        - [Training](fire/methods/training.md)
+        - [Aggregation and peak calling](fire/methods/aggregation.md)
+        - [Identifying haplotype-selective peaks](fire/methods/haplotype-selective.md)
 
 ---
 

diff --git a/src/fibertools/extracting/pileup.md b/src/fibertools/extracting/pileup.md
@@ -0,0 +1,7 @@
+# `ft-pileup` 
+The `ft-pileup` command is used to generate a per-base pileup of Fiber-seq data. This command is useful for visualizing the distribution of various features across the genome.
+
+
+## Inputs and options
+
+See the [help message](../help.md#ft-pileup) for details.
diff --git a/src/fibertools/extracting/qc.md b/src/fibertools/extracting/qc.md
@@ -0,0 +1,7 @@
+# `ft-qc`
+The `ft-qc` command is used to generate quality control metrics for a Fiber-seq BAM file. 
+
+
+## Inputs and options
+
+See the [help message](../help.md#ft-qc) for details.
diff --git a/src/fibertools/help.md b/src/fibertools/help.md
diff --git a/src/fire/fire.md b/src/fire/fire.md
@@ -1,12 +1,21 @@
-# FIRE: descriptions, methods, and outputs 
+# FIRE: descriptions, methods, and outputs
 
-For details on running the FIRE pipeline see the [README.md](https://github.com/fiberseq/FIRE).
+The code for running the FIRE pipeline can be found on [GitHub](https://github.com/fiberseq/FIRE), and if you do use FIRE in your work please cite our publication:
+
+> **Vollger, M. R.**, **Swanson, E. G.**, Neph, S. J., Ranchalis, J., Munson, K. M., Ho, C.-H., Sedeño-Cortés, A. E., Fondrie, W. E., Bohaczuk, S. C., Mao, Y., Parmalee, N. L., Mallory, B. J., Harvey, W. T., Kwon, Y., Garcia, G. H., Hoekzema, K., Meyer, J. G., Cicek, M., Eichler, E. E., … Stergachis, A. B. (2024). A haplotype-resolved view of human gene regulation. _bioRxiv_. [https://doi.org/10.1101/2024.06.14.599122](https://doi.org/10.1101/2024.06.14.599122)
 
 ## Summary of Fiber-seq inferred regulatory elements (FIREs)
 
-For those who would prefer video I have recorded a [lab meeting](https://youtu.be/RiZrMltAiWM?si=sSo64goaNQxgyfcc) discussing FIRE and the methods used here. If you prefer to read please continue on. 
+For those who would prefer video I have recorded a [lab meeting](https://youtu.be/RiZrMltAiWM?si=sSo64goaNQxgyfcc) discussing FIRE and the methods used here. If you prefer to read please continue on.
 
-FIREs are MTase sensitive patches (MSPs) that are inferred to be regulatory elements on single chromatin fibers. To do this we used semi-supervised machine learning to identify MSPs that are likely to be regulatory elements using the `Mokapot` framework and `XGBoost`. Every individual FIRE element is associated with an estimated precision value (defined below), which indicates the probability that the FIRE element is a true regulatory element. The estimated precision of FIREs elements are created using `Mokapot` and validation data not used in training. We train our model targeting FIRE elements with at least 95% precision, MSPs with less than 90% precision are considered to have average level of accessibility expected between two nucleosomes, and are referred to as linker regions.
+FIREs are MTase sensitive patches (MSPs) that are inferred to be regulatory elements on single chromatin fibers. To do this we used semi-supervised machine learning to identify MSPs that are likely to be regulatory elements using the `Mokapot` framework and `XGBoost`. Every individual FIRE element is associated with an estimated precision value, which indicates the probability that the FIRE element is a true regulatory element. The estimated precision of FIREs elements are created using `Mokapot` and validation data not used in training. We train our model targeting FIRE elements with at least 95% precision, MSPs with less than 90% precision are considered to have average level of accessibility expected between two nucleosomes, and are referred to as linker regions.
 
 Semi-superivized machine learning with `Mokapot` requires a mixed-positive training set and a clean negative training set. To create mixed positive training data we selected MSPs that overlapped DNase hypersensitive sites (DHSs) and CTCF ChIP-seq peaks. And to create a clean negative training set we selected MSPs that did not overlap DHSs or CTCF ChIP-seq peaks.
 
+For details on running the FIRE workflow see:
+
+- [Running and installing](run.md)
+
+For details on the outputs of the FIRE workflow see:
+
+- [Outputs](outputs.md)
diff --git a/src/fire/aggregation.md → src/fire/methods/aggregation.md b/src/fire/aggregation.md → src/fire/methods/aggregation.md
@@ -12,8 +12,8 @@ where \\(C_g\\) is the number of FIRE elements at the \\(g\\)th position, \\(R_g
 Regions with unreliable coverage are defined as regions with Fiber-seq coverage that deviate from the median coverage by five standard deviations, where a standard deviation is defined by the Poisson distribution (i.e., the square root of the mean coverage). This is a parameter that can be adjusted by the user.
 
 ### FIRE score FDR calculation
-We shuffle the location of an entire Fiber-seq read by selecting a random start position within the chromosome and relocating the entire read to that start position. Fiber-seq reads originating from regions with unreliable coverage (defined above) are not shuffled, and reads from regions with reliable coverage are not shuffled into regions with unreliable coverage (`bedtools shuffle -chrom -excl`, v2.31.0) (Quinlan & Hall, 2010). We then compute FIRE scores associated with the shuffled genome. Recalling that the FIRE score for the \\(g\\)th position in the genome is denoted \\(S_g\\), we divide the number of bases that have shuffled FIRE scores above \\(S_g\\) by the number of bases that have un-shuffled FIRE scores above \\(S_g\\). This provides an estimate of the FDR associated with the FIRE score \\(S_g\\).
+We shuffle the location of an entire Fiber-seq read by selecting a random start position within the chromosome and relocating the entire read to that start position. Fiber-seq reads originating from regions with unreliable coverage are not shuffled, and reads from regions with reliable coverage are not shuffled into regions with unreliable coverage (`bedtools shuffle -chrom -excl`, v2.31.0). We then compute FIRE scores associated with the shuffled genome. Recalling that the FIRE score for the \\(g\\)th position in the genome is denoted \\(S_g\\), we divide the number of bases that have shuffled FIRE scores above \\(S_g\\) by the number of bases that have un-shuffled FIRE scores above \\(S_g\\). This provides an estimate of the FDR associated with the FIRE score \\(S_g\\).
 
 ### Peak calling
-Peaks are called by identifying FIRE score local-maxima that have FDR values below a 5% threshold and at least 10% actuation (Cg / Rg). Adjacent local maxima that share 50% of the underlying FIRE elements or have 90% reciprocal overlap are merged into a single peak, using the higher of the two local maxima. Then, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculated and reported wide peaks by taking the union of the FIRE peaks and all regions below the FDR threshold and then merged the resulting regions that were within one nucleosome (147 bp) of one another.
+Peaks are called by identifying FIRE score local-maxima that have FDR values below a 5% threshold and that exceed an optional percent actuation threshold \\(C_g / R_g\\). Adjacent local maxima that share 50% of the underlying FIRE elements or have 90% reciprocal overlap are merged into a single peak, using the higher of the two local maxima. Then, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculated and reported wide peaks by taking the union of the FIRE peaks and all regions below the FDR threshold and then merged the resulting regions that were within one nucleosome (147 bp) of one another.
 
diff --git a/src/fire/haplotype-selective.md → src/fire/methods/haplotype-selective.md b/src/fire/haplotype-selective.md → src/fire/methods/haplotype-selective.md
diff --git a/src/fire/methods/methods.md b/src/fire/methods/methods.md
@@ -0,0 +1,2 @@
+# FIRE methods
+This chapter describes the methods and training used in the FIRE pipeline. There are two major components, identifying per-molecule FIRE elements, and aggregating these elements into peaks.
diff --git a/src/fire/training.md → src/fire/methods/training.md b/src/fire/training.md → src/fire/methods/training.md
diff --git a/src/fire/outputs.md b/src/fire/outputs.md
@@ -0,0 +1,68 @@
+# Outputs of the FIRE pipeline
+
+The FIRE pipeline returns outputs for a sample in the directory `results/{sample}/` and files are labeled with the sample (`{sample}`) and the version of the FIRE pipeline (`{v}`). The following files and directories are generated:
+
+| Output                                   | Description                                                                                        |
+| ---------------------------------------- | -------------------------------------------------------------------------------------------------- |
+| {sample}-fire-{v}-filtered.cram          | CRAM file containing the all the data used in the FIRE pipeline.                                   |
+| {sample}-fire-{v}-peaks.bed.gz           | BED file containing the FIRE peaks calls                                                           |
+| {sample}-fire-{v}-hap-differences.bed.gz | BED file containing the results of searching for haplotype-selective peaks.                        |
+| {sample}-fire-{v}-pileup.bed.gz          | BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more. |
+| {sample}-fire-{v}-qc.tbl.gz              | Table containing quality control metrics for the FIRE CRAM.                                        |
+| trackHub-{v}/                            | Directory containing a UCSC trackHub for visualizing all the results of the FIRE pipeline.         |
+| additional-outputs-{v}/                  | Directory containing additional outputs from the FIRE pipeline.                                    |
+
+# More details on the individual outputs
+
+## The `{sample}-fire-{v}-filtered.cram` file
+
+The CRAM file contains all the data used in the FIRE pipeline. It is a CRAM file that can be viewed with IGV or other genome browsers. Sequencing quality scores are removed from the CRAM file to reduce the file size since per base quality scores are not used in the FIRE pipeline, as well as reads with insufficient m6A signal. The CRAM file is sorted and indexed.
+
+## The `{sample}-fire-{v}-peaks.bed.gz` file
+
+This is the peak file for the FIRE method. Peaks are called by identifying FIRE score ([methods](methods/aggregation.md)) local-maxima that have FDR values below a threshold. By default, the pipeline reports peaks at a 5% FDR threshold. Once a local-maxima is identified, the start and end positions of the peak are determined by the median start and end positions of the underlying FIRE elements. We also calculate and report wide peaks in the `additional-outputs/` by taking the union of the FIRE peaks and all regions below the FDR threshold and then merging resulting regions that are within one nucleosome (147 bp) of one another.
+
+The FIRE peaks file has the following columns:
+| Column | Description |
+| --- | --- |
+| #chrom | Chromosome of the peak |
+| peak\*start | Start of the peak |
+| peak_end | End of the peak |
+| start | Start of the maximum of the peak |
+| end | End of the maximum of the peak |
+| coverage | Coverage of the peak |
+| fire_coverage | Coverage of the FIREs in the peak |
+| score | FIRE score of the peak (see [methods](methods/aggregation.md)) |
+| nuc_coverage | Coverage of the nucleosomes in the peak |
+| msp_coverage | Coverage of the MSPs in the peak |
+| .\*\*{H1,H2} | Repeats of previous columns but specific for the two haplotypes |
+| FDR | False discovery rate of the peak |
+| log_FDR | -10\*log10 of the FDR |
+| FIRE_size_mean | Mean size of the FIREs in the peak |
+| FIRE_size_ssd | Standard deviation of the size of the FIREs in the peak |
+| FIRE_start_ssd | Standard deviation of the start of the FIREs in the peak |
+| FIRE_end_ssd | Standard deviation of the end of the FIREs in the peak |
+| pass_coverage | Whether the peak passes coverage filters |
+
+## The `{sample}-fire-{v}-hap-differences.bed.gz` file
+
+This file primarily contains the same columns as the FIRE peaks file but additionally has a `p_value` column with the results of a Fisher's exact test for the difference in coverage between the two haplotypes, and a `p_adjust` column with the Benjamini-Hochberg adjusted p-value. See the [methods](methods/haplotype-selective.md) for more details.
+
+## The `{sample}-fire-{v}-pileup.bed.gz` file
+
+This is a BED file containing per-base information on number of FIREs, MSPs, nucleosomes, coverage and more. The columns are calculated using `ft-pileup` and more details can be found in the `ft-pileup` documentation.
+
+## The `{sample}-fire-{v}-qc.tbl.gz` file
+
+This file contains quality control metrics for the FIRE CRAM. The results are directly created by `ft-qc` and more details can be found in the `ft-qc` documentation.
+
+## The `trackHub-{v}/` directory
+
+The `trackHub-{v}/` directory contains a UCSC trackHub for visualizing all the results of the FIRE pipeline. A description of the trackHub can be found in `trackHub/fire-description.html`. The trackHub can be loaded into UCSC by uploading the trackHub directory to a public facing website and then loading the `hub.txt`'s URL into the UCSC trackHub browser.
+
+A copy of the trackHub description can be found [here](https://s3-us-west-2.amazonaws.com/stergachis-public1/Mitchell/temp/FIRE/dev/trackHub/fire-description.html).
+
+## The `additional-outputs-{v}/` directory
+
+The `additional-outputs-{v}/` directory contains the following files:
+TODO
diff --git a/src/fire/run.md b/src/fire/run.md
@@ -0,0 +1,73 @@
+# Running the FIRE pipeline
+
+The FIRE pipeline is a Snakemake workflow for calling Fiber-seq Inferred Regulatory Elements (FIREs) on single molecules and peak calling with Fiber-seq.
+
+## Install
+
+Please start by installing [pixi](https://pixi.sh/latest/) which handles the environment of the FIRE workflow.
+
+Then install FIRE using `git` and `pixi`:
+
+```bash
+git clone https://github.com/fiberseq/FIRE.git
+pixi install
+```
+
+We then recommend quickly testing your installation by running the test suite:
+
+```bash
+pixi run test
+```
+
+We recommend setting a Snakemake conda prefix and the Apptainer cache directory in your `bashrc`, e.g. in the Stergachis lab add:
+
+```bash
+export SNAKEMAKE_CONDA_PREFIX=/mmfs1/gscratch/stergachislab/snakemake-conda-envs
+export APPTAINER_CACHEDIR=/mmfs1/gscratch/stergachislab/snakemake-conda-envs/apptainer-cache
+```
+
+Then Snakemake installs all the additional requirements as conda envs in that directory.
+
+If you wish to distribute jobs across a cluster you may need to install the appropriate [snakemake executor plugin](https://snakemake.github.io/snakemake-plugin-catalog/). The SLURM executor is included in the environment (`snakemake-executor-slurm`)
+
+## Configuring
+
+See the [configuration README](https://github.com/fiberseq/FIRE/tree/main/config), the example [configuration file](https://github.com/fiberseq/FIRE/blob/main/config/config.yaml), and the example [manifest file](https://github.com/fiberseq/FIRE/blob/main/config/config.tbl) for configuration options.
+
+## Run
+
+The `FIRE` workflow can be executed using the `pixi run fire` command. Under the hood this runs a `snakemake` workflow and any extra parameters are passed directly to snakemake. For example:
+
+```bash
+pixi run fire --configfile config/config.yaml
+```
+
+If you want to do a dry run:
+
+```bash
+pixi run fire --configfile config/config.yaml -n
+```
+
+If you want to execute across a cluster (modify `profiles/slurm-executor` as needed for distributed execution):
+
+```bash
+pixi run fire --configfile config/config.yaml --profile profiles/slurm-executor
+```
+
+And if you want to run the FIRE workflow from another directory you can do so with:
+
+```bash
+pixi run --manifest-path /path/to/snakemake/pixi.toml fire ...
+```
+
+where you update `/path/to/snakemake/pixi.toml` to the path of the `pixi.toml` in the cloned FIRE repository.
+
+You can also run snakemake directly, e.g.:
+
+```bash
+pixi shell
+snakemake \
+  --configfile config/config.yaml \
+  --profile profiles/slurm-executor \
+  --local-cores 8 -k
+```
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# FIRE methods
		This chapter describes the methods and training used in the FIRE pipeline. There are two major components, identifying per-molecule FIRE elements, and aggregating these elements into peaks.