Skip to content

Latest commit

 

History

History
217 lines (189 loc) · 24.9 KB

README.md

File metadata and controls

217 lines (189 loc) · 24.9 KB

ModHMM

ModHMM is a genome segmentation method that is easy to apply and requires no manual interpretation of hidden states. It provides highly accurate annotations of chromatin states including active promoters and enhancers. The following is a list of states detected by ModHMM:

State name Description
PA active promoter
EA active enhancer
BI bivalent region
PR primed region
EA:tr active enhancer in a transcribed region
BI:tr bivalet region in a transcribed region
PR:tr primed region in a transcribed region
TR transcribed region
R1 H3K27me3 repressed
R2 H3K9me3 repressed
NS no signal
CL control signal

References:

Philipp Benner and Martin Vingron. ModHMM: A Modular Supra-Bayesian Genome Segmentation Method. Journal of Computational Biology. Apr 2020. 442-457. [Link]

ModHMM

Contact

E-Mail: <philipp.benner@bäm.de> (replace ä with a).

Available Segmentations

ModHMM segmentations are available for several ENCODE data sets:

Tissue Version Segmentation Posteriors Config Comment
GRCh38 ascending aorta 1.2.2 Segmentation Posteriors Config total RNA-seq
GRCh38 gastrocnemius medialis 1.2.2 Segmentation Posteriors Config total RNA-seq
GRCh38 heart left ventricle 1.2.2 Segmentation Posteriors Config total RNA-seq
GRCh38 lung upper lobe 1.2.2 Segmentation Posteriors Config total RNA-seq
GRCh38 pancreas body 1.2.2 Segmentation Posteriors Config total RNA-seq
GRCh38 spleen 1.2.2 Segmentation Posteriors Config total RNA-seq
GRCh38 stomach 1.2.2 Segmentation Posteriors Config total RNA-seq
GRCh38 tibial nerve 1.2.2 Segmentation Posteriors Config total RNA-seq
GRCh38 transverse colon 1.2.2 Segmentation Posteriors Config total RNA-seq
GRCh38 uterus 1.2.2 Segmentation Posteriors Config total RNA-seq
mm10 forebrain embryo day11.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 forebrain embryo day12.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 forebrain embryo day13.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 forebrain embryo day14.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 forebrain embryo day15.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 forebrain embryo day16.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 heart embryo day14.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 heart embryo day15.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 hindbrain embryo day11.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 hindbrain embryo day12.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 hindbrain embryo day13.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 hindbrain embryo day14.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 hindbrain embryo day15.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 hindbrain embryo day16.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 kidney embryo day14.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 kidney embryo day15.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 kidney embryo day16.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 limb embryo day14.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 limb embryo day15.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 liver embryo day11.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 liver embryo day12.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 liver embryo day13.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 liver embryo day14.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 liver embryo day15.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 liver embryo day16.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 lung embryo day14.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 lung embryo day15.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 lung embryo day16.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 midbrain embryo day11.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 midbrain embryo day12.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 midbrain embryo day13.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 midbrain embryo day14.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 midbrain embryo day15.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq
mm10 midbrain embryo day16.5 1.2.2 Segmentation Posteriors Config poly-A RNA-seq

Installation

ModHMM can be installed by either downloading a binary from the binary repository or by compiling the program from source.

To compile ModHMM you must first install the Go compiler. Afterwards, ModHMM can be installed as follows:

  go install -v github.com/pbenner/modhmm@latest

Alternatively, it is also possible to clone the repository and compile modhmm manually

  git clone github.com/pbenner/modhmm
  make
  make install

Computing ModHMM Segmentations

Unlike most other genome segmentation methods, ModHMM so far depends on data from a fixed set of features or assays (ATAC/DNase, H3K27ac, H3K9me3, H3K27me3, H3K4me1, H3K4me3, WCE/IgG, and RNA-seq). Open chromatin information must be either provided as ATAC-seq or DNase-seq data. Preferentially, the data should be given as BAM files, but it is also possible to use bigWig files as input.

ModHMM requires a configuration file in JSON format. The following is a simple example where data is provided as BAM files:

{
    # Directory containing feature alignment files
    "Bam Directory" : ".bam",
    # Names of alignment files (for each feature a comma separated list of
    # replicates must be specified, any number of replicates is supported)
    "Bam Files"     : {
        "ATAC"      : [    "atac-rep1.bam",     "atac-rep2.bam"],
        #"DNase"    : [   "dnase-rep1.bam",    "dnase-rep2.bam"],
        "H3K27ac"   : [ "h3k27ac-rep1.bam",  "h3k27ac-rep2.bam"],
        "H3K27me3"  : ["h3k27me3-rep1.bam", "h3k27me3-rep2.bam"], # optional feature
        "H3K9me3"   : [ "h3k9me3-rep1.bam",  "h3k9me3-rep2.bam"], # optional feature
        "H3K4me1"   : [ "h3k4me1-rep1.bam",  "h3k4me1-rep2.bam"],
        "H3K4me3"   : [ "h3k4me3-rep1.bam",  "h3k4me3-rep2.bam"],
        "RNA"       : [     "rna-rep1.bam",      "rna-rep2.bam"],
        "Control"   : [ "control-rep1.bam",  "control-rep2.bam"]  # optional feature
    },
    # Number of threads used for computing coverage bigWigs (memory intense!)
    "Coverage Threads"                : 5,
    # Directory containing all auxiliary files and the final segmentation
    "Directory"                       : "mm10-liver-embryo-day12.5",
    "Description"                     : "liver embryo day12.5",
    # Number of threads used for evaluating classifiers and computing the segmentation
    "Threads"                         : 20,
    # Verbose level (0: no output, 1: low, 2: high)
    "Verbose"                         : 1
}

ModHMM computes segmentations in several stages. At every stage the output is saved as a bigWig file, which can be inspected in a genome browser. The location and name of each bigWig file can be configured. A full set of all options is printed with modhmm --genconf.

To execute ModHMM simply run (assuming the configuration file is named config.json):

  modhmm -c config.json segmentation

Extracting Promoter and Enhancer Predictions

Genome segmentations are discretized predictions of chromatin states. They do not contain any information about the certainty of a particular prediction. Another drawback is that the number of predicted promoters and enhancers depends on the quality of the data, in particular the sequencing depth. Especially for differential analysis the dependency on the data quality might be hindering. In addition to genome segmentations, ModHMM can compute chromatin state probabilities:

  modhmm -c config.json eval-posterior-marginals

This command will compute genome-wide probabilities for all chromatin states and export them as bigWig files named posterior-marginal-STATE.bw. By default, probabilities are on log-scale. With the option --std-scale ModHMM exports probabilities on standard scale. In this case, bigWig files are named posterior-marginal-exp-STATE.bw.

ModHMM also implements a simple peak-finding algorithm that can be used to call high-probability regions in chromatin state probability tracks:

  modhmm -c config.json call-posterior-marginal-peaks --threshold=0.8

This command outputs tables with identified peaks, i.e. all regions with probabilities higher than the given threshold.

Using ModHMM as a Peak Caller

Most peak callers use a single pre-defined model for computing enrichment probabilities and detecting peaks. In most cases there is a strong model misfit, because of the strong heterogeneity of ChIP-seq data. ModHMM instead allows to fit a mixture distribution (single-feature model) to the observed coverage values with a user-defined set of components. The following command calls ATAC-seq peaks using the estimated single-feature model, if available (see previous section):

  modhmm -c config.json call-single-feature-peaks atac

Using bigWig files as input

The following configuration can be used if data instead is given in bigWig format:

{
    # Data is provided as bigWig files. Set all coverage files static!
    "Coverage Files": {
        "ATAC"   : {"Filename": "coverage-atac.bw",    "Static": true },
        #"DNase" : {"Filename": "coverage-dnase.bw",   "Static": true },
        "H3K27ac": {"Filename": "coverage-h3k27ac.bw", "Static": true },
        "H3K4me1": {"Filename": "coverage-h3k4me1.bw", "Static": true },
        "H3K4me3": {"Filename": "coverage-h3k4me3.bw", "Static": true },
        "RNA"    : {"Filename": "coverage-rna.bw",     "Static": true },
        "Control": {"Filename": "coverage-control.bw", "Static": true }
    },
    # Directory containing all auxiliary files and the final segmentation
    "Directory"                       : "mm10-liver-embryo-day12.5",
    "Description"                     : "liver embryo day12.5",
    # Number of threads used for evaluating classifiers and computing the segmentation
    "Threads"                         : 20,
    # Verbose level (0: no output, 1: low, 2: high)
    "Verbose"                         : 1
}

Coverage bigWig files should contain discrete count data and must be placed in the directory mm10-liver-embryo-day12.5. Setting the option Static to true tells ModHMM that the provided bigWig files are not automatically generated and should not be overwritten.

Use Cases

Example 1: Compute segmentation on ENCODE data from mouse embyonic liver at day 12.5

Create a configuration file named mm10-liver-embryo-day12.5.json (ModHMM accepts an extended JSON format that allows comments):

{
    "Bam Directory" : ".bam",
    "Bam Files"     : {
        "ATAC"      : ["ENCFF929LOH.bam", "ENCFF848NLJ.bam"],
        "H3K27ac"   : ["ENCFF524ZFV.bam", "ENCFF322QGS.bam"],
        "H3K27me3"  : ["ENCFF811DWT.bam", "ENCFF171KAM.bam"],
        "H3K9me3"   : ["ENCFF293UCG.bam", "ENCFF777XFH.bam"],
        "H3K4me1"   : ["ENCFF788JMC.bam", "ENCFF340ACH.bam"],
        "H3K4me3"   : ["ENCFF211WGC.bam", "ENCFF587PZE.bam"],
        "RNA"       : ["ENCFF405LEY.bam", "ENCFF627PCS.bam"],
        "Control"   : ["ENCFF865QGZ.bam", "ENCFF438RYK.bam"]
    },
    "Coverage Threads"                : 5,
    "Directory"                       : "mm10-liver-embryo-day12.5",
    "Description"                     : "liver embryo day12.5",
    "Threads"                         : 20,
    "Verbose"                         : 1
}

ENCODE bam files will be automatically downloaded by ModHMM.

Create output directory

  mkdir mm10-liver-embryo-day12.5

Execute ModHMM:

  modhmm -c mm10-liver-embryo-day12.5.json segmentation