Affinity Distillation is a method for extracting thermodynamic affinities de-novo from deep learning models of ChIP experiments. This has been tested with neural networks modeling base-resolution in-vivo binding profiles of yeast and mammalian TFs. Affinity Distillation can accurately predict energetic impacts of varying underlying motifs and local sequence context on TF binding. Affinity Distillation relies on in-silico marginalization against many sequence backgrounds, resulting in a higher dynamic range and more accurate predictions than motif discovery algorithms. Systematic comparisons between Affinity Distillation predictions and other predictive algorithms consistently show that Affinity Distillation more accurately predicts affinities across a wide range of TF structural classes and DNA sequences.
This repository implements the methods in De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein-DNA binding as well as the comparisons with other commonly used methods such as Weeder, MoDISco, STREME, etc.
Here is a video of a short talk given at ISMB. Here is an example of Affinity Distillation being used to predict effects identical to those measured in-vitro from neural networks trained only on in-vivo genome-wide chromatin immunoprecipitation data.
The train_and_eval
directory contains all the relevant code for model training, testing, interpretation, and evaluation.
train_model.py
can be used to train basepair resolution neural nets. The architecture is based on https://github.com/kundajelab/basepair/blob/cda0875571066343cdf90aed031f7c51714d991a/basepair/models.py#L534
The script accepts a json specifying the network parameters. Here is an example json that was used to train models for mammalian TFs such as MAX in HeLa-S3 cells:
{
"seq_len": 1346,
"out_pred_len": 1000,
"c_task_weight": 100,
"filters": 64,
"n_dil_layers": 6,
"conv1_kernel_size": 21,
"dil_kernel_size": 3,
"outconv_kernel_size": 75,
"lr": 0.001,
"genome_fasta": "hg38.genome.fa",
"genome_sizes": "hg38.chrom.sizes"
}
interpret_model.py
can be used to obtain DeepSHAP contribution scores. The script used the deep explainer implementation of SHAP, which is an updated version of the DeepLIFT algorithm, to interpret all models. The script uses a shuffled reference with 20 random shuffles. The results of running this code are used as inputs for the run_modisco
scripts, which can be configured to compute consensus motifs using the TF-MoDISco algorithm.
evaluate
scripts are used to compute the in-silico marginalization scores. In silico marginalization relies on the counts head of BPNet. To obtain a marginalization score for a sequence: (1) Background sequences are generated by dinucleotide shuffling DNA sequences from held-out genomic peaks, (2) the sequence of interest is inserted at the center of the background sequences, (3) the model predictions from the count head are stored for both the background sequences and the sequences with the insert, (4) the mean of the differences between the two sets of predictions (mean predicted log count ratio) across the different backgrounds Δ log(counts)
is the marginalization score for the sequence of interest. These scores at this stage are uncalibrated.
The analyses_and_plots
directory contains all the relevant code for reproducing the paper results, figures, and analyses.
Calibration refers to mapping Δ log(counts)
to binding free energies. Calibration performs inference of the binding free energies using a regression model. Using a sample of in-vitro measurements, we fit a linear regression model with the generated marginalization scores as the input. The resulting function recovers the inference we would have obtained if the neural network model was predicting in the relative free energy space. We can generalize to other sequences, without access to measurements, by deploying a correction for inference using the fitted regression model. To be consistent across comparisons, the same samples were used for calibrating all the methods. As a result, the calibration step is the first step of any comparison script.
-
plot_raw_counts_comparison
finds genomic matches for the in-vitro sequences and is used to plot observed and predicted log-transformed read counts for the training and held-out test chromosomes. This is a common baseline plot in several main figures. -
plot_gmatches
scripts are related; they are used to plot Affinity Distillation-predicted marginalization scores vs measured binding free energies for sequences present in both in-vivo and in-vitro experiments. -
plot_results
scripts generalize to produce Affinity Distillation-predicted marginalization scores vs measured binding free energies as well as calibrated predictions vs measured binding free energies for all sequences in in-vitro experiments. For example, in the case of the BETseq library in figures 1 and 2, predictions for all 1,048,576NNNNNCACGTGNNNNN
possible sequences in BET-seq experiments are present. -
plot_rmse
scripts produce the post-calibration RMSEs of predictions vs observations for each TF. These include the performance of Weeder, STREME, MoDISco, and Affinity Distillation. They can be used to indicate the performance of the top output of each algorithm, as well as mean values and standard deviations. This set up of baseline, evaluation, and summary plots are present in several figures throughout the paper. -
yeast_gcpbm
focuses on gcPBM experiments of Pho4 and Cb1 in yeast. These are used to analyze the differential specificity of paralogous TFs using Affinity Distillation as shown in figure 3. -
mammal_gcpbm
analyzes gcPBM experiments of MAX, GABPA, E2F1, and others in a variety of mammalian cells. These are used to showcase the versatility of Affinity Distillation across different TF structural families and cell types as shown in figure 6. -
dynamic_range_analysis
plots histograms of observed and predicted intensity distributions for different algorithms. It also plots observed and predicted standard deviations and RMSEs for log-transformed intensity distributions, and Affinity Distillation-predictions' standard deviations and RMSEs as a function of the number of backgrounds (1, 2, 5, 10, 20, 50, 100 seqs) used in in silico marginalization. The results are shown in figure 4. -
augmentation_analysis
analyzes GC-matched augmentation strategies. It focuses on testing prediction accuracy for an out-of-distribution yeast DNA library designed to test how short tandem repeats flanking known binding sites alter binding affinities. The results are shown in figure 4. -
library_analysis_GR
focuses on MITOMI measurements of Glucocorticoid receptor (GR). It can be used to produce scatterplots showing the predictive performance of Affinity Distillation on single substitution variations of consensus sites (MITOMI and ChIP), mutations to half sites (MITOMI and ChIP), alternate spacer sequences, and genomic GRE variants. The breakdown of RMSE by type of variation in the library, sorted by difference between Affinity Distillation and the best performing motif (in this case MoDISco) is shown in figure 5.
All the ChIP-nexus data generated and used for this study are available in GEO:GSE207001. All the MITOMI measurements of glucocorticoid receptor binding are available at Zenodo: https://zenodo.org/record/6762262.