vConv paper
Prerequisites
- Prerequisites for all but training Basset-related models
- Prerequisites for training Basset-related models
  - Basics
Step 1: Reproduce Figures 2-3, Supplementary Figures 5-10, and Supplementary Tables 2-3 (benchmarking models on motif identification)
Step 2: Reproduce Figure 4 (benchmarking models on motif discovery)
Step 3: Reproduce Supplementary Fig. 11 B-I (theoretical analysis)
- 3.1 Prepare datasets and results needed by Supp. Fig. 11 B-I
- 3.2 Reproduce Supplementary Fig. 11 B-I

vConv paper

This is the repository for reproducing figures and tables in the paper Identifying complex sequence patterns in massive omics data with a variable-convolutional layer in deep neural network.

A Keras-based implementation of vConv is available at https://github.com/gao-lab/vConv.

Prerequisites

Prerequisites for all but training Basset-related models

Basics

Imagemagick
Python (version 2)
R
bedtools
DREME (version 5.0.1)
MEME-ChIP (version 5.0.1)
CisFinder

Python packages

numpy
h5py
pandas
seaborn
scipy
keras (version 2.2.4)
tensorflow (version 1.3.0)
sklearn

Alternatively, if you want to guarantee working versions of each dependency, you can install via a fully pre-specified environment.

conda env create -f corecode/environment_vConv.yml

R packages

ggpubr
data.table
readxl
foreach
ggseqlogo
magick

the vConv layer

rm -fr ./vConv
git clone https://github.com/gao-lab/vConv
cp -r ./vConv/corecode ./

Prerequisites for training Basset-related models

Python 3
Follow the 'Installation' instruction at here to install Basenji: https://github.com/calico/basenji , with the following modifications:
- Must use CUDA 10.0
- Must use tensorflow version 2.3.4
  - We wrote a tensorflow-2.3.4-compatible vConv for Basset in our code (see codes in the section 'Train basset-related model' below). Currently we do not support vConv for this version of tensorflow publicly.

Step 1: Reproduce Figures 2-3, Supplementary Figures 5-10, and Supplementary Tables 2,4 (benchmarking models on motif identification)

1.1 Prepare datasets

Run the following codes to prepare the datasets.

wget -P ./ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/data.tar.gz
tar -C ./ -xzvf ./data.tar.gz

This tarball archive contains both simulated and published datasets. The following code has been used to generate the simulation dataset:

mkdir -p ./data/JasperMotif
cd ./data/JasperMotif/
python generateSequencesForSimulation.py
cd -

1.2 Train and evaluate models needed for reproducing these figures

The training takes about 18 days on a server with 1 CPU cores, 32G memory, and one NVIDIA 1080 Ti GPU card. The user can either train the models by themselves or use the pre-trained results.

1.2.1 Train the models directly

1.2.1.1 Train all but Basset-related models

Run the following codes to train the models.

cd ./train/JasperMotifSimulation
python trainAllVConvBasedAndConvolutionBasedNetworksForSimulation.py
python trainAllVConvBasedNetworksWithoutMSLForSimulation.py
cd -

cd ./train/ZengChIPSeqCode
python trainAllVConvBasedAndConvolutionBasedNetworksForZeng.py
cd -

cd ./train/DeepBindChIPSeqCode2015
python trainAllVConvBasedAndConvolutionBasedNetworksForDeepBind.py
cd -

cd ./train/convergenceSpeed
python estimateConvergenceSpeed.py
cd -

1.2.1.2 Train Basset-related models

Here the user needs to switch to the basenji environment.
After finishing training, the user needs to deactivate the basenji environment to run other codes.

cd ./basset/vConv/9layersvConv/
python TrainBasenjiBasset.py
cd -

cd ./basset/vConv/basenjibasset/
python basenji_train.py params_basset.json ../../../data/data_basset/
cd -

cd ./basset/vConv/singlelayervConv/
python TrainBasenjiBasset.py
cd -

1.2.2 Use the pre-trained results

Run the following codes to obtain the pre-trained models.

mkdir -p ./output
wget -P ./output/ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/result.tar.gz
tar -C ./output/ -xzvf ./output/result.tar.gz

1.3 Prepare summary files for reproducing figures and tables from datasets and results above

Note that

Both the original datasets ("./data") and the trained results ("./output/result/") are needed.
The scripts below must be run in the order displayed.

cd ./output/code
python checkResultsForSimulation.py
python checkResultsForSimulationWithoutMSL.py
python checkResultsForZengCase.py.py
python checkResultsForDeepBindCase.py
python extractMaskedKernelsFromSimulation.py
python prepareInputForTomtom.py
python useTomtomToCompareWithRealMotifs.py
cd -

1.4 Reproduce Figures

1.4.1 Reproduce Figure 2

Rscript ./vConvFigmain/code/generate_fig_2.R

The figure generated is located at ./vConvFigmain/result/Fig.2/Fig.2.png.

1.4.2 Reproduce Figure 3

Rscript ./vConvFigmain/code/generate_fig_3.R

The figure generated is located at ./vConvFigmain/result/Fig.3/Fig.3.png.

1.4.3 Reproduce Supplementary Figure 5

Rscript ./vConvFigmain/code/generate_supplementary_figure_5.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.5/Supplementary.Fig.5.png.

1.4.4 Reproduce Supplementary Figure 6

Rscript ./vConvFigmain/code/generate_supplementary_figure_6.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.6/Supplementary.Fig.6.png.

1.4.5 Reproduce Supplementary Figure 7

Rscript ./vConvFigmain/code/generate_supplementary_figure_7.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.7/Supplementary.Fig.7.png.

1.4.6 Reproduce Supplementary Figure 8

cd ./output/code
python checkResultComparedZengSearch.py
cd -

The figure generated is located at

Supp. Fig. 8A: output/ModelAUC/ChIPSeq/Pic/worseData/DataSize.png
Supp. Fig. 8B: output/ModelAUC/ChIPSeq/Pic/worseData/DataSizeWorseCase.png
Supp. Fig. 8C: "output/ModelAUC/ChIPSeq/Pic/convolution-based network from Zeng et al., 2016Boxplot.png"

1.4.7 Reproduce Supplementary Figure 10

cd ./output/SpeedTest/code
python DrawLoss.py
cd -

The figure generated is located at

2 motifs: output/SpeedTest/Png/2.jpg
4 motifs: output/SpeedTest/Png/4.jpg
6 motifs: output/SpeedTest/Png/6.jpg
8 motifs: output/SpeedTest/Png/8.jpg
TwoDiffMotif1: output/SpeedTest/Png/TwoDiff1.jpg
TwoDiffMotif2: output/SpeedTest/Png/TwoDiff2.jpg
TwoDiffMotif3: output/SpeedTest/Png/TwoDiff3.jpg
Basset: output/SpeedTest/Png/basset.jpg

1.4.8 Reproduce Supplementary Figure 11

Rscript ./vConvFigmain/code/generate_supplementary_figure_11.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.11/Supplementary.Fig.11.png.

1.4.9 Reproduce Supplementary Table 2

By now this table should have been generated at ./vConvFigmain/supptable23/SuppTable2.csv. Use the script below to reproduce it again.

cd ./output/code
python checkResultsForSimulation.py
cd -

1.4.10 Reproduce Supplementary Table 3

By now this table should have been generated at ./vConvFigmain/supptable23/SuppTable3.csv. Use the script below to reproduce it again.

cd ./output/code
python checkResultsForSimulationWithoutMSL.py
cd -

Step 2: Reproduce Figure 4 (benchmarking models on motif discovery)

2.1 Prepare datasets needed by Figure 4

mkdir -p ./vConvMotifDiscovery/ChIPSeqPeak/
wget -P ./vConvMotifDiscovery/ChIPSeqPeak/  http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/files.txt
for file in `cut -f 1 ./vConvMotifDiscovery/ChIPSeqPeak/files.txt|grep narrowPeak.gz`
do
    wget -P ./vConvMotifDiscovery/ChIPSeqPeak/  http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/${file}
    gunzip ./vConvMotifDiscovery/ChIPSeqPeak/${file}
done

mkdir -p ./data
wget -P ./data/ http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
gunzip ./data/hg19.fa.gz

2.2 Generate results needed by Figure 4

The user could either generate the results by themselves, or use the pre-computed version. Note that both the data files and the results are needed for reproducing Figure 4.

2.2.1 Generate the results

See Suppelementary Fig. 4. (shown below) for description of each step.

2.2.1.1 step (1)

## extract sequences
cd ./vConvMotifDiscovery/code/MLtools
python extractSequences.py
cd -

## generate motifs by CisFinder, DREME, and MEME-ChIP
cd ./vConvMotifDiscovery/code/MLtools
python generateMotifsByCisFinder.py
python generateMotifsByDREME.py
python generateMotifsByMEMEChIP.py
cd -

## generate motifs by vConv-based
cd ./vConvMotifDiscovery/code/vConvBased
python generateMotifsByVConvBasedNetworks.py
cd -

2.2.1.2 steps (2-3)

cd ./vConvMotifDiscovery/code/CisfinderFile
python convertIntoCisfinderFormat.py
python splitIntoIndividualMotifFiles.py
python scanSequencesWithCisFinder.py
python combineCisFinderResults.py
cd -

2.2.1.3 steps (4-5)

cd ./vConvMotifDiscovery/code/Analysis
python computeAccuracy.py
python summarizeResults.py
cd -

2.2.2 Use the pre-computed version

wget -P ./vConvMotifDiscovery/output/ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/AUChdf5.tar.gz
tar -C ./vConvMotifDiscovery/output/ -xzvf ./vConvMotifDiscovery/output/AUChdf5.tar.gz

2.3 Reproduce Figure 4 (and Supplementary Figure 9)

Supplementary Figure 9 is generated together with Figure 4.

Rscript ./vConvFigmain/code/generate_fig_4.R

Figure 4 generated is located at vConvFigmain/result/Fig.4/Fig.4.png.

Supplementary Figure 9 generated is located at ./vConvFigmain/result/Supplementary.Fig.9/Supplementary.Fig.9.png.

Step 3: Reproduce Supplementary Fig. 12 B-I (theoretical analysis)

3.1 Prepare datasets and results needed by Supp. Fig. 12 B-I

cd theoretical/code/
python runSimulation.py
python trainCNN.py
cd -

3.2 Reproduce Supplementary Fig. 12 B-I

cd theoretical/code/
python plotFigures.py
cd -

The figure generated is located at:

Supp. Fig. 12B: theoretical/Motif/ICSimu/simuMtf_Len-8_totIC-10.png
Supp. Fig. 12C: theoretical/Motif/ICSimu/simuMtf_Len-23_totIC-12.png
Supp. Fig. 12D: theoretical/figure/simuMtf_Len-8_totIC-10.png
Supp. Fig. 12E: theoretical/figure/simuMtf_Len-23_totIC-12.png
Supp. Fig. 12F: theoretical/figure/simuMtf_Len-8_totIC-10rank.png
Supp. Fig. 12G: theoretical/figure/simuMtf_Len-23_totIC-12rank.png
Supp. Fig. 12H: theoretical/figure/simu01.png
Supp. Fig. 12I: theoretical/figure/simu02.png

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.idea		.idea
CompareResults		CompareResults
FigforResponse		FigforResponse
basset		basset
data/JasperMotif		data/JasperMotif
output		output
theoretical		theoretical
train		train
vConvFigmain		vConvFigmain
vConvMotifDiscovery		vConvMotifDiscovery
web		web
.gitignore		.gitignore
README.md		README.md
environment_vConv.yml		environment_vConv.yml

gao-lab/vConv-Figures_and_Tables

Folders and files

Latest commit

History

Repository files navigation

Table of contents