- vConv paper
- Prerequisites
- Step 1: Reproduce Figures 2-3, Supplementary Figures 5-10, and Supplementary Tables 2-3 (benchmarking models on motif identification)
- 1.1 Prepare datasets
- 1.2 Train and evaluate models needed for reproducing these figures
- 1.3 Prepare summary files for reproducing figures and tables from datasets and results above
- 1.4 Reproduce Figures
- 1.4.1 Reproduce Figure 2
- 1.4.2 Reproduce Figure 3
- 1.4.3 Reproduce Supplementary Figure 5
- 1.4.4 Reproduce Supplementary Figure 6
- 1.4.5 Reproduce Supplementary Figure 7
- 1.4.6 Reproduce Supplementary Figure 8
- 1.4.7 Reproduce Supplementary Figure 10
- 1.4.8 Reproduce Supplementary Figure 11
- 1.4.9 Reproduce Supplementary Table 2
- 1.4.10 Reproduce Supplementary Table 3
- Step 2: Reproduce Figure 4 (benchmarking models on motif discovery)
- Step 3: Reproduce Supplementary Fig. 11 B-I (theoretical analysis)
This is the repository for reproducing figures and tables in the paper Identifying complex sequence patterns in massive omics data with a variable-convolutional layer in deep neural network.
A Keras-based implementation of vConv is available at https://github.com/gao-lab/vConv.
- Imagemagick
- Python (version 2)
- R
- bedtools
- DREME (version 5.0.1)
- MEME-ChIP (version 5.0.1)
- CisFinder
- numpy
- h5py
- pandas
- seaborn
- scipy
- keras (version 2.2.4)
- tensorflow (version 1.3.0)
- sklearn
Alternatively, if you want to guarantee working versions of each dependency, you can install via a fully pre-specified environment.
conda env create -f corecode/environment_vConv.yml
- ggpubr
- data.table
- readxl
- foreach
- ggseqlogo
- magick
rm -fr ./vConv
git clone https://github.com/gao-lab/vConv
cp -r ./vConv/corecode ./
- Python 3
- Follow the 'Installation' instruction at here to install Basenji: https://github.com/calico/basenji , with the following modifications:
- Must use CUDA 10.0
- Must use tensorflow version 2.3.4
- We wrote a tensorflow-2.3.4-compatible vConv for Basset in our code (see codes in the section 'Train basset-related model' below). Currently we do not support vConv for this version of tensorflow publicly.
Step 1: Reproduce Figures 2-3, Supplementary Figures 5-10, and Supplementary Tables 2,4 (benchmarking models on motif identification)
Run the following codes to prepare the datasets.
wget -P ./ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/data.tar.gz
tar -C ./ -xzvf ./data.tar.gz
This tarball archive contains both simulated and published datasets. The following code has been used to generate the simulation dataset:
mkdir -p ./data/JasperMotif
cd ./data/JasperMotif/
python generateSequencesForSimulation.py
cd -
The training takes about 18 days on a server with 1 CPU cores, 32G memory, and one NVIDIA 1080 Ti GPU card. The user can either train the models by themselves or use the pre-trained results.
- Run the following codes to train the models.
cd ./train/JasperMotifSimulation
python trainAllVConvBasedAndConvolutionBasedNetworksForSimulation.py
python trainAllVConvBasedNetworksWithoutMSLForSimulation.py
cd -
cd ./train/ZengChIPSeqCode
python trainAllVConvBasedAndConvolutionBasedNetworksForZeng.py
cd -
cd ./train/DeepBindChIPSeqCode2015
python trainAllVConvBasedAndConvolutionBasedNetworksForDeepBind.py
cd -
cd ./train/convergenceSpeed
python estimateConvergenceSpeed.py
cd -
- Here the user needs to switch to the basenji environment.
- After finishing training, the user needs to deactivate the basenji environment to run other codes.
cd ./basset/vConv/9layersvConv/
python TrainBasenjiBasset.py
cd -
cd ./basset/vConv/basenjibasset/
python basenji_train.py params_basset.json ../../../data/data_basset/
cd -
cd ./basset/vConv/singlelayervConv/
python TrainBasenjiBasset.py
cd -
- Run the following codes to obtain the pre-trained models.
mkdir -p ./output
wget -P ./output/ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/result.tar.gz
tar -C ./output/ -xzvf ./output/result.tar.gz
Note that
- Both the original datasets ("./data") and the trained results ("./output/result/") are needed.
- The scripts below must be run in the order displayed.
cd ./output/code
python checkResultsForSimulation.py
python checkResultsForSimulationWithoutMSL.py
python checkResultsForZengCase.py.py
python checkResultsForDeepBindCase.py
python extractMaskedKernelsFromSimulation.py
python prepareInputForTomtom.py
python useTomtomToCompareWithRealMotifs.py
cd -
Rscript ./vConvFigmain/code/generate_fig_2.R
The figure generated is located at ./vConvFigmain/result/Fig.2/Fig.2.png
.
Rscript ./vConvFigmain/code/generate_fig_3.R
The figure generated is located at ./vConvFigmain/result/Fig.3/Fig.3.png
.
Rscript ./vConvFigmain/code/generate_supplementary_figure_5.R
The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.5/Supplementary.Fig.5.png
.
Rscript ./vConvFigmain/code/generate_supplementary_figure_6.R
The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.6/Supplementary.Fig.6.png
.
Rscript ./vConvFigmain/code/generate_supplementary_figure_7.R
The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.7/Supplementary.Fig.7.png
.
cd ./output/code
python checkResultComparedZengSearch.py
cd -
The figure generated is located at
- Supp. Fig. 8A:
output/ModelAUC/ChIPSeq/Pic/worseData/DataSize.png
- Supp. Fig. 8B:
output/ModelAUC/ChIPSeq/Pic/worseData/DataSizeWorseCase.png
- Supp. Fig. 8C:
"output/ModelAUC/ChIPSeq/Pic/convolution-based network from Zeng et al., 2016Boxplot.png"
cd ./output/SpeedTest/code
python DrawLoss.py
cd -
The figure generated is located at
- 2 motifs:
output/SpeedTest/Png/2.jpg
- 4 motifs:
output/SpeedTest/Png/4.jpg
- 6 motifs:
output/SpeedTest/Png/6.jpg
- 8 motifs:
output/SpeedTest/Png/8.jpg
- TwoDiffMotif1:
output/SpeedTest/Png/TwoDiff1.jpg
- TwoDiffMotif2:
output/SpeedTest/Png/TwoDiff2.jpg
- TwoDiffMotif3:
output/SpeedTest/Png/TwoDiff3.jpg
- Basset:
output/SpeedTest/Png/basset.jpg
Rscript ./vConvFigmain/code/generate_supplementary_figure_11.R
The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.11/Supplementary.Fig.11.png
.
By now this table should have been generated at ./vConvFigmain/supptable23/SuppTable2.csv
. Use the script below to reproduce it again.
cd ./output/code
python checkResultsForSimulation.py
cd -
By now this table should have been generated at ./vConvFigmain/supptable23/SuppTable3.csv
. Use the script below to reproduce it again.
cd ./output/code
python checkResultsForSimulationWithoutMSL.py
cd -
mkdir -p ./vConvMotifDiscovery/ChIPSeqPeak/
wget -P ./vConvMotifDiscovery/ChIPSeqPeak/ http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/files.txt
for file in `cut -f 1 ./vConvMotifDiscovery/ChIPSeqPeak/files.txt|grep narrowPeak.gz`
do
wget -P ./vConvMotifDiscovery/ChIPSeqPeak/ http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/${file}
gunzip ./vConvMotifDiscovery/ChIPSeqPeak/${file}
done
mkdir -p ./data
wget -P ./data/ http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
gunzip ./data/hg19.fa.gz
The user could either generate the results by themselves, or use the pre-computed version. Note that both the data files and the results are needed for reproducing Figure 4.
See Suppelementary Fig. 4. (shown below) for description of each step.
## extract sequences
cd ./vConvMotifDiscovery/code/MLtools
python extractSequences.py
cd -
## generate motifs by CisFinder, DREME, and MEME-ChIP
cd ./vConvMotifDiscovery/code/MLtools
python generateMotifsByCisFinder.py
python generateMotifsByDREME.py
python generateMotifsByMEMEChIP.py
cd -
## generate motifs by vConv-based
cd ./vConvMotifDiscovery/code/vConvBased
python generateMotifsByVConvBasedNetworks.py
cd -
cd ./vConvMotifDiscovery/code/CisfinderFile
python convertIntoCisfinderFormat.py
python splitIntoIndividualMotifFiles.py
python scanSequencesWithCisFinder.py
python combineCisFinderResults.py
cd -
cd ./vConvMotifDiscovery/code/Analysis
python computeAccuracy.py
python summarizeResults.py
cd -
wget -P ./vConvMotifDiscovery/output/ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/AUChdf5.tar.gz
tar -C ./vConvMotifDiscovery/output/ -xzvf ./vConvMotifDiscovery/output/AUChdf5.tar.gz
Supplementary Figure 9 is generated together with Figure 4.
Rscript ./vConvFigmain/code/generate_fig_4.R
Figure 4 generated is located at vConvFigmain/result/Fig.4/Fig.4.png
.
Supplementary Figure 9 generated is located at ./vConvFigmain/result/Supplementary.Fig.9/Supplementary.Fig.9.png
.
cd theoretical/code/
python runSimulation.py
python trainCNN.py
cd -
cd theoretical/code/
python plotFigures.py
cd -
The figure generated is located at:
- Supp. Fig. 12B:
theoretical/Motif/ICSimu/simuMtf_Len-8_totIC-10.png
- Supp. Fig. 12C:
theoretical/Motif/ICSimu/simuMtf_Len-23_totIC-12.png
- Supp. Fig. 12D:
theoretical/figure/simuMtf_Len-8_totIC-10.png
- Supp. Fig. 12E:
theoretical/figure/simuMtf_Len-23_totIC-12.png
- Supp. Fig. 12F:
theoretical/figure/simuMtf_Len-8_totIC-10rank.png
- Supp. Fig. 12G:
theoretical/figure/simuMtf_Len-23_totIC-12rank.png
- Supp. Fig. 12H:
theoretical/figure/simu01.png
- Supp. Fig. 12I:
theoretical/figure/simu02.png