Skip to content

Latest commit

 

History

History
444 lines (319 loc) · 13.5 KB

README.md

File metadata and controls

444 lines (319 loc) · 13.5 KB

Table of contents

vConv paper

This is the repository for reproducing figures and tables in the paper Identifying complex sequence patterns in massive omics data with a variable-convolutional layer in deep neural network.

A Keras-based implementation of vConv is available at https://github.com/gao-lab/vConv.

Prerequisites

Prerequisites for all but training Basset-related models

Basics

  • Imagemagick
  • Python (version 2)
  • R
  • bedtools
  • DREME (version 5.0.1)
  • MEME-ChIP (version 5.0.1)
  • CisFinder

Python packages

  • numpy
  • h5py
  • pandas
  • seaborn
  • scipy
  • keras (version 2.2.4)
  • tensorflow (version 1.3.0)
  • sklearn

Alternatively, if you want to guarantee working versions of each dependency, you can install via a fully pre-specified environment.

conda env create -f corecode/environment_vConv.yml

R packages

  • ggpubr
  • data.table
  • readxl
  • foreach
  • ggseqlogo
  • magick

the vConv layer

rm -fr ./vConv
git clone https://github.com/gao-lab/vConv
cp -r ./vConv/corecode ./

Prerequisites for training Basset-related models

  • Python 3
  • Follow the 'Installation' instruction at here to install Basenji: https://github.com/calico/basenji , with the following modifications:
    • Must use CUDA 10.0
    • Must use tensorflow version 2.3.4
      • We wrote a tensorflow-2.3.4-compatible vConv for Basset in our code (see codes in the section 'Train basset-related model' below). Currently we do not support vConv for this version of tensorflow publicly.

Step 1: Reproduce Figures 2-3, Supplementary Figures 5-10, and Supplementary Tables 2,4 (benchmarking models on motif identification)

1.1 Prepare datasets

Run the following codes to prepare the datasets.

wget -P ./ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/data.tar.gz
tar -C ./ -xzvf ./data.tar.gz

This tarball archive contains both simulated and published datasets. The following code has been used to generate the simulation dataset:

mkdir -p ./data/JasperMotif
cd ./data/JasperMotif/
python generateSequencesForSimulation.py
cd -

1.2 Train and evaluate models needed for reproducing these figures

The training takes about 18 days on a server with 1 CPU cores, 32G memory, and one NVIDIA 1080 Ti GPU card. The user can either train the models by themselves or use the pre-trained results.

1.2.1 Train the models directly

1.2.1.1 Train all but Basset-related models

  • Run the following codes to train the models.
cd ./train/JasperMotifSimulation
python trainAllVConvBasedAndConvolutionBasedNetworksForSimulation.py
python trainAllVConvBasedNetworksWithoutMSLForSimulation.py
cd -
cd ./train/ZengChIPSeqCode
python trainAllVConvBasedAndConvolutionBasedNetworksForZeng.py
cd -
cd ./train/DeepBindChIPSeqCode2015
python trainAllVConvBasedAndConvolutionBasedNetworksForDeepBind.py
cd -
cd ./train/convergenceSpeed
python estimateConvergenceSpeed.py
cd -

1.2.1.2 Train Basset-related models

  • Here the user needs to switch to the basenji environment.
  • After finishing training, the user needs to deactivate the basenji environment to run other codes.
cd ./basset/vConv/9layersvConv/
python TrainBasenjiBasset.py
cd -

cd ./basset/vConv/basenjibasset/
python basenji_train.py params_basset.json ../../../data/data_basset/
cd -

cd ./basset/vConv/singlelayervConv/
python TrainBasenjiBasset.py
cd -

1.2.2 Use the pre-trained results

  • Run the following codes to obtain the pre-trained models.
mkdir -p ./output
wget -P ./output/ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/result.tar.gz
tar -C ./output/ -xzvf ./output/result.tar.gz

1.3 Prepare summary files for reproducing figures and tables from datasets and results above

Note that

  • Both the original datasets ("./data") and the trained results ("./output/result/") are needed.
  • The scripts below must be run in the order displayed.
cd ./output/code
python checkResultsForSimulation.py
python checkResultsForSimulationWithoutMSL.py
python checkResultsForZengCase.py.py
python checkResultsForDeepBindCase.py
python extractMaskedKernelsFromSimulation.py
python prepareInputForTomtom.py
python useTomtomToCompareWithRealMotifs.py
cd -

1.4 Reproduce Figures

1.4.1 Reproduce Figure 2

Rscript ./vConvFigmain/code/generate_fig_2.R

The figure generated is located at ./vConvFigmain/result/Fig.2/Fig.2.png.

1.4.2 Reproduce Figure 3

Rscript ./vConvFigmain/code/generate_fig_3.R

The figure generated is located at ./vConvFigmain/result/Fig.3/Fig.3.png.

1.4.3 Reproduce Supplementary Figure 5

Rscript ./vConvFigmain/code/generate_supplementary_figure_5.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.5/Supplementary.Fig.5.png.

1.4.4 Reproduce Supplementary Figure 6

Rscript ./vConvFigmain/code/generate_supplementary_figure_6.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.6/Supplementary.Fig.6.png.

1.4.5 Reproduce Supplementary Figure 7

Rscript ./vConvFigmain/code/generate_supplementary_figure_7.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.7/Supplementary.Fig.7.png.

1.4.6 Reproduce Supplementary Figure 8

cd ./output/code
python checkResultComparedZengSearch.py
cd -

The figure generated is located at

  • Supp. Fig. 8A: output/ModelAUC/ChIPSeq/Pic/worseData/DataSize.png
  • Supp. Fig. 8B: output/ModelAUC/ChIPSeq/Pic/worseData/DataSizeWorseCase.png
  • Supp. Fig. 8C: "output/ModelAUC/ChIPSeq/Pic/convolution-based network from Zeng et al., 2016Boxplot.png"

1.4.7 Reproduce Supplementary Figure 10

cd ./output/SpeedTest/code
python DrawLoss.py
cd -

The figure generated is located at

  • 2 motifs: output/SpeedTest/Png/2.jpg
  • 4 motifs: output/SpeedTest/Png/4.jpg
  • 6 motifs: output/SpeedTest/Png/6.jpg
  • 8 motifs: output/SpeedTest/Png/8.jpg
  • TwoDiffMotif1: output/SpeedTest/Png/TwoDiff1.jpg
  • TwoDiffMotif2: output/SpeedTest/Png/TwoDiff2.jpg
  • TwoDiffMotif3: output/SpeedTest/Png/TwoDiff3.jpg
  • Basset: output/SpeedTest/Png/basset.jpg

1.4.8 Reproduce Supplementary Figure 11

Rscript ./vConvFigmain/code/generate_supplementary_figure_11.R

The figure generated is located at ./vConvFigmain/result/Supplementary.Fig.11/Supplementary.Fig.11.png.

1.4.9 Reproduce Supplementary Table 2

By now this table should have been generated at ./vConvFigmain/supptable23/SuppTable2.csv. Use the script below to reproduce it again.

cd ./output/code
python checkResultsForSimulation.py
cd -

1.4.10 Reproduce Supplementary Table 3

By now this table should have been generated at ./vConvFigmain/supptable23/SuppTable3.csv. Use the script below to reproduce it again.

cd ./output/code
python checkResultsForSimulationWithoutMSL.py
cd -

Step 2: Reproduce Figure 4 (benchmarking models on motif discovery)

2.1 Prepare datasets needed by Figure 4

mkdir -p ./vConvMotifDiscovery/ChIPSeqPeak/
wget -P ./vConvMotifDiscovery/ChIPSeqPeak/  http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/files.txt
for file in `cut -f 1 ./vConvMotifDiscovery/ChIPSeqPeak/files.txt|grep narrowPeak.gz`
do
    wget -P ./vConvMotifDiscovery/ChIPSeqPeak/  http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/${file}
    gunzip ./vConvMotifDiscovery/ChIPSeqPeak/${file}
done

mkdir -p ./data
wget -P ./data/ http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
gunzip ./data/hg19.fa.gz

2.2 Generate results needed by Figure 4

The user could either generate the results by themselves, or use the pre-computed version. Note that both the data files and the results are needed for reproducing Figure 4.

2.2.1 Generate the results

See Suppelementary Fig. 4. (shown below) for description of each step.

SupplementaryFig4

2.2.1.1 step (1)

## extract sequences
cd ./vConvMotifDiscovery/code/MLtools
python extractSequences.py
cd -

## generate motifs by CisFinder, DREME, and MEME-ChIP
cd ./vConvMotifDiscovery/code/MLtools
python generateMotifsByCisFinder.py
python generateMotifsByDREME.py
python generateMotifsByMEMEChIP.py
cd -

## generate motifs by vConv-based
cd ./vConvMotifDiscovery/code/vConvBased
python generateMotifsByVConvBasedNetworks.py
cd -

2.2.1.2 steps (2-3)

cd ./vConvMotifDiscovery/code/CisfinderFile
python convertIntoCisfinderFormat.py
python splitIntoIndividualMotifFiles.py
python scanSequencesWithCisFinder.py
python combineCisFinderResults.py
cd -

2.2.1.3 steps (4-5)

cd ./vConvMotifDiscovery/code/Analysis
python computeAccuracy.py
python summarizeResults.py
cd -

2.2.2 Use the pre-computed version

wget -P ./vConvMotifDiscovery/output/ ftp://ftp.cbi.pku.edu.cn/pub/supplementary_file/VConv/Data/AUChdf5.tar.gz
tar -C ./vConvMotifDiscovery/output/ -xzvf ./vConvMotifDiscovery/output/AUChdf5.tar.gz

2.3 Reproduce Figure 4 (and Supplementary Figure 9)

Supplementary Figure 9 is generated together with Figure 4.

Rscript ./vConvFigmain/code/generate_fig_4.R

Figure 4 generated is located at vConvFigmain/result/Fig.4/Fig.4.png.

Supplementary Figure 9 generated is located at ./vConvFigmain/result/Supplementary.Fig.9/Supplementary.Fig.9.png.

Step 3: Reproduce Supplementary Fig. 12 B-I (theoretical analysis)

3.1 Prepare datasets and results needed by Supp. Fig. 12 B-I

cd theoretical/code/
python runSimulation.py
python trainCNN.py
cd -

3.2 Reproduce Supplementary Fig. 12 B-I

cd theoretical/code/
python plotFigures.py
cd -

The figure generated is located at:

  • Supp. Fig. 12B: theoretical/Motif/ICSimu/simuMtf_Len-8_totIC-10.png
  • Supp. Fig. 12C: theoretical/Motif/ICSimu/simuMtf_Len-23_totIC-12.png
  • Supp. Fig. 12D: theoretical/figure/simuMtf_Len-8_totIC-10.png
  • Supp. Fig. 12E: theoretical/figure/simuMtf_Len-23_totIC-12.png
  • Supp. Fig. 12F: theoretical/figure/simuMtf_Len-8_totIC-10rank.png
  • Supp. Fig. 12G: theoretical/figure/simuMtf_Len-23_totIC-12rank.png
  • Supp. Fig. 12H: theoretical/figure/simu01.png
  • Supp. Fig. 12I: theoretical/figure/simu02.png