This is a pipeline for calculating FRiP ("fraction of reads in peaks") from FASTQ.
This involves mapping and filtering FASTQ to generate bam files (with bowtie2 and samtools), calling peaks (with macs2). FRiP is then calculated from the bam files and bed files. Publicly-available datasets provide FASTQ files, however, bam files are often not provided, yet essential for calculating FRiP.
The fastaFRiP pipeline can be used to process ChIP-seq datasets that are:
- spike-in or not spike-in
- single-end or paired-end
- with input or with no input
This repository additional provides scripts for:
- managing metadata from SRA/GEO (via ffq)
- computing FRiP (via DeepTools)
- Python
git clone https://github.com/Fudenberg-Research-Group/fastaFRiP.git
cd fastaFRiP/frip_sm
conda env create -f env/fastq_frip_env.yml -n fastq_frip_env
conda activate fastq_frip_env
All dependecies mentioned below are include in our conda environment, so you don't need to worry about any further installation :)
FASTQ data is available in the Gene Expression Omnibus (GEO). FASTQ is a common format for storing raw sequencing data generated by next-generation sequencing technologies. In GEO, such raw sequencing data are often included as part of the supplementary files associated with a GEO Series (GSE) record.
By clicking on the 'SRA Run Selector', users can select and download specific data (e.g., based on organism, gene, condition, or experiment type) from the Sequence Read Archive (SRA) page.
To quickly access the accession codes for a particular ChIP-seq dataset, click "Metadata" button on the page to download "SraRunTable.txt". To extract accesion numbers for each file, run:
grep "ChIP" SraRunTable.txt | awk -F, '{print $1}' > accessions.txt
batch_download.sh
, will download FASTQs listed in 'accessions.txt'. This uses fasterq-dump
(from sra-tools
). This script requires 'accession.txt' in the same folder, and can be run as:
./batch_download.sh
To rescale the bigwig file and call peaks based on an input (control) sample, the pipeline requires a metadata table with two columns: ChIP and Input.
ChIP | Input |
---|---|
SRR5085155 | SRR5085156 |
SRR5085157 | SRR5085158 |
Sample names should match those in the FASTQ files (e.g., SRR5085155.fastq). If a sample does not have input, exclude it from the table.
A python script for generating such a table is provided in this repository, generate_chip_input_table.ipynb
.
For many genomes, bowtie2 index files can be obtained from NCBI or the UCSC genome browser. From NCBI, you can choose to use bowtie2 index files directly, or download reference genome for alignment to make your own bowtie2 index files.
Species/File Type | URL Link |
---|---|
hg38/reference genome | Link |
| | mm39/reference genome | Link| | | hg38/bowtie2 index | Link| | | mm39/bowtie2 index | Link|
Most of time, you can use bowtie2 index files directly by running the following command:
tar -xvzf GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz
For spike-in ChIP-seq, you can use the following way to create a bowtie index files that includes two species, here we use hg38 and mm39 as an example:
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001635.9_GRCm39_full_analysis_set.fna.gz
sed -i '1s/^>/>hg38_/' GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
sed -i '1s/^>/>mm39_/' GCA_000001635.9_GRCm39_full_analysis_set.fna
cat GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GCA_000001635.9_GRCm39_full_analysis_set.fna > hg38_mm39.fna
mkdir hg38_mm39
cd hg38_mm39
bowtie2-build ../hg38_mm39.fna hg38_mm39.bowtie_index
Specify the locations of your input files (FASTQ and Bowtie index files) and output files, and choose whether to include spike-in normalization in the configuration file config.yml
. Detailed explanations for each parameter are included in config.yml.
If the experiment includes spike-in, set 'include_spikein' to true, and set 'index_primary' and 'index_spikein' according to the experiment.
Once the configuration file is set up, run the following command in the terminal to generate the required BAM/BED files:
snakemake --use-conda --cores $Ncores --configfile config/config.yml
Ensure that your computing resources are available.
Tips: [Number of cores] = [number of jobs] * [number of process in config.yml]. And, [Number of cores] <= the total number of cpus you have
To create metadata file, modifying config/fetch_metadata_config.yml
, and run
python fetch_metadata.py config/fetch_metadata_config.yml
Example output metadata table:
Organism | Celltype | Condition | Antibody | Peak_protein | Author_Year | SRUN | GSM_Accession | GEO | Experiment |
---|---|---|---|---|---|---|---|---|---|
Homo sapiens | RPE | siNIPBL | H3K4me3 | N/A | Nakato_2023 | SRR18024424 | GSM5899636 | GSE196450 | siNIPBL |
Homo sapiens | RPE | Control | H3K4me3 | N/A | Nakato_2023 | SRR18024425 | GSM5899635 | GSE196450 | Control |
If you have your own bam files, you can also create a metadata table based on the below template. Each row is one sample, if the sample does not have either BAM file or Peak BED file, then just leave the cell blank. (columns listed below are mandatory for pipeline. For convenience, you can also add more columns as above example output metadata table):
Condition | Antibody | BAM | Peak_BED |
---|---|---|---|
condition of the sample | antibody used in ChIP assay | pathway to the BAM file of its ChIP-seq | pathway to the peak BED file by calling peaks on this sample |
If all samples use same Peak_BED file/files, then you can ignore this column and input path to peak bed files in create_frip_table_config.yml
After you generate bam files and bed files with the above command line, you can specify path to the bed file, input data, and output data in the config file config/create_frip_table_config.yml
, and use create_frip_table.py
to calculate FRiP value,
python create_frip_table.py config/create_frip_table_config.yml
Example FRiP table:
FRiP | Organism | Celltype | Treatment | Antibody | Peak_protein | Author_year | SRUN | Peak-SRUN | GEO | Experiment | FRiP enrichment | #Peaks | Total #basepairs in peaks | Total #reads |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.11113138926362592 | Homo sapiens | Hap1 | SCC4KO | CTCF | CTCF | Haarhuis_2017 | SRR5266528 | SRR5266528 | GSE90994 | SCC4KO CTCF ChIPseq | 27.17470160067354 | 37415 | 12677501 | 19977713 |
0.00698026098917694 | Homo sapiens | Hap1 | SCC4KO | IgG | CTCF | Haarhuis_2017 | SRR5266524 | SRR5266528 | GSE90994 | SCC4KO IgG ChIPseq | 1.7068670762832925 | 37415 | 12677501 | 14485275 |