fastaFRiP

Description

This is a pipeline for calculating FRiP ("fraction of reads in peaks") from FASTQ.

This involves mapping and filtering FASTQ to generate bam files (with bowtie2 and samtools), calling peaks (with macs2). FRiP is then calculated from the bam files and bed files. Publicly-available datasets provide FASTQ files, however, bam files are often not provided, yet essential for calculating FRiP.

The fastaFRiP pipeline can be used to process ChIP-seq datasets that are:

spike-in or not spike-in
single-end or paired-end
with input or with no input

This repository additional provides scripts for:

managing metadata from SRA/GEO (via ffq)
computing FRiP (via DeepTools)

Prerequisites

Python

Installation

git clone https://github.com/Fudenberg-Research-Group/fastaFRiP.git
cd fastaFRiP/frip_sm
conda env create -f env/fastq_frip_env.yml -n fastq_frip_env
conda activate fastq_frip_env

All dependecies mentioned below are include in our conda environment, so you don't need to worry about any further installation :)

Preparing input data

Download fastq files from GEO dataset

FASTQ data is available in the Gene Expression Omnibus (GEO). FASTQ is a common format for storing raw sequencing data generated by next-generation sequencing technologies. In GEO, such raw sequencing data are often included as part of the supplementary files associated with a GEO Series (GSE) record.

By clicking on the 'SRA Run Selector', users can select and download specific data (e.g., based on organism, gene, condition, or experiment type) from the Sequence Read Archive (SRA) page.

To quickly access the accession codes for a particular ChIP-seq dataset, click "Metadata" button on the page to download "SraRunTable.txt". To extract accesion numbers for each file, run:

grep "ChIP" SraRunTable.txt | awk -F, '{print $1}' > accessions.txt

batch_download.sh, will download FASTQs listed in 'accessions.txt'. This uses fasterq-dump (from sra-tools). This script requires 'accession.txt' in the same folder, and can be run as:

./batch_download.sh

(Optional) ChIP & Input table for Peak Calling:

To rescale the bigwig file and call peaks based on an input (control) sample, the pipeline requires a metadata table with two columns: ChIP and Input.

ChIP	Input
SRR5085155	SRR5085156
SRR5085157	SRR5085158

Sample names should match those in the FASTQ files (e.g., SRR5085155.fastq). If a sample does not have input, exclude it from the table. A python script for generating such a table is provided in this repository, generate_chip_input_table.ipynb.

Create Bowtie2 index files

For many genomes, bowtie2 index files can be obtained from NCBI or the UCSC genome browser. From NCBI, you can choose to use bowtie2 index files directly, or download reference genome for alignment to make your own bowtie2 index files.

Species/File Type	URL Link
hg38/reference genome	Link

Most of time, you can use bowtie2 index files directly by running the following command:

tar -xvzf GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz

For spike-in ChIP-seq, you can use the following way to create a bowtie index files that includes two species, here we use hg38 and mm39 as an example:

gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001635.9_GRCm39_full_analysis_set.fna.gz

sed -i '1s/^>/>hg38_/' GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
sed -i '1s/^>/>mm39_/' GCA_000001635.9_GRCm39_full_analysis_set.fna

cat GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GCA_000001635.9_GRCm39_full_analysis_set.fna > hg38_mm39.fna
mkdir hg38_mm39
cd hg38_mm39
bowtie2-build ../hg38_mm39.fna hg38_mm39.bowtie_index

The snakefile: Generating bams and BEDs

Modify Configuration file of snakemake

Specify the locations of your input files (FASTQ and Bowtie index files) and output files, and choose whether to include spike-in normalization in the configuration file config.yml. Detailed explanations for each parameter are included in config.yml. If the experiment includes spike-in, set 'include_spikein' to true, and set 'index_primary' and 'index_spikein' according to the experiment.

Generating BAM/BED files

Once the configuration file is set up, run the following command in the terminal to generate the required BAM/BED files:

snakemake --use-conda --cores $Ncores --configfile config/config.yml

Ensure that your computing resources are available.
Tips: [Number of cores] = [number of jobs] * [number of process in config.yml]. And, [Number of cores] <= the total number of cpus you have

Computing FRiPs

Creating metadata table

To create metadata file, modifying config/fetch_metadata_config.yml, and run

python fetch_metadata.py config/fetch_metadata_config.yml

Example output metadata table:

Organism	Celltype	Condition	Antibody	Peak_protein	Author_Year	SRUN	GSM_Accession	GEO	Experiment
Homo sapiens	RPE	siNIPBL	H3K4me3	N/A	Nakato_2023	SRR18024424	GSM5899636	GSE196450	siNIPBL
Homo sapiens	RPE	Control	H3K4me3	N/A	Nakato_2023	SRR18024425	GSM5899635	GSE196450	Control

If you have your own bam files, you can also create a metadata table based on the below template. Each row is one sample, if the sample does not have either BAM file or Peak BED file, then just leave the cell blank. (columns listed below are mandatory for pipeline. For convenience, you can also add more columns as above example output metadata table):

Condition	Antibody	BAM	Peak_BED
condition of the sample	antibody used in ChIP assay	pathway to the BAM file of its ChIP-seq	pathway to the peak BED file by calling peaks on this sample

If all samples use same Peak_BED file/files, then you can ignore this column and input path to peak bed files in create_frip_table_config.yml

Create FRiP table

After you generate bam files and bed files with the above command line, you can specify path to the bed file, input data, and output data in the config file config/create_frip_table_config.yml , and use create_frip_table.py to calculate FRiP value,

python create_frip_table.py config/create_frip_table_config.yml

Example FRiP table:

FRiP	Organism	Celltype	Treatment	Antibody	Peak_protein	Author_year	SRUN	Peak-SRUN	GEO	Experiment	FRiP enrichment	#Peaks	Total #basepairs in peaks	Total #reads
0.11113138926362592	Homo sapiens	Hap1	SCC4KO	CTCF	CTCF	Haarhuis_2017	SRR5266528	SRR5266528	GSE90994	SCC4KO CTCF ChIPseq	27.17470160067354	37415	12677501	19977713
0.00698026098917694	Homo sapiens	Hap1	SCC4KO	IgG	CTCF	Haarhuis_2017	SRR5266524	SRR5266528	GSE90994	SCC4KO IgG ChIPseq	1.7068670762832925	37415	12677501	14485275

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
data		data
frip_sm		frip_sm
.gitignore		.gitignore
README.md		README.md
batch_download.sh		batch_download.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fastaFRiP

Description

Prerequisites

Installation

Preparing input data

Download fastq files from GEO dataset

(Optional) ChIP & Input table for Peak Calling:

Create Bowtie2 index files

The snakefile: Generating bams and BEDs

Modify Configuration file of snakemake

Generating BAM/BED files

Computing FRiPs

Creating metadata table

Create FRiP table

About

Releases

Packages

Contributors 3

Languages

Fudenberg-Research-Group/fastaFRiP

Folders and files

Latest commit

History

Repository files navigation

fastaFRiP

Description

Prerequisites

Installation

Preparing input data

Download fastq files from GEO dataset

(Optional) ChIP & Input table for Peak Calling:

Create Bowtie2 index files

The snakefile: Generating bams and BEDs

Modify Configuration file of snakemake

Generating BAM/BED files

Computing FRiPs

Creating metadata table

Create FRiP table

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages