Skip to content

pipeline for calculating FRiP from fasta (or bed + fasta)

Notifications You must be signed in to change notification settings

Fudenberg-Research-Group/fastaFRiP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fastaFRiP

Description

This is a pipeline for calculating FRiP ("fraction of reads in peaks") from FASTQ.

This involves mapping and filtering FASTQ to generate bam files (with bowtie2 and samtools), calling peaks (with macs2). FRiP is then calculated from the bam files and bed files. Publicly-available datasets provide FASTQ files, however, bam files are often not provided, yet essential for calculating FRiP.

The fastaFRiP pipeline can be used to process ChIP-seq datasets that are:

  • spike-in or not spike-in
  • single-end or paired-end
  • with input or with no input

This repository additional provides scripts for:

  • managing metadata from SRA/GEO (via ffq)
  • computing FRiP (via DeepTools)

Prerequisites

  • Python

Installation

git clone https://github.com/Fudenberg-Research-Group/fastaFRiP.git
cd fastaFRiP/frip_sm
conda env create -f env/fastq_frip_env.yml -n fastq_frip_env
conda activate fastq_frip_env

All dependecies mentioned below are include in our conda environment, so you don't need to worry about any further installation :)

Preparing input data

Download fastq files from GEO dataset

FASTQ data is available in the Gene Expression Omnibus (GEO). FASTQ is a common format for storing raw sequencing data generated by next-generation sequencing technologies. In GEO, such raw sequencing data are often included as part of the supplementary files associated with a GEO Series (GSE) record.

By clicking on the 'SRA Run Selector', users can select and download specific data (e.g., based on organism, gene, condition, or experiment type) from the Sequence Read Archive (SRA) page.

To quickly access the accession codes for a particular ChIP-seq dataset, click "Metadata" button on the page to download "SraRunTable.txt". To extract accesion numbers for each file, run:

grep "ChIP" SraRunTable.txt | awk -F, '{print $1}' > accessions.txt

batch_download.sh, will download FASTQs listed in 'accessions.txt'. This uses fasterq-dump (from sra-tools). This script requires 'accession.txt' in the same folder, and can be run as:

./batch_download.sh

(Optional) ChIP & Input table for Peak Calling:

To rescale the bigwig file and call peaks based on an input (control) sample, the pipeline requires a metadata table with two columns: ChIP and Input.

ChIP Input
SRR5085155 SRR5085156
SRR5085157 SRR5085158

Sample names should match those in the FASTQ files (e.g., SRR5085155.fastq). If a sample does not have input, exclude it from the table. A python script for generating such a table is provided in this repository, generate_chip_input_table.ipynb.

Create Bowtie2 index files

For many genomes, bowtie2 index files can be obtained from NCBI or the UCSC genome browser. From NCBI, you can choose to use bowtie2 index files directly, or download reference genome for alignment to make your own bowtie2 index files.

Species/File Type URL Link
hg38/reference genome Link

| | mm39/reference genome | Link| | | hg38/bowtie2 index | Link| | | mm39/bowtie2 index | Link|

Most of time, you can use bowtie2 index files directly by running the following command:

tar -xvzf GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz

For spike-in ChIP-seq, you can use the following way to create a bowtie index files that includes two species, here we use hg38 and mm39 as an example:

gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001635.9_GRCm39_full_analysis_set.fna.gz

sed -i '1s/^>/>hg38_/' GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
sed -i '1s/^>/>mm39_/' GCA_000001635.9_GRCm39_full_analysis_set.fna

cat GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GCA_000001635.9_GRCm39_full_analysis_set.fna > hg38_mm39.fna
mkdir hg38_mm39
cd hg38_mm39
bowtie2-build ../hg38_mm39.fna hg38_mm39.bowtie_index

The snakefile: Generating bams and BEDs

Modify Configuration file of snakemake

Specify the locations of your input files (FASTQ and Bowtie index files) and output files, and choose whether to include spike-in normalization in the configuration file config.yml. Detailed explanations for each parameter are included in config.yml. If the experiment includes spike-in, set 'include_spikein' to true, and set 'index_primary' and 'index_spikein' according to the experiment.

Generating BAM/BED files

Once the configuration file is set up, run the following command in the terminal to generate the required BAM/BED files:

snakemake --use-conda --cores $Ncores --configfile config/config.yml

Ensure that your computing resources are available.
Tips: [Number of cores] = [number of jobs] * [number of process in config.yml]. And, [Number of cores] <= the total number of cpus you have

Computing FRiPs

Creating metadata table

To create metadata file, modifying config/fetch_metadata_config.yml, and run

python fetch_metadata.py config/fetch_metadata_config.yml

Example output metadata table:

Organism Celltype Condition Antibody Peak_protein Author_Year SRUN GSM_Accession GEO Experiment
Homo sapiens RPE siNIPBL H3K4me3 N/A Nakato_2023 SRR18024424 GSM5899636 GSE196450 siNIPBL
Homo sapiens RPE Control H3K4me3 N/A Nakato_2023 SRR18024425 GSM5899635 GSE196450 Control

If you have your own bam files, you can also create a metadata table based on the below template. Each row is one sample, if the sample does not have either BAM file or Peak BED file, then just leave the cell blank. (columns listed below are mandatory for pipeline. For convenience, you can also add more columns as above example output metadata table):

Condition Antibody BAM Peak_BED
condition of the sample antibody used in ChIP assay pathway to the BAM file of its ChIP-seq pathway to the peak BED file by calling peaks on this sample

If all samples use same Peak_BED file/files, then you can ignore this column and input path to peak bed files in create_frip_table_config.yml

Create FRiP table

After you generate bam files and bed files with the above command line, you can specify path to the bed file, input data, and output data in the config file config/create_frip_table_config.yml , and use create_frip_table.py to calculate FRiP value,

python create_frip_table.py config/create_frip_table_config.yml

Example FRiP table:

FRiP Organism Celltype Treatment Antibody Peak_protein Author_year SRUN Peak-SRUN GEO Experiment FRiP enrichment #Peaks Total #basepairs in peaks Total #reads
0.11113138926362592 Homo sapiens Hap1 SCC4KO CTCF CTCF Haarhuis_2017 SRR5266528 SRR5266528 GSE90994 SCC4KO CTCF ChIPseq 27.17470160067354 37415 12677501 19977713
0.00698026098917694 Homo sapiens Hap1 SCC4KO IgG CTCF Haarhuis_2017 SRR5266524 SRR5266528 GSE90994 SCC4KO IgG ChIPseq 1.7068670762832925 37415 12677501 14485275

About

pipeline for calculating FRiP from fasta (or bed + fasta)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •