Skip to content

A Snakemake base workflow to perform routine analysis modules of ATAC-Seq data

License

Notifications You must be signed in to change notification settings

tigerxu/SnakemakeATACSeq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 

Repository files navigation

Snakemake for ATACseq

A snakemake-based workflow for ATAC-seq data analysis and report

Author

Zhuofei Xu

Bookmarks

Background

The workflow is an all-in-one analysis and report system. Users can only provide sample data files in the pre-defined file path, and edit some basic metadata (e.g. sample names, group names) in the configuration file (config.yaml). The workflow has the following advantages:

  • Automation: Don't need manually execute command lines of individual analysis steps. May be dozens of steps, it will be boring if a programmer executes these commands one by one. Prerequisite is that the developers should be familiar with the I/O files of each task.
  • Reproducibility
  • Portability: It supports standard operations for software portability, configuration, installation in the computer devices by the users with analytical tasks.
  • Extendability: It allows the developers to add more analysis modules into the toolkits for software update and upgradation.

The workflow performs the following analysis tasks:

  • quality examination of raw fastq reads for each sample_r?.fq.gz file
  • quality trimming of raw reads
  • mapping reads per sample to the reference genome
  • removing duplicated reads in the bam files
  • filtering of low quality reads from the bam files
  • peak calling
  • peak annotation
  • Enrichment analysis of the genes adjacent to the peaks

Computer requirements

Recommended requirements:

  • CPUs = 64
  • Memory = 256GB

Minimum requirements:

  • CPUs = 4
  • Memory = 16GB

Approximate run time for the example dataset provided in my repository using Minimum requirements: 1.5h for 4 samples; 1M reads per sample

Usage

Running the commands below can get a HTML-format analysis report from the raw data of all samples.

$ conda activate atacenv1
$ snakemake -c 2 -p -s workflow/Snakefile

The configuration files and raw data related to a test running of the workflow are deposited in this respository, as well as the final resulting report file and some intermediate processing files for a reproducible goal.

The resulting report file is ATACseq_report.html under the directory 'result'.

The framework plot of the program running is shown below

ATAC-seq

Some preparation prior to running the workflow

$ mamba create -n atacenv1
$ mamba activate atacenv1
$ mamba install snakemake

Create directory structure for a project

$ mkdir {config,data,result,workflow}
$ mkdir workflow/{rules,envs,scripts}
$ tree
.
├── config
├── data
├── result
└── workflow
    ├── envs
    ├── rules
    └── scripts

The path and file names of the raw samples are written into the file config/config.yaml. The other sample metadata and group assignment info are also configured in this file, as well as some preprocessed files required by running tools, like genome index needed by bowtie2, bwa, star.

# the paths below need to replaced with the corresponding paths in your server
workdir: ./result
PE: true
sample:
  liver_rep1:
    r1: /path/to/sample/files/ENCFF288CVJ.fastq.gz
    r2: /path/to/sample/files/ENCFF888ZZV.fastq.gz
  liver_rep2:                
    r1: /path/to/sample/files/ENCFF883SEZ.fastq.gz
    r2: /path/to/sample/files/ENCFF035OMK.fastq.gz
  heart_rep1:                
    r1: /path/to/sample/files/ENCFF279LMU.fastq.gz
    r2: /path/to/sample/files/ENCFF820PVO.fastq.gz
  heart_rep2:                
    r1: /path/to/sample/files/ENCFF823XXU.fastq.gz
    r2: /path/to/sample/files/ENCFF518FYP.fastq.gz


## config/config.yaml

bwt2_idx: /path-to-the-btw2-genome-index/mm10

## config/config.yaml

genome: mm


OrgDb: org.Mm.eg.db
txdb: TxDb.Mmusculus.UCSC.mm10.knownGene

## config/config.yaml
diffGroup:
    lh:
      control:
        - liver_rep1
        - liver_rep2
      case:
        - heart_rep1
        - heart_rep2

Create a file named as Snakefile.

$ touch workflow/Snakefile

Write all the command line operations required by the workflow in the file Snakefile. The program snakemake will execute the content in this file.

## workflow/Snakefile

# edit a configuration file in yaml or json format
configfile: "config/config.yaml"
# set the current directory as the working directory
workdir: config["workdir"]

## get the sample names from the config.yaml file
SAMPLES = config["sample"].keys()

## set two configuration for paired or single end reads according to the information set in config.yaml file
## generate two output files if "PE:true" present in the config.yaml for R1 and R2 reads
if config["PE"]:
    ENDS = ["r1", "r2"]
else:
    ENDS = ["se"]

## get the group names for differential comparison analysis
DIFFGROUPS = ["lh"]


raw_fq_qc_zips = expand("02fqc/raw/{sample}_{end}_fastqc.zip", sample=SAMPLES, end=ENDS)
## if the last rule performed by trim_gorale, then must add the files generated by trim_gorale
clean_fq_files = expand("01seq/clean/{sample}_r1_val_1.fq.gz", sample=SAMPLES)
align_bams = expand("03align/{sample}.bam", sample=SAMPLES)
filter_bams = expand("03align/{sample}.filter.bam", sample=SAMPLES)

peak_anno = expand("05peakanno/{sample}_peak_anno.csv", sample=SAMPLES)
diff_peak_result = expand("06diffpeak/{dgroup}_result.csv", dgroup = DIFFGROUPS)



rule all:
    input:
        raw_fq_qc_zips,
        #clean_fq_files,
        #align_bams,
        #filter_bams,
        peak_anno,
        diff_peak_result,
        "ATACseq_report.html"

# in the 'sample' block, in each of the 'sample' subblock, for the line of 'r1' or 'r2', get the value, i.e. the path and the name of the fastq file
def get_fq(wildcards):
    return config["sample"][wildcards.sample][wildcards.end]


rule rename:
    input:
        get_fq
    output:
        "01seq/raw/{sample}_{end}.fq.gz"
    shell:
        "ln -s {input} {output}"


## workflow/rules/fastqc.smk

rule raw_fq:
    input: 
        raw = rules.rename.output,
    output:
        "02fqc/raw/{sample}_{end}_fastqc.zip",  
    threads: 1        
    log:
        "logs/fastqc/raw/{sample}_{end}.log",
    shell:
        """
        bash -c '
        . $HOME/.bashrc # if not loaded automatically
        conda activate atacenv1
        fastqc -o 02fqc/raw -f fastq -t {threads} --noextract {input} 2> {log}
        conda deactivate'
        """

## workflow/rules/trimming.smk 

if config["PE"]:
    rule trim:
        input:
            "01seq/raw/{sample}_r1.fq.gz",
            "01seq/raw/{sample}_r2.fq.gz"
        output:
            "01seq/clean/{sample}_r1_val_1.fq.gz",
            "01seq/clean/{sample}_r2_val_2.fq.gz"   
        threads: 2 
        log:
            "logs/trim/{sample}.log"          
        shell:
            """
            bash -c '
            . $HOME/.bashrc # if not loaded automatically
            conda activate atacenv2
            trim_galore -j {threads} -q 25 --phred33 --length 36 --stringency 3 -o 01seq/clean --paired {input} 2> {log}
            conda deactivate'
            """
else:
    rule trim:
        input:
            "01seq/raw/{sample}_se.fq.gz"
        output:
            "01seq/clean/{sample}_se_trimmed.fq.gz"             
        threads: 2 
        log:
            "logs/trim/{sample}.log"          
        shell:
            """
            bash -c '
            . $HOME/.bashrc # if not loaded automatically
            conda activate atacenv2
            trim_galore -j {threads} -q 25 --phred33 --length 36 --stringency 3 -o 01seq/clean --paired {input} 2> {log}
            conda deactivate'
            """

## workflow/rules/bowtie2.smk 


if config["PE"]:
    rule btw2_map:
        input:
            rules.trim.output
        output:
            "03align/{sample}.bam" 
        threads: 2 
        log:
            "logs/bowtie2/{sample}.log"
        params:
            idx = config['bwt2_idx']
        shell:
            """
            bash -c '
            . $HOME/.bashrc # if not loaded automatically
            conda activate atacenv2
            bowtie2 -X2000 --mm -x {params} --no-unal --seed 100 -p {threads} -1 {input[0]} -2 {input[1]} 2> {log} | samtools sort -O bam -o {output}
            conda deactivate'
            """
            
else:
    rule btw2_map:
        input:
            rules.trim.output
        output:
            "03align/{sample}.bam" 
        threads: 2 
        log:
            "logs/bowtie2/{sample}.log"
        params:
            idx = config['bwt2_idx']
        shell:
            """
            bash -c '
            . $HOME/.bashrc # if not loaded automatically
            conda activate atacenv2
            bowtie2 -X2000 --mm -x {params} --no-unal --seed 100 -p {threads} -U {input} 2> {log} | samtools sort -O bam -o {output} && samtools index {output} -@ {threads}
            conda deactivate'
            """
            
## workflow/rules/bamfilter.smk 

## delete duplication reads
rule rmdup:
    input:
        rules.btw2_map.output
    output:
        "03align/{sample}.rmdup.bam"
    log:
        "logs/bamfilter/{sample}.rmdup.log"
    params:
        mode = "" if config["PE"] else "-s"
    shell:
            """
            bash -c '
            . $HOME/.bashrc # if not loaded automatically
            conda activate atacenv2
            samtools rmdup {params.mode} --output-fmt BAM {input} {output} 2> {log}
            conda deactivate'
            """

rule filter_bam:
    input:
        rules.rmdup.output
    output:
        "03align/{sample}.filter.bam"
    threads: 2
    log: 
        "logs/bamfilter/{sample}.filter.log"
    params:
        mode = "" if config["PE"] else "-s"
    shell:
            """
            bash -c '
            . $HOME/.bashrc # if not loaded automatically
            conda activate atacenv2
            samtools view -@ {threads} -q 20 -h {input} | grep -v -P '\\tchrM\\t' | samtools view -Sb -o {output}
            conda deactivate'
            """

## workflow/rules/callpeak.smk 

rule macs2_callpeak:
    input:
        rules.filter_bam.output
    output:
        "04callpeak/{sample}_peaks.narrowPeak"
    params:
        np = "04callpeak/{sample}", 
        fp = "BAMPE" if config["PE"] else "",
        gp = config["genome"]
    log:
        "logs/callpeak/{sample}.log"
    shell:
        """
            bash -c '
            . $HOME/.bashrc # if not loaded automatically
            conda activate atacenv2
            macs2 callpeak -t {input[0]} -g {params.gp} -f {params.fp} -n {params.np} -B -q 0.05 2> {log}
            conda deactivate'
        """



## workflow/rules/peak_anno.smk 

rule peak_annotate:
    input:
        "04callpeak/{sample}_peaks.narrowPeak"
    output:
        "05peakanno/{sample}_peak_anno.csv",
    params:
        outdir = "05peakanno"
    script:
        "scripts/peak_annotate.R"


def get_diff_control_sn(wildcards):
    return config["diffGroup"][wildcards.dgroup]["control"]

def get_diff_case_sn(wildcards):
    return config["diffGroup"][wildcards.dgroup]["case"]


def get_diff_control_bam(wildcards):
    sn = get_diff_control_sn(wildcards)
    bams = [f"03align/{i}.filter.bam" for i in  sn]
    return bams

def get_diff_case_bam(wildcards):
    sn = get_diff_case_sn(wildcards)
    bams = [f"03align/{i}.filter.bam" for i in  sn]
    return bams

def get_diff_control_peak(wildcards):
    sn = get_diff_control_sn(wildcards)
    peaks = [f"04callpeak/{i}_peaks.narrowPeak" for i in  sn]
    return peaks

def get_diff_case_peak(wildcards):
    sn = get_diff_case_sn(wildcards)
    peaks = [f"04callpeak/{i}_peaks.narrowPeak" for i in  sn]
    return peaks


rule peak_diff:
    input:
        control_bam = get_diff_control_bam,
        case_bam = get_diff_case_bam,
        control_peak = get_diff_control_peak,
        case_peak = get_diff_case_peak
    output:
        "06diffpeak/{dgroup}_result.csv",
    params:
        control_sn = get_diff_control_sn,
        case_sn = get_diff_case_sn      
    script:
        "scripts/peak_diff.R"


### workflow/rules/make_report.smk
rule make_report:
    input:
        peak_anno,
        diff_peak_result
    output:
        "ATACseq_report.html",
    params:
        samples = SAMPLES,
        diffgroups = DIFFGROUPS,
        peak_dir = "05peakanno",
        diffpeak_dir = "06diffpeak"
    script:
        "scripts/make_report.Rmd"

##Citation

If you publish the workflow in my repository, please cite both the software and the appropriate tools you used

About

A Snakemake base workflow to perform routine analysis modules of ATAC-Seq data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages