Simulation pipeline for ecDNA structures.
The pipeline generates:
- ecDNA template
based on a bed file definition - reads based on the ecDNA template
With the raw data is performed:
- assembly
- evaluation of the assembly
- mapped, sv, cnv calling
If you clone this repo, index and dictionary are already included. If you use different genome or change location of genome please run initialize rule first
snakemake --use-conda -s rules/genome_index.smk
snakemake --cores 8 \
--use-conda \
--configfile configs/config.yaml \
--config input=data/raw/AnBC.bed name=AnBC outputdir=data/process/AnBC
For single ecDNA template:
#SBATCH --job-name=snakemake_main_job
#SBATCH --ntasks=32
#SBATCH --nodes=1
#SBATCH --time=14-00:00:00
#SBATCH --mem-per-cpu=50G
#SBATCH --output=slurm_logs/%x-%j.log
mkdir -p slurm_logs
export SBATCH_DEFAULTS=" --output=slurm_logs/%x-%j.log"
snakemake --cores 32 \
--use-conda \
--configfile configs/config.yaml \
--config input=data/raw/AnBC.bed name=AnBC outputdir=data/process/AnBC runmode=simulate-mapping-sv
For multiple templates at once:
#SBATCH --job-name=snakemake_main_job
#SBATCH --ntasks=32
#SBATCH --nodes=1
#SBATCH --time=14-00:00:00
#SBATCH --mem-per-cpu=50G
#SBATCH --output=slurm_logs/%x-%j.log
mkdir -p slurm_logs
export SBATCH_DEFAULTS=" --output=slurm_logs/%x-%j.log"
snakemake --cores 32 \
--use-conda \
--configfile configs/config.yaml \
--config batch=<dir_batch> outputdir=data/process/<batch_number> runmode=simulate-mapping-sv
Argument batch=<dir_batch>
is set to run the pipeline for multiple templates at once, located under <dir_batch>
Plot pipeline dag:
snakemake --dag | dot -Tpdf > dag.pdf
snakemake --rulegraph | dot -Tpdf > dag_simplified.pdf
conda install quast
# GRIDSS (needed for structural variants detection)
# install gene annotation
Precompute the high frequency k-mers for the different assemblies.
meryl count k=15 output merylDB ref.fa
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt