Epigenomics and genetics: variation (EGG:V) integration pipeline. This is one of five ETL sub-pipelines designed to integrate and model large-scale, heterogeneous epigenomics datasets in GeneWeaver (GW). EGG:V processes and integrates variant metadata and annotations for use in GW. Currently H. sapiens and M. musculus genome builds, hg38 and mm10 respectively, are supported.
Usage: eggv [OPTIONS] COMMAND [ARGS]...
Options:
-c, --config PATH configuration file
-f, --force force data retrieval and overwrite local copies
if they exist
-s, --species [hg38|mm10] run the pipeline for the given species
--version Show the version and exit.
--help Show this message and exit.
Commands:
annotate Annotate intragenic variants to their corresponding genes.
complete Run the complete variant processing pipeline.
process Parse and process gene and variant builds from Ensembl.
retrieve Retrieve gene and variant builds from Ensembl.
The complete variation ETL pipeline can be run by specifying the genome
build and the complete
subcommand:
$ eggv complete -s hg38
The complete pipeline is composed of a number of steps, each of which can be run
individually.
If steps are run independently, they should be run in the order
retrieve → process → annotate
.
The retrieve
step of the pipeline retrieves genomic variant and gene builds
from Ensembl and the Ensembl variation database.
Variants downloaded from Ensembl are stored in the
Genome Variation Format (GVF).
Run this step using the retrieve
subcommand:
$ eggv retrieve -s hg38
The process
step parses and formats genomic variants for later use.
Variant effects are isolated and stored separately from other metadata
(e.g. dbSNP ID, chromosome, alleles).
Run this step using the process
subcommand:
$ eggv process -s hg38
The final annotate
step identifies variants as either intergenic or intragenic.
Intragenic variants are mapped to their respective genes using previously retrieved
Ensembl gene builds.
Run this step using the annotate
subcommand:
$ eggv annotate -s hg38
The pipeline can be configured using a YAML based configuration file. The complete file, pipeline defaults, and option explanations are listed below.
resources:
environment:
hpc: true
local: false
custom: false
cores: 4
processes: 4
jobs: 15
memory: '40GB'
walltime: '05:00:00'
interface: 'ib0'
directories:
data: 'data/'
temp: ~
scheduler: ~
workers: ~
overwrite: true
species: ~
- resources.environment.hpc
- boolean. If true, the pipeline will initialize a cluster on an HPC system running PBS/Torque.
- resources.environment.local
- boolean. If true, the pipeline will initialize a local, single machine cluster.
- resources.environment.custom
- boolean. If true, the pipeline will initialize a cluster for custom environments. This option requires the scheduler option to be set.
- resources.cores
- integer. The number of CPU cores available to each cluster worker process. This option only has an effect if running an HPC cluster.
- resources.processes
- integer. The number of worker processes to use. If running an HPC cluster, the number of cores will be divided by the number of worker processes. So, if cores = 4 and processes = 2, two worker processes will spawn utilizing 2 cores (threads) each. If cores = 4 and processes = 4, four worker processes will spawn utilizing 1 core each.
- resources.jobs
- integer. The number of worker nodes to use. This option only has an effect if running an HPC cluster.
- resources.memory
- string. Worker process memory limits. If using a memory limit of 40GB with 4 worker processes, each worker has a limit of 10GB. This option only has an effect if running an HPC cluster.
- resources.walltime
- string. Worker node time limits. This option only has an effect if running an HPC cluster.
- resources.interface
- string. Network interface to use for worker-worker and worker-scheduler
communication.
'ib0' is Infiniband, 'eth0' is ethernet, etc.
Use
ip addr
to identify the proper interface to use. This option only has an effect if running an HPC cluster. - directories.data
- string. The base directory path to store raw and processed datasets.
- directories.temp
- string. The temp directory. If left blank the pipeline will automatically use system defaults.
- scheduler
- string. The scheduler node address.
- workers
- list. A list of worker node addresses.
- overwrite
- boolean. Force data retrieval and overwrite local copies even if they already exist.
- species
- string. The genome build to run the pipeline on.
The current release is v1.2.0
.
Install via pip:
$ pip install https://github.com/treynr/eggv/archive/v1.2.0.tar.gz
Or clone this repo and install via setup.py
:
$ git clone https://github.com/treynr/eggv.git
$ cd eggv
$ python setup.py install
The EGG:V pipeline has some hefty storage and memory requirements.
To be safe, at least 500GB of disk space should be available if both hg38 and mm10 builds will be processed. The sizes below are for Ensembl v95.
249G ./hg38/raw
106G ./hg38/effects
27G ./hg38/meta
5.8G ./hg38/annotated/intergenic
49G ./hg38/annotated/intragenic
54G ./hg38/annotated
436G ./hg38
23G ./mm10/raw
21G ./mm10/effects
6.6G ./mm10/meta
2.0G ./mm10/annotated/intergenic
4.5G ./mm10/annotated/intragenic
6.5G ./mm10/annotated
56G ./mm10
492G ./
The lowest amount of total available memory this pipeline has been tested with is 450GB. Since processing is done in-memory, all at once, systems with total memory below 400GB might not be able to run the complete pipeline.
Use as many CPU cores as you possibly can. Seriously.
See requirements.txt
for a complete list of required Python packages.
The major requirements are: