vep

Update GENOME_ASSEMBLY for mus_musculus to GRCm39 (#14 )

Jun 4, 2021

f36a301 · Jun 4, 2021

This branch is 11 commits ahead of, 2 commits behind verilylifesciences/variant-annotation:main.

Name	Name	Last commit message	Last commit date
parent directory ..
Dockerfile	Dockerfile	Update to VEP 104 (#13 )	Jun 4, 2021
README.md	README.md	Update to VEP 104 (#13 )	Jun 4, 2021
build_vep_cache.sh	build_vep_cache.sh	Update GENOME_ASSEMBLY for mus_musculus to GRCm39 (#14 )	Jun 4, 2021
run_script_with_watchdog.sh	run_script_with_watchdog.sh	gsutil failure bug (#9 )	Feb 28, 2019
run_vep.sh	run_vep.sh	Added --check_ref and --allow_non_variant to default VEP options and …	Apr 26, 2018
sample_pipeline.yaml	sample_pipeline.yaml	Update to VEP 104 (#13 )	Jun 4, 2021

README.md

Annotating input files with VEP

This directory includes tools and utilities for running Ensembl's Variant Effect Predictor (VEP) on input VCF files of Variant Transforms.

Overview

With tools provided in this directory, one can:

Create a docker image of VEP.
Download and package VEP's database (a.k.a. cache) for different species, reference sequences and versions of VEP.
Run VEP on VCF input files and create output VCF files that are annotated.

Note that, this is a useful standalone tool for running VEP in the cloud but the main goal is to be able to run VEP as a preprocessor through Variant Transforms and then import the annotated variants into BigQuery with proper handling of annotations.

How to create and push VEP docker images

Inside this directory, run:

docker build . -t [IMAGE_TAG]

This will download the source from VEP GitHub repo and build VEP from that source. By default, it uses version 104 of VEP. This can be changed by ENSEMBL_RELEASE build argument, e.g.,

docker build . -t [IMAGE_TAG] --build-arg ENSEMBL_RELEASE=104

Let's say we want to push this image to the Container Registry of my-project on Google Cloud, so we can pick [IMAGE_TAG] as gcr.io/my-project/vep:104. Then push this image by:

gcloud docker -- push gcr.io/my-project/vep:104

TODO: Add cloudbuild.yaml files for both easy push and integration test.

How to download and package VEP databases

Choose a local directory with enough space (e.g., ~20GB for homo_sapiens) to download and integrate different pieces of the VEP database or cache files. Then from within that directory run the build_vep_cache.sh script. By default this script creates the database for human (homo_sapiens), referenec sequence GRCh38, and release 104 of VEP. These values can be overwritten by the following environment variables (note you should use the same VEP release that you used for creating VEP docker image above):

VEP_SPECIES
GENOME_ASSEMBLY
ENSEMBL_RELEASE

How to run VEP on GCP

There is the helper script run_vep.sh that is added to the VEP docker image and can be used to run VEP. One way of running it on Google Cloud Platform (GCP) is through the Pipelines API. For a sample yaml job description check sample_pipeline.yaml. Here is a sample gcloud command that uses that file:

gcloud alpha genomics pipelines run \
  --project my-project \
  --pipeline-file sample_pipeline.yaml \
  --logging gs://my_bucket/logs \
  --inputs VCF_INFO_FILED=CSQ_RERUN

Note the vep_cache_homo_sapiens_GRCh38_104.tar.gz file that is referenced in the sample yaml file, is the output file that you get from the above database creation step.

The run_vep.sh script relies on several environment variables that can be set to change the default behaviour. In the above example VCF_INFO_FILED is changed to CSQ_RERUN (the default is CSQ_VT).

This is the full list of supported environment variables:

SPECIES: default is homo_sapiens
GENOME_ASSEMBLY: default is GRCh38
NUM_FORKS: The value to be set for --fork option of VEP. default is 1.
OTHER_VEP_OPTS: Other options to be set for the VEP invocation, default is --everything
VCF_INFO_FILED: The name of the info field to be used for annotations, default is CSQ_VT. See --vcf_info_field

The following environment variables have to be set and point to valid storage locations:

VEP_CACHE: Where the tar.gz file, created in the above database creation step, is located.
INPUT_FILE: Note this can be either a VCF file or a compressed VCF file (.gz or .bgz). Treatment of compressed and uncompressed files is the same, i.e., the input file is directly fed into VEP.
OUTPUT_VCF: The name of the output file which is always a VCF file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

vep

vep

README.md

Annotating input files with VEP

Overview

How to create and push VEP docker images

How to download and package VEP databases

How to run VEP on GCP

Files

vep

Directory actions

More options

Directory actions

More options

Latest commit

History

vep

Folders and files

parent directory

README.md

Annotating input files with VEP

Overview

How to create and push VEP docker images

How to download and package VEP databases

How to run VEP on GCP