Skip to content

Files

This branch is 11 commits ahead of, 2 commits behind verilylifesciences/variant-annotation:main.

vep

Annotating input files with VEP

This directory includes tools and utilities for running Ensembl's Variant Effect Predictor (VEP) on input VCF files of Variant Transforms.

Overview

With tools provided in this directory, one can:

  • Create a docker image of VEP.
  • Download and package VEP's database (a.k.a. cache) for different species, reference sequences and versions of VEP.
  • Run VEP on VCF input files and create output VCF files that are annotated.

Note that, this is a useful standalone tool for running VEP in the cloud but the main goal is to be able to run VEP as a preprocessor through Variant Transforms and then import the annotated variants into BigQuery with proper handling of annotations.

How to create and push VEP docker images

Inside this directory, run:

docker build . -t [IMAGE_TAG]

This will download the source from VEP GitHub repo and build VEP from that source. By default, it uses version 104 of VEP. This can be changed by ENSEMBL_RELEASE build argument, e.g.,

docker build . -t [IMAGE_TAG] --build-arg ENSEMBL_RELEASE=104

Let's say we want to push this image to the Container Registry of my-project on Google Cloud, so we can pick [IMAGE_TAG] as gcr.io/my-project/vep:104. Then push this image by:

gcloud docker -- push gcr.io/my-project/vep:104

TODO: Add cloudbuild.yaml files for both easy push and integration test.

How to download and package VEP databases

Choose a local directory with enough space (e.g., ~20GB for homo_sapiens) to download and integrate different pieces of the VEP database or cache files. Then from within that directory run the build_vep_cache.sh script. By default this script creates the database for human (homo_sapiens), referenec sequence GRCh38, and release 104 of VEP. These values can be overwritten by the following environment variables (note you should use the same VEP release that you used for creating VEP docker image above):

  • VEP_SPECIES
  • GENOME_ASSEMBLY
  • ENSEMBL_RELEASE

How to run VEP on GCP

There is the helper script run_vep.sh that is added to the VEP docker image and can be used to run VEP. One way of running it on Google Cloud Platform (GCP) is through the Pipelines API. For a sample yaml job description check sample_pipeline.yaml. Here is a sample gcloud command that uses that file:

gcloud alpha genomics pipelines run \
  --project my-project \
  --pipeline-file sample_pipeline.yaml \
  --logging gs://my_bucket/logs \
  --inputs VCF_INFO_FILED=CSQ_RERUN

Note the vep_cache_homo_sapiens_GRCh38_104.tar.gz file that is referenced in the sample yaml file, is the output file that you get from the above database creation step.

The run_vep.sh script relies on several environment variables that can be set to change the default behaviour. In the above example VCF_INFO_FILED is changed to CSQ_RERUN (the default is CSQ_VT).

This is the full list of supported environment variables:

  • SPECIES: default is homo_sapiens
  • GENOME_ASSEMBLY: default is GRCh38
  • NUM_FORKS: The value to be set for --fork option of VEP. default is 1.
  • OTHER_VEP_OPTS: Other options to be set for the VEP invocation, default is --everything
  • VCF_INFO_FILED: The name of the info field to be used for annotations, default is CSQ_VT. See --vcf_info_field

The following environment variables have to be set and point to valid storage locations:

  • VEP_CACHE: Where the tar.gz file, created in the above database creation step, is located.
  • INPUT_FILE: Note this can be either a VCF file or a compressed VCF file (.gz or .bgz). Treatment of compressed and uncompressed files is the same, i.e., the input file is directly fed into VEP.
  • OUTPUT_VCF: The name of the output file which is always a VCF file.