vcf_to_bq running with --run_annotation_pipeline fails to find **_vep_output.vcf files #655

TomGardner · 2020-09-08T22:46:18Z

The input xxxx.g.vcf.gz file was generated using the BAM to VCF Cromwell pipeline: https://github.com/broadinstitute/wdl-runner

When I ran vcf_to_bq without --run_annotation_pipeline - it ran fine and BigQuery tables were created.

When I added the '--run_annotation_pipeline true' parameter - 8570 output files were generated, but none had the **_vep_output.vcf extension. The output file structure was 'annotation/shards/LONG_UUID' with a single file in each called 'count_20000'.

The command I ran was:

#!/bin/bash
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=my_project
GOOGLE_CLOUD_REGION=my_region
TEMP_LOCATION=gs://my_output_bucket/temp
ANNOTATION_LOCATION=gs://my_output_bucket/annotation
INPUT_PATTERN=gs://my_input_bucket/gatk/gatk4-genome-processing-pipeline/output/NA12878.g.vcf.gz
OUTPUT_TABLE=my_project:vcf_to_bq.test_run

COMMAND="vcf_to_bq \
  --input_pattern ${INPUT_PATTERN} \
  --output_table ${OUTPUT_TABLE} \
  --job_name vcf-to-bigquery-09-08-64 \
  --run_annotation_pipeline true \
  --use_allele_num true \
  --max_num_workers 1000 \
  --worker_machine_type n1-standard-64 \
  --annotation_output_dir ${ANNOTATION_LOCATION} \
  --runner DataflowRunner"

docker run -v ~/.config:/root/.config \
  gcr.io/cloud-lifesciences/gcp-variant-transforms \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --region "${GOOGLE_CLOUD_REGION}" \
  --temp_location "${TEMP_LOCATION}" \
  "${COMMAND}"

The output error was:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/gcp_variant_transforms/src/gcp_variant_transforms/vcf_to_bq.py", line 643, in <module>
    raise e
IOError: No files found based on the file pattern gs://my_output_bucket/annotation/**_vep_output.vcf

The text was updated successfully, but these errors were encountered:

TomGardner · 2020-09-11T00:07:49Z

I think what is needed for VEP is the --vcf flag.
I'm running VEP manually and the command looks like:
vep -i NA12878.g.vcf.gz --fork=4 --vcf --allele_number --assembly=GRCh38 --cache --offline --force_overwrite
Which outputs a VCF formatted file.
Also, I tried adding '--shard_variants false' to the parameter list - no change - still failed with the same error.

moschetti · 2021-05-14T21:42:25Z

I believe this is due to the vep_runner using the v2alpha Genomics API. We will look at updating this. In the meantime you can enable the Genomics API here to resolve the issue.

moschetti mentioned this issue Jun 5, 2021

Update VEP runner to use Life Sciences API beta & VEP 104 #698

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vcf_to_bq running with --run_annotation_pipeline fails to find **_vep_output.vcf files #655

vcf_to_bq running with --run_annotation_pipeline fails to find **_vep_output.vcf files #655

TomGardner commented Sep 8, 2020 •

edited

Loading

TomGardner commented Sep 11, 2020 •

edited

Loading

moschetti commented May 14, 2021

vcf_to_bq running with --run_annotation_pipeline fails to find **_vep_output.vcf files #655

vcf_to_bq running with --run_annotation_pipeline fails to find **_vep_output.vcf files #655

Comments

TomGardner commented Sep 8, 2020 • edited Loading

TomGardner commented Sep 11, 2020 • edited Loading

moschetti commented May 14, 2021

TomGardner commented Sep 8, 2020 •

edited

Loading

TomGardner commented Sep 11, 2020 •

edited

Loading