Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcf_to_bq running with --run_annotation_pipeline fails to find **_vep_output.vcf files #655

Open
TomGardner opened this issue Sep 8, 2020 · 2 comments

Comments

@TomGardner
Copy link

TomGardner commented Sep 8, 2020

The input xxxx.g.vcf.gz file was generated using the BAM to VCF Cromwell pipeline: https://github.com/broadinstitute/wdl-runner

When I ran vcf_to_bq without --run_annotation_pipeline - it ran fine and BigQuery tables were created.

When I added the '--run_annotation_pipeline true' parameter - 8570 output files were generated, but none had the **_vep_output.vcf extension. The output file structure was 'annotation/shards/LONG_UUID' with a single file in each called 'count_20000'.

The command I ran was:

#!/bin/bash
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=my_project
GOOGLE_CLOUD_REGION=my_region
TEMP_LOCATION=gs://my_output_bucket/temp
ANNOTATION_LOCATION=gs://my_output_bucket/annotation
INPUT_PATTERN=gs://my_input_bucket/gatk/gatk4-genome-processing-pipeline/output/NA12878.g.vcf.gz
OUTPUT_TABLE=my_project:vcf_to_bq.test_run

COMMAND="vcf_to_bq \
  --input_pattern ${INPUT_PATTERN} \
  --output_table ${OUTPUT_TABLE} \
  --job_name vcf-to-bigquery-09-08-64 \
  --run_annotation_pipeline true \
  --use_allele_num true \
  --max_num_workers 1000 \
  --worker_machine_type n1-standard-64 \
  --annotation_output_dir ${ANNOTATION_LOCATION} \
  --runner DataflowRunner"

docker run -v ~/.config:/root/.config \
  gcr.io/cloud-lifesciences/gcp-variant-transforms \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --region "${GOOGLE_CLOUD_REGION}" \
  --temp_location "${TEMP_LOCATION}" \
  "${COMMAND}"

The output error was:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/gcp_variant_transforms/src/gcp_variant_transforms/vcf_to_bq.py", line 643, in <module>
    raise e
IOError: No files found based on the file pattern gs://my_output_bucket/annotation/**_vep_output.vcf
@TomGardner
Copy link
Author

TomGardner commented Sep 11, 2020

I think what is needed for VEP is the --vcf flag.
I'm running VEP manually and the command looks like:
vep -i NA12878.g.vcf.gz --fork=4 --vcf --allele_number --assembly=GRCh38 --cache --offline --force_overwrite
Which outputs a VCF formatted file.
Also, I tried adding '--shard_variants false' to the parameter list - no change - still failed with the same error.

@moschetti
Copy link
Member

I believe this is due to the vep_runner using the v2alpha Genomics API. We will look at updating this. In the meantime you can enable the Genomics API here to resolve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants