Skip to content

Commit

Permalink
Add location flag to documentation (#682)
Browse files Browse the repository at this point in the history
* Update README.md for location flag

* vcf_files_preprocessor.md for location flag

* bigquery_to_vcf.md for location flag

* Add docs for --use_public_ips, --subnetwork, & --location

* Clarify docker worker in setting_region.md

* Clarify location flag as not required

Co-authored-by: Saman Vaisipour <[email protected]>
  • Loading branch information
moschetti and samanvp authored Oct 5, 2020
1 parent 92197df commit c24fd61
Show file tree
Hide file tree
Showing 4 changed files with 69 additions and 3 deletions.
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,11 @@ Run the script below and replace the following parameters:
* `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
to process your data, for example: `us-west1`. For more information please refer to
[Setting Regions](docs/setting_region.md).
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
Sciences API to orchestrate job from. This is not where the data will be processed,
but where some operation metadata will be stored. This can be the same or different from
the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
* `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
project has write access to. It's used to store temporary files and logs
from the pipeline.
Expand All @@ -72,6 +77,7 @@ Run the script below and replace the following parameters:
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
TEMP_LOCATION=gs://BUCKET/temp
INPUT_PATTERN=gs://BUCKET/*.vcf
OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
Expand All @@ -85,6 +91,7 @@ COMMAND="vcf_to_bq \
docker run -v ~/.config:/root/.config \
gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
--location "${GOOGLE_CLOUD_LOCATION}" \
--region "${GOOGLE_CLOUD_REGION}" \
--temp_location "${TEMP_LOCATION}" \
"${COMMAND}"
Expand Down
7 changes: 7 additions & 0 deletions docs/bigquery_to_vcf.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@ Run the script below and replace the following parameters:
* `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
to process your data, for example: `us-west1`. For more information please refer to
[Setting Regions](docs/setting_region.md).
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
Sciences API to orchestrate job from. This is not where the data will be processed,
but where some operation metadata will be stored. This can be the same or different from
the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
* `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
project has write access to. It's used to store temporary files and logs
from the pipeline.
Expand All @@ -35,6 +40,7 @@ Run the script below and replace the following parameters:
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
TEMP_LOCATION=gs://BUCKET/temp
INPUT_TABLE=GOOGLE_CLOUD_PROJECT:DATASET.TABLE
OUTPUT_FILE=gs://BUCKET/loaded_file.vcf
Expand All @@ -48,6 +54,7 @@ COMMAND="bq_to_vcf \
docker run -v ~/.config:/root/.config \
gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
--location "${GOOGLE_CLOUD_LOCATION}" \
--region "${GOOGLE_CLOUD_REGION}" \
--temp_location "${TEMP_LOCATION}" \
"${COMMAND}"
Expand Down
51 changes: 48 additions & 3 deletions docs/setting_region.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,37 @@ are located in the same region:
* Your pipeline's temporary location set by `--temp_location` flag.
* Your output BigQuery dataset set by `--output_table` flag.
* Your Dataflow pipeline set by `--region` flag.
* Your Life Sciences API location set by `--location` flag.

## Running jobs in a particular region
The Dataflow API [requires](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#configuring-pipelineoptions-for-execution-on-the-cloud-dataflow-service)
setting a [GCP
region](https://cloud.google.com/compute/docs/regions-zones/#available) via
`--region` flag to run. In addition to this requirment you might also
`--region` flag to run.

When running from Docker, the Cloud Life Sciences API is used to spin up a
worker that launches and monitors the Dataflow job. Cloud Life Sciences API
is a [regionalized service](https://cloud.google.com/life-sciences/docs/concepts/locations)
that runs in multiple regions. This is set with the `--location` flag. The
Life Sciences API location is where metadata about the pipeline's progress
will be stored, and can be different from the region where the data is
processed. Note that Cloud Life Sciences API is not available in all regions,
and if this flag is left out, the metadata will be stored in us-central1. See
the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).

In addition to this requirment you might also
choose to run Variant Transforms in a specific region following your project’s
security and compliance requirements. For example, in order
to restrict your processing job to Europe west, set the region as follows:
to restrict your processing job to europe-west4 (Netherlands), set the region
and location as follows:

```bash
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
--region europe-west1 \
--region europe-west4 \
--location europe-west4 \
--temp_location "${TEMP_LOCATION}" \
"${COMMAND}"
```
Expand Down Expand Up @@ -77,3 +92,33 @@ You can choose the region for the BigQuery dataset at dataset creation time.
![BigQuery dataset region](images/bigquery_dataset_region.png)
## Advanced Flags
Variant Transforms supports specifying a subnetwork to use with the `--subnetwork` flag.
This can be used to start the processing VMs in a specific network of your Google Cloud
project as opposed to the default network.
Variant Transforms allows disabling the use of external IP addresses with the
`--use_public_ips` flag. If not specified, this defaults to true, so to restrict the
use of external IP addresses, use `--use_public_ips false`. Note that without external
IP addresses, VMs can only send packets to other internal IP addresses. To allow these
VMs to connect to the external IP addresses used by Google APIs and services, you can
[enable Private Google Access](https://cloud.google.com/vpc/docs/configure-private-google-access)
on the subnet.
For example, to run Variant Transforms in a VPC you already created called
`custom-network-eu-west` with no public IP addresses you can add these flags to the
example above as follows:
```bash
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
--region europe-west4 \
--location europe-west4 \
--temp_location "${TEMP_LOCATION}" \
--subnetwork custom-network-eu-west \
--use_public_ips false \
"${COMMAND}"
```
7 changes: 7 additions & 0 deletions docs/vcf_files_preprocessor.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,11 @@ Run the script below and replace the following parameters:
* `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
to process your data, for example: `us-west1`. For more information please refer to
[Setting Regions](docs/setting_region.md).
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
Sciences API to orchestrate job from. This is not where the data will be processed,
but where some operation metadata will be stored. This can be the same or different from
the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
* `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
project has write access to. It's used to store temporary files and logs
from the pipeline.
Expand All @@ -71,6 +76,7 @@ records.
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
TEMP_LOCATION=gs://BUCKET/temp
INPUT_PATTERN=gs://BUCKET/*.vcf
REPORT_PATH=gs://BUCKET/report.tsv
Expand All @@ -87,6 +93,7 @@ COMMAND="vcf_to_bq_preprocess \
docker run -v ~/.config:/root/.config \
gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
--location "${GOOGLE_CLOUD_LOCATION}" \
--region "${GOOGLE_CLOUD_REGION}" \
--temp_location "${TEMP_LOCATION}" \
"${COMMAND}"
Expand Down

0 comments on commit c24fd61

Please sign in to comment.