Skip to content

Commit

Permalink
Merge pull request #236 from navinrathore/zBigQuery
Browse files Browse the repository at this point in the history
Documentation for BigQuery connector
  • Loading branch information
sonalgoyal authored May 5, 2022
2 parents 0b648ed + e7b418d commit 68f034f
Show file tree
Hide file tree
Showing 3 changed files with 78 additions and 1 deletion.
1 change: 1 addition & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
* [MongoDB](dataSourcesAndSinks/mongodb.md)
* [Neo4j](dataSourcesAndSinks/neo4j.md)
* [Parquet](dataSourcesAndSinks/parquet.md)
* [BigQuery](dataSourcesAndSinks/bigquery.md)
* [Running Zingg on Cloud](running/running.md)
* [Running on AWS](running/aws.md)
* [Running on Azure](running/azure.md)
Expand Down
70 changes: 70 additions & 0 deletions docs/dataSourcesAndSinks/bigquery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
## Using Google BigQuery to read and write data with Zingg

Zingg can seemlessly work with Google BigQuery. Please find below details about the properties that must be set.

The two driver jars namely **spark-bigquery-with-dependencies_2.12-0.24.2.jar** and **gcs-connector-hadoop2-latest.jar** are required to work with BigQuery. To include these BigQuery drivers to the spark classpath, set the following environment variable before running Zingg.

```bash
export ZINGG_EXTRA=./spark-bigquery-with-dependencies_2.12-0.24.2.jar,./gcs-connector-hadoop2-latest.jar
export ZINGG_ARGS_EXTRA="--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
```

If Zingg is run from outside Google cloud, it requires authentication, please set the following env variable to the location of the file containing service account key. A service account key can be created and downloaded in json format from [Google Cloud console](https://cloud.google.com/docs/authentication/getting-started)

```bash
export GOOGLE_APPLICATION_CREDENTIALS=path to google service account key file
```

Connection properties for BigQuery as data source and data sink are given below. If you are curious to know more about how Spark connects to BigQuery, you may look at the [Spark BigQuery connector documentation](https://github.com/GoogleCloudDataproc/spark-bigquery-connector).

### Properties for reading data from BigQuery:

The property **"credentialsFile"** should point to the google service account key file location. This is the same path that is used to set variable **GOOGLE_APPLICATION_CREDENTIALS**. The **"table"** property should point to a BigQuery table that contains source data. The property **"viewsEnabled"** must be set to true only.

```json
"data" : [{
"name":"test",
"format":"bigquery",
"props": {
"credentialsFile": "/home/work/product/final/zingg-1/mynotification-46566-905cbfd2723f.json",
"table": "mynotification-46566.zinggdataset.zinggtest",
"viewsEnabled": true
}
}],
```


### Properties for writing data to BigQuery:

To write to BigQuery, a bucket needs to be created and assigned to the **"temporaryGcsBucket"** property.

```json
"output" : [{
"name":"output",
"format":"bigquery",
"props": {
"credentialsFile": "/home/work/product/final/zingg-1/mynotification-46566-905cbfd2723f.json",
"table": "mynotification-46566.zinggdataset.zinggOutput",
"temporaryGcsBucket":"zingg-test",
}
}],
```

### Notes:
* The library **"gcs-connector-hadoop2-latest.jar"** can be downloaded from [Google](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar)
and the library **"spark-bigquery-with-dependencies_2.12-0.24.2"** from [maven repo](https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.24.2/spark-bigquery-with-dependencies_2.12-0.24.2.jar)
* A typical service account key file look like below. format of the file is json.

```json
{
"type": "service_account",
"project_id": "mynotification-46566",
"private_key_id": "905cbfd273ff9205d1cabfe06fa6908e54534",
"private_key": "-----BEGIN PRIVATE KEY-----CERT.....",
"client_id": "11143646541283115487",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/zingtest%44mynotification-46566.iam.gserviceaccount.com"
}
```
8 changes: 7 additions & 1 deletion scripts/zingg.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,10 @@ else
OPTION_JARS="--jars ${ZINGG_EXTRA}"
fi

$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER $OPTION_JARS --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.es.nodes="127.0.0.1" --conf spark.es.port="9200" --conf spark.es.resource="cluster/cluster1" --conf spark.default.parallelism="8" --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/tmp/memLog.txt -XX:+UseCompressedOops" --conf spark.executor.memory=10g --conf spark.debug.maxToStringFields=200 --driver-class-path $ZINGG_JARS --class zingg.client.Client $ZINGG_JARS $@ --email $EMAIL --license $LICENSE
if [[ -z "${ZINGG_ARGS_EXTRA}" ]]; then
OPTION_ARGS=""
else
OPTION_ARGS="${ZINGG_ARGS_EXTRA}"
fi

$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER $OPTION_JARS $OPTION_ARGS --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.es.nodes="127.0.0.1" --conf spark.es.port="9200" --conf spark.es.resource="cluster/cluster1" --conf spark.default.parallelism="8" --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/tmp/memLog.txt -XX:+UseCompressedOops" --conf spark.executor.memory=10g --conf spark.debug.maxToStringFields=200 --driver-class-path $ZINGG_JARS --class zingg.client.Client $ZINGG_JARS $@ --email $EMAIL --license $LICENSE

0 comments on commit 68f034f

Please sign in to comment.