Merge pull request #236 from navinrathore/zBigQuery

Documentation for BigQuery connector
zinggAI · May 5, 2022 · 68f034f · 68f034f
2 parents 0b648ed + e7b418d
commit 68f034f
Show file tree

Hide file tree

Showing 3 changed files with 78 additions and 1 deletion.
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -23,6 +23,7 @@
   * [MongoDB](dataSourcesAndSinks/mongodb.md)
   * [Neo4j](dataSourcesAndSinks/neo4j.md)
   * [Parquet](dataSourcesAndSinks/parquet.md)
+  * [BigQuery](dataSourcesAndSinks/bigquery.md)
 * [Running Zingg on Cloud](running/running.md)
   * [Running on AWS](running/aws.md)
   * [Running on Azure](running/azure.md)

diff --git a/docs/dataSourcesAndSinks/bigquery.md b/docs/dataSourcesAndSinks/bigquery.md
@@ -0,0 +1,70 @@
+## Using Google BigQuery to read and write data with Zingg
+
+Zingg can seemlessly work with Google BigQuery. Please find below details about the properties that must be set. 
+
+The two driver jars namely **spark-bigquery-with-dependencies_2.12-0.24.2.jar** and  **gcs-connector-hadoop2-latest.jar** are required to work with BigQuery. To include these BigQuery drivers to the spark classpath, set the following environment variable before running Zingg.
+
+```bash
+export ZINGG_EXTRA=./spark-bigquery-with-dependencies_2.12-0.24.2.jar,./gcs-connector-hadoop2-latest.jar
+export ZINGG_ARGS_EXTRA="--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
+```
+
+If Zingg is run from outside Google cloud, it requires authentication, please set the following env variable to the location of the file containing service account key. A service account key can be created and downloaded in json format from [Google Cloud console](https://cloud.google.com/docs/authentication/getting-started)
+
+```bash
+export GOOGLE_APPLICATION_CREDENTIALS=path to google service account key file
+```
+
+Connection properties for BigQuery as data source and data sink are given below. If you are curious to know more about how Spark connects to BigQuery, you may look at the [Spark BigQuery connector documentation](https://github.com/GoogleCloudDataproc/spark-bigquery-connector).
+
+### Properties for reading data from BigQuery:
+
+The property **"credentialsFile"** should point to the google service account key file location. This is the same path that is used to set variable **GOOGLE_APPLICATION_CREDENTIALS**. The **"table"** property should point to a BigQuery table that contains source data. The property **"viewsEnabled"** must be set to true only.
+
+```json
+    "data" : [{
+        "name":"test", 
+         "format":"bigquery", 
+        "props": {
+            "credentialsFile": "/home/work/product/final/zingg-1/mynotification-46566-905cbfd2723f.json",
+            "table": "mynotification-46566.zinggdataset.zinggtest",
+            "viewsEnabled": true
+        }
+    }],
+``` 
+
+
+### Properties for writing data to BigQuery:
+
+To write to BigQuery, a bucket needs to be created and assigned to the **"temporaryGcsBucket"** property.
+
+```json
+    "output" : [{
+        "name":"output", 
+        "format":"bigquery",
+        "props": {
+            "credentialsFile": "/home/work/product/final/zingg-1/mynotification-46566-905cbfd2723f.json",
+            "table": "mynotification-46566.zinggdataset.zinggOutput",
+            "temporaryGcsBucket":"zingg-test",
+         }
+    }],
+```
+
+### Notes:
+ * The library **"gcs-connector-hadoop2-latest.jar"** can be downloaded from [Google](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar)
+ and the library **"spark-bigquery-with-dependencies_2.12-0.24.2"** from [maven repo](https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.24.2/spark-bigquery-with-dependencies_2.12-0.24.2.jar)
+ * A typical service account key file look like below. format of the file is json.
+
+ ```json
+ {
+  "type": "service_account",
+  "project_id": "mynotification-46566",
+  "private_key_id": "905cbfd273ff9205d1cabfe06fa6908e54534",
+  "private_key": "-----BEGIN PRIVATE KEY-----CERT.....",
+  "client_id": "11143646541283115487",
+  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
+  "token_uri": "https://oauth2.googleapis.com/token",
+  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
+  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/zingtest%44mynotification-46566.iam.gserviceaccount.com"
+}
+```
diff --git a/scripts/zingg.sh b/scripts/zingg.sh
@@ -12,4 +12,10 @@ else
   OPTION_JARS="--jars ${ZINGG_EXTRA}"
 fi
 
-$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER $OPTION_JARS  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.es.nodes="127.0.0.1" --conf spark.es.port="9200" --conf spark.es.resource="cluster/cluster1" --conf spark.default.parallelism="8" --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/tmp/memLog.txt -XX:+UseCompressedOops" --conf spark.executor.memory=10g --conf spark.debug.maxToStringFields=200 --driver-class-path $ZINGG_JARS --class zingg.client.Client $ZINGG_JARS $@ --email $EMAIL --license $LICENSE 
+if [[ -z "${ZINGG_ARGS_EXTRA}" ]]; then
+  OPTION_ARGS=""
+else
+  OPTION_ARGS="${ZINGG_ARGS_EXTRA}"
+fi
+
+$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER $OPTION_JARS  $OPTION_ARGS --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.es.nodes="127.0.0.1" --conf spark.es.port="9200" --conf spark.es.resource="cluster/cluster1" --conf spark.default.parallelism="8" --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/tmp/memLog.txt -XX:+UseCompressedOops" --conf spark.executor.memory=10g --conf spark.debug.maxToStringFields=200 --driver-class-path $ZINGG_JARS --class zingg.client.Client $ZINGG_JARS $@ --email $EMAIL --license $LICENSE