Skip to content
This repository has been archived by the owner on Dec 15, 2021. It is now read-only.
/ spark-bigquery Public archive

Google BigQuery support for Spark, SQL, and DataFrames

License

Notifications You must be signed in to change notification settings

spotify/spark-bigquery

Repository files navigation

MAINTENANCE MODE

THIS PROJECT IS IN MAINTENANCE MODE DUE TO THE FACT THAT IT'S NOT WIDELY USED WITHIN SPOTIFY. WE'LL PROVIDE BEST EFFORT SUPPORT FOR ISSUES AND PULL REQUESTS BUT DO EXPECT DELAY IN RESPONSES.

spark-bigquery

Build Status GitHub license Maven Central

Google BigQuery support for Spark, SQL, and DataFrames.

spark-bigquery version Spark version Comment
0.2.x 2.x.y Active development
0.1.x 1.x.y Development halted

To use the package in a Google Cloud Dataproc cluster:

install org.apache.avro_avro-ipc-1.7.7.jar to ~/.ivy2/jars

spark-shell --packages com.spotify:spark-bigquery_2.10:0.2.2

To use it in a local SBT console:

import com.spotify.spark.bigquery._

// Set up GCP credentials
sqlContext.setGcpJsonKeyFile("<JSON_KEY_FILE>")

// Set up BigQuery project and bucket
sqlContext.setBigQueryProjectId("<BILLING_PROJECT>")
sqlContext.setBigQueryGcsBucket("<GCS_BUCKET>")

// Set up BigQuery dataset location, default is US
sqlContext.setBigQueryDatasetLocation("<DATASET_LOCATION>")

Usage:

// Load everything from a table
val table = sqlContext.bigQueryTable("bigquery-public-data:samples.shakespeare")

// Load results from a SQL query
// Only legacy SQL dialect is supported for now
val df = sqlContext.bigQuerySelect(
  "SELECT word, word_count FROM [bigquery-public-data:samples.shakespeare]")

// Save data to a table
df.saveAsBigQueryTable("my-project:my_dataset.my_table")

If you'd like to write nested records to BigQuery, be sure to specify an Avro Namespace. BigQuery is unable to load Avro Namespaces with a leading dot (.nestedColumn) on nested records.

// BigQuery is able to load fields with namespace 'myNamespace.nestedColumn'
df.saveAsBigQueryTable("my-project:my_dataset.my_table", tmpWriteOptions = Map("recordNamespace" -> "myNamespace"))

See also Loading Avro Data from Google Cloud Storage for data type mappings and limitations. For example loading arrays of arrays is not supported.

License

Copyright 2016 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

About

Google BigQuery support for Spark, SQL, and DataFrames

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages