Skip to content

Commit

Permalink
README updates for 0.5.0 release
Browse files Browse the repository at this point in the history
This patch updates the README for the 0.5.0 release.

- Fixes incorrect Python examples (fixes #56).
- Clarify use cases for ETL vs. interactive queries.
- Provide verbatim examples that users can copy for setting up credentials.

Author: Josh Rosen <[email protected]>

Closes #72 from JoshRosen/readme-updates.
  • Loading branch information
JoshRosen committed Sep 9, 2015
1 parent a9198c3 commit 7b402e3
Showing 1 changed file with 33 additions and 18 deletions.
51 changes: 33 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ A library to load data into Spark SQL DataFrames from Amazon Redshift, and write
Redshift tables. Amazon S3 is used to efficiently transfer data in and out of Redshift, and
JDBC is used to automatically trigger the appropriate `COPY` and `UNLOAD` commands on Redshift.

This library is more suited to ETL than interactive queries, since large amounts of data could be extracted to S3 for each query execution. If you plan to perform many queries against the same Redshift tables then we recommend saving the extracted data in a format such as Parquet.

- [Installation](#installation)
- Usage:
- Data sources API: [Scala](#scala), [Python](#python), [SQL](#sql)
Expand Down Expand Up @@ -49,21 +51,20 @@ import org.apache.spark.sql._
val sc = // existing SparkContext
val sqlContext = new SQLContext(sc)


// Get some data from a Redshift table
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable" -> "my_table")
.option("tempdir" -> "s3n://path/for/temp/data")
.option("dbtable", "my_table")
.option("tempdir", "s3n://path/for/temp/data")
.load()

// Can also load data from a Redshift query
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("query" -> "select x, count(*) my_table group by x")
.option("tempdir" -> "s3n://path/for/temp/data")
.option("query", "select x, count(*) my_table group by x")
.option("tempdir", "s3n://path/for/temp/data")
.load()

// Apply some transformations to the data as per normal, then you can use the
Expand All @@ -72,8 +73,8 @@ val df: DataFrame = sqlContext.read
df.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable" -> "my_table_copy")
.option("tempdir" -> "s3n://path/for/temp/data")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()
```
Expand All @@ -90,25 +91,25 @@ sql_context = SQLContext(sc)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable" -> "my_table") \
.option("tempdir" -> "s3n://path/for/temp/data") \
.option("dbtable", "my_table") \
.option("tempdir", "s3n://path/for/temp/data") \
.load()

# Read data from a query
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query" -> "select x, count(*) my_table group by x") \
.option("tempdir" -> "s3n://path/for/temp/data") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3n://path/for/temp/data") \
.load()

# Write back to a table
df.write \
.format("com.databricks.spark.redshift")
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable" -> "my_table_copy") \
.option("tempdir" -> "s3n://path/for/temp/data") \
.mode("error")
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("error") \
.save()
```

Expand Down Expand Up @@ -145,9 +146,23 @@ val records = sc.newAPIHadoopFile(

There are three ways of configuring AWS credentials for use by this library:

1. Specify AWS keys via [Hadoop configuration properties](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md). For example, if your `tempdir` configuration points to a `s3n://` filesystem then you can set the `fs.s3n.awsAccessKeyId` and `fs.s3n.awsSecretAccessKey` properties in a Hadoop XML configuration file or call `sc.hadoopConfig.set()` to mutate Spark's global Hadoop configuration.
2. Encode the keys into the `tempdir` URI. For example, the URI `s3n://ACCESSKEY:SECRETKEY@bucket/path/to/temp/dir` encodes the key pair (`ACCESSKEY`, `SECRETKEY`). Due to Hadoop limitations, this approach will not work for secret keys which contain forward slash (`/`) characters.
3. If you are running on EC2 and authenticate to S3 using IAM and [instance profiles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html), then you must must configure the `temporary_aws_access_key_id`, `temporary_aws_access_key_id`, and `temporary_aws_session_token` configuration properties to point to temporary keys created via the AWS [Security Token Service](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html). These temporary keys will then be passed to Redshift via `LOAD` and `UNLOAD` commands.
1. **Set keys in Hadoop conf (best option for most users):** You can specify AWS keys via [Hadoop configuration properties](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md). For example, if your `tempdir` configuration points to a `s3n://` filesystem then you can set the `fs.s3n.awsAccessKeyId` and `fs.s3n.awsSecretAccessKey` properties in a Hadoop XML configuration file or call `sc.hadoopConfig.set()` to mutate Spark's global Hadoop configuration.

For example, if you are using the `s3n` filesystem then add

```scala
sc.hadoopConfig.set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc.hadoopConfig.set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET_ACCESS_KEY")
```

and for the `s3a` filesystem add

```scala
sc.hadoopConfig.set("fs.s3a.access.key", "YOUR_KEY_ID")
sc.hadoopConfig.set("fs.s3a.secret.key", "YOUR_SECRET_ACCESS_KEY")
```
2. **Encode keys in `tempdir` URI**: For example, the URI `s3n://ACCESSKEY:SECRETKEY@bucket/path/to/temp/dir` encodes the key pair (`ACCESSKEY`, `SECRETKEY`). Due to [Hadoop limitations](https://issues.apache.org/jira/browse/HADOOP-3733), this approach will not work for secret keys which contain forward slash (`/`) characters.
3. **IAM instance profiles:** If you are running on EC2 and authenticate to S3 using IAM and [instance profiles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html), then you must must configure the `temporary_aws_access_key_id`, `temporary_aws_access_key_id`, and `temporary_aws_session_token` configuration properties to point to temporary keys created via the AWS [Security Token Service](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html). These temporary keys will then be passed to Redshift via `LOAD` and `UNLOAD` commands.

**:warning: Note**: `spark-redshift` does not clean up the temporary files that it creates in S3. As a result, we recommend that you use a dedicated temporary S3 bucket with an [object lifecycle configuration](http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html) to ensure that temporary files are automatically deleted after a specified expiration period.

Expand Down

0 comments on commit 7b402e3

Please sign in to comment.