Python documentation error #56

TELSER1 · 2015-08-26T23:23:14Z

I believe you have accidentally copied the Scala syntax into the Python code examples,e.g.:

df.write \
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
  .option("dbtable" -> "my_table_copy") \
  .option("tempdir" -> "s3://path/for/temp/data") \
  .mode("error")
  .save()

Presumably, all of the -> should be replaced with =, and the quotes on the assignment removed, i.e.:

df.write \
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
  .option(dbtable="my_table_copy") \
  .option(tempdir="s3://path/for/temp/data") \
  .mode("error")
  .save()

However, this doesn't quite work either; it returns the error "option() got an unexpected keyword argument 'dbtable'" instead.

JoshRosen · 2015-08-26T23:27:21Z

Here's the correct syntax:

df.write \
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
  .option("dbtable", "my_table_copy") \
  .option("tempdir", "s3://path/for/temp/data") \
  .mode("error")
  .save()

There's a fix for this as part of another PR that hasn't been merged yet. I'll just push these fixes as a separate commit.

TELSER1 · 2015-08-26T23:50:20Z

Thanks; this resolves the error, although I promptly hit "Failed to load class for data source: com.databricks.spark.redshift" right afterwards. This may be because I'm missing a dependency.

JoshRosen · 2015-08-27T00:02:05Z

How did you add spark-redshift to your Spark environment?

Try using the --packages spark-submit option, e.g.

$SPARK_HOME/bin/pyspark --packages databricks:spark-redshift:0.4.0-hadoop2

JoshRosen · 2015-08-27T00:03:50Z

Oh, I just realized that you're probably building your own version off of master since you're trying to use the save feature.

In this case, try publishing locally using SBT, then update the version to 0.4.1-SNAPSHOT in your --packages command.

TELSER1 · 2015-08-27T00:13:25Z

No, I was using the command you suggested.

JoshRosen · 2015-08-27T00:17:17Z

To clarify: you need to add this library as a dependency to your job. There are instructions at http://spark-packages.org/package/databricks/spark-redshift; this is what I was asking about.

TELSER1 · 2015-08-27T00:33:43Z

More specifically, I am currently initializing the shell as follows

pyspark --packages databricks:spark-redshift:0.4.0-hadoop2,com.databricks:spark-csv_2.10:1.2.0

Spark-csv works fine; both seem to download and install.

JoshRosen · 2015-08-27T00:40:17Z

Ah, I see the problem:

spark-redshift 0.4.0 does not include data source support; that feature is only present in the master branch of this repository and hasn't been included in a published release yet. To see the documentation for the version that you're using, see https://github.com/databricks/spark-redshift/tree/v0.4.0.

We are tentatively planning to publish v0.5.0 this Friday; that version will include the data sources feature. In the meantime, you'll have to build the library yourself if you want to test out its data sources support.

TELSER1 · 2015-08-27T00:45:02Z

Alright, thanks for the update. I can wait on the release.

TELSER1 · 2015-09-11T18:33:49Z

Hello,

I got around to trying this out again, and running

pyspark --packages databricks:spark-redshift:0.5.0-hadoop2,com.databricks:spark-csv_2.10:1.2.0

now throws

    ::::::::::::::::::::::::::::::::::::::::::::::

    ::          UNRESOLVED DEPENDENCIES         ::

    ::::::::::::::::::::::::::::::::::::::::::::::

    :: databricks#spark-redshift;0.5.0-hadoop2: not found

    ::::::::::::::::::::::::::::::::::::::::::::::

Is it not fully released yet, or am I misspecifying something?

JoshRosen · 2015-09-11T18:34:58Z

It is fully-released. Remove the -hadoop2 part from your --packages command; we now publish a single artifact which is compatible with both Hadoop 1.x and 2.x.

TELSER1 · 2015-09-11T18:38:22Z

pyspark --packages databricks:spark-redshift:0.5.0,com.databricks:spark-csv_2.10:1.2.0 yields the same message:
::::::::::::::::::::::::::::::::::::::::::::::

    ::          UNRESOLVED DEPENDENCIES         ::

    ::::::::::::::::::::::::::::::::::::::::::::::

    :: databricks#spark-redshift;0.5.0: not found

    ::::::::::::::::::::::::::::::::::::::::::::::

brkyvz · 2015-09-11T18:39:56Z

The package hasn't been published to Spark Packages yet, therefore the shorthand coordinate doesn't work. Could you please try
com.databricks:spark-redshift_2.10:0.5.0

brkyvz · 2015-09-11T18:40:16Z

@JoshRosen Maybe you could also publish the release to Spark Packages.

JoshRosen · 2015-09-11T18:40:49Z

@brkyvz, yep, I just realized that we haven't done that yet. Going to work on it shortly.

TELSER1 · 2015-09-11T18:41:56Z

com.databricks:spark-redshift_2.10:0.5.0 didn't throw an error.

JoshRosen · 2015-09-11T19:13:52Z

We've just published 0.5.0 to spark-packages, in case you want to give it another try.

TELSER1 · 2015-09-12T00:00:59Z

That also currently works, although I hit

java.lang.ClassNotFoundException: com.amazon.redshift.jdbc4.Driver

immediately; I presume this is because I haven't figured out how to properly provide a JDBC driver, as per the installation instructions.

JoshRosen · 2015-09-12T00:04:17Z

@TELSER1, try downloading Amazon's JDBC4 driver from http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html, then add it to your Spark cluster via --jars.

TELSER1 · 2015-09-12T00:13:03Z

Thanks for all the help; I had been trying to use JDBC4.1.

JoshRosen · 2015-09-12T01:15:59Z

@TELSER1, glad that we were able to figure everything out. I think that you technically can use the JDBC 4.1 driver if you've configured the jdbcdriver option to be com.amazon.redshift.jdbc41.Driver. I think that this confusion is common enough that it would be a good idea for spark-redshift to automatically use either the 4.0 or 4.1 driver depending on which is available. I've filed #83 so that we remember to follow up on this.

JoshRosen · 2015-09-15T20:09:16Z

@TELSER1, the 4.0 vs 4.1 driver class configuration issue should now be fixed as of #90, so we'll now automatically pick the correct class name.

JoshRosen added the bug label Aug 26, 2015

JoshRosen added this to the 0.5 milestone Aug 27, 2015

JoshRosen self-assigned this Sep 4, 2015

JoshRosen mentioned this issue Sep 4, 2015

README updates for 0.5.0 release #72

Closed

JoshRosen closed this as completed in 7b402e3 Sep 9, 2015

JoshRosen mentioned this issue Sep 12, 2015

Automatically use Redshift 4.0 or 4.1 JDBC driver, depending on which is installed #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python documentation error #56

Python documentation error #56

TELSER1 commented Aug 26, 2015

JoshRosen commented Aug 26, 2015

TELSER1 commented Aug 26, 2015

JoshRosen commented Aug 27, 2015

JoshRosen commented Aug 27, 2015

TELSER1 commented Aug 27, 2015

JoshRosen commented Aug 27, 2015

TELSER1 commented Aug 27, 2015

JoshRosen commented Aug 27, 2015

TELSER1 commented Aug 27, 2015

TELSER1 commented Sep 11, 2015

JoshRosen commented Sep 11, 2015

TELSER1 commented Sep 11, 2015

brkyvz commented Sep 11, 2015

brkyvz commented Sep 11, 2015

JoshRosen commented Sep 11, 2015

TELSER1 commented Sep 11, 2015

JoshRosen commented Sep 11, 2015

TELSER1 commented Sep 12, 2015

JoshRosen commented Sep 12, 2015

TELSER1 commented Sep 12, 2015

JoshRosen commented Sep 12, 2015

JoshRosen commented Sep 15, 2015

Python documentation error #56

Python documentation error #56

Comments

TELSER1 commented Aug 26, 2015

JoshRosen commented Aug 26, 2015

TELSER1 commented Aug 26, 2015

JoshRosen commented Aug 27, 2015

JoshRosen commented Aug 27, 2015

TELSER1 commented Aug 27, 2015

JoshRosen commented Aug 27, 2015

TELSER1 commented Aug 27, 2015

JoshRosen commented Aug 27, 2015

TELSER1 commented Aug 27, 2015

TELSER1 commented Sep 11, 2015

JoshRosen commented Sep 11, 2015

TELSER1 commented Sep 11, 2015

brkyvz commented Sep 11, 2015

brkyvz commented Sep 11, 2015

JoshRosen commented Sep 11, 2015

TELSER1 commented Sep 11, 2015

JoshRosen commented Sep 11, 2015

TELSER1 commented Sep 12, 2015

JoshRosen commented Sep 12, 2015

TELSER1 commented Sep 12, 2015

JoshRosen commented Sep 12, 2015

JoshRosen commented Sep 15, 2015