-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python documentation error #56
Comments
Here's the correct syntax: df.write \
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3://path/for/temp/data") \
.mode("error")
.save() There's a fix for this as part of another PR that hasn't been merged yet. I'll just push these fixes as a separate commit. |
Thanks; this resolves the error, although I promptly hit "Failed to load class for data source: com.databricks.spark.redshift" right afterwards. This may be because I'm missing a dependency. |
How did you add Try using the
|
Oh, I just realized that you're probably building your own version off of master since you're trying to use the save feature. In this case, try publishing locally using SBT, then update the version to |
No, I was using the command you suggested. |
To clarify: you need to add this library as a dependency to your job. There are instructions at http://spark-packages.org/package/databricks/spark-redshift; this is what I was asking about. |
More specifically, I am currently initializing the shell as follows
Spark-csv works fine; both seem to download and install. |
Ah, I see the problem: spark-redshift 0.4.0 does not include data source support; that feature is only present in the master branch of this repository and hasn't been included in a published release yet. To see the documentation for the version that you're using, see https://github.com/databricks/spark-redshift/tree/v0.4.0. We are tentatively planning to publish v0.5.0 this Friday; that version will include the data sources feature. In the meantime, you'll have to build the library yourself if you want to test out its data sources support. |
Alright, thanks for the update. I can wait on the release. |
Hello, I got around to trying this out again, and running pyspark --packages databricks:spark-redshift:0.5.0-hadoop2,com.databricks:spark-csv_2.10:1.2.0 now throws
Is it not fully released yet, or am I misspecifying something? |
It is fully-released. Remove the |
pyspark --packages databricks:spark-redshift:0.5.0,com.databricks:spark-csv_2.10:1.2.0 yields the same message:
|
The package hasn't been published to Spark Packages yet, therefore the shorthand coordinate doesn't work. Could you please try |
@JoshRosen Maybe you could also publish the release to Spark Packages. |
@brkyvz, yep, I just realized that we haven't done that yet. Going to work on it shortly. |
com.databricks:spark-redshift_2.10:0.5.0 didn't throw an error. |
We've just published 0.5.0 to spark-packages, in case you want to give it another try. |
That also currently works, although I hit
immediately; I presume this is because I haven't figured out how to properly provide a JDBC driver, as per the installation instructions. |
@TELSER1, try downloading Amazon's JDBC4 driver from http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html, then add it to your Spark cluster via --jars. |
Thanks for all the help; I had been trying to use JDBC4.1. |
@TELSER1, glad that we were able to figure everything out. I think that you technically can use the JDBC 4.1 driver if you've configured the |
I believe you have accidentally copied the Scala syntax into the Python code examples,e.g.:
Presumably, all of the -> should be replaced with =, and the quotes on the assignment removed, i.e.:
However, this doesn't quite work either; it returns the error "option() got an unexpected keyword argument 'dbtable'" instead.
The text was updated successfully, but these errors were encountered: