Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python documentation error #56

Closed
TELSER1 opened this issue Aug 26, 2015 · 22 comments
Closed

Python documentation error #56

TELSER1 opened this issue Aug 26, 2015 · 22 comments
Assignees
Labels
Milestone

Comments

@TELSER1
Copy link

TELSER1 commented Aug 26, 2015

I believe you have accidentally copied the Scala syntax into the Python code examples,e.g.:

df.write \
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
  .option("dbtable" -> "my_table_copy") \
  .option("tempdir" -> "s3://path/for/temp/data") \
  .mode("error")
  .save()

Presumably, all of the -> should be replaced with =, and the quotes on the assignment removed, i.e.:

df.write \
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
  .option(dbtable="my_table_copy") \
  .option(tempdir="s3://path/for/temp/data") \
  .mode("error")
  .save()

However, this doesn't quite work either; it returns the error "option() got an unexpected keyword argument 'dbtable'" instead.

@JoshRosen
Copy link
Contributor

Here's the correct syntax:

df.write \
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
  .option("dbtable", "my_table_copy") \
  .option("tempdir", "s3://path/for/temp/data") \
  .mode("error")
  .save()

There's a fix for this as part of another PR that hasn't been merged yet. I'll just push these fixes as a separate commit.

@JoshRosen JoshRosen added the bug label Aug 26, 2015
@TELSER1
Copy link
Author

TELSER1 commented Aug 26, 2015

Thanks; this resolves the error, although I promptly hit "Failed to load class for data source: com.databricks.spark.redshift" right afterwards. This may be because I'm missing a dependency.

@JoshRosen
Copy link
Contributor

How did you add spark-redshift to your Spark environment?

Try using the --packages spark-submit option, e.g.

$SPARK_HOME/bin/pyspark --packages databricks:spark-redshift:0.4.0-hadoop2

@JoshRosen
Copy link
Contributor

Oh, I just realized that you're probably building your own version off of master since you're trying to use the save feature.

In this case, try publishing locally using SBT, then update the version to 0.4.1-SNAPSHOT in your --packages command.

@TELSER1
Copy link
Author

TELSER1 commented Aug 27, 2015

No, I was using the command you suggested.

@JoshRosen
Copy link
Contributor

To clarify: you need to add this library as a dependency to your job. There are instructions at http://spark-packages.org/package/databricks/spark-redshift; this is what I was asking about.

@TELSER1
Copy link
Author

TELSER1 commented Aug 27, 2015

More specifically, I am currently initializing the shell as follows

pyspark --packages databricks:spark-redshift:0.4.0-hadoop2,com.databricks:spark-csv_2.10:1.2.0

Spark-csv works fine; both seem to download and install.

@JoshRosen
Copy link
Contributor

Ah, I see the problem:

spark-redshift 0.4.0 does not include data source support; that feature is only present in the master branch of this repository and hasn't been included in a published release yet. To see the documentation for the version that you're using, see https://github.com/databricks/spark-redshift/tree/v0.4.0.

We are tentatively planning to publish v0.5.0 this Friday; that version will include the data sources feature. In the meantime, you'll have to build the library yourself if you want to test out its data sources support.

@TELSER1
Copy link
Author

TELSER1 commented Aug 27, 2015

Alright, thanks for the update. I can wait on the release.

@JoshRosen JoshRosen added this to the 0.5 milestone Aug 27, 2015
@JoshRosen JoshRosen self-assigned this Sep 4, 2015
@TELSER1
Copy link
Author

TELSER1 commented Sep 11, 2015

Hello,

I got around to trying this out again, and running

pyspark --packages databricks:spark-redshift:0.5.0-hadoop2,com.databricks:spark-csv_2.10:1.2.0

now throws

    ::::::::::::::::::::::::::::::::::::::::::::::

    ::          UNRESOLVED DEPENDENCIES         ::

    ::::::::::::::::::::::::::::::::::::::::::::::

    :: databricks#spark-redshift;0.5.0-hadoop2: not found

    ::::::::::::::::::::::::::::::::::::::::::::::

Is it not fully released yet, or am I misspecifying something?

@JoshRosen
Copy link
Contributor

It is fully-released. Remove the -hadoop2 part from your --packages command; we now publish a single artifact which is compatible with both Hadoop 1.x and 2.x.

@TELSER1
Copy link
Author

TELSER1 commented Sep 11, 2015

pyspark --packages databricks:spark-redshift:0.5.0,com.databricks:spark-csv_2.10:1.2.0 yields the same message:
::::::::::::::::::::::::::::::::::::::::::::::

    ::          UNRESOLVED DEPENDENCIES         ::

    ::::::::::::::::::::::::::::::::::::::::::::::

    :: databricks#spark-redshift;0.5.0: not found

    ::::::::::::::::::::::::::::::::::::::::::::::

@brkyvz
Copy link
Contributor

brkyvz commented Sep 11, 2015

The package hasn't been published to Spark Packages yet, therefore the shorthand coordinate doesn't work. Could you please try
com.databricks:spark-redshift_2.10:0.5.0

@brkyvz
Copy link
Contributor

brkyvz commented Sep 11, 2015

@JoshRosen Maybe you could also publish the release to Spark Packages.

@JoshRosen
Copy link
Contributor

@brkyvz, yep, I just realized that we haven't done that yet. Going to work on it shortly.

@TELSER1
Copy link
Author

TELSER1 commented Sep 11, 2015

com.databricks:spark-redshift_2.10:0.5.0 didn't throw an error.

@JoshRosen
Copy link
Contributor

We've just published 0.5.0 to spark-packages, in case you want to give it another try.

@TELSER1
Copy link
Author

TELSER1 commented Sep 12, 2015

That also currently works, although I hit

java.lang.ClassNotFoundException: com.amazon.redshift.jdbc4.Driver

immediately; I presume this is because I haven't figured out how to properly provide a JDBC driver, as per the installation instructions.

@JoshRosen
Copy link
Contributor

@TELSER1, try downloading Amazon's JDBC4 driver from http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html, then add it to your Spark cluster via --jars.

@TELSER1
Copy link
Author

TELSER1 commented Sep 12, 2015

Thanks for all the help; I had been trying to use JDBC4.1.

@JoshRosen
Copy link
Contributor

@TELSER1, glad that we were able to figure everything out. I think that you technically can use the JDBC 4.1 driver if you've configured the jdbcdriver option to be com.amazon.redshift.jdbc41.Driver. I think that this confusion is common enough that it would be a good idea for spark-redshift to automatically use either the 4.0 or 4.1 driver depending on which is available. I've filed #83 so that we remember to follow up on this.

@JoshRosen
Copy link
Contributor

@TELSER1, the 4.0 vs 4.1 driver class configuration issue should now be fixed as of #90, so we'll now automatically pick the correct class name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants