-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to Spark v3 #1543
Upgrade to Spark v3 #1543
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Wondering if we have tested for 3.8 support as well?
"pyarrow>=0.8.0,<1.0.0", | ||
"pyspark>=2.4.5,<3.0.0", | ||
"pyarrow>=0.12.1,<6.0.0", | ||
"pyspark>=2.4.5,<4.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one question here, if a user has pyspark 2 and installs the current notebooks with the current version of MMLSpark, will it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The databricks_install script will work because it looks up the jar based on the Spark version.
The notebooks that use MMLSPARK variables from spark_utils.py will have to be changed; but it is a reasonably small change for the end user IMO. Perhaps we could have a check of the Spark version in the notebooks too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for users that are using 2.x it is not going to be easy to find out the exact package and repo information. How can we notify the users, should we have a note in the notebook or a note in spark_utils.py? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a couple of commented out lines in the notebook (that they can uncomment in case of v2) with the explicit mmlspark info for v2? Just for their convenience, since we support only v3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included the old info in spark_utils.py in the end.
@@ -118,5 +118,5 @@ | |||
install_requires=install_requires, | |||
package_dir={"recommenders": "recommenders"}, | |||
packages=find_packages(where=".", exclude=["tests", "tools", "examples"]), | |||
python_requires=">=3.6, <3.8", | |||
python_requires=">=3.6, <3.9", # latest Databricks versions come with Python 3.8 installed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we know if all the repo works with 3.8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could check.
I need to check. |
"pyarrow>=0.8.0,<1.0.0", | ||
"pyspark>=2.4.5,<3.0.0", | ||
"pyarrow>=0.12.1,<6.0.0", | ||
"pyspark>=2.4.5,<4.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah makes sense
So, python 3.8 doesn't work. The [gpu] dependencies cannot be installed with pip (I tried it in a conda env). Sometimes it complains about pytorch, sometimes about tensorflow. |
I have also tested with Java 11 (i.e. keeping the JAVA env variable pointing to the Java 11 home) and the tests pass. |
Codecov Report
@@ Coverage Diff @@
## staging #1543 +/- ##
========================================
Coverage 62.07% 62.07%
========================================
Files 84 84
Lines 8492 8492
========================================
Hits 5271 5271
Misses 3221 3221
Continue to review full report at Codecov.
|
Description
Changes required to support Spark version 3.
Enabled Python 3.8 because required by Databricks.
Related Issues
Checklist:
staging branch
and not tomain branch
.