Upgrade to Spark v3 #1543

anargyri · 2021-10-06T16:42:04Z

Description

Changes required to support Spark version 3.
Enabled Python 3.8 because required by Databricks.

Related Issues

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.
This PR is being made to staging branch and not to main branch.

review-notebook-app · 2021-10-06T16:42:08Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

laserprec

Looks good! Wondering if we have tested for 3.8 support as well?

miguelgfierro · 2021-10-07T09:01:32Z

setup.py

-        "pyarrow>=0.8.0,<1.0.0",
-        "pyspark>=2.4.5,<3.0.0",
+        "pyarrow>=0.12.1,<6.0.0",
+        "pyspark>=2.4.5,<4.0.0",


one question here, if a user has pyspark 2 and installs the current notebooks with the current version of MMLSpark, will it work?

The databricks_install script will work because it looks up the jar based on the Spark version.
The notebooks that use MMLSPARK variables from spark_utils.py will have to be changed; but it is a reasonably small change for the end user IMO. Perhaps we could have a check of the Spark version in the notebooks too.

I guess for users that are using 2.x it is not going to be easy to find out the exact package and repo information. How can we notify the users, should we have a note in the notebook or a note in spark_utils.py? What do you think?

Maybe add a couple of commented out lines in the notebook (that they can uncomment in case of v2) with the explicit mmlspark info for v2? Just for their convenience, since we support only v3.

yeah makes sense

I included the old info in spark_utils.py in the end.

miguelgfierro · 2021-10-07T09:02:38Z

setup.py

@@ -118,5 +118,5 @@
    install_requires=install_requires,
    package_dir={"recommenders": "recommenders"},
    packages=find_packages(where=".", exclude=["tests", "tools", "examples"]),
-    python_requires=">=3.6, <3.8",
+    python_requires=">=3.6, <3.9",     # latest Databricks versions come with Python 3.8 installed


do we know if all the repo works with 3.8?

I could check.

anargyri · 2021-10-07T09:14:36Z

Looks good! Wondering if we have tested for 3.8 support as well?

I need to check.

miguelgfierro · 2021-10-08T04:23:52Z

setup.py

-        "pyarrow>=0.8.0,<1.0.0",
-        "pyspark>=2.4.5,<3.0.0",
+        "pyarrow>=0.12.1,<6.0.0",
+        "pyspark>=2.4.5,<4.0.0",


yeah makes sense

anargyri · 2021-10-08T13:36:35Z

So, python 3.8 doesn't work. The [gpu] dependencies cannot be installed with pip (I tried it in a conda env). Sometimes it complains about pytorch, sometimes about tensorflow.
We should leave the Python 3.8 upgrade for another PR.
This makes sense because TF v1.15 doesn't support Python 3.8, see https://www.tensorflow.org/install/source#gpu

anargyri · 2021-10-08T14:03:15Z

I have also tested with Java 11 (i.e. keeping the JAVA env variable pointing to the Java 11 home) and the tests pass.
I will amend the text to require Java >=8.

codecov-commenter · 2021-10-11T14:20:30Z

Codecov Report

Merging #1543 (8d878aa) into staging (e8fd200) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           staging    #1543   +/-   ##
========================================
  Coverage    62.07%   62.07%           
========================================
  Files           84       84           
  Lines         8492     8492           
========================================
  Hits          5271     5271           
  Misses        3221     3221

Impacted Files	Coverage Δ
recommenders/utils/spark_utils.py	`96.15% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2eccb87...8d878aa. Read the comment docs.

anargyri added 4 commits September 28, 2021 18:05

Enable Spark v3

d9c8a27

Update maven coords for mmlspark

7d80d94

Update databricks installation script

327b8fd

Enable Python 3.8

0b05bae

anargyri requested review from gramhagen, loomlike, miguelgfierro, wutaomsft and yueguoguo as code owners October 6, 2021 16:42

laserprec approved these changes Oct 6, 2021

View reviewed changes

miguelgfierro reviewed Oct 7, 2021

View reviewed changes

anargyri added 2 commits October 7, 2021 16:43

Update docs and conda script

d094d92

Edit SETUP.md

f067415

miguelgfierro approved these changes Oct 8, 2021

View reviewed changes

anargyri and others added 2 commits October 8, 2021 08:58

Add old mmlspark coords with comment

d5cb80f

Merge branch 'staging' into andreas/spark3

5db9b9d

anargyri and others added 6 commits October 8, 2021 14:22

Change Java version in documentation

35178ce

Edit java version in doc

21848ee

Update Java in venv instructions

4d2f926

Add SPARK_HOME for venv and Java 11

1602d17

Merge branch 'staging' into andreas/spark3

287c3cc

Merge branch 'staging' into andreas/spark3

8d878aa

anargyri merged commit a485784 into staging Oct 11, 2021

anargyri deleted the andreas/spark3 branch October 11, 2021 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to Spark v3 #1543

Upgrade to Spark v3 #1543

anargyri commented Oct 6, 2021 •

edited

Loading

review-notebook-app bot commented Oct 6, 2021

laserprec left a comment

miguelgfierro Oct 7, 2021

anargyri Oct 7, 2021

miguelgfierro Oct 7, 2021

anargyri Oct 7, 2021

miguelgfierro Oct 8, 2021

anargyri Oct 8, 2021

miguelgfierro Oct 7, 2021

anargyri Oct 7, 2021

anargyri commented Oct 7, 2021

miguelgfierro Oct 8, 2021

anargyri commented Oct 8, 2021 •

edited

Loading

anargyri commented Oct 8, 2021

codecov-commenter commented Oct 11, 2021

Upgrade to Spark v3 #1543

Upgrade to Spark v3 #1543

Conversation

anargyri commented Oct 6, 2021 • edited Loading

Description

Related Issues

Checklist:

review-notebook-app bot commented Oct 6, 2021

laserprec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anargyri commented Oct 7, 2021

Choose a reason for hiding this comment

anargyri commented Oct 8, 2021 • edited Loading

anargyri commented Oct 8, 2021

codecov-commenter commented Oct 11, 2021

Codecov Report

anargyri commented Oct 6, 2021 •

edited

Loading

anargyri commented Oct 8, 2021 •

edited

Loading