Merge pull request #1543 from microsoft/andreas/spark3

Upgrade to Spark v3
recommenders-team · Oct 11, 2021 · a485784 · a485784
2 parents 2eccb87 + 8d878aa
commit a485784
Show file tree

Hide file tree

Showing 7 changed files with 30 additions and 34 deletions.
diff --git a/SETUP.md b/SETUP.md
@@ -58,18 +58,18 @@ conda update conda -n root
 conda update anaconda        # use 'conda install anaconda' if the package is not installed
 ```
 
-If using venv, see [these instructions](#using-a-virtual-environment).
+If using venv or virtualenv, see [these instructions](#using-a-virtual-environment).
 
 **NOTE** the `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/).
 
 **NOTE** the models from Cornac require installation of `libpython` i.e. using `sudo apt-get install -y libpython3.6` or `libpython3.7`, depending on the version of Python.
 
-**NOTE** PySpark v2.4.x requires Java version 8. 
+**NOTE** Spark requires Java version 8 or 11. We support Spark version 3, but versions 2.4+ with Java version 8 may also work. 
 
 <details> 
-<summary><strong><em>Install Java 8 on MacOS</em></strong></summary>
+<summary><strong><em>Install Java on MacOS</em></strong></summary>
 
-To install Java 8 on MacOS using [asdf](https://github.com/halcyon/asdf-java):
+To install e.g. Java 8 on MacOS using [asdf](https://github.com/halcyon/asdf-java):
 
     brew install asdf
     asdf plugin add Java
@@ -151,7 +151,7 @@ create the file `%RECO_ENV%\etc\conda\deactivate.d\env_vars.bat` and add:
 
 It is straightforward to install the recommenders package within a [virtual environment](https://docs.python.org/3/library/venv.html). However, setting up CUDA for use with a GPU can be cumbersome. We thus
 recommend setting up [Nvidia docker](https://github.com/NVIDIA/nvidia-docker) and running the virtual environment within a container, as the most convenient way to do this.  
-In the following `3.6` should be replaced with the Python version you are using. 
+In the following `3.6` should be replaced with the Python version you are using and `11` should be replaced with the appropriate Java version. 
 
     # Start docker daemon if not running
     sudo dockerd &
@@ -167,14 +167,15 @@ In the following `3.6` should be replaced with the Python version you are using.
     apt-get -y install python3.6-venv
     apt-get -y install libpython3.6-dev
     apt-get -y install cmake
-    apt-get install -y libgomp1 openjdk-8-jre
-    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+    apt-get install -y libgomp1 openjdk-11-jre
+    export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
 
     python3.6 -m venv --system-site-packages /venv
     source /venv/bin/activate
     pip install --upgrade pip
     pip install --upgrade setuptools
 
+    export SPARK_HOME=/venv/lib/python3.6/site-packages/pyspark
     export PYSPARK_DRIVER_PYTHON=/venv/bin/python
     export PYSPARK_PYTHON=/venv/bin/python
 
@@ -223,13 +224,6 @@ SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.a
 unset SPARK_HOME
 ```
 
-* Java 11 might produce errors when running the notebooks. To change it to Java 8:
-
-```
-sudo apt install openjdk-8-jdk
-export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
-```
-
 * We found that there might be conflicts between the current MMLSpark jars available in the DSVM and the ones used by the library. In that case, it is better to remove those jars and rely on loading them from Maven or other repositories made available by MMLSpark team.
 
 ```
@@ -241,9 +235,10 @@ sudo rm -rf Azure_mmlspark-0.12.jar com.microsoft.cntk_cntk-2.4.jar com.microsof
 
 ### Requirements
 
-* Databricks Runtime version >= 4.3 (Apache Spark 2.3.1, Scala 2.11) and <= 5.5 (Apache Spark 2.4.3, Scala 2.11)
+* Databricks Runtime version >= 7 (Apache Spark >= 3.0.1, Scala 2.12)
 * Python 3
 
+Earlier versions of Databricks or Spark may work but this is not guaranteed.
 An example of how to create an Azure Databricks workspace and an Apache Spark cluster within the workspace can be found from [here](https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal). To utilize deep learning models and GPUs, you may setup GPU-enabled cluster. For more details about this topic, please see [Azure Databricks deep learning guide](https://docs.azuredatabricks.net/applications/deep-learning/index.html).
 
 ### Installation from PyPI
@@ -368,7 +363,8 @@ You can follow instructions [here](https://docs.azuredatabricks.net/user-guide/l
 
 Additionally, you must install the [spark-cosmosdb connector](https://docs.databricks.com/spark/latest/data-sources/azure/cosmosdb-connector.html) on the cluster. The easiest way to manually do that is to:
 
-1. Download the [appropriate jar](https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar) from MAVEN. **NOTE** This is the appropriate jar for spark versions `2.3.X`, and is the appropriate version for the recommended Azure Databricks run-time detailed above.
+
+1. Download the [appropriate jar](https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar) from MAVEN. **NOTE** This is the appropriate jar for spark versions `3.1.X`, and is the appropriate version for the recommended Azure Databricks run-time detailed above.
 2. Upload and install the jar by:
    1. Log into your `Azure Databricks` workspace
    2. Select the `Clusters` button on the left.

diff --git a/conda.md b/conda.md
@@ -1,6 +1,6 @@
 One possible way to use the repository is to run all the recommender utilities directly from a local copy of the source code (without building the package). This requires installing all the necessary dependencies from Anaconda and PyPI.
 
-To this end we provide a script, [generate_conda_file.py](tools/generate_conda_file.py), to generate a conda-environment yaml file which you can use to create the target environment using Python 3.6 or 3.7 with all the correct dependencies.
+To this end we provide a script, [generate_conda_file.py](tools/generate_conda_file.py), to generate a conda-environment yaml file which you can use to create the target environment using Python with all the correct dependencies.
 
 Assuming the repo is cloned as `Recommenders` in the local system, to install **a default (Python CPU) environment**:
 
@@ -34,7 +34,7 @@ To install the PySpark environment:
 
 Additionally, if you want to test a particular version of spark, you may pass the `--pyspark-version` argument:
 
-    python tools/generate_conda_file.py --pyspark-version 2.4.5
+    python tools/generate_conda_file.py --pyspark-version 3.1.1
 
 </details>
 

diff --git a/examples/04_model_select_and_optimize/tuning_spark_als.ipynb b/examples/04_model_select_and_optimize/tuning_spark_als.ipynb
@@ -42,29 +42,24 @@
     "# set the environment path to find Recommenders\n",
     "%matplotlib notebook\n",
     "\n",
-    "import matplotlib\n",
     "import matplotlib.pyplot as plt\n",
     "import sys\n",
     "import pandas as pd\n",
     "import numpy as np\n",
     "\n",
     "import pyspark\n",
-    "from pyspark.sql import SparkSession\n",
     "import pyspark.sql.functions as F\n",
-    "from pyspark import SparkContext, SparkConf\n",
     "from pyspark.ml.recommendation import ALS\n",
     "from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit\n",
     "from pyspark.ml.evaluation import Evaluator, RegressionEvaluator\n",
     "from pyspark.ml.pipeline import Estimator, Model\n",
     "from pyspark import keyword_only  \n",
-    "from pyspark.ml import Transformer\n",
     "from pyspark.ml.param.shared import *\n",
     "from pyspark.ml.util import *\n",
-    "from pyspark.mllib.evaluation import RegressionMetrics, RankingMetrics\n",
-    "from pyspark.sql.types import ArrayType, IntegerType, StringType\n",
+    "from pyspark.mllib.evaluation import RankingMetrics\n",
+    "from pyspark.sql.types import ArrayType, IntegerType\n",
     "\n",
     "from hyperopt import fmin, tpe, hp, STATUS_OK, Trials\n",
-    "from hyperopt.pyll.base import scope\n",
     "from hyperopt.pyll.stochastic import sample\n",
     "\n",
     "from recommenders.utils.timer import Timer\n",

diff --git a/recommenders/utils/spark_utils.py b/recommenders/utils/spark_utils.py
@@ -9,9 +9,11 @@
 except ImportError:
     pass  # skip this import if we are in pure python environment
 
-
-MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"
-MMLSPARK_REPO = "https://mvnrepository.com/artifact"
+MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark:1.0.0-rc3-184-3314e164-SNAPSHOT"
+MMLSPARK_REPO = "https://mmlspark.azureedge.net/maven"
+# We support Spark v3, but in case you wish to use v2, set
+# MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"
+# MMLSPARK_REPO = "https://mvnrepository.com/artifact"
 
 def start_or_get_spark(
     app_name="Sample",

diff --git a/setup.py b/setup.py
@@ -72,8 +72,8 @@
     ],
     "spark": [
         "databricks_cli>=0.8.6,<1",
-        "pyarrow>=0.8.0,<1.0.0",
-        "pyspark>=2.4.5,<3.0.0",
+        "pyarrow>=0.12.1,<6.0.0",
+        "pyspark>=2.4.5,<4.0.0",
     ],
     "xlearn": [
         "cmake>=3.18.4.post1",
@@ -129,6 +129,6 @@
     "machine learning python spark gpu",
     install_requires=install_requires,
     package_dir={"recommenders": "recommenders"},
+    python_requires=">=3.6, <3.9",     # latest Databricks versions come with Python 3.8 installed
     packages=find_packages(where=".", exclude=["contrib", "docs", "examples", "scenarios", "tests", "tools"]),
-    python_requires=">=3.6, <3.8",
 )
diff --git a/tools/databricks_install.py b/tools/databricks_install.py
@@ -47,7 +47,10 @@
     "3": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.2.0_2.11/1.1.1/azure-cosmosdb-spark_2.2.0_2.11-1.1.1-uber.jar",
     "4": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar",
     "5": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.4.0_2.11/1.3.5/azure-cosmosdb-spark_2.4.0_2.11-1.3.5-uber.jar",
-    "6": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.4.0_2.11/3.7.0/azure-cosmosdb-spark_2.4.0_2.11-3.7.0-uber.jar"
+    "6": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.4.0_2.11/3.7.0/azure-cosmosdb-spark_2.4.0_2.11-3.7.0-uber.jar",
+    "7": "https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar",
+    "8": "https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar",
+    "9": "https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar"
 }
 
 MMLSPARK_INFO = {
@@ -302,7 +305,7 @@ def prepare_for_operationalization(
     while "recommenders" not in installed_libraries:
         time.sleep(PENDING_SLEEP_INTERVAL)
         installed_libraries = get_installed_libraries(my_api_client, args.cluster_id)
-    while installed_libraries["recommenders"] != "INSTALLED":
+    while installed_libraries["recommenders"] not in ["INSTALLED", "FAILED"]:
         time.sleep(PENDING_SLEEP_INTERVAL)
         installed_libraries = get_installed_libraries(my_api_client, args.cluster_id)
     if installed_libraries["recommenders"] == "FAILED":

diff --git a/tools/generate_conda_file.py b/tools/generate_conda_file.py
@@ -60,7 +60,7 @@
     "tqdm": "tqdm>=4.31.1",
 }
 
-CONDA_PYSPARK = {"pyarrow": "pyarrow>=0.8.0", "pyspark": "pyspark==2.4.5"}
+CONDA_PYSPARK = {"pyarrow": "pyarrow>=0.8.0", "pyspark": "pyspark>=3"}
 
 CONDA_GPU = {
     "fastai": "fastai==1.0.46",