diff --git a/SETUP.md b/SETUP.md index 7592ce955e..b39283380a 100644 --- a/SETUP.md +++ b/SETUP.md @@ -58,18 +58,18 @@ conda update conda -n root conda update anaconda # use 'conda install anaconda' if the package is not installed ``` -If using venv, see [these instructions](#using-a-virtual-environment). +If using venv or virtualenv, see [these instructions](#using-a-virtual-environment). **NOTE** the `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/). **NOTE** the models from Cornac require installation of `libpython` i.e. using `sudo apt-get install -y libpython3.6` or `libpython3.7`, depending on the version of Python. -**NOTE** PySpark v2.4.x requires Java version 8. +**NOTE** Spark requires Java version 8 or 11. We support Spark version 3, but versions 2.4+ with Java version 8 may also work.
-Install Java 8 on MacOS +Install Java on MacOS -To install Java 8 on MacOS using [asdf](https://github.com/halcyon/asdf-java): +To install e.g. Java 8 on MacOS using [asdf](https://github.com/halcyon/asdf-java): brew install asdf asdf plugin add Java @@ -151,7 +151,7 @@ create the file `%RECO_ENV%\etc\conda\deactivate.d\env_vars.bat` and add: It is straightforward to install the recommenders package within a [virtual environment](https://docs.python.org/3/library/venv.html). However, setting up CUDA for use with a GPU can be cumbersome. We thus recommend setting up [Nvidia docker](https://github.com/NVIDIA/nvidia-docker) and running the virtual environment within a container, as the most convenient way to do this. -In the following `3.6` should be replaced with the Python version you are using. +In the following `3.6` should be replaced with the Python version you are using and `11` should be replaced with the appropriate Java version. # Start docker daemon if not running sudo dockerd & @@ -167,14 +167,15 @@ In the following `3.6` should be replaced with the Python version you are using. apt-get -y install python3.6-venv apt-get -y install libpython3.6-dev apt-get -y install cmake - apt-get install -y libgomp1 openjdk-8-jre - export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + apt-get install -y libgomp1 openjdk-11-jre + export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 python3.6 -m venv --system-site-packages /venv source /venv/bin/activate pip install --upgrade pip pip install --upgrade setuptools + export SPARK_HOME=/venv/lib/python3.6/site-packages/pyspark export PYSPARK_DRIVER_PYTHON=/venv/bin/python export PYSPARK_PYTHON=/venv/bin/python @@ -223,13 +224,6 @@ SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.a unset SPARK_HOME ``` -* Java 11 might produce errors when running the notebooks. To change it to Java 8: - -``` -sudo apt install openjdk-8-jdk -export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" -``` - * We found that there might be conflicts between the current MMLSpark jars available in the DSVM and the ones used by the library. In that case, it is better to remove those jars and rely on loading them from Maven or other repositories made available by MMLSpark team. ``` @@ -241,9 +235,10 @@ sudo rm -rf Azure_mmlspark-0.12.jar com.microsoft.cntk_cntk-2.4.jar com.microsof ### Requirements -* Databricks Runtime version >= 4.3 (Apache Spark 2.3.1, Scala 2.11) and <= 5.5 (Apache Spark 2.4.3, Scala 2.11) +* Databricks Runtime version >= 7 (Apache Spark >= 3.0.1, Scala 2.12) * Python 3 +Earlier versions of Databricks or Spark may work but this is not guaranteed. An example of how to create an Azure Databricks workspace and an Apache Spark cluster within the workspace can be found from [here](https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal). To utilize deep learning models and GPUs, you may setup GPU-enabled cluster. For more details about this topic, please see [Azure Databricks deep learning guide](https://docs.azuredatabricks.net/applications/deep-learning/index.html). ### Installation from PyPI @@ -368,7 +363,8 @@ You can follow instructions [here](https://docs.azuredatabricks.net/user-guide/l Additionally, you must install the [spark-cosmosdb connector](https://docs.databricks.com/spark/latest/data-sources/azure/cosmosdb-connector.html) on the cluster. The easiest way to manually do that is to: -1. Download the [appropriate jar](https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar) from MAVEN. **NOTE** This is the appropriate jar for spark versions `2.3.X`, and is the appropriate version for the recommended Azure Databricks run-time detailed above. + +1. Download the [appropriate jar](https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar) from MAVEN. **NOTE** This is the appropriate jar for spark versions `3.1.X`, and is the appropriate version for the recommended Azure Databricks run-time detailed above. 2. Upload and install the jar by: 1. Log into your `Azure Databricks` workspace 2. Select the `Clusters` button on the left. diff --git a/conda.md b/conda.md index 2c2ee29bbf..db0e03bfc2 100644 --- a/conda.md +++ b/conda.md @@ -1,6 +1,6 @@ One possible way to use the repository is to run all the recommender utilities directly from a local copy of the source code (without building the package). This requires installing all the necessary dependencies from Anaconda and PyPI. -To this end we provide a script, [generate_conda_file.py](tools/generate_conda_file.py), to generate a conda-environment yaml file which you can use to create the target environment using Python 3.6 or 3.7 with all the correct dependencies. +To this end we provide a script, [generate_conda_file.py](tools/generate_conda_file.py), to generate a conda-environment yaml file which you can use to create the target environment using Python with all the correct dependencies. Assuming the repo is cloned as `Recommenders` in the local system, to install **a default (Python CPU) environment**: @@ -34,7 +34,7 @@ To install the PySpark environment: Additionally, if you want to test a particular version of spark, you may pass the `--pyspark-version` argument: - python tools/generate_conda_file.py --pyspark-version 2.4.5 + python tools/generate_conda_file.py --pyspark-version 3.1.1
diff --git a/examples/04_model_select_and_optimize/tuning_spark_als.ipynb b/examples/04_model_select_and_optimize/tuning_spark_als.ipynb index e0d839412c..9d8052a096 100644 --- a/examples/04_model_select_and_optimize/tuning_spark_als.ipynb +++ b/examples/04_model_select_and_optimize/tuning_spark_als.ipynb @@ -42,29 +42,24 @@ "# set the environment path to find Recommenders\n", "%matplotlib notebook\n", "\n", - "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import sys\n", "import pandas as pd\n", "import numpy as np\n", "\n", "import pyspark\n", - "from pyspark.sql import SparkSession\n", "import pyspark.sql.functions as F\n", - "from pyspark import SparkContext, SparkConf\n", "from pyspark.ml.recommendation import ALS\n", "from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit\n", "from pyspark.ml.evaluation import Evaluator, RegressionEvaluator\n", "from pyspark.ml.pipeline import Estimator, Model\n", "from pyspark import keyword_only \n", - "from pyspark.ml import Transformer\n", "from pyspark.ml.param.shared import *\n", "from pyspark.ml.util import *\n", - "from pyspark.mllib.evaluation import RegressionMetrics, RankingMetrics\n", - "from pyspark.sql.types import ArrayType, IntegerType, StringType\n", + "from pyspark.mllib.evaluation import RankingMetrics\n", + "from pyspark.sql.types import ArrayType, IntegerType\n", "\n", "from hyperopt import fmin, tpe, hp, STATUS_OK, Trials\n", - "from hyperopt.pyll.base import scope\n", "from hyperopt.pyll.stochastic import sample\n", "\n", "from recommenders.utils.timer import Timer\n", diff --git a/recommenders/utils/spark_utils.py b/recommenders/utils/spark_utils.py index 6e6f625944..8cea34fdef 100644 --- a/recommenders/utils/spark_utils.py +++ b/recommenders/utils/spark_utils.py @@ -9,9 +9,11 @@ except ImportError: pass # skip this import if we are in pure python environment - -MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1" -MMLSPARK_REPO = "https://mvnrepository.com/artifact" +MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark:1.0.0-rc3-184-3314e164-SNAPSHOT" +MMLSPARK_REPO = "https://mmlspark.azureedge.net/maven" +# We support Spark v3, but in case you wish to use v2, set +# MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1" +# MMLSPARK_REPO = "https://mvnrepository.com/artifact" def start_or_get_spark( app_name="Sample", diff --git a/setup.py b/setup.py index 9cca36aa38..7e406eed92 100644 --- a/setup.py +++ b/setup.py @@ -72,8 +72,8 @@ ], "spark": [ "databricks_cli>=0.8.6,<1", - "pyarrow>=0.8.0,<1.0.0", - "pyspark>=2.4.5,<3.0.0", + "pyarrow>=0.12.1,<6.0.0", + "pyspark>=2.4.5,<4.0.0", ], "xlearn": [ "cmake>=3.18.4.post1", @@ -129,6 +129,6 @@ "machine learning python spark gpu", install_requires=install_requires, package_dir={"recommenders": "recommenders"}, + python_requires=">=3.6, <3.9", # latest Databricks versions come with Python 3.8 installed packages=find_packages(where=".", exclude=["contrib", "docs", "examples", "scenarios", "tests", "tools"]), - python_requires=">=3.6, <3.8", ) diff --git a/tools/databricks_install.py b/tools/databricks_install.py index e1421eb1da..02c2817539 100644 --- a/tools/databricks_install.py +++ b/tools/databricks_install.py @@ -47,7 +47,10 @@ "3": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.2.0_2.11/1.1.1/azure-cosmosdb-spark_2.2.0_2.11-1.1.1-uber.jar", "4": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar", "5": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.4.0_2.11/1.3.5/azure-cosmosdb-spark_2.4.0_2.11-1.3.5-uber.jar", - "6": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.4.0_2.11/3.7.0/azure-cosmosdb-spark_2.4.0_2.11-3.7.0-uber.jar" + "6": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.4.0_2.11/3.7.0/azure-cosmosdb-spark_2.4.0_2.11-3.7.0-uber.jar", + "7": "https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar", + "8": "https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar", + "9": "https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar" } MMLSPARK_INFO = { @@ -302,7 +305,7 @@ def prepare_for_operationalization( while "recommenders" not in installed_libraries: time.sleep(PENDING_SLEEP_INTERVAL) installed_libraries = get_installed_libraries(my_api_client, args.cluster_id) - while installed_libraries["recommenders"] != "INSTALLED": + while installed_libraries["recommenders"] not in ["INSTALLED", "FAILED"]: time.sleep(PENDING_SLEEP_INTERVAL) installed_libraries = get_installed_libraries(my_api_client, args.cluster_id) if installed_libraries["recommenders"] == "FAILED": diff --git a/tools/generate_conda_file.py b/tools/generate_conda_file.py index 061f4e43a1..6c5978a96a 100644 --- a/tools/generate_conda_file.py +++ b/tools/generate_conda_file.py @@ -60,7 +60,7 @@ "tqdm": "tqdm>=4.31.1", } -CONDA_PYSPARK = {"pyarrow": "pyarrow>=0.8.0", "pyspark": "pyspark==2.4.5"} +CONDA_PYSPARK = {"pyarrow": "pyarrow>=0.8.0", "pyspark": "pyspark>=3"} CONDA_GPU = { "fastai": "fastai==1.0.46",