Skip to content

Commit

Permalink
Merge pull request #1543 from microsoft/andreas/spark3
Browse files Browse the repository at this point in the history
Upgrade to Spark v3
  • Loading branch information
anargyri authored Oct 11, 2021
2 parents 2eccb87 + 8d878aa commit a485784
Show file tree
Hide file tree
Showing 7 changed files with 30 additions and 34 deletions.
28 changes: 12 additions & 16 deletions SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,18 +58,18 @@ conda update conda -n root
conda update anaconda # use 'conda install anaconda' if the package is not installed
```

If using venv, see [these instructions](#using-a-virtual-environment).
If using venv or virtualenv, see [these instructions](#using-a-virtual-environment).

**NOTE** the `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/).

**NOTE** the models from Cornac require installation of `libpython` i.e. using `sudo apt-get install -y libpython3.6` or `libpython3.7`, depending on the version of Python.

**NOTE** PySpark v2.4.x requires Java version 8.
**NOTE** Spark requires Java version 8 or 11. We support Spark version 3, but versions 2.4+ with Java version 8 may also work.

<details>
<summary><strong><em>Install Java 8 on MacOS</em></strong></summary>
<summary><strong><em>Install Java on MacOS</em></strong></summary>

To install Java 8 on MacOS using [asdf](https://github.com/halcyon/asdf-java):
To install e.g. Java 8 on MacOS using [asdf](https://github.com/halcyon/asdf-java):

brew install asdf
asdf plugin add Java
Expand Down Expand Up @@ -151,7 +151,7 @@ create the file `%RECO_ENV%\etc\conda\deactivate.d\env_vars.bat` and add:

It is straightforward to install the recommenders package within a [virtual environment](https://docs.python.org/3/library/venv.html). However, setting up CUDA for use with a GPU can be cumbersome. We thus
recommend setting up [Nvidia docker](https://github.com/NVIDIA/nvidia-docker) and running the virtual environment within a container, as the most convenient way to do this.
In the following `3.6` should be replaced with the Python version you are using.
In the following `3.6` should be replaced with the Python version you are using and `11` should be replaced with the appropriate Java version.

# Start docker daemon if not running
sudo dockerd &
Expand All @@ -167,14 +167,15 @@ In the following `3.6` should be replaced with the Python version you are using.
apt-get -y install python3.6-venv
apt-get -y install libpython3.6-dev
apt-get -y install cmake
apt-get install -y libgomp1 openjdk-8-jre
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
apt-get install -y libgomp1 openjdk-11-jre
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

python3.6 -m venv --system-site-packages /venv
source /venv/bin/activate
pip install --upgrade pip
pip install --upgrade setuptools

export SPARK_HOME=/venv/lib/python3.6/site-packages/pyspark
export PYSPARK_DRIVER_PYTHON=/venv/bin/python
export PYSPARK_PYTHON=/venv/bin/python

Expand Down Expand Up @@ -223,13 +224,6 @@ SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.a
unset SPARK_HOME
```

* Java 11 might produce errors when running the notebooks. To change it to Java 8:

```
sudo apt install openjdk-8-jdk
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
```

* We found that there might be conflicts between the current MMLSpark jars available in the DSVM and the ones used by the library. In that case, it is better to remove those jars and rely on loading them from Maven or other repositories made available by MMLSpark team.

```
Expand All @@ -241,9 +235,10 @@ sudo rm -rf Azure_mmlspark-0.12.jar com.microsoft.cntk_cntk-2.4.jar com.microsof

### Requirements

* Databricks Runtime version >= 4.3 (Apache Spark 2.3.1, Scala 2.11) and <= 5.5 (Apache Spark 2.4.3, Scala 2.11)
* Databricks Runtime version >= 7 (Apache Spark >= 3.0.1, Scala 2.12)
* Python 3

Earlier versions of Databricks or Spark may work but this is not guaranteed.
An example of how to create an Azure Databricks workspace and an Apache Spark cluster within the workspace can be found from [here](https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal). To utilize deep learning models and GPUs, you may setup GPU-enabled cluster. For more details about this topic, please see [Azure Databricks deep learning guide](https://docs.azuredatabricks.net/applications/deep-learning/index.html).

### Installation from PyPI
Expand Down Expand Up @@ -368,7 +363,8 @@ You can follow instructions [here](https://docs.azuredatabricks.net/user-guide/l

Additionally, you must install the [spark-cosmosdb connector](https://docs.databricks.com/spark/latest/data-sources/azure/cosmosdb-connector.html) on the cluster. The easiest way to manually do that is to:

1. Download the [appropriate jar](https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar) from MAVEN. **NOTE** This is the appropriate jar for spark versions `2.3.X`, and is the appropriate version for the recommended Azure Databricks run-time detailed above.

1. Download the [appropriate jar](https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar) from MAVEN. **NOTE** This is the appropriate jar for spark versions `3.1.X`, and is the appropriate version for the recommended Azure Databricks run-time detailed above.
2. Upload and install the jar by:
1. Log into your `Azure Databricks` workspace
2. Select the `Clusters` button on the left.
Expand Down
4 changes: 2 additions & 2 deletions conda.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
One possible way to use the repository is to run all the recommender utilities directly from a local copy of the source code (without building the package). This requires installing all the necessary dependencies from Anaconda and PyPI.

To this end we provide a script, [generate_conda_file.py](tools/generate_conda_file.py), to generate a conda-environment yaml file which you can use to create the target environment using Python 3.6 or 3.7 with all the correct dependencies.
To this end we provide a script, [generate_conda_file.py](tools/generate_conda_file.py), to generate a conda-environment yaml file which you can use to create the target environment using Python with all the correct dependencies.

Assuming the repo is cloned as `Recommenders` in the local system, to install **a default (Python CPU) environment**:

Expand Down Expand Up @@ -34,7 +34,7 @@ To install the PySpark environment:

Additionally, if you want to test a particular version of spark, you may pass the `--pyspark-version` argument:

python tools/generate_conda_file.py --pyspark-version 2.4.5
python tools/generate_conda_file.py --pyspark-version 3.1.1

</details>

Expand Down
9 changes: 2 additions & 7 deletions examples/04_model_select_and_optimize/tuning_spark_als.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -42,29 +42,24 @@
"# set the environment path to find Recommenders\n",
"%matplotlib notebook\n",
"\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import sys\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"import pyspark\n",
"from pyspark.sql import SparkSession\n",
"import pyspark.sql.functions as F\n",
"from pyspark import SparkContext, SparkConf\n",
"from pyspark.ml.recommendation import ALS\n",
"from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit\n",
"from pyspark.ml.evaluation import Evaluator, RegressionEvaluator\n",
"from pyspark.ml.pipeline import Estimator, Model\n",
"from pyspark import keyword_only \n",
"from pyspark.ml import Transformer\n",
"from pyspark.ml.param.shared import *\n",
"from pyspark.ml.util import *\n",
"from pyspark.mllib.evaluation import RegressionMetrics, RankingMetrics\n",
"from pyspark.sql.types import ArrayType, IntegerType, StringType\n",
"from pyspark.mllib.evaluation import RankingMetrics\n",
"from pyspark.sql.types import ArrayType, IntegerType\n",
"\n",
"from hyperopt import fmin, tpe, hp, STATUS_OK, Trials\n",
"from hyperopt.pyll.base import scope\n",
"from hyperopt.pyll.stochastic import sample\n",
"\n",
"from recommenders.utils.timer import Timer\n",
Expand Down
8 changes: 5 additions & 3 deletions recommenders/utils/spark_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,11 @@
except ImportError:
pass # skip this import if we are in pure python environment


MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"
MMLSPARK_REPO = "https://mvnrepository.com/artifact"
MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark:1.0.0-rc3-184-3314e164-SNAPSHOT"
MMLSPARK_REPO = "https://mmlspark.azureedge.net/maven"
# We support Spark v3, but in case you wish to use v2, set
# MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"
# MMLSPARK_REPO = "https://mvnrepository.com/artifact"

def start_or_get_spark(
app_name="Sample",
Expand Down
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,8 @@
],
"spark": [
"databricks_cli>=0.8.6,<1",
"pyarrow>=0.8.0,<1.0.0",
"pyspark>=2.4.5,<3.0.0",
"pyarrow>=0.12.1,<6.0.0",
"pyspark>=2.4.5,<4.0.0",
],
"xlearn": [
"cmake>=3.18.4.post1",
Expand Down Expand Up @@ -129,6 +129,6 @@
"machine learning python spark gpu",
install_requires=install_requires,
package_dir={"recommenders": "recommenders"},
python_requires=">=3.6, <3.9", # latest Databricks versions come with Python 3.8 installed
packages=find_packages(where=".", exclude=["contrib", "docs", "examples", "scenarios", "tests", "tools"]),
python_requires=">=3.6, <3.8",
)
7 changes: 5 additions & 2 deletions tools/databricks_install.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,10 @@
"3": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.2.0_2.11/1.1.1/azure-cosmosdb-spark_2.2.0_2.11-1.1.1-uber.jar",
"4": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar",
"5": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.4.0_2.11/1.3.5/azure-cosmosdb-spark_2.4.0_2.11-1.3.5-uber.jar",
"6": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.4.0_2.11/3.7.0/azure-cosmosdb-spark_2.4.0_2.11-3.7.0-uber.jar"
"6": "https://search.maven.org/remotecontent?filepath=com/microsoft/azure/azure-cosmosdb-spark_2.4.0_2.11/3.7.0/azure-cosmosdb-spark_2.4.0_2.11-3.7.0-uber.jar",
"7": "https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar",
"8": "https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar",
"9": "https://search.maven.org/remotecontent?filepath=com/azure/cosmos/spark/azure-cosmos-spark_3-1_2-12/4.3.1/azure-cosmos-spark_3-1_2-12-4.3.1.jar"
}

MMLSPARK_INFO = {
Expand Down Expand Up @@ -302,7 +305,7 @@ def prepare_for_operationalization(
while "recommenders" not in installed_libraries:
time.sleep(PENDING_SLEEP_INTERVAL)
installed_libraries = get_installed_libraries(my_api_client, args.cluster_id)
while installed_libraries["recommenders"] != "INSTALLED":
while installed_libraries["recommenders"] not in ["INSTALLED", "FAILED"]:
time.sleep(PENDING_SLEEP_INTERVAL)
installed_libraries = get_installed_libraries(my_api_client, args.cluster_id)
if installed_libraries["recommenders"] == "FAILED":
Expand Down
2 changes: 1 addition & 1 deletion tools/generate_conda_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
"tqdm": "tqdm>=4.31.1",
}

CONDA_PYSPARK = {"pyarrow": "pyarrow>=0.8.0", "pyspark": "pyspark==2.4.5"}
CONDA_PYSPARK = {"pyarrow": "pyarrow>=0.8.0", "pyspark": "pyspark>=3"}

CONDA_GPU = {
"fastai": "fastai==1.0.46",
Expand Down

0 comments on commit a485784

Please sign in to comment.