From 092aebd59a67ee10e1815fc7b09b08ada0235489 Mon Sep 17 00:00:00 2001 From: Tao Wu <21267949+wutaomsft@users.noreply.github.com> Date: Mon, 10 Apr 2023 08:37:42 -0400 Subject: [PATCH 1/6] setup update --- SETUP.md | 426 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 226 insertions(+), 200 deletions(-) diff --git a/SETUP.md b/SETUP.md index e9f58560b4..0446b907c7 100644 --- a/SETUP.md +++ b/SETUP.md @@ -1,13 +1,16 @@ -# Setup guide +# Setup Guide -This document describes how to setup all the dependencies to run the notebooks in this repository in following platforms: +The repo, including this guide, is tested on Linux. Where applicable, we document differences in [Windows](Setup_Windows.md) and [MacOS](Setup_MacOS.md) although +such documentation may not always be up to date. We currently have documentation for Docker container, but plan to remove it in the future +due to limited ability to maintain it. +FIXME - the following three lines should be removed. * Local (Linux, MacOS or Windows) or [DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (Linux or Windows) * [Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/) * Docker container ## Table of Contents - + - [Extras](#extras) - [Compute environments](#compute-environments) - [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm) - [Requirements](#requirements) @@ -25,211 +28,49 @@ This document describes how to setup all the dependencies to run the notebooks i - [Setup guide for Docker](#setup-guide-for-docker) - [Setup guide for making a release](#setup-guide-for-making-a-release) -## Compute environments -Depending on the type of recommender system and the notebook that needs to be run, there are different computational requirements. -Currently, this repository supports **Python CPU**, **Python GPU** and **PySpark**. +## Extras +In addition to the pip installable package, several extras are provided, including: ++ `[examples]`: Needed for running examples. ++ `[gpu]`: Needed for running GPU models. ++ `[spark]`: Needed for running Spark models. ++ `[dev]`: Needed for development. ++ `[all]`: `[examples]`|`[gpu]`|`[spark]`|`[dev]` ++ `[experimental]`: Models that are not throughly tested and/or may require additional steps in installation). ++ `[nni]`: Needed for running models integrated with [NNI](https://nni.readthedocs.io/en/stable/). +## Test environments -## Setup guide for Local or DSVM +Depending on the type of recommender system and the notebook that needs to be run, there are different computational requirements. -There are different ways one may use the recommenders utilities. The most convenient one is probably by installing the `recommenders` package from [PyPI](https://pypi.org). +Currently, tests are done on **Python CPU** (the base environment), **Python GPU** (corresponding to `[gpu]` extra above) and **PySpark** (corresponding to `[spark]` extra above). Another way is to build a docker image and use the functions inside a [docker container](#setup-guide-for-docker). Another alternative is to run all the recommender utilities directly from a local copy of the source code. This requires installing all the necessary dependencies from Anaconda and PyPI. For instructions on how to do this, see [this guide](conda.md). -### Requirements - -* A machine running Linux, MacOS or Windows -* An optional requirement is Anaconda with Python version >= 3.6, <= 3.9 - * This is pre-installed on Azure DSVM such that one can run the following steps directly. To setup on your local machine, [Miniconda](https://docs.conda.io/en/latest/miniconda.html) is a quick way to get started. - - Alternatively a [virtual environment](#using-a-virtual-environment) can be used instead of Anaconda. -* [Apache Spark](https://spark.apache.org/downloads.html) (this is only needed for the PySpark environment). - -### Dependencies setup - -As a pre-requisite to installing the dependencies, if using Conda, make sure that Anaconda and the package manager Conda are both up to date: - -```{shell} -conda update conda -n root -conda update anaconda # use 'conda install anaconda' if the package is not installed -``` - -If using venv or virtualenv, see [these instructions](#using-a-virtual-environment). +## Setup for Core Package -**NOTE** the `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/). +Follow the [Getting Started](./README.md#Getting-Started) section in the [README](./README.md) to install the package and run the examples. **NOTE** the models from Cornac require installation of `libpython` i.e. using `sudo apt-get install -y libpython3.x`, depending on the version of Python. +### Dependencies setup -**NOTE** Spark requires Java version 8 or 11. We support Spark versions 3.0 and 3.1, but versions 2.4+ with Java version 8 may also work. - -
-Install Java on MacOS - -To install e.g. Java 8 on MacOS using [asdf](https://github.com/halcyon/asdf-java): - - brew install asdf - asdf plugin add Java - asdf install java adoptopenjdk-8.0.265+1 - asdf global java adoptopenjdk-8.0.265+1 - . ~/.asdf/plugins/java/set-java-home.zsh - -
- - -Then, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable. - -Click on the following menus to see details: -
-Set PySpark environment variables on Linux or MacOS - -If you use conda, to set these variables every time the environment is activated, you can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). - -First, assuming that the environment is called `reco_pyspark`, get the path where the environment is installed: - - RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}') - mkdir -p $RECO_ENV/etc/conda/activate.d - mkdir -p $RECO_ENV/etc/conda/deactivate.d - -Then, create the file `$RECO_ENV/etc/conda/activate.d/env_vars.sh` and add: - -```bash -#!/bin/sh -RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}') -export PYSPARK_PYTHON=$RECO_ENV/bin/python -export PYSPARK_DRIVER_PYTHON=$RECO_ENV/bin/python -unset SPARK_HOME -``` -This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, create the file `$RECO_ENV/etc/conda/deactivate.d/env_vars.sh` and add: +## Setup for Spark +Make sure you have installed JDK (we tested on Java 8 and 11). FIXME - instrcutions are on 11. +You can install OpenJDK 11 using the command `[sudo apt-get install openjdk-11-jdk]`. +Then, ```bash -#!/bin/sh -unset PYSPARK_PYTHON -unset PYSPARK_DRIVER_PYTHON +# Within vscode: +# 1. Open a notebook with a Spark model, e.g., examples/00_quick_start/als_movielens.ipynb; +# 2. Select Jupyter kernel ; +# 3. Run the notebook. ``` -
- -
Set PySpark environment variables on Windows - -To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#windows). -First, get the path of the environment `reco_pyspark` is installed: - - for /f "delims=" %A in ('conda env list ^| grep reco_pyspark ^| awk "{print $NF}"') do set "RECO_ENV=%A" - -Then, create the file `%RECO_ENV%\etc\conda\activate.d\env_vars.bat` and add: - - @echo off - for /f "delims=" %%A in ('conda env list ^| grep reco_pyspark ^| awk "{print $NF}"') do set "RECO_ENV=%%A" - set PYSPARK_PYTHON=%RECO_ENV%\python.exe - set PYSPARK_DRIVER_PYTHON=%RECO_ENV%\python.exe - set SPARK_HOME_BACKUP=%SPARK_HOME% - set SPARK_HOME= - set PYTHONPATH_BACKUP=%PYTHONPATH% - set PYTHONPATH= - -This will export the variables every time we do `conda activate reco_pyspark`. -To unset these variables when we deactivate the environment, -create the file `%RECO_ENV%\etc\conda\deactivate.d\env_vars.bat` and add: - - @echo off - set PYSPARK_PYTHON= - set PYSPARK_DRIVER_PYTHON= - set SPARK_HOME=%SPARK_HOME_BACKUP% - set SPARK_HOME_BACKUP= - set PYTHONPATH=%PYTHONPATH_BACKUP% - set PYTHONPATH_BACKUP= - -
- - -### Using a virtual environment - -It is straightforward to install the recommenders package within a [virtual environment](https://docs.python.org/3/library/venv.html). However, setting up CUDA for use with a GPU can be cumbersome. We thus -recommend setting up [Nvidia docker](https://github.com/NVIDIA/nvidia-docker) and running the virtual environment within a container, as the most convenient way to do this. -In the following `3.6` should be replaced with the Python version you are using and `8` should be replaced with the appropriate Java version. - - # Start docker daemon if not running - sudo dockerd & - # Pull the image from the Nvidia docker hub (https://hub.docker.com/r/nvidia/cuda) that is suitable for your system - # E.g. for Ubuntu 18.04 do - sudo docker run --gpus all -it --rm nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04 - - # Within the container: - - apt-get -y update - apt-get -y install python3.6 - apt-get -y install python3-pip - apt-get -y install python3.6-venv - apt-get -y install libpython3.6-dev - apt-get -y install cmake - apt-get install -y libgomp1 openjdk-8-jre - export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 - - python3.6 -m venv --system-site-packages /venv - source /venv/bin/activate - pip install --upgrade pip - pip install --upgrade setuptools - - export SPARK_HOME=/venv/lib/python3.6/site-packages/pyspark - export PYSPARK_DRIVER_PYTHON=/venv/bin/python - export PYSPARK_PYTHON=/venv/bin/python - - pip install recommenders[all] - -If you prefer to use [virtualenv](https://virtualenv.pypa.io/en/latest/index.html#) instead of venv, you may follow the above steps, except you will need to replace the line - -`apt-get -y install python3.6-venv` - -with - -`python3.6 -m pip install --user virtualenv` - -and the line - -`python3.6 -m venv --system-site-packages /venv` - -with - -`python3.6 -m virtualenv /venv` - - -### Register the environment as a kernel in Jupyter - -We can register our conda or virtual environment to appear as a kernel in the Jupyter notebooks. After activating the environment (`my_env_name`) do - - python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)" - -If you are using the DSVM, you can [connect to JupyterHub](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro#jupyterhub-and-jupyterlab) by browsing to `https://your-vm-ip:8000`. - - -### Troubleshooting for the DSVM - -* We found that there can be problems if the Spark version of the machine is not the same as the one in the [conda file](conda.md). You can use the option `--pyspark-version` to address this issue. - -* When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this on a DSVM, we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at `/dsvm/tools/spark/current/conf/spark-env.sh`. - -```{shell} -SPARK_LOCAL_DIRS="/mnt" -SPARK_WORKER_DIR="/mnt" -SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.appDataTtl=3600, -Dspark.worker.cleanup.interval=300, -Dspark.storage.cleanupFilesAfterExecutorExit=true" -``` - -* Another source of problems is when the variable `SPARK_HOME` is not set correctly. In the Azure DSVM, `SPARK_HOME` is by default `/dsvm/tools/spark/current`. We need to unset it: -``` -unset SPARK_HOME -``` - -* We found that there might be conflicts between the current MMLSpark jars available in the DSVM and the ones used by the library. In that case, it is better to remove those jars and rely on loading them from Maven or other repositories made available by MMLSpark team. - -``` -cd /dsvm/tools/spark/current/jars -sudo rm -rf Azure_mmlspark-0.12.jar com.microsoft.cntk_cntk-2.4.jar com.microsoft.ml.lightgbm_lightgbmlib-2.0.120.jar -``` +TODO 0401 - Databricks ## Setup guide for Azure Databricks ### Requirements @@ -376,21 +217,10 @@ Additionally, you must install the [spark-cosmosdb connector](https://docs.datab -## Setup guide for Docker -A [Dockerfile](tools/docker/Dockerfile) is provided to build images of the repository to simplify setup for different environments. You will need [Docker Engine](https://docs.docker.com/install/) installed on your system. -*Note: `docker` is already available on Azure Data Science Virtual Machine* -See guidelines in the Docker [README](tools/docker/README.md) for detailed instructions of how to build and run images for different environments. -Example command to build and run Docker image with base CPU environment. -```{shell} -DOCKER_BUILDKIT=1 docker build -t recommenders:cpu --build-arg ENV="cpu" --build-arg VIRTUAL_ENV="conda" . -docker run -p 8888:8888 -d recommenders:cpu -``` - -You can then open the Jupyter notebook server at http://localhost:8888 ## Setup guide for making a release @@ -407,3 +237,199 @@ generates a wheel and a tar.gz which are uploaded to a [GitHub draft release](ht 1. Download the wheel and tar.gz locally, these files shouldn't have any bug, since they passed all the tests. 1. Install twine: `pip install twine` 1. Publish the wheel and tar.gz to pypi: `twine upload recommenders*` + + +## Setup for Experimental + +**NOTE** the `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/). + + +# MacOS-Specific Instructions + +
+Install Java on MacOS + +To install e.g. Java 8 on MacOS using [asdf](https://github.com/halcyon/asdf-java): + + brew install asdf + asdf plugin add Java + asdf install java adoptopenjdk-8.0.265+1 + asdf global java adoptopenjdk-8.0.265+1 + . ~/.asdf/plugins/java/set-java-home.zsh + +
+ + +Then, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable. + + +# Windows-Specific Instructions + +
Set PySpark environment variables on Windows + +To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#windows). +First, get the path of the environment `reco_pyspark` is installed: + + for /f "delims=" %A in ('conda env list ^| grep reco_pyspark ^| awk "{print $NF}"') do set "RECO_ENV=%A" + +Then, create the file `%RECO_ENV%\etc\conda\activate.d\env_vars.bat` and add: + + @echo off + for /f "delims=" %%A in ('conda env list ^| grep reco_pyspark ^| awk "{print $NF}"') do set "RECO_ENV=%%A" + set PYSPARK_PYTHON=%RECO_ENV%\python.exe + set PYSPARK_DRIVER_PYTHON=%RECO_ENV%\python.exe + set SPARK_HOME_BACKUP=%SPARK_HOME% + set SPARK_HOME= + set PYTHONPATH_BACKUP=%PYTHONPATH% + set PYTHONPATH= + +This will export the variables every time we do `conda activate reco_pyspark`. +To unset these variables when we deactivate the environment, +create the file `%RECO_ENV%\etc\conda\deactivate.d\env_vars.bat` and add: + + @echo off + set PYSPARK_PYTHON= + set PYSPARK_DRIVER_PYTHON= + set SPARK_HOME=%SPARK_HOME_BACKUP% + set SPARK_HOME_BACKUP= + set PYTHONPATH=%PYTHONPATH_BACKUP% + set PYTHONPATH_BACKUP= + +
+ +## Setup guide for Docker + +A [Dockerfile](tools/docker/Dockerfile) is provided to build images of the repository to simplify setup for different environments. You will need [Docker Engine](https://docs.docker.com/install/) installed on your system. + +*Note: `docker` is already available on Azure Data Science Virtual Machine* + +See guidelines in the Docker [README](tools/docker/README.md) for detailed instructions of how to build and run images for different environments. + +Example command to build and run Docker image with base CPU environment. +```{shell} +DOCKER_BUILDKIT=1 docker build -t recommenders:cpu --build-arg ENV="cpu" --build-arg VIRTUAL_ENV="conda" . +docker run -p 8888:8888 -d recommenders:cpu +``` + +You can then open the Jupyter notebook server at http://localhost:8888 + + + +Click on the following menus to see details: +
+Set PySpark environment variables on Linux or MacOS + +If you use conda, to set these variables every time the environment is activated, you can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). + +First, assuming that the environment is called `reco_pyspark`, get the path where the environment is installed: + + RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}') + mkdir -p $RECO_ENV/etc/conda/activate.d + mkdir -p $RECO_ENV/etc/conda/deactivate.d + +Then, create the file `$RECO_ENV/etc/conda/activate.d/env_vars.sh` and add: + +```bash +#!/bin/sh +RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}') +export PYSPARK_PYTHON=$RECO_ENV/bin/python +export PYSPARK_DRIVER_PYTHON=$RECO_ENV/bin/python +unset SPARK_HOME +``` + +This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, create the file `$RECO_ENV/etc/conda/deactivate.d/env_vars.sh` and add: + +```bash +#!/bin/sh +unset PYSPARK_PYTHON +unset PYSPARK_DRIVER_PYTHON +``` + +
+ + + +### Using a virtual environment + +It is straightforward to install the recommenders package within a [virtual environment](https://docs.python.org/3/library/venv.html). However, setting up CUDA for use with a GPU can be cumbersome. We thus +recommend setting up [Nvidia docker](https://github.com/NVIDIA/nvidia-docker) and running the virtual environment within a container, as the most convenient way to do this. +In the following `3.6` should be replaced with the Python version you are using and `8` should be replaced with the appropriate Java version. + + # Start docker daemon if not running + sudo dockerd & + # Pull the image from the Nvidia docker hub (https://hub.docker.com/r/nvidia/cuda) that is suitable for your system + # E.g. for Ubuntu 18.04 do + sudo docker run --gpus all -it --rm nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04 + + # Within the container: + + apt-get -y update + apt-get -y install python3.6 + apt-get -y install python3-pip + apt-get -y install python3.6-venv + apt-get -y install libpython3.6-dev + apt-get -y install cmake + apt-get install -y libgomp1 openjdk-8-jre + export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + + python3.6 -m venv --system-site-packages /venv + source /venv/bin/activate + pip install --upgrade pip + pip install --upgrade setuptools + + export SPARK_HOME=/venv/lib/python3.6/site-packages/pyspark + export PYSPARK_DRIVER_PYTHON=/venv/bin/python + export PYSPARK_PYTHON=/venv/bin/python + + pip install recommenders[all] + + + +### Register the environment as a kernel in Jupyter + +We can register our conda or virtual environment to appear as a kernel in the Jupyter notebooks. After activating the environment (`my_env_name`) do + + python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)" + +If you are using the DSVM, you can [connect to JupyterHub](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro#jupyterhub-and-jupyterlab) by browsing to `https://your-vm-ip:8000`. + + +### Troubleshooting for the DSVM + +* We found that there can be problems if the Spark version of the machine is not the same as the one in the [conda file](conda.md). You can use the option `--pyspark-version` to address this issue. + +* When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this on a DSVM, we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at `/dsvm/tools/spark/current/conf/spark-env.sh`. + +```{shell} +SPARK_LOCAL_DIRS="/mnt" +SPARK_WORKER_DIR="/mnt" +SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.appDataTtl=3600, -Dspark.worker.cleanup.interval=300, -Dspark.storage.cleanupFilesAfterExecutorExit=true" +``` + +* Another source of problems is when the variable `SPARK_HOME` is not set correctly. In the Azure DSVM, `SPARK_HOME` is by default `/dsvm/tools/spark/current`. We need to unset it: +``` +unset SPARK_HOME +``` + +* We found that there might be conflicts between the current MMLSpark jars available in the DSVM and the ones used by the library. In that case, it is better to remove those jars and rely on loading them from Maven or other repositories made available by MMLSpark team. + +``` +cd /dsvm/tools/spark/current/jars +sudo rm -rf Azure_mmlspark-0.12.jar com.microsoft.cntk_cntk-2.4.jar com.microsoft.ml.lightgbm_lightgbmlib-2.0.120.jar +``` From 005dcd06bd32ab996c7e4c9892ebbfcab80adb1a Mon Sep 17 00:00:00 2001 From: Tao Wu <21267949+wutaomsft@users.noreply.github.com> Date: Mon, 10 Apr 2023 09:01:04 -0400 Subject: [PATCH 2/6] update gcc; typo --- README.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index b0e4505a47..b7b66f6fbb 100644 --- a/README.md +++ b/README.md @@ -34,9 +34,12 @@ For a more detailed overview of the repository, please see the documents on the ## Getting Started -We recommend [conda](https://docs.conda.io/projects/conda/en/latest/glossary.html?highlight=environment#conda-environment) for environment management, and [vscode](https://code.visualstudio.com/) for development. To install the recommenders package and run an example notebook: +We recommend [conda](https://docs.conda.io/projects/conda/en/latest/glossary.html?highlight=environment#conda-environment) for environment management, and [VS Code](https://code.visualstudio.com/) for development. To install the recommenders package and run an example notebook on Linux/WSL: ```bash +# Install gcc if it is not installed already. On Ubuntu, this could done by using the command +# sudo apt install gcc + # Create and activate a new conda environment conda create -n python=3.9 conda activate @@ -50,13 +53,13 @@ python -m ipykernel install --user --name --display-name ; # 3. Run the notebook. ``` -For more information about setup including extras, as well as configurations for GPU, Spark and Docker container, see the [setup guide](SETUP.md). +For more information about setup on different platforms (e.g., Windows and macOS) and configurations (GPU, Spark and Docker container), see the [setup guide](SETUP.md). In addition to the core package, several extras are also provided, including: + `[examples]`: Needed for running examples. @@ -64,7 +67,7 @@ In addition to the core package, several extras are also provided, including: + `[spark]`: Needed for running Spark models. + `[dev]`: Needed for development for the repo. + `[all]`: `[examples]`|`[gpu]`|`[spark]`|`[dev]` -+ `[experimental]`: Models that are not throughly tested and/or may require additional steps in installation. ++ `[experimental]`: Models that are not thoroughly tested and/or may require additional steps in installation. + `[nni]`: Needed for running models integrated with [NNI](https://nni.readthedocs.io/en/stable/). From c3ef4f5a7367983b6f66c8be5e433f2d8038b13d Mon Sep 17 00:00:00 2001 From: Tao Wu <21267949+wutaomsft@users.noreply.github.com> Date: Mon, 10 Apr 2023 09:16:10 -0400 Subject: [PATCH 3/6] Update Setup/README for Java/Spark instructions --- README.md | 18 +++++++++--------- SETUP.md | 19 +++++++++++-------- 2 files changed, 20 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index b7b66f6fbb..d1f936d280 100644 --- a/README.md +++ b/README.md @@ -37,26 +37,26 @@ For a more detailed overview of the repository, please see the documents on the We recommend [conda](https://docs.conda.io/projects/conda/en/latest/glossary.html?highlight=environment#conda-environment) for environment management, and [VS Code](https://code.visualstudio.com/) for development. To install the recommenders package and run an example notebook on Linux/WSL: ```bash -# Install gcc if it is not installed already. On Ubuntu, this could done by using the command +# 1. Install gcc if it is not installed already. On Ubuntu, this could done by using the command # sudo apt install gcc -# Create and activate a new conda environment +# 2. Create and activate a new conda environment conda create -n python=3.9 conda activate -# Install the recommenders package with examples +# 3. Install the recommenders package with examples pip install recommenders[examples] -# create a Jupyter kernel +# 4. create a Jupyter kernel python -m ipykernel install --user --name --display-name -# Clone this repo within vscode or using command: +# 5. Clone this repo within vscode or using command: git clone https://github.com/microsoft/recommenders.git -# Within VS Code: -# 1. Open a notebook, e.g., examples/00_quick_start/sar_movielens.ipynb; -# 2. Select Jupyter kernel ; -# 3. Run the notebook. +# 6. Within VS Code: +# a. Open a notebook, e.g., examples/00_quick_start/sar_movielens.ipynb; +# b. Select Jupyter kernel ; +# c. Run the notebook. ``` For more information about setup on different platforms (e.g., Windows and macOS) and configurations (GPU, Spark and Docker container), see the [setup guide](SETUP.md). diff --git a/SETUP.md b/SETUP.md index 0446b907c7..b57bd29972 100644 --- a/SETUP.md +++ b/SETUP.md @@ -59,18 +59,21 @@ Follow the [Getting Started](./README.md#Getting-Started) section in the [README ## Setup for Spark -Make sure you have installed JDK (we tested on Java 8 and 11). FIXME - instrcutions are on 11. -You can install OpenJDK 11 using the command `[sudo apt-get install openjdk-11-jdk]`. -Then, ```bash -# Within vscode: -# 1. Open a notebook with a Spark model, e.g., examples/00_quick_start/als_movielens.ipynb; -# 2. Select Jupyter kernel ; -# 3. Run the notebook. +# 1. Make sure JDK is installed. For example, OpenJDK 11 can be installed using the command +# sudo apt-get install openjdk-11-jdk + +# 2. Follow Steps 1-5 in [Getting Started](./README.md#Getting-Started) section in [README](./README.md) to install the package and Jupyter kernel, adding the spark extra to the pip install command: +pip install recommenders[examples,spark] + +# 3. Within VS Code: +# a. Open a notebook with a Spark model, e.g., examples/00_quick_start/als_movielens.ipynb; +# b. Select Jupyter kernel ; +# c. Run the notebook. ``` -TODO 0401 - Databricks +TODO 0410 - Databricks ## Setup guide for Azure Databricks ### Requirements From 46b828c58b18e506291737ff40da682b583b9468 Mon Sep 17 00:00:00 2001 From: Tao Wu <21267949+wutaomsft@users.noreply.github.com> Date: Sun, 16 Apr 2023 22:27:39 -0400 Subject: [PATCH 4/6] Completed Spark; need o16n update --- SETUP.md | 371 ++++++------------------------------------------------- 1 file changed, 35 insertions(+), 336 deletions(-) diff --git a/SETUP.md b/SETUP.md index b57bd29972..50afb47d46 100644 --- a/SETUP.md +++ b/SETUP.md @@ -1,33 +1,7 @@ # Setup Guide -The repo, including this guide, is tested on Linux. Where applicable, we document differences in [Windows](Setup_Windows.md) and [MacOS](Setup_MacOS.md) although -such documentation may not always be up to date. We currently have documentation for Docker container, but plan to remove it in the future -due to limited ability to maintain it. - -FIXME - the following three lines should be removed. -* Local (Linux, MacOS or Windows) or [DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) (Linux or Windows) -* [Azure Databricks](https://azure.microsoft.com/en-us/services/databricks/) -* Docker container - -## Table of Contents - - [Extras](#extras) - - [Compute environments](#compute-environments) - - [Setup guide for Local or DSVM](#setup-guide-for-local-or-dsvm) - - [Requirements](#requirements) - - [Dependencies setup](#dependencies-setup) - - [Using a virtual environment](#using-a-virtual-environment) - - [Register the environment as a kernel in Jupyter](#register-the-environment-as-a-kernel-in-jupyter) - - [Troubleshooting for the DSVM](#troubleshooting-for-the-dsvm) - - [Setup guide for Azure Databricks](#setup-guide-for-azure-databricks) - - [Requirements of Azure Databricks](#requirements-1) - - [Installation from PyPI](#installation-from-pypi) - - [Dependencies setup](#dependencies-setup-1) - - [Confirm Installation](#confirm-installation) - - [Troubleshooting Installation on Azure Databricks](#troubleshooting-installation-on-azure-databricks) - - [Prepare Azure Databricks for Operationalization](#prepare-azure-databricks-for-operationalization) - - [Setup guide for Docker](#setup-guide-for-docker) - - [Setup guide for making a release](#setup-guide-for-making-a-release) - +The repo, including this guide, is tested on Linux. Where applicable, we document differences in [Windows](#windows-specific-instructions) and [macOS](#macos-specific-instructions) although +such documentation may not always be up to date. ## Extras In addition to the pip installable package, several extras are provided, including: @@ -39,23 +13,11 @@ In addition to the pip installable package, several extras are provided, includi + `[experimental]`: Models that are not throughly tested and/or may require additional steps in installation). + `[nni]`: Needed for running models integrated with [NNI](https://nni.readthedocs.io/en/stable/). -## Test environments - -Depending on the type of recommender system and the notebook that needs to be run, there are different computational requirements. - -Currently, tests are done on **Python CPU** (the base environment), **Python GPU** (corresponding to `[gpu]` extra above) and **PySpark** (corresponding to `[spark]` extra above). - -Another way is to build a docker image and use the functions inside a [docker container](#setup-guide-for-docker). - -Another alternative is to run all the recommender utilities directly from a local copy of the source code. This requires installing all the necessary dependencies from Anaconda and PyPI. For instructions on how to do this, see [this guide](conda.md). ## Setup for Core Package Follow the [Getting Started](./README.md#Getting-Started) section in the [README](./README.md) to install the package and run the examples. -**NOTE** the models from Cornac require installation of `libpython` i.e. using `sudo apt-get install -y libpython3.x`, depending on the version of Python. -### Dependencies setup - ## Setup for Spark @@ -72,111 +34,20 @@ pip install recommenders[examples,spark] # c. Run the notebook. ``` +## Setup for Azure Databricks -TODO 0410 - Databricks -## Setup guide for Azure Databricks +The following instructions were tested on Azure Databricks Runtime 12.2 LTS (Apache Spark version 3.3.2) and 11.3 LTS (Apache Spark version 3.3.0). +As of April 2023, Databricks Runtime 13 is not yet supported as it is on Python 3.10. -### Requirements - -* Databricks Runtime version >= 7, <= 9 (Apache Spark >= 3.0, <= 3.1, Scala 2.12) -* Python 3.6 - 3.9 - -Earlier versions of Databricks or Spark may work but this is not guaranteed. -An example of how to create an Azure Databricks workspace and an Apache Spark cluster within the workspace can be found from [here](https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal). To utilize deep learning models and GPUs, you may setup GPU-enabled cluster. For more details about this topic, please see [Azure Databricks deep learning guide](https://docs.azuredatabricks.net/applications/deep-learning/index.html). - -### Installation from PyPI - -The `recommenders` package can be installed with core dependencies for utilities and CPU-based algorithms. -This is done from the _Libraries_ link at the cluster, selecting the option to import a library and selecting _PyPI_ in the menu. -For installations with more dependencies, see the steps below. - -### Dependencies setup - -You can setup the repository as a library on Databricks either manually or by running an [installation script](tools/databricks_install.py). Both options assume you have access to a provisioned Databricks workspace and cluster and that you have appropriate permissions to install libraries. - -
-Quick install - -This option utilizes an installation script to do the setup, and it requires additional dependencies in the environment used to execute the script. - -> To run the script, following **prerequisites** are required: -> * Setup CLI authentication for [Azure Databricks CLI (command-line interface)](https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html#install-the-cli). Please find details about how to create a token and set authentication [here](https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html#set-up-authentication). Very briefly, you can install and configure your environment with the following commands. -> -> ```{shell} -> conda activate reco_pyspark -> databricks configure --token -> ``` -> -> * Get the target **cluster id** and **start** the cluster if its status is *TERMINATED*. -> * You can get the cluster id from the databricks CLI with: -> ```{shell} -> databricks clusters list -> ``` -> * If required, you can start the cluster with: -> ```{shell} -> databricks clusters start --cluster-id ` -> ``` - -The installation script has a number of options that can also deal with different databricks-cli profiles, install a version of the mmlspark library, overwrite the libraries, or prepare the cluster for operationalization. For all options, please see: - -```{shell} -python tools/databricks_install.py -h -``` - -Once you have confirmed the databricks cluster is *RUNNING*, install the modules within this repository with the following commands. - -```{shell} -cd Recommenders -python tools/databricks_install.py -``` - -**Note** If you are planning on running through the sample code for operationalization [here](examples/05_operationalize/als_movie_o16n.ipynb), you need to prepare the cluster for operationalization. You can do so by adding an additional option to the script run. is the same as that mentioned above, and can be identified by running `databricks clusters list` and selecting the appropriate cluster. - -```{shell} -python tools/databricks_install.py --prepare-o16n -``` - -See below for details. - -
- -
-Manual setup - -To install the repo manually onto Databricks, follow the steps: - -1. Clone the Microsoft Recommenders repository to your local computer. -2. Zip the contents inside the Recommenders folder (Azure Databricks requires compressed folders to have the `.egg` suffix, so we don't use the standard `.zip`): - - ```{shell} - cd Recommenders - zip -r Recommenders.egg . - ``` - -3. Once your cluster has started, go to the Databricks workspace, and select the `Home` button. -4. Your `Home` directory should appear in a panel. Right click within your directory, and select `Import`. -5. In the pop-up window, there is an option to import a library, where it says: `(To import a library, such as a jar or egg, click here)`. Select `click here`. -6. In the next screen, select the option `Upload Python Egg or PyPI` in the first menu. -7. Next, click on the box that contains the text `Drop library egg here to upload` and use the file selector to choose the `Recommenders.egg` file you just created, and select `Open`. -8. Click on the `Create library`. This will upload the egg and make it available in your workspace. -9. Finally, in the next menu, attach the library to your cluster. - -
- -### Confirm Installation - -After installation, you can now create a new notebook and import the utilities from Databricks in order to confirm that the import worked. - -```{python} -import recommenders +After an Azure Databricks cluster is provisioned: +```bash +# 1. Go to the "Compute" tab on the left of the page, click on the provisioned cluster and then click on "Libraries". +# 2. Click the "Install new" button. +# 3. In the popup window, select "PyPI" as the library source. Enter "recommenders[examples]" as the package name. Click "Install" to install the package. ``` -### Troubleshooting Installation on Azure Databricks - -* For the [recommenders](recommenders) import to work on Databricks, it is important to zip the content correctly. The zip has to be performed inside the Recommenders folder, if you zip directly above the Recommenders folder, it won't work. - ### Prepare Azure Databricks for Operationalization - + This repository includes an end-to-end example notebook that uses Azure Databricks to estimate a recommendation model using matrix factorization with Alternating Least Squares, writes pre-computed recommendations to Azure Cosmos DB, and then creates a real-time scoring service that retrieves the recommendations from Cosmos DB. In order to execute that [notebook](examples/05_operationalize/als_movie_o16n.ipynb), you must install the Recommenders repository as a library (as described above), **AND** you must also install some additional dependencies. With the *Quick install* method, you just need to pass an additional option to the [installation script](tools/databricks_install.py).
@@ -221,11 +92,34 @@ Additionally, you must install the [spark-cosmosdb connector](https://docs.datab +## Setup for Experimental + +The `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/). + +## Windows-Specific Instructions + +For Spark features to work, make sure Java and Spark are installed and respective environment varialbes such as `JAVA_HOME`, `SPARK_HOME` and `HADOOP_HOME` are set properly. Also make sure environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` are set to the the same python executable. + +## macOS-Specific Instructions +We recommend using [Homebrew](https://brew.sh/) to install the dependencies on macOS, including conda (please remember to add conda's path to `$PATH`). One may also need to install lightgbm using Homebrew before pip install the package. +If zsh is used, one will need to use `pip install 'recommenders[]'` to install \. +For Spark features to work, make sure Java and Spark are installed first. Also make sure environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` are set to the the same python executable. + -## Setup guide for making a release +## Test Environments + +Depending on the type of recommender system and the notebook that needs to be run, there are different computational requirements. + +Currently, tests are done on **Python CPU** (the base environment), **Python GPU** (corresponding to `[gpu]` extra above) and **PySpark** (corresponding to `[spark]` extra above). + +Another way is to build a docker image and use the functions inside a [docker container](#setup-guide-for-docker). + +Another alternative is to run all the recommender utilities directly from a local copy of the source code. This requires installing all the necessary dependencies from Anaconda and PyPI. For instructions on how to do this, see [this guide](conda.md). + +## Setup for Making a Release The process of making a new release and publishing it to pypi is as follows: @@ -241,198 +135,3 @@ generates a wheel and a tar.gz which are uploaded to a [GitHub draft release](ht 1. Install twine: `pip install twine` 1. Publish the wheel and tar.gz to pypi: `twine upload recommenders*` - -## Setup for Experimental - -**NOTE** the `xlearn` package has dependency on `cmake`. If one uses the `xlearn` related notebooks or scripts, make sure `cmake` is installed in the system. The easiest way to install on Linux is with apt-get: `sudo apt-get install -y build-essential cmake`. Detailed instructions for installing `cmake` from source can be found [here](https://cmake.org/install/). - - -# MacOS-Specific Instructions - -
-Install Java on MacOS - -To install e.g. Java 8 on MacOS using [asdf](https://github.com/halcyon/asdf-java): - - brew install asdf - asdf plugin add Java - asdf install java adoptopenjdk-8.0.265+1 - asdf global java adoptopenjdk-8.0.265+1 - . ~/.asdf/plugins/java/set-java-home.zsh - -
- - -Then, we need to set the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to the conda python executable. - - -# Windows-Specific Instructions - -
Set PySpark environment variables on Windows - -To set these variables every time the environment is activated, we can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#windows). -First, get the path of the environment `reco_pyspark` is installed: - - for /f "delims=" %A in ('conda env list ^| grep reco_pyspark ^| awk "{print $NF}"') do set "RECO_ENV=%A" - -Then, create the file `%RECO_ENV%\etc\conda\activate.d\env_vars.bat` and add: - - @echo off - for /f "delims=" %%A in ('conda env list ^| grep reco_pyspark ^| awk "{print $NF}"') do set "RECO_ENV=%%A" - set PYSPARK_PYTHON=%RECO_ENV%\python.exe - set PYSPARK_DRIVER_PYTHON=%RECO_ENV%\python.exe - set SPARK_HOME_BACKUP=%SPARK_HOME% - set SPARK_HOME= - set PYTHONPATH_BACKUP=%PYTHONPATH% - set PYTHONPATH= - -This will export the variables every time we do `conda activate reco_pyspark`. -To unset these variables when we deactivate the environment, -create the file `%RECO_ENV%\etc\conda\deactivate.d\env_vars.bat` and add: - - @echo off - set PYSPARK_PYTHON= - set PYSPARK_DRIVER_PYTHON= - set SPARK_HOME=%SPARK_HOME_BACKUP% - set SPARK_HOME_BACKUP= - set PYTHONPATH=%PYTHONPATH_BACKUP% - set PYTHONPATH_BACKUP= - -
- -## Setup guide for Docker - -A [Dockerfile](tools/docker/Dockerfile) is provided to build images of the repository to simplify setup for different environments. You will need [Docker Engine](https://docs.docker.com/install/) installed on your system. - -*Note: `docker` is already available on Azure Data Science Virtual Machine* - -See guidelines in the Docker [README](tools/docker/README.md) for detailed instructions of how to build and run images for different environments. - -Example command to build and run Docker image with base CPU environment. -```{shell} -DOCKER_BUILDKIT=1 docker build -t recommenders:cpu --build-arg ENV="cpu" --build-arg VIRTUAL_ENV="conda" . -docker run -p 8888:8888 -d recommenders:cpu -``` - -You can then open the Jupyter notebook server at http://localhost:8888 - - - -Click on the following menus to see details: -
-Set PySpark environment variables on Linux or MacOS - -If you use conda, to set these variables every time the environment is activated, you can follow the steps of this [guide](https://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linux). - -First, assuming that the environment is called `reco_pyspark`, get the path where the environment is installed: - - RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}') - mkdir -p $RECO_ENV/etc/conda/activate.d - mkdir -p $RECO_ENV/etc/conda/deactivate.d - -Then, create the file `$RECO_ENV/etc/conda/activate.d/env_vars.sh` and add: - -```bash -#!/bin/sh -RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}') -export PYSPARK_PYTHON=$RECO_ENV/bin/python -export PYSPARK_DRIVER_PYTHON=$RECO_ENV/bin/python -unset SPARK_HOME -``` - -This will export the variables every time we do `conda activate reco_pyspark`. To unset these variables when we deactivate the environment, create the file `$RECO_ENV/etc/conda/deactivate.d/env_vars.sh` and add: - -```bash -#!/bin/sh -unset PYSPARK_PYTHON -unset PYSPARK_DRIVER_PYTHON -``` - -
- - - -### Using a virtual environment - -It is straightforward to install the recommenders package within a [virtual environment](https://docs.python.org/3/library/venv.html). However, setting up CUDA for use with a GPU can be cumbersome. We thus -recommend setting up [Nvidia docker](https://github.com/NVIDIA/nvidia-docker) and running the virtual environment within a container, as the most convenient way to do this. -In the following `3.6` should be replaced with the Python version you are using and `8` should be replaced with the appropriate Java version. - - # Start docker daemon if not running - sudo dockerd & - # Pull the image from the Nvidia docker hub (https://hub.docker.com/r/nvidia/cuda) that is suitable for your system - # E.g. for Ubuntu 18.04 do - sudo docker run --gpus all -it --rm nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04 - - # Within the container: - - apt-get -y update - apt-get -y install python3.6 - apt-get -y install python3-pip - apt-get -y install python3.6-venv - apt-get -y install libpython3.6-dev - apt-get -y install cmake - apt-get install -y libgomp1 openjdk-8-jre - export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 - - python3.6 -m venv --system-site-packages /venv - source /venv/bin/activate - pip install --upgrade pip - pip install --upgrade setuptools - - export SPARK_HOME=/venv/lib/python3.6/site-packages/pyspark - export PYSPARK_DRIVER_PYTHON=/venv/bin/python - export PYSPARK_PYTHON=/venv/bin/python - - pip install recommenders[all] - - - -### Register the environment as a kernel in Jupyter - -We can register our conda or virtual environment to appear as a kernel in the Jupyter notebooks. After activating the environment (`my_env_name`) do - - python -m ipykernel install --user --name my_env_name --display-name "Python (my_env_name)" - -If you are using the DSVM, you can [connect to JupyterHub](https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro#jupyterhub-and-jupyterlab) by browsing to `https://your-vm-ip:8000`. - - -### Troubleshooting for the DSVM - -* We found that there can be problems if the Spark version of the machine is not the same as the one in the [conda file](conda.md). You can use the option `--pyspark-version` to address this issue. - -* When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this on a DSVM, we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at `/dsvm/tools/spark/current/conf/spark-env.sh`. - -```{shell} -SPARK_LOCAL_DIRS="/mnt" -SPARK_WORKER_DIR="/mnt" -SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.appDataTtl=3600, -Dspark.worker.cleanup.interval=300, -Dspark.storage.cleanupFilesAfterExecutorExit=true" -``` - -* Another source of problems is when the variable `SPARK_HOME` is not set correctly. In the Azure DSVM, `SPARK_HOME` is by default `/dsvm/tools/spark/current`. We need to unset it: -``` -unset SPARK_HOME -``` - -* We found that there might be conflicts between the current MMLSpark jars available in the DSVM and the ones used by the library. In that case, it is better to remove those jars and rely on loading them from Maven or other repositories made available by MMLSpark team. - -``` -cd /dsvm/tools/spark/current/jars -sudo rm -rf Azure_mmlspark-0.12.jar com.microsoft.cntk_cntk-2.4.jar com.microsoft.ml.lightgbm_lightgbmlib-2.0.120.jar -``` From 0b721b66835991ea7ae4c72d37261fe47800309e Mon Sep 17 00:00:00 2001 From: Tao Wu <21267949+wutaomsft@users.noreply.github.com> Date: Sun, 16 Apr 2023 22:35:20 -0400 Subject: [PATCH 5/6] Updated README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d1f936d280..51f11f0e02 100644 --- a/README.md +++ b/README.md @@ -59,7 +59,7 @@ git clone https://github.com/microsoft/recommenders.git # c. Run the notebook. ``` -For more information about setup on different platforms (e.g., Windows and macOS) and configurations (GPU, Spark and Docker container), see the [setup guide](SETUP.md). +For more information about setup on different platforms (e.g., Windows and macOS) and configurations (GPU, Spark and experimental features), see the [setup guide](SETUP.md). In addition to the core package, several extras are also provided, including: + `[examples]`: Needed for running examples. From 47fe0e1a0c5760c59c1e3e19fe4e9909d89ea303 Mon Sep 17 00:00:00 2001 From: Tao Wu <21267949+wutaomsft@users.noreply.github.com> Date: Sun, 16 Apr 2023 22:36:08 -0400 Subject: [PATCH 6/6] updated --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 51f11f0e02..005828f921 100644 --- a/README.md +++ b/README.md @@ -59,7 +59,7 @@ git clone https://github.com/microsoft/recommenders.git # c. Run the notebook. ``` -For more information about setup on different platforms (e.g., Windows and macOS) and configurations (GPU, Spark and experimental features), see the [setup guide](SETUP.md). +For more information about setup on other platforms (e.g., Windows and macOS) and different configurations (e.g., GPU, Spark and experimental features), see the [Setup Guide](SETUP.md). In addition to the core package, several extras are also provided, including: + `[examples]`: Needed for running examples.