Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Commit

Permalink
Refine document for installation and platform supports (#1978)
Browse files Browse the repository at this point in the history
  • Loading branch information
squirrelsc authored Jan 23, 2020
1 parent 85d5ecb commit 71fbff1
Show file tree
Hide file tree
Showing 5 changed files with 183 additions and 136 deletions.
79 changes: 26 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ Within the following table, we summarized the current NNI capabilities, we are g
</ul>
</ul>
</td>
</tr>
</tr>
<tr align="center" valign="bottom">
</td>
</tr>
Expand All @@ -193,18 +193,18 @@ Within the following table, we summarized the current NNI capabilities, we are g
<li><a href="docs/en_US/TrainingService/SupportTrainingService.md">Support TrainingService</li>
<li><a href="docs/en_US/TrainingService/HowToImplementTrainingService.md">Implement TrainingService</a></li>
</ul>
</td>
</tr>
</td>
</tr>
</tbody>
</table>

## **Install & Verify**
## **Installation**

**Install through pip**
### **Install**

* We support Linux, MacOS and Windows (local, remote and pai mode) in current stage, Ubuntu 16.04 or higher, MacOS 10.14.1 along with Windows 10.1809 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`.
NNI supports and is tested on Ubuntu >= 16.04, macOS >= 10.14.1, and Windows 10 >= 1809. Simply run the following `pip install` in an environment that has `python 64-bit >= 3.5`.

Linux and MacOS
Linux or macOS

```bash
python3 -m pip install --upgrade nni
Expand All @@ -216,65 +216,39 @@ Windows
python -m pip install --upgrade nni
```

Note:

* `--user` can be added if you want to install NNI in your home directory, which does not require any special privileges.
* Currently NNI on Windows support local, remote and pai mode. Anaconda or Miniconda is highly recommended to install NNI on Windows.
* If there is any error like `Segmentation fault`, please refer to [FAQ](docs/en_US/Tutorial/FAQ.md)

**Install through source code**

* We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) and Windows (10.1809) in our current stage.

Linux and MacOS

* Run the following commands in an environment that has `python >= 3.5`, `git` and `wget`.

```bash
git clone -b v1.3 https://github.com/Microsoft/nni.git
cd nni
source install.sh
```

Windows

* Run the following commands in an environment that has `python >=3.5`, `git` and `PowerShell`
If you want to try latest code, please [install NNI](docs/en_US/Tutorial/Installation.md) from source code.

```bash
git clone -b v1.3 https://github.com/Microsoft/nni.git
cd nni
powershell -ExecutionPolicy Bypass -file install.ps1
```
For detail system requirements of NNI, please refer to [here](docs/en_US/Tutorial/Installation.md#system-requirements).

For the system requirements of NNI, please refer to [Install NNI](docs/en_US/Tutorial/Installation.md)
Note:

For NNI on Windows, please refer to [NNI on Windows](docs/en_US/Tutorial/NniOnWindows.md)
* If there is any privilege issue, add `--user` to install NNI in the user directory.
* Currently NNI on Windows supports local, remote and pai mode. Anaconda or Miniconda is highly recommended to install NNI on Windows.
* If there is any error like `Segmentation fault`, please refer to [FAQ](docs/en_US/Tutorial/FAQ.md). For FAQ on Windows, please refer to [NNI on Windows](docs/en_US/Tutorial/NniOnWindows.md).

**Verify install**
### **Verify installation**

The following example is an experiment built on TensorFlow. Make sure you have **TensorFlow 1.x installed** before running it. Note that **currently Tensorflow 2.0 is NOT supported**.
The following example is built on TensorFlow 1.x. Make sure **TensorFlow 1.x is used** when running it.

* Download the examples via clone the source code.

```bash
git clone -b v1.3 https://github.com/Microsoft/nni.git
```

Linux and MacOS
```bash
git clone -b v1.3 https://github.com/Microsoft/nni.git
```

* Run the MNIST example.

```bash
nnictl create --config nni/examples/trials/mnist-tfv1/config.yml
```
Linux or macOS

Windows
```bash
nnictl create --config nni/examples/trials/mnist-tfv1/config.yml
```

* Run the MNIST example.
Windows

```bash
nnictl create --config nni\examples\trials\mnist-tfv1\config_windows.yml
```
```bash
nnictl create --config nni\examples\trials\mnist-tfv1\config_windows.yml
```

* Wait for the message `INFO: Successfully started experiment!` in the command line. This message indicates that your experiment has been successfully started. You can explore the experiment using the `Web UI url`.

Expand Down Expand Up @@ -371,4 +345,3 @@ We encourage researchers and students leverage these projects to accelerate the
## **License**

The entire codebase is under [MIT license](LICENSE)

40 changes: 21 additions & 19 deletions docs/en_US/TrainingService/RemoteMachineMode.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,32 @@
# Run an Experiment on Multiple Machines
# Run an Experiment on Remote Machines

NNI supports running an experiment on multiple machines through SSH channel, called `remote` mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code.
NNI can run one experiment on multiple remote machines through SSH, called `remote` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.

e.g. Three machines and you login in with account `bob` (Note: the account is not necessarily the same on different machine):
## Remote machine requirements

| IP | Username| Password |
| -------- |---------|-------|
| 10.1.1.1 | bob | bob123 |
| 10.1.1.2 | bob | bob123 |
| 10.1.1.3 | bob | bob123 |
* It only supports Linux as remote machines, and [linux part in system specification](../Tutorial/Installation.md) is same as NNI local mode.

## Setup NNI environment
* Follow [installation](../Tutorial/Installation.md) to install NNI on each machine.

Install NNI on each of your machines following the install guide [here](../Tutorial/QuickStart.md).
* Make sure remote machines meet environment requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into `command` field of NNI config.

* Make sure remote machines can be accessed through SSH from the machine which runs `nnictl` command. It supports both password and key authentication of SSH. For advanced usages, please refer to [machineList part of configuration](../Tutorial/ExperimentConfig.md).

* Make sure the NNI version on each machine is consistent.

## Run an experiment

Install NNI on another machine which has network accessibility to those three machines above, or you can just run `nnictl` on any one of the three to launch the experiment.
e.g. there are three machines, which can be logged in with username and password.

| IP | Username | Password |
| -------- | -------- | -------- |
| 10.1.1.1 | bob | bob123 |
| 10.1.1.2 | bob | bob123 |
| 10.1.1.3 | bob | bob123 |

Install and run NNI on one of those three machines or another machine, which has network access to them.

We use `examples/trials/mnist-annotation` as an example here. Shown here is `examples/trials/mnist-annotation/config_remote.yml`:
Use `examples/trials/mnist-annotation` as the example. Below is content of `examples/trials/mnist-annotation/config_remote.yml`:

```yaml
authorName: default
Expand Down Expand Up @@ -58,14 +66,8 @@ machineList:
passwd: bob123
```
Files in `codeDir` will be automatically uploaded to the remote machine. You can run NNI on different operating systems (Windows, Linux, MacOS) to spawn experiments on the remote machines (only Linux allowed):
Files in `codeDir` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:

```bash
nnictl create --config examples/trials/mnist-annotation/config_remote.yml
```

You can also use public/private key pairs instead of username/password for authentication. For advanced usages, please refer to [Experiment Config Reference](../Tutorial/ExperimentConfig.md).

## Version check

NNI support version check feature in since version 0.6, [reference](PaiMode.md).
7 changes: 5 additions & 2 deletions docs/en_US/TrainingService/SupportTrainingService.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@ NNI TrainingService provides the training platform for running NNI trial jobs. N
NNI not only provides few built-in training service options, but also provides a method for customers to build their own training service easily.

## Built-in TrainingService

|TrainingService|Brief Introduction|
|---|---|
|[__Local__](./LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.|
|[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enouth gpu resource if specified.|
|[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.|
|[__Pai__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.|
|[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.|
|[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.|
Expand All @@ -16,7 +17,8 @@ NNI not only provides few built-in training service options, but also provides a

TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainingService, users just need to inherit the parent class and complete their own child class if they want to implement customized TrainingService.
The abstract function in TrainingService is shown below:
```

```javascript
abstract class TrainingService {
public abstract listTrialJobs(): Promise<TrialJobDetail[]>;
public abstract getTrialJob(trialJobId: string): Promise<TrialJobDetail>;
Expand All @@ -32,5 +34,6 @@ abstract class TrainingService {
public abstract run(): Promise<void>;
}
```

The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions.
For more information about how to write your own TrainingService, please [refer](https://github.com/microsoft/nni/blob/master/docs/en_US/TrainingService/HowToImplementTrainingService.md).
Loading

0 comments on commit 71fbff1

Please sign in to comment.