Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
Doc refactoring and update hello-world sample (#2445)
Browse files Browse the repository at this point in the history
  • Loading branch information
squirrelsc authored Mar 29, 2019
1 parent a7525dc commit 8944849
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 19 deletions.
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,17 +91,17 @@ For a large size cluster, this section is still needed to generate default confi

#### Customize deployment

As various hardware environments and different use scenarios, default configuration of OpenPAI may need to be updated. Following [Customize deployment](docs/pai-management/doc/how-to-generate-cluster-config.md#Optional-Step-3.-Customize-configure-OpenPAI) part to learn more details.
As various hardware environments and different use scenarios, default configuration of OpenPAI may need to be optimized. Following [Customize deployment](docs/pai-management/doc/how-to-generate-cluster-config.md#Optional-Step-3.-Customize-configure-OpenPAI) part to learn more details.

### Validate deployment

After deployment, it's recommended to [validate key components of OpenPAI](docs/pai-management/doc/validate-deployment.md) in health status. After validation is success, [submit a hello-world job](docs/user/training.md) and check if it works end-to-end.

### Train users before "train models"

The common practice on OpenPAI is to submit job requests, and wait jobs got computing resource and executed. It's different experience with assigning dedicated servers to each one. People may feel computing resource is not in control and the learning curve may be higher than run job on dedicated servers. But shared resource on OpenPAI can improve productivity significantly and save time on maintaining environments.
The common practice on OpenPAI is to submit job requests, and wait jobs got computing resource and executed. It's different experience with assigning dedicated servers to each one. People may feel computing resource is not in control and the learning curve may be higher than run job on dedicated servers. But shared resource on OpenPAI can improve utilization of resources and save time on maintaining environments.

For administrators of OpenPAI, a successful deployment is first step, the second step is to let users of OpenPAI understand benefits and know how to use it. Users of OpenPAI can learn from [Train models](#train-models). But below content is for various scenarios and may be too much to specific users. So, a simplified document based on below content is easier to learn.
For administrators of OpenPAI, a successful deployment is first step, the second step is to let users of OpenPAI understand benefits and know how to use it. Users can learn from [Train models](#train-models). But below part of training models is for various scenarios and maybe users doesn't need all of them. So, administrators can create simplified documents as users' actual scenarios.

### FAQ

Expand All @@ -111,23 +111,23 @@ If FAQ doesn't resolve it, refer to [here](#get-involved) to ask question or sub

## Train models

Like all machine learning platforms, OpenPAI is a productive tool. To maximize utilization, it's recommended to submit training jobs and let OpenPAI to allocate resource and run it. If there are too many jobs, some jobs may be queued until enough resource available, and OpenPAI choose some server(s) to run a job. This is different with run code on dedicated servers, and it needs a bit more knowledge about how to submit/manage training jobs on OpenPAI.
Like all machine learning platforms, OpenPAI is a productive tool. To maximize utilization of resources, it's recommended to submit training jobs and let OpenPAI to allocate resource and run it. If there are too many jobs, some jobs may be queued until enough resource available. This is different with run code on dedicated servers, and it needs a bit more knowledge about how to submit/manage training jobs on OpenPAI.

Note, OpenPAI also supports to allocate on demand resource besides queuing jobs. Users can use SSH or Jupyter to connect like on a physical server, refer to [here](examples/jupyter/README.md) about how to use OpenPAI like this way. Though it's not efficient to resources, but it also saves cost on setup and managing environments on physical servers.
Note, OpenPAI also supports to allocate dedicated resource besides queuing jobs. Users can use SSH or Jupyter to connect and use like on a physical server, refer to [here](examples/jupyter/README.md) for details. Though it's not efficient to resources, but it also saves cost on setup and managing environments on physical servers.

### Submit training jobs

Follow [submitting a hello-world job](docs/user/training.md), and learn more about training models on OpenPAI. It's a very simple job and used to understand OpenPAI job definition and familiar with Web portal.
Follow [submitting a hello-world job](docs/user/training.md), and learn more about training models on OpenPAI. It's a very simple job and used to understand OpenPAI job configuration and familiar with Web UI.

### OpenPAI VS Code Client

[OpenPAI VS Code Client](contrib/pai_vscode/VSCodeExt.md) is a friendly, GUI based client tool of OpenPAI. It's an extension of Visual Studio Code. It can submit job, simulate job running locally, manage multiple OpenPAI environments, and so on.

### Troubleshooting job failure

Web portal and job log are helpful to analyze job failure, and OpenPAI supports SSH into environment for debugging.
Web UI and job log are helpful to analyze job failure, and OpenPAI supports SSH into environment for debugging.

Refer to [here](docs/user/troubleshooting_job.md) for more information about troubleshooting job failure. It's recommended to get code succeeded locally, then submit to OpenPAI. It reduces posibility to troubleshoot remotely.
Refer to [here](docs/user/troubleshooting_job.md) for more information about troubleshooting job failure.

## Administration

Expand All @@ -137,7 +137,7 @@ Refer to [here](docs/user/troubleshooting_job.md) for more information about tro

## Reference

* [Job definition](docs/job_tutorial.md)
* [Job configuration](docs/job_tutorial.md)
* [RESTful API](docs/rest-server/API.md)
* Design documents could be found [here](docs).

Expand Down Expand Up @@ -167,8 +167,8 @@ contact [[email protected]](mailto:[email protected]) with any additio

We are working on a set of major features improvement and refactor, anyone who is familiar with the features is encouraged to join the design review and discussion in the corresponding issue ticket.

* PAI virtual cluster design. [Issue 1754](https://github.com/Microsoft/pai/issues/1754)
* PAI protocol design. [Issue 2007](https://github.com/Microsoft/pai/issues/2007)
* OpenPAI virtual cluster design. [Issue 1754](https://github.com/Microsoft/pai/issues/1754)
* OpenPAI protocol design. [Issue 2007](https://github.com/Microsoft/pai/issues/2007)

### Who should consider contributing to OpenPAI

Expand Down
16 changes: 8 additions & 8 deletions docs/user/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
- [Submit a hello-world job](#submit-a-hello-world-job)
- [Understand job](#understand-job)
- [Learn hello-world job](#learn-hello-world-job)
- [Exchange data](#exchange-data)
- [Transfer files](#transfer-files)
- [Job workflow](#job-workflow)
- [Reference](#reference)

Expand Down Expand Up @@ -54,15 +54,15 @@ Following this section to submit a very simple job like hello-world during learn
```json
{
"jobName": "tensorflow-cifar10",
"image": "ufoym/deepo:tensorflow-py36-cu90",
"image": "tensorflow/tensorflow:1.12.0-gpu-py3",
"taskRoles": [
{
"name": "default",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 8192,
"gpuNumber": 1,
"command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=/tmp/data && python train_image_classifier.py --dataset_name=cifar10 --dataset_dir=/tmp/data --max_number_of_steps=1000"
"command": "apt update && apt install -y git && git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=/tmp/data && python train_image_classifier.py --dataset_name=cifar10 --dataset_dir=/tmp/data --max_number_of_steps=1000"
}
]
}
Expand Down Expand Up @@ -94,7 +94,7 @@ The **job configuration** is a JSON file, which is posted to OpenPAI. Here uses

The JSON file of job has two levels entries. The top level includes shared information of the job, including job name, docker image, task roles, and so on. The second level is taskRoles, it's an array. Each item in the array specifies commands and the corresponding running environment.

Below is key part of all fields and [full spec of job configuration](../job_tutorial.md) is here.
Below is required fields and [full spec of job configuration](../job_tutorial.md) is here.

- **jobName** is the unique name of current job, displays in web also. A meaningful name helps managing jobs well.

Expand Down Expand Up @@ -126,13 +126,13 @@ Below is key part of all fields and [full spec of job configuration](../job_tuto

Like the hello-world job, user needs to construct command(s) to get code, data and trigger executing.

### Exchange data
### Transfer files

The data here doesn't only mean *dataset* of machine learning, also includes all files and information, like code, scripts, trained model, and so on. Most model training and other kinds of jobs need to exchange data between docker container and outside.
Most model training and other kinds of jobs need to transfer files between docker container and outside on OpenPAI. The files include dataset, code, scripts, trained model, and so on.

OpenPAI creates a clean docker container. Some data can be built into docker image directly if it's changed rarely.
OpenPAI creates a clean docker container each running. Some files can be built into docker image directly if they are changed rarely.

If it needs to exchange data on runtime, the command, which passes to docker in job configuration, needs to initiate the data exchange progress. For example, use `git`, `wget`, `scp`, `sftp` or other commands to copy data in and out. If some command is not built in docker, it can be installed in the command by `apt install` or `python -m pip install`.
If it needs to transfer files at runtime, the command field, which passes to docker in job configuration, is used to initiate the files transferring progress. For example, use `git`, `wget`, `scp`, `sftp`, other commands, code or scripts to copy files in and out. If some commands are not built in docker, it can be installed in the command field by `apt install ...` or `python -m pip install ...`.

It's better to check with administrator of the OpenPAI cluster, since there may be suggested approaches and examples already.

Expand Down

0 comments on commit 8944849

Please sign in to comment.