From 8944849070be0051044f5c2b8e0219bf69030cb1 Mon Sep 17 00:00:00 2001 From: Chi Song Date: Fri, 29 Mar 2019 10:20:28 +0800 Subject: [PATCH] Doc refactoring and update hello-world sample (#2445) --- README.md | 22 +++++++++++----------- docs/user/training.md | 16 ++++++++-------- 2 files changed, 19 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index c27aa4a784..eed00a77ad 100644 --- a/README.md +++ b/README.md @@ -91,7 +91,7 @@ For a large size cluster, this section is still needed to generate default confi #### Customize deployment -As various hardware environments and different use scenarios, default configuration of OpenPAI may need to be updated. Following [Customize deployment](docs/pai-management/doc/how-to-generate-cluster-config.md#Optional-Step-3.-Customize-configure-OpenPAI) part to learn more details. +As various hardware environments and different use scenarios, default configuration of OpenPAI may need to be optimized. Following [Customize deployment](docs/pai-management/doc/how-to-generate-cluster-config.md#Optional-Step-3.-Customize-configure-OpenPAI) part to learn more details. ### Validate deployment @@ -99,9 +99,9 @@ After deployment, it's recommended to [validate key components of OpenPAI](docs/ ### Train users before "train models" -The common practice on OpenPAI is to submit job requests, and wait jobs got computing resource and executed. It's different experience with assigning dedicated servers to each one. People may feel computing resource is not in control and the learning curve may be higher than run job on dedicated servers. But shared resource on OpenPAI can improve productivity significantly and save time on maintaining environments. +The common practice on OpenPAI is to submit job requests, and wait jobs got computing resource and executed. It's different experience with assigning dedicated servers to each one. People may feel computing resource is not in control and the learning curve may be higher than run job on dedicated servers. But shared resource on OpenPAI can improve utilization of resources and save time on maintaining environments. -For administrators of OpenPAI, a successful deployment is first step, the second step is to let users of OpenPAI understand benefits and know how to use it. Users of OpenPAI can learn from [Train models](#train-models). But below content is for various scenarios and may be too much to specific users. So, a simplified document based on below content is easier to learn. +For administrators of OpenPAI, a successful deployment is first step, the second step is to let users of OpenPAI understand benefits and know how to use it. Users can learn from [Train models](#train-models). But below part of training models is for various scenarios and maybe users doesn't need all of them. So, administrators can create simplified documents as users' actual scenarios. ### FAQ @@ -111,13 +111,13 @@ If FAQ doesn't resolve it, refer to [here](#get-involved) to ask question or sub ## Train models -Like all machine learning platforms, OpenPAI is a productive tool. To maximize utilization, it's recommended to submit training jobs and let OpenPAI to allocate resource and run it. If there are too many jobs, some jobs may be queued until enough resource available, and OpenPAI choose some server(s) to run a job. This is different with run code on dedicated servers, and it needs a bit more knowledge about how to submit/manage training jobs on OpenPAI. +Like all machine learning platforms, OpenPAI is a productive tool. To maximize utilization of resources, it's recommended to submit training jobs and let OpenPAI to allocate resource and run it. If there are too many jobs, some jobs may be queued until enough resource available. This is different with run code on dedicated servers, and it needs a bit more knowledge about how to submit/manage training jobs on OpenPAI. -Note, OpenPAI also supports to allocate on demand resource besides queuing jobs. Users can use SSH or Jupyter to connect like on a physical server, refer to [here](examples/jupyter/README.md) about how to use OpenPAI like this way. Though it's not efficient to resources, but it also saves cost on setup and managing environments on physical servers. +Note, OpenPAI also supports to allocate dedicated resource besides queuing jobs. Users can use SSH or Jupyter to connect and use like on a physical server, refer to [here](examples/jupyter/README.md) for details. Though it's not efficient to resources, but it also saves cost on setup and managing environments on physical servers. ### Submit training jobs -Follow [submitting a hello-world job](docs/user/training.md), and learn more about training models on OpenPAI. It's a very simple job and used to understand OpenPAI job definition and familiar with Web portal. +Follow [submitting a hello-world job](docs/user/training.md), and learn more about training models on OpenPAI. It's a very simple job and used to understand OpenPAI job configuration and familiar with Web UI. ### OpenPAI VS Code Client @@ -125,9 +125,9 @@ Follow [submitting a hello-world job](docs/user/training.md), and learn more abo ### Troubleshooting job failure -Web portal and job log are helpful to analyze job failure, and OpenPAI supports SSH into environment for debugging. +Web UI and job log are helpful to analyze job failure, and OpenPAI supports SSH into environment for debugging. -Refer to [here](docs/user/troubleshooting_job.md) for more information about troubleshooting job failure. It's recommended to get code succeeded locally, then submit to OpenPAI. It reduces posibility to troubleshoot remotely. +Refer to [here](docs/user/troubleshooting_job.md) for more information about troubleshooting job failure. ## Administration @@ -137,7 +137,7 @@ Refer to [here](docs/user/troubleshooting_job.md) for more information about tro ## Reference -* [Job definition](docs/job_tutorial.md) +* [Job configuration](docs/job_tutorial.md) * [RESTful API](docs/rest-server/API.md) * Design documents could be found [here](docs). @@ -167,8 +167,8 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio We are working on a set of major features improvement and refactor, anyone who is familiar with the features is encouraged to join the design review and discussion in the corresponding issue ticket. -* PAI virtual cluster design. [Issue 1754](https://github.com/Microsoft/pai/issues/1754) -* PAI protocol design. [Issue 2007](https://github.com/Microsoft/pai/issues/2007) +* OpenPAI virtual cluster design. [Issue 1754](https://github.com/Microsoft/pai/issues/1754) +* OpenPAI protocol design. [Issue 2007](https://github.com/Microsoft/pai/issues/2007) ### Who should consider contributing to OpenPAI diff --git a/docs/user/training.md b/docs/user/training.md index 08aae5c16f..845454a8a3 100644 --- a/docs/user/training.md +++ b/docs/user/training.md @@ -23,7 +23,7 @@ - [Submit a hello-world job](#submit-a-hello-world-job) - [Understand job](#understand-job) - [Learn hello-world job](#learn-hello-world-job) - - [Exchange data](#exchange-data) + - [Transfer files](#transfer-files) - [Job workflow](#job-workflow) - [Reference](#reference) @@ -54,7 +54,7 @@ Following this section to submit a very simple job like hello-world during learn ```json { "jobName": "tensorflow-cifar10", - "image": "ufoym/deepo:tensorflow-py36-cu90", + "image": "tensorflow/tensorflow:1.12.0-gpu-py3", "taskRoles": [ { "name": "default", @@ -62,7 +62,7 @@ Following this section to submit a very simple job like hello-world during learn "cpuNumber": 4, "memoryMB": 8192, "gpuNumber": 1, - "command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=/tmp/data && python train_image_classifier.py --dataset_name=cifar10 --dataset_dir=/tmp/data --max_number_of_steps=1000" + "command": "apt update && apt install -y git && git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=/tmp/data && python train_image_classifier.py --dataset_name=cifar10 --dataset_dir=/tmp/data --max_number_of_steps=1000" } ] } @@ -94,7 +94,7 @@ The **job configuration** is a JSON file, which is posted to OpenPAI. Here uses The JSON file of job has two levels entries. The top level includes shared information of the job, including job name, docker image, task roles, and so on. The second level is taskRoles, it's an array. Each item in the array specifies commands and the corresponding running environment. -Below is key part of all fields and [full spec of job configuration](../job_tutorial.md) is here. +Below is required fields and [full spec of job configuration](../job_tutorial.md) is here. - **jobName** is the unique name of current job, displays in web also. A meaningful name helps managing jobs well. @@ -126,13 +126,13 @@ Below is key part of all fields and [full spec of job configuration](../job_tuto Like the hello-world job, user needs to construct command(s) to get code, data and trigger executing. -### Exchange data +### Transfer files -The data here doesn't only mean *dataset* of machine learning, also includes all files and information, like code, scripts, trained model, and so on. Most model training and other kinds of jobs need to exchange data between docker container and outside. +Most model training and other kinds of jobs need to transfer files between docker container and outside on OpenPAI. The files include dataset, code, scripts, trained model, and so on. -OpenPAI creates a clean docker container. Some data can be built into docker image directly if it's changed rarely. +OpenPAI creates a clean docker container each running. Some files can be built into docker image directly if they are changed rarely. -If it needs to exchange data on runtime, the command, which passes to docker in job configuration, needs to initiate the data exchange progress. For example, use `git`, `wget`, `scp`, `sftp` or other commands to copy data in and out. If some command is not built in docker, it can be installed in the command by `apt install` or `python -m pip install`. +If it needs to transfer files at runtime, the command field, which passes to docker in job configuration, is used to initiate the files transferring progress. For example, use `git`, `wget`, `scp`, `sftp`, other commands, code or scripts to copy files in and out. If some commands are not built in docker, it can be installed in the command field by `apt install ...` or `python -m pip install ...`. It's better to check with administrator of the OpenPAI cluster, since there may be suggested approaches and examples already.