This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

02 Feb 03:21

Gerhut

v0.9.0

d26f762

v0.9.0: Feb. 2019 Release

Release v0.9.0

New Features

Add pai service dashboard to grafana, cluster admin can get pai services resource consumption from paiServiceMetrics page. - PR 1694
Support to add custom web pages to the web portal of PAI deployments with WebPortal Plugin, refer to Plugins Doc for how to use the new feature, and refer to PR 1700 for how PAI Marketplace is using it as an example.
Support update virtual cluster dynamically from webportal. Please refer to virtual cluster management for how to use this new feature. -PR 1831 PR 1974
Support customized job environment variables. -PR 1544
Add VS Code client for PAI, please refer to OpenPAI VS Code Client for more detail.

Improvements

Service

Cluster object model implementation to make it easier for developer to add customized service configuration generation logic. -PR 1735
Job exporter refactor to avoid single external command call will make exporter hang indefinitely. -PR 1840
Extend yarn local log expiration time to 7 days. -PR 1673
Reduce grafana image from 440M to 280M by merging all startup scripts add removing useless plugin. -PR 1685
Upgrade Nodejs version of webportal and rest server to 8. -PR 1453
Support hdfs path customization. -PR 1922
Migrate user information from etcd to k8s secret to reduce the dependency on raw etcd data. -PR 1917
Move user code to a background process. -PR 1461
Support configuration storage all config files will be stored in kubernetes config map. please refer paictl-manual to get more information. -PR 1177 PR 1431 PR 1489

Job

Add timestamp for cloned job's name - PR 1532
Add log if job's image doesn't have ssh server - PR 1675
Escape injected variables in shell scripts -PR 1860
Add an example of how to integrate jupyter and pai by using restserver. -PR 1676
Expose all ports among tasks and the format will be:PAI_${taskRole}_${taskidx} _${portLabel}_PORT=${portNumber}. -PR 1918

GPU driver

Make GPU drivers version configurable. - PR 1626
Add two driver images. Current supports driver's versions are 384.111, 390.25 and 410.73. By default will deploy 390.25 version. -PR 1642
User can skip driver installation if they pre-installed. -PR 1841

Command

Support add machine from node-list file. -PR 819
Add config sub-command in paictl to manage config files. -PR 1263
After the configuration storage is enabled -p is no longer required in the service sub-command.

Example

Use tensorflow job instead of cntk job as end-to-end tests. Note: If the user is upgrading from the previous version of pai, please delete the test data file under the hdfs://ip:port/Test folder to ensure that the end-to-end-test works normally.

Bug Fixes

issue 1603 is fixed by adding job_exporter_iteration_seconds to expose iteration time. - PR 1627.
issue 1602 is fixed by initializing the host ip from None to unscheduled - PR 1625.
issue 1639 is fixed by adding imagePullSecrets to prometheus. - PR 1678.
issue 1600 is fixed by offloading docker daemon check from watchdog to job-exporter. - PR 1670.
Fix admin can't submit job to newly added virtaul cluster. - PR 1972
issue 2005 is fixed by making Grafana Legend unique in task level dashboard. - PR 1921

Known Issues

Paictl may fail to start service after calling stop service. issue 2081
If a running container's "View SSH Info" popup is opened in Chrome browser. By clicking the "private key" link the private key file will downloaded and stored to local host. The key file's name consists of the user name and job name, jointed by a ~ character. Chrome will replace the ~ with - character. So users need to change the key file name accordingly when SSH to the container by following step 3 and 4 in the "View SSH Info" popup. Please follow issue 1574 to track this problem.

Upgrading from Earlier Release

Download the code package of release v0.9.0 from release page.
Or you can clone the code by running
```
git clone --branch v0.9.0 [email protected]:Microsoft/pai.git
```
prepare your cluster configuration by instructions in OpenPAI Configuration. Configure the docker info as following:
```
docker-registry:
  namespace: openpai
  domain: docker.io
  tag: v0.9.0
```

In the code source directory, upgrade by following steps:

# push cluster configuration file to kubernetes
python paictl config push -p cluster_configuration_file_path
# stop the services
python paictl.py service stop
# after the services are stopped, stop the kubernetes cluster
python paictl.py cluster k8s-clean -p cluster_configuration_file_path
# reboot the kubernetes cluster
python paictl.py cluster k8s-bootup -p cluster_configuration_file_path
# start the services
python paictl.py service start

Thanks to our Contributors

Thanks to the following people who have contributed new code or given us helpful suggestions for this release:

Bin Wang, Fan Yang, Can Wang, Di Xu, Hao Yuan, Qixiang Cheng, Xinwei Zheng @ustc (virtual cluster update), Yanjie Gao, Yundong Ye, Ziming Miao, Yuqi Wang, Scarlett Li, Dian Wang, Mao Yang, Shuguang Liu, Quanlu Zhang

Assets 2

19 Dec 09:25

wangcan0329

v0.8.3

5185b03

v0.8.3: Dec. 2018 Release

Release v0.8.3

Bug Fixes

issue 1873 Fix webportal document broken link. - PR 1898 PR 1844 PR 1847
issue 1887 Fix SSH info in webportal behind pylon is not replaced problem. - PR 1901
issue 1863 Fix the issue of missing file when clean-up kubernetes . - PR 1864

Assets 2

30 Nov 04:49

wangcan0329

v0.8.2

3c274a9

v0.8.2: Nov. 2018 Release

Release v0.8.2

Bug Fixes

Fix grafana rule syntax error and remove unnecessary template variables - PR 1728
Add more detail about how to add node to kubernetes cluster in machine-maintenance document. - PR 1750
Add disk and memory pressure toleration policy for node manager. - PR 1784

Assets 2

19 Nov 06:17

hao1939

v0.8.1

a56bc2d

v0.8.1: Nov. 2018 Release

Changes since v0.8.0 release

Bug Fixes

Issue 1658 Fix the datanode out-of-memory problem - PR 1689

Assets 2

31 Oct 09:48

DongZhaoYu

v0.8.0

b75ec9e

v0.8.0: Oct. 2018 Release

Release v0.8.0

New Features

All user submitted jobs can be cloned and resubmitted in Job detail page - PR 1448.
The new designed Marketplace and Submit Job V2 are under public review.
Please refer to the instruction for more information Marketplace and Submit job v2.
Any feedback and suggestions are appreciated.
Alerting service supports to mute alerts. The instructions can be found via alert-manager.
New Feedback Button: users are allowed to submit GitHub Issues with appended OpenPAI version directly from WebUI - PR 1289.

Improvements

Service

Memory limits are added for all OpenPAI services. Please refer Resource Requirement for details.
The metrics from alerting service are extended and they can be reported per job, node or service.
Etcd data path configuration entry is added to Kubernetes Configuration and user can decide the path to store etcd data permanently - PR 1221.
Alert email from Prometheus is refined for clarity - PR 1282.
RestServer's API supports username and different users can submit jobs with the same name.

Job

When starting container to run user command, --init option is enabled to help avoiding zombie processes. - PR 1435
In the container running user's command, the code directory is mounted as readonly - PR 1422.
In job submission request, both user specified and random ports are supported and they can coexist - PR 1402.
When Yarn node manager service is deleted, user's job containers will be cleaned forcefully - PR 1296.

Web Portal

The job's view page is enhanced to show the retry history link - PR 1425.
On job detail page, job configuration can be exported and stored locally - PR 1429.

Command

Build tool is refactored out from paictl and is implemented by pai-build.
When logging in machines in the cluster on deployment, besides username and password, users can configure the ssh key file path for authentication. The details can be found in deployment configuration.

Example

Add auto test to run examples in an automatic or semi-automatic way.

Bug Fixes

issue 1153 is fixed by checking API resources before installing kube-proxy - PR 1210.
issue 1226 is fixed and limit the image pull time in 10 minutes - PR 1227.
issue 1217 is fixed to rotate exporter's log via docker daemon - PR 1239.
issue 1314 is fixed to redirect the WebHDFS requests via pylon correctly - PR 1328.
issue 1396 is fixed to detect GPU correctly when ECC is turned off - PR 1421.

Known Issues

Yarn resource manager abnormality can make the submitted jobs stuck on waiting state. This can be resolved by restarting the Yarn resource manager - issue 1274.
Scheduling jobs by GpuType cannot work now since the missing of cluster configuration file in FrameworkLauncher - Issue 1416.
A work around is to manually update the configuration file to the cluster. This can be done in following steps:

  # In the OpenPAI source code folder where you do the deployment, 
  # there should be a gpu configuration file under path src/cluster-configuration/deploy/gpu-configuration/gpu-configuration.json.
  # Or you can start cluster-configuration to generate it.
  sudo python paictl.py service start -p your_configuration_dir -n cluster-configuration
  # Make sure launcher is configured to login with admin user and get the admin user name.
  # The admin user name can be found in launcher status which can be retrieved by running following command. The default user name is root.
  curl "http://master_address:9086/v1/LauncherStatus"
  # put the configuration to cluster with admin user name
  curl -X PUT -H "Content-Type: application/json" -H "UserName: admin_user_name" \
   -d @src/cluster-configuration/deploy/gpu-configuration/gpu-configuration.json "http://master_address:9086/v1/LauncherRequest/ClusterConfiguration"
  # check the configuration
  curl -X GET "http://master_address:9086/v1/LauncherRequest/ClusterConfiguration"

If a running container's "View SSH Info" popup is opened in Chrome browser. By clicking the "private key" link the private key file will downloaded
and stored to local host. The key file's name consists of the user name and job name, jointed by a ~ character. Chrome will replace the ~ with
- character. So users need to change the key file name accordingly when SSH to the container by following step 3 and 4 in the "View SSH Info" popup.
Please follow Issue 1574 to track this problem.

Break Changes

In release v0.8.0 the Yarn container script will be run by docker executor. After a cluster is upgraded to release v0.8.0 from an earlier release.
The jobs submitted before the upgrading cannot be retried on the new release. The retried jobs may end up with nonzero exit code even if they complete correctly.
To run the retried jobs, users need to end them and submit new jobs with the same configuration.

Upgrading from Earlier Release

Download the code package of release v0.8.0 from release page.
Or you can clone the code by running
```
git clone --branch v0.8.0 [email protected]:Microsoft/pai.git
```
prepare your cluster configuration by instructions in OpenPAI Configuration.
In the service-configuration.yaml file, configure the docker info as following:
```
docker-namespace: openpai
docker-registry-domain: docker.io
docker-tag: v0.8.0
```

In the code source directory, upgrade by following steps:

# stop the services
python paictl.py service stop -p cluster_configuration_file_path
# after the services are stopped, stop the kubernetes cluster
python paictl.py cluster k8s-clean -p cluster_configuration_file_path
# reboot the kubernetes cluster
python paictl.py cluster k8s-bootup -p cluster_configuration_file_path
# start the services
python paictl.py service start -p cluster_configuration_file_path

Assets 2

17 Sep 07:44

xudifsd

v0.7.2

448a9ba

v0.7.2: Sept. 2018 Release

Changes since v0.7.1 release

Bug Fixes

Pylon: Fix rewrite of WebHDFS redirection
Add backward compatibility of killAllOnCompletedTaskNumber

Assets 2

31 Aug 07:35

xudifsd

v0.7.1

85686d9

v0.7.1: August 2018 Release

New features

Administrators can receive email notifications on cluster problems after set up the new supported "Alert Manager". Please read more about how to set up Alert Manager and the notification Rules.

Improvements

Optimized the boot speed of web portal - PR 1021;
Improved the Kubernetes upgrade experience, PAI admin is no more required to delete ETCD data when upgrading Kubernetes - PR 1038
Upgrade hadoop from 2.7.2 to 2.9.0 - PR 923
Enable docker log rotationn by default - PR 995
Documentation
- Restructured and refined README to provide a better experiences for new users.
- Documentation improvement for:
More examples:
- caffe
- caffe2
- chainer
- kafka
- Spark
- XGBoost

Bug fixes

Fixed nginx reverse proxy issue in webhdfs - PR 1009
Fixed pylon UI issue - PR 916
Fixed webportal data table issue - PR 734

Known issues

Currently, OpenPai start user's job using hostNetwork and leverge docker stats to generate job's running metrics including network usage. This will render 0 network usage in job/task metrics page, because docker stats will return 0 network usage if container uses hostNetwork. We will fix this issue in future release.

Break changes

Replace killOnAnyComplete with more powerful options minFailedTaskCount and minSucceededTaskCount. killOnAnyComplete is now obsolete, any json files that specify this field will not work in this version. Please see job tutorial for more information. - PR 879

Assets 2

19 Jul 06:34

hwuu

v0.6.1

87dcd35

v0.6.1: July 2018 Release

New features:

The 'paictl' tool: Introducing paictl, the deployment/management tool with the functionalities of image building, service start/stop, k8s bootup/clean, and configuration generation.
Single-box deployment: Support single-box deployment for evaluation purpose.
New UI for user management: Now the console for administrators to manage PAI users has got a new UI.
Documentation: Significant changes on documents -- more comprehensive, more structured, and easier to follow.

Improvements:

Faster loading of the job list UI: Now the page gets 5x faster than before when loading its content.

Various bug fixes: (Omitted here)

Known issues:

#827 Deploy PAI master and worker on the same node may lead to resource competition.
#813 Still in investigation. Install PAI on some old kernel may fail.
#713 Yarn may not use all the resource shown on the PAI dashboard, due to configuration issues.

Example of single-box deployment with quick start:

Step 1. Prepare file ~/quick-start.yaml:

ssh-username: pai-admin
ssh-password: ****
machines:
  - 10.240.0.10

Step 2. Go to <pai-codebase>/pai-deployment folder and then run following commands one by one:

python paictl.py cluster configuration-generation -i ~/quick-start.yaml -o ~/.pai/config
python paictl.py cluster k8s-bootup -p ~/.pai/config
python paictl.py service start -p ~/.pai/config

Step 3. Open a web browser and then go to http://10.240.0.10 to see PAI web portal:

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.9.0

New Features

Improvements

Service

Job

GPU driver

Command

Example

Bug Fixes

Known Issues

Upgrading from Earlier Release

Thanks to our Contributors

Release v0.8.3

Bug Fixes

Release v0.8.2

Bug Fixes

Changes since v0.8.0 release

Bug Fixes

Release v0.8.0

New Features

Improvements

Service

Job

Web Portal

Command

Example

Bug Fixes

Known Issues

Break Changes

Upgrading from Earlier Release

Changes since v0.7.1 release

Bug Fixes

New features

Improvements

Bug fixes

Known issues

Break changes

Releases: microsoft/pai

v0.9.0: Feb. 2019 Release

Release v0.9.0

New Features

Improvements

Service

Job

GPU driver

Command

Example

Bug Fixes

Known Issues

Upgrading from Earlier Release

Thanks to our Contributors

v0.8.3: Dec. 2018 Release

Release v0.8.3

Bug Fixes

v0.8.2: Nov. 2018 Release

Release v0.8.2

Bug Fixes

v0.8.1: Nov. 2018 Release

Changes since v0.8.0 release

Bug Fixes

v0.8.0: Oct. 2018 Release

Release v0.8.0

New Features

Improvements

Service

Job

Web Portal

Command

Example

Bug Fixes

Known Issues

Break Changes

Upgrading from Earlier Release

v0.7.2: Sept. 2018 Release

Changes since v0.7.1 release

Bug Fixes

v0.7.1: August 2018 Release

New features

Improvements

Bug fixes

Known issues

Break changes

v0.6.1: July 2018 Release