Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Releases: microsoft/pai

v0.9.0: Feb. 2019 Release

02 Feb 03:21
Compare
Choose a tag to compare

Release v0.9.0

New Features

  • Add pai service dashboard to grafana, cluster admin can get pai services resource consumption from paiServiceMetrics page. - PR 1694
  • Support to add custom web pages to the web portal of PAI deployments with WebPortal Plugin, refer to Plugins Doc for how to use the new feature, and refer to PR 1700 for how PAI Marketplace is using it as an example.
  • Support update virtual cluster dynamically from webportal. Please refer to virtual cluster management for how to use this new feature. -PR 1831 PR 1974
  • Support customized job environment variables. -PR 1544
  • Add VS Code client for PAI, please refer to OpenPAI VS Code Client for more detail.

Improvements

Service

  • Cluster object model implementation to make it easier for developer to add customized service configuration generation logic. -PR 1735
  • Job exporter refactor to avoid single external command call will make exporter hang indefinitely. -PR 1840
  • Extend yarn local log expiration time to 7 days. -PR 1673
  • Reduce grafana image from 440M to 280M by merging all startup scripts add removing useless plugin. -PR 1685
  • Upgrade Nodejs version of webportal and rest server to 8. -PR 1453
  • Support hdfs path customization. -PR 1922
  • Migrate user information from etcd to k8s secret to reduce the dependency on raw etcd data. -PR 1917
  • Move user code to a background process. -PR 1461
  • Support configuration storage all config files will be stored in kubernetes config map. please refer paictl-manual to get more information. -PR 1177 PR 1431 PR 1489

Job

  • Add timestamp for cloned job's name - PR 1532
  • Add log if job's image doesn't have ssh server - PR 1675
  • Escape injected variables in shell scripts -PR 1860
  • Add an example of how to integrate jupyter and pai by using restserver. -PR 1676
  • Expose all ports among tasks and the format will be:PAI_${taskRole}_${taskidx} _${portLabel}_PORT=${portNumber}. -PR 1918

GPU driver

  • Make GPU drivers version configurable. - PR 1626
  • Add two driver images. Current supports driver's versions are 384.111, 390.25 and 410.73. By default will deploy 390.25 version. -PR 1642
  • User can skip driver installation if they pre-installed. -PR 1841

Command

  • Support add machine from node-list file. -PR 819
  • Add config sub-command in paictl to manage config files. -PR 1263
  • After the configuration storage is enabled -p is no longer required in the service sub-command.

Example

  • Use tensorflow job instead of cntk job as end-to-end tests. Note: If the user is upgrading from the previous version of pai, please delete the test data file under the hdfs://ip:port/Test folder to ensure that the end-to-end-test works normally.

Bug Fixes

  • issue 1603 is fixed by adding job_exporter_iteration_seconds to expose iteration time. - PR 1627.
  • issue 1602 is fixed by initializing the host ip from None to unscheduled - PR 1625.
  • issue 1639 is fixed by adding imagePullSecrets to prometheus. - PR 1678.
  • issue 1600 is fixed by offloading docker daemon check from watchdog to job-exporter. - PR 1670.
  • Fix admin can't submit job to newly added virtaul cluster. - PR 1972
  • issue 2005 is fixed by making Grafana Legend unique in task level dashboard. - PR 1921

Known Issues

  • Paictl may fail to start service after calling stop service. issue 2081
  • If a running container's "View SSH Info" popup is opened in Chrome browser. By clicking the "private key" link the private key file will downloaded and stored to local host. The key file's name consists of the user name and job name, jointed by a ~ character. Chrome will replace the ~ with - character. So users need to change the key file name accordingly when SSH to the container by following step 3 and 4 in the "View SSH Info" popup. Please follow issue 1574 to track this problem.

Upgrading from Earlier Release

  • Download the code package of release v0.9.0 from release page.
    Or you can clone the code by running
    git clone --branch v0.9.0 [email protected]:Microsoft/pai.git
  • prepare your cluster configuration by instructions in OpenPAI Configuration. Configure the docker info as following:
    docker-registry:
      namespace: openpai
      domain: docker.io
      tag: v0.9.0
  • In the code source directory, upgrade by following steps:
    # push cluster configuration file to kubernetes
    python paictl config push -p cluster_configuration_file_path
    # stop the services
    python paictl.py service stop
    # after the services are stopped, stop the kubernetes cluster
    python paictl.py cluster k8s-clean -p cluster_configuration_file_path
    # reboot the kubernetes cluster
    python paictl.py cluster k8s-bootup -p cluster_configuration_file_path
    # start the services
    python paictl.py service start

Thanks to our Contributors

Thanks to the following people who have contributed new code or given us helpful suggestions for this release:

Bin Wang, Fan Yang, Can Wang, Di Xu, Hao Yuan, Qixiang Cheng, Xinwei Zheng @ustc (virtual cluster update), Yanjie Gao, Yundong Ye, Ziming Miao, Yuqi Wang, Scarlett Li, Dian Wang, Mao Yang, Shuguang Liu, Quanlu Zhang

v0.8.3: Dec. 2018 Release

19 Dec 09:25
Compare
Choose a tag to compare

Release v0.8.3

Bug Fixes

v0.8.2: Nov. 2018 Release

30 Nov 04:49
3c274a9
Compare
Choose a tag to compare

Release v0.8.2

Bug Fixes

  • Fix grafana rule syntax error and remove unnecessary template variables - PR 1728
  • Add more detail about how to add node to kubernetes cluster in machine-maintenance document. - PR 1750
  • Add disk and memory pressure toleration policy for node manager. - PR 1784

v0.8.1: Nov. 2018 Release

19 Nov 06:17
a56bc2d
Compare
Choose a tag to compare

Changes since v0.8.0 release

Bug Fixes

v0.8.0: Oct. 2018 Release

31 Oct 09:48
b75ec9e
Compare
Choose a tag to compare

Release v0.8.0

New Features

  • All user submitted jobs can be cloned and resubmitted in Job detail page - PR 1448.
  • The new designed Marketplace and Submit Job V2 are under public review.
    Please refer to the instruction for more information Marketplace and Submit job v2.
    Any feedback and suggestions are appreciated.
  • Alerting service supports to mute alerts. The instructions can be found via alert-manager.
  • New Feedback Button: users are allowed to submit GitHub Issues with appended OpenPAI version directly from WebUI - PR 1289.

Improvements

Service

  • Memory limits are added for all OpenPAI services. Please refer Resource Requirement for details.
  • The metrics from alerting service are extended and they can be reported per job, node or service.
  • Etcd data path configuration entry is added to Kubernetes Configuration and user can decide the path to store etcd data permanently - PR 1221.
  • Alert email from Prometheus is refined for clarity - PR 1282.
  • RestServer's API supports username and different users can submit jobs with the same name.

Job

  • When starting container to run user command, --init option is enabled to help avoiding zombie processes. - PR 1435
  • In the container running user's command, the code directory is mounted as readonly - PR 1422.
  • In job submission request, both user specified and random ports are supported and they can coexist - PR 1402.
  • When Yarn node manager service is deleted, user's job containers will be cleaned forcefully - PR 1296.

Web Portal

  • The job's view page is enhanced to show the retry history link - PR 1425.
  • On job detail page, job configuration can be exported and stored locally - PR 1429.

Command

  • Build tool is refactored out from paictl and is implemented by pai-build.
  • When logging in machines in the cluster on deployment, besides username and password, users can configure the ssh key file path for authentication. The details can be found in deployment configuration.

Example

  • Add auto test to run examples in an automatic or semi-automatic way.

Bug Fixes

Known Issues

  • Yarn resource manager abnormality can make the submitted jobs stuck on waiting state. This can be resolved by restarting the Yarn resource manager - issue 1274.
  • Scheduling jobs by GpuType cannot work now since the missing of cluster configuration file in FrameworkLauncher - Issue 1416.
    A work around is to manually update the configuration file to the cluster. This can be done in following steps:
  # In the OpenPAI source code folder where you do the deployment, 
  # there should be a gpu configuration file under path src/cluster-configuration/deploy/gpu-configuration/gpu-configuration.json.
  # Or you can start cluster-configuration to generate it.
  sudo python paictl.py service start -p your_configuration_dir -n cluster-configuration
  # Make sure launcher is configured to login with admin user and get the admin user name.
  # The admin user name can be found in launcher status which can be retrieved by running following command. The default user name is root.
  curl "http://master_address:9086/v1/LauncherStatus"
  # put the configuration to cluster with admin user name
  curl -X PUT -H "Content-Type: application/json" -H "UserName: admin_user_name" \
   -d @src/cluster-configuration/deploy/gpu-configuration/gpu-configuration.json "http://master_address:9086/v1/LauncherRequest/ClusterConfiguration"
  # check the configuration
  curl -X GET "http://master_address:9086/v1/LauncherRequest/ClusterConfiguration"
  • If a running container's "View SSH Info" popup is opened in Chrome browser. By clicking the "private key" link the private key file will downloaded
    and stored to local host. The key file's name consists of the user name and job name, jointed by a ~ character. Chrome will replace the ~ with
    - character. So users need to change the key file name accordingly when SSH to the container by following step 3 and 4 in the "View SSH Info" popup.
    Please follow Issue 1574 to track this problem.

Break Changes

  • In release v0.8.0 the Yarn container script will be run by docker executor. After a cluster is upgraded to release v0.8.0 from an earlier release.
    The jobs submitted before the upgrading cannot be retried on the new release. The retried jobs may end up with nonzero exit code even if they complete correctly.
    To run the retried jobs, users need to end them and submit new jobs with the same configuration.

Upgrading from Earlier Release

  • Download the code package of release v0.8.0 from release page.
    Or you can clone the code by running
    git clone --branch v0.8.0 [email protected]:Microsoft/pai.git
  • prepare your cluster configuration by instructions in OpenPAI Configuration.
    In the service-configuration.yaml file, configure the docker info as following:
    docker-namespace: openpai
    docker-registry-domain: docker.io
    docker-tag: v0.8.0
  • In the code source directory, upgrade by following steps:
    # stop the services
    python paictl.py service stop -p cluster_configuration_file_path
    # after the services are stopped, stop the kubernetes cluster
    python paictl.py cluster k8s-clean -p cluster_configuration_file_path
    # reboot the kubernetes cluster
    python paictl.py cluster k8s-bootup -p cluster_configuration_file_path
    # start the services
    python paictl.py service start -p cluster_configuration_file_path

v0.7.2: Sept. 2018 Release

17 Sep 07:44
448a9ba
Compare
Choose a tag to compare

Changes since v0.7.1 release

Bug Fixes

  • Pylon: Fix rewrite of WebHDFS redirection
  • Add backward compatibility of killAllOnCompletedTaskNumber

v0.7.1: August 2018 Release

31 Aug 07:35
85686d9
Compare
Choose a tag to compare

New features

  • Administrators can receive email notifications on cluster problems after set up the new supported "Alert Manager". Please read more about how to set up Alert Manager and the notification Rules.

Improvements

Bug fixes

  • Fixed nginx reverse proxy issue in webhdfs - PR 1009
  • Fixed pylon UI issue - PR 916
  • Fixed webportal data table issue - PR 734

Known issues

  • Currently, OpenPai start user's job using hostNetwork and leverge docker stats to generate job's running metrics including network usage. This will render 0 network usage in job/task metrics page, because docker stats will return 0 network usage if container uses hostNetwork. We will fix this issue in future release.

Break changes

  • Replace killOnAnyComplete with more powerful options minFailedTaskCount and minSucceededTaskCount. killOnAnyComplete is now obsolete, any json files that specify this field will not work in this version. Please see job tutorial for more information. - PR 879

v0.6.1: July 2018 Release

19 Jul 06:34
87dcd35
Compare
Choose a tag to compare

New features:

  1. The 'paictl' tool: Introducing paictl, the deployment/management tool with the functionalities of image building, service start/stop, k8s bootup/clean, and configuration generation.
  2. Single-box deployment: Support single-box deployment for evaluation purpose.
  3. New UI for user management: Now the console for administrators to manage PAI users has got a new UI.
  4. Documentation: Significant changes on documents -- more comprehensive, more structured, and easier to follow.

Improvements:

  1. Faster loading of the job list UI: Now the page gets 5x faster than before when loading its content.

Various bug fixes: (Omitted here)

Known issues:

  • #827 Deploy PAI master and worker on the same node may lead to resource competition.
  • #813 Still in investigation. Install PAI on some old kernel may fail.
  • #713 Yarn may not use all the resource shown on the PAI dashboard, due to configuration issues.

Example of single-box deployment with quick start:

  • Step 1. Prepare file ~/quick-start.yaml:
ssh-username: pai-admin
ssh-password: ****
machines:
  - 10.240.0.10
  • Step 2. Go to <pai-codebase>/pai-deployment folder and then run following commands one by one:
python paictl.py cluster configuration-generation -i ~/quick-start.yaml -o ~/.pai/config
python paictl.py cluster k8s-bootup -p ~/.pai/config
python paictl.py service start -p ~/.pai/config

1