This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
Releases: microsoft/pai
Releases · microsoft/pai
v0.9.0: Feb. 2019 Release
Release v0.9.0
New Features
- Add pai service dashboard to grafana, cluster admin can get pai services resource consumption from paiServiceMetrics page. - PR 1694
- Support to add custom web pages to the web portal of PAI deployments with WebPortal Plugin, refer to Plugins Doc for how to use the new feature, and refer to PR 1700 for how PAI Marketplace is using it as an example.
- Support update virtual cluster dynamically from webportal. Please refer to virtual cluster management for how to use this new feature. -PR 1831 PR 1974
- Support customized job environment variables. -PR 1544
- Add VS Code client for PAI, please refer to OpenPAI VS Code Client for more detail.
Improvements
Service
- Cluster object model implementation to make it easier for developer to add customized service configuration generation logic. -PR 1735
- Job exporter refactor to avoid single external command call will make exporter hang indefinitely. -PR 1840
- Extend yarn local log expiration time to 7 days. -PR 1673
- Reduce grafana image from 440M to 280M by merging all startup scripts add removing useless plugin. -PR 1685
- Upgrade Nodejs version of webportal and rest server to 8. -PR 1453
- Support hdfs path customization. -PR 1922
- Migrate user information from etcd to k8s secret to reduce the dependency on raw etcd data. -PR 1917
- Move user code to a background process. -PR 1461
- Support configuration storage all config files will be stored in kubernetes config map. please refer paictl-manual to get more information. -PR 1177 PR 1431 PR 1489
Job
- Add timestamp for cloned job's name - PR 1532
- Add log if job's image doesn't have ssh server - PR 1675
- Escape injected variables in shell scripts -PR 1860
- Add an example of how to integrate jupyter and pai by using restserver. -PR 1676
- Expose all ports among tasks and the format will be:PAI_${taskRole}_${taskidx} _${portLabel}_PORT=${portNumber}. -PR 1918
GPU driver
- Make GPU drivers version configurable. - PR 1626
- Add two driver images. Current supports driver's versions are 384.111, 390.25 and 410.73. By default will deploy 390.25 version. -PR 1642
- User can skip driver installation if they pre-installed. -PR 1841
Command
- Support add machine from node-list file. -PR 819
- Add config sub-command in paictl to manage config files. -PR 1263
- After the configuration storage is enabled -p is no longer required in the service sub-command.
Example
- Use tensorflow job instead of cntk job as end-to-end tests. Note: If the user is upgrading from the previous version of pai, please delete the test data file under the hdfs://ip:port/Test folder to ensure that the end-to-end-test works normally.
Bug Fixes
- issue 1603 is fixed by adding job_exporter_iteration_seconds to expose iteration time. - PR 1627.
- issue 1602 is fixed by initializing the host ip from None to unscheduled - PR 1625.
- issue 1639 is fixed by adding imagePullSecrets to prometheus. - PR 1678.
- issue 1600 is fixed by offloading docker daemon check from watchdog to job-exporter. - PR 1670.
- Fix admin can't submit job to newly added virtaul cluster. - PR 1972
- issue 2005 is fixed by making Grafana Legend unique in task level dashboard. - PR 1921
Known Issues
- Paictl may fail to start service after calling stop service. issue 2081
- If a running container's "View SSH Info" popup is opened in Chrome browser. By clicking the "private key" link the private key file will downloaded and stored to local host. The key file's name consists of the user name and job name, jointed by a ~ character. Chrome will replace the ~ with - character. So users need to change the key file name accordingly when SSH to the container by following step 3 and 4 in the "View SSH Info" popup. Please follow issue 1574 to track this problem.
Upgrading from Earlier Release
- Download the code package of release v0.9.0 from release page.
Or you can clone the code by runninggit clone --branch v0.9.0 [email protected]:Microsoft/pai.git
- prepare your cluster configuration by instructions in OpenPAI Configuration. Configure the docker info as following:
docker-registry: namespace: openpai domain: docker.io tag: v0.9.0
- In the code source directory, upgrade by following steps:
# push cluster configuration file to kubernetes python paictl config push -p cluster_configuration_file_path # stop the services python paictl.py service stop # after the services are stopped, stop the kubernetes cluster python paictl.py cluster k8s-clean -p cluster_configuration_file_path # reboot the kubernetes cluster python paictl.py cluster k8s-bootup -p cluster_configuration_file_path # start the services python paictl.py service start
Thanks to our Contributors
Thanks to the following people who have contributed new code or given us helpful suggestions for this release:
Bin Wang, Fan Yang, Can Wang, Di Xu, Hao Yuan, Qixiang Cheng, Xinwei Zheng @ustc (virtual cluster update), Yanjie Gao, Yundong Ye, Ziming Miao, Yuqi Wang, Scarlett Li, Dian Wang, Mao Yang, Shuguang Liu, Quanlu Zhang
v0.8.3: Dec. 2018 Release
Release v0.8.3
Bug Fixes
- issue 1873 Fix webportal document broken link. - PR 1898 PR 1844 PR 1847
- issue 1887 Fix SSH info in webportal behind pylon is not replaced problem. - PR 1901
- issue 1863 Fix the issue of missing file when clean-up kubernetes . - PR 1864
v0.8.2: Nov. 2018 Release
v0.8.1: Nov. 2018 Release
v0.8.0: Oct. 2018 Release
Release v0.8.0
New Features
- All user submitted jobs can be cloned and resubmitted in Job detail page - PR 1448.
- The new designed Marketplace and Submit Job V2 are under public review.
Please refer to the instruction for more information Marketplace and Submit job v2.
Any feedback and suggestions are appreciated. - Alerting service supports to mute alerts. The instructions can be found via alert-manager.
- New Feedback Button: users are allowed to submit GitHub Issues with appended OpenPAI version directly from WebUI - PR 1289.
Improvements
Service
- Memory limits are added for all OpenPAI services. Please refer Resource Requirement for details.
- The metrics from alerting service are extended and they can be reported per job, node or service.
- Etcd data path configuration entry is added to Kubernetes Configuration and user can decide the path to store etcd data permanently - PR 1221.
- Alert email from Prometheus is refined for clarity - PR 1282.
- RestServer's API supports username and different users can submit jobs with the same name.
Job
- When starting container to run user command, --init option is enabled to help avoiding zombie processes. - PR 1435
- In the container running user's command, the code directory is mounted as readonly - PR 1422.
- In job submission request, both user specified and random ports are supported and they can coexist - PR 1402.
- When Yarn node manager service is deleted, user's job containers will be cleaned forcefully - PR 1296.
Web Portal
- The job's view page is enhanced to show the retry history link - PR 1425.
- On job detail page, job configuration can be exported and stored locally - PR 1429.
Command
- Build tool is refactored out from paictl and is implemented by pai-build.
- When logging in machines in the cluster on deployment, besides username and password, users can configure the ssh key file path for authentication. The details can be found in deployment configuration.
Example
- Add auto test to run examples in an automatic or semi-automatic way.
Bug Fixes
- issue 1153 is fixed by checking API resources before installing kube-proxy - PR 1210.
- issue 1226 is fixed and limit the image pull time in 10 minutes - PR 1227.
- issue 1217 is fixed to rotate exporter's log via docker daemon - PR 1239.
- issue 1314 is fixed to redirect the WebHDFS requests via pylon correctly - PR 1328.
- issue 1396 is fixed to detect GPU correctly when ECC is turned off - PR 1421.
Known Issues
- Yarn resource manager abnormality can make the submitted jobs stuck on waiting state. This can be resolved by restarting the Yarn resource manager - issue 1274.
- Scheduling jobs by GpuType cannot work now since the missing of cluster configuration file in FrameworkLauncher - Issue 1416.
A work around is to manually update the configuration file to the cluster. This can be done in following steps:
# In the OpenPAI source code folder where you do the deployment,
# there should be a gpu configuration file under path src/cluster-configuration/deploy/gpu-configuration/gpu-configuration.json.
# Or you can start cluster-configuration to generate it.
sudo python paictl.py service start -p your_configuration_dir -n cluster-configuration
# Make sure launcher is configured to login with admin user and get the admin user name.
# The admin user name can be found in launcher status which can be retrieved by running following command. The default user name is root.
curl "http://master_address:9086/v1/LauncherStatus"
# put the configuration to cluster with admin user name
curl -X PUT -H "Content-Type: application/json" -H "UserName: admin_user_name" \
-d @src/cluster-configuration/deploy/gpu-configuration/gpu-configuration.json "http://master_address:9086/v1/LauncherRequest/ClusterConfiguration"
# check the configuration
curl -X GET "http://master_address:9086/v1/LauncherRequest/ClusterConfiguration"
- If a running container's "View SSH Info" popup is opened in Chrome browser. By clicking the "private key" link the private key file will downloaded
and stored to local host. The key file's name consists of the user name and job name, jointed by a ~ character. Chrome will replace the ~ with
- character. So users need to change the key file name accordingly when SSH to the container by following step 3 and 4 in the "View SSH Info" popup.
Please follow Issue 1574 to track this problem.
Break Changes
- In release v0.8.0 the Yarn container script will be run by docker executor. After a cluster is upgraded to release v0.8.0 from an earlier release.
The jobs submitted before the upgrading cannot be retried on the new release. The retried jobs may end up with nonzero exit code even if they complete correctly.
To run the retried jobs, users need to end them and submit new jobs with the same configuration.
Upgrading from Earlier Release
- Download the code package of release v0.8.0 from release page.
Or you can clone the code by runninggit clone --branch v0.8.0 [email protected]:Microsoft/pai.git
- prepare your cluster configuration by instructions in OpenPAI Configuration.
In the service-configuration.yaml file, configure the docker info as following:docker-namespace: openpai docker-registry-domain: docker.io docker-tag: v0.8.0
- In the code source directory, upgrade by following steps:
# stop the services python paictl.py service stop -p cluster_configuration_file_path # after the services are stopped, stop the kubernetes cluster python paictl.py cluster k8s-clean -p cluster_configuration_file_path # reboot the kubernetes cluster python paictl.py cluster k8s-bootup -p cluster_configuration_file_path # start the services python paictl.py service start -p cluster_configuration_file_path
v0.7.2: Sept. 2018 Release
Changes since v0.7.1 release
Bug Fixes
- Pylon: Fix rewrite of WebHDFS redirection
- Add backward compatibility of killAllOnCompletedTaskNumber
v0.7.1: August 2018 Release
New features
- Administrators can receive email notifications on cluster problems after set up the new supported "Alert Manager". Please read more about how to set up Alert Manager and the notification Rules.
Improvements
- Optimized the boot speed of web portal - PR 1021;
- Improved the Kubernetes upgrade experience, PAI admin is no more required to delete ETCD data when upgrading Kubernetes - PR 1038
- Upgrade hadoop from 2.7.2 to 2.9.0 - PR 923
- Enable docker log rotationn by default - PR 995
- Documentation
- Restructured and refined README to provide a better experiences for new users.
- Documentation improvement for:
- More examples:
Bug fixes
- Fixed nginx reverse proxy issue in webhdfs - PR 1009
- Fixed pylon UI issue - PR 916
- Fixed webportal data table issue - PR 734
Known issues
- Currently, OpenPai start user's job using hostNetwork and leverge
docker stats
to generate job's running metrics including network usage. This will render 0 network usage in job/task metrics page, becausedocker stats
will return 0 network usage if container uses hostNetwork. We will fix this issue in future release.
Break changes
- Replace
killOnAnyComplete
with more powerful optionsminFailedTaskCount
andminSucceededTaskCount
.killOnAnyComplete
is now obsolete, any json files that specify this field will not work in this version. Please see job tutorial for more information. - PR 879
v0.6.1: July 2018 Release
New features:
- The 'paictl' tool: Introducing
paictl
, the deployment/management tool with the functionalities of image building, service start/stop, k8s bootup/clean, and configuration generation. - Single-box deployment: Support single-box deployment for evaluation purpose.
- New UI for user management: Now the console for administrators to manage PAI users has got a new UI.
- Documentation: Significant changes on documents -- more comprehensive, more structured, and easier to follow.
Improvements:
- Faster loading of the job list UI: Now the page gets 5x faster than before when loading its content.
Various bug fixes: (Omitted here)
Known issues:
- #827 Deploy PAI
master
andworker
on the same node may lead to resource competition. - #813 Still in investigation. Install PAI on some old kernel may fail.
- #713 Yarn may not use all the resource shown on the PAI dashboard, due to configuration issues.
Example of single-box deployment with quick start:
- Step 1. Prepare file
~/quick-start.yaml
:
ssh-username: pai-admin
ssh-password: ****
machines:
- 10.240.0.10
- Step 2. Go to
<pai-codebase>/pai-deployment
folder and then run following commands one by one:
python paictl.py cluster configuration-generation -i ~/quick-start.yaml -o ~/.pai/config
python paictl.py cluster k8s-bootup -p ~/.pai/config
python paictl.py service start -p ~/.pai/config
- Step 3. Open a web browser and then go to http://10.240.0.10 to see PAI web portal: