PAI configuration consists of 4 YAML files:
- Machine-level configurations. This file contains basic configurations of cluster, such as the login info, machine SKUs, labels of each machine, etc.kubernetes-configuration.yaml
- Kubernetes-level configurations. This file contains basic configurations of Kubernetes, such as the version info, network configurations, etc.k8s-role-definition.yaml
- Kubernetes-level configurations. This file contains the mappings of Kubernetes roles and machine labels.serivices-configuration.yaml
- Service-level configurations. This file contains the definitions of cluster id, docker registry, and those of all individual PAI services.
Before deployment or maintenance, user should have the cluster configuration files ready.
You could find the example configuration files in pai/cluster-configuration/.
Note: Please do not change the name of the configuration files. And those 4 files should be put in the same directory.
Configure OpenPAI from scenarios
- placement
- scheduling
- account
- port / data folder etc.
- component version
- HA
- Cluster related configuration: configuration of cluster-configuration.yaml
- Kubernetes role related configuration: configuration of k8s-role-definition.yaml
- Kubernetes related configuration: configuration of kubernetes-configuration.yaml
- Service related configuration: configuration of services-configuration.yaml
Configure OpenPAI services [Note: This part is for advanced user who wants to customize OpenPAI each service]
- Kubernetes
- Webportal
- FrameworkLauncher
- Hadoop
- Monitor
Appendix: Default values in auto-generated configuration files
An example cluster-configuration.yaml is available here. In the following we explain the fields in the yaml file one by one.
# A Linux host account with sudo permission
username: username
password: password
sshport: port
Set the default value of username, password, and sshport in default-machine-properties. PAI will use these default values to access cluster machines. User can override the default access information for each machine in machine-list.
mem: 224
type: teslak80
count: 4
vcore: 24
#Note: Up to now, the only supported os version is Ubuntu16.04. Please do not change it here.
os: ubuntu16.04
In this field, you could define several sku with different name. And in the machine list you should refer your machine to one of them.
check gpu driver:
search driver, view driver status
view driver logs, this log shows driver in health status
- hostname: hostname (echo `hostname`)
hostip: IP
machine-type: D8SV3
etcdid: etcdid1
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: master
dashboard: "true"
zkid: "1"
pai-master: "true"
- hostname: hostname
hostip: IP
machine-type: D8SV3
etcdid: etcdid2
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: master
node-exporter: "true"
- hostname: hostname
hostip: IP
machine-type: NC24R
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: worker
pai-worker: "true"
User could config each service deploy at which node by labeling node with service tag as below:
Configuration Property | File | Meaning |
hostname |
cluster-configuration.yaml | Required. You could get the hostname by the command echo `hostname` on the host. |
hostip |
cluster-configuration.yaml | Required. The ip address of the corresponding host. |
machine-type |
cluster-configuration.yaml | Required. The sku name defined in the machine-sku . |
sshport, username, password |
cluster-configuration.yaml | Optional. Used if this machine's account and port is different from the default properties. Or you can remove them. |
etcdid |
cluster-configuration.yaml | K8s-Master Required. The etcd is part of kubernetes master. If you assign the k8s-role=master to a node, you should set this filed. This value will be used when starting and fixing k8s. |
k8s-role |
cluster-configuration.yaml | Required. You could set this value to master , worker or proxy . If you want to configure more than 1 k8s-master, please refer to Kubernetes High Availability Configuration. |
dashboard |
cluster-configuration.yaml | Select one node to set this field. And set the value as "true" . |
pai-master |
cluster-configuration.yaml | Optional. hadoop-name-node, hadoop-resource-manager, frameworklauncher, restserver, webportal, grafana, prometheus and node-exporter. |
zkid |
cluster-configuration.yaml | Unique zookeeper id required by pai-master node(s). You can set this field from 1 to n . |
pai-worker |
cluster-configuration.yaml | Optional. hadoop-data-node, hadoop-node-manager, and node-exporter will be deployed on a pai-work |
node-exporter |
cluster-configuration.yaml | Optional. You can assign this label to nodes to enable hardware and service monitoring. |
Note: To deploy PAI in a single box, users should set pai-master and pai-worker labels for the same machine in machine-list section, or just follow the quick deployment approach described in this section.
check node labels:
check service pod deployed on which node:
An example k8s-role-definition.yaml file is available here. The file is used to bootstrap a k8s cluster. It includes a list of k8s components and specifies what components should be include in different k8s roles (master, worker, and proxy). By default, user does not need to change the file.
An example kubernetes-configuration.yaml file is available here. The yaml file includes the following fields.
Suggest user use the default configuration:
cluster-dns: IP
load-balance-ip: IP
storage-backend: etcd3
hyperkube-version: v1.9.4
etcd-version: 3.2.17
apiserver-version: v1.9.4
kube-scheduler-version: v1.9.4
kube-controller-manager-version: v1.9.4
dashboard-version: v1.8.3
Configuration Property | File | Meaning |
cluster-dns |
kubernetes-configuration.yaml | Find the nameserver address in /etc/resolv.conf |
load-balance-ip |
kubernetes-configuration.yaml | If the cluster has only one k8s-master, please set this field with the ip-address of your k8s-master. If there are more than one k8s-master, please refer to k8s high availability configuration. |
Configuration Property | File | Meaning |
service-cluster-ip-range |
kubernetes-configuration.yaml | Please specify an ip range that does not overlap with the host network in the cluster. E.g., use the link-local IPv4 address according to [RFC 3927] |
storage-backend |
kubernetes-configuration.yaml | ETCD major version. If you are not familiar with etcd, please do not change it. |
docker-registry |
kubernetes-configuration.yaml | The docker registry used in the k8s deployment. To use the official k8s Docker images, set this field to, the deployment process will pull Kubernetes component's image from . You can also set the docker registry to (or, which is maintained by pai. |
hyperkube-version |
kubernetes-configuration.yaml | The version of hyperkube. If the registry is gcr, you could find the version tag here. |
etcd-version |
kubernetes-configuration.yaml | The version of etcd. If you are not familiar with etcd, please do not change it. If the registry is gcr, you could find the version tag here. |
apiserver-version |
kubernetes-configuration.yaml | The version of apiserver. If the registry is gcr, you could find the version tag here. |
kube-scheduler-version |
kubernetes-configuration.yaml | The version of kube-scheduler. If the registry is gcr, you could find the version tag here |
kube-controller-manager-version |
kubernetes-configuration.yaml | The version of kube-controller-manager.If the registry is gcr, you could find the version tag here |
dashboard-version |
kubernetes-configuration.yaml | The version of kubernetes-dashboard. If the registry is gcr, you could find the version tag here |
check kubernetes version:
~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.4", GitCommit:"bee2d150", GitTreeState:"clean", BuildDate:"2018-03-12T16:21:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
An example services-configuration.yaml file is available here. The following explains the details of the yaml file.
clusterid: pai-example
nvidia-drivers-version: 384.111
docker-verison: 17.06.2
data-path: "/datastorage"
docker-namespace: your_registry_namespace
docker-registry-domain: your_registry_domain
# If the docker registry doesn't require authentication, please leave docker_username and docker_password empty
docker-username: your_registry_username
docker-password: your_registry_password
docker-tag: your_image_tag
# The name of the secret in kubernetes will be created in your cluster
# Must be lower case, e.g., regsecret.
secret-name: your_secret_name
Configuration Property | File | Meaning |
clusterid |
services-configuration.yaml | The id of the cluster. |
nvidia-drivers-version |
services-configuration.yaml | Choose proper nvidia driver version for your cluster here. |
docker-verison |
services-configuration.yaml | The Docker client used by hadoop NM (node manager) to launch Docker containers (e.g., of a deep learning job) in the host environment. |
data-path |
services-configuration.yaml | The absolute path on the host in your cluster to store the data such as hdfs, zookeeper and yarn. Note: please make sure there is enough space in this path. |
docker-namespace |
services-configuration.yaml | Your registry's namespace. If your choose DockerHub as your docker registry. You should fill this field with your username. |
docker-registry-domain |
services-configuration.yaml | E.g., If public,fill docker_registry_domain with the word "public". |
docker-username |
services-configuration.yaml | The account of the docker registry |
docker-password |
services-configuration.yaml | The password of the account |
docker-tag |
services-configuration.yaml | The image tag of the service. You could set the version here. Or just set latest here. |
secret-name |
services-configuration.yaml | Must be lower case, e.g., regsecret. The name of the secret in Kubernetes will be created for your cluster. |
Note that we provide a read-only public docker registry on DockerHub for official releases. To use this docker registry, th docker-registry-info
section should be configured as follows, leaving docker-username
and docker-password
- docker-namespace: openpai
- docker-registry-domain:
#- docker-username: <n/a>
#- docker-password: <n/a>
- docker-tag: latest # or a specific version, i.e. 0.5.0.
- secret-name: <anything>
check docker OpenPAI public registry:
User can browse to to see all the repositories in this public docker registry.
check docker image tag:
check data path:
~$ ls /datastorage
hadooptmp hdfs launcherlogs prometheus yarn zoodata
check docker version:
~$ sudo docker version
Version: 17.09.0-ce
API version: 1.32
Go version: go1.8.3
Git commit: afdb6d4
Built: Tue Sep 26 22:42:18 2017
OS/Arch: linux/amd64
Version: 17.09.0-ce
API version: 1.32 (minimum version 1.12)
Go version: go1.8.3
Git commit: afdb6d4
Built: Tue Sep 26 22:40:56 2017
OS/Arch: linux/amd64
Experimental: false
check driver version:
# (1) find driver container at server
~$ sudo docker ps | grep driver
daeaa9a81d3f aiplatform/drivers "/bin/sh -c ./inst..." 8 days ago Up 8 days k8s_nvidia-drivers_drivers-one-shot-d7fr4_default_9d91059c-9078-11e8-8aea-000d3ab5296b_0
ccf53c260f6f "/pause" 8 days ago Up 8 days k8s_POD_drivers-one-shot-d7fr4_default_9d91059c-9078-11e8-8aea-000d3ab5296b_0
# (2) login driver container
~$ sudo docker exec -it daeaa9a81d3f /bin/bash
# (3) checker driver version
root@~/drivers# nvidia-smi
Fri Aug 3 01:53:04 2018
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Tesla K80 On | 0000460D:00:00.0 Off | 0 |
| N/A 31C P8 31W / 149W | 0MiB / 11439MiB | 0% Default |
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| No running processes found |
# custom_hadoop_binary_path specifies the path PAI stores the custom built hadoop-ai
# Notice: the name should be hadoop-{hadoop-version}.tar.gz
custom-hadoop-binary-path: /pathHadoop/hadoop-2.9.0.tar.gz
description: default queue for all users.
capacity: 40
description: VC for Alice's team.
capacity: 20
description: VC for Bob's team.
capacity: 20
description: VC for Charlie's team.
capacity: 20
Configuration Property | File | Meaning |
custom-hadoop-binary-path |
services-configuration.yaml | Please set a path here for paictl to build hadoop-ai. |
virtualClusters |
services-configuration.yaml | Hadoop queue setting. Each VC will be assigned with (capacity / total_capacity * 100%) of resources. paictl will create the 'default' VC with 0 capacity, if it is not been specified. paictl will split resources to each VC evenly if the total capacity is 0. The capacity of each VC will be set to 0 if it is a negative number. |
check virtual cluster:
After configure node placement of service, User define service's node ip. User could also define service's entry port as follows configurations (note: webportal is OpenPAI's main page):
frameworklauncher-port: 9086
: Launcher's port. You can use the default value.
server-port: 9186
jwt-secret: your_jwt_secret
default-pai-admin-username: your_default_pai_admin_username
default-pai-admin-password: your_default_pai_admin_password
configure OpenPAI admin user account
Configuration Property | File | Meaning |
server-port |
services-configuration.yaml | Port for rest api server. You can use the default value. |
jwt-secret |
services-configuration.yaml | Secret for signing authentication tokens, e.g., "Hello PAI!" |
default-pai-admin-username |
services-configuration.yaml | Database admin username, and admin username of pai. |
default-pai-admin-password |
services-configuration.yaml | Database admin password |
check admin user:
try to login:
server-port: 9286
: Port for webportal, you can use the default value.
grafana-port: 3000
: Port for grafana, you can use the default value.
prometheus-port: 9091
node-exporter-port: 9100
Configuration Property | File | Meaning |
prometheus-port |
services-configuration.yaml | Port for prometheus port, you can use the default value. |
node-exporter-port |
services-configuration.yaml | Port for node exporter, you can use the default value. |
# port of pylon
port: 80
: Port of pylon, you can use the default value.
Users can browse to each service's dashboard:
Single master mode does not have high availability.
- Only set one node's k8s-role as master
- Set this field
to your master's ip address
There are 3 roles in k8s-role-definition. The master
will start a k8s-master component on the specified machine. And the proxy
will start a proxy component on the specified machine. In cluster-configuration.yaml,
- One or more than one nodes are labeled with
k8s-role: master
- One node should be labeled with
k8s-role: proxy
- Set the field
to your proxy node's ip address
Node: the proxy node itself is not in ha mode. How to configure the proxy node in ha mode is out of the scope of PAI deployment.
If your cluster has a reliable load-balance server (e.g. in a cloud environment such as Azure), you could set up a load-balancer and set the field load-balance-ip
in the kubernetes-configuration.yaml to the load-balancer.
- Set the field ```load-balance-ip`` to the ip-address of your load-balancer.
The paictl
tool sets the following default values in the 4 configuration files:
Configuration Property | Default value |
master node |
The first machine in the machine list will be configured as the master node. |
SSH port |
If not explicitly specified, the SSH port is set to 22 . |
cluster DNS |
If not explicitly specified, the cluster DNS is set to the value of the nameserver field in /etc/resolv.conf file of the master node. |
IP range used by Kubernetes |
If not explicitly specified, the IP range used by Kubernetes is set to . |
docker registry |
The docker registry is set to , and the docker namespace is set to openpai . In another word, all PAI service images will be pulled from (see this link on DockerHub for the details of all images). |
Cluster id |
Cluster id is set to pai-example |
REST server's admin user |
REST server's admin user is set to admin , and its password is set to admin-password |
VC |
There is only one VC in the system, default , which has 100% of the resource capacity. |