ray-project · DmitriGekhtman · Dec 7, 2022 · Dec 1, 2022 · Dec 1, 2022 · Dec 1, 2022
diff --git a/apiserver/DEVELOPMENT.md b/apiserver/DEVELOPMENT.md
@@ -0,0 +1,58 @@
+# User Guide
+
+This guide documents the purpose and deployment of kuberay-apiserver.
+
+## Requirements
+
+| software | version  |                                                                link |
+| :------- | :------: | ------------------------------------------------------------------: |
+| kubectl  | v1.18.3+ | [download](https://kubernetes.io/docs/tasks/tools/install-kubectl/) |
+| go       |  v1.13+  |                                  [download](https://golang.org/dl/) |
+| docker   |  19.03+  |                        [download](https://docs.docker.com/install/) |
+
+## Purpose
+Lifecycle management of ray cluster may not be friendly for kubernetes nonexperts.
+Backend service is intended to provide a RESTful web service to manage ray cluster kubernetes resource.
+
+## Build and Deployment
+Backend service can be deployed locally, or in kubernetes cluster itself. The http service is listening on port 8888.
+
+### Pre-requisites
+admin kube config file is located at ~/.kube/config
+
+### Local Deployment
+#### Build
+```
+go build -a -o raymgr cmd/main.go
+```
+
+#### Start Service
+```
+./raymgr
+```
+#### Access
+localhost:8888
+
+### Kubernetes Deployment
+#### Build
+```
+./docker-image-builder.sh
+```
+This script will build and optionally push the image to the remote docker hub (hub.byted.org).
+#### Start Service
+```
+kubectl apply -f deploy/
+```
+#### Access
+To get port 
+
+```
+NODE_PORT=$(kubectl get -o jsonpath="{.spec.ports[0].nodePort}" services backend-service -n ray-system)
+```
+To get node
+```
+NODE_IP=$(kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }')
+```
+and pick any ip address
+
+Use NODE_IP:NODE_PORT to access
diff --git a/apiserver/README.md b/apiserver/README.md
@@ -1,28 +1,180 @@
-# KubeRay ApiServer
+# KubeRay APIServer
 
-The KubeRay ApiServer provides gRPC and HTTP APIs to manage KubeRay resources.
+The KubeRay APIServer provides gRPC and HTTP APIs to manage KubeRay resources.
 
-!!! note
+**Note**
 
-    The KubeRay ApiServer is an optional component. It provides a layer of simplified
+    The KubeRay APIServer is an optional component. It provides a layer of simplified
     configuration for KubeRay resources. The KubeRay API server is used internally
     by some organizations to back user interfaces for KubeRay resource management.
 
-    The KubeRay ApiServer is community-managed and is not officially endorsed by the
+    The KubeRay APIServer is community-managed and is not officially endorsed by the
     Ray maintainers. At this time, the only officially supported methods for
     managing KubeRay resources are
 
     - Direct management of KubeRay custom resources via kubectl, kustomize, and Kubernetes language clients.
     - Helm charts.
 
-    KubeRay ApiServer maintainer contacts (GitHub handles):
+    KubeRay APIServer maintainer contacts (GitHub handles):
     @Jeffwan @scarlet25151
 
 
 ## Usage
 
+You can install the KubeRay APIServer by using the [helm chart](https://github.com/ray-project/kuberay/tree/master/helm-chart/kuberay-apiserver) or [kustomize](https://github.com/ray-project/kuberay/tree/master/apiserver/deploy/base)
+
+After the deployment we may use the `{{baseUrl}}` to access the 
+
+- (default) for nodeport access, we provide the default http port `31888` for connection and you can connect it using.
+
+- for ingress access, you will need to create your own ingress 
+
+The requests parameters detail can be seen in [KubeRay swagger](https://github.com/ray-project/kuberay/tree/master/proto/swagger), here we only present some basic example:
+
+### Setup end-to-end test
+
+0. (Optional) You may use your local kind cluster or minikube
+
+```bash
+cat <<EOF | kind create cluster --name ray-test --config -
+kind: Cluster
+apiVersion: kind.x-k8s.io/v1alpha4
+nodes:
+  - role: control-plane
+    kubeadmConfigPatches:
+      - |
+        kind: InitConfiguration
+        nodeRegistration:
+          kubeletExtraArgs:
+            node-labels: "ingress-ready=true"
+    extraPortMappings:
+      - containerPort: 30379
+        hostPort: 6379
+        listenAddress: "0.0.0.0"
+        protocol: tcp
+      - containerPort: 30265
+        hostPort: 8265
+        listenAddress: "0.0.0.0"
+        protocol: tcp
+      - containerPort: 30001
+        hostPort: 10001
+        listenAddress: "0.0.0.0"
+        protocol: tcp
+      - containerPort: 8000
+        hostPort: 8000
+        listenAddress: "0.0.0.0"
+      - containerPort: 31888
+        hostPort: 31888
+        listenAddress: "0.0.0.0"
+  - role: worker
+  - role: worker
+EOF
+```
+
+1. Deploy the KubeRay APIServer within the same cluster of KubeRay operator 
+
+```bash
+helm -n ray-system install kuberay-apiserver kuberay/helm-chart/kuberay-apiserver
+```
+
+2. The APIServer expose service using `NodePort` by default. You can test access by your host and port, the default port is set to `31888`.
+
+```
+curl localhost:31888
+{"code":5, "message":"Not Found"}
+```
+
+3. You can create `RayCluster`, `RayJobs` or `RayService` by dialing the endpoints. The following is a simple example for creating the `RayService` object, follow [swagger support](https://ray-project.github.io/kuberay/components/apiserver/#swagger-support) to get the complete definitions of APIs.
+
+```shell
+curl -X POST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/compute_templates' \
+--header 'Content-Type: application/json' \
+--data '{
+  "name": "default-template",
+  "namespace": "ray-system",
+  "cpu": 2,
+  "memory": 4
+}'
+
+curl -X POST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/services' \
+--header 'Content-Type: application/json' \
+--data '{
+  "name": "user-test-1",
+  "namespace": "ray-system",
+  "user": "user",
+  "serveDeploymentGraphSpec": {
+      "importPath": "fruit.deployment_graph",
+      "runtimeEnv": "working_dir: \"https://github.com/ray-project/test_dag/archive/c620251044717ace0a4c19d766d43c5099af8a77.zip\"\n",
+      "serveConfigs": [
+      {
+        "deploymentName": "OrangeStand",
+        "replicas": 1,
+        "userConfig": "price: 2",
+        "actorOptions": {
+          "cpusPerActor": 0.1
+        }
+      },
+      {
+        "deploymentName": "PearStand",
+        "replicas": 1,
+        "userConfig": "price: 1",
+        "actorOptions": {
+          "cpusPerActor": 0.1
+        }
+      },
+      {
+        "deploymentName": "FruitMarket",
+        "replicas": 1,
+        "actorOptions": {
+          "cpusPerActor": 0.1
+        }
+      },{
+        "deploymentName": "DAGDriver",
+        "replicas": 1,
+        "routePrefix": "/",
+        "actorOptions": {
+          "cpusPerActor": 0.1
+        }
+      }]
+  },
+  "clusterSpec": {
+    "headGroupSpec": {
+      "computeTemplate": "default-template",
+      "image": "rayproject/ray:2.1.0",
+      "serviceType": "NodePort",
+      "rayStartParams": {
+            "dashboard-host": "0.0.0.0",
+            "metrics-export-port": "8080"
+        },
+       "volumes": [] 
+    },
+    "workerGroupSpec": [
+      {
+        "groupName": "small-wg",
+        "computeTemplate": "default-template",
+        "image": "rayproject/ray:2.1.0",
+        "replicas": 1,
+        "minReplicas": 0,
+        "maxReplicas": 5,
+        "rayStartParams": {
+                "node-ip-address": "$MY_POD_IP"
+            }
+      }
+    ]
+  }
+}'
+```
+The Ray resource will then be created in your Kubernetes cluster.
+
+## Full definition of payload
+
 ### Compute Template
 
+For the purpose to simplify the setting of resource, we abstract the resource 
+of the pods template resource to the `compute template` for usage, you can 
+define the resource in the `compute template` and then choose the appropriate
+template for your `head` and `workergroup` when you are creating the real objects of `RayCluster`, `RayJobs` or `RayService`.
+
 #### Create compute templates in a given namespace
 
 ```

diff --git a/apiserver/deploy/base/apiserver.yaml b/apiserver/deploy/base/apiserver.yaml
@@ -84,6 +84,7 @@ rules:
   resources:
   - rayclusters
   - rayjobs
+  - rayservices
   verbs:
   - create
   - delete

diff --git a/docs/best-practice/worker-head-reconnection.md b/docs/best-practice/worker-head-reconnection.md
@@ -6,7 +6,7 @@ For a `RayCluster` with a head and several workers, if a worker is crashed, it w
 
 ## Explanation
 
-When the head pod was deleted, it will be recreated with a new IP by KubeRay controller，and the GCS server address is changed accordingly. The Raylets of all workers will try to get GCS address from Redis in ‘ReconnectGcsServer’, but the redis_clients always use the previous head IP, so they will always fail to get new GCS address. The Raylets will not exit until max retries are reached. There are two configurations determining this long delay:
+When the head pod was deleted, it will be recreated with a new IP by KubeRay controller，and the GCS server address is changed accordingly. The Raylets of all workers will try to get GCS address from Redis in `ReconnectGcsServer`, but the redis_clients always use the previous head IP, so they will always fail to get new GCS address. The Raylets will not exit until max retries are reached. There are two configurations determining this long delay:
 
 ```
 /// The interval at which the gcs rpc client will check if gcs rpc server is ready.
@@ -22,12 +22,10 @@ It retries 600 times and each interval is 1s, resulting in total 600s timeout, i
 
 ## Best Practice
 
-GCS FT feature [#20498](https://github.com/ray-project/ray/issues/20498) is planned in Ray Core Roadmap. When this feature is released, expect a stable head and GCS such that worker-head connection lost issue will not appear anymore. 
+The GCS Fault-Tolerance (FT) feature is alpha release. To enable GCS FT, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md)
 
-Before that, to solve the workers-head connection lost, there are two options:
+To reduce the chances of a lost worker-head connection, there are two other options:
 
 - Make head more stable: when creating the cluster, allocate sufficient amount of resources on head pod such that it tends to be stable and not easy to crash. You can also set {"num-cpus": "0"} in "rayStartParams" of "headGroupSpec" such that Ray scheduler will skip the head node when scheduling workloads. This also helps to maintain the stability of the head. 
 
 - Make reconnection shorter: for version <= 1.9.1, you can set this head param --system-config='{"ping_gcs_rpc_server_max_retries": 20}' to reduce the delay from 600s down to 20s before workers reconnect to the new head. 
-
-> Note: we should update this doc when GCS FT feature gets updated.
-Original file line number
+Diff line change
@@ Expand Up / @@ -84,6 +84,7 @@ rules: @@
       resources:
       - rayclusters
       - rayjobs
+      - rayservices
       verbs:
       - create
       - delete
@@ Expand Down @@