Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs for release v0.4.0 #778

Merged
merged 11 commits into from
Dec 7, 2022
58 changes: 58 additions & 0 deletions apiserver/DEVELOPMENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# User Guide

This guide documents the purpose and deployment of kuberay-apiserver.

## Requirements

| software | version | link |
| :------- | :------: | ------------------------------------------------------------------: |
| kubectl | v1.18.3+ | [download](https://kubernetes.io/docs/tasks/tools/install-kubectl/) |
| go | v1.13+ | [download](https://golang.org/dl/) |
| docker | 19.03+ | [download](https://docs.docker.com/install/) |

## Purpose
Lifecycle management of ray cluster may not be friendly for kubernetes nonexperts.
Backend service is intended to provide a RESTful web service to manage ray cluster kubernetes resource.

## Build and Deployment
Backend service can be deployed locally, or in kubernetes cluster itself. The http service is listening on port 8888.

### Pre-requisites
admin kube config file is located at ~/.kube/config

### Local Deployment
#### Build
```
go build -a -o raymgr cmd/main.go
```

#### Start Service
```
./raymgr
```
#### Access
localhost:8888

### Kubernetes Deployment
#### Build
```
./docker-image-builder.sh
```
This script will build and optionally push the image to the remote docker hub (hub.byted.org).
#### Start Service
```
kubectl apply -f deploy/
```
#### Access
To get port

```
NODE_PORT=$(kubectl get -o jsonpath="{.spec.ports[0].nodePort}" services backend-service -n ray-system)
```
To get node
```
NODE_IP=$(kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }')
```
and pick any ip address

Use NODE_IP:NODE_PORT to access
164 changes: 158 additions & 6 deletions apiserver/README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,180 @@
# KubeRay ApiServer
# KubeRay APIServer

The KubeRay ApiServer provides gRPC and HTTP APIs to manage KubeRay resources.
The KubeRay APIServer provides gRPC and HTTP APIs to manage KubeRay resources.

!!! note
**Note**

The KubeRay ApiServer is an optional component. It provides a layer of simplified
The KubeRay APIServer is an optional component. It provides a layer of simplified
configuration for KubeRay resources. The KubeRay API server is used internally
by some organizations to back user interfaces for KubeRay resource management.

The KubeRay ApiServer is community-managed and is not officially endorsed by the
The KubeRay APIServer is community-managed and is not officially endorsed by the
Ray maintainers. At this time, the only officially supported methods for
managing KubeRay resources are

- Direct management of KubeRay custom resources via kubectl, kustomize, and Kubernetes language clients.
- Helm charts.

KubeRay ApiServer maintainer contacts (GitHub handles):
KubeRay APIServer maintainer contacts (GitHub handles):
@Jeffwan @scarlet25151


## Usage

You can install the KubeRay APIServer by using the [helm chart](https://github.com/ray-project/kuberay/tree/master/helm-chart/kuberay-apiserver) or [kustomize](https://github.com/ray-project/kuberay/tree/master/apiserver/deploy/base)

After the deployment we may use the `{{baseUrl}}` to access the

- (default) for nodeport access, we provide the default http port `31888` for connection and you can connect it using.

- for ingress access, you will need to create your own ingress

The requests parameters detail can be seen in [KubeRay swagger](https://github.com/ray-project/kuberay/tree/master/proto/swagger), here we only present some basic example:

### Setup end-to-end test

0. (Optional) You may use your local kind cluster or minikube

```bash
cat <<EOF | kind create cluster --name ray-test --config -
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 30379
hostPort: 6379
listenAddress: "0.0.0.0"
protocol: tcp
- containerPort: 30265
hostPort: 8265
listenAddress: "0.0.0.0"
protocol: tcp
- containerPort: 30001
hostPort: 10001
listenAddress: "0.0.0.0"
protocol: tcp
- containerPort: 8000
hostPort: 8000
listenAddress: "0.0.0.0"
- containerPort: 31888
hostPort: 31888
listenAddress: "0.0.0.0"
- role: worker
- role: worker
EOF
```

1. Deploy the KubeRay APIServer within the same cluster of KubeRay operator

```bash
helm -n ray-system install kuberay-apiserver kuberay/helm-chart/kuberay-apiserver
```

2. The APIServer expose service using `NodePort` by default. You can test access by your host and port, the default port is set to `31888`.

```
curl localhost:31888
{"code":5, "message":"Not Found"}
```

3. You can create `RayCluster`, `RayJobs` or `RayService` by dialing the endpoints. The following is a simple example for creating the `RayService` object, follow [swagger support](https://ray-project.github.io/kuberay/components/apiserver/#swagger-support) to get the complete definitions of APIs.

```shell
curl -X POST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/compute_templates' \
--header 'Content-Type: application/json' \
--data '{
"name": "default-template",
"namespace": "ray-system",
"cpu": 2,
"memory": 4
}'

curl -X POST 'localhost:31888/apis/v1alpha2/namespaces/ray-system/services' \
--header 'Content-Type: application/json' \
--data '{
"name": "user-test-1",
"namespace": "ray-system",
"user": "user",
"serveDeploymentGraphSpec": {
"importPath": "fruit.deployment_graph",
"runtimeEnv": "working_dir: \"https://github.com/ray-project/test_dag/archive/c620251044717ace0a4c19d766d43c5099af8a77.zip\"\n",
"serveConfigs": [
{
"deploymentName": "OrangeStand",
"replicas": 1,
"userConfig": "price: 2",
"actorOptions": {
"cpusPerActor": 0.1
}
},
{
"deploymentName": "PearStand",
"replicas": 1,
"userConfig": "price: 1",
"actorOptions": {
"cpusPerActor": 0.1
}
},
{
"deploymentName": "FruitMarket",
"replicas": 1,
"actorOptions": {
"cpusPerActor": 0.1
}
},{
"deploymentName": "DAGDriver",
"replicas": 1,
"routePrefix": "/",
"actorOptions": {
"cpusPerActor": 0.1
}
}]
},
"clusterSpec": {
"headGroupSpec": {
"computeTemplate": "default-template",
"image": "rayproject/ray:2.1.0",
"serviceType": "NodePort",
"rayStartParams": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DmitriGekhtman Is there any redundant field here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"port" and "node-ip-address" are unnecessary

"dashboard-host": "0.0.0.0",
"metrics-export-port": "8080"
},
"volumes": []
},
"workerGroupSpec": [
{
"groupName": "small-wg",
"computeTemplate": "default-template",
"image": "rayproject/ray:2.1.0",
"replicas": 1,
"minReplicas": 0,
"maxReplicas": 5,
"rayStartParams": {
"node-ip-address": "$MY_POD_IP"
}
}
]
}
}'
```
The Ray resource will then be created in your Kubernetes cluster.

## Full definition of payload

### Compute Template

For the purpose to simplify the setting of resource, we abstract the resource
of the pods template resource to the `compute template` for usage, you can
define the resource in the `compute template` and then choose the appropriate
template for your `head` and `workergroup` when you are creating the real objects of `RayCluster`, `RayJobs` or `RayService`.

#### Create compute templates in a given namespace

```
Expand Down
1 change: 1 addition & 0 deletions apiserver/deploy/base/apiserver.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ rules:
resources:
- rayclusters
- rayjobs
- rayservices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they need to be synchronized? If so, we can open an issue to add the RABC consistency between deploy/base/apiserver.yaml and helm-chart/kuberay-apiserver/templates/role.yaml.

verbs:
- create
- delete
Expand Down
8 changes: 3 additions & 5 deletions docs/best-practice/worker-head-reconnection.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ For a `RayCluster` with a head and several workers, if a worker is crashed, it w

## Explanation

When the head pod was deleted, it will be recreated with a new IP by KubeRay controller,and the GCS server address is changed accordingly. The Raylets of all workers will try to get GCS address from Redis in ReconnectGcsServer, but the redis_clients always use the previous head IP, so they will always fail to get new GCS address. The Raylets will not exit until max retries are reached. There are two configurations determining this long delay:
When the head pod was deleted, it will be recreated with a new IP by KubeRay controller,and the GCS server address is changed accordingly. The Raylets of all workers will try to get GCS address from Redis in `ReconnectGcsServer`, but the redis_clients always use the previous head IP, so they will always fail to get new GCS address. The Raylets will not exit until max retries are reached. There are two configurations determining this long delay:

```
/// The interval at which the gcs rpc client will check if gcs rpc server is ready.
Expand All @@ -22,12 +22,10 @@ It retries 600 times and each interval is 1s, resulting in total 600s timeout, i

## Best Practice

GCS FT feature [#20498](https://github.com/ray-project/ray/issues/20498) is planned in Ray Core Roadmap. When this feature is released, expect a stable head and GCS such that worker-head connection lost issue will not appear anymore.
The GCS Fault-Tolerance (FT) feature is alpha release. To enable GCS FT, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md)

Before that, to solve the workers-head connection lost, there are two options:
To reduce the chances of a lost worker-head connection, there are two other options:

- Make head more stable: when creating the cluster, allocate sufficient amount of resources on head pod such that it tends to be stable and not easy to crash. You can also set {"num-cpus": "0"} in "rayStartParams" of "headGroupSpec" such that Ray scheduler will skip the head node when scheduling workloads. This also helps to maintain the stability of the head.

- Make reconnection shorter: for version <= 1.9.1, you can set this head param --system-config='{"ping_gcs_rpc_server_max_retries": 20}' to reduce the delay from 600s down to 20s before workers reconnect to the new head.

> Note: we should update this doc when GCS FT feature gets updated.
Loading