Skip to content

Commit

Permalink
update doc
Browse files Browse the repository at this point in the history
  • Loading branch information
kevin85421 committed Aug 7, 2023
1 parent 413a803 commit 875c7f3
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 2 deletions.
2 changes: 1 addition & 1 deletion docs/guidance/rayservice-troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ If you consistently encounter this issue, there are several possible causes:
# Get \"http://rayservice-sample-raycluster-rqlsl-head-svc.default.svc.cluster.local:52365/api/serve/applications/\": dial tcp 10.96.7.154:52365: connect: connection refused
```

### Issue 8: A loop of restarting the RayCluster occurs when the Kubernetes cluster runs out of resources.
### Issue 8: A loop of restarting the RayCluster occurs when the Kubernetes cluster runs out of resources. (KubeRay v0.6.1 or earlier)

> Note: Currently, the KubeRay operator does not have a clear plan to handle situations where the Kubernetes cluster runs out of resources.
Therefore, we recommend ensuring that the Kubernetes cluster has sufficient resources to accommodate the serve application.
Expand Down
15 changes: 14 additions & 1 deletion docs/guidance/rayservice.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,6 @@ curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc:800
```

* `rayservice-sample-serve-svc` is HA in general. It will do traffic routing among all the workers which have Serve deployments and will always try to point to the healthy cluster, even during upgrading or failing cases.
* You can set `serviceUnhealthySecondThreshold` to define the threshold of seconds that the Serve deployments fail. You can also set `deploymentUnhealthySecondThreshold` to define the threshold of seconds that Ray fails to deploy any Serve deployments.

## Step 7: In-place update for Ray Serve applications

Expand Down Expand Up @@ -263,6 +262,20 @@ curl -X POST -H 'Content-Type: application/json' rayservice-sample-serve-svc:800
# [Expected output]: 8
```

### Another two possible scenarios that will trigger a new RayCluster preparation

> Note: The following behavior is for KubeRay v0.6.2 or newer.
For older versions, please refer to [kuberay#1293](https://github.com/ray-project/kuberay/pull/1293) for more details.

Not only will the zero downtime upgrade trigger a new RayCluster preparation, but KubeRay will also trigger it if it considers a RayCluster unhealthy.
In the RayService, KubeRay can mark a RayCluster as unhealthy in two possible scenarios.

* Case 1: The KubeRay operator cannot connect to the dashboard agent on the head Pod for more than the duration defined by the `deploymentUnhealthySecondThreshold` parameter. Both the default value and values in sample YAML files of `deploymentUnhealthySecondThreshold` are 300 seconds.

* Case 2: The KubeRay operator will mark a RayCluster as unhealthy if the status of a serve application is `DEPLOY_FAILED` or `UNHEALTHY` for a duration exceeding the `serviceUnhealthySecondThreshold` parameter. Both the default value and values in sample YAML files of `serviceUnhealthySecondThreshold` are 900 seconds.

After KubeRay marks a RayCluster as unhealthy, it initiates the creation of a new RayCluster. Once the new RayCluster is ready, KubeRay redirects network traffic to it, and subsequently deletes the old RayCluster.

## Step 9: Clean up the Kubernetes cluster

```sh
Expand Down

0 comments on commit 875c7f3

Please sign in to comment.