Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correct gcs ha to gcs ft #482

Merged
merged 1 commit into from
Aug 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/best-practice/worker-head-reconnection.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,12 @@ It retries 600 times and each interval is 1s, resulting in total 600s timeout, i

## Best Practice

GCS HA feature [#20498](https://github.com/ray-project/ray/issues/20498) is planned in Ray Core Roadmap. When this feature is released, expect a stable head and GCS such that worker-head connection lost issue will not appear anymore.
GCS FT feature [#20498](https://github.com/ray-project/ray/issues/20498) is planned in Ray Core Roadmap. When this feature is released, expect a stable head and GCS such that worker-head connection lost issue will not appear anymore.

Before that, to solve the workers-head connection lost, there are two options:

- Make head more stable: when creating the cluster, allocate sufficient amount of resources on head pod such that it tends to be stable and not easy to crash. You can also set {"num-cpus": "0"} in "rayStartParams" of "headGroupSpec" such that Ray scheduler will skip the head node when scheduling workloads. This also helps to maintain the stability of the head.

- Make reconnection shorter: for version <= 1.9.1, you can set this head param --system-config='{"ping_gcs_rpc_server_max_retries": 20}' to reduce the delay from 600s down to 20s before workers reconnect to the new head.

> Note: we should update this doc when GCS HA feature gets updated.
> Note: we should update this doc when GCS FT feature gets updated.
30 changes: 15 additions & 15 deletions docs/guidance/gcs-ha.md → docs/guidance/gcs-ft.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,32 @@
## Ray GCS HA (Experimental)
## Ray GCS Fault Tolerance(GCS FT) (Experimental)

> Note: This feature is still experimental, there are a few limitations and stabilization will be done in future release from both Ray and KubeRay side.

Ray GCS HA enables GCS server to use external storage backend. As a result, Ray clusters can tolerant GCS failures and recover from failures
Ray GCS FT enables GCS server to use external storage backend. As a result, Ray clusters can tolerant GCS failures and recover from failures
without affecting important services such as detached Actors & RayServe deployments.

### Prerequisite

* Ray 2.0 is required.
* You need to support external Redis server for Ray. (Redis HA cluster is highly recommended.)

### Enable Ray GCS HA
### Enable Ray GCS FT

To enable Ray GCS HA in your newly KubeRay-managed Ray cluster, you need to enable it by adding an annotation to the
To enable Ray GCS FT in your newly KubeRay-managed Ray cluster, you need to enable it by adding an annotation to the
RayCluster YAML file.

```yaml
...
kind: RayCluster
metadata:
annotations:
ray.io/ha-enabled: "true" # <- add this annotation enable GCS HA
ray.io/ft-enabled: "true" # <- add this annotation enable GCS FT
ray.io/external-storage-namespace: "my-raycluster-storage-namespace" # <- optional, to specify the external storage namespace
...
```
An example can be found at [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml)

When annotation `ray.io/ha-enabled` is added with a `true` value, KubeRay will enable Ray GCS HA feature. This feature
When annotation `ray.io/ft-enabled` is added with a `true` value, KubeRay will enable Ray GCS FT feature. This feature
contains several components:

1. Newly created Ray cluster has `Readiness Probe` and `Liveness Probe` added to all the head/worker nodes.
Expand All @@ -37,7 +37,7 @@ contains several components:

#### Readiness Probe vs Liveness Probe

These are the two types of probes we used in Ray GCS HA.
These are the two types of probes we used in Ray GCS FT.

The readiness probe is used to notify KubeRay in case of failures in the corresponding Ray cluster. KubeRay can try its best to
recover the Ray cluster. If KubeRay cannot recover the failed head/worker node, the liveness probe gets in, delete the old pod
Expand All @@ -53,14 +53,14 @@ On Ray head node, we access a local Ray dashboard http endpoint and a Raylet htt
healthy state. Since Ray dashboard does not reside Ray worker node, we only check the local Raylet http endpoint to make sure
the worker node is healthy.

#### Ray GCS HA Annotation
#### Ray GCS FT Annotation

Our Ray GCS HA feature checks if an annotation called `ray.io/ha-enabled` is set to `true` in `RayCluster` YAML file. If so, KubeRay
Our Ray GCS FT feature checks if an annotation called `ray.io/ft-enabled` is set to `true` in `RayCluster` YAML file. If so, KubeRay
will also add such annotation to the pod whenever the head/worker node is created.

#### Use External Redis Cluster

To use external Redis cluster as the backend storage(required by Ray GCS HA),
To use external Redis cluster as the backend storage(required by Ray GCS FT),
you need to add `RAY_REDIS_ADDRESS` environment variable to the head node template.

Also, you can specify a storage namespace for your Ray cluster by using an annotation `ray.io/external-storage-namespace`
Expand All @@ -70,8 +70,8 @@ An example can be found at [ray-cluster.external-redis.yaml](https://github.com/
#### KubeRay Operator Controller

KubeRay Operator controller watches for new `Event` reconcile call. If this Event object is to notify the failed readiness probe,
controller checks if this pod has `ray.io/ha-enabled` set to `true`. If this pod has this annotation set to true, that means this pod
belongs to a Ray cluster that has Ray GCS HA enabled.
controller checks if this pod has `ray.io/ft-enabled` set to `true`. If this pod has this annotation set to true, that means this pod
belongs to a Ray cluster that has Ray GCS FT enabled.

After this, the controller will try to recover the failed pod. If controller cannot recover it, an annotation named
`ray.io/health-state` with a value `Unhealthy` is added to this pod.
Expand All @@ -82,7 +82,7 @@ In every KubeRay Operator controller reconcile loop, it monitors any pod in Ray
#### External Storage Namespace

External storage namespaces can be used to share a single storage backend among multiple Ray clusters. By default, `ray.io/external-storage-namespace`
uses the RayCluster UID as its value when GCS HA is enabled. Or if the user wants to use customized external storage namespace,
uses the RayCluster UID as its value when GCS FT is enabled. Or if the user wants to use customized external storage namespace,
the user can add `ray.io/external-storage-namespace` annotation to RayCluster yaml file.

Whenever `ray.io/external-storage-namespace` annotation is set, the head/worker node will have `RAY_external_storage_namespace` environment
Expand All @@ -92,9 +92,9 @@ variable set which Ray can pick up later.

1. For now, Ray head/worker node that fails the readiness probe recovers itself by restarting itself. More fine-grained control and recovery mechanisms are expected in the future.

### Test Ray GCS HA
### Test Ray GCS FT

Currently, two tests are responsible for ensuring Ray GCS HA is working correctly.
Currently, two tests are responsible for ensuring Ray GCS FT is working correctly.

1. Detached actor test
2. RayServe test
Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ nav:
- Features:
- RayService: guidance/rayservice.md
- RayJob: guidance/rayjob.md
- Ray GCS HA: guidance/gcs-ha.md
- Ray GCS FT: guidance/gcs-ft.md
- Autoscaling: guidance/autoscaler.md
- Ingress: guidance/ingress.md
- Observability: guidance/observability.md
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ metadata:
labels:
controller-tools.k8s.io: "1.0"
annotations:
ray.io/ha-enabled: "true" # enable Ray HA
ray.io/ft-enabled: "true" # enable Ray GCS FT
ray.io/external-storage-namespace: "my-raycluster-storage-namespace"
# An unique identifier for the head node and workers of this cluster.
name: raycluster-external-redis
Expand Down
8 changes: 4 additions & 4 deletions ray-operator/controllers/ray/common/constant.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ const (
RayClusterDashboardServiceLabelKey = "ray.io/cluster-dashboard"
RayClusterServingServiceLabelKey = "ray.io/serve"

// Ray GCS HA related annotations
RayHAEnabledAnnotationKey = "ray.io/ha-enabled"
// Ray GCS FT related annotations
RayFTEnabledAnnotationKey = "ray.io/ft-enabled"
RayExternalStorageNSAnnotationKey = "ray.io/external-storage-namespace"
RayNodeHealthStateAnnotationKey = "ray.io/health-state"

Expand Down Expand Up @@ -79,14 +79,14 @@ const (
DefaultRedisPassword = "5241590000000000"

LOCAL_HOST = "127.0.0.1"
// Ray HA default readiness probe values
// Ray FT default readiness probe values
DefaultReadinessProbeInitialDelaySeconds = 10
DefaultReadinessProbeTimeoutSeconds = 1
DefaultReadinessProbePeriodSeconds = 3
DefaultReadinessProbeSuccessThreshold = 0
DefaultReadinessProbeFailureThreshold = 20

// Ray HA default liveness probe values
// Ray FT default liveness probe values
DefaultLivenessProbeInitialDelaySeconds = 10
DefaultLivenessProbeTimeoutSeconds = 1
DefaultLivenessProbePeriodSeconds = 3
Expand Down
18 changes: 9 additions & 9 deletions ray-operator/controllers/ray/common/pod.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,12 @@ func GetHeadPort(headStartParams map[string]string) string {
return headPort
}

// rayClusterHAEnabled check if RayCluster enabled HA in annotations
// rayClusterHAEnabled check if RayCluster enabled FT in annotations
func rayClusterHAEnabled(instance rayiov1alpha1.RayCluster) bool {
if instance.Annotations == nil {
return false
}
if v, ok := instance.Annotations[RayHAEnabledAnnotationKey]; ok {
if v, ok := instance.Annotations[RayFTEnabledAnnotationKey]; ok {
if strings.ToLower(v) == "true" {
return true
}
Expand All @@ -65,14 +65,14 @@ func initTemplateAnnotations(instance rayiov1alpha1.RayCluster, podTemplate *v1.
podTemplate.Annotations = make(map[string]string)
}

// For now, we just set ray external storage enabled/disabled by checking if HA is enalled/disabled.
// For now, we just set ray external storage enabled/disabled by checking if FT is enabled/disabled.
// This may need to be updated in the future.
if rayClusterHAEnabled(instance) {
podTemplate.Annotations[RayHAEnabledAnnotationKey] = "true"
// if we have HA enabled, we need to set up a default external storage namespace.
podTemplate.Annotations[RayFTEnabledAnnotationKey] = "true"
// if we have FT enabled, we need to set up a default external storage namespace.
podTemplate.Annotations[RayExternalStorageNSAnnotationKey] = string(instance.UID)
} else {
podTemplate.Annotations[RayHAEnabledAnnotationKey] = "false"
podTemplate.Annotations[RayFTEnabledAnnotationKey] = "false"
}
podTemplate.Annotations[RayNodeHealthStateAnnotationKey] = ""

Expand Down Expand Up @@ -327,11 +327,11 @@ func BuildPod(podTemplateSpec v1.PodTemplateSpec, rayNodeType rayiov1alpha1.RayN

setContainerEnvVars(&pod, rayContainerIndex, rayNodeType, rayStartParams, svcName, headPort, creator)

// health check only if HA enabled
// health check only if FT enabled
if podTemplateSpec.Annotations != nil {
if enabledString, ok := podTemplateSpec.Annotations[RayHAEnabledAnnotationKey]; ok {
if enabledString, ok := podTemplateSpec.Annotations[RayFTEnabledAnnotationKey]; ok {
if strings.ToLower(enabledString) == "true" {
// Ray HA is enabled and we need to add health checks
// Ray FT is enabled and we need to add health checks
if pod.Spec.Containers[rayContainerIndex].ReadinessProbe == nil {
// it is possible that some user have the probe parameters to override the default,
// in this case, this if condition is skipped
Expand Down
4 changes: 2 additions & 2 deletions ray-operator/controllers/ray/raycluster_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -137,9 +137,9 @@ func (r *RayClusterReconciler) eventReconcile(request ctrl.Request, event *v1.Ev
return ctrl.Result{}, nil
}

if enabledString, ok := unhealthyPod.Annotations[common.RayHAEnabledAnnotationKey]; ok {
if enabledString, ok := unhealthyPod.Annotations[common.RayFTEnabledAnnotationKey]; ok {
if strings.ToLower(enabledString) != "true" {
r.Log.Info("HA not enabled skipping event reconcile for pod.", "pod name", unhealthyPod.Name)
r.Log.Info("FT not enabled skipping event reconcile for pod.", "pod name", unhealthyPod.Name)
return ctrl.Result{}, nil
}
} else {
Expand Down
14 changes: 7 additions & 7 deletions tests/compatibility-test.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@ def test_cluster_info(self):
client.close()


def ray_ha_supported():
def ray_ft_supported():
if ray_version == "nightly":
return True
major, minor, patch = parse_ray_version(ray_version)
Expand All @@ -269,22 +269,22 @@ def ray_service_supported():
return True


class RayHATestCase(unittest.TestCase):
cluster_template_file = 'tests/config/ray-cluster.ray-ha.yaml.template'
class RayFTTestCase(unittest.TestCase):
cluster_template_file = 'tests/config/ray-cluster.ray-ft.yaml.template'

@classmethod
def setUpClass(cls):
if not ray_ha_supported():
if not ray_ft_supported():
return
delete_cluster()
create_cluster()
apply_kuberay_resources()
download_images()
create_kuberay_cluster(RayHATestCase.cluster_template_file)
create_kuberay_cluster(RayFTTestCase.cluster_template_file)

def setUp(self):
if not ray_ha_supported():
raise unittest.SkipTest("ray ha is not supported")
if not ray_ft_supported():
raise unittest.SkipTest("ray ft is not supported")

def test_kill_head(self):
# This test will delete head node and wait for a new replacement to
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ metadata:
labels:
controller-tools.k8s.io: "1.0"
annotations:
ray.io/ha-enabled: "true" # enable Ray HA
ray.io/ft-enabled: "true" # enable Ray GCS FT
name: raycluster-external-redis
spec:
rayVersion: '$ray_version'
Expand Down