-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][GCS FT] Redis cleanup improvement #1557
Comments
@rueian @evalaiyc98 will take this issue. |
In terms of issue 3, there are two job configurations we can leverage:
When setting the In contrast, when setting the I would recommend setting the Either way, after setting up one of these options, we can then detect if the job is terminated by checking its kuberay/ray-operator/controllers/ray/raycluster_controller.go Lines 283 to 284 in 856a33e
And then remove the finalizer. For example: for _, condition := range redisCleanupJob.Status.Conditions {
if condition.Status == corev1.ConditionTrue && condition.Type == batchv1.JobFailed {
controllerutil.RemoveFinalizer(instance, common.GCSFaultToleranceRedisCleanupFinalizer)
if err := r.Update(ctx, instance); err != nil {
return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, err
}
return ctrl.Result{}, nil
}
}
if redisCleanupJob.Status.Failed > 0 {
r.Log.Info("If the Redis cleanup Job has failed, we will requeue the RayCluster CR after 1 minute.")
return ctrl.Result{RequeueAfter: 1 * time.Minute}, nil
} |
This makes sense.
|
I will take issue 3, and @evalaiyc98 will take issue 2. |
I've reconsidered issue 3. I've decided not to adopt the best-effort strategy in v1.0.0. The reason is that the Redis cleanup feature isn't stable enough, and adopting this strategy might cause us to miss some opportunities to enhance the feature. When we are confident that the Redis cleanup feature works well in most scenarios, we can adopt the best-effort strategy to enhance the user experience during certain edge cases, such as network connection issues. In v1.0.0, I will minimize the UX friction if the Redis cleanup fails by (1) Minimizing the computing resources for the K8s Job and (2) Setting the backoffLimit to 0 to avoid some issues related to ResourceQuota. |
Search before asking
Description
Issue 1: https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1697556705340069
Issue 2: Enhance observability. For instance, display the link to the KubeRay GCS FT doc on the Ray website in the K8s Job. In addition, we can also consider create a K8s event when the K8s Job fails to clean up the Redis key.
Issue 3: Currently, if the K8s Job fails to clean up the Redis key, KubeRay will not remove the RayCluster's finalizer. As a result, users must remove the finalizer manually. This approach is somewhat aggressive, and we might consider changing the behavior to be best-effort. For instance, if the K8s Job fails to clean up the Redis key within 5 minutes, KubeRay could issue a K8s event for observability and then remove the finalizer.
Use case
No response
Related issues
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: