Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][GCS FT] Best-effort redis cleanup job #1766

Merged
merged 1 commit into from
Dec 24, 2023

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Dec 21, 2023

Why are these changes needed?

Following the previous proposal #1557 (comment), this PR introduces a new environment variable GCS_FT_REDIS_CLEANUP_DEADLINE_SECONDS to the kuberay-operator.

This variable will be set to the .spec.activeDeadlineSeconds option of the redis cleanup job of a fault-tolerant RayCluster, and it will terminate the job after the deadline no matter what.

This PR further makes the kuberay-operator remove the ray.io/gcs-ft-redis-cleanup-finalizer finalizer on the RayCluster allowing it to be deleted after the job is terminated.

The default value of GCS_FT_REDIS_CLEANUP_DEADLINE_SECONDS is 300. This essentially makes all redis cleanup jobs work in a best-effort manner for 5 minutes by default.

Related issue number

#1725
#1557

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 self-requested a review December 21, 2023 19:40
@kevin85421 kevin85421 self-assigned this Dec 21, 2023
ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/util.go Show resolved Hide resolved
@@ -283,6 +284,19 @@ func (r *RayClusterReconciler) rayClusterReconcile(ctx context.Context, request
"We do not need to requeue the RayCluster CR anymore.", instance.Name))
return ctrl.Result{}, nil
}
// redisCleanupJob.Spec.ActiveDeadlineSeconds is set by the GCS_FT_REDIS_CLEANUP_DEADLINE_SECONDS env
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this if statement and the if redisCleanupJob.Status.Succeeded > 0 statement?

if utils.IsJobFinished(&redisCleanupJob) {
  isBestEffort := redisCleanupJob.Spec.ActiveDeadlineSeconds != nil
  if (Job is complete) {
    print some logs
  } else {
    print some logs
    maybe file a k8s event
  }
  remove finalizer
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
return false
}

func EnvInt64(name string, fallback int64) int64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could directly use strconv.ParseInt(v, 10, 64) in the buildRedisCleanupJob function. If users input invalid values, it should fail. In open-source projects, we often need to address numerous user questions without access to their Kubernetes clusters. Therefore, identifying and surfacing issues as early as possible might be a better strategy for maintainers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kevin85421,

I find it hard to make the operator fail in the buildRedisCleanupJob. This function is called only when the operator is going to create the cleanup job, which is already too late.

Considering that the GCS_FT_REDIS_CLEANUP_DEADLINE_SECONDS is an environment variable, the proper timing to fail the operator should be during initialization. How about we parse the env and set the parsed result as a member field of RayClusterReconciler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our latest discussion, we decided not to introduce the GCS_FT_REDIS_CLEANUP_DEADLINE_SECONDS but always set the .spec.activeDeadlineSeconds to 300 seconds instead.

@rueian rueian force-pushed the redis-cleanup-best-effort branch 3 times, most recently from 9dde125 to a548d87 Compare December 23, 2023 07:05
@rueian rueian force-pushed the redis-cleanup-best-effort branch from a548d87 to 93aa1a3 Compare December 23, 2023 08:33
@rueian rueian requested a review from kevin85421 December 23, 2023 10:18
@kevin85421 kevin85421 merged commit cfa1203 into ray-project:master Dec 24, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants