-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RayService object's Status is being updated due to frequent reconciliation #1065
RayService object's Status is being updated due to frequent reconciliation #1065
Conversation
12af66c
to
191e934
Compare
cc @msumitjain would you mind giving some feedbacks? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks for the detailed PR description. The behavior looks good to me. I left few suggestions around wording.
Co-authored-by: shrekris-anyscale <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Thanks for responding to all my comments.
} | ||
|
||
return ctrl.Result{RequeueAfter: ServiceDefaultRequeueDuration}, nil | ||
} | ||
|
||
func (r *RayServiceReconciler) inconsistentRayServiceStatus(oldStatus rayv1alpha1.RayServiceStatus, newStatus rayv1alpha1.RayServiceStatus) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add func comment to list what is decided to be inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. 36f1050
Just a quick question : The frequency of reconcile function itself is too high. Is this Reconcile function gets called on some event ? Regardless should we limit the frequency by which this function can be called per sec maybe like a RateLimiter ? |
@msumitjain would you mind performing a similar experiment as #1062? We can add a RateLimiter, and there is an issue tracking progress #820. |
…ation (ray-project#1065) RayService object's Status is being updated due to frequent reconciliation
Why are these changes needed?
The experiment result in #1062 shows that
r.Status().Update()
occurred 48 times during the RayCluster initialization phase and 239 times after the RayCluster completed the initialization process within 10 minutes. Hence, this PR mainly focuses on relieving the frequent status update caused byr.Status().Update() Final status update for any CR modification.
.The root cause is that
r.Status().Update()
is called at the end of the functionReconcile
no matter whether it should be updated or not. The frequent status update will hurt the performance of the Kubernetes cluster.Refer to the source code of ReplicaSet and StatefulSet. They will check the difference between the original status and the new status to determine whether update the status or not. Hence, this PR implements a similar function
inconsistentRayServiceStatuses
to determine whether update status or not.Implementation of
inconsistentRayServiceStatuses
inconsistentRayServiceStatuses
does not checkLastUpdateTime
andHealthLastUpdateTime
:HealthLastUpdateTime
to determine whether the RayService is healthy or not (example). If the RayService is unhealthy, the controller will trigger a new RayCluster preparation. The controller only usesHealthLastUpdateTime
when the status is notHEALTHY
. Hence, it is fine to ignore the status update ofHealthLastUpdateTime
.inconsistentRayServiceStatuses
ignoresRayClusterStatus
.rayservice_controller.go
does not use any information in RayClusterStatus. It is totally for observability.This solution is not good enough because the status is too complicated, so I list some items in the follow-up section.
Related issue number
Closes #1061
#1062
Checks
I performed an experiment similar to #1062, and the only difference is this experiment is 15 mins and #1062 is 10 mins. The result is (see this gist for more details):
Let the number of reconciliations be N.
Follow-up
The status of RayService is too complicated. We should revisit the status design based on the principle of Kubernetes API conventions.
Add
observedGeneration
to the logic ofinconsistentRayServiceStatuses
."47 times status updates in the RayCluster initialization." => This is definitely not good enough. I will optimize it.
Currently, RayService determines whether to trigger a new RayCluster preparation or not based on data plane logic (i.e. Ray). This is pretty weird and makes everything very complex because Ray has already had similar functionalities. For me, the better solution will be:
The key of the cache may not be unique in some special cases. In addition, Kubernetes operator should be stateless, but the cache means that the operator is stateful.