-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes around exceptional cases in watchers #2671
Conversation
- If the target of a watch is deleted, it will never satisfy the watch condition - Path segment based filters have spotty support in newer kube api versions. Field selector seems to work everywhere - Handle unsuccessful watch statuses better -- we should either be retrying, or passing an error event back to the user. Throwing an exception will result in a transparent failure that no user can handle, because it happens in a thread - long polls return a response once data is available. we should only backoff in exceptional cases, otherwise immediately connect so we can receive followup data in a more timely manner
Can one of the admins verify this patch? |
SonarCloud Quality Gate failed. |
Timing is impecable on this one. We identified #2668 yesterday. I'm still wary about why these lines were added in the first place: if (baseOperation.isApiGroup()) {
httpUrlBuilder.addPathSegment(name);
}
Sorry, missed this earlier. It makes sense given the timeline.
I can't really think of any resource API that allows this. As Rohan stated ni his comment this was added a long time back for certain OpenShift resources #821 (however, I can't really find any documentation for these endpoints where watches can be applied to a specific resource and not the whole list). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I confirm this PR also fixed the issue reported in #2668 |
[merge] |
Description
These changes have been running for us internally for 6 months or so, but I never got a chance to upstream them. We'd like to bring our fork more in-line with mainstream, so here is a group of changes related to watcher reliability:
If the target of a watch is deleted, it will never satisfy the watch condition
We saw this case where you are watching a resource hoping it will hit a particular condition, if the resource gets deleted while you are waiting the watcher will simply go on forever until the timeout. This change makes it fail much faster in the case where it will never succeed.
Path segment based filters have spotty support in newer kube api versions. Field selector seems to work everywhere
It looks like this was added 3.5 years ago because field selectors weren't working. The patch to make CRDs support field selectors was then added a few months later. If you take a look at the API spec (even for the really old v1.11 here), the read API url we are using here does not support the
watch
query param. If you want to watch an object, you need to use either the Watch API (with/api/v1/watch
path) or the List API, which does support thewatch
query param. We noticed that watches for specific resources were not working correctly for us, instead they were spamming our API servers because they were actually short polling (basically just a get that returns instantly). Note that I linked to the Pods API, but the same seems to be true for CustomResources (or as the original PR called them, API Groups). You can test this out pretty easily:This is obviously a CRD we have internally, you can create your own CRD of any type to test it. It also works the same on standard APIs. I chose to simply remove this apiGroup handling here so that we always use the fieldSelector filtering. It's possible that another change we could make is to move over to the
/watch
API, but that would be more complicated because we'd need to use a different URL than what we have available fromgetNamespacedUrl()
. I could hook that up if you want, but it mean somewhat wider scope changes. I'd prefer to handle that separately/later if possible.Handle unsuccessful watch statuses better -- we should either be retrying, or passing an error event back to the user. Throwing an exception will result in a transparent failure that no user can handle, because it happens in a thread
We noticed this issue where our watchers were failing silently with no real feedback. This is because the client was throwing a failure exception in the callback which does not get executed in the main thread. So the exception really goes nowhere. I refactored callback so that when
!isSuccessful()
we do the same status handling that we do in the case of a successful response, we just pass in the unsuccessful Status object. This way the watcher will receive an appropriate event and the caller can be notified that the watch failed.long polls return a response once data is available. we should only backoff in exceptional cases, otherwise immediately connect so we can receive followup data in a more timely manner
Another situation we saw is we some watch events would not make their way to us in a timely manner. This was because the WatchHTTPManager was always backing off after processing a response body. The way long polling works is, when data is ready, the connection immediately returns a response with the data in the body. At that point, the long poll needs to be reinitialized. This is not an error state, the long poll should immediately be reinitialized. So I updated the onResponse callback and scheduleReconnect to only backoff in the case of errors.
Type of change
test, version modification, documentation, etc.)
Checklist