-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add possibility to add an exception handler to Watchers #4365
Conversation
@metacosm I think it makes sense, but it may not match the go watch behavior. It looks like for all error scenarios in a watch they will create an error event - https://github.com/kubernetes/client-go/blob/master/tools/watch/retrywatcher.go#L160 and possibly terminate the watch. That isn't a great mach for us as we are using KubernetesResource, rather than raw type for the object. Does it make sense to try to reuse the error WatchEvent for this error handling? The error event handling in go in the reflector for an unknown error just seems to be to log and retry: https://sourcegraph.com/github.com/kubernetes/client-go/-/blob/tools/cache/reflector.go?L347 - so it doesn't currently handle this scenario very well either. |
We discussed this internally in yesterday's call. The main drawback with the currently proposed approach is that we already have:
The new Exception Handler would add a third option, making it hard for users to understand what each of these callbacks or points of entry might entail (e.g. would onException be called too when onClose(Exception) is closed? would eventReceived(ERROR) trigger an onException too? and so on). IMHO this is bad for the overall Informer + Watcher UX.
I was considering two other options besides the one proposed by Chris (none of which makes me happy TBH)
As you know, I'm inclined to provide whatever client-go is doing for UX consistency purposes. |
This might be the best option… How should we proceed?
The problem with this option is that it doesn't provide any flexibility: the calling code cannot decide what to do apart from crashing or not, which isn't terribly useful most of the time, especially in user code where people might not have direct access to the informers/watchers. |
Looking things over some more in all of the cases in which the go client creates its own Error event, it also terminates the watch. So I think it would make sense for us to just terminate the watch and call onClose with the relevant exception instead. When the informer sees an non-http gone onClose exception, it will stop. If we add a additional handler for informers, that would be an alternative to monitoring the isWatching method - so that you get a callback if there is an abnormal termination. That notification would need to happen here Line 223 in 0bd34f8
The resource too old exceptions are recoverable, so we likely don't need to notify the handler. Also the consistency with the go client seems cumbersome. The go client doesn't introduce another enum type, you'd probably just reuse Watcher.Action.Error. The event object would then be a Status which matches docs from https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#watchevent-v1-meta and what the go client is doing. |
What are the implications, though? The scenario we're trying to solve is an edge case where a serialisation issue might occur with one CR instance. If we stop the watcher, what does it mean for the informer? Will it need to get restarted (and resynched) knowing that the same CR might crash it again? This particular case should only really happen during development (though we all know that issues that are supposed to never happen in prod have a tendency to do so anyway 😅) so this is more about the developer experience than anything else but the point is to make it easier to detect (and presumably) fix such issues without having to sift through very long logs when it's really easy to miss the problem when you don't know what you're looking for.
The problem with
I personally don't think that we should care all that much about what client-go is doing… Trying to somewhat follow what it does but not quite seems actually worse to me than doing something completely different because then you get surprised when the behavior differs even though it looks like it should be the same… |
No, when the informer sees an non-http gone onClose exception, it will stop.
I'm not saying you have to rely on isWatching, just pointing out that if add something new on top of the change to close the watch with a non-http gone exception it's an alternative callback to isWatching. |
If the informer is not processing the other events then what is the point? The solution we're aiming for is to not disable the processing of CRs of a given type just because one instance is somehow causing an issue… If the informer attempts to restart just to get stopped again because of that one CR, that's a nice way to cause denial of service.
Not sure I understand what you're trying to say? Maybe we should explore options using mock code so that we can look at how things would look like from a user's perspective? |
This is probably related: |
Again the informer will not restart with what I'm describing.
It should be clearer as add-exception-handler...shawkins:kubernetes-client:add-exception-handler
That could look a little different that the commit I just showed. Instead of void onWatchNonrecoverable(WatcherException e), it would be boolean onWatchNonrecoverable(WatcherException e) - so that the handler can choose whether to shutdown. |
Thanks! 👀
Yep, the handler should have the opportunity to ask for a graceful shutdown of the watcher, indeed. |
Looked again, there is actually - https://sourcegraph.com/github.com/kubernetes/client-go@7ccf7b05af286664a658af15c8502a6066ae3288/-/blob/tools/cache/shared_informer.go?L183 But it's informational only - the informer will continue to retry using a backoff (which is not implemented for ours yet). Another go implementation change looks to be that the initial informer start is no longer handled as a special case - that is an exception during the initial list/watch won't cause the informer not to start, rather it will just proceed to looping with the backoff. Also I think switching to all async calls has introduced a regression for us - previously if we failed during listSyncWatch it was the WatchManager who would retry after catching the exception. Now the reflector after a resource too old exception will retry only once to call listSyncWatch - that means if we see a resource too old (which is now rarer due to bookmarks), then fail to list after the watch interval, and fail again - the informer / reflector will stop trying. So that needs addressed as well. |
Note that this is exactly what we need in the other issue: #4369 |
@metacosm I don't how this PR addresses the core problem to be able to process the raw resource if not able to deserialize. |
In summary we have two goals and related issues in JOSDK:
This might be one callback method on the informer level to cover both. |
285acb3
to
c596971
Compare
@metacosm @csviri I've now hijacked this pr with something that should address #4369 as well. This pr does not introduce a backoff - we're still using the watch retry delay. Nor does it change the startup - that is potentially breaking change, so we likely want to handle it separately. Let me know if you have any issues with the draft, then I'll clean this up and add some tests. |
c596971
to
69aa906
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach sounds good to me, though I guess ideally, we'd also pass the raw object that caused the deserialisation error if at all possible…
...rnetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java
Outdated
Show resolved
Hide resolved
...lient-api/src/main/java/io/fabric8/kubernetes/client/informers/InformerExceptionHandler.java
Outdated
Show resolved
Hide resolved
add-exception-handler
Marking as ready for review. Added a couple of tests and future cleaned up the reconnect logic. Also added more conformity to the go retry watcher logic in our abstract watch manager. There were several cases where it handles the retry on its own rather than relying on downstream logic. Also the raw message has been added to the Watch exception in more cases. |
...rnetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java
Show resolved
Hide resolved
also stopping messages when a watch closes
e7eb012
to
cbbe59f
Compare
SonarCloud Quality Gate failed. |
@manusa I would opt for squashing this given how many intermediate commits no longer make sense. |
Ok, just saw those merge master now. Squashing |
Actually wondering if we could release a new version and get that version in Quarkus before 2.13 is cut… 😨 |
🎉 Amazing job, thanks to everyone involved! |
also adding stacktraces for blocking request exceptions
also adding stacktraces for blocking request exceptions
also adding stacktraces for blocking request exceptions
Description
Type of change
test, version modification, documentation, etc.)
Checklist