feat: add possibility to add an exception handler to Watchers #4365

metacosm · 2022-08-30T13:03:58Z

Description

Type of change

Bug fix (non-breaking change which fixes an issue)
Feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change
Chore (non-breaking change which doesn't affect codebase;
test, version modification, documentation, etc.)

Checklist

Code contributed by me aligns with current project license: Apache 2.0
I Added CHANGELOG entry regarding this change
I have implemented unit tests to cover my changes
I have added/updated the javadocs and other documentation accordingly
No new bugs, code smells, etc. in SonarCloud report
I tested my code in Kubernetes
I tested my code in OpenShift

shawkins · 2022-08-30T19:31:23Z

@metacosm I think it makes sense, but it may not match the go watch behavior. It looks like for all error scenarios in a watch they will create an error event - https://github.com/kubernetes/client-go/blob/master/tools/watch/retrywatcher.go#L160 and possibly terminate the watch. That isn't a great mach for us as we are using KubernetesResource, rather than raw type for the object.

Does it make sense to try to reuse the error WatchEvent for this error handling?

The error event handling in go in the reflector for an unknown error just seems to be to log and retry: https://sourcegraph.com/github.com/kubernetes/client-go/-/blob/tools/cache/reflector.go?L347 - so it doesn't currently handle this scenario very well either.

manusa · 2022-08-31T09:41:18Z

We discussed this internally in yesterday's call.

The main drawback with the currently proposed approach is that we already have:

Watcher.Action.ERROR event that might be consumed by the Watcher.eventReceived callback method
Watcher.onClose(WatcherException cause) that is called whenever the watcher can't be started or an HTTP_GONE status is received

The new Exception Handler would add a third option, making it hard for users to understand what each of these callbacks or points of entry might entail (e.g. would onException be called too when onClose(Exception) is closed? would eventReceived(ERROR) trigger an onException too? and so on).

IMHO this is bad for the overall Informer + Watcher UX.

Does it make sense to try to reuse the error WatchEvent for this error handling?

I was considering two other options besides the one proposed by Chris (none of which makes me happy TBH)

Add a new Action.CLIENT_ERROR so that we can call Watcher.eventReceived with this action whenever there is a client-side problem when processing the received events. Apparently this seems to align with client-go, from what you said in your comment.
Add a configuration flag (disabled by default) so that when enabled, the watcher is closed (onClose(Exception)) in case one of these Exceptions is produced. I think Chris and Andrea didn't like this one very much.

As you know, I'm inclined to provide whatever client-go is doing for UX consistency purposes.

metacosm · 2022-08-31T12:03:10Z

Does it make sense to try to reuse the error WatchEvent for this error handling?

I was considering two other options besides the one proposed by Chris (none of which makes me happy TBH)

Add a new Action.CLIENT_ERROR so that we can call Watcher.eventReceived with this action whenever there is a client-side problem when processing the received events. Apparently this seems to align with client-go, from what you said in your comment.

This might be the best option… How should we proceed?

Add a configuration flag (disabled by default) so that when enabled, the watcher is closed (onClose(Exception)) in case one of these Exceptions is produced. I think Chris and Andrea didn't like this one very much.

The problem with this option is that it doesn't provide any flexibility: the calling code cannot decide what to do apart from crashing or not, which isn't terribly useful most of the time, especially in user code where people might not have direct access to the informers/watchers.

shawkins · 2022-08-31T12:41:43Z

@manusa @metacosm

This might be the best option… How should we proceed?

Looking things over some more in all of the cases in which the go client creates its own Error event, it also terminates the watch. So I think it would make sense for us to just terminate the watch and call onClose with the relevant exception instead.

When the informer sees an non-http gone onClose exception, it will stop.

If we add a additional handler for informers, that would be an alternative to monitoring the isWatching method - so that you get a callback if there is an abnormal termination. That notification would need to happen here

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java

Line 223 in 0bd34f8

log.warn("Watch closing with exception for {}", Reflector.this, exception);

The resource too old exceptions are recoverable, so we likely don't need to notify the handler.

Also the consistency with the go client seems cumbersome. The go client doesn't introduce another enum type, you'd probably just reuse Watcher.Action.Error. The event object would then be a Status which matches docs from https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#watchevent-v1-meta and what the go client is doing.

metacosm · 2022-08-31T13:33:02Z

@manusa @metacosm

This might be the best option… How should we proceed?

Looking things over some more in all of the cases in which the go client creates its own Error event, it also terminates the watch. So I think it would make sense for us to just terminate the watch and call onClose with the relevant exception instead.

What are the implications, though? The scenario we're trying to solve is an edge case where a serialisation issue might occur with one CR instance. If we stop the watcher, what does it mean for the informer? Will it need to get restarted (and resynched) knowing that the same CR might crash it again? This particular case should only really happen during development (though we all know that issues that are supposed to never happen in prod have a tendency to do so anyway 😅) so this is more about the developer experience than anything else but the point is to make it easier to detect (and presumably) fix such issues without having to sift through very long logs when it's really easy to miss the problem when you don't know what you're looking for.

When the informer sees an non-http gone onClose exception, it will stop.

If we add a additional handler for informers, that would be an alternative to monitoring the isWatching method - so that you get a callback if there is an abnormal termination. That notification would need to happen here

The problem with isWatching is that you'd need (from a calling code perspective) to constantly monitor it to know that something wrong happened. Or am I missing something?

Also the consistency with the go client seems cumbersome. The go client doesn't introduce another enum type, you'd probably just reuse Watcher.Action.Error. The event object would then be a Status which matches docs from https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#watchevent-v1-meta and what the go client is doing.

I personally don't think that we should care all that much about what client-go is doing… Trying to somewhat follow what it does but not quite seems actually worse to me than doing something completely different because then you get surprised when the behavior differs even though it looks like it should be the same…

shawkins · 2022-08-31T13:46:59Z

The scenario we're trying to solve is an edge case where a serialisation issue might occur with one CR instance. If we stop the watcher, what does it mean for the informer? Will it need to get restarted (and resynched) knowing that the same CR might crash it again?

No, when the informer sees an non-http gone onClose exception, it will stop.

The problem with isWatching is that you'd need (from a calling code perspective) to constantly monitor it to know that something wrong happened. Or am I missing something?

I'm not saying you have to rely on isWatching, just pointing out that if add something new on top of the change to close the watch with a non-http gone exception it's an alternative callback to isWatching.

metacosm · 2022-08-31T13:57:27Z

The scenario we're trying to solve is an edge case where a serialisation issue might occur with one CR instance. If we stop the watcher, what does it mean for the informer? Will it need to get restarted (and resynched) knowing that the same CR might crash it again?

No, when the informer sees an non-http gone onClose exception, it will stop.

If the informer is not processing the other events then what is the point? The solution we're aiming for is to not disable the processing of CRs of a given type just because one instance is somehow causing an issue… If the informer attempts to restart just to get stopped again because of that one CR, that's a nice way to cause denial of service.

The problem with isWatching is that you'd need (from a calling code perspective) to constantly monitor it to know that something wrong happened. Or am I missing something?

I'm not saying you have to rely on isWatching, just pointing out that if add something new on top of the change to close the watch with a non-http gone exception it's an alternative callback to isWatching.

Not sure I understand what you're trying to say? Maybe we should explore options using mock code so that we can look at how things would look like from a user's perspective?

csviri · 2022-08-31T14:01:07Z

This is probably related:
#4369

shawkins · 2022-08-31T14:25:49Z

If the informer is not processing the other events then what is the point? The solution we're aiming for is to not disable the processing of CRs of a given type just because one instance is somehow causing an issue… If the informer attempts to restart just to get stopped again because of that one CR, that's a nice way to cause denial of service.

Again the informer will not restart with what I'm describing.

Not sure I understand what you're trying to say? Maybe we should explore options using mock code so that we can look at how things would look like from a user's perspective?

It should be clearer as add-exception-handler...shawkins:kubernetes-client:add-exception-handler

This is probably related: #4369
Have a more configurable retry options (with a possibility to indefinitely retry) for this use case.

That could look a little different that the commit I just showed. Instead of void onWatchNonrecoverable(WatcherException e), it would be boolean onWatchNonrecoverable(WatcherException e) - so that the handler can choose whether to shutdown.

metacosm · 2022-08-31T15:42:49Z

If the informer is not processing the other events then what is the point? The solution we're aiming for is to not disable the processing of CRs of a given type just because one instance is somehow causing an issue… If the informer attempts to restart just to get stopped again because of that one CR, that's a nice way to cause denial of service.

Again the informer will not restart with what I'm describing.

Not sure I understand what you're trying to say? Maybe we should explore options using mock code so that we can look at how things would look like from a user's perspective?

It should be clearer as add-exception-handler...shawkins:kubernetes-client:add-exception-handler

Thanks! 👀

This is probably related: #4369
Have a more configurable retry options (with a possibility to indefinitely retry) for this use case.

That could look a little different that the commit I just showed. Instead of void onWatchNonrecoverable(WatcherException e), it would be boolean onWatchNonrecoverable(WatcherException e) - so that the handler can choose whether to shutdown.

Yep, the handler should have the opportunity to ask for a graceful shutdown of the watcher, indeed.

shawkins · 2022-09-01T11:48:33Z

The error event handling in go in the reflector for an unknown error just seems to be to log and retry: https://sourcegraph.com/github.com/kubernetes/client-go/-/blob/tools/cache/reflector.go?L347 - so it doesn't currently handle this scenario very well either.

Looked again, there is actually - https://sourcegraph.com/github.com/kubernetes/client-go@7ccf7b05af286664a658af15c8502a6066ae3288/-/blob/tools/cache/shared_informer.go?L183

But it's informational only - the informer will continue to retry using a backoff (which is not implemented for ours yet).

Another go implementation change looks to be that the initial informer start is no longer handled as a special case - that is an exception during the initial list/watch won't cause the informer not to start, rather it will just proceed to looping with the backoff.

Also I think switching to all async calls has introduced a regression for us - previously if we failed during listSyncWatch it was the WatchManager who would retry after catching the exception. Now the reflector after a resource too old exception will retry only once to call listSyncWatch - that means if we see a resource too old (which is now rarer due to bookmarks), then fail to list after the watch interval, and fail again - the informer / reflector will stop trying. So that needs addressed as well.

csviri · 2022-09-02T06:48:43Z

But it's informational only - the informer will continue to retry using a backoff (which is not implemented for ours yet).

Another go implementation change looks to be that the initial informer start is no longer handled as a special case - that is an exception during the initial list/watch won't cause the informer not to start, rather it will just proceed to looping with the backoff.

Note that this is exactly what we need in the other issue: #4369
So make it configurable when to retry (or not).

csviri · 2022-09-02T07:06:16Z

@metacosm I don't how this PR addresses the core problem to be able to process the raw resource if not able to deserialize.
issue in JOSDK: operator-framework/java-operator-sdk#1170

csviri · 2022-09-02T10:02:51Z

In summary we have two goals and related issues in JOSDK:

Operator should trigger the error state of the CR when deserialization fails operator-framework/java-operator-sdk#1170 - to handle situation when a custom resource cannot be de-serailized, and have a callback to handle that resource.
Umbrella issue for handling Informer Permission Errors and Probes operator-framework/java-operator-sdk#1422 - when a permission removed to a resource (or not present on startup) we would like to have a possibility periodically try to reconnect the watch.

This might be one callback method on the informer level to cover both.

shawkins · 2022-09-02T12:05:43Z

@metacosm @csviri I've now hijacked this pr with something that should address #4369 as well. This pr does not introduce a backoff - we're still using the watch retry delay. Nor does it change the startup - that is potentially breaking change, so we likely want to handle it separately. Let me know if you have any issues with the draft, then I'll clean this up and add some tests.

metacosm

Approach sounds good to me, though I guess ideally, we'd also pass the raw object that caused the deserialisation error if at all possible…

...rnetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java

...lient-api/src/main/java/io/fabric8/kubernetes/client/informers/InformerExceptionHandler.java

add-exception-handler

shawkins · 2022-09-07T11:34:20Z

Marking as ready for review. Added a couple of tests and future cleaned up the reconnect logic. Also added more conformity to the go retry watcher logic in our abstract watch manager. There were several cases where it handles the retry on its own rather than relying on downstream logic. Also the raw message has been added to the Watch exception in more cases.

...rnetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java

also stopping messages when a watch closes

sonarqubecloud · 2022-09-08T12:33:41Z

SonarCloud Quality Gate failed.

3 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

41.5% Coverage
0.0% Duplication

shawkins · 2022-09-08T16:52:56Z

@manusa I would opt for squashing this given how many intermediate commits no longer make sense.

manusa · 2022-09-08T16:57:59Z

~~If there are no merge commits I'll leave just two commits, with the initial approach (Chris), and your final take and refactor.~~

Ok, just saw those merge master now. Squashing

metacosm · 2022-09-08T16:59:30Z

Actually wondering if we could release a new version and get that version in Quarkus before 2.13 is cut… 😨

andreaTP · 2022-09-08T17:29:31Z

🎉 Amazing job, thanks to everyone involved!

also adding stacktraces for blocking request exceptions

metacosm added this to the 6.1.0 milestone Aug 30, 2022

metacosm self-assigned this Aug 30, 2022

metacosm requested review from andreaTP, manusa and shawkins August 30, 2022 13:03

manusa modified the milestones: 6.1.0, 6.2.0 Aug 31, 2022

feat: add possibility to add an exception handler to Informers

90cf0db

shawkins force-pushed the add-exception-handler branch 2 times, most recently from 285acb3 to c596971 Compare September 2, 2022 12:02

fix #4369: allowing the handler to determine if the stop or retry

69aa906

shawkins force-pushed the add-exception-handler branch from c596971 to 69aa906 Compare September 2, 2022 12:08

shawkins mentioned this pull request Sep 5, 2022

SharedInformer failure on reconnect can kill the watcher #4388

Closed

metacosm commented Sep 6, 2022

View reviewed changes

...rnetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java Outdated Show resolved Hide resolved

manusa reviewed Sep 6, 2022

View reviewed changes

...lient-api/src/main/java/io/fabric8/kubernetes/client/informers/InformerExceptionHandler.java Outdated Show resolved Hide resolved

Merge branch 'master' of github.com:fabric8io/kubernetes-client into

9686f6b

add-exception-handler

shawkins marked this pull request as ready for review September 7, 2022 11:22

cleaning up how the watch is stopped

64905e1

rohanKanojia approved these changes Sep 8, 2022

View reviewed changes

manusa approved these changes Sep 8, 2022

View reviewed changes

csviri approved these changes Sep 8, 2022

View reviewed changes

andreaTP reviewed Sep 8, 2022

View reviewed changes

...rnetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java Show resolved Hide resolved

just logging resourceeventhandler exceptions

cbbe59f

also stopping messages when a watch closes

shawkins force-pushed the add-exception-handler branch from e7eb012 to cbbe59f Compare September 8, 2022 11:21

Merge branch 'master' into add-exception-handler

1dbfb43

shawkins approved these changes Sep 8, 2022

View reviewed changes

shawkins mentioned this pull request Sep 8, 2022

KubernetesClient informer stopped silently after running couple weeks #4340

Closed

andreaTP approved these changes Sep 8, 2022

View reviewed changes

manusa merged commit 951a7f5 into master Sep 8, 2022

manusa deleted the add-exception-handler branch September 8, 2022 16:59

shawkins mentioned this pull request Sep 8, 2022

Informer initial start exception handling #4408

Closed

shawkins added a commit to shawkins/kubernetes-client that referenced this pull request Sep 14, 2022

fix fabric8io#4365: adding a test for the termination exception

be5dba7

shawkins added a commit to shawkins/kubernetes-client that referenced this pull request Sep 14, 2022

fix fabric8io#4365: adding a test for the termination exception

8bddfb5

manusa pushed a commit that referenced this pull request Sep 20, 2022

fix #4365: adding a test for the termination exception

80c1085

shawkins added a commit to shawkins/kubernetes-client that referenced this pull request Oct 2, 2022

fix fabric8io#4365 correcting backoff interval

748d97a

also adding stacktraces for blocking request exceptions

shawkins mentioned this pull request Oct 2, 2022

fix #4365 correcting backoff interval #4473

Merged

11 tasks

shawkins added a commit to shawkins/kubernetes-client that referenced this pull request Oct 5, 2022

fix fabric8io#4365 correcting backoff interval

b56c186

also adding stacktraces for blocking request exceptions

manusa pushed a commit that referenced this pull request Oct 7, 2022

fix #4365 correcting backoff interval

14cb945

also adding stacktraces for blocking request exceptions

StupidScience mentioned this pull request Oct 13, 2022

KubernetesResourceList resources are not handled properly by AbstractWatchManager #4496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add possibility to add an exception handler to Watchers #4365

feat: add possibility to add an exception handler to Watchers #4365

metacosm commented Aug 30, 2022 •

edited by manusa

Loading

shawkins commented Aug 30, 2022

manusa commented Aug 31, 2022

metacosm commented Aug 31, 2022

shawkins commented Aug 31, 2022

metacosm commented Aug 31, 2022

shawkins commented Aug 31, 2022 •

edited

Loading

metacosm commented Aug 31, 2022

csviri commented Aug 31, 2022

shawkins commented Aug 31, 2022

metacosm commented Aug 31, 2022

shawkins commented Sep 1, 2022

csviri commented Sep 2, 2022

csviri commented Sep 2, 2022

csviri commented Sep 2, 2022

shawkins commented Sep 2, 2022 •

edited

Loading

metacosm left a comment

shawkins commented Sep 7, 2022

sonarqubecloud bot commented Sep 8, 2022

shawkins commented Sep 8, 2022

manusa commented Sep 8, 2022 •

edited

Loading

metacosm commented Sep 8, 2022

andreaTP commented Sep 8, 2022

feat: add possibility to add an exception handler to Watchers #4365

feat: add possibility to add an exception handler to Watchers #4365

Conversation

metacosm commented Aug 30, 2022 • edited by manusa Loading

Description

Type of change

Checklist

shawkins commented Aug 30, 2022

manusa commented Aug 31, 2022

metacosm commented Aug 31, 2022

shawkins commented Aug 31, 2022

metacosm commented Aug 31, 2022

shawkins commented Aug 31, 2022 • edited Loading

metacosm commented Aug 31, 2022

csviri commented Aug 31, 2022

shawkins commented Aug 31, 2022

metacosm commented Aug 31, 2022

shawkins commented Sep 1, 2022

csviri commented Sep 2, 2022

csviri commented Sep 2, 2022

csviri commented Sep 2, 2022

shawkins commented Sep 2, 2022 • edited Loading

metacosm left a comment

Choose a reason for hiding this comment

shawkins commented Sep 7, 2022

sonarqubecloud bot commented Sep 8, 2022

shawkins commented Sep 8, 2022

manusa commented Sep 8, 2022 • edited Loading

metacosm commented Sep 8, 2022

andreaTP commented Sep 8, 2022

metacosm commented Aug 30, 2022 •

edited by manusa

Loading

shawkins commented Aug 31, 2022 •

edited

Loading

shawkins commented Sep 2, 2022 •

edited

Loading

manusa commented Sep 8, 2022 •

edited

Loading