Fork HttpEventCollectorResendMiddleware to properly implement retrying #322

jcjveraa · 2025-02-14T13:36:05Z

As of today the 'bundled' HttpEventCollectorResendMiddleware only retries when '200 - OK' response is returned from the Splunk server, which is odd to say the least. @simonhege has created a PR on the java splunk logging project (ref. splunk/splunk-library-javalogging#287), but this project seems to be inactive to the point that nobody is merging this. I suggest to include his work in this project, until #281 is implemented (or until his PR ever gets merged).

Without this, any user of this library specifying retries is getting what I would call a false sense of security, as in any case where the splunk server has a server error there is no retry done. In my organisation we get 503's quite regularly when the server is temporarily overloaded and these log messages now simply dissapear. (ref https://docs.splunk.com/Documentation/Splunk/9.4.0/RESTREF/RESTinput#services.2Fcollector - note that deep links work poorly on this site, ctrl+f for HEC is unhealthy, queues are full).

If you agree to this change, I would propose to also include this in the relevant 3.15.X LTS release (my team is on Quarkus 3.15 LTS).

A perfectly fine alternative would be of course to have this merged in the main splunk HEC repository, but as said that project seems to be very inactive.

…ent retrying As of today the 'bundled' HttpEventCollectorResendMiddleware only retries when '200 - OK' response is returned from the Splunk server, which is odd to say the least. Simon Hege has created a PR on the (https://github.com/splunk/splunk-library-javalogging/pull/287)[main java splunk logging project], but this project seems to be inactive to the point that no contributor is merging this. I suggest to include this in the Quarkus logger, until quarkiverse#281 is implemented.

runtime/src/main/java/io/quarkiverse/logging/splunk/SplunkLogHandler.java

jcjveraa · 2025-02-24T07:37:10Z

@vietk ?

rquinio1A

Sorry for the late review, I think the PR makes sense as the logic can be quite generic.

rquinio1A · 2025-03-06T15:12:23Z

...c/main/java/io/quarkiverse/logging/splunk/middleware/HttpEventCollectorResendMiddleware.java

+            // Method Not Allowed
+            405,
+            // Bad Request
+            400


There may be other 4xx where we should not retry.
I think it may be safer to retry only on 502 Bad Gateway, 503 Service Unavailable and 504 Gateway Timeout (we can expan the list in the future if needed).

I would actually prefer the 'blacklist' as @simonhege implemented: we want logging to happen.
Retrying anything except documented Splunk 'there is no point in retrying' responses would in my eyes be the 'safe' default.

If we retry only on a whitelist, we risk not retrying on a hypothetical future error 567 that the Splunk API might return without an adjustment to our code. Or we won't retry on a 404 which might be caused by a faulty DNS configuration while Splunk is fine and back online 5 minutes later.

This PR was a 1:1 copy of @simonhege's work, but triggered by your response I'm thinking about it a bit more and would actually even want to narrow down on the check of what to retry: we should retry anything that isn't a known Splunk 4XX-response. For example, if the response is a 400, but without the expected json body such as {"text":"Incorrect data format","code":5,"invalid-event-number":0}. We should then assume the 400-code comes from some middleware between us and the Splunk HEC and we retry. I've seen this enough in practice, where some 'transparant' proxy failed and we got responses from the proxy instead of the end-system we tried to reach.

Optionally, we could add another user config parameter for 'codes not to retry'?

rquinio1A · 2025-03-06T15:22:17Z

...c/main/java/io/quarkiverse/logging/splunk/middleware/HttpEventCollectorResendMiddleware.java

+ * When HTTP post reply isn't an application error it tries to resend the data.
+ * An exponentially growing delay is used to prevent server overflow.
+ */
+public class HttpEventCollectorResendMiddleware


Would it make sense to make this class the default of config quarkus.log.handler.splunk.middleware ? Since retriesOnError = 0, it would do nothing by default ?
If so, we could potentially make retriesOnError (same as existing quarkus.log.handler.splunk.max-retries ?) and retryDelay configurable by doing a microprofile config lookup in the constructor via something like:

org.eclipse.microprofile.config.ConfigProvider.getConfig().getOptionalValue("quarkus.log.handler.splunk.max-retries", Integer.class).orElse(0); org.eclipse.microprofile.config.ConfigProvider.getConfig().getOptionalValue("quarkus.log.handler.splunk.retry-initial-delay-ms", Integer.class).orElse(1000);

If we set this as the default middleware, perhaps it may be confusing to users as then setting their own middleware will remove this middleware I think? Especially confusing as they can still specify retries via the config. Injecting it 'transparently' when retries > 0 as in the current implementation makes sense to me.

Making it more configurable is a good suggestion of course! I would specify the retry-delay as a Duration which we see more often in Quarkus (with default 1 (second)). https://quarkus.io/guides/all-config#duration-note-anchor-all-config

rquinio1A · 2025-03-06T15:25:16Z

...c/main/java/io/quarkiverse/logging/splunk/middleware/HttpEventCollectorResendMiddleware.java

+    }
+
+    private boolean shouldRetry(int statusCode) {
+        return statusCode != 200 && !HttpEventCollectorApplicationErrors.contains(statusCode);


Linked to the comment about reversing the logic to a whitelist:

Suggested change

return statusCode != 200 && !HttpEventCollectorApplicationErrors.contains(statusCode);

return RetryableHttpStatusCodes.contains(statusCode);

rquinio1A · 2025-03-06T15:34:10Z

...c/main/java/io/quarkiverse/logging/splunk/middleware/HttpEventCollectorResendMiddleware.java

+                retry();
+            } else {
+                // if non-retryable, resend wouldn't help, delegate to previous callback
+                prevCallback.completed(statusCode, reply);


Should we call prevCallback.failed(new RetryLimitExceededException()) with our own exception ?
There's some logic in io.quarkiverse.logging.splunk.SplunkErrorCallback to log to stdout/stderr in case of failure, but not sure if the ErrorCallback will get called.

This PR as is should work in principle the same as the current implementation - it just changes what to retry, so if it works now it will in theory still work. But no reason not to change it of course. Do you have a concrete code suggestion?

rquinio1A · 2025-03-06T15:47:21Z

Note: you'll need to rebase from main to fix the CI failure.

jcjveraa

Thanks for the review @rquinio1A! Responded to them above.

jcjveraa · 2025-03-06T18:31:51Z

...c/main/java/io/quarkiverse/logging/splunk/middleware/HttpEventCollectorResendMiddleware.java

+            // Method Not Allowed
+            405,
+            // Bad Request
+            400


I would actually prefer the 'blacklist' as @simonhege implemented: we want logging to happen.
Retrying anything except documented Splunk 'there is no point in retrying' responses would in my eyes be the 'safe' default.

If we retry only on a whitelist, we risk not retrying on a hypothetical future error 567 that the Splunk API might return without an adjustment to our code. Or we won't retry on a 404 which might be caused by a faulty DNS configuration while Splunk is fine and back online 5 minutes later.

This PR was a 1:1 copy of @simonhege's work, but triggered by your response I'm thinking about it a bit more and would actually even want to narrow down on the check of what to retry: we should retry anything that isn't a known Splunk 4XX-response. For example, if the response is a 400, but without the expected json body such as {"text":"Incorrect data format","code":5,"invalid-event-number":0}. We should then assume the 400-code comes from some middleware between us and the Splunk HEC and we retry. I've seen this enough in practice, where some 'transparant' proxy failed and we got responses from the proxy instead of the end-system we tried to reach.

Optionally, we could add another user config parameter for 'codes not to retry'?

jcjveraa · 2025-03-06T18:34:59Z

...c/main/java/io/quarkiverse/logging/splunk/middleware/HttpEventCollectorResendMiddleware.java

+                retry();
+            } else {
+                // if non-retryable, resend wouldn't help, delegate to previous callback
+                prevCallback.completed(statusCode, reply);


This PR as is should work in principle the same as the current implementation - it just changes what to retry, so if it works now it will in theory still work. But no reason not to change it of course. Do you have a concrete code suggestion?

jcjveraa · 2025-03-06T18:42:10Z

...c/main/java/io/quarkiverse/logging/splunk/middleware/HttpEventCollectorResendMiddleware.java

+ * When HTTP post reply isn't an application error it tries to resend the data.
+ * An exponentially growing delay is used to prevent server overflow.
+ */
+public class HttpEventCollectorResendMiddleware


If we set this as the default middleware, perhaps it may be confusing to users as then setting their own middleware will remove this middleware I think? Especially confusing as they can still specify retries via the config. Injecting it 'transparently' when retries > 0 as in the current implementation makes sense to me.

Making it more configurable is a good suggestion of course! I would specify the retry-delay as a Duration which we see more often in Quarkus (with default 1 (second)). https://quarkus.io/guides/all-config#duration-note-anchor-all-config

…middleware

jcjveraa · 2025-03-07T14:18:15Z

Merged with main so the pipeline can run, also discovered that I failed to update SplunkLogHandlerTest earlier. Not implemented anything of the above discussion yet - time to wake up my sleeping toddler now so perhaps later today :-)

jcjveraa requested a review from a team as a code owner February 14, 2025 13:36

jcjveraa commented Feb 14, 2025

View reviewed changes

runtime/src/main/java/io/quarkiverse/logging/splunk/SplunkLogHandler.java Show resolved Hide resolved

jcjveraa commented Feb 14, 2025

View reviewed changes

runtime/src/main/java/io/quarkiverse/logging/splunk/SplunkLogHandler.java Show resolved Hide resolved

jcjveraa changed the title ~~Fork HttpEventCollectorResendMiddleware to properly implemeent retrying~~ Fork HttpEventCollectorResendMiddleware to properly implement retrying Feb 14, 2025

rquinio1A requested changes Mar 6, 2025

View reviewed changes

jcjveraa commented Mar 6, 2025

View reviewed changes

jcjveraa added 3 commits March 7, 2025 14:40

Merge branch 'quarkiverse:main' into bugfix/include-updated-resender-…

6e01b22

…middleware

fix SplunkLogHandlerTest reference to ResenderMiddleware

e9313c8

restore copyright block in HttpEventCollectorResendMiddlewareTest

0929af7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fork HttpEventCollectorResendMiddleware to properly implement retrying #322

Fork HttpEventCollectorResendMiddleware to properly implement retrying #322

jcjveraa commented Feb 14, 2025 •

edited

Loading

jcjveraa commented Feb 24, 2025

rquinio1A left a comment

rquinio1A Mar 6, 2025

jcjveraa Mar 6, 2025

rquinio1A Mar 6, 2025

jcjveraa Mar 6, 2025

rquinio1A Mar 6, 2025

rquinio1A Mar 6, 2025

jcjveraa Mar 6, 2025

rquinio1A commented Mar 6, 2025

jcjveraa left a comment

jcjveraa Mar 6, 2025

jcjveraa Mar 6, 2025

jcjveraa Mar 6, 2025

jcjveraa commented Mar 7, 2025

	return statusCode != 200 && !HttpEventCollectorApplicationErrors.contains(statusCode);
	return RetryableHttpStatusCodes.contains(statusCode);

Fork HttpEventCollectorResendMiddleware to properly implement retrying #322

Are you sure you want to change the base?

Fork HttpEventCollectorResendMiddleware to properly implement retrying #322

Conversation

jcjveraa commented Feb 14, 2025 • edited Loading

jcjveraa commented Feb 24, 2025

rquinio1A left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rquinio1A commented Mar 6, 2025

jcjveraa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcjveraa commented Mar 7, 2025

jcjveraa commented Feb 14, 2025 •

edited

Loading