Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fork HttpEventCollectorResendMiddleware to properly implement retrying #322

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jcjveraa
Copy link

@jcjveraa jcjveraa commented Feb 14, 2025

As of today the 'bundled' HttpEventCollectorResendMiddleware only retries when '200 - OK' response is returned from the Splunk server, which is odd to say the least. @simonhege has created a PR on the java splunk logging project (ref. splunk/splunk-library-javalogging#287), but this project seems to be inactive to the point that nobody is merging this. I suggest to include his work in this project, until #281 is implemented (or until his PR ever gets merged).

Without this, any user of this library specifying retries is getting what I would call a false sense of security, as in any case where the splunk server has a server error there is no retry done. In my organisation we get 503's quite regularly when the server is temporarily overloaded and these log messages now simply dissapear. (ref https://docs.splunk.com/Documentation/Splunk/9.4.0/RESTREF/RESTinput#services.2Fcollector - note that deep links work poorly on this site, ctrl+f for HEC is unhealthy, queues are full).

If you agree to this change, I would propose to also include this in the relevant 3.15.X LTS release (my team is on Quarkus 3.15 LTS).

A perfectly fine alternative would be of course to have this merged in the main splunk HEC repository, but as said that project seems to be very inactive.

…ent retrying

As of today the 'bundled' HttpEventCollectorResendMiddleware only retries when '200 - OK' response is returned from the Splunk server, which is odd to say the least. Simon Hege has created a PR on the (https://github.com/splunk/splunk-library-javalogging/pull/287)[main java splunk logging project], but this project seems to be inactive to the point that no contributor is merging this. I suggest to include this in the Quarkus logger, until quarkiverse#281 is implemented.
@jcjveraa jcjveraa requested a review from a team as a code owner February 14, 2025 13:36
@jcjveraa jcjveraa changed the title Fork HttpEventCollectorResendMiddleware to properly implemeent retrying Fork HttpEventCollectorResendMiddleware to properly implement retrying Feb 14, 2025
@jcjveraa
Copy link
Author

@vietk ?

Copy link
Member

@rquinio1A rquinio1A left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, I think the PR makes sense as the logic can be quite generic.

// Method Not Allowed
405,
// Bad Request
400
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be other 4xx where we should not retry.
I think it may be safer to retry only on 502 Bad Gateway, 503 Service Unavailable and 504 Gateway Timeout (we can expan the list in the future if needed).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer the 'blacklist' as @simonhege implemented: we want logging to happen.
Retrying anything except documented Splunk 'there is no point in retrying' responses would in my eyes be the 'safe' default.

If we retry only on a whitelist, we risk not retrying on a hypothetical future error 567 that the Splunk API might return without an adjustment to our code. Or we won't retry on a 404 which might be caused by a faulty DNS configuration while Splunk is fine and back online 5 minutes later.

This PR was a 1:1 copy of @simonhege's work, but triggered by your response I'm thinking about it a bit more and would actually even want to narrow down on the check of what to retry: we should retry anything that isn't a known Splunk 4XX-response. For example, if the response is a 400, but without the expected json body such as {"text":"Incorrect data format","code":5,"invalid-event-number":0}. We should then assume the 400-code comes from some middleware between us and the Splunk HEC and we retry. I've seen this enough in practice, where some 'transparant' proxy failed and we got responses from the proxy instead of the end-system we tried to reach.

image

Optionally, we could add another user config parameter for 'codes not to retry'?

* When HTTP post reply isn't an application error it tries to resend the data.
* An exponentially growing delay is used to prevent server overflow.
*/
public class HttpEventCollectorResendMiddleware
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make this class the default of config quarkus.log.handler.splunk.middleware ? Since retriesOnError = 0, it would do nothing by default ?
If so, we could potentially make retriesOnError (same as existing quarkus.log.handler.splunk.max-retries ?) and retryDelay configurable by doing a microprofile config lookup in the constructor via something like:

org.eclipse.microprofile.config.ConfigProvider.getConfig().getOptionalValue("quarkus.log.handler.splunk.max-retries", Integer.class).orElse(0);
org.eclipse.microprofile.config.ConfigProvider.getConfig().getOptionalValue("quarkus.log.handler.splunk.retry-initial-delay-ms", Integer.class).orElse(1000);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set this as the default middleware, perhaps it may be confusing to users as then setting their own middleware will remove this middleware I think? Especially confusing as they can still specify retries via the config. Injecting it 'transparently' when retries > 0 as in the current implementation makes sense to me.

Making it more configurable is a good suggestion of course! I would specify the retry-delay as a Duration which we see more often in Quarkus (with default 1 (second)). https://quarkus.io/guides/all-config#duration-note-anchor-all-config

}

private boolean shouldRetry(int statusCode) {
return statusCode != 200 && !HttpEventCollectorApplicationErrors.contains(statusCode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked to the comment about reversing the logic to a whitelist:

Suggested change
return statusCode != 200 && !HttpEventCollectorApplicationErrors.contains(statusCode);
return RetryableHttpStatusCodes.contains(statusCode);

retry();
} else {
// if non-retryable, resend wouldn't help, delegate to previous callback
prevCallback.completed(statusCode, reply);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call prevCallback.failed(new RetryLimitExceededException()) with our own exception ?
There's some logic in io.quarkiverse.logging.splunk.SplunkErrorCallback to log to stdout/stderr in case of failure, but not sure if the ErrorCallback will get called.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR as is should work in principle the same as the current implementation - it just changes what to retry, so if it works now it will in theory still work. But no reason not to change it of course. Do you have a concrete code suggestion?

@rquinio1A
Copy link
Member

Note: you'll need to rebase from main to fix the CI failure.

Copy link
Author

@jcjveraa jcjveraa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @rquinio1A! Responded to them above.

// Method Not Allowed
405,
// Bad Request
400
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer the 'blacklist' as @simonhege implemented: we want logging to happen.
Retrying anything except documented Splunk 'there is no point in retrying' responses would in my eyes be the 'safe' default.

If we retry only on a whitelist, we risk not retrying on a hypothetical future error 567 that the Splunk API might return without an adjustment to our code. Or we won't retry on a 404 which might be caused by a faulty DNS configuration while Splunk is fine and back online 5 minutes later.

This PR was a 1:1 copy of @simonhege's work, but triggered by your response I'm thinking about it a bit more and would actually even want to narrow down on the check of what to retry: we should retry anything that isn't a known Splunk 4XX-response. For example, if the response is a 400, but without the expected json body such as {"text":"Incorrect data format","code":5,"invalid-event-number":0}. We should then assume the 400-code comes from some middleware between us and the Splunk HEC and we retry. I've seen this enough in practice, where some 'transparant' proxy failed and we got responses from the proxy instead of the end-system we tried to reach.

image

Optionally, we could add another user config parameter for 'codes not to retry'?

retry();
} else {
// if non-retryable, resend wouldn't help, delegate to previous callback
prevCallback.completed(statusCode, reply);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR as is should work in principle the same as the current implementation - it just changes what to retry, so if it works now it will in theory still work. But no reason not to change it of course. Do you have a concrete code suggestion?

* When HTTP post reply isn't an application error it tries to resend the data.
* An exponentially growing delay is used to prevent server overflow.
*/
public class HttpEventCollectorResendMiddleware
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set this as the default middleware, perhaps it may be confusing to users as then setting their own middleware will remove this middleware I think? Especially confusing as they can still specify retries via the config. Injecting it 'transparently' when retries > 0 as in the current implementation makes sense to me.

Making it more configurable is a good suggestion of course! I would specify the retry-delay as a Duration which we see more often in Quarkus (with default 1 (second)). https://quarkus.io/guides/all-config#duration-note-anchor-all-config

@jcjveraa
Copy link
Author

jcjveraa commented Mar 7, 2025

Merged with main so the pipeline can run, also discovered that I failed to update SplunkLogHandlerTest earlier. Not implemented anything of the above discussion yet - time to wake up my sleeping toddler now so perhaps later today :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants