-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Heartbeat] Adjust State loader to only retry for failed requests and not for 4xx #37424
Comments
Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services) |
Notes:
|
@vigneshshanmugam, I could help if you want? Currently the error is a json string which could be unmarshalled and then check the status. For example "401 Unauthorized: {\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"missing authentication credentials for REST request [/]\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer realm=\\\"security\\\"\",\"ApiKey\"]}}],\"type\":\"security_exception\",\"reason\":\"missing authentication credentials for REST request [/]\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer realm=\\\"security\\\"\",\"ApiKey\"]}},\"status\":401}" But what about errors e.g. when the server is offline or any certificate issues. Then the error message has a different structure.
|
I guess you had this attempts in mind? |
Hi @martinscholz83, We started progressing in this issue a while ago. We overlooked your comments. Sorry about this. We strongly appreciate your interest in solving this!
The attempts in mind were related to the query that extracts the state. The one you mentioned was related to the "initial" connection to ES. Thanks again, |
Summary
Heartbeat uses state loader to get the last status from the ES cluster and loads the current monitor state after the monitor has been successfully ran. The ES loader has a backoff of 3 retries which will be attempted when the request fails due to connection failures or ES being unavailable. But currently, we are also retrying for 4xx where the API key has limited permissions to read the ES state.
This consequently affects the time a particular execution is taking when we are running inside the SAAS service.
Proposal
Retry ES state loader requests only for network connection failures or 5xx and avoid retrying for 403 and 4xx when we knows its a valid status from the ES side.
The text was updated successfully, but these errors were encountered: