-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decommission retry/pr #66
Conversation
Signed-off-by: Rishab Nahata <[email protected]>
Signed-off-by: Rishab Nahata <[email protected]>
Signed-off-by: Rishab Nahata <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishab Nahata <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
} | ||
|
||
public DecommissionRequest(DecommissionAttribute decommissionAttribute, TimeValue retryTimeout) { | ||
this(decommissionAttribute, false, retryTimeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are we making sure that user doesn't specify this parameter ? did we evaluate putting this in request context as it is an internal detail of a request and not a user facing one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about the retry flag or the timeout?
For flag, The REST action doesn't support accepting retry flag with the user and hence any request created at REST layer will always have it set to false. Today, it is only set to true when we trigger retry action from service
For timeout, the default is set and user can set one according to his case
Let me know if this answers your question
final long remainingTimeoutMS = decommissionRequest.getRetryTimeout().millis() - (threadPool.relativeTimeInMillis() - startTime); | ||
if (remainingTimeoutMS <= 0) { | ||
logger.debug( | ||
"timed out before retrying [{}] for attribute [{}] after cluster manager change", | ||
DecommissionAction.NAME, | ||
decommissionRequest.getDecommissionAttribute() | ||
); | ||
listener.onFailure( | ||
new OpenSearchTimeoutException( | ||
"timed out before retrying [{}] for attribute [{}] after cluster manager change", | ||
DecommissionAction.NAME, | ||
decommissionRequest.getDecommissionAttribute() | ||
) | ||
); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need remainingTimeoutMS
check here only ? Why do we need it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout is a check for retry eligibility only. Other actions as part of decommission has their own timeouts. This is the place where we are actually triggering the retry and hence the check. If timed out, request will not be eligible for a retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do without this ? In worst case, the retried action will timeout and we will throw a timeout exception . This part of the code is looking out of place from readability POV.
// explicitly calling listener.onFailure with NotClusterManagerException as the local node is not the cluster manager | ||
// this will ensures that request is retried until cluster manager times out | ||
logger.info( | ||
"local node is not eligible to process the request, " | ||
+ "throwing NotClusterManagerException to attempt a retry on an eligible node" | ||
); | ||
listener.onFailure( | ||
new NotClusterManagerException( | ||
"node [" | ||
+ transportService.getLocalNode().toString() | ||
+ "] not eligible to execute decommission request. Will retry until timeout." | ||
) | ||
// since the local node is no longer cluster manager which could've happened due to leader abdication, | ||
// hence retrying the decommission action until it times out | ||
logger.info("local node is not eligible to process the request, " + "retrying the transport action until it times out"); | ||
decommissionController.retryDecommissionAction( | ||
decommissionRequest, | ||
startTime, | ||
ActionListener.delegateResponse(listener, (delegatedListener, t) -> { | ||
logger.debug( | ||
() -> new ParameterizedMessage( | ||
"failed to retry decommission action for attribute [{}]", | ||
decommissionRequest.getDecommissionAttribute() | ||
), | ||
t | ||
); | ||
clearVotingConfigExclusionAndUpdateStatus(false, false); // TODO - need to test this | ||
delegatedListener.onFailure(t); | ||
}) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are responding to user API call now ? Earlier , the retried request on new elected cluster manager would respond it . Would the same hold true now as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, response to the retried request is attached to the listener of original request. The chain will continue if multiple retries gets executed.
Closing this and opened a new PR in upstream repo - opensearch-project#4684 |
Description
[Describe what this change achieves]
Issues Resolved
[List any issues this PR will resolve]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.