Decommission retry/pr #66

imRishN · 2022-09-26T21:11:34Z

Description

[Describe what this change achieves]

Issues Resolved

[List any issues this PR will resolve]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishab Nahata <[email protected]>

github-actions · 2022-09-26T21:12:33Z

Gradle Check (Jenkins) Run Completed with:

RESULT: null ❌
URL:
CommitID: 87a0f32

Signed-off-by: Rishab Nahata <[email protected]>

github-actions · 2022-09-26T21:19:52Z

Gradle Check (Jenkins) Run Completed with:

RESULT: null ❌
URL:
CommitID: ecbe9a6

gbbafna · 2022-09-27T07:48:31Z

...java/org/opensearch/action/admin/cluster/decommission/awareness/put/DecommissionRequest.java

+    }
+
+    public DecommissionRequest(DecommissionAttribute decommissionAttribute, TimeValue retryTimeout) {
+        this(decommissionAttribute, false, retryTimeout);


how are we making sure that user doesn't specify this parameter ? did we evaluate putting this in request context as it is an internal detail of a request and not a user facing one ?

Are you talking about the retry flag or the timeout?
For flag, The REST action doesn't support accepting retry flag with the user and hence any request created at REST layer will always have it set to false. Today, it is only set to true when we trigger retry action from service
For timeout, the default is set and user can set one according to his case

Let me know if this answers your question

gbbafna · 2022-09-27T07:53:39Z

server/src/main/java/org/opensearch/cluster/decommission/DecommissionController.java

+        final long remainingTimeoutMS = decommissionRequest.getRetryTimeout().millis() - (threadPool.relativeTimeInMillis() - startTime);
+        if (remainingTimeoutMS <= 0) {
+            logger.debug(
+                "timed out before retrying [{}] for attribute [{}] after cluster manager change",
+                DecommissionAction.NAME,
+                decommissionRequest.getDecommissionAttribute()
+            );
+            listener.onFailure(
+                new OpenSearchTimeoutException(
+                    "timed out before retrying [{}] for attribute [{}] after cluster manager change",
+                    DecommissionAction.NAME,
+                    decommissionRequest.getDecommissionAttribute()
+                )
+            );
+            return;
+        }


do we need remainingTimeoutMS check here only ? Why do we need it ?

This timeout is a check for retry eligibility only. Other actions as part of decommission has their own timeouts. This is the place where we are actually triggering the retry and hence the check. If timed out, request will not be eligible for a retry

can we do without this ? In worst case, the retried action will timeout and we will throw a timeout exception . This part of the code is looking out of place from readability POV.

gbbafna · 2022-09-27T08:20:09Z

server/src/main/java/org/opensearch/cluster/decommission/DecommissionService.java

-                    // explicitly calling listener.onFailure with NotClusterManagerException as the local node is not the cluster manager
-                    // this will ensures that request is retried until cluster manager times out
-                    logger.info(
-                        "local node is not eligible to process the request, "
-                            + "throwing NotClusterManagerException to attempt a retry on an eligible node"
-                    );
-                    listener.onFailure(
-                        new NotClusterManagerException(
-                            "node ["
-                                + transportService.getLocalNode().toString()
-                                + "] not eligible to execute decommission request. Will retry until timeout."
-                        )
+                    // since the local node is no longer cluster manager which could've happened due to leader abdication,
+                    // hence retrying the decommission action until it times out
+                    logger.info("local node is not eligible to process the request, " + "retrying the transport action until it times out");
+                    decommissionController.retryDecommissionAction(
+                        decommissionRequest,
+                        startTime,
+                        ActionListener.delegateResponse(listener, (delegatedListener, t) -> {
+                            logger.debug(
+                                () -> new ParameterizedMessage(
+                                    "failed to retry decommission action for attribute [{}]",
+                                    decommissionRequest.getDecommissionAttribute()
+                                ),
+                                t
+                            );
+                            clearVotingConfigExclusionAndUpdateStatus(false, false); // TODO - need to test this
+                            delegatedListener.onFailure(t);
+                        })
                    );


how are responding to user API call now ? Earlier , the retried request on new elected cluster manager would respond it . Would the same hold true now as well ?

Yes, response to the retried request is attached to the listener of original request. The chain will continue if multiple retries gets executed.

imRishN · 2022-10-05T12:20:11Z

Closing this and opened a new PR in upstream repo - opensearch-project#4684

imRishN added 3 commits September 27, 2022 01:18

Retry decommission action only on master change

0ad7414

Signed-off-by: Rishab Nahata <[email protected]>

Add retry timeout mechanism

c6e7c04

Signed-off-by: Rishab Nahata <[email protected]>

Fix spotless check

87a0f32

Signed-off-by: Rishab Nahata <[email protected]>

Minor bug fix

ecbe9a6

Signed-off-by: Rishab Nahata <[email protected]>

gbbafna reviewed Sep 27, 2022

View reviewed changes

gbbafna approved these changes Sep 27, 2022

View reviewed changes

imRishN mentioned this pull request Oct 5, 2022

Control concurrency and add retry action in decommission flow opensearch-project/OpenSearch#4684

Closed

6 tasks

imRishN closed this Oct 5, 2022

imRishN mentioned this pull request Dec 26, 2022

Gracefully handle concurrent zone decommission action opensearch-project/OpenSearch#5542

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decommission retry/pr #66

Decommission retry/pr #66

imRishN commented Sep 26, 2022

github-actions bot commented Sep 26, 2022

github-actions bot commented Sep 26, 2022

gbbafna Sep 27, 2022

imRishN Sep 27, 2022

gbbafna Sep 27, 2022

imRishN Sep 27, 2022

gbbafna Sep 27, 2022

gbbafna Sep 27, 2022

imRishN Sep 27, 2022

imRishN commented Oct 5, 2022

Decommission retry/pr #66

Decommission retry/pr #66

Conversation

imRishN commented Sep 26, 2022

Description

Issues Resolved

Check List

github-actions bot commented Sep 26, 2022

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Sep 26, 2022

Gradle Check (Jenkins) Run Completed with:

gbbafna Sep 27, 2022

Choose a reason for hiding this comment

imRishN Sep 27, 2022

Choose a reason for hiding this comment

gbbafna Sep 27, 2022

Choose a reason for hiding this comment

imRishN Sep 27, 2022

Choose a reason for hiding this comment

gbbafna Sep 27, 2022

Choose a reason for hiding this comment

gbbafna Sep 27, 2022

Choose a reason for hiding this comment

imRishN Sep 27, 2022

Choose a reason for hiding this comment

imRishN commented Oct 5, 2022