Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Flaky Test SpecificClusterManagerNodesIT.testElectOnlyBetweenClusterManagerNodes #16021

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kkewwei
Copy link
Contributor

@kkewwei kkewwei commented Sep 20, 2024

Description

The case is as follows:

  1. When the node_t1 is excluded from the vote config, and the cluster starts a new leader election, but the the node node_t2 hasn't been elected as the new leader.

  2. At the moment, we send request to get the ClusterManager, we first get ClusterManager name, and leads to the NullPointerException.

internalCluster().nonClusterManagerClient()-> ......->getClusterManagerName()

return client.admin().cluster().prepareState().get().getState().nodes().getClusterManagerNode().getName();

Related Issues

Resolves #15944 #16015

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❕ Gradle check result for 1b3920b: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@msfroh
Copy link
Collaborator

msfroh commented Sep 20, 2024

Checking the history of this test, has this been flaky for more than a year?

The only code related to cluster manager election that I've been able to find changed more recently than ~5 years ago (besides renaming) is the introduction of DecommisionService in 2023.

@kkewwei kkewwei force-pushed the fix_16015 branch 2 times, most recently from 1338c91 to 3823b1c Compare September 21, 2024 05:30
@kkewwei kkewwei changed the title Fix Flaky Test SpecificClusterManagerNodesIT.testElectOnlyBetweenClus… Fix Flaky Test SpecificClusterManagerNodesIT.testElectOnlyBetweenClusterManagerNodes Sep 21, 2024
@kkewwei
Copy link
Contributor Author

kkewwei commented Sep 21, 2024

Checking the history of this test, has this been flaky for more than a year?

The only code related to cluster manager election that I've been able to find changed more recently than ~5 years ago (besides renaming) is the introduction of DecommisionService in 2023.

@msfroh I'm not clearly why it wasn’t reported before, maybe the frequency of occurrence is very low and we ignored it? This issue #15944 collects some of the recent case.

Logically speaking, when the cluster has no cluster manager, It's right to return null from getClusterManagerNode, which will hit the case.

Copy link
Contributor

✅ Gradle check result for 4d812e9: SUCCESS

Copy link

codecov bot commented Sep 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.98%. Comparing base (036f6bc) to head (4d812e9).
Report is 239 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16021      +/-   ##
============================================
+ Coverage     71.92%   71.98%   +0.06%     
  Complexity    64400    64400              
============================================
  Files          5281     5281              
  Lines        300995   300995              
  Branches      43479    43479              
============================================
+ Hits         216491   216672     +181     
+ Misses        66793    66569     -224     
- Partials      17711    17754      +43     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kkewwei
Copy link
Contributor Author

kkewwei commented Sep 24, 2024

@msfroh @gaobinlong please help confirm it when you are free.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Nov 6, 2024
return null;
}
};
assertThat(clusterManagerName.get(), equalTo(nextClusterManagerEligableNodeName));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we swallow the NPE with the try/catch above, won't this test still fail on the assertion here since clusterManagerName.get() would be null and nextClusterManagerEligableNodeName shouldn't be null?

Copy link
Contributor Author

@kkewwei kkewwei Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jed326 assertBusy allows the assert failed in 10s, so it seems ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add an additional assert before this to evaluate it is not null.

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Nov 7, 2024
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 13, 2024
Copy link
Contributor

@rajiv-kv rajiv-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kkewwei for raising a fix

.getName(),
equalTo(nextClusterManagerEligableNodeName)
);
Supplier<String> clusterManagerName = () -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can we move the supplier ouside the assertBusy block ?
  • nit: could be rename as getClusterManagerIfElected

.nodes()
.getClusterManagerNode()
.getName();
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of catching a generic exception, could we catch a specific exception related to ClusterManager Not found.

return null;
}
};
assertThat(clusterManagerName.get(), equalTo(nextClusterManagerEligableNodeName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add an additional assert before this to evaluate it is not null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut backport 2.x Backport to 2.x branch bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for SpecificClusterManagerNodesIT
5 participants