-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where replication lag grows post primary relocation #11238
Conversation
Compatibility status:Checks if related components are compatible with change e4b35a7 Incompatible componentsSkipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git] |
❌ Gradle check result for 49069a3: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
...er/src/main/java/org/opensearch/indices/replication/checkpoint/PublishCheckpointRequest.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Outdated
Show resolved
Hide resolved
...sterTest/java/org/opensearch/remotestore/SegmentReplicationUsingRemoteStoreDisruptionIT.java
Outdated
Show resolved
Hide resolved
...er/src/main/java/org/opensearch/indices/replication/checkpoint/PublishCheckpointRequest.java
Outdated
Show resolved
Hide resolved
...er/src/main/java/org/opensearch/indices/replication/checkpoint/PublishCheckpointRequest.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Outdated
Show resolved
Hide resolved
The current solution breaks down when processing the latest received checkpoint as we don't know on the replica if an incoming checkpoint is coming from the former or new primary. An alternate solution I've been testing here is to simply add a cluster changed listener, and on a routing table change we either process the latest cp or simply update the new primary with the replica's state like so...
This would cover our case here post relocation, guaranteeing that the replica updates the correct node. Alternatively, we could look up the current primary on the replica, and if its routing entry shows a relocation is in progress invoke update to both old & new nodes. However, this still leaves us depending on cluster state arriving at a certain time. |
Resolved comments made against removed code as obsolete. |
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
This comment was marked as outdated.
This comment was marked as outdated.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #11238 +/- ##
============================================
- Coverage 71.28% 71.16% -0.12%
+ Complexity 59033 58970 -63
============================================
Files 4893 4893
Lines 277753 277780 +27
Branches 40357 40363 +6
============================================
- Hits 197989 197681 -308
- Misses 63288 63659 +371
+ Partials 16476 16440 -36 ☔ View full report in Codecov by Sentry. |
❕ Gradle check result for 08d58b7: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
❌ Gradle check result for 6c93efb: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/main/java/org/opensearch/index/shard/IndexShard.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Outdated
Show resolved
Hide resolved
|
Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
added a changelog entry here as its user facing, needed to rebase to avoid conflict |
❌ Gradle check result for e4b35a7: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❕ Gradle check result for e4b35a7: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
|
* Fix bug where replication lag grows post primary relocation Signed-off-by: Marc Handalian <[email protected]> * Fix broken UT Signed-off-by: Marc Handalian <[email protected]> * add unit test for cluster state update Signed-off-by: Marc Handalian <[email protected]> * PR feedback Signed-off-by: Marc Handalian <[email protected]> * add changelog entry Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> (cherry picked from commit 6fa3a0d) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…11427) * Fix bug where replication lag grows post primary relocation * Fix broken UT * add unit test for cluster state update * PR feedback * add changelog entry --------- (cherry picked from commit 6fa3a0d) Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
protected void doClose() throws IOException { | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should doClose
replicate doStop
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to as stop is invoked before close when the node is shut down from Node#close. However, I don't see this getting explicitly closed there along with multiple multiple other services extending AbstractLifecycleComponent, ex PeerRecoverySourceService
. Unless I'm missing something here, will raise a pr to get these added.
…ch-project#11238) * Fix bug where replication lag grows post primary relocation Signed-off-by: Marc Handalian <[email protected]> * Fix broken UT Signed-off-by: Marc Handalian <[email protected]> * add unit test for cluster state update Signed-off-by: Marc Handalian <[email protected]> * PR feedback Signed-off-by: Marc Handalian <[email protected]> * add changelog entry Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]>
…ch-project#11238) * Fix bug where replication lag grows post primary relocation Signed-off-by: Marc Handalian <[email protected]> * Fix broken UT Signed-off-by: Marc Handalian <[email protected]> * add unit test for cluster state update Signed-off-by: Marc Handalian <[email protected]> * PR feedback Signed-off-by: Marc Handalian <[email protected]> * add changelog entry Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]>
…ch-project#11238) * Fix bug where replication lag grows post primary relocation Signed-off-by: Marc Handalian <[email protected]> * Fix broken UT Signed-off-by: Marc Handalian <[email protected]> * add unit test for cluster state update Signed-off-by: Marc Handalian <[email protected]> * PR feedback Signed-off-by: Marc Handalian <[email protected]> * add changelog entry Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: Shivansh Arora <[email protected]>
Description
This fixes an issue where replication lag grows post primary relocation. After a relocation occurs the new primary will publish a checkpoint to sync with the new replica, the replica if it has not yet processed a cluster state update that the relocation has completed, will respond to the old primary. If no further action is taken on the shard, the new primary will think the replica is stale. This change fixes this by triggering a round of replication from the replica once it has processed the cluster state update.
Related Issues
Resolves #11211
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.