[BUG] Segment Replication - SegRep bytes behind and lag metrics incorrect post primary relocation #11211
Labels
bug
Something isn't working
Indexing:Replication
Issues and PRs related to core replication framework eg segrep
Describe the bug
The replication lag metric appears to be growing indefinitely even though no document has been indexed.
Some logs I captured
Where the node assignment is as follows:
From the replica's logs it is trying to notify the old primary of its state instead of the new, after the time the primary relocated.
I've found from logs that this happens after a primary relocation. From this it looks like the primary is moving to a new node and refreshing, publishing a checkpoint to its replicas and starting its timers, the replica syncs/discards the checkpoint and calls back to the old primary to update its state.
SegmentReplicationTargetService identifies the primary using:
To Reproduce
Steps to reproduce the behavior:
This is not 100% reproducible...
/_cluster/stats
and we will see lag even though no document has been indexed.I have also been able to reproduce this case with an IT using NetworkDisruption, added this to
SegmentReplicationUsingRemoteStoreDisruptionIT
and fails 100% of the time:Expected behavior
Lag should not grow unless there is an active replication event in the the group.
Plugins
N/A
The text was updated successfully, but these errors were encountered: