Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCR status should show lagging / doc count #89991

Closed
Leaf-Lin opened this issue Sep 12, 2022 · 1 comment · Fixed by #104015 · May be fixed by #97379
Closed

CCR status should show lagging / doc count #89991

Leaf-Lin opened this issue Sep 12, 2022 · 1 comment · Fixed by #104015 · May be fixed by #97379
Assignees
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@Leaf-Lin
Copy link
Contributor

Description

Today, CCR stats gives a lot of low-level details, but that isn’t as useful as document count. At the end of the day, users want to know if the follower is keeping up with the leader and if there's a delay, what the differences are in document count, or how much time it will take for the followers to catch up with the leader. These should be exposed as simple user-friendly metrics without going through extra mental arithmetic.

  • One could chek the document count on leader and follower per CCR index after shards are refreshed.
  • A difference between follower_max_seq_no and leader_max_seq_no on a shard indicates some operation hasn't been processed.
  • Alternatively, the difference between leader_global_checkpoint and follower_global_checkpoint indicates some lag. Although we have seen cases where the checkpoint values report identical from _ccr/stats while the doc counts are different. It seems some xpack/ccr/shard_follow_task tasks could be stuck when the connection fails, and the global_checkpoints alone may not be a reliable source on lagging.
  • Kibana CCR monitoring visualization also provides Sync Lag (Ops) according to [Monitoring] CCR UI kibana#23013 was described as following:
    • The delta of the max and min leader_max_seq_no subtracted against the delta of the max and min follower_global_checkpoint between the time period for each shard, then subtract those two from each other and take the max.
  • It may be possible to estimate the time for follower to catch up leader:
(elasticsearch.ccr.leader.max_seq_no - elasticsearch.ccr.follower.max_seq_no) * total_read_time_millis / operations_read 
  • Capture the leader’s global checkpoint, call it N, and then time how long it takes for the follower’s global checkpoint to be ≥N.

image
image

Related: #86798

@Leaf-Lin Leaf-Lin added >enhancement needs:triage Requires assignment of a team area label :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. and removed needs:triage Requires assignment of a team area label labels Sep 12, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
3 participants