CCR status should show lagging / doc count #89991

Leaf-Lin · 2022-09-12T02:10:47Z

Description

Today, CCR stats gives a lot of low-level details, but that isn’t as useful as document count. At the end of the day, users want to know if the follower is keeping up with the leader and if there's a delay, what the differences are in document count, or how much time it will take for the followers to catch up with the leader. These should be exposed as simple user-friendly metrics without going through extra mental arithmetic.

One could chek the document count on leader and follower per CCR index after shards are refreshed.
A difference between follower_max_seq_no and leader_max_seq_no on a shard indicates some operation hasn't been processed.
Alternatively, the difference between leader_global_checkpoint and follower_global_checkpoint indicates some lag. Although we have seen cases where the checkpoint values report identical from _ccr/stats while the doc counts are different. It seems some xpack/ccr/shard_follow_task tasks could be stuck when the connection fails, and the global_checkpoints alone may not be a reliable source on lagging.
Kibana CCR monitoring visualization also provides Sync Lag (Ops) according to [Monitoring] CCR UI kibana#23013 was described as following:
- The delta of the max and min leader_max_seq_no subtracted against the delta of the max and min follower_global_checkpoint between the time period for each shard, then subtract those two from each other and take the max.
It may be possible to estimate the time for follower to catch up leader:

(elasticsearch.ccr.leader.max_seq_no - elasticsearch.ccr.follower.max_seq_no) * total_read_time_millis / operations_read

Capture the leader’s global checkpoint, call it N, and then time how long it takes for the follower’s global checkpoint to be ≥N.

Related: #86798

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2022-09-12T02:11:34Z

Pinging @elastic/es-distributed (Team:Distributed)

volodk85 mentioned this issue Jul 5, 2023

Show CCR status lagging doc count #97379

Draft

volodk85 self-assigned this Dec 9, 2023

volodk85 mentioned this issue Jan 5, 2024

Calc follower vs leader indexing lag based on shard global checkpoints #104015

Merged

volodk85 closed this as completed in #104015 Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCR status should show lagging / doc count #89991

CCR status should show lagging / doc count #89991

Leaf-Lin commented Sep 12, 2022

elasticsearchmachine commented Sep 12, 2022

CCR status should show lagging / doc count #89991

CCR status should show lagging / doc count #89991

Comments

Leaf-Lin commented Sep 12, 2022

Description

elasticsearchmachine commented Sep 12, 2022