Calc follower vs leader indexing lag based on shard global checkpoints #104015

volodk85 · 2024-01-05T23:08:25Z

Calculate follower vs leader indexing lag based on shard global checkpoints. Calculation based on doc stats after doing a shard refresh is not desired - it can significantly slow down overall process

Closes #89991

Sample API output:

curl -X GET "localhost:19200/my-follower-index-000001/_ccr/stats?pretty"         
{
  "indices" : [
    {
      "index" : "my-follower-index-000001",
      "follower_to_leader_lagging_ops_count" : 1634,
      "shards" : [
        {
          "remote_cluster" : "cluster0",
          "leader_index" : "my-index-000001",
          "follower_index" : "my-follower-index-000001",
          "shard_id" : 0,
          "leader_global_checkpoint" : 3290,
          "leader_max_seq_no" : 3290,
          "follower_global_checkpoint" : 3290,
          "follower_max_seq_no" : 3290,
          "last_requested_seq_no" : 3290,
          "outstanding_read_requests" : 1,
          "outstanding_write_requests" : 0,
          "write_buffer_operation_count" : 0,
          "write_buffer_size_in_bytes" : 0,
          "follower_mapping_version" : 2,
          "follower_settings_version" : 1,
          "follower_aliases_version" : 1,
          "total_read_time_millis" : 349,
          "total_read_remote_exec_time_millis" : 121,
          "successful_read_requests" : 2,
          "failed_read_requests" : 0,
          "operations_read" : 2343,
          "bytes_read" : 38633727,
          "total_write_time_millis" : 979,
          "successful_write_requests" : 2,
          "failed_write_requests" : 4,
          "operations_written" : 2343,
          "read_exceptions" : [ ],
          "time_since_last_read_millis" : 1149
        },
        {
          "remote_cluster" : "cluster0",
          "leader_index" : "my-index-000001",
          "follower_index" : "my-follower-index-000001",
          "shard_id" : 1,
          "leader_global_checkpoint" : 3406,
          "leader_max_seq_no" : 3406,
          "follower_global_checkpoint" : 1772,
          "follower_max_seq_no" : 3406,
          "last_requested_seq_no" : 3406,
          "outstanding_read_requests" : 1,
          "outstanding_write_requests" : 1,
          "write_buffer_operation_count" : 0,
          "write_buffer_size_in_bytes" : 0,
          "follower_mapping_version" : 2,
          "follower_settings_version" : 1,
          "follower_aliases_version" : 1,
          "total_read_time_millis" : 349,
          "total_read_remote_exec_time_millis" : 134,
          "successful_read_requests" : 2,
          "failed_read_requests" : 0,
          "operations_read" : 2417,
          "bytes_read" : 39853913,
          "total_write_time_millis" : 144,
          "successful_write_requests" : 1,
          "failed_write_requests" : 3,
          "operations_written" : 382,
          "read_exceptions" : [ ],
          "time_since_last_read_millis" : 1137
        },
        {
          "remote_cluster" : "cluster0",
          "leader_index" : "my-index-000001",
          "follower_index" : "my-follower-index-000001",
          "shard_id" : 2,
          "leader_global_checkpoint" : 3301,
          "leader_max_seq_no" : 3301,
          "follower_global_checkpoint" : 3301,
          "follower_max_seq_no" : 3301,
          "last_requested_seq_no" : 3301,
          "outstanding_read_requests" : 1,
          "outstanding_write_requests" : 0,
          "write_buffer_operation_count" : 0,
          "write_buffer_size_in_bytes" : 0,
          "follower_mapping_version" : 2,
          "follower_settings_version" : 1,
          "follower_aliases_version" : 1,
          "total_read_time_millis" : 315,
          "total_read_remote_exec_time_millis" : 115,
          "successful_read_requests" : 2,
          "failed_read_requests" : 0,
          "operations_read" : 2248,
          "bytes_read" : 37067272,
          "total_write_time_millis" : 1022,
          "successful_write_requests" : 2,
          "failed_write_requests" : 0,
          "operations_written" : 2248,
          "read_exceptions" : [ ],
          "time_since_last_read_millis" : 1149
        }
      ]
    }
  ]
}

github-actions · 2024-01-05T23:08:37Z

Documentation preview:

✨ Changed pages

elasticsearchmachine · 2024-01-08T05:43:31Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner

Looks good, I left a couple of nits/suggestions

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ccr/action/FollowStatsAction.java

DaveCTurner · 2024-01-08T08:20:25Z

docs/reference/ccr/apis/follow/get-follow-stats.asciidoc

@@ -74,6 +74,8 @@ task. In this situation, the following task must be resumed manually with the

 `index`::
 (string) The name of the follower index.
+`follower_to_leader_lagging_ops_count`::


Naming nit: how about total_global_checkpoint_lag? The follower_to_leader bit seems redundant since this is the CCR stats API, and I think it would be helpful to include global_checkpoint in the name in case we decide to add some other metrics in future.

Also you're missing a blank line above this item.

"total_global_checkpoint_lag" sounds good, thx

idegtiarenko · 2024-01-08T08:25:14Z

docs/reference/ccr/apis/follow/get-follow-stats.asciidoc

@@ -219,6 +221,7 @@ The API returns the following results:
  "indices" : [
    {
      "index" : "follower_index",
+      "follower_to_leader_lagging_ops_count" : 256,


Suggested change

"follower_to_leader_lagging_ops_count" : 256,

"lag_ops_count" : 256,

will use "total_global_checkpoint_lag" as per suggestion above

idegtiarenko · 2024-01-08T08:27:36Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ccr/action/FollowStatsAction.java

-                            (builder, params) -> builder.startObject().field("index", indexEntry.getKey()).startArray("shards")
+                            (builder, params) -> builder.startObject()
+                                .field("index", indexEntry.getKey())
+                                .field("follower_to_leader_lagging_ops_count", calcFollowerToLeaderLaggingOps(indexEntry.getValue()))


I am surprised that this is per index, not per shard. Any reason why it deviates from the rest of the stats?

I think the reason is be an accumulative stat across all shards per index. main motto is stated by parent issue:

These should be exposed as simple user-friendly metrics without going through extra mental arithmetic.

kingherc

Also looks good to me, and nice that it is as simple as this! Will approve after handling reviewer comments.

kingherc

LGTM. One suggestion for the documentation.

docs/reference/ccr/apis/follow/get-follow-stats.asciidoc

Co-authored-by: Iraklis Psaroudakis <[email protected]>

Calc follower vs leader indexing lag baed on shard global checkpoints

618047c

elasticsearchmachine added the v8.13.0 label Jan 5, 2024

volodk85 added 3 commits January 5, 2024 15:46

code simplify

dceb9da

Merge branch 'main' into ccr_indexing_lag

65a8661

Merge branch 'main' into ccr_indexing_lag

c37fc65

volodk85 added :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >non-issue labels Jan 8, 2024

volodk85 marked this pull request as ready for review January 8, 2024 05:43

volodk85 requested a review from kingherc January 8, 2024 05:43

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jan 8, 2024

DaveCTurner reviewed Jan 8, 2024

View reviewed changes

idegtiarenko reviewed Jan 8, 2024

View reviewed changes

kingherc reviewed Jan 9, 2024

View reviewed changes

volodk85 added 2 commits January 9, 2024 09:37

follow PR comments

74e8da4

spotless

33ae847

volodk85 requested a review from kingherc January 9, 2024 19:47

kingherc approved these changes Jan 10, 2024

View reviewed changes

docs/reference/ccr/apis/follow/get-follow-stats.asciidoc Outdated Show resolved Hide resolved

Update docs/reference/ccr/apis/follow/get-follow-stats.asciidoc

cbba8e2

Co-authored-by: Iraklis Psaroudakis <[email protected]>

volodk85 merged commit f6f86d1 into elastic:main Jan 10, 2024
15 checks passed

volodk85 deleted the ccr_indexing_lag branch January 10, 2024 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calc follower vs leader indexing lag based on shard global checkpoints #104015

Calc follower vs leader indexing lag based on shard global checkpoints #104015

volodk85 commented Jan 5, 2024 •

edited

Loading

github-actions bot commented Jan 5, 2024

elasticsearchmachine commented Jan 8, 2024

DaveCTurner left a comment

DaveCTurner Jan 8, 2024

volodk85 Jan 9, 2024

idegtiarenko Jan 8, 2024

volodk85 Jan 9, 2024

idegtiarenko Jan 8, 2024

volodk85 Jan 9, 2024

kingherc left a comment

kingherc left a comment

	"follower_to_leader_lagging_ops_count" : 256,
	"lag_ops_count" : 256,

Calc follower vs leader indexing lag based on shard global checkpoints #104015

Calc follower vs leader indexing lag based on shard global checkpoints #104015

Conversation

volodk85 commented Jan 5, 2024 • edited Loading

github-actions bot commented Jan 5, 2024

elasticsearchmachine commented Jan 8, 2024

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Jan 8, 2024

Choose a reason for hiding this comment

volodk85 Jan 9, 2024

Choose a reason for hiding this comment

idegtiarenko Jan 8, 2024

Choose a reason for hiding this comment

volodk85 Jan 9, 2024

Choose a reason for hiding this comment

idegtiarenko Jan 8, 2024

Choose a reason for hiding this comment

volodk85 Jan 9, 2024

Choose a reason for hiding this comment

kingherc left a comment

Choose a reason for hiding this comment

kingherc left a comment

Choose a reason for hiding this comment

volodk85 commented Jan 5, 2024 •

edited

Loading