Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calc follower vs leader indexing lag based on shard global checkpoints #104015

Merged
merged 7 commits into from
Jan 10, 2024

Conversation

volodk85
Copy link
Contributor

@volodk85 volodk85 commented Jan 5, 2024

Calculate follower vs leader indexing lag based on shard global checkpoints. Calculation based on doc stats after doing a shard refresh is not desired - it can significantly slow down overall process

Closes #89991

Sample API output:

curl -X GET "localhost:19200/my-follower-index-000001/_ccr/stats?pretty"         
{
  "indices" : [
    {
      "index" : "my-follower-index-000001",
      "follower_to_leader_lagging_ops_count" : 1634,
      "shards" : [
        {
          "remote_cluster" : "cluster0",
          "leader_index" : "my-index-000001",
          "follower_index" : "my-follower-index-000001",
          "shard_id" : 0,
          "leader_global_checkpoint" : 3290,
          "leader_max_seq_no" : 3290,
          "follower_global_checkpoint" : 3290,
          "follower_max_seq_no" : 3290,
          "last_requested_seq_no" : 3290,
          "outstanding_read_requests" : 1,
          "outstanding_write_requests" : 0,
          "write_buffer_operation_count" : 0,
          "write_buffer_size_in_bytes" : 0,
          "follower_mapping_version" : 2,
          "follower_settings_version" : 1,
          "follower_aliases_version" : 1,
          "total_read_time_millis" : 349,
          "total_read_remote_exec_time_millis" : 121,
          "successful_read_requests" : 2,
          "failed_read_requests" : 0,
          "operations_read" : 2343,
          "bytes_read" : 38633727,
          "total_write_time_millis" : 979,
          "successful_write_requests" : 2,
          "failed_write_requests" : 4,
          "operations_written" : 2343,
          "read_exceptions" : [ ],
          "time_since_last_read_millis" : 1149
        },
        {
          "remote_cluster" : "cluster0",
          "leader_index" : "my-index-000001",
          "follower_index" : "my-follower-index-000001",
          "shard_id" : 1,
          "leader_global_checkpoint" : 3406,
          "leader_max_seq_no" : 3406,
          "follower_global_checkpoint" : 1772,
          "follower_max_seq_no" : 3406,
          "last_requested_seq_no" : 3406,
          "outstanding_read_requests" : 1,
          "outstanding_write_requests" : 1,
          "write_buffer_operation_count" : 0,
          "write_buffer_size_in_bytes" : 0,
          "follower_mapping_version" : 2,
          "follower_settings_version" : 1,
          "follower_aliases_version" : 1,
          "total_read_time_millis" : 349,
          "total_read_remote_exec_time_millis" : 134,
          "successful_read_requests" : 2,
          "failed_read_requests" : 0,
          "operations_read" : 2417,
          "bytes_read" : 39853913,
          "total_write_time_millis" : 144,
          "successful_write_requests" : 1,
          "failed_write_requests" : 3,
          "operations_written" : 382,
          "read_exceptions" : [ ],
          "time_since_last_read_millis" : 1137
        },
        {
          "remote_cluster" : "cluster0",
          "leader_index" : "my-index-000001",
          "follower_index" : "my-follower-index-000001",
          "shard_id" : 2,
          "leader_global_checkpoint" : 3301,
          "leader_max_seq_no" : 3301,
          "follower_global_checkpoint" : 3301,
          "follower_max_seq_no" : 3301,
          "last_requested_seq_no" : 3301,
          "outstanding_read_requests" : 1,
          "outstanding_write_requests" : 0,
          "write_buffer_operation_count" : 0,
          "write_buffer_size_in_bytes" : 0,
          "follower_mapping_version" : 2,
          "follower_settings_version" : 1,
          "follower_aliases_version" : 1,
          "total_read_time_millis" : 315,
          "total_read_remote_exec_time_millis" : 115,
          "successful_read_requests" : 2,
          "failed_read_requests" : 0,
          "operations_read" : 2248,
          "bytes_read" : 37067272,
          "total_write_time_millis" : 1022,
          "successful_write_requests" : 2,
          "failed_write_requests" : 0,
          "operations_written" : 2248,
          "read_exceptions" : [ ],
          "time_since_last_read_millis" : 1149
        }
      ]
    }
  ]
}

Copy link
Contributor

github-actions bot commented Jan 5, 2024

Documentation preview:

@volodk85 volodk85 added :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >non-issue labels Jan 8, 2024
@volodk85 volodk85 marked this pull request as ready for review January 8, 2024 05:43
@volodk85 volodk85 requested a review from kingherc January 8, 2024 05:43
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jan 8, 2024
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left a couple of nits/suggestions

@@ -74,6 +74,8 @@ task. In this situation, the following task must be resumed manually with the

`index`::
(string) The name of the follower index.
`follower_to_leader_lagging_ops_count`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: how about total_global_checkpoint_lag? The follower_to_leader bit seems redundant since this is the CCR stats API, and I think it would be helpful to include global_checkpoint in the name in case we decide to add some other metrics in future.

Also you're missing a blank line above this item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"total_global_checkpoint_lag" sounds good, thx

@@ -219,6 +221,7 @@ The API returns the following results:
"indices" : [
{
"index" : "follower_index",
"follower_to_leader_lagging_ops_count" : 256,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"follower_to_leader_lagging_ops_count" : 256,
"lag_ops_count" : 256,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will use "total_global_checkpoint_lag" as per suggestion above

(builder, params) -> builder.startObject().field("index", indexEntry.getKey()).startArray("shards")
(builder, params) -> builder.startObject()
.field("index", indexEntry.getKey())
.field("follower_to_leader_lagging_ops_count", calcFollowerToLeaderLaggingOps(indexEntry.getValue()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that this is per index, not per shard. Any reason why it deviates from the rest of the stats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason is be an accumulative stat across all shards per index. main motto is stated by parent issue:

These should be exposed as simple user-friendly metrics without going through extra mental arithmetic.

Copy link
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks good to me, and nice that it is as simple as this! Will approve after handling reviewer comments.

@volodk85 volodk85 requested a review from kingherc January 9, 2024 19:47
Copy link
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One suggestion for the documentation.

docs/reference/ccr/apis/follow/get-follow-stats.asciidoc Outdated Show resolved Hide resolved
@volodk85 volodk85 merged commit f6f86d1 into elastic:main Jan 10, 2024
15 checks passed
@volodk85 volodk85 deleted the ccr_indexing_lag branch January 10, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CCR status should show lagging / doc count
5 participants