Pushing back on index stats requests can cause ILM rollover-ready checks to pile up #85333

original-brownbear · 2022-03-24T15:43:23Z

#83832 introduced a rate limit of 100 per coordinating node for certain stats requests. This causes issues once ILM starts executing more than 100 of them at a time which it can easily do during a rollover-ready check.
For one, if more than 100 indices need to be checked for rollover readiness during one periodic execution of ILM those in excess of the 100 will be marked as in ILM error.
More importantly though, this effectively limits us to checking only 100 indices per ILM trigger period (10 minutes by default) which will cause issues in larger deployments.

I think the easiest solution here might be to disable this new functionality by default to not introduce new functionality that might cause trouble elsewhere into 8.2 given how close the 8.2 FF is.

elasticmachine · 2022-03-24T15:43:26Z

Pinging @elastic/es-data-management (Team:Data Management)

joegallo · 2022-03-24T15:46:30Z

Related to #51992 (the issue that #83832 closed), also related to #77466

joegallo · 2022-03-24T16:57:42Z

Making a note here of #82830, too -- IIUC this should have sped up some of the related stats requests significantly.

dakrone · 2022-03-24T20:12:14Z

Okay, a few (non-exhaustive!) ideas for fixing this for us to discuss:

Disable the limiting for this API entirely.

This is the fast-and-easy fix, though we remove the pushback and we'll have to figure out if/how to re-add it in the future. This might be something we do for 8.2, but I don't think we should consider it a permanent fix.

Add a special flag that makes requests not subject to the 100-max limit.

This would allow the ILM-made requests to pass this flag through and not run into the limit. The downside for this is that if we still increment the number of concurrent requests, we could end up causing other stats requests to fail. We don't necessarily have to increment the request counter though, so it could still be an option.

Handle rejections in the ILM action itself.

We could handle these in the onFailure listener for the rollover check, and then do a retry (maybe with a timeout) inside the WaitForRolloverReadyStep class itself if it got the "too many concurrent stats requests" error. It won't really help too much though, because the first 100 will get in, and then all the others will be sitting there doing retries and we'll still end up not being able to get the request through.

Add the ability to "wait-for-an-empty-slot" for the stats request.

This would let us use the timeout to say, wait 30 seconds on the semaphore for a permit to be available. It essentially limits us to 100 concurrently, but they will all queue up behind one another, and we could end up in the state where the last rollover check waits longer than the timeout and we end up in an error anyway.

Implement check-rollover-ready within ILM differently.

Since we're triggering most of these from a cluster state changed event, we could grab the stats once at the beginning of the new state, and pass it in to all of these events (so that they use the same stats). This would entail probably not calling the rollover stuff itself, but factoring that logic out so that ILM could use it directly (passing in the stats info as necessary).

Make TransportRolloverAction divorced from synchronous stats collection.

For this, we'd make TransportRolloverAction use stats for all indices gathered every N seconds, similar to what we do for gathering disk usage. It would then scale independently from ILM to as many indices as we want without having to get the stats on-demand each time. This would add complexity and change the contract for rollover that we currently have though, so I don't think I am in favor of this.

Well, those are a few options to kick off the discussion, I don't think I have a favorite yet, and there are undoubtedly more ways to go about fixing this. Going to /cc @gmarouli also since she worked on the initial PR for limiting this also, in case she has some ideas.

joegallo · 2022-03-28T20:24:36Z

We should probably change the label (and maybe changelog entry) for #83832 -- for example, if we disable this by default with 8.2.0, then we wouldn't want to have the changelog go out as if the feature was included as originally written.

joegallo · 2022-03-30T18:07:05Z

~~I tweaked the label back to v8.2.0 because this remains an 8.2.0 blocker (but I'll remove the blocker label and update the version label after #85504 has been merged).~~

I reverted #83832 from the 8.2 branch via #85504, and I've updated the version tag here to v8.3.0 rather than v8.2.0. For the moment I think this stays as an 8.3.0 blocker until we further resolve things.

joegallo · 2022-05-23T16:59:13Z

I reverted #83832 from master (before the 8.3 branch was created) via #87054, so I'm removing the 'blocker' label from this issue.

joegallo · 2022-05-26T14:28:59Z

@DaveCTurner and I discussed this issue again today, and I'd like to add additional possible option onto #85333 (comment):

Ignore the limit for stats requests that are scoped to a single index. At least in the many shards benchmark cases, it doesn't appear that running tons of these simultaneously ends up being a problem (note: double check this, though!), so we could simply allow them through and not have them count towards the limit.

DaveCTurner · 2022-05-26T14:34:31Z

Also:

Avoid making thousands of individual stats requests while executing check-rollover-ready steps. We could for instance introduce a bulk dry-run rollover API which makes a single stats call for all the target indices at once.

henningandersen · 2022-05-30T08:01:24Z

Adding one more to the mix:

Move the determination of when to rollover to primary shard(s). This introduces a slight imprecision when rolling over on total doc count or total index size (I imagine simply dividing by number of shards as the per shard limit). But would be very efficient and could serve as foundation for pushing back on indexing when a rollover does not happen in a timely manner.

original-brownbear added >bug blocker :Data Management/Stats Statistics tracking and retrieval APIs :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Mar 24, 2022

elasticmachine added the Team:Data Management Meta label for data/management team label Mar 24, 2022

original-brownbear added the v8.2.0 label Mar 24, 2022

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

joegallo mentioned this issue Mar 30, 2022

Revert "Push back excessive requests for stats" (#83832) from 8.2 #85504

Merged

joegallo added v8.2.0 and removed v8.3.0 labels Mar 30, 2022

joegallo added v8.3.0 and removed v8.2.0 labels Mar 31, 2022

dakrone added the team-discuss label May 12, 2022

joegallo mentioned this issue May 23, 2022

Remove "Push back excessive requests for stats (#83832)" #87054

Merged

joegallo removed the blocker label May 23, 2022

joegallo mentioned this issue May 23, 2022

Push back on excessive requests for stats #51992

Open

joegallo removed v8.3.0 team-discuss labels May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pushing back on index stats requests can cause ILM rollover-ready checks to pile up #85333

Pushing back on index stats requests can cause ILM rollover-ready checks to pile up #85333

original-brownbear commented Mar 24, 2022

elasticmachine commented Mar 24, 2022

joegallo commented Mar 24, 2022

joegallo commented Mar 24, 2022

dakrone commented Mar 24, 2022

joegallo commented Mar 28, 2022

joegallo commented Mar 30, 2022 •

edited

Loading

joegallo commented May 23, 2022 •

edited

Loading

joegallo commented May 26, 2022 •

edited

Loading

DaveCTurner commented May 26, 2022

henningandersen commented May 30, 2022

Pushing back on index stats requests can cause ILM rollover-ready checks to pile up #85333

Pushing back on index stats requests can cause ILM rollover-ready checks to pile up #85333

Comments

original-brownbear commented Mar 24, 2022

elasticmachine commented Mar 24, 2022

joegallo commented Mar 24, 2022

joegallo commented Mar 24, 2022

dakrone commented Mar 24, 2022

joegallo commented Mar 28, 2022

joegallo commented Mar 30, 2022 • edited Loading

joegallo commented May 23, 2022 • edited Loading

joegallo commented May 26, 2022 • edited Loading

DaveCTurner commented May 26, 2022

henningandersen commented May 30, 2022

joegallo commented Mar 30, 2022 •

edited

Loading

joegallo commented May 23, 2022 •

edited

Loading

joegallo commented May 26, 2022 •

edited

Loading