-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pushing back on index stats requests can cause ILM rollover-ready checks to pile up #85333
Comments
Pinging @elastic/es-data-management (Team:Data Management) |
Making a note here of #82830, too -- IIUC this should have sped up some of the related stats requests significantly. |
Okay, a few (non-exhaustive!) ideas for fixing this for us to discuss:
This is the fast-and-easy fix, though we remove the pushback and we'll have to figure out if/how to re-add it in the future. This might be something we do for 8.2, but I don't think we should consider it a permanent fix.
This would allow the ILM-made requests to pass this flag through and not run into the limit. The downside for this is that if we still increment the number of concurrent requests, we could end up causing other stats requests to fail. We don't necessarily have to increment the request counter though, so it could still be an option.
We could handle these in the
This would let us use the timeout to say, wait 30 seconds on the semaphore for a permit to be available. It essentially limits us to 100 concurrently, but they will all queue up behind one another, and we could end up in the state where the last rollover check waits longer than the timeout and we end up in an error anyway.
Since we're triggering most of these from a cluster state changed event, we could grab the stats once at the beginning of the new state, and pass it in to all of these events (so that they use the same stats). This would entail probably not calling the rollover stuff itself, but factoring that logic out so that ILM could use it directly (passing in the stats info as necessary).
For this, we'd make Well, those are a few options to kick off the discussion, I don't think I have a favorite yet, and there are undoubtedly more ways to go about fixing this. Going to /cc @gmarouli also since she worked on the initial PR for limiting this also, in case she has some ideas. |
We should probably change the label (and maybe changelog entry) for #83832 -- for example, if we disable this by default with 8.2.0, then we wouldn't want to have the changelog go out as if the feature was included as originally written. |
I reverted #83832 from the 8.2 branch via #85504, and I've updated the version tag here to v8.3.0 rather than v8.2.0. For the moment I think this stays as an 8.3.0 blocker until we further resolve things. |
@DaveCTurner and I discussed this issue again today, and I'd like to add additional possible option onto #85333 (comment):
|
Also:
|
Adding one more to the mix:
|
#83832 introduced a rate limit of 100 per coordinating node for certain stats requests. This causes issues once ILM starts executing more than 100 of them at a time which it can easily do during a rollover-ready check.
For one, if more than 100 indices need to be checked for rollover readiness during one periodic execution of ILM those in excess of the 100 will be marked as in ILM error.
More importantly though, this effectively limits us to checking only 100 indices per ILM trigger period (10 minutes by default) which will cause issues in larger deployments.
I think the easiest solution here might be to disable this new functionality by default to not introduce new functionality that might cause trouble elsewhere into 8.2 given how close the 8.2 FF is.
The text was updated successfully, but these errors were encountered: