-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transform] Elasticsearch upgrades make transforms fail easily #107251
Comments
Pinging @elastic/ml-core (Team:ML) |
1719 instances of
|
Transient issues, such as temporary search or indexing problems, are retried. The transform will fail if this configurable retry count is exceeded. Workaround is to increase retries, or to fix cluster stability, or to run as unattended. These can be excluded from the scope of this initial investigation. To help focus investigations, I would suggest we look at the behaviour of transforms whilst nodes leave and join the cluster. I believe it is likely we have some code paths here which can lead to errors and race conditions depending on the order and time it takes for these things to happen. Transforms may not react quickly enough when a node is shutting down. Transform may be too eager to start when a node is joining. Nodes may leave/join either due to upgrade, or restart, or catastrophic node failure. It could be any node, the one running the task, the one hosting the config index etc... Because it is easier to test, I believe it's worth initially validating if transforms behave well during node movement, rather than upgrade. (Also from experience, upgrade errors tend to manifest themselves as cluster state failures .. and we don't see these atm). Timeouts for graceful node shutdowns are longer for Serverless than for non-Serverless - so I'd prioritise Serverless initially as we've seen more alerts here (however I think both are applicable, so pick whichever is easiest to bulk-test multi node movement). |
This makes sense, and I'll try to keep the task list ordered by priority. |
Will look into this as well: #100891 It's likely we don't have to worry about some of these inconsistencies during a rollout if we can handle the rollout |
Related to |
Related to |
|
A user also found these after upgrade from
|
@nerophon that seems to be an issue with the index. From searching around, it seems that it is corrupted. Do you know if the index is a Transform internal index, the source index that the Transform is searching, or the destination index that the Transform is bulk writing to? I'm not sure if there's anything the Transform can automatically do to recover in this scenario. That seems to require external intervention. |
A few new ones
Doesn't seem to reoccur There are still a lot of WARNS due to node disconnects, missing shards, etc, that happen during nodes joining/leaving the cluster. We could potentially listen for shutdown events and handle accordingly, but there doesn't seem to be any transforms moving into failure due to these reasons |
We haven't seen unrecoverable failures in the last month - I think it is safe to mark this closed, and we can prioritize new issues outside of this meta-issue |
Elasticsearch Version
main
Installed Plugins
No response
Java Version
bundled
OS Version
serverless, Cloud
Problem Description
Users are noticing transforms failing when there is an Elasticsearch version upgrade.
This came up in serverless and on the Cloud. I'm not sure if this also affect stateful ES.
Each such upgrade can make a transform fail. Once the transform fails, the user has to manually stop and delete it and create a new transform.
The primary purpose of this GH issue is to reduce the volume of the transform alerts and users complaints.
Some questions/ideas that need to be addressed:
unattended
set tofalse
. Does the problem also occur whenunattended
istrue
? If so, this is a bug as we expectunattended
transforms to never fail.unattended
isfalse
, what can we do to make the transform more robust during these upgrades? Maybe we can make all the transforms slightly more "unattended", i.e. less prone to intermittent issues.Steps to Reproduce
It happens during Cloud upgrades.
Logs (if relevant)
No response
Tasks
The text was updated successfully, but these errors were encountered: