Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] Elasticsearch upgrades make transforms fail easily #107251

Closed
4 tasks done
przemekwitek opened this issue Apr 9, 2024 · 12 comments
Closed
4 tasks done

[Transform] Elasticsearch upgrades make transforms fail easily #107251

przemekwitek opened this issue Apr 9, 2024 · 12 comments
Assignees
Labels
>bug :ml/Transform Transform Team:ML Meta label for the ML team

Comments

@przemekwitek
Copy link
Contributor

przemekwitek commented Apr 9, 2024

Elasticsearch Version

main

Installed Plugins

No response

Java Version

bundled

OS Version

serverless, Cloud

Problem Description

Users are noticing transforms failing when there is an Elasticsearch version upgrade.
This came up in serverless and on the Cloud. I'm not sure if this also affect stateful ES.
Each such upgrade can make a transform fail. Once the transform fails, the user has to manually stop and delete it and create a new transform.

The primary purpose of this GH issue is to reduce the volume of the transform alerts and users complaints.
Some questions/ideas that need to be addressed:

  • We've seen the problem happening for the transforms with unattended set to false. Does the problem also occur when unattended is true? If so, this is a bug as we expect unattended transforms to never fail.
  • Even when unattended is false, what can we do to make the transform more robust during these upgrades? Maybe we can make all the transforms slightly more "unattended", i.e. less prone to intermittent issues.
  • Maybe the transforms should treat all the error types as recoverable?
  • What is the right retrying strategy for a non-unattended transform?
  • Does the problem happen for a version upgrade only or does it also happen for a full cluster restart (but without changing the version)?

Steps to Reproduce

It happens during Cloud upgrades.

Logs (if relevant)

No response

Tasks

Preview Give feedback
  1. :ml/Transform >bug Team:ML v8.14.0
    prwhelan
  2. :ml/Transform Team:ML
    prwhelan
  3. :ml/Transform >bug Team:ML v8.15.0
    prwhelan
  4. :ml/Transform >bug Team:ML v8.15.0
    prwhelan
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Apr 9, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@prwhelan
Copy link
Member

prwhelan commented Apr 9, 2024

1719 instances of Transform has failed errors on serverless over the last 90 days:

  • 14 of The object cannot be set twice -> [Transforms] Transform has failed with "The object cannot be set twice" #107215
  • 25 of Failed to reload transform configuration for transform <>
  • 846 of [parent] Data too large, data for [indices:data/read/search[phase/query]] would be [4092117844/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4092117096/3.8gb], new bytes reserved: [748/748b], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=394234840/375.9mb, request=0/0b, inflight_requests=2244/2.1kb]
  • 9 of Failed to persist transform statistics for transform
  • 3-5 per rollout that are similar to Bulk index experienced [2] failures and at least 1 irrecoverable [org.elasticsearch.index.IndexNotFoundException: no such index [.metrics-endpoint.metadata_united_default]].
  • 2 of rejected execution of TimedRunnable

@przemekwitek przemekwitek changed the title [Transform] Elasticsearch upgrades in serverless make transforms fail [Transform] Elasticsearch upgrades make transforms fail easily Apr 9, 2024
@sophiec20
Copy link
Contributor

sophiec20 commented Apr 9, 2024

Transient issues, such as temporary search or indexing problems, are retried. The transform will fail if this configurable retry count is exceeded. Workaround is to increase retries, or to fix cluster stability, or to run as unattended. These can be excluded from the scope of this initial investigation.

To help focus investigations, I would suggest we look at the behaviour of transforms whilst nodes leave and join the cluster. I believe it is likely we have some code paths here which can lead to errors and race conditions depending on the order and time it takes for these things to happen. Transforms may not react quickly enough when a node is shutting down. Transform may be too eager to start when a node is joining.

Nodes may leave/join either due to upgrade, or restart, or catastrophic node failure. It could be any node, the one running the task, the one hosting the config index etc... Because it is easier to test, I believe it's worth initially validating if transforms behave well during node movement, rather than upgrade. (Also from experience, upgrade errors tend to manifest themselves as cluster state failures .. and we don't see these atm).

Timeouts for graceful node shutdowns are longer for Serverless than for non-Serverless - so I'd prioritise Serverless initially as we've seen more alerts here (however I think both are applicable, so pick whichever is easiest to bulk-test multi node movement).

@prwhelan
Copy link
Member

prwhelan commented Apr 9, 2024

To help focus investigations, I would suggest we look at the behaviour of transforms whilst nodes leave and join the cluster. I believe it is likely we have some code paths here which can lead to errors and race conditions depending on the order and time it takes for these things to happen. Transforms may not react quickly enough when a node is shutting down. Transform may be too eager to start when a node is joining.

This makes sense, and I'll try to keep the task list ordered by priority.

@prwhelan
Copy link
Member

Will look into this as well: #100891

It's likely we don't have to worry about some of these inconsistencies during a rollout if we can handle the rollout

@prwhelan
Copy link
Member

Related to data too large: #60391

@prwhelan
Copy link
Member

[endpoint.metadata_united-default-8.14.0] transform has failed; experienced: [Insufficient memory for search after repeated page size reductions to [0], unable to continue pivot, please simplify job or increase heap size on data nodes.].

Related to Data too large

@prwhelan
Copy link
Member

[endpoint.metadata_current-default-8.14.0] transform has failed; experienced: [task encountered irrecoverable failure: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]].

@nerophon
Copy link

A user also found these after upgrade from 8.11.1 to 8.13.2:

Transform task state is [failed]
task encountered more than 10 failures; latest failure: read past EOF: MemorySegmentIndexInput(path="/app/data/indices/mtc6-NYrQPi_irBUjusHPA/0/index/_5ro.cfs") [slice=_5ro_ES87TSDB_0.dvd] [slice=values]

@prwhelan
Copy link
Member

prwhelan commented Apr 26, 2024

@nerophon that seems to be an issue with the index. From searching around, it seems that it is corrupted. Do you know if the index is a Transform internal index, the source index that the Transform is searching, or the destination index that the Transform is bulk writing to? I'm not sure if there's anything the Transform can automatically do to recover in this scenario. That seems to require external intervention.

@prwhelan
Copy link
Member

A few new ones

Caused by: java.lang.IllegalArgumentException: field [message] not present as part of path [message]

Doesn't seem to reoccur

There are still a lot of WARNS due to node disconnects, missing shards, etc, that happen during nodes joining/leaving the cluster. We could potentially listen for shutdown events and handle accordingly, but there doesn't seem to be any transforms moving into failure due to these reasons

@prwhelan
Copy link
Member

We haven't seen unrecoverable failures in the last month - I think it is safe to mark this closed, and we can prioritize new issues outside of this meta-issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml/Transform Transform Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

5 participants