[Transform] Elasticsearch upgrades make transforms fail easily #107251

przemekwitek · 2024-04-09T09:36:59Z

Elasticsearch Version

main

Installed Plugins

No response

Java Version

bundled

OS Version

serverless, Cloud

Problem Description

Users are noticing transforms failing when there is an Elasticsearch version upgrade.
This came up in serverless and on the Cloud. I'm not sure if this also affect stateful ES.
Each such upgrade can make a transform fail. Once the transform fails, the user has to manually stop and delete it and create a new transform.

The primary purpose of this GH issue is to reduce the volume of the transform alerts and users complaints.
Some questions/ideas that need to be addressed:

We've seen the problem happening for the transforms with unattended set to false. Does the problem also occur when unattended is true? If so, this is a bug as we expect unattended transforms to never fail.
Even when unattended is false, what can we do to make the transform more robust during these upgrades? Maybe we can make all the transforms slightly more "unattended", i.e. less prone to intermittent issues.
Maybe the transforms should treat all the error types as recoverable?
What is the right retrying strategy for a non-unattended transform?
Does the problem happen for a version upgrade only or does it also happen for a full cluster restart (but without changing the version)?

Steps to Reproduce

It happens during Cloud upgrades.

Logs (if relevant)

No response

Tasks

Give feedback

[Transforms] Transform has failed with "The object cannot be set twice" #107215

:ml/Transform >bug Team:ML v8.14.0
[Transform] Integrate transforms with the node shutdown API #100891

:ml/Transform Team:ML
[Transform] IndexNotFoundException during cluster upgrade #107263

:ml/Transform >bug Team:ML v8.15.0
[Transform] Unattended are failing due to missing configuration #107266

:ml/Transform >bug Team:ML v8.15.0
Options

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-04-09T09:37:23Z

Pinging @elastic/ml-core (Team:ML)

prwhelan · 2024-04-09T11:24:28Z

1719 instances of Transform has failed errors on serverless over the last 90 days:

14 of The object cannot be set twice -> [Transforms] Transform has failed with "The object cannot be set twice" #107215
25 of Failed to reload transform configuration for transform <>
846 of [parent] Data too large, data for [indices:data/read/search[phase/query]] would be [4092117844/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4092117096/3.8gb], new bytes reserved: [748/748b], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=394234840/375.9mb, request=0/0b, inflight_requests=2244/2.1kb]
9 of Failed to persist transform statistics for transform
3-5 per rollout that are similar to Bulk index experienced [2] failures and at least 1 irrecoverable [org.elasticsearch.index.IndexNotFoundException: no such index [.metrics-endpoint.metadata_united_default]].
2 of rejected execution of TimedRunnable

sophiec20 · 2024-04-09T14:52:11Z

Transient issues, such as temporary search or indexing problems, are retried. The transform will fail if this configurable retry count is exceeded. Workaround is to increase retries, or to fix cluster stability, or to run as unattended. These can be excluded from the scope of this initial investigation.

To help focus investigations, I would suggest we look at the behaviour of transforms whilst nodes leave and join the cluster. I believe it is likely we have some code paths here which can lead to errors and race conditions depending on the order and time it takes for these things to happen. Transforms may not react quickly enough when a node is shutting down. Transform may be too eager to start when a node is joining.

Nodes may leave/join either due to upgrade, or restart, or catastrophic node failure. It could be any node, the one running the task, the one hosting the config index etc... Because it is easier to test, I believe it's worth initially validating if transforms behave well during node movement, rather than upgrade. (Also from experience, upgrade errors tend to manifest themselves as cluster state failures .. and we don't see these atm).

Timeouts for graceful node shutdowns are longer for Serverless than for non-Serverless - so I'd prioritise Serverless initially as we've seen more alerts here (however I think both are applicable, so pick whichever is easiest to bulk-test multi node movement).

prwhelan · 2024-04-09T23:09:36Z

To help focus investigations, I would suggest we look at the behaviour of transforms whilst nodes leave and join the cluster. I believe it is likely we have some code paths here which can lead to errors and race conditions depending on the order and time it takes for these things to happen. Transforms may not react quickly enough when a node is shutting down. Transform may be too eager to start when a node is joining.

This makes sense, and I'll try to keep the task list ordered by priority.

prwhelan · 2024-04-10T11:27:49Z

Will look into this as well: #100891

It's likely we don't have to worry about some of these inconsistencies during a rollout if we can handle the rollout

prwhelan · 2024-04-10T15:45:22Z

Related to data too large: #60391

prwhelan · 2024-04-22T14:02:45Z

[endpoint.metadata_united-default-8.14.0] transform has failed; experienced: [Insufficient memory for search after repeated page size reductions to [0], unable to continue pivot, please simplify job or increase heap size on data nodes.].

Related to Data too large

prwhelan · 2024-04-22T16:03:54Z

[endpoint.metadata_current-default-8.14.0] transform has failed; experienced: [task encountered irrecoverable failure: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]].

nerophon · 2024-04-25T14:02:57Z

A user also found these after upgrade from 8.11.1 to 8.13.2:

Transform task state is [failed]
task encountered more than 10 failures; latest failure: read past EOF: MemorySegmentIndexInput(path="/app/data/indices/mtc6-NYrQPi_irBUjusHPA/0/index/_5ro.cfs") [slice=_5ro_ES87TSDB_0.dvd] [slice=values]

prwhelan · 2024-04-26T14:40:41Z

@nerophon that seems to be an issue with the index. From searching around, it seems that it is corrupted. Do you know if the index is a Transform internal index, the source index that the Transform is searching, or the destination index that the Transform is bulk writing to? I'm not sure if there's anything the Transform can automatically do to recover in this scenario. That seems to require external intervention.

prwhelan · 2024-05-20T14:08:20Z

A few new ones

Caused by: java.lang.IllegalArgumentException: field [message] not present as part of path [message]

Doesn't seem to reoccur

There are still a lot of WARNS due to node disconnects, missing shards, etc, that happen during nodes joining/leaving the cluster. We could potentially listen for shutdown events and handle accordingly, but there doesn't seem to be any transforms moving into failure due to these reasons

prwhelan · 2024-06-18T12:46:30Z

We haven't seen unrecoverable failures in the last month - I think it is safe to mark this closed, and we can prioritize new issues outside of this meta-issue

przemekwitek added >bug :ml/Transform Transform labels Apr 9, 2024

przemekwitek assigned prwhelan Apr 9, 2024

elasticsearchmachine added the Team:ML Meta label for the ML team label Apr 9, 2024

przemekwitek changed the title ~~[Transform] Elasticsearch upgrades in serverless make transforms fail~~ [Transform] Elasticsearch upgrades make transforms fail easily Apr 9, 2024

prwhelan mentioned this issue Apr 9, 2024

[Transform] IndexNotFoundException during cluster upgrade #107263

Closed

prwhelan closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transform] Elasticsearch upgrades make transforms fail easily #107251

[Transform] Elasticsearch upgrades make transforms fail easily #107251

przemekwitek commented Apr 9, 2024 •

edited by prwhelan

Loading

Tasks

elasticsearchmachine commented Apr 9, 2024

prwhelan commented Apr 9, 2024 •

edited

Loading

sophiec20 commented Apr 9, 2024 •

edited

Loading

prwhelan commented Apr 9, 2024

prwhelan commented Apr 10, 2024

prwhelan commented Apr 10, 2024

prwhelan commented Apr 22, 2024

prwhelan commented Apr 22, 2024

nerophon commented Apr 25, 2024

prwhelan commented Apr 26, 2024 •

edited

Loading

prwhelan commented May 20, 2024

prwhelan commented Jun 18, 2024

[Transform] Elasticsearch upgrades make transforms fail easily #107251

[Transform] Elasticsearch upgrades make transforms fail easily #107251

Comments

przemekwitek commented Apr 9, 2024 • edited by prwhelan Loading

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Tasks

elasticsearchmachine commented Apr 9, 2024

prwhelan commented Apr 9, 2024 • edited Loading

sophiec20 commented Apr 9, 2024 • edited Loading

prwhelan commented Apr 9, 2024

prwhelan commented Apr 10, 2024

prwhelan commented Apr 10, 2024

prwhelan commented Apr 22, 2024

prwhelan commented Apr 22, 2024

nerophon commented Apr 25, 2024

prwhelan commented Apr 26, 2024 • edited Loading

prwhelan commented May 20, 2024

prwhelan commented Jun 18, 2024

przemekwitek commented Apr 9, 2024 •

edited by prwhelan

Loading

prwhelan commented Apr 9, 2024 •

edited

Loading

sophiec20 commented Apr 9, 2024 •

edited

Loading

prwhelan commented Apr 26, 2024 •

edited

Loading