Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Missing enrich policies on skip_unavailable=true clusters no longer fail the query #116972

Closed
wants to merge 19 commits into from

Conversation

quux00
Copy link
Contributor

@quux00 quux00 commented Nov 18, 2024

Missing enrich policies (or failures while looking up the policies on remote clusters) are no longer
fatal errors for skip_unavailable=true clusters. Those clusters will simply not be included
in the rest of the query.

Partially addresses #114531

@quux00 quux00 added auto-backport Automatically create backport pull requests when merged v9.0.0 v8.17.0 labels Nov 18, 2024
@quux00 quux00 changed the title ESQL: Enrich policy failures on skip_unavailable=true do not fail the query ESQL: Enrich policy failures on skip_unavailable=true clusters do not fail the query Nov 18, 2024
@quux00 quux00 force-pushed the esql-ccs/skip_un-enrich-policy-t3 branch 4 times, most recently from 6670870 to 2bb0629 Compare November 19, 2024 20:25
@quux00 quux00 force-pushed the esql-ccs/skip_un-enrich-policy-t3 branch from 2bb0629 to ebc931e Compare November 19, 2024 21:51
@quux00 quux00 added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL >enhancement labels Nov 19, 2024
@quux00 quux00 changed the title ESQL: Enrich policy failures on skip_unavailable=true clusters do not fail the query ESQL: Missing enrich policies on skip_unavailable=true clusters do not fail the query Nov 19, 2024
@quux00 quux00 marked this pull request as ready for review November 19, 2024 21:56
@quux00 quux00 requested a review from dnhatn November 19, 2024 21:56
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @quux00, I've created a changelog YAML for you.

Map<String, EsField> mappings = new HashMap<>();
Map<String, String> concreteIndices = new HashMap<>();
ResolvedEnrichPolicy last = null;
// loop over clusters with a ResolvedEnrichPolicy - ensure no errors within the policy
Copy link
Contributor Author

@quux00 quux00 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that mismatches across policies (being checked in the section below here) are still fatal errors for skip_unavailable=true clusters. I started down the road of having these be skippable errors, but that looks rather tricky to pull off. At a minimum, you'd have to partition the policies by "skip_unavailable" and build a canonical list of fields/types/etc. from skip_un=false clusters and then compare the skip_un=true clusters and then if any mismatches are found, those aren't fatal, but you pull that cluster out of the list to be resolved for field-caps. Not impossible but this section would require a significant rewrite, so I decided to only handle missing enrich policies (and policies that have errors on the remote cluster during resolution), but still fail them based on mismatches between policies.

final String reason;
if (failures.isEmpty()) {
List<String> missingClusters = targetClusters.stream().filter(c -> policies.containsKey(c) == false).sorted().toList();
reason = missingPolicyError(policyName, targetClusters, missingClusters);
List<String> missingClusters = targetClusters.keySet()
Copy link
Contributor Author

@quux00 quux00 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this block (under (if failures.isEmpty)) will ever execute now, but I was apprehensive to remove it. I couldn't find a way to enter this block based on my testing.

@quux00 quux00 changed the title ESQL: Missing enrich policies on skip_unavailable=true clusters do not fail the query ESQL: Missing enrich policies on skip_unavailable=true clusters no longer fail the query Nov 20, 2024
@dnhatn
Copy link
Member

dnhatn commented Nov 22, 2024

@quux00 Can you explain why we choose to treat a missing enrich policy as unavailable? I think we should either fail the query in this case or return a partial result if partial_results is specified. This approach would allow users to fix the query or clusters accordingly.

@quux00
Copy link
Contributor Author

quux00 commented Nov 22, 2024

Can you explain why we choose to treat a missing enrich policy as unavailable?

The definition of skip_unavailable that we are using means that if an error occurs on a remote cluster with skip_unavailable=true, then that should not be a fatal error that fails the query. Instead, we return partial data from other clusters. Since the missing enrich policy will only affect that one remote cluster, it can be safely left out of the query and marked as SKIPPED with a failure notice in the ccs_metadata of the response.

So a missing enrich policy on a remote cluster is a fatal error only if the cluster is skip_unavailable=false (or if it is the local cluster).

return a partial result if partial_results is specified

I'm not sure what you are referring to. At present ES|QL does not support partial results, except for the skip_unavailable handling we've been adding. Marking a cluster as skip_unavailable=true is effectively saying you want partial data from the query, just at a cluster level, not a shard/node level.

@dnhatn
Copy link
Member

dnhatn commented Nov 22, 2024

I think a missing enrich policy error is different from an unavailable one. Could we narrow the scope of this PR to focus only on unavailable nodes or connections?

@smalyshev
Copy link
Contributor

I think generally our assumption has been that virtually any error on a cluster that is marked as skip_unavailable is going to lead us to ignore this cluster, but not fail the whole request - provided the request can be performed at all ignoring that cluster (e.g. if the request only contained data from that cluster, it'll still fail). That includes missing indexes and missing policies too. If that assumption is not correct we need to re-sync and define the behavior we want, but that was also the underlying assumption for #112886

@dnhatn
Copy link
Member

dnhatn commented Nov 22, 2024

See #33915 and #27182 (comment)

@smalyshev
Copy link
Contributor

Yeah it looks like there was the same kind of discussion on _search side, and while the "unavailable" thing is indeed quite misleading, I think the end resolution has been that this option works as "ignore the errors that is possible to ignore" rather than "only ignore errors that have to do with network connections". If I understand this correctly, then it makes sense to follow the same road with ES|QL? It may be a bit confusing that "unavailable" works this way, but IMHO it'd be much more confusing if it worked differently for different types of search. In any case, if there are concerns about it then we probably should raise it on the PM level to ensure we don't misunderstand what is supposed to happen?

@quux00
Copy link
Contributor Author

quux00 commented Nov 22, 2024

Could we narrow the scope of this PR to focus only on unavailable nodes or connections?

That is already done: #115266
as well as handling missing indices errors based on the skip_unavailable setting: #116348

The two in progress PRs around skip_unavailable are to handling missing enrich policies (this PR) and arbitrary errors at execution time (#116365). This will make ES|QL consistent with current skip_unavailable handling in _search CCS.

@dnhatn
Copy link
Member

dnhatn commented Nov 22, 2024

I think we should avoid the mistake we made with the _search API. This flag is inconsistent in _search, depending on whether ccs_minimized_round_trip is enabled. We can't easily fix this because of BWC.

@smalyshev
Copy link
Contributor

But for ES|QL MRT is always on, so there's no room for such inconsistency?

@quux00
Copy link
Contributor Author

quux00 commented Jan 21, 2025

Closing this as we will not be extending skip_unavailable=true to handle this use case.

@quux00 quux00 closed this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants