ES|QL CCS uses skip_unavailable setting for handling disconnected remote clusters #115266

quux00 · 2024-10-21T17:13:42Z

As part of ES|QL planning of a cross-cluster search, a field-caps call is done to each cluster and,
if an ENRICH command is present, the enrich policy-resolve API is called on each remote. If a
remote cluster cannot be connected to in these calls, the outcome depends on the
skip_unavailable setting.

For skip_unavailable=false clusters, the error is fatal and the error will immediately be propagated
back to the client with a top level error message with a 500 HTTP status response code.

For skip_unavailable=true clusters, the error is not fatal. The error will be trapped, recorded in the
EsqlExecutionInfo object for the query, marking the cluster as SKIPPED. If the user requested
CCS metadata to be included, the cluster status and connection failure will be present in the
_clusters/details section of the response.

If no clusters can be contacted, if they are all marked as skip_unavailable=true, no error will be
returned. Instead a 200 HTTP status will be returned with no column and no values. If the
include_ccs_metadata: true setting was included on the query, the errors will listed in the
_clusters metadata section. (Note: this is also how the _search endpoint works for CCS.)

Partially addresses #114531

…rectly; needs cleanup and more testing

Created ConnectionErrorLookupResponse as subclass of LookupResponse

…ich policy API handling. One key change: if after the enrich policy resolution we know we could not contact one or more clusters, we remove them from the index expression that is sent to field-caps so we don't trigger another round of network errors and ExecutionInfo updates.

…pResponse

…edback

…ableRemotesIT

… enrich API calls though

…usterSecurityEsqlIT to match the new error output

Fixed some bugs in EnrichPolicyResolver and/or adjusted logic around handling missing remotes in enrich based on testing.

…e error swallowing functionality

… indexResolution.isValid block Other minor cleanup

…lable-disconnected-t1

…r SKIPPED clusters

docs/changelog/115266.yaml

smalyshev · 2024-10-25T21:40:22Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/enrich/EnrichPolicyResolver.java

@@ -261,22 +281,34 @@ private void lookupPolicies(
            if (remotePolicies.isEmpty() == false) {
                for (String cluster : remoteClusters) {


By this point, as I understand, we have already run field caps resolution. So we may have some skipped clusters. Do we exclude them somewhere from the list of clusters for which we're going to resolve policies? Because it looks like it's pointless to try to resolve those policies if the cluster will be skipped anyway.

No, the enrich policy resolution is done before the field-caps. I'm not sure why or if that's necessary for some reason or just historical accident. IMO, the field-caps call seems the logical one to do first, but that's not how it is, so we need to track unavailable clusters here and then pass that list to the callback listener which is the one that does the field-caps lookup: https://github.com/elastic/elasticsearch/pull/115266/files#diff-40060e2ec9003953a228c4a03bdc80a301949b0d4e3dccc1978798e33a992a73R345

Oh ok, I was for some reason assuming field caps go first... then of course yes, enrich should be telling field caps which clusters are skipped, not the other way around.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

smalyshev · 2024-10-26T00:05:47Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

+            // if the preceding call to the enrich policy API found unavailable clusters, recreate the index expression to search
+            // based only on available clusters (which could now be an empty list)
+            String indexExpressionToResolve = createIndexExpressionFromAvailableClusters(executionInfo);
+            if (indexExpressionToResolve.equals("")) {


why not isEmpty?

Fixed in next push.

smalyshev · 2024-10-26T00:09:31Z

...src/javaRestTest/java/org/elasticsearch/xpack/remotecluster/RemoteClusterSecurityEsqlIT.java

@@ -491,6 +495,10 @@ public void testCrossClusterQueryWithRemoteDLSAndFLS() throws Exception {
        assertThat(flatList, containsInAnyOrder("engineering"));
    }

+    /**
+     * Note: invalid_remote is "invalid" because it has a bogus API key, but the cluster does actually exist


Does invalid_remote also always have skip_unavailable=true? Because as I understand if it's not, it should error out?

Good catch. I think my understanding of this test is wrong. I'll work on it next and update the comments and make sure this is behaving as expected.

I've updated the javadoc comment. The test is actually missing and we get different errors from the security layer depending on its skip_un=true/false flag, which explains the different behavior in the test seen below (for the remote-only test; the local+remote test has a different issue).

smalyshev · 2024-10-26T00:11:12Z

...src/javaRestTest/java/org/elasticsearch/xpack/remotecluster/RemoteClusterSecurityEsqlIT.java

+        {
+            var q = "FROM invalid_remote:employees,employees | SORT emp_id DESC | LIMIT 10";
+            Response response = performRequestWithRemoteSearchUser(esqlRequest(q));
+            assertLocalOnlyResults(response);


For example, here - if skipUnavailable is false, shouldn't it be an error here?

I analyzed this case and the problem is that field-caps is returning nothing from this cluster, rather than an Exception, so it is indistinguishable from the scenario where no matching indices could be found. The "no matching indices" found is going to be handled in a follow-on PR, so I will investigate this tests behavior under RCS 2.0 in that test.

...erTest/java/org/elasticsearch/xpack/esql/action/CrossClusterEnrichUnavailableClustersIT.java

…lable-disconnected-t1

DRY'd up CrossClusterEnrichUnavailableClustersIT to reuse set up from CrossClustersEnrichIT Created NoClustersToSearchException sentinel exception Clarified how RemoteClusterSecurityEsqlIT works and flagged it for more work in a follow-on PR.

…lable-disconnected-t1

astefan

That's great @quux00.
I've left some comments.
I was also about to add something related to testing scenarios involving inexistent indices (excluding the remote clusters not being available case), but I think you are going to cover those in the second PR right?

For example, the case of from remote:existent, inexistent* where there should be no error and the case of from remote:existent, inexistent where there should be an error. This is not strictly related to CCS and skip_unvailable but it would be good to see that the above behavior involving missing local indices (with and without wildcard) is also tested.

...erTest/java/org/elasticsearch/xpack/esql/action/CrossClusterEnrichUnavailableClustersIT.java

astefan · 2024-10-28T13:13:06Z

...erTest/java/org/elasticsearch/xpack/esql/action/CrossClusterEnrichUnavailableClustersIT.java

+                String query = "FROM *:events | eval ip= TO_STR(host) | " + enrichHosts(mode) + " | stats c = COUNT(*) by os | SORT os";
+                try (EsqlQueryResponse resp = runQuery(query, requestIncludeMeta)) {
+                    List<List<Object>> rows = getValuesList(resp);
+                    assertThat(rows.size(), equalTo(0));


I think here it would be important to also check the number of columns. I think this case should be treated the same as this test here https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/EsqlActionIT.java#L1414

That <no-fields> thingy comes from https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java#L211 where the use case is the one of an empty mappings index, meaning the index exists but it has no fields. In the special case of CCS-only remotes all which have skip_unavailable = true I am wondering if "empty" response should actually be a one-column response that has <no-fields> in it and it's of the null type.

Oh, interesting. Thanks for flagging that. I'll work on adding that next. I've seen that response in my manual testing and didn't know what it meant.

OK, I added it and did some manual testing. So here's a remote-only test against remote1 which is offline and skip_unavailable true. Is this the output that looks best to you?

{ "is_running": false, "took": 7, "all_columns": [ { "name": "<no-fields>", "type": "null" } ], "columns": [], "values": [], "_clusters": { "total": 1, "successful": 0, "running": 0, "skipped": 1, "partial": 0, "failed": 0, "details": { "remote1": { "status": "skipped", "indices": "bl*", "took": 7, "_shards": { "total": 0, "successful": 0, "skipped": 0, "failed": 0 }, "failures": [ { "shard": -1, "index": null, "reason": { "type": "remote_transport_exception", "reason": "[connect_transport_exception - unable to connect to remote cluster]" } } ] } } } }

Added that behavior and modified two tests in this IT test in this commit.

astefan · 2024-10-28T13:14:42Z

...erTest/java/org/elasticsearch/xpack/esql/action/CrossClusterEnrichUnavailableClustersIT.java

+                assertTrue(ExceptionsHelper.isRemoteUnavailableException(exception));
+            } else {
+                try (EsqlQueryResponse resp = runQuery(query, requestIncludeMeta)) {
+                    assertThat(getValuesList(resp).size(), equalTo(0));


Same here, about the number of columns.

astefan · 2024-10-28T14:00:46Z

...sterTest/java/org/elasticsearch/xpack/esql/action/CrossClusterQueryUnavailableRemotesIT.java

+                assertClusterMetadataInResponse(resp, responseExpectMeta);
+            }
+
+            // close remote-cluster-12that it is also unavailable


Nit: remote-cluster-2 I think?

Thanks. Fixed.

astefan · 2024-10-28T14:04:08Z

...sterTest/java/org/elasticsearch/xpack/esql/action/CrossClusterQueryUnavailableRemotesIT.java

+        clusterInfo.put("remote.index", remoteIndex);
+
+        if (numClusters == 3) {
+            String remoteIndex2 = "logs-2";


Here it wasn't clear for me if remoteIndex2 should have the same name as remoteIndex or not. If it should be the same, then I suggest using remoteIndex already defined value instead.

astefan · 2024-10-28T14:09:16Z

...sterTest/java/org/elasticsearch/xpack/esql/action/CrossClusterQueryUnavailableRemotesIT.java

+            // close remote cluster 2 so that it is also unavailable
+            cluster(REMOTE_CLUSTER_2).close();
+
+            // query only the REMOTE_CLUSTER_1


Nit: I think this should be different, as the query queries two remote clusters.

Thanks. Fixed.

astefan · 2024-10-28T14:09:32Z

...sterTest/java/org/elasticsearch/xpack/esql/action/CrossClusterQueryUnavailableRemotesIT.java

+                )
+            ) {
+                List<List<Object>> values = getValuesList(resp);
+                assertThat(values, hasSize(0));


Same thing about the number of columns check.

astefan · 2024-10-28T14:40:26Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/enrich/EnrichPolicyResolver.java

        String policyName = unresolved.name;
+        if (targetClusters.isEmpty()) {
+            return Tuple.tuple(null, "enrich policy [" + policyName + "] cannot be resolved since remote clusters are unavailable");


I couldn't find a test for this situation. Does it make sense to add one now? (I know there are several other follow up PRs in this area)

This code was added in response to one of the IT tests I added failing:
CrossClusterEnrichUnavailableClustersIT..testEnrichWithHostsPolicyAndDisconnectedRemotesWithSkipUnavailableTrue, and in particular when the 2nd cluster is taken offline: https://github.com/elastic/elasticsearch/pull/115266/files#diff-67efeda55a8a497d595e7ce7737ee3ae3acc3ec966db031190bd6954a610f88aR240

In that case invariants in the code below are not met and assertion failures are thrown, so I added this to just leave the code immediately when there are no targetClusters, which can happen when the Enrich Mode type is "REMOTE" and the local cluster was not included in the query.

astefan · 2024-10-28T15:09:06Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

+        /**
+         * Whether to return an empty result (HTTP status 200) for a CCS rather than a top level 4xx/5xx error.
+         *
+         * For cases where field-caps had no indices to search and the remotes were unavailable, we


I think there is an edge case here that I might not have covered here: from inexistent, remote:inexistent meaning when there is no index in remotes but also no index in local cluster. It doesn't matter if inexistent uses wildcard or not in any of the local or remote clusters. I do also think this is in the same area as this use case from #112886 table:

from remote:inexistent | failure coming from post-_field_caps check, at the Verifier level. See IndexResolver:93

First comment - I was planning to handle the "missing indices" questions in "ticket 2" of the skip_unavailable work (this is ticket 1 dealing with just disconnected clusters).

That said, let's clarify here what behavior you are proposing:

If, at the end of cluster resolution (and checking for connectivity) and index resolution, there are no indices to query (either because the clusters are unavailable and/or the available clusters have no matching indices), then we should throw a VerificationException, rather than an empty "successful" result.

The above is true even if it is a remote-only query and wildcards were used for either the cluster names or the index names or both.

If I have that right - do you think there is any scenario where at planning time (before starting ComputeService execution) we should return an empty "successful" result?

If, at the end of cluster resolution (and checking for connectivity) and index resolution, there are no indices to query (either because the clusters are unavailable and/or the available clusters have no matching indices), then we should throw a VerificationException, rather than an empty "successful" result.

No :-). My comment was specific to from inexistent, remote:inexistent and from inexistent, remote:inexistent*. This is a scenario that I missed to cover when we discussed about behavior of CCS queries. This involves the local cluster as well and I think in this case it doesn't matter how CCS behaves. The fact that the local cluster doesn't have an explicitly specified index should fail no matter what. In the two tables from #112886 (skip_un true/false) a better comparison query than the one in my previous review comment is

from remote:existent, inexistent

where the behavior is

failure coming from ES at query time complaining about "inexistent" from the "local" cluster

But, in our case the failure should be coming from the Verifier because the field_caps call returned no indices.

Another two use cases are from inexistent*, remote:inexistent and from inexistent*, remote:inexistent* and the principle for both these two cases and the previous two should be if the local cluster is involved and the overall - remote and local - resolution of indices ends up with no indices this should be an error because the user specified a missing indices pattern for the local cluster; even if the user cannot control how the remote clusters behave, the user is in control of the index pattern for the local cluster.

skip_unavailable = true/false

index pattern behavior

from inexistent*, remote:inexistent failure coming from post-_field_caps check, at the Verifier level

from inexistent*, remote:inexistent* failure coming from post-_field_caps check, at the Verifier level

ESQL's current behavior where no CCS is used is to reject queries that have no indices after the index pattern is resolved:
from inexistent*
from inexistent1*,inexistent2*
from inexistent
from inexistent1,inexistent2
from inexistent,inexistent1*

I have updated the tables from #112886

OK, sounds good. We are on the same page. This PR is not the one to handle those cases though. That is the next PR ("ticket 2" in #114531). Thanks for updating the tables in 112886. I will use that as a guide for the work on the next PR.

astefan · 2024-10-28T15:16:42Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

+         * on any of the requested clusters.
+         */
+        private boolean returnSuccessWithEmptyResult(Exception e) {
+            if (executionInfo.isCrossClusterSearch() == false) {


My review above about from inexistent, remote:inexistent I think it started from this line where this ESQL query is considered to be a CCS search. If this ESQL query flow goes through the next condition I think returnSuccessWithEmptyResult can return true?

Sorry, I don't think I follow what you are saying here. It sounds like there is a scenario where you think we should return an empty successful result at planning. Could you outline what that is again? (Same question I just asked in the previous comment, so you can just answer there.)

What I was trying to say is that this method may return true for a query like from inexistent, remote:inexistent which I believe shouldn't be the case.

…otesIT. Some minor code cleanup

…lable-disconnected-t1

…er.NO_FIELDS as the schema; tests updated

…otesIT when empty result in returned

…lable-disconnected-t1

astefan · 2024-10-29T13:37:13Z

...sterTest/java/org/elasticsearch/xpack/esql/action/CrossClusterQueryUnavailableRemotesIT.java

+            // 1) the local cluster indexExpression and REMOTE_CLUSTER_2 indexExpression match no indices
+            // 2) the REMOTE_CLUSTER_1 is unavailable
+            // 3) both remotes are marked as skip_un=true
+            String query = "FROM nomatch*," + REMOTE_CLUSTER_1 + ":logs-*," + REMOTE_CLUSTER_2 + ":nomatch* | STATS sum (v)";


This specific test should fail following the second PR from the skip_un series, right? (and following this discussion)

astefan

LGTM
Thanks @quux00

…lable-disconnected-t1

elasticsearchmachine · 2024-10-29T20:20:14Z

💚 Backport successful

Status	Branch	Result
✅	8.x

…ote clusters (elastic#115266) As part of ES|QL planning of a cross-cluster search, a field-caps call is done to each cluster and, if an ENRICH command is present, the enrich policy-resolve API is called on each remote. If a remote cluster cannot be connected to in these calls, the outcome depends on the skip_unavailable setting. For skip_unavailable=false clusters, the error is fatal and the error will immediately be propagated back to the client with a top level error message with a 500 HTTP status response code. For skip_unavailable=true clusters, the error is not fatal. The error will be trapped, recorded in the EsqlExecutionInfo object for the query, marking the cluster as SKIPPED. If the user requested CCS metadata to be included, the cluster status and connection failure will be present in the _clusters/details section of the response. If no clusters can be contacted, if they are all marked as skip_unavailable=true, no error will be returned. Instead a 200 HTTP status will be returned with no column and no values. If the include_ccs_metadata: true setting was included on the query, the errors will listed in the _clusters metadata section. (Note: this is also how the _search endpoint works for CCS.) Partially addresses elastic#114531

…ote clusters (#115266) (#115881) As part of ES|QL planning of a cross-cluster search, a field-caps call is done to each cluster and, if an ENRICH command is present, the enrich policy-resolve API is called on each remote. If a remote cluster cannot be connected to in these calls, the outcome depends on the skip_unavailable setting. For skip_unavailable=false clusters, the error is fatal and the error will immediately be propagated back to the client with a top level error message with a 500 HTTP status response code. For skip_unavailable=true clusters, the error is not fatal. The error will be trapped, recorded in the EsqlExecutionInfo object for the query, marking the cluster as SKIPPED. If the user requested CCS metadata to be included, the cluster status and connection failure will be present in the _clusters/details section of the response. If no clusters can be contacted, if they are all marked as skip_unavailable=true, no error will be returned. Instead a 200 HTTP status will be returned with no column and no values. If the include_ccs_metadata: true setting was included on the query, the errors will listed in the _clusters metadata section. (Note: this is also how the _search endpoint works for CCS.) Partially addresses #114531

…cted remote clusters (elastic#115266)" This reverts commit b6d2d4b.

costin

I'm leaving feedback here after this PR was merged for future enhancement (as I understand there's going to be more work).

It mainly resolved around the Enrich and CCS intricacies being spread through-out EsqlSession which makes evolving this problematic - as this class is at the code of doing cluster calls.

The Enrich and CCS parts need encapsulation (such as moving to a separate class like CcsUtils, defining a dedicated listener, etc...) so that EsqlSession focused on its core purpose : working on the LogicalPlan.
Please make this a priority moving forward - thanks!

costin · 2024-10-30T03:24:34Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlExecutionInfo.java

+            if (failures == null) {
+                this.failures = List.of();
+            } else {
+                this.failures = failures;
+            }


this.failures = failures == null ? emptyList() : failures

costin · 2024-10-30T03:25:28Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlExecutionInfo.java

@@ -362,6 +370,11 @@ public Cluster(StreamInput in) throws IOException {
            this.failedShards = in.readOptionalInt();
            this.took = in.readOptionalTimeValue();
            this.skipUnavailable = in.readBoolean();
+            if (in.getTransportVersion().onOrAfter(TransportVersions.ESQL_CCS_EXEC_INFO_WITH_FAILURES)) {
+                this.failures = Collections.unmodifiableList(in.readCollectionAsList(ShardSearchFailure::readShardSearchFailure));


Any reason why the collection is unmodifiable - within the ESQL code we're not using the defensive style since the collections are never modified, rather copied.

The EsqlExecution.Cluster class, like the SearchResponse.Cluster class it is based on, is immutable. You can only change state by coping an existing Cluster and swapping it in. I can change line 374 to use a mutable list, but it won't make any difference since the coding practice for this class is to never modify in place. In fact, the line below where I do this.failures = List.of(); should likely be changed to an immutable list as well to enforce this model.

costin · 2024-10-30T03:27:23Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/enrich/EnrichPolicyResolver.java

+                                new ActionListenerResponseHandler<>(lookupListener.delegateResponse((l, e) -> {
+                                    if (ExceptionsHelper.isRemoteUnavailableException(e)
+                                        && remoteClusterService.isSkipUnavailable(cluster)) {
+                                        l.onResponse(new LookupResponse(e));
+                                    } else {
+                                        l.onFailure(e);
+                                    }
+                                }), LookupResponse::new, threadPool.executor(ThreadPool.Names.SEARCH))
+                            );
+                        }
+
+                        @Override
+                        public void onFailure(Exception e) {
+                            if (ExceptionsHelper.isRemoteUnavailableException(e) && remoteClusterService.isSkipUnavailable(cluster)) {
+                                lookupListener.onResponse(new LookupResponse(e));
+                            } else {
+                                lookupListener.onFailure(e);
+                            }
+                        }
+                    });


Since the code is similar between onFailure and onResponse, pull it out in a separate method.

costin · 2024-10-30T03:35:22Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

+     * Any Exception sent to onFailure stops processing, but not all are fatal (return a 4xx or 5xx), so
+     * the onFailure handler determines whether to return an empty successful result or a 4xx/5xx error.
+     */
+    class LogicalPlanActionListener implements ActionListener<LogicalPlan> {


As far as I can tell, the reason for this listener is to pick up remote CCS error and properly report them. These occur during the preAnalysis phase or the execution.
Hence why I suggest to:

use a different name that better conveys the purpose (e.g. CcsAwareActionListener or CcsPartialFailuresActionListener)

update the javadoc description accordingly

costin · 2024-10-30T03:38:42Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

+
+        @Override
+        public void onFailure(Exception e) {
+            if (returnSuccessWithEmptyResult(e)) {


To improve readability, please extract this code into a CcsUtils class and call it here as a one liner:

CcsUtils.handleException(Exception, ExecutionInfo, ActionListener)

costin · 2024-10-30T03:39:33Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

@@ -302,6 +440,30 @@ private void preAnalyzeIndices(
        }
    }

+    // visible for testing
+    static String createIndexExpressionFromAvailableClusters(EsqlExecutionInfo executionInfo) {


Move this into a separate class CcsUtils

costin · 2024-10-30T03:39:50Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java

-            executionInfo.swapCluster(
-                clusterAlias,
-                (k, v) -> new EsqlExecutionInfo.Cluster.Builder(v).setStatus(EsqlExecutionInfo.Cluster.Status.SKIPPED).build()
+    static void updateExecutionInfoWithUnavailableClusters(EsqlExecutionInfo execInfo, Map<String, FieldCapabilitiesFailure> unavailable) {


Move this into a separate CcsUtils

costin · 2024-10-30T03:40:27Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/IndexResolver.java

    }

    // visible for testing
-    static Set<String> determineUnavailableRemoteClusters(List<FieldCapabilitiesFailure> failures) {
-        Set<String> unavailableRemotes = new HashSet<>();
+    static Map<String, FieldCapabilitiesFailure> determineUnavailableRemoteClusters(List<FieldCapabilitiesFailure> failures) {


Same as above- look into moving this into a CcsUtils class

…ote clusters (elastic#115266) As part of ES|QL planning of a cross-cluster search, a field-caps call is done to each cluster and, if an ENRICH command is present, the enrich policy-resolve API is called on each remote. If a remote cluster cannot be connected to in these calls, the outcome depends on the skip_unavailable setting. For skip_unavailable=false clusters, the error is fatal and the error will immediately be propagated back to the client with a top level error message with a 500 HTTP status response code. For skip_unavailable=true clusters, the error is not fatal. The error will be trapped, recorded in the EsqlExecutionInfo object for the query, marking the cluster as SKIPPED. If the user requested CCS metadata to be included, the cluster status and connection failure will be present in the _clusters/details section of the response. If no clusters can be contacted, if they are all marked as skip_unavailable=true, no error will be returned. Instead a 200 HTTP status will be returned with no column and no values. If the include_ccs_metadata: true setting was included on the query, the errors will listed in the _clusters metadata section. (Note: this is also how the _search endpoint works for CCS.) Partially addresses elastic#114531

quux00 added >feature auto-backport Automatically create backport pull requests when merged v9.0.0 v8.17.0 labels Oct 21, 2024

quux00 mentioned this pull request Oct 21, 2024

ES|QL CCS handles disconnected clusters according to skip_unavailable setting #114745

Closed

quux00 force-pushed the esql-ccs/skip_unavailable-disconnected-t1 branch 5 times, most recently from 2647bf8 to 4d5ec71 Compare October 25, 2024 19:52

quux00 added 17 commits October 25, 2024 16:54

Intmd commit in order to work on backports

da778ee

Added LogicalPlanActionListener and improved edge case handling

c784ebb

Placeholder to start on enrich policy pathways

737ca7c

Intmd commit - very basic testing shows that this handles skip_un cor…

43cf559

…rectly; needs cleanup and more testing

Added Map<String, Exception> unavailableClusters to EnrichResolution

6f73b4d

Created ConnectionErrorLookupResponse as subclass of LookupResponse

Remove ConnectionErrorLookupResponse and add errorConnection to Looku…

e84e0c8

…pResponse

Non-working version of CrossClusterQueryUnavailableIT added to get fe…

52ce899

…edback

Working version of CrossClusterQueryUnavailableRemotesIT

e8626c1

Renamed to CrossClusterQueryUnavailableIT to CrossClusterQueryUnavail…

3027b58

…ableRemotesIT

Enhancements to CrossClusterQueryUnavailableRemotesIT - not yet doing…

21e849c

… enrich API calls though

Fixed bug in EsqlSession found by existing tests and patched RemoteCl…

a7ca28d

…usterSecurityEsqlIT to match the new error output

Added CrossClusterEnrichUnavailableClustersIT

2d4bc98

Fixed some bugs in EnrichPolicyResolver and/or adjusted logic around handling missing remotes in enrich based on testing.

Removed two tests from CrossClusterEnrichUnavailableClustersIT

ac1cb85

Updated RemoteClusterSecurityEsqlIT to match new skip_unavailable=tru…

22871a8

…e error swallowing functionality

Minor cleanup before PR review

b3956f4

Moved updateExecutionInfoWithClustersWithNoMatchingIndices out of the…

67454fd

… indexResolution.isValid block Other minor cleanup

quux00 force-pushed the esql-ccs/skip_unavailable-disconnected-t1 branch from 4d5ec71 to 67454fd Compare October 25, 2024 20:54

quux00 added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Oct 25, 2024

quux00 added 2 commits October 25, 2024 19:44

Merge remote-tracking branch 'elastic/main' into esql-ccs/skip_unavai…

5fbb373

…lable-disconnected-t1

Adjusted EsqlSessionTests to new logic of zeroing out shard counts fo…

3c95dfe

…r SKIPPED clusters

smalyshev reviewed Oct 26, 2024

View reviewed changes

...erTest/java/org/elasticsearch/xpack/esql/action/CrossClusterEnrichUnavailableClustersIT.java Outdated Show resolved Hide resolved

quux00 added 4 commits October 26, 2024 10:25

Merge remote-tracking branch 'elastic/main' into esql-ccs/skip_unavai…

bb6b96c

…lable-disconnected-t1

Merge remote-tracking branch 'elastic/main' into esql-ccs/skip_unavai…

06aa298

…lable-disconnected-t1

PR feedback

5fd33f0

DRY'd up CrossClusterEnrichUnavailableClustersIT to reuse set up from CrossClustersEnrichIT Created NoClustersToSearchException sentinel exception Clarified how RemoteClusterSecurityEsqlIT works and flagged it for more work in a follow-on PR.

Merge remote-tracking branch 'elastic/main' into esql-ccs/skip_unavai…

59fe3e9

…lable-disconnected-t1

astefan reviewed Oct 28, 2024

View reviewed changes

quux00 added 8 commits October 28, 2024 16:16

PR feedback. Added additional test to CrossClusterQueryUnavailableRem…

dc122c3

…otesIT. Some minor code cleanup

Merge remote-tracking branch 'elastic/main' into esql-ccs/skip_unavai…

cd56a2e

…lable-disconnected-t1

PR feedback: Empty Results (skip_un=true scenarios) now return Analyz…

dc2a75b

…er.NO_FIELDS as the schema; tests updated

Added checks for no-fields columns to CrossClusterQueryUnavailableRem…

bcdaad3

…otesIT when empty result in returned

Fix failing test and spotless check

9369d29

Merge remote-tracking branch 'elastic/main' into esql-ccs/skip_unavai…

4b5c72b

…lable-disconnected-t1

Minor cleanup

db2a441

Merge remote-tracking branch 'elastic/main' into esql-ccs/skip_unavai…

8eda8c8

…lable-disconnected-t1

astefan reviewed Oct 29, 2024

View reviewed changes

astefan approved these changes Oct 29, 2024

View reviewed changes

Merge remote-tracking branch 'elastic/main' into esql-ccs/skip_unavai…

2a4c06c

…lable-disconnected-t1

quux00 merged commit b6d2d4b into elastic:main Oct 29, 2024
16 checks passed

quux00 mentioned this pull request Oct 29, 2024

[8.x] ES|QL CCS uses skip_unavailable setting for handling disconnected remote clusters (#115266) #115881

Merged

quux00 added a commit to quux00/elasticsearch that referenced this pull request Oct 30, 2024

Revert "ES|QL CCS uses skip_unavailable setting for handling disconne…

1ded26a

…cted remote clusters (elastic#115266)" This reverts commit b6d2d4b.

costin reviewed Oct 30, 2024

View reviewed changes

quux00 mentioned this pull request Oct 30, 2024

Refactor CCS handling code from EsqlSession and IndexResolver into dedicated util class #115976

Closed

quux00 mentioned this pull request Nov 22, 2024

ESQL: Missing enrich policies on skip_unavailable=true clusters no longer fail the query #116972

Closed

		@@ -261,22 +281,34 @@ private void lookupPolicies(
		if (remotePolicies.isEmpty() == false) {
		for (String cluster : remoteClusters) {

index pattern	behavior
from inexistent*, remote:inexistent	failure coming from post-`_field_caps` check, at the `Verifier` level
from inexistent, remote:inexistent	failure coming from post-`_field_caps` check, at the `Verifier` level

ES|QL CCS uses skip_unavailable setting for handling disconnected remote clusters #115266

ES|QL CCS uses skip_unavailable setting for handling disconnected remote clusters #115266

Conversation

quux00 commented Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

quux00 Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astefan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astefan left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Oct 29, 2024

💚 Backport successful

costin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quux00 commented Oct 21, 2024 •

edited

Loading

quux00 Oct 28, 2024 •

edited

Loading