Transient 503 UH "no healthy upstream" errors during CDS updates #13070

kryzthov-stripe · 2020-09-11T21:40:25Z

Title: Envoy briefly fails to route requests during CDS updates

Description:
When a cluster is updated via CDS, we sometimes observe request failures in the form of HTTP 503 "no healthy upstream" with the UH response flag.
The membership total for the cluster remains constant throughout the update.
The membership healthy falls down to 0 shortly after the CDS update begins, then switches back to normal shortly after the CDS update completed.
Request failures do occur regardless of the panic routing thresholds.

The docs around cluster warming (in https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/cluster_manager#cluster-warming) suggest there should be no traffic disruption during a CDS update.

Repro steps:
We observed and reproduced this behavior on static or EDS clusters, by sending a CDS update to change the health-checking timeout (eg. from 5s to 10s and back).
For example, we used an EDS cluster with 1% panic routing threshold configured around the lines of:

{
    "@type": "type.googleapis.com/envoy.api.v2.Cluster",
    "name": "foo",
    "type": "EDS",
    "eds_cluster_config": {
        "eds_config": {
            "api_config_source": {
                "api_type": "GRPC",
                "grpc_services": [ { "envoy_grpc": { "cluster_name": "xxx" } } ]
            }
        }
    },
    "connect_timeout": "10s",
    "health_checks": [ {
        "timeout": "5s",
        "interval": "3s",
        "unhealthy_threshold": 1,
        "healthy_threshold": 2,
        "reuse_connection": true,
        "http_health_check": {
            "host": "xxx",
            "path": "/healthcheck",
        },
        "no_traffic_interval": "3s"
    } ],
    "circuit_breakers": {
        "thresholds": [ {
            "max_connections": 4294967295,
            "max_pending_requests": 4294967295,
            "max_requests": 4294967295,
            "retry_budget": {}
        } ]
    },
    "dns_lookup_family": "V4_ONLY",
    "common_lb_config": { "healthy_panic_threshold": { "value": 1 } },
    "transport_socket": {
        "name": "envoy.transport_sockets.tls",
        "typed_config": {
            "@type": "type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext",
            "common_tls_context": {
                "tls_certificates": [ {
                        "certificate_chain": { "filename": "/path/to/cert.pem" },
                        "private_key": { "filename": "/path/to/key.pem" }
                } ],
                "validation_context": {
                    "trusted_ca": { "filename": "/path/to/ca.pem" },
                    "match_subject_alt_names": [ { "exact": "xxx" } ]
                }
            }
        }
    },
    "upstream_connection_options": {
        "tcp_keepalive": {
            "keepalive_probes": 9,
            "keepalive_time": 300,
            "keepalive_interval": 60
        }
    }
}

The cluster load assignment included a stable pool of 10 healthy backends.
Envoy does not appear to make any health-check request during the CDS update.
The backends are hard-coded to report healthy, and run locally on the same host to minimize possible network interferences.

Logs:
We've reproduced this using a version of Envoy 1.14.5-dev with some extra log statements:

2020-09-11T18:05:48 cluster.foo.membership_healthy = 10
[...]
2020-09-11T18:05:49 Traffic generator: 1428 requests, 0 errors
[...]
2020-09-11T18:05:54 Traffic generator: 2590 requests, 0 errors
2020-09-11T18:05:54 Updating foo HC timeout to 5s
2020-09-11T18:05:55 Traffic generator: 2962 requests, 0 errors
[...]
2020-09-11T18:05:57 Traffic generator: 3024 requests, 0 errors
2020-09-11T18:05:57 [[email protected]:6878] responding to /healthcheck request from 192.168.0.11:44078 (for [192.168.0.12]) with OK
2020-09-11T18:05:57 [[email protected]:6872] responding to /healthcheck request from 192.168.0.5:37884 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6871] responding to /healthcheck request from 192.168.0.4:49690 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6877] responding to /healthcheck request from 192.168.0.10:36026 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6876] responding to /healthcheck request from 192.168.0.9:46476 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6870] responding to /healthcheck request from 192.168.0.3:46406 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6873] responding to /healthcheck request from 192.168.0.6:58104 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6875] responding to /healthcheck request from 192.168.0.8:34440 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6869] responding to /healthcheck request from 192.168.0.2:47934 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6874] responding to /healthcheck request from 192.168.0.7:45602 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 Traffic generator: 3640 requests, 0 errors
[...]
2020-09-11T18:05:58 Traffic generator: 3554 requests, 0 errors
[2020-09-11 18:05:58.897][401][upstream] [source/common/upstream/cluster_manager_impl.cc:582] add/update cluster foo starting warming
[2020-09-11 18:05:58.897][401][upstream] [source/common/upstream/upstream_impl.cc:951] ClusterImplBase::initialize()
[2020-09-11 18:05:58.913][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 0
[2020-09-11 18:05:58.913][401][upstream] [source/common/upstream/upstream_impl.cc:967] initializing Secondary cluster foo completed
[2020-09-11 18:05:58.913][401][upstream] [source/common/upstream/upstream_impl.cc:972] ClusterImplBase::onInitDone()
2020-09-11T18:05:58 cluster.foo.membership_healthy = 0
[2020-09-11 18:05:58.976][401][upstream] [source/common/upstream/upstream_impl.cc:992] ClusterImplBase::finishInitialization()
[2020-09-11 18:05:58.976][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 0 added and 0 removed, updating hosts from 10 to 10, and healthy hosts from 0 to 10
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:594] warming cluster foo complete
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:612] updating TLS cluster foo
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:1176] Init ClusterEntry for foo at 0x563f2e21a340
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:1206] Initializing RoundRobin Load Balancer
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:369] onClusterInit
[2020-09-11 18:05:58.978][665][upstream] [source/common/upstream/cluster_manager_impl.cc:612] updating TLS cluster foo
[2020-09-11 18:05:58.978][665][upstream] [source/common/upstream/cluster_manager_impl.cc:1176] Init ClusterEntry for foo at 0x563f2e4189c0
[2020-09-11 18:05:58.978][665][upstream] [source/common/upstream/cluster_manager_impl.cc:1206] Initializing RoundRobin Load Balancer
[2020-09-11 18:05:58.981][665][misc] [source/common/upstream/load_balancer_impl.cc:808] chooseHostOnce: hosts_to_use.empty with source type AllHosts
[2020-09-11 18:05:58.981][665][upstream] [source/common/upstream/cluster_manager_impl.cc:1244] no healthy host for HTTP connection pool 0x563f2e4189c0
[2020-09-11 18:05:58.981][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 10
[2020-09-11 18:05:58.985][665][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 10
2020-09-11T18:05:58 Error making request: Expected HTTP code 200, but got 503 (with body "no healthy upstream")
2020-09-11T18:05:58 cluster.foo.membership_healthy = 10
2020-09-11T18:05:59 Traffic generator: 3441 requests, 1 errors

Based on the trace above, it looks like a load-balancer of the updated cluster is used to route a request before its underlying host set is initialized.
I'm not familiar enough with Envoy internals and could use some help to understand if I'm somehow misconfiguring Envoy, or if something else is happening.

The text was updated successfully, but these errors were encountered:

mattklein123 · 2020-09-11T22:52:54Z

This sounds like another repeat of #13009? cc @htuch @ramaraochavali @snowp

snowp · 2020-09-13T13:09:42Z

I'm not sure if it's the same issue, if so I would expect the time stamps to indicate that we're hitting the EDS fetch timeout (15s) without a response, causing cluster warming prior to receiving any hosts.

The logs seem to indicate that the hosts are being added to the cluster, then health checked during warming:

[2020-09-11 18:05:58.897][401][upstream] [source/common/upstream/upstream_impl.cc:951] ClusterImplBase::initialize()
[2020-09-11 18:05:58.913][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 0
[2020-09-11 18:05:58.913][401][upstream] [source/common/upstream/upstream_impl.cc:967] initializing Secondary cluster foo completed
[2020-09-11 18:05:58.913][401][upstream] [source/common/upstream/upstream_impl.cc:972] ClusterImplBase::onInitDone()
2020-09-11T18:05:58 cluster.foo.membership_healthy = 0
[2020-09-11 18:05:58.976][401][upstream] [source/common/upstream/upstream_impl.cc:992] ClusterImplBase::finishInitialization()
[2020-09-11 18:05:58.976][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 0 added and 0 removed, updating hosts from 10 to 10, and healthy hosts from 0 to 10

I wonder if the problem is that we create a TLS cluster and update its hosts in separate dispatcher events so they don't happen atomically?

mattklein123 · 2020-09-13T16:36:16Z

Warming shouldn't complete until the first round of health checking is complete, so not sure what is happening here. I can take a more detailed look later.

snowp · 2020-09-13T17:00:08Z

The idea I mentioned in my last post isn't even necessarily about warming, but about the fact that we call

void ClusterManagerImpl::createOrUpdateThreadLocalCluster(ClusterData& cluster) {
  tls_->runOnAllThreads([this, new_cluster = cluster.cluster_->info(),
                         thread_aware_lb_factory = cluster.loadBalancerFactory()]() -> void {
    ThreadLocalClusterManagerImpl& cluster_manager =
        tls_->getTyped<ThreadLocalClusterManagerImpl>();

    if (cluster_manager.thread_local_clusters_.count(new_cluster->name()) > 0) {
      ENVOY_LOG(debug, "updating TLS cluster {}", new_cluster->name());
    } else {
      ENVOY_LOG(debug, "adding TLS cluster {}", new_cluster->name());
    }

    auto thread_local_cluster = new ThreadLocalClusterManagerImpl::ClusterEntry(
        cluster_manager, new_cluster, thread_aware_lb_factory);
    cluster_manager.thread_local_clusters_[new_cluster->name()].reset(thread_local_cluster);
    for (auto& cb : cluster_manager.update_callbacks_) {
      cb->onClusterAddOrUpdate(*thread_local_cluster);
    }
  });
}

before we do

  // Finally, if the cluster has any hosts, post updates cross-thread so the per-thread load
  // balancers are ready.
  for (auto& host_set : cluster.prioritySet().hostSetsPerPriority()) {
    if (host_set->hosts().empty()) {
      continue;
    }
    postThreadLocalClusterUpdate(cluster, host_set->priority(), host_set->hosts(), HostVector{});
  }

which means we get two dispatcher events: [update TLS cluster, set hosts], which could be interleaved with a dispatcher event that tries to route to this cluster.

Maybe there's something I missed that prevents this from being an issue, I didn't try to reproduce the issue.

If this is the case then any CDS update could possibly cause this, so the main reason I'm skeptical about this theory is that I would have expected it to have been discovered a long time ago.

mattklein123 · 2020-09-13T20:46:31Z

If this is the case then any CDS update could possibly cause this, so the main reason I'm skeptical about this theory is that I would have expected it to have been discovered a long time ago.

No I think you are right. This is definitely a race condition and broken. I'm guessing the reason this has not been reported before is that CDS updates are pretty rare and I'm guessing no one has noticed. We will need to fix this.

ramaraochavali · 2020-09-14T10:19:46Z

The fact that this specific issue "We observed and reproduced this behavior on static" happens on static clusters seem to indicate that it is not same as #13009 and may be @snowp 's theory is correct.

howardjohn · 2020-10-29T00:53:55Z

I am fairly sure this is the cause of istio/istio#23029. I can reproduce this pretty reliably (not 100% of time, but within a few minutes).

The test I am doing is updating a cluster every 2s, changing some field in metadata that does not matter. In the mean time, I send 2k QPS.

Cluster, which notable has no TLS or health check:


     "version_info": "2020-10-29T00:34:54Z/259",
     "cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "outbound|8080||echo.default.svc.cluster.local",
      "type": "EDS",
      "eds_cluster_config": {
       "eds_config": {
        "ads": {},
        "initial_fetch_timeout": "3s",
        "resource_api_version": "V3"
       },
       "service_name": "outbound|8080||echo.default.svc.cluster.local"
      },
      "connect_timeout": "10s",
      "circuit_breakers": {
       "thresholds": [
        {
         "max_connections": 4294967295,
         "max_pending_requests": 4294967295,
         "max_requests": 4294967295,
         "max_retries": 4294967295
        }
       ]
      },
      "metadata": {
       "filter_metadata": {
        "istio": {
         "services": [
          {
           "namespace": "default",
           "host": "echo.default.svc.cluster.local",
           "name": "echo"
          }
         ],
         "default_original_port": 0.74407261806614822,
         "config": "/apis/networking.istio.io/v1alpha3/namespaces/istio-system/destination-rule/plaintext"
        }
       }
      }
     },
     "last_updated": "2020-10-29T00:34:54.197Z"
    }
   ]
  },

note: 3s initial fetch timeout was playing around with #13009, it does not matter what I set this to, even disabling with 0s still gets this issue

I provided some trace logging of 2x where the update triggered this and 2x where it did not.

While the above test is synthetic, this behavior is consistently reproducible in real Knative workloads as well

mattklein123 · 2020-10-29T16:13:52Z

I can take a look at fixing this.

mattklein123 · 2020-10-30T23:02:15Z

I looked at this and the fix is not difficult. I will fix soon.

kryzthov-stripe added bug triage Issue requires triage labels Sep 11, 2020

mattklein123 added area/cluster_manager help wanted Needs help! and removed triage Issue requires triage labels Sep 13, 2020

mattklein123 self-assigned this Sep 13, 2020

howardjohn mentioned this issue Oct 29, 2020

503s on scaling in master istio/istio#23029

Closed

EItanya mentioned this issue Oct 30, 2020

Enable rest XDS for endpoints to fix race condition in envoy kgateway-dev/kgateway#3805

Closed

EItanya mentioned this issue Nov 2, 2020

added rest xds server for endpoints kgateway-dev/kgateway#3799

Merged

8 tasks

mattklein123 mentioned this issue Nov 5, 2020

cluster manager: make add/update and initial host set update atomic #13906

Merged

htuch mentioned this issue Nov 10, 2020

503 + NR due to the inconsistency between the order in which the configuration takes effect and the order in xDS push #13910

Open

ZhiminXiang mentioned this issue Nov 11, 2020

Performance regression after testing with Istio 1.7 knative/serving#10074

Closed

mattklein123 closed this as completed in #13906 Nov 13, 2020

kdorosh mentioned this issue Dec 28, 2020

Disable rest eds kgateway-dev/kgateway#4028

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient 503 UH "no healthy upstream" errors during CDS updates #13070

Transient 503 UH "no healthy upstream" errors during CDS updates #13070

kryzthov-stripe commented Sep 11, 2020

mattklein123 commented Sep 11, 2020

snowp commented Sep 13, 2020

mattklein123 commented Sep 13, 2020

snowp commented Sep 13, 2020

mattklein123 commented Sep 13, 2020

ramaraochavali commented Sep 14, 2020

howardjohn commented Oct 29, 2020

mattklein123 commented Oct 29, 2020

mattklein123 commented Oct 30, 2020

Transient 503 UH "no healthy upstream" errors during CDS updates #13070

Transient 503 UH "no healthy upstream" errors during CDS updates #13070

Comments

kryzthov-stripe commented Sep 11, 2020

mattklein123 commented Sep 11, 2020

snowp commented Sep 13, 2020

mattklein123 commented Sep 13, 2020

snowp commented Sep 13, 2020

mattklein123 commented Sep 13, 2020

ramaraochavali commented Sep 14, 2020

howardjohn commented Oct 29, 2020

mattklein123 commented Oct 29, 2020

mattklein123 commented Oct 30, 2020