Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient 503 UH "no healthy upstream" errors during CDS updates #13070

Closed
kryzthov-stripe opened this issue Sep 11, 2020 · 9 comments · Fixed by #13906 or kgateway-dev/kgateway#4028
Closed

Comments

@kryzthov-stripe
Copy link

Title: Envoy briefly fails to route requests during CDS updates

Description:
When a cluster is updated via CDS, we sometimes observe request failures in the form of HTTP 503 "no healthy upstream" with the UH response flag.
The membership total for the cluster remains constant throughout the update.
The membership healthy falls down to 0 shortly after the CDS update begins, then switches back to normal shortly after the CDS update completed.
Request failures do occur regardless of the panic routing thresholds.

The docs around cluster warming (in https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/cluster_manager#cluster-warming) suggest there should be no traffic disruption during a CDS update.

Repro steps:
We observed and reproduced this behavior on static or EDS clusters, by sending a CDS update to change the health-checking timeout (eg. from 5s to 10s and back).
For example, we used an EDS cluster with 1% panic routing threshold configured around the lines of:

{
    "@type": "type.googleapis.com/envoy.api.v2.Cluster",
    "name": "foo",
    "type": "EDS",
    "eds_cluster_config": {
        "eds_config": {
            "api_config_source": {
                "api_type": "GRPC",
                "grpc_services": [ { "envoy_grpc": { "cluster_name": "xxx" } } ]
            }
        }
    },
    "connect_timeout": "10s",
    "health_checks": [ {
        "timeout": "5s",
        "interval": "3s",
        "unhealthy_threshold": 1,
        "healthy_threshold": 2,
        "reuse_connection": true,
        "http_health_check": {
            "host": "xxx",
            "path": "/healthcheck",
        },
        "no_traffic_interval": "3s"
    } ],
    "circuit_breakers": {
        "thresholds": [ {
            "max_connections": 4294967295,
            "max_pending_requests": 4294967295,
            "max_requests": 4294967295,
            "retry_budget": {}
        } ]
    },
    "dns_lookup_family": "V4_ONLY",
    "common_lb_config": { "healthy_panic_threshold": { "value": 1 } },
    "transport_socket": {
        "name": "envoy.transport_sockets.tls",
        "typed_config": {
            "@type": "type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext",
            "common_tls_context": {
                "tls_certificates": [ {
                        "certificate_chain": { "filename": "/path/to/cert.pem" },
                        "private_key": { "filename": "/path/to/key.pem" }
                } ],
                "validation_context": {
                    "trusted_ca": { "filename": "/path/to/ca.pem" },
                    "match_subject_alt_names": [ { "exact": "xxx" } ]
                }
            }
        }
    },
    "upstream_connection_options": {
        "tcp_keepalive": {
            "keepalive_probes": 9,
            "keepalive_time": 300,
            "keepalive_interval": 60
        }
    }
}

The cluster load assignment included a stable pool of 10 healthy backends.
Envoy does not appear to make any health-check request during the CDS update.
The backends are hard-coded to report healthy, and run locally on the same host to minimize possible network interferences.

Logs:
We've reproduced this using a version of Envoy 1.14.5-dev with some extra log statements:

2020-09-11T18:05:48 cluster.foo.membership_healthy = 10
[...]
2020-09-11T18:05:49 Traffic generator: 1428 requests, 0 errors
[...]
2020-09-11T18:05:54 Traffic generator: 2590 requests, 0 errors
2020-09-11T18:05:54 Updating foo HC timeout to 5s
2020-09-11T18:05:55 Traffic generator: 2962 requests, 0 errors
[...]
2020-09-11T18:05:57 Traffic generator: 3024 requests, 0 errors
2020-09-11T18:05:57 [[email protected]:6878] responding to /healthcheck request from 192.168.0.11:44078 (for [192.168.0.12]) with OK
2020-09-11T18:05:57 [[email protected]:6872] responding to /healthcheck request from 192.168.0.5:37884 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6871] responding to /healthcheck request from 192.168.0.4:49690 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6877] responding to /healthcheck request from 192.168.0.10:36026 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6876] responding to /healthcheck request from 192.168.0.9:46476 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6870] responding to /healthcheck request from 192.168.0.3:46406 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6873] responding to /healthcheck request from 192.168.0.6:58104 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6875] responding to /healthcheck request from 192.168.0.8:34440 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6869] responding to /healthcheck request from 192.168.0.2:47934 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 [[email protected]:6874] responding to /healthcheck request from 192.168.0.7:45602 (for [192.168.0.12]) with OK
2020-09-11T18:05:58 Traffic generator: 3640 requests, 0 errors
[...]
2020-09-11T18:05:58 Traffic generator: 3554 requests, 0 errors
[2020-09-11 18:05:58.897][401][upstream] [source/common/upstream/cluster_manager_impl.cc:582] add/update cluster foo starting warming
[2020-09-11 18:05:58.897][401][upstream] [source/common/upstream/upstream_impl.cc:951] ClusterImplBase::initialize()
[2020-09-11 18:05:58.913][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 0
[2020-09-11 18:05:58.913][401][upstream] [source/common/upstream/upstream_impl.cc:967] initializing Secondary cluster foo completed
[2020-09-11 18:05:58.913][401][upstream] [source/common/upstream/upstream_impl.cc:972] ClusterImplBase::onInitDone()
2020-09-11T18:05:58 cluster.foo.membership_healthy = 0
[2020-09-11 18:05:58.976][401][upstream] [source/common/upstream/upstream_impl.cc:992] ClusterImplBase::finishInitialization()
[2020-09-11 18:05:58.976][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 0 added and 0 removed, updating hosts from 10 to 10, and healthy hosts from 0 to 10
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:594] warming cluster foo complete
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:612] updating TLS cluster foo
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:1176] Init ClusterEntry for foo at 0x563f2e21a340
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:1206] Initializing RoundRobin Load Balancer
[2020-09-11 18:05:58.978][401][upstream] [source/common/upstream/cluster_manager_impl.cc:369] onClusterInit
[2020-09-11 18:05:58.978][665][upstream] [source/common/upstream/cluster_manager_impl.cc:612] updating TLS cluster foo
[2020-09-11 18:05:58.978][665][upstream] [source/common/upstream/cluster_manager_impl.cc:1176] Init ClusterEntry for foo at 0x563f2e4189c0
[2020-09-11 18:05:58.978][665][upstream] [source/common/upstream/cluster_manager_impl.cc:1206] Initializing RoundRobin Load Balancer
[2020-09-11 18:05:58.981][665][misc] [source/common/upstream/load_balancer_impl.cc:808] chooseHostOnce: hosts_to_use.empty with source type AllHosts
[2020-09-11 18:05:58.981][665][upstream] [source/common/upstream/cluster_manager_impl.cc:1244] no healthy host for HTTP connection pool 0x563f2e4189c0
[2020-09-11 18:05:58.981][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 10
[2020-09-11 18:05:58.985][665][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 10
2020-09-11T18:05:58 Error making request: Expected HTTP code 200, but got 503 (with body "no healthy upstream")
2020-09-11T18:05:58 cluster.foo.membership_healthy = 10
2020-09-11T18:05:59 Traffic generator: 3441 requests, 1 errors

Based on the trace above, it looks like a load-balancer of the updated cluster is used to route a request before its underlying host set is initialized.
I'm not familiar enough with Envoy internals and could use some help to understand if I'm somehow misconfiguring Envoy, or if something else is happening.

@kryzthov-stripe kryzthov-stripe added bug triage Issue requires triage labels Sep 11, 2020
@mattklein123
Copy link
Member

This sounds like another repeat of #13009? cc @htuch @ramaraochavali @snowp

@snowp
Copy link
Contributor

snowp commented Sep 13, 2020

I'm not sure if it's the same issue, if so I would expect the time stamps to indicate that we're hitting the EDS fetch timeout (15s) without a response, causing cluster warming prior to receiving any hosts.

The logs seem to indicate that the hosts are being added to the cluster, then health checked during warming:

[2020-09-11 18:05:58.897][401][upstream] [source/common/upstream/upstream_impl.cc:951] ClusterImplBase::initialize()
[2020-09-11 18:05:58.913][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 10 added and 0 removed, updating hosts from 0 to 10, and healthy hosts from 0 to 0
[2020-09-11 18:05:58.913][401][upstream] [source/common/upstream/upstream_impl.cc:967] initializing Secondary cluster foo completed
[2020-09-11 18:05:58.913][401][upstream] [source/common/upstream/upstream_impl.cc:972] ClusterImplBase::onInitDone()
2020-09-11T18:05:58 cluster.foo.membership_healthy = 0
[2020-09-11 18:05:58.976][401][upstream] [source/common/upstream/upstream_impl.cc:992] ClusterImplBase::finishInitialization()
[2020-09-11 18:05:58.976][401][misc] [source/common/upstream/upstream_impl.cc:386] HostSetImpl::updateHosts with 0 added and 0 removed, updating hosts from 10 to 10, and healthy hosts from 0 to 10

I wonder if the problem is that we create a TLS cluster and update its hosts in separate dispatcher events so they don't happen atomically?

@mattklein123
Copy link
Member

Warming shouldn't complete until the first round of health checking is complete, so not sure what is happening here. I can take a more detailed look later.

@snowp
Copy link
Contributor

snowp commented Sep 13, 2020

The idea I mentioned in my last post isn't even necessarily about warming, but about the fact that we call

void ClusterManagerImpl::createOrUpdateThreadLocalCluster(ClusterData& cluster) {
  tls_->runOnAllThreads([this, new_cluster = cluster.cluster_->info(),
                         thread_aware_lb_factory = cluster.loadBalancerFactory()]() -> void {
    ThreadLocalClusterManagerImpl& cluster_manager =
        tls_->getTyped<ThreadLocalClusterManagerImpl>();

    if (cluster_manager.thread_local_clusters_.count(new_cluster->name()) > 0) {
      ENVOY_LOG(debug, "updating TLS cluster {}", new_cluster->name());
    } else {
      ENVOY_LOG(debug, "adding TLS cluster {}", new_cluster->name());
    }

    auto thread_local_cluster = new ThreadLocalClusterManagerImpl::ClusterEntry(
        cluster_manager, new_cluster, thread_aware_lb_factory);
    cluster_manager.thread_local_clusters_[new_cluster->name()].reset(thread_local_cluster);
    for (auto& cb : cluster_manager.update_callbacks_) {
      cb->onClusterAddOrUpdate(*thread_local_cluster);
    }
  });
}

before we do

  // Finally, if the cluster has any hosts, post updates cross-thread so the per-thread load
  // balancers are ready.
  for (auto& host_set : cluster.prioritySet().hostSetsPerPriority()) {
    if (host_set->hosts().empty()) {
      continue;
    }
    postThreadLocalClusterUpdate(cluster, host_set->priority(), host_set->hosts(), HostVector{});
  }

which means we get two dispatcher events: [update TLS cluster, set hosts], which could be interleaved with a dispatcher event that tries to route to this cluster.

Maybe there's something I missed that prevents this from being an issue, I didn't try to reproduce the issue.

If this is the case then any CDS update could possibly cause this, so the main reason I'm skeptical about this theory is that I would have expected it to have been discovered a long time ago.

@mattklein123
Copy link
Member

If this is the case then any CDS update could possibly cause this, so the main reason I'm skeptical about this theory is that I would have expected it to have been discovered a long time ago.

No I think you are right. This is definitely a race condition and broken. I'm guessing the reason this has not been reported before is that CDS updates are pretty rare and I'm guessing no one has noticed. We will need to fix this.

@mattklein123 mattklein123 added area/cluster_manager help wanted Needs help! and removed triage Issue requires triage labels Sep 13, 2020
@mattklein123 mattklein123 self-assigned this Sep 13, 2020
@ramaraochavali
Copy link
Contributor

The fact that this specific issue "We observed and reproduced this behavior on static" happens on static clusters seem to indicate that it is not same as #13009 and may be @snowp 's theory is correct.

@howardjohn
Copy link
Contributor

I am fairly sure this is the cause of istio/istio#23029. I can reproduce this pretty reliably (not 100% of time, but within a few minutes).

The test I am doing is updating a cluster every 2s, changing some field in metadata that does not matter. In the mean time, I send 2k QPS.

Cluster, which notable has no TLS or health check:


     "version_info": "2020-10-29T00:34:54Z/259",
     "cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "outbound|8080||echo.default.svc.cluster.local",
      "type": "EDS",
      "eds_cluster_config": {
       "eds_config": {
        "ads": {},
        "initial_fetch_timeout": "3s",
        "resource_api_version": "V3"
       },
       "service_name": "outbound|8080||echo.default.svc.cluster.local"
      },
      "connect_timeout": "10s",
      "circuit_breakers": {
       "thresholds": [
        {
         "max_connections": 4294967295,
         "max_pending_requests": 4294967295,
         "max_requests": 4294967295,
         "max_retries": 4294967295
        }
       ]
      },
      "metadata": {
       "filter_metadata": {
        "istio": {
         "services": [
          {
           "namespace": "default",
           "host": "echo.default.svc.cluster.local",
           "name": "echo"
          }
         ],
         "default_original_port": 0.74407261806614822,
         "config": "/apis/networking.istio.io/v1alpha3/namespaces/istio-system/destination-rule/plaintext"
        }
       }
      }
     },
     "last_updated": "2020-10-29T00:34:54.197Z"
    }
   ]
  },

note: 3s initial fetch timeout was playing around with #13009, it does not matter what I set this to, even disabling with 0s still gets this issue

I provided some trace logging of 2x where the update triggered this and 2x where it did not.

While the above test is synthetic, this behavior is consistently reproducible in real Knative workloads as well

@mattklein123
Copy link
Member

I can take a look at fixing this.

@mattklein123
Copy link
Member

I looked at this and the fix is not difficult. I will fix soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants