Cluster initialization issue. Envoy start could be blocked forever #13874

lambdai · 2020-11-03T05:40:42Z

If a cluster is warming during server start, a second update on this cluster may put the envoy in the state that start never happen.

Notes that another update must be "real" which means the config cannot be determined as duplicated config.

The full story is

cluster A is warming up
CDS deliver a new config on cluster A
Envoy removes the old warming cluster [1], but the warming secondary cluster counter is not decreased[2]
Envoy creates a new cluster A with the new config.

The counter not decrease will block the Envoy initialization, see istio/istio#28500 (comment)

I am not fully sure how bad is destroy-before-add, in LDS api this is explicitly forbidden since the last warming destroy will early signal the init manager that "all clusters are ready"

Also the entire log file could be found in the attached istio issue.

[1] init manager destroyed before init

2020-11-03T01:05:59.112926Z	debug	envoy upstream	cm init: adding: cluster=outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local primary=0 secondary=31
2020-11-03T01:05:59.112929Z	info	envoy upstream	cds: add/update cluster 'outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local'
2020-11-03T01:05:59.116583Z	debug	envoy upstream	initializing secondary cluster outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local
2020-11-03T01:05:59.124471Z	debug	envoy init	init manager Cluster outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local destroyed
2020-11-03T01:05:59.124483Z	debug	envoy upstream	add/update cluster outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local during init

[2] increase twice, decrease once

$ grep "secondary=" logs.txt |grep -B 1 "svc00-0-6-0.service-graph00"
2020-11-03T01:05:59.112393Z	debug	envoy upstream	cm init: adding: cluster=outbound|8080||svc00-0-5-0.service-graph00.svc.cluster.local primary=0 secondary=30
2020-11-03T01:05:59.112926Z	debug	envoy upstream	cm init: adding: cluster=outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local primary=0 secondary=31
--
2020-11-03T01:05:59.116213Z	debug	envoy upstream	cm init: adding: cluster=InboundPassthroughClusterIpv4 primary=0 secondary=34
2020-11-03T01:05:59.124538Z	debug	envoy upstream	cm init: adding: cluster=outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local primary=0 secondary=35
2020-11-03T01:05:59.126380Z	debug	envoy upstream	cm init: init complete: cluster=outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local primary=0 secondary=34

The text was updated successfully, but these errors were encountered:

lambdai · 2020-11-03T06:01:51Z

#13875 should fix to the 2nd issue.

howardjohn · 2020-11-03T18:07:35Z

Confirmed that this breaks with only istio#281 which corresponds to #12783 and #13344

lambdai · 2020-11-03T19:08:16Z

reproduced the failure in clang-asan config. The secondary_init_clusters_ contains wild pointer

    secondary_init_clusters_.remove_if(
        [name_to_remove = cluster.info()->name()](Cluster* cluster_iter) {
          return cluster_iter->info()->name() == name_to_remove;
        });

source/common/upstream/cluster_manager_impl.cc:72:32: runtime error: member call on address 0x617000048380 which does not point to an object of type 'Envoy::Upstream::Cluster'
0x617000048380: note: object has a possibly invalid vptr: abs(offset to top) too big
 98 01 80 71  cb 01 80 00 00 00 00 00  f0 c5 3b 1d 00 00 00 00  21 00 00 00 00 00 00 00  1e 00 00 00

mattklein123 · 2020-11-03T19:34:19Z

cc @rgs1 this is causing multiple issues. We are going to revert the original PR, so let's go from there.

lambdai · 2020-11-03T19:39:52Z

@mattklein123 @rgs1
hold on... Let me see if I can have a quick fix on the 2nd issue

lambdai · 2020-11-03T19:41:07Z

And I suspect the first bug (early remove cluster) is not a regression

lambdai added bug triage Issue requires triage labels Nov 3, 2020

lambdai mentioned this issue Nov 3, 2020

cluster: unstuck cluster manager when update the initializing cluster #13875

Merged

mattklein123 added area/cluster_manager help wanted Needs help! and removed triage Issue requires triage labels Nov 3, 2020

howardjohn mentioned this issue Nov 3, 2020

Proxy cannot get SDS push if SDS server is not ready and proxy sends out SDS requests already. istio/istio#22443

Closed

mattklein123 closed this as completed in #13875 Nov 9, 2020

hzxuzhonghu mentioned this issue Dec 13, 2021

dynamic_warming_clusters does not turn active #19262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster initialization issue. Envoy start could be blocked forever #13874

Cluster initialization issue. Envoy start could be blocked forever #13874

lambdai commented Nov 3, 2020

lambdai commented Nov 3, 2020

howardjohn commented Nov 3, 2020

lambdai commented Nov 3, 2020

mattklein123 commented Nov 3, 2020

lambdai commented Nov 3, 2020

lambdai commented Nov 3, 2020

Cluster initialization issue. Envoy start could be blocked forever #13874

Cluster initialization issue. Envoy start could be blocked forever #13874

Comments

lambdai commented Nov 3, 2020

lambdai commented Nov 3, 2020

howardjohn commented Nov 3, 2020

lambdai commented Nov 3, 2020

mattklein123 commented Nov 3, 2020

lambdai commented Nov 3, 2020

lambdai commented Nov 3, 2020