Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster initialization issue. Envoy start could be blocked forever #13874

Closed
lambdai opened this issue Nov 3, 2020 · 6 comments · Fixed by #13875
Closed

Cluster initialization issue. Envoy start could be blocked forever #13874

lambdai opened this issue Nov 3, 2020 · 6 comments · Fixed by #13875

Comments

@lambdai
Copy link
Contributor

lambdai commented Nov 3, 2020

If a cluster is warming during server start, a second update on this cluster may put the envoy in the state that start never happen.

Notes that another update must be "real" which means the config cannot be determined as duplicated config.

The full story is

  1. cluster A is warming up
  2. CDS deliver a new config on cluster A
  3. Envoy removes the old warming cluster [1], but the warming secondary cluster counter is not decreased[2]
  4. Envoy creates a new cluster A with the new config.

The counter not decrease will block the Envoy initialization, see istio/istio#28500 (comment)

I am not fully sure how bad is destroy-before-add, in LDS api this is explicitly forbidden since the last warming destroy will early signal the init manager that "all clusters are ready"

Also the entire log file could be found in the attached istio issue.

[1] init manager destroyed before init

2020-11-03T01:05:59.112926Z	debug	envoy upstream	cm init: adding: cluster=outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local primary=0 secondary=31
2020-11-03T01:05:59.112929Z	info	envoy upstream	cds: add/update cluster 'outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local'
2020-11-03T01:05:59.116583Z	debug	envoy upstream	initializing secondary cluster outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local
2020-11-03T01:05:59.124471Z	debug	envoy init	init manager Cluster outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local destroyed
2020-11-03T01:05:59.124483Z	debug	envoy upstream	add/update cluster outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local during init

[2] increase twice, decrease once

$ grep "secondary=" logs.txt |grep -B 1 "svc00-0-6-0.service-graph00"
2020-11-03T01:05:59.112393Z	debug	envoy upstream	cm init: adding: cluster=outbound|8080||svc00-0-5-0.service-graph00.svc.cluster.local primary=0 secondary=30
2020-11-03T01:05:59.112926Z	debug	envoy upstream	cm init: adding: cluster=outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local primary=0 secondary=31
--
2020-11-03T01:05:59.116213Z	debug	envoy upstream	cm init: adding: cluster=InboundPassthroughClusterIpv4 primary=0 secondary=34
2020-11-03T01:05:59.124538Z	debug	envoy upstream	cm init: adding: cluster=outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local primary=0 secondary=35
2020-11-03T01:05:59.126380Z	debug	envoy upstream	cm init: init complete: cluster=outbound|8080||svc00-0-6-0.service-graph00.svc.cluster.local primary=0 secondary=34
@lambdai
Copy link
Contributor Author

lambdai commented Nov 3, 2020

#13875 should fix to the 2nd issue.

@howardjohn
Copy link
Contributor

Confirmed that this breaks with only istio#281 which corresponds to #12783 and #13344

@lambdai
Copy link
Contributor Author

lambdai commented Nov 3, 2020

reproduced the failure in clang-asan config. The secondary_init_clusters_ contains wild pointer

    secondary_init_clusters_.remove_if(
        [name_to_remove = cluster.info()->name()](Cluster* cluster_iter) {
          return cluster_iter->info()->name() == name_to_remove;
        });
source/common/upstream/cluster_manager_impl.cc:72:32: runtime error: member call on address 0x617000048380 which does not point to an object of type 'Envoy::Upstream::Cluster'
0x617000048380: note: object has a possibly invalid vptr: abs(offset to top) too big
 98 01 80 71  cb 01 80 00 00 00 00 00  f0 c5 3b 1d 00 00 00 00  21 00 00 00 00 00 00 00  1e 00 00 00

@mattklein123
Copy link
Member

cc @rgs1 this is causing multiple issues. We are going to revert the original PR, so let's go from there.

@lambdai
Copy link
Contributor Author

lambdai commented Nov 3, 2020

@mattklein123 @rgs1
hold on... Let me see if I can have a quick fix on the 2nd issue

@lambdai
Copy link
Contributor Author

lambdai commented Nov 3, 2020

And I suspect the first bug (early remove cluster) is not a regression

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants