cluster manager: make add/update and initial host set update atomic #13906

mattklein123 · 2020-11-05T00:14:19Z

Previously, we would do a separate TLS set to add/update the cluster,
and then another TLS operation to populate the initial host set. For
cluster updates for a server already in operation this can lead to a
gap of no hosts.

Fixes #13070

Risk Level: High
Testing: I didn't add any new tests, because it's unclear to me what tests to add.
I could have added a unit test to cover the race condition, but it would be broken
by the change and then deleted. Note that getting all of the existing tests to pass
was quite difficult, so I do feel there is good coverage here, but if anyone has good
ideas on how to better test this please let me know.
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A

Previously, we would do a separate TLS set to add/update the cluster, and then another TLS operation to populate the initial host set. For cluster updates for a server already in operation this can lead to a gap of no hosts. Signed-off-by: Matt Klein <[email protected]>

mattklein123 · 2020-11-05T00:16:13Z

@snowp @lambdai could you take a first pass on this? This was much harder to fix than I thought. I tried to do the minimal change possible and left some TODOs for follow ups.

cc @rgs1 it would be great if you would be willing to smoke test this once we get past the initial round of reviews.

rgs1

happy to smoke test this when more eyes have made a pass.

source/common/upstream/cluster_manager_impl.cc

lambdai

So the major change is never createOrUpdateCluster, but go with host update.

At high level I feel it's fine. The question is whether non-add-update should deserve a individual post.
My concern is that cluster update is not always along with member change and Envoy should propagate this cluster update ASAP instead of waiting for the next member change.

source/common/upstream/cluster_manager_impl.h

source/common/upstream/cluster_manager_impl.cc

mattklein123 · 2020-11-05T16:04:42Z

My concern is that cluster update is not always along with member change and Envoy should propagate this cluster update ASAP instead of waiting for the next member change.

Essentially, this is the fix. Now, we could track whether the cluster is new or updated, and only do this in the updating case, but I'm not sure the extra logic is worth it. In the bootstrap case, we do what you describe above, but as I mentioned in the TODOs I would actually like to remove it from the few paths that require it (e.g. route validation).

snowp

At a high level this makes perfect sense. Is there any way we could try racing requests and cluster updates and check for intermittent failures?

source/common/upstream/cluster_manager_impl.h

mattklein123 · 2020-11-06T04:04:20Z

Is there any way we could try racing requests and cluster updates and check for intermittent failures?

You mean like in an integration test? This seems kind of fragile from a test perspective.

snowp · 2020-11-09T16:28:04Z

You mean like in an integration test? This seems kind of fragile from a test perspective.

Yeah that was my thinking, just to have something that could hit this in case of a regression. But yeah introducing a test that would likely just end up being flaky if something broke might be too painful to debug etc if something actually breaks, so it might not be a great idea.

Signed-off-by: Matt Klein <[email protected]>

mattklein123 · 2020-11-10T00:33:06Z

@lambdai @snowp updated per comments.

@rgs1 I think this should be ready for a smoke test. Thank you!

Signed-off-by: Matt Klein <[email protected]>

snowp

Thanks, this LGTM. Definitely a high risk change that's hard to test, so would want to see some smoke testing results if possible before merging.

mattklein123 · 2020-11-11T00:29:09Z

I'm going to work on the follow up clean up to this PR. @rgs1 depending on when you have time to smoke test we might want to just get them both in together.

rgs1 · 2020-11-11T00:40:53Z

I'm going to work on the follow up clean up to this PR. @rgs1 depending on when you have time to smoke test we might want to just get them both in together.

Won't happen before Thu, I am afraid. I've been looking at the new TCP conn pool with Alyssa. So if you can have both by then, I'll test them together.

rgs1 · 2020-11-13T00:58:31Z

I'm going to work on the follow up clean up to this PR. @rgs1 depending on when you have time to smoke test we might want to just get them both in together.

Won't happen before Thu, I am afraid. I've been looking at the new TCP conn pool with Alyssa. So if you can have both by then, I'll test them together.

Ok, giving this a try now.

rgs1 · 2020-11-13T18:32:00Z

@mattklein123 so it's not breaking anything, but I don't have a repro handy.

I guess it would be something like:

a) push cluster update (e.g.: change HC settings)
b) see if membership count drops and/or 5xxs happen

?

mattklein123 · 2020-11-13T18:59:58Z

@rgs1 the repro would be to do CDS updates under high traffic load. However, I think I'm OK with merging assuming nothing broke. :) So let's go from there. Thank you!

This is a follow up to #13906. It replaces use of the thread local clusters with the main thread clusters() output for static route validation. This will enable further cleanups in the cluster manager code. Signed-off-by: Matt Klein <[email protected]>

…ter validation (#14204) This is a follow up to #13906. It replaces use of the thread local clusters with the main thread clusters() output for static route validation. This will enable further cleanups in the cluster manager code. Signed-off-by: Matt Klein <[email protected]>

This is part of several follow ups to #13906. This change moves various functions from the cluster API to the thread local cluster API. This simplifies error handling and also will increase performance slightly as many instances of double map lookups are now a single lookup. Signed-off-by: Matt Klein <[email protected]>

…ter (#14332) This is part of several follow ups to #13906. This change moves various functions from the cluster API to the thread local cluster API. This simplifies error handling and also will increase performance slightly as many instances of double map lookups are now a single lookup. Signed-off-by: Matt Klein <[email protected]>

Final follow up from #13906. This PR does: 1) Simplify the logic during startup by making thread local clusters only appear after a cluster has been initialized. This is now uniform both for bootstrap clusters as well as CDS clusters, making the logic simpler to follow. 2) Aggregate cluster needed fixes due to assumptions on startup existence of the thread local cluster. This change also fixes #14119 3) Make TLS mocks verify that set() is called before other functions. Signed-off-by: Matt Klein <[email protected]>

mattklein123 assigned snowp and lambdai Nov 5, 2020

rgs1 reviewed Nov 5, 2020

View reviewed changes

source/common/upstream/cluster_manager_impl.cc Outdated Show resolved Hide resolved

lambdai reviewed Nov 5, 2020

View reviewed changes

snowp suggested changes Nov 5, 2020

View reviewed changes

source/common/upstream/cluster_manager_impl.h Show resolved Hide resolved

source/common/upstream/cluster_manager_impl.h Outdated Show resolved Hide resolved

mattklein123 added 2 commits November 10, 2020 00:05

Merge remote-tracking branch 'origin/master' into fix_cm_init

51a32a4

Signed-off-by: Matt Klein <[email protected]>

comments

e784914

Signed-off-by: Matt Klein <[email protected]>

mattklein123 added 2 commits November 10, 2020 05:59

fix tidy

ea9543a

Signed-off-by: Matt Klein <[email protected]>

Merge remote-tracking branch 'origin/master' into fix_cm_init

9a3696a

Signed-off-by: Matt Klein <[email protected]>

snowp approved these changes Nov 10, 2020

View reviewed changes

mattklein123 added the no stalebot Disables stalebot from closing an issue label Nov 11, 2020

mattklein123 merged commit d0dda3f into master Nov 13, 2020

mattklein123 deleted the fix_cm_init branch November 13, 2020 19:00

mattklein123 mentioned this pull request Nov 28, 2020

cluster manager: use clusters() instead of get() for main thread cluster validation #14204

Merged

mattklein123 mentioned this pull request Dec 8, 2020

tech debt: move connection and async client APIs to thread local cluster #14332

Merged

mattklein123 mentioned this pull request Dec 11, 2020

cluster manager: initialization cleanups #14382

Merged

lambdai added the backport/review Request to backport to stable releases label Feb 1, 2021

Shikugawa added backport/approved Approved backports to stable releases and removed backport/review Request to backport to stable releases labels Feb 5, 2021

markusthoemmes mentioned this pull request Aug 31, 2021

Bump default envoy version to 1.17 knative-extensions/net-kourier#623

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster manager: make add/update and initial host set update atomic #13906

cluster manager: make add/update and initial host set update atomic #13906

mattklein123 commented Nov 5, 2020

mattklein123 commented Nov 5, 2020

rgs1 left a comment

lambdai left a comment

mattklein123 commented Nov 5, 2020

snowp left a comment

mattklein123 commented Nov 6, 2020

snowp commented Nov 9, 2020

mattklein123 commented Nov 10, 2020

snowp left a comment

mattklein123 commented Nov 11, 2020

rgs1 commented Nov 11, 2020

rgs1 commented Nov 13, 2020

rgs1 commented Nov 13, 2020

mattklein123 commented Nov 13, 2020

cluster manager: make add/update and initial host set update atomic #13906

cluster manager: make add/update and initial host set update atomic #13906

Conversation

mattklein123 commented Nov 5, 2020

mattklein123 commented Nov 5, 2020

rgs1 left a comment

Choose a reason for hiding this comment

lambdai left a comment

Choose a reason for hiding this comment

mattklein123 commented Nov 5, 2020

snowp left a comment

Choose a reason for hiding this comment

mattklein123 commented Nov 6, 2020

snowp commented Nov 9, 2020

mattklein123 commented Nov 10, 2020

snowp left a comment

Choose a reason for hiding this comment

mattklein123 commented Nov 11, 2020

rgs1 commented Nov 11, 2020

rgs1 commented Nov 13, 2020

rgs1 commented Nov 13, 2020

mattklein123 commented Nov 13, 2020