Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster manager: make add/update and initial host set update atomic #13906

Merged
merged 5 commits into from
Nov 13, 2020

Conversation

mattklein123
Copy link
Member

Previously, we would do a separate TLS set to add/update the cluster,
and then another TLS operation to populate the initial host set. For
cluster updates for a server already in operation this can lead to a
gap of no hosts.

Fixes #13070

Risk Level: High
Testing: I didn't add any new tests, because it's unclear to me what tests to add.
I could have added a unit test to cover the race condition, but it would be broken
by the change and then deleted. Note that getting all of the existing tests to pass
was quite difficult, so I do feel there is good coverage here, but if anyone has good
ideas on how to better test this please let me know.
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A

Previously, we would do a separate TLS set to add/update the cluster,
and then another TLS operation to populate the initial host set. For
cluster updates for a server already in operation this can lead to a
gap of no hosts.

Signed-off-by: Matt Klein <[email protected]>
@mattklein123
Copy link
Member Author

@snowp @lambdai could you take a first pass on this? This was much harder to fix than I thought. I tried to do the minimal change possible and left some TODOs for follow ups.

cc @rgs1 it would be great if you would be willing to smoke test this once we get past the initial round of reviews.

Copy link
Member

@rgs1 rgs1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to smoke test this when more eyes have made a pass.

Copy link
Contributor

@lambdai lambdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the major change is never createOrUpdateCluster, but go with host update.

At high level I feel it's fine. The question is whether non-add-update should deserve a individual post.
My concern is that cluster update is not always along with member change and Envoy should propagate this cluster update ASAP instead of waiting for the next member change.

@mattklein123
Copy link
Member Author

My concern is that cluster update is not always along with member change and Envoy should propagate this cluster update ASAP instead of waiting for the next member change.

Essentially, this is the fix. Now, we could track whether the cluster is new or updated, and only do this in the updating case, but I'm not sure the extra logic is worth it. In the bootstrap case, we do what you describe above, but as I mentioned in the TODOs I would actually like to remove it from the few paths that require it (e.g. route validation).

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level this makes perfect sense. Is there any way we could try racing requests and cluster updates and check for intermittent failures?

@mattklein123
Copy link
Member Author

Is there any way we could try racing requests and cluster updates and check for intermittent failures?

You mean like in an integration test? This seems kind of fragile from a test perspective.

@snowp
Copy link
Contributor

snowp commented Nov 9, 2020

You mean like in an integration test? This seems kind of fragile from a test perspective.

Yeah that was my thinking, just to have something that could hit this in case of a regression. But yeah introducing a test that would likely just end up being flaky if something broke might be too painful to debug etc if something actually breaks, so it might not be a great idea.

@mattklein123
Copy link
Member Author

@lambdai @snowp updated per comments.

@rgs1 I think this should be ready for a smoke test. Thank you!

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this LGTM. Definitely a high risk change that's hard to test, so would want to see some smoke testing results if possible before merging.

@mattklein123
Copy link
Member Author

I'm going to work on the follow up clean up to this PR. @rgs1 depending on when you have time to smoke test we might want to just get them both in together.

@mattklein123 mattklein123 added the no stalebot Disables stalebot from closing an issue label Nov 11, 2020
@rgs1
Copy link
Member

rgs1 commented Nov 11, 2020

I'm going to work on the follow up clean up to this PR. @rgs1 depending on when you have time to smoke test we might want to just get them both in together.

Won't happen before Thu, I am afraid. I've been looking at the new TCP conn pool with Alyssa. So if you can have both by then, I'll test them together.

@rgs1
Copy link
Member

rgs1 commented Nov 13, 2020

I'm going to work on the follow up clean up to this PR. @rgs1 depending on when you have time to smoke test we might want to just get them both in together.

Won't happen before Thu, I am afraid. I've been looking at the new TCP conn pool with Alyssa. So if you can have both by then, I'll test them together.

Ok, giving this a try now.

@rgs1
Copy link
Member

rgs1 commented Nov 13, 2020

@mattklein123 so it's not breaking anything, but I don't have a repro handy.

I guess it would be something like:

a) push cluster update (e.g.: change HC settings)
b) see if membership count drops and/or 5xxs happen

?

@mattklein123
Copy link
Member Author

@rgs1 the repro would be to do CDS updates under high traffic load. However, I think I'm OK with merging assuming nothing broke. :) So let's go from there. Thank you!

@mattklein123 mattklein123 merged commit d0dda3f into master Nov 13, 2020
@mattklein123 mattklein123 deleted the fix_cm_init branch November 13, 2020 19:00
mattklein123 added a commit that referenced this pull request Nov 28, 2020
This is a follow up to #13906. It replaces use of the thread local
clusters with the main thread clusters() output for static route
validation. This will enable further cleanups in the cluster manager
code.

Signed-off-by: Matt Klein <[email protected]>
mattklein123 added a commit that referenced this pull request Dec 2, 2020
…ter validation (#14204)

This is a follow up to #13906. It replaces use of the thread local
clusters with the main thread clusters() output for static route
validation. This will enable further cleanups in the cluster manager
code.

Signed-off-by: Matt Klein <[email protected]>
mattklein123 added a commit that referenced this pull request Dec 8, 2020
This is part of several follow ups to #13906. This change moves various
functions from the cluster API to the thread local cluster API. This
simplifies error handling and also will increase performance slightly
as many instances of double map lookups are now a single lookup.

Signed-off-by: Matt Klein <[email protected]>
mattklein123 added a commit that referenced this pull request Dec 11, 2020
…ter (#14332)

This is part of several follow ups to #13906. This change moves various
functions from the cluster API to the thread local cluster API. This
simplifies error handling and also will increase performance slightly
as many instances of double map lookups are now a single lookup.

Signed-off-by: Matt Klein <[email protected]>
mattklein123 added a commit that referenced this pull request Dec 11, 2020
Final follow up from #13906. This PR does:
1) Simplify the logic during startup by making thread local clusters
   only appear after a cluster has been initialized. This is now uniform
   both for bootstrap clusters as well as CDS clusters, making the logic
   simpler to follow.
2) Aggregate cluster needed fixes due to assumptions on startup
   existence of the thread local cluster. This change also
   fixes #14119
3) Make TLS mocks verify that set() is called before other functions.

Signed-off-by: Matt Klein <[email protected]>
mattklein123 added a commit that referenced this pull request Dec 14, 2020
Final follow up from #13906. This PR does:
1) Simplify the logic during startup by making thread local clusters
   only appear after a cluster has been initialized. This is now uniform
   both for bootstrap clusters as well as CDS clusters, making the logic
   simpler to follow.
2) Aggregate cluster needed fixes due to assumptions on startup
   existence of the thread local cluster. This change also
   fixes #14119
3) Make TLS mocks verify that set() is called before other functions.

Signed-off-by: Matt Klein <[email protected]>
@lambdai lambdai added the backport/review Request to backport to stable releases label Feb 1, 2021
@Shikugawa Shikugawa added backport/approved Approved backports to stable releases and removed backport/review Request to backport to stable releases labels Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/approved Approved backports to stable releases no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transient 503 UH "no healthy upstream" errors during CDS updates
5 participants