-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster manager: make add/update and initial host set update atomic #13906
Conversation
Previously, we would do a separate TLS set to add/update the cluster, and then another TLS operation to populate the initial host set. For cluster updates for a server already in operation this can lead to a gap of no hosts. Signed-off-by: Matt Klein <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to smoke test this when more eyes have made a pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the major change is never createOrUpdateCluster, but go with host update.
At high level I feel it's fine. The question is whether non-add-update should deserve a individual post.
My concern is that cluster update is not always along with member change and Envoy should propagate this cluster update ASAP instead of waiting for the next member change.
Essentially, this is the fix. Now, we could track whether the cluster is new or updated, and only do this in the updating case, but I'm not sure the extra logic is worth it. In the bootstrap case, we do what you describe above, but as I mentioned in the TODOs I would actually like to remove it from the few paths that require it (e.g. route validation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level this makes perfect sense. Is there any way we could try racing requests and cluster updates and check for intermittent failures?
You mean like in an integration test? This seems kind of fragile from a test perspective. |
Yeah that was my thinking, just to have something that could hit this in case of a regression. But yeah introducing a test that would likely just end up being flaky if something broke might be too painful to debug etc if something actually breaks, so it might not be a great idea. |
Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Matt Klein <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this LGTM. Definitely a high risk change that's hard to test, so would want to see some smoke testing results if possible before merging.
I'm going to work on the follow up clean up to this PR. @rgs1 depending on when you have time to smoke test we might want to just get them both in together. |
Won't happen before Thu, I am afraid. I've been looking at the new TCP conn pool with Alyssa. So if you can have both by then, I'll test them together. |
Ok, giving this a try now. |
@mattklein123 so it's not breaking anything, but I don't have a repro handy. I guess it would be something like: a) push cluster update (e.g.: change HC settings) ? |
@rgs1 the repro would be to do CDS updates under high traffic load. However, I think I'm OK with merging assuming nothing broke. :) So let's go from there. Thank you! |
This is a follow up to #13906. It replaces use of the thread local clusters with the main thread clusters() output for static route validation. This will enable further cleanups in the cluster manager code. Signed-off-by: Matt Klein <[email protected]>
…ter validation (#14204) This is a follow up to #13906. It replaces use of the thread local clusters with the main thread clusters() output for static route validation. This will enable further cleanups in the cluster manager code. Signed-off-by: Matt Klein <[email protected]>
This is part of several follow ups to #13906. This change moves various functions from the cluster API to the thread local cluster API. This simplifies error handling and also will increase performance slightly as many instances of double map lookups are now a single lookup. Signed-off-by: Matt Klein <[email protected]>
…ter (#14332) This is part of several follow ups to #13906. This change moves various functions from the cluster API to the thread local cluster API. This simplifies error handling and also will increase performance slightly as many instances of double map lookups are now a single lookup. Signed-off-by: Matt Klein <[email protected]>
Final follow up from #13906. This PR does: 1) Simplify the logic during startup by making thread local clusters only appear after a cluster has been initialized. This is now uniform both for bootstrap clusters as well as CDS clusters, making the logic simpler to follow. 2) Aggregate cluster needed fixes due to assumptions on startup existence of the thread local cluster. This change also fixes #14119 3) Make TLS mocks verify that set() is called before other functions. Signed-off-by: Matt Klein <[email protected]>
Final follow up from #13906. This PR does: 1) Simplify the logic during startup by making thread local clusters only appear after a cluster has been initialized. This is now uniform both for bootstrap clusters as well as CDS clusters, making the logic simpler to follow. 2) Aggregate cluster needed fixes due to assumptions on startup existence of the thread local cluster. This change also fixes #14119 3) Make TLS mocks verify that set() is called before other functions. Signed-off-by: Matt Klein <[email protected]>
Previously, we would do a separate TLS set to add/update the cluster,
and then another TLS operation to populate the initial host set. For
cluster updates for a server already in operation this can lead to a
gap of no hosts.
Fixes #13070
Risk Level: High
Testing: I didn't add any new tests, because it's unclear to me what tests to add.
I could have added a unit test to cover the race condition, but it would be broken
by the change and then deleted. Note that getting all of the existing tests to pass
was quite difficult, so I do feel there is good coverage here, but if anyone has good
ideas on how to better test this please let me know.
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A