-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster: unstuck cluster manager when update the initializing cluster #13875
Conversation
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
The new test AdsClusterV3Test.ClusterUpdateWhenWarming is failing on CI but I cannot reproduce locally Is it another #11877 ? It seems CDS and EDS update on Envoy are concurrent executing |
The PR that is being reverted is causing production issues, so let's land that revert, and then work on re-applying with better tests? cc @Shikugawa /wait |
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
linux_x64 release test passed! |
Sorry is this still valid after reverting the other PR? Or should I wait until that issue is resolved? /wait-any |
I can try this one tomorrow, we should spot the problem immediately. |
That is, on top of master with #13344 already reverted. |
Oh nvm, this is not for the crash. |
Signed-off-by: Yuchen Dai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks in general this makes sense to me with a few small comments.
/wait
Signed-off-by: Yuchen Dai <[email protected]>
…erfix Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks LGTM pending potentially a better integration test for the primary case. Can you also merge main to pick up the CI fix?
/wait
// The override cluster is added. Manually drop the previous cluster. In production flow this is | ||
// achieved by ClusterManagerImpl. | ||
sds.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add a cluster manager test of this case that uses "real" DNS clusters? I think the main sequence that can actually happen here is:
- CDS adds DNS cluster
- DNS cluster begins init
- CDS updates DNS cluster?
I think you could also pretty easily test this in your integration test below by doing:
- Have CDS deliver a DNS cluster configured with health checking.
- DNS cluster won't initialize pending health checking.
- Update DNS cluster
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the DNS with health check!
After a quick search I didn't find an existing example but this test case
TEST_F(ClusterManagerImplTest, RemoveWarmingCluster)
seem putting a defaultStaticCluster("fake_cluster")
in warm.
Writing a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slight difference is that DNS cluster is always depending on resolver.
I modified the cluster to STATIC type to make it immediate ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this all looks good, but can you add a real integration test for this in ADS integration test? I think it's really easy based on what you already have. Just use a static cluster with health checking, it will be stuck initializing, then you can update it?
/wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the impression that static cluster with bad health check doesn't block initialization...
Yeah, adding integrate test to confirm initialization and the behavior health check behavior. Will udpate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
Also confirmed that failed health check in static cluster also turn the warm cluster to ready.
And STRICT DNS resolve failure has the same scary side effect.
I end up not accept the tcp connection but not write back to hold the DNS request.
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>
@@ -416,6 +416,22 @@ AssertionResult BaseIntegrationTest::compareDiscoveryRequest( | |||
} | |||
} | |||
|
|||
AssertionResult compareSets(const std::set<std::string>& set1, const std::set<std::string>& set2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this function up so that more functions defined in this file can call it.
Signed-off-by: Yuchen Dai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you!
…envoyproxy#13875) Signed-off-by: Yuchen Dai <[email protected]>
* cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783) Signed-off-by: Shikugawa <[email protected]> Signed-off-by: Yuchen Dai <[email protected]> * cluster: unstuck cluster manager when update the initializing cluster (envoyproxy#13875) Signed-off-by: Yuchen Dai <[email protected]> Co-authored-by: Rei Shimizu <[email protected]>
Commit Message:
Remove existing initializing cluster when updating secondary clusters.
the destroy of existing cluster.
the warming secondary cluster correctly.
Additional Description:
Risk Level: LOW
Testing: integration test
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
fix #13874