Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: unstuck cluster manager when update the initializing cluster #13875

Merged
merged 20 commits into from
Nov 9, 2020

Conversation

lambdai
Copy link
Contributor

@lambdai lambdai commented Nov 3, 2020

Commit Message:
Remove existing initializing cluster when updating secondary clusters.

  • Return existing cluster in ClusterManagerImp::loadCluster() to delay
    the destroy of existing cluster.
  • Remove the existing cluster from the secondary cluster list to accounting
    the warming secondary cluster correctly.

Additional Description:
Risk Level: LOW
Testing: integration test
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
fix #13874

Signed-off-by: Yuchen Dai <[email protected]>
@lambdai lambdai changed the title Warmingclusterfix cluster: unstuck cluster manager when update the initializing cluster Nov 3, 2020
Signed-off-by: Yuchen Dai <[email protected]>
@lambdai
Copy link
Contributor Author

lambdai commented Nov 3, 2020

The new test AdsClusterV3Test.ClusterUpdateWhenWarming is failing on CI but I cannot reproduce locally

Is it another #11877 ? It seems CDS and EDS update on Envoy are concurrent executing

@lizan lizan self-assigned this Nov 3, 2020
@mattklein123
Copy link
Member

The PR that is being reverted is causing production issues, so let's land that revert, and then work on re-applying with better tests? cc @Shikugawa

/wait

Signed-off-by: Yuchen Dai <[email protected]>
@lambdai
Copy link
Contributor Author

lambdai commented Nov 3, 2020

linux_x64 release test passed!

@mattklein123
Copy link
Member

Sorry is this still valid after reverting the other PR? Or should I wait until that issue is resolved?

/wait-any

@lambdai
Copy link
Contributor Author

lambdai commented Nov 4, 2020

@mattklein123
There are two PR
#12783 and #13344

My fix is based on #12783 which is not revered.

BTW: I just realized #13344 is the only one reverted but I suspect the problem is introduced in #12783

@lambdai
Copy link
Contributor Author

lambdai commented Nov 4, 2020

@mattklein123
There are two PR
#12783 and #13344

My fix is based on #12783 which is not revered.

BTW: I just realized #13344 is the only one reverted but I suspect the problem is introduced in #12783

Because the major risky change in #13344 is protected by the default off runtime key.

See https://github.com/envoyproxy/envoy/pull/13886/files#diff-c904f9b0d49cdd938ffaa952192372529415afa8602d7dc6acb904e61111d8a5L433-L438

@lizan
Copy link
Member

lizan commented Nov 4, 2020

Based on my conversation with @rgs1 the crash that they are seeing is likely due to #13344.

@rgs1
Copy link
Member

rgs1 commented Nov 4, 2020

I can try this one tomorrow, we should spot the problem immediately.

@rgs1
Copy link
Member

rgs1 commented Nov 4, 2020

I can try this one tomorrow, we should spot the problem immediately.

That is, on top of master with #13344 already reverted.

@rgs1
Copy link
Member

rgs1 commented Nov 4, 2020

I can try this one tomorrow, we should spot the problem immediately.

That is, on top of master with #13344 already reverted.

Oh nvm, this is not for the crash.

Signed-off-by: Yuchen Dai <[email protected]>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks in general this makes sense to me with a few small comments.

/wait

Signed-off-by: Yuchen Dai <[email protected]>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM pending potentially a better integration test for the primary case. Can you also merge main to pick up the CI fix?

/wait

Comment on lines +3152 to +3154
// The override cluster is added. Manually drop the previous cluster. In production flow this is
// achieved by ClusterManagerImpl.
sds.reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a cluster manager test of this case that uses "real" DNS clusters? I think the main sequence that can actually happen here is:

  1. CDS adds DNS cluster
  2. DNS cluster begins init
  3. CDS updates DNS cluster?

I think you could also pretty easily test this in your integration test below by doing:

  1. Have CDS deliver a DNS cluster configured with health checking.
  2. DNS cluster won't initialize pending health checking.
  3. Update DNS cluster

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the DNS with health check!

After a quick search I didn't find an existing example but this test case
TEST_F(ClusterManagerImplTest, RemoveWarmingCluster) seem putting a defaultStaticCluster("fake_cluster") in warm.

Writing a test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slight difference is that DNS cluster is always depending on resolver.

I modified the cluster to STATIC type to make it immediate ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this all looks good, but can you add a real integration test for this in ADS integration test? I think it's really easy based on what you already have. Just use a static cluster with health checking, it will be stuck initializing, then you can update it?

/wait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the impression that static cluster with bad health check doesn't block initialization...
Yeah, adding integrate test to confirm initialization and the behavior health check behavior. Will udpate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Also confirmed that failed health check in static cluster also turn the warm cluster to ready.
And STRICT DNS resolve failure has the same scary side effect.

I end up not accept the tcp connection but not write back to hold the DNS request.

Signed-off-by: Yuchen Dai <[email protected]>
@@ -416,6 +416,22 @@ AssertionResult BaseIntegrationTest::compareDiscoveryRequest(
}
}

AssertionResult compareSets(const std::set<std::string>& set1, const std::set<std::string>& set2,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this function up so that more functions defined in this file can call it.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you!

@mattklein123 mattklein123 merged commit 7a15e5d into envoyproxy:master Nov 9, 2020
@lambdai lambdai deleted the warmingclusterfix branch November 9, 2020 20:25
lambdai added a commit to lambdai/envoy-dai that referenced this pull request Nov 10, 2020
istio-testing pushed a commit to istio/envoy that referenced this pull request Nov 11, 2020
* cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783)

Signed-off-by: Shikugawa <[email protected]>
Signed-off-by: Yuchen Dai <[email protected]>

* cluster: unstuck cluster manager when update the initializing cluster (envoyproxy#13875)

Signed-off-by: Yuchen Dai <[email protected]>

Co-authored-by: Rei Shimizu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster initialization issue. Envoy start could be blocked forever
4 participants