Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Galley performance tuning with k8s api #8338

Closed
cmluciano opened this issue Aug 29, 2018 · 13 comments
Closed

Galley performance tuning with k8s api #8338

cmluciano opened this issue Aug 29, 2018 · 13 comments
Assignees
Milestone

Comments

@cmluciano
Copy link
Member

chat history:

kuat: There's been several reports of istio taking down k8s apiserver: #7675, #8184, and probably some others. Can you please evaluate galley's load on apiserver across istio and see if we have a hot loop somewhere?

cmluciano: I think we should just be using the normal informer stuff so it's odd that there should be so many calls

kuat: we should consider staggering the load with HPA enabled at least. On GKE the effect is that GKE apiserver auto scaling does not keep up with istio installation.
@cmluciano same here, which is why i suspect some non-informer call somewhere.

@cmluciano
Copy link
Member Author

@irisdingbj Can you please take a look at the metrics from k8s api-server and see if we should be using a more recent version of k8s client-go .

The informers should only be triggered when something changes. We should ensure that we are using informers properly and only watch apis that are absolutely necessary.

Given the comment about autoscaling not working, kuat's suggestion about a possible non-informer call is quite possible

@lichuqiang
Copy link
Member

/cc

@irisdingbj
Copy link
Member

@cmluciano will take a look at this

@irisdingbj
Copy link
Member

Galley is using the normal informer stuff to watch the related istio crd resources.
@douglas-reid Is this problem identified in this PR #8356?

@irisdingbj
Copy link
Member

@ayj see above. I think galley is safe for the k8s api server. Any comments?

@irisdingbj irisdingbj self-assigned this Aug 31, 2018
@mandarjog
Copy link
Contributor

The issue with mixer was that it was watching too many CRD “kinds”. We have moved Mixer to a model that uses far fewer crds, so the load will reduce.

If galley uses the same mechanism where every Kind is its own channel to api server it will face the same problem with the total number of crds in the system.

We have to actually measure the load.

@irisdingbj
Copy link
Member

Need to work on this for galley part as well since we will use separate watchers for every kinds of istio crds.

@ozevren
Copy link
Contributor

ozevren commented Sep 6, 2018

@ayj I believe there was a naked call loop in the validator webhook sometime ago, that you were planning to replace with a watcher. Did you get a chance to do that?

@ayj
Copy link
Contributor

ayj commented Sep 6, 2018

Not yet. The validation webhook currently polls a specific resource instance (not collection) every ~5 seconds. #6451 is the tracking issue for the injector and validation webhooks.

@ozevren
Copy link
Contributor

ozevren commented Sep 6, 2018

Here is the situation, to the best of my understanding. Please keep me honest.

Taking a 1 Mixer, 1 Pilot, 1 Galley setup as a baseline:

  • Galley will end up inheriting the same set of CRDs that Pilot and Mixer is currently listening on. With switch to Galley, there should not be a change in the total number of CRDs being listened on.
  • The perf effect of Mixer moving to a smaller set of CRDs is orthogonal to whether Galley is used to ingest config, or not. In either case, the perf effect should remain the same.
  • Galley is using 1 shared informer per apigroup/version/kind. This is due to the nature of the APIServer libraries, and is not something in our control.

Apart from #6451, I don't see any low-hanging fruits/obvious fixes that we can apply to Galley code. (If I am missing something, please speak up).

Assuming Galley has good fanout, introducing Galley based config distribution should improve our perf footprint on the host system: On the old case, scaling Mixer and Pilot would end up linearly increasing the load on the API server. With the new model, the load will be on Galley, and API Server load should stay constant (i.e. assuming # of Galleys staying constant).

This doesn't mean we won't run into any issues like Azure/AKS#620, but it means introducing Galley (along with reducing Mixer CRDs) should help. If we run into such issues, then we will need to fix it within the scope of Galley.

Based on this, to tune Galley performance, it is probably beneficial to understand:

  • The config number/size v.s. resource consumption characteristics and limits.
  • The MCP client fanout characteristics.

@ayj
Copy link
Contributor

ayj commented Sep 6, 2018

cc @Nino-K who was also looking at scaling MCP client/server for CF use cases.

@ozevren
Copy link
Contributor

ozevren commented Nov 1, 2018

@ayj You've been looking at Galley performance for 1.1. Any blockers for 1.1 release? Can you share some numbers?

@ayj
Copy link
Contributor

ayj commented Dec 18, 2018

I think we can close this for now. The original issue was motivated by Mixer opening unnecessary duplicate CRD watches which was exacerbated by HPA. Galley only create one watch per CRD and does not have the same scaling characteristics as Mixer. We can open follow-up issues if necessary when specific problems arise.

@ayj ayj closed this as completed Dec 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants