Galley performance tuning with k8s api #8338

cmluciano · 2018-08-29T16:44:07Z

chat history:

kuat: There's been several reports of istio taking down k8s apiserver: #7675, #8184, and probably some others. Can you please evaluate galley's load on apiserver across istio and see if we have a hot loop somewhere?

cmluciano: I think we should just be using the normal informer stuff so it's odd that there should be so many calls

kuat: we should consider staggering the load with HPA enabled at least. On GKE the effect is that GKE apiserver auto scaling does not keep up with istio installation.
@cmluciano same here, which is why i suspect some non-informer call somewhere.

cmluciano · 2018-08-29T16:47:48Z

@irisdingbj Can you please take a look at the metrics from k8s api-server and see if we should be using a more recent version of k8s client-go .

The informers should only be triggered when something changes. We should ensure that we are using informers properly and only watch apis that are absolutely necessary.

Given the comment about autoscaling not working, kuat's suggestion about a possible non-informer call is quite possible

lichuqiang · 2018-08-30T01:55:31Z

/cc

irisdingbj · 2018-08-31T02:01:48Z

@cmluciano will take a look at this

irisdingbj · 2018-08-31T03:09:24Z

Galley is using the normal informer stuff to watch the related istio crd resources.
@douglas-reid Is this problem identified in this PR #8356?

irisdingbj · 2018-08-31T03:12:31Z

@ayj see above. I think galley is safe for the k8s api server. Any comments?

mandarjog · 2018-08-31T03:29:25Z

The issue with mixer was that it was watching too many CRD “kinds”. We have moved Mixer to a model that uses far fewer crds, so the load will reduce.

If galley uses the same mechanism where every Kind is its own channel to api server it will face the same problem with the total number of crds in the system.

We have to actually measure the load.

irisdingbj · 2018-08-31T09:51:34Z

Need to work on this for galley part as well since we will use separate watchers for every kinds of istio crds.

ozevren · 2018-09-06T19:24:51Z

@ayj I believe there was a naked call loop in the validator webhook sometime ago, that you were planning to replace with a watcher. Did you get a chance to do that?

ayj · 2018-09-06T20:23:13Z

Not yet. The validation webhook currently polls a specific resource instance (not collection) every ~5 seconds. #6451 is the tracking issue for the injector and validation webhooks.

ozevren · 2018-09-06T22:28:09Z

Here is the situation, to the best of my understanding. Please keep me honest.

Taking a 1 Mixer, 1 Pilot, 1 Galley setup as a baseline:

Galley will end up inheriting the same set of CRDs that Pilot and Mixer is currently listening on. With switch to Galley, there should not be a change in the total number of CRDs being listened on.
The perf effect of Mixer moving to a smaller set of CRDs is orthogonal to whether Galley is used to ingest config, or not. In either case, the perf effect should remain the same.
Galley is using 1 shared informer per apigroup/version/kind. This is due to the nature of the APIServer libraries, and is not something in our control.

Apart from #6451, I don't see any low-hanging fruits/obvious fixes that we can apply to Galley code. (If I am missing something, please speak up).

Assuming Galley has good fanout, introducing Galley based config distribution should improve our perf footprint on the host system: On the old case, scaling Mixer and Pilot would end up linearly increasing the load on the API server. With the new model, the load will be on Galley, and API Server load should stay constant (i.e. assuming # of Galleys staying constant).

This doesn't mean we won't run into any issues like Azure/AKS#620, but it means introducing Galley (along with reducing Mixer CRDs) should help. If we run into such issues, then we will need to fix it within the scope of Galley.

Based on this, to tune Galley performance, it is probably beneficial to understand:

The config number/size v.s. resource consumption characteristics and limits.
The MCP client fanout characteristics.

ayj · 2018-09-06T22:29:57Z

cc @Nino-K who was also looking at scaling MCP client/server for CF use cases.

ozevren · 2018-11-01T17:44:12Z

@ayj You've been looking at Galley performance for 1.1. Any blockers for 1.1 release? Can you share some numbers?

ayj · 2018-12-18T19:28:05Z

I think we can close this for now. The original issue was motivated by Mixer opening unnecessary duplicate CRD watches which was exacerbated by HPA. Galley only create one watch per CRD and does not have the same scaling characteristics as Mixer. We can open follow-up issues if necessary when specific problems arise.

irisdingbj self-assigned this Aug 31, 2018

ayj added the area/config label Oct 15, 2018

ayj added this to the 1.1 milestone Oct 15, 2018

ayj mentioned this issue Oct 22, 2018

Use a k8s.io/client-go informer in validating webhook #9023

Merged

ozevren assigned ayj Nov 1, 2018

munrodg unassigned irisdingbj Nov 16, 2018

ayj closed this as completed Dec 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Galley performance tuning with k8s api #8338

Galley performance tuning with k8s api #8338

cmluciano commented Aug 29, 2018

cmluciano commented Aug 29, 2018

lichuqiang commented Aug 30, 2018

irisdingbj commented Aug 31, 2018

irisdingbj commented Aug 31, 2018

irisdingbj commented Aug 31, 2018

mandarjog commented Aug 31, 2018

irisdingbj commented Aug 31, 2018

ozevren commented Sep 6, 2018

ayj commented Sep 6, 2018

ozevren commented Sep 6, 2018

ayj commented Sep 6, 2018

ozevren commented Nov 1, 2018

ayj commented Dec 18, 2018

Galley performance tuning with k8s api #8338

Galley performance tuning with k8s api #8338

Comments

cmluciano commented Aug 29, 2018

cmluciano commented Aug 29, 2018

lichuqiang commented Aug 30, 2018

irisdingbj commented Aug 31, 2018

irisdingbj commented Aug 31, 2018

irisdingbj commented Aug 31, 2018

mandarjog commented Aug 31, 2018

irisdingbj commented Aug 31, 2018

ozevren commented Sep 6, 2018

ayj commented Sep 6, 2018

ozevren commented Sep 6, 2018

ayj commented Sep 6, 2018

ozevren commented Nov 1, 2018

ayj commented Dec 18, 2018