Flux applies changes to the cluster very slowly #1422

brantb · 2018-10-03T20:04:38Z

Recently we've begun to see a significant increase in the amount of time flux takes to sync the git repository to the cluster:

ts=2018-10-03T19:09:42.295546122Z caller=sync.go:121 component=cluster method=Sync cmd=apply args= count=119
ts=2018-10-03T19:14:31.162556177Z caller=sync.go:164 component=cluster method=Sync cmd="kubectl apply -f -" took=4m48.866802454s err=null output="namespace \"cluster-admin\" configured\n <snip>

Based on the flux logs, it looks like individual kubectl apply calls are taking 6-11 seconds to return:

ts=2018-10-02T21:18:17.751129168Z  caller=sync.go:164 component=cluster method=Sync cmd="kubectl apply -f  -" took=4.865805366s err=null output="clusterissuer  \\"redacted\\" configured"
ts=2018-10-02T21:18:24.480141851Z  caller=sync.go:164 component=cluster method=Sync cmd="kubectl apply -f  -" took=6.728899883s err=null output="sealedsecret  \\"redacted\\" configured"
ts=2018-10-02T21:18:28.58635465Z  caller=sync.go:164 component=cluster method=Sync cmd="kubectl apply -f  -" took=4.106128199s err=null output="fluxhelmrelease  \\"redacted\\" configured"
ts=2018-10-02T21:18:33.328764968Z  caller=sync.go:164 component=cluster method=Sync cmd="kubectl apply -f  -" took=4.742312018s err=null output="horizontalpodautoscaler  \\"redacted\\" unchanged"

In contrast, syncing the entire git repo from my local machine takes under 5 seconds.

I launched a copy of k8s.gcr.io/hyperkube:v1.9.0 in the cluster, ran kubectl apply -Rf on a clone of our git repository, and it completed quickly there as well. I also tried moving the flux pod to a different node on the cluster to rule out a transient network issue, but there was no effect. It seems only flux is affected.

Let me know if there's any other information I can provide to help troubleshoot this, because I'm stumped.

k8s version: 1.11.2
flux: 1.7.1

(initial discussion)

The text was updated successfully, but these errors were encountered:

oliviabarrick · 2018-10-03T22:56:05Z

Do individual applies take 4-6 seconds if you try that locally?

time kubectl apply -f file.yaml

As I understand it, flux is deploying files or directories one at a time and this might be inefficient.

squaremo · 2018-10-04T11:22:30Z

As a bit of background: when flux runs a sync, it does it in two stages:

try to apply everything in one big lump of bytes, via stdin
if that fails, apply each resource individually, via stdin

The rationale for this is that if there are no problematic resources (i.e., step 1 sails through) it is much faster. (Though it could be better -- e.g., #1410 passes information about the previous sync back to the process, so it can make an informed guess about which approach will work best.)

From the logs, it looks like it's doing both the "big bang" kubectl apply (1.) and then individual kubectl applys (2.). It's surprising that either of the varieties take as long as they do.

I'd be interested to know if doing kubectl apply from inside the flux pod specifically takes comparably long:

$ kubectl get pods -n flux
NAME                                  READY     STATUS    RESTARTS   AGE
flux-687dfbc6d8-79fft                 1/1       Running   0          1h
$ kubectl exec -n flux -ti flux-687dfbc6d8-79fft -- /bin/sh
/home/flux # git clone /tmp/flux-gitclone044115885/ ./config
/home/flux # cd config
/home/flux # time kubectl apply -Rf ./

brantb · 2018-10-04T14:56:07Z

@justinbarrick Run locally, individual applies also finish very quickly, and don't exhibit the same problem.

@squaremo Interesting, thanks for the background info! We had one YAML manifest containing a formatting error, which must have been why flux was doing the sync in two stages.

I tested kubectl apply within the flux pod as you suggested and reproduced the slow behavior, but I decided to bump up the log verbosity and found something interesting:

# kubectl exec -it flux-54fb649979-whv55 -- time kubectl apply -v=6 -f /tmp/flux-working134349209/path-to/namespace.yaml
I1004 14:27:24.022489   19805 loader.go:357] Config loaded from file /root/.kube/config
I1004 14:27:24.023028   19805 merged_client_builder.go:123] Using in-cluster configuration
I1004 14:27:25.085618   19805 round_trippers.go:436] GET https://10.0.0.1:443/swagger-2.0.0.pb-v1 200 OK in 1062 milliseconds
I1004 14:27:25.200885   19805 merged_client_builder.go:123] Using in-cluster configuration
I1004 14:27:25.201544   19805 merged_client_builder.go:123] Using in-cluster configuration
I1004 14:27:25.202082   19805 decoder.go:224] decoding stream as YAML
I1004 14:27:25.202500   19805 merged_client_builder.go:123] Using in-cluster configuration
I1004 14:27:25.223500   19805 round_trippers.go:436] GET https://10.0.0.1:443/api 200 OK in 20 milliseconds
I1004 14:27:25.238057   19805 round_trippers.go:436] GET https://10.0.0.1:443/apis 200 OK in 13 milliseconds
I1004 14:27:25.241120   19805 cached_discovery.go:133] failed to write cache to /root/.kube/cache/discovery/10.0.0.1_443/servergroups.json due to mkdir /root/.kube/cache: read-only file system
I1004 14:27:25.252141   19805 round_trippers.go:436] GET https://10.0.0.1:443/apis/apiregistration.k8s.io/v1 200 OK in 10 milliseconds
I1004 14:27:25.252851   19805 cached_discovery.go:86] failed to write cache to /root/.kube/cache/discovery/10.0.0.1_443/apiregistration.k8s.io/v1/serverresources.json due to mkdir /root/.kube/cache: read-only file system
I1004 14:27:25.276198   19805 round_trippers.go:436] GET https://10.0.0.1:443/apis/apiregistration.k8s.io/v1beta1 200 OK in 23 milliseconds
I1004 14:27:25.280988   19805 cached_discovery.go:86] failed to write cache to /root/.kube/cache/discovery/10.0.0.1_443/apiregistration.k8s.io/v1beta1/serverresources.json due to mkdir /root/.kube/cache: read-only file system
I1004 14:27:31.085824   19805 round_trippers.go:436] GET https://10.0.0.1:443/apis/extensions/v1beta1 200 OK in 5804 milliseconds
I1004 14:27:31.086479   19805 cached_discovery.go:86] failed to write cache to /root/.kube/cache/discovery/10.0.0.1_443/extensions/v1beta1/serverresources.json due to mkdir /root/.kube/cache: read-only file system
<snip many lines like the previous two>
namespace "cluster-admin" configured
real	0m 43.37s
user	0m 0.24s
sys	0m 0.07s

Each of the API calls takes either 8-15ms or, randomly, 2-5 seconds. 💥 The reason I was unable to reproduce this elsewhere was because flux is making around 100 requests to the API server per apply operation, vs. 2-3 when .kube/cache/discovery is already populated. This made it much more obvious that our API server is having a bad day.

squaremo · 2018-10-04T15:00:42Z

Nice sleuthing! I wonder why that bit of filesystem is read-only. Do you mount your own kubeconfig file from a configmap?

brantb · 2018-10-04T15:14:12Z

We're installing flux from the helm chart, which does provide a default kubeconfig ConfigMap, but we haven't changed the defaults in any way.

apiVersion: v1
data:
  config: |
    apiVersion: v1
    clusters: []
    contexts:
    - context:
        cluster: ""
        namespace: default
        user: ""
      name: default
    current-context: default
    kind: Config
    preferences: {}
    users: []
kind: ConfigMap
metadata:
  annotations:
    flux.weave.works/antecedent: cluster-admin:fluxhelmrelease/flux
  creationTimestamp: 2018-09-28T18:42:14Z
  name: flux-kube-config
  namespace: cluster-admin
  resourceVersion: "274875"
  selfLink: /api/v1/namespaces/cluster-admin/configmaps/flux-kube-config
  uid: 37ce4bcc-c34e-11e8-a95e-062485839578

squaremo · 2018-10-04T15:23:28Z

Ahhh that'll be it then -- configmaps and secrets are mounted read-only, and since we mount it at $HOME/.kube/, that makes the cache directory read-only. I suspect the easiest solution will be to mount the kubeconfig elsewhere and use the environment variable KUBECONFIG to point to it.

https://github.com/weaveworks/flux/blob/master/cluster/kubernetes/sync.go#L149 suggests that the an environment entry for the container would carry through to kubectl (which doesn't set its own environment).

squaremo · 2018-10-08T09:54:22Z

@stefanprodan Something to try in the Helm chart I reckon ^

brantb · 2018-10-11T18:24:13Z

To follow up on this: I opened a ticket with our cloud provider (Azure) and found out that this was also the cause of the degraded API server performance. kubectl was making so many separate HTTP requests -- without keepalives, so a new TCP connection was being opened for each one -- that it was exhausting the SNAT port allocation for the node. This caused some API server connections to take many seconds to open.

I updated our flux chart with the changes from #1435 and the problem vanished. Thanks for the quick fix!

The linked issue fluxcd#1422 is not rendered in Read the docs as a link. https://docs.fluxcd.io/en/stable/troubleshooting.html#flux-is-taking-a-long-time-to-apply-manifests-when-it-syncs This change fixes the rendering to follow the `[fluxcd/flux#number]` format used else where in the docs.

stefanprodan mentioned this issue Oct 8, 2018

Helm chart: change kubectl config mount path #1435

Merged

stefanprodan closed this as completed in #1435 Oct 8, 2018

squaremo added the FAQ Issues that come up a lot label Oct 8, 2018

squaremo mentioned this issue Oct 8, 2018

Describe how to override kubeconfig in docs #1437

Merged

Crevil mentioned this issue Dec 4, 2019

Fix link in troubleshooting docs #2658

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux applies changes to the cluster very slowly #1422

Flux applies changes to the cluster very slowly #1422

brantb commented Oct 3, 2018

oliviabarrick commented Oct 3, 2018

squaremo commented Oct 4, 2018

brantb commented Oct 4, 2018 •

edited

Loading

squaremo commented Oct 4, 2018

brantb commented Oct 4, 2018

squaremo commented Oct 4, 2018 •

edited

Loading

squaremo commented Oct 8, 2018

brantb commented Oct 11, 2018 •

edited

Loading

Flux applies changes to the cluster very slowly #1422

Flux applies changes to the cluster very slowly #1422

Comments

brantb commented Oct 3, 2018

oliviabarrick commented Oct 3, 2018

squaremo commented Oct 4, 2018

brantb commented Oct 4, 2018 • edited Loading

squaremo commented Oct 4, 2018

brantb commented Oct 4, 2018

squaremo commented Oct 4, 2018 • edited Loading

squaremo commented Oct 8, 2018

brantb commented Oct 11, 2018 • edited Loading

brantb commented Oct 4, 2018 •

edited

Loading

squaremo commented Oct 4, 2018 •

edited

Loading

brantb commented Oct 11, 2018 •

edited

Loading