Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Flux sync-loop is failing with internal Error #3450

Closed
Bujail opened this issue Mar 23, 2021 · 27 comments
Closed

Flux sync-loop is failing with internal Error #3450

Bujail opened this issue Mar 23, 2021 · 27 comments
Assignees
Labels
blocked-needs-validation Issue is waiting to be validated before we can proceed bug wontfix

Comments

@Bujail
Copy link

Bujail commented Mar 23, 2021

Describe the bug

Flux sync-loop fails with internal error. Changes are not getting applied.

To Reproduce

Steps to reproduce the behaviour:

Flux deployment yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  name: flux
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  selector:
    matchLabels:
      name: flux
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        prometheus.io/port: "3031"
      labels:
        name: flux
    spec:
      containers:
      - args:
        - --memcached-service=
        - --ssh-keygen-dir=/var/fluxd/keygen
        - [email protected]:xyz/infrastructure/clusters/stg.git
        - --git-branch=master
        - --git-label=flux
        - --registry-exclude-image=*
        - --manifest-generation=true
        - --listen-metrics=:3031
        - --git-readonly
        - --git-timeout=300s
        - --sync-timeout=30m
        image: docker.io/fluxcd/flux:1.22.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/flux/v6/identity.pub
            port: 3030
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: flux
        ports:
        - containerPort: 3030
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/flux/v6/identity.pub
            port: 3030
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          requests:
            cpu: "3"
            memory: 2Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/fluxd/ssh
          name: git-key
          readOnly: true
        - mountPath: /var/fluxd/keygen
          name: git-keygen
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: registry
      nodeSelector:
        nodegroup: agent
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: flux
      serviceAccountName: flux
      terminationGracePeriodSeconds: 30
      volumes:
      - name: git-key
        secret:
          defaultMode: 256
          secretName: flux-git-deploy
      - emptyDir:
          medium: Memory
        name: git-keygen

All new changes should be applied to the cluster*

Logs

ts=2021-03-23T04:00:03.498726475Z caller=loop.go:108 component=sync-loop err="collating resources in cluster for sync: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 36131; INTERNAL_ERROR"
ts=2021-03-23T04:00:03.500314177Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://[email protected]/xyz/infrastructure/clusters/stg.git branch=master HEAD=1fab6b6fac9342a7944d9f59c56d6f184d14a0d6
ts=2021-03-23T04:00:34.430292113Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://[email protected]/xyz/infrastructure/clusters/stg.git branch=master HEAD=1fab6b6fac9342a7944d9f59c56d6f184d14a0d6
ts=2021-03-23T04:07:11.5531765Z caller=sync.go:61 component=daemon info="trying to sync git changes to the cluster" old=d152f4eab76071b2fbbe00bd1860db5cd06d2a4d new=1fab6b6fac9342a7944d9f59c56d6f184d14a0d6
ts=2021-03-23T04:10:26.344102153Z caller=loop.go:108 component=sync-loop err="collating resources in cluster for sync: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 36405; INTERNAL_ERROR"
ts=2021-03-23T04:10:26.345656977Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://[email protected]/xyz/infrastructure/clusters/stg.git branch=feat/velero/encrypt HEAD=1fab6b6fac9342a7944d9f59c56d6f184d14a0d6

Additional context

  • Flux version: 1.22.0
  • Kubernetes version: 1.20.2
  • Git provider: GitLab
  • Container registry provider: GitLab
@Bujail Bujail added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Mar 23, 2021
@kingdonb
Copy link
Member

Thanks, can you provide some more information please? If this is a regression after upgrading, can you tell if it goes away when you revert to the previous version of Flux (if it was 1.21.2, or another version, please try backing down to prior version.)

Where is your Kubernetes cluster hosted, with what cloud provider (is it GKE?)

The error messages you listed aren't mentioned by any other reports I can find. I will do some research and get back to you.

In the mean time, please let me know if you are able to revert to the earlier version; whether it solves the problem to revert, or if the issue persists.

@kingdonb kingdonb self-assigned this Mar 24, 2021
@Bujail
Copy link
Author

Bujail commented Mar 25, 2021

Flux upgraded to v1.22.0 after getting the above INTERNAL error in v1.21.1. I tried v1.21.2 as well, but no luck.

At the beginning, we were running flux v1.17.1 and got the timeout issue( #3049 ). By upgrading to v1.21.1 and removing --sync-timeout from arguments and adding timeout in .flux file fixed the issue temporarily. After some days, it started to report the INTERNAL error. As a test we upgraded to v1.22.0 but still we are getting the same error.

Note:
Cluster is a vanilla deployment in AWS.
The main Git repo will fetch base remote repo while building.
Eg:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ssh://[email protected]/xyz/gitops-base.git/base/calico-v3.17.1?ref=v0.104.0

patchesStrategicMerge:
- calico-config-patch.yaml

@kingdonb
Copy link
Member

Are you able to reproduce this in a non-production cluster?

I haven't seen any other users reporting this issue, and that's a pretty wide range of versions, so I have to infer that it's something about one of your artifacts. The string "INTERNAL_ERROR" does not appear inside of the flux codebase, so it may be coming from somewhere else.

return errors.Wrap(err, "collating resources in cluster for sync")

This line of code in Sync, the main loop that applies changes in Flux, has the string "collating resources in cluster for sync"

The rest of your error must be coming from something else. I will follow up, my suggestion for now is to try reproducing this in a cluster with as many of your own things in common as possible with the failing cluster, (then confirm you don't see the issue in a vanilla cluster without any of those things), then start putting back one by one until you find the triggering factor.

@dmduggan
Copy link

I've started seeing this error also.

Flux version: v1.23.0
Kubernetes version: v1.19.6-eks
Git provider: GitHub Enterprise
Container registry provider: Harbor

@kingdonb kingdonb added this to the 1.23.1 milestone Jul 13, 2021
@kingdonb
Copy link
Member

Thank you for your report @dmduggan, there are two open important issues I'm aware of that might be resolved in the next release of Flux (but I don't have on my radar any PRs targeting either issue yet.) #3500 is the other issue on my radar.

Is the issue intermittent for you, or has all syncing completely stopped? Are you able to provide any more detail or debugging help?

Since Flux v2 is the focus of development, there is limited time to spend on development addressing Flux v1 issues. I will be more helpful here if you can provide bisect information (eg. in which release did the issue first start affecting you, or can you find a Flux v1 release at all that if you revert to that release, didn't suffer from this issue, so we can see what changed?)

Another thing that would be of immense help help is to provide a public repo which elicits this problem, or instructions to reproduce the issue. I cannot debug what I cannot see for myself. I'm very sorry this is affecting you, and while #3320 remains open, we can certainly try to address any bugs that exist in Flux v1.

Most of the issue reports we are seeing for Flux v1 at this point seem to be at the boundaries of scale, where Flux v1 is known to have some issues, and I unfortunately do not have any large-scale Flux v1 implementations that I can use for testing at present, or I might be in a better position to reproduce issues like this for debugging.

To reiterate, all users are recommended to begin planning (or executing) their migrations to Flux v2, to prepare for the eventual end of life of Flux v1, but if we can narrow down the issue it will be a feather in my cap to get it fixed and merged. At some point, we may no longer be able to fix bugs with Flux v1, and though we will continue to offer security fixes for some time yet longer, users who are still struggling with scale issues or bugs will be best advised to migrate to v2.

Thanks again for your report! @Bujail are you still using Flux v1, and if so does this issue still affect you?

@Bujail
Copy link
Author

Bujail commented Jul 14, 2021

We where seeing this issue even for single objects. Now we are migrating to V2.

@dmduggan
Copy link

I'm just seeing it on one of ~10 clusters. This is one of our smaller git repos that it is syncing against. I tried adjusting how much of the repo it was scanning for a while and that seem to help some, but started seeing the following error along with the other error:

ts=2021-07-15T22:37:31.927073711Z caller=loop.go:108 component=sync-loop err="collating resources in cluster for sync: the server was unable to return a response in the time allotted, but may still be processing the request"

Before that it seemed to stop syncing all together.

Tried asking AWS if there were any performance issues with the EKS control plane but didn't sound like there are any issues there.

I think it did seem to get worse after upgrading to linkerd 2.10, which another of my team reported on #3489. We've been mostly able to workaround that on other clusters by increasing sync-timeout to 25m. fluxctl sync --timeout 20m also times out.

We are also looking at migrating to flux v2, but might take us a while to get there.

@kingdonb
Copy link
Member

25m is a very long timeout for syncing. I am not sure if these are the same issue.

I think I'd like to attach a debugger or profiler to try and get an idea where it is spending all of this time. Seeing this issue for even single objects seems to indicate that it is getting stuck in a loop somewhere, or waiting for some callback that is never going to come. I will do some experiments with a debugger and come up with a strategy for how to diagnose this further.

@kingdonb kingdonb modified the milestones: 1.23.1, 1.23.2 Jul 22, 2021
@aleclerc-sonrai
Copy link

Last lines I see from sync-loop are

ts=2021-07-22T18:03:00.126940962Z caller=loop.go:142 component=sync-loop jobID=f6b24928-dcb8-7695-4530-1b556d09bfb9 state=in-progress
ts=2021-07-22T18:04:05.928214509Z caller=releaser.go:59 component=sync-loop jobID=f6b24928-dcb8-7695-4530-1b556d09bfb9 type=release updates=51

then nothing anymore. I basically get one proper flux sync when the app comes up and that's the last one.

@kingdonb
Copy link
Member

Thanks for joining this issue @aleclerc-sonrai – it helps to know this issue has more than just one or two affected.

I'd like to close this issue (with a resolution) but I'm not sure if anyone has identified the root cause, or if we are really seeing only one issue here at all. I still haven't identified the source of that "INTERNAL_ERROR" in the top report.

I have #3156 but I don't think that's necessarily related. I moved this out to the 1.23.2 release milestone because I don't yet know what is causing or how to solve it, but if we get a PR and confirmation that it solves the issue, I'll be more than happy to fast-track another release.

Short of that, what would help most is a concrete steps to reproduce, or a public repo that I can sync and deploy Flux on that will reproduce the issue. Then I can exercise the debugger and will surely be able to find out what's happening here.

@aleclerc-sonrai
Copy link

I think I've fixed it by completely disabling the registry updater

@kingdonb
Copy link
Member

@aleclerc-sonrai you mean this one?

  # Disable registry scanning completely
  disableScanning: true

That's helpful information 👍 thanks

@aleclerc-sonrai
Copy link

Yes. We were only using it for quay but maybe some new rate limiter there is slowing down that entire job.

@kingdonb
Copy link
Member

I am again looking at #3156 for including in the next release.

Is anyone still struggling with this, and if so could you please try the image I've just built to see if it resolves it for you? I've pushed a build of kingdonb@410a3b4 to this image:

docker.io/kingdonb/flux:reset-sync-errors-410a3b43

I think there can definitely be multiple (unrelated) root causes for the sync-loop failure. I'd almost be surprised if it turns out that all subscribers to this issue have really been experiencing the error in the same way, but even so they could all be due to this common root cause that might be resolved with that commit.

I added some comments to #3156 explaining how I interpret the issue, but it's rather long and complicated, and I don't want to spread the details across more issues. I am mostly interested to know if it fixes a problem for anyone in this thread or not.

@dmduggan
Copy link

I should be able to try it out next week.

@mmckane
Copy link

mmckane commented Jul 30, 2021

We seem to be having this issue on one of our clusters after it was running fine for over a year. @kingdonb we tried the image and it doesn't appear to be resolving our issue. Here is what our error log looks like it just seems to be attempting to sync without actually finishing the sync.

ts=2021-07-30T17:11:46.697348263Z caller=loop.go:108 component=sync-loop err="collating resources in cluster for sync: the server was unable to return a response in the time allotted, but may still be processing the request"
ts=2021-07-30T17:11:46.699803362Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:13:09.097097299Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:15:09.60435917Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:17:10.097248921Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:19:10.458019763Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:21:10.759968606Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:22:20.517264293Z caller=sync.go:61 component=daemon info="trying to sync git changes to the cluster" old=326e2f590f04780d28b2b3e5da1dabecb7de2066 new=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:24:07.966396727Z caller=loop.go:108 component=sync-loop err="collating resources in cluster for sync: the server was unable to return a response in the time allotted, but may still be processing the request"
ts=2021-07-30T17:24:07.96922372Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:25:11.901339674Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:27:12.364358518Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:29:12.913049671Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:31:13.386117488Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:33:13.896769783Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://<redacted> HEAD=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8
ts=2021-07-30T17:34:39.278220261Z caller=sync.go:61 component=daemon info="trying to sync git changes to the cluster" old=326e2f590f04780d28b2b3e5da1dabecb7de2066 new=e3b8d304fb28d15d4e6c6495da446b0939cc6ba8

@kingdonb
Copy link
Member

I'm very sorry for the trouble. I just wanted to note that earlier this month, the Flux website was updated to reflect that Flux v1 has formally been marked as superseded by Flux v2, due to the stability release of the Flux v2/GitOps Toolkit APIs:

https://fluxcd.io/docs/migration/timetable/

I am not trying to indicate that this won't be fixed in Flux v1 (although I cannot provide a timetable for a fix, and I cannot even say with certainty that this report should be linked to just one issue). I do not know how to reproduce this issue; it sounds like some reports are related to scaling issues, and others are unlikely to be driven by scale.

I will state that in the Flux v1 docs, we have posted this notice for some time asking folks to consider their upgrade plans:

https://fluxcd.io/legacy/flux/faq/#why-should-i-upgrade

Unfortunately as we have seen, the fix attempted in #3156 does not work for all users, and the latest reports on this issue do not appear to have added anything to the debugging story. Short of someone posting concrete steps to reproduce, or a public repo that I can sync and deploy Flux on that will reproduce the issue, I can't see a likely path to resolving this issue.

Thanks for looking into this.

@dmduggan
Copy link

dmduggan commented Aug 6, 2021

I was able to try docker.io/kingdonb/flux:reset-sync-errors-410a3b43 and the new 1.23.2 release, but didn't see any change. I'm not seeing INTERNAL_ERROR anymore, just the collating resources in cluster for sync: the server was unable to return a response in the time allotted, but may still be processing the request error now. Also tried some other things like disableScanning: true and setting --git-path to a very small set of namespace yaml files. Still no luck.

@kingdonb
Copy link
Member

I am putting out another Flux v1 release soon, probably a minor release that bumps Alpine from 3.12 to 3.14.1, with few other fixes if any.

I do not have a fix for this issue in my queue, I am not sure how to fix it. I would welcome a PR from any contributor who has information about where this issue comes from, or how to reproduce it. So far we have come up empty, I do not have a clearly stated repro for this issue so I cannot debug it further without some outside help.

@kingdonb kingdonb modified the milestones: 1.23.3, 1.24.0 Aug 17, 2021
@kingdonb
Copy link
Member

Hello all

Flux 1.24.0 has been released. Please re-check this issue, if you have been waiting for the release.

I have no reason to suspect this issue is resolved, but it still seems possible it could have been related to #3156, which was included in this release. I think we already confirmed that didn't fix the issue, but as I do not yet have another PR, ...

@kingdonb
Copy link
Member

Today I got this error from kubectl, I have never seen anything like it before:

Failure                                              InternalError   an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 3; INTERNAL_ERROR") has prevented the request from succeeding

The error occurred during what appears to have been a temporary (even momentary) outage of my cluster's keycloak or oidc, or perhaps expiration of my CLI's oauth2 token. I say this because on the next kubectl get command, I was redirected to the Keycloak portal and received a 502 error. Refreshing within a few seconds, I reached the portal as whatever outage was only temporary in my case, and logged in, so the next kubectl get succeeded.

I have just upgraded my cluster to use kube-oidc-proxy which authenticates all external requests against this new Keycloak instance. The kube-oidc-proxy itself has two ingress-nginx load balancers in front of it (two? yeah, don't ask...)

I thought to look up other possible sources of this error related to nginx and found this top link:

https://stackoverflow.com/questions/56413290/getting-curl-92-http-2-stream-1-was-not-closed-cleanly-internal-error-err

It doesn't really shed much light except that you can get this error from Nginx sometimes, or ALB, any HTTP server, or perhaps only TLS-secured connections.

Quick check: is there anyone actually still subscribed on this issue who remains struggling with sync-loop failures, with or without the INTERNAL_ERROR message? I haven't solved it, only seem to be getting somewhat closer to understanding the source. But I will need a live case on someone's actual cluster that is running Flux v1 to proceed with debugging further.

(If everyone has upgraded to Flux v2 and/or there are no reoccurrences of this issue, then we will have to close it. I will wait about two weeks for any response before closing, but of course if someone finds this issue matches their scenario at some future point we could reopen it if Flux is still in maintenance then.)

@dmduggan
Copy link

Hi @kingdonb, I am still experiencing this on one of our clusters. I did upgrade to 1.24.0 but still seeing the collating resources in cluster for sync: the server was unable to return a response in the time allotted, but may still be processing the request error. But no longer seeing INTERNAL_ERROR (not sure if this warrants a new issue since the title no longer applies to me).

I tried switching over to a fresh git repo with all of the yaml copied over and didn't see much difference except that fluxctl sync no longer timed out on me. So I tried switching back to the original repo and deleted the flux-sync tag. That also fixed the fluxctl sync timeout but changes are still not being applied by flux. Our work around is to manually run kubectl apply -R -f infra -f projects which takes 8 seconds. Flux is able to update image tags in the git repo from Harbor, but doesn't ever apply the changes.

Let me know if there is anything I can collect from the cluster that might help.

@dmduggan
Copy link

dmduggan commented Sep 21, 2021

I started seeing the issue on a second cluster now. Seemed like a performance issue with EKS at first. I was able to workaround for a while by scaling down flux and helm-operator for 10 minutes which seemed to let EKS catch up or something. I did see the INTERNAL_ERROR message again once on this cluster. Unfortunately now my scale down trick doesn't seem to be working. I have a support ticket in with AWS to see if there is some EKS scaling that needs to happen.

@kingdonb
Copy link
Member

Users have been reporting this issue intermittently, and the trick that seems to resolve it for many people is to add settings for:

        - --git-timeout=300s
        - --sync-timeout=30m

The original poster had these values, and it didn't resolve the issue for them.

I have only seen INTERNAL_ERROR come from nginx ingress or a load balancer. I don't think this message is emitted by Flux, but if things are timing out ahead of the specified Flux timeouts, this could be an indication that some timeout occurs on the cluster control-plane's load balancer that somehow pre-empts Flux from completing it's work.

I will note though, for some or many users, simply adding an increased setting for --git-timeout and --sync-timeout that raises them from the default values, has been enough to resolve this issue for them. Any new sync issues please try that first.

@dmduggan
Copy link

We were able to transition to new clusters so we are no longer seeing this issue on any of our clusters. We've also established a process for switching between clusters so this should enable our transition to Flux v2.

@AlexLast
Copy link

In case this helps anyone else in the future, we starting seeing the following error in our EKS cluster:

err="collating resources in cluster for sync: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1; INTERNAL_ERROR"

We noticed the connection to the kubernetes API was being closed after 60s, which aligns with the default setting for --request-timeout on kube-apiserver. The request was timing out due to the response being particularly large ~1GB of kubernetes resources being listed.

The error seemingly started occurring out of the blue, but in the end we managed to correlate it with our EKS control plane nodes being autoscaled to some larger instance types. It turns out that all responses from the kubernetes API in the EKS control plane are compressed with gzip, which is a single threaded operation. While our control plane had been scaled up to larger instance types, these instance types actually had a slower CPU clock speed than the newer generation instances, negatively affecting the compression time for responses and causing the failure.

AWS moving the EKS control plane back to the previous instance types resolved the issue, however the real issue here is flux making requests to list all resources, which can cause pretty large responses from the API. We're planning to migrate to flux v2 in the hope that requests to the API are crafted in a more conservative fashion.

@pjbgf pjbgf added the wontfix label Jul 26, 2022
@pjbgf
Copy link
Member

pjbgf commented Jul 26, 2022

This project is in Migration and security support only, so unfortunately this issue won't be fixed. We recommend users to migrate to Flux 2 at their earliest convenience.

More information about the Flux 2 transition timetable can be found at: https://fluxcd.io/docs/migration/timetable/.

@pjbgf pjbgf closed this as completed Jul 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked-needs-validation Issue is waiting to be validated before we can proceed bug wontfix
Projects
None yet
Development

No branches or pull requests

7 participants