-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Flux sync-loop is failing with internal Error #3450
Comments
Thanks, can you provide some more information please? If this is a regression after upgrading, can you tell if it goes away when you revert to the previous version of Flux (if it was 1.21.2, or another version, please try backing down to prior version.) Where is your Kubernetes cluster hosted, with what cloud provider (is it GKE?) The error messages you listed aren't mentioned by any other reports I can find. I will do some research and get back to you. In the mean time, please let me know if you are able to revert to the earlier version; whether it solves the problem to revert, or if the issue persists. |
Flux upgraded to v1.22.0 after getting the above INTERNAL error in v1.21.1. I tried v1.21.2 as well, but no luck. At the beginning, we were running flux v1.17.1 and got the timeout issue( #3049 ). By upgrading to v1.21.1 and removing Note: apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ssh://[email protected]/xyz/gitops-base.git/base/calico-v3.17.1?ref=v0.104.0
patchesStrategicMerge:
- calico-config-patch.yaml |
Are you able to reproduce this in a non-production cluster? I haven't seen any other users reporting this issue, and that's a pretty wide range of versions, so I have to infer that it's something about one of your artifacts. The string "INTERNAL_ERROR" does not appear inside of the flux codebase, so it may be coming from somewhere else. flux/pkg/cluster/kubernetes/sync.go Line 60 in a194e2f
This line of code in The rest of your error must be coming from something else. I will follow up, my suggestion for now is to try reproducing this in a cluster with as many of your own things in common as possible with the failing cluster, (then confirm you don't see the issue in a vanilla cluster without any of those things), then start putting back one by one until you find the triggering factor. |
I've started seeing this error also.
|
Thank you for your report @dmduggan, there are two open important issues I'm aware of that might be resolved in the next release of Flux (but I don't have on my radar any PRs targeting either issue yet.) #3500 is the other issue on my radar. Is the issue intermittent for you, or has all syncing completely stopped? Are you able to provide any more detail or debugging help? Since Flux v2 is the focus of development, there is limited time to spend on development addressing Flux v1 issues. I will be more helpful here if you can provide bisect information (eg. in which release did the issue first start affecting you, or can you find a Flux v1 release at all that if you revert to that release, didn't suffer from this issue, so we can see what changed?) Another thing that would be of immense help help is to provide a public repo which elicits this problem, or instructions to reproduce the issue. I cannot debug what I cannot see for myself. I'm very sorry this is affecting you, and while #3320 remains open, we can certainly try to address any bugs that exist in Flux v1. Most of the issue reports we are seeing for Flux v1 at this point seem to be at the boundaries of scale, where Flux v1 is known to have some issues, and I unfortunately do not have any large-scale Flux v1 implementations that I can use for testing at present, or I might be in a better position to reproduce issues like this for debugging. To reiterate, all users are recommended to begin planning (or executing) their migrations to Flux v2, to prepare for the eventual end of life of Flux v1, but if we can narrow down the issue it will be a feather in my cap to get it fixed and merged. At some point, we may no longer be able to fix bugs with Flux v1, and though we will continue to offer security fixes for some time yet longer, users who are still struggling with scale issues or bugs will be best advised to migrate to v2. Thanks again for your report! @Bujail are you still using Flux v1, and if so does this issue still affect you? |
We where seeing this issue even for single objects. Now we are migrating to V2. |
I'm just seeing it on one of ~10 clusters. This is one of our smaller git repos that it is syncing against. I tried adjusting how much of the repo it was scanning for a while and that seem to help some, but started seeing the following error along with the other error:
Before that it seemed to stop syncing all together. Tried asking AWS if there were any performance issues with the EKS control plane but didn't sound like there are any issues there. I think it did seem to get worse after upgrading to linkerd 2.10, which another of my team reported on #3489. We've been mostly able to workaround that on other clusters by increasing sync-timeout to 25m. We are also looking at migrating to flux v2, but might take us a while to get there. |
25m is a very long timeout for syncing. I am not sure if these are the same issue. I think I'd like to attach a debugger or profiler to try and get an idea where it is spending all of this time. Seeing this issue for even single objects seems to indicate that it is getting stuck in a loop somewhere, or waiting for some callback that is never going to come. I will do some experiments with a debugger and come up with a strategy for how to diagnose this further. |
Last lines I see from sync-loop are
then nothing anymore. I basically get one proper flux sync when the app comes up and that's the last one. |
Thanks for joining this issue @aleclerc-sonrai – it helps to know this issue has more than just one or two affected. I'd like to close this issue (with a resolution) but I'm not sure if anyone has identified the root cause, or if we are really seeing only one issue here at all. I still haven't identified the source of that "INTERNAL_ERROR" in the top report. I have #3156 but I don't think that's necessarily related. I moved this out to the 1.23.2 release milestone because I don't yet know what is causing or how to solve it, but if we get a PR and confirmation that it solves the issue, I'll be more than happy to fast-track another release. Short of that, what would help most is a concrete steps to reproduce, or a public repo that I can sync and deploy Flux on that will reproduce the issue. Then I can exercise the debugger and will surely be able to find out what's happening here. |
I think I've fixed it by completely disabling the registry updater |
@aleclerc-sonrai you mean this one?
That's helpful information 👍 thanks |
Yes. We were only using it for quay but maybe some new rate limiter there is slowing down that entire job. |
I am again looking at #3156 for including in the next release. Is anyone still struggling with this, and if so could you please try the image I've just built to see if it resolves it for you? I've pushed a build of kingdonb@410a3b4 to this image:
I think there can definitely be multiple (unrelated) root causes for the sync-loop failure. I'd almost be surprised if it turns out that all subscribers to this issue have really been experiencing the error in the same way, but even so they could all be due to this common root cause that might be resolved with that commit. I added some comments to #3156 explaining how I interpret the issue, but it's rather long and complicated, and I don't want to spread the details across more issues. I am mostly interested to know if it fixes a problem for anyone in this thread or not. |
I should be able to try it out next week. |
We seem to be having this issue on one of our clusters after it was running fine for over a year. @kingdonb we tried the image and it doesn't appear to be resolving our issue. Here is what our error log looks like it just seems to be attempting to sync without actually finishing the sync.
|
I'm very sorry for the trouble. I just wanted to note that earlier this month, the Flux website was updated to reflect that Flux v1 has formally been marked as superseded by Flux v2, due to the stability release of the Flux v2/GitOps Toolkit APIs: https://fluxcd.io/docs/migration/timetable/ I am not trying to indicate that this won't be fixed in Flux v1 (although I cannot provide a timetable for a fix, and I cannot even say with certainty that this report should be linked to just one issue). I do not know how to reproduce this issue; it sounds like some reports are related to scaling issues, and others are unlikely to be driven by scale. I will state that in the Flux v1 docs, we have posted this notice for some time asking folks to consider their upgrade plans: https://fluxcd.io/legacy/flux/faq/#why-should-i-upgrade Unfortunately as we have seen, the fix attempted in #3156 does not work for all users, and the latest reports on this issue do not appear to have added anything to the debugging story. Short of someone posting concrete steps to reproduce, or a public repo that I can sync and deploy Flux on that will reproduce the issue, I can't see a likely path to resolving this issue. Thanks for looking into this. |
I was able to try |
I am putting out another Flux v1 release soon, probably a minor release that bumps Alpine from 3.12 to 3.14.1, with few other fixes if any. I do not have a fix for this issue in my queue, I am not sure how to fix it. I would welcome a PR from any contributor who has information about where this issue comes from, or how to reproduce it. So far we have come up empty, I do not have a clearly stated repro for this issue so I cannot debug it further without some outside help. |
Hello all Flux 1.24.0 has been released. Please re-check this issue, if you have been waiting for the release. I have no reason to suspect this issue is resolved, but it still seems possible it could have been related to #3156, which was included in this release. I think we already confirmed that didn't fix the issue, but as I do not yet have another PR, ... |
Today I got this error from kubectl, I have never seen anything like it before:
The error occurred during what appears to have been a temporary (even momentary) outage of my cluster's keycloak or oidc, or perhaps expiration of my CLI's oauth2 token. I say this because on the next I have just upgraded my cluster to use kube-oidc-proxy which authenticates all external requests against this new Keycloak instance. The kube-oidc-proxy itself has two ingress-nginx load balancers in front of it (two? yeah, don't ask...) I thought to look up other possible sources of this error related to nginx and found this top link: It doesn't really shed much light except that you can get this error from Nginx sometimes, or ALB, any HTTP server, or perhaps only TLS-secured connections. Quick check: is there anyone actually still subscribed on this issue who remains struggling with sync-loop failures, with or without the INTERNAL_ERROR message? I haven't solved it, only seem to be getting somewhat closer to understanding the source. But I will need a live case on someone's actual cluster that is running Flux v1 to proceed with debugging further. (If everyone has upgraded to Flux v2 and/or there are no reoccurrences of this issue, then we will have to close it. I will wait about two weeks for any response before closing, but of course if someone finds this issue matches their scenario at some future point we could reopen it if Flux is still in maintenance then.) |
Hi @kingdonb, I am still experiencing this on one of our clusters. I did upgrade to 1.24.0 but still seeing the I tried switching over to a fresh git repo with all of the yaml copied over and didn't see much difference except that Let me know if there is anything I can collect from the cluster that might help. |
I started seeing the issue on a second cluster now. Seemed like a performance issue with EKS at first. I was able to workaround for a while by scaling down flux and helm-operator for 10 minutes which seemed to let EKS catch up or something. I did see the INTERNAL_ERROR message again once on this cluster. Unfortunately now my scale down trick doesn't seem to be working. I have a support ticket in with AWS to see if there is some EKS scaling that needs to happen. |
Users have been reporting this issue intermittently, and the trick that seems to resolve it for many people is to add settings for:
The original poster had these values, and it didn't resolve the issue for them. I have only seen I will note though, for some or many users, simply adding an increased setting for |
We were able to transition to new clusters so we are no longer seeing this issue on any of our clusters. We've also established a process for switching between clusters so this should enable our transition to Flux v2. |
In case this helps anyone else in the future, we starting seeing the following error in our EKS cluster:
We noticed the connection to the kubernetes API was being closed after 60s, which aligns with the default setting for The error seemingly started occurring out of the blue, but in the end we managed to correlate it with our EKS control plane nodes being autoscaled to some larger instance types. It turns out that all responses from the kubernetes API in the EKS control plane are compressed with gzip, which is a single threaded operation. While our control plane had been scaled up to larger instance types, these instance types actually had a slower CPU clock speed than the newer generation instances, negatively affecting the compression time for responses and causing the failure. AWS moving the EKS control plane back to the previous instance types resolved the issue, however the real issue here is flux making requests to list all resources, which can cause pretty large responses from the API. We're planning to migrate to flux v2 in the hope that requests to the API are crafted in a more conservative fashion. |
This project is in Migration and security support only, so unfortunately this issue won't be fixed. We recommend users to migrate to Flux 2 at their earliest convenience. More information about the Flux 2 transition timetable can be found at: https://fluxcd.io/docs/migration/timetable/. |
Describe the bug
Flux sync-loop fails with internal error. Changes are not getting applied.
To Reproduce
Steps to reproduce the behaviour:
Flux deployment yaml:
All new changes should be applied to the cluster*
Logs
Additional context
The text was updated successfully, but these errors were encountered: