-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom controllers appear to not get updates from shared informer after some time #522
Comments
I'm seeing similar symptoms using kubeapps v1.0.0-alpha.4 on vanilla, fresh AKS clusters in eastus and centralus using the "az aks create" defaults which currently dispenses Kubernetes v1.9.9 in my account. The problem doesn't manifest immediately, but after an hour or so of very light usage, deploying a few community charts, seems to be reliably reproduceable. Is there some logging that can be enabled to see if this is a disconnection issue as speculated above? |
Same happens for me with Argo project's CRD controller (AKS cluster created with az tool in Cloud Shell, Kubernetes 1.10.3 in Europe North). The logs from the controller doesn't show any problem. |
We are experiencing this as well. A ~2 week old cluster suddenly started to display behavior of controllers handling state updates many minutes after they would happen. I would delete or create an ingress and This cluster worked fine up until 2-3 days ago. I then reprovisioned the cluster (deleting everything including the resource group) and indeed as @nomisbeme said - initially everything worked. I would create an ingress / function resource and see the controller handle it properly. However after about an hour or so - again all state updates would be handed minutes after they happened. |
This just happened to me. New AKS cluster with k8s v1.10.5 in EastUS
UPDATE: so as I submitted, I saw the ingress logs get the DELETE event and the ingress being deleted. It took upwards of 10 mins from the time the ingress was deleted. |
I believe we are also seeing this issue from one of our users of Voyager, since July 12. This is a 1.10.5 cluster. Here is some relevant log:
|
@ultimateboy Can you please look into this? |
@rite2nikhil has been actively engaged in this. We believe we have found the problem. It was misbehaving niginx acting forward proxy to forward |
@prydonius the cluster you shared on feedback email has been patched, can you see if that fixes the issue you reported. Other who want expedited patch for their cluster can send the sub id, resource group, cluster name to [email protected] Thanks all for reporting the issue. |
Thanks @rite2nikhil, I'll take a look! |
@rite2nikhil Is there a release date for this patch so I can be sure that I have the fix on my cluster? |
The fix is expected to rollout to production by 07/27, this will fix new cluster, the older clusters will be patched in next few weeks as it will be a slower rollout process, however an update / scale or upgrade operation after 07/27 should apply the patch on current clusters as well. |
@rite2nikhil Can you verify that this has been patched? |
Hello! This issue still seems to exist. After provisioning a new cluster i still get:
Any ideas when this issue will be resolved? |
The fix has not rolled out yet as it did not make it to the realease last week. apologies for the delay, I expect it to start rollout today and rollout to all production regions by end. To expedite patch mailto:[email protected] |
@rite2nikhil I sent an email (with cluster information) for patching our cluster. Has it been patched? Thanks. |
I am seeing similar behavior with AKS in west Europe Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Happy to offer any reproducing info. I already see the required info to reproduce the issue. In short, deploy aks, nginx-ingress controller. Wait for 30-40 min. update ingress. k8s master will ack but changes not reflected into ingress controller pods, nginx.conf. After a considerable long time, you will see the changes in sync. may take upto 1hr.
So approximately, 20 min |
@ravicm The has been rolled out for this issue, doing any update to you cluster and bouncing(deleteing) the kube-system/azureproxy pod would apply the fix to your old clusters for this specific issue. |
@rite2nikhil i updated (scaled) one of our 1.10.6 aks clusters, bounced the azureproxy pod but deployments are still failing randomly... first try:
second try (same cluster)
|
The problem is during helm install has tiller pod watching for more than a minute idle, we are increasing the idle timeout from pod to api server to 10 minutes which wil mitigate this scenario. This rollout will finish by end of next week. Meanwhile you can send your cluster info (subid, rg, clustername) to [email protected] |
There's not sufficient detail here for me to work out what the underlying issue is/was, but that last comment sounds like we might be talking about LB connection tracking timeouts between pod and apiserver. If so, note the kubernetes apiserver duplicates the regular golang net/http server behaviour and configures a 3 minute TCP keep-alive timer. In other words, you should see some sort of TCP packets (L4) in both directions within at-most 3 minutes. Note HTTP/websocket (L5) traffic might be idle for much longer. One minute is/was definitely not sufficient, assuming the apiserver hasn't been modified. If the LB is doing TCP termination before passing on to the apiserver, then you're on your own 😛 |
I've sent an email with the clusters i would need to be patched to [email protected]. It would be great to get some explanation about all this mess as it renders AKS virtually unusable for serious use. Having no updates for this blocker issue for more than a week and no response after writing to [email protected] is not making azure the best place to run kubernetes. |
@Nuke1234 Can you please resend, our on-calls have been patching clusters, sorry if it got missed. We understand the problem this is causing, thanks folks for your patience, we are working on getting this fixed at highest priority. |
@rite2nikhil do you have an estimate when the fix is available and will be rolled out? |
It seems we hit the same bug |
any news? would be great if you could keep us in the loop. |
The patch worked for quite sometime in our old clusters. Now that we created a new cluster (k8s version 1.11.2), we got quite many error messages: Does anyone know if this is common to k8s 1.11.2? Thanks. |
I'm not sure if my case has to do with this issue here, but we see the following every ten minute in our cert-manager pod log on our azure aks kubernetes cluster with kubernetes version 1.11.2: E0920 11:43:02.224615 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1079; INTERNAL_ERROR Thanks for your help. |
Same "unable to decode" problem here, on an app using client-go to watch resources |
We recently discovered while working with Microsoft that talking to Kube master directly using the FQDN instead of the in-cluster service 10.0.0.1 avoids these errors. It is most likely some intermediate load balancer timeouts not working well with the pods which want to talk to the master and use a watch. |
A recent related discussion in kubernetes slack #sig-azure channel: |
@rite2nikhil what's the update on this issue? This seems to be impacting us as well: |
@juan-lee who has been investigating this issue and will provide and update |
This issue has the workaround that should fix the issues described here. |
@juan-lee the workaround suggested seems to send all communication between the controller and the API server over the public network is that correct? |
It's actually over the Azure backplane and the traffic was always taking this path. The difference is, now it is not going through an nginx reverse proxy (azureproxy) inside of your cluster. |
|
Closing this issue as old/stale. If this issue still comes up, please confirm you are running the latest AKS release. If you are on the latest release and the issue can be re-created outside of your specific cluster please open a new github issue. If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket. |
We've very recently been noticing an issue in our AKS clusters (as well as freshly created ones) where custom controllers unsuspectingly stop responding to events on Kubernetes API resources. For example, we've seen the issue with Kubeapps, where users are unable to install Helm Charts or sync Helm Repositories after some time. The Helm CRD controller and AppRepository controller are used for these functions.
Nothing in the logs of these controllers indicate that the controller has disconnected from the Shared Informers, but they stop responding to new, updated and deleted resources.
This has been reproduced on two clusters running Kubernetes v1.10.5 and v1.10.3.
The text was updated successfully, but these errors were encountered: