Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom controllers appear to not get updates from shared informer after some time #522

Closed
prydonius opened this issue Jul 11, 2018 · 37 comments

Comments

@prydonius
Copy link

We've very recently been noticing an issue in our AKS clusters (as well as freshly created ones) where custom controllers unsuspectingly stop responding to events on Kubernetes API resources. For example, we've seen the issue with Kubeapps, where users are unable to install Helm Charts or sync Helm Repositories after some time. The Helm CRD controller and AppRepository controller are used for these functions.

Nothing in the logs of these controllers indicate that the controller has disconnected from the Shared Informers, but they stop responding to new, updated and deleted resources.

This has been reproduced on two clusters running Kubernetes v1.10.5 and v1.10.3.

@nomisbeme
Copy link

I'm seeing similar symptoms using kubeapps v1.0.0-alpha.4 on vanilla, fresh AKS clusters in eastus and centralus using the "az aks create" defaults which currently dispenses Kubernetes v1.9.9 in my account.

The problem doesn't manifest immediately, but after an hour or so of very light usage, deploying a few community charts, seems to be reliably reproduceable.

Is there some logging that can be enabled to see if this is a disconnection issue as speculated above?

@hazsetata
Copy link

Same happens for me with Argo project's CRD controller (AKS cluster created with az tool in Cloud Shell, Kubernetes 1.10.3 in Europe North). The logs from the controller doesn't show any problem.

@pavius
Copy link

pavius commented Jul 15, 2018

We are experiencing this as well. A ~2 week old cluster suddenly started to display behavior of controllers handling state updates many minutes after they would happen. I would delete or create an ingress and nginx-ingress would not receive it for some time. Same for Nuclio functions. At some point I even saw state updates with stale data (contained the content of the resource prior to the update).

This cluster worked fine up until 2-3 days ago.

I then reprovisioned the cluster (deleting everything including the resource group) and indeed as @nomisbeme said - initially everything worked. I would create an ingress / function resource and see the controller handle it properly. However after about an hour or so - again all state updates would be handed minutes after they happened.

@raghur
Copy link

raghur commented Jul 16, 2018

This just happened to me. New AKS cluster with k8s v1.10.5 in EastUS

  1. Deployed nginx-ingress chart with
    helm install --namespace kube-system -n nginx-ingress stable/nginx-ingress --set controller.publishService.enabled=true --set rbac.create=false
  2. Created ingress to jenkins instance; Address is allocated and can browse jenkins instance successfully.
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: jenkins-ingress
  labels:
    app: jenkins
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
    - host: jenkins2.<snipped>
      http:
        paths:
          - path: /
            backend:
              serviceName: jenkins
              servicePort: 8080
  1. after a few mins, deleted the ingress resource.
  2. The ingress is deleted, but the ingress controller never gets to know. End result is that I can still browse to the IP/host name and am served the jenkins page.

UPDATE: so as I submitted, I saw the ingress logs get the DELETE event and the ingress being deleted. It took upwards of 10 mins from the time the ingress was deleted.
Repeating the entire process, saw upto 30 minute delays for events to propagate to controllers :(

@tamalsaha
Copy link

tamalsaha commented Jul 16, 2018

I believe we are also seeing this issue from one of our users of Voyager, since July 12. This is a 1.10.5 cluster. Here is some relevant log:

I0716 18:54:20.437833      13 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
I0716 18:54:20.489384      13 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
I0716 18:54:20.489739      13 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
W0716 18:54:20.532338      13 reflector.go:341] github.com/appscode/voyager/vendor/k8s.io/client-go/informers/factory.go:87: watch of *v1.ConfigMap ended with: too old resource version: 3859791 (3860126)
I0716 18:54:20.896945      13 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
I0716 18:54:21.532698      13 reflector.go:240] Listing and watching *v1.ConfigMap from github.com/appscode/voyager/vendor/k8s.io/client-go/informers/factory.go:87

@sauryadas
Copy link
Contributor

@ultimateboy Can you please look into this?

@khenidak
Copy link

@rite2nikhil has been actively engaged in this. We believe we have found the problem. It was misbehaving niginx acting forward proxy to forward kubernetes default service to the hosted control plane. We have started patching the clusters with the solution. Please allow sometime before all clusters gets the patch

@rite2nikhil
Copy link

@prydonius the cluster you shared on feedback email has been patched, can you see if that fixes the issue you reported.

Other who want expedited patch for their cluster can send the sub id, resource group, cluster name to [email protected]

Thanks all for reporting the issue.

@prydonius
Copy link
Author

Thanks @rite2nikhil, I'll take a look!

@weinong weinong added cli bug and removed cli bug labels Jul 22, 2018
@richerlariviere
Copy link

richerlariviere commented Jul 23, 2018

@rite2nikhil Is there a release date for this patch so I can be sure that I have the fix on my cluster?

@rite2nikhil
Copy link

The fix is expected to rollout to production by 07/27, this will fix new cluster, the older clusters will be patched in next few weeks as it will be a slower rollout process, however an update / scale or upgrade operation after 07/27 should apply the patch on current clusters as well.

@connorgorman
Copy link

@rite2nikhil Can you verify that this has been patched?

@Nuke1234
Copy link

Hello! This issue still seems to exist. After provisioning a new cluster i still get:

level=error ts=2018-07-30T08:59:42.36877976Z caller=main.go:216 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:270: Failed to list *v1.Pod: the server cannot complete the requested operation at this time, try again later (get pods)"

Any ideas when this issue will be resolved?

@rite2nikhil
Copy link

The fix has not rolled out yet as it did not make it to the realease last week. apologies for the delay, I expect it to start rollout today and rollout to all production regions by end. To expedite patch mailto:[email protected]
with sub id, resource group and resource id(cluster name)

@thezultimate
Copy link

@rite2nikhil I sent an email (with cluster information) for patching our cluster. Has it been patched? Thanks.

@ravicm
Copy link

ravicm commented Aug 5, 2018

I am seeing similar behavior with AKS in west Europe

Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Happy to offer any reproducing info. I already see the required info to reproduce the issue. In short, deploy aks, nginx-ingress controller. Wait for 30-40 min. update ingress. k8s master will ack but changes not reflected into ingress controller pods, nginx.conf. After a considerable long time, you will see the changes in sync. may take upto 1hr.

$ kubectl apply -f ingress.yaml 
ingress "test-ingress" configured

$ date
Sun Aug  5 10:56:14 EDT 2018
$ 

W0805 15:16:55.179523      23 controller.go:724] Error obtaining Endpoints for Service "default/nginx": no object matching key "default/nginx" in local store
I0805 15:16:55.179936      23 controller.go:167] Changes handled by the dynamic configuration, skipping backend reload.
I0805 15:16:55.181878      23 controller.go:202] Dynamic reconfiguration succeeded.
I0805 15:20:35.419390      23 event.go:221] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"test-ingress", UID:"8ef782bb-98b5-11e8-bfda-0a58ac1f036b", APIVersion:"extensions/v1beta1", ResourceVersion:"1590555", FieldPath:""}): type: 'Normal' reason: 'UPDATE' Ingress default/test-ingress
W0805 15:20:35.419561      23 controller.go:724] Error obtaining Endpoints for Service "default/nginx": no object matching key "default/nginx" in local store
I0805 15:20:35.419719      23 controller.go:169] Configuration changes detected, backend reload required.
I0805 15:20:35.514785      23 controller.go:185] Backend successfully reloaded.
I0805 15:20:35.518585      23 controller.go:202] Dynamic reconfiguration succeeded.

So approximately, 20 min

@rite2nikhil
Copy link

@ravicm The has been rolled out for this issue, doing any update to you cluster and bouncing(deleteing) the kube-system/azureproxy pod would apply the fix to your old clusters for this specific issue.

@Nuke1234
Copy link

Nuke1234 commented Aug 7, 2018

@rite2nikhil i updated (scaled) one of our 1.10.6 aks clusters, bounced the azureproxy pod but deployments are still failing randomly...

first try:

helm upgrade --install --namespace cds --wait --timeout=1800 --values system/values.yaml --set tags.stage1=true cds-system-1 cdsrepo/cds-system Release "cds-system-1" does not exist. Installing it now. Error: release cds-system-1 failed: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.extensions)

second try (same cluster)

helm upgrade --install --namespace cds --wait --timeout=1800 --values system/values.yaml --set tags.stage1=true cds-system-1 cdsrepo/cds-system Release "cds-system-1" does not exist. Installing it now. E0807 20:56:36.517972 31 portforward.go:178] lost connection to pod Error: transport is closing

@rite2nikhil
Copy link

The problem is during helm install has tiller pod watching for more than a minute idle, we are increasing the idle timeout from pod to api server to 10 minutes which wil mitigate this scenario. This rollout will finish by end of next week. Meanwhile you can send your cluster info (subid, rg, clustername) to [email protected]

@anguslees
Copy link

There's not sufficient detail here for me to work out what the underlying issue is/was, but that last comment sounds like we might be talking about LB connection tracking timeouts between pod and apiserver.

If so, note the kubernetes apiserver duplicates the regular golang net/http server behaviour and configures a 3 minute TCP keep-alive timer.

In other words, you should see some sort of TCP packets (L4) in both directions within at-most 3 minutes. Note HTTP/websocket (L5) traffic might be idle for much longer. One minute is/was definitely not sufficient, assuming the apiserver hasn't been modified. If the LB is doing TCP termination before passing on to the apiserver, then you're on your own 😛

@Nuke1234
Copy link

Nuke1234 commented Aug 8, 2018

I've sent an email with the clusters i would need to be patched to [email protected]. It would be great to get some explanation about all this mess as it renders AKS virtually unusable for serious use. Having no updates for this blocker issue for more than a week and no response after writing to [email protected] is not making azure the best place to run kubernetes.

@rite2nikhil
Copy link

@Nuke1234 Can you please resend, our on-calls have been patching clusters, sorry if it got missed.
@anguslees is correct in the explanation, the pod->apiserver requests that get routed through the default service ip that are idle are causing the issue, the current patch allows a 1 minute idle which we have learnt is not sufficient for scenarios that issue a single watch (without retry on early connection close) therefore we are rolling out a change to make it 10 minutes.

We understand the problem this is causing, thanks folks for your patience, we are working on getting this fixed at highest priority.

@malachma
Copy link
Member

@rite2nikhil do you have an estimate when the fix is available and will be rolled out?

@mtparet
Copy link

mtparet commented Aug 22, 2018

It seems we hit the same bug Error: UPGRADE FAILED: watch closed before Until timeout when doing an helm upgrade --watch --timeout 600

@Nuke1234
Copy link

any news? would be great if you could keep us in the loop.

@thezultimate
Copy link

The patch worked for quite sometime in our old clusters.

Now that we created a new cluster (k8s version 1.11.2), we got quite many error messages: ERROR: logging before flag.Parse: E0920 12:34:56.775821 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 681; INTERNAL_ERROR. And occasionally our custom controllers are slow to respond to changes (again).

Does anyone know if this is common to k8s 1.11.2?

Thanks.

@peterwy01
Copy link

I'm not sure if my case has to do with this issue here, but we see the following every ten minute in our cert-manager pod log on our azure aks kubernetes cluster with kubernetes version 1.11.2:

E0920 11:43:02.224615 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1079; INTERNAL_ERROR
E0920 11:43:02.224829 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1077; INTERNAL_ERROR
E0920 11:43:02.224974 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1071; INTERNAL_ERROR
E0920 11:43:02.225075 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1073; INTERNAL_ERROR
E0920 11:43:02.225229 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1067; INTERNAL_ERROR
E0920 11:43:02.509582 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1069; INTERNAL_ERROR
E0920 11:43:02.510031 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1075; INTERNAL_ERROR

Thanks for your help.

@m1o1
Copy link

m1o1 commented Sep 20, 2018

Same "unable to decode" problem here, on an app using client-go to watch resources

@saykumar
Copy link

saykumar commented Sep 20, 2018

We recently discovered while working with Microsoft that talking to Kube master directly using the FQDN instead of the in-cluster service 10.0.0.1 avoids these errors.

It is most likely some intermediate load balancer timeouts not working well with the pods which want to talk to the master and use a watch.

@bergerx
Copy link

bergerx commented Sep 21, 2018

A recent related discussion in kubernetes slack #sig-azure channel:
https://kubernetes.slack.com/archives/C5HJXTT9Q/p1537536887000100

@asridharan
Copy link

@rite2nikhil what's the update on this issue?

This seems to be impacting us as well:
Azure/application-gateway-kubernetes-ingress#45

@rite2nikhil
Copy link

@juan-lee who has been investigating this issue and will provide and update

@juan-lee
Copy link
Contributor

juan-lee commented Oct 3, 2018

This issue has the workaround that should fix the issues described here.

@asridharan
Copy link

@juan-lee the workaround suggested seems to send all communication between the controller and the API server over the public network is that correct?

@juan-lee
Copy link
Contributor

@juan-lee the workaround suggested seems to send all communication between the controller and the API server over the public network is that correct?

It's actually over the Azure backplane and the traffic was always taking this path. The difference is, now it is not going through an nginx reverse proxy (azureproxy) inside of your cluster.

@asridharan
Copy link

@juan-lee the workaround suggested seems to send all communication between the controller and the API server over the public network is that correct?

It's actually over the Azure backplane and the traffic was always taking this path. The difference is, now it is not going through an nginx reverse proxy (azureproxy) inside of your cluster.

@akshaysngupta ^^

@jnoller
Copy link
Contributor

jnoller commented Apr 4, 2019

Closing this issue as old/stale.

If this issue still comes up, please confirm you are running the latest AKS release. If you are on the latest release and the issue can be re-created outside of your specific cluster please open a new github issue.

If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket.

@jnoller jnoller closed this as completed Apr 4, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Aug 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests