Custom controllers appear to not get updates from shared informer after some time #522

prydonius · 2018-07-11T22:37:02Z

We've very recently been noticing an issue in our AKS clusters (as well as freshly created ones) where custom controllers unsuspectingly stop responding to events on Kubernetes API resources. For example, we've seen the issue with Kubeapps, where users are unable to install Helm Charts or sync Helm Repositories after some time. The Helm CRD controller and AppRepository controller are used for these functions.

Nothing in the logs of these controllers indicate that the controller has disconnected from the Shared Informers, but they stop responding to new, updated and deleted resources.

This has been reproduced on two clusters running Kubernetes v1.10.5 and v1.10.3.

nomisbeme · 2018-07-11T23:17:52Z

I'm seeing similar symptoms using kubeapps v1.0.0-alpha.4 on vanilla, fresh AKS clusters in eastus and centralus using the "az aks create" defaults which currently dispenses Kubernetes v1.9.9 in my account.

The problem doesn't manifest immediately, but after an hour or so of very light usage, deploying a few community charts, seems to be reliably reproduceable.

Is there some logging that can be enabled to see if this is a disconnection issue as speculated above?

hazsetata · 2018-07-13T12:58:51Z

Same happens for me with Argo project's CRD controller (AKS cluster created with az tool in Cloud Shell, Kubernetes 1.10.3 in Europe North). The logs from the controller doesn't show any problem.

pavius · 2018-07-15T05:44:45Z

We are experiencing this as well. A ~2 week old cluster suddenly started to display behavior of controllers handling state updates many minutes after they would happen. I would delete or create an ingress and nginx-ingress would not receive it for some time. Same for Nuclio functions. At some point I even saw state updates with stale data (contained the content of the resource prior to the update).

This cluster worked fine up until 2-3 days ago.

I then reprovisioned the cluster (deleting everything including the resource group) and indeed as @nomisbeme said - initially everything worked. I would create an ingress / function resource and see the controller handle it properly. However after about an hour or so - again all state updates would be handed minutes after they happened.

raghur · 2018-07-16T06:38:17Z

This just happened to me. New AKS cluster with k8s v1.10.5 in EastUS

Deployed nginx-ingress chart with
helm install --namespace kube-system -n nginx-ingress stable/nginx-ingress --set controller.publishService.enabled=true --set rbac.create=false
Created ingress to jenkins instance; Address is allocated and can browse jenkins instance successfully.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: jenkins-ingress
  labels:
    app: jenkins
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
    - host: jenkins2.<snipped>
      http:
        paths:
          - path: /
            backend:
              serviceName: jenkins
              servicePort: 8080

after a few mins, deleted the ingress resource.
The ingress is deleted, but the ingress controller never gets to know. End result is that I can still browse to the IP/host name and am served the jenkins page.

UPDATE: so as I submitted, I saw the ingress logs get the DELETE event and the ingress being deleted. It took upwards of 10 mins from the time the ingress was deleted.
Repeating the entire process, saw upto 30 minute delays for events to propagate to controllers :(

tamalsaha · 2018-07-16T19:24:44Z

I believe we are also seeing this issue from one of our users of Voyager, since July 12. This is a 1.10.5 cluster. Here is some relevant log:

I0716 18:54:20.437833      13 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
I0716 18:54:20.489384      13 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
I0716 18:54:20.489739      13 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
W0716 18:54:20.532338      13 reflector.go:341] github.com/appscode/voyager/vendor/k8s.io/client-go/informers/factory.go:87: watch of *v1.ConfigMap ended with: too old resource version: 3859791 (3860126)
I0716 18:54:20.896945      13 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF
I0716 18:54:21.532698      13 reflector.go:240] Listing and watching *v1.ConfigMap from github.com/appscode/voyager/vendor/k8s.io/client-go/informers/factory.go:87

sauryadas · 2018-07-19T22:48:33Z

@ultimateboy Can you please look into this?

khenidak · 2018-07-21T17:09:36Z

@rite2nikhil has been actively engaged in this. We believe we have found the problem. It was misbehaving niginx acting forward proxy to forward kubernetes default service to the hosted control plane. We have started patching the clusters with the solution. Please allow sometime before all clusters gets the patch

rite2nikhil · 2018-07-22T00:16:23Z

@prydonius the cluster you shared on feedback email has been patched, can you see if that fixes the issue you reported.

Other who want expedited patch for their cluster can send the sub id, resource group, cluster name to [email protected]

Thanks all for reporting the issue.

prydonius · 2018-07-22T00:26:13Z

Thanks @rite2nikhil, I'll take a look!

richerlariviere · 2018-07-23T13:14:56Z

@rite2nikhil Is there a release date for this patch so I can be sure that I have the fix on my cluster?

rite2nikhil · 2018-07-24T03:49:21Z

The fix is expected to rollout to production by 07/27, this will fix new cluster, the older clusters will be patched in next few weeks as it will be a slower rollout process, however an update / scale or upgrade operation after 07/27 should apply the patch on current clusters as well.

connorgorman · 2018-07-27T18:23:19Z

@rite2nikhil Can you verify that this has been patched?

Nuke1234 · 2018-07-30T09:29:23Z

Hello! This issue still seems to exist. After provisioning a new cluster i still get:

level=error ts=2018-07-30T08:59:42.36877976Z caller=main.go:216 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:270: Failed to list *v1.Pod: the server cannot complete the requested operation at this time, try again later (get pods)"

Any ideas when this issue will be resolved?

rite2nikhil · 2018-07-30T14:09:10Z

The fix has not rolled out yet as it did not make it to the realease last week. apologies for the delay, I expect it to start rollout today and rollout to all production regions by end. To expedite patch mailto:[email protected]
with sub id, resource group and resource id(cluster name)

thezultimate · 2018-07-31T13:43:09Z

@rite2nikhil I sent an email (with cluster information) for patching our cluster. Has it been patched? Thanks.

ravicm · 2018-08-05T15:00:05Z

I am seeing similar behavior with AKS in west Europe

Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Happy to offer any reproducing info. I already see the required info to reproduce the issue. In short, deploy aks, nginx-ingress controller. Wait for 30-40 min. update ingress. k8s master will ack but changes not reflected into ingress controller pods, nginx.conf. After a considerable long time, you will see the changes in sync. may take upto 1hr.

$ kubectl apply -f ingress.yaml 
ingress "test-ingress" configured

$ date
Sun Aug  5 10:56:14 EDT 2018
$ 

W0805 15:16:55.179523      23 controller.go:724] Error obtaining Endpoints for Service "default/nginx": no object matching key "default/nginx" in local store
I0805 15:16:55.179936      23 controller.go:167] Changes handled by the dynamic configuration, skipping backend reload.
I0805 15:16:55.181878      23 controller.go:202] Dynamic reconfiguration succeeded.
I0805 15:20:35.419390      23 event.go:221] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"default", Name:"test-ingress", UID:"8ef782bb-98b5-11e8-bfda-0a58ac1f036b", APIVersion:"extensions/v1beta1", ResourceVersion:"1590555", FieldPath:""}): type: 'Normal' reason: 'UPDATE' Ingress default/test-ingress
W0805 15:20:35.419561      23 controller.go:724] Error obtaining Endpoints for Service "default/nginx": no object matching key "default/nginx" in local store
I0805 15:20:35.419719      23 controller.go:169] Configuration changes detected, backend reload required.
I0805 15:20:35.514785      23 controller.go:185] Backend successfully reloaded.
I0805 15:20:35.518585      23 controller.go:202] Dynamic reconfiguration succeeded.

So approximately, 20 min

rite2nikhil · 2018-08-07T20:05:13Z

@ravicm The has been rolled out for this issue, doing any update to you cluster and bouncing(deleteing) the kube-system/azureproxy pod would apply the fix to your old clusters for this specific issue.

Nuke1234 · 2018-08-07T21:03:59Z

@rite2nikhil i updated (scaled) one of our 1.10.6 aks clusters, bounced the azureproxy pod but deployments are still failing randomly...

first try:

helm upgrade --install --namespace cds --wait --timeout=1800 --values system/values.yaml --set tags.stage1=true cds-system-1 cdsrepo/cds-system Release "cds-system-1" does not exist. Installing it now. Error: release cds-system-1 failed: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.extensions)

second try (same cluster)

helm upgrade --install --namespace cds --wait --timeout=1800 --values system/values.yaml --set tags.stage1=true cds-system-1 cdsrepo/cds-system Release "cds-system-1" does not exist. Installing it now. E0807 20:56:36.517972 31 portforward.go:178] lost connection to pod Error: transport is closing

rite2nikhil · 2018-08-08T01:07:36Z

The problem is during helm install has tiller pod watching for more than a minute idle, we are increasing the idle timeout from pod to api server to 10 minutes which wil mitigate this scenario. This rollout will finish by end of next week. Meanwhile you can send your cluster info (subid, rg, clustername) to [email protected]

anguslees · 2018-08-08T04:47:32Z

There's not sufficient detail here for me to work out what the underlying issue is/was, but that last comment sounds like we might be talking about LB connection tracking timeouts between pod and apiserver.

If so, note the kubernetes apiserver duplicates the regular golang net/http server behaviour and configures a 3 minute TCP keep-alive timer.

In other words, you should see some sort of TCP packets (L4) in both directions within at-most 3 minutes. Note HTTP/websocket (L5) traffic might be idle for much longer. One minute is/was definitely not sufficient, assuming the apiserver hasn't been modified. If the LB is doing TCP termination before passing on to the apiserver, then you're on your own 😛

Nuke1234 · 2018-08-08T10:25:32Z

I've sent an email with the clusters i would need to be patched to [email protected]. It would be great to get some explanation about all this mess as it renders AKS virtually unusable for serious use. Having no updates for this blocker issue for more than a week and no response after writing to [email protected] is not making azure the best place to run kubernetes.

rite2nikhil · 2018-08-08T15:27:32Z

@Nuke1234 Can you please resend, our on-calls have been patching clusters, sorry if it got missed.
@anguslees is correct in the explanation, the pod->apiserver requests that get routed through the default service ip that are idle are causing the issue, the current patch allows a 1 minute idle which we have learnt is not sufficient for scenarios that issue a single watch (without retry on early connection close) therefore we are rolling out a change to make it 10 minutes.

We understand the problem this is causing, thanks folks for your patience, we are working on getting this fixed at highest priority.

malachma · 2018-08-13T08:10:41Z

@rite2nikhil do you have an estimate when the fix is available and will be rolled out?

mtparet · 2018-08-22T15:22:42Z

It seems we hit the same bug Error: UPGRADE FAILED: watch closed before Until timeout when doing an helm upgrade --watch --timeout 600

Nuke1234 · 2018-08-30T13:06:08Z

any news? would be great if you could keep us in the loop.

thezultimate · 2018-09-20T12:45:24Z

The patch worked for quite sometime in our old clusters.

Now that we created a new cluster (k8s version 1.11.2), we got quite many error messages: ERROR: logging before flag.Parse: E0920 12:34:56.775821 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 681; INTERNAL_ERROR. And occasionally our custom controllers are slow to respond to changes (again).

Does anyone know if this is common to k8s 1.11.2?

Thanks.

peterwy01 · 2018-09-20T12:55:18Z

I'm not sure if my case has to do with this issue here, but we see the following every ten minute in our cert-manager pod log on our azure aks kubernetes cluster with kubernetes version 1.11.2:

E0920 11:43:02.224615 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1079; INTERNAL_ERROR
E0920 11:43:02.224829 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1077; INTERNAL_ERROR
E0920 11:43:02.224974 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1071; INTERNAL_ERROR
E0920 11:43:02.225075 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1073; INTERNAL_ERROR
E0920 11:43:02.225229 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1067; INTERNAL_ERROR
E0920 11:43:02.509582 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1069; INTERNAL_ERROR
E0920 11:43:02.510031 1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1075; INTERNAL_ERROR

Thanks for your help.

m1o1 · 2018-09-20T13:00:44Z

Same "unable to decode" problem here, on an app using client-go to watch resources

saykumar · 2018-09-20T18:13:58Z

We recently discovered while working with Microsoft that talking to Kube master directly using the FQDN instead of the in-cluster service 10.0.0.1 avoids these errors.

It is most likely some intermediate load balancer timeouts not working well with the pods which want to talk to the master and use a watch.

bergerx · 2018-09-21T21:16:27Z

A recent related discussion in kubernetes slack #sig-azure channel:
https://kubernetes.slack.com/archives/C5HJXTT9Q/p1537536887000100

asridharan · 2018-09-28T19:02:27Z

@rite2nikhil what's the update on this issue?

This seems to be impacting us as well:
Azure/application-gateway-kubernetes-ingress#45

rite2nikhil · 2018-09-28T23:11:43Z

@juan-lee who has been investigating this issue and will provide and update

juan-lee · 2018-10-03T18:22:15Z

This issue has the workaround that should fix the issues described here.

asridharan · 2018-10-16T21:30:01Z

@juan-lee the workaround suggested seems to send all communication between the controller and the API server over the public network is that correct?

juan-lee · 2018-10-17T01:45:13Z

@juan-lee the workaround suggested seems to send all communication between the controller and the API server over the public network is that correct?

It's actually over the Azure backplane and the traffic was always taking this path. The difference is, now it is not going through an nginx reverse proxy (azureproxy) inside of your cluster.

asridharan · 2018-10-17T05:10:17Z

@juan-lee the workaround suggested seems to send all communication between the controller and the API server over the public network is that correct?

It's actually over the Azure backplane and the traffic was always taking this path. The difference is, now it is not going through an nginx reverse proxy (azureproxy) inside of your cluster.

@akshaysngupta ^^

jnoller · 2019-04-04T13:03:08Z

Closing this issue as old/stale.

If this issue still comes up, please confirm you are running the latest AKS release. If you are on the latest release and the issue can be re-created outside of your specific cluster please open a new github issue.

If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket.

tamalsaha mentioned this issue Jul 20, 2018

Pods unable to quickly communicate with API server #546

Closed

weinong added cli bug and removed cli bug labels Jul 22, 2018

This was referenced Jul 23, 2018

helm install breaks nondeterministically #521

Closed

Slow Response Time #524

Closed

rite2nikhil self-assigned this Aug 7, 2018

rite2nikhil added the runtime issue label Aug 7, 2018

weinong mentioned this issue Aug 8, 2018

[westeurope] Intermittent Ability to Communicate with API Server #577

Closed

m1o1 mentioned this issue Aug 30, 2018

Performance degradation for high levels of in-cluster kube-apiserver traffic #620

Closed

AaronFriel mentioned this issue Sep 5, 2018

Interoperability with kubernetes crate Arnavion/k8s-openapi#20

Closed

asridharan mentioned this issue Sep 28, 2018

Occasionally Ingress Controller introduces significant delays in process ingress events Azure/application-gateway-kubernetes-ingress#45

Closed

dharmeshkakadia mentioned this issue Oct 29, 2018

Multiple sparkapplication instances are created in case of LastRun is failed kubeflow/spark-operator#249

Closed

mmourafiq mentioned this issue Jan 3, 2019

Polyaxon fails to finish jobs polyaxon/polyaxon#303

Closed

jnoller closed this as completed Apr 4, 2019

jnoller unassigned rite2nikhil Apr 4, 2019

Simon3 mentioned this issue Apr 23, 2019

Random restarts of Voyager operator pod voyagermesh/voyager#1380

Closed

ghost locked as resolved and limited conversation to collaborators Aug 4, 2020

Custom controllers appear to not get updates from shared informer after some time #522

Custom controllers appear to not get updates from shared informer after some time #522

Comments

prydonius commented Jul 11, 2018

nomisbeme commented Jul 11, 2018

hazsetata commented Jul 13, 2018

pavius commented Jul 15, 2018

raghur commented Jul 16, 2018 • edited Loading

tamalsaha commented Jul 16, 2018 • edited Loading

sauryadas commented Jul 19, 2018

khenidak commented Jul 21, 2018

rite2nikhil commented Jul 22, 2018

prydonius commented Jul 22, 2018

richerlariviere commented Jul 23, 2018 • edited Loading

rite2nikhil commented Jul 24, 2018

connorgorman commented Jul 27, 2018

Nuke1234 commented Jul 30, 2018

rite2nikhil commented Jul 30, 2018

thezultimate commented Jul 31, 2018

ravicm commented Aug 5, 2018 • edited Loading

rite2nikhil commented Aug 7, 2018

Nuke1234 commented Aug 7, 2018 • edited Loading

rite2nikhil commented Aug 8, 2018

anguslees commented Aug 8, 2018

Nuke1234 commented Aug 8, 2018

rite2nikhil commented Aug 8, 2018

malachma commented Aug 13, 2018

mtparet commented Aug 22, 2018 • edited Loading

Nuke1234 commented Aug 30, 2018

thezultimate commented Sep 20, 2018

peterwy01 commented Sep 20, 2018

m1o1 commented Sep 20, 2018 • edited Loading

saykumar commented Sep 20, 2018 • edited Loading

bergerx commented Sep 21, 2018

asridharan commented Sep 28, 2018

rite2nikhil commented Sep 28, 2018

juan-lee commented Oct 3, 2018

asridharan commented Oct 16, 2018

juan-lee commented Oct 17, 2018

asridharan commented Oct 17, 2018

jnoller commented Apr 4, 2019

raghur commented Jul 16, 2018 •

edited

Loading

tamalsaha commented Jul 16, 2018 •

edited

Loading

richerlariviere commented Jul 23, 2018 •

edited

Loading

ravicm commented Aug 5, 2018 •

edited

Loading

Nuke1234 commented Aug 7, 2018 •

edited

Loading

mtparet commented Aug 22, 2018 •

edited

Loading

m1o1 commented Sep 20, 2018 •

edited

Loading

saykumar commented Sep 20, 2018 •

edited

Loading