[westeurope] Intermittent Ability to Communicate with API Server #577

EamonKeane · 2018-08-02T10:27:03Z

What happened:
Communication with API server is very patchy. E.g. a helm install will work one moment, but then the next minute a 502 bad gateway is returned from nginx-ingress (version quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.15.0). Another example is:

E0802 10:31:10.702009       1 reflector.go:322] github.com/jetstack/cert-manager/pkg/client/informers/externalversions/factory.go:71: 
Failed to watch *v1alpha1.Certificate: an error on the server 
("<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor=\"white\">\r\n
<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.13.6</center>\r\n</body>\r\n</html>") 
has prevented the request from succeeding (get certificates.certmanager.k8s.io)

What you expected to happen:
API server requests work as normal.

How to reproduce it (as minimally and precisely as possible):
Run 1000 requests against the API server and see if they all succeed.
Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T22:29:25Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Size of cluster (how many worker nodes are in the cluster?)
2 Standard_DS14_v2
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
Backend services: Prometheus, Grafana, Elasticsearch, Jenkins
Others:
Kubernetes spec:

RESOURCE_GROUP=squareroute-develop
LOCATION=westeurope
NODE_VM_SIZE=Standard_DS14_v2
NODE_COUNT=2
CLUSTER_NAME=$RESOURCE_GROUP
NODE_OSDISK_SIZE=100
KUBERNETES_VERSION=1.10.5
TAGS="client=squareroute environment=develop"
MAX_PODS=30
NETWORK_PLUGIN=azure

az aks create \
    --name $CLUSTER_NAME \
    --resource-group $RESOURCE_GROUP \
    --generate-ssh-keys \
    --node-osdisk-size $NODE_OSDISK_SIZE \
    --node-vm-size $NODE_VM_SIZE \
    --node-count $NODE_COUNT \
    --network-plugin $NETWORK_PLUGIN \
    --vnet-subnet-id $SUBNET_ID \
    --kubernetes-version $KUBERNETES_VERSION \
    --max-pods $MAX_PODS \
    --location $LOCATION \
    --tags $TAGS

The text was updated successfully, but these errors were encountered:

DenisBiondic · 2018-08-04T11:57:12Z

I don't know if it is related, but here is an example log output from tiller running inside a k8s cluster (west EU as well):

[storage/driver] 2018/08/04 09:03:51 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 09:27:35 listing all releases with filter
[storage/driver] 2018/08/04 09:28:35 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:08:37 listing all releases with filter
[storage] 2018/08/04 11:13:32 listing all releases with filter
[storage/driver] 2018/08/04 11:14:32 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:20:45 listing all releases with filter
[storage/driver] 2018/08/04 11:21:45 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:26:41 listing all releases with filter
[storage] 2018/08/04 11:26:55 listing all releases with filter
[storage] 2018/08/04 11:27:37 listing all releases with filter
[storage/driver] 2018/08/04 11:28:37 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:28:42 listing all releases with filter
[storage] 2018/08/04 11:29:33 listing all releases with filter
[storage] 2018/08/04 11:32:20 listing all releases with filter
[storage] 2018/08/04 11:38:52 listing all releases with filter
[storage/driver] 2018/08/04 11:39:52 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:40:08 listing all releases with filter
[storage] 2018/08/04 11:40:27 listing all releases with filter
[storage] 2018/08/04 11:40:44 listing all releases with filter
[storage] 2018/08/04 11:41:44 listing all releases with filter
[storage/driver] 2018/08/04 11:42:44 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:42:53 listing all releases with filter
[storage/driver] 2018/08/04 11:43:53 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:53:03 listing all releases with filter
[storage/driver] 2018/08/04 11:54:03 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:54:07 listing all releases with filter

kubectl executes the queries completely fine, but I suspect that connectivity inside the cluster to 10.0.0.1 is not working properly all the time. I've seen similar issues when using other containers that query the api server directly.

theobolo · 2018-08-06T13:59:27Z

Same here guys but on North Europe, i posted another message on antoher issue. #581

Yep same here, sometimes the Master API is really slow or basically just not available during 5 or 10min, sometimes it's longer ... have you some trouble about master scaling on your side @Azure ? I'm experiencing this since 2 weeks by the way, it's very important that your master servers are 100% available since a lot of services are using k8s API for different things ... for the moment i can say that AKS is not really fully performant on that specific point.

EamonKeane · 2018-08-06T15:28:12Z

AKS is basically unuseable for me as for example Jenkins builds fail around 50% of the time if a step which involves API communication doesn't get a response. The exact same setup works on GKE.

I haven't had time to investigate other configurations, my impression is there are some ghosts in the azure data centre networking machine. I need some reliable way to run kubernetes on Azure. Have had similar issues with acs-engine (although I haven't tried recently). Tools like Kubicorn dropped support for Azure because it was too difficult, so it looks like my only option is something like kubeadm or Kubernetes the Hard Way.

https://github.com/ivanfioravanti/kubernetes-the-hard-way-on-azure
https://github.com/sozercan/kubeadm-azure-terraform

weinong · 2018-08-07T19:29:55Z

Can you guys send details like subscriptionID, resource group, resource name and region to [email protected] for us to take a look?

malachma · 2018-08-08T06:30:28Z

Hi @weinong I have a customer Case to this issue. I will send you an E-Mail with Details about it

strtdusty · 2018-08-08T16:25:38Z

I just sent our subscription/cluster info. We are running in West-US. I see the issue with the dashboard service, nginx-ingress, helm and prometheus. All having issues (bad gateway) connecting to the API server.

weinong · 2018-08-08T19:29:44Z

Hi,

Your issue is similar to #522
I’ve patched your cluster. Please let us know if it helps or not.

EamonKeane · 2018-08-14T10:41:17Z

I made a new cluster with version 1.11.1 but the problem still appears to persist. 143 failed watches over the past two days with cert-manager. By comparison GKE has had zero over the past 29 days.

kubectl logs cert-manager-59fb9b6779-mbz42 | grep -o watch | wc -l
143

m1o1 · 2018-08-14T12:44:57Z

We have the same problem with watches.

From cache.NewInformer

resyncPeriod: if non-zero, will re-list this often (you will get OnUpdate
calls, even if nothing changed). Otherwise, re-list will be delayed as
long as possible (until the upstream source closes the watch or times out,
or you stop the controller)

AKS might have trouble with the case where the resync period is 0 (or longer than the timeout that occurs between the pod and api server). I mention this resync period because I'm curious if the new timeout of 10 minutes will be sufficient for this case.

EamonKeane · 2018-08-15T07:16:27Z

@andrew-dinunzio thanks for that context, looks like that may be the issue. I've had pre-install hooks with helm before that lasted 45 minutes so I guess the new timeout wouldn't work for those. Not sure how long the watches are with cert-manager.

m1o1 · 2018-08-15T17:26:14Z

Reading the comment here, I'm hoping that means that there won't be any period of time longer than 3 minutes where there's no traffic between pods and the API server, which would be a good thing

EamonKeane · 2018-08-16T09:57:54Z

Thanks. Well at least this seems like a fixable configuration which will hopefully be resolved soon.

jnoller · 2019-04-04T15:14:31Z

Closing this issue as old/stale.

If this issue still comes up, please confirm you are running the latest AKS release. If you are on the latest release and the issue can be re-created outside of your specific cluster please open a new github issue.

If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket.

weinong self-assigned this Aug 7, 2018

weinong added the runtime issue label Aug 7, 2018

gurvindersingh mentioned this issue Aug 8, 2018

watch closed before Until timeout helm/helm#2918

Closed

justb4 mentioned this issue Aug 8, 2018

Spontaneous failures on Azure Kubernetes smartemission/smartemission#142

Closed

m1o1 mentioned this issue Aug 30, 2018

Performance degradation for high levels of in-cluster kube-apiserver traffic #620

Closed

jnoller closed this as completed Apr 4, 2019

ghost locked as resolved and limited conversation to collaborators Aug 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[westeurope] Intermittent Ability to Communicate with API Server #577

[westeurope] Intermittent Ability to Communicate with API Server #577

EamonKeane commented Aug 2, 2018 •

edited

Loading

DenisBiondic commented Aug 4, 2018

theobolo commented Aug 6, 2018 •

edited

Loading

EamonKeane commented Aug 6, 2018 •

edited

Loading

weinong commented Aug 7, 2018

malachma commented Aug 8, 2018

strtdusty commented Aug 8, 2018

weinong commented Aug 8, 2018

EamonKeane commented Aug 14, 2018

m1o1 commented Aug 14, 2018

EamonKeane commented Aug 15, 2018

m1o1 commented Aug 15, 2018

EamonKeane commented Aug 16, 2018

jnoller commented Apr 4, 2019

[westeurope] Intermittent Ability to Communicate with API Server #577

[westeurope] Intermittent Ability to Communicate with API Server #577

Comments

EamonKeane commented Aug 2, 2018 • edited Loading

DenisBiondic commented Aug 4, 2018

theobolo commented Aug 6, 2018 • edited Loading

EamonKeane commented Aug 6, 2018 • edited Loading

weinong commented Aug 7, 2018

malachma commented Aug 8, 2018

strtdusty commented Aug 8, 2018

weinong commented Aug 8, 2018

EamonKeane commented Aug 14, 2018

m1o1 commented Aug 14, 2018

EamonKeane commented Aug 15, 2018

m1o1 commented Aug 15, 2018

EamonKeane commented Aug 16, 2018

jnoller commented Apr 4, 2019

EamonKeane commented Aug 2, 2018 •

edited

Loading

theobolo commented Aug 6, 2018 •

edited

Loading

EamonKeane commented Aug 6, 2018 •

edited

Loading