Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[westeurope] Intermittent Ability to Communicate with API Server #577

Closed
EamonKeane opened this issue Aug 2, 2018 · 13 comments
Closed

[westeurope] Intermittent Ability to Communicate with API Server #577

EamonKeane opened this issue Aug 2, 2018 · 13 comments
Assignees

Comments

@EamonKeane
Copy link

EamonKeane commented Aug 2, 2018

What happened:
Communication with API server is very patchy. E.g. a helm install will work one moment, but then the next minute a 502 bad gateway is returned from nginx-ingress (version quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.15.0). Another example is:

E0802 10:31:10.702009       1 reflector.go:322] github.com/jetstack/cert-manager/pkg/client/informers/externalversions/factory.go:71: 
Failed to watch *v1alpha1.Certificate: an error on the server 
("<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body bgcolor=\"white\">\r\n
<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>nginx/1.13.6</center>\r\n</body>\r\n</html>") 
has prevented the request from succeeding (get certificates.certmanager.k8s.io)

What you expected to happen:
API server requests work as normal.

How to reproduce it (as minimally and precisely as possible):
Run 1000 requests against the API server and see if they all succeed.
Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T22:29:25Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  • Size of cluster (how many worker nodes are in the cluster?)
    2 Standard_DS14_v2
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
    Backend services: Prometheus, Grafana, Elasticsearch, Jenkins
  • Others:
    Kubernetes spec:
RESOURCE_GROUP=squareroute-develop
LOCATION=westeurope
NODE_VM_SIZE=Standard_DS14_v2
NODE_COUNT=2
CLUSTER_NAME=$RESOURCE_GROUP
NODE_OSDISK_SIZE=100
KUBERNETES_VERSION=1.10.5
TAGS="client=squareroute environment=develop"
MAX_PODS=30
NETWORK_PLUGIN=azure
az aks create \
    --name $CLUSTER_NAME \
    --resource-group $RESOURCE_GROUP \
    --generate-ssh-keys \
    --node-osdisk-size $NODE_OSDISK_SIZE \
    --node-vm-size $NODE_VM_SIZE \
    --node-count $NODE_COUNT \
    --network-plugin $NETWORK_PLUGIN \
    --vnet-subnet-id $SUBNET_ID \
    --kubernetes-version $KUBERNETES_VERSION \
    --max-pods $MAX_PODS \
    --location $LOCATION \
    --tags $TAGS
@DenisBiondic
Copy link

I don't know if it is related, but here is an example log output from tiller running inside a k8s cluster (west EU as well):

[storage/driver] 2018/08/04 09:03:51 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 09:27:35 listing all releases with filter
[storage/driver] 2018/08/04 09:28:35 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:08:37 listing all releases with filter
[storage] 2018/08/04 11:13:32 listing all releases with filter
[storage/driver] 2018/08/04 11:14:32 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:20:45 listing all releases with filter
[storage/driver] 2018/08/04 11:21:45 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:26:41 listing all releases with filter
[storage] 2018/08/04 11:26:55 listing all releases with filter
[storage] 2018/08/04 11:27:37 listing all releases with filter
[storage/driver] 2018/08/04 11:28:37 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:28:42 listing all releases with filter
[storage] 2018/08/04 11:29:33 listing all releases with filter
[storage] 2018/08/04 11:32:20 listing all releases with filter
[storage] 2018/08/04 11:38:52 listing all releases with filter
[storage/driver] 2018/08/04 11:39:52 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:40:08 listing all releases with filter
[storage] 2018/08/04 11:40:27 listing all releases with filter
[storage] 2018/08/04 11:40:44 listing all releases with filter
[storage] 2018/08/04 11:41:44 listing all releases with filter
[storage/driver] 2018/08/04 11:42:44 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:42:53 listing all releases with filter
[storage/driver] 2018/08/04 11:43:53 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:53:03 listing all releases with filter
[storage/driver] 2018/08/04 11:54:03 list: failed to list: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)
[storage] 2018/08/04 11:54:07 listing all releases with filter

kubectl executes the queries completely fine, but I suspect that connectivity inside the cluster to 10.0.0.1 is not working properly all the time. I've seen similar issues when using other containers that query the api server directly.

@theobolo
Copy link

theobolo commented Aug 6, 2018

Same here guys but on North Europe, i posted another message on antoher issue. #581

Yep same here, sometimes the Master API is really slow or basically just not available during 5 or 10min, sometimes it's longer ... have you some trouble about master scaling on your side @Azure ? I'm experiencing this since 2 weeks by the way, it's very important that your master servers are 100% available since a lot of services are using k8s API for different things ... for the moment i can say that AKS is not really fully performant on that specific point.

@EamonKeane
Copy link
Author

EamonKeane commented Aug 6, 2018

AKS is basically unuseable for me as for example Jenkins builds fail around 50% of the time if a step which involves API communication doesn't get a response. The exact same setup works on GKE.

I haven't had time to investigate other configurations, my impression is there are some ghosts in the azure data centre networking machine. I need some reliable way to run kubernetes on Azure. Have had similar issues with acs-engine (although I haven't tried recently). Tools like Kubicorn dropped support for Azure because it was too difficult, so it looks like my only option is something like kubeadm or Kubernetes the Hard Way.

https://github.com/ivanfioravanti/kubernetes-the-hard-way-on-azure
https://github.com/sozercan/kubeadm-azure-terraform

@weinong weinong self-assigned this Aug 7, 2018
@weinong
Copy link
Contributor

weinong commented Aug 7, 2018

Can you guys send details like subscriptionID, resource group, resource name and region to [email protected] for us to take a look?

@malachma
Copy link
Member

malachma commented Aug 8, 2018

Hi @weinong I have a customer Case to this issue. I will send you an E-Mail with Details about it

@strtdusty
Copy link

I just sent our subscription/cluster info. We are running in West-US. I see the issue with the dashboard service, nginx-ingress, helm and prometheus. All having issues (bad gateway) connecting to the API server.

@weinong
Copy link
Contributor

weinong commented Aug 8, 2018

Hi,

Your issue is similar to #522
I’ve patched your cluster. Please let us know if it helps or not.

@EamonKeane
Copy link
Author

I made a new cluster with version 1.11.1 but the problem still appears to persist. 143 failed watches over the past two days with cert-manager. By comparison GKE has had zero over the past 29 days.

kubectl logs cert-manager-59fb9b6779-mbz42 | grep -o watch | wc -l
143

@m1o1
Copy link

m1o1 commented Aug 14, 2018

We have the same problem with watches.

From cache.NewInformer

resyncPeriod: if non-zero, will re-list this often (you will get OnUpdate
calls, even if nothing changed). Otherwise, re-list will be delayed as
long as possible (until the upstream source closes the watch or times out,
or you stop the controller)

AKS might have trouble with the case where the resync period is 0 (or longer than the timeout that occurs between the pod and api server). I mention this resync period because I'm curious if the new timeout of 10 minutes will be sufficient for this case.

@EamonKeane
Copy link
Author

@andrew-dinunzio thanks for that context, looks like that may be the issue. I've had pre-install hooks with helm before that lasted 45 minutes so I guess the new timeout wouldn't work for those. Not sure how long the watches are with cert-manager.

@m1o1
Copy link

m1o1 commented Aug 15, 2018

Reading the comment here, I'm hoping that means that there won't be any period of time longer than 3 minutes where there's no traffic between pods and the API server, which would be a good thing

@EamonKeane
Copy link
Author

Thanks. Well at least this seems like a fixable configuration which will hopefully be resolved soon.

@jnoller
Copy link
Contributor

jnoller commented Apr 4, 2019

Closing this issue as old/stale.

If this issue still comes up, please confirm you are running the latest AKS release. If you are on the latest release and the issue can be re-created outside of your specific cluster please open a new github issue.

If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket.

@jnoller jnoller closed this as completed Apr 4, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Aug 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants