GRPC Health check sometimes stuck in failed state #7965
Labels
inactive/not-enough-info
Unable to act on the request due to insufficient information
type/question
Not an "enhancement" or "bug". Please post on discuss.hashicorp
Overview of the Issue
We are using consul to register GRPC-services and it performs health checks with the built in GRPC health protocol. Most of the time it works very well but sometimes, mostly after an update of our stack some of the health checks fails over and over again.
The GRPC-server is reachable from other sources so it is working.
Once it gets into this state it never leaves it if not consul or the service is restarted.
Reproduction Steps
Steps to reproduce this issue, eg:
Consul info for both Client and Server
Clients: GRPC-node version 1.24.2, consul node package version 0.37.0
Consul version 1.7.3
Operating system and Environment details
AWS EC2 instances running docker swarm
Log Fragments
2020-05-27T16:24:35.779Z [WARN] agent: Check is now critical: check=service:9113f7ae09c4b503fa7e90fb74649112 2020-05-27T16:24:35.779Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {172.18.0.26:6000 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.18.0.26:6000: operation was canceled". Reconnecting...
"operation was canceled" indicates that it was canceled in the client and never got out to the service at all.
The text was updated successfully, but these errors were encountered: