GRPC Health check sometimes stuck in failed state #7965

daviderenger · 2020-05-28T06:40:19Z

Overview of the Issue

We are using consul to register GRPC-services and it performs health checks with the built in GRPC health protocol. Most of the time it works very well but sometimes, mostly after an update of our stack some of the health checks fails over and over again.

The GRPC-server is reachable from other sources so it is working.

Once it gets into this state it never leaves it if not consul or the service is restarted.

Reproduction Steps

Steps to reproduce this issue, eg:

Create a cluster with 1 client nodes and 1 server nodes
Register some services (we have about 100) with GPRC-health checks
Redeploy the stack (docker swarm)
Sometimes one or a couple of health checks are always failing
Redeploy consul or specific service and it works again

Consul info for both Client and Server

Clients: GRPC-node version 1.24.2, consul node package version 0.37.0
Consul version 1.7.3

Operating system and Environment details

AWS EC2 instances running docker swarm

Log Fragments

2020-05-27T16:24:35.779Z [WARN] agent: Check is now critical: check=service:9113f7ae09c4b503fa7e90fb74649112 2020-05-27T16:24:35.779Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {172.18.0.26:6000 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.18.0.26:6000: operation was canceled". Reconnecting...

"operation was canceled" indicates that it was canceled in the client and never got out to the service at all.

The text was updated successfully, but these errors were encountered:

jsosulska · 2020-06-01T13:51:49Z

Hi @daviderenger ,
Thanks for posting! For questions like this, you may find a faster response from the community on a whole by posting on our Discuss forums. Feel free to post there, and close this thread.

I have some questions about your set up.

Can you please explain why you are using one server? Is this a -dev set up server? If so, have you noticed similar things happen when there are 3 nodes?
Can you post your server config file, with the sensitive bits commented out? If not using a config file, please post the full command you are running to get this set up.
When you are saying "Redeploy the stack (docker swarm)", how is this happening? Is it "big bang" deployment?
Can you post a bit more information about your topology? How many hosts are used for the swarm? What are their instance profiles?
On your own step Bootstrapping unclear in docs #5, you said that "redeploying Consul" fixes the issue. Is this a stop/start of the Consul Client? Or the consul server?

If you can post more information and replication steps, it may help with debugging this. As I mentioned before, posting on Discuss will get this in front of a bigger audience as well, and you can link back to here.

Happy coding!

jkirschner-hashicorp · 2021-08-30T18:37:51Z

Hi @daviderenger ,

I'm closing for now since we haven't heard any updates in a while. Feel free to reply and reopen if needed.

jsosulska added the type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp label May 28, 2020

jsosulska added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jun 1, 2020

jkirschner-hashicorp added inactive/not-enough-info Unable to act on the request due to insufficient information and removed waiting-reply Waiting on response from Original Poster or another individual in the thread labels Aug 30, 2021

jkirschner-hashicorp closed this as completed Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPC Health check sometimes stuck in failed state #7965

GRPC Health check sometimes stuck in failed state #7965

daviderenger commented May 28, 2020 •

edited

Loading

jsosulska commented Jun 1, 2020

jkirschner-hashicorp commented Aug 30, 2021

GRPC Health check sometimes stuck in failed state #7965

GRPC Health check sometimes stuck in failed state #7965

Comments

daviderenger commented May 28, 2020 • edited Loading

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

jsosulska commented Jun 1, 2020

jkirschner-hashicorp commented Aug 30, 2021

daviderenger commented May 28, 2020 •

edited

Loading