Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC Health check sometimes stuck in failed state #7965

Closed
daviderenger opened this issue May 28, 2020 · 2 comments
Closed

GRPC Health check sometimes stuck in failed state #7965

daviderenger opened this issue May 28, 2020 · 2 comments
Labels
inactive/not-enough-info Unable to act on the request due to insufficient information type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp

Comments

@daviderenger
Copy link

daviderenger commented May 28, 2020

Overview of the Issue

We are using consul to register GRPC-services and it performs health checks with the built in GRPC health protocol. Most of the time it works very well but sometimes, mostly after an update of our stack some of the health checks fails over and over again.

The GRPC-server is reachable from other sources so it is working.

Once it gets into this state it never leaves it if not consul or the service is restarted.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 1 client nodes and 1 server nodes
  2. Register some services (we have about 100) with GPRC-health checks
  3. Redeploy the stack (docker swarm)
  4. Sometimes one or a couple of health checks are always failing
  5. Redeploy consul or specific service and it works again

Consul info for both Client and Server

Clients: GRPC-node version 1.24.2, consul node package version 0.37.0
Consul version 1.7.3

Operating system and Environment details

AWS EC2 instances running docker swarm

Log Fragments

2020-05-27T16:24:35.779Z [WARN] agent: Check is now critical: check=service:9113f7ae09c4b503fa7e90fb74649112 2020-05-27T16:24:35.779Z [WARN] agent: grpc: addrConn.createTransport failed to connect to {172.18.0.26:6000 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.18.0.26:6000: operation was canceled". Reconnecting...

"operation was canceled" indicates that it was canceled in the client and never got out to the service at all.

@jsosulska jsosulska added the type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp label May 28, 2020
@jsosulska
Copy link
Contributor

Hi @daviderenger ,
Thanks for posting! For questions like this, you may find a faster response from the community on a whole by posting on our Discuss forums. Feel free to post there, and close this thread.

I have some questions about your set up.

  1. Can you please explain why you are using one server? Is this a -dev set up server? If so, have you noticed similar things happen when there are 3 nodes?
  2. Can you post your server config file, with the sensitive bits commented out? If not using a config file, please post the full command you are running to get this set up.
  3. When you are saying "Redeploy the stack (docker swarm)", how is this happening? Is it "big bang" deployment?
  4. Can you post a bit more information about your topology? How many hosts are used for the swarm? What are their instance profiles?
  5. On your own step Bootstrapping unclear in docs #5, you said that "redeploying Consul" fixes the issue. Is this a stop/start of the Consul Client? Or the consul server?

If you can post more information and replication steps, it may help with debugging this. As I mentioned before, posting on Discuss will get this in front of a bigger audience as well, and you can link back to here.

Happy coding!

@jsosulska jsosulska added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jun 1, 2020
@jkirschner-hashicorp jkirschner-hashicorp added inactive/not-enough-info Unable to act on the request due to insufficient information and removed waiting-reply Waiting on response from Original Poster or another individual in the thread labels Aug 30, 2021
@jkirschner-hashicorp
Copy link
Contributor

Hi @daviderenger ,

I'm closing for now since we haven't heard any updates in a while. Feel free to reply and reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive/not-enough-info Unable to act on the request due to insufficient information type/question Not an "enhancement" or "bug". Please post on discuss.hashicorp
Projects
None yet
Development

No branches or pull requests

3 participants