Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all SubConns are in TransientFailure, latest connection error: <nil> #2636

Closed
mpuncel opened this issue Feb 12, 2019 · 25 comments
Closed

all SubConns are in TransientFailure, latest connection error: <nil> #2636

mpuncel opened this issue Feb 12, 2019 · 25 comments

Comments

@mpuncel
Copy link

mpuncel commented Feb 12, 2019

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.18.0

What version of Go are you using (go version)?

1.11.2

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

Ran gRPC requests in production through a local Envoy sidecar. This means that all connections are to a process over a local unix socket. WaitForReady was not set to true for these calls

What did you expect to see?

I'd expect it to be exceedingly rare for connections to a local unix sidecar to be in the TRANSIENT_FAILURE state, and if they were i'd expect to see a non-nil error message. We use whatever the default balancer implementation is, but we do pass in a custom dialer implementation:

	envoySocketDialer := func(address string, timeout time.Duration) (net.Conn, error) {
		return net.DialUnix("unix", nil, &net.UnixAddr{
			Name: envoySocketPath,
			Net:  "unix",
		})
	}
	options := []grpc.DialOption{
                ...
		grpc.WithDialer(envoySocketDialer), // Override the dialer as a workaround for https://github.com/grpc/grpc-go/issues/2510
	}

What did you see instead?

A small percentage (<1%) of RPCs failed with a "all SubConns are in TransientFailure, latest connection error: <nil>" error message. Without knowing much about the connection pooling implementation, it seems like the error should typically be non-nil there

@mpuncel
Copy link
Author

mpuncel commented Feb 12, 2019

I suspect that setting WaitForReady(true) in the call options will effectively prevent this issue for me, but I'm curious if there is a way to get more information about why all SubConns would be in the failure state and wondering if there's a lurking bug because there seems to not be any connection errors

@zhyon404
Copy link

Run in to the same problem when upgrading to 1.18.0.
1.17.0 works fine.

@menghanl menghanl self-assigned this Feb 21, 2019
@jdejesus07
Copy link

I am currently using 1.19.0 and same issue described here was experienced. I added WaitForReady(true) to call options and worked as stated @mpuncel Ty
There definitely seems to be a bug here. I shall add this as a workaround for now and keep digging to see what I come up with.

@menghanl
Copy link
Contributor

Thanks for reporting. This is indeed a bug.
I suspect we either set the SubConn in TransientFailure when it isn't, or we forgot to set the connection. Anyway, IMO the last connection error shouldn't be nil here.

Is there any other logs that you got to tell whether there's an actual error (like connection broken) when this happens?

@menghanl
Copy link
Contributor

Also, it could be related to #2565 (see #2406 for more information).

Can you try setting GRPC_GO_REQUIRE_HANDSHAKE=off and see if that changes anything?

@jdejesus007
Copy link

2019/03/01 14:17:38 http2: Framer 0xc4206b7260: wrote HEADERS flags=END_HEADERS stream=1 len=107

2019/03/01 14:17:38 http2: Framer 0xc4206b7260: wrote DATA flags=END_STREAM stream=1 len=1285 data="\x00\x00\x00\x05\x00\t\xbb'\x0f\v\xb5v:@\x11\r\xe0-\x90\xa0\x04T\xc0\x18\x012\n\b(\x12\x06\b\xb1\xa3\xdd\xe3\x052\n\br\x12\x06\b\x84\xcd\xe3\xe2\x052\v\b\xb0\x01\x12\x06\b\xa7ڵ\xe1\x052\v\b\xd5\f\x12\x06\b\x96\x8b\xed\xe2\x052\v\b\xd8 \x12\x06\b\xbd\x95\xdd\xe3\x052\v\b\xa1-\x12\x06\b\xa7\xa5\xdc\xe3\x052\v\b\xf6>\x12\x06\b\x96\x8b\xed\xe2\x052\n\b\x0f\x12\x06\b\x96\x8b\xed\xe2\x052\n\b\x12\x12\x06\b\x9b\xc0\xc9\xe2\x052\v\b\x8c\x01\x12\x06\b\xa3\xa9\xd1\xe0\x052\v\b\xc7\x01\x12\x06\b\xda\xd1\xdb\xe3\x052\v\b\xff\x02\x12\x06\bރ\x93\xe3\x052\v\b\x99\x1c\x12\x06\b\xb0\x89\xdc\xe3\x052\v\b\xb8#\x12\x06\b\x96\x8b\xed\xe2\x052\v\b\xc8,\x12\x06\b\xa7\xbe\xd2\xe3\x052\v\b\xf04\x12\x06\b\xb5\xe5\xd2\xe2\x052\v\b\xf6\x01\x12\x06\b\xe8\xa9\xd6\xe1\x052\v\b\xf9\x01\x12\x06\b\xbd\xe7\xb6\xe3\x052" (1029 bytes omitted)

INFO: 2019/03/01 14:17:38 pickfirstBalancer: HandleSubConnStateChange: 0xc4203ccee0, TRANSIENT_FAILURE

INFO: 2019/03/01 14:17:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

I removed the WaitForReady and turned off handshake but experienced the same behavior.

@aanm
Copy link
Contributor

aanm commented Mar 5, 2019

@menghanl that fixes the issue for me.

@menghanl
Copy link
Contributor

menghanl commented Mar 6, 2019

There are more than one things happening, but the root cause is failing to reset reconnecting backoff, which is fixed in #2669. Please try the fix and see if it helps.

Because of the bug, reconnecting doesn't happen in time, resulting in longer than expected time in transient failure.

About latest connection error: nil. This error message was introduced for errors like TLS misconfig, so when the connection fails, the RPC error will contain useful information.
It can still potentially be nil when connection is closed by servers. It shouldn't be noticeable because reconnecting will happen immediately. But because of the backoff bug, it's much more easily encountered.

One possible fix is to set the error whenever connection goes into transient failure, so it will never be nil. I will save that in another fix.

@aanm
Copy link
Contributor

aanm commented Mar 7, 2019

@menghanl patch #2669 does not solve the issue. it only works if GRPC_GO_REQUIRE_HANDSHAKE=off is set

@menghanl
Copy link
Contributor

menghanl commented Mar 7, 2019

@aanm
If GRPC_GO_REQUIRE_HANDSHAKE=off is required, it sounds like that the server didn't finish the handshake in time.
How is your server setup? Is it a proxy, a gRPC server, or server behind a mux (so port is shared by gRPC and http)?

@bbassingthwaite
Copy link

bbassingthwaite commented Mar 12, 2019

@menghanl #2669 does not fix the issue for me either, but GRPC_GO_REQUIRE_HANDSHAKE=off does work for me.

@bbassingthwaite
Copy link

We are multiplexing gRPC and HTTP traffic on a single port using cmux. Removing cmux from the equation fixes the problem.

import "github.com/cockroachdb/cmux"

server := grpc.NewServer()

lis, err := net.Listen("tcp", ":8081")
if err != nil {
    log.Fatal(err)
}
mux := cmux.New(lis)
grpcL := mux.Match(cmux.HTTP2HeaderField("content-type", "application/grpc"))

server.Serve(grpcL)

@menghanl
Copy link
Contributor

menghanl commented Mar 12, 2019

@bbassingthwaite-va
That is caused by #2565 (also see #2406).
With that, gRPC-Go clients have the same behavior as gRPC-Java, thus you need to set the same match with writers as mentioned: https://github.com/soheilhy/cmux#limitations

An issue (soheilhy/cmux#64) was filed before the gRPC-Go changes were checked in. Will send a PR to update their README as well.

@aanm
Copy link
Contributor

aanm commented Mar 13, 2019

@aanm
If GRPC_GO_REQUIRE_HANDSHAKE=off is required, it sounds like that the server didn't finish the handshake in time.
How is your server setup? Is it a proxy, a gRPC server, or server behind a mux (so port is shared by gRPC and http)?

@menghanl sorry for the late reply, the server is a binary already compiled with grpc 1.7.5 5b3c4e8, it's an program called cri-o. In their vendor directory they have that commit that I was mentioning https://github.com/kubernetes-sigs/cri-o/blob/v1.13.1/vendor.conf#L24. They do have a cmux library in there https://github.com/kubernetes-sigs/cri-o/blob/v1.13.1/vendor.conf#L115.

If you want to install it in your computer, these instructions should be enough

   sudo apt-key adv --recv-key --keyserver keyserver.ubuntu.com 8BECF1637AD8C79D

   cat <<EOF > /etc/apt/sources.list.d/projectatomic-ubuntu-ppa-artful.list
deb http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
deb-src http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
EOF
   sudo apt-get update
   sudo apt-get remove cri-o-1.* -y || true
   sudo apt-get install cri-o-1.13 -y

The client that I'm using to test is an "hello world" type of code, which is compiled with grpc 1.18.0, that checks the status of the server

import (
	criRuntime "k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2"
	"context"
	"fmt"
	"google.golang.org/grpc"
	"k8s.io/kubernetes/pkg/kubelet/util"
	"runtime"
	"sync"
	"time"
)

func getGRPCCLient(ctx context.Context) (*grpc.ClientConn, error) {
	addr, dialer, err := util.GetAddressAndDialer("unix:///var/run/crio/crio.sock")
	if err != nil {
		return nil, err
	}

	c, cancel := context.WithTimeout(ctx, time.Duration(5*time.Second))
	defer cancel()

	conn, err := grpc.DialContext(c, addr, grpc.WithDialer(dialer), grpc.WithInsecure(), grpc.WithBlock(), grpc.WithBackoffMaxDelay(15*time.Second))
	if err != nil {
		return nil, fmt.Errorf("failed to connect: %s", err)
	}
	return conn, nil
}

func newCRIClient(ctx context.Context) (*criClient, error) {
	cc, err := getGRPCCLient(ctx)
	if err != nil {
		return nil, err
	}
	rsc := criRuntime.NewRuntimeServiceClient(cc)
	return &criClient{rsc}, nil
}

func testCRI() {
	c, err := newCRIClient(context.Background())
	if err != nil {
		panic(err)
	}
	sreq := &criRuntime.StatusRequest{
		Verbose: false,
	}
	sresp, err := c.RuntimeServiceClient.Status(context.Background(), sreq)
	fmt.Println(sresp)
	if err != nil {
		panic(err)
	}
}

If this client program is executed with GRPC_GO_REQUIRE_HANDSHAKE=off everything works fine.

@menghanl
Copy link
Contributor

menghanl commented Mar 13, 2019

@aanm That is caused by #2565 (also see #2406).
cri-o will need to set match with writers as mentioned in https://github.com/soheilhy/cmux#limitations for gRPC-Java clients.
For now, you can set GRPC_GO_REQUIRE_HANDSHAKE=off before it's done.

@dfawley dfawley added P2 and removed P1 labels Mar 14, 2019
@dfawley
Copy link
Member

dfawley commented Mar 14, 2019

Lowering priority, since the main issue is resolved by eliminating the backoff after a successful connection is established. Ideally the error message would be populated with "server handshake not received" (or similar). When it's due to the handshake, this will happen naturally when we make transport creation function blocking. When it's due to a subsequent connection attempt, we need to plumb the error from the transport to the addrConn.

saschagrunert added a commit to openSUSE/cri-o that referenced this issue Aug 5, 2019
gRPC clients which wait until they receive a `SETTINGS` frame may not be
able to connect to CRI-O because of a limitation to cmux:
- https://github.com/soheilhy/cmux#limitations

The issue seems not yet solved and causes impossible connection when
upgrading cri-tools to the latest version of gRPC. A possible workaround
is to export the environment variable: `GRPC_GO_REQUIRE_HANDSHAKE=off`.

Reference: grpc/grpc-go#2636

Signed-off-by: Sascha Grunert <[email protected]>
saschagrunert added a commit to openSUSE/cri-tools that referenced this issue Aug 5, 2019
The mentioned issue in 3d04f3b is
related to a limitation of cmux, where clients may block until they
receive a `SETTINGS` frame. My tests were only related to CRI-O, where
an appropriate fix in the project should solve the issue from a general
perspective.

Nevertheless, a general workaround for end-users trying to connect to a
cmux based implementation might be exporting the environment variable
`GRPC_GO_REQUIRE_HANDSHAKE=off`. This has been documented in the
README.md as well.

References:
- grpc/grpc-go#2636
- https://github.com/soheilhy/cmux#limitations
- cri-o/cri-o#2697

Signed-off-by: Sascha Grunert <[email protected]>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cri-o that referenced this issue Sep 5, 2019
gRPC clients which wait until they receive a `SETTINGS` frame may not be
able to connect to CRI-O because of a limitation to cmux:
- https://github.com/soheilhy/cmux#limitations

The issue seems not yet solved and causes impossible connection when
upgrading cri-tools to the latest version of gRPC. A possible workaround
is to export the environment variable: `GRPC_GO_REQUIRE_HANDSHAKE=off`.

Reference: grpc/grpc-go#2636

Signed-off-by: Sascha Grunert <[email protected]>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cri-o that referenced this issue Sep 5, 2019
gRPC clients which wait until they receive a `SETTINGS` frame may not be
able to connect to CRI-O because of a limitation to cmux:
- https://github.com/soheilhy/cmux#limitations

The issue seems not yet solved and causes impossible connection when
upgrading cri-tools to the latest version of gRPC. A possible workaround
is to export the environment variable: `GRPC_GO_REQUIRE_HANDSHAKE=off`.

Reference: grpc/grpc-go#2636

Signed-off-by: Sascha Grunert <[email protected]>
haircommander pushed a commit to haircommander/cri-o that referenced this issue Sep 5, 2019
gRPC clients which wait until they receive a `SETTINGS` frame may not be
able to connect to CRI-O because of a limitation to cmux:
- https://github.com/soheilhy/cmux#limitations

The issue seems not yet solved and causes impossible connection when
upgrading cri-tools to the latest version of gRPC. A possible workaround
is to export the environment variable: `GRPC_GO_REQUIRE_HANDSHAKE=off`.

Reference: grpc/grpc-go#2636

Signed-off-by: Sascha Grunert <[email protected]>
@stale
Copy link

stale bot commented Sep 6, 2019

This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@stale stale bot added the stale label Sep 6, 2019
@dfawley dfawley removed the stale label Sep 6, 2019
facebook-github-bot pushed a commit to magma/magma that referenced this issue Nov 19, 2019
Summary:
Pull Request resolved: #901

In production we are seeing: all SubConns are in TransientFailure, latest connection error: <nil>
as referenced by: grpc/grpc-go#2636. The primary issue is fixed in later grpc-go
versions and as such we update our go.mod for Magma to 1.25.0

Reviewed By: xjtian

Differential Revision: D18544967

fbshipit-source-id: 58208650d4b29feadae2cdd708b505b66359d068
gjalves pushed a commit to gjalves/magma that referenced this issue Nov 19, 2019
Summary:
Pull Request resolved: magma#901

In production we are seeing: all SubConns are in TransientFailure, latest connection error: <nil>
as referenced by: grpc/grpc-go#2636. The primary issue is fixed in later grpc-go
versions and as such we update our go.mod for Magma to 1.25.0

Reviewed By: xjtian

Differential Revision: D18544967

fbshipit-source-id: 58208650d4b29feadae2cdd708b505b66359d068
@perolausson
Copy link

perolausson commented Dec 2, 2019

I've noticed this problem too, but found that it only seems to happen when I have left a program being debugged and then hibernated the system. On waking up I typically see that there is still a listener as expected, but there is no accept on any incoming requests and eventually the process locks up. Is there some kind of time keeping inside the go implementation of the grpc server which gets buggered up when time freezes and is then later unfrozen when the systems is woken up again?

@menghanl
Copy link
Contributor

menghanl commented Dec 4, 2019

@perolausson there's no server side time keeping. Could it be that the listener doesn't work well after the system is hibernated? And that in this case, all connections are blocked on connecting, but don't error either.

@perolausson
Copy link

I don't really know what is going on, but I noticed that a few reported seeing this in the context of containers (wherepausing the container is quite common), and from what I have seen it has been a fairly consistent observation in my own work as I hibernate my development machine all the time.

@noelukwa
Copy link

noelukwa commented Dec 5, 2019

still facing this issue

@joelklabo
Copy link

I'm also experiencing this on lnd version 0.12.0

@menghanl menghanl closed this as completed May 4, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests