all SubConns are in TransientFailure, latest connection error: <nil> #2636

mpuncel · 2019-02-12T18:29:19Z

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.18.0

What version of Go are you using (`go version`)?

1.11.2

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

Ran gRPC requests in production through a local Envoy sidecar. This means that all connections are to a process over a local unix socket. WaitForReady was not set to true for these calls

What did you expect to see?

I'd expect it to be exceedingly rare for connections to a local unix sidecar to be in the TRANSIENT_FAILURE state, and if they were i'd expect to see a non-nil error message. We use whatever the default balancer implementation is, but we do pass in a custom dialer implementation:

	envoySocketDialer := func(address string, timeout time.Duration) (net.Conn, error) {
		return net.DialUnix("unix", nil, &net.UnixAddr{
			Name: envoySocketPath,
			Net:  "unix",
		})
	}
	options := []grpc.DialOption{
                ...
		grpc.WithDialer(envoySocketDialer), // Override the dialer as a workaround for https://github.com/grpc/grpc-go/issues/2510
	}

What did you see instead?

A small percentage (<1%) of RPCs failed with a "all SubConns are in TransientFailure, latest connection error: <nil>" error message. Without knowing much about the connection pooling implementation, it seems like the error should typically be non-nil there

The text was updated successfully, but these errors were encountered:

mpuncel · 2019-02-12T18:34:36Z

I suspect that setting WaitForReady(true) in the call options will effectively prevent this issue for me, but I'm curious if there is a way to get more information about why all SubConns would be in the failure state and wondering if there's a lurking bug because there seems to not be any connection errors

zhyon404 · 2019-02-15T06:40:56Z

Run in to the same problem when upgrading to 1.18.0.
1.17.0 works fine.

jdejesus07 · 2019-02-28T01:00:22Z

I am currently using 1.19.0 and same issue described here was experienced. I added WaitForReady(true) to call options and worked as stated @mpuncel Ty
There definitely seems to be a bug here. I shall add this as a workaround for now and keep digging to see what I come up with.

menghanl · 2019-02-28T18:38:45Z

Thanks for reporting. This is indeed a bug.
I suspect we either set the SubConn in TransientFailure when it isn't, or we forgot to set the connection. Anyway, IMO the last connection error shouldn't be nil here.

Is there any other logs that you got to tell whether there's an actual error (like connection broken) when this happens?

menghanl · 2019-02-28T21:43:45Z

Also, it could be related to #2565 (see #2406 for more information).

Can you try setting GRPC_GO_REQUIRE_HANDSHAKE=off and see if that changes anything?

jdejesus007 · 2019-03-04T14:30:33Z

2019/03/01 14:17:38 http2: Framer 0xc4206b7260: wrote HEADERS flags=END_HEADERS stream=1 len=107

2019/03/01 14:17:38 http2: Framer 0xc4206b7260: wrote DATA flags=END_STREAM stream=1 len=1285 data="\x00\x00\x00\x05\x00\t\xbb'\x0f\v\xb5v:@\x11\r\xe0-\x90\xa0\x04T\xc0\x18\x012\n\b(\x12\x06\b\xb1\xa3\xdd\xe3\x052\n\br\x12\x06\b\x84\xcd\xe3\xe2\x052\v\b\xb0\x01\x12\x06\b\xa7ڵ\xe1\x052\v\b\xd5\f\x12\x06\b\x96\x8b\xed\xe2\x052\v\b\xd8 \x12\x06\b\xbd\x95\xdd\xe3\x052\v\b\xa1-\x12\x06\b\xa7\xa5\xdc\xe3\x052\v\b\xf6>\x12\x06\b\x96\x8b\xed\xe2\x052\n\b\x0f\x12\x06\b\x96\x8b\xed\xe2\x052\n\b\x12\x12\x06\b\x9b\xc0\xc9\xe2\x052\v\b\x8c\x01\x12\x06\b\xa3\xa9\xd1\xe0\x052\v\b\xc7\x01\x12\x06\b\xda\xd1\xdb\xe3\x052\v\b\xff\x02\x12\x06\bރ\x93\xe3\x052\v\b\x99\x1c\x12\x06\b\xb0\x89\xdc\xe3\x052\v\b\xb8#\x12\x06\b\x96\x8b\xed\xe2\x052\v\b\xc8,\x12\x06\b\xa7\xbe\xd2\xe3\x052\v\b\xf04\x12\x06\b\xb5\xe5\xd2\xe2\x052\v\b\xf6\x01\x12\x06\b\xe8\xa9\xd6\xe1\x052\v\b\xf9\x01\x12\x06\b\xbd\xe7\xb6\xe3\x052" (1029 bytes omitted)

INFO: 2019/03/01 14:17:38 pickfirstBalancer: HandleSubConnStateChange: 0xc4203ccee0, TRANSIENT_FAILURE

INFO: 2019/03/01 14:17:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

I removed the WaitForReady and turned off handshake but experienced the same behavior.

aanm · 2019-03-05T12:02:13Z

@menghanl that fixes the issue for me.

menghanl · 2019-03-06T19:58:02Z

There are more than one things happening, but the root cause is failing to reset reconnecting backoff, which is fixed in #2669. Please try the fix and see if it helps.

Because of the bug, reconnecting doesn't happen in time, resulting in longer than expected time in transient failure.

About latest connection error: nil. This error message was introduced for errors like TLS misconfig, so when the connection fails, the RPC error will contain useful information.
It can still potentially be nil when connection is closed by servers. It shouldn't be noticeable because reconnecting will happen immediately. But because of the backoff bug, it's much more easily encountered.

One possible fix is to set the error whenever connection goes into transient failure, so it will never be nil. I will save that in another fix.

aanm · 2019-03-07T10:19:03Z

@menghanl patch #2669 does not solve the issue. it only works if GRPC_GO_REQUIRE_HANDSHAKE=off is set

menghanl · 2019-03-07T18:11:24Z

@aanm
If GRPC_GO_REQUIRE_HANDSHAKE=off is required, it sounds like that the server didn't finish the handshake in time.
How is your server setup? Is it a proxy, a gRPC server, or server behind a mux (so port is shared by gRPC and http)?

#2663 #2636

bbassingthwaite · 2019-03-12T21:40:04Z

@menghanl #2669 does not fix the issue for me either, but GRPC_GO_REQUIRE_HANDSHAKE=off does work for me.

bbassingthwaite · 2019-03-12T22:03:01Z

We are multiplexing gRPC and HTTP traffic on a single port using cmux. Removing cmux from the equation fixes the problem.

import "github.com/cockroachdb/cmux"

server := grpc.NewServer()

lis, err := net.Listen("tcp", ":8081")
if err != nil {
    log.Fatal(err)
}
mux := cmux.New(lis)
grpcL := mux.Match(cmux.HTTP2HeaderField("content-type", "application/grpc"))

server.Serve(grpcL)

menghanl · 2019-03-12T22:42:32Z

@bbassingthwaite-va
That is caused by #2565 (also see #2406).
With that, gRPC-Go clients have the same behavior as gRPC-Java, thus you need to set the same match with writers as mentioned: https://github.com/soheilhy/cmux#limitations

An issue (soheilhy/cmux#64) was filed before the gRPC-Go changes were checked in. Will send a PR to update their README as well.

aanm · 2019-03-13T07:44:19Z

@aanm
If GRPC_GO_REQUIRE_HANDSHAKE=off is required, it sounds like that the server didn't finish the handshake in time.
How is your server setup? Is it a proxy, a gRPC server, or server behind a mux (so port is shared by gRPC and http)?

@menghanl sorry for the late reply, the server is a binary already compiled with grpc 1.7.5 5b3c4e8, it's an program called cri-o. In their vendor directory they have that commit that I was mentioning https://github.com/kubernetes-sigs/cri-o/blob/v1.13.1/vendor.conf#L24. They do have a cmux library in there https://github.com/kubernetes-sigs/cri-o/blob/v1.13.1/vendor.conf#L115.

If you want to install it in your computer, these instructions should be enough

   sudo apt-key adv --recv-key --keyserver keyserver.ubuntu.com 8BECF1637AD8C79D

   cat <<EOF > /etc/apt/sources.list.d/projectatomic-ubuntu-ppa-artful.list
deb http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
deb-src http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
EOF
   sudo apt-get update
   sudo apt-get remove cri-o-1.* -y || true
   sudo apt-get install cri-o-1.13 -y

The client that I'm using to test is an "hello world" type of code, which is compiled with grpc 1.18.0, that checks the status of the server

import (
	criRuntime "k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2"
	"context"
	"fmt"
	"google.golang.org/grpc"
	"k8s.io/kubernetes/pkg/kubelet/util"
	"runtime"
	"sync"
	"time"
)

func getGRPCCLient(ctx context.Context) (*grpc.ClientConn, error) {
	addr, dialer, err := util.GetAddressAndDialer("unix:///var/run/crio/crio.sock")
	if err != nil {
		return nil, err
	}

	c, cancel := context.WithTimeout(ctx, time.Duration(5*time.Second))
	defer cancel()

	conn, err := grpc.DialContext(c, addr, grpc.WithDialer(dialer), grpc.WithInsecure(), grpc.WithBlock(), grpc.WithBackoffMaxDelay(15*time.Second))
	if err != nil {
		return nil, fmt.Errorf("failed to connect: %s", err)
	}
	return conn, nil
}

func newCRIClient(ctx context.Context) (*criClient, error) {
	cc, err := getGRPCCLient(ctx)
	if err != nil {
		return nil, err
	}
	rsc := criRuntime.NewRuntimeServiceClient(cc)
	return &criClient{rsc}, nil
}

func testCRI() {
	c, err := newCRIClient(context.Background())
	if err != nil {
		panic(err)
	}
	sreq := &criRuntime.StatusRequest{
		Verbose: false,
	}
	sresp, err := c.RuntimeServiceClient.Status(context.Background(), sreq)
	fmt.Println(sresp)
	if err != nil {
		panic(err)
	}
}

If this client program is executed with GRPC_GO_REQUIRE_HANDSHAKE=off everything works fine.

menghanl · 2019-03-13T17:33:44Z

@aanm That is caused by #2565 (also see #2406).
cri-o will need to set match with writers as mentioned in https://github.com/soheilhy/cmux#limitations for gRPC-Java clients.
For now, you can set GRPC_GO_REQUIRE_HANDSHAKE=off before it's done.

dfawley · 2019-03-14T20:48:41Z

Lowering priority, since the main issue is resolved by eliminating the backoff after a successful connection is established. Ideally the error message would be populated with "server handshake not received" (or similar). When it's due to the handshake, this will happen naturally when we make transport creation function blocking. When it's due to a subsequent connection attempt, we need to plumb the error from the transport to the addrConn.

#2663 #2636

gRPC clients which wait until they receive a `SETTINGS` frame may not be able to connect to CRI-O because of a limitation to cmux: - https://github.com/soheilhy/cmux#limitations The issue seems not yet solved and causes impossible connection when upgrading cri-tools to the latest version of gRPC. A possible workaround is to export the environment variable: `GRPC_GO_REQUIRE_HANDSHAKE=off`. Reference: grpc/grpc-go#2636 Signed-off-by: Sascha Grunert <[email protected]>

The mentioned issue in 3d04f3b is related to a limitation of cmux, where clients may block until they receive a `SETTINGS` frame. My tests were only related to CRI-O, where an appropriate fix in the project should solve the issue from a general perspective. Nevertheless, a general workaround for end-users trying to connect to a cmux based implementation might be exporting the environment variable `GRPC_GO_REQUIRE_HANDSHAKE=off`. This has been documented in the README.md as well. References: - grpc/grpc-go#2636 - https://github.com/soheilhy/cmux#limitations - cri-o/cri-o#2697 Signed-off-by: Sascha Grunert <[email protected]>

gRPC clients which wait until they receive a `SETTINGS` frame may not be able to connect to CRI-O because of a limitation to cmux: - https://github.com/soheilhy/cmux#limitations The issue seems not yet solved and causes impossible connection when upgrading cri-tools to the latest version of gRPC. A possible workaround is to export the environment variable: `GRPC_GO_REQUIRE_HANDSHAKE=off`. Reference: grpc/grpc-go#2636 Signed-off-by: Sascha Grunert <[email protected]>

stale · 2019-09-06T17:36:52Z

This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed.

Summary: Pull Request resolved: #901 In production we are seeing: all SubConns are in TransientFailure, latest connection error: <nil> as referenced by: grpc/grpc-go#2636. The primary issue is fixed in later grpc-go versions and as such we update our go.mod for Magma to 1.25.0 Reviewed By: xjtian Differential Revision: D18544967 fbshipit-source-id: 58208650d4b29feadae2cdd708b505b66359d068

Summary: Pull Request resolved: magma#901 In production we are seeing: all SubConns are in TransientFailure, latest connection error: <nil> as referenced by: grpc/grpc-go#2636. The primary issue is fixed in later grpc-go versions and as such we update our go.mod for Magma to 1.25.0 Reviewed By: xjtian Differential Revision: D18544967 fbshipit-source-id: 58208650d4b29feadae2cdd708b505b66359d068

perolausson · 2019-12-02T11:36:35Z

I've noticed this problem too, but found that it only seems to happen when I have left a program being debugged and then hibernated the system. On waking up I typically see that there is still a listener as expected, but there is no accept on any incoming requests and eventually the process locks up. Is there some kind of time keeping inside the go implementation of the grpc server which gets buggered up when time freezes and is then later unfrozen when the systems is woken up again?

menghanl · 2019-12-04T23:46:59Z

@perolausson there's no server side time keeping. Could it be that the listener doesn't work well after the system is hibernated? And that in this case, all connections are blocked on connecting, but don't error either.

perolausson · 2019-12-05T07:24:50Z

I don't really know what is going on, but I noticed that a few reported seeing this in the context of containers (wherepausing the container is quite common), and from what I have seen it has been a fairly consistent observation in my own work as I hibernate my development machine all the time.

noelukwa · 2019-12-05T16:08:17Z

still facing this issue

joelklabo · 2021-03-03T23:17:22Z

I'm also experiencing this on lnd version 0.12.0

menghanl added the Type: Bug label Feb 15, 2019

menghanl self-assigned this Feb 21, 2019

cristiangraz mentioned this issue Mar 1, 2019

Transport is closing and transient failure errors since upgrading to 1.18 #2663

Closed

aanm mentioned this issue Mar 5, 2019

Update grpc to v1.19.0 cilium/cilium#7290

Closed

menghanl mentioned this issue Mar 6, 2019

client: reset backoff to 0 after a connection is established #2669

Merged

dfawley added the P1 label Mar 7, 2019

menghanl added a commit that referenced this issue Mar 8, 2019

client: reset backoff to 0 after a connection is established (#2669)

5abb357

#2663 #2636

dfawley added P2 and removed P1 labels Mar 14, 2019

dfawley unassigned menghanl Mar 14, 2019

simonpasquier mentioned this issue Mar 15, 2019

*: bump gRPC and protobuf dependencies prometheus/prometheus#5367

Merged

menghanl added a commit that referenced this issue Mar 20, 2019

client: reset backoff to 0 after a connection is established (#2669)

dcd6fac

#2663 #2636

menghanl added a commit that referenced this issue Mar 20, 2019

client: reset backoff to 0 after a connection is established (#2669)

a29b2fc

#2663 #2636

zhongtan mentioned this issue May 24, 2019

Trouble creating macaroon file (lncli create+rpc error) lightningnetwork/lnd#3115

Closed

kennytm mentioned this issue May 24, 2019

go.mod: update most dependencies to latest versions pingcap/dm#158

Merged

hexfusion mentioned this issue Aug 4, 2019

WIP release-3.3: test grpc-go 1.22.1 etcd-io/etcd#10952

Closed

saschagrunert mentioned this issue Aug 5, 2019

Use HTTP2MatchHeaderFieldSendSettings for incoming gRPC connections cri-o/cri-o#2697

Merged

saschagrunert mentioned this issue Aug 5, 2019

Update gRPC to the latest version kubernetes-sigs/cri-tools#499

Merged

alena1108 mentioned this issue Aug 20, 2019

Cant provision new cluster, error message rpc error all SubConns are in TransientFailure rancher/rancher#21274

Closed

stale bot added the stale label Sep 6, 2019

dfawley removed the stale label Sep 6, 2019

afbjorklund mentioned this issue Sep 12, 2019

v1.16-rc.1 regression w/ CRI-O: Get runtime version failed: rpc error: code = Unavailable kubernetes/minikube#5323

Closed

nicks mentioned this issue Sep 19, 2019

connection to synclet stops working tilt-dev/tilt#2182

Closed

johnugeorge mentioned this issue Sep 20, 2019

Wait for GRPC server to be up kubeflow/katib#785

Merged

wnxn mentioned this issue Oct 31, 2019

Transport is closing on DeleteSnapshot testcase yunify/qingcloud-csi#126

Open

mpgermano mentioned this issue Nov 19, 2019

Upgrade magma's GRPC-go version to 1.25.0 magma/magma#901

Closed

marefr mentioned this issue Dec 5, 2019

Alert Image Coming as blank- Even when accessed the Direct Link Rendered Image grafana/grafana#20860

Closed

klizhentas mentioned this issue Nov 5, 2020

Slow auth server can hang client connections in async recording mode gravitational/teleport#4695

Closed

menghanl closed this as completed May 4, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all SubConns are in TransientFailure, latest connection error: <nil> #2636

all SubConns are in TransientFailure, latest connection error: <nil> #2636

mpuncel commented Feb 12, 2019 •

edited

Loading

mpuncel commented Feb 12, 2019

zhyon404 commented Feb 15, 2019

jdejesus07 commented Feb 28, 2019

menghanl commented Feb 28, 2019

menghanl commented Feb 28, 2019

jdejesus007 commented Mar 4, 2019

aanm commented Mar 5, 2019

menghanl commented Mar 6, 2019

aanm commented Mar 7, 2019

menghanl commented Mar 7, 2019

bbassingthwaite commented Mar 12, 2019 •

edited

Loading

bbassingthwaite commented Mar 12, 2019

menghanl commented Mar 12, 2019 •

edited

Loading

aanm commented Mar 13, 2019

menghanl commented Mar 13, 2019 •

edited

Loading

dfawley commented Mar 14, 2019

stale bot commented Sep 6, 2019

perolausson commented Dec 2, 2019 •

edited

Loading

menghanl commented Dec 4, 2019

perolausson commented Dec 5, 2019

noelukwa commented Dec 5, 2019

joelklabo commented Mar 3, 2021

all SubConns are in TransientFailure, latest connection error: <nil> #2636

all SubConns are in TransientFailure, latest connection error: <nil> #2636

Comments

mpuncel commented Feb 12, 2019 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

mpuncel commented Feb 12, 2019

zhyon404 commented Feb 15, 2019

jdejesus07 commented Feb 28, 2019

menghanl commented Feb 28, 2019

menghanl commented Feb 28, 2019

jdejesus007 commented Mar 4, 2019

aanm commented Mar 5, 2019

menghanl commented Mar 6, 2019

aanm commented Mar 7, 2019

menghanl commented Mar 7, 2019

bbassingthwaite commented Mar 12, 2019 • edited Loading

bbassingthwaite commented Mar 12, 2019

menghanl commented Mar 12, 2019 • edited Loading

aanm commented Mar 13, 2019

menghanl commented Mar 13, 2019 • edited Loading

dfawley commented Mar 14, 2019

stale bot commented Sep 6, 2019

perolausson commented Dec 2, 2019 • edited Loading

menghanl commented Dec 4, 2019

perolausson commented Dec 5, 2019

noelukwa commented Dec 5, 2019

joelklabo commented Mar 3, 2021

mpuncel commented Feb 12, 2019 •

edited

Loading

What version of Go are you using (`go version`)?

bbassingthwaite commented Mar 12, 2019 •

edited

Loading

menghanl commented Mar 12, 2019 •

edited

Loading

menghanl commented Mar 13, 2019 •

edited

Loading

perolausson commented Dec 2, 2019 •

edited

Loading