-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all SubConns are in TransientFailure, latest connection error: <nil> #2636
Comments
I suspect that setting WaitForReady(true) in the call options will effectively prevent this issue for me, but I'm curious if there is a way to get more information about why all SubConns would be in the failure state and wondering if there's a lurking bug because there seems to not be any connection errors |
Run in to the same problem when upgrading to 1.18.0. |
I am currently using 1.19.0 and same issue described here was experienced. I added WaitForReady(true) to call options and worked as stated @mpuncel Ty |
Thanks for reporting. This is indeed a bug. Is there any other logs that you got to tell whether there's an actual error (like connection broken) when this happens? |
2019/03/01 14:17:38 http2: Framer 0xc4206b7260: wrote HEADERS flags=END_HEADERS stream=1 len=107 2019/03/01 14:17:38 http2: Framer 0xc4206b7260: wrote DATA flags=END_STREAM stream=1 len=1285 data="\x00\x00\x00\x05\x00\t\xbb'\x0f\v\xb5v:@\x11\r\xe0-\x90\xa0\x04T\xc0\x18\x012\n\b(\x12\x06\b\xb1\xa3\xdd\xe3\x052\n\br\x12\x06\b\x84\xcd\xe3\xe2\x052\v\b\xb0\x01\x12\x06\b\xa7ڵ\xe1\x052\v\b\xd5\f\x12\x06\b\x96\x8b\xed\xe2\x052\v\b\xd8 \x12\x06\b\xbd\x95\xdd\xe3\x052\v\b\xa1-\x12\x06\b\xa7\xa5\xdc\xe3\x052\v\b\xf6>\x12\x06\b\x96\x8b\xed\xe2\x052\n\b\x0f\x12\x06\b\x96\x8b\xed\xe2\x052\n\b\x12\x12\x06\b\x9b\xc0\xc9\xe2\x052\v\b\x8c\x01\x12\x06\b\xa3\xa9\xd1\xe0\x052\v\b\xc7\x01\x12\x06\b\xda\xd1\xdb\xe3\x052\v\b\xff\x02\x12\x06\bރ\x93\xe3\x052\v\b\x99\x1c\x12\x06\b\xb0\x89\xdc\xe3\x052\v\b\xb8#\x12\x06\b\x96\x8b\xed\xe2\x052\v\b\xc8,\x12\x06\b\xa7\xbe\xd2\xe3\x052\v\b\xf04\x12\x06\b\xb5\xe5\xd2\xe2\x052\v\b\xf6\x01\x12\x06\b\xe8\xa9\xd6\xe1\x052\v\b\xf9\x01\x12\x06\b\xbd\xe7\xb6\xe3\x052" (1029 bytes omitted) INFO: 2019/03/01 14:17:38 pickfirstBalancer: HandleSubConnStateChange: 0xc4203ccee0, TRANSIENT_FAILURE INFO: 2019/03/01 14:17:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing" I removed the WaitForReady and turned off handshake but experienced the same behavior. |
@menghanl that fixes the issue for me. |
There are more than one things happening, but the root cause is failing to reset reconnecting backoff, which is fixed in #2669. Please try the fix and see if it helps. Because of the bug, reconnecting doesn't happen in time, resulting in longer than expected time in transient failure. About One possible fix is to set the error whenever connection goes into transient failure, so it will never be nil. I will save that in another fix. |
@aanm |
We are multiplexing gRPC and HTTP traffic on a single port using cmux. Removing cmux from the equation fixes the problem. import "github.com/cockroachdb/cmux"
server := grpc.NewServer()
lis, err := net.Listen("tcp", ":8081")
if err != nil {
log.Fatal(err)
}
mux := cmux.New(lis)
grpcL := mux.Match(cmux.HTTP2HeaderField("content-type", "application/grpc"))
server.Serve(grpcL) |
@bbassingthwaite-va An issue (soheilhy/cmux#64) was filed before the gRPC-Go changes were checked in. Will send a PR to update their README as well. |
@menghanl sorry for the late reply, the server is a binary already compiled with grpc 1.7.5 5b3c4e8, it's an program called cri-o. In their vendor directory they have that commit that I was mentioning https://github.com/kubernetes-sigs/cri-o/blob/v1.13.1/vendor.conf#L24. They do have a cmux library in there https://github.com/kubernetes-sigs/cri-o/blob/v1.13.1/vendor.conf#L115. If you want to install it in your computer, these instructions should be enough sudo apt-key adv --recv-key --keyserver keyserver.ubuntu.com 8BECF1637AD8C79D
cat <<EOF > /etc/apt/sources.list.d/projectatomic-ubuntu-ppa-artful.list
deb http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
deb-src http://ppa.launchpad.net/projectatomic/ppa/ubuntu bionic main
EOF
sudo apt-get update
sudo apt-get remove cri-o-1.* -y || true
sudo apt-get install cri-o-1.13 -y The client that I'm using to test is an "hello world" type of code, which is compiled with grpc 1.18.0, that checks the status of the server import (
criRuntime "k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2"
"context"
"fmt"
"google.golang.org/grpc"
"k8s.io/kubernetes/pkg/kubelet/util"
"runtime"
"sync"
"time"
)
func getGRPCCLient(ctx context.Context) (*grpc.ClientConn, error) {
addr, dialer, err := util.GetAddressAndDialer("unix:///var/run/crio/crio.sock")
if err != nil {
return nil, err
}
c, cancel := context.WithTimeout(ctx, time.Duration(5*time.Second))
defer cancel()
conn, err := grpc.DialContext(c, addr, grpc.WithDialer(dialer), grpc.WithInsecure(), grpc.WithBlock(), grpc.WithBackoffMaxDelay(15*time.Second))
if err != nil {
return nil, fmt.Errorf("failed to connect: %s", err)
}
return conn, nil
}
func newCRIClient(ctx context.Context) (*criClient, error) {
cc, err := getGRPCCLient(ctx)
if err != nil {
return nil, err
}
rsc := criRuntime.NewRuntimeServiceClient(cc)
return &criClient{rsc}, nil
}
func testCRI() {
c, err := newCRIClient(context.Background())
if err != nil {
panic(err)
}
sreq := &criRuntime.StatusRequest{
Verbose: false,
}
sresp, err := c.RuntimeServiceClient.Status(context.Background(), sreq)
fmt.Println(sresp)
if err != nil {
panic(err)
}
} If this client program is executed with |
@aanm That is caused by #2565 (also see #2406). |
Lowering priority, since the main issue is resolved by eliminating the backoff after a successful connection is established. Ideally the error message would be populated with "server handshake not received" (or similar). When it's due to the handshake, this will happen naturally when we make transport creation function blocking. When it's due to a subsequent connection attempt, we need to plumb the error from the transport to the addrConn. |
gRPC clients which wait until they receive a `SETTINGS` frame may not be able to connect to CRI-O because of a limitation to cmux: - https://github.com/soheilhy/cmux#limitations The issue seems not yet solved and causes impossible connection when upgrading cri-tools to the latest version of gRPC. A possible workaround is to export the environment variable: `GRPC_GO_REQUIRE_HANDSHAKE=off`. Reference: grpc/grpc-go#2636 Signed-off-by: Sascha Grunert <[email protected]>
The mentioned issue in 3d04f3b is related to a limitation of cmux, where clients may block until they receive a `SETTINGS` frame. My tests were only related to CRI-O, where an appropriate fix in the project should solve the issue from a general perspective. Nevertheless, a general workaround for end-users trying to connect to a cmux based implementation might be exporting the environment variable `GRPC_GO_REQUIRE_HANDSHAKE=off`. This has been documented in the README.md as well. References: - grpc/grpc-go#2636 - https://github.com/soheilhy/cmux#limitations - cri-o/cri-o#2697 Signed-off-by: Sascha Grunert <[email protected]>
gRPC clients which wait until they receive a `SETTINGS` frame may not be able to connect to CRI-O because of a limitation to cmux: - https://github.com/soheilhy/cmux#limitations The issue seems not yet solved and causes impossible connection when upgrading cri-tools to the latest version of gRPC. A possible workaround is to export the environment variable: `GRPC_GO_REQUIRE_HANDSHAKE=off`. Reference: grpc/grpc-go#2636 Signed-off-by: Sascha Grunert <[email protected]>
gRPC clients which wait until they receive a `SETTINGS` frame may not be able to connect to CRI-O because of a limitation to cmux: - https://github.com/soheilhy/cmux#limitations The issue seems not yet solved and causes impossible connection when upgrading cri-tools to the latest version of gRPC. A possible workaround is to export the environment variable: `GRPC_GO_REQUIRE_HANDSHAKE=off`. Reference: grpc/grpc-go#2636 Signed-off-by: Sascha Grunert <[email protected]>
gRPC clients which wait until they receive a `SETTINGS` frame may not be able to connect to CRI-O because of a limitation to cmux: - https://github.com/soheilhy/cmux#limitations The issue seems not yet solved and causes impossible connection when upgrading cri-tools to the latest version of gRPC. A possible workaround is to export the environment variable: `GRPC_GO_REQUIRE_HANDSHAKE=off`. Reference: grpc/grpc-go#2636 Signed-off-by: Sascha Grunert <[email protected]>
This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
Summary: Pull Request resolved: #901 In production we are seeing: all SubConns are in TransientFailure, latest connection error: <nil> as referenced by: grpc/grpc-go#2636. The primary issue is fixed in later grpc-go versions and as such we update our go.mod for Magma to 1.25.0 Reviewed By: xjtian Differential Revision: D18544967 fbshipit-source-id: 58208650d4b29feadae2cdd708b505b66359d068
Summary: Pull Request resolved: magma#901 In production we are seeing: all SubConns are in TransientFailure, latest connection error: <nil> as referenced by: grpc/grpc-go#2636. The primary issue is fixed in later grpc-go versions and as such we update our go.mod for Magma to 1.25.0 Reviewed By: xjtian Differential Revision: D18544967 fbshipit-source-id: 58208650d4b29feadae2cdd708b505b66359d068
I've noticed this problem too, but found that it only seems to happen when I have left a program being debugged and then hibernated the system. On waking up I typically see that there is still a listener as expected, but there is no accept on any incoming requests and eventually the process locks up. Is there some kind of time keeping inside the go implementation of the grpc server which gets buggered up when time freezes and is then later unfrozen when the systems is woken up again? |
@perolausson there's no server side time keeping. Could it be that the listener doesn't work well after the system is hibernated? And that in this case, all connections are blocked on connecting, but don't error either. |
I don't really know what is going on, but I noticed that a few reported seeing this in the context of containers (wherepausing the container is quite common), and from what I have seen it has been a fairly consistent observation in my own work as I hibernate my development machine all the time. |
still facing this issue |
I'm also experiencing this on lnd version 0.12.0 |
Please answer these questions before submitting your issue.
What version of gRPC are you using?
1.18.0
What version of Go are you using (
go version
)?1.11.2
What operating system (Linux, Windows, …) and version?
Linux
What did you do?
Ran gRPC requests in production through a local Envoy sidecar. This means that all connections are to a process over a local unix socket. WaitForReady was not set to true for these calls
What did you expect to see?
I'd expect it to be exceedingly rare for connections to a local unix sidecar to be in the TRANSIENT_FAILURE state, and if they were i'd expect to see a non-nil error message. We use whatever the default balancer implementation is, but we do pass in a custom dialer implementation:
What did you see instead?
A small percentage (<1%) of RPCs failed with a "all SubConns are in TransientFailure, latest connection error: <nil>" error message. Without knowing much about the connection pooling implementation, it seems like the error should typically be non-nil there
The text was updated successfully, but these errors were encountered: