-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why gRPC Server closes a connection at the 2nd GOAWAY. #6019
Comments
Are you running into any issue because of this behavior? If so, could you please explain that in detail. The GOAWAY behavior that we implement is based on https://httpwg.org/specs/rfc7540.html#rfc.section.6.8. When the server sends the first GOAWAY, it usually gives some time to the client to terminate the connection gracefully, before sending the second GOAWAY (at which point, the connection is hard closed). |
@easwars Thank you for replying. In my environment, I sometimes get the e.g. client logs (v1.51.0)
e.g. server logs (v1.51.0)
Also I found the following successful cases.
The gRPC server is configured as below. grpc.NewServer(
grpc.UnaryInterceptor(unaryInterceptor),
grpc.StreamInterceptor(streamInterceptor),
grpc.Creds(credentials.NewTLS(tlsConfig)),
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionAge: 3 * time.Minute,
}),
) The gRPC client is configured as below. grpc.DialContext(ctx, address,
grpc.WithDefaultServiceConfig(`{ "loadBalancingConfig": [ { "round_robin": {} } ] }`),
grpc.FailOnNonTempDialError(true),
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.WithTransportCredentials(credentials.NewTLS(tlsConfig)),
) The configuration is intended for load balancing connections. So I am wondering if there is a possibility race in case of the server closes the connection like described the comments of test case. Lines 900 to 906 in 3151e83
In addition, I have disabled dynamic windows and BDP in the client settings as shown in #5358, I have not seen any errors at this time. |
Do you feel that the server is closing the connection when it should not have? Or do you feel that the server is not giving enough time for the client to gracefully close the connection from its end? Because I still don't understand what problem you are facing. |
I am not sure, why the server send RST packets. e.g. If the server closed the connection while there were still unhandled data in the socket buffer it will send a RST packet. Is there a possibility like above around GOAWAY? |
We used to rely on the client closing the connection, but that is a bug. The server needs to protect itself from a client that does not close the connection, and there are clients in the wild that don't do this properly. This was actually the case in v1.51, which you were using. It was only fixed in #5968. Can you test this again with v1.53? It's possible the changes made since v1.51 fixes a problem you may be seeing. |
Yes, I will test with v1.53. Please wait for some days due to low reproducibility. |
Yes, I think so too. And this is not under the control of the gRPC server. This is under the control of the TCP implementation in the kernel.
gRPC server flushes all data in its write buffer before closing the TCP connection. Also, what I feel is more important in this scenario is that the gRPC server should stop handling new streams on the connection once it sends the second GOAWAY frame. And it is currently doing that, and I was able to verify the same. |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
Is it possible for you to provide a reproduction test case that we can run? Otherwise, our code seems okay through inspection, and if we can't stimulate the problem then we won't be able to debug it. Thanks! |
@dfawley You can see the error by follows. codes are based on https://github.com/grpc/grpc-go/tree/master/examples/features/keepalive server package main
import (
"context"
"flag"
"fmt"
"log"
"math/rand"
"net"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/benchmark/latency"
pb "google.golang.org/grpc/examples/features/proto/echo"
"google.golang.org/grpc/keepalive"
)
var port = flag.Int("port", 50052, "port number")
var kasp = keepalive.ServerParameters{
MaxConnectionAge: 10 * time.Second,
}
// server implements EchoServer.
type server struct {
pb.UnimplementedEchoServer
}
var letterRunes = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
func randomString(n int) string {
b := make([]rune, n)
for i := range b {
b[i] = letterRunes[rand.Intn(len(letterRunes))]
}
return string(b)
}
func (s *server) UnaryEcho(ctx context.Context, req *pb.EchoRequest) (*pb.EchoResponse, error) {
msg := req.Message + randomString(630000)
return &pb.EchoResponse{Message: msg}, nil
}
func main() {
flag.Parse()
rand.Seed(time.Now().UnixNano())
address := fmt.Sprintf(":%v", *port)
lis, err := net.Listen("tcp", address)
if err != nil {
log.Fatalf("failed to listen: %v", err)
}
sLis := latency.Longhaul.Listener(lis)
s := grpc.NewServer(grpc.KeepaliveParams(kasp))
pb.RegisterEchoServer(s, &server{})
if err := s.Serve(sLis); err != nil {
log.Fatalf("failed to serve: %v", err)
}
} client package main
import (
"context"
"flag"
"fmt"
"log"
"math/rand"
"net"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/benchmark/latency"
"google.golang.org/grpc/credentials/insecure"
pb "google.golang.org/grpc/examples/features/proto/echo"
)
var addr = flag.String("addr", "localhost:50052", "the address to connect to")
func main() {
flag.Parse()
//const maxWindowSize int32 = (1 << 20) * 16
conn, err := grpc.Dial(*addr,
//grpc.WithInitialWindowSize(maxWindowSize),
//grpc.WithInitialConnWindowSize(maxWindowSize),
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
grpc.WithContextDialer(func(ctx context.Context, addr string) (net.Conn, error) {
dialer := latency.Longhaul.Dialer(net.Dial)
sConn, err := dialer("tcp4", addr)
if err != nil {
return nil, err
}
return sConn, nil
}))
if err != nil {
log.Fatalf("did not connect: %v", err)
}
defer conn.Close()
for {
c := pb.NewEchoClient(conn)
fmt.Println("Performing unary request")
_, err = c.UnaryEcho(context.Background(), &pb.EchoRequest{Message: "keepalive demo"})
if err != nil {
log.Fatalf("unexpected error from UnaryEcho: %v", err)
}
fmt.Println("Got RPC response")
base := float64(1 * time.Second)
var delta = 0.10 * float64(1*time.Second)
var min = base - delta
var max = base + delta
interVal := time.Duration(min + (rand.Float64() * (max - min + 1)))
fmt.Printf("sleeping %v\n", interVal)
time.Sleep(interVal)
}
select {}
} The error may differ depending on the situation(timing). In my understanding, e.g. error logs case 1 (closed the connection after the 2nd GOAWAY)server log
client log
packets
case 2 (closed the connection at the 2nd GOAWAY)server log
client log
|
Thanks for the example. I'll look into reproducing this on my side and see what I can find. There's an unavoidable race between the GOAWAY initiating the end of the connection (or even the second GOAWAY & connection closure) and the client attempting to use that connection. However, this should be caught internally and transparently retried, so there should be no visible RPC failure. |
Have you seen this error in a scenario without the It seems what's happening is the server is sending all the data for the RPC to the client before closing the conn, but when the client encounters an error writing to the connection (because the server closed it), that leads to an error reading from the conn. But my understanding is that reads should continue in that case until all the server's data is consumed. In this case an error is occurring instead, and I'm not sure if that's a misunderstanding on my part or if the data is lost by the latency package itself. I can make the test case you provided pass by either removing the latency package from the example or by not closing the Conn after the client's writer fails. |
This could be a misunderstanding on my part, actually. The documentation on // Close closes the connection.
// Any blocked Read or Write operations will be unblocked and return errors. So it may be unsafe to close the connection at this point. We probably have to close it when a local error occurs (or shut down the whole transport?), but not when an error writing to the connection occurs due to the peer closing it. |
@dfawley Thank you for debugging.
Which codes do you mean? |
No, this is internal to gRPC: https://github.com/grpc/grpc-go/blob/master/internal/transport/http2_server.go#L340 Were you seeing this in production, though? Or just in test cases with the |
OK.
Yes, I can see the error in our production environment. |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
Is it possible to fix this problem? |
Sorry about the ping from the bot. :) #6110 should fix this. |
@dfawley client log (from example code)
server log (from example code)
|
I also built real-world application with grpc-go HEAD and deployed it, then I can find the error still. |
@dfawley Can I reopen this issue? |
I'm able to reproduce the problem, still. However, when moving the client & server into the same binary, the problem seemed to go away. It may have something to do with how the TCP connection is handled. I need to debug more. |
Is there any traction on this issue? We have a similar issue in our production where our grpc client receives a lot of grpcSrv := grpc.NewServer(
grpc.StatsHandler(&ocgrpc.ServerHandler{}),
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionIdle: 5 * time.Minute,
}),
) could the problem be Why does this happen? |
Hi @Ghilteras, This is not expected. The client should essentially try to create a new connection with the server. We are still looking into this. I will keep you posted. Could you add some logs to this issue which could help us with debugging |
I think in my case this happens because there is an NGINX ingress between the GRPC server and the client, so in my case it's more related to this kubernetes/ingress-nginx#8973 |
Hi @hiyosi , I used the following server configuration and the problem was not reproduced within 4 hours.
|
same issue here. KeepAlivceParameters: keepalive.ClientParameters{
Time: 120 * time.Second, // send pings every 120 seconds if there is no activity
Timeout: 10 * time.Second, // wait 10 second for ping ack before considering the connection dead
PermitWithoutStream: true, // send pings even without active streams
} grpc server: grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionAge: time.Minute * time.Duration(2),
Time: 60 * time.Second,
}),
grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
MinTime: 10 * time.Second,
PermitWithoutStream: true,
} error log of grpc client:
error log of grpc server:
|
I can also avoid this case by grpc client: // const MAX_WINDOW_SIZE int32 = (1 << 24)
grpc.WithInitialWindowSize(MAX_WINDOW_SIZE),
grpc.WithInitialConnWindowSize(MAX_WINDOW_SIZE), |
This seems 100% the same as #5358. Let's consolidate these two issues. |
In my under standing, the gRPC Server closes a connection at sending the 2nd GOAWAY frame if there is no active streams.
https://github.com/grpc/grpc-go/blob/v1.53.0/internal/transport/http2_server.go#L1346-L1350
https://github.com/grpc/grpc-go/blob/v1.53.0/internal/transport/http2_server.go#L336-L342
But is there a data race between Flush() and Close() ?
Because I sometimes find the error
connection rest by peer
around the 2nd GOAWAY.If some data (e.g. PING frame) is received between Flush() and Close(), the server close the connection by sending the RST packet instead of FIN packet isn't it? and an error occur at client side.
As described at
grpc-go/test/end2end_test.go
Lines 900 to 906 in 3151e83
I would like to know the reason that the connection is closed from the server instead of waiting for the client to close the connection after the 2nd GOAWAY.
On the other hand, the same process is followed in the case of Too Many Pings, but I do not consider that to be a problem.
The text was updated successfully, but these errors were encountered: