grpc-health-probe shuts down the gRPC connection non-gracefully #34

JanDash · 2020-02-04T23:38:05Z

The probe process sends a TCP RST at the conclusion of the gRPC session instead of gracefully shutting down with a FIN/ACK exchange. This causes a lot of noise logging in k8s pods such as:

Feb 01, 2020 2:54:05 PM io.grpc.netty.shaded.io.grpc.netty.NettyServerTransport notifyTerminated
INFO: Transport failed
io.grpc.netty.shaded.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

Screen-shot of Wireshark packet exchange:

ahmetb · 2020-02-04T23:40:17Z

This came up in #33 as well. However it does not surface up in gRPC implementations other than Java ––so I might need some help debugging this.

The code is rather short, maybe take a look to see if we're missing a conn.Close(). I'll take a look as well.

ahmetb · 2020-02-04T23:42:02Z

Welp, it seems like we do a bunch of os.Exit(..) which causes defer conn.Close() to be not given a chance to be executed.

Should be a rather easy thing to fix. Would you be able to compile a binary to test if I can give you a patch?

ahmetb · 2020-02-05T00:06:00Z

See #36, can you compile and let me know if that helps?

JanDash · 2020-02-05T01:02:58Z

Hi Ahmet, thanks a lot for your quick response! I certainly can, but I need to head out in a minute. I will do it first thing in the morning (PST), is that OK?

JanDash · 2020-02-05T20:53:11Z

I just built the updated code and tested the binary, but unfortunately I am still getting the same errors.

ahmetb · 2020-02-05T23:36:46Z

Can you please try this. Change the defer conn.Close() to

defer func(){
    log.Println("closing connection")
    conn.Close()
    log.Println("closed connection")
}()

and recompile & run. We should be seeing these clear messages on the exit path of the probe.

JanDash · 2020-02-06T01:39:13Z

The messages are printed out but the issue remains. I passed this on to one of my colleagues that knows go to see if he can spot anything.

ahmetb · 2020-02-06T02:01:27Z

That’s indeed peculiar. Is the wireshark dump the same, or are you seeing RST?

JanDash · 2020-02-06T23:33:43Z

My coworker made a few changes as an experiment and it still sends RST, same as before.
I am as confused as you are, typically sockets are closed gracefully by the O/S when a process dies. I don't really have any more cycles to experiment further unfortunately, but please feel free to update if you make a breakthrough.

ahmetb · 2020-02-07T03:53:10Z

My idea would be adding time.Sleep(time.Second*2) before os.Exit() to see if we can keep the process alive for connection to close.

But since you applied the code block above and saw closed connection in the output, honestly I have no other answer.

One good litmus test would be to see if this happens to all Go gRPC clients calling Java gRPC servers. (I assume not, but it might help bisect where the problem is.) For that, you can maybe write a small program that makes an RPC and exits.

ahmetb · 2020-02-07T03:53:55Z

I went ahead and merged the #36 to master, so you can build off of there.

NB: Issue is still not fixed by #36.

thejasn · 2020-03-12T04:25:21Z

For issue #36, can we send SIGTERM instead of os.Exit(retCode) ?? Will that solve the issue ??

ahmetb · 2020-03-12T06:08:28Z

I don’t see why it would help.

jsabalos · 2020-03-17T19:42:59Z

is this being worked for next release? =)

ahmetb · 2020-03-18T00:02:15Z

is this being worked for next release? =)

No. We can’t repro the issue in other languages. Furthermore we don’t know what’s causing it. If you have ideas or time to debug, please contribute.

danieldooley · 2020-04-21T02:57:14Z

I think I'm seeing something (perhaps) similar to this with a Go service?

With this health service:

package main

import (
	"log"
	"net"
	"google.golang.org/grpc"
	"google.golang.org/grpc/health"
	"google.golang.org/grpc/health/grpc_health_v1"
	"google.golang.org/grpc/reflection"
)

func main() {

	lis, err := net.Listen("tcp", "localhost:50052")
	if err != nil {
		log.Fatal("failed to listen")
	}

	h := health.NewServer()

	// Docs say that an empty service name is used to indicate "overall server health",
	// which is the only thing we care about. We only have one service. Refer to
	// https://github.com/grpc/grpc/blob/master/doc/health-checking.md
	h.SetServingStatus("", grpc_health_v1.HealthCheckResponse_SERVING)

	s := grpc.NewServer()
	grpc_health_v1.RegisterHealthServer(s, h)

	// we're happy to do this because we can use grpcurl without having to provide a .proto file
	reflection.Register(s)

	log.Print("Starting healthcheck")

	if err := s.Serve(lis); err != nil {
		log.Fatal("failed to serve healthcheck")
	}

}

Running :

GRPC_GO_LOG_VERBOSITY_LEVEL=2 GRPC_GO_LOG_SEVERITY_LEVEL=info go run main.go

And triggering the health_probe in another terminal:

grpc-health-probe -addr localhost:50052

I'm seeing this WARNING in the service logs:

WARNING: 2020/04/21 14:02:57 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:50052->127.0.0.1:53958: read: connection reset by peer
INFO: 2020/04/21 14:02:57 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

I wasn't sure if this was intended behaviour so I changed:

defer conn.Close()

In the grpc-health-probe/main.go to:

defer func(){
        time.Sleep(time.Second)
	conn.Close()
}()

And when running that against the service above I only see:

INFO: 2020/04/21 14:03:20 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

In the logs.

I did a bit more munging and found that by inserting:

time.Sleep(time.Millisecond)

Between the Check returning and conn.Close() (I put it on line line 234) being called seems to prevent this warning as well.

This PR changes the `NettyServerTransport#getLogLevel` method to log `SocketException`s to `LogLevel.FINE`, rather than exclusively pure IOExceptions. This may fix an unintentional regression introduced in c166ec2, although the message in my java version (14.0.1) wouldn't have passed the old logic for quieting either. This also fixes the issue raised in #6423 that was locked for inactivity. This fixes ``` [2020/05/14 20:21:52 INFO] [io.grpc.netty.NettyServerTransport.connections] Transport failed java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:345) at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:376) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1125) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:832) ``` being logged to level INFO, which occurs whenever a socket is improperly closed by the client, such as with the grpc-health-probe (They've got an [open issue](grpc-ecosystem/grpc-health-probe#34) for this)

ahmetb · 2020-10-28T20:39:40Z

Any advice how should we proceed here?
I'm not in favor of adding notable sleeps here and there since I expect defer conn.Close() is actually supposed to close the connection.

JanDash · 2020-10-28T21:58:13Z

This is a known issue already reported: #34 We spent some time researching it to no avail as you can see in the thread. I have long been thinking about writing a small C client (to be closer to the metal and actually have control) to replace the go implementation, but never seem to get the time.

…

On Wed, Oct 28, 2020 at 1:39 PM Ahmet Alp Balkan ***@***.***> wrote: Any advice how should we proceed here? I'm not in favor of adding notable sleeps here and there since I expect defer conn.Close() is actually supposed to close the connection. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKDFQHDBIZV55WWBGSYEPXLSNB6RVANCNFSM4KQAQGPA> .

This PR changes the `NettyServerTransport#getLogLevel` method to log `SocketException`s to `LogLevel.FINE`, rather than exclusively pure IOExceptions. This may fix an unintentional regression introduced in c166ec2, although the message in my java version (14.0.1) wouldn't have passed the old logic for quieting either. This also fixes the issue raised in grpc#6423 that was locked for inactivity. This fixes ``` [2020/05/14 20:21:52 INFO] [io.grpc.netty.NettyServerTransport.connections] Transport failed java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:345) at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:376) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1125) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:832) ``` being logged to level INFO, which occurs whenever a socket is improperly closed by the client, such as with the grpc-health-probe (They've got an [open issue](grpc-ecosystem/grpc-health-probe#34) for this)

owenthereal · 2021-05-30T19:08:26Z

I can testify to seeing the same error in a Go service like #34 (comment).

GeertJohan · 2022-01-26T08:46:06Z

I suspect this issue is caused by grpc/grpc-go#460

xorver · 2022-08-24T10:11:01Z

We just hit this problem while running the prober on our GKE services. We were observing slow leak of bpf_jit memory allocations on the node, counted with

jq -n "`sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{ print $2 }' | jq -s 'add'` / `sudo cat /proc/sys/net/core/bpf_jit_limit`"

In weeks time it exceeds the bpf_jit_limit, and the node is in a state when starting a new container fails with

OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: setns_init_linux.go:88: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: errno 524: unknown

The node needs a restart to recover.

Workarounds

We have observed that on 3 independent services. Using the gke built in grpc prober solves the issue. Also adding a sleep of 100ms (1ms is not enough in our case) as suggested in #34 (comment) works.

Proper fix

Is there any idea for a proper fix, not involving a sleep here? (or maybe grpc/grpc-go#460 needs more prioritization in case it's really this)

As a side note. Do you think it's worth removing the issue's "harmless" tag? As our case shows, in certain circumstance this issue can cause troubles.

ahmetb mentioned this issue Feb 5, 2020

Close connection before exiting #36

Merged

ahmetb closed this as completed in #36 Feb 7, 2020

ahmetb reopened this Feb 7, 2020

AgentK20 mentioned this issue May 14, 2020

netty: log SocketExceptions at FINE, too grpc/grpc-java#7041

Merged

gwpries added a commit to AVOXI/grpc-health-probe that referenced this issue Sep 18, 2020

add fix from grpc-ecosystem#34

04c64f4

ahmetb added harmless long term issue bug Something isn't working labels Jul 7, 2021

EarthChen mentioned this issue Sep 1, 2021

[3.0-triple] Add tri inner health service apache/dubbo#8644

Merged

8 tasks

dims mentioned this issue Feb 8, 2023

Failed to create pod sandbox | error loading seccomp filter: errno 524: unknown awslabs/amazon-eks-ami#1179

Closed

bmon mentioned this issue Mar 16, 2023

[patch] extend conn close delay mx51/grpc-health-probe#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpc-health-probe shuts down the gRPC connection non-gracefully #34

grpc-health-probe shuts down the gRPC connection non-gracefully #34

JanDash commented Feb 4, 2020

ahmetb commented Feb 4, 2020

ahmetb commented Feb 4, 2020

ahmetb commented Feb 5, 2020

JanDash commented Feb 5, 2020

JanDash commented Feb 5, 2020

ahmetb commented Feb 5, 2020

JanDash commented Feb 6, 2020

ahmetb commented Feb 6, 2020

JanDash commented Feb 6, 2020

ahmetb commented Feb 7, 2020

ahmetb commented Feb 7, 2020 •

edited

Loading

thejasn commented Mar 12, 2020

ahmetb commented Mar 12, 2020

jsabalos commented Mar 17, 2020

ahmetb commented Mar 18, 2020

danieldooley commented Apr 21, 2020

ahmetb commented Oct 28, 2020

JanDash commented Oct 28, 2020 via email

owenthereal commented May 30, 2021

GeertJohan commented Jan 26, 2022

xorver commented Aug 24, 2022

grpc-health-probe shuts down the gRPC connection non-gracefully #34

grpc-health-probe shuts down the gRPC connection non-gracefully #34

Comments

JanDash commented Feb 4, 2020

ahmetb commented Feb 4, 2020

ahmetb commented Feb 4, 2020

ahmetb commented Feb 5, 2020

JanDash commented Feb 5, 2020

JanDash commented Feb 5, 2020

ahmetb commented Feb 5, 2020

JanDash commented Feb 6, 2020

ahmetb commented Feb 6, 2020

JanDash commented Feb 6, 2020

ahmetb commented Feb 7, 2020

ahmetb commented Feb 7, 2020 • edited Loading

thejasn commented Mar 12, 2020

ahmetb commented Mar 12, 2020

jsabalos commented Mar 17, 2020

ahmetb commented Mar 18, 2020

danieldooley commented Apr 21, 2020

ahmetb commented Oct 28, 2020

JanDash commented Oct 28, 2020 via email

owenthereal commented May 30, 2021

GeertJohan commented Jan 26, 2022

xorver commented Aug 24, 2022

ahmetb commented Feb 7, 2020 •

edited

Loading