Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc-health-probe shuts down the gRPC connection non-gracefully #34

Open
JanDash opened this issue Feb 4, 2020 · 21 comments · Fixed by #36
Open

grpc-health-probe shuts down the gRPC connection non-gracefully #34

JanDash opened this issue Feb 4, 2020 · 21 comments · Fixed by #36
Labels
bug Something isn't working harmless long term issue

Comments

@JanDash
Copy link

JanDash commented Feb 4, 2020

The probe process sends a TCP RST at the conclusion of the gRPC session instead of gracefully shutting down with a FIN/ACK exchange. This causes a lot of noise logging in k8s pods such as:

Feb 01, 2020 2:54:05 PM io.grpc.netty.shaded.io.grpc.netty.NettyServerTransport notifyTerminated
INFO: Transport failed
io.grpc.netty.shaded.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

Screen-shot of Wireshark packet exchange:

Screen Shot 2020-02-04 at 9 25 50 AM

@ahmetb
Copy link
Collaborator

ahmetb commented Feb 4, 2020

This came up in #33 as well. However it does not surface up in gRPC implementations other than Java ––so I might need some help debugging this.

The code is rather short, maybe take a look to see if we're missing a conn.Close(). I'll take a look as well.

@ahmetb
Copy link
Collaborator

ahmetb commented Feb 4, 2020

Welp, it seems like we do a bunch of os.Exit(..) which causes defer conn.Close() to be not given a chance to be executed.

Should be a rather easy thing to fix. Would you be able to compile a binary to test if I can give you a patch?

@ahmetb
Copy link
Collaborator

ahmetb commented Feb 5, 2020

See #36, can you compile and let me know if that helps?

@JanDash
Copy link
Author

JanDash commented Feb 5, 2020

Hi Ahmet, thanks a lot for your quick response! I certainly can, but I need to head out in a minute. I will do it first thing in the morning (PST), is that OK?

@JanDash
Copy link
Author

JanDash commented Feb 5, 2020

I just built the updated code and tested the binary, but unfortunately I am still getting the same errors.

@ahmetb
Copy link
Collaborator

ahmetb commented Feb 5, 2020

Can you please try this. Change the defer conn.Close() to

defer func(){
    log.Println("closing connection")
    conn.Close()
    log.Println("closed connection")
}()

and recompile & run. We should be seeing these clear messages on the exit path of the probe.

@JanDash
Copy link
Author

JanDash commented Feb 6, 2020

The messages are printed out but the issue remains. I passed this on to one of my colleagues that knows go to see if he can spot anything.

@ahmetb
Copy link
Collaborator

ahmetb commented Feb 6, 2020

That’s indeed peculiar. Is the wireshark dump the same, or are you seeing RST?

@JanDash
Copy link
Author

JanDash commented Feb 6, 2020

My coworker made a few changes as an experiment and it still sends RST, same as before.
I am as confused as you are, typically sockets are closed gracefully by the O/S when a process dies. I don't really have any more cycles to experiment further unfortunately, but please feel free to update if you make a breakthrough.

@ahmetb
Copy link
Collaborator

ahmetb commented Feb 7, 2020

My idea would be adding time.Sleep(time.Second*2) before os.Exit() to see if we can keep the process alive for connection to close.

But since you applied the code block above and saw closed connection in the output, honestly I have no other answer.

One good litmus test would be to see if this happens to all Go gRPC clients calling Java gRPC servers. (I assume not, but it might help bisect where the problem is.) For that, you can maybe write a small program that makes an RPC and exits.

@ahmetb ahmetb closed this as completed in #36 Feb 7, 2020
@ahmetb
Copy link
Collaborator

ahmetb commented Feb 7, 2020

I went ahead and merged the #36 to master, so you can build off of there.

NB: Issue is still not fixed by #36.

@ahmetb ahmetb reopened this Feb 7, 2020
@thejasn
Copy link

thejasn commented Mar 12, 2020

For issue #36, can we send SIGTERM instead of os.Exit(retCode) ?? Will that solve the issue ??

@ahmetb
Copy link
Collaborator

ahmetb commented Mar 12, 2020

I don’t see why it would help.

@jsabalos
Copy link

is this being worked for next release? =)

@ahmetb
Copy link
Collaborator

ahmetb commented Mar 18, 2020

is this being worked for next release? =)

No. We can’t repro the issue in other languages. Furthermore we don’t know what’s causing it. If you have ideas or time to debug, please contribute.

@danieldooley
Copy link

I think I'm seeing something (perhaps) similar to this with a Go service?

With this health service:

package main

import (
	"log"
	"net"
	"google.golang.org/grpc"
	"google.golang.org/grpc/health"
	"google.golang.org/grpc/health/grpc_health_v1"
	"google.golang.org/grpc/reflection"
)

func main() {

	lis, err := net.Listen("tcp", "localhost:50052")
	if err != nil {
		log.Fatal("failed to listen")
	}

	h := health.NewServer()

	// Docs say that an empty service name is used to indicate "overall server health",
	// which is the only thing we care about. We only have one service. Refer to
	// https://github.com/grpc/grpc/blob/master/doc/health-checking.md
	h.SetServingStatus("", grpc_health_v1.HealthCheckResponse_SERVING)

	s := grpc.NewServer()
	grpc_health_v1.RegisterHealthServer(s, h)

	// we're happy to do this because we can use grpcurl without having to provide a .proto file
	reflection.Register(s)

	log.Print("Starting healthcheck")

	if err := s.Serve(lis); err != nil {
		log.Fatal("failed to serve healthcheck")
	}

}

Running :

GRPC_GO_LOG_VERBOSITY_LEVEL=2 GRPC_GO_LOG_SEVERITY_LEVEL=info go run main.go

And triggering the health_probe in another terminal:

grpc-health-probe -addr localhost:50052

I'm seeing this WARNING in the service logs:

WARNING: 2020/04/21 14:02:57 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:50052->127.0.0.1:53958: read: connection reset by peer
INFO: 2020/04/21 14:02:57 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

I wasn't sure if this was intended behaviour so I changed:

defer conn.Close()

In the grpc-health-probe/main.go to:

defer func(){
        time.Sleep(time.Second)
	conn.Close()
}()

And when running that against the service above I only see:

INFO: 2020/04/21 14:03:20 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

In the logs.

I did a bit more munging and found that by inserting:

time.Sleep(time.Millisecond)

Between the Check returning and conn.Close() (I put it on line line 234) being called seems to prevent this warning as well.

ejona86 pushed a commit to grpc/grpc-java that referenced this issue May 15, 2020
This PR changes the `NettyServerTransport#getLogLevel` method to log 
`SocketException`s to `LogLevel.FINE`, rather than exclusively pure
IOExceptions. This may fix an unintentional regression introduced in
c166ec2, although the message in my java version (14.0.1) wouldn't have
passed the old logic for quieting either. This also fixes the issue
raised in #6423 that was locked for inactivity.

This fixes 
```
[2020/05/14 20:21:52   INFO] [io.grpc.netty.NettyServerTransport.connections] Transport failed
java.net.SocketException: Connection reset
        at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:345)
        at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:376)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1125)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:832)
```
being logged to level INFO, which occurs whenever a socket is improperly 
closed by the client, such as with the grpc-health-probe (They've got an
[open issue](grpc-ecosystem/grpc-health-probe#34)
for this)
gwpries added a commit to AVOXI/grpc-health-probe that referenced this issue Sep 18, 2020
@ahmetb
Copy link
Collaborator

ahmetb commented Oct 28, 2020

Any advice how should we proceed here?
I'm not in favor of adding notable sleeps here and there since I expect defer conn.Close() is actually supposed to close the connection.

@JanDash
Copy link
Author

JanDash commented Oct 28, 2020 via email

dfawley pushed a commit to dfawley/grpc-java that referenced this issue Jan 15, 2021
This PR changes the `NettyServerTransport#getLogLevel` method to log 
`SocketException`s to `LogLevel.FINE`, rather than exclusively pure
IOExceptions. This may fix an unintentional regression introduced in
c166ec2, although the message in my java version (14.0.1) wouldn't have
passed the old logic for quieting either. This also fixes the issue
raised in grpc#6423 that was locked for inactivity.

This fixes 
```
[2020/05/14 20:21:52   INFO] [io.grpc.netty.NettyServerTransport.connections] Transport failed
java.net.SocketException: Connection reset
        at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:345)
        at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:376)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1125)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:832)
```
being logged to level INFO, which occurs whenever a socket is improperly 
closed by the client, such as with the grpc-health-probe (They've got an
[open issue](grpc-ecosystem/grpc-health-probe#34)
for this)
@owenthereal
Copy link

I can testify to seeing the same error in a Go service like #34 (comment).

@GeertJohan
Copy link

I suspect this issue is caused by grpc/grpc-go#460

@xorver
Copy link

xorver commented Aug 24, 2022

We just hit this problem while running the prober on our GKE services. We were observing slow leak of bpf_jit memory allocations on the node, counted with

jq -n "`sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{ print $2 }' | jq -s 'add'` / `sudo cat /proc/sys/net/core/bpf_jit_limit`"

In weeks time it exceeds the bpf_jit_limit, and the node is in a state when starting a new container fails with

OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: setns_init_linux.go:88: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: errno 524: unknown

The node needs a restart to recover.

Workarounds

We have observed that on 3 independent services. Using the gke built in grpc prober solves the issue. Also adding a sleep of 100ms (1ms is not enough in our case) as suggested in #34 (comment) works.

Proper fix

Is there any idea for a proper fix, not involving a sleep here? (or maybe grpc/grpc-go#460 needs more prioritization in case it's really this)

As a side note. Do you think it's worth removing the issue's "harmless" tag? As our case shows, in certain circumstance this issue can cause troubles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working harmless long term issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants