-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc-health-probe shuts down the gRPC connection non-gracefully #34
Comments
This came up in #33 as well. However it does not surface up in gRPC implementations other than Java ––so I might need some help debugging this. The code is rather short, maybe take a look to see if we're missing a |
Welp, it seems like we do a bunch of Should be a rather easy thing to fix. Would you be able to compile a binary to test if I can give you a patch? |
See #36, can you compile and let me know if that helps? |
Hi Ahmet, thanks a lot for your quick response! I certainly can, but I need to head out in a minute. I will do it first thing in the morning (PST), is that OK? |
I just built the updated code and tested the binary, but unfortunately I am still getting the same errors. |
Can you please try this. Change the defer func(){
log.Println("closing connection")
conn.Close()
log.Println("closed connection")
}() and recompile & run. We should be seeing these clear messages on the exit path of the probe. |
The messages are printed out but the issue remains. I passed this on to one of my colleagues that knows go to see if he can spot anything. |
That’s indeed peculiar. Is the wireshark dump the same, or are you seeing RST? |
My coworker made a few changes as an experiment and it still sends RST, same as before. |
My idea would be adding time.Sleep(time.Second*2) before os.Exit() to see if we can keep the process alive for connection to close. But since you applied the code block above and saw One good litmus test would be to see if this happens to all Go gRPC clients calling Java gRPC servers. (I assume not, but it might help bisect where the problem is.) For that, you can maybe write a small program that makes an RPC and exits. |
For issue #36, can we send SIGTERM instead of os.Exit(retCode) ?? Will that solve the issue ?? |
I don’t see why it would help. |
is this being worked for next release? =) |
No. We can’t repro the issue in other languages. Furthermore we don’t know what’s causing it. If you have ideas or time to debug, please contribute. |
I think I'm seeing something (perhaps) similar to this with a Go service? With this health service:
Running :
And triggering the health_probe in another terminal:
I'm seeing this WARNING in the service logs:
I wasn't sure if this was intended behaviour so I changed:
In the
And when running that against the service above I only see:
In the logs. I did a bit more munging and found that by inserting:
Between the |
This PR changes the `NettyServerTransport#getLogLevel` method to log `SocketException`s to `LogLevel.FINE`, rather than exclusively pure IOExceptions. This may fix an unintentional regression introduced in c166ec2, although the message in my java version (14.0.1) wouldn't have passed the old logic for quieting either. This also fixes the issue raised in #6423 that was locked for inactivity. This fixes ``` [2020/05/14 20:21:52 INFO] [io.grpc.netty.NettyServerTransport.connections] Transport failed java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:345) at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:376) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1125) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:832) ``` being logged to level INFO, which occurs whenever a socket is improperly closed by the client, such as with the grpc-health-probe (They've got an [open issue](grpc-ecosystem/grpc-health-probe#34) for this)
Any advice how should we proceed here? |
This is a known issue already reported:
#34
We spent some time researching it to no avail as you can see in the thread.
I have long been thinking about writing a small C client (to be closer to
the metal and actually have control) to replace the go implementation, but
never seem to get the time.
…On Wed, Oct 28, 2020 at 1:39 PM Ahmet Alp Balkan ***@***.***> wrote:
Any advice how should we proceed here?
I'm not in favor of adding notable sleeps here and there since I expect defer
conn.Close() is actually supposed to close the connection.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#34 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKDFQHDBIZV55WWBGSYEPXLSNB6RVANCNFSM4KQAQGPA>
.
|
This PR changes the `NettyServerTransport#getLogLevel` method to log `SocketException`s to `LogLevel.FINE`, rather than exclusively pure IOExceptions. This may fix an unintentional regression introduced in c166ec2, although the message in my java version (14.0.1) wouldn't have passed the old logic for quieting either. This also fixes the issue raised in grpc#6423 that was locked for inactivity. This fixes ``` [2020/05/14 20:21:52 INFO] [io.grpc.netty.NettyServerTransport.connections] Transport failed java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:345) at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:376) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1125) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:832) ``` being logged to level INFO, which occurs whenever a socket is improperly closed by the client, such as with the grpc-health-probe (They've got an [open issue](grpc-ecosystem/grpc-health-probe#34) for this)
I can testify to seeing the same error in a Go service like #34 (comment). |
I suspect this issue is caused by grpc/grpc-go#460 |
We just hit this problem while running the prober on our GKE services. We were observing slow leak of bpf_jit memory allocations on the node, counted with
In weeks time it exceeds the bpf_jit_limit, and the node is in a state when starting a new container fails with
The node needs a restart to recover. Workarounds We have observed that on 3 independent services. Using the gke built in grpc prober solves the issue. Also adding a sleep of 100ms (1ms is not enough in our case) as suggested in #34 (comment) works. Proper fix Is there any idea for a proper fix, not involving a sleep here? (or maybe grpc/grpc-go#460 needs more prioritization in case it's really this) As a side note. Do you think it's worth removing the issue's "harmless" tag? As our case shows, in certain circumstance this issue can cause troubles. |
The probe process sends a TCP RST at the conclusion of the gRPC session instead of gracefully shutting down with a FIN/ACK exchange. This causes a lot of noise logging in k8s pods such as:
Feb 01, 2020 2:54:05 PM io.grpc.netty.shaded.io.grpc.netty.NettyServerTransport notifyTerminated
INFO: Transport failed
io.grpc.netty.shaded.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
Screen-shot of Wireshark packet exchange:
The text was updated successfully, but these errors were encountered: