-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backoff/timeout does not take into account actual GRPC connection established, causes denial of service. #954
Comments
@jellevandenhooff in #120 the main topic is how annoying the message are, however here I instead see that:
Since grpc-go wants to automate reconnection, I think it should cover also the case of this type of misbehaving servers to prevent self-DoS scenarios. It should be possible to cover not only the dialing stage but the reconnection stage (think about the IRC "reconnecting too fast" error message). |
I have dug a bit into this, it turns out that the tight-looping (DoS-ing) behaviour is to be found in the Steps to reproduce the problem:
The client will generate an avalanche of reconnections, quickly exhausting all file descriptors. |
@iamqizhao, what do you think of adding logic to the transport monitor to back-off before attempting to reconnect again if the first read in the reader goroutine fails? |
+1 this is still an issue. I have a very basic use case of gRPC and the log output is flooded with this message every 4 minutes while it is sitting idle. This looks confusing for someone who just started using gRPC and therefore warrants a Google search which eventually lands people here. I am still wondering if I’m doing something wrong.
|
We will be adding verbosity levels to logging soon. Error like that won't
spam logs after that.
No plans yet to support cases where the server or client misbehaves.
In this particular case, seems like the underlying transport breaks due to
something (may be the environment configuration doesn't allow for
long-living ideal connections or something). However, the aforementioned
monitor routine reconnects.
…On Wed, Feb 22, 2017 at 10:04 AM, Ahmet Alp Balkan ***@***.*** > wrote:
+1 this is still an issue. I have a very simple use case of gRPC and the
log output is flooded with this message every 4 minutes while it is sitting
idle. This looks confusing for someone who just started using gRPC and
therefore warrants a Google search which eventually lands people here. I am
still wondering if I’m doing something wrong.
2017/02/22 09:37:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:41:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:45:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:49:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:53:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:57:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#954 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ATtnR6xwfuoJ8xHMLzi5a5vENRZ3BRC7ks5rfHjCgaJpZM4Kjh0O>
.
|
I think we need to dig this one deeper. I understand the logger work is going on but this one might be pointing to an actual problem. This is basically just me running multiple grpc services on my laptop and it starts happening after a while such as 2 minutes after services don't get any requests:
|
@MakMukhi Do you have an ETA and/or gRPC version number for when the log spam will cease? |
I'm getting these logs when using the gcloud vision client (cloud.google.com/go/vision)
|
I am also getting these logs when using Google's Cloud Speech and Cloud Translate APIs. Is there an update on this? |
I have an easy repro for this, I just initialized one of cloud.google.com/go libraries, did not make any requests, and created an interactive prompt to read in from the user, and just waited for a couple of minutes in the prompt. Simplest repro:
shortly after, you'll start seeing:
|
The logging API changes are in: #922 After this, we will change the gRPC code to use the correct log level. |
@menghanl any updates on the zero-delay reconnect logic? |
@jellevandenhooff The logging issue is more annoying for users, so we will get that fixed soon. |
Let me know if you need a repro program. I am assuming the fix will get rid of this error message without any change in the caller code. If that's not the case, we're not fixing it right. |
Sorry, reopened. |
|
We discussed this a bit today and weren't sure exactly what the right behavior is in this case. As mentioned above, backoff applies only to cases where the server does not respond, not to the case where it can be reached but then immediately fails. Related, the pick-first balancer, which restarts from the top of the list of backends upon any backend failure, would continually pick this same problem server if it is at the top. We will need to do some more investigation on this. |
A side effect (it seems) of these rapid retries, leaving in hundreds of lingering TCP sessions, in case of a mis-behaving remote server (which in our case was due to mis-configuration, pointing to a wrong server) is GC getting triggered every 15secs or so and taking up to 70% of a core for about 15secs. Below is a snip of perf report. Any way to expedite fix for the issue?: |
This and similar problems have come up a few times lately, so we would like to do something soon. The tentative plan is:
If a server is healthy enough to send a settings frame, and then disconnects, and that happens repeatedly...I'm not sure there's much we can do about that without changes to the gRPC spec. One incompatibility with this behavior that I'm aware of is cmux -- there is a workaround, but it would require a small code change. If anyone knows of other possible problems with this approach, please let us know. |
Thanks for the details of the fix being thought about. Wonder if you are in a position to share when the patch is likely to be available. Thanks. |
Let's consolidate this with #1444. |
When I connect a GRPC client to an IP and port number which has no listeners the client properly tries to reconnect and respects the dial timeout and fallback.
However if I run an socat -v TCP-LISTEN:8002,fork SYSTEM:'echo hi' and then attempt to connect my grpc-go client to it... it'll keep reconnecting at a fast rate in what seems an infinite loop, effectively DoS'ing the host its running on since TCP connections stay around with TIME_WAIT for tcp_fin_timeout duration.
This can cause issues when using a TCP load balancer like ELB which always has a TCP listen port open regardless of whether there are any backend instances available as the connection is immediately closed after an accept.
Even in the case where someone isn't running a load balancer, if for any reason the server it tries to connect misbehaves it will cause the above.
Note: I'm using grpc.WithTimeout and grpc.WithBlock.
The text was updated successfully, but these errors were encountered: