-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"connection reset by peer, received prior goaway: code: NO_ERROR" sometimes occur in agents #3177
Comments
This is strange. Did you see this in versions prior to 1.3.0? The "received prior goaway: code: NO_ERROR" suggests that the server terminated the http/2 connection, which is a little strange. The interval is also interesting, seems this is happening sporadically since it only seemed to happen every few days. Was the agent running the entire time? The server is configured with a maximum connection age of 3 minutes. This means that it will intentionally reply to the client with a GOAWAY (NO_ERROR), which causes the client to reconnect. This is a mechanism we employ to help distribute load when new servers come up. I wonder if the L4 load balancer is reusing the TCP connection that the server hasn't terminated yet (because it is waiting some time between sending the GOAWAY).... Here are related issues: |
I can see the error with agent version |
Do you know if the agent was running the entire time? That would help isolate the frequency of this event occurring. |
Yes, the agent was running the entire time. |
I'm not quite sure what is happening but it sounds like an issue in grpc-go working with the L4 load balancer. It might be difficult considering how sporadic it happens, but maybe we can collect detailed grpc logs? https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md |
Thank you. |
So far, I founded the error on v1.2.0, v1.2.1, v1.3.0 environments. I wonder the error always occur between fetchEntries() and fetchBundles() in agent. Following logs are captured on v1.2.0.
|
Is there a way to test this without the L4 load balancer in the middle? |
I need some time to prepare this environment. I will report the results shortly. Additional info. |
Any updates here, @hiyosi ? |
sorry, It's still going to take some time. |
No worries! |
Thank you very much for reporting this issue @hiyosi . We discussed it again today on our contributor call ... the gRPC library has been recently updated in SPIRE - is there any way you can check with the latest code to see if the problem is still there? I thought maybe this would be easier than testing without the load balancer.
This sounds super likely when considering that SPIRE Server does not log anything (which I think it would if it received unexpected RST from a client). TCP connection pooling is also a known feature/optimization frequently used in L4 load balancing. If it is ok with you, we will go ahead and close this issue in two weeks if you're not able to do any additional testing by then. We're happy to re-open it whenever you are ready! |
I will do my best to report back to here within the next two weeks. |
Hey @hiyosi - thank you again for reporting this! We understand you are busy and it takes time to test, we don't want you to feel any pressure here. If you're able to find some time to test, please do post back here and let us know what you find. In the meantime, I'm going to close this issue out |
This may be related htps://github.com/grpc/grpc-go/issues/6019. |
https://spiffe.slack.com/archives/C7XDP01HB/p1655716118861989
SPIRE Agents sometimes log
received prior goaway: code: NO_ERROR
I can only see this error log when fetching bundles in my environment.
In my envionment, there is no L7 Proxy, only L4 LoadBalancer.
The text was updated successfully, but these errors were encountered: