-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use native grpc round robin balancer #12940
Conversation
@@ -130,8 +131,8 @@ func (c *ClientConnPool) dial(datacenter string, serverType string) (*grpc.Clien | |||
grpc.WithContextDialer(c.dialer), | |||
grpc.WithDisableRetry(), | |||
grpc.WithStatsHandler(newStatsHandler(defaultMetrics())), | |||
// nolint:staticcheck // there is no other supported alternative to WithBalancerName | |||
grpc.WithBalancerName("pick_first"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer true https://pkg.go.dev/google.golang.org/grpc/balancer#section-directories
@@ -373,21 +373,11 @@ func TestClientConnPool_IntegrationWithGRPCResolver_Rebalance(t *testing.T) { | |||
first, err := client.Something(ctx, &testservice.Req{}) | |||
require.NoError(t, err) | |||
|
|||
t.Run("rebalance a different DC, does nothing", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer a relevant test since calls are round-robin
// Only pass the first address initially, which will cause the | ||
// balancer to spin down the connection for its previous first address | ||
// if it is different. If we don't do this, it will keep using the old | ||
// first address as long as it is still in the list, making it impossible to | ||
// rebalance until that address is removed. | ||
var firstAddr []resolver.Address | ||
if len(addrs) > 0 { | ||
firstAddr = []resolver.Address{addrs[0]} | ||
} | ||
r.clientConn.UpdateState(resolver.State{Addresses: firstAddr}) | ||
|
||
// Call UpdateState again with the entire list of addrs in case we need them | ||
// for failover. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do want to preserve the old behavior of killing streams (almost) every time server addresses are shuffled, we could add some logic here.
Looks like this broke some tests. Will look into them when I'm back from vacation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that every client will maintain an open connection to every server (and it's only the target server being shuffled)? Or that when a new request happens we're always opening a new connection to a server? In the former case that might cause servers to hit open connection/file descriptor limits more often (5x the connections with 5 servers).
The latter; unless it's a stream which will stay open indefinitely, each new gRPC request will target a new server |
cb93ee5
to
e6fc5be
Compare
retryFailedConn(t, conn) | ||
|
||
streamClient = pbsubscribe.NewStateChangeSubscriptionClient(conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't see a need for this code to hard-reset the connection; it did not impact the test (nor did it cause me to add the retry below).
retry.Run(t, func(r *retry.R) { | ||
_, err = streamClient.Subscribe(ctx, req) | ||
require.NoError(r, err) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a slight timing issue (<1s) that flakes the second try. I'm unsure why the previous code did not run into the same issue but I didn't consider this to be a significant regression.
@kisunji I am not so sure this is desirable:
We use the shuffling as a way of ensuring load gets properly distributed across all servers. For example if we have 3 servers and each one has 10k streams open. The load is too high so more servers (possibly read replicas) are added. If we don't shuffle then those existing 30k streams will remain pinned to the original 3 servers. If we do get rid of the shuffle in the grpc resolver mechanisms we probably need to replace it with some more graceful mechanism for stream reestablishment to happen (or some other means of distributed load balancing) |
Additionally this is concerning as well:
In a cluster not under load this means we will spend more time doing TCP establishment as we wont be getting the benefits of multiplexing (I think this is true but probably depends on the grpc library for how it chooses when to select a new server). In a cluster under load it means that every client will have a grpc conn open to every server. That is |
I haven't delved too deeply into the grpc client so its possible that the behavior of keeping a connection open to all addresses is already happening. The most important thing is likely not having a means to force stream reestablishment to another server. |
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this comment as a means of marking myself as having reviewed the PR.
Overall I have a few concerns but feel free to correct me if my knowledge of how the current gRPC resolver / connection pooling works today isn't accurate.
Possibly fixes #10603
Description
Users are reporting frequently logged WARNs which are being output by gRPC.
They are caused by our gRPC resolver attempting to shuffle server addresses by:
The WARN log seems to be caused by step 3 occurring as the connection is in a transitionary state from applying step 2.
Testing locally, it seems that removing the extra call to
UpdateState
also eliminates the WARN logs.This PR attempts to replace our custom client-side shuffling of server addresses with native gRPC-go's round robin load balancer.
Impact
There are 2 changed behaviors from this PR:
Before: a client making a gRPC call always routes to the same server until the shuffle happens
Now: a client making a gRPC call will always route to a new server
Before: a server address shuffle usually leads to existing connections being terminated (see step 2 above)
Now: there is no server address shuffle