Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use native grpc round robin balancer #12940

Closed
wants to merge 4 commits into from
Closed

Conversation

kisunji
Copy link
Contributor

@kisunji kisunji commented May 4, 2022

Possibly fixes #10603

Description

Users are reporting frequently logged WARNs which are being output by gRPC.
They are caused by our gRPC resolver attempting to shuffle server addresses by:

  1. Randomly shuffle server addresses every T interval
  2. Pass the first address to update the gRPC client connection which kills any existing connections (unless the first address happens to be equal to the existing)
  3. Pass all the server addresses to update the connection to provide fallback addresses.

The WARN log seems to be caused by step 3 occurring as the connection is in a transitionary state from applying step 2.

Testing locally, it seems that removing the extra call to UpdateState also eliminates the WARN logs.

This PR attempts to replace our custom client-side shuffling of server addresses with native gRPC-go's round robin load balancer.

Impact

There are 2 changed behaviors from this PR:

  1. Clients will no longer have sticky connections to Servers
    Before: a client making a gRPC call always routes to the same server until the shuffle happens
    Now: a client making a gRPC call will always route to a new server
  2. Existing connections (streams) will no longer be terminated
    Before: a server address shuffle usually leads to existing connections being terminated (see step 2 above)
    Now: there is no server address shuffle

@@ -130,8 +131,8 @@ func (c *ClientConnPool) dial(datacenter string, serverType string) (*grpc.Clien
grpc.WithContextDialer(c.dialer),
grpc.WithDisableRetry(),
grpc.WithStatsHandler(newStatsHandler(defaultMetrics())),
// nolint:staticcheck // there is no other supported alternative to WithBalancerName
grpc.WithBalancerName("pick_first"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -373,21 +373,11 @@ func TestClientConnPool_IntegrationWithGRPCResolver_Rebalance(t *testing.T) {
first, err := client.Something(ctx, &testservice.Req{})
require.NoError(t, err)

t.Run("rebalance a different DC, does nothing", func(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer a relevant test since calls are round-robin

Comment on lines -273 to -285
// Only pass the first address initially, which will cause the
// balancer to spin down the connection for its previous first address
// if it is different. If we don't do this, it will keep using the old
// first address as long as it is still in the list, making it impossible to
// rebalance until that address is removed.
var firstAddr []resolver.Address
if len(addrs) > 0 {
firstAddr = []resolver.Address{addrs[0]}
}
r.clientConn.UpdateState(resolver.State{Addresses: firstAddr})

// Call UpdateState again with the entire list of addrs in case we need them
// for failover.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do want to preserve the old behavior of killing streams (almost) every time server addresses are shuffled, we could add some logic here.

@kisunji
Copy link
Contributor Author

kisunji commented May 4, 2022

Looks like this broke some tests. Will look into them when I'm back from vacation

Copy link
Contributor

@kyhavlov kyhavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that every client will maintain an open connection to every server (and it's only the target server being shuffled)? Or that when a new request happens we're always opening a new connection to a server? In the former case that might cause servers to hit open connection/file descriptor limits more often (5x the connections with 5 servers).

@kisunji
Copy link
Contributor Author

kisunji commented May 17, 2022

Does this mean that every client will maintain an open connection to every server (and it's only the target server being shuffled)? Or that when a new request happens we're always opening a new connection to a server? In the former case that might cause servers to hit open connection/file descriptor limits more often (5x the connections with 5 servers).

The latter; unless it's a stream which will stay open indefinitely, each new gRPC request will target a new server

@kisunji kisunji force-pushed the kisunji/grpc-native-balancer branch from cb93ee5 to e6fc5be Compare May 17, 2022 14:42
Comment on lines -207 to -209
retryFailedConn(t, conn)

streamClient = pbsubscribe.NewStateChangeSubscriptionClient(conn)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see a need for this code to hard-reset the connection; it did not impact the test (nor did it cause me to add the retry below).

Comment on lines +208 to +211
retry.Run(t, func(r *retry.R) {
_, err = streamClient.Subscribe(ctx, req)
require.NoError(r, err)
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a slight timing issue (<1s) that flakes the second try. I'm unsure why the previous code did not run into the same issue but I didn't consider this to be a significant regression.

@kisunji kisunji requested review from rboyer and boxofrad May 17, 2022 16:38
@rboyer rboyer requested a review from mkeeler May 17, 2022 17:05
@mkeeler
Copy link
Member

mkeeler commented May 18, 2022

@kisunji I am not so sure this is desirable:

Existing connections (streams) will no longer be terminated
Before: a server address shuffle usually leads to existing connections being terminated (see step 2 above)
Now: there is no server address shuffle

We use the shuffling as a way of ensuring load gets properly distributed across all servers. For example if we have 3 servers and each one has 10k streams open. The load is too high so more servers (possibly read replicas) are added. If we don't shuffle then those existing 30k streams will remain pinned to the original 3 servers.

If we do get rid of the shuffle in the grpc resolver mechanisms we probably need to replace it with some more graceful mechanism for stream reestablishment to happen (or some other means of distributed load balancing)

@mkeeler
Copy link
Member

mkeeler commented May 18, 2022

Additionally this is concerning as well:

Clients will no longer have sticky connections to Servers
Before: a client making a gRPC call always routes to the same server until the shuffle happens
Now: a client making a gRPC call will always route to a new server

In a cluster not under load this means we will spend more time doing TCP establishment as we wont be getting the benefits of multiplexing (I think this is true but probably depends on the grpc library for how it chooses when to select a new server).

In a cluster under load it means that every client will have a grpc conn open to every server. That is <num servers> times more connections than that same cluster today. Like @kyhavlov mentioned in his comment, this could cause fd limits on servers to be reached much more quickly (and means this is likely a breaking change)

@kisunji
Copy link
Contributor Author

kisunji commented May 18, 2022

Thanks for the feedback @mkeeler! It looks like round_robin actually maintains a subchannel to every server address which also is probably not desirable (source). Will look for another solution!

@kisunji kisunji closed this May 18, 2022
@mkeeler
Copy link
Member

mkeeler commented May 18, 2022

I haven't delved too deeply into the grpc client so its possible that the behavior of keeping a connection open to all addresses is already happening.

The most important thing is likely not having a means to force stream reestablishment to another server.

@kisunji
Copy link
Contributor Author

kisunji commented May 18, 2022

The pick_first policy stops traversing the list on a successful connection so this PR would definitely cause more overhead.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this comment as a means of marking myself as having reviewed the PR.

Overall I have a few concerns but feel free to correct me if my knowledge of how the current gRPC resolver / connection pooling works today isn't accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool
3 participants