Use native grpc round robin balancer #12940

kisunji · 2022-05-04T17:52:31Z

Possibly fixes #10603

Description

Users are reporting frequently logged WARNs which are being output by gRPC.
They are caused by our gRPC resolver attempting to shuffle server addresses by:

Randomly shuffle server addresses every T interval
Pass the first address to update the gRPC client connection which kills any existing connections (unless the first address happens to be equal to the existing)
Pass all the server addresses to update the connection to provide fallback addresses.

The WARN log seems to be caused by step 3 occurring as the connection is in a transitionary state from applying step 2.

Testing locally, it seems that removing the extra call to UpdateState also eliminates the WARN logs.

This PR attempts to replace our custom client-side shuffling of server addresses with native gRPC-go's round robin load balancer.

Impact

There are 2 changed behaviors from this PR:

Clients will no longer have sticky connections to Servers
Before: a client making a gRPC call always routes to the same server until the shuffle happens
Now: a client making a gRPC call will always route to a new server
Existing connections (streams) will no longer be terminated
Before: a server address shuffle usually leads to existing connections being terminated (see step 2 above)
Now: there is no server address shuffle

kisunji · 2022-05-04T17:57:53Z

agent/grpc/private/client.go

@@ -130,8 +131,8 @@ func (c *ClientConnPool) dial(datacenter string, serverType string) (*grpc.Clien
 		grpc.WithContextDialer(c.dialer),
 		grpc.WithDisableRetry(),
 		grpc.WithStatsHandler(newStatsHandler(defaultMetrics())),
-		// nolint:staticcheck // there is no other supported alternative to WithBalancerName
-		grpc.WithBalancerName("pick_first"),


This is no longer true https://pkg.go.dev/google.golang.org/grpc/balancer#section-directories

kisunji · 2022-05-04T17:58:20Z

agent/grpc/private/client_test.go

@@ -373,21 +373,11 @@ func TestClientConnPool_IntegrationWithGRPCResolver_Rebalance(t *testing.T) {
 	first, err := client.Something(ctx, &testservice.Req{})
 	require.NoError(t, err)

-	t.Run("rebalance a different DC, does nothing", func(t *testing.T) {


No longer a relevant test since calls are round-robin

kisunji · 2022-05-04T18:16:02Z

agent/grpc/private/resolver/resolver.go

-	// Only pass the first address initially, which will cause the
-	// balancer to spin down the connection for its previous first address
-	// if it is different. If we don't do this, it will keep using the old
-	// first address as long as it is still in the list, making it impossible to
-	// rebalance until that address is removed.
-	var firstAddr []resolver.Address
-	if len(addrs) > 0 {
-		firstAddr = []resolver.Address{addrs[0]}
-	}
-	r.clientConn.UpdateState(resolver.State{Addresses: firstAddr})
-
-	// Call UpdateState again with the entire list of addrs in case we need them
-	// for failover.


If we do want to preserve the old behavior of killing streams (almost) every time server addresses are shuffled, we could add some logic here.

kisunji · 2022-05-04T18:31:13Z

Looks like this broke some tests. Will look into them when I'm back from vacation

kyhavlov

Does this mean that every client will maintain an open connection to every server (and it's only the target server being shuffled)? Or that when a new request happens we're always opening a new connection to a server? In the former case that might cause servers to hit open connection/file descriptor limits more often (5x the connections with 5 servers).

kisunji · 2022-05-17T14:25:48Z

Does this mean that every client will maintain an open connection to every server (and it's only the target server being shuffled)? Or that when a new request happens we're always opening a new connection to a server? In the former case that might cause servers to hit open connection/file descriptor limits more often (5x the connections with 5 servers).

The latter; unless it's a stream which will stay open indefinitely, each new gRPC request will target a new server

kisunji · 2022-05-17T16:30:48Z

agent/consul/subscribe_backend_test.go

-	retryFailedConn(t, conn)
-
-	streamClient = pbsubscribe.NewStateChangeSubscriptionClient(conn)


Didn't see a need for this code to hard-reset the connection; it did not impact the test (nor did it cause me to add the retry below).

kisunji · 2022-05-17T16:34:35Z

agent/consul/subscribe_backend_test.go

+	retry.Run(t, func(r *retry.R) {
+		_, err = streamClient.Subscribe(ctx, req)
+		require.NoError(r, err)
+	})


There's a slight timing issue (<1s) that flakes the second try. I'm unsure why the previous code did not run into the same issue but I didn't consider this to be a significant regression.

mkeeler · 2022-05-18T13:11:51Z

@kisunji I am not so sure this is desirable:

Existing connections (streams) will no longer be terminated
Before: a server address shuffle usually leads to existing connections being terminated (see step 2 above)
Now: there is no server address shuffle

We use the shuffling as a way of ensuring load gets properly distributed across all servers. For example if we have 3 servers and each one has 10k streams open. The load is too high so more servers (possibly read replicas) are added. If we don't shuffle then those existing 30k streams will remain pinned to the original 3 servers.

If we do get rid of the shuffle in the grpc resolver mechanisms we probably need to replace it with some more graceful mechanism for stream reestablishment to happen (or some other means of distributed load balancing)

mkeeler · 2022-05-18T13:16:42Z

Additionally this is concerning as well:

Clients will no longer have sticky connections to Servers
Before: a client making a gRPC call always routes to the same server until the shuffle happens
Now: a client making a gRPC call will always route to a new server

In a cluster not under load this means we will spend more time doing TCP establishment as we wont be getting the benefits of multiplexing (I think this is true but probably depends on the grpc library for how it chooses when to select a new server).

In a cluster under load it means that every client will have a grpc conn open to every server. That is <num servers> times more connections than that same cluster today. Like @kyhavlov mentioned in his comment, this could cause fd limits on servers to be reached much more quickly (and means this is likely a breaking change)

kisunji · 2022-05-18T13:21:54Z

Thanks for the feedback @mkeeler! It looks like round_robin actually maintains a subchannel to every server address which also is probably not desirable (source). Will look for another solution!

mkeeler · 2022-05-18T13:23:16Z

I haven't delved too deeply into the grpc client so its possible that the behavior of keeping a connection open to all addresses is already happening.

The most important thing is likely not having a means to force stream reestablishment to another server.

kisunji · 2022-05-18T13:26:14Z

The pick_first policy stops traversing the list on a successful connection so this PR would definitely cause more overhead.

mkeeler

Leaving this comment as a means of marking myself as having reviewed the PR.

Overall I have a few concerns but feel free to correct me if my knowledge of how the current gRPC resolver / connection pooling works today isn't accurate.

kisunji commented May 4, 2022

View reviewed changes

kyhavlov reviewed May 5, 2022

View reviewed changes

kisunji added 3 commits May 17, 2022 10:41

Use native grpc round robin balancer

a1b0dc2

Fix tests

16b36d8

Re-add validation

e6fc5be

kisunji force-pushed the kisunji/grpc-native-balancer branch from cb93ee5 to e6fc5be Compare May 17, 2022 14:42

Fix test flakiness

383e150

kisunji commented May 17, 2022

View reviewed changes

kisunji requested review from rboyer and boxofrad May 17, 2022 16:38

rboyer requested a review from mkeeler May 17, 2022 17:05

kisunji closed this May 18, 2022

mkeeler reviewed May 18, 2022

View reviewed changes

jkirschner-hashicorp mentioned this pull request Dec 8, 2022

New WARN in 1.10.0 caused by shuffling the servers in the gRPC ClientConn pool #10603

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use native grpc round robin balancer #12940

Use native grpc round robin balancer #12940

kisunji commented May 4, 2022 •

edited

Loading

kisunji May 4, 2022

kisunji May 4, 2022

kisunji May 4, 2022

kisunji commented May 4, 2022

kyhavlov left a comment

kisunji commented May 17, 2022

kisunji May 17, 2022

kisunji May 17, 2022

mkeeler commented May 18, 2022

mkeeler commented May 18, 2022 •

edited

Loading

kisunji commented May 18, 2022 •

edited

Loading

mkeeler commented May 18, 2022

kisunji commented May 18, 2022

mkeeler left a comment

		retryFailedConn(t, conn)

		streamClient = pbsubscribe.NewStateChangeSubscriptionClient(conn)

Use native grpc round robin balancer #12940

Use native grpc round robin balancer #12940

Conversation

kisunji commented May 4, 2022 • edited Loading

Description

Impact

kisunji May 4, 2022

Choose a reason for hiding this comment

kisunji May 4, 2022

Choose a reason for hiding this comment

kisunji May 4, 2022

Choose a reason for hiding this comment

kisunji commented May 4, 2022

kyhavlov left a comment

Choose a reason for hiding this comment

kisunji commented May 17, 2022

kisunji May 17, 2022

Choose a reason for hiding this comment

kisunji May 17, 2022

Choose a reason for hiding this comment

mkeeler commented May 18, 2022

mkeeler commented May 18, 2022 • edited Loading

kisunji commented May 18, 2022 • edited Loading

mkeeler commented May 18, 2022

kisunji commented May 18, 2022

mkeeler left a comment

Choose a reason for hiding this comment

kisunji commented May 4, 2022 •

edited

Loading

mkeeler commented May 18, 2022 •

edited

Loading

kisunji commented May 18, 2022 •

edited

Loading