Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to go-grpc/[email protected] to resolve connection memory leak #13051

Merged
merged 1 commit into from
Jun 8, 2022

Conversation

dekimsey
Copy link
Collaborator

@dekimsey dekimsey commented May 12, 2022

Description

Updates go-grpc/grpc to v1.37.1 to address memory leak issue in earlier versions.

Testing & Reproduction steps

I am including the test case I used to validate the behavior, I have removed it from the PR as it is testing internal library behavior and likely will not survive long-term. The test uses reflection to do some digging for connection leaks. The test when ported to 1.11.x at least demonstrates the issue cleanly.

Test code

In agent/grpc/private/client_test.go:

func TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000(t *testing.T) {
	count := 5
	res := resolver.NewServerResolverBuilder(newConfig(t))
	registerWithGRPC(t, res)
	pool := NewClientConnPool(ClientConnPoolConfig{
			Servers:               res,
			UseTLSForDC:           useTLSForDcAlwaysTrue,
			DialingFromServer:     true,
			DialingFromDatacenter: "dc1",
	})

	for i := 0; i < count; i {
			name := fmt.Sprintf("server-%d", i)
			srv := newSimpleTestServer(t, name, "dc1", nil)
			res.AddServer(types.AreaWAN, srv.Metadata())
			t.Cleanup(srv.shutdown)
	}

	conn, err := pool.ClientConn("dc1")
	require.NoError(t, err)
	client := testservice.NewSimpleClient(conn)

	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
	t.Cleanup(cancel)

	_, err = client.Something(ctx, &testservice.Req{})
	require.NoError(t, err)

	t.Run("rebalance the dc", func(t *testing.T) {
			// Rebalance is random, but if we repeat it a few times it should give us a
			// new server.
			attempts := 1000
			for i := 0; i < attempts; i {
					res.NewRebalancer("dc1")()

					_, err := client.Something(ctx, &testservice.Req{})
					require.NoError(t, err)
			}
			// Confirm conns aren't leaking
			e := reflect.ValueOf(client).Elem()
			ccPtr := e.FieldByName("cc").Elem()
			conns := reflect.Indirect(ccPtr).FieldByName("conns").Len()
			assert.LessOrEqual(t, conns, 1, "excessive grpc connections remaining open")
			//time.Sleep(10 * time.Second)
	})
}
v1.11.5, arm64 grpc/[email protected]
$ (pushd agent/grpc; go test -v -run '^TestClientConnPool_IntegrationWithGRPCResolver_Rebalance')
~/src/consul/agent/grpc ~/src/consul
=== RUN   TestClientConnPool_IntegrationWithGRPCResolver_Rebalance
=== RUN   TestClientConnPool_IntegrationWithGRPCResolver_Rebalance/rebalance_a_different_DC,_does_nothing
=== RUN   TestClientConnPool_IntegrationWithGRPCResolver_Rebalance/rebalance_the_dc
--- PASS: TestClientConnPool_IntegrationWithGRPCResolver_Rebalance (0.10s)
    --- PASS: TestClientConnPool_IntegrationWithGRPCResolver_Rebalance/rebalance_a_different_DC,_does_nothing (0.00s)
    --- PASS: TestClientConnPool_IntegrationWithGRPCResolver_Rebalance/rebalance_the_dc (0.07s)
=== RUN   TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000
=== RUN   TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000/rebalance_the_dc
    assertion_compare.go:342:
        	Error Trace:	client_test.go:438
        	Error:      	"1628" is not less than or equal to "1"
        	Test:       	TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000/rebalance_the_dc
        	Messages:   	[excessive grpc connections remaining open]
--- FAIL: TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000 (26.91s)
    --- FAIL: TestClientConnPool_IntegrationWithGRPCResolver_Rebalance5000/rebalance_the_dc (26.87s)
FAIL
exit status 1
FAIL	github.com/hashicorp/consul/agent/grpc	27.456s
v1.12.0, arm64 grpc/[email protected]
$ (pushd agent/grpc/private; go test -v -run '^TestClientConnPool_IntegrationWithGRPCResolver_Rebalance')
~/src/consul/agent/grpc/private ~/src/consul
=== RUN   TestClientConnPool_IntegrationWithGRPCResolver_Rebalance
=== RUN   TestClientConnPool_IntegrationWithGRPCResolver_Rebalance/rebalance_a_different_DC,_does_nothing
=== RUN   TestClientConnPool_IntegrationWithGRPCResolver_Rebalance/rebalance_the_dc
--- PASS: TestClientConnPool_IntegrationWithGRPCResolver_Rebalance (0.10s)
    --- PASS: TestClientConnPool_IntegrationWithGRPCResolver_Rebalance/rebalance_a_different_DC,_does_nothing (0.00s)
    --- PASS: TestClientConnPool_IntegrationWithGRPCResolver_Rebalance/rebalance_the_dc (0.05s)
=== RUN   TestClientConnPool_IntegrationWithGRPCResolver_RebalanceRepeatedly
=== RUN   TestClientConnPool_IntegrationWithGRPCResolver_RebalanceRepeatedly/rebalance_the_dc
--- PASS: TestClientConnPool_IntegrationWithGRPCResolver_RebalanceRepeatedly (16.94s)
    --- PASS: TestClientConnPool_IntegrationWithGRPCResolver_RebalanceRepeatedly/rebalance_the_dc (16.92s)
PASS
ok  	github.com/hashicorp/consul/agent/grpc/private	17.712s

Links

Fixes the issue discussed in #12288

PR Checklist

  • updated test coverage
  • external facing docs updated
  • not a security concern
  • checklist folder consulted

@Amier3 Amier3 requested a review from boxofrad May 24, 2022 19:16
@Amier3
Copy link
Contributor

Amier3 commented May 24, 2022

Hey @dekimsey

Thanks for going ahead and updating this, much appreciated! We'll take a look at this hopefully this week or next and get you some feedback on test cases.

@boxofrad
Copy link
Contributor

Thanks so much for taking this on @dekimsey! 🙌🏻

The upgrade looks good to me, and it's great to have a test demonstrating the fix.

I agree about the reflection and wonder if it'd be possible to count the number of connections on the server-side? either by wrapping the underlying TCP listener or implementing stats.Handler.

@dekimsey
Copy link
Collaborator Author

dekimsey commented Jun 2, 2022

Thanks @boxofrad! I spent a bit of time poking at this idea. I used the already existing patchGlobalMetrics to enable connection metrics and was able to see some data populated into the metrics sink, but I wasn't able to determine how to use the data to detect the issue. But it's very possible I'm holding it wrong!

The test details aside, given the nature of the bug (leak in upstream) I'm not sure how valuable it is to have a test like this for anything other than identifying and confirming the issue is addressed. We aren't testing "our" code but an upstream's and to me that doesn't seem like a good fit in the test suite. I included the test because the initial reporter laid it out to demonstrate the issue which was great! But I'm not so sure about keeping it long-term.

@boxofrad
Copy link
Contributor

boxofrad commented Jun 2, 2022

I think you're right @dekimsey, let's go ahead and remove the test. Thanks so much for looking into it though!

@dekimsey dekimsey force-pushed the 12288-grpc-memory-leak branch from fc406f9 to fd17158 Compare June 2, 2022 19:54
@dekimsey
Copy link
Collaborator Author

dekimsey commented Jun 2, 2022

I removed the test case (documented it in the PR), squashed the PR, and rebased the changes. Assuming PR checks all pass I think it's good to go.

Unfortunately I wasn't able to figure out how to back port the changes to 1.11. I ran afoul of dependency cascade issues within go-discover and had to back out. I think it'll require a more familiar touch to properly port it.

@boxofrad
Copy link
Contributor

boxofrad commented Jun 6, 2022

That's great, thanks again! 🙇🏻‍♂️

It looks like the lint step is failing for reasons unrelated to this PR — I'll do some digging and then merge.

Unfortunately, I'm not sure it's feasible to backport this change to 1.11 because it'd also require upgrading a number of other dependencies (including go-discover as you mention) which has a wider blast radius than we're generally comfortable with for this kind of backport.

@boxofrad
Copy link
Contributor

boxofrad commented Jun 7, 2022

Hey @dekimsey 👋🏻

It looks like there's a bug in the lint script that was fixed in d457d8b, would you mind rebasing against main?

That should fix the failing check, and then I can go ahead and merge your PR.

Reported in hashicorp#12288

The initial test reported was ported and accurately reproduced the issue.
However, since it is a test of an upstream library's internal behavior it won't
be codified in our test suite. Refer to the ticket/PR for details on how to
demonstrate the behavior.
@dekimsey dekimsey force-pushed the 12288-grpc-memory-leak branch from fd17158 to b636b31 Compare June 7, 2022 14:35
@boxofrad boxofrad merged commit feead0b into hashicorp:main Jun 8, 2022
boxofrad added a commit that referenced this pull request Jun 8, 2022
boxofrad added a commit that referenced this pull request Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants