-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some data races #10396
Fix some data races #10396
Conversation
ID: "node-wr", | ||
Name: "node-wr", | ||
Description: "node-wr", | ||
Rules: `node_prefix "" { policy = "write"}`, | ||
Syntax: acl.SyntaxCurrent, | ||
Datacenters: []string{"dc1"}, | ||
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2}, | ||
}, nil | ||
} | ||
p.SetHash(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about this fix. Why does the ACL resolver need to call SetHash
? Shouldn't that be done when the policy is created instead of when it is read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do set the hash on creation in the PolicySet endpoint: https://github.com/hashicorp/consul/blob/main/agent/consul/acl_endpoint.go#L1204
testPolicyForID is used by a test resolver delegate, so I think it's necessary to do it on reads here because there is no write beyond the policy existing in this function. It resolves these hard coded policies as if they had been created by some other process. (But maybe I'm misunderstanding your question)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think that suggests that this fix is reasonable (because it better matches a production flow), which is reassuring.
I think my broader question is about the call to SetHash
that caused the data race. The call in resolveWithCache
(here).
That seems like a very strange place to be calling SetHash
. From my reading of the code this method is called in read flows, which should already have a hash set from when the policy was written. And if not, why would SetHash
be called here, and not in a more core place (like the state store or RPC handler) ? My thinking is that if this SetHash
call is misplaced, it has the potential to cause or hide bugs, either now or in the future. I guess maybe it was done defensively because we are about to read the hash, but I think sometimes this kind of defense can actually introduce other problems.
I think answering that question isn't a blocker for this PR, but it may be worth exploring further to see if we can rationalize where SetHash
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good point about resolveWithCache
. I'm not sure why it's done there. At first I thought it could have been a new field, where policies before some version doesn't have it, but the Hash field has existed since the New ACLs PR.
It also doesn't seem to be used for ACL compatibility, since the function to extract policies from embedded tokens sets the hash itself:
Line 475 in 0527dcf
policy.SetHash(true) |
Same with these other synthetic policy extractors:
Line 189 in 0527dcf
policy.SetHash(true) |
Line 233 in 0527dcf
policy.SetHash(true) |
My sense is that it was there for tests like this one that don't set the hash in advance. I don't see a non-test use for it.
Agreed that it doesn't need to be handled in this PR.
dbf327a
to
71719d2
Compare
71719d2
to
3e9b693
Compare
Some global variables are patched to shorter values in these tests. But the goroutines that read them can outlive the test because nothing waited for them to exit. This commit adds a Wait() method to the routine manager, so that tests can wait for the goroutines to exit. This prevents the data race because the 'reset to original value' can happen after all other goroutines have stopped.
The LogOutput io.Writer used by TestAgent must allow concurrent reads and writes, and a bytes.Buffer does not allow this. The bytes.Buffer must be wrapped with a lock to make this safe.
The dnsConfig pulled from the atomic.Value is a pointer, so modifying it in place creates a data race. Use the exported ReloadConfig interface instead.
The test was modifying a pointer to a struct that had been passed to another goroutine. Instead create a new struct to modify. ``` WARNING: DATA RACE Write at 0x00c01407c3c0 by goroutine 832: github.com/hashicorp/consul/agent.TestServiceManager_PersistService_API() /home/daniel/pers/code/consul/agent/service_manager_test.go:446 +0x1d86 testing.tRunner() /usr/lib/go/src/testing/testing.go:1193 +0x202 Previous read at 0x00c01407c3c0 by goroutine 938: reflect.typedmemmove() /usr/lib/go/src/runtime/mbarrier.go:177 +0x0 reflect.Value.Set() /usr/lib/go/src/reflect/value.go:1569 +0x13b github.com/mitchellh/copystructure.(*walker).Primitive() /home/daniel/go/pkg/mod/github.com/mitchellh/[email protected]/copystructure.go:289 +0x190 github.com/mitchellh/reflectwalk.walkPrimitive() /home/daniel/go/pkg/mod/github.com/mitchellh/[email protected]/reflectwalk.go:252 +0x31b github.com/mitchellh/reflectwalk.walk() /home/daniel/go/pkg/mod/github.com/mitchellh/[email protected]/reflectwalk.go:179 +0x24d github.com/mitchellh/reflectwalk.walkStruct() /home/daniel/go/pkg/mod/github.com/mitchellh/[email protected]/reflectwalk.go:386 +0x4ec github.com/mitchellh/reflectwalk.walk() /home/daniel/go/pkg/mod/github.com/mitchellh/[email protected]/reflectwalk.go:188 +0x656 github.com/mitchellh/reflectwalk.walkStruct() /home/daniel/go/pkg/mod/github.com/mitchellh/[email protected]/reflectwalk.go:386 +0x4ec github.com/mitchellh/reflectwalk.walk() /home/daniel/go/pkg/mod/github.com/mitchellh/[email protected]/reflectwalk.go:188 +0x656 github.com/mitchellh/reflectwalk.Walk() /home/daniel/go/pkg/mod/github.com/mitchellh/[email protected]/reflectwalk.go:92 +0x164 github.com/mitchellh/copystructure.Config.Copy() /home/daniel/go/pkg/mod/github.com/mitchellh/[email protected]/copystructure.go:69 +0xe7 github.com/mitchellh/copystructure.Copy() /home/daniel/go/pkg/mod/github.com/mitchellh/[email protected]/copystructure.go:13 +0x84 github.com/hashicorp/consul/agent.mergeServiceConfig() /home/daniel/pers/code/consul/agent/service_manager.go:362 +0x56 github.com/hashicorp/consul/agent.(*serviceConfigWatch).handleUpdate() /home/daniel/pers/code/consul/agent/service_manager.go:279 +0x250 github.com/hashicorp/consul/agent.(*serviceConfigWatch).runWatch() /home/daniel/pers/code/consul/agent/service_manager.go:246 +0x2d4 Goroutine 832 (running) created at: testing.(*T).Run() /usr/lib/go/src/testing/testing.go:1238 +0x5d7 testing.runTests.func1() /usr/lib/go/src/testing/testing.go:1511 +0xa6 testing.tRunner() /usr/lib/go/src/testing/testing.go:1193 +0x202 testing.runTests() /usr/lib/go/src/testing/testing.go:1509 +0x612 testing.(*M).Run() /usr/lib/go/src/testing/testing.go:1417 +0x3b3 main.main() _testmain.go:1181 +0x236 Goroutine 938 (running) created at: github.com/hashicorp/consul/agent.(*serviceConfigWatch).start() /home/daniel/pers/code/consul/agent/service_manager.go:223 +0x4e4 github.com/hashicorp/consul/agent.(*ServiceManager).AddService() /home/daniel/pers/code/consul/agent/service_manager.go:98 +0x344 github.com/hashicorp/consul/agent.(*Agent).addServiceLocked() /home/daniel/pers/code/consul/agent/agent.go:1942 +0x2e4 github.com/hashicorp/consul/agent.(*Agent).AddService() /home/daniel/pers/code/consul/agent/agent.go:1929 +0x337 github.com/hashicorp/consul/agent.TestServiceManager_PersistService_API() /home/daniel/pers/code/consul/agent/service_manager_test.go:400 +0x17c4 testing.tRunner() /usr/lib/go/src/testing/testing.go:1193 +0x202 ```
To pick up data race fixes
By setting the hash when we create the policy. ``` WARNING: DATA RACE Read at 0x00c0028b4b10 by goroutine 1182: github.com/hashicorp/consul/agent/structs.(*ACLPolicy).SetHash() /home/daniel/pers/code/consul/agent/structs/acl.go:701 +0x40d github.com/hashicorp/consul/agent/structs.ACLPolicies.resolveWithCache() /home/daniel/pers/code/consul/agent/structs/acl.go:779 +0xfe github.com/hashicorp/consul/agent/structs.ACLPolicies.Compile() /home/daniel/pers/code/consul/agent/structs/acl.go:809 +0xf1 github.com/hashicorp/consul/agent/consul.(*ACLResolver).ResolveTokenToIdentityAndAuthorizer() /home/daniel/pers/code/consul/agent/consul/acl.go:1226 +0x6ef github.com/hashicorp/consul/agent/consul.resolveTokenAsync() /home/daniel/pers/code/consul/agent/consul/acl_test.go:66 +0x5c Previous write at 0x00c0028b4b10 by goroutine 1509: github.com/hashicorp/consul/agent/structs.(*ACLPolicy).SetHash() /home/daniel/pers/code/consul/agent/structs/acl.go:730 +0x3a8 github.com/hashicorp/consul/agent/structs.ACLPolicies.resolveWithCache() /home/daniel/pers/code/consul/agent/structs/acl.go:779 +0xfe github.com/hashicorp/consul/agent/structs.ACLPolicies.Compile() /home/daniel/pers/code/consul/agent/structs/acl.go:809 +0xf1 github.com/hashicorp/consul/agent/consul.(*ACLResolver).ResolveTokenToIdentityAndAuthorizer() /home/daniel/pers/code/consul/agent/consul/acl.go:1226 +0x6ef github.com/hashicorp/consul/agent/consul.resolveTokenAsync() /home/daniel/pers/code/consul/agent/consul/acl_test.go:66 +0x5c Goroutine 1182 (running) created at: github.com/hashicorp/consul/agent/consul.TestACLResolver_Client.func4() /home/daniel/pers/code/consul/agent/consul/acl_test.go:1669 +0x459 testing.tRunner() /usr/lib/go/src/testing/testing.go:1193 +0x202 Goroutine 1509 (running) created at: github.com/hashicorp/consul/agent/consul.TestACLResolver_Client.func4() /home/daniel/pers/code/consul/agent/consul/acl_test.go:1668 +0x415 testing.tRunner() /usr/lib/go/src/testing/testing.go:1193 +0x202 ```
3e9b693
to
fa47c04
Compare
@@ -319,14 +319,16 @@ func testPolicyForID(policyID string) (bool, *structs.ACLPolicy, error) { | |||
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2}, | |||
}, nil | |||
case "acl-wr": | |||
return true, &structs.ACLPolicy{ | |||
p := &structs.ACLPolicy{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this change was only needed for a subset of the cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is because only the "Concurrent-Token-Resolve" test cases has a race on SetHash
, and only these two cases are used by the "Concurrent-Token-Resolve" test case.
I can add SetHash
to the other cases as well, if we think that would better imitate production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding it to all of them would be nice for that reason. Would also avoid that confusion when someone looks at this again later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Added those in the latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing this PR! I think those are excellent questions, I've replied below.
@@ -319,14 +319,16 @@ func testPolicyForID(policyID string) (bool, *structs.ACLPolicy, error) { | |||
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2}, | |||
}, nil | |||
case "acl-wr": | |||
return true, &structs.ACLPolicy{ | |||
p := &structs.ACLPolicy{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is because only the "Concurrent-Token-Resolve" test cases has a race on SetHash
, and only these two cases are used by the "Concurrent-Token-Resolve" test case.
I can add SetHash
to the other cases as well, if we think that would better imitate production.
ID: "node-wr", | ||
Name: "node-wr", | ||
Description: "node-wr", | ||
Rules: `node_prefix "" { policy = "write"}`, | ||
Syntax: acl.SyntaxCurrent, | ||
Datacenters: []string{"dc1"}, | ||
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2}, | ||
}, nil | ||
} | ||
p.SetHash(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think that suggests that this fix is reasonable (because it better matches a production flow), which is reassuring.
I think my broader question is about the call to SetHash
that caused the data race. The call in resolveWithCache
(here).
That seems like a very strange place to be calling SetHash
. From my reading of the code this method is called in read flows, which should already have a hash set from when the policy was written. And if not, why would SetHash
be called here, and not in a more core place (like the state store or RPC handler) ? My thinking is that if this SetHash
call is misplaced, it has the potential to cause or hide bugs, either now or in the future. I guess maybe it was done defensively because we are about to read the hash, but I think sometimes this kind of defense can actually introduce other problems.
I think answering that question isn't a blocker for this PR, but it may be worth exploring further to see if we can rationalize where SetHash
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have one request below about calling SetHash for all of the test policies, but will approve the PR anyways
@@ -319,14 +319,16 @@ func testPolicyForID(policyID string) (bool, *structs.ACLPolicy, error) { | |||
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2}, | |||
}, nil | |||
case "acl-wr": | |||
return true, &structs.ACLPolicy{ | |||
p := &structs.ACLPolicy{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding it to all of them would be nice for that reason. Would also avoid that confusion when someone looks at this again later.
ID: "node-wr", | ||
Name: "node-wr", | ||
Description: "node-wr", | ||
Rules: `node_prefix "" { policy = "write"}`, | ||
Syntax: acl.SyntaxCurrent, | ||
Datacenters: []string{"dc1"}, | ||
RaftIndex: structs.RaftIndex{CreateIndex: 1, ModifyIndex: 2}, | ||
}, nil | ||
} | ||
p.SetHash(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good point about resolveWithCache
. I'm not sure why it's done there. At first I thought it could have been a new field, where policies before some version doesn't have it, but the Hash field has existed since the New ACLs PR.
It also doesn't seem to be used for ACL compatibility, since the function to extract policies from embedded tokens sets the hash itself:
Line 475 in 0527dcf
policy.SetHash(true) |
Same with these other synthetic policy extractors:
Line 189 in 0527dcf
policy.SetHash(true) |
Line 233 in 0527dcf
policy.SetHash(true) |
My sense is that it was there for tests like this one that don't set the hash in advance. I don't see a non-test use for it.
Agreed that it doesn't need to be handled in this PR.
A previous commit used SetHash on two of the cases to fix a data race. This commit applies that change to all cases. Using SetHash in this test helper should ensure that the test helper behaves closer to production.
The one test failure ( |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/412138. |
Related to #8329
Fixes #9458
These commits are extracted from #10341. There's still a lot more work required to properly solve #9457 (ex: hashicorp/memberlist#238), so I'd like to get some of the data race fixes merged to prevent them from rotting, and so that contributors can run the race detector locally with much less noise.