-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
acl: remove ACL.GetPolicy RPC endpoint and ACLResolver.resolveTokenLegacy #11126
acl: remove ACL.GetPolicy RPC endpoint and ACLResolver.resolveTokenLegacy #11126
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As part of removing the legacy ACL system ACL upgrading and the flag for legacy ACLs is removed from Clients. This commit also removes the 'acls' serf tag from client nodes. The tag is only ever read from server nodes. This commit also introduces a constant for the acl serf tag, to make it easier to track where it is used.
And all code that was no longer used once those two were removed.
6b63cb9
to
2a2da0b
Compare
2a2da0b
to
94be183
Compare
@@ -383,7 +383,7 @@ func TestLeader_FederationStateAntiEntropyPruning_ACLDeny(t *testing.T) { | |||
// Try to join. | |||
joinWAN(t, s2, s1) | |||
testrpc.WaitForLeader(t, s1.RPC, "dc1") | |||
testrpc.WaitForLeader(t, s1.RPC, "dc2") | |||
testrpc.WaitForLeader(t, s1.RPC, "dc2", testrpc.WithToken("root")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling out this test fix. I'm not sure why this is necessary now, but this fixed the test failure.
Since this is test setup, not the actual test logic, I'm thinking it's probably ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is unusual, I'm running this test locally and it's failing on L379 due to ACL not found.
It doesn't pass when changing that line to include the root token either 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because the secondary DC is declaring a "root" token, but the global management token can only be initialized in the primary:
https://github.com/hashicorp/consul/blob/main/agent/consul/leader.go#L431-L468
So "root" isn't a known token in DC2 until they are WAN joined, which is why your fix works for L386.
I think the WaitForLeader in L379 should potentially be removed because there's no token that can be used for those catalog queries before joinWAN
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah it's a similar problem that requires your fix. The anonymous token is only created in the primary DC, so sending a request with an empty token to dc2 will lead to ACL not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this wasn't needed before this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, in ResolveTokenToIdentityAndAuthorizer
, r.delegate.UseLegacyACLs()
was evaluating to true, and the legacy resolve method would handle missing tokens by returning a synthetic identity:missingIdentity
, rather than "ACL not found".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, removing both this testrpc.WithToken("root")
and the WaitForLeader
on L379 does seem to work.
I think I understand why it worked previously (I had guessed the resolver legacy case was at play but didn't trace it back to the different error being related), and why adding testrpc.WithToken("root")
fixed the failures (because it was actually waiting for replication, not simply the leader).
But I don't think I understand why not waiting for a leader in dc2 fixes the problem. Is this fix relying on some timing of the two servers which could cause the test to flake if the order changes?
I've pushed that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think I understand why not waiting for a leader in dc2 fixes the problem. Is this fix relying on some timing of the two servers which could cause the test to flake if the order changes?
Do you mean that you're not sure why removing this line helps? testrpc.WaitForLeader(t, s2.RPC, "dc2")
There's a couple things I'm seeing:
- When s2 in dc2 starts it doesn't initialize any tokens in its state store, because it is not the primary datacenter. Only the primary DC will ensure the anonymous and global-management tokens exist. So specifying
c.ACLMasterToken = "root"
in s2 is a no-op, because a secondary DC can't persist the master token. - Since ACLs are in deny mode,
WaitForLeader
requires a token in order to queryCatalog.ListNodes
. This token can be the anonymous token, where an empty token string is passed in the request.
However, due to point 1, there is no token that can be passed into WaitForLeader
in s2. s2 can't create any tokens because it's not the authoritative ACL datacenter. It also doesn't have any tokens because it hasn't replicated any from the primary DC.
s2 can't get out of this situation until it is WAN joined with dc1.
I think the only potential source for flakiness here could be if: dc1 and dc2 are WAN joined, dc1 does not have a leader and therefore hasn't initialized ACLs, and then s2 attempts to use a token that needs to be replicated from s1 ("root").
Though I don't think we need to worry about this here because we wait for dc1 to have a leader before spinning up and joining s2, and then the dc2 WaitForLeader is wrapped in a retry as well. So that should provide enough time for the root token to be replicated to, and resolved in dc2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to add: The reason why an explicit token isn't needed when calling testrpc.WaitForLeader(t, s1.RPC, "dc1")
is that the anonymous token gets filled in here:
Line 1240 in 1904058
token = anonymousToken |
In the primary DC this token ID can be resolved to a policy.
The token doesn't have any permissions, but because it exists it won't lead to "ACL not found". Catalog.ListNodes
does ACL filtering, so the end result will be an empty list rather than a permission denied error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think part of my confusion here is that WaitForLeader
is implemented as an RPC call for some endpoint that requires ACLS, and ACL errors are assumed to mean something specific. If the implementation used something like /status/leader
instead, which does not require ACLs, that might be a bit easier to reason about.
Either way, sounds like this test is working now, so we can look into those improvements some other time.
if a.srv.UseLegacyACLs() { | ||
return fmt.Errorf("The ACL system is currently in legacy mode.") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was required to fix the test failures. The full removal of upgrades is happening in #11182
I had to make some changes to get the tests to pass, so re-requesting a review. |
This commit two test failures: 1. Remove check for "in legacy ACL mode", the actual upgrade will be removed in a following commit. 2. Remove the early WaitForLeader in dc2, because with it the test was failing with ACL not found.
94be183
to
a1e3fa8
Compare
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/463268. |
The first commit also removes ACL serf tags from Client agents (they were never used), and removes ACL upgrading from client agents. The next PR will remove ACL upgrades from servers.