Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acl: remove ACL.GetPolicy RPC endpoint and ACLResolver.resolveTokenLegacy #11126

Merged
merged 3 commits into from
Oct 4, 2021

Conversation

dnephin
Copy link
Contributor

@dnephin dnephin commented Sep 22, 2021

The first commit also removes ACL serf tags from Client agents (they were never used), and removes ACL upgrading from client agents. The next PR will remove ACL upgrades from servers.

@dnephin dnephin added theme/acls ACL and token generation pr/no-changelog PR does not need a corresponding .changelog entry labels Sep 22, 2021
@dnephin dnephin marked this pull request as draft September 22, 2021 23:26
Copy link
Member

@rboyer rboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

As part of removing the legacy ACL system ACL upgrading and the flag for
legacy ACLs is removed from Clients.

This commit also removes the 'acls' serf tag from client nodes. The tag is only ever read
from server nodes.

This commit also introduces a constant for the acl serf tag, to make it easier to track where
it is used.
And all code that was no longer used once those two were removed.
@dnephin dnephin force-pushed the dnephin/acl-legacy-remove-resolve-and-get-policy branch from 6b63cb9 to 2a2da0b Compare September 29, 2021 18:37
@vercel vercel bot temporarily deployed to Preview – consul September 29, 2021 18:37 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging September 29, 2021 18:37 Inactive
@dnephin dnephin force-pushed the dnephin/acl-legacy-remove-resolve-and-get-policy branch from 2a2da0b to 94be183 Compare September 29, 2021 19:17
@dnephin dnephin requested a review from rboyer September 29, 2021 19:17
@vercel vercel bot temporarily deployed to Preview – consul September 29, 2021 19:17 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging September 29, 2021 19:17 Inactive
@dnephin dnephin marked this pull request as ready for review September 29, 2021 19:23
@@ -383,7 +383,7 @@ func TestLeader_FederationStateAntiEntropyPruning_ACLDeny(t *testing.T) {
// Try to join.
joinWAN(t, s2, s1)
testrpc.WaitForLeader(t, s1.RPC, "dc1")
testrpc.WaitForLeader(t, s1.RPC, "dc2")
testrpc.WaitForLeader(t, s1.RPC, "dc2", testrpc.WithToken("root"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out this test fix. I'm not sure why this is necessary now, but this fixed the test failure.

Since this is test setup, not the actual test logic, I'm thinking it's probably ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is unusual, I'm running this test locally and it's failing on L379 due to ACL not found.

It doesn't pass when changing that line to include the root token either 🤔

Copy link
Contributor

@freddygv freddygv Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because the secondary DC is declaring a "root" token, but the global management token can only be initialized in the primary:
https://github.com/hashicorp/consul/blob/main/agent/consul/leader.go#L431-L468

So "root" isn't a known token in DC2 until they are WAN joined, which is why your fix works for L386.

I think the WaitForLeader in L379 should potentially be removed because there's no token that can be used for those catalog queries before joinWAN.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it's a similar problem that requires your fix. The anonymous token is only created in the primary DC, so sending a request with an empty token to dc2 will lead to ACL not found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this wasn't needed before this PR though.

Copy link
Contributor

@freddygv freddygv Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, in ResolveTokenToIdentityAndAuthorizer, r.delegate.UseLegacyACLs() was evaluating to true, and the legacy resolve method would handle missing tokens by returning a synthetic identity:missingIdentity, rather than "ACL not found".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removing both this testrpc.WithToken("root") and the WaitForLeader on L379 does seem to work.

I think I understand why it worked previously (I had guessed the resolver legacy case was at play but didn't trace it back to the different error being related), and why adding testrpc.WithToken("root") fixed the failures (because it was actually waiting for replication, not simply the leader).

But I don't think I understand why not waiting for a leader in dc2 fixes the problem. Is this fix relying on some timing of the two servers which could cause the test to flake if the order changes?

I've pushed that change.

Copy link
Contributor

@freddygv freddygv Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think I understand why not waiting for a leader in dc2 fixes the problem. Is this fix relying on some timing of the two servers which could cause the test to flake if the order changes?

Do you mean that you're not sure why removing this line helps? testrpc.WaitForLeader(t, s2.RPC, "dc2")

There's a couple things I'm seeing:

  1. When s2 in dc2 starts it doesn't initialize any tokens in its state store, because it is not the primary datacenter. Only the primary DC will ensure the anonymous and global-management tokens exist. So specifying c.ACLMasterToken = "root" in s2 is a no-op, because a secondary DC can't persist the master token.
  2. Since ACLs are in deny mode, WaitForLeader requires a token in order to query Catalog.ListNodes. This token can be the anonymous token, where an empty token string is passed in the request.

However, due to point 1, there is no token that can be passed into WaitForLeader in s2. s2 can't create any tokens because it's not the authoritative ACL datacenter. It also doesn't have any tokens because it hasn't replicated any from the primary DC.

s2 can't get out of this situation until it is WAN joined with dc1.

I think the only potential source for flakiness here could be if: dc1 and dc2 are WAN joined, dc1 does not have a leader and therefore hasn't initialized ACLs, and then s2 attempts to use a token that needs to be replicated from s1 ("root").

Though I don't think we need to worry about this here because we wait for dc1 to have a leader before spinning up and joining s2, and then the dc2 WaitForLeader is wrapped in a retry as well. So that should provide enough time for the root token to be replicated to, and resolved in dc2.

Copy link
Contributor

@freddygv freddygv Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to add: The reason why an explicit token isn't needed when calling testrpc.WaitForLeader(t, s1.RPC, "dc1") is that the anonymous token gets filled in here:

token = anonymousToken

In the primary DC this token ID can be resolved to a policy.

The token doesn't have any permissions, but because it exists it won't lead to "ACL not found". Catalog.ListNodes does ACL filtering, so the end result will be an empty list rather than a permission denied error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think part of my confusion here is that WaitForLeader is implemented as an RPC call for some endpoint that requires ACLS, and ACL errors are assumed to mean something specific. If the implementation used something like /status/leader instead, which does not require ACLs, that might be a bit easier to reason about.

Either way, sounds like this test is working now, so we can look into those improvements some other time.

Comment on lines -174 to -176
if a.srv.UseLegacyACLs() {
return fmt.Errorf("The ACL system is currently in legacy mode.")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was required to fix the test failures. The full removal of upgrades is happening in #11182

@dnephin dnephin requested a review from a team September 29, 2021 19:57
@dnephin
Copy link
Contributor Author

dnephin commented Sep 29, 2021

I had to make some changes to get the tests to pass, so re-requesting a review.

This commit two test failures:

1. Remove check for "in legacy ACL mode", the actual upgrade will be removed in a following commit.
2. Remove the early WaitForLeader in dc2, because with it the test was
   failing with ACL not found.
@dnephin dnephin force-pushed the dnephin/acl-legacy-remove-resolve-and-get-policy branch from 94be183 to a1e3fa8 Compare October 1, 2021 22:05
@vercel vercel bot temporarily deployed to Preview – consul October 1, 2021 22:05 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging October 1, 2021 22:05 Inactive
@dnephin dnephin merged commit 9b1d268 into main Oct 4, 2021
@dnephin dnephin deleted the dnephin/acl-legacy-remove-resolve-and-get-policy branch October 4, 2021 20:29
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/463268.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/no-changelog PR does not need a corresponding .changelog entry theme/acls ACL and token generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants