Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Registrations RBAC #1975

Closed
prasadborole1 opened this issue Nov 24, 2020 · 19 comments · Fixed by #2416
Closed

[RFC] Registrations RBAC #1975

prasadborole1 opened this issue Nov 24, 2020 · 19 comments · Fixed by #2416
Assignees

Comments

@prasadborole1
Copy link
Contributor

prasadborole1 commented Nov 24, 2020

Background:
SPIRE implements the concept of “admin” workloads. A workload presenting an admin SVID is able to act broadly across the entire registration API. In an ecosystem where multiple registrants are creating/deleting registrations, it results in registrants being able to affect registrations created by other registrants. Therefore, we need a way to limit the access of registration APIs to subsets of SPIFFE IDs by the registrant. More background discussion can be found in the github issue ​#716​.

Possible solutions:

  1. Plug-in-based approach (Registration authorizer):
    The registration authorizer plug-in would be invoked in the path of create, update and delete registration RPCs. This will allow plug-in to get details of requestor identity, action to be performed and subject of the action. It will return a boolean response denoting whether requestor identity is authorized to perform action on registration with specific SPIFFE ID present in the RPC request.
    We currently have a plug-in “Notifier” which gets invoked after each bundle update call. The registration authorizer plug-in would behave in a similar way. Since both of these plug-ins are intended to be used for very different purposes, I’m proposing to create a new one instead of overloading existing “Notifier” plug-in.
    This approach offers more flexibility as users can potentially implement their own authorization model depending on different use cases or can integrate with any other external authorization solution. By default we will have a Noop plug-in which will always return true to ensure backward compatibility.

  2. OPA-based config:
    In this approach, we make changes to SPIRE core to include OPA-based config which will define admin registration authorization. The basic implementation can be found ​here​.
    This approach would make authorization a first class citizen of SPIRE which brings both advantages and challenges. All the SPIRE users will get authorization by default but it means arriving at consensus on the basic permission model so that most customer use cases are addressed. Also, there will be challenges around compatibility when modifying the permission model while the feature is settling in the production environment.

Sample Implementation of Registration Authorizer plug-in:
A sample permission model for registration RPCs in the plug-in could look like:

Actor​ ​:​ ​Role, 
Role​ ​:​ ​[{
              Actions: ​[], 
              Resources: ​[], 
              Namespaces: ​[],
​         }, 
         {
              Actions: ​[], 
              Resources: ​[], 
              Namespaces: ​[],
​         }, 
​}]

Example:

spiffe:​//example.ca/foo/workload:​ ​foobarRegistrant, 
foobarRegistrant: ​[​ ​{
   Actions: ​["CREATE",​ ​"UPDATE",​ ​"DELETE"]
   Resources: ​[“REGISTREATIONS”],
   Namespaces: ​["spiffe://example.ca/foo",​ ​"spiffe://example.ca/bar"]
}]

In the above example, the “​Actor​” is s​piffe://example.ca/foo/workload​ admin workload which is mapped to the “​Role​” of the registrant​.
The role​ foobarRegistrant ​would allow “​Actions”​ like create, update and delete on specific namespaces.
The “​Namespaces​” holds prefixes of SPIFFE IDs on which the role allows the above Actions.
During each lookup, plug-in will look for a role and action and then verify the namespace of requested identity. The role could be shared with different actors.
Namespaces​ are prefixes of the SPIFFE IDs.
Resources​ are a set of RPCs which are authorizing a given role. (Initially we will just have registration RPCs. Adding this field so that in future we can possibly extend this to other RPCs)
List of RPCs which will invoke the plug-in:

  • Create registration
  • Update registration
  • Delete registration

The latency impact of plug-in to registration RPCs would be proportional to the number of actors(registrants) and roles. But assuming this can be performed in memory, overall RPC latency would not observe a sizable increase.
As a first step we can get Noop plug-in as a default. The details of implementations of actual registration authorization plug-in can be discussed separately.

Request For Comments:
This proposal is intended to get feedback on the overall approach for authorizing admin workloads in the near future as this is becoming an increasingly essential feature in hardening the SPIRE ecosystem in the production.

@prasadborole1 prasadborole1 changed the title [RFC] Regit [RFC] Registrations RBAC Nov 24, 2020
@rturner3
Copy link
Collaborator

Thanks for sending this, @prasadborole1 ! I think it would also be useful to examine the usefulness of this RBAC feature for more than just the Registration APIs so that we come up with a solution that can potentially be extended beyond the set of use cases presented here. Bundle management, for example, seems like another set of operations that would benefit from a finer grained authorization solution.

@amartinezfayo
Copy link
Member

Thank you @prasadborole1 for putting this together!
I'm personally optimistic about the OPA-based approach. I feel that having an homogeneous permission model is a good thing and can help to simplify the configuration and deployment once that's defined. Adopting OPA also brings a lot of things already figured out and implemented functionality that would need to be reimplemented otherwise, like the evaluation of policies. It seems better to me to have that handled by the exposed APIs which is less error prone than a custom implementation, at the cost of less flexibility.

@evan2645
Copy link
Member

I agree with both @rturner3 and @amartinezfayo. My vote is for option 2

In the end, we'll want this work to solve RPC authorization across the board, entry CRUD being just one use case... so we should be thinking about it in more generalized terms if possible. Here are a few questions rolling around in my head:

  • I noticed in the linked example that the entire request is passed into OPA for evaluation. Does this "just work" with all the native protobuf types etc?
  • How should the policy be configured? File-based config seems like a reasonable approach. Do folks typically maintain multiple smaller rego files (e.g. policy-file-per-api) or is there a better way to model it?
  • How does this interact with the current admin bit? Feels to me like the logic should be isAuthorized = admin || OPA

@lumjjb
Copy link
Contributor

lumjjb commented Feb 1, 2021

Wanted to chime in and mention that this is a great feature to have and am looking forward to it. We have a use case around this where we want to be able to create organizational policy around identities, so having something a call to a policy engine would be ideal.

Another more specific use case we have is around the kubernetes workload registrar. The specific issue we want to address with the policies is that the admin of the kubernetes cluster is able to create any identity they want. This is something that we are looking to lock down without necessarily having to create a single trust domain for every separate cluster. Being able to define a policy to enforce that identities created by the workload registrar cannot have the admin flag, or being able to restrict the identities to not be able to impersonate other clusters would be very useful.

@rturner3
Copy link
Collaborator

rturner3 commented Feb 1, 2021

This work may also enable #1207 as well, which I believe is a pretty critical requirement for traceability of registrations and bundles.

@lumjjb
Copy link
Contributor

lumjjb commented Mar 11, 2021

Would like to find out more about the work around this issue, really keen on using this :). is there anything I can do to help out?

@evan2645
Copy link
Member

Hi @lumjjb!

I think someone @rturner3 works with was going to pick this up, but it has been sitting for a long time. If we provide a detailed design, would you be willing to help implement it? I think we know roughly what it should look like... of course, you'd also have an opportunity to validate that :)

@lumjjb
Copy link
Contributor

lumjjb commented Mar 12, 2021

Yea! We should be able to help out - current planning our future work, and we should be able to fit this into our plan.

@lumjjb
Copy link
Contributor

lumjjb commented May 17, 2021

We should have some cycles to start looking into this.

@prasadborole1 @rturner3 is this something that you'd be free to chat about?

@prasadborole1
Copy link
Contributor Author

Thank you for following up on this and apologies for the delay. I believe the next step here is to come up with the design using OPA strategy. @lumjjb I can sync with you later in this week.

@lumjjb
Copy link
Contributor

lumjjb commented May 18, 2021

Awesome - i'm out later this week, would you be okay to sync up early next week? My email is on my github profile if you'd like to send an invite! Thanks!

@lumjjb
Copy link
Contributor

lumjjb commented Jun 4, 2021

@prasadborole1 and I had a chat to sync up on this. I've rebased their work on the main branch https://github.com/lumjjb/spire/tree/tj/policy-dirty and will be working on it from there.

I will be working on writing up some documentation around it, and maybe adding the ability for policy to apply to UDS calls as well with caller specified as "UDS".

@evan2645
Copy link
Member

evan2645 commented Jun 7, 2021

Great news @lumjjb! Thank you for picking this up, it is much needed.

It'd be great if you could join the SPIRE contributor sync at some point. Most folks undertaking non-trivial works on the codebase join, and it is a useful place to get opinions before going down a long path. They take place every Tuesday at 11:30 Pacific, the details are shared on the SIG-SPIRE shared calendar: https://calendar.google.com/calendar/ical/c_q2don6m2b33gljqftauib3hnuk%40group.calendar.google.com/public/basic.ics

@lumjjb
Copy link
Contributor

lumjjb commented Jun 8, 2021

It'd be great if you could join the SPIRE contributor sync at some point.

Awesome - yea i have an appointment already this week but ill probably drop by next week! Thanks Evan.

@lumjjb
Copy link
Contributor

lumjjb commented Jun 25, 2021

Here's a quick update on where I am with this issue...

Quick recap:

I rebased the branch that folks from Uber worked on. The integration for that was done with the old registration API. So I've integrated the similar functionality with the new SPIRE server APIs. Part of this effort was modifying the middleware API to provide requests to the Preprocess interface for unary interceptors.

During discussions during community meetings, we talked about perhaps using the OPA policy to encapsulate the entire policy.

Where we are now

Based on this idea, I explored several options that we could take with this. This is tricky mainly because of the interaction between OPA and SPIRE with relation with figuring out of a caller has certain properties - like IsAdmin, IsDownstream, or IsAgent. Obtaining these properties brings some cost with it and doing it for every request even though it may not be necessary to determine authorization will create additional cost within the critical path of every SPIRE request.

The policy databindings would mimic the following go code through this JSON

Possible Solutions

There are several approaches to solving the problem stated above:

  1. Incur cost for all calls and pass it to OPA policy
  2. OPA policy interfaces with SPIRE through HTTP / golang addon to get more
    information
  3. Use OPA policy and provide an option to "pass" onto default SPIRE policy
  4. OPA policy returns additional variables

I will provide more details about these options below, but for as a summary, my
general opinions are that:

Option 1 is infeasible due to costs,

Option 2 on paper sounds really enticing but either will result in over-engineering / creating artificial restrictions (i.e. can't use a remote OPA server). But the main drawback for this is that there is no notion of cost from OPA about these external calls, which means that writing an OPA policy that uses this calls will be much more complicated, as there is no way explicit way to
indicate to OPA that call IsAdmin() is expensive and should only be called as a last resort. For the specific usecase of reducing cost in the critical path, this option seems fairly fragile.

Option 3 is close to the initial proposal of the issue, but make it difficult to create finer grained policy controls.

Option 4 allows creation of finer grained policies without any additional cost but results in slightly more involved policy definition. This is my personal favorite choice.

Big thanks to @ashutosh-narkar, that helped consult on the different options.

Options

Option 1. Incur cost for all calls and pass it to OPA policy

The idea behind this is to obtain all the necessary booleans of whether a caller is admin, etc. and pass all these as inputs to the OPA policy. For example, the policy input would look like:

input := policy.Input{
		Caller:     spiffeID,
		FullMethod: fullMethod,
		Req:        req,
		Admin:      isAdmin,
		Local:      isLocal,
		Downstream: isDownstream,
		Agent:      isAgentCaller,
	}

However, this would cost in terms of implementation, and look like main...lumjjb:policy_opa#diff-95d2951732a69d7a4f84f0982eed9968c2aeedca645cb2bafaad3db6547a510eR12-R79

This way the OPA policy would have all the information needed to make a final decision, however, coming at a cost of preparing all these inputs.

Option 2: OPA policy interfaces with SPIRE through HTTP / golang addon to get more information

This option revolves around either two of these "plugin" features of OPA:
pull data during evaluation and custom built-in functions in go.

These both provide the ability for OPA to call into a HTTP endpoint or golang code for retrieving certain data. There are 2 major implications of using these methods, 1st is engineering overhead, and the 2nd, implicit policy complexity.

Engineering Implications

The use of either of them has some engineering implications.

For the earlier, a HTTP endpoint needs to be created and serve OPA query of whether a certain SPIFFE ID has the admin flag, or is an agent (which may need certain correlation with golang contexts), which is a fairly involved engineering effort for what we want to achieve.

For the later, golang code can be written to replace certain functions within the OPA engine. This is great for golang integration, but the implication of this is that we would not be able to do remote OPA policy evaluation. Remote OPA policy evaluation seems like it may be useful in the future as

Policy Complexity Implications

OPA is not aware of the cost of these functions. Therefore, it treats these functions like any other OPA functions. This means that the user has very little control on what ends up being executed to fulfill the policy. This could lead to incurring a cost equivalent to getting all the peroperties before hand (like in option 1). This may also make writing of policies "right" fairly complicated since there is an implicit cost based on the execution semantics of the OPA policy. This means that one may have to write a policy a certain way to be more optimal, or is at the mercy of the OPA engine in determining it.

Option 3. Use OPA policy and provide an option to "pass" onto default SPIRE policy

This option is similar to the original draft branch proposed. With the difference that on top of the OPA policy evaluation returning just a boolean allow indicating that the call should be allowed, an additional variable pass can indicate that the policy would like to defer to the default SPIRE policy.

So instead of a query of a result with:

result = {
  "allow": true/false,
}

We will have an additional field:

result = {
  "allow": true/false,
  "pass": true/false,
}

Thus, the evaluation would look something like the following:

// Check OPA policy and if allow=false and pass=true, go on to regular authz
// rules
allow, pass, err := m.opaAuth2(ctx, req, methodName)
if err != nil {
    return nil, err
}
if allow {
    return ctx, nil
} else if !pass {
    return nil, status.Errorf(codes.PermissionDenied, "OPA policy denied without passthrough")
}

authorizer, ok := m.authorizers[methodName]
if !ok {
    middleware.LogMisconfiguration(ctx, "Authorization misconfigured (method not registered); this is a bug")
    return nil, status.Errorf(codes.Internal, "authorization misconfigured for %q (method not registered)", methodName)
}
return authorizer.AuthorizeCaller(ctx, req)

Option 3 is close to the initial proposal of the issue, but make it difficult to create finer grained policy controls that involve additional flags like checking if the caller is admin or an agent.

Option 4. OPA policy returns additional variables

Option 4 allows creation of finer grained policies without any additional cost but results in slightly more involved policy definition. This is my personal favorite choice.

This option is similar to option 3 but with additional fields defined by the OPA policy,

So instead of the OPA policy returning

result = {
  "allow": true/false,
}

We have additional fields indicating that the policy would allow it if certain flags are met:

result = {
  "allow": true/false,
  "allow_if_admin": true/false,
  "allow_if_local": true/false,
  "allow_if_downstream": true/false,
  "allow_if_agent": true/false,
  "pass": true/false,
}

This would then be evaluated like:

Admit IF

allow || (allow_if_local && isLocal()) || (allow_if_admin && isAdmin()) || ...

where it would call the necessary go functions in order of cost.

Code that does this is as follows, below are some examples to understand the evaluation main...lumjjb:policy_opa#diff-95d2951732a69d7a4f84f0982eed9968c2aeedca645cb2bafaad3db6547a510eR81-R149

Example 1: all false

result = {
  "allow": false,
  "allow_if_admin": false,
  "allow_if_local": false,
  "allow_if_downstream": false,
  "allow_if_agent": false,
  "pass": false,
}

In this case, there is no way the it would be allowed so it would return false
immedietely without checking any of these other conditions

Example 2: multiple trues

result = {
  "allow": false,
  "allow_if_admin": true,
  "allow_if_local": true,
  "allow_if_downstream": false,
  "allow_if_agent": false,
  "pass": false,
}

In this case, allow is false, so in order of ease to evaluate, we check if it is a local call, and if it is then we shortcircuit and return true.

allow=false || (allow_if_local=true && isLocal()=true) <<SHORT-CIRCUIT>> || (allow_if_admin && isAdmin()) || ...

@lumjjb
Copy link
Contributor

lumjjb commented Jun 28, 2021

@azdagron @evan2645 Above is a quick rundown on the design points for the OPA policy integration, would like to discuss this during the Tuesday community call if there's time.

@evan2645
Copy link
Member

@lumjjb thank you so much for this, let's definitely discuss tomorrow

@rturner3
Copy link
Collaborator

Option 4 looks good to me. It avoids unnecessary performance penalties during authorization policy evaluation while retaining some flexibility as to where the OPA engine is running (in-memory vs. server). The drawback of complicating the authorization decision to be OPA + SPIRE vs. just OPA itself seems acceptable in this case given the available alternatives.

@lumjjb
Copy link
Contributor

lumjjb commented Jul 9, 2021

I am going to NOT include the old registration API to use this since the scope has now increased to handle the other API calls as well, so it doesn't quite match up with the old APIs. Unless there's an exception, I will assume this is the case. Cleaning up and writing tests and stuff, PR should be ready soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants