Skip to content

Commit

Permalink
Merge pull request kubernetes#292 from tsandall/federated-placement-p…
Browse files Browse the repository at this point in the history
…olicy

Proposal: policy-based federated resource placement
  • Loading branch information
nikhiljindal authored Apr 10, 2017
2 parents 0ac727b + 19f1f6d commit a3b9c9f
Showing 1 changed file with 371 additions and 0 deletions.
371 changes: 371 additions & 0 deletions federated-placement-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,371 @@
# Policy-based Federated Resource Placement

This document proposes a design for policy-based control over placement of
Federated resources.

Tickets:

- https://github.com/kubernetes/kubernetes/issues/39982

Authors:

- Torin Sandall ([email protected], tsandall@github) and Tim Hinrichs
([email protected]).
- Based on discussions with Quinton Hoole ([email protected],
quinton-hoole@github), Nikhil Jindal (nikhiljindal@github).

## Background

Resource placement is a policy-rich problem affecting many deployments.
Placement may be based on company conventions, external regulation, pricing and
performance requirements, etc. Furthermore, placement policies evolve over time
and vary across organizations. As a result, it is difficult to anticipate the
policy requirements of all users.

A simple example of a placement policy is

> Certain apps must be deployed on clusters in EU zones with sufficient PCI
> compliance.
The [Kubernetes Cluster
Federation](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation.md#policy-engine-and-migrationreplication-controllers)
design proposal includes a pluggable policy engine component that decides how
applications/resources are placed across federated clusters.

Currently, the placement decision can be controlled for Federated ReplicaSets
using the `federation.kubernetes.io/replica-set-preferences` annotation. In the
future, the [Cluster
Selector](https://github.com/kubernetes/kubernetes/issues/29887) annotation will
provide control over placement of other resources. The proposed design supports
policy-based control over both of these annotations (as well as others).

This proposal is based on a POC built using the Open Policy Agent project. [This
short video (7m)](https://www.youtube.com/watch?v=hRz13baBhfg) provides an
overview and demo of the POC.

## Design

The proposed design uses the [Open Policy Agent](http://www.openpolicyagent.org)
project (OPA) to realize the policy engine component from the Federation design
proposal. OPA is an open-source, general purpose policy engine that includes a
declarative policy language and APIs to answer policy queries.

The proposed design allows administrators to author placement policies and have
them automatically enforced when resources are created or updated. The design
also covers support for automatic remediation of resource placement when policy
(or the relevant state of the world) changes.

In the proposed design, the policy engine (OPA) is deployed on top of Kubernetes
in the same cluster as the Federation Control Plane:

![Architecture](https://docs.google.com/drawings/d/1kL6cgyZyJ4eYNsqvic8r0kqPJxP9LzWVOykkXnTKafU/pub?w=807&h=407)

The proposed design is divided into following sections:

1. Control over the initial placement decision (admission controller)
1. Remediation of resource placement (opa-kube-sync/remediator)
1. Replication of Kubernetes resources (opa-kube-sync/replicator)
1. Management and storage of policies (ConfigMap)

### 1. Initial Placement Decision

To provide policy-based control over the initial placement decision, we propose
a new admission controller that integrates with OPA:

When admitting requests, the admission controller executes an HTTP API call
against OPA. The API call passes the JSON representation of the resource in the
message body.

The response from OPA contains the desired value for the resource’s annotations
(defined in policy by the administrator). The admission controller updates the
annotations on the resource and admits the request:

![InitialPlacement](https://docs.google.com/drawings/d/1c9PBDwjJmdv_qVvPq0sQ8RVeZad91vAN1XT6K9Gz9k8/pub?w=812&h=288)

The admission controller updates the resource by **merging** the annotations in
the response with existing annotations on the resource. If there are overlapping
annotation keys the admission controller replaces the existing value with the
value from the response.

#### Example Policy Engine Query:

```http
POST /v1/data/io/k8s/federation/admission HTTP/1.1
Content-Type: application/json
```

```json
{
"input": {
"apiVersion": "extensions/v1beta1",
"kind": "ReplicaSet",
"metadata": {
"annotations": {
"policy.federation.alpha.kubernetes.io/eu-jurisdiction-required": "true",
"policy.federation.alpha.kubernetes.io/pci-compliance-level": "2"
},
"creationTimestamp": "2017-01-23T16:25:14Z",
"generation": 1,
"labels": {
"app": "nginx-eu"
},
"name": "nginx-eu",
"namespace": "default",
"resourceVersion": "364993",
"selfLink": "/apis/extensions/v1beta1/namespaces/default/replicasets/nginx-eu",
"uid": "84fab96d-e188-11e6-ac83-0a580a54020e"
},
"spec": {
"replicas": 4,
"selector": {...},
"template": {...},
}
}
}
```

#### Example Policy Engine Response:

```http
HTTP/1.1 200 OK
Content-Type: application/json
```

```json
{
"result": {
"annotations": {
"federation.kubernetes.io/replica-set-preferences": {
"clusters": {
"gce-europe-west1": {
"weight": 1
},
"gce-europe-west2": {
"weight": 1
}
},
"rebalance": true
}
}
}
}
```

> This example shows the policy engine returning the replica-set-preferences.
> The policy engine could similarly return a desired value for other annotations
> such as the Cluster Selector annotation.
#### Conflicts

A conflict arises if the developer and the policy define different values for an
annotation. In this case, the developer's intent is provided as a policy query
input and the policy author's intent is encoded in the policy itself. Since the
policy is the only place where both the developer and policy author intents are
known, the policy (or policy engine) should be responsible for resolving the
conflict.

There are a few options for handling conflicts. As a concrete example, this is
how a policy author could handle invalid clusters/conflicts:

```
package io.k8s.federation.admission
errors["requested replica-set-preferences includes invalid clusters"] {
invalid_clusters = developer_clusters - policy_defined_clusters
invalid_clusters != set()
}
annotations["replica-set-preferences"] = value {
value = developer_clusters & policy_defined_clusters
}
# Not shown here:
#
# policy_defined_clusters[...] { ... }
# developer_clusters[...] { ... }
```

The admission controller will execute a query against
/io/k8s/federation/admission and if the policy detects an invalid cluster, the
"errors" key in the response will contain a non-empty array. In this case, the
admission controller will deny the request.

```http
HTTP/1.1 200 OK
Content-Type: application/json
```

```json
{
"result": {
"errors": [
"requested replica-set-preferences includes invalid clusters"
],
"annotations": {
"federation.kubernetes.io/replica-set-preferences": {
...
}
}
}
}
```

This example shows how the policy could handle conflicts when the author's
intent is to define clusters that MAY be used. If the author's intent is to
define what clusters MUST be used, then the logic would not use intersection.

#### Configuration

The admission controller requires configuration for the OPA endpoint:

```
{
"EnforceSchedulingPolicy": {
"url": “https://opa.federation.svc.cluster.local:8181/v1/data/io/k8s/federation/annotations”,
"token": "super-secret-token-value"
}
}
```

- `url` specifies the URL of the policy engine API to query. The query response
contains the annotations to apply to the resource.
- `token` specifies a static token to use for authentication when contacting the
policy engine. In the future, other authentication schemes may be supported.

The configuration file is provided to the federation-apiserver with the
`--admission-control-config-file` command line argument.

The admission controller is enabled in the federation-apiserver by providing the
`--admission-control` command line argument. E.g.,
`--admission-control=AlwaysAdmit,EnforceSchedulingPolicy`.

The admission controller will be enabled by default.

#### Error Handling

The admission controller is designed to **fail closed** if policies have been
created.

Request handling may fail because of:

- Serialization errors
- Request timeouts or other network errors
- Authentication or authorization errors from the policy engine
- Other unexpected errors from the policy engine

In the event of request timeouts (or other network errors) or back-pressure
hints from the policy engine, the admission controller should retry after
applying a backoff. The admission controller should also create an event so that
developers can identify why their resources are not being scheduled.

Policies are stored as ConfigMap resources in a well-known namespace. This
allows the admission controller to check if one or more policies exist. If one
or more policies exist, the admission controller will fail closed. Otherwise
the admission controller will **fail open**.

### 2. Remediation of Resource Placement

When policy changes or the environment in which resources are deployed changes
(e.g. a cluster’s PCI compliance rating gets up/down-graded), resources might
need to be moved for them to obey the placement policy. Sometimes administrators
may decide to remediate manually, other times they may want Kubernetes to
remediate automatically.

To automatically reschedule resources onto desired clusters, we introduce a
remediator component (**opa-kube-sync**) that is deployed as a sidecar with OPA.

![Remediation](https://docs.google.com/drawings/d/1ehuzwUXSpkOXzOUGyBW0_7jS8pKB4yRk_0YRb1X4zsY/pub?w=812&h=288)

The notifications sent to the remediator by OPA specify the new value for
annotations such as replica-set-preferences.

When the remediator component (in the sidecar) receives the notification it
sends a PATCH request to the federation-apiserver to update the affected
resource. This way, the actual rebalancing of ReplicaSets is still handled by
the [Rescheduling
Algorithm](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-replicasets.md)
in the Federated ReplicaSet controller.

The remediator component must be deployed with a kubeconfig for the
federation-apiserver so that it can identify itself when sending the PATCH
requests. We can use the same mechanism that is used for the
federation-controller-manager (which also needs ot identify itself when sending
requests to the federation-apiserver.)

### 3. Replication of Kubernetes Resources

Administrators must be able to author policies that refer to properties of
Kubernetes resources. For example, assuming the following sample policy (in
English):

> Certain apps must be deployed on Clusters in EU zones with sufficient PCI
> compliance.
The policy definition must refer to the geographic region and PCI compliance
rating of federated clusters. Today, the geographic region is stored as an
attribute on the cluster resource and the PCI compliance rating is an example of
data that may be included in a label or annotation.

When the policy engine is queried for a placement decision (e.g., by the
admission controller), it must have access to the data representing the
federated clusters.

To provide OPA with the data representing federated clusters as well as other
Kubernetes resource types (such as federated ReplicaSets), we use a sidecar
container that is deployed alongside OPA. The sidecar (“opa-kube-sync”) is
responsible for replicating Kubernetes resources into OPA:

![Replication](https://docs.google.com/drawings/d/1XjdgszYMDHD3hP_2ynEh_R51p7gZRoa1DBTi4yq1rc0/pub?w=812&h=288)

The sidecar/replicator component will implement the (somewhat common) list/watch
pattern against the federation-apiserver:

- Initially, it will GET all resources of a particular type.
- Subsequently, it will GET with the **watch** and **resourceVersion**
parameters set and process add, remove, update events accordingly.

Each resource received by the sidecar/replicator component will be pushed into
OPA. The sidecar will likely rely on one of the existing Kubernetes Go client
libraries to handle the low-level list/watch behavior.

As new resource types are introduced in the federation-apiserver, the
sidecar/replicator component will need to be updated to support them. As a
result, the sidecar/replicator component must be designed so that it is easy to
add support for new resource types.

Eventually, the sidecar/replicator component may allow admins to configure which
resource types are replicated. As an optimization, the sidecar may eventually
analyze policies to determine which resource properties are requires for policy
evaluation. This would allow it to replicate the minimum amount of data into
OPA.

### 4. Policy Management

Policies are written in a text-based, declarative language supported by OPA. The
policies can be loaded into the policy engine either on startup or via HTTP
APIs.

To avoid introducing additional persistent state, we propose storing policies
in ConfigMap resources in the Federation Control Plane inside a well-known
namespace (e.g., `kube-federationscheduling-policy`). The ConfigMap resources
will be replicated into the policy engine by the sidecar.

The sidecar can establish a watch on the ConfigMap resources in the Federation
Control Plane. This will enable hot-reloading of policies whenever they change.

## Applicability to Other Policy Engines

This proposal was designed based on a POC with OPA, but it can be applied to
other policy engines as well. The admission and remediation components are
comprised of two main pieces of functionality: (i) applying annotation values to
federated resources and (ii) asking the policy engine for annotation values. The
code for applying annotation values is completely independent of the policy
engine. The code that asks the policy engine for annotation values happens both
within the admission and remediation components. In the POC, asking OPA for
annotation values amounts to a simple RESTful API call that any other policy
engine could implement.

## Future Work

- This proposal uses ConfigMaps to store and manage policies. In the future, we
want to introduce a first-class **Policy** API resource.

0 comments on commit a3b9c9f

Please sign in to comment.