diff --git a/keps/prod-readiness/sig-windows/1981.yaml b/keps/prod-readiness/sig-windows/1981.yaml
new file mode 100644
index 00000000000..bb4a2ab4fd0
--- /dev/null
+++ b/keps/prod-readiness/sig-windows/1981.yaml
@@ -0,0 +1,3 @@
+kep-number: 1981
+alpha:
+ approver: "@deads2k"
\ No newline at end of file
diff --git a/keps/sig-windows/1981-windows-privileged-container-support/README.md b/keps/sig-windows/1981-windows-privileged-container-support/README.md
index c29e0dd0e0c..b910443b36e 100644
--- a/keps/sig-windows/1981-windows-privileged-container-support/README.md
+++ b/keps/sig-windows/1981-windows-privileged-container-support/README.md
@@ -84,10 +84,10 @@ tags, and then generate with `hack/update-toc.sh`.
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Use case 1: Privileged Daemon Sets](#use-case-1-privileged-daemon-sets)
- - [Use case 2: Node plugin containers](#use-case-2-node-plugin-containers)
+ - [Use case 2: Administrative tasks](#use-case-2-administrative-tasks)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- - [PSP support](#psp-support)
+ - [Pod Security Implications](#pod-security-implications)
- [Design Details](#design-details)
- [Overview](#overview)
- [Networking](#networking)
@@ -98,11 +98,14 @@ tags, and then generate with `hack/update-toc.sh`.
- [Container Images](#container-images)
- [Container Image Build/Definition](#container-image-builddefinition)
- [CRI Implementation Details](#cri-implementation-details)
+ - [Kubernetes API updates](#kubernetes-api-updates)
+ - [WindowsSecurityContextOptions.HostProcess Flag](#windowssecuritycontextoptionshostprocess-flag)
+ - [Alternatives](#alternatives)
+ - [Host Network Mode](#host-network-mode)
+ - [Example deployment spec](#example-deployment-spec)
- [Kubelet Implementation Details](#kubelet-implementation-details)
- - [Privileged Flag](#privileged-flag)
- - [Host Network Mode](#host-network-mode)
- - [ ContainerD Support Only](#-containerd-support-only)
- - [ Feature Gates](#-feature-gates)
+ - [CRI Support Only](#cri-support-only)
+ - [Feature Gates](#feature-gates)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
@@ -116,7 +119,7 @@ tags, and then generate with `hack/update-toc.sh`.
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
-- [Alternatives](#alternatives)
+- [Alternatives](#alternatives-1)
- [Open Questions](#open-questions)
@@ -138,12 +141,12 @@ checklist items _must_ be updated for the enhancement to be released.
Items marked with (R) are required *prior to targeting to a milestone / release*.
-- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
-- [ ] (R) Design details are appropriately documented
-- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
-- [ ] (R) Graduation criteria is in place
-- [ ] (R) Production readiness review completed
+- [X] (R) Design details are appropriately documented
+- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
+- [X] (R) Graduation criteria is in place
+- [X] (R) Production readiness review completed
- [ ] Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
@@ -195,16 +198,15 @@ The lack of privileged container support within the Windows container model has
Furthermore, since kube-proxy cannot be run as a privileged daemonset, it must either be run with a proxy or directly on the host as a service. In the case that it is run as a service, the admin kube config must be stored on the Windows node which poses a security concern. This is also true for networking daemons such as Flannel.
-
### Goals
-- To provide a method to build, launch, and run a Windows-based container with privileged access to host resources, including the host network service, devices, disks (including hostPath volumes), etc.
-- To enable access to host network resources for privileged containers and pods with host network mode
+- To provide a method to build, launch, and run a Windows-based container with privileged access to host resources, including the host network service, devices, disks (including hostPath volumes), etc.
+- To enable access to host network resources for **privileged** containers and pods with host network mode
### Non-Goals
@@ -212,10 +214,12 @@ know that this has succeeded?
What is out of scope for this KEP? Listing non-goals helps to focus discussion
and make progress.
-->
-- To provide access to host network resources for non-privileged containers and pods
-- To provide a privileged mode for Hyper-V containers, or a method to run privileged process containers within a Hyper-V isolation boundary. This is a non-goal as running a Hyper-V container in the root namespace from within the isolation boundary is not supported.
-- To enable privileged containers for Docker. This will only be for Containerd
+- To provide access to host network resources for **non-privileged** containers and pods.
+- To provide a privileged mode for Hyper-V containers, or a method to run privileged process containers within a Hyper-V isolation boundary. This is a non-goal as running a Hyper-V container in the root namespace from within the isolation boundary is not supported.
+- To enable privileged containers for Docker. This will only be for containerd.
+- To align privileged containers with pod namespaces - this functionality may be addressed in a future KEP.
+- Enabling the ability to mix privileged and non-privileged containers in the same Pod. (Multiple privileged containers running in the same Pod will be supported.)
## Proposal
@@ -228,18 +232,18 @@ nitty-gritty.
-->
### Use case 1: Privileged Daemon Sets
-Privileged daemon sets are used to deploy networking (CNI), storage (CSI), and device plugins, kube-proxy, and other agents to Linux nodes. Currently, similar set-up and deployment operations utilize wins or proxies (i.e. CSI-proxy, HNS-Proxy) for Windows nodes. With Windows privileged containers, privileged daemon sets will also be available to deploy desired plugins and agents to Windows nodes. For networking scenarios, host network mode will enable these privileged deployments to access and configure host network resources.
-Enablement of both host network mode and privileged containers will allow users to configure network policies between pods and the hosts. Privileged containers will also enable service mesh support on Windows by enabling the creation of an init container that can configure HNS.
-
-### Use case 2: Node plugin containers
-Windows privileged containers would also enable deployment of single privileged containers to Windows nodes. Other functionalities that could be enabled with privileged containers in or out of daemon set deployment include device enumeration, monitoring add-ons, among others.
+Privileged daemon sets are used to deploy networking (CNI), storage (CSI), device plugins, kube-proxy, and other components to Linux nodes. Currently, similar set-up and deployment operations utilize wins or dedicated proxies (i.e. CSI-proxy, HNS-Proxy) or these components are installed as services running on Windows nodes. With Windows privileged containers many of these components could run inside containers increasing consistency between how they are deployed and/or managed on Linux and Windows. For networking scenarios, host network mode will enable these privileged deployments to access and configure host network resources.
Some interesting scenario examples:
+
- Cluster API
- CSI Proxy
- Logging Daemons
+### Use case 2: Administrative tasks
+
+Windows privileged containers would also enable a wide variety of administrative tasks without requiring cluster operations to log onto each Windows nodes. Tasks like installing security patches, collecting specific event logs, etc could all be done via deployments of privileged containers.
### Notes/Constraints/Caveats (Optional)
@@ -249,10 +253,10 @@ What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->
+
- Host network mode support is only targeted for privileged containers and pods.
- Privileged pods can only consist of privileged containers. Standard Windows Server containers or other non-privileged containers will not be supported. This is because containers in a Kubernetes pod share an IP. For the privileged containers with host network mode, this container IP will be the host IP. As a result, a pod cannot consist of a privileged container with the host IP and an unprivileged Windows Server container(s) sharing a vNic on the host with a different IP, or vice versa.
-- We are currently investigating service mesh scenarios where privileged containers in a pod will need host networking access but run alongside non-privileged containers in a pod. This would require further changes and investigation.
-
+- We are currently investigating service mesh scenarios where privileged containers in a pod will need host networking access but run alongside non-privileged containers in a pod. This would require further investigation and is out of scope for this KEP.
### Risks and Mitigations
@@ -267,26 +271,33 @@ How will UX be reviewed, and by whom?
Consider including folks who also work outside the SIG or subproject.
-->
-Most of the fundamental changes to enable this feature for Windows containers is dependent on changes within hcsshim, which serves as the runtime (container creation and management) coordinator and shim layer for containerd. However:
+Most of the fundamental changes to enable this feature for Windows containers is dependent on changes within [hcsshim](https://github.com/microsoft/hcsshim), which serves as the runtime (container creation and management) coordinator and shim layer for containerd on Windows.
+
+However:
+
- Several upstream changes are required to support this feature in Kubernetes, including changes to containerd, OCI, CRI, and kubelet. The identified changes include (see CRI and Kubelet Implementation Details below for more details on changes):
- - Containerd: enabling host network mode for privileged containers and pods ([working prototype demo](https://drive.google.com/file/d/1WQO2NDX34Z1FPbW7jEymhcPMY4AZWQSE/view)). Prototype is done using containerd runtimehandler but this proposal is to use cri-api.
+ - Containerd: enabling host network mode for privileged containers and pods ([working prototype demo](https://drive.google.com/file/d/1WQO2NDX34Z1FPbW7jEymhcPMY4AZWQSE/view)). Prototype is done using containerd runtimehandler but this proposal is to use cri-api.
- OCI spec: https://github.com/opencontainers/runtime-spec
- - [TBD]
+ - Updates pending decisions made in this KEP regarding namings.
- CRI-api:
- - Adding a privileged field to the runtime spec and pass through to containers
- - Pass security context and flag of runtime spec to podsandbox spec (not currently supported, open issue: https://github.com/kubernetes/kubernetes/issues/92963)
- - Kubelet: Pass privileged flag and windows security context to runtime spec and other PSP changes (see below).
-- There are risks that changes at each of these levels may not be supported.
- - If containerd changes are not supported, host network mode will not be enabled.This would restrict the scenarios that privileged containers would enable, as CNI plugins, network policy, etc. rely on host network mode to enable access to host network resources.
- - If CRI changes to enable a privileged flag are not supported, there would be a less-ideal workaround via annotations in the pod container spec.
- - The CRI changes may make an annotation in the OCI spec until the OCI updates are included.
+ - Adding `WindowsPodSandboxConfig` and `WindowsSandboxSecurityContext` message
+ - Adding `host_process` flag to `WindowsContainerSecurityContext`
+ - Pass security context and flag of runtime spec to podsandbox spec (not currently supported, open issue: https://github.com/kubernetes/kubernetes/issues/92963)
+ - Kubelet: Pass host_process flag and windows security context options to runtime spec.
+- There are risks that changes at each of these levels may not be supported.
+ - If containerd changes are not supported, host network mode will not be enabled.This would restrict the scenarios that privileged containers would enable, as CNI plugins, network policy, etc. rely on host network mode to enable access to host network resources.
+ - If CRI changes to enable a privileged flag are not supported, there would be a less-ideal workaround via annotations in the pod container spec.
+ - The CRI changes may make an annotation in the OCI spec until the OCI updates are included.
-### PSP support
-Additionally, privileged containers may impact other pod security policies (PSPs) outside of allowPrivilegedEscalation. We will provide guidance similar to [Pod Security Standards](https://v1-18.docs.kubernetes.io/docs/concepts/security/pod-security-standards/) for Windows privileged containers. There is an [analysis for non-privileged containers](https://github.com/kubernetes/kubernetes/issues/64801#issuecomment-668897952) which can be augmented with the details below. There is an open question if PSP should be updated or the recommendation is to use an out of tree enforcement tool such as GateKeeper. The anticipated impacted PSPs include:
+### Pod Security Implications
+
+For alpha we will update [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) with information on the new `hostProcess` flag.
+
+Additionally, privileged containers may impact other pod security policies (PSPs) outside of allowPrivilegedEscalation. We will provide guidance similar to [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) for Windows privileged containers when graduating this feature out of alpha. There is an [analysis for non-privileged containers](https://github.com/kubernetes/kubernetes/issues/64801#issuecomment-668897952) which can be augmented with the details below.The anticipated impacted PSPs include:
- Use case
+ | Use case
|
Field name
|
@@ -302,11 +313,11 @@ Additionally, privileged containers may impact other pod security policies (PSPs
privileged
|
- yes
+ | no
|
- Required.
+ | Not applicable. Windows privileged containers will be controlled with a new `WindowsSecurityContextOptions.HostProcess` instead of the existing `privileged` field due to fundamental differences in their implementation on Windows.
|
- Alpha
+ | Alpha
|
@@ -316,7 +327,7 @@ Additionally, privileged containers may impact other pod security policies (PSPs
no
|
- No support for pid and ipc since jobs don’t provide isolation at that level.
+ | Windows does not have configurable PID/IPC namespaces (unlike Linux). Windows containers are always assigned their own process namespace. Job objects always run in the host's process namespace. These behaviors are not configurable. Future plans in this area include improvements to enable scheduling pods that can contain both normal and HostProcess/Job Object containers. These fields would not makes in this scenario because Windows cannot configure PID/IPC namespaces like in Linux.
|
N/A
|
@@ -352,7 +363,7 @@ Additionally, privileged containers may impact other pod security policies (PSPs
no
|
- Job objects have full access to write to the root file system. Current design does not have a way to control access to read only.
+ | Job objects have full access to write to the root file system. Current design does not have a way to control access to read only. Instead privileged/job object containers can be ran as users with limited/scoped files system access via RunAsUsername
|
N/A
|
@@ -443,7 +454,6 @@ Additionally, privileged containers may impact other pod security policies (PSPs
-
## Design Details
+
### Overview
+
Windows privileged containers will be implemented with [Job Objects](https://docs.microsoft.com/en-us/windows/win32/procthread/job-objects), a break from the previous container model using server silos. Job objects provide the ability to manage a group of processes as a group, and assign resource constraints to the processes in the job. Job objects have no process or file system isolation, enabling the privileged payload to view and edit the host file system with the correct permissions, among other host resources. The init process, and any processes it launches or that are explicitly launched by the user, are all assigned to the job object of that container. When the init process exits or is signaled to exit, all the processes in the job will be signaled to exit, the job handle will be closed and the storage will be unmounted.
+Because Windows privileged containers will work much differently than Linux privileged containers they will be referred to as **HostProcess** containers in kubernetes specs and user-facing documentation. Hopefully this will encourage users to seek documentation to better understand the capabilities and behaviors of these privileged containers.
+

#### Networking
- The container will be in the host’s network namespace (default network compartment) so it will have access to all the host’s network interfaces and have the host's IP as well.
+
#### Resource Limits
- Resource limits (disk, memory, cpu count) will be applied to the job and will be job wide. For example, with a limit of 10 MB is set for the job, if every process in the jobs memory allocations added up exceeds 10 MB this limit would be reached. This is the same behavior as other Windows container types. These limits would be specified the same way they are currently for whatever orchestrator/runtime is being used.
+- Disk resource tracking may work slightly differently for privileged Windows containers due to how these containers are bootstrapped. The extent of these differences are still being investigated but will be fully documented when understood. Resource usage will be trackable the differences would be in how resource usage is calculated.
+
#### Container Lifecycle
- The container's lifecycle will be managed by the container runtime just like other Windows container types.
-#### Container users
-- The privileged container should be able to run as any user that's available on the host or in the domain of the host machine. Password accounts are being investigated.
+
+#### Container users
+
+- The privileged container can run as any user that's available on the host or in the domain of the host machine. Password accounts are being investigated.
+Running privileged containers as non SYSTEM/admin accounts will be the primary way operators can restrict access to system resources (files, registry, named pipes, WMI, etc).
+More information on Windows resource access can be found at https://docs.microsoft.com/en-us/archive/msdn-magazine/2008/november/access-control-understanding-windows-file-and-registry-permissions.
+
#### Container Mounts
-- Directory mounts (i.e. directory C:\test mapped to C:\test in the privileged container) are still being investigated. Note that file system mapping in the current implementation would expose both the host filesystem and the container filesystem to the processes in the privileged container. We are still investigation how external mounts such as secrets or storage accounts would be added. The default visibility of all the host filesystem may present backwards compatibility challenges in the future if host path mounting is enabled in Windows. Currently the only method to restrict would be via the user in a way similar to [UID or GID](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/) in the PSP. Currently however, this host path mounting is unlikely to be enabled in the future as a restriction of Job Objects.
+
+- When privileged containers are started a new volume will be created on the host which will contain the contents of the container image. Containers will have a default working directory that points to this container volume. Containers will also have full access to the host file-system (unless restricted by filed-based ACLs and the run_as_username used to start the container.) Processes should use absolute paths when accessing files on the host and relative paths when accessing files brought in via the container image.
+Note: there will be no `chroot` equivalent.
+- An environment variable `$CONTAINER_SANDBOX_MOUNT_POINT` will be set to the absolute path where the container volume is mounted.
+- Volume mounts (including service account tokens) will be supported for privileged containers and will be mounted under the container volume. Programs running inside the container can either access volume mounts be using a relative path or by prefixing `$CONTAINER_SANDBOX_MOUNT_POINT` to their paths (example: use either `.\var\run\secrets\kubernetes.io\serviceaccount\` or `$CONTAINER_SANDBOX_MOUNT_POINT\var\run\secrets\kubernetes.io\serviceaccount\` to access service account tokens). These relative paths will be based on `Pod.containers.volumeMounts.mountPath`.
+- Client libraries such as https://pkg.go.dev/k8s.io/client-go/rest#InClusterConfig may be updated to prefix paths with `$CONTAINER_SANDBOX_MOUNT_POINT` if the environment variable is set for Windows so these libraries will work in `hostProcess`containers. This will be re-evaluated when transistioning from `alpha` to `beta` as we get more feedback.
+Note: it is not possible to feature-gate this behavior in client libraries and because of this the functionality should not be added to client libraries after privileged containers while this feature is in `alpha`.
+- Named Pipe mounts will **not** be supported. Instead named pipes should access via their path on the host (\\\\.\\pipe\\*). Unix domain sockets mounts **will** be supported.
+- All other volume types supported for normal containers on Windows will work with privileged containers.
+
#### Container Images
-- A slim base image may be required to satisfy hierarchy requirements from HCS. It was found that the graphdriver calls expect certain files to be in a windows image when un-tarring. These files were found to be simply several registry hive files in /windows/system32/config (which can be empty).
-- We are currently targeting to release a simple image alongside this feature including the above mentioned files, which can be used instead of traditional Windows base images of server core or otherwise. Server core and others that container the necessary files would also work as privileged container base images.
+
+- Privileged containers can be built on top of existing Windows base images (nanoserver, servercore, etc).
+- A slim base image may be introduced to act as a replacement for [scratch]. On windows graphdriver calls expect some files (mainly registry hives) and these would be included in the slim base image.
+- Privileged containers will not inherit the same [compatibility requirements](https://docs.microsoft.com/en-us/virtualization/windowscontainers/deploy-containers/version-compatibility) as process isolated containers from an OS perspective. Container runtimes like containerd may be able to use fields on `WindowsPodSandboxConfig` to skip OS version checks when pulling/starting these containers in the future. This will continue to be investigated as the feature matures.
+
#### Container Image Build/Definition
-- This is another ongoing area of investigation and open to feedback.
-
+
+- This is another ongoing area of investigation and open to feedback. Currently only a subset of dockerfile operations are supported (ADD, COPY, PATH, ENTRYPOINT, etc).
+
### CRI Implementation Details
-We will need to add a privileged field to the runtime spec. We can model this after the Linux pod security context and container security context that is a boolean that is set to “true” for privileged containers. References:
-- [Linux SandboxSecurityConfig](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1alpha2/api.proto#L293)
+
+We will need to add a `hostProcess` field to the runtime spec. We can model this after the Linux pod security context and container security context that is a boolean that is set to `true` for privileged containers. References:
+
+- [LinuxSandboxSecurityConfig](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1alpha2/api.proto#L293)
- [LinuxSandboxSecurityContext](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1alpha2/api.proto#L28)
- [LinuxContainerSecurityContext](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1alpha2/api.proto#L612)
-In windows, the same privileged field would need to be added in the windows pod security context (open issue) and the container security context will need to be augmented with the privileged flag.
+For Windows we are proposing the following updates to CRI-API:
-The new WindowsPodSandboxConfig:
-```
-message WindowsPodSandboxConfig {.
- WindowsSandboxSecurityContext security_context = 1;
+Add WindowsPodSandboxConfig (and it to PodSandboxConfig)
+
+```protobuf
+message WindowsPodSandboxConfig {
+ WindowsSandboxSecurityContext security_context = 1;
}
```
-A new WindowsSandboxSecurityContext:
-```
+Add WindowsSandboxSecurityContext:
+
+```protobuf
message WindowsSandboxSecurityContext {
- string run_as_user = 1;
- bool privileged = 2;
+ string run_as_username = 1;
+ string credential_spec = 2;
+ bool host_process = 3;
}
```
-The new field for the WindowsContainerSecurityContext:
-```
+Update WindowsContainerSecurityContext by adding host_process field:
+
+```protobuf
message WindowsContainerSecurityContext {
- string run_as_username = 1;
- string credential_spec = 2;
- bool privileged = 3;
+ string run_as_username = 1;
+ string credential_spec = 2;
+ bool host_process = 3;
}
```
-### Kubelet Implementation Details
-#### Privileged Flag
-There are no kubernetes API changes required. Kubelet can use the privileged flag from SecurityContext api and pass that to the new privileged flag in the CRI layer.
+*Note:* For alpha annotations on RunPodSandbox and CreateContainer CRI calls may be used until a version of containerd with Windows privileged container support is released.
-Add functionality to Kuberuntime_sandbox to split out the linux sandbox creation and add windows sandbox creation (WIP PR).
+### Kubernetes API updates
-https://github.com/kubernetes/kubernetes/blob/a9f1d72e1de6450b890a0c0e451725468f54f515/pkg/kubelet/kuberuntime/kuberuntime_sandbox.go#L136
+#### WindowsSecurityContextOptions.HostProcess Flag
-#### Host Network Mode
+A new `*bool` field named `hostProcess` will be added to [WindowsSecurityContextOptions](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#windowssecuritycontextoptions-v1-core).
-If using Windows privileged containers the Host Network Mode will also be enabled, as the pod will automatically get the host IP. The PodSpec has a hostNetwork boolean field:
+On Windows, all containers in a pod must be privileged. Because of this behavior and because `WindowsSecurityContextOptions` already exists on both `PodSecurityContext` and `Container.SecurityContext` Windows containers will use this new field instead re-using the existing `privileged` field which only exists on `SecurityContext`.
+Additionally, the existing `privileged` field does not clearly describe what capabilities the container has (see https://github.com/kubernetes/kubernetes/issues/44503).
+Documentation will be added to clearly describe what capabilities these new "hostProcess" containers have.
-https://github.com/kubernetes/kubernetes/blob/d20fd4088476ec39c5ae2151b8fffaf0f4834418/pkg/kubelet/container/helpers.go#L227-L229
+Current behavior applies `PodSecurityContext.WindowsSecurityContextOptions` settings to all `Container.SecurityContext.WindowsSecurityContextOptions` unless those settings are already specified on the container. To address this the following API validation will be added:
-https://github.com/kubernetes/kubernetes/blob/a9f1d72e1de6450b890a0c0e451725468f54f515/pkg/kubelet/kuberuntime/kuberuntime_sandbox.go#L98
+- If `PodSecurityContext.WindowsSecurityContextOptions.HostProcess = true` is set to true then no container in the pod sets `Container.SecurityContext.WindowsSecurityContextOptions.HostProcess = false`
+- If `PodSecurityContext.WindowsSecurityContextOptions.HostProcess` is not set then all containers in a pod must set `Container.SecurityContext.WindowsSecurityContextOptions.HostProcess = true`
+- If `PodSecurityContext.WindowsSecurityContextOptions.HostProcess = false` no containers may set `Container.SecurityContext.WindowsSecurityContextOptions.HostProcess = true`
+- `hostNetwork = true` must be set explicits if the pod contains all hostProcess containers (this value will not be inferred and/or defaulted)
-In the case of windows this field would need to be addressed. During the alpha stage the proposal for this field would be to add documentation that note the field is only applicable to Linux. In addition the check for hostnetwork mode will need to be updated to check the privileged flag on windows PodSpec. This will be implemented in the following way, with validation loosened in the future if required:
- - this pod must run on a windows host, and kubelets must reject it if not on windows hosts
- - all pods marked privileged on windows must have host network enabled, if not the pod does not validate
+Additionally kube-apiserver will disallow `hostProcess` containers to be scheduled if `--allow-privileged=false` is passed as an argument.
+https://github.com/kubernetes/kubernetes/blob/release-1.20/pkg/apis/core/validation/validation.go#L5767-L5771 for reference.
-In beta there is a possibility to enable the privileged container to be a part of a different network component. If this feature is enabled we will use the existing Pod HostNetwork field to enable/disable.
+##### Alternatives
-#### ContainerD Support Only
-There are no plans to update Docker to have support for Privileged containers due to requirements for support HCSv2. The additional fields for docker would be ignored. The default value should be false.
+Option 1: Re-use `SecurityContext.Privileged` field.
-Need to investigate if the choice of base image would cause errors when pulling with docker.
+Re-using the existing `SecurityContext.Privileged` field was considering and here are the pros/cons considered:
-#### Feature Gates
-Although there are no changes required to the kubernetes API, the ability to pass the the privileged flag to the CRI should be behind a feature gate in Kubelet:
+Pros
-https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#feature-stages
+- The field already exists and many policy tools already leverage it.
+Cons
-### Test Plan
+- Privileged containers on Windows will operate very differently than privileged containers on Linux. Having a new field should help avoid confusion around the differences between the two.
+- The privileged field does not have clear meaning for Linux containers today (see comments above).
+- `WindowsSecurityContextOptions.RunAsUserName` will the the primary way of restricting access to host/node resources (See [Container users](#container-users)). It is desirable that `RunAsUserName` and `HostProcess` fields live on the same property.
+- API validation to ensure all containers are either privileged or not will be difficult because there is no way of definitively knowing that a pod is intended for a Windows node.
+
+#### Host Network Mode
+
+Host Network mode for privileged Windows containers will always be enabled, as the pod will automatically get the host IP.
+
+Privileged Windows containers will be unable to align to pod namespaces due to limitations in the Windows OS. This functionality will likely be enabled in the future through a new KEP.
+
+Because of this we will require that `hostNetwork` is set to `true` when scheduling privileged pods. This will allow existing policy tools to detect and act on privileged Windows containers without any updates. In the future if/when functionality is added to support joining privileged containers to pod networks this validation will be revisited.
+
+#### Example deployment spec
+
+Here are two examples of valid specs each containing two privileged Windows containers:
+
+```yaml
+spec:
+ hostNetwork: true
+ securityContext:
+ windowsOptions:
+ hostProcess: true
+ containers:
+ - name: foo
+ image: image1:latest
+ - name: bar
+ image: image2:latest
+ nodeSelector:
+ "kubernetes.io/os": windows
+```
+
+```yaml
+spec:
+ hostNetwork: true
+ containers:
+ - name: foo
+ image: image1:latest
+ securityContext:
+ windowsOptions:
+ hostProcess: true
+ - name: bar
+ image: image2:latest
+ securityContext:
+ windowsOptions:
+ hostProcess: true
+ nodeSelector:
+ "kubernetes.io/os": windows
+```
+
+### Kubelet Implementation Details
+
+Kubelet will pass privileged flag from `WindowsSecurityContextOptions` to the appropriate CRI layer calls.
+
+*Note:* For alpha kubelet may add well-known annotations to CRI calls if privileged flags are set.
+
+Add functionality to Kuberuntime_sandbox to:
+- Split out the linux sandbox creation and add [windows sandbox creation](https://github.com/kubernetes/kubernetes/blob/a9f1d72e1de6450b890a0c0e451725468f54f515/pkg/kubelet/kuberuntime/kuberuntime_sandbox.go#L136)
+- Configure all privileged Windows pods to join the [host network](https://github.com/kubernetes/kubernetes/blob/a9f1d72e1de6450b890a0c0e451725468f54f515/pkg/kubelet/kuberuntime/kuberuntime_sandbox.go#L98)
+
+The following extra validation will be added to the kubelet for Windows. These checks will ensure privileged pods work correctly on Windows if these are not validated by apiserver.
+
+- Ensure all containers in a pod privileged, if any are.
+- Ensure `hostNetwork = true` is set if pod contains privileged containers.
+
+#### CRI Support Only
+
+There are no plans to update Docker and/or dockershim to have support for privileged containers due to requirements on HCSv2.
+Currently containerd is the only container runtime with a Windows implementation so containerd will be required.
+
+Validation will be added in the kubelet to fail to schedule a pod if the node is configured to use dockershim and the pod contains privileged Windows containers.
+
+#### Feature Gates
+
+Privileged container functionally on windows will be gated behind a new `WindowsHostProcessContainers` feature gate.
+
+https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#feature-stages
+
+### Test Plan
Alpha
-- Preliminary analysis/testing of known kubernetes scenarios:
- - [List]
-Beta
-- Testing and validation of key scenarios identified in the alpha analysis
-- Test grids
- - Validate running kubeproxy as a daemon set
- - [List]
+- Unit tests around validation logic for new API fields.
+- Add e2e test validating basic privileged container functionality (pod starts and run in a privileged context on the node)
+- Update Pod Security Standards doc to dissallow `hostProcess` containers in the baseline/default and restricted policies.
+Beta
+
+- Validate running kubeproxy as a daemon set
+- Validate CSI-proxy running as a daemon set
+- Validate running a CNI implementation as a daemon set
### Graduation Criteria
@@ -628,23 +749,27 @@ in back-to-back releases.
-->
Alpha
-- Version of containerd: v1.5
-- Version of Kubernetes: Target 1.20 or 1.21
-- Version of OS support: 1809/Windows 2019 LTSC and 2004
-- Alpha Feature Gate for passing privilege flag to CRI
+
+- Version of containerd: Target v1.5
+- Version of Kubernetes: Target 1.21
+- OS support: 1809/Windows 2019 LTSC and all future versions of Windows Server
+- Alpha Feature Gate for passing privileged flag **or** annotations to CRI calls.
Beta
-- Go through PSP Linux test (e2e: validation & conformance) and make them relevant for Windows (which apply, which dont and where we need to write new tests).
+
+- Go through PSP Linux test (e2e: validation & conformance) and make them relevant for Windows (which apply, which don't and where we need to write new tests).
- Provide guidance similar to Pod Security Standards for Windows privileged containers
-- Containerd: v1.5
-- Kubernetes Target 1.21 or 1.22
-- OS support: 1809/Windows 2019 LTSC and 2004
+- Containerd: v1.5
+- Kubernetes Target 1.22 or later
+- OS support: 1809/Windows 2019 LTSC and all future versions of Windows Server
- Beta Feature Gate for passing privilege flag to CRI
+- Extensive documentation around `HostProcess` containers on https://kubernetes.io/
+ - Includes clarification around disk limits mentioned in [Resource Limits](#resource-limits)
GA:
-- [need feedback]
-- Remove feature gate for passing privileged flag
+- Address any issues uncovered in alpha/beta
+- Remove feature gate for passing privileged flag
### Upgrade / Downgrade Strategy
@@ -660,9 +785,9 @@ enhancement:
cluster required to make on upgrade, in order to make use of the enhancement?
-->
-- Windows: This implementation requires no backports for OS components.
-- Kubernetes: [need feedback on]
-- Containerd: [need feedback]
+- Windows: This implementation requires no backports for OS components.
+- Kubernetes: No changes required outside of ensuring feature gates are set while feature is in development.
+- Containerd: Must run a version of containerd with privileged container support (targeting v1.5+).
### Version Skew Strategy
@@ -679,7 +804,7 @@ enhancement:
CRI or CNI may require updating that component before the kubelet.
-->
-N/A
+N/A
## Production Readiness Review Questionnaire
@@ -712,9 +837,9 @@ you need any help or guidance.
_This section must be completed when targeting alpha to a release._
* **How can this feature be enabled / disabled in a live cluster?**
- - [ ] Feature gate (also fill in values in `kep.yaml`)
- - Feature gate name:
- - Components depending on the feature gate:
+ - [x] Feature gate (also fill in values in `kep.yaml`)
+ - Feature gate name: WindowsPrivilegedContainers
+ - Components depending on the feature gate: Kubelet, kube-apiserver
- [ ] Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
@@ -723,22 +848,19 @@ _This section must be completed when targeting alpha to a release._
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
* **Does enabling the feature change any default behavior?**
- Any change of default behavior may be surprising to users or break existing
- automations, so be extremely careful here.
+ No
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?**
- Also set `disable-supported` to `true` or `false` in `kep.yaml`.
- Describe the consequences on existing workloads (e.g., if this is a runtime
- feature, can it break the existing applications?).
+ This feature can be disabled.
+ If this feature flag is disabled in kube-apiserver than new pods which try to schedule `hostProcess` containers will be rejected by kube-apiserver.
+ If this flag is disabled in the kubelet then new `hostProcess` containers are will not be started and an appropriate event will be emitted.
* **What happens if we reenable the feature if it was previously rolled back?**
+Newly created privileged Windows containers will run as expected.
* **Are there any tests for feature enablement/disablement?**
- The e2e framework does not currently support enabling or disabling feature
- gates. However, unit tests in each component dealing with managing data, created
- with and without the feature, are necessary. At the very least, think about
- conversion tests if API types are being modified.
+No
### Rollout, Upgrade and Rollback Planning
@@ -809,7 +931,6 @@ _This section must be completed when targeting beta graduation to a release._
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-
### Scalability
_For alpha, this section is encouraged: reviewers should consider these questions
@@ -821,45 +942,26 @@ _For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field._
* **Will enabling / using this feature result in any new API calls?**
- Describe them, providing:
- - API call type (e.g. PATCH pods)
- - estimated throughput
- - originating component(s) (e.g. Kubelet, Feature-X-controller)
- focusing mostly on:
- - components listing and/or watching resources they didn't before
- - API calls that may be triggered by changes of some Kubernetes resources
- (e.g. update of object X triggers new updates of object Y)
- - periodic API calls to reconcile state (e.g. periodic fetching state,
- heartbeats, leader election, etc.)
+ No
* **Will enabling / using this feature result in introducing new API types?**
- Describe them, providing:
- - API type
- - Supported number of objects per cluster
- - Supported number of objects per namespace (for namespace-scoped objects)
+ No
* **Will enabling / using this feature result in any new calls to the cloud
provider?**
+ No
* **Will enabling / using this feature result in increasing size or count of
the existing API objects?**
- Describe them, providing:
- - API type(s):
- - Estimated increase in size: (e.g., new annotation of size 32B)
- - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
+ A new field is being added so API object size will grow *slightly* larger.
* **Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs]?**
- Think about adding additional work or introducing new steps in between
- (e.g. need to do X to start a container), etc. Please describe the details.
+ No
* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**
- Things to keep in mind include: additional in-memory state, additional
- non-trivial computations, excessive access to disks (including increased log
- volume), significant amount of data sent and/or received over network, etc.
- This through this both in small and large cases, again with respect to the
- [supported limits].
+ No - Privileged containers will honor limits/reserves specified in the specs and will count against node quota just like unprivilged containers.
### Troubleshooting
@@ -915,9 +1017,9 @@ not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
-- Use Containerd Runtimehandlers and K8s RuntimeClasses - Runtimehandlers are using the prototype. Adding the ability to the CRI provides kubelet to have more control over the security context and and fields that it allows through giving additional checks (such as runasnonroot)
+- Use containerd Runtimehandlers and K8s RuntimeClasses - Runtimehandlers are using the prototype. Adding the ability to the CRI provides kubelet to have more control over the security context and and fields that it allows through giving additional checks (such as runasnonroot).
-- Use annotations on CRI to pass privileged flag to Containerd - Adding the field to the CRI spec allows for the existing CRI calls to work as is. The resulting code is cleaner and doesn’t rely on magic strings. There is currently a PR for adding the SecurityFields to the CRI API adding Sandbox level security support for window containers. The Runasusername will be required for privileged containers to make sure every container (including pause) runs as the correct user to limit access to the file system.
+- Use annotations on CRI to pass privileged flag to containerd - Adding the field to the CRI spec allows for the existing CRI calls to work as is. The resulting code is cleaner and doesn’t rely on magic strings. There is currently a PR for adding the SecurityFields to the CRI API adding Sandbox level security support for window containers. The Runasusername will be required for privileged containers to make sure every container (including pause) runs as the correct user to limit access to the file system.
## Open Questions
- What’s the future of plug-ins that will be impacted
diff --git a/keps/sig-windows/1981-windows-privileged-container-support/kep.yaml b/keps/sig-windows/1981-windows-privileged-container-support/kep.yaml
index 44aaef547db..4ec7f1eb6c2 100644
--- a/keps/sig-windows/1981-windows-privileged-container-support/kep.yaml
+++ b/keps/sig-windows/1981-windows-privileged-container-support/kep.yaml
@@ -1,16 +1,23 @@
title: Windows Privileged Container Support
kep-number: 1981
authors:
- - "@ambguo"
+ - "@ambguo"
+ - "@marosset"
owning-sig: sig-windows
participating-sigs:
- - sig-node
-status: provisional
+ - sig-auth
+ - sig-node
+ - sig-storage
+status: implementable
creation-date: 2020-10-09
reviewers:
- - "@marosset"
+ - "@jsturtevant"
+ - "@Random-Liu"
approvers:
- - "@marosset"
+ - "@liggitt"
+ - "@yujuhong"
+prr-approvers:
+ - "@deads2k"
see-also:
replaces:
@@ -21,20 +28,20 @@ stage: alpha
# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
-latest-milestone: "v1.20"
+latest-milestone: "v1.21"
# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.21"
- beta: "v1.22"
- stable: "v1.23"
# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- - name:
+ - name: WindowsHostProcessContainers
components:
-
+ - kube-apiserver
+ - kubelet
+
disable-supported: true
# The following PRR answers are required at beta release