Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding API section to Windows KEP #732

Merged
merged 14 commits into from
Jan 28, 2019

Conversation

PatrickLang
Copy link
Contributor

@PatrickLang PatrickLang commented Jan 25, 2019

This PR adds the API details that @bgrant0607 requested. I'm currently double checking the pod & container objects to see if any other notes are needed and will remove the hold by EOD

Update history

  • 25 Jan 12:13 PST - first draft
  • 25 Jan 4:44 PST - draft complete, waiting for review

/sig windows

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. sig/windows Categorizes an issue or PR as relevant to SIG Windows. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 25, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/pm size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 25, 2019
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 26, 2019
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 26, 2019
@PatrickLang
Copy link
Contributor Author

/assign bgrant0607

I merged the API support list into the KEP per your feedback. I just finished double checking Pod/Container as well and clarified some more corner cases.

@PatrickLang
Copy link
Contributor Author

/hold remove

@bgrant0607
Copy link
Member

In the future, please squash commits before asking for a review, unless there are commits that should be reviewed separately but need to be merged together. Then add new commits in response to review feedback, and squash again after lgtm.

@bgrant0607
Copy link
Member

I'm looking at this, but reviewed #685, #729, kubernetes/kubernetes#73204, and kubernetes/community#3140 first

Copy link
Member

@bgrant0607 bgrant0607 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this level of detail. It helps.

Are there any differences in the way status is reported, such as exitCode or signal field values?


- Resource management and process isolation - Linux cgroups are used as a pod boundary for resource controls. Containers are created within that boundary for network, process and filesystem isolation. The cgroups APIs can be used to gather cpu/io/memory stats. Windows uses a Job object per container with a system namespace filter to contain all processes in a container and provide logical isolation from the host.
- There is no way to run a Windows container without the namespace filtering in place. This means that system privileges cannot be asserted in the context of the host, and privileged containers are not available on Windows. Containers cannot assume an identity from the host because the SAM is separate.
- Filesystems - Windows has a layered filesystem driver to mount container layers and create a copy filesystem based on NTFS. All file paths in the container are resolved only within the context of that container. T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing T

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing T fixed, will push shortly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding exit code & signal notes too.

`V1.Container.ResourceRequirements.limits.cpu`
`V1.Container.ResourceRequirements.limits.memory`

Windows schedules CPU based on CPU count & percentage of cores. We need this represented because it can help optimize app performance. CPU count is immutable once set but you can change % of core allocations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how is limits.cpu mapped to these 2 values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added link to code & docs, will be in next commit


`V1.Pod.volumeDevices` - this is an alpha feature, and is not implemented on Windows. Windows cannot attach raw block devices to pods.

`V1.Pod.Volumes` - EmptyDir, Secret, ConfigMap, HostPath - all work and have tests in TestGrid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does medium = Memory work for emptyDir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, added a note on that. Will be in next commit.

`V1.Container.SecurityContext.privileged` - Windows doesn't support privileged containers
`V1.Container.SecurityContext.readOnlyRootFilesystem` - not possible on Windows, write access is required for registry & system processes to run inside the container
`V1.Container.SecurityContext.runAsGroup` - not possible on Windows, no GID support
`V1.Container.SecurityContext.runAsUser` - not possible on Windows, no UID support as int. This needs to change to IntStr, see [64009](https://github.com/kubernetes/kubernetes/pull/64009), to support Windows users as strings, or another field is needed.
Copy link
Member

@ddebroy ddebroy Jan 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the SecurityContext.run_as_username (introduced in kubernetes/kubernetes#64009) mainly a placeholder for now? Should it be plumbed through to OCI spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's implemented in CRI & OCI, but needs to be hooked up in kubelet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened issue kubernetes/kubernetes#73387 to track finishing this work

Adding references to signals
Adding millicore to share scaling
Adding EmptyDirVolumeSource.medium node
Adding link to run_as_username bug
@PatrickLang
Copy link
Contributor Author

I just added 1 commit with updates based on several feedbacks

@michmike
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2019
@PatrickLang
Copy link
Contributor Author

Added 1 commit from @michmike so a separate PR isn't needed.

@PatrickLang
Copy link
Contributor Author

/hold remove

@michmike
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2019
@michmike
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: michmike, PatrickLang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@PatrickLang
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 28, 2019
@k8s-ci-robot k8s-ci-robot merged commit d941779 into kubernetes:master Jan 28, 2019
@@ -280,13 +281,20 @@ At a high level, these OS concepts are different:
- Identity - Linux uses userID (UID) and groupID (GID) which are represented as integer types. User and group names are not canonical - they are just an alias in /etc/groups or /etc/passwd back to UID+GID. Windows uses a larger binary security identifier (SID) which is stored in the Windows Security Access Manager (SAM) database. This database is not shared between the host and containers, or between containers.
- File permissions - Windows uses an access control list based on SIDs, rather than a bitmask of permissions and UID+GID
- File paths - convention on Windows is to use `\` instead of `/`. The Go IO libraries typically accept both and just make it work, but when you're setting a path or commandline that's interpreted inside a container, `\` may be needed.
- Signals - Windows interactive apps handle termination differently, and can implement one or more of these:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the container runtime (docker, containerd) send in order to gracefully terminate the application?

I found this issue:
moby/moby#25982

Does Kubernetes terminationGracePeriodSeconds work as expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will write a test case to validate how terminationGracePeriodSeconds works.

@@ -313,7 +323,8 @@ Requests are subtracted from node available resources, so they can be used to av
`V1.Container.SecurityContext.privileged` - Windows doesn't support privileged containers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at the markdown view. These lines need linebreaks or bullets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i fixed them in the next PR Brian. thanks!

- beta.kubernetes.io/os = [windows|linux]
- beta.kubernetes.io/arch = [amd64|arm64|...]

If a deployment does not specify a nodeSelector like `"beta.kubernetes.io/os": windows`, it is possible the Pods can be scheduled on any host, Windows or Linux. This can be problematic since a Windows container can only run on Windows and a Linux container can only run on Linux. The best practice we will recommend is to use a nodeSelector.

However, we understand that in certain cases customers have a pre-existing large number of deployments for Linux containers. Since they will not want to change all deployments to add nodeSelectors, the alternative is to use Taints. Because the kubelet can set Taints during registration, it could easily be modified to automatically add a taint when running on Windows only (`“--register-with-taints=’os=Win1809:NoSchedule’” `). By adding a taint to all Windows nodes, nothing will be scheduled on them (that includes existing Linux Pods). In order for a Windows Pod to be scheduled on a Windows node, it would need both the nodeSelector to choose Windows, and a toleration.
However, we understand that in many cases customers have a pre-existing large number of deployments for Linux containers, as well as an ecosystem of off-the-shelf configurations, such as community Helm charts, and programmatic pod generation cases, such as with Operators. Customers will be hesitant to make the configuration change to add nodeSelectors. Our proposal as an alternative is to use Taints. Because the kubelet can set Taints during registration, it could easily be modified to automatically add a taint when running on Windows only (`“--register-with-taints=’os=Win1809:NoSchedule’” `). By adding a taint to all Windows nodes, nothing will be scheduled on them (that includes existing Linux Pods). In order for a Windows Pod to be scheduled on a Windows node, it would need both the nodeSelector to choose Windows, and a toleration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's stated that windows server 2019 will be the only supported version, could you change the example (Win1809) to reflect that?

Also, it's not mentioned at all in this proposal how compatibility between windows container with different OS versions (e.g., 1803 vs 2019) will be handled. This will affect how the taints should named (i.e., including OS "version" or not).

If this is something for the post-GA or the "What will never work" section, could you add it there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1809 is the same as Windows Server 2019. i used the 1809 designation since in the future we will support more releases like 1903, etc
Compatibility across different versions of container/OS is not our problem to solve. that comes directly from msft. i will mention this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean windows pods have to target the exact version (1809, 1903, etc), instead of just windows server 2019?

I know it's beyond sig-windows' responsibility to guarantee compatibility, but it'd be nice to have some clarification (from Microsoft) and document this clearly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong yes that is correct. Windows PODs have to target a container host with the same exact version. Taints and tolerations have to be OS specific. In our case, and we documented that in the PR #743, you need to create an OS specific taint. For v1.14 only 1809 (windows server 2019) is supported, both as a container host and as the OS for a container.

@@ -153,8 +161,9 @@ tolerations:
## Graduation Criteria
- All features and functionality under `What works today` is fully tested and vetted to be working by SIG-Windows
- SIG-Windows has high confidence to the stability and reliability of Windows Server containers on Kubernetes
- 100% green/passing conformance tests that are applicable to Windows (see the Testing Plan section for details on these tests)
- Comprehensive documentation that includes but is not limited to the following sections. Documentation will reside at https://kubernetes.io/docs
- 100% green/passing conformance tests that are applicable to Windows (see the Testing Plan section for details on these tests). These tests are adequate, non flaky, and continuously run. The test results are publicly accessible, enabled as part of the release-blocking suite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 100% necessary? I still see some rare flakes in the conformance dashboards on testgrid. Just wanted to make sure that we set a reasonable target.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key thing is that the tests themselves are not flaky. we can have tests that fail due to infra reasons or other infrequent situations, but the tests themselves should be solid



Additional dashboard pages will be added over time as we run the same test cases with additional CRI, CNI and cloud providers. They reflect work that may be stabilized in v1.15 or later and is not strictly required for v1.14.
Additional dashboard pages will be added over time as we run the same test cases with additional CRI, CNI and cloud providers. They are running the same test cases, and are not required for v1.14 graduation to stable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a section describing the primary test environment for GA? The list below is for "not required" tests, but there is no description of the main testing environment and setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in latest PR #743


## Proposal

As of 29-11-2018 much of the work for enabling Windows nodes has already been completed. Both `kubelet` and `kube-proxy` have been adapted to work on Windows Server, and so the first goal of this KEP is largely already complete.

### What works today
- Windows-based containers can be created by kubelet, [provided the host OS version matches the container base image](https://docs.microsoft.com/en-us/virtualization/windowscontainers/deploy-containers/version-compatibility)
- Pod (single or multiple containers per Pod with process isolation), Deployment, ReplicaSet
- Pod (single or multiple containers per Pod with process isolation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SIG Node would appreciate a section or documentation on what kubelet features work on window, since many resource management features are not part of conformance tests, and Node E2E suites cannot run on Windows yet.

As an example, we can start with node e2e tests, and/or scanning over existing kubelet config/flags.

(Just to be clear, this is purely for documentation, and whether a test passes or not should not block windows GA).

/cc @dchen1107

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will try to document all these. we added quite a bit in the PR for #743 already


- Windows Server 2019 on GCP - this is [in progress](https://k8s-testgrid.appspot.com/google-windows#windows-prototype), but not required for v1.14
- Windows Server 2019 on GCP - this is [in progress](https://k8s-testgrid.appspot.com/google-windows#windows-prototype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the GCP tests are still using windows server 1803, but will switch to using 2019 (hopefully soon)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i commented in a PR i was tagged on this issue. yes, they must switch to 1809 soon :). thanks for tagging me.

@@ -203,6 +212,7 @@ The testing for Windows nodes will include multiple approaches:
All of the test cases will be maintained within the kubernetes/kubernetes repo. SIG-Windows specific tests for 2/3 will be in [test/e2e/windows](https://github.com/kubernetes/kubernetes/tree/master/test/e2e/windows)

Additional Windows test setup scripts, container image source code, and documentation will be kept in the [kubernetes-sigs/windows-testing](https://github.com/kubernetes-sigs/windows-testing) repo. One example is that the prow jobs need a list of repos to use for the test containers, and that will be maintained here - see [windows-testing#1](https://github.com/kubernetes-sigs/windows-testing/issues/1).
Building these containers for Windows requires a Windows build machine, which isn't part of the Kubernetes PR or official builds. If the SIG is given access to a suitable GCR.io account, images can be pushed there. Otherwise, we'll use continue pushing to Docker Hub.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ixdy having these test images on gcr.io would be great. Do you have any suggestions for doing this? Can we manually do this if automation is not an option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the near-term we could copy these from Docker Hub to gcr.io/kubernetes-e2e-test-images.

Longer term this is probably something that the @kubernetes/k8s-infra-team would own.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ixdy can you please work offline with @PatrickLang on this?


`V1.Container.SecurityContext.seLinuxOptions` - not possible on Windows, no SELinux

`V1.Container.terminationMessagePath` - this has some limitations in that Windows doesn't support mapping single files. The default value is `/dev/termination-log`, which does work because it does not exist on Windows by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify a little bit on this? If mounting a single file is not possible, why would /dev/termination-log work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can mount /dev as a whole

`V1.podSecurityContext.runAsUser` provides a UID, not available on Windows
`V1.podSecurityContext.supplementalGroups` provides GID, not available on Windows

`V1.Pod.shareProcessNamespace` - this is an alpha feature, and depends on Linux cgroups which are not implemented on Windows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beta already. This is a namespace feature, not sure why Linux cgroups is mentioned...

Can Windows containers share any namespace (not with the host, just among the containers)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, windows containers can share the namespace with some limitations that were outlined in the latest PR

Copy link
Contributor

@michmike michmike Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @PatrickLang on the question about cgroups

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we updated the docs on this in PR #743


`V1.Pod.shareProcessNamespace` - this is an alpha feature, and depends on Linux cgroups which are not implemented on Windows

`V1.Pod.volumeDevices` - this is an alpha feature, and is not implemented on Windows. Windows cannot attach raw block devices to pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also beta.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct. i updated the text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants