-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updating sysctls for 3.11 #10361
updating sysctls for 3.11 #10361
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,18 +14,18 @@ toc::[] | |
|
||
Sysctl settings are exposed via Kubernetes, allowing users to modify certain | ||
kernel parameters at runtime for namespaces within a container. Only sysctls | ||
that are namespaced can be set independently on pods; if a sysctl is not | ||
namespaced (called _node-level_), it cannot be set within {product-title}. | ||
Moreover, only those sysctls considered _safe_ are whitelisted by default; other | ||
_unsafe_ sysctls can be manually enabled on the node to be available to the | ||
that are namespaced can be set independently on pods. If a sysctl is not | ||
namespaced, called _node-level_, it cannot be set within {product-title}. | ||
Moreover, only those sysctls considered _safe_ are whitelisted by default; you | ||
can manually enable other _unsafe_ sysctls on the node to be available to the | ||
user. | ||
|
||
[[undersatnding-sysctls]] | ||
== Understanding Sysctls | ||
== Understanding sysctls | ||
|
||
In Linux, the sysctl interface allows an administrator to modify kernel | ||
parameters at runtime. Parameters are available via the *_/proc/sys/_* virtual | ||
process file system. The parameters cover various subsystems such as: | ||
process file system. The parameters cover various subsystems, such as: | ||
|
||
- kernel (common prefix: *_kernel._*) | ||
- networking (common prefix: *_net._*) | ||
|
@@ -40,10 +40,10 @@ $ sudo sysctl -a | |
---- | ||
|
||
[[namespaced-vs-node-level-sysctls]] | ||
== Namespaced Versus Node-Level Sysctls | ||
== Namespaced versus node-level sysctls | ||
|
||
A number of sysctls are _namespaced_ in today’s Linux kernels. This means that | ||
they can be set independently for each pod on a node. Being namespaced is a | ||
you can set them independently for each pod on a node. Being namespaced is a | ||
requirement for sysctls to be accessible in a pod context within Kubernetes. | ||
|
||
The following sysctls are known to be namespaced: | ||
|
@@ -56,63 +56,64 @@ The following sysctls are known to be namespaced: | |
|
||
Sysctls that are not namespaced are called _node-level_ and must be set | ||
manually by the cluster administrator, either by means of the underlying Linux | ||
distribution of the nodes (e.g., via *_/etc/sysctls.conf_*) or using a DaemonSet | ||
with privileged containers. | ||
distribution of the nodes, such as by modifying the *_/etc/sysctls.conf_* file, | ||
or by using a DaemonSet with privileged containers. | ||
|
||
[NOTE] | ||
==== | ||
Consider marking nodes with special sysctls as tainted. Only schedule pods onto | ||
them that need those sysctl settings. Use the | ||
link:http://kubernetes.io/docs/user-guide/kubectl/kubectl_taint/[Kubernetes _taints and toleration_ feature] to implement this. | ||
xref:../admin_guide/scheduling/taints_tolerations.adoc#admin-guide-taints[taints | ||
and toleration feature] to mark the nodes. | ||
==== | ||
|
||
[[safe-vs-unsafe-sysclts]] | ||
== Safe Versus Unsafe Sysctls | ||
== Safe versus unsafe sysctls | ||
|
||
Sysctls are grouped into _safe_ and _unsafe_ sysctls. In addition to proper | ||
namespacing, a safe sysctl must be properly isolated between pods on the same | ||
node. This means that setting a safe sysctl for one pod: | ||
node. This means that if you set a sysctl as safe for one pod it must not: | ||
|
||
- must not have any influence on any other pod on the node, | ||
- must not allow to harm the node's health, and | ||
- must not allow to gain CPU or memory resources outside of the resource limits of | ||
a pod. | ||
- Influence any other pod on the node | ||
- Harm the node's health | ||
- Gain CPU or memory resources outside of the resource limits of a pod | ||
|
||
By far, most of the namespaced sysctls are not necessarily considered safe. | ||
|
||
For {product-title} 3.3.1, the following sysctls are supported (whitelisted) in | ||
the safe set: | ||
Currently, {product-title} supports, or whitelists, the following sysctls | ||
in the safe set: | ||
|
||
- *_kernel.shm_rmid_forced_* | ||
- *_net.ipv4.ip_local_port_range_* | ||
- *_net.ipv4.tcp_syncookies_* | ||
|
||
This list will be extended in future versions when the kubelet supports better | ||
This list might be extended in future versions when the kubelet supports better | ||
isolation mechanisms. | ||
|
||
All safe sysctls are enabled by default. All unsafe sysctls are disabled by | ||
default and must be allowed manually by the cluster administrator on a per-node | ||
basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to | ||
default, and the cluster administrator must manually enable them on a per-node | ||
basis. Pods with disabled unsafe sysctls will be scheduled but will fail to | ||
launch. | ||
|
||
[[enabling-unsafe-sysctls]] | ||
== Enabling unsafe sysctls | ||
|
||
The cluster administrator can allow certain unsafe sysctls for very special | ||
situations such as high-performance or real-time application tuning. | ||
|
||
If you want to use unsafe sysctls, cluster administrators must enable them | ||
individually on nodes. They can enable only namespaced sysctls. | ||
|
||
[WARNING] | ||
==== | ||
Due to their nature of being unsafe, the use of unsafe sysctls is | ||
at-your-own-risk and can lead to severe problems like wrong behavior of | ||
containers, resource shortage, or complete breakage of a node. | ||
==== | ||
|
||
[[enabling-unsafe-sysctls]] | ||
== Enabling Unsafe Sysctls | ||
|
||
With the warning above in mind, the cluster administrator can allow certain | ||
unsafe sysctls for very special situations, e.g., high-performance or real-time | ||
application tuning. | ||
|
||
If you want to use unsafe sysctls, cluster administrators must enable them | ||
individually on nodes. Only namespaced sysctls can be enabled this way. | ||
|
||
. Specify the unsafe sysctls to use as the value of the `kubeletArguments`\ parameter in the appropriate xref:../admin_guide/manage_nodes.adoc#modifying-nodes[node configuration map] | ||
file, as described in xref:../admin_guide/manage_nodes.adoc#configuring-node-resources[Configuring Node Resources]: | ||
. Use the `*kubeletArguments*` field in the *_/etc/origin/node/node-config.yaml_* | ||
file, as described in | ||
xref:../admin_guide/manage_nodes.adoc#configuring-node-resources[Configuring Node Resources], to set the desired unsafe sysctls: | ||
+ | ||
---- | ||
kubeletArguments: | ||
|
@@ -134,31 +135,49 @@ ifdef::openshift-origin[] | |
endif::[] | ||
|
||
[[setting-sysctls-for-a-pod]] | ||
== Setting Sysctls for a Pod | ||
== Setting sysctls for a pod | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kalexand-rh This heading appears to be formatted incorrectly on the View page here. But, I think it is OK; the formatting looks off in all points in the repo history since the ifdefs were added right before the heading. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting. It looks ok in a local build, too. Thanks for checking it. |
||
|
||
Sysctls are set on pods using the pod's `securityContext`. The `securityContext` | ||
applies to all containers in the same pod. | ||
|
||
The following example uses the pod `securityContext` to set a safe sysctl | ||
`kernel.shm_rmid_forced` and two unsafe sysctls, `net.ipv4.route.min_pmtu` and | ||
`kernel.msgmax`. There is no distinction between _safe_ and _unsafe_ sysctls in | ||
the specification. | ||
|
||
Sysctls are set on pods using annotations. They apply to all containers in the | ||
same pod. | ||
[WARNING] | ||
==== | ||
To avoid destabilizing your operating system, modify sysctl parameters only | ||
after you understand their effects. | ||
==== | ||
|
||
Here is an example, with different annotations for safe and unsafe sysctls: | ||
Modify the YAML file that defines the pod and add the `securityContext` spec, as | ||
shown in the following example: | ||
|
||
[source,yaml] | ||
---- | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: sysctl-example | ||
annotations: | ||
security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1 | ||
security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3 | ||
spec: | ||
securityContext: | ||
sysctls: | ||
- name: kernel.shm_rmid_forced | ||
value: "0" | ||
- name: net.ipv4.route.min_pmtu | ||
value: "552" | ||
- name: kernel.msgmax | ||
value: "65536" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why the sysctl values have been changed? I'm not against such a change but I'd like to ensure that there is no mistakes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @php-coder, that's what the K8 doc used:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok; perhaps our example with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
... | ||
---- | ||
|
||
[NOTE] | ||
==== | ||
A pod with the unsafe sysctls specified above will fail to launch on any node | ||
that has not enabled those two unsafe sysctls explicitly. As with node-level | ||
sysctls, use the | ||
link:http://kubernetes.io/docs/user-guide/kubectl/kubectl_taint[taints and | ||
that the admin has not explicitly enabled those two unsafe sysctls. As with | ||
node-level sysctls, use the | ||
xref:../admin_guide/scheduling/taints_tolerations.adoc#admin-guide-taints[taints and | ||
toleration feature] or | ||
xref:../admin_guide/manage_nodes.adoc#updating-labels-on-nodes[labels on nodes] | ||
to schedule those pods onto the right nodes. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
net.ipv4.tcp_syncookies
is in the safe set as wellThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund, how long has
net.ipv4.tcp_syncookies
been in the safe set? (I can open up a separate PR to fix older versions, if applicable.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since k8s 1.4: kubernetes/kubernetes#27180
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/kubernetes/kubernetes/pull/27180/files#diff-853d8d6fa2710e6a38f79cf40ba47f8bR44
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a PR to make that change in older versions: #11232