Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating sysctls for 3.11 #10361

Merged
merged 1 commit into from
Aug 6, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 63 additions & 44 deletions admin_guide/sysctls.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,18 @@ toc::[]

Sysctl settings are exposed via Kubernetes, allowing users to modify certain
kernel parameters at runtime for namespaces within a container. Only sysctls
that are namespaced can be set independently on pods; if a sysctl is not
namespaced (called _node-level_), it cannot be set within {product-title}.
Moreover, only those sysctls considered _safe_ are whitelisted by default; other
_unsafe_ sysctls can be manually enabled on the node to be available to the
that are namespaced can be set independently on pods. If a sysctl is not
namespaced, called _node-level_, it cannot be set within {product-title}.
Moreover, only those sysctls considered _safe_ are whitelisted by default; you
can manually enable other _unsafe_ sysctls on the node to be available to the
user.

[[undersatnding-sysctls]]
== Understanding Sysctls
== Understanding sysctls

In Linux, the sysctl interface allows an administrator to modify kernel
parameters at runtime. Parameters are available via the *_/proc/sys/_* virtual
process file system. The parameters cover various subsystems such as:
process file system. The parameters cover various subsystems, such as:

- kernel (common prefix: *_kernel._*)
- networking (common prefix: *_net._*)
Expand All @@ -40,10 +40,10 @@ $ sudo sysctl -a
----

[[namespaced-vs-node-level-sysctls]]
== Namespaced Versus Node-Level Sysctls
== Namespaced versus node-level sysctls

A number of sysctls are _namespaced_ in today’s Linux kernels. This means that
they can be set independently for each pod on a node. Being namespaced is a
you can set them independently for each pod on a node. Being namespaced is a
requirement for sysctls to be accessible in a pod context within Kubernetes.

The following sysctls are known to be namespaced:
Expand All @@ -56,63 +56,64 @@ The following sysctls are known to be namespaced:

Sysctls that are not namespaced are called _node-level_ and must be set
manually by the cluster administrator, either by means of the underlying Linux
distribution of the nodes (e.g., via *_/etc/sysctls.conf_*) or using a DaemonSet
with privileged containers.
distribution of the nodes, such as by modifying the *_/etc/sysctls.conf_* file,
or by using a DaemonSet with privileged containers.

[NOTE]
====
Consider marking nodes with special sysctls as tainted. Only schedule pods onto
them that need those sysctl settings. Use the
link:http://kubernetes.io/docs/user-guide/kubectl/kubectl_taint/[Kubernetes _taints and toleration_ feature] to implement this.
xref:../admin_guide/scheduling/taints_tolerations.adoc#admin-guide-taints[taints
and toleration feature] to mark the nodes.
====

[[safe-vs-unsafe-sysclts]]
== Safe Versus Unsafe Sysctls
== Safe versus unsafe sysctls

Sysctls are grouped into _safe_ and _unsafe_ sysctls. In addition to proper
namespacing, a safe sysctl must be properly isolated between pods on the same
node. This means that setting a safe sysctl for one pod:
node. This means that if you set a sysctl as safe for one pod it must not:

- must not have any influence on any other pod on the node,
- must not allow to harm the node's health, and
- must not allow to gain CPU or memory resources outside of the resource limits of
a pod.
- Influence any other pod on the node
- Harm the node's health
- Gain CPU or memory resources outside of the resource limits of a pod

By far, most of the namespaced sysctls are not necessarily considered safe.

For {product-title} 3.3.1, the following sysctls are supported (whitelisted) in
the safe set:
Currently, {product-title} supports, or whitelists, the following sysctls
in the safe set:

- *_kernel.shm_rmid_forced_*
- *_net.ipv4.ip_local_port_range_*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

net.ipv4.tcp_syncookies is in the safe set as well

Copy link
Contributor Author

@kalexand-rh kalexand-rh Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund, how long has net.ipv4.tcp_syncookies been in the safe set? (I can open up a separate PR to fix older versions, if applicable.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a PR to make that change in older versions: #11232

- *_net.ipv4.tcp_syncookies_*

This list will be extended in future versions when the kubelet supports better
This list might be extended in future versions when the kubelet supports better
isolation mechanisms.

All safe sysctls are enabled by default. All unsafe sysctls are disabled by
default and must be allowed manually by the cluster administrator on a per-node
basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to
default, and the cluster administrator must manually enable them on a per-node
basis. Pods with disabled unsafe sysctls will be scheduled but will fail to
launch.

[[enabling-unsafe-sysctls]]
== Enabling unsafe sysctls

The cluster administrator can allow certain unsafe sysctls for very special
situations such as high-performance or real-time application tuning.

If you want to use unsafe sysctls, cluster administrators must enable them
individually on nodes. They can enable only namespaced sysctls.

[WARNING]
====
Due to their nature of being unsafe, the use of unsafe sysctls is
at-your-own-risk and can lead to severe problems like wrong behavior of
containers, resource shortage, or complete breakage of a node.
====

[[enabling-unsafe-sysctls]]
== Enabling Unsafe Sysctls

With the warning above in mind, the cluster administrator can allow certain
unsafe sysctls for very special situations, e.g., high-performance or real-time
application tuning.

If you want to use unsafe sysctls, cluster administrators must enable them
individually on nodes. Only namespaced sysctls can be enabled this way.

. Specify the unsafe sysctls to use as the value of the `kubeletArguments`\ parameter in the appropriate xref:../admin_guide/manage_nodes.adoc#modifying-nodes[node configuration map]
file, as described in xref:../admin_guide/manage_nodes.adoc#configuring-node-resources[Configuring Node Resources]:
. Use the `*kubeletArguments*` field in the *_/etc/origin/node/node-config.yaml_*
file, as described in
xref:../admin_guide/manage_nodes.adoc#configuring-node-resources[Configuring Node Resources], to set the desired unsafe sysctls:
+
----
kubeletArguments:
Expand All @@ -134,31 +135,49 @@ ifdef::openshift-origin[]
endif::[]

[[setting-sysctls-for-a-pod]]
== Setting Sysctls for a Pod
== Setting sysctls for a pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kalexand-rh This heading appears to be formatted incorrectly on the View page here. But, I think it is OK; the formatting looks off in all points in the repo history since the ifdefs were added right before the heading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. It looks ok in a local build, too. Thanks for checking it.


Sysctls are set on pods using the pod's `securityContext`. The `securityContext`
applies to all containers in the same pod.

The following example uses the pod `securityContext` to set a safe sysctl
`kernel.shm_rmid_forced` and two unsafe sysctls, `net.ipv4.route.min_pmtu` and
`kernel.msgmax`. There is no distinction between _safe_ and _unsafe_ sysctls in
the specification.

Sysctls are set on pods using annotations. They apply to all containers in the
same pod.
[WARNING]
====
To avoid destabilizing your operating system, modify sysctl parameters only
after you understand their effects.
====

Here is an example, with different annotations for safe and unsafe sysctls:
Modify the YAML file that defines the pod and add the `securityContext` spec, as
shown in the following example:

[source,yaml]
----
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
annotations:
security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1
security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3
spec:
securityContext:
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.ipv4.route.min_pmtu
value: "552"
- name: kernel.msgmax
value: "65536"
Copy link

@php-coder php-coder Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the sysctl values have been changed? I'm not against such a change but I'd like to ensure that there is no mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@php-coder, that's what the K8 doc used:

  securityContext:
    sysctls:
    - name: kernel.shm_rmid_forced
      value: "0"
    - name: net.ipv4.route.min_pmtu
      value: "552"
    - name: kernel.msgmax
      value: "65536"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok; perhaps our example with kernel.msgmax=1 2 3 was wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...
----

[NOTE]
====
A pod with the unsafe sysctls specified above will fail to launch on any node
that has not enabled those two unsafe sysctls explicitly. As with node-level
sysctls, use the
link:http://kubernetes.io/docs/user-guide/kubectl/kubectl_taint[taints and
that the admin has not explicitly enabled those two unsafe sysctls. As with
node-level sysctls, use the
xref:../admin_guide/scheduling/taints_tolerations.adoc#admin-guide-taints[taints and
toleration feature] or
xref:../admin_guide/manage_nodes.adoc#updating-labels-on-nodes[labels on nodes]
to schedule those pods onto the right nodes.
Expand Down