Update DaemonSet guide

Rewrite "How Daemon Pods are scheduled" section of the DaemonSet guide to align with the current state and be more clear. Signed-off-by: Grigoris Thanasoulas <[email protected]>
kubernetes · Feb 15, 2023 · d0b3ba5 · d0b3ba5
1 parent 237fdab
commit d0b3ba5
Showing 1 changed file with 52 additions and 43 deletions.
diff --git a/content/en/docs/concepts/workloads/controllers/daemonset.md b/content/en/docs/concepts/workloads/controllers/daemonset.md
@@ -105,30 +105,24 @@ If you do not specify either, then the DaemonSet controller will create Pods on
 
 ## How Daemon Pods are scheduled
 
-### Scheduled by default scheduler
-
-{{< feature-state for_k8s_version="1.17" state="stable" >}}
-
-A DaemonSet ensures that all eligible nodes run a copy of a Pod. Normally, the
-node that a Pod runs on is selected by the Kubernetes scheduler. However,
-DaemonSet pods are created and scheduled by the DaemonSet controller instead.
-That introduces the following issues:
-
-* Inconsistent Pod behavior: Normal Pods waiting to be scheduled are created
-  and in `Pending` state, but DaemonSet pods are not created in `Pending`
-  state. This is confusing to the user.
-* [Pod preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/)
-  is handled by default scheduler. When preemption is enabled, the DaemonSet controller
-  will make scheduling decisions without considering pod priority and preemption.
-
-`ScheduleDaemonSetPods` allows you to schedule DaemonSets using the default
-scheduler instead of the DaemonSet controller, by adding the `NodeAffinity` term
-to the DaemonSet pods, instead of the `.spec.nodeName` term. The default
-scheduler is then used to bind the pod to the target host. If node affinity of
-the DaemonSet pod already exists, it is replaced (the original node affinity was
-taken into account before selecting the target host). The DaemonSet controller only
-performs these operations when creating or modifying DaemonSet pods, and no
-changes are made to the `spec.template` of the DaemonSet.
+A DaemonSet ensures that all eligible nodes run a copy of a Pod. The DaemonSet
+controller creates a Pod for each eligible node and adds the
+`spec.affinity.nodeAffinity` field of the Pod to match the target host. After
+the Pod is created, the default scheduler typically takes over and then binds
+the Pod to the target host by setting the `.spec.nodeName` field.  If the new
+Pod cannot fit on the node, the default scheduler may preempt (evict) some of
+the existing Pods based on the
+[priority](/docs/concepts/scheduling-eviction/pod-priority-preemption/#pod-priority)
+of the new Pod.
+
+The user can specify a different scheduler for the Pods of the DamonSet, by
+setting the `.spec.template.spec.schedulerName` field of the DaemonSet.
+
+The original node affinity specified at the
+`.spec.template.spec.affinity.nodeAffinity` field (if specified) is taken into
+consideration by the DaemonSet controller when evaluating the eligible nodes,
+but is replaced on the created Pod with the node affinity that matches the name
+of the eligible node.
 
 ```yaml
 nodeAffinity:
@@ -141,25 +135,40 @@ nodeAffinity:
         - target-host-name
 ```
 
-In addition, `node.kubernetes.io/unschedulable:NoSchedule` toleration is added
-automatically to DaemonSet Pods. The default scheduler ignores
-`unschedulable` Nodes when scheduling DaemonSet Pods.
-
-### Taints and Tolerations
-
-Although Daemon Pods respect
-[taints and tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/),
-the following tolerations are added to DaemonSet Pods automatically according to
-the related features.
-
-| Toleration Key                           | Effect     | Version | Description |
-| ---------------------------------------- | ---------- | ------- | ----------- |
-| `node.kubernetes.io/not-ready`           | NoExecute  | 1.13+   | DaemonSet pods will not be evicted when there are node problems such as a network partition. |
-| `node.kubernetes.io/unreachable`         | NoExecute  | 1.13+   | DaemonSet pods will not be evicted when there are node problems such as a network partition. |
-| `node.kubernetes.io/disk-pressure`       | NoSchedule | 1.8+    | DaemonSet pods tolerate disk-pressure attributes by default scheduler. |
-| `node.kubernetes.io/memory-pressure`     | NoSchedule | 1.8+    | DaemonSet pods tolerate memory-pressure attributes by default scheduler. |
-| `node.kubernetes.io/unschedulable`       | NoSchedule | 1.12+   | DaemonSet pods tolerate unschedulable attributes by default scheduler. |
-| `node.kubernetes.io/network-unavailable` | NoSchedule | 1.12+   | DaemonSet pods, who uses host network, tolerate network-unavailable attributes by default scheduler. |
+
+### Taints and tolerations
+
+The DaemonSet controller automatically adds a set of {{< glossary_tooltip
+text="tolerations" term_id="toleration" >}} to DaemonSet Pods:
+
+{{< table caption="Tolerations for DaemonSet pods" >}}
+
+| Toleration key                                                                                                        | Effect       | Details                                                                                                                                       |
+| --------------------------------------------------------------------------------------------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------------- |
+| [`node.kubernetes.io/not-ready`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-not-ready)             | `NoExecute`  | DaemonSet Pods can be scheduled onto nodes that are not healthy or ready to accept Pods. Any DaemonSet Pods running on such nodes will not be evicted. |
+| [`node.kubernetes.io/unreachable`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-unreachable)         | `NoExecute`  | DaemonSet Pods can be scheduled onto nodes that are unreachable from the node controller. Any DaemonSet Pods running on such nodes will not be evicted. |
+| [`node.kubernetes.io/disk-pressure`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-disk-pressure)     | `NoSchedule` | DaemonSet Pods can be scheduled onto nodes with disk pressure issues.                                                                         |
+| [`node.kubernetes.io/memory-pressure`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-memory-pressure) | `NoSchedule` | DaemonSet Pods can be scheduled onto nodes with memory pressure issues.                                                                        |
+| [`node.kubernetes.io/pid-pressure`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-pid-pressure) | `NoSchedule` | DaemonSet Pods can be scheduled onto nodes with process pressure issues.                                                                        |
+| [`node.kubernetes.io/unschedulable`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-unschedulable)   | `NoSchedule` | DaemonSet Pods can be scheduled onto nodes that are unschedulable.                                                                            |
+| [`node.kubernetes.io/network-unavailable`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-network-unavailable) | `NoSchedule` | **Only added for DaemonSet Pods that request host networking**, i.e., Pods having `spec.hostNetwork: true`. Such DaemonSet Pods can be scheduled onto nodes with unavailable network.|
+
+{{< /table >}}
+
+You can add your own tolerations to the Pods of a Daemonset as well, by
+defining these in the Pod template of the DaemonSet.
+
+Because the DaemonSet controller sets the
+`node.kubernetes.io/unschedulable:NoSchedule` toleration automatically,
+Kubernetes can run DaemonSet Pods on nodes that are marked as _unschedulable_.
+
+If you use a DaemonSet to provide an important node-level function, such as
+[cluster networking](/docs/concepts/cluster-administration/networking/), it is
+helpful that Kubernetes places DaemonSet Pods on nodes before they are ready.
+For example, without that special toleration, you could end up in a deadlock
+situation where the node is not marked as ready because the network plugin is
+not running there, and at the same time the network plugin is not running on
+that node because the node is not yet ready.
 
 ## Communicating with Daemon Pods