Skip to content

Commit

Permalink
Update DaemonSet guide
Browse files Browse the repository at this point in the history
Rewrite "How Daemon Pods are scheduled" section of the DaemonSet guide
to align with the current state and be more clear.

Signed-off-by: Grigoris Thanasoulas <[email protected]>
  • Loading branch information
Grigoris Thanasoulas committed Feb 15, 2023
1 parent 237fdab commit d0b3ba5
Showing 1 changed file with 52 additions and 43 deletions.
95 changes: 52 additions & 43 deletions content/en/docs/concepts/workloads/controllers/daemonset.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,30 +105,24 @@ If you do not specify either, then the DaemonSet controller will create Pods on

## How Daemon Pods are scheduled

### Scheduled by default scheduler

{{< feature-state for_k8s_version="1.17" state="stable" >}}

A DaemonSet ensures that all eligible nodes run a copy of a Pod. Normally, the
node that a Pod runs on is selected by the Kubernetes scheduler. However,
DaemonSet pods are created and scheduled by the DaemonSet controller instead.
That introduces the following issues:

* Inconsistent Pod behavior: Normal Pods waiting to be scheduled are created
and in `Pending` state, but DaemonSet pods are not created in `Pending`
state. This is confusing to the user.
* [Pod preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/)
is handled by default scheduler. When preemption is enabled, the DaemonSet controller
will make scheduling decisions without considering pod priority and preemption.

`ScheduleDaemonSetPods` allows you to schedule DaemonSets using the default
scheduler instead of the DaemonSet controller, by adding the `NodeAffinity` term
to the DaemonSet pods, instead of the `.spec.nodeName` term. The default
scheduler is then used to bind the pod to the target host. If node affinity of
the DaemonSet pod already exists, it is replaced (the original node affinity was
taken into account before selecting the target host). The DaemonSet controller only
performs these operations when creating or modifying DaemonSet pods, and no
changes are made to the `spec.template` of the DaemonSet.
A DaemonSet ensures that all eligible nodes run a copy of a Pod. The DaemonSet
controller creates a Pod for each eligible node and adds the
`spec.affinity.nodeAffinity` field of the Pod to match the target host. After
the Pod is created, the default scheduler typically takes over and then binds
the Pod to the target host by setting the `.spec.nodeName` field. If the new
Pod cannot fit on the node, the default scheduler may preempt (evict) some of
the existing Pods based on the
[priority](/docs/concepts/scheduling-eviction/pod-priority-preemption/#pod-priority)
of the new Pod.

The user can specify a different scheduler for the Pods of the DamonSet, by
setting the `.spec.template.spec.schedulerName` field of the DaemonSet.

The original node affinity specified at the
`.spec.template.spec.affinity.nodeAffinity` field (if specified) is taken into
consideration by the DaemonSet controller when evaluating the eligible nodes,
but is replaced on the created Pod with the node affinity that matches the name
of the eligible node.

```yaml
nodeAffinity:
Expand All @@ -141,25 +135,40 @@ nodeAffinity:
- target-host-name
```
In addition, `node.kubernetes.io/unschedulable:NoSchedule` toleration is added
automatically to DaemonSet Pods. The default scheduler ignores
`unschedulable` Nodes when scheduling DaemonSet Pods.

### Taints and Tolerations

Although Daemon Pods respect
[taints and tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/),
the following tolerations are added to DaemonSet Pods automatically according to
the related features.

| Toleration Key | Effect | Version | Description |
| ---------------------------------------- | ---------- | ------- | ----------- |
| `node.kubernetes.io/not-ready` | NoExecute | 1.13+ | DaemonSet pods will not be evicted when there are node problems such as a network partition. |
| `node.kubernetes.io/unreachable` | NoExecute | 1.13+ | DaemonSet pods will not be evicted when there are node problems such as a network partition. |
| `node.kubernetes.io/disk-pressure` | NoSchedule | 1.8+ | DaemonSet pods tolerate disk-pressure attributes by default scheduler. |
| `node.kubernetes.io/memory-pressure` | NoSchedule | 1.8+ | DaemonSet pods tolerate memory-pressure attributes by default scheduler. |
| `node.kubernetes.io/unschedulable` | NoSchedule | 1.12+ | DaemonSet pods tolerate unschedulable attributes by default scheduler. |
| `node.kubernetes.io/network-unavailable` | NoSchedule | 1.12+ | DaemonSet pods, who uses host network, tolerate network-unavailable attributes by default scheduler. |
### Taints and tolerations
The DaemonSet controller automatically adds a set of {{< glossary_tooltip
text="tolerations" term_id="toleration" >}} to DaemonSet Pods:
{{< table caption="Tolerations for DaemonSet pods" >}}
| Toleration key | Effect | Details |
| --------------------------------------------------------------------------------------------------------------------- | ------------ | --------------------------------------------------------------------------------------------------------------------------------------------- |
| [`node.kubernetes.io/not-ready`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-not-ready) | `NoExecute` | DaemonSet Pods can be scheduled onto nodes that are not healthy or ready to accept Pods. Any DaemonSet Pods running on such nodes will not be evicted. |
| [`node.kubernetes.io/unreachable`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-unreachable) | `NoExecute` | DaemonSet Pods can be scheduled onto nodes that are unreachable from the node controller. Any DaemonSet Pods running on such nodes will not be evicted. |
| [`node.kubernetes.io/disk-pressure`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-disk-pressure) | `NoSchedule` | DaemonSet Pods can be scheduled onto nodes with disk pressure issues. |
| [`node.kubernetes.io/memory-pressure`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-memory-pressure) | `NoSchedule` | DaemonSet Pods can be scheduled onto nodes with memory pressure issues. |
| [`node.kubernetes.io/pid-pressure`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-pid-pressure) | `NoSchedule` | DaemonSet Pods can be scheduled onto nodes with process pressure issues. |
| [`node.kubernetes.io/unschedulable`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-unschedulable) | `NoSchedule` | DaemonSet Pods can be scheduled onto nodes that are unschedulable. |
| [`node.kubernetes.io/network-unavailable`](/docs/reference/labels-annotations-taints/#node-kubernetes-io-network-unavailable) | `NoSchedule` | **Only added for DaemonSet Pods that request host networking**, i.e., Pods having `spec.hostNetwork: true`. Such DaemonSet Pods can be scheduled onto nodes with unavailable network.|

{{< /table >}}

You can add your own tolerations to the Pods of a Daemonset as well, by
defining these in the Pod template of the DaemonSet.

Because the DaemonSet controller sets the
`node.kubernetes.io/unschedulable:NoSchedule` toleration automatically,
Kubernetes can run DaemonSet Pods on nodes that are marked as _unschedulable_.

If you use a DaemonSet to provide an important node-level function, such as
[cluster networking](/docs/concepts/cluster-administration/networking/), it is
helpful that Kubernetes places DaemonSet Pods on nodes before they are ready.
For example, without that special toleration, you could end up in a deadlock
situation where the node is not marked as ready because the network plugin is
not running there, and at the same time the network plugin is not running on
that node because the node is not yet ready.

## Communicating with Daemon Pods

Expand Down

0 comments on commit d0b3ba5

Please sign in to comment.