Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dont skip drain for unhealthy nodes #839

Merged
merged 11 commits into from
Sep 22, 2023
19 changes: 19 additions & 0 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ this document. Few of the answers assume that the MCM being used is in conjuctio
* [What health checks are performed on a machine?](#what-health-checks-are-performed-on-a-machine)
* [How does rate limiting replacement of machine work in MCM ? How is it related to meltdown protection?](#how-does-rate-limiting-replacement-of-machine-work-in-mcm-how-is-it-related-to-meltdown-protection)
* [How MCM responds when scale-out/scale-in is done during rolling update of a machinedeployment?](#how-mcm-responds-when-scale-outscale-in-is-done-during-rolling-update-of-a-machinedeployment)
* [How some unhealthy machines are drained quickly?](#how-some-unhealthy-machines-are-drained-quickly-)

* [Troubleshooting](#troubleshooting)
* [My machine is stuck in deletion for 1 hr, why?](#My-machine-is-stuck-in-deletion-for-1-hr-why)
Expand Down Expand Up @@ -256,6 +257,24 @@ During update for scaling event, a machineSet is updated if any of the below is

Once scaling is achieved, rollout continues.

## How some unhealthy machines are drained quickly ?

If a node is unhealthy for more than the `machine-health-timeout` specified for the `machine-controller`, the controller
health-check moves the machine phase to `Failed`. By default, the `machine-health-timeout` is 10` minutes.

`Failed` machines have their deletion timestamp set and the machine then moves to the `Terminating` phase. The node
drain process is initiated. The drain process is invoked either *gracefully* or *forcefully*.

The usual drain process is graceful. Pods are evicted from the node and the drain process waits until any existing
attached volumes are mounted on new node. However, if the node `Ready` is `False` or the `ReadonlyFilesystem` is `True`
for greater than `5` minutes, then a forceful drain is initiated. In a forceful drain, pods are deleted
himanshu-kun marked this conversation as resolved.
Show resolved Hide resolved
and `VolumeAttachment` objects associated with the old node are also deleted. This is followed by the deletion of the
himanshu-kun marked this conversation as resolved.
Show resolved Hide resolved
cloud provider VM associated with the `Machine` and then finally ending with the `Node` object deletion.

During the deletion of the VM we only delete the local data and boot disks associated with the VM. The disks associated
himanshu-kun marked this conversation as resolved.
Show resolved Hide resolved
with persistent volumes are left un-touched as their attach/de-detach, mount/unmount processes are handled by k8s
attach-detach controller in conjunction with the CSI driver.

# Troubleshooting
### My machine is stuck in deletion for 1 hr, why?

Expand Down
2 changes: 1 addition & 1 deletion pkg/apis/machine/v1alpha1/machine_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ const (
// MachineOperationHealthCheck indicates that the operation was a create
MachineOperationHealthCheck MachineOperationType = "HealthCheck"

// MachineOperationDelete indicates that the operation was a create
// MachineOperationDelete indicates that the operation was a delete
MachineOperationDelete MachineOperationType = "Delete"
)

Expand Down
18 changes: 11 additions & 7 deletions pkg/util/provider/machinecontroller/machine.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,6 @@ import (
"strings"
"time"

machineapi "github.com/gardener/machine-controller-manager/pkg/apis/machine"
"github.com/gardener/machine-controller-manager/pkg/apis/machine/v1alpha1"
"github.com/gardener/machine-controller-manager/pkg/apis/machine/validation"
"github.com/gardener/machine-controller-manager/pkg/util/provider/driver"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machinecodes/codes"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machinecodes/status"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machineutils"
corev1 "k8s.io/api/core/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
Expand All @@ -39,6 +32,14 @@ import (
"k8s.io/apimachinery/pkg/util/sets"
"k8s.io/client-go/tools/cache"
"k8s.io/klog/v2"

machineapi "github.com/gardener/machine-controller-manager/pkg/apis/machine"
"github.com/gardener/machine-controller-manager/pkg/apis/machine/v1alpha1"
"github.com/gardener/machine-controller-manager/pkg/apis/machine/validation"
"github.com/gardener/machine-controller-manager/pkg/util/provider/driver"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machinecodes/codes"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machinecodes/status"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machineutils"
)

/*
Expand Down Expand Up @@ -592,6 +593,9 @@ func (c *controller) triggerDeletionFlow(ctx context.Context, deleteMachineReque
case strings.Contains(machine.Status.LastOperation.Description, machineutils.InitiateDrain):
return c.drainNode(ctx, deleteMachineRequest)

case strings.Contains(machine.Status.LastOperation.Description, machineutils.DelVolumesAttachments):
return c.deleteNodeVolAttachments(ctx, deleteMachineRequest)

case strings.Contains(machine.Status.LastOperation.Description, machineutils.InitiateVMDeletion):
return c.deleteVM(ctx, deleteMachineRequest)

Expand Down
41 changes: 21 additions & 20 deletions pkg/util/provider/machinecontroller/machine_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,6 @@ import (
"math"
"time"

machineapi "github.com/gardener/machine-controller-manager/pkg/apis/machine"
"github.com/gardener/machine-controller-manager/pkg/apis/machine/v1alpha1"
"github.com/gardener/machine-controller-manager/pkg/apis/machine/validation"
fakemachineapi "github.com/gardener/machine-controller-manager/pkg/client/clientset/versioned/typed/machine/v1alpha1/fake"
customfake "github.com/gardener/machine-controller-manager/pkg/fakeclient"
"github.com/gardener/machine-controller-manager/pkg/util/provider/driver"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machinecodes/codes"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machinecodes/status"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machineutils"
. "github.com/onsi/ginkgo"
. "github.com/onsi/ginkgo/extensions/table"
. "github.com/onsi/gomega"
Expand All @@ -39,6 +30,16 @@ import (
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/util/validation/field"
k8stesting "k8s.io/client-go/testing"

machineapi "github.com/gardener/machine-controller-manager/pkg/apis/machine"
"github.com/gardener/machine-controller-manager/pkg/apis/machine/v1alpha1"
"github.com/gardener/machine-controller-manager/pkg/apis/machine/validation"
fakemachineapi "github.com/gardener/machine-controller-manager/pkg/client/clientset/versioned/typed/machine/v1alpha1/fake"
customfake "github.com/gardener/machine-controller-manager/pkg/fakeclient"
"github.com/gardener/machine-controller-manager/pkg/util/provider/driver"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machinecodes/codes"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machinecodes/status"
"github.com/gardener/machine-controller-manager/pkg/util/provider/machineutils"
)

const testNamespace = "test"
Expand Down Expand Up @@ -1631,7 +1632,7 @@ var _ = Describe("machine", func() {
),
},
}),
Entry("Drain skipping as machine is NotReady for a long time (5 minutes)", &data{
Entry("Force Drain as machine is NotReady for a long time (5 minutes)", &data{
setup: setup{
secrets: []*corev1.Secret{
{
Expand Down Expand Up @@ -1696,7 +1697,7 @@ var _ = Describe("machine", func() {
},
},
expect: expect{
err: fmt.Errorf("Skipping drain as machine is NotReady for over 5minutes. %s", machineutils.InitiateVMDeletion),
err: fmt.Errorf(fmt.Sprintf("Drain successful. %s", machineutils.DelVolumesAttachments)),
retry: machineutils.ShortRetry,
machine: newMachine(
&v1alpha1.MachineTemplateSpec{
Expand All @@ -1715,7 +1716,7 @@ var _ = Describe("machine", func() {
LastUpdateTime: metav1.Now(),
},
LastOperation: v1alpha1.LastOperation{
Description: fmt.Sprintf("Skipping drain as machine is NotReady for over 5minutes. %s", machineutils.InitiateVMDeletion),
Description: fmt.Sprintf("Drain successful. %s", machineutils.DelVolumesAttachments),
State: v1alpha1.MachineStateProcessing,
Type: v1alpha1.MachineOperationDelete,
LastUpdateTime: metav1.Now(),
Expand All @@ -1733,7 +1734,7 @@ var _ = Describe("machine", func() {
),
},
}),
Entry("Drain skipping as machine is in ReadonlyFilesystem for a long time (5 minutes)", &data{
Entry("Force Drain as machine is in ReadonlyFilesystem for a long time (5 minutes)", &data{
setup: setup{
secrets: []*corev1.Secret{
{
Expand Down Expand Up @@ -1798,7 +1799,7 @@ var _ = Describe("machine", func() {
},
},
expect: expect{
err: fmt.Errorf("Skipping drain as machine is in ReadonlyFilesystem for over 5minutes. %s", machineutils.InitiateVMDeletion),
err: fmt.Errorf(fmt.Sprintf("Drain successful. %s", machineutils.DelVolumesAttachments)),
retry: machineutils.ShortRetry,
machine: newMachine(
&v1alpha1.MachineTemplateSpec{
Expand All @@ -1817,7 +1818,7 @@ var _ = Describe("machine", func() {
LastUpdateTime: metav1.Now(),
},
LastOperation: v1alpha1.LastOperation{
Description: fmt.Sprintf("Skipping drain as machine is in ReadonlyFilesystem for over 5minutes. %s", machineutils.InitiateVMDeletion),
Description: fmt.Sprintf("Drain successful. %s", machineutils.DelVolumesAttachments),
State: v1alpha1.MachineStateProcessing,
Type: v1alpha1.MachineOperationDelete,
LastUpdateTime: metav1.Now(),
Expand All @@ -1835,7 +1836,7 @@ var _ = Describe("machine", func() {
),
},
}),
Entry("Drain skipping as machine is NotReady for a long time(5 min) ,also ReadonlyFilesystem is true for a long time (5 minutes)", &data{
Entry("Force Drain as machine is NotReady for a long time(5 min) ,also ReadonlyFilesystem is true for a long time (5 minutes)", &data{
setup: setup{
secrets: []*corev1.Secret{
{
Expand Down Expand Up @@ -1905,7 +1906,7 @@ var _ = Describe("machine", func() {
},
},
expect: expect{
err: fmt.Errorf("Skipping drain as machine is NotReady for over 5minutes. %s", machineutils.InitiateVMDeletion),
err: fmt.Errorf(fmt.Sprintf("Drain successful. %s", machineutils.DelVolumesAttachments)),
retry: machineutils.ShortRetry,
machine: newMachine(
&v1alpha1.MachineTemplateSpec{
Expand All @@ -1924,7 +1925,7 @@ var _ = Describe("machine", func() {
LastUpdateTime: metav1.Now(),
},
LastOperation: v1alpha1.LastOperation{
Description: fmt.Sprintf("Skipping drain as machine is NotReady for over 5minutes. %s", machineutils.InitiateVMDeletion),
Description: fmt.Sprintf("Drain successful. %s", machineutils.DelVolumesAttachments),
State: v1alpha1.MachineStateProcessing,
Type: v1alpha1.MachineOperationDelete,
LastUpdateTime: metav1.Now(),
Expand Down Expand Up @@ -2236,7 +2237,7 @@ var _ = Describe("machine", func() {
LastUpdateTime: metav1.Now(),
},
LastOperation: v1alpha1.LastOperation{
Description: fmt.Sprintf("Drain failed due to - Failed to update node. However, since it's a force deletion shall continue deletion of VM. %s", machineutils.InitiateVMDeletion),
Description: fmt.Sprintf("Drain failed due to - Failed to update node. However, since it's a force deletion shall continue deletion of VM. %s", machineutils.DelVolumesAttachments),
State: v1alpha1.MachineStateProcessing,
Type: v1alpha1.MachineOperationDelete,
LastUpdateTime: metav1.Now(),
Expand Down Expand Up @@ -2450,7 +2451,7 @@ var _ = Describe("machine", func() {
LastUpdateTime: metav1.Now(),
},
LastOperation: v1alpha1.LastOperation{
Description: fmt.Sprintf("Drain failed due to - Failed to update node. However, since it's a force deletion shall continue deletion of VM. %s", machineutils.InitiateVMDeletion),
Description: fmt.Sprintf("Drain failed due to - Failed to update node. However, since it's a force deletion shall continue deletion of VM. %s", machineutils.DelVolumesAttachments),
State: v1alpha1.MachineStateProcessing,
Type: v1alpha1.MachineOperationDelete,
LastUpdateTime: metav1.Now(),
Expand Down
Loading