[RFE] Do not reboot when draining fails #188

Nuckal777 · 2023-02-20T12:43:54Z

In version 0.9.0 of the update-agent certain errors while draining are ignored here. For availability it would be great it if nodes with remaining payload would not get rebooted.

invidian · 2023-02-20T13:25:12Z

Hm, as far as I remember and understand, this is not a new behavior in v0.9.0, it has always been like this. This is because in K8s nodes are expected to go away randomly and due to current way operator is implemented, blocking update on a single node will also block it for the rest of the nodes, which I think is desired, as if the update is faulty, one likely not want to proceed rolling it on the entire cluster. On the other hand, draining a node should not block the updates. If this is desired, probably before reboot hook should be used to make sure any additional conditions required for rebooting are satisfied. This is exactly for what we have hooks mechanism. It should be OK to drain a node from the hook with any kind of desired handling.

Can you provide some more details about your issue @Nuckal777? Do you consider it a regression from the previous version?

Nuckal777 · 2023-02-20T15:18:46Z

We only use the update-agent. When upgrading to 0.9.0 I forgot to update the rolebinding to include evictions, which in turn caused no pods to be evicted and the drain to produce respective errors. After all the payload was not moved. So my initial framing of this being a regression is wrong apparently, but still I would personally prefer a more defensive approach here. I rather check for stalled upgrades than wondering why every pod on a node is interrupted for a short period of time.

invidian · 2023-02-20T15:34:33Z

Interesting. Thank you for sharing your use case. I understand your rationale, but considering that the behavior did not change between the releases as far as I know, I would consider this a feature request, rather than a regression/bug, do you agree?

I think it should be OK to implement a new flag for agent, which will require a drain to succeed before the reboot. We just need to define what should happen afterwards when this error occur. Should we exit and restart? Or should we delay and retry draining? Or just stall the process until human intervention? What would be the desired behavior for you?

Nuckal777 · 2023-02-21T11:48:48Z

I think exit and restart is fine as that would surface an issue via the CrashLoopBackoff pod status in Kubernetes.

invidian added the component/agent Agent-related issue label Feb 20, 2023

invidian added the question Further information is requested label Feb 20, 2023

invidian added enhancement New feature or request and removed question Further information is requested labels Feb 21, 2023

jescarri mentioned this issue Jul 7, 2023

[RFE] Add flag to force Node Drain #196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE] Do not reboot when draining fails #188

[RFE] Do not reboot when draining fails #188

Nuckal777 commented Feb 20, 2023

invidian commented Feb 20, 2023

Nuckal777 commented Feb 20, 2023

invidian commented Feb 20, 2023 •

edited

Loading

Nuckal777 commented Feb 21, 2023

[RFE] Do not reboot when draining fails #188

[RFE] Do not reboot when draining fails #188

Comments

Nuckal777 commented Feb 20, 2023

invidian commented Feb 20, 2023

Nuckal777 commented Feb 20, 2023

invidian commented Feb 20, 2023 • edited Loading

Nuckal777 commented Feb 21, 2023

invidian commented Feb 20, 2023 •

edited

Loading