Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] Do not reboot when draining fails #188

Open
Nuckal777 opened this issue Feb 20, 2023 · 4 comments
Open

[RFE] Do not reboot when draining fails #188

Nuckal777 opened this issue Feb 20, 2023 · 4 comments
Labels
component/agent Agent-related issue enhancement New feature or request

Comments

@Nuckal777
Copy link

In version 0.9.0 of the update-agent certain errors while draining are ignored here. For availability it would be great it if nodes with remaining payload would not get rebooted.

@invidian invidian added the component/agent Agent-related issue label Feb 20, 2023
@invidian
Copy link
Member

Hm, as far as I remember and understand, this is not a new behavior in v0.9.0, it has always been like this. This is because in K8s nodes are expected to go away randomly and due to current way operator is implemented, blocking update on a single node will also block it for the rest of the nodes, which I think is desired, as if the update is faulty, one likely not want to proceed rolling it on the entire cluster. On the other hand, draining a node should not block the updates. If this is desired, probably before reboot hook should be used to make sure any additional conditions required for rebooting are satisfied. This is exactly for what we have hooks mechanism. It should be OK to drain a node from the hook with any kind of desired handling.

Can you provide some more details about your issue @Nuckal777? Do you consider it a regression from the previous version?

@Nuckal777
Copy link
Author

We only use the update-agent. When upgrading to 0.9.0 I forgot to update the rolebinding to include evictions, which in turn caused no pods to be evicted and the drain to produce respective errors. After all the payload was not moved. So my initial framing of this being a regression is wrong apparently, but still I would personally prefer a more defensive approach here. I rather check for stalled upgrades than wondering why every pod on a node is interrupted for a short period of time.

@invidian invidian added the question Further information is requested label Feb 20, 2023
@invidian
Copy link
Member

invidian commented Feb 20, 2023

Interesting. Thank you for sharing your use case. I understand your rationale, but considering that the behavior did not change between the releases as far as I know, I would consider this a feature request, rather than a regression/bug, do you agree?

I think it should be OK to implement a new flag for agent, which will require a drain to succeed before the reboot. We just need to define what should happen afterwards when this error occur. Should we exit and restart? Or should we delay and retry draining? Or just stall the process until human intervention? What would be the desired behavior for you?

@Nuckal777
Copy link
Author

I think exit and restart is fine as that would surface an issue via the CrashLoopBackoff pod status in Kubernetes.

@invidian invidian added enhancement New feature or request and removed question Further information is requested labels Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/agent Agent-related issue enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants