Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS upgrade fails if you have node-role.kubernetes.io node labels #1835

Closed
ohorvath opened this issue Sep 3, 2020 · 15 comments
Closed

AKS upgrade fails if you have node-role.kubernetes.io node labels #1835

ohorvath opened this issue Sep 3, 2020 · 15 comments
Assignees

Comments

@ohorvath
Copy link

ohorvath commented Sep 3, 2020

What happened:

If you have AKS 1.15 nodepool labeled with "node-role.kubernetes.io/something", the upgrade process will fail. It will terminate all your pods, set your nodes as notready but never can bring in new nodes to replace the old ones. I think it tries to bring in the new nodes with the old unsupported labels.

What you expected to happen:

Upgrade should work by provisioning new nodes with the new labels or feature gate should be set to support the old one. Or at least we should have a way to replace the node labels after creation to fix this manually before starting the upgrade.

How to reproduce it (as minimally and precisely as possible):

Create an AKS cluster with a nodepool contains a node-role.kubernetes.io label. Start the upgrade process from the portal.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.15
  • Size of cluster (how many worker nodes are in the cluster?)
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
  • Others:
@ghost ghost added the triage label Sep 3, 2020
@ghost
Copy link

ghost commented Sep 3, 2020

Hi ohorvath, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@jayush
Copy link

jayush commented Sep 4, 2020

+1, we cannot even update node labels on a node pool before we upgrade to address this problem.

https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools

Label can only be set for node pools during node pool creation. Labels must also be a key/value pair and have a valid syntax.

@jonathan-hurley
Copy link

+1 We also saw this recently on a 1.15 to 1.16 upgrade.

We are unable to change our node labels to be compliant with 1.16, so our upgrade completely fails. We kind of need this ability so that:

  • We can have a successful upgrade to 1.16
  • We can adjust our nodeSelectors to the appropriate new labels so pods land correctly.

@ncole
Copy link

ncole commented Sep 4, 2020

+1. Ideally just upgrading from 1.15 to 1.16 should Just Work. As it is the portal just hangs for over an hour until it fails with no meaningful information.

The ability to update node labels after creation is also critically important. Not only setting the labels, but propagating the change to the existing nodes should be part of the update.

@ghost ghost added the action-required label Sep 6, 2020
@ghost
Copy link

ghost commented Sep 6, 2020

Triage required from @Azure/aks-pm

@paulgmiller
Copy link
Member

We're going to try and take a look at this/reproduce. Seems like node-role.kubernetes.io a common label that got deprecated.
kubernetes/kubernetes#84912 maybe?

@ghost
Copy link

ghost commented Sep 9, 2020

Action required from @paulgmiller.

@ohorvath
Copy link
Author

ohorvath commented Sep 9, 2020

We're going to try and take a look at this/reproduce. Seems like node-role.kubernetes.io a common label that got deprecated.
kubernetes/kubernetes#84912 maybe?

Yes, that's deprecated, but we can't remove it as node labels cannot be changed after AKS cluster creation. I mean we can try to manipulate them programmatically node by node or move the workloads to a new node pool, but that's a pain for existing environments. So all of those clusters are stuck with 1.15 forever. And similar issues will happen in the future I'm sure. So a solution like node label operator would be beneficial to control node labels with tags or similar.

@paulgmiller
Copy link
Member

paulgmiller commented Sep 9, 2020

So we reprod that if you have a nodepool you created with

az aks nodepool add ... --labels node-role.kubernetes.io/worker=worker
then that label will be handed of to kubelet which will handle it poorly (not sure if it's not starting or just not updating status).

Since we don't have a way to update labels (--labels not supported in az aks nodepool update ) the short term way to fix this is to create a new node pool and cordon/drain/delete the old one.

Longer term we need to decide if we should just

  1. Not carry node-role.kubernetes.io across to > 1.16. This seems a bit dangerous since presumably there was a reason they were added
    2a) Fail the upgrade and tell the customer to manually create a new nodepool
    2b) Fail the upgrade and give them a way update labels (basically an upgrade itself).
    2c) Only allow you to change nodelabels when you're doing an upgrade.

Interested in the feedback of those whove' commetned already.

@jonathan-hurley
Copy link

I would personally like to see the ability for us to change the labels on node pools. This will allow us to prevent any problems during upgrade and also to continue to target certain nodes with our nodeSelectors.

@ohorvath
Copy link
Author

ohorvath commented Sep 9, 2020

I'd like to manipulate node labels through API calls or CLI commands before or during upgrade. Like so:

az aks nodepool upgrade --removelabel XYZ --addlabel XYZ

@jonathan-hurley
Copy link

Having the existing node labels automatically removed doesn't buy us much since our charts used them as nodeSelector targets. We'd definitely need the ability to adjust the existing nodepool labels.

@paulgmiller
Copy link
Member

Current plan is to try and block upgrades/creates >=1.16 with these bad labels starting next week (2a). We want to let you update labels but needs more work and first priority is to keep people from accidentally destroying their agent pools.

@ghost ghost added the action-required label Oct 6, 2020
@ghost ghost added the stale Stale issue label Jan 25, 2021
@ghost
Copy link

ghost commented Jan 25, 2021

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@ghost ghost closed this as completed Feb 9, 2021
@ghost
Copy link

ghost commented Feb 9, 2021

This issue will now be closed because it hasn't had any activity for 15 days after stale. ohorvath feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 11, 2021
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants