Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated job name is invalid #346

Open
tuxpeople opened this issue Feb 15, 2025 · 6 comments
Open

Generated job name is invalid #346

tuxpeople opened this issue Feb 15, 2025 · 6 comments

Comments

@tuxpeople
Copy link

tuxpeople commented Feb 15, 2025

Version
v0.15.0-rc2

Platform/Architecture
Linux talos-test04 6.12.11-talos #1 SMP Tue Jan 28 09:32:23 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Describe the bug
Jobs generated have invalid names, as they end with a "-":

$ kubectl logs -n kube-system system-upgrade-8445f958db-knnpd
[...]
I0215 06:38:31.911115       1 event.go:389] "Event occurred" object="kube-system/talos" fieldPath="" kind="Plan" apiVersion="upgrade.cattle.io/v1" type="Normal" reason="SyncJob" message="Jobs synced for version v1.9.4 on Nodes talos-test02. Hash: "
time="2025-02-15T06:38:31Z" level=error msg="error syncing 'kube-system/talos': handler system-upgrade: secrets \"system-upgrade\" not found, handler system-upgrade: failed to create kube-system/apply-talos-on-talos-test02-with- batch/v1, Kind=Job for system-upgrade kube-system/talos: Job.batch \"apply-talos-on-talos-test02-with-\" is invalid: [metadata.name: Invalid value: \"apply-talos-on-talos-test02-with-\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.labels: Invalid value: \"apply-talos-on-talos-test02-with-\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')], requeuing"

To Reproduce

  1. deploy v0.15.0-rc2
  2. add plan

Deployment-YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
    meta.helm.sh/release-name: system-upgrade
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2025-02-14T12:19:07Z"
  generation: 3
  labels:
    app.kubernetes.io/component: system-upgrade
    app.kubernetes.io/instance: system-upgrade
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: system-upgrade
    helm.sh/chart: app-template-3.7.1
    helm.toolkit.fluxcd.io/name: system-upgrade
    helm.toolkit.fluxcd.io/namespace: kube-system
  name: system-upgrade
  namespace: kube-system
  resourceVersion: "146247497"
  uid: f406b0b7-74ab-4429-bd6a-5af8b1e2581a
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app.kubernetes.io/component: system-upgrade
      app.kubernetes.io/instance: system-upgrade
      app.kubernetes.io/name: system-upgrade
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        checksum/secrets: f9a2edb516d89dc9e0af00dcf3d13ae57cbe1bc631c4b35d393a497ef218d929
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: system-upgrade
        app.kubernetes.io/instance: system-upgrade
        app.kubernetes.io/name: system-upgrade
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: Exists
      automountServiceAccountToken: true
      containers:
      - env:
        - name: SYSTEM_UPGRADE_CONTROLLER_LEADER_ELECT
          value: "true"
        - name: SYSTEM_UPGRADE_CONTROLLER_NAME
          value: system-upgrade
        - name: SYSTEM_UPGRADE_CONTROLLER_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: SYSTEM_UPGRADE_CONTROLLER_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: SYSTEM_UPGRADE_JOB_BACKOFF_LIMIT
          value: "99"
        - name: SYSTEM_UPGRADE_JOB_PRIVILEGED
          value: "false"
        image: docker.io/rancher/system-upgrade-controller:v0.15.0-rc2
        imagePullPolicy: IfNotPresent
        name: app
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      enableServiceLinks: false
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
      serviceAccount: system-upgrade
      serviceAccountName: system-upgrade
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
        operator: Exists

Plan-YAML:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  creationTimestamp: "2025-02-14T12:24:10Z"
  generation: 3
  labels:
    app.kubernetes.io/name: system-upgrade-plans
    kustomize.toolkit.fluxcd.io/name: system-upgrade-plans
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: talos
  namespace: kube-system
  resourceVersion: "146247587"
  uid: f02a54af-4644-4ce2-ab9f-e9a8f128e703
spec:
  concurrency: 1
  exclusive: true
  nodeSelector:
    matchExpressions:
    - key: kubernetes.io/os
      operator: In
      values:
      - linux
  postCompleteDelay: 2m
  secrets:
  - ignoreUpdates: true
    name: system-upgrade
    path: /var/run/secrets/talos.dev
  serviceAccountName: system-upgrade
  upgrade:
    args:
    - --node=$(SYSTEM_UPGRADE_NODE_NAME)
    - --tag=$(SYSTEM_UPGRADE_PLAN_LATEST_VERSION)
    - --powercycle
    image: ghcr.io/jfroy/tnu:0.4.0
  version: v1.9.4

Full deployment:
https://github.com/tuxpeople/k8s-homelab/tree/97e7256808cd65c0d004d4e58adbfd38e8f5984f/kubernetes/apps/kube-system/system-upgrade

Expected behavior
Jobs to created with valid names

Actual behavior
Controller fails to create jobs

Additional context
I'm not a programmer, but I digged a bit and I think the name gets created here:

Name: name.SafeConcatName("apply", plan.Name, "on", shortNodeName, "with", plan.Status.LatestHash),

Therefore, I think plan.Status.LatestHash is empty. I assume it's coming from here:

LatestHash string `json:"latestHash,omitempty"`

But if I do a kubectl get plan, the plan does not have a status at all.

Also, the event in the logs also not showing any hash (Hash: ")

@brandond
Copy link
Member

Yeah it shouldn't be possible to have an empty hash for a valid plan. That's an odd one for sure.

@brandond
Copy link
Member

brandond commented Feb 18, 2025

Can you post the complete logs from the controller pod, with --debug if possible? Is there anything else odd going on in this environment?

@tuxpeople
Copy link
Author

Yes, I had issues with etcd. But I've recreated the cluster, and upgraded from rc2 to 0.15.0, and moved system-upgrade-controller in its own namespace, and I still have the problem. The debug logs are here:

system-upgrade-controller-67b59df874-4t2kg.log

@brandond
Copy link
Member

brandond commented Mar 3, 2025

 secrets:
 - ignoreUpdates: true
   name: system-upgrade
   path: /var/run/secrets/talos.dev

The secret specified in your plan does not exist. From the log message:

error syncing 'kube-system/talos':
handler system-upgrade: secrets "system-upgrade" not found
handler system-upgrade: failed to create kube-system/apply-talos-on-talos-test02-with- batch/v1, Kind=Job for system-upgrade kube-system/talos: Job.batch "apply-talos-on-talos-test02-with-" is invalid: [metadata.name: Invalid value: "apply-talos-on-talos-test02-with-": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is 'a-z0-9?(\.a-z0-9?)'), spec.template.labels: Invalid value: "apply-talos-on-talos-test02-with-": a valid label must be an empty string or consist of alphanumeric characters, '-', '' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9.])?[A-Za-z0-9])?')]

We can fix this to improve the error handling if a specified secret does not exist, but this is a case of user error.

@brandond
Copy link
Member

brandond commented Mar 3, 2025

The above-linked PR makes secrets with ignoreUpdate: true optional. Up until now they were not hashed, but were still required to exist.

Note that if you need something from the secret for your upgrade pod to work properly, it probably should exist.

@brandond
Copy link
Member

brandond commented Mar 3, 2025

Also note that your example plan was in the kube-system namespace, but referenced a serviceaccount that only exists in the system-upgrade namespace by default.

I suspect that you were copy-pasting stuff and didn't ensure that everything referenced in the example actually existed where it should?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants