Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore from S3 repo - failed with error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel #7761

Closed
thomasklosinsky opened this issue Apr 30, 2024 · 9 comments
Assignees

Comments

@thomasklosinsky
Copy link

What steps did you take and what happened:
1 pod with a pvc mounted via ceph-csi in namespace backuptest

Ran:
velero backup create NAME --snapshot-data-move --include-namespace backuptest << successful
kubectl delete namespace backuptest << of course, successful
velero describe backup NAME --detail << all data looking good
velero repo get << all data there
velero restore create --from backup NAME

What did you expect to happen:
I expected the pvc to be restored from S3 repo.
But the datadownload failed with this error:
found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel

The following information will help us better understand what's going on:
Log from the node-agent:

time="2024-04-30T10:00:59Z" level=info msg="Restore PVC is created" logSource="pkg/exposer/generic_restore.go:103" owner=backuptest-01-20240430120059-2x42c pvc name=backuptest-01-20240430120059-2x42c source namespace=backuptest target PVC=test-pvc-2
time="2024-04-30T10:00:59Z" level=info msg="Restore is exposed" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:195"
time="2024-04-30T10:00:59Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"
time="2024-04-30T10:01:11Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"
time="2024-04-30T10:01:11Z" level=info msg="Data download is prepared" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:226"
time="2024-04-30T10:01:11Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"
time="2024-04-30T10:01:11Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:285"
time="2024-04-30T10:10:55Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"
time="2024-04-30T10:10:55Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:285"
time="2024-04-30T10:10:55Z" level=info msg="Data download is being canceled" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:287"
time="2024-04-30T10:10:55Z" level=warning msg="Async fs backup data path canceled" controller=DataDownload datadownload=backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:398"
time="2024-04-30T10:10:55Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"
time="2024-04-30T10:10:55Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"

Environment:

  • Velero version (use velero version): 1.13.2
  • Velero features (use velero client config get features): NOT SET
  • Kubernetes version (use kubectl version): 1.30.0
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration: S3 compatible self hosted (Ceph/Rados)
  • OS (e.g. from /etc/os-release): Ubuntu 22.04 LTS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@thomasklosinsky
Copy link
Author

@sseago
Copy link
Collaborator

sseago commented Apr 30, 2024

That error means that the node agent restarted in the middle of processing a datadownload.

@thomasklosinsky
Copy link
Author

Hmmm... no restarts visible:
NAME READY STATUS RESTARTS AGE
node-agent-4r7rq 1/1 Running 0 45h
node-agent-8vzlr 1/1 Running 0 2d18h
velero-7d6578dfcc-kdgfq 1/1 Running 0 2d20h

@thomasklosinsky
Copy link
Author

Changed the Deamonset of the node agents to debug log level.

time="2024-05-02T08:02:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:02:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:03:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:03:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:03:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:03:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:04:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:04:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:04:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:04:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:05:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:05:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:05:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:05:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:06:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:06:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:06:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:06:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:07:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:07:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:07:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:07:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:08:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:08:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:08:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:08:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"

The restore already is canceled but the datadownload still is trying to "do something"...

Details of the datadownload:

kubectl -n velero get datadownloads -l velero.io/restore-name=backuptest-01-20240430120059 -o yaml
apiVersion: v1
items:

  • apiVersion: velero.io/v2alpha1
    kind: DataDownload
    metadata:
    creationTimestamp: "2024-04-30T10:00:59Z"
    generateName: backuptest-01-20240430120059-
    generation: 6
    labels:
    velero.io/accepted-by: kubetestw1
    velero.io/async-operation-id: dd-0f71eee4-ef16-4b20-ac1f-47cda0cbba3c.437d6380-f23b-437415fe8
    velero.io/restore-name: backuptest-01-20240430120059
    velero.io/restore-uid: 0f71eee4-ef16-4b20-ac1f-47cda0cbba3c
    name: backuptest-01-20240430120059-2x42c
    namespace: velero
    ownerReferences:
    • apiVersion: velero.io/v1
      controller: true
      kind: Restore
      name: backuptest-01-20240430120059
      uid: 0f71eee4-ef16-4b20-ac1f-47cda0cbba3c
      resourceVersion: "1553867"
      uid: 2ac55336-9bce-4af8-bb67-16be88e90bda
      spec:
      backupStorageLocation: default
      cancel: true
      dataMoverConfig:
      WriteSparseFiles: "false"
      operationTimeout: 10m0s
      snapshotID: e1ee356bb1abc0855e739aed7eb201ae
      sourceNamespace: backuptest
      targetVolume:
      namespace: backuptest
      pv: ""
      pvc: test-pvc-2
      status:
      completionTimestamp: "2024-04-30T10:10:55Z"
      message: found a datadownload with status "InProgress" during the node-agent starting,
      mark it as cancel
      node: kubetestw2
      phase: Canceled
      progress: {}
      startTimestamp: "2024-04-30T10:00:59Z"
      kind: List
      metadata:
      resourceVersion: ""

@danfengliu
Copy link
Contributor

danfengliu commented May 6, 2024

Hi @thomasklosinsky, from debug bundle you provided, you can find out that node agent node-agent-4r7rq started at time="2024-04-30T10:10:54Z" and node agent node-agent-8vzlr found a InProgress Datadownload at time="2024-04-30T10:10:55Z", 10:10:54 is the restarting time of node agent and new one is node-agent-4r7rq. It is the restarting triggered by the restore request, mostly, lacking cluster resource is the reason of this, you can verify that by checking kubernetes cluster event logs, or increase memory of cluster nodes to ensure node agent functionality.

image

image

@danfengliu danfengliu self-assigned this May 6, 2024
@thomasklosinsky
Copy link
Author

backuptest01-20240506133312-wjdq4 1/1 Running 0 2m35s
node-agent-mdxp2 1/1 Running 0 116m
node-agent-p6nfj 0/1 Evicted 0 7s
velero-7d6578dfcc-hnvgz 1/1 Running 0 116m

Normal Pulled pod/backuptest01-lq75c
Container image "velero/velero:v1.13.2" already present on machine
15m Normal Created pod/backuptest01-lq75c
Created container eb16cf27-0fa9-46f2-b410-0ae78e45595a
15m Normal Started pod/backuptest01-lq75c
Started container eb16cf27-0fa9-46f2-b410-0ae78e45595a
12m Normal Killing pod/backuptest01-lq75c
Stopping container eb16cf27-0fa9-46f2-b410-0ae78e45595a
105s Warning Evicted pod/node-agent-9fpgc
The node was low on resource: ephemeral-storage. Threshold quantity: 1255215563, available: 1111428Ki. Container node-agent was using 1289240Ki, request is 0, has larger consumption of ephemeral-storage.
105s Normal Killing pod/node-agent-9fpgc
Stopping container node-agent
104s Warning Evicted pod/node-agent-fzv9k
The node had condition: [DiskPressure].
103s Normal Scheduled pod/node-agent-fzv9k
Successfully assigned velero/node-agent-fzv9k to kubetestw2
102s Normal Scheduled pod/node-agent-hh546
Successfully assigned velero/node-agent-hh546 to kubetestw2
103s Warning Evicted pod/node-agent-hh546
The node had condition: [DiskPressure].
96s Normal Scheduled pod/node-agent-hpwcr
Successfully assigned velero/node-agent-hpwcr to kubetestw2
97s Warning Evicted pod/node-agent-hpwcr
The node had condition: [DiskPressure].
71s Normal Scheduled pod/node-agent-lvqbx
Successfully assigned velero/node-agent-lvqbx to kubetestw2
72s Warning Evicted pod/node-agent-lvqbx
The node had condition: [DiskPressure].
39s Normal Scheduled pod/node-agent-p6nfj
Successfully assigned velero/node-agent-p6nfj to kubetestw2
40s Warning Evicted pod/node-agent-p6nfj
The node had condition: [DiskPressure].
88s Normal Scheduled pod/node-agent-rhn86
Successfully assigned velero/node-agent-rhn86 to kubetestw2
89s Warning Evicted pod/node-agent-rhn86
The node had condition: [DiskPressure].
100s Normal Scheduled pod/node-agent-xnb5b
Successfully assigned velero/node-agent-xnb5b to kubetestw2
101s Warning Evicted pod/node-agent-xnb5b
The node had condition: [DiskPressure].
104s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-9fpgc on node kubetestw2, will try to kill it
104s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-9fpgc
104s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-fzv9k
103s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-fzv9k on node kubetestw2, will try to kill it
103s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-fzv9k
103s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-hh546
101s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-hh546 on node kubetestw2, will try to kill it
101s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-hh546
101s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-xnb5b
97s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-xnb5b on node kubetestw2, will try to kill it
97s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-xnb5b
97s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-hpwcr
89s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-hpwcr on node kubetestw2, will try to kill it
89s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-hpwcr
89s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-rhn86
72s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-rhn86 on node kubetestw2, will try to kill it
72s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-rhn86
72s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-lvqbx
40s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-lvqbx on node kubetestw2, will try to kill it
40s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-lvqbx
40s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-p6nfj

Means, that the IO is too slow...... right?

@danfengliu
Copy link
Contributor

danfengliu commented May 7, 2024

The node was low on resource: ephemeral-storage has told us your local storage has no space for node agent to cache data, you should check your node ephemeral-storage, and increase that.

@danfengliu
Copy link
Contributor

We had issues #7725 and #7620 as solutions to let users have ways to avoid using a lot of ephemeral-storage, but these issues have not be planned yet, currently you have to increase your node ephemeral-storage to allow node agent using enough space to cache data for which are to be restoring.

@thomasklosinsky
Copy link
Author

Added some space to k8s worker vm disk and it's working now. thx a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants