Restore from S3 repo - failed with error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel #7761

thomasklosinsky · 2024-04-30T10:28:42Z

What steps did you take and what happened:
1 pod with a pvc mounted via ceph-csi in namespace backuptest

Ran:
velero backup create NAME --snapshot-data-move --include-namespace backuptest << successful
kubectl delete namespace backuptest << of course, successful
velero describe backup NAME --detail << all data looking good
velero repo get << all data there
velero restore create --from backup NAME

What did you expect to happen:
I expected the pvc to be restored from S3 repo.
But the datadownload failed with this error:
found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel

The following information will help us better understand what's going on:
Log from the node-agent:

time="2024-04-30T10:00:59Z" level=info msg="Restore PVC is created" logSource="pkg/exposer/generic_restore.go:103" owner=backuptest-01-20240430120059-2x42c pvc name=backuptest-01-20240430120059-2x42c source namespace=backuptest target PVC=test-pvc-2
time="2024-04-30T10:00:59Z" level=info msg="Restore is exposed" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:195"
time="2024-04-30T10:00:59Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"
time="2024-04-30T10:01:11Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"
time="2024-04-30T10:01:11Z" level=info msg="Data download is prepared" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:226"
time="2024-04-30T10:01:11Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"
time="2024-04-30T10:01:11Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:285"
time="2024-04-30T10:10:55Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"
time="2024-04-30T10:10:55Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:285"
time="2024-04-30T10:10:55Z" level=info msg="Data download is being canceled" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:287"
time="2024-04-30T10:10:55Z" level=warning msg="Async fs backup data path canceled" controller=DataDownload datadownload=backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:398"
time="2024-04-30T10:10:55Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"
time="2024-04-30T10:10:55Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"

Environment:

Velero version (use velero version): 1.13.2
Velero features (use velero client config get features): NOT SET
Kubernetes version (use kubectl version): 1.30.0
Kubernetes installer & version:
Cloud provider or hardware configuration: S3 compatible self hosted (Ceph/Rados)
OS (e.g. from /etc/os-release): Ubuntu 22.04 LTS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

thomasklosinsky · 2024-04-30T10:29:39Z

bundle-2024-04-30-12-25-13.tar.gz

sseago · 2024-04-30T13:20:36Z

That error means that the node agent restarted in the middle of processing a datadownload.

thomasklosinsky · 2024-05-02T07:25:14Z

Hmmm... no restarts visible:
NAME READY STATUS RESTARTS AGE
node-agent-4r7rq 1/1 Running 0 45h
node-agent-8vzlr 1/1 Running 0 2d18h
velero-7d6578dfcc-kdgfq 1/1 Running 0 2d20h

thomasklosinsky · 2024-05-02T08:11:46Z

Changed the Deamonset of the node agents to debug log level.

time="2024-05-02T08:02:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:02:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:03:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:03:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:03:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:03:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:04:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:04:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:04:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:04:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:05:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:05:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:05:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:05:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:06:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:06:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:06:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:06:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:07:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:07:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:07:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:07:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:08:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:08:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataDownloadList"
time="2024-05-02T08:08:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v2alpha1.DataUploadList"
time="2024-05-02T08:08:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="*v2alpha1.DataUploadList"

The restore already is canceled but the datadownload still is trying to "do something"...

Details of the datadownload:

kubectl -n velero get datadownloads -l velero.io/restore-name=backuptest-01-20240430120059 -o yaml
apiVersion: v1
items:

apiVersion: velero.io/v2alpha1
kind: DataDownload
metadata:
creationTimestamp: "2024-04-30T10:00:59Z"
generateName: backuptest-01-20240430120059-
generation: 6
labels:
velero.io/accepted-by: kubetestw1
velero.io/async-operation-id: dd-0f71eee4-ef16-4b20-ac1f-47cda0cbba3c.437d6380-f23b-437415fe8
velero.io/restore-name: backuptest-01-20240430120059
velero.io/restore-uid: 0f71eee4-ef16-4b20-ac1f-47cda0cbba3c
name: backuptest-01-20240430120059-2x42c
namespace: velero
ownerReferences:
- apiVersion: velero.io/v1
  controller: true
  kind: Restore
  name: backuptest-01-20240430120059
  uid: 0f71eee4-ef16-4b20-ac1f-47cda0cbba3c
  resourceVersion: "1553867"
  uid: 2ac55336-9bce-4af8-bb67-16be88e90bda
  spec:
  backupStorageLocation: default
  cancel: true
  dataMoverConfig:
  WriteSparseFiles: "false"
  operationTimeout: 10m0s
  snapshotID: e1ee356bb1abc0855e739aed7eb201ae
  sourceNamespace: backuptest
  targetVolume:
  namespace: backuptest
  pv: ""
  pvc: test-pvc-2
  status:
  completionTimestamp: "2024-04-30T10:10:55Z"
  message: found a datadownload with status "InProgress" during the node-agent starting,
  mark it as cancel
  node: kubetestw2
  phase: Canceled
  progress: {}
  startTimestamp: "2024-04-30T10:00:59Z"
  kind: List
  metadata:
  resourceVersion: ""

danfengliu · 2024-05-06T05:31:33Z

Hi @thomasklosinsky, from debug bundle you provided, you can find out that node agent node-agent-4r7rq started at time="2024-04-30T10:10:54Z" and node agent node-agent-8vzlr found a InProgress Datadownload at time="2024-04-30T10:10:55Z", 10:10:54 is the restarting time of node agent and new one is node-agent-4r7rq. It is the restarting triggered by the restore request, mostly, lacking cluster resource is the reason of this, you can verify that by checking kubernetes cluster event logs, or increase memory of cluster nodes to ensure node agent functionality.

thomasklosinsky · 2024-05-06T11:38:10Z

backuptest01-20240506133312-wjdq4 1/1 Running 0 2m35s
node-agent-mdxp2 1/1 Running 0 116m
node-agent-p6nfj 0/1 Evicted 0 7s
velero-7d6578dfcc-hnvgz 1/1 Running 0 116m

Normal Pulled pod/backuptest01-lq75c
Container image "velero/velero:v1.13.2" already present on machine
15m Normal Created pod/backuptest01-lq75c
Created container eb16cf27-0fa9-46f2-b410-0ae78e45595a
15m Normal Started pod/backuptest01-lq75c
Started container eb16cf27-0fa9-46f2-b410-0ae78e45595a
12m Normal Killing pod/backuptest01-lq75c
Stopping container eb16cf27-0fa9-46f2-b410-0ae78e45595a
105s Warning Evicted pod/node-agent-9fpgc
The node was low on resource: ephemeral-storage. Threshold quantity: 1255215563, available: 1111428Ki. Container node-agent was using 1289240Ki, request is 0, has larger consumption of ephemeral-storage.
105s Normal Killing pod/node-agent-9fpgc
Stopping container node-agent
104s Warning Evicted pod/node-agent-fzv9k
The node had condition: [DiskPressure].
103s Normal Scheduled pod/node-agent-fzv9k
Successfully assigned velero/node-agent-fzv9k to kubetestw2
102s Normal Scheduled pod/node-agent-hh546
Successfully assigned velero/node-agent-hh546 to kubetestw2
103s Warning Evicted pod/node-agent-hh546
The node had condition: [DiskPressure].
96s Normal Scheduled pod/node-agent-hpwcr
Successfully assigned velero/node-agent-hpwcr to kubetestw2
97s Warning Evicted pod/node-agent-hpwcr
The node had condition: [DiskPressure].
71s Normal Scheduled pod/node-agent-lvqbx
Successfully assigned velero/node-agent-lvqbx to kubetestw2
72s Warning Evicted pod/node-agent-lvqbx
The node had condition: [DiskPressure].
39s Normal Scheduled pod/node-agent-p6nfj
Successfully assigned velero/node-agent-p6nfj to kubetestw2
40s Warning Evicted pod/node-agent-p6nfj
The node had condition: [DiskPressure].
88s Normal Scheduled pod/node-agent-rhn86
Successfully assigned velero/node-agent-rhn86 to kubetestw2
89s Warning Evicted pod/node-agent-rhn86
The node had condition: [DiskPressure].
100s Normal Scheduled pod/node-agent-xnb5b
Successfully assigned velero/node-agent-xnb5b to kubetestw2
101s Warning Evicted pod/node-agent-xnb5b
The node had condition: [DiskPressure].
104s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-9fpgc on node kubetestw2, will try to kill it
104s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-9fpgc
104s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-fzv9k
103s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-fzv9k on node kubetestw2, will try to kill it
103s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-fzv9k
103s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-hh546
101s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-hh546 on node kubetestw2, will try to kill it
101s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-hh546
101s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-xnb5b
97s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-xnb5b on node kubetestw2, will try to kill it
97s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-xnb5b
97s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-hpwcr
89s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-hpwcr on node kubetestw2, will try to kill it
89s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-hpwcr
89s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-rhn86
72s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-rhn86 on node kubetestw2, will try to kill it
72s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-rhn86
72s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-lvqbx
40s Warning FailedDaemonPod daemonset/node-agent
Found failed daemon pod velero/node-agent-lvqbx on node kubetestw2, will try to kill it
40s Normal SuccessfulDelete daemonset/node-agent
Deleted pod: node-agent-lvqbx
40s Normal SuccessfulCreate daemonset/node-agent
Created pod: node-agent-p6nfj

Means, that the IO is too slow...... right?

danfengliu · 2024-05-07T02:36:41Z

The node was low on resource: ephemeral-storage has told us your local storage has no space for node agent to cache data, you should check your node ephemeral-storage, and increase that.

danfengliu · 2024-05-07T02:47:05Z

We had issues #7725 and #7620 as solutions to let users have ways to avoid using a lot of ephemeral-storage, but these issues have not be planned yet, currently you have to increase your node ephemeral-storage to allow node agent using enough space to cache data for which are to be restoring.

thomasklosinsky · 2024-05-07T10:26:16Z

Added some space to k8s worker vm disk and it's working now. thx a lot!

danfengliu self-assigned this May 6, 2024

danfengliu added the area/datamover label May 6, 2024

thomasklosinsky closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore from S3 repo - failed with error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel #7761

Restore from S3 repo - failed with error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel #7761

thomasklosinsky commented Apr 30, 2024

thomasklosinsky commented Apr 30, 2024

sseago commented Apr 30, 2024

thomasklosinsky commented May 2, 2024

thomasklosinsky commented May 2, 2024

danfengliu commented May 6, 2024 •

edited

Loading

thomasklosinsky commented May 6, 2024

danfengliu commented May 7, 2024 •

edited

Loading

danfengliu commented May 7, 2024

thomasklosinsky commented May 7, 2024

Restore from S3 repo - failed with error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel #7761

Restore from S3 repo - failed with error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel #7761

Comments

thomasklosinsky commented Apr 30, 2024

thomasklosinsky commented Apr 30, 2024

sseago commented Apr 30, 2024

thomasklosinsky commented May 2, 2024

thomasklosinsky commented May 2, 2024

danfengliu commented May 6, 2024 • edited Loading

thomasklosinsky commented May 6, 2024

danfengliu commented May 7, 2024 • edited Loading

danfengliu commented May 7, 2024

thomasklosinsky commented May 7, 2024

danfengliu commented May 6, 2024 •

edited

Loading

danfengliu commented May 7, 2024 •

edited

Loading