Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

dharanui · 2024-09-01T17:26:38Z

velero version: 1.14.1
error: async write error: "unable to write content chunk 96 of FILE:000002: mutable parameters: unable to read format blob: error getting kopia.repository blob: The provided token has expired: mutable parameters: unable to read format blob: error getting kopia.repository blob: The provided token has expired"

The datauploads are failing after almost one hour of running.
Tried also to incraese repo maintainence frequency , but no luck

Lyndon-Li · 2024-09-02T04:59:12Z

Looks like the token to access object store has expired.

dharanui · 2024-09-02T05:11:47Z

its expires every one hour? because datauploads which takes less than an hour runs and completes.. the ones which take longer are getting cancelled. In the logs of node-agent we see this error at that time

Lyndon-Li · 2024-09-02T05:20:51Z

The expiration time of the token is not set by Velero, so you need to check how the token was created.

dharanui · 2024-09-02T07:37:41Z

but we were not getting this issue in 1.12

dharanui · 2024-09-02T10:32:46Z

We use IRSA and I see iam token valid for 24h.

volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

Looks like this commit seems to be relevant: https://github.com/vmware-tanzu/velero/pull/7374/files ??

Lyndon-Li · 2024-09-04T06:21:21Z

We use IRSA and I see iam token valid for 24h.
volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
Looks like this commit seems to be relevant: https://github.com/vmware-tanzu/velero/pull/7374/files ??

Why that commit is related? Have you specified BSL->credentialFile?

dharanui · 2024-09-04T14:53:41Z

oops sorry , no we dont use credentialFile.
It is not giving that error now when we rolled back to 1.12.
Could it be that the repository maintainance job is recreating the token or something on that lines?

Or may be kopia version changes with velero upgrade?

Lyndon-Li · 2024-09-05T02:40:06Z

Neither Velero nor Kopia could change the token being used, I guess there might be another token specified. We also have test cases for IRSA, but we didn't see the problem as here.

SCLogo · 2024-09-25T06:48:58Z

issue happens with velero 1.13.2 with datamover.

dharanui · 2024-09-25T11:23:38Z

@Lyndon-Li this was working fine until 1.12 and started happening since upgrade to 1.13 also 1.14. Do we know what has changed since 1.12? This is blocking us from upgrading to 1.14 currently

catalinpan · 2024-10-03T10:45:14Z

As mentioned above I'm getting the same error for restores which are longer than 1h. The restore will fail based on the fsBackupTimeout so the error is not detected by the restore process.

I'm using below images with IAM role and IRSA:

velero:v1.14.1
velero/velero-plugin-for-aws:v1.10.1
velero/velero-restore-helper:v1.14.1 (this has issues with kopia cache Make kopia repo cache place configurable #7725)

On the restore-wait init container this message shows up in a loop

The filesystem restore done file /restores/data/.velero/file123 is not found yet. Retry later.

In the node-agent this message show up

time="2024-10-02T23:20:34Z" level=error msg="Async fs restore data path failed" controller=PodVolumeRestore error="Failed to run kopia restore: Failed to copy snapshot data to the target: restore error: copy file: error creating file: cannot write data to file %q /host_pods/a2e48cae-8c75-4971-abb0-cbadb80674c8/volumes/kubernetes.io~csi/pvc-d38b075b-f1f3-4c59-8384-15f9d25fa782/mount/export/2024-Jul-12--0100.zip: unexpected content error: error getting cached content from blob \"pb3f655d8f0c66aa9377a3d660c143a45-s83fa0c09e23487b612d\": failed to get blob with ID pb3f655d8f0c66aa9377a3d660c143a45-s83fa0c09e23487b612d: The provided token has expired" logSource="pkg/controller/pod_volume_restore_controller.go:332" pvr=pvc-20241002183056-20241002221832khkt

The restore worked without any issues when downgraded to below versions

velero:v1.12.4
velero/velero-plugin-for-aws:v1.8.0
velero/velero-restore-helper:v1.10.2

Hope this will help a bit.

dharanui · 2024-10-04T07:29:22Z

Thanks @catalinpan .
We are using CSI snapshot (https://velero.io/docs/main/csi-snapshot-data-movement/) instead of fsb.
For us the backup itself is failing if beyond one hour for velero v1.14.1. Downgrading to 1.12 made the backups work.

Is there any workaround to make this work in 1.14?

SCLogo · 2024-10-08T14:20:24Z

we use velero 1.14.1
aws plugin: 1.10.1
with kube2iam 3600s tokens
Csi backup with data mover (kopia)
What I probably see in logs, that velero or kopia does not request new token when it expired just goes failed (canceled)
Kopia requests aws token using kube2iam @ 12:03:46 . It starts the upload and finishes. 1 hour later (we use hourly backups), another dataupload request created (2024-10-08T14:02:49Z)for the same resources and it exit with token has expired error (2024-10-08T14:02:55Z) and new token requested (14:03:25).
can it somehow set that before it goes failed with expired token error, just try to request a new token ?

Lyndon-Li · 2024-10-09T02:59:04Z

This may be the expected behavior for now, multiple DUs may be created at the same time but processed one by one. If the 1st DU takes more than 1 hour, the second one's token will timeout.
The data mover pod doesn't support IRSA, this may be the cause.

SCLogo · 2024-10-09T06:09:20Z

those are two different backups. 1st finish w/o issue. Second starts and DU created eariler then last run so it gets the old key that expire soon and DU goes cancelled. If Velero would try to get a new key before exit with error, this problem could not come up. If I reduce the duration of the key I can just hide the issue, but once DU needs more time than I set,then I need to set higher duration.

SCLogo · 2024-10-09T07:25:54Z

the default duration for iam role is 1 hour. we use that one

dharanui · 2024-11-19T09:25:06Z

Is increasing the default duration helping in this case? @SCLogo

SCLogo · 2024-11-19T12:20:49Z

In our case yes and no. yes because we saw less issue related to expired key, but not in all case. The best solution would be if the kopia or Velero tries to get a temp key before it exits with expiration error.

…

On Tue, Nov 19, 2024 at 10:25 AM dharanui ***@***.***> wrote: Is increasing the default duration helping in this case? @SCLogo <https://github.com/SCLogo> — Reply to this email directly, view it on GitHub <#8173 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANVQPA3X2DLB4ALW3XYGGZT2BL7YPAVCNFSM6AAAAABNPAN622VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBVGE2DMNJWGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- -- Balazs Varga |* DevOps*

dharanui · 2024-12-18T12:41:31Z

Hi @SCLogo / @Lyndon-Li , can you help me how to override DurationSeconds while velero is performing assumeRole ? I am using IRSA. Updation maxSessionDuration on role is not helping because default duration while assuming role is 1hr.
https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html

according to aws/aws-cli#9021 there is no environmental variable for that currently.

SCLogo · 2024-12-21T19:59:50Z

@dharanui . I am using kube2iam. As the default max duration is 1hour, but the kube2iam asks 30 mins temp roles.
if you pass the iam-role-session-ttl: 1600s then kube2iam will ask ~53 mins because of a bug/feature (jtblin/kube2iam#240) https://www.bluematador.com/blog/iam-access-in-kubernetes-kube2iam-vs-kiam
If you need more time you need to set max session duration to higher.

dharanui · 2025-01-13T11:32:46Z

Hi @Lyndon-Li / @SCLogo , any idea when will this be fixed so that we can make it work with IRSA?

dharanui changed the title ~~from velero version 1.14 we are getting error of expired toekn for backuing up data using datamover~~ from velero version 1.14 we are getting error of expired token for backuing up data using datamover Sep 1, 2024

ywk253100 mentioned this issue Sep 2, 2024

--default-repo-maintain-frequency is not working even with changed value #8156

Closed

reasonerjt added the Needs info Waiting for information label Sep 2, 2024

reasonerjt assigned Lyndon-Li Sep 2, 2024

dharanui changed the title ~~from velero version 1.14 we are getting error of expired token for backuing up data using datamover~~ Since velero version 1.14 we are getting error of expired token for backuing up data using datamover Sep 2, 2024

dharanui changed the title ~~Since velero version 1.14 we are getting error of expired token for backuing up data using datamover~~ Since velero version 1.14 we are getting error of expired token for backing up data using datamover Sep 2, 2024

Lyndon-Li added the Area/Cloud/AWS label Sep 6, 2024

dharanui changed the title ~~Since velero version 1.14 we are getting error of expired token for backing up data using datamover~~ Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover Oct 5, 2024

dharanui changed the title ~~Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover~~ Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr Oct 7, 2024

dharanui changed the title ~~Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr~~ Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA Oct 7, 2024

Lyndon-Li added backlog area/datamover labels Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

dharanui commented Sep 1, 2024

Lyndon-Li commented Sep 2, 2024

dharanui commented Sep 2, 2024 •

edited

Loading

Lyndon-Li commented Sep 2, 2024

dharanui commented Sep 2, 2024

dharanui commented Sep 2, 2024 •

edited

Loading

Lyndon-Li commented Sep 4, 2024 •

edited

Loading

dharanui commented Sep 4, 2024 •

edited

Loading

Lyndon-Li commented Sep 5, 2024

SCLogo commented Sep 25, 2024

dharanui commented Sep 25, 2024 •

edited

Loading

catalinpan commented Oct 3, 2024

dharanui commented Oct 4, 2024 •

edited

Loading

SCLogo commented Oct 8, 2024

Lyndon-Li commented Oct 9, 2024

SCLogo commented Oct 9, 2024

SCLogo commented Oct 9, 2024

dharanui commented Nov 19, 2024

SCLogo commented Nov 19, 2024 via email

dharanui commented Dec 18, 2024 •

edited

Loading

SCLogo commented Dec 21, 2024 •

edited

Loading

dharanui commented Jan 13, 2025

Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

Comments

dharanui commented Sep 1, 2024

Lyndon-Li commented Sep 2, 2024

dharanui commented Sep 2, 2024 • edited Loading

Lyndon-Li commented Sep 2, 2024

dharanui commented Sep 2, 2024

dharanui commented Sep 2, 2024 • edited Loading

Lyndon-Li commented Sep 4, 2024 • edited Loading

dharanui commented Sep 4, 2024 • edited Loading

Lyndon-Li commented Sep 5, 2024

SCLogo commented Sep 25, 2024

dharanui commented Sep 25, 2024 • edited Loading

catalinpan commented Oct 3, 2024

dharanui commented Oct 4, 2024 • edited Loading

SCLogo commented Oct 8, 2024

Lyndon-Li commented Oct 9, 2024

SCLogo commented Oct 9, 2024

SCLogo commented Oct 9, 2024

dharanui commented Nov 19, 2024

SCLogo commented Nov 19, 2024 via email

dharanui commented Dec 18, 2024 • edited Loading

SCLogo commented Dec 21, 2024 • edited Loading

dharanui commented Jan 13, 2025

dharanui commented Sep 2, 2024 •

edited

Loading

dharanui commented Sep 2, 2024 •

edited

Loading

Lyndon-Li commented Sep 4, 2024 •

edited

Loading

dharanui commented Sep 4, 2024 •

edited

Loading

dharanui commented Sep 25, 2024 •

edited

Loading

dharanui commented Oct 4, 2024 •

edited

Loading

dharanui commented Dec 18, 2024 •

edited

Loading

SCLogo commented Dec 21, 2024 •

edited

Loading