Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA #8173

Open
dharanui opened this issue Sep 1, 2024 · 21 comments
Assignees

Comments

@dharanui
Copy link

dharanui commented Sep 1, 2024

velero version: 1.14.1
error: async write error: "unable to write content chunk 96 of FILE:000002: mutable parameters: unable to read format blob: error getting kopia.repository blob: The provided token has expired: mutable parameters: unable to read format blob: error getting kopia.repository blob: The provided token has expired"

The datauploads are failing after almost one hour of running.
Tried also to incraese repo maintainence frequency , but no luck

@dharanui dharanui changed the title from velero version 1.14 we are getting error of expired toekn for backuing up data using datamover from velero version 1.14 we are getting error of expired token for backuing up data using datamover Sep 1, 2024
@Lyndon-Li
Copy link
Contributor

Looks like the token to access object store has expired.

@dharanui
Copy link
Author

dharanui commented Sep 2, 2024

its expires every one hour? because datauploads which takes less than an hour runs and completes.. the ones which take longer are getting cancelled. In the logs of node-agent we see this error at that time

@Lyndon-Li
Copy link
Contributor

The expiration time of the token is not set by Velero, so you need to check how the token was created.

@dharanui
Copy link
Author

dharanui commented Sep 2, 2024

but we were not getting this issue in 1.12

@dharanui dharanui changed the title from velero version 1.14 we are getting error of expired token for backuing up data using datamover Since velero version 1.14 we are getting error of expired token for backuing up data using datamover Sep 2, 2024
@dharanui
Copy link
Author

dharanui commented Sep 2, 2024

We use IRSA and I see iam token valid for 24h.

volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

Looks like this commit seems to be relevant: https://github.com/vmware-tanzu/velero/pull/7374/files ??

@dharanui dharanui changed the title Since velero version 1.14 we are getting error of expired token for backuing up data using datamover Since velero version 1.14 we are getting error of expired token for backing up data using datamover Sep 2, 2024
@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Sep 4, 2024

We use IRSA and I see iam token valid for 24h.

volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

Looks like this commit seems to be relevant: https://github.com/vmware-tanzu/velero/pull/7374/files ??

Why that commit is related? Have you specified BSL->credentialFile?

@dharanui
Copy link
Author

dharanui commented Sep 4, 2024

oops sorry , no we dont use credentialFile.
It is not giving that error now when we rolled back to 1.12.
Could it be that the repository maintainance job is recreating the token or something on that lines?

Or may be kopia version changes with velero upgrade?

@Lyndon-Li
Copy link
Contributor

Neither Velero nor Kopia could change the token being used, I guess there might be another token specified. We also have test cases for IRSA, but we didn't see the problem as here.

@SCLogo
Copy link

SCLogo commented Sep 25, 2024

issue happens with velero 1.13.2 with datamover.

@dharanui
Copy link
Author

dharanui commented Sep 25, 2024

@Lyndon-Li this was working fine until 1.12 and started happening since upgrade to 1.13 also 1.14. Do we know what has changed since 1.12? This is blocking us from upgrading to 1.14 currently

@catalinpan
Copy link

As mentioned above I'm getting the same error for restores which are longer than 1h. The restore will fail based on the fsBackupTimeout so the error is not detected by the restore process.

I'm using below images with IAM role and IRSA:

On the restore-wait init container this message shows up in a loop

The filesystem restore done file /restores/data/.velero/file123 is not found yet. Retry later.

In the node-agent this message show up

time="2024-10-02T23:20:34Z" level=error msg="Async fs restore data path failed" controller=PodVolumeRestore error="Failed to run kopia restore: Failed to copy snapshot data to the target: restore error: copy file: error creating file: cannot write data to file %q /host_pods/a2e48cae-8c75-4971-abb0-cbadb80674c8/volumes/kubernetes.io~csi/pvc-d38b075b-f1f3-4c59-8384-15f9d25fa782/mount/export/2024-Jul-12--0100.zip: unexpected content error: error getting cached content from blob \"pb3f655d8f0c66aa9377a3d660c143a45-s83fa0c09e23487b612d\": failed to get blob with ID pb3f655d8f0c66aa9377a3d660c143a45-s83fa0c09e23487b612d: The provided token has expired" logSource="pkg/controller/pod_volume_restore_controller.go:332" pvr=pvc-20241002183056-20241002221832khkt

The restore worked without any issues when downgraded to below versions

  • velero:v1.12.4
  • velero/velero-plugin-for-aws:v1.8.0
  • velero/velero-restore-helper:v1.10.2

Hope this will help a bit.

@dharanui
Copy link
Author

dharanui commented Oct 4, 2024

Thanks @catalinpan .
We are using CSI snapshot (https://velero.io/docs/main/csi-snapshot-data-movement/) instead of fsb.
For us the backup itself is failing if beyond one hour for velero v1.14.1. Downgrading to 1.12 made the backups work.

Is there any workaround to make this work in 1.14?

@dharanui dharanui changed the title Since velero version 1.14 we are getting error of expired token for backing up data using datamover Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover Oct 5, 2024
@dharanui dharanui changed the title Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr Oct 7, 2024
@dharanui dharanui changed the title Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr Upgrading to any version beyond 1.12 we are getting error of expired token for backing up data using datamover after 1 hr with IRSA Oct 7, 2024
@SCLogo
Copy link

SCLogo commented Oct 8, 2024

we use velero 1.14.1
aws plugin: 1.10.1
with kube2iam 3600s tokens
Csi backup with data mover (kopia)
What I probably see in logs, that velero or kopia does not request new token when it expired just goes failed (canceled)
Kopia requests aws token using kube2iam @ 12:03:46 . It starts the upload and finishes. 1 hour later (we use hourly backups), another dataupload request created (2024-10-08T14:02:49Z)for the same resources and it exit with token has expired error (2024-10-08T14:02:55Z) and new token requested (14:03:25).
can it somehow set that before it goes failed with expired token error, just try to request a new token ?

@Lyndon-Li
Copy link
Contributor

This may be the expected behavior for now, multiple DUs may be created at the same time but processed one by one. If the 1st DU takes more than 1 hour, the second one's token will timeout.
The data mover pod doesn't support IRSA, this may be the cause.

@SCLogo
Copy link

SCLogo commented Oct 9, 2024

those are two different backups. 1st finish w/o issue. Second starts and DU created eariler then last run so it gets the old key that expire soon and DU goes cancelled. If Velero would try to get a new key before exit with error, this problem could not come up. If I reduce the duration of the key I can just hide the issue, but once DU needs more time than I set,then I need to set higher duration.

@SCLogo
Copy link

SCLogo commented Oct 9, 2024

the default duration for iam role is 1 hour. we use that one

@dharanui
Copy link
Author

Is increasing the default duration helping in this case? @SCLogo

@SCLogo
Copy link

SCLogo commented Nov 19, 2024 via email

@dharanui
Copy link
Author

dharanui commented Dec 18, 2024

Hi @SCLogo / @Lyndon-Li , can you help me how to override DurationSeconds while velero is performing assumeRole ? I am using IRSA. Updation maxSessionDuration on role is not helping because default duration while assuming role is 1hr.
https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html

according to aws/aws-cli#9021 there is no environmental variable for that currently.

@SCLogo
Copy link

SCLogo commented Dec 21, 2024

@dharanui . I am using kube2iam. As the default max duration is 1hour, but the kube2iam asks 30 mins temp roles.
if you pass the iam-role-session-ttl: 1600s then kube2iam will ask ~53 mins because of a bug/feature (jtblin/kube2iam#240) https://www.bluematador.com/blog/iam-access-in-kubernetes-kube2iam-vs-kiam
If you need more time you need to set max session duration to higher.

@dharanui
Copy link
Author

Hi @Lyndon-Li / @SCLogo , any idea when will this be fixed so that we can make it work with IRSA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants