Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external-snapshotter constantly retrying CreateSnapshot calls on erro… #651

Merged
merged 1 commit into from
Feb 24, 2022

Conversation

zhucan
Copy link
Member

@zhucan zhucan commented Jan 24, 2022

…r w/o backoff

Signed-off-by: zhucan [email protected]

What type of PR is this?
/kind bug

What this PR does / why we need it:
external-snapshotter constantly retrying CreateSnapshot calls on error w/o backoff

Which issue(s) this PR fixes:
Fixes #533

Special notes for your reviewer:
@xing-yang

Does this PR introduce a user-facing change?:

Fix a problem in CSI snapshotter sidecar that constantly retries CreateSnapshot call on error without exponential backoff.

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jan 24, 2022
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 24, 2022
@pohly
Copy link
Contributor

pohly commented Jan 24, 2022

/skip

I don't know this code.

@xing-yang xing-yang self-assigned this Jan 24, 2022
@zhucan
Copy link
Member Author

zhucan commented Jan 26, 2022

image
This is the test results. WIth the default retry interval times, The retry interval times will be 1s , 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 5m.

@zhucan
Copy link
Member Author

zhucan commented Feb 14, 2022

/retest

@zhucan
Copy link
Member Author

zhucan commented Feb 23, 2022

/retest-required

@xing-yang
Copy link
Collaborator

@zhucan Can you change the release note to the following?

Fix a problem in CSI snapshotter sidecar that constantly retries CreateSnapshot call on error without exponential backoff.

@xing-yang
Copy link
Collaborator

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 23, 2022
@xing-yang
Copy link
Collaborator

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xing-yang, zhucan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2022
@k8s-ci-robot k8s-ci-robot merged commit f320a80 into kubernetes-csi:master Feb 24, 2022
@anshulahuja98
Copy link

@xing-yang can we take a release of the external snapshotter with this fix included?

@pwschuurman
Copy link
Contributor

I think this change has the unintended effect of not updating if the CSI driver returns success. I don't think that an update which removes the AnnVolumeSnapshotBeingCreated annotation in indicative of a failure. It can be removed even when the CSI driver CreateSnapshot does not return a failure: https://github.com/kubernetes-csi/external-snapshotter/blob/master/pkg/sidecar-controller/snapshot_controller.go#L352

This PR is currently breaking some tests for the PD CSI driver (which @amacaskill is planning to bring to testgrid). The effect is that the CSI controller does not get resynced after the AnnVolumeSnapshotBeingCreated annotation is removed, which makes the snapshotter update at the next resync period (default 15 minutes). This time period is longer than how long the test waits (5 minutes). It could also affect a production controller, if using the default resync period.

@xing-yang
Copy link
Collaborator

@pwschuurman Do you see a 15 minutes delay for a VolumeSnapshot to be created successfully? Does this happen all the time? I wonder why CI didn't catch this.

@pwschuurman
Copy link
Contributor

@xing-yang For PD CSI driver, volume snapshot typically takes <1 minute for a snapshot to be ready. However with this change, a call to syncContent won't happen until the next controller resync interval, which is 15 minutes by default.

I chatted with @mattcary, and we think a short term fix would be to reintroduce the call to enqueueContentWork in the success case, but keep the logic as is for the error case (by checking the status on the new VolumeSnapshotContent).

@xing-yang
Copy link
Collaborator

@pwschuurman That makes sense. Do you want to submit a fix?

@pwschuurman
Copy link
Contributor

@xing-yang Yes, I can send out a PR to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

external-snapshotter constantly retrying CreateSnapshot calls on error w/o backoff
6 participants