Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup #1238

AndrewSirenko · 2024-12-09T20:51:11Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change

/kind bug

/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

As of v6.2.2, the snapshot controller performs a list operation on startup as a way of validating the Snapshot CRD is installed on the cluster. v7.0.0 attempts to put a limit on this list call of 0, but that is treated as an unbounded request by etcd.

For clusters with many VolumeSnapshots, the API Server may struggle to handle an unbounded list request within 60s, leading to continuous restarts in the snapshot controller.

@wongma7 proposes a better longterm solution here, but this PR introduces a short-term fix of properly setting the list response limit to 1.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Within k8s API server, paging gets set to false here because limit <= 0

We pass through limit = 0 to store.getList here

For etcd, limit of 0 allegedly means no limitation documented here

Does this PR introduce a user-facing change?:

Fix unbounded volumesnapshots list call on Snapshot Controller startup

k8s-ci-robot · 2024-12-09T20:51:14Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

xing-yang · 2024-12-09T21:26:53Z

cmd/snapshot-controller/main.go

 		// We do not care about what is returned and just want to make sure the CRDs exist.
-		listOptions := metav1.ListOptions{Limit: 0}
+		listOptions := metav1.ListOptions{Limit: 1}


List snapshots is an expensive call. We should try to avoid listing all the snapshots.
The purpose for this call here is to ensure the CRD is installed. We don't really need to list the snapshots.

I see your question here: https://kubernetes.slack.com/archives/C0EG7JC6T/p1733776266305359
It's odd that {Limit: 0} would return 10,000 entries.

Agreed, at the very least can we wait for an informer sync of volumesnapshots instead of making a list call?

According to #504 which added this ensureCustomResourceDefinitionsExist waiting for informer sync was not sufficient bc the controller showed as Ready even before informer had synced, so ensureCustomResourceDefinitionsExist was added to force the controller to immediately crash. So I guess it's not as simple as removing this function , I think the most correct solution would be to implement readyz probe which waited for informer sync. But that is a lot of effort so I think this PR is the best short term fix.

(As for the limit=0 behavior, indeed it is counterintuitive & I could not find where it's documented in apiserver code if at all It seems that since resourceversion is not set in this list call then the lImit=0 option gets plumbed to etcd list call here and it does document that "// If WithLimit is given a 0 limit, it is treated as no limit." https://github.com/etcd-io/etcd/blob/854bdd646c8ce50b879f79f403726c7ab0dc726c/client/v3/op.go#L345. Anyway, the details are unimportant, the point is we observe that limit=0 does the opposite of intended!!)

Just saw that external-provisioner has a healt hz check kubernetes-csi/external-provisioner@52c3575#diff-db1b4e664fc5c6203530702df4e7c6eade0fd481ef5badffbc0bb4a92648d36e. So in short mb snapshot-controller should have also a healthz check which starts serving OK only after informer synced.

Discussed in the CSI Implementation meeting today and decided we should not go with the "health check" fix. If the CRDs are not deployed, informer will returns errors. It's not going to be synced. So the current fix is good.

AndrewSirenko · 2024-12-10T01:33:15Z

Tested on local kops cluster with 10.291k volumesnapshots. Added a log statement:

With this PR (limit=1)
"Found volumesnapshots with" len=1

Without PR limit=0
"Found volumesnapshots with" len=10291

xing-yang · 2024-12-10T01:38:34Z

lgtm

mauriciopoppe · 2024-12-10T01:48:12Z

/lgtm

Thanks!

xing-yang · 2024-12-10T01:51:37Z

/approve
/lgtm

k8s-ci-robot · 2024-12-10T01:51:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndrewSirenko, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [xing-yang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AndrewSirenko · 2024-12-10T02:52:45Z

/cherrypick release-8.0

AndrewSirenko · 2024-12-10T02:52:53Z

/cherrypick release-7.0

k8s-infra-cherrypick-robot · 2024-12-10T02:53:26Z

@AndrewSirenko: new pull request created: #1240

In response to this:

/cherrypick release-8.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2024-12-10T02:53:29Z

@AndrewSirenko: new pull request created: #1241

In response to this:

/cherrypick release-7.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

AndrewSirenko · 2024-12-10T02:54:22Z

/cherrypick release-8.1

k8s-infra-cherrypick-robot · 2024-12-10T02:54:58Z

@AndrewSirenko: new pull request created: #1242

In response to this:

/cherrypick release-8.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

AndrewSirenko · 2024-12-11T14:49:37Z

/cherrypick release-8.1

k8s-infra-cherrypick-robot · 2024-12-11T14:50:14Z

@AndrewSirenko: new pull request could not be created: failed to create pull request against kubernetes-csi/external-snapshotter#release-8.1 from head k8s-infra-cherrypick-robot:cherry-pick-1238-to-release-8.1: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for k8s-infra-cherrypick-robot:cherry-pick-1238-to-release-8.1."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request","status":"422"}

In response to this:

/cherrypick release-8.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wip: Snapshot Controller startup should not LIST all volumesnapshots

965e259

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 9, 2024

k8s-ci-robot requested review from chrishenzie and RaunakShah December 9, 2024 20:51

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 9, 2024

xing-yang reviewed Dec 9, 2024

View reviewed changes

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 10, 2024

AndrewSirenko changed the title ~~[wip] Snapshot Controller startup should not LIST ALL volumesnapshots on startup~~ Snapshot Controller startup should not LIST ALL volumesnapshots on startup Dec 10, 2024

AndrewSirenko marked this pull request as ready for review December 10, 2024 00:13

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2024

k8s-ci-robot requested review from jingxu97 and sunnylovestiramisu December 10, 2024 00:13

AndrewSirenko changed the title ~~Snapshot Controller startup should not LIST ALL volumesnapshots on startup~~ Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup Dec 10, 2024

k8s-ci-robot assigned mauriciopoppe Dec 10, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2024

k8s-ci-robot assigned xing-yang Dec 10, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2024

k8s-ci-robot merged commit 46e8028 into kubernetes-csi:master Dec 10, 2024
8 checks passed

k8s-infra-cherrypick-robot mentioned this pull request Dec 10, 2024

[release-8.0] Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup #1240

Merged

k8s-infra-cherrypick-robot mentioned this pull request Dec 10, 2024

[release-7.0] Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup #1241

Closed

k8s-infra-cherrypick-robot mentioned this pull request Dec 10, 2024

[release-8.1] Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup #1242

Merged

AndrewSirenko deleted the fix-snapshot-controller branch December 11, 2024 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup #1238

Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup #1238

AndrewSirenko commented Dec 9, 2024 •

edited

Loading

k8s-ci-robot commented Dec 9, 2024

xing-yang Dec 9, 2024

xing-yang Dec 9, 2024

AndrewSirenko Dec 9, 2024

wongma7 Dec 9, 2024

wongma7 Dec 9, 2024

xing-yang Dec 11, 2024

AndrewSirenko commented Dec 10, 2024 •

edited

Loading

xing-yang commented Dec 10, 2024

mauriciopoppe commented Dec 10, 2024

xing-yang commented Dec 10, 2024

k8s-ci-robot commented Dec 10, 2024

AndrewSirenko commented Dec 10, 2024

AndrewSirenko commented Dec 10, 2024

k8s-infra-cherrypick-robot commented Dec 10, 2024

k8s-infra-cherrypick-robot commented Dec 10, 2024

AndrewSirenko commented Dec 10, 2024

k8s-infra-cherrypick-robot commented Dec 10, 2024

AndrewSirenko commented Dec 11, 2024

k8s-infra-cherrypick-robot commented Dec 11, 2024

Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup #1238

Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup #1238

Conversation

AndrewSirenko commented Dec 9, 2024 • edited Loading

k8s-ci-robot commented Dec 9, 2024

xing-yang Dec 9, 2024

Choose a reason for hiding this comment

xing-yang Dec 9, 2024

Choose a reason for hiding this comment

AndrewSirenko Dec 9, 2024

Choose a reason for hiding this comment

wongma7 Dec 9, 2024

Choose a reason for hiding this comment

wongma7 Dec 9, 2024

Choose a reason for hiding this comment

xing-yang Dec 11, 2024

Choose a reason for hiding this comment

AndrewSirenko commented Dec 10, 2024 • edited Loading

xing-yang commented Dec 10, 2024

mauriciopoppe commented Dec 10, 2024

xing-yang commented Dec 10, 2024

k8s-ci-robot commented Dec 10, 2024

AndrewSirenko commented Dec 10, 2024

AndrewSirenko commented Dec 10, 2024

k8s-infra-cherrypick-robot commented Dec 10, 2024

k8s-infra-cherrypick-robot commented Dec 10, 2024

AndrewSirenko commented Dec 10, 2024

k8s-infra-cherrypick-robot commented Dec 10, 2024

AndrewSirenko commented Dec 11, 2024

k8s-infra-cherrypick-robot commented Dec 11, 2024

AndrewSirenko commented Dec 9, 2024 •

edited

Loading

AndrewSirenko commented Dec 10, 2024 •

edited

Loading