Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup #1238

Merged

Conversation

AndrewSirenko
Copy link
Contributor

@AndrewSirenko AndrewSirenko commented Dec 9, 2024

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change

/kind bug

/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

As of v6.2.2, the snapshot controller performs a list operation on startup as a way of validating the Snapshot CRD is installed on the cluster. v7.0.0 attempts to put a limit on this list call of 0, but that is treated as an unbounded request by etcd.

For clusters with many VolumeSnapshots, the API Server may struggle to handle an unbounded list request within 60s, leading to continuous restarts in the snapshot controller.

@wongma7 proposes a better longterm solution here, but this PR introduces a short-term fix of properly setting the list response limit to 1.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Within k8s API server, paging gets set to false here because limit <= 0

We pass through limit = 0 to store.getList here

For etcd, limit of 0 allegedly means no limitation documented here

Does this PR introduce a user-facing change?:

Fix unbounded volumesnapshots list call on Snapshot Controller startup

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 9, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 9, 2024
// We do not care about what is returned and just want to make sure the CRDs exist.
listOptions := metav1.ListOptions{Limit: 0}
listOptions := metav1.ListOptions{Limit: 1}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List snapshots is an expensive call. We should try to avoid listing all the snapshots.
The purpose for this call here is to ensure the CRD is installed. We don't really need to list the snapshots.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your question here: https://kubernetes.slack.com/archives/C0EG7JC6T/p1733776266305359
It's odd that {Limit: 0} would return 10,000 entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, at the very least can we wait for an informer sync of volumesnapshots instead of making a list call?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to #504 which added this ensureCustomResourceDefinitionsExist waiting for informer sync was not sufficient bc the controller showed as Ready even before informer had synced, so ensureCustomResourceDefinitionsExist was added to force the controller to immediately crash. So I guess it's not as simple as removing this function , I think the most correct solution would be to implement readyz probe which waited for informer sync. But that is a lot of effort so I think this PR is the best short term fix.

(As for the limit=0 behavior, indeed it is counterintuitive & I could not find where it's documented in apiserver code if at all It seems that since resourceversion is not set in this list call then the lImit=0 option gets plumbed to etcd list call here and it does document that "// If WithLimit is given a 0 limit, it is treated as no limit." https://github.com/etcd-io/etcd/blob/854bdd646c8ce50b879f79f403726c7ab0dc726c/client/v3/op.go#L345. Anyway, the details are unimportant, the point is we observe that limit=0 does the opposite of intended!!)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw that external-provisioner has a healt hz check kubernetes-csi/external-provisioner@52c3575#diff-db1b4e664fc5c6203530702df4e7c6eade0fd481ef5badffbc0bb4a92648d36e. So in short mb snapshot-controller should have also a healthz check which starts serving OK only after informer synced.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in the CSI Implementation meeting today and decided we should not go with the "health check" fix. If the CRDs are not deployed, informer will returns errors. It's not going to be synced. So the current fix is good.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 10, 2024
@AndrewSirenko AndrewSirenko changed the title [wip] Snapshot Controller startup should not LIST ALL volumesnapshots on startup Snapshot Controller startup should not LIST ALL volumesnapshots on startup Dec 10, 2024
@AndrewSirenko AndrewSirenko marked this pull request as ready for review December 10, 2024 00:13
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2024
@AndrewSirenko AndrewSirenko changed the title Snapshot Controller startup should not LIST ALL volumesnapshots on startup Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup Dec 10, 2024
@AndrewSirenko
Copy link
Contributor Author

AndrewSirenko commented Dec 10, 2024

Tested on local kops cluster with 10.291k volumesnapshots. Added a log statement:

With this PR (limit=1)
"Found volumesnapshots with" len=1


Without PR limit=0
"Found volumesnapshots with" len=10291

@xing-yang
Copy link
Collaborator

lgtm

@mauriciopoppe
Copy link
Member

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2024
@xing-yang
Copy link
Collaborator

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndrewSirenko, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2024
@k8s-ci-robot k8s-ci-robot merged commit 46e8028 into kubernetes-csi:master Dec 10, 2024
8 checks passed
@AndrewSirenko
Copy link
Contributor Author

/cherrypick release-8.0

@AndrewSirenko
Copy link
Contributor Author

/cherrypick release-7.0

@k8s-infra-cherrypick-robot

@AndrewSirenko: new pull request created: #1240

In response to this:

/cherrypick release-8.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot

@AndrewSirenko: new pull request created: #1241

In response to this:

/cherrypick release-7.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@AndrewSirenko
Copy link
Contributor Author

/cherrypick release-8.1

@k8s-infra-cherrypick-robot

@AndrewSirenko: new pull request created: #1242

In response to this:

/cherrypick release-8.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@AndrewSirenko
Copy link
Contributor Author

/cherrypick release-8.1

@k8s-infra-cherrypick-robot

@AndrewSirenko: new pull request could not be created: failed to create pull request against kubernetes-csi/external-snapshotter#release-8.1 from head k8s-infra-cherrypick-robot:cherry-pick-1238-to-release-8.1: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for k8s-infra-cherrypick-robot:cherry-pick-1238-to-release-8.1."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request","status":"422"}

In response to this:

/cherrypick release-8.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants