Add pagination support to clients #532

ncdc · 2019-07-23T18:40:33Z

Once #341 merges, it would be nice to add implicit support for list pagination to both typedClient and unstructuredClient.

I'd be happy to take a stab at this.

The text was updated successfully, but these errors were encountered:

DirectXMan12 · 2019-07-24T01:58:23Z

/kind feature
/priority backlog

sure, sounds good

fejta-bot · 2019-10-22T02:34:13Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

DirectXMan12 · 2019-11-05T23:22:15Z

/remove-lifecycle stale

make webhook scaffolding work with old project

fejta-bot · 2020-02-03T23:31:09Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2020-02-03T23:34:52Z

/lifecycle frozen

vincepri · 2020-02-21T17:11:00Z

/help
/kind design
/priority important-longterm

We'll need to provide a new method called something like ListPages that can be given a function and iterates on it

k8s-ci-robot · 2020-02-21T17:11:02Z

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help
/kind design
/priority important-longterm

We'll need to provide a new method called something like ListPages that can be given a function

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pmalek · 2022-05-24T20:08:48Z

Hi all,

I'm writing with regard to the issue (linked above) that would benefit from an introduction of pagination as a configurable option to client's implementation (that's where I understand this is needed - if I'm mistaken please correct me).

I can see that there hasn't been too much traction on this issue and I'm willing to help if there's a will to move forward with this.

I'm kinda new to controller runtime so any tips on where the actual change needs to happen are welcome.

FillZpp · 2022-05-27T08:08:35Z

I'm writing with regard to the issue (Kong/kubernetes-ingress-controller#2382) that would benefit from an introduction of pagination as a configurable option to client's implementation (that's where I understand this is needed - if I'm mistaken please correct me).

Hi @pmalek , I looked through the issue you mentioned. IMHO it is different from this issue.

This issue, like @vincepri said, is going to add a new method called something like ListPages, which is provided for ppl who may need to list from apiserver directly with controller-runtime client.

However, the Kong/kubernetes-ingress-controller#2382 you mentioned, is using controller-runtime cache, which actually calls client-go informer to list and watch objects. Also I don't think it can be solved by page list, for now reflector already uses resourceVersion=0 to read objects only from kube-apiserver cache, but it will be changed to read from etcd for page list, which will probably make the time cost worse.

pmalek-sumo · 2022-05-27T08:22:52Z

Hi @FillZpp 👋

Thanks for your response.

So what you're saying is that even if we'd like to make a patch that would introduce pagination to the initial List: 1) it's impossible or 2) it would not solve the issue? (if so then why)

... but it will be changed to read from etcd for page list, which will probably make the time cost worse.

By this you mean that there's a plan to change the implementation sometime soon?

FillZpp · 2022-05-27T09:19:39Z

@pmalek-sumo Aha, what I'm saying are two points:

What this issue going to do is NOT enabling page list in cache(informer), so it has no relationship with Kong Ingress Controller cannot scale beyond a limit Kong/kubernetes-ingress-controller#2382 .
Let's forget about this issue, just saying Kong Ingress Controller cannot scale beyond a limit Kong/kubernetes-ingress-controller#2382 , I don't think enabling page list in informer can solve its problem. Because:
1. Since client-go defaults to set resourceVersion="0" for initial list (see), it is meaningless to set WatchListPageSize, for it will always return all objects from apiserver cache resourceVersion="0" without page limit.
2. Even you can hack into client-go to make resourceVersion="" that supports page list, then apiserver has to read the objects from etcd instead of reading from its cache, which will actually make the list request cost a longer time.

shaneutt · 2022-05-27T11:43:33Z

@pmalek-sumo Aha, what I'm saying are two points:

1. What this issue going to do is NOT enabling page list in cache(informer), so it has no relationship with [Kong Ingress Controller cannot scale beyond a limit Kong/kubernetes-ingress-controller#2382](https://github.com/Kong/kubernetes-ingress-controller/issues/2382) .

2. Let's forget about this issue, just saying [Kong Ingress Controller cannot scale beyond a limit Kong/kubernetes-ingress-controller#2382](https://github.com/Kong/kubernetes-ingress-controller/issues/2382) , I don't think enabling page list in informer can solve its problem. Because:
   
   1. Since client-go defaults to set `resourceVersion="0"` for initial list ([see](https://github.com/kubernetes/client-go/blob/master/tools/cache/reflector.go#L572)), there is no meaningless to set `WatchListPageSize`, for it will always return all objects from apiserver cache `resourceVersion="0"` without page limit.
   2. Even you can hack into client-go to make `resourceVersion=""` that supports page list, then apiserver has to read the objects from etcd instead of reading from its cache, which will actually make the list request cost a longer time.

Other than changing the namespaces that are watched, do you have any suggestions how we can get this initial listing of objects faster, or broken up so that it doesn't trip the api server limits?

FillZpp · 2022-05-27T13:40:59Z

@pmalek-sumo @shaneutt I read your issue again, not sure what actually caused the stream error when reading response body, may be caused by closed connection error from apiserver. Have you ever asked ppl from sig-api-machinery that are familiar with apiserver? Could it be because of the memory resource of apiserver is not enough?

If it is caused by apiserver memory limit, you may try to page list with the client first and see if it could solve your problem (even if it get objects from etcd).

        secretList := v1.SecretList{}
	// 1. list with resourceVersion="0", so that get all objects from apiserver cache
	cli.List(context.TODO(), &secretList, &client.ListOptions{Raw: &metav1.ListOptions{ResourceVersion: "0"}})
	// 2. list with limit, so that get limited objects from etcd
	cli.List(context.TODO(), &secretList, &client.ListOptions{Limit: 500})

If this works for you, I would like to help you gays find out how to enable initial list with page limit in client-go and controller-runtime :)

pmalek · 2022-05-27T14:46:15Z

Thanks @FillZpp.

Have you ever asked ppl from sig-api-machinery that are familiar with apiserver?

We have not, yet.

Could it be because of the memory resource of apiserver is not enough?

The memory consumption might be also an issue since the OP in Kong/kubernetes-ingress-controller#2382 did mention OOMs but the general consensus was that the culprit most likely is the long processing time and hence the timeout.

If it is caused by apiserver memory limit, you may try to page list with the client first and see if it could solve your problem (even if it get objects from etcd).

Do you suggest then that we basically create a client from client-go and try to list secrets (on a cluster where there's enough of them to cause an issue) as you've indicated? Or that would not be the same?

FillZpp · 2022-05-28T14:30:27Z

but the general consensus was that the culprit most likely is the long processing time and hence the timeout.

The quit of your controller might be caused by the cache sync timeout, but that was caused by the list request error from apiserver. BTW, have you ever tried to set the CacheSyncTimeout to be longer? But I'm afraid the timeout will always be there if we don't solve the list problem.

Do you suggest then that we basically create a client from client-go and try to list secrets (on a cluster where there's enough of them to cause an issue) as you've indicated? Or that would not be the same?

Yeah, you could reproduce the initial list of informer with client-go or controller-runtime client (on a cluster where there's enough of them to cause an issue), which can help us figure out whether page list can solve the problem or not.

I give the simple example with controller-runtime client, but it's ok to use client-go. Besides, you'd better use loop to page list all object in batches like this:

	// 2. list with limit, so that get limited objects from etcd
	startTime := time.Now()
	opts := &client.ListOptions{Limit: 500}
	for {
		err = cli.List(context.TODO(), &secretList, opts)
		if err != nil {
			return err
		}
		klog.Infof("Get objects number: %v", len(secretList.Items))

		opts.Continue = secretList.Continue
		// there is no more objects
		if secretList.Continue == "" {
			break
		}
	}
	klog.Infof("Time cost for page list: %v", time.Since(startTime))

pmalek · 2022-05-30T13:02:47Z

Alright so I was able to prepare some script to create/delete secrets and then list them with controller-runtime's client (basically what @FillZpp proposed).

Listing secrets with 256 secrets in the namespace, each with 1 payload 1MB in size I got the following results:

$ go build -o main && ./main
I0530 14:45:10.203416   50289 main.go:29] Starting to list secrets
I0530 14:45:22.895777   50289 main.go:43] Get objects number: 256
I0530 14:45:22.895815   50289 main.go:51] Time cost for page list: 12.692023625s
Execution time: 0h:00m:14s sec

# pagination using limit=100

$ go build -o main && ./main
I0530 14:45:29.759525   50318 main.go:29] Starting to list secrets
I0530 14:45:34.363228   50318 main.go:43] Get objects number: 100
I0530 14:45:38.881193   50318 main.go:43] Get objects number: 100
I0530 14:45:41.502130   50318 main.go:43] Get objects number: 56
I0530 14:45:41.502155   50318 main.go:51] Time cost for page list: 11.742369583s

Using 350 secrets:

# pagination using limit=100

$ go build -o main && ./main
I0530 14:59:05.627028   50837 main.go:29] Starting to list secrets
I0530 14:59:21.869458   50837 main.go:43] Get objects number: 100
I0530 14:59:31.968267   50837 main.go:43] Get objects number: 100
I0530 14:59:38.376587   50837 main.go:43] Get objects number: 100
I0530 14:59:41.561738   50837 main.go:43] Get objects number: 50
I0530 14:59:41.561762   50837 main.go:51] Time cost for page list: 35.934283084s
Execution time: 0h:00m:36s sec

# no pagination

$ go build -o main && ./main
I0530 14:59:54.457833   50929 main.go:29] Starting to list secrets
I0530 15:00:25.470387   50929 main.go:43] Get objects number: 350
I0530 15:00:25.470412   50929 main.go:51] Time cost for page list: 31.012180292s

Secrets can be created as follows (using the secret creation script):

./main -secret-count 256 -payload-size $((1024*1024))

What I've managed to observe as well is that when I've created a touch more secrets (e.g. 350 or 500) secrets (also putting 1MB of payload to each one) my kind cluster becomes unresponsive eating up a lot of resources. It seems that kube-apiserver is contributing a lot to that:

...
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  10101 root      20   0 5383416   4.2g  32752 S 143.0  43.0   2:30.45 kube-apiserver
  10107 root      20   0   13.9g   3.4g 194900 S   3.0  34.9   0:33.83 etcd
  10091 root      20   0 2186904   1.1g  16812 S   3.3  11.4   0:19.30 kube-controller

Secret creating script:

package main

import (
	"context"
	"errors"
	"flag"
	"fmt"

	"github.com/gammazero/workerpool"
	corev1 "k8s.io/api/core/v1"
	k8serrors "k8s.io/apimachinery/pkg/api/errors"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/klog/v2"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/client/config"
)

func main() {
	var (
		secretCount int
		payloadSize int
		namespace   string
		clean       bool
	)
	flag.IntVar(&secretCount, "secret-count", 1, "number of secrets to generate")
	flag.IntVar(&payloadSize, "payload-size", 128, "payload size in bytes")
	flag.StringVar(&namespace, "namespace", "secrets", "namespace name")
	flag.BoolVar(&clean, "clean", false, "whether to clean afterwards")

	flag.Parse()

	cli, err := client.New(config.GetConfigOrDie(), client.Options{})
	if err != nil {
		panic(err)
	}

	if clean {
		defer func() {
			err := cli.DeleteAllOf(context.Background(),
				&corev1.Secret{},
				client.InNamespace(namespace),
				client.MatchingLabels{"secrets": "experiment"},
			)
			if err != nil {
				klog.ErrorS(err, "error deleting secrets")
			}
		}()
	}

	wp := workerpool.New(16)
	for i := 0; i < secretCount; i++ {
		i := i
		wp.Submit(func() {
			secretName := fmt.Sprintf("secret-%d", i)
			klog.Infof("handling %s...\n", secretName)
			secret := corev1.Secret{
				ObjectMeta: metav1.ObjectMeta{
					Name:      secretName,
					Namespace: *&namespace,
					Labels: map[string]string{
						"secrets": "experiment",
					},
				},

				Data: map[string][]byte{
					"payload": randomHex(payloadSize / 2), // div by 2 because we encode in hex
				},
			}

			klog.InfoS("creating secret...", "secret", secretName)
			err := cli.Create(context.Background(), &secret, &client.CreateOptions{})
			if err != nil {
				var statusErr *k8serrors.StatusError
				if errors.As(err, &statusErr) {

					if statusErr.ErrStatus.Reason == metav1.StatusReasonAlreadyExists {
						klog.InfoS("secret already exists, updating it...", "secret", secretName)
						err = cli.Update(context.Background(), &secret, &client.UpdateOptions{})
						if err != nil {
							klog.ErrorS(err, "error updating secret", "secret", secretName)
						}
						return
					}

					klog.ErrorS(err, "error creating secret", "secret", secretName)
					return
				}

				panic(err)
			}
		})
	}

	wp.StopWait()
}

Listing script

package main

import (
	"context"
	"flag"
	"fmt"
	"os"
	"time"

	corev1 "k8s.io/api/core/v1"
	"k8s.io/klog/v2"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/client/config"
)

func main() {
	var namespace string
	flag.StringVar(&namespace, "namespace", "secrets", "namespace name")
	flag.Parse()

	cli, err := client.New(config.GetConfigOrDie(), client.Options{})
	if err != nil {
		fmt.Println("failed to create client")
		os.Exit(1)
	}

	secretList := corev1.SecretList{}

	klog.Infof("Starting to list secrets")

	// 2. list with limit, so that get limited objects from etcd
	startTime := time.Now()
	opts := &client.ListOptions{
		Limit:     100,
		Namespace: namespace,
	}
	for {
		err = cli.List(context.TODO(), &secretList, opts)
		if err != nil {
			klog.ErrorS(err, "Failed listing secrets: %v")
			os.Exit(1)
		}
		klog.Infof("Get objects number: %v", len(secretList.Items))

		opts.Continue = secretList.Continue
		// there is no more objects
		if secretList.Continue == "" {
			break
		}
	}
	klog.Infof("Time cost for page list: %v", time.Since(startTime))
}

pmalek · 2022-05-30T15:22:18Z

Anyway, since this was already pointed out that what I'm describing might be a different issue let me summarize what I wrote above:

the pagination doesn't seem like having a big impact on the runtime listing secrets
what seems to be in general causing a problem is lots of secrets with big payloads which overall slow down the apiserver

If you feel that based on what I wrote above we can create a new issue I'll gladly do that and we can take it from there (to stop hajicking this one).

FillZpp · 2022-06-01T03:00:50Z

What I've managed to observe as well is that when I've created a touch more secrets (e.g. 350 or 500) secrets (also putting 1MB of payload to each one) my kind cluster becomes unresponsive eating up a lot of resources. It seems that kube-apiserver is contributing a lot to that:

@pmalek So it is caused by resource limited of kube-apiserver, which means you can not handle it no matter pagination or not?

If you feel that based on what I wrote above we can create a new issue I'll gladly do that and we can take it from there (to stop hajicking this one).

Yeah, you can create a new issue for supporting optional page list when cache initialize, and it is also related to client-go, which haven not yet provided any external options to let ppl enable the page list.

pmalek · 2022-06-09T20:26:25Z

So it is caused by resource limited of kube-apiserver, which means you can not handle it no matter pagination or not?

I don't know enough here to weigh in here but it might be that handling Secrets is not as efficient as it could be? Maybe that's known fact and expected behavior, unfortunately I don't know whether that's the case.

I'm not actively working on an issue but if I find some time and enough context for this problem I will raise a separate issue for it.

krish7919 · 2022-06-16T14:20:21Z

what seems to be in general causing a problem is lots of secrets with big payloads which overall slow down the apiserver

I would disagree with this statement as I can observe that even numerous Secrets with smaller payloads will cause this issue to happen. However, I cannot verify that it is the CPU pressure.
The only data point I have on CPU right now is that we run on EKS, and it should typically autoscale the control plane when CPU thresholds are crossed, but this doesn't happen for us. My hunch is still that this is a pagination or timeout issue somewhere.

krish7919 · 2022-06-16T14:21:59Z

Yeah, you can create a new issue for supporting optional page list when cache initialize, and it is also related to client-go, which haven not yet provided any external options to let ppl enable the page list.

I have tried this in a hack-ish way by setting the pageSize to 10 via hardcoding it in code, rebuilt the controller I use (the kubernetes-ingress-controller for Kong) in a KinD cluster and this doesn't work either.

FillZpp · 2022-06-17T05:39:11Z

I have tried this in a hack-ish way by setting the pageSize to 10 via hardcoding it in code, rebuilt the controller I use (the kubernetes-ingress-controller for Kong) in a KinD cluster and this doesn't work either.

@krish7919 Not sure if you hardcoded the client-go correctly, but you can simply test it as what I sugguested (1) (2) in your cluster. It reproduces the informer initial list clearly.

krish7919 · 2022-06-17T11:43:45Z

@FillZpp @pmalek I have pasted my analysis in Kong/kubernetes-ingress-controller#2382 (comment) if you would want to take a look.

dbenque · 2024-07-26T09:41:46Z

I had the same kind of error stream error when reading response body, may be caused by closed connection in a context where:

I was using an informer on configmaps
The cluster contains a lot of configmaps (huge ones... thanks helm)
I was running a controller on my laptop against a remote kubernetes cluster behind a slow connection

The error was raised by the controller while it was doing it initial listing of the configmaps. This is done by a lister that uses parameter resourceVersion=0 which implies that the api server won't use pagination (see this issue). As a consequence the APIServer is trying to send all the ConfigMaps with all the content in one message.
In my case one miss-leading fact was that the Audit Logs were showing success in answering 200 status code for the List request. Which first led to think that problem was on client side... but we will see that it was not.

I followed that thread and the proposal by @FillZpp here.
I used directly client-go (not controller-runtime client) in a dedicated test program to list all the configmaps with param resourceVersion=0 and I could reproduce the problem. So controller-runtime was not the source of problem. Looking in the client go code, I ended up at that line.

case http2.StreamError:
			// This is trying to catch the scenario that the server may close the connection when sending the
			// response body. This can be caused by server timeout due to a slow network connection.

Because of this I turned again my investigation on the APIServer side. I could notice that the error was launch exactly 1 minute after the List call. I found the parameter --request-timeout on the apiServer side:

--request-timeout duration     Default: 1m0s
  | An optional field indicating the duration a handler must keep a request open before timing it out. This is the default request timeout for requests but may be overridden by flags such as --min-request-timeout for specific types of requests.

So I restarted the apiServers of my cluster with this parameter set to 5m and launch the test again: no more error, after 2m the list all configmaps was available on my client side.

So in my case the stream error when reading response body, may be caused by closed connection was not a client side issue and not an APIServer problem in the sense that it was able to build the reply. In that case the problem is due to slow connection that lead to timeout with Server side closing the connection before the entire message could be streamed to the client.

Wondering if KEP-3157-watch-list could help in that particular case... because just bumping the timeout on apiserver is not a great solution.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jul 24, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 22, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 5, 2019

DirectXMan12 pushed a commit that referenced this issue Jan 31, 2020

Merge pull request #532 from mengqiy/webhook_work_with_old_project

3f80c24

make webhook scaffolding work with old project

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 3, 2020

vincepri added this to the v0.5.x milestone Feb 21, 2020

nimrodshn mentioned this issue Jan 24, 2021

[WIP] ✨ Add support for implicit paging in un/structured clients #1358

Closed

anguslees mentioned this issue Apr 19, 2021

Address Excessive API Server calls from CNI Pods aws/amazon-vpc-cni-k8s#1419

Merged

shaneutt mentioned this issue Mar 31, 2022

Kong Ingress Controller cannot scale beyond a limit Kong/kubernetes-ingress-controller#2382

Closed

vincepri modified the milestones: v0.5.x, Next May 4, 2023

Starefossen mentioned this issue Feb 26, 2024

Timeout: failed waiting for *v1.ReplicaSet Informer to sync open-telemetry/opentelemetry-operator#2667

Closed

pmalek mentioned this issue Mar 28, 2024

REQUEST: New membership for pmalek kubernetes/org#4855

Closed

9 tasks

chhsia0 mentioned this issue Dec 20, 2024

Cached clients return incomplete object lists for paginated list calls #3044

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pagination support to clients #532

Add pagination support to clients #532

ncdc commented Jul 23, 2019

DirectXMan12 commented Jul 24, 2019

fejta-bot commented Oct 22, 2019

DirectXMan12 commented Nov 5, 2019

fejta-bot commented Feb 3, 2020

vincepri commented Feb 3, 2020

vincepri commented Feb 21, 2020 •

edited

Loading

k8s-ci-robot commented Feb 21, 2020

pmalek commented May 24, 2022

FillZpp commented May 27, 2022 •

edited

Loading

pmalek-sumo commented May 27, 2022

FillZpp commented May 27, 2022 •

edited

Loading

shaneutt commented May 27, 2022

FillZpp commented May 27, 2022 •

edited

Loading

pmalek commented May 27, 2022

FillZpp commented May 28, 2022

pmalek commented May 30, 2022

pmalek commented May 30, 2022

FillZpp commented Jun 1, 2022

pmalek commented Jun 9, 2022

krish7919 commented Jun 16, 2022

krish7919 commented Jun 16, 2022

FillZpp commented Jun 17, 2022

krish7919 commented Jun 17, 2022

dbenque commented Jul 26, 2024

Add pagination support to clients #532

Add pagination support to clients #532

Comments

ncdc commented Jul 23, 2019

DirectXMan12 commented Jul 24, 2019

fejta-bot commented Oct 22, 2019

DirectXMan12 commented Nov 5, 2019

fejta-bot commented Feb 3, 2020

vincepri commented Feb 3, 2020

vincepri commented Feb 21, 2020 • edited Loading

k8s-ci-robot commented Feb 21, 2020

pmalek commented May 24, 2022

FillZpp commented May 27, 2022 • edited Loading

pmalek-sumo commented May 27, 2022

FillZpp commented May 27, 2022 • edited Loading

shaneutt commented May 27, 2022

FillZpp commented May 27, 2022 • edited Loading

pmalek commented May 27, 2022

FillZpp commented May 28, 2022

pmalek commented May 30, 2022

pmalek commented May 30, 2022

FillZpp commented Jun 1, 2022

pmalek commented Jun 9, 2022

krish7919 commented Jun 16, 2022

krish7919 commented Jun 16, 2022

FillZpp commented Jun 17, 2022

krish7919 commented Jun 17, 2022

dbenque commented Jul 26, 2024

vincepri commented Feb 21, 2020 •

edited

Loading

FillZpp commented May 27, 2022 •

edited

Loading

FillZpp commented May 27, 2022 •

edited

Loading

FillZpp commented May 27, 2022 •

edited

Loading