Implement Cursor-Based Pagination in ListRepositories Endpoint #2097

Vyom-Yadav · 2024-01-09T18:46:36Z

Fixes #520

Major Changes:

Implemented Cursor-Based Pagination in ListRepositories Endpoint
Added UTs for encoding/decoding cursors
API supports fetch-all/non-paginated queries too

Vyom-Yadav · 2024-01-09T18:54:59Z

It would have been awesome if there was some setup to perform e2e tests or even to test the endpoints using an actual DB. Is this under discussion / Is anybody working on it?

jhrozek · 2024-01-10T08:21:11Z

database/migrations/000001_init.up.sql

@@ -281,6 +281,7 @@ CREATE INDEX idx_roles_project_id ON roles(project_id);
 CREATE UNIQUE INDEX roles_organization_id_name_lower_idx ON roles (organization_id, LOWER(name));
 CREATE INDEX idx_provider_access_tokens_project_id ON provider_access_tokens(project_id);
 CREATE UNIQUE INDEX repositories_repo_id_idx ON repositories(repo_id);
+CREATE UNIQUE INDEX repositories_cursor_pagination_idx ON repositories(project_id, provider, repo_id);


(I haven't really read the PR in full, just replying to this one bit for now..)

The reason the tests are failing is that whenever you change the DB schema you need to add two new migration files, not modify the old ones. Currently the largest one is 11, so you'd add 12. Also, one thing that bit me in the beginning is that you need to add both up and down migrations (even if the down was empty).

I see. Added new migration files.

jhrozek · 2024-01-10T08:45:54Z

database/query/repositories.sql

-ORDER BY repo_name;
+  AND (repo_id > sqlc.narg('repo_id') OR sqlc.narg('repo_id') IS NULL)
+ORDER BY project_id, provider, repo_id
+LIMIT sqlc.narg('limit');


This is an interesting approach - did you consider using OFFSET and LIMIT instead?

Yes, I considered it. The main reason for not going with offset-limit-based pagination is that it is slower than cursor-based pagination, and records can be potentially skipped if we are paginating using offset and limit.

The downside to cursor-based pagination is that we cannot go to any random page directly. Instead, it would be like an infinite scroll where more records can be fetched as the user scrolls through.

In addition to @Vyom-Yadav 's comments, offset-based queries require fetching all the items from the index before the offset, while using a cursor allows you to seek into the index and then march forward through the b-tree without needing to traverse the earlier entries (i.e. avoiding Schlemiel the Painter's Algorithm).

Now, I suspect we won't get into the really big numbers, but if we had 2000 repos at 100 per page, that would be scanning 100 + 200 + 300 + 400 ... + 2000 index entries = 21000 total index entries with offset and limit, vs 2000 with this query.

I'm actually fine with this approach, only thing that threw me off was the project + provider + repo_id index, I would have thought project + repo was sufficient. But I'm also flexible and willing to iterate.

@evankanderson @Vyom-Yadav thank you for the explanation about the cursor based search. I don't have any more comments, so feel free to ack once you're happy with the PR

only thing that threw me off was the project + provider + repo_id index, I would have thought project + repo was sufficient.

The reason for that is we have to make sure the cursor is valid. A project + repo cursor would work right now as github is the only provider, but if we had gitlab too (for eg), then differentiating the cursor would not be possible. Basically, we have to encode the previous query parameters in the cursor for correct retrieval.

That's a good point!

evankanderson

I know that @JAORMX and @jhrozek have some thoughts as well, so not merging/approving yet.

evankanderson · 2024-01-12T07:34:12Z

internal/controlplane/handlers_repositories.go

+	limit := sql.NullInt32{Valid: false, Int32: 0}
+	if in.GetLimit() > 0 {


I suspect that we always want to enforce a limit here, and probably enforce that the limit is e.g. < 100. (Either by clamping the value, or by returning an error if the requested limit is higher.)

Good point. I added 100 as the max limit when a client passes > 0 limit flag.

evankanderson · 2024-01-12T07:36:28Z

internal/controlplane/handlers_repositories.go

+	if len(repos) > 0 {
+		lastScannedRepoId := repos[len(repos)-1].RepoID
+		resp.Cursor, err = encodeListRepositoriesByProjectIDCursor(projectID.String(), provider.Name, lastScannedRepoId)
+		if err != nil {
+			return nil, util.UserVisibleError(codes.InvalidArgument, err.Error())
+		}
+	}


Do we want to fetch limit+1 and only add a cursor if we actually got the +1 row? I'm mixed here -- it's slightly nicer for clients if you know that you can stop iterating when cursor is empty, and you don't need to worry about the "returned an empty array" case, but on the other hand, it makes this code slightly more annoying.

Yeah, let's do limit + 1. Slack also uses it.

evankanderson · 2024-01-12T07:37:51Z

proto/minder/v1/minder.proto

    Context context = 5;
+    // cursor is the base64 encoded cursor to start listing from. Format base64.encode(project_id,provider,repo_id)


I don't think we should document the format of the cursor, as we could change it in the future (and that should be okay). It's simply an opaque value.

We may want to document when cursor is returned, and how to know you've reached the end of the set of responses.

Done. Added a comment to the returned cursor.

evankanderson · 2024-01-12T07:54:09Z

database/query/repositories.sql

-ORDER BY repo_name;
+  AND (repo_id > sqlc.narg('repo_id') OR sqlc.narg('repo_id') IS NULL)
+ORDER BY project_id, provider, repo_id
+LIMIT sqlc.narg('limit');


In addition to @Vyom-Yadav 's comments, offset-based queries require fetching all the items from the index before the offset, while using a cursor allows you to seek into the index and then march forward through the b-tree without needing to traverse the earlier entries (i.e. avoiding Schlemiel the Painter's Algorithm).

Now, I suspect we won't get into the really big numbers, but if we had 2000 repos at 100 per page, that would be scanning 100 + 200 + 300 + 400 ... + 2000 index entries = 21000 total index entries with offset and limit, vs 2000 with this query.

evankanderson · 2024-01-12T08:00:05Z

internal/controlplane/common.go

+	key, err := util.DecodeCursor(cursor)
+	if err != nil {
+		return "", "", 0, err
+	}
+
+	keyArr := strings.Split(key, cursorDelimiter)
+	if len(keyArr) != 3 {
+		return "", "", 0, fmt.Errorf("invalid cursor")
+	}
+
+	parsedRepoId, err := strconv.ParseInt(keyArr[2], 10, 32)
+	if err != nil {
+		return "", "", 0, err
+	}
+
+	projectId = keyArr[0]
+	provider = keyArr[1]
+
+	repoId = int32(parsedRepoId)


Another option here is to make a small struct or protobuf and then serialize that. In fact, you could:

type RepoCursor struct { ProjectId string Provider string RepoId int } func NewCursor(encoded string) (Cursor, error) { // ... } // String implements strings.Stringer func (c *Cursor)String() string { if c == nil || c.ProjectId == "" || c.Provider == "" || c.RepoId == 0 { return "" } // Do the encode in this method, so all the encoding/decoding has to happen in this file }

Note that we want to encapsulate knowledge of how a RepoCursor is serialized within these two methods, and not leak that implementation elsewhere, so we can change it later if we want. (For example, we could compress the string and add an integrity SHA hash later if we were worried about people messing with the cursor tokens.)

Valid points and a better pattern. Defined a struct for the cursor.

jhrozek · 2024-01-17T09:00:31Z

@Vyom-Yadav would you mind rebasing atop the current master? Sorry about that, there's been a lot of changes touching the RPC lately. Normally you only need to run make gen and then commit the new autogenerated code.

Signed-off-by: Vyom-Yadav <[email protected]>

Vyom-Yadav · 2024-01-17T09:07:30Z

@jhrozek Done.

jhrozek

The code looks good and a quick test with repo list works as well (also with an old client)

Vyom-Yadav requested a review from a team as a code owner January 9, 2024 18:46

jhrozek reviewed Jan 10, 2024

View reviewed changes

JAORMX requested a review from evankanderson January 10, 2024 08:50

Vyom-Yadav force-pushed the issue-520 branch 3 times, most recently from 24f9c50 to d8bef5e Compare January 11, 2024 16:19

evankanderson reviewed Jan 12, 2024

View reviewed changes

Vyom-Yadav force-pushed the issue-520 branch 6 times, most recently from 612c836 to e4391a3 Compare January 13, 2024 08:55

Vyom-Yadav requested review from evankanderson, jhrozek and JAORMX January 13, 2024 09:02

Implement Cursor-Based Pagination in ListRepositories Endpoint

1df2c48

Signed-off-by: Vyom-Yadav <[email protected]>

Vyom-Yadav force-pushed the issue-520 branch from e4391a3 to 1df2c48 Compare January 17, 2024 09:05

jhrozek approved these changes Jan 17, 2024

View reviewed changes

jhrozek merged commit e88e4b2 into mindersec:main Jan 17, 2024
19 checks passed

eleftherias mentioned this pull request Jan 17, 2024

Update proto files for pagination #2138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Cursor-Based Pagination in ListRepositories Endpoint #2097

Implement Cursor-Based Pagination in ListRepositories Endpoint #2097

Vyom-Yadav commented Jan 9, 2024 •

edited

Loading

Vyom-Yadav commented Jan 9, 2024

jhrozek Jan 10, 2024

Vyom-Yadav Jan 10, 2024

jhrozek Jan 10, 2024

Vyom-Yadav Jan 10, 2024

evankanderson Jan 12, 2024

JAORMX Jan 12, 2024

jhrozek Jan 12, 2024

Vyom-Yadav Jan 13, 2024

JAORMX Jan 13, 2024

evankanderson left a comment

evankanderson Jan 12, 2024

Vyom-Yadav Jan 13, 2024

evankanderson Jan 12, 2024

Vyom-Yadav Jan 13, 2024

evankanderson Jan 12, 2024

Vyom-Yadav Jan 13, 2024

evankanderson Jan 12, 2024

evankanderson Jan 12, 2024

Vyom-Yadav Jan 13, 2024

jhrozek commented Jan 17, 2024

Vyom-Yadav commented Jan 17, 2024

jhrozek left a comment

		limit := sql.NullInt32{Valid: false, Int32: 0}
		if in.GetLimit() > 0 {

		Context context = 5;
		// cursor is the base64 encoded cursor to start listing from. Format base64.encode(project_id,provider,repo_id)

Implement Cursor-Based Pagination in ListRepositories Endpoint #2097

Implement Cursor-Based Pagination in ListRepositories Endpoint #2097

Conversation

Vyom-Yadav commented Jan 9, 2024 • edited Loading

Vyom-Yadav commented Jan 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evankanderson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhrozek commented Jan 17, 2024

Vyom-Yadav commented Jan 17, 2024

jhrozek left a comment

Choose a reason for hiding this comment

Vyom-Yadav commented Jan 9, 2024 •

edited

Loading