Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Cursor-Based Pagination in ListRepositories Endpoint #2097

Merged
merged 1 commit into from
Jan 17, 2024

Conversation

Vyom-Yadav
Copy link
Member

@Vyom-Yadav Vyom-Yadav commented Jan 9, 2024

Fixes #520

Major Changes:

  • Implemented Cursor-Based Pagination in ListRepositories Endpoint
  • Added UTs for encoding/decoding cursors
  • API supports fetch-all/non-paginated queries too

@Vyom-Yadav Vyom-Yadav requested a review from a team as a code owner January 9, 2024 18:46
@Vyom-Yadav
Copy link
Member Author

It would have been awesome if there was some setup to perform e2e tests or even to test the endpoints using an actual DB. Is this under discussion / Is anybody working on it?

@@ -281,6 +281,7 @@ CREATE INDEX idx_roles_project_id ON roles(project_id);
CREATE UNIQUE INDEX roles_organization_id_name_lower_idx ON roles (organization_id, LOWER(name));
CREATE INDEX idx_provider_access_tokens_project_id ON provider_access_tokens(project_id);
CREATE UNIQUE INDEX repositories_repo_id_idx ON repositories(repo_id);
CREATE UNIQUE INDEX repositories_cursor_pagination_idx ON repositories(project_id, provider, repo_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I haven't really read the PR in full, just replying to this one bit for now..)

The reason the tests are failing is that whenever you change the DB schema you need to add two new migration files, not modify the old ones. Currently the largest one is 11, so you'd add 12. Also, one thing that bit me in the beginning is that you need to add both up and down migrations (even if the down was empty).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Added new migration files.

ORDER BY repo_name;
AND (repo_id > sqlc.narg('repo_id') OR sqlc.narg('repo_id') IS NULL)
ORDER BY project_id, provider, repo_id
LIMIT sqlc.narg('limit');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting approach - did you consider using OFFSET and LIMIT instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I considered it. The main reason for not going with offset-limit-based pagination is that it is slower than cursor-based pagination, and records can be potentially skipped if we are paginating using offset and limit.

The downside to cursor-based pagination is that we cannot go to any random page directly. Instead, it would be like an infinite scroll where more records can be fetched as the user scrolls through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to @Vyom-Yadav 's comments, offset-based queries require fetching all the items from the index before the offset, while using a cursor allows you to seek into the index and then march forward through the b-tree without needing to traverse the earlier entries (i.e. avoiding Schlemiel the Painter's Algorithm).

Now, I suspect we won't get into the really big numbers, but if we had 2000 repos at 100 per page, that would be scanning 100 + 200 + 300 + 400 ... + 2000 index entries = 21000 total index entries with offset and limit, vs 2000 with this query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually fine with this approach, only thing that threw me off was the project + provider + repo_id index, I would have thought project + repo was sufficient. But I'm also flexible and willing to iterate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evankanderson @Vyom-Yadav thank you for the explanation about the cursor based search. I don't have any more comments, so feel free to ack once you're happy with the PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only thing that threw me off was the project + provider + repo_id index, I would have thought project + repo was sufficient.

The reason for that is we have to make sure the cursor is valid. A project + repo cursor would work right now as github is the only provider, but if we had gitlab too (for eg), then differentiating the cursor would not be possible. Basically, we have to encode the previous query parameters in the cursor for correct retrieval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point!

@JAORMX JAORMX requested a review from evankanderson January 10, 2024 08:50
@Vyom-Yadav Vyom-Yadav force-pushed the issue-520 branch 3 times, most recently from 24f9c50 to d8bef5e Compare January 11, 2024 16:19
Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that @JAORMX and @jhrozek have some thoughts as well, so not merging/approving yet.

Comment on lines 172 to 173
limit := sql.NullInt32{Valid: false, Int32: 0}
if in.GetLimit() > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that we always want to enforce a limit here, and probably enforce that the limit is e.g. < 100. (Either by clamping the value, or by returning an error if the requested limit is higher.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I added 100 as the max limit when a client passes > 0 limit flag.

Comment on lines 205 to 221
if len(repos) > 0 {
lastScannedRepoId := repos[len(repos)-1].RepoID
resp.Cursor, err = encodeListRepositoriesByProjectIDCursor(projectID.String(), provider.Name, lastScannedRepoId)
if err != nil {
return nil, util.UserVisibleError(codes.InvalidArgument, err.Error())
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to fetch limit+1 and only add a cursor if we actually got the +1 row? I'm mixed here -- it's slightly nicer for clients if you know that you can stop iterating when cursor is empty, and you don't need to worry about the "returned an empty array" case, but on the other hand, it makes this code slightly more annoying.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's do limit + 1. Slack also uses it.

Context context = 5;
// cursor is the base64 encoded cursor to start listing from. Format base64.encode(project_id,provider,repo_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should document the format of the cursor, as we could change it in the future (and that should be okay). It's simply an opaque value.

We may want to document when cursor is returned, and how to know you've reached the end of the set of responses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a comment to the returned cursor.

ORDER BY repo_name;
AND (repo_id > sqlc.narg('repo_id') OR sqlc.narg('repo_id') IS NULL)
ORDER BY project_id, provider, repo_id
LIMIT sqlc.narg('limit');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to @Vyom-Yadav 's comments, offset-based queries require fetching all the items from the index before the offset, while using a cursor allows you to seek into the index and then march forward through the b-tree without needing to traverse the earlier entries (i.e. avoiding Schlemiel the Painter's Algorithm).

Now, I suspect we won't get into the really big numbers, but if we had 2000 repos at 100 per page, that would be scanning 100 + 200 + 300 + 400 ... + 2000 index entries = 21000 total index entries with offset and limit, vs 2000 with this query.

Comment on lines 117 to 135
key, err := util.DecodeCursor(cursor)
if err != nil {
return "", "", 0, err
}

keyArr := strings.Split(key, cursorDelimiter)
if len(keyArr) != 3 {
return "", "", 0, fmt.Errorf("invalid cursor")
}

parsedRepoId, err := strconv.ParseInt(keyArr[2], 10, 32)
if err != nil {
return "", "", 0, err
}

projectId = keyArr[0]
provider = keyArr[1]

repoId = int32(parsedRepoId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option here is to make a small struct or protobuf and then serialize that. In fact, you could:

type RepoCursor struct {
	ProjectId string
	Provider string
	RepoId int
}

func NewCursor(encoded string) (Cursor, error) {
	// ...
}

// String implements strings.Stringer
func (c *Cursor)String() string {
	if c == nil || c.ProjectId == "" || c.Provider == "" || c.RepoId == 0 {
		return ""
	}
	// Do the encode in this method, so all the encoding/decoding has to happen in this file
}

Note that we want to encapsulate knowledge of how a RepoCursor is serialized within these two methods, and not leak that implementation elsewhere, so we can change it later if we want. (For example, we could compress the string and add an integrity SHA hash later if we were worried about people messing with the cursor tokens.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid points and a better pattern. Defined a struct for the cursor.

@Vyom-Yadav Vyom-Yadav force-pushed the issue-520 branch 6 times, most recently from 612c836 to e4391a3 Compare January 13, 2024 08:55
@jhrozek
Copy link
Contributor

jhrozek commented Jan 17, 2024

@Vyom-Yadav would you mind rebasing atop the current master? Sorry about that, there's been a lot of changes touching the RPC lately. Normally you only need to run make gen and then commit the new autogenerated code.

@Vyom-Yadav
Copy link
Member Author

@jhrozek Done.

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good and a quick test with repo list works as well (also with an old client)

@jhrozek jhrozek merged commit e88e4b2 into mindersec:main Jan 17, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proto: consider using continuation tokens rather than offset
4 participants