-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Cursor-Based Pagination in ListRepositories Endpoint #2097
Conversation
It would have been awesome if there was some setup to perform e2e tests or even to test the endpoints using an actual DB. Is this under discussion / Is anybody working on it? |
@@ -281,6 +281,7 @@ CREATE INDEX idx_roles_project_id ON roles(project_id); | |||
CREATE UNIQUE INDEX roles_organization_id_name_lower_idx ON roles (organization_id, LOWER(name)); | |||
CREATE INDEX idx_provider_access_tokens_project_id ON provider_access_tokens(project_id); | |||
CREATE UNIQUE INDEX repositories_repo_id_idx ON repositories(repo_id); | |||
CREATE UNIQUE INDEX repositories_cursor_pagination_idx ON repositories(project_id, provider, repo_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I haven't really read the PR in full, just replying to this one bit for now..)
The reason the tests are failing is that whenever you change the DB schema you need to add two new migration files, not modify the old ones. Currently the largest one is 11, so you'd add 12. Also, one thing that bit me in the beginning is that you need to add both up and down migrations (even if the down was empty).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Added new migration files.
ORDER BY repo_name; | ||
AND (repo_id > sqlc.narg('repo_id') OR sqlc.narg('repo_id') IS NULL) | ||
ORDER BY project_id, provider, repo_id | ||
LIMIT sqlc.narg('limit'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting approach - did you consider using OFFSET
and LIMIT
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I considered it. The main reason for not going with offset-limit-based pagination is that it is slower than cursor-based pagination, and records can be potentially skipped if we are paginating using offset and limit.
The downside to cursor-based pagination is that we cannot go to any random page directly. Instead, it would be like an infinite scroll where more records can be fetched as the user scrolls through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to @Vyom-Yadav 's comments, offset-based queries require fetching all the items from the index before the offset, while using a cursor allows you to seek into the index and then march forward through the b-tree without needing to traverse the earlier entries (i.e. avoiding Schlemiel the Painter's Algorithm).
Now, I suspect we won't get into the really big numbers, but if we had 2000 repos at 100 per page, that would be scanning 100 + 200 + 300 + 400 ... + 2000 index entries = 21000 total index entries with offset and limit, vs 2000 with this query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually fine with this approach, only thing that threw me off was the project + provider + repo_id
index, I would have thought project + repo
was sufficient. But I'm also flexible and willing to iterate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evankanderson @Vyom-Yadav thank you for the explanation about the cursor based search. I don't have any more comments, so feel free to ack once you're happy with the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only thing that threw me off was the project + provider + repo_id index, I would have thought project + repo was sufficient.
The reason for that is we have to make sure the cursor is valid. A project + repo
cursor would work right now as github is the only provider, but if we had gitlab too (for eg), then differentiating the cursor would not be possible. Basically, we have to encode the previous query parameters in the cursor for correct retrieval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point!
24f9c50
to
d8bef5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limit := sql.NullInt32{Valid: false, Int32: 0} | ||
if in.GetLimit() > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that we always want to enforce a limit here, and probably enforce that the limit is e.g. < 100. (Either by clamping the value, or by returning an error if the requested limit is higher.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I added 100
as the max limit when a client passes > 0 limit flag.
if len(repos) > 0 { | ||
lastScannedRepoId := repos[len(repos)-1].RepoID | ||
resp.Cursor, err = encodeListRepositoriesByProjectIDCursor(projectID.String(), provider.Name, lastScannedRepoId) | ||
if err != nil { | ||
return nil, util.UserVisibleError(codes.InvalidArgument, err.Error()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to fetch limit+1
and only add a cursor if we actually got the +1 row? I'm mixed here -- it's slightly nicer for clients if you know that you can stop iterating when cursor
is empty, and you don't need to worry about the "returned an empty array" case, but on the other hand, it makes this code slightly more annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's do limit + 1
. Slack also uses it.
proto/minder/v1/minder.proto
Outdated
Context context = 5; | ||
// cursor is the base64 encoded cursor to start listing from. Format base64.encode(project_id,provider,repo_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should document the format of the cursor, as we could change it in the future (and that should be okay). It's simply an opaque value.
We may want to document when cursor is returned, and how to know you've reached the end of the set of responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added a comment to the returned cursor.
ORDER BY repo_name; | ||
AND (repo_id > sqlc.narg('repo_id') OR sqlc.narg('repo_id') IS NULL) | ||
ORDER BY project_id, provider, repo_id | ||
LIMIT sqlc.narg('limit'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to @Vyom-Yadav 's comments, offset-based queries require fetching all the items from the index before the offset, while using a cursor allows you to seek into the index and then march forward through the b-tree without needing to traverse the earlier entries (i.e. avoiding Schlemiel the Painter's Algorithm).
Now, I suspect we won't get into the really big numbers, but if we had 2000 repos at 100 per page, that would be scanning 100 + 200 + 300 + 400 ... + 2000 index entries = 21000 total index entries with offset and limit, vs 2000 with this query.
internal/controlplane/common.go
Outdated
key, err := util.DecodeCursor(cursor) | ||
if err != nil { | ||
return "", "", 0, err | ||
} | ||
|
||
keyArr := strings.Split(key, cursorDelimiter) | ||
if len(keyArr) != 3 { | ||
return "", "", 0, fmt.Errorf("invalid cursor") | ||
} | ||
|
||
parsedRepoId, err := strconv.ParseInt(keyArr[2], 10, 32) | ||
if err != nil { | ||
return "", "", 0, err | ||
} | ||
|
||
projectId = keyArr[0] | ||
provider = keyArr[1] | ||
|
||
repoId = int32(parsedRepoId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option here is to make a small struct or protobuf and then serialize that. In fact, you could:
type RepoCursor struct {
ProjectId string
Provider string
RepoId int
}
func NewCursor(encoded string) (Cursor, error) {
// ...
}
// String implements strings.Stringer
func (c *Cursor)String() string {
if c == nil || c.ProjectId == "" || c.Provider == "" || c.RepoId == 0 {
return ""
}
// Do the encode in this method, so all the encoding/decoding has to happen in this file
}
Note that we want to encapsulate knowledge of how a RepoCursor is serialized within these two methods, and not leak that implementation elsewhere, so we can change it later if we want. (For example, we could compress the string and add an integrity SHA hash later if we were worried about people messing with the cursor tokens.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid points and a better pattern. Defined a struct for the cursor.
612c836
to
e4391a3
Compare
@Vyom-Yadav would you mind rebasing atop the current master? Sorry about that, there's been a lot of changes touching the RPC lately. Normally you only need to run |
Signed-off-by: Vyom-Yadav <[email protected]>
@jhrozek Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good and a quick test with repo list works as well (also with an old client)
Fixes #520
Major Changes: