-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Scan Planning Endpoints to open api spec #9695
Add Scan Planning Endpoints to open api spec #9695
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your change is based on latest master, can leverage the openAPI spec validation task :iceberg-open-api:build
I added in #9344 to identify the potential problems.
Minor clarification.
* Fix yaml for python codegen. * Add updated python.
Co-authored-by: Daniel Weeks <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly looks good to me, just a few small comments. Thanks for keep pushing for this!
open-api/rest-catalog-open-api.yaml
Outdated
@@ -629,7 +887,7 @@ paths: | |||
The snapshots to return in the body of the metadata. Setting the value to `all` would | |||
return the full set of snapshots currently valid for the table. Setting the value to | |||
`refs` would load all snapshots referenced by branches or tags. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove unnecessary newline changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noted this as well, but after thinking about it, I don't think the usual risk of merge conflicts really applies to this YAML file so I'm okay leaving this to keep it clean. I'm okay either way.
Co-authored-by: Daniel Weeks <[email protected]>
- $ref: '#/components/parameters/namespace' | ||
- $ref: '#/components/parameters/table' | ||
|
||
post: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I miss all the previous context, but looks like FetchScanTasksRequest
only took opaque plan-task
of string type and generated by the rest server, can we do get instead of post? Or do we plan on expand this later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dramaticlly
Thanks for reviewing the pr will try to address your comment.
The main reason why we have fetchScanTasks as a POST instead of a GET, has to do with the structure of plan-task. Originally plan task was an opaque JSON object that vendors would return back to the client as a way of splitting up the work needed for scan planning. The client would send this plan-task back to the service in order to get associated file-scan-tasks.
Since plan-task was an opaque JSON object, this object could contain many attributes within it, and would be not be ideal to be sent as a query param which is customary for a GET request (since my understanding is that GET does not have a request body). In the current pr we have opted to have the plan-task be an opaque string instead of an opaque json object. However, the same idea should still hold that it would not be clean to embed a large JSON string with many attributes as a query param. Therefore, we decided that POST with a request body would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @rahil-c for the context
open-api/rest-catalog-open-api.yaml
Outdated
|
||
- When "failed" the response must be a valid error response | ||
|
||
- Status "cancelled" is not a valid status from this endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could we separate this into its own paragraph? It might be confusing for quick readers if it's mixed with other valid statuses. For clarity, it could look something like this:
Responses must include a valid status as the following shows, please note that `cancelled` is not valid status
- "completed":
- "submitted":
- "failed":
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a powerful feature. Thanks @rahil-c for driving this. LGTM overall. Left minor comments.
404: | ||
description: | ||
Not Found | ||
- NoSuchPlanIdException, the plan-id does not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: the server will reply 404 if it deletes a plan immediately after the plan is canceled, failed or fetched after the completed, right? Is there a use case that clients expect the server to keep the plan a bit longer to check the status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I saw the comment in the design doc to keep it for 24 hours, which is reasonable to me. Can we provide this kind of recommendation in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this specific case is where it's really the server choice and we probably want to keep the spec minimal
I think I'd advocate for any implementation recommendations to be done in a follow on PR with a separate discussion. For example, we recently added a separate section for the table spec implementation recommendations (things we advocate for implementations to do but are not required by the spec). That may apply here, but I'm still not 100% tbh since in the end servers will just clean up when they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure if this is the right place to do that, but I feel like it makes sense to give a warning message here, that if a server does delete a plan aggressively, the clients may get confused error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @amogh-jahagirdar that we should keep the spec minimal here in regards to putting a time to live for the plan
or an additional warning message around the service expiring the plan.
In the worst case, if the client expects a plan to be present and its not, it will hit a 404
exception, and the client can initiate a new plan. If we see this become an issue, in the future we can revisit this but I would rather not add onto this for now.
description: | ||
Whether to use the schema at the time the snapshot was written. | ||
|
||
When time travelling, the snapshot schema should be used (true). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do clients need to choose if the snapshot schema should be used in case of time traveling? Should this behavior be decided by the server? For example,
- current snapshot -> use the current schema
- historical snapshot -> use the snapshot schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, there are two reasons:
- We want the client to be explicit about which snapshot they want to read
- This simplifies the request because there's only one way to read historical data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, clients currently specify the snapshot ID however there needs to be a mechanism for distinguishing which schema gets used based on if it's a time travel by a specific snapshot ID or if it's a time travel by branch. The client has that context, and it's easier for it to determine which schema should be used. The request input is kept simpler by having just a snapshot ID for time travel as @danielcweeks said rather than having a mix of different options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @danielcweeks and @amogh-jahagirdar for inputs. Sorry I didn't realized there are differences between a time travel by a specific snapshot ID and by branch name, in terms of which schema to use.
To help clarify, could we add a link here for further reading, or provide a detailed explanation? For example:
- When scanning with a snapshot ID, clients should use the snapshot schema (true).
- When scanning with a branch name, clients should use the table schema (false). Note that clients still send the snapshot ID rather than the branch name in the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flyrain All good! To clarify, the distinction in the schema that gets used during time travel is a convention that got established in the reference implementation but is not actually defined in the table spec itself.
As for the rationale behind this behavior in the reference implementation please refer to
https://github.com/apache/iceberg/pull/9131/files.
I do think it's probably beneficial to provide some more context as to why this field exists, which is to enable clients to align with the reference implementation.
Edit:
A concrete suggestion on a description that provides some context:
This flag is useful when a client wants to read from a branch and use the table schema or time travel to a specific a snapshot and use the snapshot schema (aligning with the reference implementation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your insights @amogh-jahagirdar @flyrain
However I actually think the current description for this is pretty straightforward and do not think we need to explain in the spec the reference implementation context around "snapshot schema" .
cc @rdblue @danielcweeks if you think we should clarify this further or keep as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK as is, I do think adding that historical context from the reference implementation is valuable for readers because without it the utility of the client side flag is unclear without digging through PRs or discussion threads.
Def not a blocker imo since it's largely describing context, if others think this context is useful I think we can address this in a follow.
Dev list discussion thread for this change: https://lists.apache.org/thread/flmw1qts0hv8n0k4pd9n1nfry322633y
cc @jackye1995 @rdblue @danielcweeks @nastra @amogh-jahagirdar
Testing
ran make install, make lint, make generate, and python3 rest-catalog-open-api.py