Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Scan Planning Endpoints to open api spec #9695

Merged
merged 12 commits into from
Sep 10, 2024

Conversation

rahil-c
Copy link
Contributor

@rahil-c rahil-c commented Feb 9, 2024

Dev list discussion thread for this change: https://lists.apache.org/thread/flmw1qts0hv8n0k4pd9n1nfry322633y

cc @jackye1995 @rdblue @danielcweeks @nastra @amogh-jahagirdar

Testing

ran make install, make lint, make generate, and python3 rest-catalog-open-api.py

Copy link
Contributor

@dramaticlly dramaticlly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your change is based on latest master, can leverage the openAPI spec validation task :iceberg-open-api:build I added in #9344 to identify the potential problems.

rdblue added 2 commits August 29, 2024 09:06
* Fix yaml for python codegen.

* Add updated python.
Co-authored-by: Daniel Weeks <[email protected]>
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good to me, just a few small comments. Thanks for keep pushing for this!

@@ -629,7 +887,7 @@ paths:
The snapshots to return in the body of the metadata. Setting the value to `all` would
return the full set of snapshots currently valid for the table. Setting the value to
`refs` would load all snapshots referenced by branches or tags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove unnecessary newline changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted this as well, but after thinking about it, I don't think the usual risk of merge conflicts really applies to this YAML file so I'm okay leaving this to keep it clean. I'm okay either way.

- $ref: '#/components/parameters/namespace'
- $ref: '#/components/parameters/table'

post:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I miss all the previous context, but looks like FetchScanTasksRequest only took opaque plan-task of string type and generated by the rest server, can we do get instead of post? Or do we plan on expand this later on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dramaticlly
Thanks for reviewing the pr will try to address your comment.

The main reason why we have fetchScanTasks as a POST instead of a GET, has to do with the structure of plan-task. Originally plan task was an opaque JSON object that vendors would return back to the client as a way of splitting up the work needed for scan planning. The client would send this plan-task back to the service in order to get associated file-scan-tasks.

Since plan-task was an opaque JSON object, this object could contain many attributes within it, and would be not be ideal to be sent as a query param which is customary for a GET request (since my understanding is that GET does not have a request body). In the current pr we have opted to have the plan-task be an opaque string instead of an opaque json object. However, the same idea should still hold that it would not be clean to embed a large JSON string with many attributes as a query param. Therefore, we decided that POST with a request body would be more appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @rahil-c for the context


- When "failed" the response must be a valid error response

- Status "cancelled" is not a valid status from this endpoint
Copy link
Contributor

@flyrain flyrain Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could we separate this into its own paragraph? It might be confusing for quick readers if it's mixed with other valid statuses. For clarity, it could look something like this:

Responses must include a valid status as the following shows, please note that `cancelled` is not valid status
- "completed":
- "submitted":
- "failed": 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a powerful feature. Thanks @rahil-c for driving this. LGTM overall. Left minor comments.

404:
description:
Not Found
- NoSuchPlanIdException, the plan-id does not exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: the server will reply 404 if it deletes a plan immediately after the plan is canceled, failed or fetched after the completed, right? Is there a use case that clients expect the server to keep the plan a bit longer to check the status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I saw the comment in the design doc to keep it for 24 hours, which is reasonable to me. Can we provide this kind of recommendation in the spec?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this specific case is where it's really the server choice and we probably want to keep the spec minimal

I think I'd advocate for any implementation recommendations to be done in a follow on PR with a separate discussion. For example, we recently added a separate section for the table spec implementation recommendations (things we advocate for implementations to do but are not required by the spec). That may apply here, but I'm still not 100% tbh since in the end servers will just clean up when they want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure if this is the right place to do that, but I feel like it makes sense to give a warning message here, that if a server does delete a plan aggressively, the clients may get confused error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @amogh-jahagirdar that we should keep the spec minimal here in regards to putting a time to live for the plan or an additional warning message around the service expiring the plan.

In the worst case, if the client expects a plan to be present and its not, it will hit a 404 exception, and the client can initiate a new plan. If we see this become an issue, in the future we can revisit this but I would rather not add onto this for now.

description:
Whether to use the schema at the time the snapshot was written.

When time travelling, the snapshot schema should be used (true).
Copy link
Contributor

@flyrain flyrain Sep 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do clients need to choose if the snapshot schema should be used in case of time traveling? Should this behavior be decided by the server? For example,

  1. current snapshot -> use the current schema
  2. historical snapshot -> use the snapshot schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, there are two reasons:

  1. We want the client to be explicit about which snapshot they want to read
  2. This simplifies the request because there's only one way to read historical data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, clients currently specify the snapshot ID however there needs to be a mechanism for distinguishing which schema gets used based on if it's a time travel by a specific snapshot ID or if it's a time travel by branch. The client has that context, and it's easier for it to determine which schema should be used. The request input is kept simpler by having just a snapshot ID for time travel as @danielcweeks said rather than having a mix of different options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @danielcweeks and @amogh-jahagirdar for inputs. Sorry I didn't realized there are differences between a time travel by a specific snapshot ID and by branch name, in terms of which schema to use.

To help clarify, could we add a link here for further reading, or provide a detailed explanation? For example:

- When scanning with a snapshot ID, clients should use the snapshot schema (true).
- When scanning with a branch name, clients should use the table schema (false). Note that clients still send the snapshot ID rather than the branch name in the request.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyrain All good! To clarify, the distinction in the schema that gets used during time travel is a convention that got established in the reference implementation but is not actually defined in the table spec itself.

As for the rationale behind this behavior in the reference implementation please refer to
https://github.com/apache/iceberg/pull/9131/files.

I do think it's probably beneficial to provide some more context as to why this field exists, which is to enable clients to align with the reference implementation.

Edit:
A concrete suggestion on a description that provides some context:

This flag is useful when a client wants to read from a branch and use the table schema or time travel to a specific a snapshot and use the snapshot schema (aligning with the reference implementation)

Copy link
Contributor Author

@rahil-c rahil-c Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your insights @amogh-jahagirdar @flyrain

However I actually think the current description for this is pretty straightforward and do not think we need to explain in the spec the reference implementation context around "snapshot schema" .

cc @rdblue @danielcweeks if you think we should clarify this further or keep as is.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK as is, I do think adding that historical context from the reference implementation is valuable for readers because without it the utility of the client side flag is unclear without digging through PRs or discussion threads.

Def not a blocker imo since it's largely describing context, if others think this context is useful I think we can address this in a follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.