Add Scan Planning Endpoints to open api spec #9695

rahil-c · 2024-02-09T21:39:59Z

Dev list discussion thread for this change: https://lists.apache.org/thread/flmw1qts0hv8n0k4pd9n1nfry322633y

cc @jackye1995 @rdblue @danielcweeks @nastra @amogh-jahagirdar

Testing

ran make install, make lint, make generate, and python3 rest-catalog-open-api.py

open-api/rest-catalog-open-api.yaml

open-api/rest-catalog-open-api.py

open-api/rest-catalog-open-api.yaml

dramaticlly

If your change is based on latest master, can leverage the openAPI spec validation task :iceberg-open-api:build I added in #9344 to identify the potential problems.

open-api/rest-catalog-open-api.yaml

open-api/rest-catalog-open-api.py

open-api/rest-catalog-open-api.yaml

Minor clarification.

* Fix yaml for python codegen. * Add updated python.

open-api/rest-catalog-open-api.py

open-api/rest-catalog-open-api.yaml

Co-authored-by: Daniel Weeks <[email protected]>

jackye1995

mostly looks good to me, just a few small comments. Thanks for keep pushing for this!

jackye1995 · 2024-09-03T15:47:22Z

open-api/rest-catalog-open-api.yaml

@@ -629,7 +887,7 @@ paths:
            The snapshots to return in the body of the metadata. Setting the value to `all` would
            return the full set of snapshots currently valid for the table. Setting the value to
            `refs` would load all snapshots referenced by branches or tags.
-          
+


nit: remove unnecessary newline changes

I noted this as well, but after thinking about it, I don't think the usual risk of merge conflicts really applies to this YAML file so I'm okay leaving this to keep it clean. I'm okay either way.

open-api/rest-catalog-open-api.yaml

Co-authored-by: Daniel Weeks <[email protected]>

dramaticlly · 2024-09-05T22:05:58Z

open-api/rest-catalog-open-api.yaml

+      - $ref: '#/components/parameters/namespace'
+      - $ref: '#/components/parameters/table'
+
+    post:


Sorry if I miss all the previous context, but looks like FetchScanTasksRequest only took opaque plan-task of string type and generated by the rest server, can we do get instead of post? Or do we plan on expand this later on

@dramaticlly
Thanks for reviewing the pr will try to address your comment.

The main reason why we have fetchScanTasks as a POST instead of a GET, has to do with the structure of plan-task. Originally plan task was an opaque JSON object that vendors would return back to the client as a way of splitting up the work needed for scan planning. The client would send this plan-task back to the service in order to get associated file-scan-tasks.

Since plan-task was an opaque JSON object, this object could contain many attributes within it, and would be not be ideal to be sent as a query param which is customary for a GET request (since my understanding is that GET does not have a request body). In the current pr we have opted to have the plan-task be an opaque string instead of an opaque json object. However, the same idea should still hold that it would not be clean to embed a large JSON string with many attributes as a query param. Therefore, we decided that POST with a request body would be more appropriate.

thank you @rahil-c for the context

flyrain · 2024-09-06T21:02:47Z

open-api/rest-catalog-open-api.yaml

+
+        - When "failed" the response must be a valid error response
+
+        - Status "cancelled" is not a valid status from this endpoint


Nit: Could we separate this into its own paragraph? It might be confusing for quick readers if it's mixed with other valid statuses. For clarity, it could look something like this:

Responses must include a valid status as the following shows, please note that `cancelled` is not valid status - "completed": - "submitted": - "failed":

Will fix this.

jackye1995

looks good to me!

flyrain

It is a powerful feature. Thanks @rahil-c for driving this. LGTM overall. Left minor comments.

flyrain · 2024-09-06T23:15:34Z

open-api/rest-catalog-open-api.yaml

+        404:
+          description:
+            Not Found
+            - NoSuchPlanIdException, the plan-id does not exist


Question: the server will reply 404 if it deletes a plan immediately after the plan is canceled, failed or fetched after the completed, right? Is there a use case that clients expect the server to keep the plan a bit longer to check the status?

Oh I saw the comment in the design doc to keep it for 24 hours, which is reasonable to me. Can we provide this kind of recommendation in the spec?

I think this specific case is where it's really the server choice and we probably want to keep the spec minimal

I think I'd advocate for any implementation recommendations to be done in a follow on PR with a separate discussion. For example, we recently added a separate section for the table spec implementation recommendations (things we advocate for implementations to do but are not required by the spec). That may apply here, but I'm still not 100% tbh since in the end servers will just clean up when they want.

I'm also not sure if this is the right place to do that, but I feel like it makes sense to give a warning message here, that if a server does delete a plan aggressively, the clients may get confused error message.

I think I agree with @amogh-jahagirdar that we should keep the spec minimal here in regards to putting a time to live for the plan or an additional warning message around the service expiring the plan.

In the worst case, if the client expects a plan to be present and its not, it will hit a 404 exception, and the client can initiate a new plan. If we see this become an issue, in the future we can revisit this but I would rather not add onto this for now.

open-api/rest-catalog-open-api.yaml

flyrain · 2024-09-07T02:42:57Z

open-api/rest-catalog-open-api.yaml

+          description:
+            Whether to use the schema at the time the snapshot was written.
+
+            When time travelling, the snapshot schema should be used (true).


Why do clients need to choose if the snapshot schema should be used in case of time traveling? Should this behavior be decided by the server? For example,

current snapshot -> use the current schema

historical snapshot -> use the snapshot schema

iirc, there are two reasons:

We want the client to be explicit about which snapshot they want to read

This simplifies the request because there's only one way to read historical data

Yeah, clients currently specify the snapshot ID however there needs to be a mechanism for distinguishing which schema gets used based on if it's a time travel by a specific snapshot ID or if it's a time travel by branch. The client has that context, and it's easier for it to determine which schema should be used. The request input is kept simpler by having just a snapshot ID for time travel as @danielcweeks said rather than having a mix of different options.

Thanks @danielcweeks and @amogh-jahagirdar for inputs. Sorry I didn't realized there are differences between a time travel by a specific snapshot ID and by branch name, in terms of which schema to use.

To help clarify, could we add a link here for further reading, or provide a detailed explanation? For example:

- When scanning with a snapshot ID, clients should use the snapshot schema (true). - When scanning with a branch name, clients should use the table schema (false). Note that clients still send the snapshot ID rather than the branch name in the request.

@flyrain All good! To clarify, the distinction in the schema that gets used during time travel is a convention that got established in the reference implementation but is not actually defined in the table spec itself.

As for the rationale behind this behavior in the reference implementation please refer to
https://github.com/apache/iceberg/pull/9131/files.

I do think it's probably beneficial to provide some more context as to why this field exists, which is to enable clients to align with the reference implementation.

Edit:
A concrete suggestion on a description that provides some context:

This flag is useful when a client wants to read from a branch and use the table schema or time travel to a specific a snapshot and use the snapshot schema (aligning with the reference implementation)

Thanks for your insights @amogh-jahagirdar @flyrain

However I actually think the current description for this is pretty straightforward and do not think we need to explain in the spec the reference implementation context around "snapshot schema" .

cc @rdblue @danielcweeks if you think we should clarify this further or keep as is.

It's OK as is, I do think adding that historical context from the reference implementation is valuable for readers because without it the utility of the client side flag is unclear without digging through PRs or discussion threads.

Def not a blocker imo since it's largely describing context, if others think this context is useful I think we can address this in a follow.

github-actions bot added the OPENAPI label Feb 9, 2024

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.py Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 12, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 13, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 13, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

dramaticlly reviewed Feb 14, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 14, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

jackye1995 reviewed Feb 14, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Show resolved Hide resolved

jackye1995 reviewed Feb 14, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Show resolved Hide resolved

jackye1995 reviewed Feb 14, 2024

View reviewed changes

open-api/rest-catalog-open-api.py Outdated Show resolved Hide resolved

stevenzwu reviewed Feb 15, 2024

View reviewed changes

rahil-c commented Aug 28, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Outdated Show resolved Hide resolved

rahil-c commented Aug 28, 2024

View reviewed changes

open-api/rest-catalog-open-api.yaml Show resolved Hide resolved

rdblue added 2 commits August 29, 2024 09:06

Update rest-catalog-open-api.yaml

b2837f6

Minor clarification.

Fix yaml for python codegen. (#3)

9c92062

* Fix yaml for python codegen. * Add updated python.

rdblue approved these changes Aug 30, 2024

View reviewed changes

danielcweeks reviewed Aug 30, 2024

View reviewed changes