-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Store file io props to allow re-build it #802
Conversation
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
@@ -165,7 +175,7 @@ impl FileIOBuilder { | |||
/// Fetch the scheme string. | |||
/// | |||
/// The scheme_str will be empty if it's None. | |||
pub(crate) fn into_parts(self) -> (String, HashMap<String, String>) { | |||
pub fn into_parts(self) -> (String, HashMap<String, String>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really expose this? Why not just derive Serialize
/Deserialize
for FileIOBuilder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really expose this? Why not just derive
Serialize
/Deserialize
forFileIOBuilder
?
Unlike TableMetadata
, FileIOBuilder
does not have a predefined struct layout. Implementing Serialize
or Deserialize
for FileIOBuilder
could lead to users relying on it directly. By exposing into_parts
, we give users the flexibility to make their own decisions. This approach also allows us to modify the API and adjust the internal layout in the future by changing the API directly without unintentionally breaking users' applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the goal of the scheme_str
? If we want to have a specific impl, we could also pass that into the configuration itself. This is of course bound to a language, for example py-io-impl
and in Java it is just io-impl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the goal of the
scheme_str
? If we want to have a specific impl, we could also pass that into the configuration itself. This is of course bound to a language, for examplepy-io-impl
and in Java it is justio-impl
.
Makes sense to me.
I exposed the necessary elements as they are to build FileIO
for now. I can address this in another PR that removes the requirement for scheme
to build FileIO
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing Serialize or Deserialize for FileIOBuilder could lead to users relying on it directly.
That's exactly what I want. But user are not supposed to depend on the serialized format, but only the ser/de api provided by this library. The pros of this approach is that we don't need to expose internals fields to user, but only apis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But user are not supposed to depend on the serialized format, but only the ser/de api provided by this library.
Providing ser/de is a contract I don't want to establish with users. It implies a promise that users can deserialize the file I/O back at any time after serialization. Once we expose ser/de, users will inevitably try to use it in ways we never anticipated, such as storing it in a metadata service or transferring it between clusters.
There should be a much broader discussion about how we should handle serialization/deserialization across our project. Similar issues could occur in #774. However, I don't want to initiate that discussion here.
In my opinion, making an API-breaking change is much better than offering serialization/deserialization but breaking it unexpectedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offline discussion with @Xuanwo , and we reached an agreement that we should keep api consistent rather than a transparent format without clear spec.
This PR will store file io props to allow users to re-build it.
This PR will partly address #791
In a distributed system, we need to fetch table metadata, plan tasks, and distribute these tasks to different nodes. Each node will reconstruct the table instances to perform I/O while receiving a task. By storing file I/O properties, we can rebuild the
Table
without repeatedly fetching all the metadata from the catalog. This approach helps reduce the large number of catalog requests, especially when there are many workers.