Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Store file io props to allow re-build it #802

Merged
merged 3 commits into from
Dec 17, 2024

Conversation

Xuanwo
Copy link
Member

@Xuanwo Xuanwo commented Dec 14, 2024

This PR will store file io props to allow users to re-build it.

This PR will partly address #791


In a distributed system, we need to fetch table metadata, plan tasks, and distribute these tasks to different nodes. Each node will reconstruct the table instances to perform I/O while receiving a task. By storing file I/O properties, we can rebuild the Table without repeatedly fetching all the metadata from the catalog. This approach helps reduce the large number of catalog requests, especially when there are many workers.

@@ -165,7 +175,7 @@ impl FileIOBuilder {
/// Fetch the scheme string.
///
/// The scheme_str will be empty if it's None.
pub(crate) fn into_parts(self) -> (String, HashMap<String, String>) {
pub fn into_parts(self) -> (String, HashMap<String, String>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really expose this? Why not just derive Serialize/Deserialize for FileIOBuilder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really expose this? Why not just derive Serialize/Deserialize for FileIOBuilder?

Unlike TableMetadata, FileIOBuilder does not have a predefined struct layout. Implementing Serialize or Deserialize for FileIOBuilder could lead to users relying on it directly. By exposing into_parts, we give users the flexibility to make their own decisions. This approach also allows us to modify the API and adjust the internal layout in the future by changing the API directly without unintentionally breaking users' applications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the goal of the scheme_str? If we want to have a specific impl, we could also pass that into the configuration itself. This is of course bound to a language, for example py-io-impl and in Java it is just io-impl.

Copy link
Member Author

@Xuanwo Xuanwo Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the goal of the scheme_str? If we want to have a specific impl, we could also pass that into the configuration itself. This is of course bound to a language, for example py-io-impl and in Java it is just io-impl.

Makes sense to me.

I exposed the necessary elements as they are to build FileIO for now. I can address this in another PR that removes the requirement for scheme to build FileIO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing Serialize or Deserialize for FileIOBuilder could lead to users relying on it directly.

That's exactly what I want. But user are not supposed to depend on the serialized format, but only the ser/de api provided by this library. The pros of this approach is that we don't need to expose internals fields to user, but only apis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But user are not supposed to depend on the serialized format, but only the ser/de api provided by this library.

Providing ser/de is a contract I don't want to establish with users. It implies a promise that users can deserialize the file I/O back at any time after serialization. Once we expose ser/de, users will inevitably try to use it in ways we never anticipated, such as storing it in a metadata service or transferring it between clusters.

There should be a much broader discussion about how we should handle serialization/deserialization across our project. Similar issues could occur in #774. However, I don't want to initiate that discussion here.

In my opinion, making an API-breaking change is much better than offering serialization/deserialization but breaking it unexpectedly.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline discussion with @Xuanwo , and we reached an agreement that we should keep api consistent rather than a transparent format without clear spec.

@liurenjie1024 liurenjie1024 merged commit fbc2d42 into main Dec 17, 2024
16 checks passed
@liurenjie1024 liurenjie1024 deleted the allow-user-to-build-fileio branch December 17, 2024 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants