Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Store file io props to allow re-build it #802

Merged
merged 3 commits into from
Dec 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 15 additions & 4 deletions crates/iceberg/src/io/file_io.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,20 @@ use crate::{Error, ErrorKind, Result};
/// | GCS | `storage-gcs` | `gcs` |
#[derive(Clone, Debug)]
pub struct FileIO {
builder: FileIOBuilder,

inner: Arc<Storage>,
}

impl FileIO {
/// Convert FileIO into [`FileIOBuilder`] which used to build this FileIO.
///
/// This function is useful when you want serialize and deserialize FileIO across
/// distributed systems.
pub fn into_builder(self) -> FileIOBuilder {
self.builder
}

/// Try to infer file io scheme from path. See [`FileIO`] for supported schemes.
///
/// - If it's a valid url, for example `s3://bucket/a`, url scheme will be used, and the rest of the url will be ignored.
Expand Down Expand Up @@ -134,7 +144,7 @@ impl FileIO {
}

/// Builder for [`FileIO`].
#[derive(Debug)]
#[derive(Clone, Debug)]
pub struct FileIOBuilder {
/// This is used to infer scheme of operator.
///
Expand Down Expand Up @@ -165,7 +175,7 @@ impl FileIOBuilder {
/// Fetch the scheme string.
///
/// The scheme_str will be empty if it's None.
pub(crate) fn into_parts(self) -> (String, HashMap<String, String>) {
pub fn into_parts(self) -> (String, HashMap<String, String>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really expose this? Why not just derive Serialize/Deserialize for FileIOBuilder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really expose this? Why not just derive Serialize/Deserialize for FileIOBuilder?

Unlike TableMetadata, FileIOBuilder does not have a predefined struct layout. Implementing Serialize or Deserialize for FileIOBuilder could lead to users relying on it directly. By exposing into_parts, we give users the flexibility to make their own decisions. This approach also allows us to modify the API and adjust the internal layout in the future by changing the API directly without unintentionally breaking users' applications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the goal of the scheme_str? If we want to have a specific impl, we could also pass that into the configuration itself. This is of course bound to a language, for example py-io-impl and in Java it is just io-impl.

Copy link
Member Author

@Xuanwo Xuanwo Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the goal of the scheme_str? If we want to have a specific impl, we could also pass that into the configuration itself. This is of course bound to a language, for example py-io-impl and in Java it is just io-impl.

Makes sense to me.

I exposed the necessary elements as they are to build FileIO for now. I can address this in another PR that removes the requirement for scheme to build FileIO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing Serialize or Deserialize for FileIOBuilder could lead to users relying on it directly.

That's exactly what I want. But user are not supposed to depend on the serialized format, but only the ser/de api provided by this library. The pros of this approach is that we don't need to expose internals fields to user, but only apis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But user are not supposed to depend on the serialized format, but only the ser/de api provided by this library.

Providing ser/de is a contract I don't want to establish with users. It implies a promise that users can deserialize the file I/O back at any time after serialization. Once we expose ser/de, users will inevitably try to use it in ways we never anticipated, such as storing it in a metadata service or transferring it between clusters.

There should be a much broader discussion about how we should handle serialization/deserialization across our project. Similar issues could occur in #774. However, I don't want to initiate that discussion here.

In my opinion, making an API-breaking change is much better than offering serialization/deserialization but breaking it unexpectedly.

(self.scheme_str.unwrap_or_default(), self.props)
}

Expand All @@ -186,9 +196,10 @@ impl FileIOBuilder {
}

/// Builds [`FileIO`].
pub fn build(self) -> crate::Result<FileIO> {
let storage = Storage::build(self)?;
pub fn build(self) -> Result<FileIO> {
let storage = Storage::build(self.clone())?;
Ok(FileIO {
builder: self,
inner: Arc::new(storage),
})
}
Expand Down
Loading