Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading table from metadata file directly. #246

Closed
liurenjie1024 opened this issue Mar 10, 2024 · 21 comments · Fixed by #259
Closed

Loading table from metadata file directly. #246

liurenjie1024 opened this issue Mar 10, 2024 · 21 comments · Fixed by #259
Assignees
Milestone

Comments

@liurenjie1024
Copy link
Contributor

This feature is quite similar to StaticTable in python api, which allows user to load table without catalog.

@liurenjie1024 liurenjie1024 moved this to Todo in iceberg-rust Mar 11, 2024
@liurenjie1024 liurenjie1024 added this to the 0.3.0 Release milestone Mar 11, 2024
@liurenjie1024 liurenjie1024 self-assigned this Mar 11, 2024
@a-agmon
Copy link
Contributor

a-agmon commented Mar 12, 2024

Hi @liurenjie1024 ,
Thanks for adding this in response to my question on Slack :-)
I would be happy to work on this as good-first-issue if you think its a good idea.
I have implemented something similar for our use case. I can try and PR if you think any of the options here are in the right direction.
This might be something modest on table.rs such as:

    pub fn from_metadata(
        metadata: TableMetadata,
        table_ident: TableIdent,
        file_io: FileIO,
    ) -> Self {
        Self::builder()
            .metadata(metadata)
            .identifier(table_ident)
            .file_io(file_io)
            .build()
    }

or more ergonomic like the Python version - something like this (that's the way I have implemented this):

async fn from_metadata(metadata_file_location: &str, file_io: FileIO) -> anyhow::Result<Table> {
    let metadata_file = file_io.new_input(metadata_file_location)?;
    let mut metadata_file_reader = metadata_file.reader().await?;
    let mut metadata_file_content = String::new();
    metadata_file_reader
        .read_to_string(&mut metadata_file_content)
        .await?;
    let table_metadata = serde_json::from_str::<TableMetadata>(&metadata_file_content)?;
    Ok(Table::builder()
        .metadata(table_metadata)
        .identifier(TableIdent::from_strs(["static_ns", "static_table"]).unwrap())
        .file_io(file_io)
        .build())
}

either way would be happy to hear what you think

@liurenjie1024
Copy link
Contributor Author

Hi, @a-agmon

I think both method could be provided and the second version could be implemented as a simple wrapper as the first one.

What I want to add is that instead of implement it directly on Table, I want to add a new struct StaticTable as following:

pub struct StaticTable(Table);

impl StaticTable {
 pub async fn from_metadata(metadata_file_location: &str, file_io: FileIO) -> anyhow::Result<Self> {
   ...
 }

   pub fn scan(&self) -> TableScanBuilder<'_> {
        TableScanBuilder::new(self.0)
    }

   pub fn metadata(&self) -> TableMetadataRef {...}
}

The reason we need a StaticTable is that it's readoly and mainly used for scan. With this approach we can forbid applying transaction on StaticTable in a type safe approach. What do you think?

@a-agmon
Copy link
Contributor

a-agmon commented Mar 12, 2024

Thanks @liurenjie1024
Sounds good to me.

@a-agmon
Copy link
Contributor

a-agmon commented Mar 13, 2024

Hi @liurenjie1024 ,
Please take a look at the PR
How does this look?

@liurenjie1024
Copy link
Contributor Author

Hi, @a-agmon I've reviewed the pr and it looks great! I've left small comments to improve it.

@Xuanwo
Copy link
Member

Xuanwo commented Mar 14, 2024

From the discussion: https://apache-iceberg.slack.com/archives/C05HTENMJG4/p1710392414486259

The reason we need StaticTable Is that we don't want to allow modidication on tables loaded directly from metadata.

Should simply returning an error for modification operations be acceptable?

The current design requires users to write separate logic for Table and StaticTable, complicating their use in Rust, unlike Python where they can be used interchangeably.

@liurenjie1024
Copy link
Contributor Author

Should simply returning an error for modification operations be acceptable?

Yeah, that's one approach. But The motivation behind Is that we want to do it in a type safe way without runtime check.

@Xuanwo
Copy link
Member

Xuanwo commented Mar 14, 2024

But The motivation behind Is that we want to do it in a type safe way without runtime check.

Do you think it's a good idea to provide both ways?

@liurenjie1024
Copy link
Contributor Author

But The motivation behind Is that we want to do it in a type safe way without runtime check.

Do you think it's a good idea to provide both ways?

It's kind of weird to me to have both of them. Is there any concrete use case?

@Xuanwo
Copy link
Member

Xuanwo commented Mar 14, 2024

It's kind of weird to me to have both of them. Is there any concrete use case?

Database like databend supports loading iceberg table from both catalog or paths of a storage services.

For example:

CREATE TABLE <table_name> 
ENGINE = Iceberg 
LOCATION = 's3://<path_to_table>' 
CONNECTION_NAME = '<connection_name>'

If we provides Table and StaticTable, users will need to wrap them again like the following.

enum IcebergTable {
   Table(iceberg::Table), // build from catalog
   StaticTable(iceberg::StaticTable) // build from a path of table metadata
}

@a-agmon
Copy link
Contributor

a-agmon commented Mar 14, 2024

Perhaps we can also implement from_metadata_file() on Table? or provide an accessor to Table from StaticTable with some warning on usage?

@liurenjie1024
Copy link
Contributor Author

It's kind of weird to me to have both of them. Is there any concrete use case?

Database like databend supports loading iceberg table from both catalog or paths of a storage services.

For example:

CREATE TABLE <table_name> 

ENGINE = Iceberg 

LOCATION = 's3://<path_to_table>' 

CONNECTION_NAME = '<connection_name>'

If we provides Table and StaticTable, users will need to wrap them again like the following.

enum IcebergTable {

   Table(iceberg::Table), // build from catalog

   StaticTable(iceberg::StaticTable) // build from a path of table metadata

}

Make sense. Will following code work for you?

struct Table <const N: bool>{}

@Xuanwo
Copy link
Member

Xuanwo commented Mar 14, 2024

Will following code work for you?

The issue I encountered is that the Table's read-only status is determined at runtime, not compile time. Introducing a const READ_ONLY: bool would require us to pass this constant throughout our API.

@liurenjie1024
Copy link
Contributor Author

Perhaps we can also implement from_metadata_file() on Table? or provide an accessor to Table from StaticTable with some warning on usage?

Good point, how about allowing access inner Table as unsafe?

@liurenjie1024
Copy link
Contributor Author

If that still not works for you, I agree that we provide both apis.

@Xuanwo
Copy link
Member

Xuanwo commented Mar 14, 2024

Good point, how about allowing access inner Table as unsafe?

Providing a StaticTable::into_table() should work fine to me.

Alought I still stick with my idea that Table should be able to build without catalog and perform runtime checks againest read_only status, returning errors if user tried to write. Maybe we can hide this under a feature for user?

By the way, unsafe is preserved for memory safe in rust. It doesn't make sense to mark the access to inner table as unsafe.

@liurenjie1024
Copy link
Contributor Author

Alought I still stick with my idea that Table should be able to build without catalog and perform runtime checks againest read_only status, returning errors if user tried to write. Maybe we can hide this under a feature for user?

How about this:

  1. We add a readonly field in Table

  2. When it load from catalog, readonly Is false

  3. We keep StaticTable so that user have a type safe readonly table

  4. Add into method for StaticTable to get inner Table, and this case We set readonly to true

@a-agmon
Copy link
Contributor

a-agmon commented Mar 14, 2024

Maybe we can also incorporate something like this at this stage. like adding a readonly property that will default to false, while the StaticTable sets it as true.

/// Table represents a table in the catalog.
#[derive(TypedBuilder, Debug)]
pub struct Table {
    file_io: FileIO,
    #[builder(default, setter(strip_option, into))]
    metadata_location: Option<String>,
    #[builder(setter(into))]
    metadata: TableMetadataRef,
    identifier: TableIdent,
    #[builder(default = false)]
    readonly: bool,
}

I tried to sketch it here for example. I can add this if that seems right.

@liurenjie1024
Copy link
Contributor Author

Maybe we can also incorporate something like this at this stage. like adding a readonly property that will default to false, while the StaticTable sets it as true.

/// Table represents a table in the catalog.

#[derive(TypedBuilder, Debug)]

pub struct Table {

    file_io: FileIO,

    #[builder(default, setter(strip_option, into))]

    metadata_location: Option<String>,

    #[builder(setter(into))]

    metadata: TableMetadataRef,

    identifier: TableIdent,

    #[builder(default = false)]

    readonly: bool,

}

I tried to sketch it here for example. I can add this if that seems right.

Looks reasonably to me, thanks for this!

@a-agmon
Copy link
Contributor

a-agmon commented Mar 15, 2024

Maybe we can also incorporate something like this at this stage. like adding a readonly property that will default to false, while the StaticTable sets it as true.

/// Table represents a table in the catalog.

#[derive(TypedBuilder, Debug)]

pub struct Table {

    file_io: FileIO,

    #[builder(default, setter(strip_option, into))]

    metadata_location: Option<String>,

    #[builder(setter(into))]

    metadata: TableMetadataRef,

    identifier: TableIdent,

    #[builder(default = false)]

    readonly: bool,

}

I tried to sketch it here for example. I can add this if that seems right.

Looks reasonably to me, thanks for this!

Great! PR is now also updated with changes.

@liurenjie1024
Copy link
Contributor Author

Close by #259 Feel free to open when if necessary.

@github-project-automation github-project-automation bot moved this from Todo to Done in iceberg-rust Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants