Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL Catalog #503

Closed
wants to merge 52 commits into from
Closed

SQL Catalog #503

wants to merge 52 commits into from

Conversation

callum-ryan
Copy link
Contributor

This is an add-on from @JanKaul's work in a previous PR - we have spoken on slack and he is happy for me to open this PR.

Signed-off-by: callum-ryan <[email protected]>
Signed-off-by: callum-ryan <[email protected]>
Signed-off-by: callum-ryan <[email protected]>
Signed-off-by: callum-ryan <[email protected]>
Signed-off-by: callum-ryan <[email protected]>
Copy link
Contributor

@sdd sdd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great first PR @callum-ryan, and thanks to @JanKaul for the original work on this and letting Callum pick it up. Great to see a new contributor 😄

This looks almost perfect for me, with just a few minor comments.

vec![Some(&catalog_name), Some(&namespace), Some(&name)],
)
.await?;
assert_eq!(row.len(), 1, "expected only one row from load_table query");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this assertion fails, the service panics and exits. A more graceful failure would be desirable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point, in hindsight the table's primary key should stop this even happening so I have removed.

crates/catalog/sql/src/catalog.rs Outdated Show resolved Hide resolved
crates/catalog/sql/src/catalog.rs Outdated Show resolved Hide resolved
crates/catalog/sql/src/catalog.rs Outdated Show resolved Hide resolved
let src_table_exist = self.table_exists(src).await;
let dst_table_exist = self.table_exists(dest).await;

match (src_table_exist, dst_table_exist) {
Copy link
Contributor

@sdd sdd Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something more like this:

match (src_table_exist, dst_table_exist) {
    (Ok(true), Ok(false)) => Ok(())
    (_, Ok(true)) => {
        Err(Error::new(
            ErrorKind::Unexpected,
            "failed to rename table as destination already exists",
        ))
    }
    (Ok(false), _) => {
        Err(Error::new(
            ErrorKind::Unexpected,
            "failed to rename table as source does not exist",
        ))
    }
    _ => Err(Error::new(ErrorKind::Unexpected, "failed to rename table")),
}?;

Copy link
Contributor

@fqaiser94 fqaiser94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @callum-ryan 👋

I can help with reviewing this PR since I recently completed the Memory Catalog implementation for iceberg-rust.
I took a quick look at your PR and I could leave a lot of comments but I worry that would be a little demotivating.
Instead, I want to see if we can try something a little different.

I have commented 6 suggestions below.
I recommend you accept each of these suggestions blindly (just click on the Commit suggestion button next to each one).
These suggestions add ~1500 lines of code defining 59 tests that cover all the behaviours I would expect from a correctly functioning Catalog implementation.
Currently, your implementation passes 25 of them.
Here's the fun part: Try refactoring your code so it passes the remaining 34 tests as well before we continue with the review (let me know if you get stuck, I'm happy to help!)

Does that sounds like a good plan to you? 👍 👎

crates/catalog/sql/Cargo.toml Show resolved Hide resolved
crates/catalog/sql/src/catalog.rs Show resolved Hide resolved
crates/catalog/sql/src/catalog.rs Outdated Show resolved Hide resolved
crates/catalog/sql/src/catalog.rs Show resolved Hide resolved
crates/catalog/sql/src/catalog.rs Outdated Show resolved Hide resolved
crates/catalog/sql/src/catalog.rs Outdated Show resolved Hide resolved
@callum-ryan
Copy link
Contributor Author

Hi @callum-ryan 👋

I can help with reviewing this PR since I recently completed the Memory Catalog implementation for iceberg-rust. I took a quick look at your PR and I could leave a lot of comments but I worry that would be a little demotivating. Instead, I want to see if we can try something a little different.

I have commented 6 suggestions below. I recommend you accept each of these suggestions blindly (just click on the Commit suggestion button next to each one). These suggestions add ~1500 lines of code defining 59 tests that cover all the behaviours I would expect from a correctly functioning Catalog implementation. Currently, your implementation passes 25 of them. Here's the fun part: Try refactoring your code so it passes the remaining 34 tests as well before we continue with the review (let me know if you get stuck, I'm happy to help!)

Does that sounds like a good plan to you? 👍 👎

Thanks for this @fqaiser94, I like this approach! I will work on the refactor most likely tomorrow

@Xuanwo
Copy link
Member

Xuanwo commented Aug 3, 2024

I recommend you accept each of these suggestions blindly (just click on the Commit suggestion button next to each one).
These suggestions add ~1500 lines of code defining 59 tests that cover all the behaviours I would expect from a correctly functioning Catalog implementation.

Hi @fqaiser94, do you think it would be a good idea to create a new crate named catalog_integration_tests for those tests, so that all catalogs can use them? I imagine that each catalog only needs to be created once and then passed into test_catalog().

@fqaiser94
Copy link
Contributor

Hi @fqaiser94, do you think it would be a good idea to create a new crate named catalog_integration_tests for those tests, so that all catalogs can use them? I imagine that each catalog only needs to be created once and then passed into test_catalog().

@Xuanwo Ha, yea I was thinking of something similar but I definitely don't want to overload this PR with too many concerns. Let's see how things go in this PR first and if it still makes sense, I can potentially follow-up with another PR to consolidate catalog testing where it makes sense. The main concern I have is that the existing catalogs don't all follow exactly the same behaviour, there are small, subtle differences in behaviour like checking for tables before dropping a namespace, error messaging, etc. It may or may not be worth aligning behaviours but we can discuss all of that when we get there.

Eh I might start working on this in in parallel, time permitting.
Created a ticket for it here: #519
Let's continue any further discussion there.

Comment on lines +48 to +50
sqlite = ["sqlx/sqlite"]
postgres = ["sqlx/postgres"]
mysql = ["sqlx/mysql"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I'm considering whether it's possible to avoid providing features ourselves. Instead, users can active whether they want by adding sqlx in their dependencs try:

iceberg_catalog_sql = "0.x"
sqlx = { version = "0.7.4", features = ["sqlite", "postgres"] }

So we only depends on the interface of sqlx. Users can pick up whatever driver they want (even those implemented by their own).

Comment on lines +80 to +84
enum DatabaseType {
PostgreSQL,
MySQL,
SQLite,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing DatabaseType? We only need to handle the PostgreSQL cases, so we can add its name as a constant.

In this way, we don't need to decide which database we want to support.

};

sqlx::query(
&format!("create table if not exists {} ({} varchar(255) not null, {} varchar(255) not null, {} varchar(255) not null, {} varchar(255), {} varchar(255), {} varchar(5), primary key ({}, {}, {}))",
Copy link
Member

@Xuanwo Xuanwo Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.map_err(from_sqlx_error)?;

sqlx::query(
&format!("create table if not exists {} ({} varchar(255) not null, {} varchar(255) not null, {} varchar(255), {} varchar(255), primary key ({}, {}, {}))",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about format string in SQL way to make them easy to read?

format!("
create table if not exists {NAMESPACE_PROPERTIES_TABLE_NAME} (
    {CATALOG_NAME} varchar(255) not null,
    ....
    primary key (
        {CATALOG_NAME}, 
        {NAMESPACE_NAME}, 
        {NAMESPACE_PROPERTY_KEY}
    )
")

@Xuanwo
Copy link
Member

Xuanwo commented Aug 5, 2024

Hi @callum-ryan, I believe the entire SQL Catalog is quite a large project. Although @JanKaul and you have put in a lot of effort, there are still many areas that need improvement before it can be merged.

Do you think it would be better to split this into multiple PRs so we can iterate faster and more smoothly? Or, do you think this PR is good enough to merge now and we can continue improving it later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants