-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL Catalog #503
SQL Catalog #503
Conversation
Co-authored-by: Renjie Liu <[email protected]>
Co-authored-by: Renjie Liu <[email protected]>
Signed-off-by: callum-ryan <[email protected]>
Signed-off-by: callum-ryan <[email protected]>
Signed-off-by: callum-ryan <[email protected]>
Signed-off-by: callum-ryan <[email protected]>
Signed-off-by: callum-ryan <[email protected]>
Signed-off-by: callum-ryan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great first PR @callum-ryan, and thanks to @JanKaul for the original work on this and letting Callum pick it up. Great to see a new contributor 😄
This looks almost perfect for me, with just a few minor comments.
crates/catalog/sql/src/catalog.rs
Outdated
vec![Some(&catalog_name), Some(&namespace), Some(&name)], | ||
) | ||
.await?; | ||
assert_eq!(row.len(), 1, "expected only one row from load_table query"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this assertion fails, the service panics and exits. A more graceful failure would be desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good point, in hindsight the table's primary key should stop this even happening so I have removed.
Signed-off-by: callum-ryan <[email protected]>
let src_table_exist = self.table_exists(src).await; | ||
let dst_table_exist = self.table_exists(dest).await; | ||
|
||
match (src_table_exist, dst_table_exist) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking something more like this:
match (src_table_exist, dst_table_exist) {
(Ok(true), Ok(false)) => Ok(())
(_, Ok(true)) => {
Err(Error::new(
ErrorKind::Unexpected,
"failed to rename table as destination already exists",
))
}
(Ok(false), _) => {
Err(Error::new(
ErrorKind::Unexpected,
"failed to rename table as source does not exist",
))
}
_ => Err(Error::new(ErrorKind::Unexpected, "failed to rename table")),
}?;
Signed-off-by: callum-ryan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @callum-ryan 👋
I can help with reviewing this PR since I recently completed the Memory Catalog implementation for iceberg-rust.
I took a quick look at your PR and I could leave a lot of comments but I worry that would be a little demotivating.
Instead, I want to see if we can try something a little different.
I have commented 6 suggestions below.
I recommend you accept each of these suggestions blindly (just click on the Commit suggestion
button next to each one).
These suggestions add ~1500 lines of code defining 59 tests that cover all the behaviours I would expect from a correctly functioning Catalog
implementation.
Currently, your implementation passes 25 of them.
Here's the fun part: Try refactoring your code so it passes the remaining 34 tests as well before we continue with the review (let me know if you get stuck, I'm happy to help!)
Does that sounds like a good plan to you? 👍 👎
Co-authored-by: Farooq Qaiser <[email protected]>
Co-authored-by: Farooq Qaiser <[email protected]>
Co-authored-by: Farooq Qaiser <[email protected]>
Co-authored-by: Farooq Qaiser <[email protected]>
Co-authored-by: Farooq Qaiser <[email protected]>
Co-authored-by: Farooq Qaiser <[email protected]>
Thanks for this @fqaiser94, I like this approach! I will work on the refactor most likely tomorrow |
Hi @fqaiser94, do you think it would be a good idea to create a new crate named |
@Xuanwo Ha, yea I was thinking of something similar but I definitely don't want to overload this PR with too many concerns. Let's see how things go in this PR first and if it still makes sense, I can potentially follow-up with another PR to consolidate catalog testing where it makes sense. The main concern I have is that the existing catalogs don't all follow exactly the same behaviour, there are small, subtle differences in behaviour like checking for tables before dropping a namespace, error messaging, etc. It may or may not be worth aligning behaviours but we can discuss all of that when we get there. Eh I might start working on this in in parallel, time permitting. |
sqlite = ["sqlx/sqlite"] | ||
postgres = ["sqlx/postgres"] | ||
mysql = ["sqlx/mysql"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I'm considering whether it's possible to avoid providing features ourselves. Instead, users can active whether they want by adding sqlx
in their dependencs try:
iceberg_catalog_sql = "0.x"
sqlx = { version = "0.7.4", features = ["sqlite", "postgres"] }
So we only depends on the interface of sqlx. Users can pick up whatever driver they want (even those implemented by their own).
enum DatabaseType { | ||
PostgreSQL, | ||
MySQL, | ||
SQLite, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about removing DatabaseType
? We only need to handle the PostgreSQL cases, so we can add its name as a constant.
In this way, we don't need to decide which database we want to support.
}; | ||
|
||
sqlx::query( | ||
&format!("create table if not exists {} ({} varchar(255) not null, {} varchar(255) not null, {} varchar(255) not null, {} varchar(255), {} varchar(255), {} varchar(5), primary key ({}, {}, {}))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with Java or Python implementation. Are they aligned?
Python:
Java:
.map_err(from_sqlx_error)?; | ||
|
||
sqlx::query( | ||
&format!("create table if not exists {} ({} varchar(255) not null, {} varchar(255) not null, {} varchar(255), {} varchar(255), primary key ({}, {}, {}))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about format string in SQL way to make them easy to read?
format!("
create table if not exists {NAMESPACE_PROPERTIES_TABLE_NAME} (
{CATALOG_NAME} varchar(255) not null,
....
primary key (
{CATALOG_NAME},
{NAMESPACE_NAME},
{NAMESPACE_PROPERTY_KEY}
)
")
Hi @callum-ryan, I believe the entire SQL Catalog is quite a large project. Although @JanKaul and you have put in a lot of effort, there are still many areas that need improvement before it can be merged. Do you think it would be better to split this into multiple PRs so we can iterate faster and more smoothly? Or, do you think this PR is good enough to merge now and we can continue improving it later? |
This is an add-on from @JanKaul's work in a previous PR - we have spoken on slack and he is happy for me to open this PR.