Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the storage location name as the key in the clade schema #556

Merged
merged 3 commits into from
Jul 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions clade/proto/schema.proto
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,18 @@ message TableObject {
string name = 1;
// Path within the provided storage location, if any
string path = 2;
Copy link
Collaborator

@SergeiPatiakin SergeiPatiakin Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we don't care about backwards compatibility here, but should we try to follow Protobuf guidelines on field number assignment?: https://protobuf.dev/programming-guides/proto3/#assigning . This could help make errors clearer in case of a server/client protocol mismatch.

Specifically:

  • NewProtocol.TableObject.store seems like a distinct field from OldProtocol.TableObject.location, so should use a new field number
  • NewProtocol.StorageLocation.location seems like the same field as OldProtocol.StorageLocation.location, so should stay as field number 1
  • NewProtocol.StorageLocation.options seems like the same field as OldProtocol.StorageLocation.options, so should stay as field number 2

// URL of the root storage location
optional string location = 3;
// Storage location identifier
optional string store = 4;
}

// A single root storage location, hosting many individual tables
message StorageLocation {
// URL of the root storage location
// URL of the storage location root
string location = 1;
// Connection options for the object store client
map<string, string> options = 2;
// Unique storage location identifier
string name = 3;
}

message ListSchemaRequest {
Expand Down
19 changes: 11 additions & 8 deletions src/catalog/metastore.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ use std::str::FromStr;
use std::sync::Arc;
use url::Url;

// Root URL for a storage location alongside client connection options
type LocationAndOptions = (String, HashMap<String, String>);

// This is the main entrypoint to all individual catalogs for various objects types.
// The intention is to make it extensible and de-coupled from the underlying metastore
// persistence mechanism (such as the presently used `Repository`).
Expand Down Expand Up @@ -97,7 +100,7 @@ impl Metastore {
let store_options = catalog_schemas
.stores
.into_iter()
.map(|store| (store.location, store.options))
.map(|store| (store.name, (store.location, store.options)))
.collect();

// Turn the list of all collections, tables and their columns into a nested map.
Expand All @@ -119,7 +122,7 @@ impl Metastore {
async fn build_schema(
&self,
schema: SchemaObject,
store_options: &HashMap<String, HashMap<String, String>>,
store_options: &HashMap<String, LocationAndOptions>,
) -> CatalogResult<(Arc<str>, Arc<SeafowlSchema>)> {
let schema_name = schema.name;

Expand All @@ -140,21 +143,21 @@ impl Metastore {
async fn build_table(
&self,
table: TableObject,
store_options: &HashMap<String, HashMap<String, String>>,
store_options: &HashMap<String, LocationAndOptions>,
) -> CatalogResult<(Arc<str>, Arc<dyn TableProvider>)> {
// Build a delta table but don't load it yet; we'll do that only for tables that are
// actually referenced in a statement, via the async `table` method of the schema provider.
// TODO: this means that any `information_schema.columns` query will serially load all
// delta tables present in the database. The real fix for this is to make DF use `TableSource`
// for the information schema, and then implement `TableSource` for `DeltaTable` in delta-rs.

let table_log_store = match table.location {
let table_log_store = match table.store {
// Use the provided customized location
Some(location) => {
let this_store_options = store_options
.get(&location)
Some(name) => {
let (location, this_store_options) = store_options
.get(&name)
.ok_or(CatalogError::Generic {
reason: format!("Object store for location {location} not found"),
reason: format!("Object store with name {name} not found"),
})?
.clone();

Expand Down
2 changes: 1 addition & 1 deletion src/catalog/repository.rs
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ impl SchemaStore for RepositoryStore {
Some(TableObject {
name: name.clone(),
path: uuid.to_string(),
location: None,
store: None,
})
} else {
None
Expand Down
8 changes: 5 additions & 3 deletions tests/fixtures.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,28 +29,29 @@ pub fn schemas() -> ListSchemaResponse {
tables: vec![TableObject {
name: "file".to_string(),
path: "delta-0.8.0-partitioned".to_string(),
location: None,
store: None,
}],
},
SchemaObject {
name: "s3".to_string(),
tables: vec![TableObject {
name: "minio".to_string(),
path: "test-data/delta-0.8.0-partitioned".to_string(),
location: Some("s3://seafowl-test-bucket".to_string()),
store: Some("minio".to_string()),
}],
},
SchemaObject {
name: "gcs".to_string(),
tables: vec![TableObject {
name: "fake".to_string(),
path: "delta-0.8.0-partitioned".to_string(),
location: Some("gs://test-data".to_string()),
store: Some("fake-gcs".to_string()),
}],
},
],
stores: vec![
StorageLocation {
name: "minio".to_string(),
location: "s3://seafowl-test-bucket".to_string(),
options: HashMap::from([
(
Expand All @@ -74,6 +75,7 @@ pub fn schemas() -> ListSchemaResponse {
]),
},
StorageLocation {
name: "fake-gcs".to_string(),
location: "gs://test-data".to_string(),
options: HashMap::from([(
GoogleConfigKey::ServiceAccount.as_ref().to_string(),
Expand Down