Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure root path in Catalog or FileIO? #488

Closed
fqaiser94 opened this issue Jul 26, 2024 · 8 comments · Fixed by #511
Closed

Configure root path in Catalog or FileIO? #488

fqaiser94 opened this issue Jul 26, 2024 · 8 comments · Fixed by #511

Comments

@fqaiser94
Copy link
Contributor

fqaiser94 commented Jul 26, 2024

There was an unresolved discussion about whether we should:

  • Allow users to configure a path as a root of the catalog and create tables underneath that (see earlier commits in Add memory catalog implementation #475 to see an implementation of this idea) or
    • Pros:
    • Cons:
      • Adds another parameter for initializing MemoryCatalog
  • Users should configure FileIO with a root path (this is the approach currently implemented in iceberg-rust code)
    • Pros:
      • Consolidates path-related concerns in FileIO
    • Cons:
      • I think users may have to be aware of this implementation detail because when they send a CreateTableRequest, the location should not provide an absolute path, but rather a "relative" path from the root location assumed by FileIO (that or we need to add some logic to recognize relative and absolute file paths in the FileIO implementation). I'll also add that Iceberg community has so far not been super receptive to the idea of relative paths.
      • None of the Java FileIO implementations I looked at (S3FileIO, GcsFileIO) allow configuring a root path. I suspect this may affect interoperability with the Java ecosystem, not 100% sure though.

Overall, I have concerns about option 2 (which is what is currently implemented) and am currently in favour of changing the code to follow option 1: allow users to configure a path as a root of the catalog and create tables underneath that.

Thoughts?

@fqaiser94 fqaiser94 changed the title Catalog Configure root path in Catalog or FileIO? Jul 26, 2024
@fqaiser94
Copy link
Contributor Author

cc: @liurenjie1024 @Xuanwo let me know what you think, I've shared my thoughts above.

@liurenjie1024
Copy link
Contributor

liurenjie1024 commented Jul 28, 2024

Thanks @fqaiser94 for raising this.

In short, I'm also in favor of option 1 to align with java/python implementations.

The root path is a property of underlying storage layer(Apache OpenDAL), and I've checked code that currently we haven't exposed it to user. Currently the root is always set to top most dir(see fs storage). Also as @fqaiser94 currently iceberg community has no support for relative path, so setting root to top most dir fits into this requirement best to me.

@fqaiser94
Copy link
Contributor Author

Created a PR to switch to option 1 here: #494
Might be worth waiting for @Xuanwo to chime in first though since they proposed option 2 originally.

@Xuanwo
Copy link
Member

Xuanwo commented Jul 28, 2024

Thanks for starting this discussion first.

My reason for not including a default_table_root_location (or default_warehouse_location) at the catalog implementation level is that it might confuse our users about the actual location they are using.

My initial idea was to move root to the FileIO level for greater consistency and clarity. However, as @liurenjie1024 and @fqaiser94 pointed out, this property does not exist.

So I visit the pyiceberg's logic and found that it has a _get_default_warehouse_location:

https://github.com/apache/iceberg-python/blob/be5c42649914e71e8366c22558f8234ce062b145/pyiceberg/catalog/__init__.py#L885-L895

    def _get_default_warehouse_location(self, database_name: str, table_name: str) -> str:
        database_properties = self.load_namespace_properties(database_name)
        if database_location := database_properties.get(LOCATION):
            database_location = database_location.rstrip("/")
            return f"{database_location}/{table_name}"

        if warehouse_path := self.properties.get(WAREHOUSE_LOCATION):
            warehouse_path = warehouse_path.rstrip("/")
            return f"{warehouse_path}/{database_name}.db/{table_name}"

        raise ValueError("No default path is set, please specify a location when creating a table")

I believe it makes more sense for us to share the implementation of the default warehouse location logic.

  • load database location if database_properties has LOCATION, and use {database_location}/{table_name}
  • Otherwise, try WAREHOUSE_LOCATION and use {warehouse_path}/{database_name}.db/{table_name}
  • And finally returns error.

Maybe we can align with this behavior?

@liurenjie1024
Copy link
Contributor

Thanks for starting this discussion first.

My reason for not including a default_table_root_location (or default_warehouse_location) at the catalog implementation level is that it might confuse our users about the actual location they are using.

My initial idea was to move root to the FileIO level for greater consistency and clarity. However, as @liurenjie1024 and @fqaiser94 pointed out, this property does not exist.

So I visit the pyiceberg's logic and found that it has a _get_default_warehouse_location:

https://github.com/apache/iceberg-python/blob/be5c42649914e71e8366c22558f8234ce062b145/pyiceberg/catalog/__init__.py#L885-L895

    def _get_default_warehouse_location(self, database_name: str, table_name: str) -> str:
        database_properties = self.load_namespace_properties(database_name)
        if database_location := database_properties.get(LOCATION):
            database_location = database_location.rstrip("/")
            return f"{database_location}/{table_name}"

        if warehouse_path := self.properties.get(WAREHOUSE_LOCATION):
            warehouse_path = warehouse_path.rstrip("/")
            return f"{warehouse_path}/{database_name}.db/{table_name}"

        raise ValueError("No default path is set, please specify a location when creating a table")

I believe it makes more sense for us to share the implementation of the default warehouse location logic.

  • load database location if database_properties has LOCATION, and use {database_location}/{table_name}
  • Otherwise, try WAREHOUSE_LOCATION and use {warehouse_path}/{database_name}.db/{table_name}
  • And finally returns error.

Maybe we can align with this behavior?

Sounds reasonable to me. A more complete method is here: https://github.com/apache/iceberg-python/blob/055938d36d46efb94849b1d861cd8a8a111f6ae5/pyiceberg/catalog/__init__.py#L880

  1. Use user passed table location first.
  2. Fallback to database location.
  3. Fallback to warehouse location, the default_warehouse config could be treated as warehouse location property.

@Xuanwo
Copy link
Member

Xuanwo commented Jul 29, 2024

Sounds reasonable to me. A more complete method is here: https://github.com/apache/iceberg-python/blob/055938d36d46efb94849b1d861cd8a8a111f6ae5/pyiceberg/catalog/__init__.py#L880

Thank you for the correction. This logic seems sound to me.

@fqaiser94
Copy link
Contributor Author

Ack, let me see if I can refactor a little bit to match the suggested logic.

@fqaiser94
Copy link
Contributor Author

Accidentally closed the previous PR (and discovered you can't reopen closed PRs after a while 🤯).

New PR here: #511

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants