Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg]Support setting warehouse data directory for Hadoop catalog #24397

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented Jan 18, 2025

Description

This PR enable Presto Iceberg Hadoop catalog to specify an independent warehouse data directory to store table data files, in this way, we can manage metadata files on HDFS and store data files on Object Stores in a formal production environment.

See issue: #24383

Motivation and Context

Enabling Presto Iceberg to leverage the powerful capabilities of object storages

Impact

Hadoop catalog has the capability of leveraging object stores

Test Plan

  • Build an object storage environment base on MinIO docker, configure iceberg.catalog.warehouse to a local file path, and iceberg.catalog.hadoop.warehouse.datadir to a s3 path, fully run IcebergDistributedTestBase, IcebergDistributedSmokeTestBase, and TestIcebergDistributedQueries in CI test

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

Iceberg Connector Changes
* Add table properties ``write.data.path`` to specify independent data write paths for Iceberg tables
* Add connector configuration property ``iceberg.catalog.hadoop.warehouse.datadir`` for Hadoop catalog to specify root data write path for its new created tables

@hantangwangd hantangwangd force-pushed the support_seperate_write_data_location branch from 0279ea7 to 2390fb0 Compare January 18, 2025 15:14
@hantangwangd hantangwangd force-pushed the support_seperate_write_data_location branch from 2390fb0 to c3c00e1 Compare January 26, 2025 18:05
@hantangwangd hantangwangd marked this pull request as ready for review January 26, 2025 19:07
@hantangwangd hantangwangd changed the title [WIP][Iceberg]Support setting warehouse data directory for Hadoop catalog [Iceberg]Support setting warehouse data directory for Hadoop catalog Jan 27, 2025
@hantangwangd hantangwangd force-pushed the support_seperate_write_data_location branch from c3c00e1 to d18cd8b Compare January 27, 2025 14:05
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! Looks good, only a few nits only of punctuation and capitalization.

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
@hantangwangd hantangwangd force-pushed the support_seperate_write_data_location branch from d18cd8b to 5fd6b9c Compare January 27, 2025 16:01
steveburnett
steveburnett previously approved these changes Jan 27, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pulled updated branch, new local doc build, looks good. Thanks!

@hantangwangd
Copy link
Member Author

Thanks @steveburnett for your suggestion, fixed!

agrawalreetika
agrawalreetika previously approved these changes Jan 28, 2025
Copy link
Member

@agrawalreetika agrawalreetika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change. LGTM

Copy link
Member

@imjalpreet imjalpreet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd Thanks for the PR! I took a first pass and had some minor comments.

@@ -1023,6 +1040,62 @@ public void setTableProperties(ConnectorSession session, ConnectorTableHandle ta
transaction.commitTransaction();
}

protected Map<String, String> populateTableProperties(ConnectorTableMetadata tableMetadata, com.facebook.presto.iceberg.FileFormat fileFormat, ConnectorSession session, CatalogType catalogType)
{
ImmutableMap.Builder<String, String> propertiesBuilder = ImmutableMap.builderWithExpectedSize(5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we update the builder size here? I think initially when we had 5 properties it was set to 5 but I can see now we have more than 5.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I have change it to 16, which is a little bigger than current properties count. Since some more table properties will be added in the future.

}

@Config("iceberg.catalog.warehouse.datadir")
@ConfigDescription("Iceberg catalog default root data writing directory")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we mention that it is only supported for Hadoop catalog in the ConfigDescription?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add a check to throw an error at the server startup if this config is present in the config file when the catalog type is not Hadoop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add a check to throw an error at the server startup if this config is present in the config file when the catalog type is not Hadoop?

I added this check in IcebergCommonModule, but I'm not quite sure if we should throw an error or just ignore this property in non-Hadoop catalogs. Currently, iceberg.catalog.warehouse will be just ignored instead of throw an error. Perhaps we can listen to what other reviewers are thinking @tdcmeehan @ZacBlanco @agrawalreetika @kiersten-stokes?

should we mention that it is only supported for Hadoop catalog in the ConfigDescription?

Sure, added!

@@ -52,6 +52,7 @@ public class IcebergNativeCatalogFactory
private final String catalogName;
protected final CatalogType catalogType;
private final String catalogWarehouse;
private final String catalogWarehouseDataDir;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to IcebergNativeMetadata since it's only being used in that class and not in this class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IcebergNativeMetadata already holds this field. I think this is mainly a problem with parameter passing. IcebergNativeMetadata needs to be passed in the value of catalogWarehouseDataDir, so we need to hold catalogWarehouseDataDir in IcebergNativeCatalogFactory or IcebergNativeMetaFactory, so that we can pass the value of catalogWarehouseDataDir to create a new IcebergNativeMetadata when executing IcebergNativeMetadataFactory.create().

So I hold catalogWarehouseDir in IcebergNativeCatalogFactory and pass it to IcebergNativeMetadata through IcebergNativeCatalogFactory, because this is the way I believe the smallest changes will be made. Do you have a better way to pass catalogWarehouseDir into IcebergNativeMetadata?

@@ -150,6 +152,34 @@ public void testShowCreateTable()
")", schemaName, getLocation(schemaName, "orders")));
}

@Test
public void testTableWithSpecifiedWriteDataLocation()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like testTableWithSpecifiedWriteDataLocation and testShowCreateTableWithSpecifiedWriteDataLocation are the same. Can you please check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they look the same mainly because they are both implementations which implies not supported by default in the base test class. And in the relevant subclass of Hadoop catalog, they will be overwritten with different implementation logic.

throws IOException
{
String dataWriteLocation = Files.createTempDirectory("test_table_with_specified_write_data_location3").toAbsolutePath().toString();
assertQueryFails(String.format("create table test_table_with_specified_write_data_location3(a int, b varchar) with (write_data_path = '%s')", dataWriteLocation),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we attempt to create a partitioned table for this test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done!

@tdcmeehan tdcmeehan self-assigned this Jan 29, 2025
Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a few high level comments. At the core, I understand what we're trying to solve, but I'm not sure yet if this is the right solution. What happens when we have a table which already exists at the warehouse directory but has a data directory which doesn't align with the new datadir property? Should we respect the property or the table in the case of an insert? This could be confusing for users.

And if the table already exists and doesn't doesn't have a metadata folder which is in a "safe" filesystem, should we error, or warn the user? Do we even have a way of knowing that a filesystem is "safe" (supports atomic renames)?

Correct me if I'm wrong, but in theory we could have already supported this use case within the iceberg connector by using SetTableProperty procedure and just setting "write.data.path" on the individual tables, right? From my understanding, all this change does is provide a default for new tables and makes it a viewable table property. I'm wondering if it might be better providing this as schema-level property that users can set, similar to how the hive connector has a schema-level "location" property. Then, we can set defaults for the data path on schema creation, but override it on a table-level if we prefer.

However, I don't believe the metadata path could be set that way since the HadoopCatalog relies on listing the warehouse metadata directories to get the schemas and tables.

Just some thoughts for discussion. I think I just want to refine our approach and make sure there isn't any ambiguous behavior from the user perspective.

icebergProperties.put("iceberg.file-format", format.name());
icebergProperties.putAll(getConnectorProperties(CatalogType.valueOf(catalogType), icebergDataDirectory));
icebergProperties.putAll(extraConnectorProperties);
queryRunner.createCatalog(ICEBERG_CATALOG, "iceberg", ImmutableMap.copyOf(icebergProperties));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need a way to let the properties explicitly set in the test class to override the default properties set by the underlying query runner. For example, we need to explicitly specify iceberg.catalog.warehouse in the test class.

@@ -370,6 +393,9 @@ Property Name Description
``location`` Optionally specifies the file system location URI for
the table.

``write_data_path`` Optionally specifies the file system location URI for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer if we used the same property name as in iceberg to make it less confusing to users. I know we haven't followed this before but I feel like it makes more sense than what we have now. It allows for more continuity and easier for users to look up the iceberg property reference as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I think this makes sense. Then how do you think we should handle the previous table properties?

Copy link
Contributor

@ZacBlanco ZacBlanco Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to do it in this PR but here's a solution I think would work for deprecation of the old names:

Inside of IcebergTableProperties create a map of (new property name -> old property name) say, deprecatedProperties.

Then we update all existing table properties with iceberg-equivalents to the iceberg name (if one exists).

Inside of the constructor or getTableProperties method of IcebergTableProperties, we take the deprecatedProperties map and iterate over of the existing properties and if it has an entry in the deprecatedProperties map, add a new hidden property with the old name and same parameters as the new property.

In each of the getXXX methods, we would also need to fall back to checking for the old property name in the map. We would also need a method to add a warning to the session that the table property name is deprecated. We can leave it like that for a few releases and eventually remove the deprecated property names.

This could all be done in a separate PR

Copy link
Member Author

@hantangwangd hantangwangd Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds good. I have changed the newly added table property name to write.data.path, which is exactly the same as Iceberg's corresponding table property name. Please take a look when available, thanks!

@hantangwangd hantangwangd force-pushed the support_seperate_write_data_location branch from 5fd6b9c to 19b2e45 Compare January 30, 2025 16:52
@hantangwangd
Copy link
Member Author

@ZacBlanco Thanks for your feedback, very glad to be able to clarify and refine the approch through discussion. Overall, I think the core point of this change is to separate the data write path and metadata path of Iceberg tables in Hadoop catalog, and provide a simpler and more convenient way for users to specify the data write path of tables. I will reply item by item below. Please let me know if you have any comments.

What happens when we have a table which already exists at the warehouse directory but has a data directory which doesn't align with the new datadir property? Should we respect the property or the table in the case of an insert?

Undoubtedly, when inserting data, the iceberg table's table property write.data.path will be respected. In general, iceberg.catalog.warehouse.datadir is a Hadoop catalog level configuration property used to define the default data write path on table creating. It will not affect the data write path of existing tables.

  • We can explicitly specify the write data path of a table when creating it, and later change the write data path at any time through the SetTableProperty procedure, so the write data path of a table does not need to be aligned with the warehouse data dir.

  • We can change the location of the warehouse data dir at any time, which will only affect newly created tables, and will not affect the data write path of existing tables.

  • When we want to change the data write path of a table, we should explicitly use the SetTableProperty procedure (which needs to support property write.data.location). And the strategy of the Iceberg table itself is also similar, that is, the existing data files will not be affected by the later modified write.data.location.

Do you think the above behaviors make sense?

And if the table already exists and doesn't doesn't have a metadata folder which is in a "safe" filesystem, should we error, or warn the user? Do we even have a way of knowing that a filesystem is "safe" (supports atomic renames)?

I think we could specify the requirements for the file system where the metadata is located in the document. But I don't think Presto engine is responsible for identifying and distinguishing specific file systems and reminding users. This is the user's own responsibility, because they have a better understanding of their own usage scenarios. For example, for a considerable number of users, they do not need to consider transaction consistency issues caused by cross engine interoperability.

in theory we could have already supported this use case within the iceberg connector by using SetTableProperty procedure and just setting "write.data.path" on the individual tables

SetTableProperty procedure and CreateTable can both specify and set table properties, but their usage scenarios are different. I don't think SetTableProperty can replace CreateTable to specify table properties, its main purpose is to alter table properties. And I think users may also not be able to accept the requirement to execute set properties procedure after creating a table in order to specify the data write path. What do you think?

I'm wondering if it might be better providing this as schema-level property that users can set, similar to how the hive connector has a schema-level "location" property. Then, we can set defaults for the data path on schema creation, but override it on a table-level if we prefer.

Firstly, for Hadoop catalog, we cannot arbitrarily set the value of location which also maintains metadata data like hive. Therefore, we should only set a property like data.location which only contains data files. Secondly, I don't think this schema-level concept you mentioned conflicts with the catalog level data dir. Their roles are similar, just at different level. As described above, they all used to define the default data write path on table creating, they will not affect the data write path of existing tables, and can be overridden on a table-level if we prefer.

Therefore, if necessary, we can add this schema-level data dir property in the future. I think it can work well with catalog-level data dir. What's your opinion?

@steveburnett
Copy link
Contributor

New release note guidelines as of last week: PR #24354 automatically adds links to this PR to the release notes. Please remove the manual PR link in the following format from the release note entries for this PR.

:pr:`12345`

I have updated the Release Notes Guidelines to remove the examples of manually adding the PR link.

@ZacBlanco
Copy link
Contributor

ZacBlanco commented Feb 3, 2025

Thanks for your response. I did some more thinking around this and concluded that we don't really have a consistent way of specifying location for each connector. But I think we should strive to do so.

Some examples:

  1. In the hive connector, we allow schemas to set the location property. It can also be set through a external_location table property, otherwise it is controlled by the HMS configuration (external to Presto).
  2. In the iceberg connector when using the Hive catalog, the location is determined by either the HMS configuration (external to presto), or configured by the user specifying a location property.
  3. In the Iceberg REST connector, table location is specified via the catalog-level iceberg.catalog.warehouse property.

My thoughts are that

  1. We should definitely provide the ability to set both the iceberg write.data.location and write.metadata.location through the table properties in the CREATE TABLE statements. I think that we should use the iceberg property names directly rather than doing something like location or write_data_location and write_metadata_location.
  2. I think providing this configuration in Presto's iceberg catalog configuration for the HadoopCatalog inthis PR is probably fine because the difference with the HMS and Hive connector is that it's an external process with its own configuration. With the hadoop catalog, we don't really have that option since it's all filesystem-based.
    a. Though this makes me wonder whether the RestCatalog should even utilize iceberg.catalog.warehouse as I might expect RestCatalog implementations to have some kind of similar default data and metadata locations. I looked at Nessie's docs and saw that they have the ability to specify default locations. I think we might want to ensure that those configs are respected over any Presto configurations. We may even want to remove the ability to specify default locations for the Presto iceberg catalog when configured with Rest/Nessie. The Hive connector doesn't really allow it, and I think it wouldn't be clear to users as to which configuration is the correct one to respect.

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work - I do have a concern about the additional load on CI the two new test classes might have. Please let me know your thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about the impact of runtime of the CI when adding this test class and the TestIcebergSmokeOnS3Hadoop test class. These classes take quite a bit of time to complete, so I think we should try to be prudent about adding new configurations permutations that run through the whole gamut of tests.

That being said, since the only real difference in the configuration is the warehouse default data dir, and the fact we're running on S3 I think we ought to enable just one version of this test. Either the default which operates directly on the local filesystem or this one. I would actually err towards using this class because it represents something closer to real-world use case while still exercising the Hadoop catalog.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to enable just one version of these three test classes. And I agree with you that the new version tests on S3 are more suitable to be enabled, since they still exercising the Hadoop catalog.

I tried to ignore the original test classes, but led to a series of problems in the CI environment. Will make some further attempts. Would be great appreciate if you have any good ideas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried just putting @Test(enabled = false) at the class-level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to work. I previously used @ignore, but it also didn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A feasible way is to declare them as abstract classes, this works well in CI environment, see #24508. Not so sure if there is a better way.

@hantangwangd hantangwangd force-pushed the support_seperate_write_data_location branch 3 times, most recently from f3c462f to f08edd5 Compare February 5, 2025 14:00
@hantangwangd
Copy link
Member Author

hantangwangd commented Feb 5, 2025

@ZacBlanco Thanks for your investigation and perspective. I think a significant portion of the current inconsistency is due to Iceberg connector with Hive catalog. In presto, its configuration is closer to Hive connector than Iceberg lib's Hive catalog, because it is a presto native implementation based on presto-hive-metastore rather than Iceberg lib's Hive catalog.

As far as I can see, in Iceberg lib, warehouse is a commonly used catalog properties, see Iceberg catalog properties. It has a role in all catalogs: for Hadoop/Nessie/Jdbc catalogs, it is a mandatory property to be configured, while in Hive/Rest/InMemory catalogs, it is an optional property. So it seems that in presto, catalog property iceberg.catalog.warehouse (which maps to the catalog property warehouse of Iceberg lib) should also be a common catalog property in Iceberg connector and should not be removed when configured with Rest/Nessie.

As you said, the newly added configuration property is an external process with Iceberg's own configuration, and it should only be used for Hadoop catalog. Therefore, in order to avoid the confusion with property iceberg.catalog.warehouse and causing users to believe that it is also a commonly used catalog property, I think we can make it more clear by doing the following things:

    1. Name this newly added configuration property iceberg.catalog.hadoop.warehouse.datadir instead of iceberg.catalog.warehouse.datadir.
    1. Add a check to throw an error at the server startup if this config property is present when the catalog type is not Hadoop, as suggested by @imjalpreet .
    1. Clearly state in the document that this is a Hadoop catalog specific configuration property.

Do you think this makes sense?

@hantangwangd
Copy link
Member Author

New release note guidelines as of last week: PR #24354 automatically adds links to this PR to the release notes. Please remove the manual PR link in the following format from the release note entries for this PR.

Thanks @steveburnett for your message. I have fixed the release note and supplemented some content to the document, please take a look when convenient.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the expanded documentation! I made a few minor revision suggestions to use imperative, to emphasize what the user should and should not do.

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
@ZacBlanco
Copy link
Contributor

@ZacBlanco Thanks for your investigation and perspective. I think a significant portion of the current inconsistency is due to Iceberg connector with Hive catalog. In presto, its configuration is closer to Hive connector than Iceberg lib's Hive
As you said, the newly added configuration property is an external process with Iceberg's own configuration, and it should only be used for Hadoop catalog. Therefore, in order to avoid the confusion with property iceberg.catalog.warehouse and causing users to believe that it is also a commonly used catalog property, I think we can make it more clear by doing the following things:

* 1. Name this newly added configuration property `iceberg.catalog.hadoop.warehouse.datadir` instead of `iceberg.catalog.warehouse.datadir`.

* 2. Add a check to throw an error at the server startup if this config property is present when the catalog type is not Hadoop, as suggested by @imjalpreet .

* 3. Clearly state in the document that this is a Hdoop catalog specific configuration property.

Do you think this makes sense?

This approach sounds good to me

@hantangwangd hantangwangd force-pushed the support_seperate_write_data_location branch 2 times, most recently from b39e317 to 3356d39 Compare February 5, 2025 18:09
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick revision! I'm glad you rephrased my suggestion for technical accuracy. Just a few minor nits of formatting and spacing, but everything looks good.

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
@hantangwangd hantangwangd force-pushed the support_seperate_write_data_location branch from 3356d39 to 93191b4 Compare February 6, 2025 00:52
Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits. A couple questions for discussion

presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
{
try {
String writeDataLocation = table.properties().get(WRITE_DATA_LOCATION);
return Optional.of(Strings.isNullOrEmpty(writeDataLocation) ? table.location() : writeDataLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically the default from the LocationProviders class in iceberg is `table.location() + "/data"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rechecking this part of the code, I found that this method seems unnecessary. It's OK to just use tableLocation as before. Because when constructing the LocationProvider later, we passed both tableLocation and tableProperties into it. And the LocationProvider will find the appropriate write data location based on tableProperties and tableLocation.

I have refactored this part of code, please take a look when convenient, thanks.

}

@Test
public void testPropertiesTableWithSpecifiedDataWriteLocation()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is getting me thinking that maybe we ought to have a test class which creates a query runner configured with all of the catalog types. Then we don't have to create new test classes which have the overhead of starting new query runners all the time.

I think this is fine for the PR, but I am concerned that we're creating a lot of overhead in testing for these different config scenarios. I think it might be worth an effort to try to consolidate to reduce the CI times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good idea, , I think it worth trying. And the testing scenarios of TestIcebergSystemTables are relatively simple, make it more suitable for trying this.

" metadata_delete_after_commit = false,\n" +
" metadata_previous_versions_max = 100,\n" +
" metrics_max_inferred_column = 100,\n" +
" \"write.data.path\" = '%s'\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to leave for now, but I noticed adding a new table property is quite cumbersome because we have to update all of these tests which look for the exact SQL create statement. It's a bit of a burden for maintenance. Do you have any thoughts about how we might be able to make it less fragile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can extract a method that can verify the result of the show create table statement, which takes the basic information, columns, and properties fo the table as parameters. All relevant tests can utilize this method, which may result in fewer changes to the test cases when adding new table property in the future.

steveburnett
steveburnett previously approved these changes Feb 6, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, looks good.

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. One final question

propertiesBuilder.put(WRITE_DATA_LOCATION, writeDataLocation);
}
else {
throw new PrestoException(NOT_SUPPORTED, "Table property write.data.path is not supported with catalog type: " + catalogType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually the case? If the property is set on the table correctly it will be reflected in the location provider, right? I don't think we need to disable it for other catalogs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. Enabling this table property directly on all catalogs may cause unexpected problems. For example, if we specify this property in a table on Hive file catalog, when we trying to drop the table, we would fail with the following message:

Table test_table contains Iceberg path override properties and cannot be dropped from Presto

That would lead in a lot of troubles for test on Iceberg hive catalog.

So I choose not to enable it on other catalogs for now, and will enable them after confirming that there are no issues. Do you think it's OK?

@hantangwangd hantangwangd force-pushed the support_seperate_write_data_location branch from b7206dc to 0749403 Compare February 8, 2025 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants