-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Iceberg]Support specifying S3 path as warehouse dir for Hadoop catalog #24221
base: master
Are you sure you want to change the base?
[Iceberg]Support specifying S3 path as warehouse dir for Hadoop catalog #24221
Conversation
bbb4740
to
0ed5c96
Compare
adb7573
to
976ee9c
Compare
cfaa548
to
a1e1104
Compare
a1e1104
to
925d3d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local doc build, looks good. Thanks!
presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java
Outdated
Show resolved
Hide resolved
f8404fd
ed22710
to
27a23ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (doc)
Pull updated branch, new local doc build, looks good. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor question, otherwise it looks good. The testing infrastructure you added here is great!
@@ -396,8 +396,7 @@ private static boolean isDirectory(PrestoS3ObjectMetadata metadata) | |||
} | |||
|
|||
return mediaType.is(X_DIRECTORY_MEDIA_TYPE) || | |||
(mediaType.is(OCTET_STREAM_MEDIA_TYPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this condition removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Object Storage System do not necessarily return content type application/octet-stream
, for example, Ozone return a content type binary/octet-stream
, see: https://github.com/apache/ozone/blob/master/hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/ObjectEndpoint.java#L645-L647. Besides, the response examples in amazon s3 api document also used binary/octet-stream
as it's content type, see the examples of Sample Response for general purpose buckets
in https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadObject.html.
Moreover, I believe the remaining two conditions can already confirm that this is a directory. So I removed this condition in order to support as many Object Storage Systems as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, but I'm not sure if there are additional implications. If our tests pass then I am ok with it. Though I will defer to @imjalpreet for any additional thoughts. He is the one that implemented this check previously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd Could we please verify if it is possible to create an external table on an empty directory in MinIO after removing the current condition?
I added this condition because we noticed that some systems, such as MinIO, were creating empty directories with a content type of application/octet-stream
instead of application/x-directory
.
For your reference: #20603
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imjalpreet Thanks for providing the relevant note and issue link. I have verified that we can create an external table on an empty directory in Minio after removing condition 'mediaType.is(OCTET_STREAM_MEDIA_TYPE)'
, the remaining conditions 'metadata.isKeyNeedsPathSeparator() && objectMetadata.getContentLength() == 0'
is capable of handling the directory identification problem described in issue #20310.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile, if we do not remove condition 'mediaType.is(OCTET_STREAM_MEDIA_TYPE)'
, we will get an exception External location must be a directory
when trying to create an external table on an empty directory in some other Object Storage Systems like Ozone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the empty directory table covered by any test cases that run in CI? If not we should add it. Aside from that I think this PR is good to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll check it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZacBlanco @imjalpreet I have added test method testCreateExternalTableOnEmptyS3Directory()
in BaseTestHiveInsertOverwrite
, it verifies the behavior of creating an external hive table on an empty S3 directory. Please take a look when convenient, thanks.
27a23ac
to
7ab085e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for the tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of nits, otherwise LGTM, thank you.
presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java
Outdated
Show resolved
Hide resolved
super.updateConfiguration(config); | ||
|
||
if (this.icebergConfig.getCatalogType().equals(HADOOP)) { | ||
// re-map filesystem schemes to match Amazon Elastic MapReduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit unsure about what this comment meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out this. It seems the origin comment in PrestoS3ConfigurationUpdater
was copied from some other framework. I have fixed the comment in both places, please take a look when available.
7ab085e
to
671bec9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR, LGTM!
Hi @tdcmeehan, this PR is ready for a final review, could you please take a look when convenient? Thanks a lot! |
Thanks for the PR @hantangwangd. S3 now supports conditional writes, but does this PR take advantage of conditional writes to ensure catalog consistency? If not, what protects the integrity of Iceberg's ACID transactions, given it is the responsibility of the catalog to ensure the table metadata is atomically updated? |
@tdcmeehan Thanks for your question. As I understand, the After checking the corresponding code in
Among them, S3 will ensure that the So, as I understand, seems we don't necessarily need to use conditional writes to ensure the Hadoop catalog's consistency, although using conditional writing may lead to better implementation I guess. What's your opinion? If I have any misunderstanding, please let me know, thanks. |
@hantangwangd consider two different Presto engines attempting to insert into the same table, which attempt to commit the insertion at exactly the same time, to the effect that it's arbitrary which one should succeed. While the copy operation itself is atomic, wouldn't the later of the two copy operations overwrite the first one, leading to a corrupt timeline? And as far as I know, the only way to guard against this is to externally lock using some sort of metastore or external catalog that supports atomic updates, or in S3 using conditional writes (basically, the copy operation should not succeed if the destination already exists). |
@tdcmeehan If we look deeper into the implementation of I think these measures can ensure the isolation and consistency of concurrent transactions, and ensure that the scenario you described will not lead to corrupt and inconsistency. What do you think? |
Sorry I didn't notice that you were referring to two different Presto engines, in that case, the memory based lock manager in After checking s3 document for However, Iceberg provide |
@hantangwangd agreed that we need some sort of consistent external locking mechanism, such as a database to ensure transactional integrity. My concern with using the Hadoop catalog in general over an object storage system that doesn't support atomic swaps is how do users configure it? What is happening in S3, apparently even with condition write support as you pointed out above, is the rename isn't really atomic, because two versions may both succeed, with the later of the two operations overwriting the first. This requires some external solution outside of the object store, which is another infrastructure component. Can a Spark engine also be configured, but the user forgets to configure a DynamoDB based lock manager, and eventually this causes a corrupt timeline? Another thought is, what if we implement a custom S3-only catalog, and this properly leverages an algorithm which ensures transactional integrity of writes (it seems a Hadoop based solution can't currently do this)? The problem with this is, how do we test it against third party object stores which implement the S3 API? Do all of them now support conditional writes? Note the Iceberg community seems pessimistic about supporting a file-only catalog: https://lists.apache.org/thread/v7x65kxrrozwlvsgstobm7685541lf5w . So a final worry I have is we are going against the grain on this approach, thinking of Presto in the context of larger engine interoperability. |
@tdcmeehan Thanks for the information you provided, I also read the discussions when the Iceberg community making the decision to deprecate As I learned from the conversations, seems that the Iceberg community mainly wants to get rid of the specific implementation details of a file based Catalog, so they mostly converged toward the direction of REST catalog, even in the case where I will draft this PR and do some more in-depth investigations, take a look at the subsequent evolution direction after the emergence of conditional writes. Thank you for the whole discussion, it's very helpful to me. |
Description
This PR enable iceberg connector to use S3 location as warehouse dir for Hadoop catalog.
By configuring the s3 properties described in https://prestodb.io/docs/current/connector/hive.html#amazon-s3-configuration, we can specify a S3 location as the warehouse dir of Hadoop catalog. This way, both metadata and data
of iceberg tables will be maintained on S3 storage.
An example configuration includes:
Motivation and Context
Support specifying a S3 location directly as warehouse dir for Iceberg with Hadoop catalog
Impact
N/A
Test Plan
Ozone
andMinIO
)IcebergDistributedTestBase
andTestIcebergDistributedQueries
on S3 storage provided by dockerized MinIOContributor checklist
Release Notes