AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance #11112

ookumuso · 2024-09-11T18:36:35Z

The S3 team is introducing S3LocationProvider, a replacement to ObjectStorageLocationProvider that is better suited for the performance of Iceberg workloads running on Amazon S3. This configuration applies to Iceberg workloads using General Purpose and Directory Buckets in Amazon S3, and we expect it to further improve throughput of Iceberg workloads. Although this change can benefit all Iceberg workloads, the degree of improvement may vary based on specific workload characteristics.

This implementation changes the hash scheme from 32-bit base64 to 20-bit base2. The reduced character range allows S3 to automatically scale request capacity more quickly to match the demands of the target workload and reduce the amount of time that workloads observe throttle responses. A 20-bit hash allows for 2^20 possible prefixes.
This implementation also changes the path structure to reduce the number of directories created. The partition directories are eliminated, and the position of the hash is moved to before the data filename.

amogh-jahagirdar

Left some comments but a had a possibly naive question to just check my understanding:

In the past for object storage provider, we've used a wider character set in the hash portion of the file as a means to maximize entropy and ultimately improve heat distribution (#7128). With this new approach are we saying we can get a good enough heat distribution and at the same time enable s3 to scale capacity more quickly?

aws/src/main/java/org/apache/iceberg/aws/s3/S3LocationProvider.java

amogh-jahagirdar · 2024-09-13T00:22:08Z

aws/src/main/java/org/apache/iceberg/aws/s3/S3LocationProvider.java

+
+  private String computeHash(String fileName) {
+    HashCode hashCode = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8);
+    int hash = hashCode.asInt();


Nit: I think we could just inline hashCode.asInt() below when computing the binaryString.

Works, updating

ookumuso · 2024-09-13T18:48:58Z

Left some comments but a had a possibly naive question to just check my understanding:

In the past for object storage provider, we've used a wider character set in the hash portion of the file as a means to maximize entropy and ultimately improve heat distribution (#7128). With this new approach are we saying we can get a good enough heat distribution and at the same time enable s3 to scale capacity more quickly?

@amogh-jahagirdar
Yes, essentially both base64 and base2 distributes the traffic evenly which is the most crucial thing. With auto-scaling they can both support high TPS workloads. Main difference between them is how fast they auto-scale to the point where there are no more throttles observed. Base2 being more effective there due to every next char in the hash taking ~50% of the traffic.

aws/src/main/java/org/apache/iceberg/aws/s3/S3LocationProvider.java

danielcweeks · 2024-09-13T22:03:51Z

aws/src/main/java/org/apache/iceberg/aws/s3/S3LocationProvider.java

+ * s3://my-bucket/my-table/data/011101101010001111101000-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
+ * </code>.
+ */
+public class S3LocationProvider implements LocationProvider {


Why not just include this in the ObjectStoreLocationProvider as an option to choose between base2 and base32? It seems like we could include most of this in just the computeHash function. That might also allow for choosing whether to include partition context.

That was our original plan but with the partition values and other trailing directories removed, it turns into a very S3 specific location provider due to reasons mentioned in the above comment. I agree that if it was just the base2 optimization then it would fit into the ObjectStoreLocationProvider with one extra table property probably.

danielcweeks · 2024-09-13T22:06:21Z

@ookumuso Overall, this looks like a great feature if this is better for S3 to repartition and distribute data, but it also seems like it would fit cleanly into the existing ObjectStoreLocationProvider as opposed to a separate provider.

danielcweeks · 2024-09-13T22:12:27Z

aws/src/main/java/org/apache/iceberg/aws/s3/S3LocationProvider.java

+ * </ol>
+ *
+ * The data file is placed immediately under the data location. Partition names are <b>not</b>
+ * included. The data filename is prefixed with a 24-character binary hash, which ensures that files


The 24-character prefix and lack of a directory marker (/) is a little problematic for older clients (like HadoopFS) and procedures like orphan files because anything that needs to list that directory will run into problems with the key space. We already have an issue with this with the current provider, but this would compound that problem since there's no hierarch.

This would make it a little easier to brake up prefix-separated ranges for listing (especially due to the low character limit), but that's something would need to look into on the orphan files action.

Is the problem coming from the delimited list or is it some type parallelization issue due to lack of directories? I was thinking not having the directory essentially remove the requirement for discovering them and it can be just a flat list to discover everything under the write.data.path.

The issue is having too much cardinality for FS implementations that think in terms of directory structure. By breaking this up we can address both scenarios (files system based and object store listing) at the same time.

ookumuso · 2024-09-17T18:12:49Z

@ookumuso Overall, this looks like a great feature if this is better for S3 to repartition and distribute data, but it also seems like it would fit cleanly into the existing ObjectStoreLocationProvider as opposed to a separate provider.

Thanks for the review @danielcweeks! See my comment here regarding why we chose to go with a separate a provider. It was essentially done to capture optimizations for all bucket types in one place without having an impact on the existing default ObjectStoreLocationProvider.

ookumuso · 2024-09-23T18:32:46Z

@ookumuso Overall, this looks like a great feature if this is better for S3 to repartition and distribute data, but it also seems like it would fit cleanly into the existing ObjectStoreLocationProvider as opposed to a separate provider.

Thanks for the review Daniel! See my comment here regarding why we chose to go with a separate a provider. It was essentially done to capture optimizations for all bucket types in one place without having an impact on the existing default ObjectStoreLocationProvider.

@danielcweeks let me know what you think about this. One alternative is to maybe provide both as in having S3LocationProvider as is and a base2 option for the ObjectStoreLocationProvider to keep the option for partition values along with directories

danielcweeks · 2024-09-24T16:37:00Z

@danielcweeks let me know what you think about this. One alternative is to maybe provide both as in having S3LocationProvider as is and a base2 option for the ObjectStoreLocationProvider to keep the option for partition values along with directories

@ookumuso I'm still of the opinion that we should just incorporate this into the existing object store location provider. Ultimately, we can change the hashing behavior as the existing behavior is based on earlier discussions with S3, so replacing it or providing alternative options is fine.

I do think partition values in the path still needs to be an option.

I think we also need to consider adding a path separator to help with some of the maintenance routines. For example s3://bucket/<prefix>/<db>/<table>/data/00101001/0010100111011.... The / in the hash portion allows for 2^8 prefixes to allow listing under those paths for clients that list. If we omit that, we get 2^24 prefixes that need to be listed under the data directory, which makes maintenance difficult.

ookumuso · 2024-09-24T21:21:51Z

@danielcweeks let me know what you think about this. One alternative is to maybe provide both as in having S3LocationProvider as is and a base2 option for the ObjectStoreLocationProvider to keep the option for partition values along with directories

@ookumuso I'm still of the opinion that we should just incorporate this into the existing object store location provider. Ultimately, we can change the hashing behavior as the existing behavior is based on earlier discussions with S3, so replacing it or providing alternative options is fine.

I do think partition values in the path still needs to be an option.

I think we also need to consider adding a path separator to help with some of the maintenance routines. For example s3://bucket/<prefix>/<db>/<table>/data/00101001/0010100111011.... The / in the hash portion allows for 2^8 prefixes to allow listing under those paths for clients that list. If we omit that, we get 2^24 prefixes that need to be listed under the data directory, which makes maintenance difficult.

@danielcweeks Understood thanks for the feedback Daniel. Looks like it is not going to be feasible for us to remove / delimeter. With this, I agree that it doesn't make sense to offer a separate provider. I think it would be better if we update the existing provider to use base2 directly instead of opt in which can simplify the onboarding process for users as they can see the auto-scaling improvement right away. @jackye1995 Are you also okay with switching ObjectStoreLocationProvider to use base2 as a default as well? I can potentially cut down 24 bits to 20 or even 16 to reduce the length a bit.

For partition values, I can probably send a separate follow up as an option to remove them from the file name with a new table property so callers can decide but planning to exclude that for now.

jackye1995 · 2024-09-25T18:26:42Z

Sorry for the late review, was busy with some internal work...

it also seems like it would fit cleanly into the existing ObjectStoreLocationProvider as opposed to a separate provider.

+1 for that. I think the main concern was that this seems to be too S3 specific, but if there is no major concern from others I am good with changing the ObjectStoreLocationProvider directly, that actually helps adoption.

think we also need to consider adding a path separator to help with some of the maintenance routines.

I reviewed the internal experiments related to this, and having no path separators as well as having at least 24 bits I believe are all important based on load testing. If the concern is compatibility with other systems, would it make sense to only use this approach for locations that begins with s3:// or s3a://?

ookumuso · 2024-10-01T22:28:13Z

@danielcweeks @amogh-jahagirdar @jackye1995 As discussed above, made the change to update the existing ObjectStoreLocationProvider to use base2 entropy and also added new config to omit the partition values. I set it to false as default to make it backwards compatible but let me know whether you prefer that to be true to prevent customers to rely on partition values in the file path.

I kept the "/" delimiter in place due to concerns raised by @danielcweeks but would like to understand what would it take to remove that as a separate thread to potentially follow up in a separate PR. Orphan clean up path was one thing identified but I am curious whether there are any other things we need to consider to make it a viable change

core/src/main/java/org/apache/iceberg/LocationProviders.java

jackye1995

looks good to me. I think there are 2 pending topics and I am approving since I am good with the current way it is written:

should we do it for all paths, or just limit to s3:// and s3a:// paths: I think given that even the documentation of this is in AWS, it is probably fine to just apply this strategy for all users using object storage location provider.
should we add / in the middle of the binary hash values: that would mitigate the directory listing issue, but will impact the ability for scaling up S3 due to the additional slashes in the middle. My understanding is that the directory lisitng concern could be resolved if we move all HDFS-based listing to use prefix-based listing, and the team can take that as a follow up item.

Any further thoughts with the latest approach and the 2 points above? @danielcweeks @amogh-jahagirdar @nastra

jackye1995 · 2024-10-04T17:20:21Z

Any additional comments? @amogh-jahagirdar @nastra @danielcweeks

docs/docs/configuration.md

core/src/main/java/org/apache/iceberg/TableProperties.java

danielcweeks · 2024-10-07T17:33:29Z

looks good to me. I think there are 2 pending topics and I am approving since I am good with the current way it is written:

should we do it for all paths, or just limit to s3:// and s3a:// paths: I think given that even the documentation of this is in AWS, it is probably fine to just apply this strategy for all users using object storage location provider.

I think we should treat all of the prefixes the same (e.g. s3,s3n,s3a). These are all just aliases to map different implementations to, but the behavior should be the same.

should we add / in the middle of the binary hash values: that would mitigate the directory listing issue, but will impact the ability for scaling up S3 due to the additional slashes in the middle. My understanding is that the directory listing concern could be resolved if we move all HDFS-based listing to use prefix-based listing, and the team can take that as a follow up item.

I think we should add a single / to break up the prefix as I noted in my comment above. I doubt that a single slash is going to have a major impact on s3's ability to partition effectively. My understanding is that S3 doesn't treat the / character any different than any other in determining heat/partitioning and since it's a single static value, it would probably be skipped over.

That slash is important for directory based file system implementations as omitting it would result in listing effectively the full contents of the table and would likely break clients because there are too many results to hold in memory.

jackye1995 · 2024-10-07T19:36:34Z

That slash is important for directory based file system implementations as omitting it would result in listing effectively the full contents of the table and would likely break clients because there are too many results to hold in memory.

when you say "directory based file system implementations", do you mean implementation of catalog, or FileIO?

And what is the use case of this? I thought the only place we are doing something like this is in orphan file removal, but for that we are just iterating through the directories to find matching files, we are holding all the matching files in memory, not holding everything in memory.

danielcweeks · 2024-10-07T19:43:52Z

when you say "directory based file system implementations", do you mean implementation of catalog, or FileIO?

This is for FileIO, not catalog related.

And what is the use case of this? I thought the only place we are doing something like this is in orphan file removal, but for that we are just iterating through the directories to find matching files, we are holding all the matching files in memory, not holding everything in memory.

This is related to orphan file removal. The hashing is all performed at the top-level of the data directory. That means to list that directory using a FS based FileIO, you need to list that directory which could have 2^24 subdirs. Adding a slash within the hash allows that to be broken up and listed in a distributed way to build the dataset used by spark for the full listing.

jackye1995 · 2024-10-08T14:56:20Z

Is there an optimal number of directories and depth?

The hard-coded default is 3. Usually, there is a /data/ folder which takes 1 level of depth, but that might not always be the case. So if we stick to that default, then we should ensure 3 levels of depth.

put rest of the entropy into the fileName

When write.object-storage.partitioned-path=true, that will create paths like /data/010/001/100/01010101001-key=val/file-name.parquet, which will be a bit awkward.

We should potentially introduce a separator concept write.object-storage.separator that defaults to slash, and can be other values like - and can produce paths like /data/010/001/100/01010101001-key=val-file-name.parquet, which seems like the most ideal solution.

ookumuso · 2024-10-08T17:27:43Z

What do you think about this approach @danielcweeks:

Is there an optimal number of directories and depth? maybe we can just create those and put rest of the entropy into the fileName. For example: /data/010/001/100/01010101001-file-name.parquet. This can reduce both the sparse directory problem and help with orphan clean-up?

It might be a nice win-win case both solving sparse directory problem and orphan clean up. As @jackye1995 it is a bit weird for partitioned-paths but we can do it for not partitioned paths only, so it would like:

partitioned-path=true: /data/010/001/100/01010101001/key=val/file-name.parquet
partitioned-path=false: /data/010/001/100/01010101001-file-name.parquet

danielcweeks · 2024-10-09T17:22:28Z

What do you think about this approach @danielcweeks:

Is there an optimal number of directories and depth? maybe we can just create those and put rest of the entropy into the fileName. For example: /data/010/001/100/01010101001-file-name.parquet. This can reduce both the sparse directory problem and help with orphan clean-up?

It might be a nice win-win case both solving sparse directory problem and orphan clean up. As @jackye1995 it is a bit weird for partitioned-paths but we can do it for not partitioned paths only, so it would like:

partitioned-path=true: /data/010/001/100/01010101001/key=val/file-name.parquet

partitioned-path=false: /data/010/001/100/01010101001-file-name.parquet

I like removing additional path when partitioned-path=false (that's a great idea).

I'm just wondering about the where we want to put the slashes in the bit field. Breaking it up by three bits means that each recursive listing just operates on 8 subpaths, which feels too small. I feel like we might want to move to four bits (i.e. /data/0100/0110/0010/101010010110). This leaves the leaf paths with a reasonable 2^12 paths and each parent with 16, which feels like enough to parallelize at some level.

ookumuso · 2024-10-10T21:37:52Z

Sounds good @danielcweeks, I will work on this change and update the PR!

ookumuso · 2024-10-11T22:51:12Z

@danielcweeks @jackye1995 Updated the change to divide entropy into dirs so we follow the following format now:

partitioned-path=true: /data/0100/0110/0010/10101001/key=val/file-name.parquet
partitioned-path=false: /data/0100/0110/0010/10101001-file-name.parquet

Set the dir depth to 3 and dir length to 4 as discussed.

docs/docs/configuration.md

jackye1995 · 2024-10-16T20:00:45Z

@danielcweeks any further comments?

danielcweeks · 2024-10-16T21:40:43Z

@ookumuso a couple small remaining comments:

It looks like we're only using 20 bits, not 24 like in the description
The aws.md docs don't reflect the updated pathing structure

ookumuso · 2024-10-16T22:13:18Z

@ookumuso a couple small remaining comments:

1. It looks like we're only using 20 bits, not 24 like in the description

2. The aws.md docs don't reflect the updated pathing structure

@danielcweeks thanks:

Intentionally dropped the bits as it should more than enough, updated the PR description and aws.doc as well
Updated the aws.md doc with the new path structure

ookumuso · 2024-10-18T18:00:14Z

Any additional concerns @danielcweeks ?

core/src/main/java/org/apache/iceberg/LocationProviders.java

danielcweeks

Just waiting on checks. Thanks @ookumuso !

…ized S3 performance Co-authored-by: Drew Schleit <[email protected]>

jackye1995

Also looks good to me! Thanks for the work!

jackye1995 · 2024-10-18T23:47:53Z

looks like CI has passed, merging. Thanks for the work and patience @ookumuso , and thanks for the review @danielcweeks !

…ized S3 performance (apache#11112) Co-authored-by: Drew Schleit <[email protected]>

github-actions bot added docs AWS labels Sep 11, 2024

nastra self-requested a review September 12, 2024 14:38

amogh-jahagirdar self-requested a review September 13, 2024 00:12

amogh-jahagirdar reviewed Sep 13, 2024

View reviewed changes

danielcweeks reviewed Sep 13, 2024

View reviewed changes

aws/src/main/java/org/apache/iceberg/aws/s3/S3LocationProvider.java Outdated Show resolved Hide resolved

danielcweeks reviewed Sep 13, 2024

View reviewed changes

ookumuso requested a review from danielcweeks September 19, 2024 17:54

ookumuso force-pushed the s3locatorprovider branch from 827dbe2 to 0217f82 Compare October 1, 2024 22:20

github-actions bot added core and removed AWS labels Oct 1, 2024

jackye1995 reviewed Oct 2, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/LocationProviders.java Outdated Show resolved Hide resolved

ookumuso force-pushed the s3locatorprovider branch from 0217f82 to fc66d1f Compare October 2, 2024 17:29

jackye1995 approved these changes Oct 2, 2024

View reviewed changes

jackye1995 requested a review from amogh-jahagirdar October 3, 2024 18:56

danielcweeks reviewed Oct 7, 2024

View reviewed changes

docs/docs/configuration.md Outdated Show resolved Hide resolved

danielcweeks reviewed Oct 7, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/TableProperties.java Outdated Show resolved Hide resolved

jackye1995 requested a review from danielcweeks October 8, 2024 20:19

ookumuso force-pushed the s3locatorprovider branch from 7edc637 to c106c45 Compare October 11, 2024 22:48

ookumuso force-pushed the s3locatorprovider branch from c106c45 to 977ee08 Compare October 11, 2024 23:50

jackye1995 reviewed Oct 14, 2024

View reviewed changes

docs/docs/configuration.md Outdated Show resolved Hide resolved

ookumuso force-pushed the s3locatorprovider branch 3 times, most recently from 9ee23cb to 11e4a8f Compare October 15, 2024 17:07

ookumuso force-pushed the s3locatorprovider branch from 11e4a8f to fef36c4 Compare October 16, 2024 22:11

danielcweeks reviewed Oct 18, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/LocationProviders.java Outdated Show resolved Hide resolved

danielcweeks reviewed Oct 18, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/LocationProviders.java Outdated Show resolved Hide resolved

ookumuso force-pushed the s3locatorprovider branch from fef36c4 to 974ae2f Compare October 18, 2024 20:29

danielcweeks approved these changes Oct 18, 2024

View reviewed changes

danielcweeks changed the title ~~AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance~~ AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance Oct 18, 2024

AWS: Switch to base2 entropy in ObjectStoreLocationProvider for optim…

20b4f08

…ized S3 performance Co-authored-by: Drew Schleit <[email protected]>

ookumuso force-pushed the s3locatorprovider branch from 974ae2f to 20b4f08 Compare October 18, 2024 22:24

jackye1995 approved these changes Oct 18, 2024

View reviewed changes

jackye1995 merged commit ea5da17 into apache:main Oct 18, 2024
49 checks passed

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

AWS: Switch to base2 entropy in ObjectStoreLocationProvider for optim…

16a484e

…ized S3 performance (apache#11112) Co-authored-by: Drew Schleit <[email protected]>

This was referenced Jan 9, 2025

Support Location Providers apache/iceberg-python#1452

Merged

Use ObjectStoreLocationProvider by default apache/iceberg-python#1509

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance #11112

AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance #11112

ookumuso commented Sep 11, 2024 •

edited

Loading

amogh-jahagirdar left a comment

amogh-jahagirdar Sep 13, 2024

ookumuso Sep 13, 2024

ookumuso commented Sep 13, 2024

danielcweeks Sep 13, 2024

ookumuso Sep 17, 2024

danielcweeks commented Sep 13, 2024

danielcweeks Sep 13, 2024 •

edited

Loading

ookumuso Sep 20, 2024

danielcweeks Oct 7, 2024

ookumuso commented Sep 17, 2024 •

edited

Loading

ookumuso commented Sep 23, 2024 •

edited

Loading

danielcweeks commented Sep 24, 2024

ookumuso commented Sep 24, 2024

jackye1995 commented Sep 25, 2024

ookumuso commented Oct 1, 2024

jackye1995 left a comment

jackye1995 commented Oct 4, 2024

danielcweeks commented Oct 7, 2024

jackye1995 commented Oct 7, 2024

danielcweeks commented Oct 7, 2024

jackye1995 commented Oct 8, 2024

ookumuso commented Oct 8, 2024

danielcweeks commented Oct 9, 2024

ookumuso commented Oct 10, 2024

ookumuso commented Oct 11, 2024

jackye1995 commented Oct 16, 2024

danielcweeks commented Oct 16, 2024

ookumuso commented Oct 16, 2024

ookumuso commented Oct 18, 2024

danielcweeks left a comment

jackye1995 left a comment

jackye1995 commented Oct 18, 2024

AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance #11112

AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance #11112

Conversation

ookumuso commented Sep 11, 2024 • edited Loading

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

amogh-jahagirdar Sep 13, 2024

Choose a reason for hiding this comment

ookumuso Sep 13, 2024

Choose a reason for hiding this comment

ookumuso commented Sep 13, 2024

danielcweeks Sep 13, 2024

Choose a reason for hiding this comment

ookumuso Sep 17, 2024

Choose a reason for hiding this comment

danielcweeks commented Sep 13, 2024

danielcweeks Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

ookumuso Sep 20, 2024

Choose a reason for hiding this comment

danielcweeks Oct 7, 2024

Choose a reason for hiding this comment

ookumuso commented Sep 17, 2024 • edited Loading

ookumuso commented Sep 23, 2024 • edited Loading

danielcweeks commented Sep 24, 2024

ookumuso commented Sep 24, 2024

jackye1995 commented Sep 25, 2024

ookumuso commented Oct 1, 2024

jackye1995 left a comment

Choose a reason for hiding this comment

jackye1995 commented Oct 4, 2024

danielcweeks commented Oct 7, 2024

jackye1995 commented Oct 7, 2024

danielcweeks commented Oct 7, 2024

jackye1995 commented Oct 8, 2024

ookumuso commented Oct 8, 2024

danielcweeks commented Oct 9, 2024

ookumuso commented Oct 10, 2024

ookumuso commented Oct 11, 2024

jackye1995 commented Oct 16, 2024

danielcweeks commented Oct 16, 2024

ookumuso commented Oct 16, 2024

ookumuso commented Oct 18, 2024

danielcweeks left a comment

Choose a reason for hiding this comment

jackye1995 left a comment

Choose a reason for hiding this comment

jackye1995 commented Oct 18, 2024

ookumuso commented Sep 11, 2024 •

edited

Loading

danielcweeks Sep 13, 2024 •

edited

Loading

ookumuso commented Sep 17, 2024 •

edited

Loading

ookumuso commented Sep 23, 2024 •

edited

Loading