Core: Optimize S3 layout of Datafiles by expanding first character set of the hash #7128

singhpk234 · 2023-03-16T19:37:56Z

About the change

Presently we use "%08x" to get, implies it will produce a 8 digits hex number, padded by preceding zeros. This effectively means the distribution will be skewed, also since we are relying on hex number our character set is any ways limited to [0-9][A-F].

This change attempts to use a wider character set as well as meantime making sure the distribution of first character remains as much uniform as possible.

Sample code for distribution :

  @Test
  public void distributionOfFirstChar() {
    Function<Object, Integer> HASH_FUNC =
            Transforms.bucket(Integer.MAX_VALUE).bind(Types.StringType.get());
    Map<String, Integer> hm = Maps.newHashMap();
    for (int i = 0; i < 1000000; ++i) {
      String randomUUID = UUID.randomUUID().toString();
      //String hashFunc = String.format("%08x", HASH_FUNC.apply(randomUUID));
      String hashFunc = HashUtils.computeHash(randomUUID);
      String firstChar = hashFunc.substring(0, 1);
      hm.put(firstChar, (hm.getOrDefault(firstChar, 0) + 1));
    }

    for (String key : hm.keySet()) {
      System.out.println(String.format("hm[%s] = %s", key, hm.get(key)));
    }
  }

Distribution of first character before (10M UUID String) :
it's being restricted to only [0-7]
hm[0] = 125099
hm[1] = 124953
hm[2] = 125440
hm[3] = 124705
hm[4] = 124777
hm[5] = 125103
hm[6] = 124908
hm[7] = 125015

Distribution of first character after this change (10M UUID String):
hm[0] = 15715
hm[1] = 15524
hm[2] = 15861
hm[3] = 15680
hm[4] = 15411
hm[5] = 15638
hm[6] = 19410
hm[7] = 19298
hm[8] = 19472
hm[9] = 19399
hm[A] = 15661
hm[B] = 15633
hm[C] = 15414
hm[D] = 15675
hm[E] = 15711
hm[F] = 15569
hm[G] = 15767
hm[H] = 15643
hm[I] = 15616
hm[J] = 15508
hm[K] = 15636
hm[L] = 15726
hm[M] = 15701
hm[N] = 15658
hm[O] = 15525
hm[P] = 15646
hm[Q] = 15686
hm[R] = 15666
hm[S] = 15675
hm[T] = 15521
hm[U] = 15569
hm[V] = 15613
hm[W] = 15398
hm[X] = 15797
hm[Y] = 15855
hm[Z] = 15620
hm[a] = 19318
hm[b] = 19460
hm[c] = 19579
hm[d] = 19673
hm[e] = 15790
hm[f] = 15687
hm[g] = 15622
hm[h] = 15833
hm[i] = 15693
hm[j] = 15547
hm[k] = 15725
hm[l] = 15521
hm[m] = 15911
hm[n] = 15468
hm[o] = 15579
hm[p] = 15753
hm[q] = 15594
hm[r] = 15723
hm[s] = 15628
hm[t] = 15433
hm[u] = 15645
hm[v] = 15544
hm[w] = 15761
hm[x] = 15524
hm[y] = 15565
hm[z] = 15527

More resources :

https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/

singhpk234 · 2023-03-16T19:38:35Z

cc @jackye1995 @amogh-jahagirdar @rdblue @danielcweeks @aokolnychyi @szehon-ho

core/src/main/java/org/apache/iceberg/util/HashUtils.java

core/src/main/java/org/apache/iceberg/LocationProviders.java

jackye1995 · 2023-03-17T04:44:23Z

core/src/main/java/org/apache/iceberg/LocationProviders.java

+          HASH_FUNC.hashBytes(fileName.getBytes(StandardCharsets.UTF_8)).asBytes();
+
+      StringBuilder hash = new StringBuilder();
+      for (int i = 0; i < 8; ++i) {


can we make 8 a static variable?

is there any benefit in making it a config?

I guess from what Daniel replied below the answer is no and 8 is more than enough.

+1, not required at the moment, though delta has this configurable, with a property called randomPrefixLength defaulting to 2 code pointer .if required, can take this as a follow-up pr.

danielcweeks · 2023-03-17T05:05:42Z

I'm not sure this is really necessary. While the distribution of the first character is relatively narrow, the reference you point to even states:

If we target conservative targets of 100 operations per second and 20 million stored objects per partition, a four character hex hash partition set in a bucket or sub-bucket namespace could theoretically grow to support millions of operations per second and over a trillion unique keys before we’d need a fifth character in the hash.

We currently have 8 characters in the hash so without confirmation from the S3 it seems like we're already following the recommendation.

jackye1995 · 2023-03-17T16:05:48Z

@danielcweeks this comes out of a test we did with one of our biggest customers, where the current object storage layout still resulted in throttling, but by extending it we were able to eliminate throttling.

There are 2 issues with the current layout, here are some data we collected internally:

The left padding is an area of concern. If the hex-encoded version of the hash is not always 8 characters long, then there will be a large number of entries that start with one or more “0” characters, which could create uneven heat distribution for S3. To test this theory we use UUID to simulate file paths, generate 10M random entries, and then apply the existing algorithm. We can plot the histogram of the length of a non-padded, hex-encoded entropy:

length[1] = 0 
length[2] = 2 
length[3] = 24 
length[4] = 281 
length[5] = 4589 
length[6] = 73243 
length[7] = 1173442 
length[8] = 8748419

This indicates that ~12.5% of all file paths will start with “0”, which for a hex-encoded string should have been 6.25% since there are 16 characters in the set.

All hex-encodings start with the inclusive range 0-8:

count(partition["0"]) = 1249993 
count(partition["1"]) = 1250614 
count(partition["2"]) = 1250830 
count(partition["3"]) = 1250695 
count(partition["4"]) = 1249028 
count(partition["5"]) = 1249782 
count(partition["6"]) = 1249436 
count(partition["7"]) = 1249622

The issue here is that the default algorithm generates values up to Integer.MAX_VALUE, which in Java is (2^31)-1. When this is converted to hex, it produces 0x7FFFFFFF, so it’s understandable why the first half of the range is restricted to 8 characters using the existing algorithm.

With that being said, maybe we do not need all the character range to solve it. An alternative approach is to use something like a reverse hex, which we have also tested to be working. But this approach will maximize the entropy in any cases without the need to worry about S3 internals. I will ask S3 team to take a look and see if the proposed approach is necessary.

core/src/main/java/org/apache/iceberg/LocationProviders.java

dpaani · 2023-03-20T20:21:03Z

core/src/main/java/org/apache/iceberg/LocationProviders.java

@@ -143,11 +143,11 @@ public String newDataLocation(PartitionSpec spec, StructLike partitionData, Stri

    @Override
    public String newDataLocation(String filename) {
-      int hash = HASH_FUNC.apply(filename);
+      String hash = computeHash(filename);


Is it possible to support hash type as a configuration ? This change might break if anyone having custom list operation based from object store on hex prefix (%08x).

I don't think that's a good idea. If someone has a custom list operation, then they're relying on behaviors of the implementation and take the risk that something like this will break their implementation.

what do you mean by custom list operation? Listing partitions based on the path after the random prefix?

hi @jackye1995 if a object storage path has millions of files, S3 listing usually takes very long time since it has to go thru all the keys recursively. This behavior can be improved by listing files with hash prefix in parallel like /data/00, /data/01, /data/0a etc.

+1 for what Daniel says.

if a object storage path has millions of files, S3 listing usually takes very long time since it has to go thru all the keys recursively.

do you mean during orphan file removal? There could be better ways in making that work without listing if it's considered slow, like running a join against the inventory list.

core/src/main/java/org/apache/iceberg/LocationProviders.java

rdblue · 2023-04-02T23:50:07Z

@singhpk234, I opened a PR against your branch to fix a few remaining issues, like unnecessary object allocation and to improve the number of bits we get out of the hash. I think this is ready once that's merged: singhpk234#57

singhpk234 · 2023-04-03T06:35:47Z

ACK, many thanks @rdblue, cherry-picked your commit.

core/src/main/java/org/apache/iceberg/LocationProviders.java

rdblue · 2023-04-03T17:13:04Z

Thanks, @singhpk234! Merged.

danielcweeks · 2023-04-03T18:27:09Z

Hey @singhpk234 I think this was supposed to be 6 bytes . To avoid the = buffer characters.

singhpk234 · 2023-04-03T18:32:00Z

@danielcweeks, since we added BaseEncoding.base64Url().omitPadding(); to ommit the = padding hence multiple of 3 requirement was gone (more details).

danielcweeks · 2023-04-03T18:36:38Z

@danielcweeks, since we added BaseEncoding.base64Url().omitPadding(); to ommit the = padding hence multiple of 3 requirement was gone (more details).

Didn't realize that was an option. Thanks @singhpk234

dramaticlly · 2023-04-04T22:19:22Z

@singhpk234 thanks for the change, just want to confirm. Looks like the entropy prefix moved

from length 8 hex like 7ffebe82 and 568b64a6
to length 6 base64 encoded string such as 3xWw8g and nRdUQA

dramaticlly · 2023-04-04T22:33:47Z

@singhpk234 thanks for the change, just want to confirm. Looks like the entropy prefix moved

from length 8 hex like 7ffebe82 and 568b64a6

to length 6 base64 encoded string such as 3xWw8g and nRdUQA

nvm, just realized most of time the trailing == is omitted as padding

github-actions bot added the core label Mar 16, 2023

jackye1995 reviewed Mar 16, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/util/HashUtils.java Outdated Show resolved Hide resolved

jackye1995 requested a review from danielcweeks March 16, 2023 20:51

jackye1995 reviewed Mar 16, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/util/HashUtils.java Outdated Show resolved Hide resolved

jackye1995 reviewed Mar 16, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/util/HashUtils.java Outdated Show resolved Hide resolved

singhpk234 force-pushed the enhancement/optimized-loc-provider branch from 89a7d66 to aef9631 Compare March 17, 2023 01:10

jackye1995 reviewed Mar 17, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/LocationProviders.java Outdated Show resolved Hide resolved

jackye1995 reviewed Mar 17, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/LocationProviders.java Outdated Show resolved Hide resolved

rdblue reviewed Mar 18, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/LocationProviders.java Outdated Show resolved Hide resolved

singhpk234 force-pushed the enhancement/optimized-loc-provider branch from aef9631 to ff3a9f7 Compare March 19, 2023 21:49

dpaani reviewed Mar 20, 2023

View reviewed changes

jackye1995 reviewed Mar 20, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/LocationProviders.java Outdated Show resolved Hide resolved

Prashant Singh added 4 commits March 20, 2023 16:23

Core: Optimize S3 layout

e846bba

address review feedback

7a04d65

Address review feedback

6757c8a

use the suggested hash function

e70c81a

singhpk234 force-pushed the enhancement/optimized-loc-provider branch from ff3a9f7 to e70c81a Compare March 20, 2023 23:29

rdblue reviewed Mar 21, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/LocationProviders.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 2, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/LocationProviders.java Outdated Show resolved Hide resolved

rdblue and others added 2 commits April 2, 2023 23:34

Core: Improve bit density in object storage layout.

42aad4e

fix style checks

b0eb42a

rdblue reviewed Apr 3, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/LocationProviders.java Outdated Show resolved Hide resolved

remove lowe-case requriement from base64 encoder

024440c

rdblue merged commit a4a07ba into apache:master Apr 3, 2023

nastra mentioned this pull request Apr 14, 2023

ThreadLocal in ObjectStoreLocationProvider isn't serializable #7345

Closed

amogh-jahagirdar mentioned this pull request Sep 13, 2024

AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance #11112

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Optimize S3 layout of Datafiles by expanding first character set of the hash #7128

Core: Optimize S3 layout of Datafiles by expanding first character set of the hash #7128

singhpk234 commented Mar 16, 2023 •

edited by jackye1995

Loading

singhpk234 commented Mar 16, 2023 •

edited

Loading

jackye1995 Mar 17, 2023

jackye1995 Mar 17, 2023 •

edited

Loading

jackye1995 Mar 17, 2023

singhpk234 Mar 19, 2023 •

edited

Loading

danielcweeks commented Mar 17, 2023 •

edited

Loading

jackye1995 commented Mar 17, 2023 •

edited

Loading

dpaani Mar 20, 2023

danielcweeks Mar 20, 2023

jackye1995 Mar 20, 2023

dpaani Mar 20, 2023

jackye1995 Mar 20, 2023

jackye1995 Mar 20, 2023

rdblue commented Apr 2, 2023

singhpk234 commented Apr 3, 2023

rdblue commented Apr 3, 2023

danielcweeks commented Apr 3, 2023

singhpk234 commented Apr 3, 2023 •

edited

Loading

danielcweeks commented Apr 3, 2023

dramaticlly commented Apr 4, 2023

dramaticlly commented Apr 4, 2023

Core: Optimize S3 layout of Datafiles by expanding first character set of the hash #7128

Core: Optimize S3 layout of Datafiles by expanding first character set of the hash #7128

Conversation

singhpk234 commented Mar 16, 2023 • edited by jackye1995 Loading

About the change

singhpk234 commented Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

jackye1995 Mar 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singhpk234 Mar 19, 2023 • edited Loading

Choose a reason for hiding this comment

danielcweeks commented Mar 17, 2023 • edited Loading

jackye1995 commented Mar 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Apr 2, 2023

singhpk234 commented Apr 3, 2023

rdblue commented Apr 3, 2023

danielcweeks commented Apr 3, 2023

singhpk234 commented Apr 3, 2023 • edited Loading

danielcweeks commented Apr 3, 2023

dramaticlly commented Apr 4, 2023

dramaticlly commented Apr 4, 2023

singhpk234 commented Mar 16, 2023 •

edited by jackye1995

Loading

singhpk234 commented Mar 16, 2023 •

edited

Loading

jackye1995 Mar 17, 2023 •

edited

Loading

singhpk234 Mar 19, 2023 •

edited

Loading

danielcweeks commented Mar 17, 2023 •

edited

Loading

jackye1995 commented Mar 17, 2023 •

edited

Loading

singhpk234 commented Apr 3, 2023 •

edited

Loading