Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Optimize S3 layout of Datafiles by expanding first character set of the hash #7128

Merged
merged 7 commits into from
Apr 3, 2023

Conversation

singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Mar 16, 2023

About the change

Presently we use "%08x" to get, implies it will produce a 8 digits hex number, padded by preceding zeros. This effectively means the distribution will be skewed, also since we are relying on hex number our character set is any ways limited to [0-9][A-F].

This change attempts to use a wider character set as well as meantime making sure the distribution of first character remains as much uniform as possible.

Sample code for distribution :

  @Test
  public void distributionOfFirstChar() {
    Function<Object, Integer> HASH_FUNC =
            Transforms.bucket(Integer.MAX_VALUE).bind(Types.StringType.get());
    Map<String, Integer> hm = Maps.newHashMap();
    for (int i = 0; i < 1000000; ++i) {
      String randomUUID = UUID.randomUUID().toString();
      //String hashFunc = String.format("%08x", HASH_FUNC.apply(randomUUID));
      String hashFunc = HashUtils.computeHash(randomUUID);
      String firstChar = hashFunc.substring(0, 1);
      hm.put(firstChar, (hm.getOrDefault(firstChar, 0) + 1));
    }

    for (String key : hm.keySet()) {
      System.out.println(String.format("hm[%s] = %s", key, hm.get(key)));
    }
  }

Distribution of first character before (10M UUID String) :
it's being restricted to only [0-7]
hm[0] = 125099
hm[1] = 124953
hm[2] = 125440
hm[3] = 124705
hm[4] = 124777
hm[5] = 125103
hm[6] = 124908
hm[7] = 125015

Distribution of first character after this change (10M UUID String):
hm[0] = 15715
hm[1] = 15524
hm[2] = 15861
hm[3] = 15680
hm[4] = 15411
hm[5] = 15638
hm[6] = 19410
hm[7] = 19298
hm[8] = 19472
hm[9] = 19399
hm[A] = 15661
hm[B] = 15633
hm[C] = 15414
hm[D] = 15675
hm[E] = 15711
hm[F] = 15569
hm[G] = 15767
hm[H] = 15643
hm[I] = 15616
hm[J] = 15508
hm[K] = 15636
hm[L] = 15726
hm[M] = 15701
hm[N] = 15658
hm[O] = 15525
hm[P] = 15646
hm[Q] = 15686
hm[R] = 15666
hm[S] = 15675
hm[T] = 15521
hm[U] = 15569
hm[V] = 15613
hm[W] = 15398
hm[X] = 15797
hm[Y] = 15855
hm[Z] = 15620
hm[a] = 19318
hm[b] = 19460
hm[c] = 19579
hm[d] = 19673
hm[e] = 15790
hm[f] = 15687
hm[g] = 15622
hm[h] = 15833
hm[i] = 15693
hm[j] = 15547
hm[k] = 15725
hm[l] = 15521
hm[m] = 15911
hm[n] = 15468
hm[o] = 15579
hm[p] = 15753
hm[q] = 15594
hm[r] = 15723
hm[s] = 15628
hm[t] = 15433
hm[u] = 15645
hm[v] = 15544
hm[w] = 15761
hm[x] = 15524
hm[y] = 15565
hm[z] = 15527

More resources :

  1. https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/

@github-actions github-actions bot added the core label Mar 16, 2023
@singhpk234
Copy link
Contributor Author

singhpk234 commented Mar 16, 2023

@jackye1995 jackye1995 requested a review from danielcweeks March 16, 2023 20:51
@singhpk234 singhpk234 force-pushed the enhancement/optimized-loc-provider branch from 89a7d66 to aef9631 Compare March 17, 2023 01:10
HASH_FUNC.hashBytes(fileName.getBytes(StandardCharsets.UTF_8)).asBytes();

StringBuilder hash = new StringBuilder();
for (int i = 0; i < 8; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make 8 a static variable?

Copy link
Contributor

@jackye1995 jackye1995 Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any benefit in making it a config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess from what Daniel replied below the answer is no and 8 is more than enough.

Copy link
Contributor Author

@singhpk234 singhpk234 Mar 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, not required at the moment, though delta has this configurable, with a property called randomPrefixLength defaulting to 2 code pointer .if required, can take this as a follow-up pr.

@danielcweeks
Copy link
Contributor

danielcweeks commented Mar 17, 2023

I'm not sure this is really necessary. While the distribution of the first character is relatively narrow, the reference you point to even states:

If we target conservative targets of 100 operations per second and 20 million stored objects per partition, a four character hex hash partition set in a bucket or sub-bucket namespace could theoretically grow to support millions of operations per second and over a trillion unique keys before we’d need a fifth character in the hash.

We currently have 8 characters in the hash so without confirmation from the S3 it seems like we're already following the recommendation.

@jackye1995
Copy link
Contributor

jackye1995 commented Mar 17, 2023

@danielcweeks this comes out of a test we did with one of our biggest customers, where the current object storage layout still resulted in throttling, but by extending it we were able to eliminate throttling.

There are 2 issues with the current layout, here are some data we collected internally:

  1. The left padding is an area of concern. If the hex-encoded version of the hash is not always 8 characters long, then there will be a large number of entries that start with one or more “0” characters, which could create uneven heat distribution for S3. To test this theory we use UUID to simulate file paths, generate 10M random entries, and then apply the existing algorithm. We can plot the histogram of the length of a non-padded, hex-encoded entropy:
length[1] = 0 
length[2] = 2 
length[3] = 24 
length[4] = 281 
length[5] = 4589 
length[6] = 73243 
length[7] = 1173442 
length[8] = 8748419 

This indicates that ~12.5% of all file paths will start with “0”, which for a hex-encoded string should have been 6.25% since there are 16 characters in the set.

  1. All hex-encodings start with the inclusive range 0-8:
count(partition["0"]) = 1249993 
count(partition["1"]) = 1250614 
count(partition["2"]) = 1250830 
count(partition["3"]) = 1250695 
count(partition["4"]) = 1249028 
count(partition["5"]) = 1249782 
count(partition["6"]) = 1249436 
count(partition["7"]) = 1249622 

The issue here is that the default algorithm generates values up to Integer.MAX_VALUE, which in Java is (2^31)-1. When this is converted to hex, it produces 0x7FFFFFFF, so it’s understandable why the first half of the range is restricted to 8 characters using the existing algorithm.

With that being said, maybe we do not need all the character range to solve it. An alternative approach is to use something like a reverse hex, which we have also tested to be working. But this approach will maximize the entropy in any cases without the need to worry about S3 internals. I will ask S3 team to take a look and see if the proposed approach is necessary.

@singhpk234 singhpk234 force-pushed the enhancement/optimized-loc-provider branch from aef9631 to ff3a9f7 Compare March 19, 2023 21:49
@@ -143,11 +143,11 @@ public String newDataLocation(PartitionSpec spec, StructLike partitionData, Stri

@Override
public String newDataLocation(String filename) {
int hash = HASH_FUNC.apply(filename);
String hash = computeHash(filename);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to support hash type as a configuration ? This change might break if anyone having custom list operation based from object store on hex prefix (%08x).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a good idea. If someone has a custom list operation, then they're relying on behaviors of the implementation and take the risk that something like this will break their implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by custom list operation? Listing partitions based on the path after the random prefix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @jackye1995 if a object storage path has millions of files, S3 listing usually takes very long time since it has to go thru all the keys recursively. This behavior can be improved by listing files with hash prefix in parallel like /data/00, /data/01, /data/0a etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for what Daniel says.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a object storage path has millions of files, S3 listing usually takes very long time since it has to go thru all the keys recursively.

do you mean during orphan file removal? There could be better ways in making that work without listing if it's considered slow, like running a join against the inventory list.

@singhpk234 singhpk234 force-pushed the enhancement/optimized-loc-provider branch from ff3a9f7 to e70c81a Compare March 20, 2023 23:29
@rdblue
Copy link
Contributor

rdblue commented Apr 2, 2023

@singhpk234, I opened a PR against your branch to fix a few remaining issues, like unnecessary object allocation and to improve the number of bits we get out of the hash. I think this is ready once that's merged: singhpk234#57

@singhpk234
Copy link
Contributor Author

ACK, many thanks @rdblue, cherry-picked your commit.

@rdblue rdblue merged commit a4a07ba into apache:master Apr 3, 2023
@rdblue
Copy link
Contributor

rdblue commented Apr 3, 2023

Thanks, @singhpk234! Merged.

@danielcweeks
Copy link
Contributor

Hey @singhpk234 I think this was supposed to be 6 bytes . To avoid the = buffer characters.

@singhpk234
Copy link
Contributor Author

singhpk234 commented Apr 3, 2023

@danielcweeks, since we added BaseEncoding.base64Url().omitPadding(); to ommit the = padding hence multiple of 3 requirement was gone (more details).

@danielcweeks
Copy link
Contributor

@danielcweeks, since we added BaseEncoding.base64Url().omitPadding(); to ommit the = padding hence multiple of 3 requirement was gone (more details).

Didn't realize that was an option. Thanks @singhpk234

@dramaticlly
Copy link
Contributor

@singhpk234 thanks for the change, just want to confirm. Looks like the entropy prefix moved

  • from length 8 hex like 7ffebe82 and 568b64a6
  • to length 6 base64 encoded string such as 3xWw8g and nRdUQA

@dramaticlly
Copy link
Contributor

@singhpk234 thanks for the change, just want to confirm. Looks like the entropy prefix moved

  • from length 8 hex like 7ffebe82 and 568b64a6
  • to length 6 base64 encoded string such as 3xWw8g and nRdUQA

nvm, just realized most of time the trailing == is omitted as padding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants