Hash Function for UUID Primary Keys #12538

ankitsultana · 2024-03-02T07:37:06Z

This can help reduce memory (disk for off-heap hashmap) usage for scenarios where all columns of the Primary Key are UUIDs (as defined in RFC 4122). This can also handle scenarios where some values are not valid UUIDs, by simply switching to being a no-op hash-function.

Using Murmur's 128 bit hash-function will either give the same or higher space savings (in case of composite keys), but it risks hash collision. Though chances of that are extremely low, we usually avoid any exposure for critical use-cases.

Test Plan: Added UTs. Also tested this in our cluster and some results are as follows.

Testing on Cluster (~35% Old Gen improvement)

Onboarded some test tables to one of our large servers. Each server ingested from the same Kafka topic, and both of them ended up with ~397M primary keys.

You can see the memory savings in the graph below. At 17:21 I triggered a full GC by running jmap -histo:live on both servers. I triggered a full GC again a few minutes later which corresponds to the second dip in old gen.

Here's the output of jmap -histo:live in the two servers:

Server with no hashing:

num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:     405356239    26288334696  [B ([email protected])
   2:     405016060    19440770880  java.util.concurrent.ConcurrentHashMap$Node ([email protected])
   3:     396819098    15872763920  org.apache.pinot.segment.local.upsert.ConcurrentMapPartitionUpsertMetadataManager$RecordLocation
   4:     396947570    13331326424  [Ljava.lang.Object; ([email protected])
   5:     405208908    12966685056  java.lang.String ([email protected])
   6:     396819098     9523658352  org.apache.pinot.spi.data.readers.PrimaryKey
   7:          8304     6710581760  [Ljava.util.concurrent.ConcurrentHashMap$Node; ([email protected])
   8:     118497450     3966607216  [C ([email protected])
   9:      61298209     3432699704  java.nio.HeapCharBuffer ([email protected])
  10:      57191904     3235026912  [Lorg.roaringbitmap.buffer.MappeableContainer;
...

Server with UUID hashing:

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:     404045476    19394182848  java.util.concurrent.ConcurrentHashMap$Node ([email protected])
   2:     404386800    16702510464  [B ([email protected])
   3:     396820100    15872804000  org.apache.pinot.segment.local.upsert.ConcurrentMapPartitionUpsertMetadataManager$RecordLocation
   4:     396820101    12698243232  org.apache.pinot.spi.utils.ByteArray
   5:          8152     6621435072  [Ljava.util.concurrent.ConcurrentHashMap$Node; ([email protected])
   6:         15846     2171269472  [I ([email protected])
   7:      30760333     1066705472  [C ([email protected])
   8:      15922426      891655856  java.nio.HeapCharBuffer ([email protected])
   9:      14830564      838747616  [Lorg.roaringbitmap.buffer.MappeableContainer;
  10:      14830564      593222560  org.roaringbitmap.buffer.MutableRoaringArray
...

…ng Type-4 UUIDs

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java

codecov-commenter · 2024-03-02T08:12:35Z

Codecov Report

Attention: Patch coverage is 93.33333% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 62.14%. Comparing base (59551e4) to head (5f14553).
Report is 453 commits behind head on master.

Files	Patch %	Lines
...rg/apache/pinot/segment/local/utils/HashUtils.java	92.85%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #12538      +/-   ##
============================================
+ Coverage     61.75%   62.14%   +0.39%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2517      +81     
  Lines        133233   137911    +4678     
  Branches      20636    21342     +706     
============================================
+ Hits          82274    85709    +3435     
- Misses        44911    45814     +903     
- Partials       6048     6388     +340

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`62.11% <93.33%> (+0.40%)`	⬆️
java-21	`62.03% <93.33%> (+0.40%)`	⬆️
skip-bytebuffers-false	`62.13% <93.33%> (+0.38%)`	⬆️
skip-bytebuffers-true	`62.02% <93.33%> (+34.29%)`	⬆️
temurin	`62.14% <93.33%> (+0.39%)`	⬆️
unittests	`62.14% <93.33%> (+0.39%)`	⬆️
unittests1	`46.72% <6.66%> (-0.17%)`	⬇️
unittests2	`27.77% <86.66%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java

satishd

Thanks @ankitsultana for the PR, LGTM.

ankitsultana · 2024-04-01T03:11:40Z

@chenboat @satishd : I was thinking about this today and I think appending a null-byte is not the right approach, since it can lead to collisions (albeit on sub-optimally defined primary-keys). I have instead resorted to prepending the length of the byte-array for each element in the concatenate method. Note that this gets called only on invalid UUID value scenarios, which should be rare.

satishd

@ankitsultana
If any of the values is not of type UUID and hash function is used as UUID, it generates a string, and appending string bytes prefixed with the respective length. But that can also cause key collisions.

Other alternatives when any of the values is invalid UUID type:

Use primaryKey.asBytes() as the intention is more like a no-op hash function as you mentioned in this PR description.
Fallback to Murmur3 hash.

wdyt?

ankitsultana · 2024-04-02T07:21:49Z

@ankitsultana If any of the values is not of type UUID and hash function is used as UUID, it generates a string, and appending string bytes prefixed with the respective length. But that can also cause key collisions.

Other alternatives when any of the values is invalid UUID type:

Use primaryKey.asBytes() as the intention is more like a no-op hash function as you mentioned in this PR description.

Fallback to Murmur3 hash.

wdyt?

@satishd I guess you mean collision between two primary keys where one is UUID conformant and the other is not?

Unless we prepend the length of the string in the UUID conformant serialization scheme, there will always be theoretical chances of collision. I think prepending the length of the string removes chances of such collision beyond any reasonable doubt. I got rid of the null terminated strings because those felt too easy to collide (e.g. (foo, bar) and (fo, obar)).

satishd · 2024-04-02T17:05:20Z

@satishd I guess you mean collision between two primary keys where one is UUID conformant and the other is not?

Right.

Unless we prepend the length of the string in the UUID conformant serialization scheme, there will always be theoretical chances of collision. I think prepending the length of the string removes chances of such collision beyond any reasonable doubt. I got rid of the null terminated strings because those felt too easy to collide (e.g. (foo, bar) and (fo, obar)).

@ankitsultana Other approach like falling back to Murmur3 may create collisions but reduce size. I understand that you are trying to make a tradeoff here for consistency and storage by trying to generate the unique keys even when the values are not conforming to the respective UUID format. But it can cause collisions in very rare scenarios. This is because it uses value.toString().getBytes(StandardCharsets.UTF_8). The null values and strings as "null" result the same based on the concat method implementation.

These are rare scenarios and the proposed implementation in this PR seems reasonable to me.

chenboat · 2024-04-05T00:30:01Z

Can you be more specific on the PR title and description section on what is the hash function used? It is not clear currently.

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java

Jackie-Jiang

The fallback logic is kind of customized behavior. I'm okay moving forward, but we should consider making the hash function pluggable so that you may plug your own customized hash (this is actually not hash, but compression)

Jackie-Jiang · 2024-04-17T18:53:00Z

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java

+   * {@link UUID} object. The 16 bytes of each UUID are appended to a buffer which is then returned in the result.
+   * If any of the values is not a valid UUID, then we return the result of {@link #concatenate}.
+   */
+  public static byte[] hashUUID(Object[] values) {


For clarification, we are not really hashing the values, but compress it instead right?

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java

Jackie-Jiang

LGTM

ankitsultana added 2 commits March 1, 2024 23:19

[experiment] Add a Deterministic Hash Function for Upserts for Encodi…

6dc637c

…ng Type-4 UUIDs

test failure scenario

5d19f6d

ankitsultana added upsert performance labels Mar 2, 2024

ankitsultana changed the title ~~[experiment] Add Deterministic Hash Function for Upserts for Encoding Type-4 UUIDs~~ [experiment] Deterministic Hash Function for Upsert Type-4 UUID Primary Keys Mar 2, 2024

ankitsultana commented Mar 2, 2024

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java Outdated Show resolved Hide resolved

ankitsultana commented Mar 2, 2024

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java Outdated Show resolved Hide resolved

Jackie-Jiang reviewed Mar 6, 2024

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java Outdated Show resolved Hide resolved

ankitsultana marked this pull request as draft March 15, 2024 22:18

ankitsultana changed the title ~~[experiment] Deterministic Hash Function for Upsert Type-4 UUID Primary Keys~~ [experiment] Deterministic Hash Function for UUID Primary Keys Mar 18, 2024

save last weekend changes and push

07ee397

ankitsultana changed the title ~~[experiment] Deterministic Hash Function for UUID Primary Keys~~ Deterministic Hash Function for UUID Primary Keys Mar 21, 2024

ankitsultana added 2 commits March 21, 2024 00:17

push old saved changes

bf107eb

fix lint

60cd210

ankitsultana marked this pull request as ready for review March 21, 2024 16:51

ankitsultana requested a review from Jackie-Jiang March 25, 2024 17:38

ankitsultana commented Mar 25, 2024

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java Outdated Show resolved Hide resolved

chenboat assigned chenboat and unassigned chenboat Mar 28, 2024

chenboat reviewed Mar 28, 2024

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java Outdated Show resolved Hide resolved

chenboat reviewed Mar 28, 2024

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java Outdated Show resolved Hide resolved

address feedback + self-review

2f4094e

chenboat approved these changes Mar 28, 2024

View reviewed changes

eaugene reviewed Mar 29, 2024

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java Outdated Show resolved Hide resolved

satishd approved these changes Mar 29, 2024

View reviewed changes

ensure correctness in concatenate

4eb4964

satishd reviewed Apr 2, 2024

View reviewed changes

ankitsultana changed the title ~~Deterministic Hash Function for UUID Primary Keys~~ Hash Function for UUID Primary Keys Apr 2, 2024

chenboat reviewed Apr 5, 2024

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java Show resolved Hide resolved

Jackie-Jiang reviewed Apr 5, 2024

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java Outdated Show resolved Hide resolved

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/HashUtils.java Outdated Show resolved Hide resolved

ankitsultana closed this Apr 5, 2024

ankitsultana reopened this Apr 16, 2024

Jackie-Jiang reviewed Apr 17, 2024

View reviewed changes

ankitsultana added 2 commits May 16, 2024 17:26

address feedback: throw on null + switch to PrimaryKey.asBytes

4f388c6

fix test

5f14553

ankitsultana requested a review from Jackie-Jiang May 17, 2024 00:38

Jackie-Jiang approved these changes May 17, 2024

View reviewed changes

Jackie-Jiang merged commit a692560 into apache:master May 17, 2024
20 checks passed

gortiz pushed a commit to gortiz/pinot that referenced this pull request Jun 14, 2024

Hash Function for UUID Primary Keys (apache#12538)

8fd2245

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash Function for UUID Primary Keys #12538

Hash Function for UUID Primary Keys #12538

ankitsultana commented Mar 2, 2024 •

edited

Loading

codecov-commenter commented Mar 2, 2024 •

edited

Loading

satishd left a comment

ankitsultana commented Apr 1, 2024

satishd left a comment

ankitsultana commented Apr 2, 2024 •

edited

Loading

satishd commented Apr 2, 2024 •

edited

Loading

chenboat commented Apr 5, 2024

Jackie-Jiang left a comment

Jackie-Jiang Apr 17, 2024

Jackie-Jiang left a comment

Hash Function for UUID Primary Keys #12538

Hash Function for UUID Primary Keys #12538

Conversation

ankitsultana commented Mar 2, 2024 • edited Loading

Testing on Cluster (~35% Old Gen improvement)

codecov-commenter commented Mar 2, 2024 • edited Loading

Codecov Report

satishd left a comment

Choose a reason for hiding this comment

ankitsultana commented Apr 1, 2024

satishd left a comment

Choose a reason for hiding this comment

ankitsultana commented Apr 2, 2024 • edited Loading

satishd commented Apr 2, 2024 • edited Loading

chenboat commented Apr 5, 2024

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang Apr 17, 2024

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

ankitsultana commented Mar 2, 2024 •

edited

Loading

codecov-commenter commented Mar 2, 2024 •

edited

Loading

ankitsultana commented Apr 2, 2024 •

edited

Loading

satishd commented Apr 2, 2024 •

edited

Loading