Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash Function for UUID Primary Keys #12538

Merged
merged 9 commits into from
May 17, 2024

Conversation

ankitsultana
Copy link
Contributor

@ankitsultana ankitsultana commented Mar 2, 2024

This can help reduce memory (disk for off-heap hashmap) usage for scenarios where all columns of the Primary Key are UUIDs (as defined in RFC 4122). This can also handle scenarios where some values are not valid UUIDs, by simply switching to being a no-op hash-function.

Using Murmur's 128 bit hash-function will either give the same or higher space savings (in case of composite keys), but it risks hash collision. Though chances of that are extremely low, we usually avoid any exposure for critical use-cases.

Test Plan: Added UTs. Also tested this in our cluster and some results are as follows.

Testing on Cluster (~35% Old Gen improvement)

Onboarded some test tables to one of our large servers. Each server ingested from the same Kafka topic, and both of them ended up with ~397M primary keys.

You can see the memory savings in the graph below. At 17:21 I triggered a full GC by running jmap -histo:live on both servers. I triggered a full GC again a few minutes later which corresponds to the second dip in old gen.

image

Here's the output of jmap -histo:live in the two servers:

Server with no hashing:

num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:     405356239    26288334696  [B ([email protected])
   2:     405016060    19440770880  java.util.concurrent.ConcurrentHashMap$Node ([email protected])
   3:     396819098    15872763920  org.apache.pinot.segment.local.upsert.ConcurrentMapPartitionUpsertMetadataManager$RecordLocation
   4:     396947570    13331326424  [Ljava.lang.Object; ([email protected])
   5:     405208908    12966685056  java.lang.String ([email protected])
   6:     396819098     9523658352  org.apache.pinot.spi.data.readers.PrimaryKey
   7:          8304     6710581760  [Ljava.util.concurrent.ConcurrentHashMap$Node; ([email protected])
   8:     118497450     3966607216  [C ([email protected])
   9:      61298209     3432699704  java.nio.HeapCharBuffer ([email protected])
  10:      57191904     3235026912  [Lorg.roaringbitmap.buffer.MappeableContainer;
...

Server with UUID hashing:

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:     404045476    19394182848  java.util.concurrent.ConcurrentHashMap$Node ([email protected])
   2:     404386800    16702510464  [B ([email protected])
   3:     396820100    15872804000  org.apache.pinot.segment.local.upsert.ConcurrentMapPartitionUpsertMetadataManager$RecordLocation
   4:     396820101    12698243232  org.apache.pinot.spi.utils.ByteArray
   5:          8152     6621435072  [Ljava.util.concurrent.ConcurrentHashMap$Node; ([email protected])
   6:         15846     2171269472  [I ([email protected])
   7:      30760333     1066705472  [C ([email protected])
   8:      15922426      891655856  java.nio.HeapCharBuffer ([email protected])
   9:      14830564      838747616  [Lorg.roaringbitmap.buffer.MappeableContainer;
  10:      14830564      593222560  org.roaringbitmap.buffer.MutableRoaringArray
...

@ankitsultana ankitsultana changed the title [experiment] Add Deterministic Hash Function for Upserts for Encoding Type-4 UUIDs [experiment] Deterministic Hash Function for Upsert Type-4 UUID Primary Keys Mar 2, 2024
@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2024

Codecov Report

Attention: Patch coverage is 93.33333% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 62.14%. Comparing base (59551e4) to head (5f14553).
Report is 453 commits behind head on master.

Files Patch % Lines
...rg/apache/pinot/segment/local/utils/HashUtils.java 92.85% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12538      +/-   ##
============================================
+ Coverage     61.75%   62.14%   +0.39%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2517      +81     
  Lines        133233   137911    +4678     
  Branches      20636    21342     +706     
============================================
+ Hits          82274    85709    +3435     
- Misses        44911    45814     +903     
- Partials       6048     6388     +340     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.11% <93.33%> (+0.40%) ⬆️
java-21 62.03% <93.33%> (+0.40%) ⬆️
skip-bytebuffers-false 62.13% <93.33%> (+0.38%) ⬆️
skip-bytebuffers-true 62.02% <93.33%> (+34.29%) ⬆️
temurin 62.14% <93.33%> (+0.39%) ⬆️
unittests 62.14% <93.33%> (+0.39%) ⬆️
unittests1 46.72% <6.66%> (-0.17%) ⬇️
unittests2 27.77% <86.66%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ankitsultana ankitsultana marked this pull request as draft March 15, 2024 22:18
@ankitsultana ankitsultana changed the title [experiment] Deterministic Hash Function for Upsert Type-4 UUID Primary Keys [experiment] Deterministic Hash Function for UUID Primary Keys Mar 18, 2024
@ankitsultana ankitsultana changed the title [experiment] Deterministic Hash Function for UUID Primary Keys Deterministic Hash Function for UUID Primary Keys Mar 21, 2024
@ankitsultana ankitsultana marked this pull request as ready for review March 21, 2024 16:51
@chenboat chenboat assigned chenboat and unassigned chenboat Mar 28, 2024
Copy link
Member

@satishd satishd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ankitsultana for the PR, LGTM.

@ankitsultana
Copy link
Contributor Author

@chenboat @satishd : I was thinking about this today and I think appending a null-byte is not the right approach, since it can lead to collisions (albeit on sub-optimally defined primary-keys). I have instead resorted to prepending the length of the byte-array for each element in the concatenate method. Note that this gets called only on invalid UUID value scenarios, which should be rare.

Copy link
Member

@satishd satishd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankitsultana
If any of the values is not of type UUID and hash function is used as UUID, it generates a string, and appending string bytes prefixed with the respective length. But that can also cause key collisions.

Other alternatives when any of the values is invalid UUID type:

  1. Use primaryKey.asBytes() as the intention is more like a no-op hash function as you mentioned in this PR description.
  2. Fallback to Murmur3 hash.

wdyt?

@ankitsultana
Copy link
Contributor Author

ankitsultana commented Apr 2, 2024

@ankitsultana If any of the values is not of type UUID and hash function is used as UUID, it generates a string, and appending string bytes prefixed with the respective length. But that can also cause key collisions.

Other alternatives when any of the values is invalid UUID type:

  1. Use primaryKey.asBytes() as the intention is more like a no-op hash function as you mentioned in this PR description.
  2. Fallback to Murmur3 hash.

wdyt?

@satishd I guess you mean collision between two primary keys where one is UUID conformant and the other is not?

Unless we prepend the length of the string in the UUID conformant serialization scheme, there will always be theoretical chances of collision. I think prepending the length of the string removes chances of such collision beyond any reasonable doubt. I got rid of the null terminated strings because those felt too easy to collide (e.g. (foo, bar) and (fo, obar)).

@ankitsultana ankitsultana changed the title Deterministic Hash Function for UUID Primary Keys Hash Function for UUID Primary Keys Apr 2, 2024
@satishd
Copy link
Member

satishd commented Apr 2, 2024

@satishd I guess you mean collision between two primary keys where one is UUID conformant and the other is not?

Right.

Unless we prepend the length of the string in the UUID conformant serialization scheme, there will always be theoretical chances of collision. I think prepending the length of the string removes chances of such collision beyond any reasonable doubt. I got rid of the null terminated strings because those felt too easy to collide (e.g. (foo, bar) and (fo, obar)).

@ankitsultana Other approach like falling back to Murmur3 may create collisions but reduce size. I understand that you are trying to make a tradeoff here for consistency and storage by trying to generate the unique keys even when the values are not conforming to the respective UUID format. But it can cause collisions in very rare scenarios. This is because it uses value.toString().getBytes(StandardCharsets.UTF_8). The null values and strings as "null" result the same based on the concat method implementation.

These are rare scenarios and the proposed implementation in this PR seems reasonable to me.

@chenboat
Copy link
Contributor

chenboat commented Apr 5, 2024

Can you be more specific on the PR title and description section on what is the hash function used? It is not clear currently.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic is kind of customized behavior. I'm okay moving forward, but we should consider making the hash function pluggable so that you may plug your own customized hash (this is actually not hash, but compression)

* {@link UUID} object. The 16 bytes of each UUID are appended to a buffer which is then returned in the result.
* If any of the values is not a valid UUID, then we return the result of {@link #concatenate}.
*/
public static byte[] hashUUID(Object[] values) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification, we are not really hashing the values, but compress it instead right?

@ankitsultana ankitsultana requested a review from Jackie-Jiang May 17, 2024 00:38
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jackie-Jiang Jackie-Jiang merged commit a692560 into apache:master May 17, 2024
20 checks passed
gortiz pushed a commit to gortiz/pinot that referenced this pull request Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants