-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why not use xx hash? #52
Comments
I vaguely remember discussion around alternative hashing schemes in the beginning. I don't remember the details but my guess is murmur won because of availability, portability & familiarity. Which doesn't mean we couldn't revisit that decision :) Needs a strong contributor though, not only to implement the change but also test it, make sure packaging & distribution continues to work, etc. Are you up for it? What motivated your investigation here? |
I don't know C, so would be hard for me to implement it. I need approximate counters for truly large number of events, billions.. which yields 32bit hashes unusable.. |
Billions doesn't strike me as super large. What is the "business cost" of a collisions – how much do collisions / approximate counts matter? |
But it seems that even hundreds of thousands represent an issue for 32bit hash: There is not much cost to collision as long as the counter does not get saturated and start colliding all the time.. as I assume would happen based on the table from above link. |
Maybe I wasn't clear enough, I don't mean billions of events, but billions of unique events :) |
As for the motivation, wanted to compare performance with this library: https://github.com/mikrosk/py-probstructs |
I may have interpreted that table in wrong way... seems in real time I will be getting collisions every few seconds which is not terrible.. and that this rate does not go up... which is counterintuitive actually.. Just did a calculation based on https://en.wikipedia.org/wiki/Birthday_problem#Number_of_people_with_a_shared_birthday with 100 million unique events there will be 2.3 million collisions. 2.3% that is still too much for my use case.. n.b. there should be more than 10k collisions in wikipedia dataset.. so not sure how you get 100% F1 score. |
https://github.com/Cyan4973/xxHash
XXH128 seems to be superior in every way to MurmurHash3_x86_32 currently used.. https://github.com/RaRe-Technologies/bounter/blob/8c83e957b072a50be37221dcec1177fbe0128faf/cbounter/cms_common.c
I am also a bit surprised by the perfect F1 scores given the high collision probability of a 32 bit hash.
https://preshing.com/20110504/hash-collision-probabilities/
The text was updated successfully, but these errors were encountered: