Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why not use xx hash? #52

Open
beviah opened this issue Feb 23, 2022 · 7 comments
Open

Why not use xx hash? #52

beviah opened this issue Feb 23, 2022 · 7 comments

Comments

@beviah
Copy link

beviah commented Feb 23, 2022

https://github.com/Cyan4973/xxHash

XXH128 seems to be superior in every way to MurmurHash3_x86_32 currently used.. https://github.com/RaRe-Technologies/bounter/blob/8c83e957b072a50be37221dcec1177fbe0128faf/cbounter/cms_common.c

I am also a bit surprised by the perfect F1 scores given the high collision probability of a 32 bit hash.
https://preshing.com/20110504/hash-collision-probabilities/

@piskvorky
Copy link
Owner

piskvorky commented Feb 23, 2022

I vaguely remember discussion around alternative hashing schemes in the beginning. I don't remember the details but my guess is murmur won because of availability, portability & familiarity.

Which doesn't mean we couldn't revisit that decision :) Needs a strong contributor though, not only to implement the change but also test it, make sure packaging & distribution continues to work, etc. Are you up for it? What motivated your investigation here?

@beviah
Copy link
Author

beviah commented Feb 23, 2022

I don't know C, so would be hard for me to implement it.

I need approximate counters for truly large number of events, billions.. which yields 32bit hashes unusable..

@piskvorky
Copy link
Owner

piskvorky commented Feb 23, 2022

Billions doesn't strike me as super large. What is the "business cost" of a collisions – how much do collisions / approximate counts matter?

@beviah
Copy link
Author

beviah commented Feb 23, 2022

But it seems that even hundreds of thousands represent an issue for 32bit hash:
https://preshing.com/images/small-probabilities.png

There is not much cost to collision as long as the counter does not get saturated and start colliding all the time.. as I assume would happen based on the table from above link.

@beviah
Copy link
Author

beviah commented Feb 23, 2022

Maybe I wasn't clear enough, I don't mean billions of events, but billions of unique events :)
Hundreds of millions at least..

@beviah
Copy link
Author

beviah commented Feb 23, 2022

As for the motivation, wanted to compare performance with this library: https://github.com/mikrosk/py-probstructs

@beviah
Copy link
Author

beviah commented Feb 23, 2022

I may have interpreted that table in wrong way... seems in real time I will be getting collisions every few seconds which is not terrible.. and that this rate does not go up... which is counterintuitive actually..

Just did a calculation based on https://en.wikipedia.org/wiki/Birthday_problem#Number_of_people_with_a_shared_birthday

with 100 million unique events there will be 2.3 million collisions. 2.3%

that is still too much for my use case..

n.b. there should be more than 10k collisions in wikipedia dataset.. so not sure how you get 100% F1 score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants