-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up HLL presentation by 100x #491
Conversation
This is failing on the HLL downsize test (which passes on my machine). Is this a known-flaky test, or is this maybe pointing to some actual issue? |
The two builds actually failed on two different tests, one of them was that, the min hasher one too. Both are alas flaky tests. Restarted both |
LGTM though, merge when green |
I just tested with @kamalmarhubi using a lookup table as well as |
This breaks when mj > 30. If we're not concerned about that, can we at least add an assertion so we fail instead of giving the wrong result?
|
Ok, yeah, with a lookup table (pushed) it's over 100x as fast:
|
@@ -43,6 +43,8 @@ object HyperLogLog { | |||
/* Size of the hash in bits */ | |||
val hashSize = 128 | |||
|
|||
val powersOfNegativeTwo: Array[Double] = 0.until(hashSize).map{ i => math.pow(2.0, -i) }.toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to instead of until?
I think it's the rhoW that's taken here? On inspection the code seems a little not right (I could be missing something), but it seems like this is the count of the number of leading 0s, so it should include hashSize for the case that the hash is all 0s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're always setting aside at least 4 bits for the bucket, so I think the max is actually 124 - this should be slightly more than needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry looks like we raced on actions there. Can you have hashSize itself rather than hashSize -1? the tests are green, so it seems unlikely this is wrong per say right? or do we have a test missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. For the record the parts that look not right are in rhoW:
- terminal condition is that
contains
returns false, but contains doesn't do any bounds checking; is it possible to overrun the array, however unlikely? - loop starts at 1 instead of 0. with even a small aggregation I imagine this doesn't matter, but isn't it technically incorrect?
Again I could be missing something, but if you think they might be worth addressing maybe lets file issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be negativePowersOfTwo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I'm confused as to what's being asked here.
@ianoc I'm quite certain we don't need more than hashSize-1
@jnievelt IIRC the definition of rhoW is "index of first non-zero", not "number of zeros" so that might explain starting at 1?
@kamalmarhubi argh, good point
Speed up HLL presentation by 100x
If/when we clean this up to fix the name, can will also make the table |
@@ -447,7 +449,7 @@ case class DenseHLL(bits: Int, v: Bytes) extends HLL { | |||
count += 1 | |||
res += 1.0 | |||
} else { | |||
res += java.lang.Math.pow(2.0, -mj) | |||
res += HyperLogLog.powersOfNegativeTwo(mj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be negativePowersOfTwo, not powersOfNegativeTwo? the current name implies we are doing (-2)^(mj) which is not the same as (2)^(-mj)
Given the performance boost could you not replace all instances of |
@johnynek it's used in the |
yeah, private[algebird] is good, once its classed as an implementation detail outside the project so we can change it later |
@zg not, for two reasons:
|
Do we have an ETA on when a version of algebird with this optimization will be released? edit: I'm told this was just released yesterday. |
@eigenvariable Latest algebird was released only 2 days go. Check out the releases section for more info. |
This changes the use of
math.pow(2.0,-mj)
when computing the estimates inSparseHLL
andDenseHLL
to1.0 / (1 << mj)
.To benchmark the difference this makes, I first had to change the
HLLPresentBenchmark
to clone theHLL
instances before asking for theirapproximateSize
, otherwise the estimate gets cached as a lazy val and the benchmarks are meaningless.Having done that, you can see a dramatic difference. Because I'm impatient I have so far only run this with a single set of parameters, where it led to a 40x speedup.
Before:
After: