-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve HLLSeries performance. #575
Conversation
This commit improves the performance of the HLLSeries data structure. In particular, it introduces two new methods: series.insert(bytes, timestamp) series.approximateSizeSince(threshold) Previously to add data to an HLLSeries required allocating another series and then combining them with the monoid, potentially creating a lot of garbage. The `insert` method does the same thing but without an intermediate series. Similarly, getting an approximate size used to require calling `.toHLL` to build an HLL structure, and then asking that. By abstracting out the machinery to actually do the HLL calculation, we're able to do this directly from HLLSeries with `approximateSizeSince`. There are several other internal efficiency improvements. Some independent benchmarks I've run show a 4-5x speed up in building series' from many individual events, and a 3-4x speed up in calculating the approximate sizes. The data in the series and the serialization format are unchanged.
@@ -79,34 +80,65 @@ object HyperLogLog { | |||
@inline | |||
def twopow(i: Int): Double = java.lang.Math.pow(2.0, i) | |||
|
|||
@deprecated("this is no longer used", since = "1.12.3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be nice to provide an alternative in the message here: use j(Array[Byte], Int) instead
or something.
def asApprox(bits: Int, v: Double): Approximate[Long] = { | ||
val stdev = 1.04 / scala.math.sqrt(twopow(bits)) | ||
val lowerBound = math.floor(math.max(v * (1.0 - 3 * stdev), 0.0)).toLong | ||
val upperBound = math.ceil(v * (1.0 + 3 * stdev)).toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I guess the floor
and ceil
should be reversed. We can't widen the bound and still claim the probability is the same. I think this is a (minor) off-by-one bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to change that, but the bug seems to have been present before (I was trying to ensure identical behavior). Do you think it's worth fixing it here, or merging this PR (without differences) then making that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait, but we should not forget it, I think.
@@ -45,10 +92,10 @@ case class HLLSeries(bits: Int, rows: Vector[Map[Int, Long]]) { | |||
if (rows.isEmpty) | |||
SparseHLL(bits, Map()) | |||
else | |||
rows.zipWithIndex.map{ | |||
rows.iterator.zipWithIndex.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we want sumOption
here rather than reduce(_ + _)
? Or some logic that checks if there are more than k
things to aggregate use sumOption
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea -- I'll try it. I had something fancier here that actually hurt performance so I had reverted to what was previously happening (more or less).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some mostly minor comments - probably safe to ignore any of them :)
var i = 0 | ||
var sum = 0 | ||
var need = bits | ||
while (i < bytes.length && need >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this does anything if need == 0
, so you can just need > 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
var zeros = 1 // start with a single zero | ||
while (i < bytes.length) { | ||
while (j >= 0) { | ||
if (((bytes(i) >>> j) & 1) == 1) return zeros.toByte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: feel free to ignore, but I think having return
on its own line is worth it here. Mainly, I saw the 0.toByte below, figured there was a return
somewhere, but wasn't able to easily locate the return statement in this code.
val it = rows(i).iterator | ||
while (it.hasNext) { | ||
val (k, t) = it.next | ||
if (t >= threshold && !seen(k)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perf nit: Is the !seen
worth it here? If the set already contains k
, then seen += k
should be functionally equivalent to a containment check, with no mutation anyway. I guess the question is, is negativePowersOfTwo
expected to be slower than checking for membership in a hash table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to avoid double-counting a particular k
. If we remove the check then I think sum
will be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, of course, for some reason I missed that we were adding it into sum :| That said, you can still skip this check by using add
, which returns false
if the set already contains the element:
if (t >= threshold && seen.add(k)) {
sum += HyperLogLog.negativePowersOfTwo(i + 1)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, nice!
Looks like I may have broken downsampling HLLs. I'll investigate more, I don't remember seeing this test fail earlier. EDIT: Things seemed fine locally, and seem fine now. Maybe the error was transient? |
Current coverage is 64.41% (diff: 96.96%)@@ develop #575 diff @@
==========================================
Files 111 111
Lines 4524 4572 +48
Methods 4111 4154 +43
Messages 0 0
Branches 374 379 +5
==========================================
+ Hits 2905 2945 +40
- Misses 1619 1627 +8
Partials 0 0
|
This was suggested by @johnynek.
I think I've responded to all the review comments. Please let me know if there's anything else you'd like to see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, but 👍 when we fix those typos.
def asApprox(bits: Int, v: Double): Approximate[Long] = { | ||
val stdev = 1.04 / scala.math.sqrt(twopow(bits)) | ||
val lowerBound = math.floor(math.max(v * (1.0 - 3 * stdev), 0.0)).toLong | ||
val upperBound = math.ceil(v * (1.0 + 3 * stdev)).toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait, but we should not forget it, I think.
|
||
/** A super lightweight (hopefully) version of BitSet */ | ||
@deprecated("This is no longer used.", since = "1.12.3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be 0.12.3
def twopow(i: Int): Double = java.lang.Math.pow(2.0, i) | ||
def twopow(i: Int): Double = Math.pow(2.0, i) | ||
|
||
@deprecated("This is no longer used. Use j(Array[Byte], Int) instead.", since = "1.12.3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.12.3
val it = (0 until limit).iterator | ||
val h = monoid.sum(it.map(i => monoid.create(int2Bytes(i), i))) | ||
val n = h.since(0L).toHLL.approximateSize.estimate | ||
val delta = (limit * 0.2).toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the 20%
coming from? Can you add a comment?
This commit improves the performance of the HLLSeries data
structure. In particular, it introduces two new methods:
Previously to add data to an HLLSeries required allocating another
series and then combining them with the monoid, potentially creating a
lot of garbage. The
insert
method does the same thing but without anintermediate series.
Similarly, getting an approximate size used to require calling
.toHLL
to build an HLL structure, and then asking that. Byabstracting out the machinery to actually do the HLL calculation,
we're able to do this directly from HLLSeries with
approximateSizeSince
.There are several other internal efficiency improvements. Some
independent benchmarks I've run show a 4-5x speed up in building
series' from many individual events, and a 3-4x speed up in
calculating the approximate sizes.
The data in the series and the serialization format are unchanged.