Improve HLLSeries performance. #575

non · 2016-11-15T16:33:57Z

This commit improves the performance of the HLLSeries data
structure. In particular, it introduces two new methods:

series.insert(bytes, timestamp)
series.approximateSizeSince(threshold)

Previously to add data to an HLLSeries required allocating another
series and then combining them with the monoid, potentially creating a
lot of garbage. The insert method does the same thing but without an
intermediate series.

Similarly, getting an approximate size used to require calling
.toHLL to build an HLL structure, and then asking that. By
abstracting out the machinery to actually do the HLL calculation,
we're able to do this directly from HLLSeries with
approximateSizeSince.

There are several other internal efficiency improvements. Some
independent benchmarks I've run show a 4-5x speed up in building
series' from many individual events, and a 3-4x speed up in
calculating the approximate sizes.

The data in the series and the serialization format are unchanged.

This commit improves the performance of the HLLSeries data structure. In particular, it introduces two new methods: series.insert(bytes, timestamp) series.approximateSizeSince(threshold) Previously to add data to an HLLSeries required allocating another series and then combining them with the monoid, potentially creating a lot of garbage. The `insert` method does the same thing but without an intermediate series. Similarly, getting an approximate size used to require calling `.toHLL` to build an HLL structure, and then asking that. By abstracting out the machinery to actually do the HLL calculation, we're able to do this directly from HLLSeries with `approximateSizeSince`. There are several other internal efficiency improvements. Some independent benchmarks I've run show a 4-5x speed up in building series' from many individual events, and a 3-4x speed up in calculating the approximate sizes. The data in the series and the serialization format are unchanged.

thomas-stripe · 2016-11-15T16:45:24Z

algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala

@@ -79,34 +80,65 @@ object HyperLogLog {
  @inline
  def twopow(i: Int): Double = java.lang.Math.pow(2.0, i)

+  @deprecated("this is no longer used", since = "1.12.3")


May be nice to provide an alternative in the message here: use j(Array[Byte], Int) instead or something.

oscar-stripe · 2016-11-15T16:46:10Z

algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala

+  def asApprox(bits: Int, v: Double): Approximate[Long] = {
+    val stdev = 1.04 / scala.math.sqrt(twopow(bits))
+    val lowerBound = math.floor(math.max(v * (1.0 - 3 * stdev), 0.0)).toLong
+    val upperBound = math.ceil(v * (1.0 + 3 * stdev)).toLong


actually I guess the floor and ceil should be reversed. We can't widen the bound and still claim the probability is the same. I think this is a (minor) off-by-one bug.

I'm happy to change that, but the bug seems to have been present before (I was trying to ensure identical behavior). Do you think it's worth fixing it here, or merging this PR (without differences) then making that change?

Let's wait, but we should not forget it, I think.

oscar-stripe · 2016-11-15T16:48:09Z

algebird-core/src/main/scala/com/twitter/algebird/HyperLogLogSeries.scala

@@ -45,10 +92,10 @@ case class HLLSeries(bits: Int, rows: Vector[Map[Int, Long]]) {
    if (rows.isEmpty)
      SparseHLL(bits, Map())
    else
-      rows.zipWithIndex.map{
+      rows.iterator.zipWithIndex.map {


why don't we want sumOption here rather than reduce(_ + _)? Or some logic that checks if there are more than k things to aggregate use sumOption?

That's a good idea -- I'll try it. I had something fancier here that actually hurt performance so I had reverted to what was previously happening (more or less).

thomas-stripe

Some mostly minor comments - probably safe to ignore any of them :)

thomas-stripe · 2016-11-15T18:35:27Z

algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala

+    var i = 0
+    var sum = 0
+    var need = bits
+    while (i < bytes.length && need >= 0) {


I don't think this does anything if need == 0, so you can just need > 0.

thomas-stripe · 2016-11-15T18:36:56Z

algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala

+    var zeros = 1 // start with a single zero
+    while (i < bytes.length) {
+      while (j >= 0) {
+        if (((bytes(i) >>> j) & 1) == 1) return zeros.toByte


nit: feel free to ignore, but I think having return on its own line is worth it here. Mainly, I saw the 0.toByte below, figured there was a return somewhere, but wasn't able to easily locate the return statement in this code.

thomas-stripe · 2016-11-15T18:44:13Z

algebird-core/src/main/scala/com/twitter/algebird/HyperLogLogSeries.scala

+      val it = rows(i).iterator
+      while (it.hasNext) {
+        val (k, t) = it.next
+        if (t >= threshold && !seen(k)) {


Perf nit: Is the !seen worth it here? If the set already contains k, then seen += k should be functionally equivalent to a containment check, with no mutation anyway. I guess the question is, is negativePowersOfTwo expected to be slower than checking for membership in a hash table?

I think we need to avoid double-counting a particular k. If we remove the check then I think sum will be wrong.

Ah, of course, for some reason I missed that we were adding it into sum :| That said, you can still skip this check by using add, which returns false if the set already contains the element:

if (t >= threshold && seen.add(k)) { sum += HyperLogLog.negativePowersOfTwo(i + 1) }

non · 2016-11-15T21:23:03Z

Looks like I may have broken downsampling HLLs. I'll investigate more, I don't remember seeing this test fail earlier.

EDIT: Things seemed fine locally, and seem fine now. Maybe the error was transient?

codecov-io · 2016-11-15T21:40:55Z

Current coverage is 64.41% (diff: 96.96%)

Merging #575 into develop will increase coverage by 0.20%

@@            develop       #575   diff @@
==========================================
  Files           111        111          
  Lines          4524       4572    +48   
  Methods        4111       4154    +43   
  Messages          0          0          
  Branches        374        379     +5   
==========================================
+ Hits           2905       2945    +40   
- Misses         1619       1627     +8   
  Partials          0          0

Powered by Codecov. Last update 9dd8f68...c48078d

@johnynek

This was suggested by @johnynek.

non · 2016-11-15T23:39:33Z

I think I've responded to all the review comments. Please let me know if there's anything else you'd like to see.

oscar-stripe

Minor comments, but 👍 when we fix those typos.

oscar-stripe · 2016-11-15T23:36:10Z

algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala

+  def asApprox(bits: Int, v: Double): Approximate[Long] = {
+    val stdev = 1.04 / scala.math.sqrt(twopow(bits))
+    val lowerBound = math.floor(math.max(v * (1.0 - 3 * stdev), 0.0)).toLong
+    val upperBound = math.ceil(v * (1.0 + 3 * stdev)).toLong


Let's wait, but we should not forget it, I think.

oscar-stripe · 2016-11-16T18:11:41Z

algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala


 /** A super lightweight (hopefully) version of BitSet */
+@deprecated("This is no longer used.", since = "1.12.3")


I think this should be 0.12.3

oscar-stripe · 2016-11-16T18:12:02Z

algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala

-  def twopow(i: Int): Double = java.lang.Math.pow(2.0, i)
+  def twopow(i: Int): Double = Math.pow(2.0, i)
+
+  @deprecated("This is no longer used. Use j(Array[Byte], Int) instead.", since = "1.12.3")


oscar-stripe · 2016-11-16T18:16:38Z

algebird-test/src/test/scala/com/twitter/algebird/HyperLogLogSeriesTest.scala

+      val it = (0 until limit).iterator
+      val h = monoid.sum(it.map(i => monoid.create(int2Bytes(i), i)))
+      val n = h.since(0L).toHLL.approximateSize.estimate
+      val delta = (limit * 0.2).toInt


where is the 20% coming from? Can you add a comment?

erik-stripe added 2 commits November 15, 2016 11:25

Remove Java 8-ism.

e451881

thomas-stripe reviewed Nov 15, 2016

View reviewed changes

Add override modifier.

5a6ce55

oscar-stripe reviewed Nov 15, 2016

View reviewed changes

thomas-stripe suggested changes Nov 15, 2016

View reviewed changes

Respond to review comments.

ba817af

Use .sum instead of .reduce(_ + _).

7d5a73d

Swap floor and ceil to widen (not narrow) bounds.

1cb9737

This was suggested by @johnynek.

oscar-stripe reviewed Nov 16, 2016

View reviewed changes

Respond to more review comments.

c48078d

johnynek merged commit b48b559 into twitter:develop Nov 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HLLSeries performance. #575

Improve HLLSeries performance. #575

non commented Nov 15, 2016

thomas-stripe Nov 15, 2016 •

edited

Loading

oscar-stripe Nov 15, 2016

non Nov 15, 2016 •

edited

Loading

oscar-stripe Nov 15, 2016

oscar-stripe Nov 15, 2016

non Nov 15, 2016

thomas-stripe left a comment

thomas-stripe Nov 15, 2016

non Nov 15, 2016

thomas-stripe Nov 15, 2016

thomas-stripe Nov 15, 2016

non Nov 15, 2016

tixxit Nov 15, 2016

non Nov 15, 2016

non commented Nov 15, 2016 •

edited

Loading

codecov-io commented Nov 15, 2016 •

edited

Loading

non commented Nov 15, 2016

oscar-stripe left a comment

oscar-stripe Nov 15, 2016

oscar-stripe Nov 16, 2016

oscar-stripe Nov 16, 2016

oscar-stripe Nov 16, 2016


		/** A super lightweight (hopefully) version of BitSet */
		@deprecated("This is no longer used.", since = "1.12.3")

Improve HLLSeries performance. #575

Improve HLLSeries performance. #575

Conversation

non commented Nov 15, 2016

thomas-stripe Nov 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

non Nov 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomas-stripe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

non commented Nov 15, 2016 • edited Loading

codecov-io commented Nov 15, 2016 • edited Loading

Current coverage is 64.41% (diff: 96.96%)

non commented Nov 15, 2016

oscar-stripe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomas-stripe Nov 15, 2016 •

edited

Loading

non Nov 15, 2016 •

edited

Loading

non commented Nov 15, 2016 •

edited

Loading

codecov-io commented Nov 15, 2016 •

edited

Loading