Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HLLSeries performance. #575

Merged
merged 7 commits into from
Nov 17, 2016
Merged

Conversation

non
Copy link
Contributor

@non non commented Nov 15, 2016

This commit improves the performance of the HLLSeries data
structure. In particular, it introduces two new methods:

series.insert(bytes, timestamp)
series.approximateSizeSince(threshold)

Previously to add data to an HLLSeries required allocating another
series and then combining them with the monoid, potentially creating a
lot of garbage. The insert method does the same thing but without an
intermediate series.

Similarly, getting an approximate size used to require calling
.toHLL to build an HLL structure, and then asking that. By
abstracting out the machinery to actually do the HLL calculation,
we're able to do this directly from HLLSeries with
approximateSizeSince.

There are several other internal efficiency improvements. Some
independent benchmarks I've run show a 4-5x speed up in building
series' from many individual events, and a 3-4x speed up in
calculating the approximate sizes.

The data in the series and the serialization format are unchanged.

This commit improves the performance of the HLLSeries data
structure. In particular, it introduces two new methods:

    series.insert(bytes, timestamp)
    series.approximateSizeSince(threshold)

Previously to add data to an HLLSeries required allocating another
series and then combining them with the monoid, potentially creating a
lot of garbage. The `insert` method does the same thing but without an
intermediate series.

Similarly, getting an approximate size used to require calling
`.toHLL` to build an HLL structure, and then asking that. By
abstracting out the machinery to actually do the HLL calculation,
we're able to do this directly from HLLSeries with
`approximateSizeSince`.

There are several other internal efficiency improvements. Some
independent benchmarks I've run show a 4-5x speed up in building
series' from many individual events, and a 3-4x speed up in
calculating the approximate sizes.

The data in the series and the serialization format are unchanged.
@@ -79,34 +80,65 @@ object HyperLogLog {
@inline
def twopow(i: Int): Double = java.lang.Math.pow(2.0, i)

@deprecated("this is no longer used", since = "1.12.3")
Copy link

@thomas-stripe thomas-stripe Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be nice to provide an alternative in the message here: use j(Array[Byte], Int) instead or something.

def asApprox(bits: Int, v: Double): Approximate[Long] = {
val stdev = 1.04 / scala.math.sqrt(twopow(bits))
val lowerBound = math.floor(math.max(v * (1.0 - 3 * stdev), 0.0)).toLong
val upperBound = math.ceil(v * (1.0 + 3 * stdev)).toLong

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I guess the floor and ceil should be reversed. We can't widen the bound and still claim the probability is the same. I think this is a (minor) off-by-one bug.

Copy link
Contributor Author

@non non Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to change that, but the bug seems to have been present before (I was trying to ensure identical behavior). Do you think it's worth fixing it here, or merging this PR (without differences) then making that change?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait, but we should not forget it, I think.

@@ -45,10 +92,10 @@ case class HLLSeries(bits: Int, rows: Vector[Map[Int, Long]]) {
if (rows.isEmpty)
SparseHLL(bits, Map())
else
rows.zipWithIndex.map{
rows.iterator.zipWithIndex.map {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we want sumOption here rather than reduce(_ + _)? Or some logic that checks if there are more than k things to aggregate use sumOption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea -- I'll try it. I had something fancier here that actually hurt performance so I had reverted to what was previously happening (more or less).

Copy link

@thomas-stripe thomas-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some mostly minor comments - probably safe to ignore any of them :)

var i = 0
var sum = 0
var need = bits
while (i < bytes.length && need >= 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this does anything if need == 0, so you can just need > 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

var zeros = 1 // start with a single zero
while (i < bytes.length) {
while (j >= 0) {
if (((bytes(i) >>> j) & 1) == 1) return zeros.toByte

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: feel free to ignore, but I think having return on its own line is worth it here. Mainly, I saw the 0.toByte below, figured there was a return somewhere, but wasn't able to easily locate the return statement in this code.

val it = rows(i).iterator
while (it.hasNext) {
val (k, t) = it.next
if (t >= threshold && !seen(k)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perf nit: Is the !seen worth it here? If the set already contains k, then seen += k should be functionally equivalent to a containment check, with no mutation anyway. I guess the question is, is negativePowersOfTwo expected to be slower than checking for membership in a hash table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to avoid double-counting a particular k. If we remove the check then I think sum will be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, of course, for some reason I missed that we were adding it into sum :| That said, you can still skip this check by using add, which returns false if the set already contains the element:

if (t >= threshold && seen.add(k)) {
  sum += HyperLogLog.negativePowersOfTwo(i + 1)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, nice!

@non
Copy link
Contributor Author

non commented Nov 15, 2016

Looks like I may have broken downsampling HLLs. I'll investigate more, I don't remember seeing this test fail earlier.

EDIT: Things seemed fine locally, and seem fine now. Maybe the error was transient?

@codecov-io
Copy link

codecov-io commented Nov 15, 2016

Current coverage is 64.41% (diff: 96.96%)

Merging #575 into develop will increase coverage by 0.20%

@@            develop       #575   diff @@
==========================================
  Files           111        111          
  Lines          4524       4572    +48   
  Methods        4111       4154    +43   
  Messages          0          0          
  Branches        374        379     +5   
==========================================
+ Hits           2905       2945    +40   
- Misses         1619       1627     +8   
  Partials          0          0          

Powered by Codecov. Last update 9dd8f68...c48078d

@non
Copy link
Contributor Author

non commented Nov 15, 2016

I think I've responded to all the review comments. Please let me know if there's anything else you'd like to see.

Copy link

@oscar-stripe oscar-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but 👍 when we fix those typos.

def asApprox(bits: Int, v: Double): Approximate[Long] = {
val stdev = 1.04 / scala.math.sqrt(twopow(bits))
val lowerBound = math.floor(math.max(v * (1.0 - 3 * stdev), 0.0)).toLong
val upperBound = math.ceil(v * (1.0 + 3 * stdev)).toLong

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait, but we should not forget it, I think.


/** A super lightweight (hopefully) version of BitSet */
@deprecated("This is no longer used.", since = "1.12.3")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be 0.12.3

def twopow(i: Int): Double = java.lang.Math.pow(2.0, i)
def twopow(i: Int): Double = Math.pow(2.0, i)

@deprecated("This is no longer used. Use j(Array[Byte], Int) instead.", since = "1.12.3")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.12.3

val it = (0 until limit).iterator
val h = monoid.sum(it.map(i => monoid.create(int2Bytes(i), i)))
val n = h.since(0L).toHLL.approximateSize.estimate
val delta = (limit * 0.2).toInt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the 20% coming from? Can you add a comment?

@johnynek johnynek merged commit b48b559 into twitter:develop Nov 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants