Implement an appendMonoid Aggregator factory which yields aggregators… #501

erikerlandson · 2015-11-30T21:45:42Z

… that can take advantage of an efficient append method for faster aggregations.

This is a PR in response to:
http://erikerlandson.github.io/blog/2015/11/24/the-prepare-operation-considered-harmful-in-algebird/

The rationale for a solution based on a factory function is that the append method and prepare method need to remain logically consistent with each other. The factory function enforces that logical constraint, and in addition allows a few other efficient overrides.

Working through this issue has left me still feeling like an Aggregator class explicitly based on a Monoid, augmented with append, is desirable. The main problem is that it would not be backward compatible with the current design that requires a definition of prepare.

Related but tangential, I'm still feeling like a Monoid[T] subclass AppendMonoid[T, E], having zero, plus and append, is a possibly useful thing. The type laws for such an object would include all the Monoid type laws, and also probably something like:

mon.append(t, e) == mon.plus(t, mon.append(mon.zero, e))

… that can take advantage of an efficient append method for faster aggregations

johnynek · 2015-11-30T22:10:11Z

algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala

+   * @param pres The presentation function
+   * @param m The [[Monoid]] type class
+   */
+  def appendMonoid[F, T, P](appnd: (T, F) => T, pres: T => P)(implicit m: Monoid[T]): MonoidAggregator[F, T, P] =


so, appnd must satisfy the law:

appdn(appnd(Monoid.zero, f1), f2) == Monoid.plus(appnd(Monoid.zero, f1), appnd(Monoid.zero, f2))

Otherwise prepare doesn't mean what we expect. If you agree, we should add this law to the comment.

This is addressing one performance issue, which might be a very important one, but also the case of dealing with a bulk set of items (Monoid.sum/Semigroup.sumOption) is also an important optimization.

Since we expose the monoid/semigroup of an Aggregator, scalding and spark use that to get the sumOption speedup. For those cases, they will not get the benefit of an optimized append and in fact will call prepare which may be even slower. So, I think the comment "faster aggregation" is more complex and we need to point out that this could be slower than the standard approach depending on the Monoid and appnd function.

I may be missing some angle here, but that seems like an opportunity for Scalding/Spark to take better advantage of traversableOnce.aggregate(...) over individual partitions.

@erikerlandson I think the question is whether the append() optimization is more effective than whatever optimization is in sumOption. They're both about eliminating unnecessary intermediate values of the type we have a Monoid on, but in different ways - it's sorta a question of, for each reduce(left: T, right: T), whether you need the left side to actually be a T (which sumOption eliminates) or the right side to actually be a T (which append eliminates). You could imagine an appendAll which requires neither to be, but that feels even more invasive for the Aggregator to eliminate vs. the AppendMonoid like you proposed.

To put it another way, if you ultimately think of this like a def foldLeft[B](z: B)(f: (B, A) => B): B on a Seq[A], and you have some Monoid[T], the question is do we have B == T (a constraint that sumOption gets rid of) or do we have A == T (a constraint that append gets rid of), or both, or neither, and what's the most efficient space to do this aggregation in.

erikerlandson · 2015-11-30T22:46:53Z

The rub seems to be that there are some possibly-incompatible cases. If you have azero, and if append is efficient, then it is advantageous to take advantage of scala's aggregate formulation. But that may not be true, and assuming it can possibly be worse.

My desire to always have my cake and eat it too makes me wonder if there is any advantage to adding some method like aggregateIsFast which defaults to false, and which can be tested by various methods (and in spark and scalding, hopefully) to be smart about how to do aggregations depending on the characteristics of the objects in question.

erikerlandson · 2015-12-01T16:50:51Z

Last night I spent some time trying to work through what I'm "really" after, and how that relates to algebraic concerns and algebird's type system.

Firstly, I've been coming at this from the perspective that I want a principled algebraic way to capture the functionality of Scala's traversableOnce.aggregate(z)(seqop, combop), and leveraging as much efficiency as possible. Starting from that goal, in no particular order:

Right off the bat, algebird's Aggregator is not exactly that, although it can clearly embody that functionality. Aggregator is a type for representing a map-reduce computation, and so (with respect to my particular goals) it comes with some extra baggage. My unconscious tendency of equating it with Scala sequence aggregation is misleading.
Scala's aggregate requires an identity element. If I have a Monoid to work with, then all is well. I also have to provide seqop (aka append), but that does not appear to be the limiting factor in any cases that I've ever seen.
However, it is also desirable to support Semigroups, and so in that case aggregate has to be replaced with reduceLeft, and furthermore there is now a need for some way to instantiate Semigroup objects from data elements. In algebird Aggregator, the prepare (aka "map") phase is drafted into this task.
So we have two not-directly-compatible cases of enhancing "normal" algebraic objects: Monoid + append and Semigroup + prepare
With Algebird Aggregator, the first case was made a subclass of the second, I presume by analogy to the fact that Monoid is a subclass of Semigroup.
However, I'm now wondering if they should actually be sibling classes that inherit from an abstract trait that expresses the idea of "things that can aggregate data elements"

trait ElementAggregator[A, E] {
  def aggregate(data: TraversableOnce[E]): A
}

trait MonoidElementAggregator[M, E] extends ElementAggregator[M, E] {
  def monoid: M
  def append(m: M, e: E): M
}

trait SemigroupElementAggregator[S, E] extends ElementAggregator[S, E] {
  def semigroup: S
  def construct(e: E): S
}

erikerlandson · 2015-12-01T16:59:51Z

From my original blog post, the main cost with prepare/construct for working with the Semigroup case is actually instantiating Semigroup objects for every single data element (in the cases where that is nontrivial). And it isn't avoidable. Working with Monoids versus Semigroups comes with nontrivial advantages, from that point of view. I bring this up because some of Algebird's aggregating objects like CountMinSketch are semigroups. Based on discussions from #495, there was a desire to keep free parameters out of the identity element, which led me to discard the identity element and make t-digest a semigroup. I'm now wondering if that's the best solution.

avibryant · 2015-12-01T18:04:06Z

I'm not sure I understand why "it isn't avoidable". That only seems to be true if you want to treat every element of your input identically. But why is that necessary? Can't you do the equivalent of list.tail.foldLeft(prepare(list.head))(append) ?

erikerlandson · 2015-12-01T18:47:10Z

@avibryant good point. Using aggregate can still be substantially faster than foldLeft, but it does require specifically invoking some parallelism:

scala>  val data = Vector.fill(1000000) { scala.util.Random.nextInt(100) }
data: scala.collection.immutable.Vector[Int] = Vector(73, 26, 81, 92, 90, 73, ...

scala> val zero = Set.empty[Int]
zero: scala.collection.immutable.Set[Int] = Set()

scala> val plus = (s1: Set[Int], s2: Set[Int]) => s1 ++ s2
plus: (Set[Int], Set[Int]) => scala.collection.immutable.Set[Int] = <function2>

scala> val append = (s: Set[Int], e: Int) => s + e
append: (Set[Int], Int) => scala.collection.immutable.Set[Int] = <function2>

scala> val construct = (e: Int) => Set(e)
construct: Int => scala.collection.immutable.Set[Int] = <function1>

// This is still the worst:
scala> benchmark(10) { data.map(construct).reduceLeft(plus) }
res11: Double = 0.4000747779

// Using one construction followed by foldLeft is quite a bit faster
scala> benchmark(10) { data.tail.foldLeft(construct(data.head))(append) }
res12: Double = 0.040250169200000005

// without parallelism, aggregate performs same as foldLeft above
scala> benchmark(10) { data.aggregate(zero)(append, plus) }
res13: Double = 0.038395296800000006

// with parallelism, aggregate will beat foldLeft:
scala> benchmark(10) { data.par.aggregate(zero)(append, plus) }
res14: Double = 0.011873102699999999

scala>

avibryant · 2015-12-01T19:52:18Z

Interesting. I don't have time to do this right now but it seems like you could add a scala.collection.parallel.Task similar to Aggregate but that worked with semigroups? The Aggregate imeplementation, for reference, is here: https://github.com/scala/scala/blob/v2.11.7/src/library/scala/collection/parallel/ParIterableLike.scala#L1005

…faster appendAll methods for Aggregator objects

erikerlandson · 2015-12-02T16:56:40Z

Seeing that the foldLeft formulation of aggregation -- data.tail.foldLeft(construct(data.head))(append) -- is also as fast as the aggregate version (and can work without a monoid identity element), caused me to realize that the AlgebirdRDD aggregate method could be modified to use that and achieve the corresponding speed increase, in the cases with an appropriately optimized appendAll method. And it will not suffer any speed decrease in the default case. I updated the branch with those improvements.

avibryant · 2015-12-02T17:58:49Z

This looks good. As a slightly separate thing (but that might be worth including in this PR anyway), I think we should have the default appendAll make use of sumOption; that would make it likely that the AlgebirdRDD was making use of an optimized method wherever possible.

erikerlandson · 2015-12-02T19:31:55Z

@avibryant I think appendAll already uses sumOption - it calls reduce, but that defaults to a thin wrapper around sumOption:
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L194

avibryant · 2015-12-02T19:37:20Z

Aha, you're right, I hadn't realized that. Thanks.

This LGTM, then.

johnynek · 2015-12-02T19:51:51Z

algebird-spark/src/main/scala/com/twitter/algebird/spark/AlgebirdRDD.scala

+  def aggregateOption[B: ClassTag, C](agg: Aggregator[T, B, C]): Option[C] = {
+    val pr = rdd.mapPartitions({ data =>
+      if (data.isEmpty) Iterator.empty else {
+        val sg = agg.prepare(data.next)


can we rename sg to be b here? It is of type b, right? sg sounds like semigroup to me.

good point, I habitually confuse types with their type classes

johnynek · 2015-12-04T17:08:08Z

algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala

+    appendSemigroup(prep, appnd, identity[T]_)(sg)
+
+  /**
+   * Obtain an [[Aggregator]] that uses an efficient append operation for faster aggregation


I'm still not crazy about this documentation leaving out that this is disabling the sumOption optimization, that would have previously been there.

Have you tried a benchmark using a Map[K, V] value or HyperLogLog? Those two have been optimized a fair bit with sumOption (and many others). In some cases, the savings of doing your optimization here will be greater, but it won't neccesarily be faster.

In the HLL case, you might imagine that this would be a win, but there is a an optimized prepare that allocates a sparse object, and then sumOption uses a mutable Array. Here, with append, you could skip the allocation of the sparse HLL object, but you are still going to reallocate a new immutable HLL on each plus operation, and my guess is that will be significantly slower.

We have the benchmarks subproject here, which we try to use when we are submitting performance optimizations. If nothing else we need to add a comment so this does not confuse new users more (basically, when performance is a concern, they have to benchmark).

avibryant · 2015-12-04T17:08:26Z

This has addressed @johnynek's comments and LGTM, I'm going to merge. Thanks @erikerlandson !

Implement an appendMonoid Aggregator factory which yields aggregators…

Implement an appendMonoid Aggregator factory which yields aggregators…

eaaa054

… that can take advantage of an efficient append method for faster aggregations

johnynek reviewed Nov 30, 2015
View reviewed changes

erikerlandson added 3 commits December 2, 2015 09:48

Add updateSemigroup factory function

fd9b866

Optimize the AlgebirdRDD aggregateOption method to take advantage of …

c9f2e06

…faster appendAll methods for Aggregator objects

unit testing for semigroup variants

ba9df9f

johnynek reviewed Dec 2, 2015
View reviewed changes

rename sg -> b to represent proper type

f0dc4f2

erikerlandson mentioned this pull request Dec 3, 2015

A Library of Binary Tree Algorithms as Mixable Scala Traits #496

Closed

johnynek reviewed Dec 4, 2015
View reviewed changes

avibryant added a commit that referenced this pull request Dec 4, 2015

Merge pull request #501 from erikerlandson/feature/append_monoid

938798e

Implement an appendMonoid Aggregator factory which yields aggregators…

avibryant merged commit 938798e into twitter:develop Dec 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement an appendMonoid Aggregator factory which yields aggregators… #501

Implement an appendMonoid Aggregator factory which yields aggregators… #501

erikerlandson commented Nov 30, 2015

johnynek Nov 30, 2015

erikerlandson Nov 30, 2015

avibryant Nov 30, 2015

avibryant Nov 30, 2015

erikerlandson commented Nov 30, 2015

erikerlandson commented Dec 1, 2015

erikerlandson commented Dec 1, 2015

avibryant commented Dec 1, 2015

erikerlandson commented Dec 1, 2015

avibryant commented Dec 1, 2015

erikerlandson commented Dec 2, 2015

avibryant commented Dec 2, 2015

erikerlandson commented Dec 2, 2015

avibryant commented Dec 2, 2015

johnynek Dec 2, 2015

erikerlandson Dec 2, 2015

johnynek Dec 4, 2015

avibryant commented Dec 4, 2015

Implement an appendMonoid Aggregator factory which yields aggregators… #501

Implement an appendMonoid Aggregator factory which yields aggregators… #501

Conversation

erikerlandson commented Nov 30, 2015

johnynek Nov 30, 2015

Choose a reason for hiding this comment

erikerlandson Nov 30, 2015

Choose a reason for hiding this comment

avibryant Nov 30, 2015

Choose a reason for hiding this comment

avibryant Nov 30, 2015

Choose a reason for hiding this comment

erikerlandson commented Nov 30, 2015

erikerlandson commented Dec 1, 2015

erikerlandson commented Dec 1, 2015

avibryant commented Dec 1, 2015

erikerlandson commented Dec 1, 2015

avibryant commented Dec 1, 2015

erikerlandson commented Dec 2, 2015

avibryant commented Dec 2, 2015

erikerlandson commented Dec 2, 2015

avibryant commented Dec 2, 2015

johnynek Dec 2, 2015

Choose a reason for hiding this comment

erikerlandson Dec 2, 2015

Choose a reason for hiding this comment

johnynek Dec 4, 2015

Choose a reason for hiding this comment

avibryant commented Dec 4, 2015