Add trailing window aggregator #649

cdg-stripe · 2018-01-26T01:06:17Z

Description

Add WindowMonoid utility for aggregating over finite window. Window case class stores elements with a queue. The monoid combines windows while trimming their elements to the appropriate size. The monoid is also geared to aggregate elements with the group property more efficiently, as we can subtract from the total elements we push out of the queue upon aggregation.

Testing

Unit tests

r? @johnynek

CLAassistant · 2018-01-26T01:06:23Z

All committers have signed the CLA.

johnynek

This is great.

Look in algebird-core/src/test/... to find the tests for this module. We want to make sure it passes the monoidLaws. You can search for monoidLaw for some examples.

After we add test coverage we should merge this.

johnynek · 2018-01-26T16:19:20Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+case class WindowMonoid[T](
+    windowSize: Int
+)(
+    implicit m: Monoid[T],


Let’s not use this m. You can get the zero from the p by case matching.

MansurAshraf · 2018-01-26T17:08:25Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+ * @param q (optional) queue to override default q containing input "total"
+ */
+
+case class Window[T](total: T, q: Queue[T] = Queue()) {


This looks great! What do you think about allowing user to pass a PriorityQueue instead of just a Queue (AFAIK scala PQ doesnt extend Queue). The way code is written right now, its implying that I need a commutative monoid to guarantee deterministic behavior or I need to control the order in which items are inserted in the window. When I work with windows they usually have some sort of ordering defined (last 3 distinct IPs used by a user in last one hour for example), It would be nice to allow users to preserve the ordering somehow IMO

actually, I think Window[T](total: T, items: Queue[T]) is better. We don't really want people to do something like Window(100) do we? that will have no items in the queue, and I imagine break the laws.

If you look at items, it fills in the queue if none was provided, i.e meant for initialization. The intent is to have a simple interface where you can wrap a window around a single object and sum in a sensible way:

implicit def wm[T] = WindowMonoid[T](2) Window(1) + Window(2) + Window(3) == Window(5)

Needing to specify the queue would make this redundant

Window(1, Queue(1)) + ...

Tbh I'm not a big fan of even having to expose the queue, since you could change the implementation in the future (e.g. a PriorityQueue). A more general approach might use a PushPopper interface, but that's probably overkill.

MansurAshraf · 2018-01-26T17:12:04Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+    if (b.items.size >= windowSize) {
+      var total: T = b.total
+      var q = b.items
+      while (q.size > windowSize) {


Not sure if its possible to do it in this implementation but it would be nice to generalize how elements are evicted from the window. Right now it seems like I can only do fixed windows but I would really like to do time based rolling windows as well. For example: Rolling window of 1 hour and we evict element based on time instead of size

My thought was to eventually have two classes/monoids, based on whether we want aggregations over fixed items or fixed time.

// Monoid defines aggregation for fixed window length, e.g. last 28 elements. Queue fixed to 28. case class Window[T](total:T, q: Queue[T]) // Monoid defines aggregation for fixed time, e.g. last 28 days. Queue can grow arbitrarily. case class WindowWithTs(lastestTime: Double, total: T, q: Queue[(Double, T)])

Does that seem sensible?

That seems reasonable but I am still not sure why we cant use ProrityQueue instead of Queue? As a user I would like to control the ordering and if I am passing you a PQ, I can do that by specifying my own comparator

johnynek

in addition to the monoidLaws, I think we want something like:

forAll { (ts0: List[Int], n: Int) =>
  val ts = ts0.takeRight(n)
  val mon = new WindowMonoid(n)
  assert(mon.sum(ts.map(Window(_))).total == ts.sum)
}

That's the behavior you want, right?

johnynek · 2018-01-26T18:47:02Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+ * @param q (optional) queue to override default q containing input "total"
+ */
+
+case class Window[T](total: T, q: Queue[T] = Queue()) {


actually, I think Window[T](total: T, items: Queue[T]) is better. We don't really want people to do something like Window(100) do we? that will have no items in the queue, and I imagine break the laws.

cdg-stripe · 2018-01-26T23:24:15Z

@MansurAshraf: I'm trying to grasp whether we need a fully general implementation up front, where the user inputs their own queue/priority. As mentioned above my thoughts are to have trailing windows based on one of:

Fixed number of elements
Fixed time interval (not implemented here, but can in a separate PR)

The "last N ip-addresses" problem is interesting. But rather than making this implementation more flexible/abstract, it might make sense to have a LastNDistinct monoid. I suspect this approach will be easier to test and implement.

johnynek · 2018-01-31T01:49:57Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+}
+
+/*
+  Example usage:


can you make this a scaladoc comment on the case class?

johnynek · 2018-01-31T01:53:37Z

algebird-test/src/test/scala/com/twitter/algebird/WindowLawsTest.scala

+      val ts = ts0.takeRight(n)
+      val mon = new WindowMonoid(n)
+      assert(mon.sum(ts0.map( Window(_) )).total == ts.sum)
+    }


can we have a second one:

forAll { (ts0: List[Int], ts1: List[Int], n: Int) => val expected = Queue((ts0 ::: ts1).takeRight(n): _*) val mon = new WindowMonoid(n) val got = mon.plus(Window(ts0.sum, Queue(ts0: _*)), Window(ts1.sum, Queue(ts1:_*))) assert(got == expected) }

johnynek · 2018-01-31T01:54:15Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+/**
+ * Provides a natural monoid for combining windows truncated to some window size.
+ *
+ * @param windowSize


can you add more comment to this javaDoc param?

johnynek · 2018-01-31T01:58:48Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+      case Priority.Preferred(g) => Window(g.zero)
+      case Priority.Fallback(m)  => Window(m.zero)
+    }
+


what about a

def fromIterable[T](ts: Iterable[T]): Window[T] = { val monT: Monoid[T] = p.join val right = ts.toList.takeRight(windowSize) val total = monT.sum(right) Window(total, Queue.empty[T] ++ right) }

@johnynek: I did this, and generalized to Traversable.

johnynek · 2018-01-31T02:08:28Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+      case Priority.Preferred(g) => plusG(a, b)(g)
+      case Priority.Fallback(m)  => plusM(a, b)(m)
+    }
+


I think you can optimize sumOption somewhat here: you can take blocks of data and ignore the left if the right is full. PS: scalding uses this if it exists, so overridding sumOption on Semigroup (or any subclass) can be a very nice performance win on map/reduce.

def sumOption(ws: TraversableOnce[Window[T]]): Option[Window[T]] = if (ws.isEmpty) None else { val it = ws.toIterator var size = 0 var queue = Queue.empty[Window[T]] while (it.hasNext) { val n = it.next queue = queue :+ n size += n.size val tailSize = size - queue.head.size if (tailSize >= windowSize) { queue = queue.tail size = tailSize } } // now we only have to merge the queue: queue.tail.foldLeft(queue.head)(plus(_, _)) }

I think this can be a pretty big win.

@johnynek: Overwrote sumOption below.

johnynek · 2018-01-31T02:16:18Z

this PR: #650 seems to fix the ruby issue on travis.

johnynek · 2018-01-31T02:16:49Z

(nothing like non-hermetic builds to make you feel real good, all day long).

johnynek · 2018-02-01T07:04:29Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+    if(ws.isEmpty) None
+    else {
+      val it = ws.toIterator
+      var queue = Queue[T]()


.empty is more efficient since you avoid allocating the varargs wrapper.

johnynek · 2018-02-01T07:06:03Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+      while (it.hasNext) {
+        queue = (queue ++ it.next.items).takeRight(windowSize)
+      }
+      Some(fromTraversable(queue))


Using fromTraversable here is ignoring that you already have a queue. I think summing the queue and reusing it will be a significant performance improvement

johnynek · 2018-02-01T07:07:59Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+
+object Window {
+  def apply[T](v: T): Window[T] = Window[T](v, Queue[T](v))
+  def from[T](ts: Traversable[T])(implicit m: WindowMonoid[T]) = m.fromTraversable(ts)


Can we use Iterable? Scala 2.13 is changing the collections to avoid Traversable: https://www.scala-lang.org/blog/2017/02/28/collections-rework.html#traversable-and-iterable

codecov-io · 2018-02-01T23:30:54Z

Codecov Report

Merging #649 into develop will decrease coverage by 0.02%.
The diff coverage is 73.91%.

@@             Coverage Diff             @@
##           develop     #649      +/-   ##
===========================================
- Coverage    82.83%   82.81%   -0.03%     
===========================================
  Files          108      109       +1     
  Lines         5163     5209      +46     
  Branches       314      316       +2     
===========================================
+ Hits          4277     4314      +37     
- Misses         886      895       +9

Impacted Files	Coverage Δ
...e/src/main/scala/com/twitter/algebird/Window.scala	`73.91% <73.91%> (ø)`
.../main/scala/com/twitter/algebird/Successible.scala	`87.5% <0%> (-8.34%)`	⬇️
.../main/scala/com/twitter/algebird/BloomFilter.scala	`94.84% <0%> (-0.43%)`	⬇️
...c/main/scala/com/twitter/algebird/MapAlgebra.scala	`77.88% <0%> (+0.96%)`	⬆️
...src/main/scala/com/twitter/algebird/Interval.scala	`79.13% <0%> (+1.73%)`	⬆️
...src/main/scala/com/twitter/algebird/Priority.scala	`25% <0%> (+25%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c79ed9...5321cfa. Read the comment docs.

cdg-stripe · 2018-02-02T17:24:32Z

@johnynek:

Added more tests. Caught that the zero element should have an empty queue, where originally it was Queue[T](T.zero). Also made the monoid on group elements more efficient. I dug around to see if there's anything else worth tweaking, but I think it's in a decent spot.

Also what are you thoughts about Scala 2.13, when we can have a signature Window[Size, T]? Might be okay have Window[T] and WindowN[Size, T] case classes.

johnynek · 2018-02-04T20:59:15Z

yeah, I think in scala 2.13 we will have to revisit a lot of stuff, but the compatibility story is open since we will need to support 2.12 for a while, I guess (we aren't even on 2.12 yet at Stripe).

johnynek

Two comments, but let's address them in a follow up PR.

Thanks for this!

johnynek · 2018-02-04T20:59:58Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+
+  require(windowSize >= 1, "Windows must have positive sizes")
+
+  def zero = p.fold(g => Window[T](g.zero, Queue.empty[T]))(m => Window(m.zero, Queue.empty[T]))


can we make this a val so we don't have to do the computation and allocation again on each call?

johnynek · 2018-02-04T21:02:36Z

algebird-core/src/main/scala/com/twitter/algebird/Window.scala

+  def zero = p.fold(g => Window[T](g.zero, Queue.empty[T]))(m => Window(m.zero, Queue.empty[T]))
+
+  def plus(a: Window[T], b: Window[T]): Window[T] =
+    p.fold(g => plusG(a, b)(g))(m => plusM(a, b)(m))


for performance (not calling fold on each plus call) we are probably better to have two subclasses of WindowMonoid which could be an abstract class. We can decide once when we allocate which path to choose. Then the jit should be able to inline both instances. Sadly, the jvm can't optimize the above very well.

johnynek · 2018-02-22T19:18:39Z

published in 0.13.4:

https://github.com/twitter/algebird/releases/tag/v0.13.4

johnynek · 2018-02-22T19:47:06Z

@MansurAshraf note if you want the most recent K things, TopKMonoid should work:

https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/TopKMonoid.scala

(so would priorityqueue I guess).

johnynek reviewed Jan 26, 2018

View reviewed changes

MansurAshraf reviewed Jan 26, 2018

View reviewed changes

johnynek reviewed Jan 26, 2018

View reviewed changes

Add trailing window aggregator

0d0103e

johnynek requested changes Jan 31, 2018

View reviewed changes

cdg-stripe added 2 commits January 31, 2018 21:32

Create Window from traversable

a3b85cc

Override sumOption

622fddf

johnynek reviewed Feb 1, 2018

View reviewed changes

cdg-stripe added 11 commits February 1, 2018 09:55

Extend tests

afd2f07

Improve comments

1a1b82c

Make addition on groups more efficient

387d875

Cleanup plusM

7ea0e79

Line indentation

096ba21

Import operators

9203860

PR comments

6d5a948

Fold instead of match

d69f141

Fix tests

1277bd6

Require positive sizes

16aab4c

zero has a queue with no elements

5321cfa

johnynek reviewed Feb 4, 2018

View reviewed changes

johnynek approved these changes Feb 4, 2018

View reviewed changes

johnynek merged commit 03ca640 into twitter:develop Feb 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trailing window aggregator #649

Add trailing window aggregator #649

cdg-stripe commented Jan 26, 2018 •

edited

Loading

CLAassistant commented Jan 26, 2018 •

edited

Loading

johnynek left a comment

johnynek Jan 26, 2018

MansurAshraf Jan 26, 2018

johnynek Jan 26, 2018

cdg-stripe Jan 26, 2018 •

edited

Loading

MansurAshraf Jan 26, 2018

cdg-stripe Jan 26, 2018

MansurAshraf Jan 26, 2018

johnynek left a comment

johnynek Jan 26, 2018

cdg-stripe commented Jan 26, 2018 •

edited

Loading

johnynek Jan 31, 2018

johnynek Jan 31, 2018

cdg-stripe Feb 1, 2018

johnynek Jan 31, 2018

johnynek Jan 31, 2018

cdg-stripe Feb 1, 2018

johnynek Jan 31, 2018

cdg-stripe Feb 1, 2018

johnynek commented Jan 31, 2018

johnynek commented Jan 31, 2018

johnynek Feb 1, 2018

johnynek Feb 1, 2018

johnynek Feb 1, 2018

codecov-io commented Feb 1, 2018

cdg-stripe commented Feb 2, 2018

johnynek commented Feb 4, 2018

johnynek left a comment

johnynek Feb 4, 2018

johnynek Feb 4, 2018

johnynek commented Feb 22, 2018

johnynek commented Feb 22, 2018


		require(windowSize >= 1, "Windows must have positive sizes")

		def zero = p.fold(g => Window[T](g.zero, Queue.empty[T]))(m => Window(m.zero, Queue.empty[T]))

Add trailing window aggregator #649

Add trailing window aggregator #649

Conversation

cdg-stripe commented Jan 26, 2018 • edited Loading

Description

Testing

CLAassistant commented Jan 26, 2018 • edited Loading

johnynek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdg-stripe Jan 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdg-stripe commented Jan 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jan 31, 2018

johnynek commented Jan 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Feb 1, 2018

Codecov Report

cdg-stripe commented Feb 2, 2018

johnynek commented Feb 4, 2018

johnynek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Feb 22, 2018

johnynek commented Feb 22, 2018

cdg-stripe commented Jan 26, 2018 •

edited

Loading

CLAassistant commented Jan 26, 2018 •

edited

Loading

cdg-stripe Jan 26, 2018 •

edited

Loading

cdg-stripe commented Jan 26, 2018 •

edited

Loading