Make TypedPipe Optimizations configurable #1738

johnynek · 2017-10-09T03:13:33Z

This makes it possible to control what optimizations are applied. This allows us to improve the tests since we can optimize only parts of the graph.

ianoc · 2017-10-09T17:07:20Z

Is there any examples using this with sets of optimization phases? (didn't see one other than empty?)

But i guess one API question here is should we point to an optimization phase as a single class like this or should it be a list of rules? (Thus requiring rules or having rule builders that are reflectively instantiated -- but ultimately this would let us for different jobs on the command line possibly flip rules on and off? )

I also presume in the seq here each rule is itself a combined set of all rules for that phase? -- might be too verbose to do what i suggest above is my own devils advocate here. In that for full configurability we would want to do a spec something like
scalding.rules=[[com.twit...,com.twitt],[com.twit,...]]
or similar to express the rules in each phase? Though that might be pretty powerful for customization -- but possibly overkill.

(I've no super strong preference here, but this change is more API impacting that a lot of the previous ones since adding to Config)

johnynek · 2017-10-09T17:24:07Z

I saw two comments:

are we exercising the optimization enough: do we know it is being respected when configured? I think we don't know for sure. I'll add some more tests to convince ourselves.
could we make it easier to configure rules? I assume it will be somewhat rare to want to do this, and doing it by making a class with no constructor args and returning what you want would only be a few lines of code. I'd actually like to punt on making it easier to configure for now. I think we could come back if we thought it was too hard to configure, and we could do that even without breaking APIs I think.

How does it sound if I add more tests to address 1. above?

ianoc · 2017-10-09T17:49:12Z

if your confident there isn't an API breakage concern/back into a corner i think it really addresses most of my (1) and (2). Since (1) really was trying to get at exercising the new API.

ianoc · 2017-10-09T17:50:38Z

scalding-core/src/main/scala/com/twitter/scalding/Execution.scala

+    def optimizeWriteBatch(writes: List[ToWrite], rules: Seq[Rule[TypedPipe]]): HMap[TypedPipe, TypedPipe] = {
+      val dag = Dag.empty(typed.OptimizationRules.toLiteral)
+      val (d1, ws) = writes.foldLeft((dag, List.empty[Id[_]])) {
+        case ((dag, ws), Force(p)) =>


is this force a Force from a force to disk or a force to disk execution?

ForceToDiskExecution.

ianoc · 2017-10-09T17:58:17Z

scalding-core/src/main/scala/com/twitter/scalding/typed/TypedPipe.scala

-    res
+    dest.writeFrom(toPipe[T](dest.sinkFields)(flowDef, mode, dest.setter))
+    // We want to fork after this point
+    fork


why did we make this change? I guess it should be ok now with the better optimizer? i'm guessing before we needed to do it the first way to avoid re-computation?

I'm not sure why it was ever the other way. Many .writes are not consumed. But by putting the fork before, you are always going to do some extra boxing in and out of a single tuple.

I can revert it, I don't think tests pass or fail either way. We really need a better set of performance vs just correctness tests.

ianoc · 2017-10-09T18:06:50Z

...ng-core/src/main/scala/com/twitter/scalding/typed/cascading_backend/AsyncFlowDefRunner.scala

  private case class State(
    filesToCleanup: Map[Mode, Set[String]],
-    forcedPipes: Map[(Config, Mode, TypedPipe[Any]), Future[TypedPipe[Any]]]) {
+    initToOpt: HMap[TypedPipe, TypedPipe],


could we comment on this field

ianoc · 2017-10-09T18:06:51Z

...ng-core/src/main/scala/com/twitter/scalding/typed/cascading_backend/AsyncFlowDefRunner.scala

@@ -95,19 +101,37 @@ class AsyncFlowDefRunner extends Writer { self =>
        case None => this
      }

-    def addPipe[T](c: Config,
+    /**
+     * Returns true if we actually add


could you expand on this comment

Returns true if we will add to our forced pipes, false if cache hit and noop. I think?

yes. The thing is, writes are side-effects. We only want to take that side effect once. So, if you return true, we should actually do the side effect. I'll update the comment.

ianoc · 2017-10-09T18:28:29Z

...ng-core/src/main/scala/com/twitter/scalding/typed/cascading_backend/AsyncFlowDefRunner.scala

  private case class State(
    filesToCleanup: Map[Mode, Set[String]],
-    forcedPipes: Map[(Config, Mode, TypedPipe[Any]), Future[TypedPipe[Any]]]) {
+    initToOpt: HMap[TypedPipe, TypedPipe],
+    forcedPipes: HMap[StateKey, WorkVal]) {


probably worth ensuring we have good comments on all the state fields here (i know this is old, but helps review/reading). i.e. does force mean the same as in terms of the TypedPipe force to disk. Is it something we need to materialize as an output. And i.e. these are our results from the execution run ultimately?

ianoc · 2017-10-09T20:16:46Z

...ding-core/src/main/scala/com/twitter/scalding/typed/cascading_backend/CascadingBackend.scala


+    val phases = defaultOptimizationRules(


If we are optimizing on toPipe does the optimization rules in the Execution have any effect?

we are careful to avoid calling toPipe in Execution. We call toPipeUnoptimized. This is because we want to do the optimization on the whole graph (multiple sinks) that Execution can see. This is likely to be better and allow more aggressive optimization than the per-pipe approach of .write.

Do we not hit toPipe going in and out of groupBy's?

ianoc · 2017-10-09T20:17:16Z

lgtm modulo few comments and a final question

johnynek · 2017-10-09T20:20:31Z

Thanks for the review! I'll add the comments you wanted and if those look good, will merge (and on to the next item!).

I'm getting excited about scalding 0.18. I hope we don't hit any problems shipping it to twitter (I think we have tried to keep the API changes small).

Of course, we still do have the reducer estimation API change that twitter sent that we reverted in 0.17. Was that a big pain to change code? Can we do anything to make that easier for people to adopt?

ianoc · 2017-10-09T20:24:25Z

@johnynek Nah its not that bad, i suspect most people don't have custom reducer estimators so its not likely to bleed too far from that POV. I had a branch to adapt some of Stripe's code for it. I think we've only 2 reducer estimators internally. So should be ok....

johnynek · 2017-10-09T20:27:08Z

not anymore. groupBy is just building an ADT, then we recursively apply the planning, and we never call toPipe in that recursion (you can double check).

ianoc · 2017-10-09T20:48:15Z

mmm if we are breaking the API here should we mark the toPipe and from in TypedPipe as deprecated? (to be removed in scalding 0.19.0), and re-point at some cascading specific version of them?

johnynek · 2017-10-14T18:49:11Z

Added #1745 so we can add configuration options without binary breaks.

About toPipe being deprecated, I definitely want to get there before 0.18 but I want to see how tractable it is to factor scalding-typed into a separate project, and I will try to do it there.

johnynek · 2017-10-14T20:12:32Z

@ianoc how does this look? I think I addressed your concerns about comments. Also I added tests about the configuration of optimization rules to make sure they are triggered as expected in both the toPipe and Execution cases.

johnynek · 2017-10-21T00:14:28Z

got verbal signoff from @ianoc

Make TypedPipe Optimizations configurable

959c3ab

johnynek assigned ianoc Oct 9, 2017

remove the unsafe cache-cleaning prematurely added

682529b

ianoc reviewed Oct 9, 2017

View reviewed changes

add more test coverage

8394c0b

johnynek merged commit 2930cd7 into develop Oct 21, 2017

johnynek deleted the oscar/config-optimization branch October 21, 2017 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make TypedPipe Optimizations configurable #1738

Make TypedPipe Optimizations configurable #1738

johnynek commented Oct 9, 2017

ianoc commented Oct 9, 2017

johnynek commented Oct 9, 2017

ianoc commented Oct 9, 2017

ianoc Oct 9, 2017

johnynek Oct 9, 2017

ianoc Oct 9, 2017

johnynek Oct 9, 2017

ianoc Oct 9, 2017

ianoc Oct 9, 2017

johnynek Oct 9, 2017

ianoc Oct 9, 2017

ianoc Oct 9, 2017

johnynek Oct 9, 2017

ianoc Oct 9, 2017

ianoc commented Oct 9, 2017

johnynek commented Oct 9, 2017

ianoc commented Oct 9, 2017

johnynek commented Oct 9, 2017

ianoc commented Oct 9, 2017

johnynek commented Oct 14, 2017

johnynek commented Oct 14, 2017

johnynek commented Oct 21, 2017

Make TypedPipe Optimizations configurable #1738

Make TypedPipe Optimizations configurable #1738

Conversation

johnynek commented Oct 9, 2017

ianoc commented Oct 9, 2017

johnynek commented Oct 9, 2017

ianoc commented Oct 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ianoc commented Oct 9, 2017

johnynek commented Oct 9, 2017

ianoc commented Oct 9, 2017

johnynek commented Oct 9, 2017

ianoc commented Oct 9, 2017

johnynek commented Oct 14, 2017

johnynek commented Oct 14, 2017

johnynek commented Oct 21, 2017