[SPARK-14176][SQL]Add DataFrameWriter.trigger to set the stream batch period #11976

zsxwing · 2016-03-26T05:50:22Z

What changes were proposed in this pull request?

Add a processing time trigger to control the batch processing speed

How was this patch tested?

Unit tests

SparkQA · 2016-03-26T07:13:23Z

Test build #54253 has finished for PR 11976 at commit bf5d675.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-28T17:31:48Z

retest this please

SparkQA · 2016-03-28T19:05:17Z

Test build #54329 has finished for PR 11976 at commit bf5d675.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-03-28T21:20:12Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+   * @since 2.0.0
+   */
+  def trigger(period: Duration): DataFrameWriter = {
+    this.extraOptions += ("period" -> period.toMillis.toString)


Since this name is exposed for the user to accidentally modify (through option() ) we probably should make this more specific. Maybe "triggerInterval"

Also, i think "interval" is better than "period"

SparkQA · 2016-03-29T00:52:00Z

Test build #54376 has finished for PR 11976 at commit 6f5c6ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-03-29T22:28:07Z

sql/core/src/test/scala/org/apache/spark/sql/StreamTest.scala

@@ -276,7 +276,7 @@ trait StreamTest extends QueryTest with Timeouts {
            currentStream =
              sqlContext
                .streams
-                .startQuery(StreamExecution.nextName, metadataRoot, stream, sink)
+                .startQuery(StreamExecution.nextName, metadataRoot, stream, sink, 10L)


Why not 0?

zsxwing · 2016-03-30T06:06:31Z

I updated the PR to add Trigger and ProcessingTime and it supports to add other triggers in future.

SparkQA · 2016-03-30T07:27:18Z

Test build #54498 has finished for PR 11976 at commit 92d204c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait Trigger
- case class ProcessingTime(intervalMs: Long) extends Trigger with Logging

marmbrus · 2016-03-30T18:09:50Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+   *
+   * @since 2.0.0
+   */
+  def trigger(interval: Long, unit: TimeUnit): DataFrameWriter = {


Not all trigger modes are going to be time based though. In the doc we also propose data sized based triggers.

Not all trigger modes are going to be time based though. In the doc we also propose data sized based triggers.

How about def trigger(trigger: Trigger) and expose Trigger and all its subclasses?

SparkQA · 2016-03-31T01:29:37Z

Test build #54568 has finished for PR 11976 at commit f3526d0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-31T01:44:27Z

Test build #54566 has finished for PR 11976 at commit a7355ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-04-01T20:38:45Z

sql/core/src/test/scala/org/apache/spark/sql/StreamTest.scala

+                  metadataRoot,
+                  stream,
+                  sink,
+                  ProcessingTime(0L))


Minor: maybe just make this the default arg since this function is internal.

tdas · 2016-04-01T20:41:41Z

There does not seem to be any end-to-end test that makes sure that trigger is working, and keep the right timing. Also, things like what is the behavior if the previous batch takes longer? None of that is tested.

marmbrus · 2016-04-01T20:42:34Z

sql/core/src/main/scala/org/apache/spark/sql/Trigger.scala

+ * {{{
+ *   def.writer.trigger(ProcessingTime.create(10, TimeUnit.SECONDS))
+ *   def.writer.trigger(ProcessingTime.create("10 seconds"))
+ * }}}


Nice documentation! Maybe put the typesafe one second and include the imports that are required.

marmbrus · 2016-04-01T20:45:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

@@ -78,6 +76,11 @@ class StreamExecution(
  /** A list of unique sources in the query plan. */
  private val uniqueSources = sources.distinct

+  private val triggerExecutor = trigger match {
+    case t: ProcessingTime => ProcessingTimeExecutor(t)
+    case t => throw new IllegalArgumentException(s"${t.getClass} is not supported")


The trait is sealed. Do we need this?

marmbrus · 2016-04-01T20:46:14Z

This looks great! Minor comments only.

tdas · 2016-04-01T20:46:49Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -78,6 +78,17 @@ final class DataFrameWriter private[sql](df: DataFrame) {
  }

  /**
+   * Set the trigger for the stream query. The default value is `ProcessingTime(0)` and it will run
+   * the query as fast as possible.
+   *


This scala doc should have an example right here.

write.trigger(ProcessingTime("10 seconds")) write.trigger("10 seconds") // less verbose

zsxwing · 2016-04-01T21:53:13Z

Addressed all comments

SparkQA · 2016-04-01T23:24:01Z

Test build #54722 has finished for PR 11976 at commit 7c4bc42.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ProcessingTimeExecutor(processingTime: ProcessingTime, clock: Clock = new SystemClock())

SparkQA · 2016-04-02T00:20:29Z

Test build #54733 has finished for PR 11976 at commit 6c1b382.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-04-04T17:46:26Z

LGTM, merging to master.

Add DataFrameWriter.trigger to set the stream batch period

bf5d675

tdas reviewed Mar 28, 2016
View reviewed changes

zsxwing added 2 commits March 28, 2016 15:54

period -> interval

ab5dbd3

Merge remote-tracking branch 'origin/master' into trigger

6f5c6ed

marmbrus reviewed Mar 29, 2016
View reviewed changes

Add Trigger and ProcessingTime to control how to execute a batch

92d204c

marmbrus reviewed Mar 30, 2016
View reviewed changes

zsxwing added 2 commits March 30, 2016 16:59

Add 'trigger(trigger: Trigger) and remove other APIs'

a7355ed

Fix some docs

f3526d0

marmbrus reviewed Apr 1, 2016
View reviewed changes

tdas reviewed Apr 1, 2016
View reviewed changes

Address comments

7c4bc42

Remove unnecessary volatile

6c1b382

asfgit closed this in 855ed44 Apr 4, 2016

zsxwing deleted the trigger branch April 4, 2016 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14176][SQL]Add DataFrameWriter.trigger to set the stream batch period #11976

[SPARK-14176][SQL]Add DataFrameWriter.trigger to set the stream batch period #11976

zsxwing commented Mar 26, 2016

SparkQA commented Mar 26, 2016

zsxwing commented Mar 28, 2016

SparkQA commented Mar 28, 2016

tdas Mar 28, 2016

SparkQA commented Mar 29, 2016

marmbrus Mar 29, 2016

zsxwing commented Mar 30, 2016

SparkQA commented Mar 30, 2016

marmbrus Mar 30, 2016

zsxwing Mar 30, 2016

SparkQA commented Mar 31, 2016

SparkQA commented Mar 31, 2016

marmbrus Apr 1, 2016

tdas commented Apr 1, 2016

marmbrus Apr 1, 2016

marmbrus Apr 1, 2016

marmbrus commented Apr 1, 2016

tdas Apr 1, 2016

zsxwing commented Apr 1, 2016

SparkQA commented Apr 1, 2016

SparkQA commented Apr 2, 2016

marmbrus commented Apr 4, 2016

[SPARK-14176][SQL]Add DataFrameWriter.trigger to set the stream batch period #11976

[SPARK-14176][SQL]Add DataFrameWriter.trigger to set the stream batch period #11976

Conversation

zsxwing commented Mar 26, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 26, 2016

zsxwing commented Mar 28, 2016

SparkQA commented Mar 28, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 29, 2016

Choose a reason for hiding this comment

zsxwing commented Mar 30, 2016

SparkQA commented Mar 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 31, 2016

SparkQA commented Mar 31, 2016

Choose a reason for hiding this comment

tdas commented Apr 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marmbrus commented Apr 1, 2016

Choose a reason for hiding this comment

zsxwing commented Apr 1, 2016

SparkQA commented Apr 1, 2016

SparkQA commented Apr 2, 2016

marmbrus commented Apr 4, 2016