[SPARK-14257][SQL]Allow multiple continuous queries to be started from the same DataFrame #12049

zsxwing · 2016-03-29T23:21:46Z

What changes were proposed in this pull request?

Make StreamingRelation store the closure to create the source in StreamExecution so that we can start multiple continuous queries from the same DataFrame.

How was this patch tested?

test("DataFrame reuse")

SparkQA · 2016-03-30T00:54:03Z

Test build #54472 has finished for PR 12049 at commit 50c39b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StreamingRelation(

marmbrus · 2016-03-30T18:57:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+      val source = sourceCreator()
+      // We still need to use the previous `output` instead of `source.schema` as attributes in
+      // "_logicalPlan" has already used attributes of the previous `output`.
+      StreamingRelation(() => source, output)


I think its confusing to have an opaque function that sometimes is creating a source and sometimes returning a static source. Its not going to be clear from explain() which mode you are in.

I think its confusing to have an opaque function that sometimes is creating a source and sometimes returning a static source. Its not going to be clear from explain() which mode you are in.

How about adding a new Relation for a static source (maybe call it StreamExecutionRelation)?

Or StreamRelation could just hold a DataSource and we could have a Map[DataSource, Source] here thats initialized at startup.

I tried Map[DataSource, Source] but failed because of RichSource.

implicit class RichSource(s: Source) { def toDF(): DataFrame = Dataset.ofRows(sqlContext, StreamingRelation(s)) def toDS[A: Encoder](): Dataset[A] = Dataset(sqlContext, StreamingRelation(s)) }

If we only have StreamingRelaction(DataSource), then RichSource needs to create a DataSource for Source dynamically.

So the above codes will be changed to

implicit class RichSource(s: Source) { def toDF(): DataFrame = Dataset.ofRows(sqlContext, StreamingRelation(DataSource(sqlContext, className = ...))) def toDS[A: Encoder](): Dataset[A] = Dataset(sqlContext, StreamingRelation(sqlContext, className = ...)) }

Here I don't know what to fill for className. Without code generation, we won't be able to create a new class for different Source instances. This seems too complicated.

Therefore, I used the StreamExecutionRelation idea finally.

Its not going to be clear from explain() which mode you are in.

By the way, the stream DataFrame has not yet supported explain. Should we fix it now?

SparkQA · 2016-03-31T20:34:31Z

Test build #54653 has finished for PR 12049 at commit 6196790.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StreamingRelation(dataSource: DataSource, output: Seq[Attribute]) extends LeafNode
- case class StreamingExecutionRelation(source: Source, output: Seq[Attribute]) extends LeafNode

SparkQA · 2016-04-01T20:48:43Z

Test build #2722 has finished for PR 12049 at commit 6196790.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StreamingRelation(dataSource: DataSource, output: Seq[Attribute]) extends LeafNode
- case class StreamingExecutionRelation(source: Source, output: Seq[Attribute]) extends LeafNode

SparkQA · 2016-04-01T20:56:25Z

Test build #2723 has finished for PR 12049 at commit 6196790.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StreamingRelation(dataSource: DataSource, output: Seq[Attribute]) extends LeafNode
- case class StreamingExecutionRelation(source: Source, output: Seq[Attribute]) extends LeafNode

SparkQA · 2016-04-04T20:46:40Z

Test build #54878 has finished for PR 12049 at commit aa55afe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-04T21:00:32Z

Test build #54879 has finished for PR 12049 at commit ac51850.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-04T21:12:28Z

Test build #54881 has finished for PR 12049 at commit 9b5f007.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-04-05T02:15:37Z

sql/core/src/main/scala/org/apache/spark/sql/ContinuousQueryManager.scala

+          // Materialize source to avoid creating it in every batch
+          val source = dataSource.createSource()
+          // We still need to use the previous `output` instead of `source.schema` as attributes in
+          // "_logicalPlan" has already used attributes of the previous `output`.


nit: i don't see anything named _logicalPlan

marmbrus · 2016-04-05T02:16:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala

+
+/**
+ * Used to link a streaming [[DataSource]] into a
+ * [[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]].


Maybe include a description of how this gets turned into a StreamingExecutionRelation and who's responsibility that is.

SparkQA · 2016-04-05T03:53:13Z

Test build #54932 has finished for PR 12049 at commit 527f55f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-05T05:30:49Z

Test build #54955 has finished for PR 12049 at commit 48d760e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute])

zsxwing · 2016-04-05T15:57:21Z

retest this please

SparkQA · 2016-04-05T17:32:16Z

Test build #54994 has finished for PR 12049 at commit 48d760e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute])

marmbrus · 2016-04-05T18:11:43Z

Thanks, merging to master.

Allow multiple continuous queries to be started from the same DataFrame

50c39b8

marmbrus reviewed Mar 30, 2016
View reviewed changes

Add StreamingExecutionRelation as a Source container

6196790

zsxwing added 2 commits April 4, 2016 12:13

Merge branch 'master' into df-reuse

ac51850

comments

9b5f007

Merge branch 'master' into df-reuse

2a71026

marmbrus reviewed Apr 5, 2016
View reviewed changes

Fix isStreaming

527f55f

marmbrus reviewed Apr 5, 2016
View reviewed changes

Address

48d760e

asfgit closed this in 463bac0 Apr 5, 2016

zsxwing deleted the df-reuse branch April 5, 2016 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14257][SQL]Allow multiple continuous queries to be started from the same DataFrame #12049

[SPARK-14257][SQL]Allow multiple continuous queries to be started from the same DataFrame #12049

zsxwing commented Mar 29, 2016

SparkQA commented Mar 30, 2016

marmbrus Mar 30, 2016

zsxwing Mar 30, 2016

marmbrus Mar 30, 2016

zsxwing Mar 31, 2016

zsxwing Mar 31, 2016

SparkQA commented Mar 31, 2016

SparkQA commented Apr 1, 2016

SparkQA commented Apr 1, 2016

SparkQA commented Apr 4, 2016

SparkQA commented Apr 4, 2016

SparkQA commented Apr 4, 2016

marmbrus Apr 5, 2016

marmbrus Apr 5, 2016

zsxwing Apr 5, 2016

SparkQA commented Apr 5, 2016

SparkQA commented Apr 5, 2016

zsxwing commented Apr 5, 2016

SparkQA commented Apr 5, 2016

marmbrus commented Apr 5, 2016

[SPARK-14257][SQL]Allow multiple continuous queries to be started from the same DataFrame #12049

[SPARK-14257][SQL]Allow multiple continuous queries to be started from the same DataFrame #12049

Conversation

zsxwing commented Mar 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 31, 2016

SparkQA commented Apr 1, 2016

SparkQA commented Apr 1, 2016

SparkQA commented Apr 4, 2016

SparkQA commented Apr 4, 2016

SparkQA commented Apr 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 5, 2016

SparkQA commented Apr 5, 2016

zsxwing commented Apr 5, 2016

SparkQA commented Apr 5, 2016

marmbrus commented Apr 5, 2016