[SPARK-36533][SS] Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches #33763

bozhang2820 · 2021-08-17T12:51:07Z

What changes were proposed in this pull request?

This change creates a new type of Trigger: Trigger.AvailableNow for streaming queries. It is like Trigger.Once, which process all available data then stop the query, but with better scalability since data can be processed in multiple batches instead of one.

To achieve this, this change proposes a new interface SupportsTriggerAvailableNow, which is an extension of SupportsAdmissionControl. It has one method, prepareForTriggerAvailableNow, which will be called at the beginning of streaming queries with Trigger.AvailableNow, to let the source record the offset for the current latest data at the time (a.k.a. the target offset for the query). The source should then behave as if there is no new data coming in after the beginning of the query, i.e., the source will not return an offset higher than the target offset when latestOffset is called.

This change also updates FileStreamSource to be an implementation of SupportsTriggerAvailableNow.

For other sources that does not implement SupportsTriggerAvailableNow, this change also provides a new class FakeLatestOffsetSupportsTriggerAvailableNow, which wraps the sources and makes them support Trigger.AvailableNow, by overriding their latestOffset method to always return the latest offset at the beginning of the query.

Why are the changes needed?

Currently streaming queries with Trigger.Once will always load all of the available data in a single batch. Because of this, the amount of data a query can process is limited, or Spark driver will run out of memory.

Does this PR introduce any user-facing change?

Users will be able to use Trigger.AvailableNow (to process all available data then stop the streaming query) with this change.

How was this patch tested?

Added unit tests.

SparkQA · 2021-08-17T14:06:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47059/

SparkQA · 2021-08-17T15:04:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47059/

SparkQA · 2021-08-17T17:12:08Z

Test build #142557 has finished for PR 33763 at commit 5d0aee8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FakeLatestOffsetMicroBatchStream(source: MicroBatchStream)
class FakeLatestOffsetSource(source: Source)
class FakeLatestOffsetSupportsTriggerAvailableNow(source: SparkDataStream)
case class OneBatchExecutor() extends TriggerExecutor
case class MultiBatchExecutor() extends TriggerExecutor

SparkQA · 2021-08-18T04:12:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47079/

SparkQA · 2021-08-18T04:49:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47079/

SparkQA · 2021-08-18T05:39:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47084/

SparkQA · 2021-08-18T06:32:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47084/

SparkQA · 2021-08-18T08:31:58Z

Test build #142578 has finished for PR 33763 at commit 6c017c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-18T09:14:14Z

Test build #142582 has finished for PR 33763 at commit c2223f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FakeLatestOffsetMicroBatchStream(source: MicroBatchStream)
class FakeLatestOffsetSource(source: Source)
class FakeLatestOffsetSupportsTriggerAvailableNow(source: SparkDataStream)
case class OneBatchExecutor() extends TriggerExecutor
case class MultiBatchExecutor() extends TriggerExecutor

SparkQA · 2021-08-18T09:42:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47101/

SparkQA · 2021-08-18T10:20:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47101/

SparkQA · 2021-08-18T13:41:25Z

Test build #142601 has finished for PR 33763 at commit 6d9253e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-18T14:22:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47114/

SparkQA · 2021-08-18T15:00:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47114/

SparkQA · 2021-08-18T18:20:34Z

Test build #142614 has finished for PR 33763 at commit c2223f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FakeLatestOffsetMicroBatchStream(source: MicroBatchStream)
class FakeLatestOffsetSource(source: Source)
class FakeLatestOffsetSupportsTriggerAvailableNow(source: SparkDataStream)
case class OneBatchExecutor() extends TriggerExecutor
case class MultiBatchExecutor() extends TriggerExecutor

bozhang2820 · 2021-08-19T13:36:44Z

@HeartSaVioR, @brkyvz, could you review this? Thanks!

xuanyuanking · 2021-08-20T04:44:27Z

cc @viirya

HeartSaVioR

First pass.

...c/main/scala/org/apache/spark/sql/execution/streaming/FakeLatestOffsetMicroBatchStream.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FakeLatestOffsetSource.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

HeartSaVioR · 2021-08-23T07:43:06Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/TriggerAvailableNowSuite.scala

+
+        try {
+          assert(q.awaitTermination(streamingTimeout.toMillis))
+          // only one batch has data in both sources, thus counted, see SPARK-24050


(Beyond the scope of the PR) it would be ideal if we can revisit and fix it later.

sql/core/src/test/scala/org/apache/spark/sql/streaming/TriggerAvailableNowSuite.scala

HeartSaVioR · 2021-08-23T08:10:18Z

Btw, thanks for the great contribution! Nice feature indeed.

xuanyuanking

(Seems I failed to post the review comments last 4 days ago...)
Still reviewing this.

...c/main/scala/org/apache/spark/sql/execution/streaming/FakeLatestOffsetMicroBatchStream.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FakeLatestOffsetSource.scala

...c/main/scala/org/apache/spark/sql/execution/streaming/FakeLatestOffsetMicroBatchStream.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

viirya

Thanks for the work. I will review in next few days.

I think we also need to update the document. But I don't see doc change included yet.

...src/main/java/org/apache/spark/sql/connector/read/streaming/SupportsTriggerAvailableNow.java

sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java

SparkQA · 2021-08-24T09:31:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47220/

bozhang2820 · 2021-08-31T00:31:25Z

.../src/main/scala/org/apache/spark/sql/execution/streaming/AvailableNowDataStreamWrapper.scala

+  private def getInitialOffset: streaming.Offset = {
+    delegate match {
+      case _: Source => null
+      case m: MicroBatchStream => m.initialOffset
+    }
+  }


We can use null (for v1 sources) and iniitalOffset (for v2 streams) for the startOffset in latestOffset(startOffset, readLimit), since the readLimit is always allAvailable.

SparkQA · 2021-08-31T01:12:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47369/

SparkQA · 2021-08-31T01:21:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47369/

SparkQA · 2021-08-31T01:46:25Z

Test build #142865 has finished for PR 33763 at commit 0824910.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2021-08-31T03:42:58Z

retest this, please

HeartSaVioR · 2021-08-31T03:47:45Z

@viirya Would you mind revisiting this? The PR looks OK to me, but I'd like to see your approval as you're reviewing the change. Thanks in advance!

SparkQA · 2021-08-31T04:50:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47375/

SparkQA · 2021-08-31T05:20:46Z

Test build #142866 has finished for PR 33763 at commit c96d60d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-31T05:32:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47375/

bozhang2820 · 2021-08-31T05:34:38Z

@viirya Would you mind revisiting this? The PR looks OK to me, but I'd like to see your approval as you're reviewing the change. Thanks in advance!

Also @brkyvz could you review again for the changes you requested?

viirya · 2021-08-31T07:10:20Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+      val q = startQuery()
+
+      try {


Shouldn't we add more offsets createFile here after latest offset was fetched into the source? So we can verify that we only process all available at the beginning of the query?

I find it a bit hard to control the order between starting the query and adding new files into the source. Do you know if there is an easy way to do so?

I'm not sure we could do it easily, as we can't let the query be "suspended" after figuring out source offsets to process. The streaming query is running concurrently with the main thread.

Hmm, okay, I asked this because the test looks like to me only for verifying the query can run with correct results. Not able to check trigger.availablenow. okay for me.

assert(q.recentProgress.count(_.numInputRows != 0) == 3)

This verifies that the query runs three micro-batches instead of one.

Btw, the test SPARK-36533: Trigger.AvailableNow - checkpointing covers everything in here now. The change on index determines the number of micro-batch being executed, and we also checked the output DataFrame. I think we can simply remove this test.

sql/core/src/test/scala/org/apache/spark/sql/streaming/TriggerAvailableNowSuite.scala

SparkQA · 2021-08-31T08:51:38Z

Test build #142872 has finished for PR 33763 at commit c96d60d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2021-08-31T23:00:06Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+        q.stop()
+      }
+
+      var index = 3  // We have processed the first 3 rows in the first query


nit: probably better to add the comment that it tracks the number of micro-batch execution starting from here. The code is intuitive but worth having elaboration given the importance of this variable.

HeartSaVioR

+1 pending other reviewers' approval.

SparkQA · 2021-09-01T00:52:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47399/

SparkQA · 2021-09-01T01:01:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47399/

SparkQA · 2021-09-01T04:45:07Z

Test build #142896 has finished for PR 33763 at commit b6c26e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

LGTM!

HeartSaVioR · 2021-09-01T06:01:30Z

Thanks! Merging to master!

HeartSaVioR · 2021-09-01T06:03:19Z

Thanks @bozhang2820 for the great contribution and thanks all for reviewing! I just merged this into master.

srowen · 2021-11-13T22:50:10Z

@HeartSaVioR @viirya @bozhang2820 is there a way to call this from Pyspark? I can't figure it out

HeartSaVioR · 2021-11-14T21:53:56Z

Nice catch. I realized it was missing. I'm going to address this soon.

### What changes were proposed in this pull request? This PR proposes to add Trigger.AvailableNow in PySpark on top of #33763. ### Why are the changes needed? We missed adding Trigger.AvailableNow in PySpark in #33763. ### Does this PR introduce _any_ user-facing change? Yes, Trigger.AvailableNow will be available in PySpark as well. ### How was this patch tested? Added simple validation in PySpark doc. Manually tested as below: ``` >>> spark.readStream.format("text").load("/WorkArea/ScalaProjects/spark-apache/dist/inputs").writeStream.format("console").trigger(once=True).start() <pyspark.sql.streaming.StreamingQuery object at 0x118dff6d0> ------------------------------------------- Batch: 0 ------------------------------------------- +-----+ |value| +-----+ | a| | b| | c| | d| | e| +-----+ >>> spark.readStream.format("text").load("/WorkArea/ScalaProjects/spark-apache/dist/inputs").writeStream.format("console").trigger(availableNow=True).start() <pyspark.sql.streaming.StreamingQuery object at 0x118dffe50> >>> ------------------------------------------- Batch: 0 ------------------------------------------- +-----+ |value| +-----+ | a| | b| | c| | d| | e| +-----+ >>> spark.readStream.format("text").option("maxfilespertrigger", "2").load("/WorkArea/ScalaProjects/spark-apache/dist/inputs").writeStream.format("console").trigger(availableNow=True).start() <pyspark.sql.streaming.StreamingQuery object at 0x118dff820> >>> ------------------------------------------- Batch: 0 ------------------------------------------- +-----+ |value| +-----+ | a| | b| +-----+ ------------------------------------------- Batch: 1 ------------------------------------------- +-----+ |value| +-----+ | c| | d| +-----+ ------------------------------------------- Batch: 2 ------------------------------------------- +-----+ |value| +-----+ | e| +-----+ >>> ``` Closes #34592 from HeartSaVioR/SPARK-36533-FOLLOWUP-pyspark. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

github-actions bot added SQL STRUCTURED STREAMING labels Aug 17, 2021

HyukjinKwon changed the title ~~[SPARK-36533] Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches~~ [SPARK-36533][SS] Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches Aug 18, 2021

bozhang2820 force-pushed the new-trigger branch from 6c017c7 to c2223f0 Compare August 18, 2021 04:10

bozhang2820 force-pushed the new-trigger branch from 6d9253e to c2223f0 Compare August 18, 2021 13:29

HeartSaVioR reviewed Aug 23, 2021

View reviewed changes

xuanyuanking reviewed Aug 23, 2021

View reviewed changes

bozhang2820 commented Aug 24, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala Outdated Show resolved Hide resolved

viirya reviewed Aug 24, 2021

View reviewed changes

...src/main/java/org/apache/spark/sql/connector/read/streaming/SupportsTriggerAvailableNow.java Show resolved Hide resolved

viirya reviewed Aug 24, 2021

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java Show resolved Hide resolved

bozhang2820 commented Aug 31, 2021

View reviewed changes

viirya reviewed Aug 31, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/TriggerAvailableNowSuite.scala Show resolved Hide resolved

HeartSaVioR reviewed Aug 31, 2021

View reviewed changes

Address Jungtaek's latest comments

b6c26e2

HeartSaVioR approved these changes Aug 31, 2021

View reviewed changes

viirya approved these changes Sep 1, 2021

View reviewed changes

brkyvz approved these changes Sep 1, 2021

View reviewed changes

HeartSaVioR closed this in e33cdfb Sep 1, 2021

bozhang2820 deleted the new-trigger branch September 1, 2021 06:06

zsxwing mentioned this pull request Oct 18, 2021

Implements fetching file sets for delta streaming using datasets delta-io/delta#807

Closed

HeartSaVioR mentioned this pull request Nov 14, 2021

[SPARK-36533][SS][FOLLOWUP] Support Trigger.AvailableNow in PySpark #34592

Closed

[SPARK-36533][SS] Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches #33763

[SPARK-36533][SS] Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches #33763

Conversation

bozhang2820 commented Aug 17, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

SparkQA commented Aug 18, 2021

bozhang2820 commented Aug 19, 2021

xuanyuanking commented Aug 20, 2021

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR Aug 23, 2021 • edited Loading

Choose a reason for hiding this comment

HeartSaVioR commented Aug 23, 2021

xuanyuanking left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented Aug 24, 2021

bozhang2820 Aug 31, 2021

Choose a reason for hiding this comment

SparkQA commented Aug 31, 2021

SparkQA commented Aug 31, 2021

SparkQA commented Aug 31, 2021

HeartSaVioR commented Aug 31, 2021

HeartSaVioR commented Aug 31, 2021

SparkQA commented Aug 31, 2021

SparkQA commented Aug 31, 2021

SparkQA commented Aug 31, 2021

bozhang2820 commented Aug 31, 2021

viirya Aug 31, 2021

Choose a reason for hiding this comment

bozhang2820 Aug 31, 2021

Choose a reason for hiding this comment

HeartSaVioR Aug 31, 2021

Choose a reason for hiding this comment

viirya Aug 31, 2021

Choose a reason for hiding this comment

HeartSaVioR Aug 31, 2021 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Aug 31, 2021

HeartSaVioR Aug 31, 2021

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

SparkQA commented Sep 1, 2021

SparkQA commented Sep 1, 2021

SparkQA commented Sep 1, 2021

brkyvz left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Sep 1, 2021

HeartSaVioR commented Sep 1, 2021

srowen commented Nov 13, 2021

HeartSaVioR commented Nov 14, 2021

bozhang2820 commented Aug 17, 2021 •

edited

Loading

HeartSaVioR Aug 23, 2021 •

edited

Loading

HeartSaVioR Aug 31, 2021 •

edited

Loading