-
Notifications
You must be signed in to change notification settings - Fork 28.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-36533][SS] Trigger.AvailableNow for running streaming queries …
…like Trigger.Once in multiple batches ### What changes were proposed in this pull request? This change creates a new type of Trigger: Trigger.AvailableNow for streaming queries. It is like Trigger.Once, which process all available data then stop the query, but with better scalability since data can be processed in multiple batches instead of one. To achieve this, this change proposes a new interface `SupportsTriggerAvailableNow`, which is an extension of `SupportsAdmissionControl`. It has one method, `prepareForTriggerAvailableNow`, which will be called at the beginning of streaming queries with Trigger.AvailableNow, to let the source record the offset for the current latest data at the time (a.k.a. the target offset for the query). The source should then behave as if there is no new data coming in after the beginning of the query, i.e., the source will not return an offset higher than the target offset when `latestOffset` is called. This change also updates `FileStreamSource` to be an implementation of `SupportsTriggerAvailableNow`. For other sources that does not implement `SupportsTriggerAvailableNow`, this change also provides a new class `FakeLatestOffsetSupportsTriggerAvailableNow`, which wraps the sources and makes them support Trigger.AvailableNow, by overriding their `latestOffset` method to always return the latest offset at the beginning of the query. ### Why are the changes needed? Currently streaming queries with Trigger.Once will always load all of the available data in a single batch. Because of this, the amount of data a query can process is limited, or Spark driver will run out of memory. ### Does this PR introduce _any_ user-facing change? Users will be able to use Trigger.AvailableNow (to process all available data then stop the streaming query) with this change. ### How was this patch tested? Added unit tests. Closes #33763 from bozhang2820/new-trigger. Authored-by: Bo Zhang <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
- Loading branch information
1 parent
ff8cc4b
commit e33cdfb
Showing
12 changed files
with
656 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
41 changes: 41 additions & 0 deletions
41
.../main/java/org/apache/spark/sql/connector/read/streaming/SupportsTriggerAvailableNow.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.connector.read.streaming; | ||
|
||
import org.apache.spark.annotation.Evolving; | ||
|
||
/** | ||
* An interface for streaming sources that supports running in Trigger.AvailableNow mode, which | ||
* will process all the available data at the beginning of the query in (possibly) multiple batches. | ||
* | ||
* This mode will have better scalability comparing to Trigger.Once mode. | ||
* | ||
* @since 3.3.0 | ||
*/ | ||
@Evolving | ||
public interface SupportsTriggerAvailableNow extends SupportsAdmissionControl { | ||
|
||
/** | ||
* This will be called at the beginning of streaming queries with Trigger.AvailableNow, to let the | ||
* source record the offset for the current latest data at the time (a.k.a the target offset for | ||
* the query). The source will behave as if there is no new data coming in after the target | ||
* offset, i.e., the source will not return an offset higher than the target offset when | ||
* {@link #latestOffset(Offset, ReadLimit) latestOffset} is called. | ||
*/ | ||
void prepareForTriggerAvailableNow(); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
88 changes: 88 additions & 0 deletions
88
...c/main/scala/org/apache/spark/sql/execution/streaming/AvailableNowDataStreamWrapper.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.execution.streaming | ||
|
||
import org.apache.spark.internal.Logging | ||
import org.apache.spark.sql.connector.read.streaming.{MicroBatchStream, ReadLimit, SparkDataStream, SupportsAdmissionControl, SupportsTriggerAvailableNow} | ||
import org.apache.spark.sql.connector.read.streaming | ||
|
||
/** | ||
* This class wraps a [[SparkDataStream]] and makes it support Trigger.AvailableNow, by overriding | ||
* its [[latestOffset]] method to always return the latest offset at the beginning of the query. | ||
*/ | ||
class AvailableNowDataStreamWrapper(val delegate: SparkDataStream) | ||
extends SparkDataStream with SupportsTriggerAvailableNow with Logging { | ||
|
||
private var fetchedOffset: streaming.Offset = _ | ||
|
||
override def initialOffset(): streaming.Offset = delegate.initialOffset() | ||
|
||
override def deserializeOffset(json: String): streaming.Offset = delegate.deserializeOffset(json) | ||
|
||
override def commit(end: streaming.Offset): Unit = delegate.commit(end) | ||
|
||
override def stop(): Unit = delegate.stop() | ||
|
||
private def getInitialOffset: streaming.Offset = { | ||
delegate match { | ||
case _: Source => null | ||
case m: MicroBatchStream => m.initialOffset | ||
} | ||
} | ||
|
||
/** | ||
* Fetch and store the latest offset for all available data at the beginning of the query. | ||
*/ | ||
override def prepareForTriggerAvailableNow(): Unit = { | ||
fetchedOffset = delegate match { | ||
case s: SupportsAdmissionControl => | ||
s.latestOffset(getInitialOffset, ReadLimit.allAvailable()) | ||
case s: Source => s.getOffset.orNull | ||
case m: MicroBatchStream => m.latestOffset() | ||
case s => throw new IllegalStateException(s"Unexpected source: $s") | ||
} | ||
} | ||
|
||
/** | ||
* Always return [[ReadLimit.allAvailable]] | ||
*/ | ||
override def getDefaultReadLimit: ReadLimit = delegate match { | ||
case s: SupportsAdmissionControl => | ||
val limit = s.getDefaultReadLimit | ||
if (limit != ReadLimit.allAvailable()) { | ||
logWarning(s"The read limit $limit is ignored because source $delegate does not " + | ||
"support running Trigger.AvailableNow queries.") | ||
} | ||
ReadLimit.allAvailable() | ||
|
||
case _ => ReadLimit.allAvailable() | ||
} | ||
|
||
/** | ||
* Return the latest offset pre-fetched in [[prepareForTriggerAvailableNow]]. | ||
*/ | ||
override def latestOffset(startOffset: streaming.Offset, limit: ReadLimit): streaming.Offset = | ||
fetchedOffset | ||
|
||
override def reportLatestOffset: streaming.Offset = delegate match { | ||
// Return the real latest offset here since this is only used for metrics | ||
case s: SupportsAdmissionControl => s.reportLatestOffset() | ||
case s: Source => s.getOffset.orNull | ||
case s: MicroBatchStream => s.latestOffset() | ||
} | ||
} |
39 changes: 39 additions & 0 deletions
39
.../scala/org/apache/spark/sql/execution/streaming/AvailableNowMicroBatchStreamWrapper.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.execution.streaming | ||
|
||
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReaderFactory} | ||
import org.apache.spark.sql.connector.read.streaming | ||
import org.apache.spark.sql.connector.read.streaming.MicroBatchStream | ||
|
||
/** | ||
* This class wraps a [[MicroBatchStream]] and makes it supports Trigger.AvailableNow. | ||
* | ||
* See [[AvailableNowDataStreamWrapper]] for more details. | ||
*/ | ||
class AvailableNowMicroBatchStreamWrapper(delegate: MicroBatchStream) | ||
extends AvailableNowDataStreamWrapper(delegate) with MicroBatchStream { | ||
|
||
override def latestOffset(): streaming.Offset = throw new UnsupportedOperationException( | ||
"latestOffset(Offset, ReadLimit) should be called instead of this method") | ||
|
||
override def planInputPartitions(start: streaming.Offset, end: streaming.Offset): | ||
Array[InputPartition] = delegate.planInputPartitions(start, end) | ||
|
||
override def createReaderFactory(): PartitionReaderFactory = delegate.createReaderFactory() | ||
} |
38 changes: 38 additions & 0 deletions
38
...e/src/main/scala/org/apache/spark/sql/execution/streaming/AvailableNowSourceWrapper.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.execution.streaming | ||
|
||
import org.apache.spark.sql.DataFrame | ||
import org.apache.spark.sql.types.StructType | ||
|
||
/** | ||
* This class wraps a [[Source]] and makes it supports Trigger.AvailableNow. | ||
* | ||
* See [[AvailableNowDataStreamWrapper]] for more details. | ||
*/ | ||
class AvailableNowSourceWrapper(delegate: Source) | ||
extends AvailableNowDataStreamWrapper(delegate) with Source { | ||
|
||
override def schema: StructType = delegate.schema | ||
|
||
override def getOffset: Option[Offset] = throw new UnsupportedOperationException( | ||
"latestOffset(Offset, ReadLimit) should be called instead of this method") | ||
|
||
override def getBatch(start: Option[Offset], end: Offset): DataFrame = | ||
delegate.getBatch(start, end) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.