[SPARK-24386][SS] coalesce(1) aggregates in continuous processing #21560

jose-torres · 2018-06-14T05:15:01Z

What changes were proposed in this pull request?

Provide a continuous processing implementation of coalesce(1), as well as allowing aggregates on top of it.

The changes in ContinuousQueuedDataReader and such are to use split.index (the ID of the partition within the RDD currently being compute()d) rather than context.partitionId() (the partition ID of the scheduled task within the Spark job - that is, the post coalesce writer). In the absence of a narrow dependency, these values were previously always the same, so there was no need to distinguish.

How was this patch tested?

new unit test

jose-torres · 2018-06-14T05:15:18Z

@HeartSaVioR @arunmahadevan @xuanyuanking @tdas @zsxwing

SparkQA · 2018-06-14T07:04:43Z

Test build #91817 has finished for PR 21560 at commit 252f5c9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-14T07:05:02Z

Test build #91816 has finished for PR 21560 at commit 03cc20d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-06-14T13:10:30Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+ * RDD for continuous coalescing. Asynchronously writes all partitions of `prev` into a local
+ * continuous shuffle, and then reads them in the task thread using `reader`.
+ */
+class ContinuousCoalesceRDD(var reader: ContinuousShuffleReadRDD, var prev: RDD[InternalRow])


why the reader and prev both is var here?

They are to make it this RDD checkpointable by make this clearable. This raises a good point, I dont think we should make this checkpointable. Rather I suggest this, make these simple vals (well, just remove modifier), and in clearDependencies, just throw an error saying "Checkpoint this RDD is not supported".

We should do this for all the continuous shuffle RDDs.

xuanyuanking · 2018-06-14T13:23:01Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

+    case Repartition(1, false, child) =>
+      val isContinuous = child.collectFirst {
+        case StreamingDataSourceV2Relation(_, _, _, r: ContinuousReader) => r
+      }.isDefined


The judgement of whether the plan is continuous or not can be a sperated method and other place can use it?

xuanyuanking · 2018-06-15T03:27:08Z

restest this please

SparkQA · 2018-06-15T06:21:48Z

Test build #91882 has finished for PR 21560 at commit 252f5c9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

Overall, this almost looks good. My only high-level comment is that the ContinuousCoalesceRDD should not take ContinuousShuffleReadRDD as a parameter. That is very confusing. It needs a ContinuousShuffleReadRDD only because it is reusing the ContinuousShuffleReadRDD code for coalescing which is an internal implementation detail, and that code should not be in ContinuousCoalesceExec. So move the createion of ContinuousShuffleReadRDD inside the ContinuousCoalesceRDD.

The rest are relatively minor changes.

tdas · 2018-06-18T21:57:51Z

...a/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala

      queueSize: Int,
      numShuffleWriters: Int,
      epochIntervalMs: Long)
    extends Partition {
+


Unnecessary

tdas · 2018-06-19T21:13:49Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

@@ -350,7 +350,14 @@ object UnsupportedOperationChecker {
              _: TypedFilter) =>
        case node if node.nodeName == "StreamingRelationV2" =>
        case node =>
-          throwError(s"Continuous processing does not support ${node.nodeName} operations.")
+          val aboveSinglePartitionCoalesce = node.find {


Will this allow kafkaStreamDF.coalesc(1).select(...).filter(...).agg(...)?

And are we allowing all stateful operations after this?

It will allow the first one, and I've added a test to verify.

It ought to allow the second one, but for some reason streaming deduplicate insists on inserting a shuffle above the coalesce(1). I will address this in a separate PR, since this seems like suboptimal behavior that isn't only restricted to continuous processing. For now I tweaked the condition to only allow aggregates.

tdas · 2018-06-19T21:20:29Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

+          }.isDefined
+
+          if (!aboveSinglePartitionCoalesce) {
+            throwError(s"Continuous processing does not support ${node.nodeName} operations.")


It would be nice if this error statement says what is supported. That you can rewrite the query with colesce(1)

tdas · 2018-06-19T21:28:47Z

.../main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceExec.scala

+
+    val childRdd = child.execute()
+    val endpointName = s"RPCContinuousShuffleReader-${UUID.randomUUID()}"
+    val reader = new ContinuousShuffleReadRDD(


super nit: rename to readerRDD to avoid confusion with other v2 reader classes.

tdas · 2018-06-19T21:34:08Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+      }
+
+      val threads = prev.partitions.map { prevSplit =>
+        new Thread() {


maybe use a thread pool (using org...spark.util.ThreadUtils) with a name to track threads. Then the cached threads in threadpool can be reused across epochs.

tdas · 2018-06-19T21:46:55Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+ * RDD for continuous coalescing. Asynchronously writes all partitions of `prev` into a local
+ * continuous shuffle, and then reads them in the task thread using `reader`.
+ */
+class ContinuousCoalesceRDD(var reader: ContinuousShuffleReadRDD, var prev: RDD[InternalRow])


They are to make it this RDD checkpointable by make this clearable. This raises a good point, I dont think we should make this checkpointable. Rather I suggest this, make these simple vals (well, just remove modifier), and in clearDependencies, just throw an error saying "Checkpoint this RDD is not supported".

We should do this for all the continuous shuffle RDDs.

tdas · 2018-06-19T21:48:09Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+import org.apache.spark.sql.execution.streaming.continuous.shuffle._
+
+case class ContinuousCoalesceRDDPartition(index: Int) extends Partition {
+  private[continuous] var writersInitialized: Boolean = false


Add docs on what this means.

tdas · 2018-06-19T23:08:31Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+  }
+
+  override def clearDependencies() {
+    super.clearDependencies()


As commented above, this should actually throw exception so that this is never checkpointed.

tdas · 2018-06-19T23:13:24Z

...main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDD.scala

@@ -51,7 +51,7 @@ class ContinuousDataSourceRDD(
    sc: SparkContext,
    dataQueueSize: Int,
    epochPollIntervalMs: Long,
-    @transient private val readerFactories: Seq[InputPartition[UnsafeRow]])
+    private val readerFactories: Seq[InputPartition[UnsafeRow]])


since all the partitions do no need all the factories, the right thing to do is to put partition's factory in the partition object. This is so that the all factories are not serialized for all tasks.

We need to be able to generate the full list of partitions from within a single task in order for coalesce to work.

Wait. I dont see the readerFactories object to be used anywhere other than in getPartitions, where they are saved as part of ContinuousDataSourceRDDPartition objects. And RDD.compute() seems to picking it up from ContinuousDataSourceRDDPartition objects, and not from readerFactories. So I dont think readerFactories needs to be serialized.

At the very least, rename readerFactories to readerInputPartitions for consistency.

We list the partitions when computing the coalesce RDD. Should we instead be packing the partitions into the partitions of the coalesce RDD? I'd assumed it was valid to expect that rdd.partitions would work on executors, but maybe it's not.

tdas · 2018-06-19T23:28:47Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+
+  override def getDependencies: Seq[Dependency[_]] = {
+    Seq(new NarrowDependency(prev) {
+      def getParents(id: Int): Seq[Int] = Seq(0)


Should 1 partition of this class depend on all parant RDD partitions, and not just the 0.

Yeah, I confused myself when looking at the normal coalesce RDD. The default dependency handling is correct here.

SparkQA · 2018-06-21T00:16:18Z

Test build #92146 has finished for PR 21560 at commit c0f769e.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

HeartSaVioR

Looks great overall. (FYI I'm mostly concerned about functional correctness while reviewing.)

Left some comments but they're minors (mostly for completeness but this patch is for intermediate state) so you can skip addressing them.

HeartSaVioR · 2018-06-21T03:11:11Z

...a/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala

@@ -61,12 +63,14 @@ class ContinuousShuffleReadRDD(
    numPartitions: Int,
    queueSize: Int = 1024,
    numShuffleWriters: Int = 1,
-    epochIntervalMs: Long = 1000)
+    epochIntervalMs: Long = 1000,
+    val endpointNames: Seq[String] = Seq(s"RPCContinuousShuffleReader-${UUID.randomUUID()}"))


Same here: if possible it might be better to have complete code rather than just working with such assumption.

This is just a default argument to make tests less wordy. I can remove it if you think that's best, but it doesn't impose a restriction.

HeartSaVioR · 2018-06-21T03:11:48Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+    prev: RDD[InternalRow])
+  extends RDD[InternalRow](context, Nil) {
+
+  override def getPartitions: Array[Partition] = Array(ContinuousCoalesceRDDPartition(0))


We are addressing only the specific case that number of partitions is 1, but we could have some assertion for that and try to write complete code so that we don't modify it again.

Agree. And since theres an assert (numpartitions == 1) in ContinuousCoalesceExec, we can probably create any array of numPartitions here.

I've made some changes to try to restrict the assumption that the number of partitions is 1 to two places:

ContinuousCoalesceExec

The output partitioner in ContinuousCoalesceRDD, since it's not obvious to me what the right strategy to get this would be in the general case. If you have ideas I'm open to removing this too.

HeartSaVioR · 2018-06-21T03:13:41Z

...a/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala

  extends RDD[UnsafeRow](sc, Nil) {

  override protected def getPartitions: Array[Partition] = {
    (0 until numPartitions).map { partIndex =>
-      ContinuousShuffleReadPartition(partIndex, queueSize, numShuffleWriters, epochIntervalMs)
+      ContinuousShuffleReadPartition(
+        partIndex, endpointNames(partIndex), queueSize, numShuffleWriters, epochIntervalMs)


This effectively asserting numPartitions to be 1, otherwise it will throw exception.

HeartSaVioR · 2018-06-21T03:19:42Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

+        case Repartition(1, false, _) =>
+        case node: Aggregate =>
+          val aboveSinglePartitionCoalesce = node.find {
+            case Repartition(1, false, _) => true


What if we have multiple repartitions which one meets the case and others are not? I'm not sure we are restricting repartition operations to be only once.

I don't think there's any particular reason we need to. There's no reason we couldn't execute multiple repartitions if the optimizer isn't smart enough to combine them.

Oh wait, I see what you mean. Repartition(5, ...) would never be matched by this rule, since it only applies to Aggregate.

HeartSaVioR · 2018-06-21T04:04:06Z

...main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDD.scala

@@ -98,6 +98,10 @@ class ContinuousDataSourceRDD(
  override def getPreferredLocations(split: Partition): Seq[String] = {
    split.asInstanceOf[ContinuousDataSourceRDDPartition].inputPartition.preferredLocations()
  }
+
+  override def clearDependencies(): Unit = {
+    throw new IllegalStateException("Continuous RDDs cannot be checkpointed")


I'm wondering the method can be called in normal situation: when continuous query is gracefully terminated.

I don't know, I'm unfamiliar with this method. @tdas

@HeartSaVioR No this method no intended for being called in normal circumstance. And less of a reason to call this in an internally generated RDD.

arunmahadevan

Looks good overall. Left a few comments.

arunmahadevan · 2018-06-22T17:48:14Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

+            case _ => false
+          }.isDefined
+
+          if (!aboveSinglePartitionCoalesce) {


What if there was only a single partition to begin with ? Then theres no need of Repartition(1) and this check should be skipped.

I agree that it wouldn't be needed, but partitioning information is not always available during analysis. So I don't think we can write the more granular check suggested here.

arunmahadevan · 2018-06-22T17:57:23Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

+            case _ => false
+          }.isDefined
+
+          if (!aboveSinglePartitionCoalesce) {


Also if theres a single parent partition and theres a Repartition(1) that node should probably be removed. Not sure if this is already being done.

(same comment as above applies here - we don't have partitioning information in analysis)

arunmahadevan · 2018-06-22T18:10:22Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+    prev: RDD[InternalRow])
+  extends RDD[InternalRow](context, Nil) {
+
+  override def getPartitions: Array[Partition] = Array(ContinuousCoalesceRDDPartition(0))


Agree. And since theres an assert (numpartitions == 1) in ContinuousCoalesceExec, we can probably create any array of numPartitions here.

arunmahadevan · 2018-06-22T20:30:34Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+
+    if (!split.asInstanceOf[ContinuousCoalesceRDDPartition].writersInitialized) {
+      val rpcEnv = SparkEnv.get.rpcEnv
+      val outputPartitioner = new HashPartitioner(1)


Maybe I am missing. Is this more like a re-partition (just shuffles) than coalesce?

Repartition would normally imply distributed execution, which isn't happening here.

arunmahadevan · 2018-06-22T20:42:06Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+      runnables.foreach(threadPool.execute)
+    }
+
+    readerRDD.compute(readerRDD.partitions(split.index), context)


The writer.write and readerRDD.compute() is going to execute as a separate tasks (but concurrently since there are no stage boundaries) correct?

No, they'll be in the same task. Just different threads.

If its the same task, then do we need the RPC mechanism to pass the rows around ?

There is a queue inside the ContinuousShuffleReadRDD that is buffering all the records that are being sent out by the RPCContinuousShuffleWriter. And the compute function is returning data from that queue.

As I commented above, we dont really need the ContinuousShuffleReadRDD, just the ContinuousShuffleReader

Agree. Also the 2*numShuffle partition threads here looks like an overhead. Maybe ok for now but the CoalesceRDD iterator could just iterate over the parent RDD partitions tracking the epochs, returning the rows and terminating when the epoch is received from all its parents.

Yeah, it could be made more efficient. Part of the goal here is to ensure that the shuffling does indeed work end-to-end, so we can work on both the shuffle framework and distributed repartitioning in parallel.

SparkQA · 2018-06-26T01:31:34Z

Test build #92310 has finished for PR 21560 at commit 71a3568.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

tdas

Very close. Just one major refactoring comment.

tdas · 2018-06-26T08:18:05Z

...main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDD.scala

@@ -51,7 +51,7 @@ class ContinuousDataSourceRDD(
    sc: SparkContext,
    dataQueueSize: Int,
    epochPollIntervalMs: Long,
-    @transient private val readerFactories: Seq[InputPartition[UnsafeRow]])
+    private val readerFactories: Seq[InputPartition[UnsafeRow]])


Wait. I dont see the readerFactories object to be used anywhere other than in getPartitions, where they are saved as part of ContinuousDataSourceRDDPartition objects. And RDD.compute() seems to picking it up from ContinuousDataSourceRDDPartition objects, and not from readerFactories. So I dont think readerFactories needs to be serialized.

At the very least, rename readerFactories to readerInputPartitions for consistency.

tdas · 2018-06-26T08:19:09Z

.../main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceExec.scala

+  override def doExecute(): RDD[InternalRow] = {
+    assert(numPartitions == 1)
+
+    val childRdd = child.execute()


nit: Dont need this variable. And merge remove excess empty lines.

tdas · 2018-06-26T08:22:32Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+    s"ContinuousCoalesceRDD-part$i-${UUID.randomUUID()}"
+  }
+
+  val readerRDD = new ContinuousShuffleReadRDD(


Also, honestly, you dont need the RDD here. You only need the shuffle reading code, which is the ContinuousShuffleReader and endpoint. So you can just instantiate that in the compute function. Its very confusing to an RDD inside another RDD which is not hooked to the dependency chain.

tdas · 2018-06-26T08:29:11Z

...c/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousCoalesceRDD.scala

+      runnables.foreach(threadPool.execute)
+    }
+
+    readerRDD.compute(readerRDD.partitions(split.index), context)


There is a queue inside the ContinuousShuffleReadRDD that is buffering all the records that are being sent out by the RPCContinuousShuffleWriter. And the compute function is returning data from that queue.

As I commented above, we dont really need the ContinuousShuffleReadRDD, just the ContinuousShuffleReader

tdas · 2018-06-26T08:33:12Z

...re/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousAggregationSuite.scala

+      .coalesce(1)
+      .select('value as 'copy, 'value)
+      .where('copy =!= 2)
+      .agg(max('value))


test transformations both before and after coalesce.

You missed this comment.

SparkQA · 2018-06-27T04:29:48Z

Test build #92360 has finished for PR 21560 at commit 468f134.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ContinuousCoalesceRDDPartition(

jose-torres · 2018-06-27T17:08:15Z

Sorry, that wasn't meant to be a complete push. Added the tests now.

tdas · 2018-06-27T19:55:45Z

LGTM assuming tests pass.

SparkQA · 2018-06-27T21:03:18Z

Test build #92387 has finished for PR 21560 at commit f77b12b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres added 27 commits May 15, 2018 11:08

continuous shuffle read RDD

1d6b718

docs

b5d1008

Merge remote-tracking branch 'apache/master' into readerRddMaster

af40769

fix ctor

46456dc

multiple partition test

2ea8a6f

unset task context after test

955ac79

conf from RDD

8cefb72

endpoint name

f91bfe7

testing bool

2590292

tests

859e6e4

take instead of poll

b23b7bb

add interface

97f7e8f

clarify comment

de21b1c

multiple

7dcf51a

writer with 1 reader partition

ad0b5aa

docs and iface

c9adee5

Merge remote-tracking branch 'apache/master' into writerTask

63d38d8

increment epoch

331f437

undo oop

f3ce675

make rdd loop

e0108d7

basic

024f92d

coalesce working

8f1939b

Merge remote-tracking branch 'apache/master' into coalesce

c99d952

fix merge

aaac0af

rm spurious diffs

80d60db

unsupported check

26b74f0

change back timeout

03cc20d

add docs

252f5c9

xuanyuanking reviewed Jun 15, 2018

View reviewed changes

tdas suggested changes Jun 19, 2018

View reviewed changes

jose-torres added 6 commits June 20, 2018 12:53

fix unsupported check

f666aa0

change ctor

9b71e00

rm dependencies

37320f7

use pool

fbe8c21

add docs

486f36b

don't allow checkpoint

c0f769e

HeartSaVioR approved these changes Jun 21, 2018

View reviewed changes

HeartSaVioR reviewed Jun 21, 2018

View reviewed changes

arunmahadevan reviewed Jun 22, 2018

View reviewed changes

fix

71a3568

tdas suggested changes Jun 26, 2018

View reviewed changes

jose-torres added 2 commits June 26, 2018 15:38

Merge remote-tracking branch 'apache/master' into coalesce

0b35766

fixes

468f134

add tests

f77b12b

asfgit closed this in f6e6899 Jun 28, 2018

[SPARK-24386][SS] coalesce(1) aggregates in continuous processing #21560

[SPARK-24386][SS] coalesce(1) aggregates in continuous processing #21560

Conversation

jose-torres commented Jun 14, 2018

What changes were proposed in this pull request?

How was this patch tested?

jose-torres commented Jun 14, 2018

SparkQA commented Jun 14, 2018

SparkQA commented Jun 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuanyuanking commented Jun 15, 2018

SparkQA commented Jun 15, 2018

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 21, 2018

HeartSaVioR left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR Jun 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunmahadevan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jose-torres Jun 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 26, 2018

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 27, 2018

jose-torres commented Jun 27, 2018

tdas commented Jun 27, 2018

SparkQA commented Jun 27, 2018

HeartSaVioR left a comment •

edited

Loading

HeartSaVioR Jun 21, 2018 •

edited

Loading

jose-torres Jun 25, 2018 •

edited

Loading