[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization #33807

HyukjinKwon · 2021-08-23T06:26:22Z

What changes were proposed in this pull request?

This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.

import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()

Before:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
      +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
         +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
            +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
               +- Project [id#37L]
                  +- Filter atleastnnonnulls(1, id#37L)
                     +- Scan ExistingRDD[__index_level_0__#36L,id#37L]
                        # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)

After:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
      +- HashAggregate(keys=[id#258L], functions=[count(1)])
         +- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
            +- Filter atleastnnonnulls(1, id#258L)
               +- Range (0, 10, step=1, splits=16)
                  # ^^^ Removed the Spark job execution for `zipWithIndex`

Why are the changes needed?

To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to distributed-sequence.

…e optimization

HyukjinKwon · 2021-08-24T09:54:51Z

cc @ueshin and @cloud-fan can you take a look when you find some time?

SparkQA · 2021-08-24T10:29:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47223/

SparkQA · 2021-08-24T11:05:45Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47223/

cloud-fan · 2021-08-24T12:04:45Z

...ore/src/main/scala/org/apache/spark/sql/execution/python/AttachDistributedSequenceExec.scala

+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override protected def doExecute(): RDD[InternalRow] = {
+    child.execute().map(_.copy())


why do we need to copy the unsafe rows before calling localCheckpoint?

Oh, i forgot to describe it. localCheckpoint caches (persists) the data, and it stores the rows so it needs to copy. This is actually being done at Dataset.checkpoint: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L679

HyukjinKwon · 2021-08-24T12:21:33Z

...ore/src/main/scala/org/apache/spark/sql/execution/python/AttachDistributedSequenceExec.scala

+
+  override protected def doExecute(): RDD[InternalRow] = {
+    child.execute().map(_.copy())
+        .localCheckpoint() // to avoid execute multiple jobs. zipWithIndex launches a Spark job.


I am still not sure if we need to localCheckPoint in the middle here ... but let me keep it as is for now.

e.g.) if the child RDD has a shuffle, the shuffle will be triggered twice, and this checkpoint is to avoid that.

The shuffle will be reused. I think localCheckpoint is useful to save computation. e.g. df.sort(...).withSequenceColumn, if we don't do localCheckpoint, the shuffle is still done only once, but the local sort after shuffle will be done twice.

HyukjinKwon · 2021-08-24T12:24:15Z

...ore/src/main/scala/org/apache/spark/sql/execution/python/AttachDistributedSequenceExec.scala

+ * increases one by one. This is for 'distributed-sequence' default index
+ * in pandas API on Spark.
+ */
+case class AttachDistributedSequenceExec(


We could think about implementing this with an expression (like Python UDF or Window) .. but just decided to do this with plans to avoid making it too much complicated.

SparkQA · 2021-08-24T14:43:47Z

Test build #142723 has finished for PR 33807 at commit 5d74c35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

LGTM.

One thing I'm worrying is that we can't push down filters through AttachDistributedSequence, but it won't happen, right?

HyukjinKwon · 2021-08-25T01:01:57Z

Yeah, I think it won't happen. Just did a quick double check.

HyukjinKwon · 2021-08-25T01:02:38Z

Let me merge this one into 3.2 together.

HyukjinKwon · 2021-08-25T01:02:49Z

Merged to master and branch-3.2.

…ence index for optimization ### What changes were proposed in this pull request? This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(10).id.value_counts().to_frame().spark.explain() ``` **Before:** ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#51L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70] +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L]) +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67] +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L]) +- Project [id#37L] +- Filter atleastnnonnulls(1, id#37L) +- Scan ExistingRDD[__index_level_0__#36L,id#37L] # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed) ``` **After:** ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#275L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174] +- HashAggregate(keys=[id#258L], functions=[count(1)]) +- HashAggregate(keys=[id#258L], functions=[partial_count(1)]) +- Filter atleastnnonnulls(1, id#258L) +- Range (0, 10, step=1, splits=16) # ^^^ Removed the Spark job execution for `zipWithIndex` ``` ### Why are the changes needed? To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`. Closes #33807 from HyukjinKwon/SPARK-36559. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 93cec49) Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon marked this pull request as draft August 23, 2021 06:26

github-actions bot added CORE PYTHON SQL labels Aug 23, 2021

HyukjinKwon changed the title ~~[WIP][SPARK-36559][SQL][PYTHON] Move distributed-sequence index implementation to SQL plan to leverage optimization~~ [WIP][SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization Aug 23, 2021

HyukjinKwon force-pushed the SPARK-36559 branch 2 times, most recently from 6f441fc to 296a6ae Compare August 23, 2021 06:31

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the SPARK-36559 branch from 296a6ae to 68e0a23 Compare August 23, 2021 08:52

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the SPARK-36559 branch from 14021b4 to d28b29c Compare August 24, 2021 02:41

This comment has been minimized.

Sign in to view

Move distributed-sequence index implementation to SQL plan to leverag…

63ef433

…e optimization

HyukjinKwon force-pushed the SPARK-36559 branch from d28b29c to 63ef433 Compare August 24, 2021 09:09

HyukjinKwon changed the title ~~[WIP][SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization~~ [SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization Aug 24, 2021

HyukjinKwon marked this pull request as ready for review August 24, 2021 09:12

Update python/pyspark/pandas/config.py

5d74c35

cloud-fan reviewed Aug 24, 2021

View reviewed changes

HyukjinKwon commented Aug 24, 2021

View reviewed changes

ueshin reviewed Aug 24, 2021

View reviewed changes

HyukjinKwon closed this in 93cec49 Aug 25, 2021

HyukjinKwon deleted the SPARK-36559 branch January 4, 2022 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization #33807

[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization #33807

HyukjinKwon commented Aug 23, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

HyukjinKwon commented Aug 24, 2021

SparkQA commented Aug 24, 2021

SparkQA commented Aug 24, 2021

cloud-fan Aug 24, 2021

HyukjinKwon Aug 24, 2021 •

edited

Loading

HyukjinKwon Aug 24, 2021

HyukjinKwon Aug 24, 2021

cloud-fan Aug 24, 2021

HyukjinKwon Aug 24, 2021

SparkQA commented Aug 24, 2021

ueshin left a comment

HyukjinKwon commented Aug 25, 2021

HyukjinKwon commented Aug 25, 2021

HyukjinKwon commented Aug 25, 2021

[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization #33807

[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization #33807

Conversation

HyukjinKwon commented Aug 23, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

HyukjinKwon commented Aug 24, 2021

SparkQA commented Aug 24, 2021

SparkQA commented Aug 24, 2021

cloud-fan Aug 24, 2021

Choose a reason for hiding this comment

HyukjinKwon Aug 24, 2021 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Aug 24, 2021

Choose a reason for hiding this comment

HyukjinKwon Aug 24, 2021

Choose a reason for hiding this comment

cloud-fan Aug 24, 2021

Choose a reason for hiding this comment

HyukjinKwon Aug 24, 2021

Choose a reason for hiding this comment

SparkQA commented Aug 24, 2021

ueshin left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 25, 2021

HyukjinKwon commented Aug 25, 2021

HyukjinKwon commented Aug 25, 2021

HyukjinKwon commented Aug 23, 2021 •

edited

Loading

HyukjinKwon Aug 24, 2021 •

edited

Loading