[SPARK-10731][SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame. #8876

rxin · 2015-09-23T02:11:50Z

Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take).

This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion.

…in Python DataFrame.

SparkQA · 2015-09-23T04:37:21Z

Test build #42876 has finished for PR 8876 at commit 778c2fd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-23T06:22:14Z

Test build #42891 has finished for PR 8876 at commit 267ffa7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-23T09:12:14Z

Test build #1793 has finished for PR 8876 at commit 267ffa7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-09-23T21:52:08Z

LGTM

SparkQA · 2015-09-23T23:38:15Z

Test build #42920 has finished for PR 8876 at commit c52ff7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… in Python DataFrame. Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take). This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion. Author: Reynold Xin <[email protected]> Closes #8876 from rxin/SPARK-10731. (cherry picked from commit 9952217) Signed-off-by: Reynold Xin <[email protected]>

… same in Python ## What changes were proposed in this pull request? In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing. The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.). In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations. ## How was this patch tested? Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries. Author: Josh Rosen <[email protected]> Closes #15068 from JoshRosen/pyspark-collect-limit. (cherry picked from commit 6d06ff6) Signed-off-by: Davies Liu <[email protected]>

… same in Python ## What changes were proposed in this pull request? In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing. The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.). In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations. ## How was this patch tested? Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries. Author: Josh Rosen <[email protected]> Closes #15068 from JoshRosen/pyspark-collect-limit.

… same in Python ## What changes were proposed in this pull request? In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing. The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in apache#7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.). In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / apache#8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations. ## How was this patch tested? Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries. Author: Josh Rosen <[email protected]> Closes apache#15068 from JoshRosen/pyspark-collect-limit.

[SPARK-10731][SQL] Delegate to Scala's DataFrame.take implementation …

778c2fd

…in Python DataFrame.

Fix ExamplePointUDT.

267ffa7

UDT fix.

c52ff7e

asfgit closed this in 9952217 Sep 23, 2015

JoshRosen mentioned this pull request Sep 13, 2016

[SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python #15068

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10731][SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame. #8876

[SPARK-10731][SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame. #8876

rxin commented Sep 23, 2015

SparkQA commented Sep 23, 2015

SparkQA commented Sep 23, 2015

SparkQA commented Sep 23, 2015

davies commented Sep 23, 2015

SparkQA commented Sep 23, 2015

[SPARK-10731][SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame. #8876

[SPARK-10731][SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame. #8876

Conversation

rxin commented Sep 23, 2015

SparkQA commented Sep 23, 2015

SparkQA commented Sep 23, 2015

SparkQA commented Sep 23, 2015

davies commented Sep 23, 2015

SparkQA commented Sep 23, 2015