[SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs #12014

davies · 2016-03-28T21:35:36Z

What changes were proposed in this pull request?

This PR brings the support for chained Python UDFs, for example

select udf1(udf2(a))
select udf1(udf2(a) + 3)
select udf1(udf2(a) + udf3(b))

Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.

For example,

>>> sqlContext.sql("select double(double(1))").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF#10 AS double(double(1))#9]
:     +- INPUT
+- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
   +- Scan OneRowRelation[]
>>> sqlContext.sql("select double(double(1) + double(2))").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
:     +- INPUT
+- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
   +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
      +- !BatchPythonEvaluation double(1), [pythonUDF#17]
         +- Scan OneRowRelation[]

TODO: will support multiple unrelated Python UDFs in one batch (another PR).

How was this patch tested?

Added new unit tests for chained UDFs.

davies · 2016-03-28T21:36:47Z

cc @marmbrus @rxin

SparkQA · 2016-03-28T21:39:27Z

Test build #54364 has finished for PR 12014 at commit 024a822.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-28T21:55:56Z

cc @cloud-fan

hvanhovell · 2016-03-28T22:01:58Z

@davies I think the JIRA number should be SPARK-14215: https://issues.apache.org/jira/browse/SPARK-14215

davies · 2016-03-28T22:04:21Z

@hvanhovell Corrected, thanks!

SparkQA · 2016-03-28T22:16:10Z

Test build #54365 has finished for PR 12014 at commit b741073.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-29T01:06:14Z

python/pyspark/sql/functions.py

@@ -1648,6 +1648,14 @@ def sort_array(col, asc=True):

 # ---------------------------- User Defined Function ----------------------------------

+def _wrap_function(sc, func, returnType):


what's the point of creating a new _wrap_function here? To decrease the size of serialized python function?

oh i see, we wanna chain the functions at python worker side.

SparkQA · 2016-03-29T01:08:50Z

Test build #2705 has finished for PR 12014 at commit b741073.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-29T01:25:45Z

Overall LGTM

viirya · 2016-03-29T03:18:02Z

python/pyspark/worker.py

+        row_based = read_int(infile)
+        num_commands = read_int(infile)
+        if row_based:
+            profiler = None  # profiling is not supported for UDF


profiler seems need to be defined before this if block. The codes refer profiler later out of this block.

The other branch also have profiler, so I think it's fine.

SparkQA · 2016-03-29T06:39:12Z

Test build #2707 has finished for PR 12014 at commit b741073.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-29T06:48:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+      case udf: PythonUDF if canEvaluate(udf) => udf
+    }
+  }
+
  def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
    // Skip EvaluatePython nodes.
    case plan: EvaluatePython => plan

    case plan: LogicalPlan if plan.resolved =>
      // Extract any PythonUDFs from the current operator.


We should update the comments the explain our new strategy of extracting and evaluating python udfs.

davies · 2016-03-29T22:07:12Z

Merging this into master (the last commit only added comments).

SparkQA · 2016-03-29T22:26:12Z

Test build #54467 has finished for PR 12014 at commit c57e8a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-29T22:29:59Z

@davies It sounds you used a wrong JIRA number.

davies · 2016-03-29T22:36:47Z

@gatorsmile Corrected in PR, but the notification email is not updated.

Davies Liu added 2 commits March 28, 2016 13:21

use row based function for Python UDF

d63ec84

support chained Python UDFs

024a822

fix style

b741073

davies changed the title ~~[SPARK-14125] [SQL] [PYSPARK] Support chained Python UDFs~~ [SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs Mar 28, 2016

cloud-fan reviewed Mar 29, 2016
View reviewed changes

viirya reviewed Mar 29, 2016
View reviewed changes

cloud-fan reviewed Mar 29, 2016
View reviewed changes

add comments

c57e8a4

davies force-pushed the py_udfs branch from 9b99537 to c57e8a4 Compare March 29, 2016 22:05

asfgit closed this in a7a93a1 Mar 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs #12014

[SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs #12014

davies commented Mar 28, 2016

davies commented Mar 28, 2016

SparkQA commented Mar 28, 2016

davies commented Mar 28, 2016

hvanhovell commented Mar 28, 2016

davies commented Mar 28, 2016

SparkQA commented Mar 28, 2016

cloud-fan Mar 29, 2016

cloud-fan Mar 29, 2016

SparkQA commented Mar 29, 2016

cloud-fan commented Mar 29, 2016

viirya Mar 29, 2016

davies Mar 29, 2016

SparkQA commented Mar 29, 2016

cloud-fan Mar 29, 2016

davies Mar 29, 2016

davies commented Mar 29, 2016

SparkQA commented Mar 29, 2016

gatorsmile commented Mar 29, 2016

davies commented Mar 29, 2016

		@@ -1648,6 +1648,14 @@ def sort_array(col, asc=True):

		# ---------------------------- User Defined Function ----------------------------------

		def _wrap_function(sc, func, returnType):

[SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs #12014

[SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs #12014

Conversation

davies commented Mar 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

davies commented Mar 28, 2016

SparkQA commented Mar 28, 2016

davies commented Mar 28, 2016

hvanhovell commented Mar 28, 2016

davies commented Mar 28, 2016

SparkQA commented Mar 28, 2016

cloud-fan Mar 29, 2016

Choose a reason for hiding this comment

cloud-fan Mar 29, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 29, 2016

cloud-fan commented Mar 29, 2016

viirya Mar 29, 2016

Choose a reason for hiding this comment

davies Mar 29, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 29, 2016

cloud-fan Mar 29, 2016

Choose a reason for hiding this comment

davies Mar 29, 2016

Choose a reason for hiding this comment

davies commented Mar 29, 2016

SparkQA commented Mar 29, 2016

gatorsmile commented Mar 29, 2016

davies commented Mar 29, 2016