[SPARK-24624][SQL][PYTHON] Support mixture of Python UDF and Scalar Pandas UDF #21650

icexelloss · 2018-06-27T22:40:19Z

What changes were proposed in this pull request?

This PR add supports for using mixed Python UDF and Scalar Pandas UDF, in the following two cases:

(1)

from pyspark.sql.functions import udf, pandas_udf

@udf('int')
def f1(x):
    return x + 1

@pandas_udf('int')
def f2(x):
    return x + 1

df = spark.range(0, 1).toDF('v') \
    .withColumn('foo', f1(col('v'))) \
    .withColumn('bar', f2(col('v')))

QueryPlan:

>>> df.explain(True)
== Parsed Logical Plan ==
'Project [v#2L, foo#5, f2('v) AS bar#9]
+- AnalysisBarrier
      +- Project [v#2L, f1(v#2L) AS foo#5]
         +- Project [id#0L AS v#2L]
            +- Range (0, 1, step=1, splits=Some(4))

== Analyzed Logical Plan ==
v: bigint, foo: int, bar: int
Project [v#2L, foo#5, f2(v#2L) AS bar#9]
+- Project [v#2L, f1(v#2L) AS foo#5]
   +- Project [id#0L AS v#2L]
      +- Range (0, 1, step=1, splits=Some(4))

== Optimized Logical Plan ==
Project [id#0L AS v#2L, f1(id#0L) AS foo#5, f2(id#0L) AS bar#9]
+- Range (0, 1, step=1, splits=Some(4))

== Physical Plan ==
*(2) Project [id#0L AS v#2L, pythonUDF0#13 AS foo#5, pythonUDF0#14 AS bar#9]
+- ArrowEvalPython [f2(id#0L)], [id#0L, pythonUDF0#13, pythonUDF0#14]
   +- BatchEvalPython [f1(id#0L)], [id#0L, pythonUDF0#13]
      +- *(1) Range (0, 1, step=1, splits=4)

(2)

from pyspark.sql.functions import udf, pandas_udf
@udf('int')
def f1(x):
    return x + 1

@pandas_udf('int')
def f2(x):
    return x + 1

df = spark.range(0, 1).toDF('v')
df = df.withColumn('foo', f2(f1(df['v'])))

QueryPlan:

>>> df.explain(True)
== Parsed Logical Plan ==
Project [v#21L, f2(f1(v#21L)) AS foo#46]
+- AnalysisBarrier
      +- Project [v#21L, f1(f2(v#21L)) AS foo#39]
         +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#32]
            +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#25]
               +- Project [id#19L AS v#21L]
                  +- Range (0, 1, step=1, splits=Some(4))

== Analyzed Logical Plan ==
v: bigint, foo: int
Project [v#21L, f2(f1(v#21L)) AS foo#46]
+- Project [v#21L, f1(f2(v#21L)) AS foo#39]
   +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#32]
      +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#25]
         +- Project [id#19L AS v#21L]
            +- Range (0, 1, step=1, splits=Some(4))

== Optimized Logical Plan ==
Project [id#19L AS v#21L, f2(f1(id#19L)) AS foo#46]
+- Range (0, 1, step=1, splits=Some(4))

== Physical Plan ==
*(2) Project [id#19L AS v#21L, pythonUDF0#50 AS foo#46]
+- ArrowEvalPython [f2(pythonUDF0#49)], [id#19L, pythonUDF0#49, pythonUDF0#50]
   +- BatchEvalPython [f1(id#19L)], [id#19L, pythonUDF0#49]
      +- *(1) Range (0, 1, step=1, splits=4)

How was this patch tested?

New tests are added to BatchEvalPythonExecSuite and ScalarPandasUDFTests

icexelloss · 2018-06-27T22:45:48Z

This PR takes me a while to get to because I am not very familiar with Catalyst rules. I think in the end the change is relative simple but I would appreciate some more careful review from people that are familiar with Catalyst.

cc @BryanCutler @gatorsmile @HyukjinKwon @ueshin

icexelloss · 2018-06-27T22:49:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

  }

  def apply(plan: SparkPlan): SparkPlan = plan transformUp {
-    // AggregateInPandasExec and FlatMapGroupsInPandas can be evaluated directly in python worker
-    // Therefore we don't need to extract the UDFs
-    case plan: FlatMapGroupsInPandasExec => plan


This is no longer needed because this rule will only extract Python UDF and Scalar Pandas UDF and ignore other types of UDFs

viirya · 2018-06-27T22:54:51Z

@icexelloss Can you also show the query plan of the examples in the PR description? Thanks.

maropu · 2018-06-27T23:05:55Z

nit: Also, can you put [SQL][PYTHON] in the title?

SparkQA · 2018-06-28T02:31:57Z

Test build #92401 has finished for PR 21650 at commit be3b99c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-28T03:28:49Z

Test build #92400 has finished for PR 21650 at commit 6b47b69.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

icexelloss · 2018-06-28T15:28:14Z

@viirya I have added the query plan output. @maropu I updated the PR title.

Thanks!

BryanCutler · 2018-06-28T23:36:31Z

Would you mind changing cast (1) in your description? It threw me off a little as they looked independent at first glance. Maybe something like:

df = spark.range(0, 1).toDF('v') \
    .withColumn('foo', f1(df['v'])) \
    .withColumn('bar', f2(df['v']))

Also, are there any cases you know of that still aren't allowed?

BryanCutler · 2018-06-28T23:54:14Z

python/pyspark/sql/tests.py

+        df3 = df3.withColumn('f4_f2_f1', df['v'] + 1011)
+        df3 = df3.withColumn('f4_f3_f1', df['v'] + 1101)
+        df3 = df3.withColumn('f4_f3_f2', df['v'] + 1110)
+        df3 = df3.withColumn('f4_f3_f2_f1', df['v'] + 1111)


so df3 is the expected values?

That's right. I can add a comment to make it clearer.

BryanCutler · 2018-06-28T23:57:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+   * Collect evaluable UDFs from the current node.
+   *
+   * This function collects Python UDFs or Scalar Python UDFs from expressions of the input node,
+   * and returns a list of UDFs of the same eval type.


What happens if the user tries to mix a non-scalar UDF?

Hmm.. It currently will throw an exception in the codegen stage. (Because non-scalar UDF will not be extracted by this rule)

We should probably throw a better exception but I need to think a bit how to do it.

I tried this on master and got the same exception:

>>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) >>> df.select(foo(df['v'])).show() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", line 353, in show print(self._jdf.showString(n, 20, vertical)) File "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString. : java.lang.UnsupportedOperationException: Cannot evaluate expression: <lambda>(input[0, bigint, false]) at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261) at org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) at scala.Option.getOrElse(Option.scala:121) ...

Therefore, this PR doesn't change that behavior. Both master and this PR don't extract non-scalar UDF in the expression.

Yeah, that's not a very informative exception but we can fix that later. I made https://issues.apache.org/jira/browse/SPARK-24735 to track.

BryanCutler · 2018-06-28T23:58:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

-            case _ =>
-              throw new IllegalArgumentException("Can not mix vectorized and non-vectorized UDFs")
+            case (vectorizedUdfs, plainUdfs) =>
+              throw new AnalysisException(


Why change the exception type? Can you make a test that causes this?

This is because we shouldn't reach here. (Otherwise it's bug). Don't know what's the best exception type here though.

BryanCutler · 2018-06-29T00:06:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+   *
+   * If expressions contain both UDFs eval types, this function will only return Python UDFs.
+   *
+   * The caller should call this function multiple times until all evaluable UDFs are collected.


So this will pipeline UDFs of the same eval type so that they can be processed together in the same call to python worker?

For example if we have pandas_udf, pandas_udf, udf, udf then both pandas_udfs will be sent together to the worker, then both udfs together - python runner gets executed twice.

On the other hand, if we have pandas_udf, udf, pandas_udf, udf then each one will have to be executed at a time, and python runner gets executed 4 times. Is that right?

That's correct.

SparkQA · 2018-06-29T04:14:35Z

Test build #92443 has finished for PR 21650 at commit 674e361.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-06-29T04:10:21Z

python/pyspark/sql/tests.py

+            assert type(x) == int
+            return x + 1
+
+        def f2(x):


Seems like this is neither @udf nor @pandas_udf, is it on purpose? If so, could you add a comment to explain why?

Yes, the purpose is to test mixing udf, pandas_udf and sql expression. I will add comments to make it clearer.

Added comments in test

ueshin · 2018-06-29T04:36:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

@@ -166,8 +190,9 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {
              ArrowEvalPythonExec(vectorizedUdfs, child.output ++ resultAttrs, child)
            case (vectorizedUdfs, plainUdfs) if vectorizedUdfs.isEmpty =>
              BatchEvalPythonExec(plainUdfs, child.output ++ resultAttrs, child)
-            case _ =>
-              throw new IllegalArgumentException("Can not mix vectorized and non-vectorized UDFs")
+            case (vectorizedUdfs, plainUdfs) =>


case _ => should work?

Oh yes, let me revert.

ueshin · 2018-06-29T04:37:41Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

@@ -97,6 +103,64 @@ class BatchEvalPythonExecSuite extends SparkPlanTest with SharedSQLContext {
    }
    assert(qualifiedPlanNodes.size == 1)
  }
+
+  private def collectPythonExec(spark: SparkPlan): Seq[BatchEvalPythonExec] = spark.collect {


plan would be better than spark?

Yes! I meant to call it plan but apparently made a mistake :(

ueshin · 2018-06-29T04:37:49Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

+    case b: BatchEvalPythonExec => b
+  }
+
+  private def collectPandasExec(spark: SparkPlan): Seq[ArrowEvalPythonExec] = spark.collect {


SparkQA · 2018-06-29T18:43:29Z

Test build #92482 has finished for PR 21650 at commit ce5e7f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-07-03T20:46:04Z

I had an idea of a slightly different approach.. Would it be possible to "promote" the regular udf to a pandas_udf? By this I mean wrap the function using apply() so that it takes pd.Series as inputs and returns another pd.Series. Then we can send the entire mix of udfs and pandas_udfs to the worker in one shot, instead of separate evaluations. Since the user is already are using pandas_udfs we know that the worker supports it and I think the performance would be much better. Is there any downside or issues with doing it this way?

icexelloss · 2018-07-09T22:01:25Z

@BryanCutler I think your suggestion would change the behavior. ArrowEvalExec and BatchEvalExec are still different when it comes to corner cases, for example, type coercion (ArrowEvalExec supports type coercion but BatchEvalExec doesn't) and timestamp type (regular UDF expects Python datetime for timestamp and pandas UDF expects pd.Timestamp)

I think this is probably a good future improvement but not great for this Jira because of the behavior change. WDYT?

BryanCutler · 2018-07-12T16:25:35Z

I think the previous behavior was to not allow mixing pandas and regular udfs, but you're probably right that there are some cases where data could be handled differently. I'll try to look at this more in depth today.

gatorsmile · 2018-07-16T06:12:16Z

ping @BryanCutler Any update about this PR?

BryanCutler · 2018-07-17T00:02:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+    if (pythonUDFs.isEmpty) {
+      plan.expressions.flatMap(collectEvaluableUDF(_, PythonEvalType.SQL_SCALAR_PANDAS_UDF))
+    } else {
+      pythonUDFs


I think it would be better to loop through the expressions and find the first scalar python udf, either SQL_BATCHED_UDF or SQL_SCALAR_PANDAS_UDF and then collect the rest of that type. This is really what is happening here so I think it would be more straightforward to do this in a single loop instead of 2 flatMaps.

What you said makes sense and that's actually my first attempt but end up being pretty complicated. The issue is that it is hard to do a one traversal of the expression tree to find the UDFs because we need to pass the evalType to all subtree and the result of one subtree can affect the result of another (i.e, if we find one type of UDF in one subtree, we need to pass the type to all other subtree because they must agree on evalType). Because the code is recursive in natural, this makes it pretty complicated to pass the correct eval Type in all places.

Another way is to do two traversals where in the first traversal, we look for eval type and in the second traversal, we look for UDFs of the eval type, but this isn't much different from what I have now in terms of efficiency and I find the current logic is simpler and less likely to have bugs. I actually tried these approaches and found the current way to be the easiest to implement and least likely to have bugs.

WDYT?

BryanCutler · 2018-07-17T00:04:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

-  private def collectEvaluatableUDF(expr: Expression): Seq[PythonUDF] = expr match {
-    case udf: PythonUDF if PythonUDF.isScalarPythonUDF(udf) && canEvaluateInPython(udf) => Seq(udf)
-    case e => e.children.flatMap(collectEvaluatableUDF)
+  private def collectEvaluableUDF(expr: Expression, evalType: Int): Seq[PythonUDF] = expr match {


It's a little confusing to have this function named so similar to the one below, maybe you can combine them if just doing a single loop (see other comment).

BryanCutler · 2018-07-17T00:16:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

@@ -167,7 +191,8 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {
            case (vectorizedUdfs, plainUdfs) if vectorizedUdfs.isEmpty =>
              BatchEvalPythonExec(plainUdfs, child.output ++ resultAttrs, child)
            case _ =>
-              throw new IllegalArgumentException("Can not mix vectorized and non-vectorized UDFs")
+              throw new AnalysisException(
+                "Mixed Python and Scalar Pandas UDFs are not expected here")


Change this to "Expected either Scalar Pandas UDFs or Batched UDFs but got both"

BryanCutler · 2018-07-17T00:20:19Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

@@ -97,6 +103,64 @@ class BatchEvalPythonExecSuite extends SparkPlanTest with SharedSQLContext {
    }
    assert(qualifiedPlanNodes.size == 1)
  }
+
+  private def collectPythonExec(plan: SparkPlan): Seq[BatchEvalPythonExec] = plan.collect {


rename to collectBatchExec

BryanCutler · 2018-07-17T00:20:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

+    case b: BatchEvalPythonExec => b
+  }
+
+  private def collectPandasExec(plan: SparkPlan): Seq[ArrowEvalPythonExec] = plan.collect {


rename to collectArrowExec

BryanCutler · 2018-07-17T00:26:38Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

 import org.apache.spark.sql.test.SharedSQLContext
 import org.apache.spark.sql.types.BooleanType

 class BatchEvalPythonExecSuite extends SparkPlanTest with SharedSQLContext {
  import testImplicits.newProductEncoder
  import testImplicits.localSeqToDatasetHolder

+  val pythonUDF = new MyDummyPythonUDF


pythonUDF -> pythonBatchedUDF

BryanCutler · 2018-07-17T00:31:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala

@@ -23,21 +23,27 @@ import scala.collection.mutable.ArrayBuffer
 import org.apache.spark.api.python.{PythonEvalType, PythonFunction}
 import org.apache.spark.sql.catalyst.FunctionIdentifier
 import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, GreaterThan, In}
-import org.apache.spark.sql.execution.{FilterExec, InputAdapter, SparkPlanTest, WholeStageCodegenExec}
+import org.apache.spark.sql.execution._
+import org.apache.spark.sql.functions.col
 import org.apache.spark.sql.test.SharedSQLContext
 import org.apache.spark.sql.types.BooleanType

 class BatchEvalPythonExecSuite extends SparkPlanTest with SharedSQLContext {


I don't think your tests should be in this suite since it is just for BatchEvalPythonExec. How about ExtractPythonUDFsSuite?

BryanCutler · 2018-07-17T00:34:47Z

python/pyspark/sql/tests.py

+
+        df = self.spark.range(0, 10).toDF('v1')
+        df = df.withColumn('v2', udf(lambda x: x + 1, 'int')(df['v1']))
+        df = df.withColumn('v3', pandas_udf(lambda x: x + 2, 'int')(df['v1']))


could you just chain the withColumn calls here? I think it's clearer than reassigning the df each time

BryanCutler · 2018-07-17T00:36:24Z

python/pyspark/sql/tests.py

@@ -5471,6 +5598,22 @@ def foo(_):
                self.assertEqual(r.a, 'hi')
                self.assertEqual(r.b, 1)

+    def test_mixed_udf(self):


test_mixed_udf -> test_mixed_scalar_udfs_followed_by_grouby_apply

BryanCutler · 2018-07-17T00:51:40Z

python/pyspark/sql/tests.py

+        df2 = df2.withColumn('f3_f1_f2', df['v'] + 111)
+        df2 = df2.withColumn('f3_f2_f1', df['v'] + 111)
+
+        self.assertEquals(df2.collect(), df1.collect())


I think it would be better to combine this test with the one above and construct it as a list of cases that you could loop over instead of so many blocks of withColumns. Something like

class TestCase(): def __init__(self, col_name, col_expected, col_projection, col_udf_expression, col_sql_expression): ... cases = [ TestCase('f4_f3_f2_f1', df['v'] + 1111, f4(df1['f3_f2_f1']), f4(f3(f2(f1(df['v']))), f4(f3(f1(df['v']) + 10))) ...] expected_df = df for case in cases: expected_df = expected_df.with_column(case.col_name, case.col_expected) .... self.assertEquals(expected_df.collect(), projection_df.collect())

Sorry, could you please elaborate a bit? e.g.

TestCase('f4_f3_f2_f1', df['v'] + 1111, f4(df1['f3_f2_f1']), f4(f3(f2(f1(df['v']))), f4(f3(f1(df['v']) + 10)))

How is df1['f3_f2_f1'] defined in this test case?

I chained withColumn together instead of reassigning DataFrames. How does it look now?

SparkQA · 2018-07-23T19:35:44Z

Test build #93451 has finished for PR 21650 at commit 4c9c007.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-23T19:49:45Z

Test build #93450 has finished for PR 21650 at commit 78f2ebf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-07-24T12:25:44Z

@BryanCutler I've address most of you comments and explained the ones that I didn't change. Do you mind take another look? Thanks!

HyukjinKwon · 2018-07-25T08:34:38Z

python/pyspark/sql/tests.py

+            .withColumn('f3', f3(col('v'))) \
+            .withColumn('f4', f4(col('v'))) \
+            .withColumn('f2_f1', f2(col('f1'))) \
+            .withColumn('f3_f1', f3(col('f1'))) \


This looks testing udf + udf

Yeah, the way the test is written is that I am trying to test many combinations so some combinations might not be mixed UDF. Do you prefer that I remove these cases?

HyukjinKwon · 2018-07-25T08:36:04Z

python/pyspark/sql/tests.py

+            .withColumn('f1_f3', f1(f3(df['v']))) \
+            .withColumn('f2_f1', f2(f1(df['v']))) \
+            .withColumn('f2_f3', f2(f3(df['v']))) \
+            .withColumn('f3_f1', f3(f1(df['v']))) \


Looks combination between f1 and f3 duplicating few tests in test_mixed_udf, for instance f4_f3

Yeah, the way the test is written is that I am trying to test many combinations so there are some dup cases. Do you prefer that I remove these?

Yea.. I know it's still minor since the elapsed time will be virtually the same but recently the build / test time was an issue, and I wonder if there's better way then avoding duplicated tests for now..

It was discussed here #21845

I see. I don't think it's necessary (we are only likely to remove a few cases and like you said, the test time is virtually the same) and helps the readability of the tests (so it doesn't look like some test cases are missed).

But if that's the preferred practice I can remove duplicate cases in the next commit.

I am okay to leave it too here since it's clear they are virtually the same but let's remove duplicated tests or orthogonal tests next time.

Gotcha. I will keep that in mind next time.

HyukjinKwon · 2018-07-25T08:37:26Z

python/pyspark/sql/tests.py

+    def test_mixed_scalar_udfs_followed_by_grouby_apply(self):
+        # Test Pandas UDF and scalar Python UDF followed by groupby apply
+        from pyspark.sql.functions import udf, pandas_udf, PandasUDFType
+        import pandas as pd


not a big deal at all really .. but I would swap the import order (thridparty, pyspark)

HyukjinKwon · 2018-07-25T08:52:07Z

python/pyspark/sql/tests.py

+            assert type(x) == int
+            return x + 1
+
+        def f2(x):


Ah, I see why it looks confusing. Can we add an assert here too (check if it's a column)?

HyukjinKwon · 2018-07-25T10:38:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

-      // Python UDF can't be evaluated directly in JVM
-      case children => !children.exists(hasPythonUDF)
+  private def canEvaluateInPython(e: PythonUDF, evalType: Int): Boolean = {
+    if (e.evalType != evalType) {


Can we rename this function or write a comment since Scalar both Vectorized UDF and normal UDF can be evaluated in Python each but it returns false in this case?

HyukjinKwon · 2018-07-25T10:39:07Z

I'm okay with #21650 (comment) way too but should be really simplified. Either way LGTM.

icexelloss · 2018-07-25T14:32:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+   * type will be set to the eval type of the expression.
+   *
+   */
+  private def canEvaluateInPython(e: PythonUDF, lazyEvalType: LazyEvalType): Boolean = {


@BryanCutler I rewrite this function using mutable state based on your suggestion. It's not quite the same as your code so please take a look and let me know if this looks better now. Thanks!

The one method seems overly complicated, so I prefer the code from my suggestion.

In your code:

private def canEvaluateInPython(e: PythonUDF, firstEvalType: FirstEvalType): Boolean = { if (firstEvalType.isEvalTypeSet() && e.evalType != firstEvalType.evalType) { false } else { firstEvalType.evalType = e.evalType e.children match { // single PythonUDF child could be chained and evaluated in Python case Seq(u: PythonUDF) => canEvaluateInPython(u, firstEvalType) // Python UDF can't be evaluated directly in JVM case children => !children.exists(hasScalarPythonUDF) } } }

I think what's confusing part here is that the value of firstEvalType.evalType keeps changing while we are traversing the tree, and we could be carrying the value across independent subtrees (i.e., after finish traversing one subtree, the firstEvalType can be set to Scalar Pandas, even we didn't find a evaluable UDF and we never reset it so when we visit another subtree, we could get wrong results). The fact that the firstEvalType keeps changing as we traverse the tree seems very error prone to me.

I'm not sure I follow how this could get wrong results. firstEvalType.evalType = e.evalType is called only if the eval type is not set or if it is set and it equals the current eval type. In the latter case, it does assign the same value again, but that's fine. If there is some case that this fails, can you add that as a test?

Bryan, I tried to apply your implementation and the simple test fails:

@udf('int') def f1(x): assert type(x) == int return x + 1 @pandas_udf('int') def f2(x): assert type(x) == pd.Series return x + 10 df = self.spark.range(0, 1).toDF('v') df_chained_1 = df.withColumn('f2_f1', f2(f1(df['v']))) expected_chained_1 = df.withColumn('f2_f1', df['v'] + 11) self.assertEquals(expected_chained_1.collect(), df_chained_1.collect())

Do you mind trying this too? Hopefully I didn't do something silly here..

Is the above test part of sql/tests.py?

Yes it's in the most recent commit.

Ok, I think I see the problem. Since there was a map over plan.expressions, a new FirstEvalType object was being created for each expression. Changing this to the following corrected the failure:

val setEvalType = new FirstEvalType val udfs = plan.expressions.flatMap(collectEvaluableUDFs(_, setEvalType))

I updated my above code to this, does that look correct now?

I applied you new code but the test I mentioned above still fails.

I think the issue could be when visiting f2(f1(col('v'))), firstEvalType is set to Scalar Pandas first and isn't set to Batched SQL later so f1 is not extracted. It's possible that my code is still different than yours somehow.

But similar to #21650 (comment), I think the state machine of the firstEvalType here is fairly complicated (i.e., what is the expected state of the eval type holder before and after canEvaluateInPythonand what's the invariants of the algo) with your suggested implementation and I found myself think pretty hard to prove the state machine is correct in all cases. If we want to go with this implementation, we need to carefully think about it and explain it in code...

The lazyEvalType implementation is better IMHO because the state machine is simpler - lazyEvalType is empty until we find the first evaluable UDF and the value doesn't change once it's set.

The first implementation (two pass, immutable state) is probably the simplest in terms of the mental complexity of the algo but is less efficient.

I think I am ok with both immutable state or the lazy state. I think @HyukjinKwon prefers the immutable state one. @BryanCutler WDYT?

HyukjinKwon · 2018-07-25T15:05:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

@@ -94,36 +95,94 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
 */
 object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {

-  private def hasPythonUDF(e: Expression): Boolean = {
+  private case class LazyEvalType(var evalType: Int = -1) {


hmmmmm looks messier then I thought .. previous one looks a bit better to me .. wdyt @BryanCutler ?

I'm not too fond of the name LazyEvalType, makes it sound like something else. Maybe CurrentEvalType?

Yeah the idea of the LazyEvalType is a container object that can be set once. Maybe the name LazyEvalType is confusing. I don't think CurrentEvalType is accurate either because the original idea is that we don't change the value once it's set. Maybe call it EvalTypeHolder and add docs to explain?

BryanCutler · 2018-07-25T16:56:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+        false
+      } else {
+        e.children match {
+          case Seq(u: PythonUDF) => canEvaluateInPython(u, lazyEvalType)


There are 2 paths for recursion here, which is probably not a good idea. This method is much more complicated now and a little difficult to follow.

SparkQA · 2018-07-25T18:11:17Z

Test build #93546 has finished for PR 21650 at commit 2bc906d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-26T01:58:07Z

ehh .. @BryanCutler, WDYT about just doing the previous one for now? The approach you suggested sounds efficient of course but.. here's not a hot path so I think the previous way is fine too .. since that's a bit cleaner (but a bit less efficient), and partly the code freeze is close.

BryanCutler · 2018-07-26T21:40:23Z

ehh .. @BryanCutler, WDYT about just doing the previous one for now? The approach you suggested sounds efficient of course but.. here's not a hot path so I think the previous way is fine too .. since that's a bit cleaner (but a bit less efficient), and partly the code freeze is close

I didn't make the suggestion for performance, it was because looking at the previous code took me a while before I realized the intent was to find the first evaluable udf then all others matching that eval type. I think the previous code kind of masked that and made it more complicated to follow.

I wasn't really sure how the expression tree was evaluated, so my suggestion didn't handle chained expressions. The problem was the eval type was being set when checking the children nodes, instead it should only be set after all children are determined to be the same type. I'll update the above code again, which passes all tests, as far as I can tell. I still prefer this approach, but I'm not a sql expert ;)

HyukjinKwon · 2018-07-27T01:05:14Z

Hm, then how about giving a try in a followup @BryanCutler if you see some values on it?

icexelloss · 2018-07-27T02:11:48Z

@HyukjinKwon I think Bryan's imple looks promising. Please let me take a look.

icexelloss · 2018-07-27T13:59:39Z

@BryanCutler @HyukjinKwon I updated the PR based on Bryan's suggestion. Please take a look and let me know if you have further comments.

Thanks!

HyukjinKwon · 2018-07-27T15:52:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

@@ -94,36 +95,61 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
 */
 object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {

-  private def hasPythonUDF(e: Expression): Boolean = {
+  private case class EvalTypeHolder(private var evalType: Int = -1) {


How about this:

private type EvalType = Int private type EvalTypeChecker = EvalType => Boolean private def collectEvaluableUDFsFromExpressions(expressions: Seq[Expression]): Seq[PythonUDF] = { // Eval type checker is set in the middle of checking because once it's found, // the same eval type should be checked .. blah blah var evalChecker: Option[EvalTypeChecker] = None def collectEvaluableUDFs(expr: Expression): Seq[PythonUDF] = expr match { case udf: PythonUDF if PythonUDF.isScalarPythonUDF(udf) && canEvaluateInPython(udf) && evalChecker.isEmpty => evalChecker = Some((otherEvalType: EvalType) => otherEvalType == udf.evalType) collectEvaluableUDFs(expr) case udf: PythonUDF if PythonUDF.isScalarPythonUDF(udf) && canEvaluateInPython(udf) && evalChecker.get(udf.evalType) => Seq(udf) case e => e.children.flatMap(collectEvaluableUDFs) } expressions.flatMap(collectEvaluableUDFs) } def apply(plan: SparkPlan): SparkPlan = plan transformUp { case plan: SparkPlan => extract(plan) } /** * Extract all the PythonUDFs from the current operator and evaluate them before the operator. */ private def extract(plan: SparkPlan): SparkPlan = { val udfs = collectEvaluableUDFsFromExpressions(plan.expressions)

I see... You uses a var and nested function definition and var to remove the need of a holder object.

IMHO I usually find nested function definition and function that refers to variable outside its definition scope hard to read, but it could be my personal preference.

Another thing I like about the current impl the is EvalTypeHolder class ensures its value is ever changed once it's set so I think that's more robust.

That being said, I am ok with your suggestions too if you insist or @BryanCutler also prefers it.

yup. I do avoid nested functions but I found here is where is's needed. If it's clear when it's set and unset within a function, I think the shorter one is fine.

Ok, I will update the code then.

SparkQA · 2018-07-27T17:09:37Z

Test build #93668 has finished for PR 21650 at commit b25936d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-27T17:45:55Z

Test build #93667 has finished for PR 21650 at commit 6b22fea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-07-27T19:07:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+      case udf: PythonUDF if PythonUDF.isScalarPythonUDF(udf) && canEvaluateInPython(udf)
+        && evalTypeChecker.isEmpty =>
+        evalTypeChecker = Some((otherEvalType: EvalType) => otherEvalType == udf.evalType)
+        Seq(udf)


@HyukjinKwon In your code this line is collectEvaluableUDFs(expr). I think we should just return Seq(udf) to avoid checking the expression twice.

SparkQA · 2018-07-27T20:56:12Z

Test build #93688 has finished for PR 21650 at commit f3a45a5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-07-27T20:58:06Z

retest please

SparkQA · 2018-07-27T22:08:31Z

Test build #93686 has finished for PR 21650 at commit 8e995e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-28T05:40:26Z

LGTM.

Merged to master.

icexelloss · 2018-07-28T13:13:48Z

Thanks @HyukjinKwon @BryanCutler for the review!

icexelloss force-pushed the SPARK-24624-mix-udf branch from 6b47b69 to be3b99c Compare June 27, 2018 22:43

icexelloss commented Jun 27, 2018

View reviewed changes

icexelloss changed the title ~~[SPARK-24624] Support mixture of Python UDF and Scalar Pandas UDF~~ [SPARK-24624][SQL][PYTHON] Support mixture of Python UDF and Scalar Pandas UDF Jun 28, 2018

BryanCutler reviewed Jun 29, 2018

View reviewed changes

ueshin reviewed Jun 29, 2018

View reviewed changes

BryanCutler requested changes Jul 17, 2018

View reviewed changes

icexelloss added 6 commits July 23, 2018 12:02

wip

3c2fe9a

Test passes

b3435b6

Remove white spaces

490dc09

Fix typo in test

3015257

Address PR comments

cbf310e

Address PR comments

78f2ebf

icexelloss force-pushed the SPARK-24624-mix-udf branch from ce5e7f5 to 78f2ebf Compare July 23, 2018 16:07

Revert changes to BatchEvalPythonExecSuite

4c9c007

HyukjinKwon reviewed Jul 25, 2018

View reviewed changes

icexelloss added 2 commits July 25, 2018 14:27

Address comments; Use mutable state in collectEvaluableUDFs

83635da

Fix import

2bc906d

icexelloss commented Jul 25, 2018

View reviewed changes

HyukjinKwon reviewed Jul 25, 2018

View reviewed changes

BryanCutler reviewed Jul 25, 2018

View reviewed changes

icexelloss added 2 commits July 27, 2018 13:49

Address comments

6b22fea

minor fix

b25936d

HyukjinKwon reviewed Jul 27, 2018

View reviewed changes

icexelloss added 2 commits July 27, 2018 18:01

Fix bug

8e995e8

Use nested function implementation

f3a45a5

icexelloss commented Jul 27, 2018

View reviewed changes

asfgit closed this in e875209 Jul 28, 2018

[SPARK-24624][SQL][PYTHON] Support mixture of Python UDF and Scalar Pandas UDF #21650

[SPARK-24624][SQL][PYTHON] Support mixture of Python UDF and Scalar Pandas UDF #21650

Conversation

icexelloss commented Jun 27, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

icexelloss commented Jun 27, 2018

Choose a reason for hiding this comment

viirya commented Jun 27, 2018

maropu commented Jun 27, 2018

SparkQA commented Jun 28, 2018

SparkQA commented Jun 28, 2018

icexelloss commented Jun 28, 2018

BryanCutler commented Jun 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 29, 2018

BryanCutler commented Jul 3, 2018

icexelloss commented Jul 9, 2018 • edited Loading

BryanCutler commented Jul 12, 2018

gatorsmile commented Jul 16, 2018

Choose a reason for hiding this comment

icexelloss Jul 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 23, 2018

SparkQA commented Jul 23, 2018

icexelloss commented Jul 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss Jul 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss Jul 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler Jul 25, 2018 • edited Loading

Choose a reason for hiding this comment

icexelloss commented Jun 27, 2018 •

edited

Loading

BryanCutler commented Jun 28, 2018 •

edited

Loading

icexelloss commented Jul 9, 2018 •

edited

Loading

icexelloss Jul 23, 2018 •

edited

Loading

icexelloss Jul 25, 2018 •

edited

Loading

icexelloss Jul 25, 2018 •

edited

Loading

BryanCutler Jul 25, 2018 •

edited

Loading

icexelloss Jul 26, 2018 •

edited

Loading