[SPARK-23908][SQL] Add transform function. #21954

ueshin · 2018-08-02T03:12:44Z

What changes were proposed in this pull request?

This pr adds transform function which transforms elements in an array using the function.
Optionally we can take the index of each element as the second argument.

> SELECT transform(array(1, 2, 3), x -> x + 1);
 array(2, 3, 4)
> SELECT transform(array(1, 2, 3), (x, i) -> x + i);
 array(1, 3, 5)

How was this patch tested?

Added tests.

holdensmagicalunicorn · 2018-08-02T03:12:48Z

@ueshin, thanks! I am a bot who has found some folks who might be able to help with the review:@rxin, @cloud-fan and @hvanhovell

gatorsmile · 2018-08-02T03:19:40Z

cc @hvanhovell

viirya · 2018-08-02T04:22:10Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+  @transient lazy val functionsForEval: Seq[Expression] = functions.map {
+    case LambdaFunction(function, arguments, hidden) =>
+      val argumentMap = arguments.map { arg => arg.exprId -> arg }.toMap
+      function.transformUp {


Why we need to transform NamedLambdaVariable in function by arguments here? Aren't arguments also NamedLambdaVariable and we already resolved expressions in function at ResolveLambdaVariables?

I'm worried whether the NamedLambdaVariable is instantiated separately during serialization or something. In that case, we might not be able to refer the same instance and set the argument values correctly.

SparkQA · 2018-08-02T06:30:16Z

Test build #93934 has finished for PR 21954 at commit ee450c5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T07:05:02Z

Test build #93950 has finished for PR 21954 at commit c3bf6a0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ResolveHigherOrderFunctions(catalog: SessionCatalog) extends Rule[LogicalPlan]
s\"its class is $
case class ResolveLambdaVariables(conf: SQLConf) extends Rule[LogicalPlan]
case class NamedLambdaVariable(
case class LambdaFunction(
trait HigherOrderFunction extends Expression
trait ArrayBasedHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
case class ArrayTransform(

ueshin · 2018-08-02T07:23:38Z

Jenkins, retest this please.

hvanhovell · 2018-08-02T08:29:59Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+    val (elementType, containsNull) = input.dataType match {
+      case ArrayType(elementType, containsNull) => (elementType, containsNull)
+      case _ =>
+        val ArrayType(elementType, containsNull) = ArrayType.defaultConcreteType


When does this happen?

It happens when the first argument is not an array (e.g., https://github.com/apache/spark/pull/21954/files#diff-8e1a34391fdefa4a3a0349d7d454d86fR1798).

Then shall we fail the analysis before going into bind?

hvanhovell · 2018-08-02T08:41:58Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+    name: String,
+    dataType: DataType,
+    nullable: Boolean,
+    value: AtomicReference[Any] = new AtomicReference(),


You are only using the AtomicReference as an container right?

Actually, also when creating functionsForEval. I needed it for transformUp work properly.

You did? Could you elaborate? There shouldn't be any current access here.

When I tried to make copies of NamedLambdaVariables, the transformUp doesn't replace the variables, and generated wrong results.

Ah, maybe I should override fastEquals instead of using AtomicReference?

Hmm, seems like just overriding fastEquals is not enough..

Yeah, that makes sense. Let's leave it for now.

I see. Thanks.

hvanhovell

LGTM. One request, can you add a little bit of documentation on how execution currently works.

SparkQA · 2018-08-02T10:56:40Z

Test build #93973 has finished for PR 21954 at commit c3bf6a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ResolveHigherOrderFunctions(catalog: SessionCatalog) extends Rule[LogicalPlan]
s\"its class is $
case class ResolveLambdaVariables(conf: SQLConf) extends Rule[LogicalPlan]
case class NamedLambdaVariable(
case class LambdaFunction(
trait HigherOrderFunction extends Expression
trait ArrayBasedHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
case class ArrayTransform(

ueshin · 2018-08-02T12:57:28Z

Jenkins, retest this please.

SparkQA · 2018-08-02T14:51:23Z

Test build #94002 has finished for PR 21954 at commit c3bf6a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ResolveHigherOrderFunctions(catalog: SessionCatalog) extends Rule[LogicalPlan]
s\"its class is $
case class ResolveLambdaVariables(conf: SQLConf) extends Rule[LogicalPlan]
case class NamedLambdaVariable(
case class LambdaFunction(
trait HigherOrderFunction extends Expression
trait ArrayBasedHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
case class ArrayTransform(

gatorsmile · 2018-08-02T15:26:20Z

retest this please

SparkQA · 2018-08-02T19:53:02Z

Test build #94019 has finished for PR 21954 at commit c3bf6a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ResolveHigherOrderFunctions(catalog: SessionCatalog) extends Rule[LogicalPlan]
s\"its class is $
case class ResolveLambdaVariables(conf: SQLConf) extends Rule[LogicalPlan]
case class NamedLambdaVariable(
case class LambdaFunction(
trait HigherOrderFunction extends Expression
trait ArrayBasedHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
case class ArrayTransform(

gatorsmile

LGTM

Thanks! Merged to master

cloud-fan · 2018-08-07T15:00:27Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+
+  override def inputTypes: Seq[AbstractDataType] = Seq(ArrayType, expectingFunctionType)
+
+  @transient lazy val functionForEval: Expression = functionsForEval.head


does this need to be a lazy val? Seq#head is very cheap.

Ah, makes sense. Currently we have some prs for other higher-order functions, so I'll see them and submit a follow-up if needed.

cloud-fan · 2018-08-07T15:07:23Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+  }
+
+  override def eval(input: InternalRow): Any = {
+    val arr = this.input.eval(input).asInstanceOf[ArrayData]


nit: we should do some renaming to avoid the conflict, e.g. rename ArrayBasedHigherOrderFunction#input to inputArray

I'll see the other prs and submit a follow-up as well.

cloud-fan · 2018-08-07T15:16:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala

+   */
+  private def createLambda(
+      e: Expression,
+      partialArguments: Seq[(DataType, Boolean)]): LambdaFunction = e match {


why call it "partial"?

They are partial because we only pass the dataType and nullable flag.

how about argInfo?

cloud-fan · 2018-08-07T15:25:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala

+  private def resolve(e: Expression, parentLambdaMap: LambdaVariableMap): Expression = e match {
+    case _ if e.resolved => e
+
+    case h: HigherOrderFunction if h.inputResolved =>


can we add some basic type check here? Then we can fail fast if the ArrayTransform#input is not array type, and we don't need the hacky workaround in ArrayTransform#bind

Let me think about it later.

cloud-fan · 2018-08-07T15:27:53Z

sql/core/src/test/resources/sql-tests/inputs/higher-order-functions.sql

+select transform(ys, 0) as v from nested;
+
+-- Transform a null array
+select transform(cast(null as array<int>), x -> x + 1) as v;


shall we add a test for nested lambda?

Actually we have some at #21965 and #21982.

arybin93 · 2019-03-22T12:36:03Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+  }
+
+  override def eval(input: InternalRow): Any = {
+    val arr = this.input.eval(input).asInstanceOf[ArrayData]


Do you have ideas about this problem?
https://issues.apache.org/jira/browse/SPARK-27052

viirya reviewed Aug 2, 2018

View reviewed changes

Add ArrayTransform.

c3bf6a0

ueshin force-pushed the issues/SPARK-23908/transform branch from ee450c5 to c3bf6a0 Compare August 2, 2018 05:00

hvanhovell reviewed Aug 2, 2018

View reviewed changes

ueshin mentioned this pull request Aug 2, 2018

[SPARK-23909][SQL] Add filter function. #21965

Closed

hvanhovell approved these changes Aug 2, 2018

View reviewed changes

gatorsmile reviewed Aug 2, 2018

View reviewed changes

asfgit closed this in 02f9677 Aug 2, 2018

cloud-fan reviewed Aug 7, 2018

View reviewed changes

This was referenced Aug 11, 2018

[SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments, and add argument type check. #22075

Closed

[SPARK-23938][SQL] Add map_zip_with function #22017

Closed

arybin93 reviewed Mar 22, 2019

View reviewed changes


		override def inputTypes: Seq[AbstractDataType] = Seq(ArrayType, expectingFunctionType)

		@transient lazy val functionForEval: Expression = functionsForEval.head

[SPARK-23908][SQL] Add transform function. #21954

[SPARK-23908][SQL] Add transform function. #21954

Conversation

ueshin commented Aug 2, 2018

What changes were proposed in this pull request?

How was this patch tested?

holdensmagicalunicorn commented Aug 2, 2018

gatorsmile commented Aug 2, 2018

viirya Aug 2, 2018 • edited Loading

Choose a reason for hiding this comment

ueshin Aug 2, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Aug 2, 2018

SparkQA commented Aug 2, 2018

ueshin commented Aug 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin Aug 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

SparkQA commented Aug 2, 2018

ueshin commented Aug 2, 2018

SparkQA commented Aug 2, 2018

gatorsmile commented Aug 2, 2018

SparkQA commented Aug 2, 2018

gatorsmile left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Aug 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arybin93 Mar 22, 2019 • edited Loading

Choose a reason for hiding this comment

viirya Aug 2, 2018 •

edited

Loading

ueshin Aug 2, 2018 •

edited

Loading

ueshin Aug 2, 2018 •

edited

Loading

cloud-fan Aug 7, 2018 •

edited

Loading

arybin93 Mar 22, 2019 •

edited

Loading