[SPARK-23587][SQL] Add interpreted execution for MapObjects expression #20771

viirya · 2018-03-08T08:38:39Z

What changes were proposed in this pull request?

Add interpreted execution for MapObjects expression.

How was this patch tested?

Added unit test.

SparkQA · 2018-03-08T11:07:36Z

Test build #88079 has finished for PR 20771 at commit 3627dc3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2018-03-08T13:24:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+  // The data with PythonUserDefinedType are actually stored with the data type of its sqlType.
+  // When we want to apply MapObjects on it, we have to use it.
+  lazy private val inputDataType = inputData.dataType match {
+    case p: PythonUserDefinedType => p.sqlType


Please use the UserDefinedType super class here.

(I just noticed that this wasn't introduced by you, but please change it anyway)

hvanhovell · 2018-03-08T13:30:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+  override def eval(input: InternalRow): Any = {
+    assert(input.numFields == 1,
+      "The input row of interpreted LambdaVariable should have only 1 field.")
+    input.get(0, dataType)


Not a change for this PR. Maybe we should use accessors here? This uses a matching under the hood and is slower than virtual function dispatch. Implementing this would also be useful for BoundReference for example.

You mean something like this?

lazy val accessor: InternalRow => Any = dataType match { case IntegerType => (inputRow) => inputRow.getInt(0) case LongType => (inputRow) => inputRow.getLong(0) ... } override def eval(input: InternalRow): Any = accessor(input)

Let's spin that off into a different ticket if we want to work on it.

Ok. After this is merged, I will create another PR for it.

SparkQA · 2018-03-08T16:03:42Z

Test build #88087 has finished for PR 20771 at commit 07f8143.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2018-03-08T17:30:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+      return inputCollection
+    }
+
+    val results = inputDataType match {


We shouldn't be doing this during eval. Please move this into a function val.

hvanhovell · 2018-03-08T17:30:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+        executeFuncOnCollection(inputCollection.asInstanceOf[ArrayData].array)
+    }
+
+    customCollectionCls match {


We shouldn't be doing this during eval. Please move this into a function val.

hvanhovell · 2018-03-08T17:30:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+    val inputCollection = inputData.eval(input)
+
+    if (inputCollection == null) {
+      return inputCollection


NIT: It is slightly cleared to return null here.

SparkQA · 2018-03-09T08:05:02Z

Test build #88114 has finished for PR 20771 at commit 9144287.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class StackTrace(elems: Seq[String])

viirya · 2018-03-09T09:15:15Z

retest this please.

hvanhovell · 2018-03-09T11:50:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+  private lazy val getResults: Seq[_] => Any = customCollectionCls match {
+    case Some(cls) if classOf[Seq[_]].isAssignableFrom(cls) =>
+      // Scala sequence
+      _.toSeq


This identity right?

hvanhovell · 2018-03-09T11:51:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+      _.toSet
+    case Some(cls) if classOf[java.util.List[_]].isAssignableFrom(cls) =>
+      // Java list
+      if (cls == classOf[java.util.List[_]] || cls == classOf[java.util.AbstractList[_]] ||


IIUC you are matching against non concrete implementations of java.util.List? Maybe add this as documentation.

hvanhovell · 2018-03-09T11:53:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+        _.asJava
+      } else {
+        (results) => {
+          val builder = Try(cls.getConstructor(Integer.TYPE)).map { constructor =>


Can you try to do the constructor lookup only once? The duplication that that will cause is ok.

Not sure if I understand correctly. Please check update again.

hvanhovell · 2018-03-09T12:20:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+      x => executeFuncOnCollection(x.asInstanceOf[java.util.List[_]].asScala)
+    case ObjectType(cls) if cls == classOf[Object] =>
+      (inputCollection) => {
+        if (inputCollection.getClass.isArray) {


(I am sorry for sounding like a broken record) But can we move this check out of the the function closure?

SparkQA · 2018-03-09T12:39:46Z

Test build #88119 has finished for PR 20771 at commit 9144287.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class StackTrace(elems: Seq[String])

SparkQA · 2018-03-10T12:25:34Z

Test build #88146 has finished for PR 20771 at commit e725608.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-03-23T06:52:13Z

ping @hvanhovell

hvanhovell · 2018-03-26T08:39:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+      x => executeFuncOnCollection(x.asInstanceOf[Array[_]].toSeq)
+    case ObjectType(cls) if classOf[java.util.List[_]].isAssignableFrom(cls) =>
+      x => executeFuncOnCollection(x.asInstanceOf[java.util.List[_]].asScala)
+    case ObjectType(cls) if cls == classOf[Object] =>


Ugghh... I know understand why this needed. RowEncoder does not pass the needed type information down: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala#L146

This obviously needs to be done during evaluation. You got it right in the previous commit. I am sorry for misunderstanding this, and making you move it. Next time please call me out on this!

hvanhovell · 2018-03-26T08:45:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+      } else {
+        // Specifying concrete implementations of `java.util.List`
+        (results) => {
+          val constructors = cls.getConstructors()


Is there a way we can move the constructor resolution out of the closure? I am fine with some code duplication here :)...

hvanhovell · 2018-03-26T09:01:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+        x => executeFuncOnCollection(x.asInstanceOf[Seq[_]])
+      }
+    case ArrayType(et, _) =>
+      x => executeFuncOnCollection(x.asInstanceOf[ArrayData].array)


This will blow up with UnsafeArrayData :(... It would be nice if we can avoid copying the entire array. We could implement an ArrayData wrapper that implements Seq or Iterable (I slightly prefer the latter).

Shall we implement this wrapper here, or a follow-up?

hvanhovell · 2018-03-26T09:02:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+
+  private def executeFuncOnCollection(inputCollection: Seq[_]): Seq[_] = {
+    inputCollection.map { element =>
+      val row = InternalRow.fromSeq(Seq(element))


NIT reuse the row object.

hvanhovell · 2018-03-26T09:04:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+  }
+
+  // Executes lambda function on input collection.
+  private lazy val executeFunc: Any => Seq[_] = inputDataType match {


I am wondering if we shouldn't just return an Iterator instead of a Seq? This seems a bit more flexible, allows us to avoid materializing an intermediate sequence. WDYT?

hvanhovell · 2018-03-26T09:05:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+  }
+
+  // Converts the processed collection to custom collection class if any.
+  private lazy val getResults: Seq[_] => Any = customCollectionCls match {


Can you add a catch all clause that throws a nice exception to this match statement?

SparkQA · 2018-03-29T10:33:57Z

Test build #88696 has finished for PR 20771 at commit f0ba614.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new RuntimeException(s\"class$clsis not supported byMapObjects as \" +

SparkQA · 2018-03-29T16:53:19Z

Test build #88708 has finished for PR 20771 at commit face72c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-29T17:34:26Z

Test build #88709 has finished for PR 20771 at commit d4f0ecb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new RuntimeException(s\"class$`

viirya · 2018-04-03T14:02:24Z

ping @hvanhovell Do you have any more comments? Thanks.

hvanhovell

LGTM. Merging tot master. Thanks!

viirya · 2018-04-03T23:39:23Z

Thanks @hvanhovell, I will open another ticket & PR for the accessors, based on #20771 (comment).

hvanhovell · 2018-04-03T23:42:26Z

@viirya can you also file a ticket for the UnsafeArrayData.array issue? We should just provide an IndexedSeq for ArrayData.

viirya · 2018-04-04T07:44:09Z

@hvanhovell Sure, I will do it too.

## What changes were proposed in this pull request? Add interpreted execution for `MapObjects` expression. ## How was this patch tested? Added unit test. Author: Liang-Chi Hsieh <[email protected]> Closes apache#20771 from viirya/SPARK-23587.

hvanhovell · 2018-04-04T08:51:47Z

@viirya to be clear: let's do this into two separate JIRA's/PRs.

viirya · 2018-04-04T09:01:13Z

@hvanhovell Yes. I thought "do it together" will be confusing, so I changed it to "do it too" later. :)

## What changes were proposed in this pull request? Add interpreted execution for `MapObjects` expression. ## How was this patch tested? Added unit test. Author: Liang-Chi Hsieh <[email protected]> Closes apache#20771 from viirya/SPARK-23587.

viirya added 3 commits March 8, 2018 08:14

Add interpreted execution for MapObjects expression.

c55a634

Merge remote-tracking branch 'upstream/master' into SPARK-23587

f86f40e

Fix bug.

3627dc3

Use lazy to avoid call dataType on UnresolvedAttribute.

07f8143

hvanhovell reviewed Mar 8, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into SPARK-23587

9144287

hvanhovell reviewed Mar 9, 2018

View reviewed changes

Address comments.

e725608

hvanhovell reviewed Mar 26, 2018

View reviewed changes

Address comments.

f0ba614

Improve test case.

d4f0ecb

viirya force-pushed the SPARK-23587 branch from face72c to d4f0ecb Compare March 29, 2018 14:18

hvanhovell approved these changes Apr 3, 2018

View reviewed changes

asfgit closed this in 1035aaa Apr 3, 2018

viirya deleted the SPARK-23587 branch December 27, 2023 18:35

[SPARK-23587][SQL] Add interpreted execution for MapObjects expression #20771

[SPARK-23587][SQL] Add interpreted execution for MapObjects expression #20771

Conversation

viirya commented Mar 8, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 9, 2018

viirya commented Mar 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 9, 2018

SparkQA commented Mar 10, 2018

viirya commented Mar 23, 2018

Choose a reason for hiding this comment

hvanhovell Mar 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 29, 2018

SparkQA commented Mar 29, 2018

SparkQA commented Mar 29, 2018

viirya commented Apr 3, 2018

hvanhovell left a comment

Choose a reason for hiding this comment

viirya commented Apr 3, 2018 • edited Loading

hvanhovell commented Apr 3, 2018

viirya commented Apr 4, 2018 • edited Loading

hvanhovell commented Apr 4, 2018

viirya commented Apr 4, 2018

viirya Mar 9, 2018 •

edited

Loading

hvanhovell Mar 9, 2018 •

edited

Loading

hvanhovell Mar 26, 2018 •

edited

Loading

viirya commented Apr 3, 2018 •

edited

Loading

viirya commented Apr 4, 2018 •

edited

Loading