[SPARK-23935][SQL] Adding map_entries function #21236

mn-mikke · 2018-05-04T18:50:20Z

What changes were proposed in this pull request?

This PR adds map_entries function that returns an unordered array of all entries in the given map.

How was this patch tested?

New tests added into:

CollectionExpressionSuite
DataFrameFunctionsSuite

CodeGen examples

Primitive types

val df = Seq(Map(1 -> 5, 2 -> 6)).toDF("m")
df.filter('m.isNotNull).select(map_entries('m)).debugCodegen

Result:

/* 042 */         boolean project_isNull_0 = false;
/* 043 */
/* 044 */         ArrayData project_value_0 = null;
/* 045 */
/* 046 */         final int project_numElements_0 = inputadapter_value_0.numElements();
/* 047 */         final ArrayData project_keys_0 = inputadapter_value_0.keyArray();
/* 048 */         final ArrayData project_values_0 = inputadapter_value_0.valueArray();
/* 049 */
/* 050 */         final long project_size_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(
/* 051 */           project_numElements_0,
/* 052 */           32);
/* 053 */         if (project_size_0 > 2147483632) {
/* 054 */           final Object[] project_internalRowArray_0 = new Object[project_numElements_0];
/* 055 */           for (int z = 0; z < project_numElements_0; z++) {
/* 056 */             project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{project_keys_0.getInt(z), project_values_0.getInt(z)});
/* 057 */           }
/* 058 */           project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0);
/* 059 */
/* 060 */         } else {
/* 061 */           final byte[] project_arrayBytes_0 = new byte[(int)project_size_0];
/* 062 */           UnsafeArrayData project_unsafeArrayData_0 = new UnsafeArrayData();
/* 063 */           Platform.putLong(project_arrayBytes_0, 16, project_numElements_0);
/* 064 */           project_unsafeArrayData_0.pointTo(project_arrayBytes_0, 16, (int)project_size_0);
/* 065 */
/* 066 */           final int project_structsOffset_0 = UnsafeArrayData.calculateHeaderPortionInBytes(project_numElements_0) + project_numElements_0 * 8;
/* 067 */           UnsafeRow project_unsafeRow_0 = new UnsafeRow(2);
/* 068 */           for (int z = 0; z < project_numElements_0; z++) {
/* 069 */             long offset = project_structsOffset_0 + z * 24L;
/* 070 */             project_unsafeArrayData_0.setLong(z, (offset << 32) + 24L);
/* 071 */             project_unsafeRow_0.pointTo(project_arrayBytes_0, 16 + offset, 24);
/* 072 */             project_unsafeRow_0.setInt(0, project_keys_0.getInt(z));
/* 073 */             project_unsafeRow_0.setInt(1, project_values_0.getInt(z));
/* 074 */           }
/* 075 */           project_value_0 = project_unsafeArrayData_0;
/* 076 */
/* 077 */         }

Non-primitive types

val df = Seq(Map("a" -> "foo", "b" -> null)).toDF("m")
df.filter('m.isNotNull).select(map_entries('m)).debugCodegen

Result:

/* 042 */         boolean project_isNull_0 = false;
/* 043 */
/* 044 */         ArrayData project_value_0 = null;
/* 045 */
/* 046 */         final int project_numElements_0 = inputadapter_value_0.numElements();
/* 047 */         final ArrayData project_keys_0 = inputadapter_value_0.keyArray();
/* 048 */         final ArrayData project_values_0 = inputadapter_value_0.valueArray();
/* 049 */
/* 050 */         final Object[] project_internalRowArray_0 = new Object[project_numElements_0];
/* 051 */         for (int z = 0; z < project_numElements_0; z++) {
/* 052 */           project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{project_keys_0.getUTF8String(z), project_values_0.getUTF8String(z)});
/* 053 */         }
/* 054 */         project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0);

mn-mikke · 2018-05-04T18:51:19Z

cc @ueshin @gatorsmile

gatorsmile · 2018-05-04T21:40:18Z

ok to test

SparkQA · 2018-05-04T21:44:52Z

Test build #90216 has finished for PR 21236 at commit 086e223.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class MapEntries(child: Expression) extends UnaryExpression with ExpectsInputTypes

SparkQA · 2018-05-05T01:47:10Z

Test build #90221 has finished for PR 21236 at commit b9e2409.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-05-07T09:22:12Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+    }
+
+    s"""
+       |final int $structSize = ${UnsafeRow.calculateBitSetWidthInBytes(2) + longSize * 2};


We can calculate structSize beforehand and inline it?

ueshin · 2018-05-07T09:49:29Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+    s"""
+       |final int $structSize = ${UnsafeRow.calculateBitSetWidthInBytes(2) + longSize * 2};
+       |final long $byteArraySize = $calculateArraySize($numElements, $longSize + $structSize);
+       |final int $structsOffset = $calculateHeader($numElements) + $numElements * $longSize;


We can move this into else-clause?

…p_entries-to-master # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

… else branch.

SparkQA · 2018-05-07T16:21:52Z

Test build #90318 has finished for PR 21236 at commit d05ad9b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…p_entries-to-master

SparkQA · 2018-05-08T16:27:40Z

Test build #90367 has finished for PR 21236 at commit 6aa90ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

LGTM except for one question.

ueshin · 2018-05-09T09:39:56Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+
+    val baseOffset = Platform.BYTE_ARRAY_OFFSET
+    val longSize = LongType.defaultSize
+    val structSize = UnsafeRow.calculateBitSetWidthInBytes(2) + longSize * 2


I'm wondering it is right to use longSize here?
I know the value is 8 and is same as the word size, but feel like the meaning is different?
cc @gatorsmile @cloud-fan

@ueshin Really good question. I'm eager to learn about the true purpose of the DataType.defaultSize function. Currently, it's used in this meaning at more places (e.g.GenArrayData.genCodeToCreateArrayData and CodeGenerator.createUnsafeArray.)

What about using Long.BYTES from Java 8 instead?

IMHO, 8 is the better choice since it is not related to an element size of long.
To my best guess, it would be best to define a new constant.

@kiszk Thanks for your suggestion, but it seems to me that LongType.defaultSize could be used in this case. It seems that the purpose of defaultSize is not only the calculation of estimated data size in statistics. GenerateUnsafeProjection.writeArrayToBuffer, InterpretedUnsafeProjection.getElementSize and other parts utilize defaultSize in the same way.

This is not for the element size of arrays. I agree with @kiszk to use 8.
Maybe we need to add a constant to represent the word size in UnsafeRow or somewhere in the future pr.

Oh OK, I misunderstood the comments. Thanks guys!

kiszk · 2018-05-11T18:05:29Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+       |  $unsafeArrayData.pointTo($data, $baseOffset, (int)$byteArraySize);
+       |  UnsafeRow $unsafeRow = new UnsafeRow(2);
+       |  for (int z = 0; z < $numElements; z++) {
+       |    long offset = $structsOffset + z * $structSize;


nit: $structSize -> ${$structSize}L

SparkQA · 2018-05-14T00:57:42Z

Test build #90557 has finished for PR 21236 at commit 56ff20a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-05-14T07:10:13Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+
+    val baseOffset = Platform.BYTE_ARRAY_OFFSET
+    val longSize = LongType.defaultSize
+    val structSize = UnsafeRow.calculateBitSetWidthInBytes(2) + longSize * 2


This is not for the element size of arrays. I agree with @kiszk to use 8.
Maybe we need to add a constant to represent the word size in UnsafeRow or somewhere in the future pr.

ueshin · 2018-05-14T07:14:59Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+        s"$values.isNullAt(z) ? null : (Object)${getValue(values)}"
+      } else {
+        getValue(values)
+      }


nit: indent

ueshin · 2018-05-14T07:17:26Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+    s"""
+       |final long $byteArraySize = $calculateArraySize($numElements, ${longSize + structSize});
+       |if ($byteArraySize > ${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}) {
+       |  ${genCodeForAnyElements(ctx, keys, values, arrayData, numElements)}


Hmm, should we use this idiom for other array functions? WDYT?

For now, I separated the logic that I can leverage for map_from_entries function. Moreover, I think it should be possible to replace UnsafeArrayData.createUnsafeArray with that logic, but will do it in a different PR.

SparkQA · 2018-05-14T19:08:29Z

Test build #90596 has finished for PR 21236 at commit 1bd0d5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-17T12:13:55Z

Test build #90720 has finished for PR 21236 at commit baa61e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mn-mikke · 2018-05-20T18:14:21Z

@ueshin, @kiszk Thank you for valuable comments! Do you have any more?

ueshin · 2018-05-21T06:38:42Z

I'd retrigger the build for just checking again.

ueshin · 2018-05-21T06:38:52Z

Jenkins, retest this please.

SparkQA · 2018-05-21T07:05:01Z

Test build #90880 has finished for PR 21236 at commit baa61e5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

mn-mikke · 2018-05-21T07:15:42Z

retest this please

SparkQA · 2018-05-21T08:04:40Z

Test build #90881 has finished for PR 21236 at commit baa61e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-05-21T08:06:50Z

Jenkins, retest this please.

mn-mikke · 2018-05-21T08:07:38Z

retest this please

SparkQA · 2018-05-21T10:18:19Z

Test build #90887 has finished for PR 21236 at commit baa61e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-21T11:56:41Z

Test build #90888 has finished for PR 21236 at commit baa61e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-05-21T14:11:40Z

Thanks! merging to master.

… collection expressions. ## What changes were proposed in this pull request? The PR tries to avoid serialization of private fields of already added collection functions and follows up on comments in [SPARK-23922](#21028) and [SPARK-23935](#21236) ## How was this patch tested? Run tests from: - CollectionExpressionSuite.scala - DataFrameFunctionsSuite.scala Author: Marek Novotny <[email protected]> Closes #21352 from mn-mikke/SPARK-24305.

srowen · 2018-08-02T01:25:45Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

@@ -98,6 +98,9 @@ trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks {
        if (expected.isNaN) result.isNaN else expected == result
      case (result: Float, expected: Float) =>
        if (expected.isNaN) result.isNaN else expected == result
+      case (result: UnsafeRow, expected: GenericInternalRow) =>


@mn-mikke I was just looking over compiler warnings, and noticed it claims this case is never triggered. I think it's because it would always first match the (InternalRow, InternalRow) case above. Should it go before that then?

Hi @srowen,
(InternalRow, InternalRow) case was introduced later in 21838 and covers the logic of the case with UnsafeRow. So we can just remove the unreachable piece of code.

Roger that, looks like Wenchen just did so. Thanks!

## What changes were proposed in this pull request? - Revert [SPARK-23935][SQL] Adding map_entries function: #21236 - Revert [SPARK-23937][SQL] Add map_filter SQL function: #21986 - Revert [SPARK-23940][SQL] Add transform_values SQL function: #22045 - Revert [SPARK-23939][SQL] Add transform_keys function: #22013 - Revert [SPARK-23938][SQL] Add map_zip_with function: #22017 - Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR: #21434 ## How was this patch tested? The existing tests. Closes #22827 from gatorsmile/revertMap2.4. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-23935][SQL] Adding map_entries function

086e223

[SPARK-23935][SQL] Fixing a Scala style problem in tests

b9e2409

ueshin reviewed May 7, 2018

View reviewed changes

mn-mikke added 2 commits May 7, 2018 15:15

Merge remote-tracking branch 'spark/master' into feature/array-api-ma…

4739977

…p_entries-to-master # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

[SPARK-23935][SQL] Inlining struct size and moving structOffeset into…

d05ad9b

… else branch.

Merge remote-tracking branch 'spark/master' into feature/array-api-ma…

6aa90ef

…p_entries-to-master

ueshin reviewed May 9, 2018

View reviewed changes

kiszk reviewed May 11, 2018

View reviewed changes

[SPARK-23935][SQL] Introducing long constant for struct size.

56ff20a

ueshin reviewed May 14, 2018

View reviewed changes

[SPARK-23935][SQL] Addressing review comments.

1bd0d5e

[SPARK-23935][SQL] Resolving conflicts.

baa61e5

mn-mikke mentioned this pull request May 17, 2018

[SPARK-24305][SQL][FOLLOWUP] Avoid serialization of private fields in collection expressions. #21352

Closed

asfgit closed this in a6e883f May 21, 2018

srowen reviewed Aug 2, 2018

View reviewed changes

cloud-fan mentioned this pull request Aug 2, 2018

[minor] remove dead code in ExpressionEvalHelper #21958

Closed

gatorsmile mentioned this pull request Oct 25, 2018

[SPARK-25832][SQL][BRANCH-2.4] Revert newly added map related functions #22827

Closed

[SPARK-23935][SQL] Adding map_entries function #21236

[SPARK-23935][SQL] Adding map_entries function #21236

Conversation

mn-mikke commented May 4, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

CodeGen examples

Primitive types

Non-primitive types

mn-mikke commented May 4, 2018

gatorsmile commented May 4, 2018

SparkQA commented May 4, 2018

SparkQA commented May 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 7, 2018

SparkQA commented May 8, 2018

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 14, 2018

SparkQA commented May 17, 2018

mn-mikke commented May 20, 2018

ueshin commented May 21, 2018

ueshin commented May 21, 2018

SparkQA commented May 21, 2018

mn-mikke commented May 21, 2018

SparkQA commented May 21, 2018

ueshin commented May 21, 2018

mn-mikke commented May 21, 2018

SparkQA commented May 21, 2018

SparkQA commented May 21, 2018

ueshin commented May 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mn-mikke commented May 4, 2018 •

edited

Loading