-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23935][SQL] Adding map_entries function #21236
[SPARK-23935][SQL] Adding map_entries function #21236
Conversation
ok to test |
Test build #90216 has finished for PR 21236 at commit
|
Test build #90221 has finished for PR 21236 at commit
|
} | ||
|
||
s""" | ||
|final int $structSize = ${UnsafeRow.calculateBitSetWidthInBytes(2) + longSize * 2}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can calculate structSize
beforehand and inline it?
s""" | ||
|final int $structSize = ${UnsafeRow.calculateBitSetWidthInBytes(2) + longSize * 2}; | ||
|final long $byteArraySize = $calculateArraySize($numElements, $longSize + $structSize); | ||
|final int $structsOffset = $calculateHeader($numElements) + $numElements * $longSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this into else-clause?
…p_entries-to-master # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Test build #90318 has finished for PR 21236 at commit
|
…p_entries-to-master
Test build #90367 has finished for PR 21236 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one question.
|
||
val baseOffset = Platform.BYTE_ARRAY_OFFSET | ||
val longSize = LongType.defaultSize | ||
val structSize = UnsafeRow.calculateBitSetWidthInBytes(2) + longSize * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering it is right to use longSize
here?
I know the value is 8
and is same as the word size, but feel like the meaning is different?
cc @gatorsmile @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Really good question. I'm eager to learn about the true purpose of the DataType.defaultSize
function. Currently, it's used in this meaning at more places (e.g.GenArrayData.genCodeToCreateArrayData
and CodeGenerator.createUnsafeArray
.)
What about using Long.BYTES
from Java 8 instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, 8
is the better choice since it is not related to an element size of long
.
To my best guess, it would be best to define a new constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk Thanks for your suggestion, but it seems to me that LongType.defaultSize
could be used in this case. It seems that the purpose of defaultSize
is not only the calculation of estimated data size in statistics. GenerateUnsafeProjection.writeArrayToBuffer
, InterpretedUnsafeProjection.getElementSize
and other parts utilize defaultSize
in the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not for the element size of arrays. I agree with @kiszk to use 8
.
Maybe we need to add a constant to represent the word size in UnsafeRow
or somewhere in the future pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK, I misunderstood the comments. Thanks guys!
| $unsafeArrayData.pointTo($data, $baseOffset, (int)$byteArraySize); | ||
| UnsafeRow $unsafeRow = new UnsafeRow(2); | ||
| for (int z = 0; z < $numElements; z++) { | ||
| long offset = $structsOffset + z * $structSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: $structSize
-> ${$structSize}L
Test build #90557 has finished for PR 21236 at commit
|
|
||
val baseOffset = Platform.BYTE_ARRAY_OFFSET | ||
val longSize = LongType.defaultSize | ||
val structSize = UnsafeRow.calculateBitSetWidthInBytes(2) + longSize * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not for the element size of arrays. I agree with @kiszk to use 8
.
Maybe we need to add a constant to represent the word size in UnsafeRow
or somewhere in the future pr.
s"$values.isNullAt(z) ? null : (Object)${getValue(values)}" | ||
} else { | ||
getValue(values) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
s""" | ||
|final long $byteArraySize = $calculateArraySize($numElements, ${longSize + structSize}); | ||
|if ($byteArraySize > ${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}) { | ||
| ${genCodeForAnyElements(ctx, keys, values, arrayData, numElements)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, should we use this idiom for other array functions? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I separated the logic that I can leverage for map_from_entries
function. Moreover, I think it should be possible to replace UnsafeArrayData.createUnsafeArray
with that logic, but will do it in a different PR.
Test build #90596 has finished for PR 21236 at commit
|
Test build #90720 has finished for PR 21236 at commit
|
I'd retrigger the build for just checking again. |
Jenkins, retest this please. |
Test build #90880 has finished for PR 21236 at commit
|
retest this please |
Test build #90881 has finished for PR 21236 at commit
|
Jenkins, retest this please. |
retest this please |
Test build #90887 has finished for PR 21236 at commit
|
Test build #90888 has finished for PR 21236 at commit
|
Thanks! merging to master. |
… collection expressions. ## What changes were proposed in this pull request? The PR tries to avoid serialization of private fields of already added collection functions and follows up on comments in [SPARK-23922](#21028) and [SPARK-23935](#21236) ## How was this patch tested? Run tests from: - CollectionExpressionSuite.scala - DataFrameFunctionsSuite.scala Author: Marek Novotny <[email protected]> Closes #21352 from mn-mikke/SPARK-24305.
@@ -98,6 +98,9 @@ trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks { | |||
if (expected.isNaN) result.isNaN else expected == result | |||
case (result: Float, expected: Float) => | |||
if (expected.isNaN) result.isNaN else expected == result | |||
case (result: UnsafeRow, expected: GenericInternalRow) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke I was just looking over compiler warnings, and noticed it claims this case is never triggered. I think it's because it would always first match the (InternalRow, InternalRow) case above. Should it go before that then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger that, looks like Wenchen just did so. Thanks!
## What changes were proposed in this pull request? - Revert [SPARK-23935][SQL] Adding map_entries function: #21236 - Revert [SPARK-23937][SQL] Add map_filter SQL function: #21986 - Revert [SPARK-23940][SQL] Add transform_values SQL function: #22045 - Revert [SPARK-23939][SQL] Add transform_keys function: #22013 - Revert [SPARK-23938][SQL] Add map_zip_with function: #22017 - Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR: #21434 ## How was this patch tested? The existing tests. Closes #22827 from gatorsmile/revertMap2.4. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR adds
map_entries
function that returns an unordered array of all entries in the given map.How was this patch tested?
New tests added into:
CollectionExpressionSuite
DataFrameFunctionsSuite
CodeGen examples
Primitive types
Result:
Non-primitive types
Result: