-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23922][SQL] Add arrays_overlap function #21028
Conversation
Test build #89130 has finished for PR 21028 at commit
|
cc @ueshin |
Test build #89132 has finished for PR 21028 at commit
|
Test build #89450 has finished for PR 21028 at commit
|
* Checks if the two arrays contain at least one common element. | ||
*/ | ||
@ExpressionDescription( | ||
usage = "_FUNC_(a1, a2) - Returns true if a1 contains at least an element present also in a2.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a note for null handling?
|
||
override def dataType: DataType = BooleanType | ||
|
||
override def inputTypes: Seq[AbstractDataType] = left.dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq(ArrayType, ArrayType)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, because in that way we would loose the information about the elementType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are similar functions, such as array_union
(#21061), array_intersect
(#21102), array_except
(#21103), and maybe concat
(#20858) which is slightly different though, to handle two (or more) arrays with the same element type.
I think we should use the same way to specify and check input types.
I'd like to discuss the best way for it here or somewhere else.
cc @kiszk @mn-mikke Do you have any suggestions?
Also cc @gatorsmile @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If common functions for a method, which accepts two array with the same type, are provided as trait, it would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create a new trait, which will first make sure all its children are array type, and then make sure all its children are same type after implicit type cast(make sure other databases also do implicit type cast for these functions).
Then update TypeCoercion
rule to handle this trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea if implicit type cast is not allowed for these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick note, there is a dedicated type coercion rule for concat
functions. So if there was the trait you described, we could remove the rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke I am not sure, since it is quite a strange case, since it allows also string and byte. I am not sure we can do this with implicit type cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 Sorry, I should have been more explicit. I've been referring to the below case that I added into FunctionArgumentConversion
due to enabling type coercion of array types.
case c @ Concat(children) if children.forall(c => ArrayType.acceptsType(c.dataType)) &&
!haveSameType(children) =>
val types = children.map(_.dataType)
findWiderCommonType(types) match {
case Some(finalDataType) => Concat(children.map(Cast(_, finalDataType)))
case None => c
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implicit type cast is allowed in Presto. I am pushing here a proposal of trait, let me know what you think about it. Thanks.
checkEvaluation(ArraysOverlap(a5, a6), true) | ||
checkEvaluation(ArraysOverlap(a5, a7), null) | ||
checkEvaluation(ArraysOverlap(a6, a7), false) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add cases for one of the two arguments is null
and ArraysOverlap(Seq(null), Seq(null))
?
) | ||
} | ||
) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip if the right is containsNull == false
?
| } | ||
| } | ||
| } | ||
|} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
Test build #89639 has finished for PR 21028 at commit
|
anymore comments @ueshin ? |
Test build #89926 has finished for PR 21028 at commit
|
* Base trait for [[BinaryExpression]]s with two arrays of the same element type and implicit | ||
* casting. | ||
*/ | ||
trait BinaryArrayExpressionWithImplicitCast extends BinaryExpression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ImplicitCastInputTypes
trait is able to work with any number of children. Would it be possible to implement this trait to behave in the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible indeed. Though, as far as I know there is no use case for a function with a different number of children, so I am not sure if it makes sense to generalize it. @cloud-fan @kiszk @ueshin WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk you are not wrong, but Concat
is a very specific case, since it supports also String
s and Binary
s, so it would anyway require a specific implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I would like to hear other opinions
Test build #89930 has finished for PR 21028 at commit
|
Test build #89932 has finished for PR 21028 at commit
|
Test build #89935 has finished for PR 21028 at commit
|
retest this please |
Test build #89949 has finished for PR 21028 at commit
|
true | ||
""", since = "2.4.0") | ||
// scalastyle:off line.size.limit | ||
case class ArraysOverlap(left: Expression, right: Expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you override prettyName
to a value following the conventions?
override def prettyName: String = "arrays_overlap"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks!
} | ||
|
||
override def checkInputDataTypes(): TypeCheckResult = { | ||
TypeCoercion.findWiderTypeForTwo(left.dataType, right.dataType) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not call findWiderTypeForTwo
here. Instead we should directly check if the 2 inputs are array type and have the same element type, so that it's safe to say this expression is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with your suggestion, if we have two arrays with different element types (but one can be casted to the other), we would throw an exception. Instead, with this approach that use case is valid and we perform an implicit cast to the wider type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would throw an exception
No, the type coercion rule will be run again and again until this expression is resolved, or hit fixed point and fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, my bad, I got confused. I am fixing this, thanks.
Test build #90117 has finished for PR 21028 at commit
|
Test build #90510 has finished for PR 21028 at commit
|
Test build #90518 has finished for PR 21028 at commit
|
*/ | ||
private def fastEval(arr1: ArrayData, arr2: ArrayData): Any = { | ||
var hasNull = false | ||
val (bigger, smaller, biggerDt) = if (arr1.numElements() > arr2.numElements()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the biggerDt
is not used
index: String, | ||
code: String, | ||
isNullCode: String): String = { | ||
if (inputTypes.exists(_.asInstanceOf[ArrayType].containsNull)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this depend on whether the input array arrayVar
contains null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately we don't know which one we have here (the left or the rigth) as arrayVar
, since we don't know which one is the smaller/bigger and this can change record to record. So we can skip the null check only if both them don't contain null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, makes sense!
s"${ev.isNull} = true;") | ||
s""" | ||
|for (int $i = 0; $i < $bigger.numElements(); $i ++) { | ||
|$isInSmaller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 space
s""" | ||
|for (int $j = 0; $j < $smaller.numElements(); $j ++) { | ||
| $compareValues | ||
| if (${ev.value}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this if?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was in the wrong place, thanks, nice catch!
|} | ||
|if ($smaller.numElements() > 0) { | ||
| $comparisonCode | ||
|} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if smaller
is empty, we should return false. However, the code here depends on the initial value of ev.value
and ev.isNull
, which, according to nullSafeCodeGen
, depends on nullable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but ev.isNull
anyway is it initiated to false
, unless one of the input is null
. And in that case we don't even reach this point because we just return null
.
ev.value
is initiated always to the defaultValue
which is false
. So when we arrive here we are sure that they are both false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see, sorry I misread the code in nullSafeCodeGen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, thanks for checking it. Always better one check more than one less.
|if (${ctx.genEqual(elementType, getFromSmaller, getFromBigger)}) { | ||
| ${ev.isNull} = false; | ||
| ${ev.value} = true; | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we wanna break 2 loops here, that's why we generate 2 break
s. it might be cleaner to generate
for (int $i = 0; $i < $bigger.numElements() && !${ev.value}; $i ++) {
for (int $j = 0; $j < $smaller.numElements() && !${ev.value}; $j ++) {
...
}
}
Test build #90592 has finished for PR 21028 at commit
|
Test build #90635 has finished for PR 21028 at commit
|
*/ | ||
private def bruteForceEval(arr1: ArrayData, arr2: ArrayData): Any = { | ||
var hasNull = false | ||
if (arr1.numElements() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also compare the numElements
here and check the array is empty for the smaller one? Otherwise the result is different if the arr1
is not empty and contains null
and arr2
is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I added also some test cases for this
Test build #90679 has finished for PR 21028 at commit
|
retest this please |
LGTM |
Test build #90711 has finished for PR 21028 at commit
|
override def inputTypes: Seq[AbstractDataType] = { | ||
(left.dataType, right.dataType) match { | ||
case (ArrayType(e1, hasNull1), ArrayType(e2, hasNull2)) => | ||
TypeCoercion.findTightestCommonType(e1, e2, caseSensitive) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should fail the build now
Test build #90719 has finished for PR 21028 at commit
|
thanks, merging to master! |
… collection expressions. ## What changes were proposed in this pull request? The PR tries to avoid serialization of private fields of already added collection functions and follows up on comments in [SPARK-23922](#21028) and [SPARK-23935](#21236) ## How was this patch tested? Run tests from: - CollectionExpressionSuite.scala - DataFrameFunctionsSuite.scala Author: Marek Novotny <[email protected]> Closes #21352 from mn-mikke/SPARK-24305.
What changes were proposed in this pull request?
The PR adds the function
arrays_overlap
. This function returnstrue
if the input arrays contain a non-null common element; if not, it returnsnull
if any of the arrays contains anull
element,false
otherwise.How was this patch tested?
added UTs