Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23922][SQL] Add arrays_overlap function #21028

Closed
wants to merge 28 commits into from

Conversation

mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

The PR adds the function arrays_overlap. This function returns true if the input arrays contain a non-null common element; if not, it returns null if any of the arrays contains a null element, false otherwise.

How was this patch tested?

added UTs

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89130 has finished for PR 21028 at commit e5ebdad.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArraysOverlap(left: Expression, right: Expression)

@gatorsmile
Copy link
Member

cc @ueshin

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89132 has finished for PR 21028 at commit 682bc73.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89450 has finished for PR 21028 at commit 876cd93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Checks if the two arrays contain at least one common element.
*/
@ExpressionDescription(
usage = "_FUNC_(a1, a2) - Returns true if a1 contains at least an element present also in a2.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note for null handling?


override def dataType: DataType = BooleanType

override def inputTypes: Seq[AbstractDataType] = left.dataType match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq(ArrayType, ArrayType)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because in that way we would loose the information about the elementType

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are similar functions, such as array_union(#21061), array_intersect(#21102), array_except(#21103), and maybe concat(#20858) which is slightly different though, to handle two (or more) arrays with the same element type.
I think we should use the same way to specify and check input types.

I'd like to discuss the best way for it here or somewhere else.
cc @kiszk @mn-mikke Do you have any suggestions?
Also cc @gatorsmile @cloud-fan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If common functions for a method, which accepts two array with the same type, are provided as trait, it would be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create a new trait, which will first make sure all its children are array type, and then make sure all its children are same type after implicit type cast(make sure other databases also do implicit type cast for these functions).

Then update TypeCoercion rule to handle this trait.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea if implicit type cast is not allowed for these functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick note, there is a dedicated type coercion rule for concat functions. So if there was the trait you described, we could remove the rule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke I am not sure, since it is quite a strange case, since it allows also string and byte. I am not sure we can do this with implicit type cast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 Sorry, I should have been more explicit. I've been referring to the below case that I added into FunctionArgumentConversion due to enabling type coercion of array types.

case c @ Concat(children) if children.forall(c => ArrayType.acceptsType(c.dataType)) &&
  !haveSameType(children) =>
  val types = children.map(_.dataType)
  findWiderCommonType(types) match {
    case Some(finalDataType) => Concat(children.map(Cast(_, finalDataType)))
    case None => c
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implicit type cast is allowed in Presto. I am pushing here a proposal of trait, let me know what you think about it. Thanks.

checkEvaluation(ArraysOverlap(a5, a6), true)
checkEvaluation(ArraysOverlap(a5, a7), null)
checkEvaluation(ArraysOverlap(a6, a7), false)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add cases for one of the two arguments is null and ArraysOverlap(Seq(null), Seq(null))?

)
}
)
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip if the right is containsNull == false?

| }
| }
| }
|} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89639 has finished for PR 21028 at commit c895707.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class Summarizer(object):
  • class SummaryBuilder(JavaWrapper):
  • case class Reverse(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
  • case class ArrayPosition(left: Expression, right: Expression)
  • case class ElementAt(left: Expression, right: Expression) extends GetMapValueUtil
  • case class Concat(children: Seq[Expression]) extends Expression
  • abstract class GetMapValueUtil extends BinaryExpression with ImplicitCastInputTypes
  • case class GetMapValue(child: Expression, key: Expression)
  • class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T]

@mgaido91
Copy link
Contributor Author

anymore comments @ueshin ?

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89926 has finished for PR 21028 at commit 1dbcd0c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Base trait for [[BinaryExpression]]s with two arrays of the same element type and implicit
* casting.
*/
trait BinaryArrayExpressionWithImplicitCast extends BinaryExpression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ImplicitCastInputTypes trait is able to work with any number of children. Would it be possible to implement this trait to behave in the same way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible indeed. Though, as far as I know there is no use case for a function with a different number of children, so I am not sure if it makes sense to generalize it. @cloud-fan @kiszk @ueshin WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @ueshin pointed out here, concat is also a use case that has a different number of children. Am I wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk you are not wrong, but Concat is a very specific case, since it supports also Strings and Binarys, so it would anyway require a specific implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I would like to hear other opinions

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89930 has finished for PR 21028 at commit 076fc69.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89932 has finished for PR 21028 at commit eafca0f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89935 has finished for PR 21028 at commit 5925104.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Apr 28, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 28, 2018

Test build #89949 has finished for PR 21028 at commit 5925104.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

true
""", since = "2.4.0")
// scalastyle:off line.size.limit
case class ArraysOverlap(left: Expression, right: Expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you override prettyName to a value following the conventions?

override def prettyName: String = "arrays_overlap"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

}

override def checkInputDataTypes(): TypeCheckResult = {
TypeCoercion.findWiderTypeForTwo(left.dataType, right.dataType) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not call findWiderTypeForTwo here. Instead we should directly check if the 2 inputs are array type and have the same element type, so that it's safe to say this expression is resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with your suggestion, if we have two arrays with different element types (but one can be casted to the other), we would throw an exception. Instead, with this approach that use case is valid and we perform an implicit cast to the wider type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would throw an exception

No, the type coercion rule will be run again and again until this expression is resolved, or hit fixed point and fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, my bad, I got confused. I am fixing this, thanks.

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90117 has finished for PR 21028 at commit 2a1121c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 11, 2018

Test build #90510 has finished for PR 21028 at commit 3dd724b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 11, 2018

Test build #90518 has finished for PR 21028 at commit e36a5d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
private def fastEval(arr1: ArrayData, arr2: ArrayData): Any = {
var hasNull = false
val (bigger, smaller, biggerDt) = if (arr1.numElements() > arr2.numElements()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the biggerDt is not used

index: String,
code: String,
isNullCode: String): String = {
if (inputTypes.exists(_.asInstanceOf[ArrayType].containsNull)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this depend on whether the input array arrayVar contains null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately we don't know which one we have here (the left or the rigth) as arrayVar, since we don't know which one is the smaller/bigger and this can change record to record. So we can skip the null check only if both them don't contain null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, makes sense!

s"${ev.isNull} = true;")
s"""
|for (int $i = 0; $i < $bigger.numElements(); $i ++) {
|$isInSmaller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 space

s"""
|for (int $j = 0; $j < $smaller.numElements(); $j ++) {
| $compareValues
| if (${ev.value}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was in the wrong place, thanks, nice catch!

|}
|if ($smaller.numElements() > 0) {
| $comparisonCode
|}
Copy link
Contributor

@cloud-fan cloud-fan May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if smaller is empty, we should return false. However, the code here depends on the initial value of ev.value and ev.isNull, which, according to nullSafeCodeGen, depends on nullable.

Copy link
Contributor Author

@mgaido91 mgaido91 May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but ev.isNull anyway is it initiated to false, unless one of the input is null. And in that case we don't even reach this point because we just return null.

ev.value is initiated always to the defaultValue which is false. So when we arrive here we are sure that they are both false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, sorry I misread the code in nullSafeCodeGen

Copy link
Contributor Author

@mgaido91 mgaido91 May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np, thanks for checking it. Always better one check more than one less.

|if (${ctx.genEqual(elementType, getFromSmaller, getFromBigger)}) {
| ${ev.isNull} = false;
| ${ev.value} = true;
| break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we wanna break 2 loops here, that's why we generate 2 breaks. it might be cleaner to generate

for (int $i = 0; $i < $bigger.numElements() && !${ev.value}; $i ++) {
  for (int $j = 0; $j < $smaller.numElements() && !${ev.value}; $j ++) {
    ...
  }
}

@SparkQA
Copy link

SparkQA commented May 14, 2018

Test build #90592 has finished for PR 21028 at commit 49d9372.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2018

Test build #90635 has finished for PR 21028 at commit 227437b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
private def bruteForceEval(arr1: ArrayData, arr2: ArrayData): Any = {
var hasNull = false
if (arr1.numElements() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also compare the numElements here and check the array is empty for the smaller one? Otherwise the result is different if the arr1 is not empty and contains null and arr2 is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I added also some test cases for this

@SparkQA
Copy link

SparkQA commented May 16, 2018

Test build #90679 has finished for PR 21028 at commit 2e9e024.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 17, 2018

Test build #90711 has finished for PR 21028 at commit 2e9e024.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

override def inputTypes: Seq[AbstractDataType] = {
(left.dataType, right.dataType) match {
case (ArrayType(e1, hasNull1), ArrayType(e2, hasNull2)) =>
TypeCoercion.findTightestCommonType(e1, e2, caseSensitive) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should fail the build now

@SparkQA
Copy link

SparkQA commented May 17, 2018

Test build #90719 has finished for PR 21028 at commit 56c59ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 69350aa May 17, 2018
asfgit pushed a commit that referenced this pull request Jul 17, 2018
… collection expressions.

## What changes were proposed in this pull request?

The PR tries to avoid serialization of private fields of already added collection functions and follows up on comments in [SPARK-23922](#21028) and [SPARK-23935](#21236)

## How was this patch tested?

Run tests from:
- CollectionExpressionSuite.scala
- DataFrameFunctionsSuite.scala

Author: Marek Novotny <[email protected]>

Closes #21352 from mn-mikke/SPARK-24305.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants