[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array #24073

dilipbiswal · 2019-03-12T21:47:18Z

What changes were proposed in this pull request?

Correct the logic to compute the distinct.

Below is a small repro snippet.

scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col")
df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>]

scala> val distinctDF = df.select(array_distinct(col("array_col")))
distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>]

scala> df.show(false)
+----------------------------------------+
|array_col                               |
+----------------------------------------+
|[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]|
+----------------------------------------+

Error

scala> distinctDF.show(false)
+-------------------------+
|array_distinct(array_col)|
+-------------------------+
|[[1, 2], [1, 2], [1, 2]] |
+-------------------------+

Expected result

scala> distinctDF.show(false)
+-------------------------+
|array_distinct(array_col)|
+-------------------------+
|[[1, 2], [3, 4], [4, 5]] |
+-------------------------+

How was this patch tested?

Added an additional test.

…lumns containing array of array

SparkQA · 2019-03-12T21:58:14Z

Test build #103388 has finished for PR 24073 at commit 860ed87.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-13T02:36:07Z

Test build #103391 has finished for PR 24073 at commit 049bf9b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-13T02:48:05Z

cc @ushin @kiszk

SparkQA · 2019-03-13T03:12:44Z

Test build #103395 has finished for PR 24073 at commit d59b0f7.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-13T03:43:09Z

retest this please

SparkQA · 2019-03-13T07:05:02Z

Test build #103409 has finished for PR 24073 at commit d59b0f7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-13T07:05:49Z

retest this please

maropu · 2019-03-13T08:29:29Z

Can you put an example query to reproduce this bug in the PR description?

kiszk · 2019-03-13T08:29:42Z

cc @ueshin

kiszk · 2019-03-13T08:33:27Z

Thanks, let me see later.

dilipbiswal · 2019-03-13T08:55:21Z

@maropu Updated the PR description. Thank you.

viirya · 2019-03-13T10:37:04Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

          }
        }
      }
-      new GenericArrayData(data.slice(0, pos))


Looks like original implementation assumes the duplicate items are placed at the end of data? I think it not only affects array of array, but also other element type like BinaryType.

This test case fails at current master, but passes after this change.

val a8 = Literal.create(Seq(2, 1, 2, 3, 4, 4, 5).map(_.toString.getBytes), ArrayType(BinaryType)) checkEvaluation(new ArrayDistinct(a8), Seq(2, 1, 3, 4, 5).map(_.toString.getBytes))

@viirya

Looks like original implementation assumes the duplicate items are placed at the end of data?

Probably. The thing is, it does not rearrange the data in any way. So i don't know how we can just return the slice of the original array.

I think it not only affects array of array, but also other element type like BinaryType

Yeah.. I will add your test case or enhance the Array[Binary] test case.

Good catch.
Actually I'm not sure how I could miss this.
Thanks!

SparkQA · 2019-03-13T11:15:37Z

Test build #103419 has finished for PR 24073 at commit d59b0f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-03-13T11:59:28Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

@@ -3112,29 +3112,30 @@ case class ArrayDistinct(child: Expression)
    (data: Array[AnyRef]) => new GenericArrayData(data.distinct.asInstanceOf[Array[Any]])
  } else {
    (data: Array[AnyRef]) => {
-      var foundNullElement = false
-      var pos = 0
+      val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]


Wouldn't this be an ArrayBuffer[Array[AnyRef]]?

We are returning a GenericArrayData(arrayBuffer). GenericArrayData takes a parameter of Array[Any]. So we are okay here, no ? I looked at existing implementation for ArrayUnion and ArrayExcept for reference.

That's fine, you can pass an Array of anything to it then (right? or is there some compile-time issue I'm not thinking of). It's not that it doesn't work but this code locally could be more precise. No big deal.

@srowen Sorry.. a little confused. So we have a input which is a Array[AnyRef]. Now if we declare the temporary buffer a ArrayBuffer[Array[AnyRef]], how do we populate its content ?
Example :
Input1 : Array[Integer] => Seq(1, 2, , 1)
In this case our output is : ArrayBuffer[Int] = Array(1, 2)
Input2 : Array[Array[Integer]] => Seq(Seq(1, 2), Seq(3, 4), Seq(3,4))
In this case our output is : ArrayBuffer[Array[Int]] => Array(Array(1,2), Array(3,4))
Input3 : Array[Struct] => Seq(struct(...), struct(...))

Thanks for your help.

Disregard this, I'm mistaken. The use case here was arrays of arrays but this code isn't handling only array elements. Can the type by AnyRef though?

@srowen Thank you.. Sure i will change it to Anyref.

srowen · 2019-03-13T12:04:15Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+          var j = 0;
+          while (!found && j < arrayBuffer.size) {
+            val va = arrayBuffer(j)
+            found = (va != null) && ordering.equiv(va, data(i))


Rather than handle nulls separately, can you just check it here?

found = if (va == null) data(i) == null else ordering.equiv(...)

It also kind of looks like the ordering already handles nulls?

@srowen Thanks. Actually in my understanding, ordering does not seem to handle nulls. Given that, the proposed condition does not work for case when va != null and data[i] is null. We get a null pointer exception. We can probably do something like :

if (data(i) != null && va != null) { found = ordering.equiv(va, data(i)) } else if (data(i) == null && va == null) { found = true }

Given this, i feel the existing code reads better and i think performs better if we have many null and non-null values in the array by keeping the null handling outside the inner loop. Also, i believe, for other collection operation functions we treat nulls separately. But i will change, if you feel otherwise. Please let me know.

You're right, on second look the ordering in ArrayType doesn't handle nulls, only nulls in the array. This is fine. You could save a lookup with...

if (data(i) == null) { found = va == null } else if (va != null) { found = ordering.equiv(va, data(i)) }

but I don't feel strongly about it

@dilipbiswal explained my intention in the code of null part. @srowen is code simple and easy for reading, but it may include # of iterations if we have already seen null.

srowen · 2019-03-13T12:04:24Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

-            pos = pos + 1
+        if (data(i) != null) {
+          found = false
+          var j = 0;


Remove semicolon

@srowen will do.

srowen · 2019-03-13T12:04:44Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

-      var pos = 0
+      val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
+      var alreadyStoredNull = false
+      var found = false


Move this var inside the if statement below?

@srowen ok.

SparkQA · 2019-03-14T00:21:10Z

Test build #103467 has started for PR 24073 at commit dfdd109.

dilipbiswal · 2019-03-14T05:50:56Z

@HyukjinKwon Hello.. the test seems to be stuck in ApproxCountDistinctForIntervalsQuerySuite which seems unrelated to the change in this PR. Is there a way to stop/kill this run ?

maropu · 2019-03-14T06:30:57Z

Jenkins has weird behaivours now and it seems the tests stopped in #24028 (comment), too.

maropu · 2019-03-14T06:31:58Z

retest this please

SparkQA · 2019-03-14T07:05:01Z

Test build #103479 has finished for PR 24073 at commit dfdd109.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-14T07:08:24Z

retest this please

SparkQA · 2019-03-14T11:11:02Z

Test build #103482 has finished for PR 24073 at commit dfdd109.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-15T02:13:22Z

Test build #103515 has finished for PR 24073 at commit 60eb195.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2019-03-15T05:37:12Z

~~LGTM.~~
I rethought after #24073 (comment), I agree with @kiszk to skip traversing the arraybuffer after null found.
@srowen Could you take another look please?
Thanks!

This reverts commit 60eb195.

dilipbiswal · 2019-03-15T06:03:53Z

@kiszk @ueshin @srowen Thanks a lot for reviewing. I have reverted the last commit where i had removed special casing null handling. It should be ok now. Thanks.

srowen

This seems fine. Is there a more general problem of this form? someone else mentioned it might not be specific to arrays. It's OK to consider that separately if so, just checking if there is another very similar fix to be made elsewhere

kiszk · 2019-03-15T07:10:50Z

LGTM, thank you for fixing this mistake.

ueshin · 2019-03-15T07:29:44Z

Jenkins, retest this please.

SparkQA · 2019-03-15T21:33:05Z

Test build #103537 has finished for PR 24073 at commit 1bde41c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…th columns containing array of array ## What changes were proposed in this pull request? Correct the logic to compute the distinct. Below is a small repro snippet. ``` scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col") df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>] scala> val distinctDF = df.select(array_distinct(col("array_col"))) distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>] scala> df.show(false) +----------------------------------------+ |array_col | +----------------------------------------+ |[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]| +----------------------------------------+ ``` Error ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [1, 2], [1, 2]] | +-------------------------+ ``` Expected result ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [3, 4], [4, 5]] | +-------------------------+ ``` ## How was this patch tested? Added an additional test. Closes #24073 from dilipbiswal/SPARK-27134. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit aea9a57) Signed-off-by: Sean Owen <[email protected]>

srowen · 2019-03-16T19:33:04Z

Merged to master/2.4. It didn't pick cleanly into 2.3 and I wasn't clear whether it affected 2.3

dilipbiswal · 2019-03-16T20:49:48Z

Thanks a lot @viirya @srowen @ueshin @kiszk

dilipbiswal · 2019-03-16T21:07:02Z

@srowen Actually this function was added in 2.4. So we should be good :-)

…th columns containing array of array ## What changes were proposed in this pull request? Correct the logic to compute the distinct. Below is a small repro snippet. ``` scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col") df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>] scala> val distinctDF = df.select(array_distinct(col("array_col"))) distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>] scala> df.show(false) +----------------------------------------+ |array_col | +----------------------------------------+ |[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]| +----------------------------------------+ ``` Error ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [1, 2], [1, 2]] | +-------------------------+ ``` Expected result ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [3, 4], [4, 5]] | +-------------------------+ ``` ## How was this patch tested? Added an additional test. Closes apache#24073 from dilipbiswal/SPARK-27134. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit aea9a57) Signed-off-by: Sean Owen <[email protected]>

dilipbiswal added 2 commits March 12, 2019 14:43

[SPARK-27134] array_distinct function does not work correctly with co…

1e76669

…lumns containing array of array

remove space

860ed87

dilipbiswal added 2 commits March 12, 2019 15:20

style and one more test

049bf9b

variable names

d59b0f7

viirya reviewed Mar 13, 2019

View reviewed changes

srowen requested changes Mar 13, 2019

View reviewed changes

Code review

dfdd109

Code review

60eb195

Revert "Code review"

1bde41c

This reverts commit 60eb195.

srowen approved these changes Mar 15, 2019

View reviewed changes

kiszk approved these changes Mar 15, 2019

View reviewed changes

ueshin approved these changes Mar 15, 2019

View reviewed changes

srowen closed this in aea9a57 Mar 16, 2019

MTelling mentioned this pull request Feb 18, 2021

[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array palantir/spark#734

Merged

[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array #24073

[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array #24073

Conversation

dilipbiswal commented Mar 12, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 12, 2019

SparkQA commented Mar 13, 2019

dilipbiswal commented Mar 13, 2019

SparkQA commented Mar 13, 2019

dilipbiswal commented Mar 13, 2019

SparkQA commented Mar 13, 2019

dilipbiswal commented Mar 13, 2019

maropu commented Mar 13, 2019

kiszk commented Mar 13, 2019

kiszk commented Mar 13, 2019

dilipbiswal commented Mar 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Mar 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 13, 2019

Choose a reason for hiding this comment

dilipbiswal Mar 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Mar 13, 2019 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Mar 14, 2019

dilipbiswal commented Mar 14, 2019

maropu commented Mar 14, 2019

maropu commented Mar 14, 2019

SparkQA commented Mar 14, 2019

dilipbiswal commented Mar 14, 2019

SparkQA commented Mar 14, 2019

SparkQA commented Mar 15, 2019

ueshin commented Mar 15, 2019 • edited Loading

dilipbiswal commented Mar 15, 2019

srowen left a comment

Choose a reason for hiding this comment

kiszk commented Mar 15, 2019 • edited Loading

ueshin commented Mar 15, 2019

SparkQA commented Mar 15, 2019

srowen commented Mar 16, 2019

dilipbiswal commented Mar 16, 2019 • edited Loading

dilipbiswal commented Mar 16, 2019

dilipbiswal commented Mar 12, 2019 •

edited

Loading

dilipbiswal Mar 13, 2019 •

edited

Loading

dilipbiswal Mar 13, 2019 •

edited

Loading

dilipbiswal Mar 13, 2019 •

edited

Loading

ueshin commented Mar 15, 2019 •

edited

Loading

kiszk commented Mar 15, 2019 •

edited

Loading

dilipbiswal commented Mar 16, 2019 •

edited

Loading