[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array #734

MTelling · 2021-02-18T17:49:00Z

What changes were proposed in this pull request?

Correct the logic to compute the distinct.

Below is a small repro snippet.

scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col")
df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>]

scala> val distinctDF = df.select(array_distinct(col("array_col")))
distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>]

scala> df.show(false)
+----------------------------------------+
|array_col                               |
+----------------------------------------+
|[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]|
+----------------------------------------+

Error

scala> distinctDF.show(false)
+-------------------------+
|array_distinct(array_col)|
+-------------------------+
|[[1, 2], [1, 2], [1, 2]] |
+-------------------------+

Expected result

scala> distinctDF.show(false)
+-------------------------+
|array_distinct(array_col)|
+-------------------------+
|[[1, 2], [3, 4], [4, 5]] |
+-------------------------+

How was this patch tested?

Added an additional test.

Closes apache#24073 from dilipbiswal/SPARK-27134.

Authored-by: Dilip Biswal [email protected]
Signed-off-by: Sean Owen [email protected]
(cherry picked from commit aea9a57)
Signed-off-by: Sean Owen [email protected]

Upstream SPARK-27134 ticket and PR link (if not applicable, explain)

https://issues.apache.org/jira/browse/SPARK-27134

What changes were proposed in this pull request?

Fixing bug with array_distinct

How was this patch tested?

Existing tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

…th columns containing array of array ## What changes were proposed in this pull request? Correct the logic to compute the distinct. Below is a small repro snippet. ``` scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col") df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>] scala> val distinctDF = df.select(array_distinct(col("array_col"))) distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>] scala> df.show(false) +----------------------------------------+ |array_col | +----------------------------------------+ |[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]| +----------------------------------------+ ``` Error ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [1, 2], [1, 2]] | +-------------------------+ ``` Expected result ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [3, 4], [4, 5]] | +-------------------------+ ``` ## How was this patch tested? Added an additional test. Closes apache#24073 from dilipbiswal/SPARK-27134. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit aea9a57) Signed-off-by: Sean Owen <[email protected]>

rshkv approved these changes Feb 18, 2021

View reviewed changes

rshkv merged commit 304713b into master Feb 18, 2021

rshkv deleted the mt/cherry-pick-spark-27134 branch February 18, 2021 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array #734

[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array #734

MTelling commented Feb 18, 2021

[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array #734

[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array #734

Conversation

MTelling commented Feb 18, 2021

What changes were proposed in this pull request?

How was this patch tested?

Upstream SPARK-27134 ticket and PR link (if not applicable, explain)

What changes were proposed in this pull request?

How was this patch tested?