-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array #24073
Changes from all commits
1e76669
860ed87
049bf9b
d59b0f7
dfdd109
60eb195
1bde41c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3112,29 +3112,29 @@ case class ArrayDistinct(child: Expression) | |
(data: Array[AnyRef]) => new GenericArrayData(data.distinct.asInstanceOf[Array[Any]]) | ||
} else { | ||
(data: Array[AnyRef]) => { | ||
var foundNullElement = false | ||
var pos = 0 | ||
val arrayBuffer = new scala.collection.mutable.ArrayBuffer[AnyRef] | ||
var alreadyStoredNull = false | ||
for (i <- 0 until data.length) { | ||
if (data(i) == null) { | ||
if (!foundNullElement) { | ||
foundNullElement = true | ||
pos = pos + 1 | ||
} | ||
} else { | ||
if (data(i) != null) { | ||
var found = false | ||
var j = 0 | ||
var done = false | ||
while (j <= i && !done) { | ||
if (data(j) != null && ordering.equiv(data(j), data(i))) { | ||
done = true | ||
} | ||
j = j + 1 | ||
while (!found && j < arrayBuffer.size) { | ||
val va = arrayBuffer(j) | ||
found = (va != null) && ordering.equiv(va, data(i)) | ||
j += 1 | ||
} | ||
if (i == j - 1) { | ||
pos = pos + 1 | ||
if (!found) { | ||
arrayBuffer += data(i) | ||
} | ||
} else { | ||
// De-duplicate the null values. | ||
if (!alreadyStoredNull) { | ||
arrayBuffer += data(i) | ||
alreadyStoredNull = true | ||
} | ||
} | ||
} | ||
new GenericArrayData(data.slice(0, pos)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like original implementation assumes the duplicate items are placed at the end of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test case fails at current master, but passes after this change. val a8 = Literal.create(Seq(2, 1, 2, 3, 4, 4, 5).map(_.toString.getBytes), ArrayType(BinaryType))
checkEvaluation(new ArrayDistinct(a8), Seq(2, 1, 3, 4, 5).map(_.toString.getBytes)) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Probably. The thing is, it does not rearrange the data in any way. So i don't know how we can just return the slice of the original array.
Yeah.. I will add your test case or enhance the Array[Binary] test case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch. |
||
new GenericArrayData(arrayBuffer) | ||
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than handle nulls separately, can you just check it here?
It also kind of looks like the ordering already handles nulls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Thanks. Actually in my understanding, ordering does not seem to handle nulls. Given that, the proposed condition does not work for case when va != null and data[i] is null. We get a null pointer exception. We can probably do something like :
Given this, i feel the existing code reads better and i think performs better if we have many null and non-null values in the array by keeping the null handling outside the inner loop. Also, i believe, for other collection operation functions we treat nulls separately. But i will change, if you feel otherwise. Please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, on second look the ordering in ArrayType doesn't handle nulls, only nulls in the array. This is fine. You could save a lookup with...
but I don't feel strongly about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal explained my intention in the code of
null
part. @srowen is code simple and easy for reading, but it may include # of iterations if we have already seennull
.