-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement array_union
function
#6981
Comments
The example is not correct. It should be
|
Thank you, @Weijun-H! Corrected now. |
Could I pick this ticket? |
@Weijun-H Yes, you can implement. Thanks for your initiative! |
I started doing some work on this @izveigor but I am stucked in handling the non trivial case:
I have the following newbie questions:
|
I think union with Null is not defined, so you could just match the List type, i.e. (List,List).
It sounds like the latter is better, but if you fail to find the way, the first approach is also fine. |
@jayzhan211 I looked deeper in the code, it seems that:
The latter is here: https://github.com/apache/arrow-rs/blob/03d0505fc864c09e6dcd208d3cdddeecefb90345/arrow-select/src/concat.rs#L111 and would require a separate release of arrow-rs to extend concatenation to use an HashSet internally. On the other side, in the current arrow-datafusion, I can't find any sign of deduplication. I created a draft PR here https://github.com/apache/arrow-datafusion/pull/7897/files but I am stuck at the moment |
We may not need pattern matching for Internal type of array. Type coercion should had been done in https://github.com/apache/arrow-datafusion/blob/9fde5c4282fd9f0e3332fb40998bf1562c17fcda/datafusion/optimizer/src/analyzer/type_coercion.rs#L582-L601
I think we can maintain HashSet for each row, convert array to primitives e.g.
I'm not sure how can we benefit from having HashSet inside arrow-rs. It seems to me easier to have that in DF |
I would prefer
But you can also try |
Hello @jayzhan211, can you help me move my next steps from https://github.com/apache/arrow-datafusion/pull/7897/files#r1369465094 ? I am still learning the internal API so the answer might be obvious :) |
In response to a question asked in the ASF slack, here is my best attempt to explain what is going on here Scalars vs ArraysFirst in order to understand what is going on we need to understand the relationship between SQL and arrow, and in particular how this relates to scalars. The arrow data model is concerned with arrays, these can be thought of as columns in a tabular data model. So an Int64Array might contain the data If you then wrote a SQL expression like
This will compute a new array Now when writing SQL queries it is common to want to write something like
In this case you want In this case DF uses an ColumnarValue enumeration which contains either an array or a ScalarValue Upstream arrow-rs has a similar concept using a trait called Some kernels are able to optimise the case where one or other of the sides is known to be a scalar, and so explicitly handle these special types. In the case of other kernels, DF will expand the scalar value to an array. In particular consider the expression
Say DF will call the underlying kernel with ListArray UnionNow we can get back to talking about Consider the example SQL
Both arguments are scalars, and so DF will expand them to arrays as above, although only to a single element in this case because both arguments are actually scalar. So the arguments to array_union kernel are actually 2-dimensional, i.e.
The kernel then wants to for each element of the input list arrays, in this case a list of integers, perform the union operation
Now this is where it gets tricky, as not only do we need to deduplicate the list values, but also incrementally build an output list array with the results. One way this might be achieved is something like (not tested at all)
This relies on the particulars of how ListArray is represented, which is explained in a bit more detail here. Unfortunately the operation being performed here is not a natural fit for a selection kernel, because it is selecting particular groups of values from list children, so unfortunately there isn't really a way to avoid dealing with the reality of how these arrays are encoded. Dealing with the offsets and values directly avoids fighting the borrow checker for the lifetime of rows and avoids downcasting for each list element. Additionally using RowConverter means this will generalise to lists of any value type, including lists or structs, whilst avoiding having to generate specialized code for each value type, which is a problematic patttern (##7988). I hope this at least clarifies some things... |
Is your feature request related to a problem or challenge?
Summary
array_union
list_union
Spark SQL: Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.
Examples:
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: