-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement array_slice
and array_element
, remove array_trim
#6936
Conversation
(DataType::List(_), DataType::Int64, DataType::Int64) | ||
| (DataType::List(_), DataType::Int64, DataType::Null) | ||
| (DataType::List(_), DataType::Null, DataType::Int64) | ||
| (DataType::List(_), DataType::Null, DataType::Null) => Ok(ColumnarValue::Array(array_slice(&[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not work with columns, but all functions support column format.
@alamb if you have free time can you give me some advices how to solve the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would entail supporting something like SELECT array_column[int_column] FROM t
which while cool I think would be much less common than SELECT array_column[3] FROM t
(aka a constant) as you have done here
I think writing the code that takes columnar inputs is likely something we could work out upstream (aka in arrow-rs) as a list_take
or similar kernel.
Thus I would suggest we get this PR in (which errors with a non constant) and file a ticket to support columnar field access as a follow on PR
@@ -951,11 +951,24 @@ impl TryFrom<&Expr> for protobuf::LogicalExprNode { | |||
// see discussion in https://github.com/apache/arrow-datafusion/issues/2565 | |||
return Err(Error::General("Proto serialization error: Expr::ScalarSubquery(_) | Expr::InSubquery(_) | Expr::Exists { .. } | Exp:OuterReferenceColumn not supported".to_string())); | |||
} | |||
Expr::GetIndexedField(GetIndexedField { key, expr }) => Self { | |||
Expr::GetIndexedField(GetIndexedField { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to parse GetIndexedField
in proto.
Should I create a separate ticket for this issue and allow other users to solve it or this problem may cause many problems in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can't figure out the protobuf part, I think we have in the past added a "NotImplemented" error and filed a follow on ticket to add support
Though maybe this comment is outdated as it looks like you have the code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a separate ticket: #7146.
I think PR is enough for an initial review. @alamb @jayzhan211 if you have free time, can you review it? |
query error | ||
select array_slice(make_array(1, 2, 3, 4, 5), 2, NULL), array_slice(make_array('h', 'e', 'l', 'l', 'o'), 3, NULL); | ||
---- | ||
[4, 5] [l, l, o] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is currently not working. See the issue: #7142.
[4, 5] [l, l, o] | ||
|
||
# array_slice scalar function #13 (with NULL and negative number) | ||
query error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This too.
---- | ||
[] [] | ||
|
||
# array_slice scalar function #19 (with negative indexes; nested array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not seems any negative indexes
---- | ||
[[1, 2, 3, 4, 5]] | ||
|
||
# array_slice scalar function #20 (with first positive index and last negative index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first neg then pos?
---- | ||
[4, 5] [l, l] | ||
|
||
# array_slice scalar function #21 (with first negative index and last positive index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pos then neg?
|
||
# struct[i] | ||
query III | ||
select struct(1, 3.14, 'h')['c0'], struct(3, 2.55, 'b')['c1'], struct(2, 6.43, 'a')['c2']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quite curious, why index with 'c0' works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we start counting from 0
.
Example:
❯ CREATE TABLE values(
a INT,
b INT,
c INT,
d FLOAT,
e VARCHAR,
f VARCHAR
) AS VALUES
(1, 1, 2, 1.1, 'Lorem', 'A'),
(2, 3, 4, 2.2, 'ipsum', ''),
(3, 5, 6, 3.3, 'dolor', 'BB'),
(4, 7, 8, 4.4, 'sit', NULL),
(NULL, 9, 10, 5.5, 'amet', 'CCC'),
(5, NULL, 12, 6.6, ',', 'DD'),
(6, 11, NULL, 7.7, 'consectetur', 'E'),
(7, 13, 14, NULL, 'adipiscing', 'F'),
(8, 15, 16, 8.8, NULL, '')
;
❯ select struct(a, b, c) from values;
+------------------------------------+
| struct(values.a,values.b,values.c) |
+------------------------------------+
| {c0: 1, c1: 1, c2: 2} |
| {c0: 2, c1: 3, c2: 4} |
| {c0: 3, c1: 5, c2: 6} |
| {c0: 4, c1: 7, c2: 8} |
| {c0: , c1: 9, c2: 10} |
| {c0: 5, c1: , c2: 12} |
| {c0: 6, c1: 11, c2: } |
| {c0: 7, c1: 13, c2: 14} |
| {c0: 8, c1: 15, c2: 16} |
+------------------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand 0, but is c
hard coding somewhere?
Lets me ask another way, can we index it with col0
, column0
? If not, there might be a hard coding index with the prefix c
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The c0
is hardcoded and part of the output of the struct
function:
❯ select struct(1,2);
+---------------------------+
| struct(Int64(1),Int64(2)) |
+---------------------------+
| {c0: 1, c1: 2} |
+---------------------------+
datafusion/core/tests/sql/select.rs
Outdated
"+--------+", | ||
"| i0 |", | ||
"+--------+", | ||
"| [0, 1] |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why only one left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the ticket: #7147
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@alamb I fixed conflicts and CI, so can you review this PR again if you have free time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the effort @izveigor -- I don't think we should merge this PR with the regression in field access #7147 -- among other things we use this feature in IOx so adding a known regression seems like a step backwards
I also don't fully understand the implementation of slice
-- I think there is a significantly easier and general way to implement this using take
which I tried to write up in detal
Let me know what you think
@@ -152,21 +152,6 @@ async fn test_udaf_returning_struct() { | |||
assert_batches_eq!(expected, &execute(&ctx, sql).await.unwrap()); | |||
} | |||
|
|||
/// Demonstrate extracting the fields from a structure using a subquery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this test removed?
(this is an important usecase for us in IOx where we have a UDF that returns a struct that is accessed by name.
@@ -2550,69 +3060,6 @@ mod tests { | |||
assert_eq!("1-*-3-*-*-6-7-*", result.value(0)); | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this test removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: the answer is that trim_array was removed
}}; | ||
} | ||
|
||
macro_rules! slice { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't fully understand this implementation -- it seem like picking a single value from a list array can be implemented by calculating the appropriate offsets and then calling take
For example, given A ListArray like this
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
┌ ─ ─ ─ ─ ─ ─ ┐ │
┌─────────────┐ ┌───────┐ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│ [A,B,C] │ │ (0,3) │ │ 1 │ │ 0 │ │ │ 1 │ │ A │ │ 0 │
├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤
│ [] (empty) │ │ (3,3) │ │ 1 │ │ 3 │ │ │ 1 │ │ B │ │ 1 │
├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤
│ NULL │ │ (3,4) │ │ 0 │ │ 4 │ │ │ 1 │ │ C │ │ 2 │
├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤
│ [D] │ │ 4,5 │ │ 1 │ │ 5 │ │ │ 1 │ │ ? │ │ 3 │
├─────────────┤ ├───────┤ │ ├───┤ ├───┤ ├───┤ ├───┤
│ [NULL, F] │ │ 5,7 │ │ 1 │ │ 5 │ │ │ 0 │ │ D │ │ 4 │
└─────────────┘ └───────┘ │ └───┘ ├───┤ ├───┤ ├───┤
│ 7 │ │ │ 1 │ │ ? │ │ 5 │
│ Validity └───┘ ├───┤ ├───┤
Logical Logical (nulls) Offsets │ │ 1 │ │ F │ │ 6 │
Values Offsets │ └───┘ └───┘
│ Values │ │
(offsets[i], │ ListArray (Array)
offsets[i+1]) └ ─ ─ ─ ─ ─ ─ ┘ │
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
Computing col[1]
(aka take the second element) could be computed by adding 1 to the first logical offset, checking if that is still in the array
So in this example, the take indicies would be
[
Some(1), (add 1 to start of (0, 3) is 1
None , // (adding 1 start of (3,3) is 4 which is outside,
None, // input was null,
None, // adding 1 start of (4,5 ) is 5 which is outside
Some(6), // adding 1 to start of (5,7) is 6
]
This calculation does not depend on datatype of the elements and would be fully general
A similar calculation could be done for slice though in that case you would have to build up the offsets of the new list array as well as the indices of the take of the values.
In general I think many of these array calculations and manipulations can be done in terms of the take kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite comprehend your comment. If I use take
kernel with ListArray
, the program returns me the elements of ListArray
(i. e. Array
instances). So if it's really worth it can you give me more information how should I take data?
Example:
use arrow::array::*;
use arrow::compute::take;
use arrow::datatypes::Int32Type;
fn main() {
let data = vec![
Some(vec![Some(0), Some(1), Some(2)]),
None,
Some(vec![Some(3), None, Some(5)]),
Some(vec![Some(6), Some(7)]),
];
let values = ListArray::from_iter_primitive::<Int32Type, _, _>(data);
let indices = UInt32Array::from(vec![0, 3]);
let taken = take(&values, &indices, None).unwrap();
let taken: &GenericListArray<i32> = taken.as_list();
let data = vec![
Some(vec![Some(0), Some(1), Some(2)]),
Some(vec![Some(6), Some(7)]),
];
let expected = ListArray::from_iter_primitive::<Int32Type, _, _>(data);
assert_eq!(expected, taken.clone());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite comprehend your comment. If I use take kernel with ListArray, the program returns me the elements of ListArray (i. e. Array instances). So if it's really worth it can you give me more information how should I take data?
I was thinking we call take()
with the values array (like take(list_array.values(), indices
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in your example above, something more like
fn main() {
let list_array = todo!(); // define ListArray here
let data = vec![
Some(vec![Some(0), Some(1), Some(2)]),
None,
Some(vec![Some(3), None, Some(5)]),
Some(vec![Some(6), Some(7)]),
];
let indices = UInt32::from_iter_primitive::<Int32Type, _, _>(data);
let taken = take(&list_array.values(), &indices, None).unwrap();
I am fully accept the fact that we should prevent the regression.
I plan for examine the difference between |
@alamb I think we should remain
I understood my mistake about index column support. Now, in my opinion I should direct all forces to fix the bug with struct (which is very important for IOx). I think it is better development path than comparing The main problem with bug related to struct is difference between circumstances for
What do you think about that approach, @alamb? P.S. I really want that DataFusion index system matches with DuckDB and ClickHouse. |
@alamb @jayzhan211 can you review it again if you have free time. |
I plan to review this PR later today |
array_slice
and array_element
, remove array_trim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @izveigor -- I think this looks good to go -- it is well tested and documented. I do think there is
I plan to run this PR through the IOx tests as one last verification, but then I think we should merge it.
Thank you for pushing this through 🦾
cc @vincev @smiklos who I think are also interested in this area
|
||
# struct[i] | ||
query III | ||
select struct(1, 3.14, 'h')['c0'], struct(3, 2.55, 'b')['c1'], struct(2, 6.43, 'a')['c2']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The c0
is hardcoded and part of the output of the struct
function:
❯ select struct(1,2);
+---------------------------+
| struct(Int64(1),Int64(2)) |
+---------------------------+
| {c0: 1, c1: 2} |
+---------------------------+
| array_to_string(array, delimeter) | Converts each element to its text representation. `array_to_string([1, 2, 3, 4], ',') -> 1,2,3,4` | | ||
| cardinality(array) | Returns the total number of elements in the array. `cardinality([[1, 2, 3], [4, 5, 6]]) -> 6` | | ||
| make_array(value1, [value2 [, ...]]) | Returns an Arrow array using the specified input expressions. `make_array(1, 2, 3) -> [1, 2, 3]` | | ||
| trim_array(array, n) | Removes the last n elements from the array. | | ||
| trim_array(array, n) | Deprecated | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -2550,69 +3060,6 @@ mod tests { | |||
assert_eq!("1-*-3-*-*-6-7-*", result.value(0)); | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: the answer is that trim_array was removed
/// If we use index with `DataType::Struct`, then we use the `struct_key` argument with `list_key` equal to `None`. | ||
/// `list_key` can be any expression, unlike `struct_key` which can only be `ScalarValue::Utf8`. | ||
#[derive(Clone, PartialEq, Eq, Hash, Debug)] | ||
pub struct GetIndexedFieldKey { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this API to be awkward -- I think we can do something to make it better. I am working on a PR now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed #7193 to track improving this API
@@ -1952,6 +2104,364 @@ mod tests { | |||
) | |||
} | |||
|
|||
#[test] | |||
fn test_array_element() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this kind of test is better to write in array.slt
|
||
macro_rules! slice { | ||
($ARRAY:expr, $KEY:expr, $EXTRA_KEY:expr, $RETURN_ELEMENT:expr, $ARRAY_TYPE:ident) => {{ | ||
let sliced_array: Vec<Arc<dyn Array>> = $ARRAY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ArrayRef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
I tested this with IOx test suite and everything passed: https://github.com/influxdata/influxdb_iox/pull/8419 While I do think the logical plan node to access fields can and should be cleaned up I see no reason that should gate this PR. Nicely done @izveigor and thank you for all the help reviewing @jayzhan211 |
Which issue does this PR close?
Part of #6935
Closes #7147
Closes #6975
Closes #6974
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
Yes