-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] List chunk expression #2491
[FEAT] List chunk expression #2491
Conversation
76e1b35
to
2804531
Compare
2804531
to
6d204bb
Compare
@@ -118,6 +118,61 @@ pub fn get_slices_helper( | |||
.into_series()) | |||
} | |||
|
|||
fn get_chunks_helper( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of arguments here, wanna add some docstrings to help explain them?
We aren't very good with documentation of our code, but maybe a good time to start 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, added some docs. I also switched around the argument list while writing the docs because there was a more natural order to them: input/parent values (flat_child, field, validity) -> input/expression argument (size) -> "output"/result of preprocessing (elements to skip and new offsets)
src/daft-core/src/array/ops/list.rs
Outdated
let inner_list_field = field.to_exploded_field()?.to_fixed_size_list_field(size)?; | ||
let inner_list = FixedSizeListArray::new( | ||
inner_list_field.clone(), | ||
Series::from_arrow(field.to_exploded_field()?.into(), flat_child.to_arrow())?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this conversion to and from arrow necessary? I think to_arrow
is meant mostly for FFI back into Python.
I think flat_child.into_series()
should give you what you want here (you'll have to pull in the IntoSeries
trait)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right there was no need to make this so convoluted.
So there's an option of doing flat_child.inner.into_series()
which goes from &Series -> SeriesLike -> Series
. In the end of the day I settled on flat_child.clone()
, since flat_child
is a Series which only contains Arc<dyn SeriesLike>
, so only the Arc would be cloned.
Let me know if you think I should do otherwise.
@@ -2597,6 +2597,18 @@ def slice(self, start: int | Expression, end: int | Expression) -> Expression: | |||
end_expr = Expression._to_expression(end) | |||
return Expression._from_pyexpr(self._expr.list_slice(start_expr._expr, end_expr._expr)) | |||
|
|||
def chunk(self, size: int) -> Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add this into our Expressions docs: docs/source/api_docs/expressions.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I didn't do this for the slice expression too, so I'll add it at the same time.
Adds an expression that splits lists into fixed sized lists with chunks of a given size. If the number of elements is not a multiple of the given chunk size, elements that do not fit into a chunk are discarded.
The API to use this expression is
daft.Expression.list.slice(size: int)
, wheresize
is the size of chunks to use. The given size must be an integer greater than 0.For example, a list [1, 2, 3, 4, 5] with a chunk size of 2 gives [[1, 2], [3, 4]].