Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] List chunk expression #2491

Merged

Conversation

desmondcheongzx
Copy link
Contributor

@desmondcheongzx desmondcheongzx commented Jul 9, 2024

Adds an expression that splits lists into fixed sized lists with chunks of a given size. If the number of elements is not a multiple of the given chunk size, elements that do not fit into a chunk are discarded.

The API to use this expression is daft.Expression.list.slice(size: int), where size is the size of chunks to use. The given size must be an integer greater than 0.

For example, a list [1, 2, 3, 4, 5] with a chunk size of 2 gives [[1, 2], [3, 4]].

@github-actions github-actions bot added the enhancement New feature or request label Jul 9, 2024
@desmondcheongzx desmondcheongzx force-pushed the add-chunk-expression branch 2 times, most recently from 76e1b35 to 2804531 Compare July 9, 2024 19:30
@@ -118,6 +118,61 @@ pub fn get_slices_helper(
.into_series())
}

fn get_chunks_helper(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of arguments here, wanna add some docstrings to help explain them?

We aren't very good with documentation of our code, but maybe a good time to start 😛

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, added some docs. I also switched around the argument list while writing the docs because there was a more natural order to them: input/parent values (flat_child, field, validity) -> input/expression argument (size) -> "output"/result of preprocessing (elements to skip and new offsets)

let inner_list_field = field.to_exploded_field()?.to_fixed_size_list_field(size)?;
let inner_list = FixedSizeListArray::new(
inner_list_field.clone(),
Series::from_arrow(field.to_exploded_field()?.into(), flat_child.to_arrow())?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this conversion to and from arrow necessary? I think to_arrow is meant mostly for FFI back into Python.

I think flat_child.into_series() should give you what you want here (you'll have to pull in the IntoSeries trait)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right there was no need to make this so convoluted.

So there's an option of doing flat_child.inner.into_series() which goes from &Series -> SeriesLike -> Series. In the end of the day I settled on flat_child.clone(), since flat_child is a Series which only contains Arc<dyn SeriesLike>, so only the Arc would be cloned.

Let me know if you think I should do otherwise.

@@ -2597,6 +2597,18 @@ def slice(self, start: int | Expression, end: int | Expression) -> Expression:
end_expr = Expression._to_expression(end)
return Expression._from_pyexpr(self._expr.list_slice(start_expr._expr, end_expr._expr))

def chunk(self, size: int) -> Expression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add this into our Expressions docs: docs/source/api_docs/expressions.rst

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I didn't do this for the slice expression too, so I'll add it at the same time.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 11, 2024
@desmondcheongzx desmondcheongzx requested a review from jaychia July 11, 2024 20:22
@desmondcheongzx desmondcheongzx merged commit fb7b04c into Eventual-Inc:main Jul 12, 2024
41 checks passed
@desmondcheongzx desmondcheongzx deleted the add-chunk-expression branch August 1, 2024 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants