Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] List slice expression #2479

Merged
merged 3 commits into from
Jul 6, 2024

Conversation

desmondcheongzx
Copy link
Contributor

@desmondcheongzx desmondcheongzx commented Jul 4, 2024

Adds an expression that returns a specified subset of elements from a list.

The API to use this expression is daft.Expression.list.slice(start: int | Expression, end: int | Expression).

  • start is the 0-indexed position of the list to start retrieving elements. If this value is negative, then it's the position from the end of the array. i.e. -1 points to the last element of the array
  • end is the 0-indexed position of the list to stop retrieving elements. If this value is negative, then it's the position from the end of the array. i.e. -1 points to the last element of the array

@github-actions github-actions bot added the enhancement New feature or request label Jul 4, 2024
daft/expressions/expressions.py Show resolved Hide resolved
daft/series.py Outdated
@@ -541,6 +541,8 @@ def sum(self) -> Series:
assert self._series is not None
return Series._from_pyseries(self._series.sum())

# TODO: Is this needed?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Mostly for testing only)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

// Given an i64 array that may have either 1 or `self.len()` elements, create an iterator with
// `self.len()` elements. If there was originally 1 element, we repeat this element `self.len()`
// times, otherwise we simply take the original array.
fn create_iter<'a>(&'a self, arr: &'a Int64Array) -> Box<dyn Iterator<Item = i64> + '_> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's dedup this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}
let total_capacity = *new_offsets.last().unwrap();
let default = Series::full_null("null", self.child_data_type(), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our growable allows you to add_null without needing a dummy array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this unnecessary default value

self.name(),
self.child_data_type(),
vec![&self.flat_child, &default],
true, // TODO is this right?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are very sure that you won't be adding any nulls, you can set this to false for some performance gains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

total_capacity as usize,
);
for (i, start) in slicing_indexes.iter().enumerate() {
if *start >= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow python behavior instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, changed the API

start_iter: impl Iterator<Item = i64>,
length_iter: impl Iterator<Item = i64>,
) -> DaftResult<Series> {
let list_size = self.fixed_element_len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably dedup a lot of the logic, just need to turn the offsets into an interable or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduped

self.fixed_size_list()?.get_slices(start_arr, length_arr)
}
dt => Err(DaftError::TypeError(format!(
"list splice not implemented for {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"list splice not implemented for {}",
"list slice not implemented for {}",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! Fixed

table.eval_expression_list([col("col").list.slice()])
with pytest.raises(TypeError, match="takes 3 positional arguments but 4 were given"):
table.eval_expression_list([col("col").list.slice(0, 0, 0)])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also other corner-case (completely empty MicroPartition). This comes up quite a lot, especially in distributed settings when a filter is applied and completely removes data from a MicroPartition (making it empty).

You can make a well-typed empty series like this:

pa.array([], type=pa.list_(...))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this test

def test_list_slice_non_list_type():
table = MicroPartition.from_pydict(
{
"mapcol": [{"a": 1}, {"b": 1}, {"c": 1}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a struct 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂 fixed

Copy link
Contributor Author

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaychia thanks for the review! I addressed your comments

table.eval_expression_list([col("col").list.slice()])
with pytest.raises(TypeError, match="takes 3 positional arguments but 4 were given"):
table.eval_expression_list([col("col").list.slice(0, 0, 0)])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this test

daft/expressions/expressions.py Show resolved Hide resolved
daft/series.py Outdated
@@ -541,6 +541,8 @@ def sum(self) -> Series:
assert self._series is not None
return Series._from_pyseries(self._series.sum())

# TODO: Is this needed?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

// Given an i64 array that may have either 1 or `self.len()` elements, create an iterator with
// `self.len()` elements. If there was originally 1 element, we repeat this element `self.len()`
// times, otherwise we simply take the original array.
fn create_iter<'a>(&'a self, arr: &'a Int64Array) -> Box<dyn Iterator<Item = i64> + '_> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}
let total_capacity = *new_offsets.last().unwrap();
let default = Series::full_null("null", self.child_data_type(), 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this unnecessary default value

self.name(),
self.child_data_type(),
vec![&self.flat_child, &default],
true, // TODO is this right?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

total_capacity as usize,
);
for (i, start) in slicing_indexes.iter().enumerate() {
if *start >= 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, changed the API

start_iter: impl Iterator<Item = i64>,
length_iter: impl Iterator<Item = i64>,
) -> DaftResult<Series> {
let list_size = self.fixed_element_len();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduped

self.fixed_size_list()?.get_slices(start_arr, length_arr)
}
dt => Err(DaftError::TypeError(format!(
"list splice not implemented for {}",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! Fixed

def test_list_slice_non_list_type():
table = MicroPartition.from_pydict(
{
"mapcol": [{"a": 1}, {"b": 1}, {"c": 1}],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂 fixed

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookin good -- congrats on 1st PR!!!!!!!!!!!!

}

pub fn get_slices_helper(
parent_offsets: &arrow2::offset::OffsetsBuffer<i64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can actually make this an impl Iterator<Item = i64> (since you mostly use it that way anyways!) and then the fixed size list case would be quite clean: std::iter::repeat(element_len).take(list_len)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for the suggestion!

@desmondcheongzx desmondcheongzx enabled auto-merge (squash) July 6, 2024 00:32
@desmondcheongzx desmondcheongzx merged commit 7b81de6 into Eventual-Inc:main Jul 6, 2024
41 checks passed
@desmondcheongzx desmondcheongzx deleted the add-list-slice branch August 1, 2024 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants