Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(arrow-ord): remove duplicate logic and make it easier to add tests #6913

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rluvaton
Copy link
Contributor

@rluvaton rluvaton commented Dec 26, 2024

Rationale for this change

While working on #6911 and I started to write tests, I found out I just duplicated the logic, and it was messier, so cleaning up to make it easier to create new tests.

Background: all sort arrays test helper were given 2 vectors with native rust types (e.g. Vec<Option<bool>>), one for the input and the other for the expected output.

and the sort test helper convert it to the input and expected arrays to Arrow arrays (e.g. BooleanArray), and then call sort (or sort_limit depend on the input) and assert that the expected match the sorted.

this is done to help make the test simpler by just providing what the actual arrow array represent without messing with converting to arrow type.

the problem is that some Rust types can have multiple Arrow types, for example Vec<Option<Vec<u8>>> can be BinaryArray or LargeBinaryArray or FixedSizeBinaryArray or BinaryViewArray (same for lists and string).

in that case each sort helper function convert the input data to some of the arrow type (like FixedSizeBinaryArray) or to multiple arrow type (BinaryArray and LargeBinaryArray)

and then do the same as described above.

for the cases that fixed_length was provided it will test FixedSizeBinaryArray

The problem with this is that there are a lot of duplication and adding tests to new type is just adding a lot of boilerplate instead of just starting to write code

What changes are included in this PR?

So instead of this approach I created a single test_sort_array function that is given the input, expected and a function to convert that rust input type (e.g. Vec<Option<Vec<u8>>> like explained above) to all variants from that type (e.g. BinaryArray and LargeBinaryArray and FixedSizeBinaryArray and BinaryViewArray)

then created for each test type a new helper to convert that rust type to arrow type

for creating binary array I returned FixedSizeBinaryArray if all element have the same size, this will no allow to avoid only testing fixed or variable length types.

and moved all the build array helpers to sub module for more organize code.

But why do I pass a function to build the array and not using Rust trait system? this will surely be cleaner as you can have the actual function implementation hidden behind the input type and it would make the test cleaner.

You (basically my other self) are right. The problem is that I tried and it can't work without implementing for each primitive type manually (I think), because let's see the following example:

Let's say I have a trait called ConvertToArray

trait ConvertToArrays {
	fn convert_to_array(input: Vec<Option<Self>>) -> Vec<ArrayRef>;
}

and then I implement it for primitive type to create primitive array:

impl<T> ConvertToArrays for T::Native
	where T: ArrowPrimitiveType,
	PrimitiveArray<T>: From<Vec<Option<T::Native>>>
{
	fn convert_to_array(input: Vec<Option<Self>>) -> Vec<ArrayRef> {
		vec![Arc::new(PrimitiveArray::<T>::from(data)) as ArrayRef]
	}
} 

all good and it works. the problem is I can't implement it for bool when I want to test sorting BooleanArray because ArrowPrimitiveType can be implemented in the future for bool and this would cause duplicate implementation. and for that reason Rust disallow that

Are there any user-facing changes?

nope, only tests

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 26, 2024
@alamb
Copy link
Contributor

alamb commented Dec 29, 2024

Thank you for these PRs @rluvaton -- I know we are seriously behind on PR review in arrow-rs. I hope that we will have more capacity after next week. We just need to find more people willing to help out with reviews

@rluvaton
Copy link
Contributor Author

I'm willing to help, although currently my codebase knowledge is not the greatest but I think it's getting better

@alamb
Copy link
Contributor

alamb commented Dec 29, 2024

I'm willing to help, although currently my codebase knowledge is not the greatest but I think it's getting better

Thank you 🙏 -- the most helpful thing would be to review outstanding PRs and ensure the description is understandable, the code is commented and covered with tests, and you think the code does what the description says

If you find such PRs, feel free to @ mention me and I'll try and give it a look asap

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea; I've left some suggestions to see if we can simplify the code even further.

Comment on lines +945 to +954
fn test_sort_arrays<
ListItemType: Clone,
IntoListFn: Fn(Vec<Option<ListItemType>>) -> Vec<ArrayRef>,
>(
into_lists_fn: IntoListFn,
data: Vec<Option<ListItemType>>,
options: Option<SortOptions>,
limit: Option<usize>,
expected_data: Vec<Option<ListItemType>>,
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn test_sort_arrays<
ListItemType: Clone,
IntoListFn: Fn(Vec<Option<ListItemType>>) -> Vec<ArrayRef>,
>(
into_lists_fn: IntoListFn,
data: Vec<Option<ListItemType>>,
options: Option<SortOptions>,
limit: Option<usize>,
expected_data: Vec<Option<ListItemType>>,
) {
fn test_sort_arrays<ListItemType, IntoListFn>(
into_lists_fn: IntoListFn,
data: Vec<Option<ListItemType>>,
options: Option<SortOptions>,
limit: Option<usize>,
expected_data: Vec<Option<ListItemType>>,
) where
ListItemType: Clone,
IntoListFn: Fn(Vec<Option<ListItemType>>) -> Vec<ArrayRef>,
{

Thoughts on using where bound here to make it a bit more readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, applied

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was missed from the changes you applied

arrow-ord/src/sort.rs Show resolved Hide resolved
arrow-ord/src/sort.rs Outdated Show resolved Hide resolved
arrow-ord/src/sort.rs Outdated Show resolved Hide resolved
@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 4, 2025

There's some CI errors to look into.

Beyond that, I'm confused by some of the reasoning provided in the body:

But why do I pass a function to build the array and not using Rust trait system? this will surely be cleaner as you can have the actual function implementation hidden behind the input type and it would make the test cleaner.

You (basically my other self) are right. The problem is that I tried and it can't work without implementing for each primitive type manually (I think), because let's see the following example:

...

all good and it works. the problem is I can't implement it for bool when I want to test sorting BooleanArray because ArrowPrimitiveType can be implemented in the future for bool and this would cause duplicate implementation. and for that reason Rust disallow that

I find it unlikely that ArrowPrimitiveType will be implemented for bool in the future, and even if it does I'm sure there'll be other problems such that this test code breaking is one of the smaller issues that would follow. But even disregarding that, none of the tests you've modified are for BooleanArrays, so I'm not sure it should be used as an example of why a trait based approach might not work.

Of course if a trait based approach is more verbose/complex than the current function based approach I can understand why this approach was chosen.

@alamb alamb marked this pull request as draft January 10, 2025 19:59
@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants