Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary null buffer construction when converting arrays to a different type #6243

Closed
etseidl opened this issue Aug 13, 2024 · 2 comments · Fixed by #6244
Closed
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@etseidl
Copy link
Contributor

etseidl commented Aug 13, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #6219

While investigating #6219, I found several locations where transforming an array of type T to type U uses the pattern

let array_u = array_t
    .iter()
    .map(|o| o.map(|v| v as U))
    .collect::<PrimitiveArray<U>>()

The initial iter() call emits Option<T> (with nulls becoming None), which is then mapped to Option<U>, with the iter then passed to from_iter()

fn from_iter<I: IntoIterator<Item = Ptr>>(iter: I) -> Self {
which collects the Option<U>s, unwrapping them and collecting into a Buffer, and rebuilding the original null buffer one value at a time.

Describe the solution you'd like
We can avoid one map and the null buffer creation by instead mapping Option<T> to U, and cloning the original null buffer. Adding a from_iter_values_with_nulls() function to PrimitiveArray allows the above to become

let iter = array_t
    .iter()
    .map(|o| match o {
        Some(v) => v as U,
        None => U::default(),
    });
PrimitiveArray<U>::from_iter_values_with_nulls(iter, array_t.nulls().cloned())

This results in a pretty dramatic performance increase.

Describe alternatives you've considered
An alternative is to modify the Array trait to add some sort of transform function that could consume the null buffer from the original array rather than cloning it. But given the use cases I know of are Parquet related, I don't know if that's the right approach.

Additional context

@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

label_issue.py automatically added labels {'parquet'} from #6244

@alamb alamb added the arrow Changes to the arrow crate label Aug 31, 2024
@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

label_issue.py automatically added labels {'arrow'} from #6244

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
2 participants