Avoid unnecessary null buffer construction when converting arrays to a different type #6243

etseidl · 2024-08-13T20:42:28Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #6219

While investigating #6219, I found several locations where transforming an array of type T to type U uses the pattern

let array_u = array_t
    .iter()
    .map(|o| o.map(|v| v as U))
    .collect::<PrimitiveArray<U>>()

The initial iter() call emits Option<T> (with nulls becoming None), which is then mapped to Option<U>, with the iter then passed to from_iter()

arrow-rs/arrow-array/src/array/primitive_array.rs

Line 1318 in a693f0f

fn from_iter<I: IntoIterator<Item = Ptr>>(iter: I) -> Self {

which collects the Option<U>s, unwrapping them and collecting into a Buffer, and rebuilding the original null buffer one value at a time.

Describe the solution you'd like
We can avoid one map and the null buffer creation by instead mapping Option<T> to U, and cloning the original null buffer. Adding a from_iter_values_with_nulls() function to PrimitiveArray allows the above to become

let iter = array_t
    .iter()
    .map(|o| match o {
        Some(v) => v as U,
        None => U::default(),
    });
PrimitiveArray<U>::from_iter_values_with_nulls(iter, array_t.nulls().cloned())

This results in a pretty dramatic performance increase.

Describe alternatives you've considered
An alternative is to modify the Array trait to add some sort of transform function that could consume the null buffer from the original array rather than cloning it. But given the use cases I know of are Parquet related, I don't know if that's the right approach.

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2024-08-31T13:33:59Z

label_issue.py automatically added labels {'parquet'} from #6244

alamb · 2024-08-31T13:34:02Z

label_issue.py automatically added labels {'arrow'} from #6244

etseidl added the enhancement Any new improvement worthy of a entry in the changelog label Aug 13, 2024

etseidl mentioned this issue Aug 13, 2024

Remove unnecessary null buffer construction when converting arrays to a different type #6244

Merged

alamb closed this as completed in #6244 Aug 14, 2024

etseidl mentioned this issue Aug 14, 2024

Use unary() for array conversion in Parquet array readers, speed up Decimal128, Decimal256 and Float16 #6252

Merged

alamb added the parquet Changes to the parquet crate label Aug 31, 2024

alamb added the arrow Changes to the arrow crate label Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary null buffer construction when converting arrays to a different type #6243

Avoid unnecessary null buffer construction when converting arrays to a different type #6243

etseidl commented Aug 13, 2024

alamb commented Aug 31, 2024

alamb commented Aug 31, 2024

Avoid unnecessary null buffer construction when converting arrays to a different type #6243

Avoid unnecessary null buffer construction when converting arrays to a different type #6243

Comments

etseidl commented Aug 13, 2024

alamb commented Aug 31, 2024

alamb commented Aug 31, 2024