Avoid unnecessary null buffer construction when converting arrays to a different type #6243
Labels
arrow
Changes to the arrow crate
enhancement
Any new improvement worthy of a entry in the changelog
parquet
Changes to the parquet crate
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #6219
While investigating #6219, I found several locations where transforming an array of type
T
to typeU
uses the patternThe initial
iter()
call emitsOption<T>
(with nulls becomingNone
), which is then mapped toOption<U>
, with the iter then passed tofrom_iter()
arrow-rs/arrow-array/src/array/primitive_array.rs
Line 1318 in a693f0f
Option<U>
s, unwrapping them and collecting into aBuffer
, and rebuilding the original null buffer one value at a time.Describe the solution you'd like
We can avoid one
map
and the null buffer creation by instead mappingOption<T>
toU
, and cloning the original null buffer. Adding afrom_iter_values_with_nulls()
function toPrimitiveArray
allows the above to becomeThis results in a pretty dramatic performance increase.
Describe alternatives you've considered
An alternative is to modify the
Array
trait to add some sort of transform function that could consume the null buffer from the original array rather than cloning it. But given the use cases I know of are Parquet related, I don't know if that's the right approach.Additional context
The text was updated successfully, but these errors were encountered: