Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix for BinaryEncoder positions #698

Merged
merged 2 commits into from
Mar 18, 2023

Conversation

gsilvestrin
Copy link
Contributor

When merging multiple arrays, BinaryEncoder creates a array of positions/offsets that it is too big.

For instance in the following array:

&[
    &StringArray::from(vec![Some("a"), None, Some("cd"), None]),
    &StringArray::from(vec![Some("f"), None, Some("gh"), None]),
    &StringArray::from(vec![Some("t"), None, Some("uv"), None]),
]

Each sub-array has its own offset array. Since each array has 4 elements the offset array has 5 elements

offsets: [0, 1, 1, 3, 3]
offsets: [0, 1, 1, 3, 3]
offsets: [0, 1, 1, 3, 3]

When BinaryEncoder merge all sub-arrays into a single Array, it creates an array of length 12 (4*3) but an offset array with 15 elements

pos_array: PrimitiveArray<Int64>
[
  4,
  5,
  5,
  7,
  7,
  8,
  9,
  9,
  11,
  11,
  12,
  13,
  13,
  15,
  15,
]

This PR makes it create the correct offset array with 13 elements

pos_array: PrimitiveArray<Int64>
[
  4,
  5,
  5,
  7,
  7,
  8,
  8,
  10,
  10,
  11,
  11,
  13,
  13,
]

@gsilvestrin gsilvestrin requested review from changhiskhan and eddyxu and removed request for changhiskhan March 18, 2023 02:16
@@ -81,11 +81,19 @@ impl<'a> BinaryEncoder<'a> {

let start_offset = offsets[0];
// Did not use `add_scalar(positions, value_offset)`, so we can save a memory copy.
offsets.iter().for_each(|o| {
offsets.iter().enumerate().for_each(|(o_idx, o)| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be simpler like

let pos_builder = Builder::new(...)
pos_builder.append(start_pos);

for array in arrs.iter() {
    let offsets = array.offsets_array();
    let last_offset = pos_builder.last() # somehow
    pos_builder.append_iter(offsets.iter().skip(1).map(|p| p + last))
}

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no last but maybe pos_builder.values_slice()[pos_builder.len() - 1] would work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to this issue: #700

@gsilvestrin gsilvestrin force-pushed the gsilvestrin/bugfix-binary_encoder_positions branch from 25147e6 to 46ef95e Compare March 18, 2023 05:37
@gsilvestrin gsilvestrin merged commit f9ddd64 into main Mar 18, 2023
@gsilvestrin gsilvestrin deleted the gsilvestrin/bugfix-binary_encoder_positions branch March 18, 2023 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants