Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StringViewArray and BinaryViewArray (#4253) #4585

Closed
wants to merge 3 commits into from

Conversation

tustvold
Copy link
Contributor

Draft as not yet standardised and needs a LOT more testing

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@alamb
Copy link
Contributor

alamb commented Jul 30, 2023

Related mailing list discussion: https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt

use arrow_buffer::Buffer;
use arrow_schema::ArrowError;

/// The element layout of a view buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "small string optimization" in rust !!! -- I had wondered what this would look like

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 30, 2023
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I have just a few suggestions

arrow-array/src/array/byte_view_array.rs Show resolved Hide resolved
Comment on lines +120 to +123
/// View data buffers
/// This is not stored in `MutableArrayData` because these values constant and only needed
/// at the end, when freezing [_MutableArrayData].
view_buffers: Vec<Buffer>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend renaming this to avoid the suggestion that these buffers hold views

Suggested change
/// View data buffers
/// This is not stored in `MutableArrayData` because these values constant and only needed
/// at the end, when freezing [_MutableArrayData].
view_buffers: Vec<Buffer>,
/// Variadic data buffers referenced by views
/// This is not stored in `MutableArrayData` because these values constant and only needed
/// at the end, when freezing [_MutableArrayData].
variadic_character_buffers: Vec<Buffer>,

Comment on lines +29 to +63
let lhs_b = lhs.buffers();
let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
let rhs_b = rhs.buffers();

for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
// Only checking one null mask here because by the time the control flow reaches
// this point, the equality of the two masks would have already been verified.
if lhs.is_null(idx) {
continue;
}

let l_len = *l as u32;
let r_len = *r as u32;
if l_len != r_len {
return false;
} else if l_len <= 12 {
// Inline storage
if l != r {
return false;
}
} else {
let l_view = View::from(*l);
let r_view = View::from(*r);
let l_b = &lhs_b[(l_view.buffer_index as usize) + 1];
let r_b = &rhs_b[(r_view.buffer_index as usize) + 1];

let l_o = l_view.offset as usize;
let r_o = r_view.offset as usize;
let len = l_len as usize;
if l_b[l_o..l_o + len] != r_b[r_o..r_o + len] {
return false;
}
}
}
true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For equality comparison, we can compare the length and prefix of the views simultaneously:

Suggested change
let lhs_b = lhs.buffers();
let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
let rhs_b = rhs.buffers();
for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
// Only checking one null mask here because by the time the control flow reaches
// this point, the equality of the two masks would have already been verified.
if lhs.is_null(idx) {
continue;
}
let l_len = *l as u32;
let r_len = *r as u32;
if l_len != r_len {
return false;
} else if l_len <= 12 {
// Inline storage
if l != r {
return false;
}
} else {
let l_view = View::from(*l);
let r_view = View::from(*r);
let l_b = &lhs_b[(l_view.buffer_index as usize) + 1];
let r_b = &rhs_b[(r_view.buffer_index as usize) + 1];
let l_o = l_view.offset as usize;
let r_o = r_view.offset as usize;
let len = l_len as usize;
if l_b[l_o..l_o + len] != r_b[r_o..r_o + len] {
return false;
}
}
}
true
let lhs_b = lhs.buffers()[1..];
let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
let rhs_b = rhs.buffers()[1..];
for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
// Only checking one null mask here because by the time the control flow reaches
// this point, the equality of the two masks would have already been verified.
if lhs.is_null(idx) {
continue;
}
let l_len_prefix = *l as u64;
let r_len_prefix = *r as u64;
if l_len_prefix != r_len_prefix {
return false;
}
let len = *l as u32;
if len <= 12 {
if (*l >> 64) as u64 != (*r >> 64) as u64 {
return false;
}
continue;
}
let l_view = View::from(*l);
let r_view = View::from(*r);
let l_b = &lhs_b[l_view.buffer_index as usize];
let r_b = &rhs_b[r_view.buffer_index as usize];
// prefixes are already known to be equal; skip checking them
let len = len as usize - 4;
let l_o = l_view.offset as usize + 4;
let r_o = r_view.offset as usize + 4;
if l_b[l_o..l_o + len] != r_b[r_o..r_o + len] {
return false;
}
}
true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have hoped LLVM is smart enough to work this out but I can double-check

@alamb
Copy link
Contributor

alamb commented Feb 12, 2024

See #5374 for implementation discussions


/// An array of variable length byte view arrays
pub struct GenericByteViewArray<T: ByteViewType> {
data_type: DataType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this? Can we just use T::DATA_TYPE?

let v: &[u8] = value.as_ref().as_ref();
let length: u32 = v.len().try_into().unwrap();
if length <= 12 {
let mut offset = [0; 16];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend renaming offset to view_buffer.

self.len()
);

assert!(i < self.views.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is duplicate with the previous one.

}
} else {
let l_view = View::from(*l);
let r_view = View::from(*r);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can compare the prefix to short-circuit.

As described in paper:

 In case of long strings, the remaining four bytes
of the header are used to store the first four characters of the string,
allowing Umbra to short-circuit some comparisons

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just note it's commend already. ignore this.

/// The element layout of a view buffer
///
/// See [`DataType::Utf8View`](arrow_schema::DataType)
pub struct View {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to use #[repr(C)] ?

buffers: &[Buffer],
) -> Result<(), ArrowError> {
validate_view_impl(views, buffers, |idx, b| {
std::str::from_utf8(b).map_err(|e| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simdutf8 crate may be better than this.

@alamb
Copy link
Contributor

alamb commented Feb 26, 2024

Thanks for the comments @sundy-li . Is there any chance you or someone at DataBend are interested in / planning on working on this feature?

@sundy-li
Copy link
Contributor

sundy-li commented Feb 27, 2024

Thanks for the comments @sundy-li . Is there any chance you or someone at DataBend are interested in / planning on working on this feature?

Yes, we are working on it at databendlabs/databend#14662

In databend, our column memory model is still based on arrow2. But reading/writing is using arrow-rs, we are planing to work on it after databendlabs/databend#14662 is finished.

@ariesdevil

@ariesdevil
Copy link
Contributor

Hi @alamb, I'm willing to work on this feature after databendlabs/databend#14662 is finished.

@alamb
Copy link
Contributor

alamb commented Feb 28, 2024

Hi @alamb, I'm willing to work on this feature after datafuselabs/databend#14662 is finished.

Thank you @sundy-li and @ariesdevil

I left a comment on #5374 #5374 (comment) about how we can potentially work on this together

@tustvold
Copy link
Contributor Author

tustvold commented Mar 6, 2024

I'm going to close this PR as I believe others are picking this up

@tustvold tustvold closed this Mar 6, 2024
@alamb
Copy link
Contributor

alamb commented Mar 6, 2024

Add StringViewArray implementation and layout and basic construction + tests apache/arrow#5469

Indeed -- we are tracking the work in #5374

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants