-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add COW conversion for Buffer and PrimitiveArray and unary_mut #3115
Changes from 1 commit
498c74f
a3ad2c5
4a028e6
8311372
42f6a1c
7ff701c
cee4d48
7cc4d0d
e4a55a1
f7fe8a1
6e3461a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,7 @@ | |
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
use crate::array::SizedArray; | ||
use crate::builder::{BooleanBufferBuilder, BufferBuilder, PrimitiveBuilder}; | ||
use crate::iterator::PrimitiveIter; | ||
use crate::raw_pointer::RawPtrBox; | ||
|
@@ -489,6 +490,41 @@ impl<T: ArrowPrimitiveType> PrimitiveArray<T> { | |
) | ||
} | ||
} | ||
|
||
/// Returns `PrimitiveBuilder` of this primitive array for mutating its values if the underlying | ||
/// data buffer is not shared by others. | ||
pub fn into_builder(self) -> Result<PrimitiveBuilder<T>, Self> { | ||
let null_buffer = self | ||
.data | ||
.null_buffer() | ||
.cloned() | ||
.and_then(|b| b.into_mutable().ok()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this needs to be split into the part that clones the null buffer and I also think this method should fail if the null buffer cannot be converted, where I think it currently just loses the null buffer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, that's why the doc test is failed now. Looking into how to make it work. |
||
|
||
let len = self.len(); | ||
let null_bit_buffer = self.data.null_buffer().cloned(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to bit_slice this to be consistent with the values buffer |
||
|
||
let mut buffers = self.data.get_buffers(); | ||
let buffer = buffers.remove(0); | ||
|
||
let builder = buffer | ||
.into_mutable() | ||
.map(|buffer| PrimitiveBuilder::<T>::new_from_buffer(buffer, null_buffer)); | ||
|
||
match builder { | ||
Ok(builder) => Ok(builder), | ||
Err(buffer) => { | ||
let builder = ArrayData::builder(T::DATA_TYPE) | ||
.len(len) | ||
.add_buffer(buffer) | ||
.null_bit_buffer(null_bit_buffer); | ||
|
||
let array_data = unsafe { builder.build_unchecked() }; | ||
let array = PrimitiveArray::<T>::from(array_data); | ||
|
||
Err(array) | ||
} | ||
} | ||
} | ||
} | ||
|
||
#[inline] | ||
|
@@ -529,6 +565,9 @@ impl<T: ArrowPrimitiveType> Array for PrimitiveArray<T> { | |
} | ||
} | ||
|
||
/// Makes `PrimitiveArray<T>` can be "downcast_array". | ||
impl<T: ArrowPrimitiveType> SizedArray for PrimitiveArray<T> {} | ||
|
||
impl<'a, T: ArrowPrimitiveType> ArrayAccessor for &'a PrimitiveArray<T> { | ||
type Item = T::Native; | ||
|
||
|
@@ -1036,7 +1075,8 @@ impl<T: DecimalType + ArrowPrimitiveType> PrimitiveArray<T> { | |
mod tests { | ||
use super::*; | ||
use crate::builder::{Decimal128Builder, Decimal256Builder}; | ||
use crate::BooleanArray; | ||
use crate::{downcast_array, BooleanArray}; | ||
use std::sync::Arc; | ||
|
||
#[test] | ||
fn test_primitive_array_from_vec() { | ||
|
@@ -1939,4 +1979,42 @@ mod tests { | |
|
||
array.value(4); | ||
} | ||
|
||
#[test] | ||
fn test_into_builder() { | ||
let array: Int32Array = vec![1, 2, 3].into_iter().map(Some).collect(); | ||
|
||
let boxed: Arc<dyn SizedArray> = Arc::new(array); | ||
let array: Arc<Int32Array> = downcast_array(boxed).unwrap(); | ||
|
||
let col: Int32Array = Arc::try_unwrap(array).unwrap(); | ||
let mut builder = col.into_builder().unwrap(); | ||
|
||
builder.append_value(4); | ||
builder.append_null(); | ||
builder.append_value(2); | ||
|
||
let expected: Int32Array = vec![Some(4), None, Some(2)].into_iter().collect(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would have expected There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, currently the returned builder sets length to |
||
|
||
let new_array = builder.finish(); | ||
assert_eq!(expected, new_array); | ||
} | ||
|
||
#[test] | ||
fn test_into_builder_cloned_array() { | ||
let array: Int32Array = vec![1, 2, 3].into_iter().map(Some).collect(); | ||
|
||
let boxed: Arc<dyn SizedArray> = Arc::new(array); | ||
|
||
let col: Int32Array = PrimitiveArray::<Int32Type>::from(boxed.data().clone()); | ||
let err = col.into_builder(); | ||
|
||
match err { | ||
Ok(_) => panic!("Should not get builder from cloned array"), | ||
Err(returned) => { | ||
let expected: Int32Array = vec![1, 2, 3].into_iter().map(Some).collect(); | ||
assert_eq!(expected, returned) | ||
} | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,6 +33,10 @@ impl BooleanBufferBuilder { | |
Self { buffer, len: 0 } | ||
} | ||
|
||
pub fn new_from_buffer(buffer: MutableBuffer) -> Self { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this need to be provided with the length in bits? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, added. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, missed it. |
||
Self { buffer, len: 0 } | ||
} | ||
|
||
#[inline] | ||
pub fn len(&self) -> usize { | ||
self.len | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -124,6 +124,14 @@ impl<T: ArrowNativeType> BufferBuilder<T> { | |||||
} | ||||||
} | ||||||
|
||||||
pub fn new_from_buffer(buffer: MutableBuffer) -> Self { | ||||||
Self { | ||||||
buffer, | ||||||
len: 0, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
? |
||||||
_marker: PhantomData, | ||||||
} | ||||||
} | ||||||
|
||||||
/// Returns the current number of array elements in the internal buffer. | ||||||
/// | ||||||
/// # Example: | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,7 +16,7 @@ | |
// under the License. | ||
|
||
use crate::builder::BooleanBufferBuilder; | ||
use arrow_buffer::Buffer; | ||
use arrow_buffer::{Buffer, MutableBuffer}; | ||
|
||
/// Builder for creating the null bit buffer. | ||
/// This builder only materializes the buffer when we append `false`. | ||
|
@@ -42,6 +42,15 @@ impl NullBufferBuilder { | |
} | ||
} | ||
|
||
pub fn new_from_buffer(buffer: Option<MutableBuffer>, capacity: usize) -> Self { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the option here is necessary, as if there is no MutableBuffer, callers can just call the regular constructor? |
||
let bitmap_builder = buffer.map(BooleanBufferBuilder::new_from_buffer); | ||
Self { | ||
bitmap_builder, | ||
len: 0, | ||
capacity, | ||
} | ||
} | ||
|
||
/// Appends `n` `true`s into the builder | ||
/// to indicate that these `n` items are not nulls. | ||
#[inline] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,7 +19,7 @@ use std::fmt::Debug; | |
use std::iter::FromIterator; | ||
use std::ptr::NonNull; | ||
use std::sync::Arc; | ||
use std::{convert::AsRef, usize}; | ||
use std::{convert::AsRef, mem, usize}; | ||
|
||
use crate::alloc::{Allocation, Deallocation}; | ||
use crate::util::bit_chunk_iterator::{BitChunks, UnalignedBitChunk}; | ||
|
@@ -227,6 +227,28 @@ impl Buffer { | |
pub fn count_set_bits_offset(&self, offset: usize, len: usize) -> usize { | ||
UnalignedBitChunk::new(self.as_slice(), offset, len).count_ones() | ||
} | ||
|
||
/// Returns `MutableBuffer` for mutating the buffer if this buffer is not shared. | ||
pub fn into_mutable(self) -> Result<MutableBuffer, Self> { | ||
let offset_ptr = self.as_ptr(); | ||
let offset = self.offset; | ||
let length = self.length; | ||
Arc::try_unwrap(self.data) | ||
.map(|bytes| { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to verify that |
||
// The pointer of underlying buffer should not be offset. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a somewhat annoying limitation, I wonder if there is some way to avoid it 🤔 Perhaps we could push the offset into Bytes 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the limitation, I assume that a non-zero offset means the Bytes is shared/sliced from others. So it is disallowed to be mutable here. Wondering pushing the offset into Bytes can change it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah ideally we'd do something like make MutableArray hold a Bytes, and then push an offset into Bytes. We could even extend the Allocation trait to allow reallocation of custom allocated data. Not sure how much, if any, of that you would be interested in doing 😅 was more just an observation There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Just question: i think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clone on an Arc increments the strong count, it does not perform a deep copy There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tustvold Sorry another question: I think in datafusion most There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would likely need to update the kernels to make use of this, but theoretically. Whether this will make a major difference in practice I'm not sure, sorts, aggregates and joins, tend to dominate queries in my experience |
||
assert_eq!(offset_ptr, bytes.ptr().as_ptr()); | ||
|
||
let mutable_buffer = | ||
MutableBuffer::from_ptr(bytes.ptr(), bytes.capacity()); | ||
mem::forget(bytes); | ||
mutable_buffer | ||
}) | ||
.map_err(|bytes| Buffer { | ||
data: bytes, | ||
offset, | ||
length, | ||
}) | ||
} | ||
} | ||
|
||
/// Creating a `Buffer` instance by copying the memory from a `AsRef<[u8]>` into a newly | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -92,6 +92,15 @@ impl MutableBuffer { | |
} | ||
} | ||
|
||
/// Allocates a new [MutableBuffer] from given pointer `ptr`, `capacity`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be "safer" for the to be |
||
pub(crate) fn from_ptr(ptr: NonNull<u8>, capacity: usize) -> Self { | ||
Self { | ||
data: ptr, | ||
len: 0, | ||
capacity, | ||
} | ||
} | ||
|
||
/// creates a new [MutableBuffer] with capacity and length capable of holding `len` bits. | ||
/// This is useful to create a buffer for packed bitmaps. | ||
pub fn new_null(len: usize) -> Self { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -387,6 +387,10 @@ impl ArrayData { | |
&self.buffers[..] | ||
} | ||
|
||
pub fn get_buffers(self) -> Vec<Buffer> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about child_data, null_buffers, etc... ? I wonder if a more general pattern would be to use |
||
self.buffers | ||
} | ||
|
||
/// Returns a slice of children data arrays | ||
pub fn child_data(&self) -> &[ArrayData] { | ||
&self.child_data[..] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you go from an
ArrayRef
toArc<dyn SizedArray>
?Edit: Is the intention to change
ArrayRef
to beArc<dyn SizedArray>
? 🤔There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot make
AsDynAny
work withArc<dyn Array>
. We have implementedArray
for reference type&'a T
.Any
must be'static
and and it must beSized
as well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, got what you mean. Hmm, good question. I currently don't make it work with
ArrayRef
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposal in #3117