-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: String/Binary View Support #596
Conversation
Hmm looks like the Python process whereby we generate the pxd file does not play nice with the union containing nested struct declarations |
I'm on vacation at the moment but I believe you just have to give the anonymous structures names (i.e. declare them as top level types). This is how you would have to do it in cython anyway i believe! |
Sounds good. Enjoy the time away! |
0b11479
to
00fbf17
Compare
@paleolimbot started to look at this in more earnest to get a full round trip, instead of just a read. The current read implemenation also only assumes one variadic buffer. I see the Arrow specification says that binary view types have an extra member which is a buffer containing the lengths of the variadic buffers: https://arrow.apache.org/docs/format/CDataInterface.html#binary-view-arrays Do you think its worth adding that to the struct ArrowBinaryViewArray {
struct ArrowArray array_data;
// Variadic buffer lengths for Binary View arrays
int64_t* variadic_buffer_lengths;
}; So that it can be "safely" cast to/from ArrowArray pointers? |
With that approach the one thing I still wonder is where we should store the length of |
Awesome!
I think the arrow-nanoarrow/src/nanoarrow/nanoarrow_types.h Lines 794 to 796 in b3a9894
I think this refers to an extra buffer (not an extra struct member): https://github.com/apache/arrow-nanoarrow/pull/367/files#r1459338465 |
Ah OK interesting. Do you know if this is confirmed to be working upstream in Arrow? The reason I ask is that when running the test suite for what's already in this PR, when I hit the C++ Export method: The The Arrow docs also read a little vague to me:
I'm not sure if that is saying that the |
Bah ignore what I just said - bad debugging. Digging deeper! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #596 +/- ##
==========================================
+ Coverage 85.02% 86.06% +1.04%
==========================================
Files 92 94 +2
Lines 11750 13697 +1947
==========================================
+ Hits 9990 11788 +1798
- Misses 1760 1909 +149 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more notes! I am gearing up for a release and it would be very cool to include this if we can...what do you see as blockers for getting this PR across the finish line and can I help with any of them?
python/tests/test_schema.py
Outdated
# TODO: run_end_encoded | ||
assert na.binary_view().type == na.Type.BINARY_VIEW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: run_end_encoded | |
assert na.binary_view().type == na.Type.BINARY_VIEW | |
assert na.binary_view().type == na.Type.BINARY_VIEW |
(we have an issue for this one, feel free to punt! #618)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a hard time deciphering the code suggestion - do you just want me to remove these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just that you can remove the TODO
Great! I should have time to see this through. There are no major blockers from my end. Let me know any feedback you have and I will take a look |
Great! I'll give this a thorough checkout + run through + review tomorrow 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a few changes are required to get this PR merged! This also needs a look from @bkietz to ensure that the string view details are correct!
src/nanoarrow/common/array_test.cc
Outdated
TEST(ArrayTest, ArrayTestAppendToBinaryViewArray) { | ||
struct ArrowArray array; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, these two tests can share a common void TestAppendToDataViewArray()
to avoid duplicating this entire test? (I know we haven't done a great job of this in the test suite so far 😳)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking that function should be added in the test suite or nanoarrow itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a test pattern (like you added below for consolidating the builder tests):
void TestAppendToDataViewArray(enum ArrowType view_type) {
// ...
}
TEST(ArrayTest, ArrayTestAppendToBinaryViewArray) {
ASSERT_NO_FATAL_FAILURE(TestAppendToDataViewArray(NANOARROW_TYPE_BINARY_VIEW));
}
TEST(ArrayTest, ArrayTestAppendToStringViewArray) {
ASSERT_NO_FATAL_FAILURE(TestAppendToDataViewArray(NANOARROW_TYPE_STRING_VIEW));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter that the string test uses ArrowArrayAppendString
whereas the binary one uses ArrowArrayAppendBytes
for the append func? Should I just use ArrowArrayAppendBytes
in the shared test implementation? Or parametrize it to accept a std::function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for missing this comment! Probably we should test both functions to make sure they both work (I believe an if
statement would work to get that coverage but we've also used std::function
(for example, in the integration test json tests) and I think either works!
src/nanoarrow/common/inline_array.h
Outdated
} else { | ||
int32_t n_vbufs = private_data->n_variadic_buffers; | ||
if (n_vbufs == 0 || | ||
private_data->variadic_buffers[n_vbufs - 1].size_bytes + value.size_bytes > | ||
NANOARROW_BINARY_VIEW_BLOCK_SIZE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the management of the array of buffers should be abstracted at least one level away to make it easier to spot potential problems and that here it would be more readable:
int64_t buffer_index;
NANOARROW_RETURN_NOT_OK(ArrowArrayAddVariadicBuffers(array, 1, &buffer_index));
(Exposing the ability to reserve variadic buffers also unlocks the ability to build by buffer and is probably a good idea anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the buffer_index
in this case supposed to point to the index of the first buffer that has been added? If so, any preference for the name of that parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just to tell you which one just got added (so that you can put a reference to it in the item you're about to append to the data buffer). Perhaps there's a better parameter name (or a way to retrieve the number of currently used variadic buffers using a function?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea...struggling with what to name this, especially if we allow this function to add multiple buffers at once. Maybe the second function to get the number of buffers allocated is a better pattern?
So
int64_t nbuf_capacity = ArrowArrayVariadicBufferSize(struct ArrowArray *array);
int64_t additional_buffers = 3;
NANOARROW_RETURN_NOT_OK(ArrowArrayAddVariadicBuffers(array, additional_buffers));
// nbuf_capacity + additional_buffers == ArrowArrayVariadicBufferSize(struct ArrowArray *array);
versus
int64_t nbuf_capacity = private_data->n_variadic_buffers;
int64_t next_buffer_avail;
int64_t additional_buffers = 3;
NANOARROW_RETURN_NOT_OK(ArrowArrayAddVariadicBuffers(array, additional_buffers, &next_buffer_avail));
// next_buffer_avail + additional_buffers == private_data->n_variadic_buffers;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the second function to get the number of buffers allocated is a better pattern?
Yes! Maybe ArrowArrayVariadicBufferCount()
though? (Invoking Size
suggests to me that there's only one of them and that the function will tell you how many bytes there are).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think I've got this right. FWIW, maybe we would also add a ArrowArrayGetBuffer(struct ArrowArray* array, int32_t i)
function to abstract many of the calls that have to reach into private_data
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have this one!
arrow-nanoarrow/src/nanoarrow/common/inline_array.h
Lines 40 to 49 in 561de01
static inline struct ArrowBuffer* ArrowArrayBuffer(struct ArrowArray* array, int64_t i) { | |
struct ArrowArrayPrivateData* private_data = | |
(struct ArrowArrayPrivateData*)array->private_data; | |
switch (i) { | |
case 0: | |
return &private_data->bitmap.buffer; | |
default: | |
return private_data->buffers + i - 1; | |
} | |
} |
We do use it a number of internals (but could possibly use it in others!)
c501d49
to
ea4208a
Compare
src/nanoarrow/common/inline_array.h
Outdated
@@ -467,52 +468,136 @@ static inline ArrowErrorCode ArrowArrayAppendDouble(struct ArrowArray* array, | |||
return NANOARROW_OK; | |||
} | |||
|
|||
#define NANOARROW_BINARY_VIEW_FIXED_BUFFERS 2 | |||
#define NANOARROW_BINARY_VIEW_INLINE_SIZE 12 | |||
#define NANOARROW_BINARY_VIEW_PREVIEW_SIZE 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: other implementations call this the prefix rather than the preview
#define NANOARROW_BINARY_VIEW_PREVIEW_SIZE 4 | |
#define NANOARROW_BINARY_VIEW_PREFIX_SIZE 4 |
src/nanoarrow/common/inline_array.h
Outdated
struct ArrowBinaryViewTypeInlinedData { | ||
int32_t size; | ||
uint8_t data[NANOARROW_BINARY_VIEW_INLINE_SIZE]; | ||
}; | ||
|
||
struct ArrowBinaryViewTypeRefData { | ||
int32_t size; | ||
uint8_t data[NANOARROW_BINARY_VIEW_PREVIEW_SIZE]; | ||
int32_t buffer_index; | ||
int32_t offset; | ||
}; | ||
|
||
union ArrowBinaryViewType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These types should probably avoid including Type
in their name since they are values from an array of that type and don't encode type-level information. See also ArrowDecimal
, ArrowStringView
, ArrowInterval
struct ArrowBinaryViewTypeInlinedData { | |
int32_t size; | |
uint8_t data[NANOARROW_BINARY_VIEW_INLINE_SIZE]; | |
}; | |
struct ArrowBinaryViewTypeRefData { | |
int32_t size; | |
uint8_t data[NANOARROW_BINARY_VIEW_PREVIEW_SIZE]; | |
int32_t buffer_index; | |
int32_t offset; | |
}; | |
union ArrowBinaryViewType { | |
struct ArrowBinaryViewInlined { | |
int32_t size; | |
uint8_t data[NANOARROW_BINARY_VIEW_INLINE_SIZE]; | |
}; | |
struct ArrowBinaryViewRef { | |
int32_t size; | |
uint8_t data[NANOARROW_BINARY_VIEW_PREVIEW_SIZE]; | |
int32_t buffer_index; | |
int32_t offset; | |
}; | |
union ArrowBinaryView { |
src/nanoarrow/common/inline_array.h
Outdated
if (n_vbufs == 0 || | ||
private_data->variadic_buffers[n_vbufs - 1].size_bytes + value.size_bytes > | ||
NANOARROW_BINARY_VIEW_BLOCK_SIZE) { | ||
NANOARROW_RETURN_NOT_OK(ArrowArrayAddVariadicBuffers(array, 1, &buf_index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't handle the case where value.size_bytes > NANOARROW_BINARY_VIEW_BLOCK_SIZE
, in that case you'll need an extra large variadic buffer to accommodate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK interesting. So right now whenever a variadic buffer's capacity may be exceeded I am starting afresh in a new buffer - should I be packing the existing buffer as full as possible before-hand and potentially splitting the value array across two variadic buffers? Or is that only a consideration when value.size_bytes > NANOARROW_BINARY_VIEW_BLOCK_SIZE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I wrote this before observing that you aren't presizing data buffers #596 (comment)
Since you aren't presizing this isn't a bug
src/nanoarrow/common/inline_array.h
Outdated
|
||
struct ArrowBinaryViewTypeRefData { | ||
int32_t size; | ||
uint8_t data[NANOARROW_BINARY_VIEW_PREVIEW_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint8_t data[NANOARROW_BINARY_VIEW_PREVIEW_SIZE]; | |
uint8_t prefix[NANOARROW_BINARY_VIEW_PREVIEW_SIZE]; |
src/nanoarrow/common/inline_array.h
Outdated
} | ||
|
||
for (int32_t i = 0; i < nbuffers; i++) { | ||
ArrowBufferInit(&private_data->variadic_buffers[n_current_bufs + i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrowBufferInit(&private_data->variadic_buffers[n_current_bufs + i]); | |
ArrowBufferInit(&private_data->variadic_buffers[n_current_bufs + i]); | |
private_data->variadic_buffer_sizes[i] = 0; |
NANOARROW_RETURN_NOT_OK( | ||
ArrowBufferAppend(data_buffer, value.data.data, value.size_bytes)); | ||
break; | ||
ArrowBufferAppend(variadic_buf, value.data.as_char, value.size_bytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, but the more usual pattern for building view arrays is to preallocate the data buffers (which reduces the frequency of reallocation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we want to leave that to a follow up PR at this point, but is the idea that we would just pre-allocate the entire NANOARROW_BINARY_VIEW_BLOCK_SIZE
bytes? Or is there a more refined algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be misunderstanding this, but would:
if (value.size_bytes < NANOARROW_BINARY_VIEW_BLOCK_SIZE) {
NANOARROW_RETURN_NOT_OK(ArrowBufferReserve(variadic_buf, NANOARROW_BINARY_VIEW_BLOCK_SIZE));
}
...just after adding a new variadic buffer solve this? (Feel free to ignore if this is off base...I agree that there's no bug here!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we wanted to reserve the size of the buffer we should do it after adding it with the call to ArrowArrayAddVariadicBuffers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blocking logic is rather specific to our appending, and we should make sure that others can implement their own logic if they want, which might have nothing to do with our block size. (Also, value.size_bytes
is greater than the block size, that would result the initial reserve being immediately discarded).
src/nanoarrow/common/inline_array.h
Outdated
const union ArrowBufferViewData value_view = array_view->buffer_views[1].data; | ||
union ArrowBinaryViewType bvt; | ||
const size_t idx = sizeof(union ArrowBinaryViewType) * i; | ||
memcpy(&bvt, value_view.as_uint8 + idx, sizeof(union ArrowBinaryViewType)); | ||
const int32_t inline_size = bvt.inlined.size; | ||
view.size_bytes = inline_size; | ||
if (inline_size <= NANOARROW_BINARY_VIEW_INLINE_SIZE) { | ||
view.data.as_uint8 = value_view.as_uint8 + idx + | ||
sizeof(((union ArrowBinaryViewType*)0)->inlined.size); | ||
} else { | ||
const int32_t buf_index = bvt.ref.buffer_index; | ||
const int32_t nfixed_buf = NANOARROW_BINARY_VIEW_FIXED_BUFFERS; | ||
const struct ArrowBufferView variadic_vw = | ||
ArrowArrayViewBufferView(array_view, buf_index + nfixed_buf); | ||
view.data.as_uint8 = variadic_vw.data.as_uint8 + bvt.ref.offset; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd probably be best to add const struct ArrowBinaryView* as_binary_view;
to union ArrowBufferViewData
, and extract a function
static struct ArrowBufferView ArrowArrayViewGetBytesFromViewArrayUnsafe(
const struct ArrowArrayView* array_view, int64_t i) {
const struct ArrowBinaryView* bv = &array_view->buffer_views[1].data.as_binary_view[i];
struct ArrowBufferView out{.size_bytes = bv->inlined.size};
if (bv->inlined.size <= NANOARROW_BINARY_VIEW_INLINE_SIZE) {
out.data.as_uint8 = bv->inlined.data;
return out;
}
const int32_t buf_index = bv->ref.buf_index + NANOARROW_BINARY_VIEW_FIXED_BUFFERS;
out.data.data = array_view->array->buffers[buf_index];
out.data.as_uint8 += bv->ref.offset;
return out;
}
(replacing ArrowArrayViewBufferView).
Then we can rewrite this as:
const union ArrowBufferViewData value_view = array_view->buffer_views[1].data; | |
union ArrowBinaryViewType bvt; | |
const size_t idx = sizeof(union ArrowBinaryViewType) * i; | |
memcpy(&bvt, value_view.as_uint8 + idx, sizeof(union ArrowBinaryViewType)); | |
const int32_t inline_size = bvt.inlined.size; | |
view.size_bytes = inline_size; | |
if (inline_size <= NANOARROW_BINARY_VIEW_INLINE_SIZE) { | |
view.data.as_uint8 = value_view.as_uint8 + idx + | |
sizeof(((union ArrowBinaryViewType*)0)->inlined.size); | |
} else { | |
const int32_t buf_index = bvt.ref.buffer_index; | |
const int32_t nfixed_buf = NANOARROW_BINARY_VIEW_FIXED_BUFFERS; | |
const struct ArrowBufferView variadic_vw = | |
ArrowArrayViewBufferView(array_view, buf_index + nfixed_buf); | |
view.data.as_uint8 = variadic_vw.data.as_uint8 + bvt.ref.offset; | |
} | |
view = ArrowArrayViewGetBytesFromViewArrayUnsafe(string_array, i); |
and ArrowArrayViewGetStringUnsafe's case as
case NANOARROW_TYPE_STRING_VIEW:
case NANOARROW_TYPE_BINARY_VIEW: {
struct ArrowBufferView buf_view = ArrowArrayViewGetBytesFromViewArrayUnsafe(string_array, i);
view.data = buf_view.data.as_char;
view.size_bytes = buf_view.size_bytes;
break;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome idea - thanks!
d41254b
to
74418fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more nits, looking good
a21f45a
to
313481f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is huge! Thank you!
closes #583