-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add encoding for FixedSizeBinary and FixedSizeList #129
Conversation
One thing I am not very sure is whether we should merge this code with plain encoding. Conceptually they are very similar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean about this vs plain codec. Your understanding of the code is better than mine, but based on my reading, I would probably prefer to have a separate abstraction for this. The plain codec can be kept simple and be made reusable.
Is there any particular reason you want to merge the two? Other just reducing LOC?
@@ -181,6 +182,41 @@ ::arrow::Result<std::shared_ptr<::arrow::DataType>> FromLogicalType( | |||
"FromLogicalType: logical_type \"{}\" is not supported yet", logical_type.to_string())); | |||
} | |||
|
|||
::arrow::Result<std::shared_ptr<::arrow::ArrayBuilder>> GetArrayBuilder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for numeric types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to support more.
auto a = std::static_pointer_cast<::arrow::FixedSizeBinaryArray>(arr); | ||
ARROW_RETURN_NOT_OK(out_->Write(a->raw_values(), a->length() * a->byte_width())); | ||
} else if (lance::arrow::is_fixed_size_list(arr->type())) { | ||
auto list_arr = std::dynamic_pointer_cast<::arrow::FixedSizeListArray>(arr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use a dynamic cast here? i thought we already first check that it's a FixedSizeListArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was thinking of being consistent to use a dynamic cast to cast base class to child class. dynamic_cast
is safer w.r.t. to virtual tables and etc.
I should use the same in line 37 to use dynamic_cast tho.
|
||
::arrow::Result<std::shared_ptr<::arrow::Array>> FixedSizedBinaryDecoder::ToFixedSizeBinaryArray( | ||
int32_t start, int32_t length) const { | ||
auto bytes = type_->byte_width(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd probably call this nbytes or byte_width?
std::shared_ptr<::arrow::Int32Array> indices) const { | ||
std::shared_ptr<::arrow::ArrayBuilder> builder; | ||
if (lance::arrow::is_fixed_size_list(type_)) { | ||
auto list_type = std::dynamic_pointer_cast<::arrow::FixedSizeListType>(type_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here wrt dynamic cast
return ::arrow::Status::Invalid("FixedSizeBuilderDecoder::Take: Invalid data type: ", type_); | ||
} | ||
|
||
// TODO: Use thread pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be done for all decoders? Or just ones where we may end up reading large amounts of data for each scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done if we choose to pick "scalar" individually. Usually it works better for the case where scalar is big, i.e., large binary, tensors, images and etc.
Other cases, i.e., for numeric values, it is faster to read a block and then filter them. (i.e., 64KB = 8000 int64 values in one read)
::arrow::Result<std::shared_ptr<::arrow::Array>> FixedSizedBinaryDecoder::ToFixedSizeBinaryArray( | ||
int32_t start, int32_t length) const { | ||
auto bytes = type_->byte_width(); | ||
ARROW_ASSIGN_OR_RAISE(auto buf, infile_->ReadAt(position_ + start * bytes, length * bytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for fixed size binary it's just a plain encoder/decoder using fixed width data (e.g., any of the numeric dtypes) right? How come we don't just call plain_decoder like you do with the fixed size list handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes , we can definitely do it.
|
||
namespace lance::encodings { | ||
|
||
FixedSizeBinaryEncoder::FixedSizeBinaryEncoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called FixedSizeBinaryEncoder/Decoder but it handles both FixedSizeBinary and FixedSizeList. Would it be better to call it FixedSizeEncoder/Decoder or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. So this encoding is in contrast to VarBinary
, (var-sized binary).
Maybe this is another argument to support merging with PlainEncoding
So conceptually, plain encoder is a special case of a fixed-size binary encoding with It use less "encoding enum" too, so in other language re-implementation, one less encoder to implement as well. Parquet uses the plain encoding for all of them tho. https://parquet.apache.org/docs/file-format/data-pages/encodings/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main remaining item: do we also need to add boolean and timestamp since sophon is going to need it?
One thing with timestamps is that it's a parametric type so you'll need the unit. The default conversion from pandas date range is timestamp[ns].
Or is that a separate thing?
Looking at the code, you're right that merging it makes more sense. Code is simpler this way.
return std::make_shared<::arrow::FixedSizeListBuilder>(pool, value_builder, list_type); | ||
} | ||
|
||
::arrow::Result<std::shared_ptr<::arrow::ArrayBuilder>> GetArrayBuilder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sophon is going to need some datetime fields. Should those be added here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and also boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do
cpp/src/lance/encodings/plain.cc
Outdated
start, | ||
length.value(), | ||
length_)); | ||
} | ||
auto bytes = std::max(1, ::arrow::bit_width(type_->id()) / 8); | ||
auto nbyes = type_->byte_width(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: since you called it width
above, let's just stick with that to be consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Closes #15