Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add encoding for FixedSizeBinary and FixedSizeList #129

Merged
merged 14 commits into from
Aug 26, 2022
Merged

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Aug 25, 2022

Closes #15

@eddyxu eddyxu requested a review from changhiskhan August 25, 2022 06:20
@eddyxu eddyxu self-assigned this Aug 25, 2022
@eddyxu eddyxu added enhancement New feature or request c++ C++ issues arrow Apache Arrow related issues labels Aug 25, 2022
@eddyxu
Copy link
Contributor Author

eddyxu commented Aug 25, 2022

One thing I am not very sure is whether we should merge this code with plain encoding. Conceptually they are very similar.

Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean about this vs plain codec. Your understanding of the code is better than mine, but based on my reading, I would probably prefer to have a separate abstraction for this. The plain codec can be kept simple and be made reusable.

Is there any particular reason you want to merge the two? Other just reducing LOC?

@@ -181,6 +182,41 @@ ::arrow::Result<std::shared_ptr<::arrow::DataType>> FromLogicalType(
"FromLogicalType: logical_type \"{}\" is not supported yet", logical_type.to_string()));
}

::arrow::Result<std::shared_ptr<::arrow::ArrayBuilder>> GetArrayBuilder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for numeric types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to support more.

auto a = std::static_pointer_cast<::arrow::FixedSizeBinaryArray>(arr);
ARROW_RETURN_NOT_OK(out_->Write(a->raw_values(), a->length() * a->byte_width()));
} else if (lance::arrow::is_fixed_size_list(arr->type())) {
auto list_arr = std::dynamic_pointer_cast<::arrow::FixedSizeListArray>(arr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use a dynamic cast here? i thought we already first check that it's a FixedSizeListArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was thinking of being consistent to use a dynamic cast to cast base class to child class. dynamic_cast is safer w.r.t. to virtual tables and etc.

I should use the same in line 37 to use dynamic_cast tho.


::arrow::Result<std::shared_ptr<::arrow::Array>> FixedSizedBinaryDecoder::ToFixedSizeBinaryArray(
int32_t start, int32_t length) const {
auto bytes = type_->byte_width();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd probably call this nbytes or byte_width?

std::shared_ptr<::arrow::Int32Array> indices) const {
std::shared_ptr<::arrow::ArrayBuilder> builder;
if (lance::arrow::is_fixed_size_list(type_)) {
auto list_type = std::dynamic_pointer_cast<::arrow::FixedSizeListType>(type_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here wrt dynamic cast

return ::arrow::Status::Invalid("FixedSizeBuilderDecoder::Take: Invalid data type: ", type_);
}

// TODO: Use thread pool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be done for all decoders? Or just ones where we may end up reading large amounts of data for each scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done if we choose to pick "scalar" individually. Usually it works better for the case where scalar is big, i.e., large binary, tensors, images and etc.

Other cases, i.e., for numeric values, it is faster to read a block and then filter them. (i.e., 64KB = 8000 int64 values in one read)

::arrow::Result<std::shared_ptr<::arrow::Array>> FixedSizedBinaryDecoder::ToFixedSizeBinaryArray(
int32_t start, int32_t length) const {
auto bytes = type_->byte_width();
ARROW_ASSIGN_OR_RAISE(auto buf, infile_->ReadAt(position_ + start * bytes, length * bytes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for fixed size binary it's just a plain encoder/decoder using fixed width data (e.g., any of the numeric dtypes) right? How come we don't just call plain_decoder like you do with the fixed size list handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , we can definitely do it.


namespace lance::encodings {

FixedSizeBinaryEncoder::FixedSizeBinaryEncoder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called FixedSizeBinaryEncoder/Decoder but it handles both FixedSizeBinary and FixedSizeList. Would it be better to call it FixedSizeEncoder/Decoder or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. So this encoding is in contrast to VarBinary, (var-sized binary).
Maybe this is another argument to support merging with PlainEncoding

@eddyxu
Copy link
Contributor Author

eddyxu commented Aug 25, 2022

Is there any particular reason you want to merge the two? Other just reducing LOC?

So conceptually, plain encoder is a special case of a fixed-size binary encoding with byte_width = size of(primitive type). They supposedly share the same code path, and future same optimizations as well.

It use less "encoding enum" too, so in other language re-implementation, one less encoder to implement as well.

Parquet uses the plain encoding for all of them tho.

https://parquet.apache.org/docs/file-format/data-pages/encodings/

@eddyxu eddyxu changed the title Add fixed size binary encoding Add encoding for FixedSizeBinary and FixedSizeList Aug 25, 2022
Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main remaining item: do we also need to add boolean and timestamp since sophon is going to need it?
One thing with timestamps is that it's a parametric type so you'll need the unit. The default conversion from pandas date range is timestamp[ns].
Or is that a separate thing?

Looking at the code, you're right that merging it makes more sense. Code is simpler this way.

return std::make_shared<::arrow::FixedSizeListBuilder>(pool, value_builder, list_type);
}

::arrow::Result<std::shared_ptr<::arrow::ArrayBuilder>> GetArrayBuilder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sophon is going to need some datetime fields. Should those be added here as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also boolean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do

start,
length.value(),
length_));
}
auto bytes = std::max(1, ::arrow::bit_width(type_->id()) / 8);
auto nbyes = type_->byte_width();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: since you called it width above, let's just stick with that to be consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@eddyxu eddyxu merged commit e711e8c into main Aug 26, 2022
@eddyxu eddyxu deleted the lei/fix_length_binary branch August 26, 2022 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Apache Arrow related issues c++ C++ issues enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fixed size binary encoding
2 participants