Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic extension type handling #90

Merged
merged 8 commits into from
Aug 13, 2022
Merged

Basic extension type handling #90

merged 8 commits into from
Aug 13, 2022

Conversation

changhiskhan
Copy link
Contributor

I'm sure I'm missing some edge cases but seems like this gives us what we want?

@changhiskhan changhiskhan requested a review from eddyxu August 9, 2022 08:52
cpp/src/lance/arrow/CMakeLists.txt Outdated Show resolved Hide resolved
}

std::string Serialize() const override {
std::string result(" ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? copy in memory?

Lets use something like https://github.com/eto-ai/lance/blob/main/cpp/src/lance/io/endian.h?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string buffer has 1 "\n" , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i copy-pasted this from Arrow test codebase. It's just serializing the type not the data. This isn't trying to cover all data types.

explicit ParametricType(int32_t parameter)
: ExtensionType(::arrow::int32()), parameter_(parameter) {}

int32_t parameter() const { return parameter_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why parameter is an int32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a test class, doesn't really matter what the parameter is.

cpp/src/lance/arrow/type.cc Show resolved Hide resolved
cpp/src/lance/arrow/type.cc Outdated Show resolved Hide resolved
std::shared_ptr<::arrow::Field> Field::ToArrow() const { return ::arrow::field(name(), type()); }
std::shared_ptr<::arrow::Field> Field::ToArrow() {
if (is_extension_field()) {
auto ext_type = ::arrow::GetExtensionType(extension_name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this handle project?
For excample, if an image object has <URI: str, data:binary>. can it support SELECT image.uri FROM dataset?

I am ok if there is another issue to address that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me add a test case to check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if image is of ImageType extension, projecting image.uri returns the storage schema. So in the returned schema it's got image as a struct and uri as the utf8 sub-field.

cpp/src/lance/format/schema.h Outdated Show resolved Hide resolved
cpp/src/lance/format/schema.h Outdated Show resolved Hide resolved
@@ -138,6 +138,10 @@ class Field final {

std::string logical_type() const { return logical_type_; };

std::string extension_name() const { return extension_name_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob can make the method to const std::string& extension_name() here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was trying to stay consistent with logical_type and name. Any particular reason this should be different?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yea, lets change logical type to const string& , avoid a copy each time calling the string.

@@ -201,10 +205,13 @@ class Field final {
// TODO: use enum to replace protobuf enum.
pb::Field::Type GetNodeType() const;

void init(std::shared_ptr<::arrow::DataType> dtype);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No error expect to happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just moving the existing code to an init method so I can call either with the input dtype or with the storage type. What errors/exceptions should I be expecting here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets change the method (not getter / setter) name to CapCase to be consistent w/ arrow and google

@changhiskhan changhiskhan requested a review from eddyxu August 10, 2022 23:00
@changhiskhan changhiskhan force-pushed the changhiskhan/ext-type branch from 7ae2eb9 to 76f43bd Compare August 13, 2022 00:59
@changhiskhan changhiskhan merged commit b5b2992 into main Aug 13, 2022
@changhiskhan changhiskhan deleted the changhiskhan/ext-type branch August 13, 2022 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants