Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DUCKDB] Native duckdb lance reader #347

Merged
merged 20 commits into from
Dec 5, 2022
Merged

[DUCKDB] Native duckdb lance reader #347

merged 20 commits into from
Dec 5, 2022

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Dec 3, 2022

Support SELECT * from lance_scan('s3://URI') directly via duckdb extension.

@eddyxu eddyxu marked this pull request as ready for review December 4, 2022 07:30
@eddyxu
Copy link
Contributor Author

eddyxu commented Dec 4, 2022

Screenshot 2022-12-03 at 10 42 45 PM

Screenshot 2022-12-03 at 11 20 51 PM

@eddyxu eddyxu requested a review from changhiskhan December 4, 2022 07:54
@eddyxu eddyxu self-assigned this Dec 4, 2022
@eddyxu eddyxu added c++ C++ issues duckdb arrow Apache Arrow related issues labels Dec 4, 2022
@eddyxu
Copy link
Contributor Author

eddyxu commented Dec 4, 2022

Filed #348 to track the optimization for zero-copy arrow=>ducdkb conversion.

@eddyxu eddyxu changed the title [DUCKDB] Native duckdb reader [DUCKDB] Native duckdb lance reader Dec 4, 2022
Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to add some unit tests?
perf optimization can happen separately

// TODO: dynamic_pointer_cast does not work here, IDK why.
auto array = std::static_pointer_cast<typename ::arrow::TypeTraits<T>::ArrayType>(arr);
assert(array != nullptr);
// TODO: How to use zero copy to move data from arrow to duckdb.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duckdb-pyarrow integration is zero-copy right? i forgot how but vaguely remember seeing code that uses the Arrow C structs to dress up arrow memory buffer as duckdb vector. not a blocker for this tho i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked a few places, i.e.

Seems no code that convert from arrow => duckdb, it only has the reversed conversion (i..e, duckdb => arrow).

I can look into it more later. Filed #348 to track the work.

out->SetVectorType(::duckdb::VectorType::FLAT_VECTOR);
}

/// Convert a String array into duckdb vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. fixed.

auto indices_arr = std::static_pointer_cast<::arrow::Int8Array>(array->indices());
for (int i = 0; i < indices_arr->length(); ++i) {
auto idx = indices_arr->Value(i);
out->SetValue(i, std::string(dict_arr->Value(idx)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder whether this fixes the duckdb-arrow issue where list of dictionary querying doesn't work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea probally, we can actually make (a hardcoded set of ) user defined types work via our plugins tho.

@eddyxu eddyxu merged commit d868ad2 into main Dec 5, 2022
@eddyxu eddyxu deleted the lei/scan_lance branch December 5, 2022 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Apache Arrow related issues c++ C++ issues duckdb
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants