Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Scan Node reads multiple files #300

Merged
merged 26 commits into from
Nov 8, 2022
Merged

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Nov 8, 2022

As the leaf execution node, allows ScanNode to reads multiple files within the same fragment, thus the DataFragment / schema evolution details are hidden from the rest of the I/O execution stack (other than Take probably.)

@eddyxu eddyxu marked this pull request as ready for review November 8, 2022 04:25
@eddyxu eddyxu requested a review from changhiskhan November 8, 2022 04:25
@eddyxu eddyxu self-assigned this Nov 8, 2022
@eddyxu eddyxu added enhancement New feature or request arrow Apache Arrow related issues labels Nov 8, 2022
}

TEST_CASE("Scan with multiple readers") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth adding a test for nested and/or fixed_sized_list? i forgot if these all go through the same code path / offset calculation or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offsets / actual reads happen within each FileReader. The Scan node just orchestrates the reads amount of different files.

@@ -31,6 +32,15 @@ ::arrow::Result<std::shared_ptr<::arrow::RecordBatch>> MergeRecordBatches(
const std::shared_ptr<::arrow::RecordBatch>& rhs,
::arrow::MemoryPool* pool = ::arrow::default_memory_pool());

/// Merge a list of record batches into one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a quick note on what merge means here?
is it horizontal or vertical concat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

}
for (auto name :
rhs->struct_type()->fields() //
| views::filter([&lhs](auto& f) { return !lhs->GetFieldByName(f->name()); }) //
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you feel about functional C++?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wish it is simpler like rust.

It simplifies a few for for loops, but not all , 😮‍💨

return nullptr;
}
auto batch = batches[0];
for (auto& b : batches | views::drop(1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this require that all RecordBatches being merged are of the same length? if so, should it be checked here or in the caller or somewhere else altogether?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good �point. Added a check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checks the length deep down the stack https://github.com/eto-ai/lance/blob/a6e788c37ea41f37cf0c5b993e041f3b262dafa6/cpp/src/lance/arrow/utils.cc#L69-L71

It is nice to raise here with clear context for sure.

@eddyxu eddyxu merged commit 39e47a8 into main Nov 8, 2022
@eddyxu eddyxu deleted the lei/scan_multiple_readers branch November 8, 2022 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Apache Arrow related issues enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants