-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Scan Node reads multiple files #300
Conversation
} | ||
|
||
TEST_CASE("Scan with multiple readers") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth adding a test for nested and/or fixed_sized_list? i forgot if these all go through the same code path / offset calculation or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offsets / actual reads happen within each FileReader
. The Scan
node just orchestrates the reads amount of different files.
cpp/src/lance/arrow/utils.h
Outdated
@@ -31,6 +32,15 @@ ::arrow::Result<std::shared_ptr<::arrow::RecordBatch>> MergeRecordBatches( | |||
const std::shared_ptr<::arrow::RecordBatch>& rhs, | |||
::arrow::MemoryPool* pool = ::arrow::default_memory_pool()); | |||
|
|||
/// Merge a list of record batches into one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a quick note on what merge means here?
is it horizontal or vertical concat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
} | ||
for (auto name : | ||
rhs->struct_type()->fields() // | ||
| views::filter([&lhs](auto& f) { return !lhs->GetFieldByName(f->name()); }) // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you feel about functional C++?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wish it is simpler like rust.
It simplifies a few for for loops, but not all , 😮💨
return nullptr; | ||
} | ||
auto batch = batches[0]; | ||
for (auto& b : batches | views::drop(1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this require that all RecordBatches being merged are of the same length? if so, should it be checked here or in the caller or somewhere else altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good �point. Added a check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It checks the length deep down the stack https://github.com/eto-ai/lance/blob/a6e788c37ea41f37cf0c5b993e041f3b262dafa6/cpp/src/lance/arrow/utils.cc#L69-L71
It is nice to raise here with clear context for sure.
As the leaf execution node, allows ScanNode to reads multiple files within the same fragment, thus the DataFragment / schema evolution details are hidden from the rest of the I/O execution stack (other than Take probably.)