[C++] Scan Node reads multiple files #300

eddyxu · 2022-11-08T03:05:17Z

As the leaf execution node, allows ScanNode to reads multiple files within the same fragment, thus the DataFragment / schema evolution details are hidden from the rest of the I/O execution stack (other than Take probably.)

changhiskhan · 2022-11-08T18:50:09Z

cpp/src/lance/io/exec/scan_test.cc

+}
+
+TEST_CASE("Scan with multiple readers") {


is it worth adding a test for nested and/or fixed_sized_list? i forgot if these all go through the same code path / offset calculation or not.

The offsets / actual reads happen within each FileReader. The Scan node just orchestrates the reads amount of different files.

changhiskhan · 2022-11-08T18:51:57Z

cpp/src/lance/arrow/utils.h

@@ -31,6 +32,15 @@ ::arrow::Result<std::shared_ptr<::arrow::RecordBatch>> MergeRecordBatches(
    const std::shared_ptr<::arrow::RecordBatch>& rhs,
    ::arrow::MemoryPool* pool = ::arrow::default_memory_pool());

+/// Merge a list of record batches into one.


add a quick note on what merge means here?
is it horizontal or vertical concat?

Sure, will do.

changhiskhan · 2022-11-08T18:53:13Z

cpp/src/lance/arrow/utils.cc

-    }
+  for (auto name :
+       rhs->struct_type()->fields()                                                      //
+           | views::filter([&lhs](auto& f) { return !lhs->GetFieldByName(f->name()); })  //


how do you feel about functional C++?

i wish it is simpler like rust.

It simplifies a few for for loops, but not all , 😮‍💨

changhiskhan · 2022-11-08T18:55:01Z

cpp/src/lance/arrow/utils.cc

+    return nullptr;
+  }
+  auto batch = batches[0];
+  for (auto& b : batches | views::drop(1)) {


does this require that all RecordBatches being merged are of the same length? if so, should it be checked here or in the caller or somewhere else altogether?

good �point. Added a check

It checks the length deep down the stack https://github.com/eto-ai/lance/blob/a6e788c37ea41f37cf0c5b993e041f3b262dafa6/cpp/src/lance/arrow/utils.cc#L69-L71

It is nice to raise here with clear context for sure.

eddyxu added 11 commits November 7, 2022 17:34

add merge vectors

b02e062

merge mulitple

cad48cb

add test

697b880

functional merge

7b818de

allow scanner node to take mulitple files

d2f28eb

clean up

345a7a3

simplify

7855087

more docs

93933f1

parallel read

e6e6b5e

clean format

807ef49

better comments

0c3fbae

eddyxu marked this pull request as ready for review November 8, 2022 04:25

eddyxu requested a review from changhiskhan November 8, 2022 04:25

eddyxu self-assigned this Nov 8, 2022

eddyxu added enhancement New feature or request arrow Apache Arrow related issues labels Nov 8, 2022

eddyxu added 13 commits November 7, 2022 20:39

remove old Scan::Make interface

49302c8

remove unnedesary wait

a8311ad

move local copy of file reader

bd6cc38

move

31c36be

ref to future

6f70176

ref

1b8cc41

test

c293d7a

set more cpu

ad123fe

do not set cpu

3a100cb

keep minimal thread counts

7a9154c

cleanup

8d40c9b

add test for reading several files

0075140

simply test dont move

283adf9

changhiskhan reviewed Nov 8, 2022

View reviewed changes

address comments

638523e

changhiskhan approved these changes Nov 8, 2022

View reviewed changes

use checkout GHA v3

0d61063

eddyxu merged commit 39e47a8 into main Nov 8, 2022

eddyxu deleted the lei/scan_multiple_readers branch November 8, 2022 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Scan Node reads multiple files #300

[C++] Scan Node reads multiple files #300

eddyxu commented Nov 8, 2022 •

edited

Loading

changhiskhan Nov 8, 2022

eddyxu Nov 8, 2022

changhiskhan Nov 8, 2022

eddyxu Nov 8, 2022

changhiskhan Nov 8, 2022

eddyxu Nov 8, 2022

changhiskhan Nov 8, 2022

eddyxu Nov 8, 2022

eddyxu Nov 8, 2022

[C++] Scan Node reads multiple files #300

[C++] Scan Node reads multiple files #300

Conversation

eddyxu commented Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyxu commented Nov 8, 2022 •

edited

Loading