Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor I/O exec nodes #136

Merged
merged 24 commits into from
Sep 4, 2022
Merged

Refactor I/O exec nodes #136

merged 24 commits into from
Sep 4, 2022

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Sep 2, 2022

  • Reconstruct the I/O execute node to a more standard physical plan, represented by an ExecNode tree.
  • Make the Scan node (leaf node) own the logic of batch read.
  • Add Take node that do random reads based on the indices returned from the Filter node.

@eddyxu eddyxu force-pushed the lei/refactory_io_exec branch from 2b52d6d to d5c56ad Compare September 2, 2022 23:26
@eddyxu eddyxu requested a review from changhiskhan September 3, 2022 05:32
@eddyxu eddyxu self-assigned this Sep 3, 2022
@eddyxu eddyxu marked this pull request as ready for review September 3, 2022 05:36
cpp/src/lance/encodings/binary.h Outdated Show resolved Hide resolved
cpp/src/lance/encodings/plain.cc Show resolved Hide resolved
cpp/src/lance/encodings/plain_test.cc Outdated Show resolved Hide resolved
@@ -46,6 +46,24 @@ TEST_CASE("Test Write Int32 array") {
}
}

TEST_CASE("Read not full batch") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to test that it's not scanning the whole column and returning 6-8? or is that already tested elsewhere?

Copy link
Contributor Author

@eddyxu eddyxu Sep 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This one is tested on the decoder, not the "columnar reader“ / ExecNode. It might need mock arrow::RandomAccessFile to verify. Should we do it in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i’m ok with that. Can you make a github issue for that then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #138

cpp/src/lance/format/schema.cc Outdated Show resolved Hide resolved
ARROW_ASSIGN_OR_RAISE(auto scan_schema, schema.Project(scan_options->filter));
ARROW_ASSIGN_OR_RAISE(auto scan_node,
Scan::Make(reader, scan_schema, scan_options->batch_size));
ARROW_ASSIGN_OR_RAISE(child, Filter::Make(scan_options->filter, std::move(scan_node)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to use different variable names here instead of reassigning to child? reads a bit confusing here.

Copy link
Contributor Author

@eddyxu eddyxu Sep 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one, child is like the head of a linked list. The reason writing in this way is that, there is an optional Limit in between Take and Filter.

Otherwise, the code might look like

filter_node.= Filter::Make()
if (limit) {
   limit_node = Limit::Make(filter_node);
   take_node = Take::Make(limit_node);
} else {
  take_node = Take::Make(filter_node);
}

Any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kind of feel like that makes the logic clearer because it’s more explicit and the cost is only one extra line. I don’t feel super strongly about it so it’s not a blocker.

Copy link
Contributor Author

@eddyxu eddyxu Sep 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i can update this. I lean to think the current way is more mathmatically general, i.e., ”inserting one ExecNode to the head of chain", instead of making each ::Make a special case tho.

But I am fine with either way.

/// Build a chain of:
///
/// Take -> (optionally) Limit -> Filter -> Scan
ARROW_ASSIGN_OR_RAISE(auto scan_schema, schema.Project(scan_options->filter));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this a filter_scan instead of just a plain scan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean filter_scan_schema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, filter_scan_schema instead of scan_schema
and was there also a filter_scan_node instead of scan_node? i forgot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

cpp/src/lance/io/exec/project.cc Show resolved Hide resolved
cpp/src/lance/io/exec/scan.cc Show resolved Hide resolved
cpp/src/lance/io/exec/take.cc Outdated Show resolved Hide resolved
@eddyxu eddyxu force-pushed the lei/refactory_io_exec branch from fbd1013 to a8513c2 Compare September 3, 2022 20:50
@eddyxu
Copy link
Contributor Author

eddyxu commented Sep 3, 2022

Rebased to main.

Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of nits remaining

@eddyxu eddyxu merged commit 728df90 into main Sep 4, 2022
@eddyxu eddyxu deleted the lei/refactory_io_exec branch September 4, 2022 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants