-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement CountRows for fast path of countings #221
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nits/questions
@@ -71,6 +71,31 @@ ::arrow::Result<std::shared_ptr<::arrow::Schema>> LanceFileFormat::Inspect( | |||
return impl_->manifest->schema().ToArrow(); | |||
} | |||
|
|||
::arrow::Future<::arrow::util::optional<int64_t>> LanceFileFormat::CountRows( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this already called from pyarrow or do we need extra python integration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow
calls C++ Scanner.CountRows() which eventually calls this method.
ARROW_ASSIGN_OR_RAISE(auto infile, file->source().Open()); | ||
ARROW_ASSIGN_OR_RAISE(auto reader, | ||
lance::io::FileReader::Make(infile, this->impl_->manifest)); | ||
return ::arrow::util::make_optional(reader->length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot but I'm assuming reader length returns the right thing even if it's ListArray's and whatnot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is returning the number of "rows" in the dataset. It does not read the actual data, so no matter what data type that is in the file.
|
||
auto dataset = MakeDataset(table, {}, 32).ValueOrDie(); | ||
auto scanner = dataset->NewScan().ValueOrDie()->Finish().ValueOrDie(); | ||
CHECK(scanner->CountRows().ValueOrDie() == 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway to check that the fast core path is hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some print messages to make sure path is hit.
To have appropriate tests, i guess we can either make some dependency ingestions into the class, or, doing some surgery that writes a file with "correct metadata", but no data?
Or, set up some code coverage tool in CI?
Should we do that in a follow up ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm ok with that, I guess for now just verify by hand if you do count rows on coco it doesn't take forever :)
Closes #219