-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: make it possible to use rowid and rowaddr in filters #2973
Conversation
batch_size: Optional[int] = None, | ||
batch_readahead: Optional[int] = None, | ||
fragment_readahead: Optional[int] = None, | ||
scan_in_order: bool = True, | ||
scan_in_order: bool = None, | ||
fragments: Optional[Iterable[LanceFragment]] = None, | ||
full_text_query: Optional[Union[str, dict]] = None, | ||
*, | ||
prefilter: bool = False, | ||
with_row_id: bool = False, | ||
with_row_address: bool = False, | ||
use_stats: bool = True, | ||
fast_search: bool = False, | ||
prefilter: bool = None, | ||
with_row_id: bool = None, | ||
with_row_address: bool = None, | ||
use_stats: bool = None, | ||
fast_search: bool = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes in defaults should not be breaking changes since the defaults in ScannerBuilder
match the defaults that used to be here.
By using None
we can easily tell if the user is specifying a non-default value, in which case we will override whatever is in default_scan_options
.
/// the dataset schema (`dataset_schema`). This means that Substrait will | ||
/// not be able to access columns that are not in the dataset schema (e.g. | ||
/// _rowid, _rowaddr, etc.) | ||
#[allow(unused)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed because dataset_schema
is unused if the substrait
feature is not specified.
/// | ||
/// The schema for this conversion should be the full schema available to | ||
/// the filter (`full_schema`). However, due to a limitation in the way | ||
/// we do Substrait conversion today we can only do Substrait conversion with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lilmitation has been filed as a follow-up in #2972
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2973 +/- ##
==========================================
+ Coverage 78.24% 78.27% +0.03%
==========================================
Files 240 240
Lines 77284 78399 +1115
Branches 77284 78399 +1115
==========================================
+ Hits 60470 61370 +900
- Misses 13696 13910 +214
- Partials 3118 3119 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
RT.block_on(async { scanner.filter_substrait(substrait).await })?; | ||
RT.block_on(async { scanner.filter_substrait(substrait) })?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Why shouldn't we await that future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is no longer async. Now we don't actually compile the filters until scan time (since the input schema may depend on other calls to the scanner builder). The scanner.filter
and scanner.filter_substrait
methods just record whatever the user passes in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's no longer async, you could also consider removing the wrapping RT.block_on
. I think that would be a lot clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦
ds = lance.write_dataset(tab, str(tmp_path)) | ||
ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unimportant, but these should take paths as-is:
ds = lance.write_dataset(tab, str(tmp_path)) | |
ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True}) | |
ds = lance.write_dataset(tab, tmp_path) | |
ds = lance.dataset(tmp_path, default_scan_options={"with_row_id": True}) |
d231010
to
1a142da
Compare
1a142da
to
c07a0b7
Compare
c07a0b7
to
cce144e
Compare
This is particularly useful for operations like "delete by id"
I ran into a fair bit of difficulty with this PR and punted a few things to follow-ups (#2971 #2972).