Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make it possible to use rowid and rowaddr in filters #2973

Merged
merged 1 commit into from
Oct 25, 2024

Conversation

westonpace
Copy link
Contributor

This is particularly useful for operations like "delete by id"

I ran into a fair bit of difficulty with this PR and punted a few things to follow-ups (#2971 #2972).

@github-actions github-actions bot added enhancement New feature or request python java labels Oct 3, 2024
Comment on lines 272 to +278
batch_size: Optional[int] = None,
batch_readahead: Optional[int] = None,
fragment_readahead: Optional[int] = None,
scan_in_order: bool = True,
scan_in_order: bool = None,
fragments: Optional[Iterable[LanceFragment]] = None,
full_text_query: Optional[Union[str, dict]] = None,
*,
prefilter: bool = False,
with_row_id: bool = False,
with_row_address: bool = False,
use_stats: bool = True,
fast_search: bool = False,
prefilter: bool = None,
with_row_id: bool = None,
with_row_address: bool = None,
use_stats: bool = None,
fast_search: bool = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes in defaults should not be breaking changes since the defaults in ScannerBuilder match the defaults that used to be here.

By using None we can easily tell if the user is specifying a non-default value, in which case we will override whatever is in default_scan_options.

/// the dataset schema (`dataset_schema`). This means that Substrait will
/// not be able to access columns that are not in the dataset schema (e.g.
/// _rowid, _rowaddr, etc.)
#[allow(unused)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed because dataset_schema is unused if the substrait feature is not specified.

///
/// The schema for this conversion should be the full schema available to
/// the filter (`full_schema`). However, due to a limitation in the way
/// we do Substrait conversion today we can only do Substrait conversion with
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lilmitation has been filed as a follow-up in #2972

@codecov-commenter
Copy link

codecov-commenter commented Oct 3, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 20 lines in your changes missing coverage. Please review.

Project coverage is 78.27%. Comparing base (f17d88d) to head (cce144e).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 66.00% 10 Missing and 7 partials ⚠️
java/core/lance-jni/src/blocking_scanner.rs 0.00% 1 Missing ⚠️
rust/lance-core/src/datatypes/schema.rs 87.50% 1 Missing ⚠️
rust/lance/src/dataset/fragment.rs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2973      +/-   ##
==========================================
+ Coverage   78.24%   78.27%   +0.03%     
==========================================
  Files         240      240              
  Lines       77284    78399    +1115     
  Branches    77284    78399    +1115     
==========================================
+ Hits        60470    61370     +900     
- Misses      13696    13910     +214     
- Partials     3118     3119       +1     
Flag Coverage Δ
unittests 78.27% <66.66%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RT.block_on(async { scanner.filter_substrait(substrait).await })?;
RT.block_on(async { scanner.filter_substrait(substrait) })?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Why shouldn't we await that future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is no longer async. Now we don't actually compile the filters until scan time (since the input schema may depend on other calls to the scanner builder). The scanner.filter and scanner.filter_substrait methods just record whatever the user passes in.

Copy link
Contributor

@wjones127 wjones127 Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's no longer async, you could also consider removing the wrapping RT.block_on. I think that would be a lot clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

python/python/tests/test_integration.py Show resolved Hide resolved
Comment on lines +13 to +14
ds = lance.write_dataset(tab, str(tmp_path))
ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unimportant, but these should take paths as-is:

Suggested change
ds = lance.write_dataset(tab, str(tmp_path))
ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
ds = lance.write_dataset(tab, tmp_path)
ds = lance.dataset(tmp_path, default_scan_options={"with_row_id": True})

@westonpace westonpace force-pushed the feat/meta-cols-in-filter branch from d231010 to 1a142da Compare October 13, 2024 10:46
@westonpace westonpace force-pushed the feat/meta-cols-in-filter branch from 1a142da to c07a0b7 Compare October 24, 2024 15:29
@westonpace westonpace force-pushed the feat/meta-cols-in-filter branch from c07a0b7 to cce144e Compare October 25, 2024 01:46
@westonpace westonpace merged commit f6e7a66 into lancedb:main Oct 25, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request java python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants