feat: make it possible to use rowid and rowaddr in filters #2973

westonpace · 2024-10-03T16:53:17Z

This is particularly useful for operations like "delete by id"

I ran into a fair bit of difficulty with this PR and punted a few things to follow-ups (#2971 #2972).

westonpace · 2024-10-03T16:54:46Z

python/python/lance/dataset.py

        batch_size: Optional[int] = None,
        batch_readahead: Optional[int] = None,
        fragment_readahead: Optional[int] = None,
-        scan_in_order: bool = True,
+        scan_in_order: bool = None,
        fragments: Optional[Iterable[LanceFragment]] = None,
        full_text_query: Optional[Union[str, dict]] = None,
        *,
-        prefilter: bool = False,
-        with_row_id: bool = False,
-        with_row_address: bool = False,
-        use_stats: bool = True,
-        fast_search: bool = False,
+        prefilter: bool = None,
+        with_row_id: bool = None,
+        with_row_address: bool = None,
+        use_stats: bool = None,
+        fast_search: bool = None,


These changes in defaults should not be breaking changes since the defaults in ScannerBuilder match the defaults that used to be here.

By using None we can easily tell if the user is specifying a non-default value, in which case we will override whatever is in default_scan_options.

westonpace · 2024-10-03T16:55:25Z

rust/lance/src/dataset/scanner.rs

+    /// the dataset schema (`dataset_schema`).  This means that Substrait will
+    /// not be able to access columns that are not in the dataset schema (e.g.
+    /// _rowid, _rowaddr, etc.)
+    #[allow(unused)]


This was needed because dataset_schema is unused if the substrait feature is not specified.

westonpace · 2024-10-03T16:55:56Z

rust/lance/src/dataset/scanner.rs

+    ///
+    /// The schema for this conversion should be the full schema available to
+    /// the filter (`full_schema`).  However, due to a limitation in the way
+    /// we do Substrait conversion today we can only do Substrait conversion with


This lilmitation has been filed as a follow-up in #2972

codecov-commenter · 2024-10-03T17:17:35Z

Codecov Report

Attention: Patch coverage is 66.66667% with 20 lines in your changes missing coverage. Please review.

Project coverage is 78.27%. Comparing base (f17d88d) to head (cce144e).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/scanner.rs	66.00%	10 Missing and 7 partials ⚠️
java/core/lance-jni/src/blocking_scanner.rs	0.00%	1 Missing ⚠️
rust/lance-core/src/datatypes/schema.rs	87.50%	1 Missing ⚠️
rust/lance/src/dataset/fragment.rs	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2973      +/-   ##
==========================================
+ Coverage   78.24%   78.27%   +0.03%     
==========================================
  Files         240      240              
  Lines       77284    78399    +1115     
  Branches    77284    78399    +1115     
==========================================
+ Hits        60470    61370     +900     
- Misses      13696    13910     +214     
- Partials     3118     3119       +1

Flag	Coverage Δ
unittests	`78.27% <66.66%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127 · 2024-10-03T17:03:21Z

java/core/lance-jni/src/blocking_scanner.rs

-        RT.block_on(async { scanner.filter_substrait(substrait).await })?;
+        RT.block_on(async { scanner.filter_substrait(substrait) })?;


🤔 Why shouldn't we await that future?

The method is no longer async. Now we don't actually compile the filters until scan time (since the input schema may depend on other calls to the scanner builder). The scanner.filter and scanner.filter_substrait methods just record whatever the user passes in.

If it's no longer async, you could also consider removing the wrapping RT.block_on. I think that would be a lot clearer.

python/python/tests/test_integration.py

wjones127 · 2024-10-03T17:08:32Z

python/python/tests/test_integration.py

+    ds = lance.write_dataset(tab, str(tmp_path))
+    ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})


Unimportant, but these should take paths as-is:

Suggested change

ds = lance.write_dataset(tab, str(tmp_path))

ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})

ds = lance.write_dataset(tab, tmp_path)

ds = lance.dataset(tmp_path, default_scan_options={"with_row_id": True})

github-actions bot added enhancement New feature or request python java labels Oct 3, 2024

westonpace commented Oct 3, 2024

View reviewed changes

wjones127 reviewed Oct 3, 2024

View reviewed changes

wjones127 approved these changes Oct 4, 2024

View reviewed changes

westonpace force-pushed the feat/meta-cols-in-filter branch from d231010 to 1a142da Compare October 13, 2024 10:46

westonpace force-pushed the feat/meta-cols-in-filter branch from 1a142da to c07a0b7 Compare October 24, 2024 15:29

Make it possible to use rowid and rowaddr in filters

cce144e

westonpace force-pushed the feat/meta-cols-in-filter branch from c07a0b7 to cce144e Compare October 25, 2024 01:46

westonpace merged commit f6e7a66 into lancedb:main Oct 25, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make it possible to use rowid and rowaddr in filters #2973

feat: make it possible to use rowid and rowaddr in filters #2973

westonpace commented Oct 3, 2024

westonpace Oct 3, 2024

westonpace Oct 3, 2024

westonpace Oct 3, 2024

codecov-commenter commented Oct 3, 2024 •

edited

Loading

wjones127 Oct 3, 2024

westonpace Oct 4, 2024

wjones127 Oct 4, 2024 •

edited

Loading

westonpace Oct 13, 2024

wjones127 Oct 3, 2024

		RT.block_on(async { scanner.filter_substrait(substrait).await })?;
		RT.block_on(async { scanner.filter_substrait(substrait) })?;

		ds = lance.write_dataset(tab, str(tmp_path))
		ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})

feat: make it possible to use rowid and rowaddr in filters #2973

feat: make it possible to use rowid and rowaddr in filters #2973

Conversation

westonpace commented Oct 3, 2024

westonpace Oct 3, 2024

Choose a reason for hiding this comment

westonpace Oct 3, 2024

Choose a reason for hiding this comment

westonpace Oct 3, 2024

Choose a reason for hiding this comment

codecov-commenter commented Oct 3, 2024 • edited Loading

Codecov Report

wjones127 Oct 3, 2024

Choose a reason for hiding this comment

westonpace Oct 4, 2024

Choose a reason for hiding this comment

wjones127 Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

westonpace Oct 13, 2024

Choose a reason for hiding this comment

wjones127 Oct 3, 2024

Choose a reason for hiding this comment

codecov-commenter commented Oct 3, 2024 •

edited

Loading

wjones127 Oct 4, 2024 •

edited

Loading